Question? Leave a message!




Introduction to Parallel Algorithms and Parallel Program Design

Introduction to Parallel Algorithms and Parallel Program Design 30
ImogenCameron Profile Pic
ImogenCameron,France,Teacher
Published Date:14-07-2017
Website URL
Comment
Introduction to Parallel Algorithms and Parallel Program Design Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 12 – Introduction to Parallel Algorithms Methodological Design q  Partition ❍  Task/data decomposition q  Communication ❍  Task execution coordination q  Agglomeration ❍  Evaluation of the structure q  Mapping I. Foster, “Designing and Building ❍  Resource assignment Parallel Programs,” Addison-Wesley, 1995. Book is online, see webpage. Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 2 Partitioning q  Partitioning stage is intended to expose opportunities for parallel execution q  Focus on defining large number of small task to yield a fine-grained decomposition of the problem q  A good partition divides into small pieces both the computational tasks associated with a problem and the data on which the tasks operates q  Domain decomposition focuses on computation data q  Functional decomposition focuses on computation tasks q  Mixing domain/functional decomposition is possible Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 3 Domain and Functional Decomposition q  Domain decomposition of 2D / 3D grid q  Functional decomposition of a climate model Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 4 Partitioning Checklist q  Does your partition define at least an order of magnitude more tasks than there are processors in your target computer? If not, may loose design flexibility. q  Does your partition avoid redundant computation and storage requirements? If not, may not be scalable. q  Are tasks of comparable size? If not, it may be hard to allocate each processor equal amounts of work. q  Does the number of tasks scale with problem size? If not may not be able to solve larger problems with more processors q  Have you identified several alternative partitions? Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 5 Communication (Interaction) q  Tasks generated by a partition must interact to allow the computation to proceed ❍  Information flow: data and control q  Types of communication ❍  Local vs. Global: locality of communication ❍  Structured vs. Unstructured: communication patterns ❍  Static vs. Dynamic: determined by runtime conditions ❍  Synchronous vs. Asynchronous: coordination degree q  Granularity and frequency of communication ❍  Size of data exchange q  Think of communication as interaction and control ❍  Applicable to both shared and distributed memory parallelism Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 6 Types of Communication q  Point-to-point q  Group-based q  Hierachical q  Collective Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 7 Communication Design Checklist q  Is the distribution of communications equal? ❍  Unbalanced communication may limit scalability q  What is the communication locality? ❍  Wider communication locales are more expensive q  What is the degree of communication concurrency? ❍  Communication operations may be parallelized q  Is computation associated with different tasks able to proceed concurrently? Can communication be overlapped with computation? ❍  Try to reorder computation and communication to expose opportunities for parallelism Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 8 Agglomeration q  Move from parallel abstractions to real implementation q  Revisit partitioning and communication ❍  View to efficient algorithm execution q  Is it useful to agglomerate? ❍  What happens when tasks are combined? q  Is it useful to replicate data and/or computation? q  Changes important algorithm and performance ratios ❍  Surface-to-volume: reduction in communication at the expense of decreasing parallelism ❍  Communication/computation: which cost dominates q  Replication may allow reduction in communication q  Maintain flexibility to allow overlap Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 9 Types of Agglomeration q  Element to column q  Element to block ❍  Better surface to volume q  Task merging q  Task reduction ❍  Reduces communication Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 10 Agglomeration Design Checklist q  Has increased locality reduced communication costs? q  Is replicated computation worth it? q  Does data replication compromise scalability? q  Is the computation still balanced? q  Is scalability in problem size still possible? q  Is there still sufficient concurrency? q  Is there room for more agglomeration? q  Fine-grained vs. coarse-grained? Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 11 Mapping q  Specify where each task is to execute ❍  Less of a concern on shared-memory systems q  Attempt to minimize execution time ❍  Place concurrent tasks on different processors to enhance physical concurrency ❍  Place communicating tasks on same processor, or on processors close to each other, to increase locality ❍  Strategies can conflict q  Mapping problem is NP-complete ❍  Use problem classifications and heuristics q  Static and dynamic load balancing Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 12 Mapping Algorithms q  Load balancing (partitioning) algorithms q  Data-based algorithms ❍  Think of computational load with respect to amount of data being operated on ❍  Assign data (i.e., work) in some known manner to balance ❍  Take into account data interactions q  Task-based (task scheduling) algorithms ❍  Used when functional decomposition yields many tasks with weak locality requirements ❍  Use task assignment to keep processors busy computing ❍  Consider centralized and decentralize schemes Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 13 Mapping Design Checklist q  Is static mapping too restrictive and non- responsive? q  Is dynamic mapping too costly in overhead? q  Does centralized scheduling lead to bottlenecks? q  Do dynamic load-balancing schemes require too much coordination to re-balance the load? q  What is the tradeoff of dynamic scheduling complexity versus performance improvement? q  Are there enough tasks to achieve high levels of concurrency? If not, processors may idle. Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 14 Types of Parallel Programs q  Flavors of parallelism ❍  Data parallelism ◆ all processors do same thing on different data ❍  Task parallelism ◆ processors are assigned tasks that do different things q  Parallel execution models ❍  Data parallel ❍  Pipelining (Producer-Consumer) ❍  Task graph ❍  Work pool ❍  Master-Worker Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 15 Data Parallel q  Data is decomposed (mapped) onto processors q  Processors performance similar (identical) tasks on data q  Tasks are applied concurrently q  Load balance is obtained through data partitioning ❍  Equal amounts of work assigned q  Certainly may have interactions between processors q  Data parallelism scalability ❍  Degree of parallelism tends to increase with problem size ❍  Makes data parallel algorithms more efficient q  Single Program Multiple Data (SPMD) ❍  Convenient way to implement data parallel computation ❍  More associated with distributed memory parallel execution Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 16 Matrix - Vector Multiplication q  A x b = y q  Allocate tasks to rows of A yi = ∑Ai,jbj j   q  Dependencies? q  Speedup? q  Computing each element of y can be done independently Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 17 Matrix-Vector Multiplication (Limited Tasks) q  Suppose we only have 4 tasks q  Dependencies? q  Speedup? Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 18 Matrix Multiplication A   B   C   q  A x B = C x   =   q  Ai,: • B:,j = Ci,j q  Row partitioning ❍  N tasks q  Block partitioning ❍  NN/B tasks q  Shading shows data sharing in B matrix Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 19 Granularity of Task and Data Decompositions q  Granularity can be with respect to tasks and data q  Task granularity ❍  Equivalent to choosing the number of tasks ❍  Fine-grained decomposition results in large tasks ❍  Large-grained decomposition has smaller tasks ❍  Translates to data granularity after tasks chosen ◆ consider matrix multiplication q  Data granularity ❍  Think of in terms of amount of data needed in operation ❍  Relative to data as a whole ❍  Decomposition decisions based on input, output, input- output, or intermediate data Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 20