Question? Leave a message!




Introduction to Parallel Algorithms and Parallel Program Design

Introduction to Parallel Algorithms and Parallel Program Design 30
Introduction to Parallel Algorithms and Parallel Program Design Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 12 – Introduction to Parallel Algorithms Methodological Design q  Partition ❍  Task/data decomposition q  Communication ❍  Task execution coordination q  Agglomeration ❍  Evaluation of the structure q  Mapping I. Foster, “Designing and Building ❍  Resource assignment Parallel Programs,” AddisonWesley, 1995. Book is online, see webpage. Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 2 Partitioning q  Partitioning stage is intended to expose opportunities for parallel execution q  Focus on defining large number of small task to yield a finegrained decomposition of the problem q  A good partition divides into small pieces both the computational tasks associated with a problem and the data on which the tasks operates q  Domain decomposition focuses on computation data q  Functional decomposition focuses on computation tasks q  Mixing domain/functional decomposition is possible Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 3 Domain and Functional Decomposition q  Domain decomposition of 2D / 3D grid q  Functional decomposition of a climate model Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 4 Partitioning Checklist q  Does your partition define at least an order of magnitude more tasks than there are processors in your target computer If not, may loose design flexibility. q  Does your partition avoid redundant computation and storage requirements If not, may not be scalable. q  Are tasks of comparable size If not, it may be hard to allocate each processor equal amounts of work. q  Does the number of tasks scale with problem size If not may not be able to solve larger problems with more processors q  Have you identified several alternative partitions Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 5 Communication (Interaction) q  Tasks generated by a partition must interact to allow the computation to proceed ❍  Information flow: data and control q  Types of communication ❍  Local vs. Global: locality of communication ❍  Structured vs. Unstructured: communication patterns ❍  Static vs. Dynamic: determined by runtime conditions ❍  Synchronous vs. Asynchronous: coordination degree q  Granularity and frequency of communication ❍  Size of data exchange q  Think of communication as interaction and control ❍  Applicable to both shared and distributed memory parallelism Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 6 Types of Communication q  Pointtopoint q  Groupbased q  Hierachical q  Collective Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 7 Communication Design Checklist q  Is the distribution of communications equal ❍  Unbalanced communication may limit scalability q  What is the communication locality ❍  Wider communication locales are more expensive q  What is the degree of communication concurrency ❍  Communication operations may be parallelized q  Is computation associated with different tasks able to proceed concurrently Can communication be overlapped with computation ❍  Try to reorder computation and communication to expose opportunities for parallelism Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 8 Agglomeration q  Move from parallel abstractions to real implementation q  Revisit partitioning and communication ❍  View to efficient algorithm execution q  Is it useful to agglomerate ❍  What happens when tasks are combined q  Is it useful to replicate data and/or computation q  Changes important algorithm and performance ratios ❍  Surfacetovolume: reduction in communication at the expense of decreasing parallelism ❍  Communication/computation: which cost dominates q  Replication may allow reduction in communication q  Maintain flexibility to allow overlap Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 9 Types of Agglomeration q  Element to column q  Element to block ❍  Better surface to volume q  Task merging q  Task reduction ❍  Reduces communication Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 10 Agglomeration Design Checklist q  Has increased locality reduced communication costs q  Is replicated computation worth it q  Does data replication compromise scalability q  Is the computation still balanced q  Is scalability in problem size still possible q  Is there still sufficient concurrency q  Is there room for more agglomeration q  Finegrained vs. coarsegrained Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 11 Mapping q  Specify where each task is to execute ❍  Less of a concern on sharedmemory systems q  Attempt to minimize execution time ❍  Place concurrent tasks on different processors to enhance physical concurrency ❍  Place communicating tasks on same processor, or on processors close to each other, to increase locality ❍  Strategies can conflict q  Mapping problem is NPcomplete ❍  Use problem classifications and heuristics q  Static and dynamic load balancing Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 12 Mapping Algorithms q  Load balancing (partitioning) algorithms q  Databased algorithms ❍  Think of computational load with respect to amount of data being operated on ❍  Assign data (i.e., work) in some known manner to balance ❍  Take into account data interactions q  Taskbased (task scheduling) algorithms ❍  Used when functional decomposition yields many tasks with weak locality requirements ❍  Use task assignment to keep processors busy computing ❍  Consider centralized and decentralize schemes Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 13 Mapping Design Checklist q  Is static mapping too restrictive and non responsive q  Is dynamic mapping too costly in overhead q  Does centralized scheduling lead to bottlenecks q  Do dynamic loadbalancing schemes require too much coordination to rebalance the load q  What is the tradeoff of dynamic scheduling complexity versus performance improvement q  Are there enough tasks to achieve high levels of concurrency If not, processors may idle. Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 14 Types of Parallel Programs q  Flavors of parallelism ❍  Data parallelism ◆ all processors do same thing on different data ❍  Task parallelism ◆ processors are assigned tasks that do different things q  Parallel execution models ❍  Data parallel ❍  Pipelining (ProducerConsumer) ❍  Task graph ❍  Work pool ❍  MasterWorker Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 15 Data Parallel q  Data is decomposed (mapped) onto processors q  Processors performance similar (identical) tasks on data q  Tasks are applied concurrently q  Load balance is obtained through data partitioning ❍  Equal amounts of work assigned q  Certainly may have interactions between processors q  Data parallelism scalability ❍  Degree of parallelism tends to increase with problem size ❍  Makes data parallel algorithms more efficient q  Single Program Multiple Data (SPMD) ❍  Convenient way to implement data parallel computation ❍  More associated with distributed memory parallel execution Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 16 Matrix Vector Multiplication q  A x b = y q  Allocate tasks to rows of A yi = ∑Ai,jbj j   q  Dependencies q  Speedup q  Computing each element of y can be done independently Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 17 MatrixVector Multiplication (Limited Tasks) q  Suppose we only have 4 tasks q  Dependencies q  Speedup Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 18 Matrix Multiplication A   B   C   q  A x B = C x   =   q  Ai,: • B:,j = Ci,j q  Row partitioning ❍  N tasks q  Block partitioning ❍  NN/B tasks q  Shading shows data sharing in B matrix Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 19 Granularity of Task and Data Decompositions q  Granularity can be with respect to tasks and data q  Task granularity ❍  Equivalent to choosing the number of tasks ❍  Finegrained decomposition results in large tasks ❍  Largegrained decomposition has smaller tasks ❍  Translates to data granularity after tasks chosen ◆ consider matrix multiplication q  Data granularity ❍  Think of in terms of amount of data needed in operation ❍  Relative to data as a whole ❍  Decomposition decisions based on input, output, input output, or intermediate data Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 20 Mesh Allocation to Processors q  Mesh model of Lake Superior q  How to assign mesh elements to processors q  Distribute onto 8 processors randomly graph partitioning for minimum edge cut Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 21 Pipeline Model q  Stream of data operated on by succession of tasks Task 1 Task 2 Task 3 Task 4 ❍  Tasks are assigned to processors data   P1   P2   P3   P4   input   Task  1   Task  2   Task  3   Task  4   q  Consider N data units q  Sequential q  Parallel (each task assigned to a processor) 4  data  units   8  data  units     4way parallel, but for longer time   4­‐way  parallel   Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 22 Pipeline Performance q  N data and T tasks q  Each task takes unit time t q  Sequential time = NTt q  Parallel pipeline time = start + finish + (N2T)/T t = O(N/T) (for NT) q  Try to find a lot of data to pipeline q  Try to divide computation in a lot of pipeline tasks ❍  More tasks to do (longer pipelines) ❍  Shorter tasks to do q  Pipeline computation is a special form of producerconsumer parallelism ❍  Producer tasks output data input by consumer tasks Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 23 Tasks Graphs q  Computations in any parallel algorithms can be viewed as a task dependency graph q  Task dependency graphs can be nontrivial Task  1   Task  2   Task  3   Task  4   ❍  Pipeline ❍  Arbitrary (represents the algorithm dependencies) Numbers are time taken to perform task Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 24 Task Graph Performance q  Determined by the critical path (span) ❍  Sequence of dependent tasks that takes the longest time Min time = 27 Min time = 34 ❍  Critical path length bounds parallel execution time Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 25 Task Assignment (Mapping) to Processors q  Given a set of tasks and number of processors q  How to assign tasks to processors q  Should take dependencies into account q  Task mapping will determine execution time Total time = Total time = Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 26 Task Graph Based Languages/frameworks Uintah Taskgraph Plasma 1: based PDE Solver 1 Task graph for (Dongarra): Task Graphs in Action PDE solver DAG based 1: 1: 1: Parallel linear q  Uintah task graph scheduler 2 3 4 algebra V. Sarkar ❍  CSAFE: Center for Simulation of L. (S). Kale software 2: 2: 2: Accidental Fires and Explosions, 2 3 4 S Parker University of Utah K. Knobe 2: J. Dongarra ❍  Large granularity tasks 2 And many others q  PLASMA Intel CnC: ❍  DAGbased parallel new language for Wasatch Taskgraph graph based parallelism linear algebra DAG of QR for a ❍  DAGuE: A generic 4 × 4 tiles matrix on a 2 × 2 grid of distributed DAG engine processors. for HPC Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 27 Charm++: Objectbased Virtualization Bag o’ Tasks Model and Worker Pool q  Set of tasks to be performed q  How do we schedule them ❍  Find independent tasks Processors ❍  Assign tasks to available processors q  Bag o’ Tasks approach ❍  Tasks are stored in a bag waiting to run Bag o‘ independent tasks tasks …   ❍  If all dependencies ready to run are satisified, it is moved to a ready to run queue ❍  Scheduler assigns a task to a free processor q  Dynamic approach that is effective for load balancing Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 28 MasterWorker Parallelism q  One or more master processes generate work q  Masters allocate work to worker processes q  Workers idle if have nothing to do q  Workers are mostly stupid and must be told what to do ❍  Execute independently ❍  May need to synchronize, but most be told to do so q  Master may become the bottleneck if not careful q  What are the performance factors and expected performance behavior ❍  Consider task granularity and asynchrony ❍  How do they interact Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 29 MasterWorker Execution Model (Li Li) Li Li, “Modelbased Automatics Performance Diagnosis of Parallel Computations,” Ph.D. thesis, 2007. Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 30 MW Execution Trace (Li Li) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 31 SearchBased (Exploratory) Decomposition q  15puzzle problem q  15 tiles numbered 1 through 15 placed in 4x4 grid ❍  Blank tile located somewhere in grid ❍  Initial configuration is out of order ❍  Find shortest sequence of moves to put in order q  Sequential search across space of solutions ❍  May involve some heuristics Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 32 Parallelizing the 15Puzzle Problem q  Enumerate move choices at each stage q  Assign to processors q  May do pruning q  Wasted work Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 33 DivideandConquer Parallelism q  Break problem up in orderly manner into smaller, more manageable chunks and solve q  Quicksort example Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 34 Dense Matrix Algorithms q  Great deal of activity in algorithms and software for solving linear algebra problems ❍  Solution of linear systems ( Ax = b ) ❍  Leastsquares solution of over or underdetermined systems ( min Axb ) ❍  Computation of eigenvalues and eigenvectors ( Ax=λx ) ❍  Driven by numerical problem solving in scientific computation q  Solutions involves various forms of matrix computations q  Focus on highperformance matrix algorithms ❍  Key insight is to maximize computation to communication Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 35 Solving a System of Linear Equations q  Ax=b a x + a x + … + a x = b 0,0 0 0,1 1 0,n1 n1 0 a x + a x + … + a x = b 1,0 0 1,1 1 1,n1 n1 1 … A x + a x + … + a x = b n1,0 0 n1,1 1 n1,n1 n1 n1 q  Gaussian elimination (classic algorithm) ❍  Forward elimination to Ux=y (U is upper triangular) ◆ without or with partial pivoting ❍  Back substitution to solve for x ❍  Parallel algorithms based on partitioning of A Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 36 Sequential Gaussian Elimination 1. procedure GAUSSIAN ELIMINATION (A, b, y) 2. Begin 3. for k := 0 to n 1 do / Outer loop / 4. begin 5. for j := k + 1 to n 1 do 6. Ak, j := Ak, j /Ak, k; / Division step / 7. yk := bk/Ak, k; 8. Ak, k := 1; 9. for i := k + 1 to n 1 do 10. begin 11. for j := k + 1 to n 1 do 12. Ai, j := Ai, j Ai, k x Ak, j ; / Elimination step / 13. bi := bi Ai, k x yk; 14. Ai, k := 0; 15. endfor; /Line9/ 16. endfor; /Line3/ 17. end GAUSSIAN ELIMINATION Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 37 Computation Step in Gaussian Elimination x = (22 3y) / 5 5x + 3y = 22 x = (22 3y) / 5 8(22 3y)/5 + 2y = 13 8x + 2y = 13 y = (13 176/5) / (24/5 + 2) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 38 Rowwise Partitioning on Eight Processes n   Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 39 Rowwise Partitioning on Eight Processes Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 40 2D Mesh Partitioning on 64 Processes Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 41 Back Substitution to Find Solution 1. procedure BACK SUBSTITUTION (U, x, y) 2. begin 3. for k := n 1 downto 0 do / Main loop / 4. begin 5. xk := yk; 6. for i := k 1 downto 0 do 7. yi := yi xk xUi, k; 8. endfor; 9. end BACK SUBSTITUTION Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 42 Dense Linear Algebra (www.netlib.gov) q  Basic Linear Algebra Subroutines (BLAS) ❍  Level 1 (vectorvector): vectorization ❍  Level 2 (matrixvector): vectorization, parallelization ❍  Level 3 (matrixmatrix): parallelization q  LINPACK (Fortran) ❍  Linear equations and linear leastsquares q  EISPACK (Fortran) ❍  Eigenvalues and eigenvectors for matrix classes q  LAPACK (Fortran, C) (LINPACK + EISPACK) ❍  Use BLAS internally q  ScaLAPACK (Fortran, C, MPI) (scalable LAPACK) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 43 Numerical Libraries q  PETSc ❍  Data structures / routines for partial differential equations ❍  MPI based q  SuperLU ❍  Large sparse nonsymmetric linear systems q  Hypre ❍  Large sparse linear systems q  TAO ❍  Toolkit for Advanced Optimization q  DOE ACTS ❍  Advanced CompuTational Software Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 44 Sorting Algorithms q  Task of arranging unordered collection into order q  Permutation of a sequence of elements q  Internal versus external sorting ❍  External sorting uses auxiliary storage q  Comparisonbased ❍  Compare pairs of elements and exchange ❍  O(n log n) q  Noncomparisonbased ❍  Use known properties of elements ❍  O(n) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 45 Sorting on Parallel Computers q  Where are the elements stored ❍  Need to be distributed across processes ❍  Sorted order will be with respect to process order q  How are comparisons performed ❍  One element per process ◆ compareexchange ◆ interprocess communication will dominate execution time ❍  More than one element per process ◆ comparesplit q  Sorting networks ❍  Based on comparison network model q  Contrast with shared memory sorting algorithms Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 46 Single vs. Multi Element Comparision q  One element per processor q  Multiple elements per processor Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 47 Sorting Networks q  Networks to sort n elements in less than O(n log n) q  Key component in network is a comparator ❍  Increasing or decreasing comparator q  Comparators connected in parallel and permute elements Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 48 Sorting Network Design q  Multiple comparator stages ( stages, comparators) q  Connected together by interconnection network q  Output of last stage is the sorted list q  O(log2n) sorting time q  Convert any sorting network to sequential algorithm Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 49 Bitonic Sort q  Create a bitonic sequence then sort the sequence q  Bitonic sequence ❍  sequence of elements a , a , …, a 0 1 n1 ❍  a , a , …, a is monotonically increasing 0 1 i ❍  a , a , …, a is monotonically decreasing i i+1 n1 q  Sorting using bitonic splits is called bitonic merge q  Bitonic merge network is a network of comparators ❍  Implement bitonic merge q  Bitonic sequence is formed from unordered sequence ❍  Bitonic sort creates a bitonic sequence ❍  Start with sequence of size two (default bitonic) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 50 Bitonic Sort Network Unordered sequence Bitonic sequence decrease increase Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 51 Bitonic Merge Network Bitonic sequence Sorted sequence Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 52 Parallel Bitonic Sort on a Hypercube 1. procedure BITONIC SORT(label, d) 2. begin 3. for i := 0 to d 1 do 4. for j := i downto 0 do 5. if (i + 1)st bit of label = j th bit of label then 6. comp exchange max(j); 7. else 8. comp exchange min(j); 9. end BITONIC SORT Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 53 Parallel Bitonic Sort on a Hypercube Last stage Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 54 Bubble Sort and Variants 2 q  Can easily parallelize sorting algorithms of O(n ) q  Bubble sort compares and exchanges adjacent elements ❍  O(n) each pass ❍  O(n) passes ❍  Available parallelism q  Oddeven transposition sort ❍  Compares and exchanges odd and even pairs ❍  After n phases, elements are sorted ❍  Available parallelism CIS 410/510: Parallel Computing, University of Oregon, Spring 2014 Lecture 12 – Introduction to Parallel Algorithms 55 OddEven Transposition Sort Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 56 Parallel OddEven Transposition Sort 1. procedure ODDEVEN PAR(n) 2. begin 3. id := process’s label 4. for i := 1 to n do 5. begin 6. if i is odd then 7. if id is odd then 8. compareexchange min(id + 1); 9. else 10. compareexchange max(id 1); 11. if i is even then 12. if id is even then 13. compareexchange min(id + 1); 14. else 15. compareexchange max(id 1); 16. end for 17. end ODDEVEN PAR Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 57 Quicksort q  Quicksort has average complexity of O(n log n) q  Divideandconquer algorithm ❍  Divide into subsequences where every element in first is less than or equal to every element in the second ❍  Pivot is used to split the sequence ❍  Conquer step recursively applies quicksort algorithm q  Available parallelism Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 58 Sequential Quicksort 1. procedure QUICKSORT (A, q, r ) 2. begin 3. if q r then 4. begin 5. x := Aq; 6. s := q; 7. for i := q + 1 to r do 8. if Ai ≤ x then 9. begin 10. s := s + 1; 11. swap(As, Ai ); 12. end if 13. swap(Aq, As); 14. QUICKSORT (A, q, s); 15. QUICKSORT (A, s + 1, r ); 16. end if 17. end QUICKSORT Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 59 Parallel Shared Address Space Quicksort Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 60 Parallel Shared Address Space Quicksort Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 61 Bucket Sort and Sample Sort q  Bucket sort is popular when elements (values) are uniformly distributed over an interval ❍  Create m buckets and place elements in appropriate bucket ❍  O(n log(n/m)) ❍  If m=n, can use value as index to achieve O(n) time q  Sample sort is used when uniformly distributed assumption is not true ❍  Distributed to m buckets and sort each with quicksort ❍  Draw sample of size s ❍  Sort samples and choose m1 elements to be splitters ❍  Split into m buckets and proceed with bucket sort Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 62 Parallel Sample Sort Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 63 Graph Algorithms q  Graph theory important in computer science q  Many complex problems are graph problems q  G = (V, E) ❍  V finite set of points called vertices ❍  E finite set of edges ❍  e ∈ E is an pair (u,v), where u,v ∈ V ❍  Unordered and ordered graphs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 64 Graph Terminology q  Vertex adjacency if (u,v) is an edge q  Path from u to v if there is an edge sequence starting at u and ending at v q  If there exists a path, v is reachable from u q  A graph is connected if all pairs of vertices are connected by a path q  A weighted graph associates weights with each edge q  Adjacency matrix is an n x n array A such that ❍  A = 1 if (v ,v ) ∈ E; 0 otherwise i,j i j ❍  Can be modified for weighted graphs (∞ is no edge) ❍  Can represent as adjacency lists Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 65 Graph Representations q  Adjacency matrix q  Adjacency list Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 66 Minimum Spanning Tree q  A spanning tree of an undirected graph G is a subgraph of G that is a tree containing all the vertices of G q  The minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum weight q  Prim’s algorithm can be used ❍  Greedy algorithm ❍  Selects an arbitrary starting vertex ❍  Chooses new vertex guaranteed to be in MST 2 ❍  O(n ) ❍  Prim’s algorithm is iterative Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 67 Prim’s Minimum Spanning Tree Algorithm 1. procedure PRIM MST(V, E,w, r ) 2. begin 3. VT := r ; 4. dr := 0; 5. for all v ∈  (V VT ) do 6. if edge (r, v) exists set dv := w(r, v); 7. else set dv :=∞; 8. while VT ≠ V do 9. begin 10. find a vertex u such that du := mindvv ∈  (V VT ); 11. VT := VT ∪ u; 12. for all v ∈ (V VT ) do 13. dv := mindv,w(u, v);   14. endwhile 15. end PRIM MST Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 68 Example: Prim’s MST Algorithm Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 69 Example: Prim’s MST Algorithm Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 70 Parallel Formulation of Prim’s Algorithm q  Difficult to perform different iterations of the while loop in parallel because dv may change each time q  Can parallelize each iteration though q  Partition vertices into p subsets V , i=0,…,p1 i q  Each process P computes i d u=mind v v ∈ (VV ) ∩ V i i T i q  Global minimum is obtained using alltoone reduction q  New vertex is added to V and broadcast to all T processes q  New values of dv are computed for local vertex 2 q  O(n /p) + O(n log p) (computation + communication) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 71 Partitioning in Prim’s Algorithm Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 72 SingleSource Shortest Paths q  Find shortest path from a vertex v to all other vertices q  The shortest path in a weighted graph is the edge with the minimum weight q  Weights may represent time, cost, loss, or any other quantity that accumulates additively along a path q  Dijkstra’s algorithm finds shortest paths from vertex s ❍  Similar to Prim’s MST algorithm ◆ MST with vertex v as starting vertex ❍  Incrementally finds shortest paths in greedy manner ❍  Keep track of minimum cost to reach a vertex from s 2 ❍  O(n ) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 73 Dijkstra’s SingleSource Shortest Path 1. procedure DIJKSTRA SINGLE SOURCE SP(V, E,w, s) 2. begin 3. V := s; T 4. for all v ∈  (V V ) do T 5. if (s, v) exists set lv := w(s, v); 6. else set lv :=∞; 7. while V ≠ V do T 8. begin 9. find a vertex u such that lu := minlvv ∈  (V V ); T 10. VT := V ∪ u; T 11. for all v ∈  (V V ) do T 12. lv := minlv, lu + w(u, v); 13. endwhile 14. end DIJKSTRA SINGLE SOURCE SP Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 74 Parallel Formulation of Dijkstra’s Algorithm q  Very similar to Prim’s MST parallel formulation q  Use 1D block mapping as before q  All processes perform computation and communication similar to that performed in Prim’s algorithm q  Parallel performance is the same 2 ❍  O(n /p) + O(n log p) ❍  Scalability 2 ◆ O(n ) is the sequential time 2 2 ◆ O(n ) / O(n /p) + O(n log p) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 75 All Pairs Shortest Path q  Find the shortest path between all pairs of vertices q  Outcome is a n x n matrix D=d such that d is the i,j i,j cost of the shortest path from vertex v to vertex v i j q  Dijsktra’s algorithm ❍  Execute singlesource algorithm on each process 3 ❍  O(n ) ❍  Sourcepartitioned formulation (use sequential algorithm) ❍  Sourceparallel formulation (use parallel algorithm) q  Floyd’s algorithm ❍  Builds up distance matrix from the bottom up Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 76 Floyd’s AllPairs Shortest Paths Algorithm 1. procedure FLOYD ALL PAIRS SP(A) 2. begin (0) 3. D = A; 4. for k := 1 to n do 5. for i := 1 to n do 6. for j := 1 to n do (k) (k1) (k1) (k1) 7. d := min d , d + d ; i, j i, j i,k k, j 8. end FLOYD ALL PAIRS SP Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 77 Parallel Floyd’s Algorithm 1. procedure FLOYD ALL PAIRS PARALLEL (A) 2. begin (0) 3. D = A; 4. for k := 1 to n do 5. forall P , where i, j ≤ n, do in parallel i,j (k) (k1) (k1) (k1) 6. d := min d , d + d ; i, j i, j i,k k, j 7. end FLOYD ALL PAIRS PARALLEL Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 78 Parallel Graph Algorithm Library – Boost q  Parallel Boost Graph Library ❍  Andrew Lumsdaine, Indiana University ❍  Generic C++ library for highperformance parallel and distributed graph computation ❍  Builds on the Boost Graph Library (BGL) ◆ offers similar data structures, algorithms, and syntax ❍  Targets very large graphs (millions of nodes) ❍  Distributedmemory parallel processing on clusters Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 79 Original BGL: Algorithms •  Searches (breadthfirst, depth •  Maxflow (EdmondsKarp, first, A) pushrelabel) •  Singlesource shortest paths •  Sparse matrix ordering (Cuthill (Dijkstra, BellmanFord, DAG) McKee, King, Sloan, minimum degree) •  Allpairs shortest paths (Johnson, FloydWarshall) •  Layout (KamadaKawai, FruchtermanReingold, Gursoy •  Minimum spanning tree Atun) (Kruskal, Prim) •  Betweenness centrality •  Components (connected, strongly connected, •  PageRank biconnected) •  Isomorphism •  Maximum cardinality matching •  Vertex coloring •  Transitive closure •  Dominator tree Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 80 Original BGL Summary q  Original BGL is large, stable, efficient ❍  Lots of algorithms, graph types ❍  Peerreviewed code with many users, nightly regression testing, and so on ❍  Performance comparable to FORTRAN. q  Who should use the BGL ❍  Programmers comfortable with C++ ❍  Users with graph sizes from tens of vertices to millions of vertices Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 81 Parallel BGL q  A version of C++ BGL for computational clusters ❍  Distributed memory for huge graphs ❍  Parallel processing for improved performance q  An active research project q  Closely related to the original BGL ❍  Parallelizing BGL programs should be “easy” A  simple,  directed  graph…   distributed  across  3  processors   Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 82 Parallel Graph Algorithms •  Connected components •  Breadthfirst search •  Strongly connected •  Eager Dijkstra’s single components source shortest paths •  Biconnected components •  Crauser et al. singlesource •  PageRank shortest paths •  Graph coloring •  Depthfirst search •  FruchtermanReingold •  Minimum spanning tree layout (Boruvka, Dehne Götz) •  Maxflow (Dinic’s) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 83 BigData and MapReduce q  Bigdata deals with processing large data sets q  Nature of data processing problem makes it amenable to parallelism ❍  Looking for features in the data ❍  Extracting certain characteristics ❍  Analyzing properties with complex data mining algorithms q  Data size makes it opportunistic for partitioning into large of subsets and processing these in parallel q  We need new algorithms, data structures, and programming models to deal with problems Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 84 A Simple BigData Problem q  Consider a large data collection of text documents q  Suppose we want to find how often a particular word occurs and determine a probability distribution for all word occurrences Sequential algorithm web 2 Count words and weed 1 update statistics Get next green 2 document sun 1 Data   Find  and   moon 1 collecHon   count  words   land 1 Generate   part 1 probability   distribuHons   Check if more documents Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 85 Parallelization Approach q  Map: partition the data collection into subsets of documents and process each subset in parallel q  Reduce: assemble the partial frequency tables to derive final probability distribution web 2 web 2 Parallel algorithm weed 1 web 2 weed 1 green 2 weed 1 green 2 Get next Count words and sun 1 green 2 document update statistics sun 1 moon 1 sun 1 Data   moon 1 Find  and   land 1 moon 1 collecHon   count  words   land 1 part 1 land 1 part 1 part 1 Check if more Generate   probability   documents distribuHons   Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 86 Parallelization Approach q  Map: partition the data collection into subsets of documents and process each subset in parallel q  Reduce: assemble the partial frequency tables to derive final probability distribution Parallel algorithm web 2 weed 1 Get next Count words and green 2 document update statistics sun 1 Data   Find  and   moon 1 collecHon   count  words   land 1 part 1 Check if more Generate   probability   documents distribuHons   Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 87 Actually, it is not easy to parallel…. Different programming models   Message Passing Shared Memory Fundamental issues   Scheduling, data distribution, synchronization, inter process communication, robustness, fault tolerance, … Architectural issues   Flynn’s taxonomy (SIMD, MIMD, etc.), network topology, bisection bandwidth, cache coherence, … Different programming constructs   Mutexes, conditional variables, barriers, …   masters/slaves, producers/consumers, work queues,. … Common problems   Livelock, deadlock, data starvation, priority inversion, …dining philosophers, sleeping barbers, cigarette smokers, … Actually, Programmer’s Nightmare…. Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 88 MapReduce Parallel Programming q  Become an important distributed parallel programming paradigm for largescale applications ❍  Also applies to sharedmemory parallelism ❍  Becomes one of the core technologies powering big IT companies, like Google, IBM, Yahoo and Facebook. q  Framework runs on a cluster of machines and automatically partitions jobs into number of small tasks and processes them in parallel q  Can capture in combining Map and Reduce parallel patterns Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 89 MapReduce Example web 1 weed 1 MAP: Input data è key, value pair green 1 web   1   sun 1 weed   1   moon 1 green   1   land 1 sun   1   web   1   part 1 moon   1   weed   1   Map   web 1 Data   landweb     1   1   green   1   green 1 partweed     1   1   sun   1   CollecHon:   Split the data to web   1   … 1 webgreen     1   1   moon   1   split1   weed   1   KEY VALUE green sun     1   1   Supply multiple land   1   green   1   …   moon   1   1   part   1   processors sun   1   KEY  land   VALU 1  E   web   1   moon   1   part   1   green   1   Data   land   1   Map web   1   …   1   CollecHon:   part   1   green   1   KEY   VALUE   web   1   …   1   split  2   green   1   KEY   VALUE   …   1   KEY   VALUE   Data   CollecHon:   split  n   Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 90 …… … MapReduce MAP: Input data è key, value pair REDUCE: key, value pair è result Reduce   Map   Data   CollecHon:   Split the data to split1   Supply multiple processors Reduce   Data   Map CollecHon:   split  2   Data   CollecHon:   Reduce   Map   split  n   Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 91 …… …
Website URL
Comment