Question? Leave a message!




Multithreaded Parallelism and Performance Measures

Multithreaded Parallelism and Performance Measures
Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 1 / 52Plan 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 2 / 52Parallelism Complexity Measures Plan 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 3 / 52Parallelism Complexity Measures The forkjoin parallelism model int int ffib (int ib (int n) n) Example: if if ((() () n2 n2)) return return ((((nn); );));; fib(4 fib(4)) else else int int xx,y; ,y; x = x = cilkspawn cilkspawn fib(n1); fib(n1); 4 y y y y == fib(n fib(n fib(n fib(n2); 2); 2); 2); cilksync cilksync;; return (x+y); return (x+y); 3 2 22 11 11 00 “Processor oblivious” 11 00 The computation dag unfolds dynamically. We shall also call this model multithreaded parallelism. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 4 / 52Parallelism Complexity Measures Terminology initial strand final strand strand strand continue edge return edge spawn spawn edge edge call edge a strand is is a maximal sequence of instructions that ends with a spawn, sync, or return (either explicit or implicit) statement. At runtime, the spawn relation causes procedure instances to be structured as a rooted tree, called spawn tree or parallel instruction stream, where dependencies among strands form a dag. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 5 / 52Parallelism Complexity Measures Work and span We de ne several performance measures. We assume an ideal situation: no cache issues, no interprocessor costs: T is the minimum running time on p processors p T is called the work, that is, the sum of the number of instructions at 1 each node. T is the minimum running time with in nitely many processors, called 1 the span (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 6 / 52Parallelism Complexity Measures The critical path length Assuming all strands run in unit time, the longest path in the DAG is equal to T . For this reason, T is also referred to as the critical path length. 1 1 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 7 / 52Parallelism Complexity Measures Work law We have: T  T =p. p 1 Indeed, in the best case, p processors can do p works per unit of time. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 8 / 52Parallelism Complexity Measures Span law We have: T  T . p 1 Indeed, T T contradicts the de nitions of T and T . p 1 p 1 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 9 / 52Parallelism Complexity Measures Speedup on p processors T =T is called the speedup on p processors 1 p A parallel program execution can have: linear speedup: T =T = (p) 1 P superlinear speedup: T =T =(p) (not possible in this model, 1 P though it is possible in others) sublinear speedup: T =T = o(p) 1 P (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 10 / 52Parallelism Complexity Measures Parallelism Because the Span Law dictates that that T T≥≥ T T , the the m maximum aximum P∞ possible speedup given T 1 and T is ∞ T T /T /T = parall lleli lism 1∞ = the average amount amount of of work work per step along the span. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 11 / 52Parallelism Complexity Measures The Fibonacci example 1 8 22 77 3 4 6 5 For Fib(4), we have T = 17 and T = 8 and thus T =T = 2:125. 1 1 1 1 What about T (Fib(n)) and T (Fib(n)) 1 1 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 12 / 52Parallelism Complexity Measures Series composition A B Work Span (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 13 / 52Parallelism Complexity Measures Series composition A B Work: T (AB) = T (A) +T (B) 1 1 1 Span: T (AB) = T (A) +T (B) 1 1 1 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 14 / 52Parallelism Complexity Measures Parallel composition A A B Work Span (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 15 / 52Parallelism Complexity Measures Parallel composition A A B Work: T (AB) = T (A) +T (B) 1 1 1 Span: T (AB) = max(T (A);T (B)) 1 1 1 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 16 / 52cilk for Loops Plan 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 17 / 52cilk for Loops For loop parallelism in Cilk++ a a a a a a 11 12⋯ 1n 11 21⋯ n1 a a a a a a 21 21 22 22⋯ 2n 2n 12 12 22 22⋯ n2 n2 ⋮⋮⋱⋮ ⋮⋮⋱⋮ a a a a a a n1 n1 n2 n2⋯ nn nn 1n 1n 2n 2n⋯ nn nn T AA cilkfor (int i=1; in; ++i) for (int j=0; ji; ++j) double temp = Aij; Aij = Aji; Aji = temp; The iterations of a cilk for loop execute in parallel. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 18 / 52cilk for Loops Implementation of for loops in Cilk++ Up to details (next week) the previous loop is compiled as follows, using a divideandconquer implementation: void recur(int lo, int hi) if (hi lo) // coarsen int mid = lo + (hi lo)/2; cilkspawn recur(lo, mid); recur(mid+1, hi); cilksync; else for (int j=0; jhi; ++j) double temp = Ahij; Ahij = Ajhi; Ajhi = temp; (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 19 / 52cilk for Loops Analysis of parallel for loops 11 22 33 44 5 6 7 8 Here we do not assume that each strand runs in unit time. Span of loop control: (log(n)) Max span of an iteration: (n) Span: (n) 2 Work: (n ) Parallelism: (n) (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 20 / 52cilk for Loops Parallelizing the inner loop cilkfor (int i=1; in; ++i) cilkfor (int j=0; ji; ++j) double temp = Aij; Aij = Aji; Aji = temp; Span of outer loop control: (log(n)) Max span of an inner loop control: (log(n)) Span of an iteration: (1) Span: (log(n)) 2 Work: (n ) Parallelism: (n) (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 21 / 52Scheduling Theory and Implementation Plan 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 22 / 52Scheduling Theory and Implementation Scheduling Memory I/O Network PP … P P P A scheduler's job is to map a computation to particular processors. Such a mapping is called a schedule. If decisions are made at runtime, the scheduler is online, otherwise, it is oine Cilk++'s scheduler maps strands onto processors dynamically at runtime. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 23 / 52Scheduling Theory and Implementation Greedy scheduling (1/2) A strand is ready if all its predecessors have executed A scheduler is greedy if it attempts to do as much work as possible at every step. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 24 / 52Scheduling Theory and Implementation Greedy scheduling (2/2) P = 3 In any greedy schedule, there are two types of steps: complete step: There are at least p strands that are ready to run. The greedy scheduler selects any p of them and runs them. incomplete step: There are strictly less than p threads that are ready to run. The greedy scheduler runs them all. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 25 / 52Scheduling Theory and Implementation Theorem of Graham and Brent P = 3 For any greedy schedule, we have T  T =p + T p 1 1 complete steps T =p, by de nition of T . 1 1 0 incomplete steps T . Indeed, let G be the subgraph of G that 1 remains to be executed immediately prior to a incomplete step. (i) During this incomplete step, all strands that can be run are actually run 0 (ii) Hence removing this incomplete step from G reduces T by one. 1 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 26 / 52Scheduling Theory and Implementation The workstealing scheduler (1/9) Cilk/Cilk++ randomized workstealing scheduler loadbalances the computation at runtime. Each processor maintains a ready deque: A ready deque is a double ended queue, where each entry is a procedure instance that is ready to execute. Adding a procedure instance to the bottom of the deque represents a procedure call being spawned. A procedure instance being deleted from the bottom of the deque represents the processor beginning/resuming execution on that procedure. Deletion from the top of the deque corresponds to that procedure instance being stolen. A mathematical proof guarantees nearperfect linear speedup on applications with sucient parallelism, as long as the architecture has sucient memory bandwidth. A spawn/return in Cilk is over 100 times faster than a Pthread create/exit and less than 3 times slower than an ordinary C function call on a modern Intel processor. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 27 / 52Scheduling Theory and Implementation The workstealing scheduler (2/9) Each processor possesses a deque (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 28 / 52Scheduling Theory and Implementation The workstealing scheduler (3/9) (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 29 / 52Scheduling Theory and Implementation The workstealing scheduler (4/9) (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 30 / 52Scheduling Theory and Implementation The workstealing scheduler (5/9) (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 31 / 52Scheduling Theory and Implementation The workstealing scheduler (6/9) (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 32 / 52Scheduling Theory and Implementation The workstealing scheduler (7/9) (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 33 / 52Scheduling Theory and Implementation The workstealing scheduler (8/9) (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 34 / 52Scheduling Theory and Implementation The workstealing scheduler (9/9) (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 35 / 52Scheduling Theory and Implementation Performances of the workstealing scheduler Assume that each strand executes in unit time, for almost all \parallel steps" there are at least p strands to run, each processor is either working or stealing. Then, the randomized workstealing scheduler is expected to run in T = T =p +O(T ) P 1 1 During a stealfree parallel steps (steps at which all processors have work on their deque) each of the p processors consumes 1 work unit. Thus, there is at most T =p stealfree parallel steps. 1 During a parallel step with steals each thief may reduce by 1 the running time with a probability of 1=p Thus, the expected number of steals is O(pT ). 1 Therefore, the expected running time T = (T +O(pT ))=p = T =p +O(T ): (1) P 1 1 1 1 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 36 / 52Scheduling Theory and Implementation Overheads and burden Obviously T =p +T will underestimate T in practice. 1 1 p Many factors (simpli cation assumptions of the forkjoin parallelism model, architecture limitation, costs of executing the parallel constructs, overheads of scheduling) will make T larger in practice. p One may want to estimate the impact of those factors: 1 by improving the estimate of the randomized workstealing complexity result 2 by comparing a Cilk++ program with its C++ elision 3 by estimating the costs of spawning and synchronizing Cilk++ estimates T as T = T =p + 1:7 burden span, where p p 1 burden span is 15000 instructions times the number of continuation edges along the critical path. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 37 / 52Measuring Parallelism in Practice Plan 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 38 / 52Measuring Parallelism in Practice Cilkview Span Work Law Law (linear (linear speedup) Measured Measured speedup Burdened Burdened parallelism Parallelism — estimates estimates scheduling overheads Cilkview computes work and span to derive upper bounds on parallel performance Cilkview also estimates scheduling overhead to compute a burdened span for lower bounds. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 39 / 52Measuring Parallelism in Practice The Fibonacci Cilk++ example Code fragment long fib(int n) if (n 2) return n; long x, y; x = cilkspawn fib(n1); y = fib(n2); cilksync; return x + y; (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 40 / 52Measuring Parallelism in Practice Fibonacci program timing The environment for benchmarking: model name : Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz L2 cache size : 4096 KB memory size : 3 GB cores = 1 cores = 2 cores = 4 n timing(s) timing(s) speedup timing(s) speedup 30 0.086 0.046 1.870 0.025 3.440 35 0.776 0.436 1.780 0.206 3.767 40 8.931 4.842 1.844 2.399 3.723 45 105.263 54.017 1.949 27.200 3.870 50 1165.000 665.115 1.752 340.638 3.420 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 41 / 52Measuring Parallelism in Practice Quicksort code in cilk/examples/qsort void sampleqsort(int begin, int end) if (begin = end) end; int middle = std::partition(begin, end, std::bind2nd(std::lessint(), end)); using std::swap; swap(end, middle); cilkspawn sampleqsort(begin, middle); sampleqsort(++middle, ++end); cilksync; (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 42 / 52Measuring Parallelism in Practice Quicksort timing Timing for sorting an array of integers: cores = 1 cores = 2 cores = 4 of int timing(s) timing(s) speedup timing(s) speedup 6 10 10 1.958 1.016 1.927 0.541 3.619 6 50 10 10.518 5.469 1.923 2.847 3.694 6 100 10 21.481 11.096 1.936 5.954 3.608 6 500 10 114.300 57.996 1.971 31.086 3.677 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 43 / 52Measuring Parallelism in Practice Matrix multiplication Code in cilk/examples/matrix Timing of multiplying a 687 837 matrix by a 837 1107 matrix iterative recursive threshold st(s) pt(s) su st(s) pt (s) su 10 1.273 1.165 0.721 1.674 0.399 4.195 16 1.270 1.787 0.711 1.408 0.349 4.034 32 1.280 1.757 0.729 1.223 0.308 3.971 48 1.258 1.760 0.715 1.164 0.293 3.973 64 1.258 1.798 0.700 1.159 0.291 3.983 80 1.252 1.773 0.706 1.267 0.320 3.959 st = sequential time; pt = parallel time with 4 cores; su = speedup (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 44 / 52Measuring Parallelism in Practice The cilkview example from the documentation Using cilk for to perform operations over an array in parallel: static const int COUNT = 4; static const int ITERATION = 1000000; long arrCOUNT; long dowork(long k) long x = 15; static const int nn = 87; for (long i = 1; i nn; ++i) x = x / i + k i; return x; int cilkmain() for (int j = 0; j ITERATION; j++) cilkfor (int i = 0; i COUNT; i++) arri += dowork( j i + i + j); (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 45 / 52Measuring Parallelism in Practice 1) Parallelism Profile Work : 6,480,801,250 ins Span : 2,116,801,250 ins Burdened span : 31,920,801,250 ins Parallelism : 3.06 Burdened parallelism : 0.20 Number of spawns/syncs: 3,000,000 Average instructions / strand : 720 Strands along span : 4,000,001 Average instructions / strand on span : 529 2) Speedup Estimate 2 processors: 0.21 2.00 4 processors: 0.15 3.06 8 processors: 0.13 3.06 16 processors: 0.13 3.06 32 processors: 0.12 3.06 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 46 / 52Measuring Parallelism in Practice A simple x Inverting the two for loops int cilkmain() cilkfor (int i = 0; i COUNT; i++) for (int j = 0; j ITERATION; j++) arri += dowork( j i + i + j); (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 47 / 52Measuring Parallelism in Practice 1) Parallelism Profile Work : 5,295,801,529 ins Span : 1,326,801,107 ins Burdened span : 1,326,830,911 ins Parallelism : 3.99 Burdened parallelism : 3.99 Number of spawns/syncs: 3 Average instructions / strand : 529,580,152 Strands along span : 5 Average instructions / strand on span: 265,360,221 2) Speedup Estimate 2 processors: 1.40 2.00 4 processors: 1.76 3.99 8 processors: 2.01 3.99 16 processors: 2.17 3.99 32 processors: 2.25 3.99 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 48 / 52Measuring Parallelism in Practice Timing cores = 1 cores = 2 cores = 4 version timing(s) timing(s) speedup timing(s) speedup original 7.719 9.611 0.803 10.758 0.718 improved 7.471 3.724 2.006 1.888 3.957 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 49 / 52Announcements Plan 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 50 / 52Announcements Acknowledgements Charles E. Leiserson (MIT) for providing me with the sources of its lecture notes. Matteo Frigo (Intel) for supporting the work of my team with Cilk++. Yuzhen Xie (UWO) for helping me with the images used in these slides. Liyun Li (UWO) for generating the experimental data. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 51 / 52Announcements References Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The Implementation of the Cilk5 Multithreaded Language. Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, Pages: 212223. June, 1998. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An Ecient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 5569, August 25, 1996. Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. Journal of the ACM, Vol. 46, No. 5, pp. 720748. September 1999. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101 52 / 52
Website URL
Comment