Question? Leave a message!




Issues with Multithreaded Parallelism on Multicore Architectures

Issues with Multithreaded Parallelism on Multicore Architectures
Issues with Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 1 / 35Plan (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 2 / 35Example 1: a small loop with grain size = 1 Code: const int N = 100 1000 1000; void cilkforgrainsize1() pragma cilkgrainsize = 1 cilkfor (int i = 0; i N; ++i) fib(2); Expectations: Parallelism should be large, perhaps (N) or (N= logN). We should see great speedup. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 3 / 35Speedup is indeed great. . . (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 4 / 35. . . but performance is lousy (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 5 / 35Recall how cilk for is implemented Source: cilkfor (int i = A; i B; ++i) BODY(i) Implementation: void recur(int lo, int hi) if ((hi lo) GRAINSIZE) int mid = lo + (hi lo) / 2; cilkspawn recur(lo, mid); cilkspawn recur(mid, hi); else for (int i = lo; i hi; ++i) BODY(i); recur(A, B); (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 6 / 35Default grain size Cilk++ chooses a grain size if you don't specify one. void cilkfordefaultgrainsize() cilkfor (int i = 0; i N; ++i) fib(2); Cilk++'s heuristic for the grain size:   N grain size = min ; 512 : 8P Generates about 8P parallel leaves. Works well if the loop iterations are not too unbalanced. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 7 / 35Speedup with default grain size (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 8 / 35Large grain size A large grain size should be even faster, right void cilkforlargegrainsize() pragma cilkgrainsize = N cilkfor (int i = 0; i N; ++i) fib(2); Actually, no (except for noise): Grain size Runtime 1 8.55 s default (= 512) 2.44 s 8 N (= 10 ) 2.42 s (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 9 / 35Speedup with grain size =N (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 10 / 35Tradeo between grain size and parallelism Use Cilkview to understand the tradeo : Grain size Parallelism 1 6,951,154 default (= 512) 248,784 8 N (= 10 ) 1 In Cilkview,P = 1:     N N default grain size = min ; 512 = min ; 512 : 8P 8 (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 11 / 35Lessons learned Measure overhead before measuring speedup. Compare 1processor Cilk++ versus serial code. Small grain size) higher work overhead. Large grain size) less parallelism. The default grain size is designed for small loops that are reasonably balanced. You may want to use a smaller grain size for unbalanced loops or loops with large bodies. Use Cilkview to measure the parallelism of your program. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 12 / 35Example 2: A for loop that spawns Code: const int N = 10 1000 1000; / empty test function / void f() void forspawn() for (int i = 0; i N; ++i) cilkspawn f(); Expectations: I am spawningN parallel things. Parallelism should be (N), right (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 13 / 35\Speedup" of for spawn() (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 14 / 35Insucient parallelism PPA analysis: PPA says that both work and span are (N). Parallelism is 1:62, independent ofN. Too little parallelism: no speedup. Why is the span (N) for (int i = 0; i N; ++i) cilkspawn f(); (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 15 / 35Alternative: a cilk for loop. Code: / empty test function / void f() void testcilkfor() cilkfor (int i = 0; i N; ++i) f(); PPA analysis: The parallelism is about 2000 (with default grain size). The parallelism is high. As we saw earlier, this kind of loop yields good performance and speedup. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 16 / 35Lessons learned cilkfor() is di erent from for(...) cilkspawn. The span of for(...) cilkspawn is (N). For simple at loops, cilkfor() is generally preferable because it has higher parallelism. However, for(...) cilkspawn might be better when the work load is not uniformly distributed across all iterations. Use Cilkview to measure the parallelism of your program. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 17 / 35Example 3: Vector addition Code: const int N = 50 1000 1000; double AN, BN, CN; void vectoradd() cilkfor (int i = 0; i N; ++i) Ai = Bi + Ci; Expectations: Cilkview says that the parallelism is 68,377. This will work great (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 18 / 35Speedup of vector add() (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 19 / 35Bandwidth of the memory system A typical machine: AMD Phenom 920 (Feb. 2009). Cache level daxpy bandwidth L1 19.6 GB/s per core L2 18.3 GB/s per core L3 13.8 GB/s shared DRAM 7.1 GB/s shared daxpy: xi = axi + yi, double precision. The memory bottleneck: A single core can generally saturate most of the memory hierarchy. Multiple cores that access memory will con ict and slow each other down. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 20 / 35How do you determine if memory is a bottleneck Hard problem: No general solution. Requires guesswork. Two useful techniques: Use a pro ler such as the Intel VTune. Interpreting the output is nontrivial. No sensitivity analysis. Perturb the environment to understand the e ect of the CPU and memory speeds upon the program speed. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 21 / 35How to perturb the environment Overclock/underclock the processor, e.g. using the power controls. If the program runs at the same speed on a slower processor, then the memory is (probably) a bottleneck. Overclock/underclock the DRAM from the BIOS. If the program runs at the same speed on a slower DRAM, then the memory is not a bottleneck. Add spurious work to your program while keeping the memory accesses constant. RunP independent copies of the serial program concurrently. If they slow each other down then memory is probably a bottleneck. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 22 / 35Perturbing vector add() const int N = 50 1000 1000; double AN, BN, CN; void vectoradd() cilkfor (int i = 0; i N; ++i) Ai = Bi + Ci; fib(5); // waste time (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 23 / 35Speedup of perturbed vector add() (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 24 / 35Interpreting the perturbed results The memory is a bottleneck: A little extra work (fib(5)) keeps 8 cores busy. A little more extra work (fib(10)) keeps 16 cores busy. Thus, we have enough parallelism. The memory is probably a bottleneck. (If the machine had a shared FPU, the FPU could also be a bottleneck.) OK, but how do you x it vectoradd cannot be xed in isolation. You must generally restructure your program to increase the reuse of cached data. Compare the iterative and recursive matrix multiplication from yesterday. (Or you can buy a newer CPU and faster memory.) (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 25 / 35Lessons learned Memory is a common bottleneck. One way to diagnose bottlenecks is to perturb the program or the environment. Fixing memory bottlenecks usually requires algorithmic changes. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 26 / 35Example 4: Nested loops Code: const int N = 1000 1000; void innerparallel() for (int i = 0; i N; ++i) cilkfor (int j = 0; j 4; ++j) fib(10); / do some work / Expectations: The inner loop does 4 things in parallel. The parallelism should be about 4. Cilkview says that the parallelism is 3.6. We should see some speedup. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 27 / 35\Speedup" of inner parallel() (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 28 / 35Interchanging loops Code: const int N = 1000 1000; void outerparallel() cilkfor (int j = 0; j 4; ++j) for (int i = 0; i N; ++i) fib(10); / do some work / Expectations: The outer loop does 4 things in parallel. The parallelism should be about 4. Cilkview says that the parallelism is 4. Same as the previous program, which didn't work. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 29 / 35Speedup of outer parallel() (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 30 / 35Parallelism vs. burdened parallelism Parallelism: The best speedup you can hope for. Burdened parallelism: Parallelism after accounting for the unavoidable migration overheads. Depends upon: How well we implement the Cilk++ scheduler. How you express the parallelism in your program. Cilkview prints the burdened parallelism: 0.29 for innerparallel(), 4.0 for outerparallel(). In a good program, parallelism and burdened parallelism are about equal. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 31 / 35What is the burdened parallelism Code: A(); cilkspawn B(); C(); D(); cilksync; E(); Burdened critical path: The burden is (10000) cycles (locks, malloc, cache warmup, reducers, (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 32 / 35 etc.)The burden in our examples (N) spawns/syncs on the critical path (large burden): void innerparallel() for (int i = 0; i N; ++i) cilkfor (int j = 0; j 4; ++j) fib(10); / do some work / (1) spawns/syncs on the critical path (small burden): void outerparallel() cilkfor (int j = 0; j 4; ++j) for (int i = 0; i N; ++i) fib(10); / do some work / (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 33 / 35Lessons learned Insucient parallelism yields no speedup; high burden yields slowdown. Many spawns but small parallelism: suspect large burden. Cilkview helps by printing the burdened span and parallelism. The burden can be interpreted as the number of spawns/syncs on the critical path. If the burdened parallelism and the parallelism are approximately equal, your program is ok. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 34 / 35Summary and notes We have learned to identify and (when possible) address these problems: High overhead due to small grain size in cilkfor loops. Insucient parallelism. Insucient memory bandwidth. Insucient burdened parallelism. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 35 / 35