Parallelism on multicore processors

Issues with Multithreaded Parallelism on Multicore Architectures and parallelism on multicore processors ppt
Dr.JakeFinlay Profile Pic
Dr.JakeFinlay,Germany,Teacher
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Issues with Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 1 / 35Example 1: a small loop with grain size = 1 Code: const int N = 100 1000 1000; void cilk_for_grainsize_1() pragma cilk_grainsize = 1 cilk_for (int i = 0; i N; ++i) fib(2); Expectations: Parallelism should be large, perhaps (N) or (N= logN). We should see great speedup. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 3 / 35Speedup is indeed great. . . (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 4 / 35. . . but performance is lousy (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 5 / 35Recall how cilk for is implemented Source: cilk_for (int i = A; i B; ++i) BODY(i) Implementation: void recur(int lo, int hi) if ((hi - lo) GRAINSIZE) int mid = lo + (hi - lo) / 2; cilk_spawn recur(lo, mid); cilk_spawn recur(mid, hi); else for (int i = lo; i hi; ++i) BODY(i); recur(A, B); (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 6 / 35Default grain size Cilk++ chooses a grain size if you don't specify one. void cilk_for_default_grainsize() cilk_for (int i = 0; i N; ++i) fib(2); Cilk++'s heuristic for the grain size:   N grain size = min ; 512 : 8P Generates about 8P parallel leaves. Works well if the loop iterations are not too unbalanced. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 7 / 35Speedup with default grain size (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 8 / 35Large grain size A large grain size should be even faster, right? void cilk_for_large_grainsize() pragma cilk_grainsize = N cilk_for (int i = 0; i N; ++i) fib(2); Actually, no (except for noise): Grain size Runtime 1 8.55 s default (= 512) 2.44 s 8 N (= 10 ) 2.42 s (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 9 / 35Speedup with grain size =N (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 10 / 35Trade-o between grain size and parallelism Use Cilkview to understand the trade-o : Grain size Parallelism 1 6,951,154 default (= 512) 248,784 8 N (= 10 ) 1 In Cilkview,P = 1:     N N default grain size = min ; 512 = min ; 512 : 8P 8 (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 11 / 35Lessons learned Measure overhead before measuring speedup. Compare 1-processor Cilk++ versus serial code. Small grain size) higher work overhead. Large grain size) less parallelism. The default grain size is designed for small loops that are reasonably balanced. You may want to use a smaller grain size for unbalanced loops or loops with large bodies. Use Cilkview to measure the parallelism of your program. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 12 / 35Example 2: A for loop that spawns Code: const int N = 10 1000 1000; / empty test function / void f() void for_spawn() for (int i = 0; i N; ++i) cilk_spawn f(); Expectations: I am spawningN parallel things. Parallelism should be (N), right? (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 13 / 35\Speedup" of for spawn() (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 14 / 35Insucient parallelism PPA analysis: PPA says that both work and span are (N). Parallelism is 1:62, independent ofN. Too little parallelism: no speedup. Why is the span (N)? for (int i = 0; i N; ++i) cilk_spawn f(); (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 15 / 35Alternative: a cilk for loop. Code: / empty test function / void f() void test_cilk_for() cilk_for (int i = 0; i N; ++i) f(); PPA analysis: The parallelism is about 2000 (with default grain size). The parallelism is high. As we saw earlier, this kind of loop yields good performance and speedup. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 16 / 35Lessons learned cilk_for() is di erent from for(...) cilk_spawn. The span of for(...) cilk_spawn is (N). For simple at loops, cilk_for() is generally preferable because it has higher parallelism. However, for(...) cilk_spawn might be better when the work load is not uniformly distributed across all iterations. Use Cilkview to measure the parallelism of your program. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 17 / 35Example 3: Vector addition Code: const int N = 50 1000 1000; double AN, BN, CN; void vector_add() cilk_for (int i = 0; i N; ++i) Ai = Bi + Ci; Expectations: Cilkview says that the parallelism is 68,377. This will work great (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 18 / 35Speedup of vector add() (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 19 / 35Bandwidth of the memory system A typical machine: AMD Phenom 920 (Feb. 2009). Cache level daxpy bandwidth L1 19.6 GB/s per core L2 18.3 GB/s per core L3 13.8 GB/s shared DRAM 7.1 GB/s shared daxpy: xi = axi + yi, double precision. The memory bottleneck: A single core can generally saturate most of the memory hierarchy. Multiple cores that access memory will con ict and slow each other down. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 20 / 35How do you determine if memory is a bottleneck? Hard problem: No general solution. Requires guesswork. Two useful techniques: Use a pro ler such as the Intel VTune. Interpreting the output is nontrivial. No sensitivity analysis. Perturb the environment to understand the e ect of the CPU and memory speeds upon the program speed. (Moreno Maza) Issues with Multithreaded Parallelism on Multicore Architectures CS3101 21 / 35