Question? Leave a message!




Analysis of Multithreaded Algorithms

Analysis of Multithreaded Algorithms
Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS44029535 (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 1 / 27Plan 1 Matrix Multiplication 2 Merge Sort 3 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 2 / 27Matrix Multiplication Plan 1 Matrix Multiplication 2 Merge Sort 3 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 3 / 27Matrix Multiplication Matrix multiplication c c c a a a b b b 11 12⋯ 1n 11 12⋯ 1n 11 12⋯ 1n cc cc cc aa aa aa bb bb bb 21 22⋯ 2n 21 22⋯ 2n 21 22⋯ 2n =· ⋮⋮ ⋱ ⋮ ⋮⋮ ⋱ ⋮ ⋮⋮ ⋱ ⋮ cc cc cc aa aa aa bb bb bb n11 n22⋯ nn n11 n22⋯ nn n11 n22⋯ nn CA B We will study three approaches: a naive and iterative one a divideandconquer one a divideandconquer one with memory management consideration (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 4 / 27Matrix Multiplication Naive iterative matrix multiplication cilkfor (int i=1; in; ++i) cilkfor (int j=0; jn; ++j) for (int k=0; kn; ++k Cij += Aik Bkj; Work: Span: Parallelism: (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 5 / 27Matrix Multiplication Naive iterative matrix multiplication cilkfor (int i=1; in; ++i) cilkfor (int j=0; jn; ++j) for (int k=0; kn; ++k Cij += Aik Bkj; 3 Work: (n ) Span: (n) 2 Parallelism: (n ) (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 6 / 27Matrix Multiplication Matrix multiplication based on block decomposition C C C C A A A A BB BB 11 12 11 12 11 12 = · C C C C A A A A BB BB 21 22 21 22 21 22 A A BB A A BB A A BB A A BB 11 11 11 11 11 11 12 12 12 12 21 21 12 12 22 22 =+ A B A B A B A B 21 21 11 11 21 21 12 12 22 22 21 21 22 22 22 22 The divideandconquer approach is simply the one based on blocking, presented in the rst lecture. (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 7 / 27Matrix Multiplication Divideandconquer matrix multiplication // C C + A B void MMult(T C, T A, T B, int n, int size) T D = new Tnn; //base case partition matrices cilkspawn MMult(C11, A11, B11, n/2, size); cilkspawn MMult(C12, A11, B12, n/2, size); cilkspawn MMult(C22, A21, B12, n/2, size); cilkspawn MMult(C21, A21, B11, n/2, size); cilkspawn MMult(D11, A12, B21, n/2, size); cilkspawn MMult(D12, A12, B22, n/2, size); cilkspawn MMult(D22, A22, B22, n/2, size); MMult(D21, A22, B21, n/2, size); cilksync; MAdd(C, D, n, size); // C += D; delete D; Work Span Parallelism (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 8 / 27Matrix Multiplication Divideandconquer matrix multiplication void MMult(T C, T A, T B, int n, int size) T D = new Tnn; //base case partition matrices cilkspawn MMult(C11, A11, B11, n/2, size); cilkspawn MMult(C12, A11, B12, n/2, size); cilkspawn MMult(C22, A21, B12, n/2, size); cilkspawn MMult(C21, A21, B11, n/2, size); cilkspawn MMult(D11, A12, B21, n/2, size); cilkspawn MMult(D12, A12, B22, n/2, size); cilkspawn MMult(D22, A22, B22, n/2, size); MMult(D21, A22, B21, n/2, size); cilksync; MAdd(C, D, n, size); // C += D; delete D; A (n) and M (n): times on p proc. for nn Add and Mult. p p 2 A (n) = 4A (n=2) + (1) = (n ) 1 1 A (n) = A (n=2) + (1) = (lgn) 1 1 2 3 M (n) = 8M (n=2) +A (n) = 8M (n=2) + (n ) = (n ) 1 1 1 1 2 M (n) = M (n=2) + (lgn) = (lg n) 1 1 3 2 M (n)=M (n) = (n = lg n) 1 1 (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 9 / 27Matrix Multiplication Divideandconquer matrix multiplication: No temporaries template typename T void MMult2(T C, T A, T B, int n, int size) //base case partition matrices cilkspawn MMult2(C11, A11, B11, n/2, size); cilkspawn MMult2(C12, A11, B12, n/2, size); cilkspawn MMult2(C22, A21, B12, n/2, size); MMult2(C21, A21, B11, n/2, size); cilksync; cilkspawn MMult2(C11, A12, B21, n/2, size); cilkspawn MMult2(C12, A12, B22, n/2, size); cilkspawn MMult2(C22, A22, B22, n/2, size); MMult2(C21, A22, B21, n/2, size); cilksync; Work Span Parallelism (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 10 / 27Matrix Multiplication Divideandconquer matrix multiplication: No temporaries template typename T void MMult2(T C, T A, T B, int n, int size) //base case partition matrices cilkspawn MMult2(C11, A11, B11, n/2, size); cilkspawn MMult2(C12, A11, B12, n/2, size); cilkspawn MMult2(C22, A21, B12, n/2, size); MMult2(C21, A21, B11, n/2, size); cilksync; cilkspawn MMult2(C11, A12, B21, n/2, size); cilkspawn MMult2(C12, A12, B22, n/2, size); cilkspawn MMult2(C22, A22, B22, n/2, size); MMult2(C21, A22, B21, n/2, size); cilksync; MA (n): time on p proc. for nn MultAdd. p 3 MA (n) = (n ) 1 MA (n) = 2MA (n=2) + (1) = (n) 1 1 2 MA (n)=MA (n) = (n ) 1 1 Besides, saving space often saves time due to hierarchical memory. (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 11 / 27Merge Sort Plan 1 Matrix Multiplication 2 Merge Sort 3 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 12 / 27Merge Sort Merging two sorted arrays void Merge(T C, T A, T B, int na, int nb) while (na0 nb0) if (A = B) C++ = A++; na; else C++ = B++; nb; while (na0) C++ = A++; na; while (nb0) C++ = B++; nb; Time for merging n elements is (n). 33 12 12 19 19 46 46 (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 13 / 27 44 14 14 21 21 23 23Merge Sort Merge sort 3 3 4 4 12 12 14 14 19 19 21 21 33 33 46 46 merge 3 12 19 46 4 14 21 33 merge 31 192 46 433 1421 merge g 19 3 12 46 33 4 21 14 (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 14 / 27Merge Sort Parallel merge sort with serial merge template typename T void MergeSort(T B, T A, int n) if (n==1) B0 = A0; else T Cn; cilkspawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, nn/2); cilksync; Merge(B, C, C+n/2, n/2, nn/2); Work Span (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 15 / 27Merge Sort Parallel merge sort with serial merge template typename T void MergeSort(T B, T A, int n) if (n==1) B0 = A0; else T Cn; cilkspawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, nn/2); cilksync; Merge(B, C, C+n/2, n/2, nn/2); T (n) = 2T (n=2) + (n) thus T (n) == (n lgn). 1 1 1 T (n) = T (n=2) + (n) thus T (n) = (n). 1 1 1 T (n)=T (n) = (lgn). Puny parallelism 1 1 We need to parallelize the merge (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 16 / 27Merge Sort Parallel merge 0n ma = na/2a A A ≤≤ Ama Ama≥≥ Ama Ama Recursive Recursive Bina Binary ry Search Search PMerge PMerge na na≥≥ nb nb B B≤≤ Ama Ama ≥≥ Ama Ama 0n mb1 mb b Idea: if the total number of elements to be sorted in n = n +n then the a b maximum number of elements in any of the two merges is at most 3n=4. (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 17 / 27Merge Sort Parallel merge template typename T void PMerge(T C, T A, T B, int na, int nb) if (na nb) PMerge(C, B, A, nb, na); else if (na==0) return; else int ma = na/2; int mb = BinarySearch(Ama, B, nb); Cma+mb = Ama; cilkspawn PMerge(C, A, B, ma, mb); PMerge(C+ma+mb+1, A+ma+1, B+mb, nama1, nbmb); cilksync; One should coarse the base case for eciency. Work Span (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 18 / 27Merge Sort Parallel merge template typename T void PMerge(T C, T A, T B, int na, int nb) if (na nb) PMerge(C, B, A, nb, na); else if (na==0) return; else int ma = na/2; int mb = BinarySearch(Ama, B, nb); Cma+mb = Ama; cilkspawn PMerge(C, A, B, ma, mb); PMerge(C+ma+mb+1, A+ma+1, B+mb, nama1, nbmb); cilksync; Let PM (n) be the pprocessor running time of PMerge. p In the worst case, the span of PMerge is 2 PM (n) PM (3n=4) + (lgn) = (lg n) 1 1 The worstcase work of PMerge satis es the recurrence PM (n) PM ( n) +PM ((1 )n) + (lgn) 1 1 1 (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 19 / 27 , where is a constant in the range 1=4  3=4.Merge Sort Analyzing parallel merge Recall PM (n) PM ( n) +PM ((1 )n) + (lgn) for some 1 1 1 1=4  3=4. To solve this hairy equation we use the substitution method. We assume there exist some constants a;b 0 such that PM (n) anb lgn holds for all 1=4  3=4. 1 After substitution, this hypothesis implies: PM (n) anb lgnb lgn + (lgn). 1 We can pick b large enough such that we have PM (n) anb lgn 1 for all 1=4  3=4 and all n 1/ Then pick a large enough to satisfy the base conditions. Finally we have PM (n) = (n). 1 (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 20 / 27Merge Sort Parallel merge sort with parallel merge template typename T void PMergeSort(T B, T A, int n) if (n==1) B0 = A0; else T Cn; cilkspawn PMergeSort(C, A, n/2); PMergeSort(C+n/2, A+n/2, nn/2); cilksync; PMerge(B, C, C+n/2, n/2, nn/2); Work Span (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 21 / 27Merge Sort Parallel merge sort with parallel merge template typename T void PMergeSort(T B, T A, int n) if (n==1) B0 = A0; else T Cn; cilkspawn PMergeSort(C, A, n/2); PMergeSort(C+n/2, A+n/2, nn/2); cilksync; PMerge(B, C, C+n/2, n/2, nn/2); The work satis es T (n) = 2T (n=2) + (n) (as usual) and we have 1 1 T (n) = (nlog(n)). 1 The worst case criticalpath length of the MergeSort now satis es 2 3 T (n) = T (n=2) + (lg n) = (lg n) 1 1 . 3 2 The parallelism is now (n lgn)=(lg n) = (n= lg n). (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 22 / 27Tableau Construction Plan 1 Matrix Multiplication 2 Merge Sort 3 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 23 / 27Tableau Construction Tableau construction 00 01 02 03 04 05 06 07 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 20 21 22 23 24 25 26 27 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 40 41 42 43 44 45 46 47 50 51 52 53 54 55 56 57 60 61 62 63 64 65 66 67 70 71 72 73 74 75 76 77 Constructing a tableau A satisfying a relation of the form: Ai;j = R(Ai 1;j;Ai 1;j 1;Ai;j 1): (1) 2 The work is (n ). (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 24 / 27Tableau Construction Recursive construction n Parallel code I; cilkspawn II; III;; II II II cilksync; n IV; III IV 2 T (n) = 4T (n=2) + (1), thus T (n) = (n ). 1 1 1 log 3 2 T (n) = 3T (n=2) + (1), thus T (n) = (n ). 1 1 1 2log 3 0:41 2 Parallelism: (n ) = (n ). (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 25 / 27Tableau Construction A more parallel construction n I; cilk ilkspawn II II; III; I II IV cilksync; cilk cilkspawn spawn IV; IV; cilkspawn V; n VI; III V VII cilk cilksync sync;; cilkspawn VII; VIII; VI VIII IX cilksync; IX IX; 2 T (n) = 9T (n=3) + (1), thus T (n) = (n ). 1 1 1 log 5 3 T (n) = 5T (n=3) + (1), thus T (n) = (n ). 1 1 1 2log 5 0:53 3 Parallelism: (n ) = (n ). This nineway dnc has more parallelism than the four way but exhibits more cache complexity (more on this later). (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 26 / 27Tableau Construction Acknowledgements Charles E. Leiserson (MIT) for providing me with the sources of its lecture notes. Matteo Frigo (Intel) for supporting the work of my team with Cilk++ and o ering us the next lecture. Yuzhen Xie (UWO) for helping me with the images used in these slides. Liyun Li (UWO) for generating the experimental data. (Moreno Maza) Analysis of Multithreaded Algorithms CS44029535 27 / 27
Website URL
Comment