Analysis of Multithreaded Algorithms

Analysis of Multithreaded Algorithms
Dr.JakeFinlay Profile Pic
Dr.JakeFinlay,Germany,Teacher
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS4402-9535 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 1 / 27Plan 1 Matrix Multiplication 2 Merge Sort 3 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 2 / 27Matrix Multiplication Plan 1 Matrix Multiplication 2 Merge Sort 3 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 3 / 27Matrix Multiplication Matrix multiplication c c c a a a b b b 11 12⋯ 1n 11 12⋯ 1n 11 12⋯ 1n cc cc cc aa aa aa bb bb bb 21 22⋯ 2n 21 22⋯ 2n 21 22⋯ 2n =· ⋮⋮ ⋱ ⋮ ⋮⋮ ⋱ ⋮ ⋮⋮ ⋱ ⋮ cc cc cc aa aa aa bb bb bb n11 n22⋯ nn n11 n22⋯ nn n11 n22⋯ nn CA B We will study three approaches: a naive and iterative one a divide-and-conquer one a divide-and-conquer one with memory management consideration (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 4 / 27Matrix Multiplication Naive iterative matrix multiplication cilk_for (int i=1; in; ++i) cilk_for (int j=0; jn; ++j) for (int k=0; kn; ++k Cij += Aik Bkj; Work: ? Span: ? Parallelism: ? (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 5 / 27Matrix Multiplication Naive iterative matrix multiplication cilk_for (int i=1; in; ++i) cilk_for (int j=0; jn; ++j) for (int k=0; kn; ++k Cij += Aik Bkj; 3 Work: (n ) Span: (n) 2 Parallelism: (n ) (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 6 / 27Matrix Multiplication Matrix multiplication based on block decomposition C C C C A A A A BB BB 11 12 11 12 11 12 = · C C C C A A A A BB BB 21 22 21 22 21 22 A A BB A A BB A A BB A A BB 11 11 11 11 11 11 12 12 12 12 21 21 12 12 22 22 =+ A B A B A B A B 21 21 11 11 21 21 12 12 22 22 21 21 22 22 22 22 The divide-and-conquer approach is simply the one based on blocking, presented in the rst lecture. (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 7 / 27Matrix Multiplication Divide-and-conquer matrix multiplication // C - C + A B void MMult(T C, T A, T B, int n, int size) T D = new Tnn; //base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2, size); cilk_spawn MMult(C12, A11, B12, n/2, size); cilk_spawn MMult(C22, A21, B12, n/2, size); cilk_spawn MMult(C21, A21, B11, n/2, size); cilk_spawn MMult(D11, A12, B21, n/2, size); cilk_spawn MMult(D12, A12, B22, n/2, size); cilk_spawn MMult(D22, A22, B22, n/2, size); MMult(D21, A22, B21, n/2, size); cilk_sync; MAdd(C, D, n, size); // C += D; delete D; Work ? Span ? Parallelism ? (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 8 / 27Matrix Multiplication Divide-and-conquer matrix multiplication void MMult(T C, T A, T B, int n, int size) T D = new Tnn; //base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2, size); cilk_spawn MMult(C12, A11, B12, n/2, size); cilk_spawn MMult(C22, A21, B12, n/2, size); cilk_spawn MMult(C21, A21, B11, n/2, size); cilk_spawn MMult(D11, A12, B21, n/2, size); cilk_spawn MMult(D12, A12, B22, n/2, size); cilk_spawn MMult(D22, A22, B22, n/2, size); MMult(D21, A22, B21, n/2, size); cilk_sync; MAdd(C, D, n, size); // C += D; delete D; A (n) and M (n): times on p proc. for nn Add and Mult. p p 2 A (n) = 4A (n=2) + (1) = (n ) 1 1 A (n) = A (n=2) + (1) = (lgn) 1 1 2 3 M (n) = 8M (n=2) +A (n) = 8M (n=2) + (n ) = (n ) 1 1 1 1 2 M (n) = M (n=2) + (lgn) = (lg n) 1 1 3 2 M (n)=M (n) = (n = lg n) 1 1 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 9 / 27Matrix Multiplication Divide-and-conquer matrix multiplication: No temporaries template typename T void MMult2(T C, T A, T B, int n, int size) //base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2, size); cilk_spawn MMult2(C12, A11, B12, n/2, size); cilk_spawn MMult2(C22, A21, B12, n/2, size); MMult2(C21, A21, B11, n/2, size); cilk_sync; cilk_spawn MMult2(C11, A12, B21, n/2, size); cilk_spawn MMult2(C12, A12, B22, n/2, size); cilk_spawn MMult2(C22, A22, B22, n/2, size); MMult2(C21, A22, B21, n/2, size); cilk_sync; Work ? Span ? Parallelism ? (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 10 / 27Matrix Multiplication Divide-and-conquer matrix multiplication: No temporaries template typename T void MMult2(T C, T A, T B, int n, int size) //base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2, size); cilk_spawn MMult2(C12, A11, B12, n/2, size); cilk_spawn MMult2(C22, A21, B12, n/2, size); MMult2(C21, A21, B11, n/2, size); cilk_sync; cilk_spawn MMult2(C11, A12, B21, n/2, size); cilk_spawn MMult2(C12, A12, B22, n/2, size); cilk_spawn MMult2(C22, A22, B22, n/2, size); MMult2(C21, A22, B21, n/2, size); cilk_sync; MA (n): time on p proc. for nn Mult-Add. p 3 MA (n) = (n ) 1 MA (n) = 2MA (n=2) + (1) = (n) 1 1 2 MA (n)=MA (n) = (n ) 1 1 Besides, saving space often saves time due to hierarchical memory. (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 11 / 27Merge Sort Plan 1 Matrix Multiplication 2 Merge Sort 3 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 12 / 27Merge Sort Merging two sorted arrays void Merge(T C, T A, T B, int na, int nb) while (na0 && nb0) if (A = B) C++ = A++; na; else C++ = B++; nb; while (na0) C++ = A++; na; while (nb0) C++ = B++; nb; Time for merging n elements is (n). 33 12 12 19 19 46 46 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 13 / 27 44 14 14 21 21 23 23Merge Sort Merge sort 3 3 4 4 12 12 14 14 19 19 21 21 33 33 46 46 merge 3 12 19 46 4 14 21 33 merge 31 192 46 433 1421 merge g 19 3 12 46 33 4 21 14 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 14 / 27Merge Sort Parallel merge sort with serial merge template typename T void MergeSort(T B, T A, int n) if (n==1) B0 = A0; else T Cn; cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); Work? Span? (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 15 / 27Merge Sort Parallel merge sort with serial merge template typename T void MergeSort(T B, T A, int n) if (n==1) B0 = A0; else T Cn; cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); T (n) = 2T (n=2) + (n) thus T (n) == (n lgn). 1 1 1 T (n) = T (n=2) + (n) thus T (n) = (n). 1 1 1 T (n)=T (n) = (lgn). Puny parallelism 1 1 We need to parallelize the merge (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 16 / 27Merge Sort Parallel merge 0n ma = na/2a A A ≤≤ Ama Ama≥≥ Ama Ama Recursive Recursive Bina Binary ry Search Search P_Merge P_Merge na na≥≥ nb nb B B≤≤ Ama Ama ≥≥ Ama Ama 0n mb-1 mb b Idea: if the total number of elements to be sorted in n = n +n then the a b maximum number of elements in any of the two merges is at most 3n=4. (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 17 / 27Merge Sort Parallel merge template typename T void P_Merge(T C, T A, T B, int na, int nb) if (na nb) P_Merge(C, B, A, nb, na); else if (na==0) return; else int ma = na/2; int mb = BinarySearch(Ama, B, nb); Cma+mb = Ama; cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; One should coarse the base case for eciency. Work? Span? (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 18 / 27Merge Sort Parallel merge template typename T void P_Merge(T C, T A, T B, int na, int nb) if (na nb) P_Merge(C, B, A, nb, na); else if (na==0) return; else int ma = na/2; int mb = BinarySearch(Ama, B, nb); Cma+mb = Ama; cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; Let PM (n) be the p-processor running time of P-Merge. p In the worst case, the span of P-Merge is 2 PM (n) PM (3n=4) + (lgn) = (lg n) 1 1 The worst-case work of P-Merge satis es the recurrence PM (n) PM ( n) +PM ((1 )n) + (lgn) 1 1 1 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 19 / 27 , where is a constant in the range 1=4  3=4.Merge Sort Analyzing parallel merge Recall PM (n) PM ( n) +PM ((1 )n) + (lgn) for some 1 1 1 1=4  3=4. To solve this hairy equation we use the substitution method. We assume there exist some constants a;b 0 such that PM (n) anb lgn holds for all 1=4  3=4. 1 After substitution, this hypothesis implies: PM (n) anb lgnb lgn + (lgn). 1 We can pick b large enough such that we have PM (n) anb lgn 1 for all 1=4  3=4 and all n 1/ Then pick a large enough to satisfy the base conditions. Finally we have PM (n) = (n). 1 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 20 / 27