Analysis of Multithreaded Algorithms
Marc Moreno Maza
University of Western Ontario, London, Ontario (Canada)
CS4402-9535
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 1 / 27Plan
1 Matrix Multiplication
2
Merge Sort
3
Tableau Construction
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 2 / 27Matrix Multiplication
Plan
1 Matrix Multiplication
2
Merge Sort
3
Tableau Construction
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 3 / 27Matrix Multiplication
Matrix multiplication
c c c a a a b b b
11 12⋯ 1n 11 12⋯ 1n 11 12⋯ 1n
cc cc cc aa aa aa bb bb bb
21 22⋯ 2n 21 22⋯ 2n 21 22⋯ 2n
=·
⋮⋮ ⋱ ⋮ ⋮⋮ ⋱ ⋮ ⋮⋮ ⋱ ⋮
cc cc cc aa aa aa bb bb bb
n11 n22⋯ nn n11 n22⋯ nn n11 n22⋯ nn
CA B
We will study three approaches:
a naive and iterative one
a divide-and-conquer one
a divide-and-conquer one with memory management consideration
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 4 / 27Matrix Multiplication
Naive iterative matrix multiplication
cilk_for (int i=1; in; ++i)
cilk_for (int j=0; jn; ++j)
for (int k=0; kn; ++k
Cij += Aik Bkj;
Work: ?
Span: ?
Parallelism: ?
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 5 / 27Matrix Multiplication
Naive iterative matrix multiplication
cilk_for (int i=1; in; ++i)
cilk_for (int j=0; jn; ++j)
for (int k=0; kn; ++k
Cij += Aik Bkj;
3
Work: (n )
Span: (n)
2
Parallelism: (n )
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 6 / 27Matrix Multiplication
Matrix multiplication based on block decomposition
C C C C A A A A BB BB
11 12 11 12 11 12
= ·
C C C C A A A A BB BB
21 22 21 22 21 22
A A BB A A BB A A BB A A BB
11 11 11 11 11 11 12 12 12 12 21 21 12 12 22 22
=+
A B A B
A B A B
21 21 11 11 21 21 12 12 22 22 21 21 22 22 22 22
The divide-and-conquer approach is simply the one based on blocking,
presented in the rst lecture.
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 7 / 27Matrix Multiplication
Divide-and-conquer matrix multiplication
// C - C + A B
void MMult(T C, T A, T B, int n, int size)
T D = new Tnn;
//base case & partition matrices
cilk_spawn MMult(C11, A11, B11, n/2, size);
cilk_spawn MMult(C12, A11, B12, n/2, size);
cilk_spawn MMult(C22, A21, B12, n/2, size);
cilk_spawn MMult(C21, A21, B11, n/2, size);
cilk_spawn MMult(D11, A12, B21, n/2, size);
cilk_spawn MMult(D12, A12, B22, n/2, size);
cilk_spawn MMult(D22, A22, B22, n/2, size);
MMult(D21, A22, B21, n/2, size);
cilk_sync;
MAdd(C, D, n, size); // C += D;
delete D;
Work ? Span ? Parallelism ?
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 8 / 27Matrix Multiplication
Divide-and-conquer matrix multiplication
void MMult(T C, T A, T B, int n, int size)
T D = new Tnn;
//base case & partition matrices
cilk_spawn MMult(C11, A11, B11, n/2, size);
cilk_spawn MMult(C12, A11, B12, n/2, size);
cilk_spawn MMult(C22, A21, B12, n/2, size);
cilk_spawn MMult(C21, A21, B11, n/2, size);
cilk_spawn MMult(D11, A12, B21, n/2, size);
cilk_spawn MMult(D12, A12, B22, n/2, size);
cilk_spawn MMult(D22, A22, B22, n/2, size);
MMult(D21, A22, B21, n/2, size);
cilk_sync; MAdd(C, D, n, size); // C += D;
delete D;
A (n) and M (n): times on p proc. for nn Add and Mult.
p p
2
A (n) = 4A (n=2) + (1) = (n )
1 1
A (n) = A (n=2) + (1) = (lgn)
1 1
2 3
M (n) = 8M (n=2) +A (n) = 8M (n=2) + (n ) = (n )
1 1 1 1
2
M (n) = M (n=2) + (lgn) = (lg n)
1 1
3 2
M (n)=M (n) = (n = lg n)
1 1
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 9 / 27Matrix Multiplication
Divide-and-conquer matrix multiplication: No temporaries
template typename T
void MMult2(T C, T A, T B, int n, int size)
//base case & partition matrices
cilk_spawn MMult2(C11, A11, B11, n/2, size);
cilk_spawn MMult2(C12, A11, B12, n/2, size);
cilk_spawn MMult2(C22, A21, B12, n/2, size);
MMult2(C21, A21, B11, n/2, size);
cilk_sync;
cilk_spawn MMult2(C11, A12, B21, n/2, size);
cilk_spawn MMult2(C12, A12, B22, n/2, size);
cilk_spawn MMult2(C22, A22, B22, n/2, size);
MMult2(C21, A22, B21, n/2, size);
cilk_sync;
Work ? Span ? Parallelism ?
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 10 / 27Matrix Multiplication
Divide-and-conquer matrix multiplication: No temporaries
template typename T
void MMult2(T C, T A, T B, int n, int size)
//base case & partition matrices
cilk_spawn MMult2(C11, A11, B11, n/2, size);
cilk_spawn MMult2(C12, A11, B12, n/2, size);
cilk_spawn MMult2(C22, A21, B12, n/2, size);
MMult2(C21, A21, B11, n/2, size);
cilk_sync;
cilk_spawn MMult2(C11, A12, B21, n/2, size);
cilk_spawn MMult2(C12, A12, B22, n/2, size);
cilk_spawn MMult2(C22, A22, B22, n/2, size);
MMult2(C21, A22, B21, n/2, size);
cilk_sync;
MA (n): time on p proc. for nn Mult-Add.
p
3
MA (n) = (n )
1
MA (n) = 2MA (n=2) + (1) = (n)
1 1
2
MA (n)=MA (n) = (n )
1 1
Besides, saving space often saves time due to hierarchical memory.
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 11 / 27Merge Sort
Plan
1 Matrix Multiplication
2
Merge Sort
3
Tableau Construction
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 12 / 27Merge Sort
Merging two sorted arrays
void Merge(T C, T A, T B, int na, int nb)
while (na0 && nb0)
if (A = B)
C++ = A++; na;
else
C++ = B++; nb;
while (na0)
C++ = A++; na;
while (nb0)
C++ = B++; nb;
Time for merging n elements is (n).
33 12 12 19 19 46 46
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 13 / 27
44 14 14 21 21 23 23Merge Sort
Merge sort
3 3 4 4 12 12 14 14 19 19 21 21 33 33 46 46
merge
3 12 19 46 4 14 21 33
merge
31 192 46 433 1421
merge g
19 3 12 46 33 4 21 14
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 14 / 27Merge Sort
Parallel merge sort with serial merge
template typename T
void MergeSort(T B, T A, int n)
if (n==1)
B0 = A0;
else
T Cn;
cilk_spawn MergeSort(C, A, n/2);
MergeSort(C+n/2, A+n/2, n-n/2);
cilk_sync;
Merge(B, C, C+n/2, n/2, n-n/2);
Work?
Span?
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 15 / 27Merge Sort
Parallel merge sort with serial merge
template typename T
void MergeSort(T B, T A, int n)
if (n==1)
B0 = A0;
else
T Cn;
cilk_spawn MergeSort(C, A, n/2);
MergeSort(C+n/2, A+n/2, n-n/2);
cilk_sync;
Merge(B, C, C+n/2, n/2, n-n/2);
T (n) = 2T (n=2) + (n) thus T (n) == (n lgn).
1 1 1
T (n) = T (n=2) + (n) thus T (n) = (n).
1 1 1
T (n)=T (n) = (lgn). Puny parallelism
1 1
We need to parallelize the merge
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 16 / 27Merge Sort
Parallel merge
0n ma = na/2a
A A
≤≤ Ama Ama≥≥ Ama Ama
Recursive Recursive
Bina Binary ry Search Search
P_Merge P_Merge
na na≥≥ nb nb
B B≤≤ Ama Ama
≥≥ Ama Ama
0n mb-1 mb b
Idea: if the total number of elements to be sorted in n = n +n then the
a
b
maximum number of elements in any of the two merges is at most 3n=4.
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 17 / 27Merge Sort
Parallel merge
template typename T
void P_Merge(T C, T A, T B, int na, int nb)
if (na nb)
P_Merge(C, B, A, nb, na);
else if (na==0)
return;
else
int ma = na/2;
int mb = BinarySearch(Ama, B, nb);
Cma+mb = Ama;
cilk_spawn P_Merge(C, A, B, ma, mb);
P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb);
cilk_sync;
One should coarse the base case for eciency.
Work? Span?
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 18 / 27Merge Sort
Parallel merge
template typename T
void P_Merge(T C, T A, T B, int na, int nb)
if (na nb)
P_Merge(C, B, A, nb, na);
else if (na==0)
return;
else
int ma = na/2;
int mb = BinarySearch(Ama, B, nb);
Cma+mb = Ama;
cilk_spawn P_Merge(C, A, B, ma, mb);
P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb);
cilk_sync;
Let PM (n) be the p-processor running time of P-Merge.
p
In the worst case, the span of P-Merge is
2
PM (n) PM (3n=4) + (lgn) = (lg n)
1 1
The worst-case work of P-Merge satises the recurrence
PM (n) PM (n) +PM ((1 )n) + (lgn)
1 1 1
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 19 / 27
, where is a constant in the range 1=4 3=4.Merge Sort
Analyzing parallel merge
Recall PM (n) PM (n) +PM ((1 )n) + (lgn) for some
1 1 1
1=4 3=4.
To solve this hairy equation we use the substitution method.
We assume there exist some constants a;b 0 such that
PM (n) an b lgn holds for all 1=4 3=4.
1
After substitution, this hypothesis implies:
PM (n) an b lgn b lgn + (lgn).
1
We can pick b large enough such that we have PM (n) an b lgn
1
for all 1=4 3=4 and all n 1/
Then pick a large enough to satisfy the base conditions.
Finally we have PM (n) = (n).
1
(Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 20 / 27
Advise:Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.