Question? Leave a message!




Parallel Random-Access Machines

Parallel Random-Access Machines
Parallel RandomAccess Machines Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 (Moreno Maza) Parallel RandomAccess Machines CS3101 1 / 69Plan 1 The PRAM Model 2 Performance counters 3 Handling Shared Memory Access Con icts: PRAM submodels 4 Simulation of large PRAMs on small PRAMs 5 Comparing the Computational Power of PRAM Submodels 6 More PRAM algorithms and exercies (Moreno Maza) Parallel RandomAccess Machines CS3101 2 / 69The PRAM Model Plan 1 The PRAM Model 2 Performance counters 3 Handling Shared Memory Access Con icts: PRAM submodels 4 Simulation of large PRAMs on small PRAMs 5 Comparing the Computational Power of PRAM Submodels 6 More PRAM algorithms and exercies (Moreno Maza) Parallel RandomAccess Machines CS3101 3 / 69The PRAM Model The RAM Model Recall The Random Access Machine is a convenient model of a sequential computer. Its features are as follows. The computation unit executes a user de ned program. It uses a readonly input tape and a writeonly output tape. The RAM has an unbounded number of local memory cells. Each memory cell can hold an integer of unbounded size. The instruction set includes operations for: moving data between memory cells, comparisons and conditional branching, additions, subtractions, multiplications. Execution starts with the rst instruction of the program and terminates when an Halt instruction is reached. Each operation takes one time unit regardless of the the operand sizes. (Moreno Maza) Parallel RandomAccess Machines CS3101 4 / 69 Time complexity = the number of instructions executed. Space complexity = the number of memory cells accessed.The PRAM Model The PRAM Model: De nition (1/6) Architecture The Parallel Random Access Machine is a natural generalization of RAM. It is also an idealization of a shared memory machine. Its features are as follows. It holds an unbounded collection of RAM processors P ;P ;P ;::: 0 1 2 without tapes. It holds an unbounded collection of shared memory cells M0;M1;M2;::: Each processor P has its own (unbounded) local memory (register i set) and P knows its index i. i Each processor P can access any shared memory cell Mj in unit i time, unless there is a con ict (see further). (Moreno Maza) Parallel RandomAccess Machines CS3101 5 / 69The PRAM Model The PRAM Model: De nition (2/6) (Moreno Maza) Parallel RandomAccess Machines CS3101 6 / 69The PRAM Model The PRAM Model: De nition (3/6) Program execution (1/2) The input of a PRAM program consists of n items stored in M0;:::;Mn 1. 0 0 The output of a PRAM program consists of n items stored in n 0 memory cells, say Mn;:::;Mn +n 1.  A PRAM instruction executes in a 3phase cycle: 1 Read (if needed) from a shared memory cell, 2 Compute locally (if needed), 3 Write in a shared memory cell (if needed). All processors execute their 3phase cycles synchronously. Special assumptions have to be made in order to resolve shared memory access con icts. The only way processors can exchange data is by writing into and reading from memory cells. (Moreno Maza) Parallel RandomAccess Machines CS3101 7 / 69The PRAM Model The PRAM Model: De nition (4/6) Program execution (2/2) P has a special activation register specifying the maximum index of 0 an active processor: 1 Initially, only P is active; it computes the number of required active 0 processors, 2 Then, P loads this number in the activation register, 0 3 The corresponding processors start executing their programs. Computations proceed until P halts, at which time all other active 0 processors are halted. Parallel time complexity = the time for P 's computations. 0 Parallel space complexity = the maximum number of shared memory cells in use during the computations. (Moreno Maza) Parallel RandomAccess Machines CS3101 8 / 69The PRAM Model The PRAM Model: De nition (6/6) Summary of main assumptions Inputs/Outputs are placed in the shared memory Memory cell stores an arbitrarily large integer Each instruction takes unit time Instructions are synchronized across the processors PRAM complexity measures for each individual processor time: number of instructions executed space: number of memory cells accessed PRAM machine time: time taken by the longest running processor hardware: maximum number of active processors (Moreno Maza) Parallel RandomAccess Machines CS3101 9 / 69The PRAM Model The PRAM Model: Remarks The PRAM Model is attractive for designing parallel algorithms: It is natural: the number of operations executed per one cycle on p processors is at most p. It is strong: any processor can read or write any shared memory cell in unit time. It is simple: ignoring any communication or synchronization overhead. This natural, strong and simple PRAM model can be used as a benchmark: If a problem has no feasible (or ecient) solution on a PRAM then it is likely that it has no feasible (or ecient) solution on any parallel machine. The PRAM model is an idealization of existing shared memory parallel machines. The PRAM ignores lower level architecture constraints (memory access overhead, synchronization overhead, intercommunication throughput, connectivity, speed limits, etc.) (Moreno Maza) Parallel RandomAccess Machines CS3101 10 / 69The PRAM Model Constrained PRAM Models (1/2) A smallmemory PRAM satis es the axioms of a PRAM except that is has a bounded number of shared memory cells. A mcell PRAM is a smallmemory PRAM with m shared memory cells. If the input (or output) data set exceeds the capacity of the shared memory, then this data can be distributed evenly among the registers of the processors. Limiting the amount of shared memory corresponds to restricting the amount of information that can be communicated between processors in one step. For example, a distributed memory machine with processors interconnected by a shared bus can be modeled as a PRAM with a single shared memory. (Moreno Maza) Parallel RandomAccess Machines CS3101 11 / 69The PRAM Model Constrained PRAM Models (2/2) A small PRAM satis es the axioms of a PRAM except that is has a bounded number of processors. A pprocessor PRAM is a small PRAM with p + 1 processors (counting P ). 0 (Moreno Maza) Parallel RandomAccess Machines CS3101 12 / 69Performance counters Plan 1 The PRAM Model 2 Performance counters 3 Handling Shared Memory Access Con icts: PRAM submodels 4 Simulation of large PRAMs on small PRAMs 5 Comparing the Computational Power of PRAM Submodels 6 More PRAM algorithms and exercies (Moreno Maza) Parallel RandomAccess Machines CS3101 13 / 69Performance counters Performance counters (1/8) Recall The Parallel Time, denoted by T (n;p), is the time elapsed from the start of a parallel computation to the moment where the last processor nishes, on an input data of size n, and using p processors. T (n;p) takes into account computational steps (such as adding, multiplying, swapping variables), routing (or communication) steps (such as transferring and exchanging information between processors). (Moreno Maza) Parallel RandomAccess Machines CS3101 14 / 69Performance counters Performance counters (2/8) Example 1 Parallel search of an item x in an unsorted input le with n items, in a shared memory with p processors, where any cell can be accessed by only one processor at a time. Broadcasting x costs O(log(p)), leading to T (n;p) = O(log(p)) +O(n=p): (Moreno Maza) Parallel RandomAccess Machines CS3101 15 / 69Performance counters Performance counters (3/8) De nition The parallel eciency, denoted by E (n;p), is SU(n) E (n;p) = ; pT (n;p) where SU(n) is a lower bound for a sequential execution. Observe that we have SU(n) pT (n;p) and thus E (n;p) 1. One also often considers the speedup factor de ned by SU(n) S(n;p) = : T (n;p) (Moreno Maza) Parallel RandomAccess Machines CS3101 16 / 69Performance counters Performance counters (4/8) Remark Reasons for ineciency: large communication latency compared to computational performances (it would be better to calculate locally rather than remotely) too big overhead in synchronization, poor coordination, poor load distribution (processors must wait for dependent data), lack of useful work to do (too many processors for too little work). (Moreno Maza) Parallel RandomAccess Machines CS3101 17 / 69Performance counters Performance counters (5/8) The Work is de ned by W (n;p) = a + +a where a is the t t t start end number of active processors a time t. A dataprocessor isoeciency function is an asymptotically maximal function f such that for all p 0 there exists n such that for n n we 1 0 0 0 have E (n;f (n)) E (n ;p ). 1 0 0 A processordata isoeciency function is an asymptotically minimal function f such that for all n 0 there exists p such that for p p we 2 0 0 0 have E (f (p);p) E (n ;p ). 2 0 0 The isoeciency function f quanti es the growth rate of the problem size, 2 required to keep the eciency xed while increasing the number of processors. It re ects the ability of a parallel algorithm to maintain a constant eciency. A large isoeciency function f indicates poor 2 scalability, whereas a small one indicates that only a small increment in the problem size is sucient for ecient exploitation of newly added processors. (Moreno Maza) Parallel RandomAccess Machines CS3101 18 / 69Performance counters Performance counters (6/8) Example 2 Consider the following problem: summing n numbers on a small PRAM with p n processors. With the assumption that every \basic" operation runs in unit time, we have SU(n) = n. n Each processor adds locallyd e numbers. p Then the p partial sums are summed using a parallel binary reduction on p processors indlog(p)e iterations. n Thus, we have: T (n;p)2 O( + log(p)). p Elementary computations give n f (n) = and f (p) = p log(p): 1 2 log(n) (Moreno Maza) Parallel RandomAccess Machines CS3101 19 / 69Performance counters Performance counters (7/8) Example 3 (1/2) Consider a tridiagonal linear system of order n:     a x + b x + c x = e i2 i2 i1 i1 i i i1 a x + b x + c x = e i1 i1 i i i+1 i+1 i a x + b x + c x = e i i i+1 i+1 i+2 i+2 i+1     ec x a x i i+1 i+1 i1 i1 For every even i replacing x with leads to another i b i tridiagonal system of order n=2:     A x + B x + C x = E i3 i3 i1 i1 i+1 i+1 i1 A x + B x + C x = E i1 i1 i+1 i+1 i+3 i+3 i+1     (Moreno Maza) Parallel RandomAccess Machines CS3101 20 / 69Performance counters Performance counters (8/8) Example 3 (2/2) the number of processors, here p = n, can be set such that the number of parallel steps, here O(logn), is known and small, processors activity (scheduling) is easy to organize, datacommunication is not intensive. (Moreno Maza) Parallel RandomAccess Machines CS3101 21 / 69Handling Shared Memory Access Con icts: PRAM submodels Plan 1 The PRAM Model 2 Performance counters 3 Handling Shared Memory Access Con icts: PRAM submodels 4 Simulation of large PRAMs on small PRAMs 5 Comparing the Computational Power of PRAM Submodels 6 More PRAM algorithms and exercies (Moreno Maza) Parallel RandomAccess Machines CS3101 22 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (1/23) De nition EREW (Exclusive Read Exclusive Write): No two processors are allowed to read or write the same shared memory cell simultaneously. CREW (Concurrent Read Exclusive Write): Simultaneous reads of the same memory cell are allowed, but no two processors can write the same shared memory cell simultaneously. PRIORITY CRCW (PRIORITY Concurrent Read Conc. Write): Simultaneous reads of the same memory cell are allowed. Processors are assigned xed and distinct priorities. In case of write con ict, the processor with highest priority is allowed to complete WRITE. (Moreno Maza) Parallel RandomAccess Machines CS3101 23 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (2/23) De nition ARBITRARY CRCW (ARBITRARY Concurrent Read Conc. Write): Simultaneous reads of the same memory cell are allowed. In case of write con ict, one randomly chosen processor is allowed to complete WRITE. An algorithm written for this model should make no assumptions about which processor is chosen in case of write con ict. COMMON CRCW (COMMON Concurrent Read Conc. Write): Simultaneous reads of the same memory cell are allowed. In case of write con ict, all processors are allowed to complete WRITE i all values to be written are equal. An algorithm written for this model should make sure that this condition is satis ed. If not, the algorithm is illegal and the machine state will be unde ned. (Moreno Maza) Parallel RandomAccess Machines CS3101 24 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (3/23) Example 4: concurrent search Consider a pprocessor PRAM with p n. Assume that the shared memory contains n distinct items and P 0 owns a value x. The goal is to let P know whether x occurs within the n items. 0 Concurrent search EREW PRAM algorithm (a) P broadcasts x to P ;P ;:::;P in O(log(p)) steps using binary 0 1 2 p broadcast tree. (b) Every processor P ;P ;:::;P performs local searches on (at most) 1 2 p dn=pe items, hence indn=pe steps. (c) Every processor de nes a Boolean ag Found. The nal answer is obtained by a parallel reduction, that is by, means of a binary tree. This leads to T (n;p) = O(log(p) +dn=pe). (Moreno Maza) Parallel RandomAccess Machines CS3101 25 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (4/23) Concurrent search CREW PRAM algorithm A similar approach, but P ;P ;:::;P can read x in O(1). However, the 1 2 p nal reduction is still in log(p), leading again to T (n;p) = O(log(p) +dn=pe). Concurrent search COMMON PRAM algorithm Now, the nal step takes O(1). Indeed, those processors with their ag Found equal to true can write simultaneously to the same memory cell initialized to false. Hence, we have T (n;p) = O(dn=pe). (Moreno Maza) Parallel RandomAccess Machines CS3101 26 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (5/23) Example 5: statement On a CREWPRAM Machine, what does the pseudocode do A1..6 := 0,0,0,0,0,1; for each 1 = step = 5 do for each 1 = i = 5 do in parallel Ai := Ai + Ai+1; // done by processor i print A; (Moreno Maza) Parallel RandomAccess Machines CS3101 27 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (6/23) Example 5: solution No data races occur thanks to the execution model (the 3phasis cycle) and CREW handling. On an actual computer, there would be data races and an uncertain result, that is, a nondeterministic answer. (Moreno Maza) Parallel RandomAccess Machines CS3101 28 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (7/23) Example 6: statement Write an EREWPRAM Program for the following task: Given 2n input integer number compute their maximum (Moreno Maza) Parallel RandomAccess Machines CS3101 29 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (8/23) Example 6: solution Input: 2n integer numbers stored in M1;:::;M2n, where n 2 is a power of 2. Output: The maximum of those numbers, written at M2n + 1. Program: Active Proocessors P1, ...,Pn; step := 0; jump := 2step; while jump = n do // id the index of one of the active processor if (id mod jump = 0) M2 id := max(M2 id, M2 id jump); step := step + 1; jump := 2step; if (id = n) then M2n+1 := M2n; As in GPU programs, scalar variables (variables of type int, float, char) are, by default, stored only in the register le of each processor. Hence, in the above program step, jump are not in shared memory, but in the register le of each processor. (Moreno Maza) Parallel RandomAccess Machines CS3101 30 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (9/23) Example 7: statement This is a followup on Example 6. 1 What is T (2n;n) SU(n) S(2n;n) 2 What is W (2n;n) E (2n;n) 3 Propose a variant of the algorithm for an input of size n using p processors, for a wellchosen value of p, such that we have S(2n;n) = 50 (Moreno Maza) Parallel RandomAccess Machines CS3101 31 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (10/23) Example 7: solution 1 log(n), n, n=log(n). 2 nlog(n), n=(nlog(n)). 3 Algorithm: 1 Use p := n=log(n) processors, instead of n. 2 Make each of these p processors compute serially the maximum of log(n) numbers. This requires log(n) parallel steps and has total work n. 3 Run the previous algorithm on these p \local maxima". This will take log(p)2 O(log(n)) steps with a total work of plog(p)2 O((n=log(n))log(n)). 4 Therefore the algorithm runs in at most 2log(n) parallel steps and uses n=log(n) processors. Thus, we have S(n;p) = 50. (Moreno Maza) Parallel RandomAccess Machines CS3101 32 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (11/23) Example 8: statement Write a COMMON CRCWPRAM Program for the following task: Given n input integer number compute their maximum. And such that this program runs essentially in constant time, that is, O(1). (Moreno Maza) Parallel RandomAccess Machines CS3101 33 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (12/23) Example 8: solution Input: n integer numbers stored in M1;:::;Mn, where n 2. Output: The maximum of those numbers, written at Mn + 1. Program: Active Proocessors P1, ...,Pn2; // id the index of one of the active processor if (id = n) Mn + id := true; i := ((id 1) mod n) + 1; j := ((id 1) quo n) + 1; if (Mi Mj) Mn + i := false; if (Mn + i = true) Mn+1 := Mi; (Moreno Maza) Parallel RandomAccess Machines CS3101 34 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (13/23) Example 9: statement This is a followup on Example 8. 2 2 1 What is T (n;n ) SU(n) S(n;n ) 2 2 2 What is W (n;n ) E (n;n ) (Moreno Maza) Parallel RandomAccess Machines CS3101 35 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (14/23) Example 9: solution 1 O(1), n, n. 2 2 n , 1=n. (Moreno Maza) Parallel RandomAccess Machines CS3101 36 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (15/23) Example 10: statement Write a CREWPRAM program, then an EREWPRAM program for the following task: Given two polynomials of degree less than n, say n1 n1 a = a x + +a X +a and b = b x + +b X +b n1 1 0 n1 1 0 compute their product in parallel time O(log (n)). 2 We may make assumptions of the form \n is a power of 2". (Moreno Maza) Parallel RandomAccess Machines CS3101 37 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (16/23) Example 10: CREW solution (1/3) n1 Input: Tow polynomials a = a x + +a X +a and n1 1 0 n1 b = b x + +b X +b such that Mi holds a n1 1 0 i1 and Mn+i holds b for 1 i n and n is a power of 2. i1 Output: Their product. Program: Active Proocessors P1, ...,Pn2; // id the index of one of the active processor i := ((id 1) mod n) + 1; j := ((id 1) quo n) + 1; M2n + id := Mi Mn + j; .... The problem in the above code is that we have to sum up all M2n + i M2n + j contributing to the same coecient of the product. Indeed, we need to write these products in consecutive memory location to sum them conveniently. (Moreno Maza) Parallel RandomAccess Machines CS3101 38 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (17/23) Example 10: CREW solution (2/3) Observe that ab has 2n 1 coecients. d The number n of terms contributing to X satis es d  d + 1 for 0 d n 1; n = d 2nd for n d 2n 2: Observe that 1 n  n for all 0 d 2n 2. d For each d2f0;:::; 2n 2g, we allocate n slots (we assume that the memory allocator initializes them to zero) to write the n terms d d contributing to X . More precisely, M(2 n) + (d n) + i + 1 stores the product Mi + 1 Mn + j + 1 if d = i +j, for 0 i;j n 1. Note the shifted range for i and j. (Moreno Maza) Parallel RandomAccess Machines CS3101 39 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (18/23) Example 10: CREW solution (3/3) Active Proocessors P1, ...,Pn2; // id the index of one of the active processor i := ((id 1) mod n); j := ((id 1) quo n); // Observe that i and j are now in 0..(n1) d := i+j; M(2 n) + (d n) + i + 1 := Mi + 1 Mn + j + 1; After this point n processors can work together on a parallel reduction for each d. Since d2f0;:::; 2n 2g, each processor will participate to at most 2 parallel reductions. For simplicity, the code should make each processor work on 2 parallel reductions. Hence additional \zero" slots must be added. (Moreno Maza) Parallel RandomAccess Machines CS3101 40 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (19/23) Example 10: complete CREW model solution (1/2) // Active Proocessors P1, ...,Pn2; // id the index of one of the active processor //compute the products concurrently without write conflict i := (id 1) mod n j := (id 1) quo n d := i+j M(2 n) + (d n) + i + 1 := Mi+1 Mn + j + 1; // from M2n to M2n +n(2n 1), do reduction in // stride of n and keep the result from M2n +n(2n 1) + 1 // to M2n +n(2n 1) + 2n 1, // we only have n2 processors, so we need two sreduction steps for i=1; i=n; i = 2 if id mod 2i == 0 M2n + id += M2n + id +i if (id 1) mod n == 0 M2n + (2n 1) n + id quo n + 1 = M2n + id (Moreno Maza) Parallel RandomAccess Machines CS3101 41 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (20/23) Example 10: complete CREW model solution(2/2) for i=1; i=n; i = 2 if id mod 2i == 0 M2n + n2 + id += M2n + n2 + id +i if (id 1) mod n == 0 id = n (n 1) M2n + (2n 1) n + n + id quo n + 1 = M2n + n2 + id The complexity analysis goes as follows: 2 T (2n;n ) = (1) mult phase + (log(n)) add phase = (log(n)) 2 (n ) 2 E (2n;n ) = = (1= log(n)). 2 (n log(n)) (Moreno Maza) Parallel RandomAccess Machines CS3101 42 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (21/23) Example 10: EREWPRAM solution The di erence with the CREWPRAM situation is the need to bradcast the 2n input coecients in order to prevent from concurrent reads: Since each input coecient needs to be read n times, we need n 3 copies, that is, n coecients in total. (A better estimate can be achieved using a divideandconquer process.) 2 Since we have n processors at hand, we need n parallel steps to write 3 those n coecients. Then, the complexity analysis goes as follows: 2 T (2n;n ) = (n)broadcast + (1) mult + (log(n)) add = (n) 2 (n ) 2 E (2n;n ) = = (1=n). 2 (nn) (Moreno Maza) Parallel RandomAccess Machines CS3101 43 / 69Handling Shared Memory Access Con icts: PRAM submodels Handling Shared Memory Access Con icts (22/23) Example 10: COMMON CRCWPRAM solution The di erence with the CREWPRAM situation is the fact that each reduction (or addition) of n products contributing to the same coecient of the product can be done (1) time: We have 2n 1 reductions. The one contributing to the coecient of d 2 X requires n 1 additions of two numbers, thus n processors. d d 3 This gives n ) processors in total. Then, the complexity analysis goes as follows: 3 T (2n;n ) = (1) mult phase + (1) add phase = (1) 2 (n ) 3 E (2n;n ) = = (1=n). 3 (n1) (Moreno Maza) Parallel RandomAccess Machines CS3101 44 / 69Simulation of large PRAMs on small PRAMs Plan 1 The PRAM Model 2 Performance counters 3 Handling Shared Memory Access Con icts: PRAM submodels 4 Simulation of large PRAMs on small PRAMs 5 Comparing the Computational Power of PRAM Submodels 6 More PRAM algorithms and exercies (Moreno Maza) Parallel RandomAccess Machines CS3101 45 / 69Simulation of large PRAMs on small PRAMs Simulation of large PRAMs on small PRAMs (1/3) Proposition 1 0 Let p p. Then, any problem that can be solved on a pprocessor PRAM 0 0 0 in t steps can be solved on a p processor PRAM in t = O(tp=p ) steps assuming the same size of shared memory. Proof 0 0 In order to reach this result, each of the processors P of the p processor PRAM i 0 can simulate a group G of (at most)dp=pe processors of the pprocessor PRAM i 0 as follows. Each simulating processor P simulates one 3phase cycle of G by i i 1 executing all their READ instructions, 2 executing all their local COMPUTATIONS, 3 executing all their WRITE instructions. One can check that, whatever is the model for handling shared memory cell access con ict, the simulating PRAM will produce the same result as the larger PRAM. (Moreno Maza) Parallel RandomAccess Machines CS3101 46 / 69Simulation of large PRAMs on small PRAMs Simulation of large PRAMs on small PRAMs (2/3) Proposition 2 0 Assume m m. Then, any problem that can be solved on a pprocessor 0 and mcell PRAM in t steps can be solved on a max(p;m )processor and 0 0 0 m cell PRAM in t = O(tm=m ) steps. Proof of Proposition 2 (1/2) 0 Naturally, the idea is to use the register set of the processors of the m cell PRAM in order to compensate the diminution of shared memory. 0 0 This is why it is necessary to assume that the m cell PRAM has at least m processors. (After that, one can use Proposition 1 to save on processors.) Let P ;:::;P be the processors of the mcell PRAM: 1 p 0 0 0 We use processors P ;:::;P 00 on the m cell PRAM to simulate 1 m 00 0 P ;:::;P where m = max(p;m ). 1 p 0 Moreover, we (mentally) group the m cells of the mcell PRAM into m 0 0 continuous segments S ;:::;S of size m=m . 1 m (Moreno Maza) Parallel RandomAccess Machines CS3101 47 / 69Simulation of large PRAMs on small PRAMs Simulation of large PRAMs on small PRAMs (2/3) Proof of Proposition 2 (2/2) 0 We use the register set of processor P for simulating the segment S , for all i i 0 1 i m . 0 0 0 We use the shared memory cell M i, for 1 i m , on the m cell PRAM, as an auxiliary memory. Simulation of one 3phase cycle of the mcell PRAM: 0 READ: for all 0 k m=m repeat 0 0 1 for all 1 i m , the processor P writes the value of the i 0 kth cell of S into M i; i 0 2 for all 1 i p, the processor P reads from the share i memory, provided that P would read its value at position i 0 congruent to k modulo m=m . 0 COMPUTE: the local computation of P is simulated by P , for all 1 i p. i i WRITE: Analogous to READ. (Moreno Maza) Parallel RandomAccess Machines CS3101 48 / 69Comparing the Computational Power of PRAM Submodels Plan 1 The PRAM Model 2 Performance counters 3 Handling Shared Memory Access Con icts: PRAM submodels 4 Simulation of large PRAMs on small PRAMs 5 Comparing the Computational Power of PRAM Submodels 6 More PRAM algorithms and exercies (Moreno Maza) Parallel RandomAccess Machines CS3101 49 / 69Comparing the Computational Power of PRAM Submodels Comparing the Computational Power of PRAM Submodels Remark By PRAM submodels, we mean either EREW, CREW, COMMON, ARBITRARY or PRIORITY. De nition PRAM submodel A is computationally stronger than PRAM submodel B, written A  B, if any algorithm written for B will run unchanged on A in the same parallel time, assuming the same basic properties. Proposition 3 We have: PRIORITY  ARBITRARY  COMMON  CREW  EREW: (Moreno Maza) Parallel RandomAccess Machines CS3101 50 / 69Comparing the Computational Power of PRAM Submodels Comparing the Computational Power of PRAM Submodels Theorem 1 Any polylog time PRAM algorithm is robust with respect to all PRAM submodels. Remark In other words, any PRAM algorithm which runs in polylog time on one submodel can be simulated on any other PRAM submodel and run within the same complexity class. This results from Proposition 3 and Lemma 2. Lemma 1 provides a result weaker than Lemma 2 but the proof of the former helps understanding the proof of the latter. (Moreno Maza) Parallel RandomAccess Machines CS3101 51 / 69Comparing the Computational Power of PRAM Submodels Comparing the Computational Power of PRAM Submodels Lemma 1 Assume PRIORITY CRCW with the priority scheme based trivially on indexing: lower indexed processors have higher priority. Then, one step of pprocessor mcell PRIORITY CRCW can be simulated by a pprocessor mpcell EREW PRAM in O(log(p)) steps. (Moreno Maza) Parallel RandomAccess Machines CS3101 52 / 69Comparing the Computational Power of PRAM Submodels Comparing the Computational Power of PRAM Submodels Proof of Lemma 1 (1/3) Naturally, the idea is to store all the WRITE (or READ) needs for one cycle in memory evaluate their priorities execute the instruction of the winner But there is a trap, we should avoid access con ict also during this simulation algorithm. (Moreno Maza) Parallel RandomAccess Machines CS3101 53 / 69Comparing the Computational Power of PRAM Submodels Comparing the Computational Power of PRAM Submodels Proof of Lemma 1 (2/3) 0 (1) Each PRIORITY processor P is simulated by EREW processor P , k k for all 1 k p. (2) Each shared memory cell Mi, for all i = 1;:::;m, of PRIORITY 0 is simulated by an array of p shared memory cells M i;k;k = 1;:::;p of EREW, 0 M i; 1 plays the role of Mi, 0 0 M i; 2;:::;M i;p are auxiliary cells used for resolving con icts, initially empty, 0 0 M i;1; : : : ;M i;p are regarded as the rows of a complete binary tree T with p leaves and height dlog(p)e; initially, they are regarded as the i leaf row of T . i (Moreno Maza) Parallel RandomAccess Machines CS3101 54 / 69Comparing the Computational Power of PRAM Submodels Comparing the Computational Power of PRAM Submodels Proof of Lemma 1 (3/3) (3) Simulation of a PRIORITY WRITE substep. Each EREW processor must nd out whether it is the processor with lowest index within the group asking to write to the same memory cell, and if so, must become the group winner and perform the WRITE operation; the other processors of its group just fail and do not write. This is done as follows: 0 1 For all 1 k p repeat: if P wants to write into Mi, then P k k turns active and becomes the kth leaf of T . i 2 Each active left processor stores its ID into the parent cell in its tree, marks it as occupied and remains active. 3 Each active right processor checks its parent cell: if it is empty, then it stores its ID there and remains active, otherwise it becomes inactive. 4 This is repeated one row after another from bottom to top in T , in i dlog(p)e iterations. 5 The process who managed to reach the root of T , becomes the winner i and can WRITE. (Moreno Maza) Parallel RandomAccess Machines CS3101 55 / 69Comparing the Computational Power of PRAM Submodels Comparing the Computational Power of PRAM Submodels Lemma 2 One step of PRIORITY CRCW with p processors and m shared memory cells by an EREW PRAM in O(log(p)) steps with p processors and m +p shared memory cells. (Moreno Maza) Parallel RandomAccess Machines CS3101 56 / 69Comparing the Computational Power of PRAM Submodels Comparing the Computational Power of PRAM Submodels Proof of Lemma 2 (1/3) 0 1 Each PRIORITY processor P is simulated by EREW processor P . k k 0 2 Each PRIORITY cell Mi is simulated by EREW cell M i. 3 EREW uses an auxiliary array A of p cells. 0 4 If P wants to access Mi, then processor P writes pair (i;k) into Ak. k k 0 5 If P does not want to access any PRIORITY cell, processor P writes k k (0;k) into Ak. 6 All p processors sort the array A into lexicographic order using (logp)time parallel sort. (Moreno Maza) Parallel RandomAccess Machines CS3101 57 / 69Comparing the Computational Power of PRAM Submodels Comparing the Computational Power of PRAM Submodels Proof of Lemma 2 (2/3) 0 1 Each P appends to cell Ak a ag f de ned as follows k f = 0 if the rst component of Ak is either 0 or it is the same as the rst component of Ak 1. f = 1 otherwise. 2 Further steps di er for simulation of WRITE or READ. PRIORITY WRITE: 0 1 Each P reads the triple (i;j;f ) from cell Ak and k writes it into Aj. 0 2 Each P reads the triple (i;k;f ) from cell Ak and k writes into Mi i f = 1. (Moreno Maza) Parallel RandomAccess Machines CS3101 58 / 69Comparing the Computational Power of PRAM Submodels Comparing the Computational Power of PRAM Submodels Proof of Lemma 2 (3/3) PRIORITY READ: 0 1 Each P reads the triple (i;j;f ) from cell Ak. k 2 If f = 1, it reads value v from Mi and overwrites the i third component in Ak (the ag f ) with v . i 3 In at most logp steps, this third component is then distributed in subsequent cells of A until it reaches either the end or an element with a di erent rst component. 0 4 Each P reads the triple (i;j;v ) from cell Ak and k i writes it into Aj. 0 5 Each P who asked for a READ reads the value v from k i the triple (i;k;v ) in cell Ak. i (Moreno Maza) Parallel RandomAccess Machines CS3101 59 / 69More PRAM algorithms and exercies Plan 1 The PRAM Model 2 Performance counters 3 Handling Shared Memory Access Con icts: PRAM submodels 4 Simulation of large PRAMs on small PRAMs 5 Comparing the Computational Power of PRAM Submodels 6 More PRAM algorithms and exercies (Moreno Maza) Parallel RandomAccess Machines CS3101 60 / 69More PRAM algorithms and exercies Parallel scan (1/5) Another common and important data parallel primitive. This problem seems inherently sequential, but there is an ecient parallel algorithm. Applications: sorting, lexical analysis, string comparison, polynomial evaluation, stream compaction, building histograms and data structures (graphs, trees, etc.) in parallel. (Moreno Maza) Parallel RandomAccess Machines CS3101 61 / 69More PRAM algorithms and exercies Parallel scan (2/5) Let S be a set, let + : SS S be an associative operation on S with 0 as identity. Let A0n 1 be an array of n elements of S. Tthe allpre xessum or inclusive scan of A computes the array B of n elements of S de ned by  A0 if i = 0 Bi = Bi 1 +Ai if 0 i n The exclusive scan of A computes the array B of n elements of S:  0 if i = 0 C i = C i 1 +Ai 1 if 0 i n An exclusive scan can be generated from an inclusive scan by shifting the resulting array right by one element and inserting the identity. Similarly, an inclusive scan can be generated from an exclusive scan. We shall focus on exclusive scan. (Moreno Maza) Parallel RandomAccess Machines CS3101 62 / 69More PRAM algorithms and exercies Parallel scan (3/5) Here's a sequential algorithm for the exclusive scan. void scan( float output, float input, int length) output0 = 0; // since this is a prescan, not a scan for(int j = 1; j length; ++j) outputj = inputj1 + outputj1; (Moreno Maza) Parallel RandomAccess Machines CS3101 63 / 69More PRAM algorithms and exercies Parallel scan (4/5) Write a CREW algorithm for parallel scanning that would implement the principle used in the following example. Analyze its eciency. (Moreno Maza) Parallel RandomAccess Machines CS3101 64 / 69More PRAM algorithms and exercies Parallel scan (5/5) Input: Elements located in M1;:::;Mn, where n is a power of 2. Output: The n pre x sums located in Mn + 1;:::;M2n. Program: Active Proocessors P1, ...,Pn; // id the active processor index for d := 0 to (log(n) 1) do if d is even then if id 2d then Mn + id := Mid + Mid 2d else Mn + id := Mid end if else if id 2d then Mid := Mn + id + Mn + id 2d else Mid := Mn + id end if end if if d is odd then Mn + id := Mid end if (Moreno Maza) Parallel RandomAccess Machines CS3101 65 / 69More PRAM algorithms and exercies Mysterious algorithm (1/5) What does the following CREWPRAM algorithm compute Input: n elements located in M1;:::;Mn, where n 2 is a power of 2. Output: Some values in located in Mn + 1;:::;M2n. Program: Active Proocessors P1, ...,Pn; // id the index of one of the active processor Mn + id := Mid; M2 n + id := id + 1; for d := 1 to log(n) do if M2 n + id = n then j := M2 n + id; v := Mn + id; Mn + j := v + Mn + j; jj := M2 n + j; M2 n + id := jj; Analyze its eciency. (Moreno Maza) Parallel RandomAccess Machines CS3101 66 / 69More PRAM algorithms and exercies Mysterious algorithm (2/4) (Moreno Maza) Parallel RandomAccess Machines CS3101 67 / 69More PRAM algorithms and exercies Mysterious algorithm (3/4) (Moreno Maza) Parallel RandomAccess Machines CS3101 68 / 69More PRAM algorithms and exercies Mysterious algorithm (4/4) (Moreno Maza) Parallel RandomAccess Machines CS3101 69 / 69
Website URL
Comment