Question? Leave a message!




Parallel Random-Access Machines

Parallel Random-Access Machines
Dr.JakeFinlay Profile Pic
Dr.JakeFinlay,Germany,Teacher
Published Date:22-07-2017
Website URL
Comment
Parallel Random-Access Machines Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 (Moreno Maza) Parallel Random-Access Machines CS3101 1 / 69Plan 1 The PRAM Model 2 Performance counters 3 Handling Shared Memory Access Con icts: PRAM submodels 4 Simulation of large PRAMs on small PRAMs 5 Comparing the Computational Power of PRAM Submodels 6 More PRAM algorithms and exercies (Moreno Maza) Parallel Random-Access Machines CS3101 2 / 69The PRAM Model Plan 1 The PRAM Model 2 Performance counters 3 Handling Shared Memory Access Con icts: PRAM submodels 4 Simulation of large PRAMs on small PRAMs 5 Comparing the Computational Power of PRAM Submodels 6 More PRAM algorithms and exercies (Moreno Maza) Parallel Random-Access Machines CS3101 3 / 69The PRAM Model The RAM Model Recall The Random Access Machine is a convenient model of a sequential computer. Its features are as follows. The computation unit executes a user de ned program. It uses a read-only input tape and a write-only output tape. The RAM has an unbounded number of local memory cells. Each memory cell can hold an integer of unbounded size. The instruction set includes operations for: moving data between memory cells, comparisons and conditional branching, additions, subtractions, multiplications. Execution starts with the rst instruction of the program and terminates when an Halt instruction is reached. Each operation takes one time unit regardless of the the operand sizes. (Moreno Maza) Parallel Random-Access Machines CS3101 4 / 69 Time complexity = the number of instructions executed. Space complexity = the number of memory cells accessed.The PRAM Model The PRAM Model: De nition (1/6) Architecture The Parallel Random Access Machine is a natural generalization of RAM. It is also an idealization of a shared memory machine. Its features are as follows. It holds an unbounded collection of RAM processors P ;P ;P ;::: 0 1 2 without tapes. It holds an unbounded collection of shared memory cells M0;M1;M2;::: Each processor P has its own (unbounded) local memory (register i set) and P knows its index i. i Each processor P can access any shared memory cell Mj in unit i time, unless there is a con ict (see further). (Moreno Maza) Parallel Random-Access Machines CS3101 5 / 69The PRAM Model The PRAM Model: De nition (2/6) (Moreno Maza) Parallel Random-Access Machines CS3101 6 / 69The PRAM Model The PRAM Model: De nition (3/6) Program execution (1/2) The input of a PRAM program consists of n items stored in M0;:::;Mn 1. 0 0 The output of a PRAM program consists of n items stored in n 0 memory cells, say Mn;:::;Mn +n 1.  A PRAM instruction executes in a 3-phase cycle: 1 Read (if needed) from a shared memory cell, 2 Compute locally (if needed), 3 Write in a shared memory cell (if needed). All processors execute their 3-phase cycles synchronously. Special assumptions have to be made in order to resolve shared memory access con icts. The only way processors can exchange data is by writing into and reading from memory cells. (Moreno Maza) Parallel Random-Access Machines CS3101 7 / 69The PRAM Model The PRAM Model: De nition (4/6) Program execution (2/2) P has a special activation register specifying the maximum index of 0 an active processor: 1 Initially, only P is active; it computes the number of required active 0 processors, 2 Then, P loads this number in the activation register, 0 3 The corresponding processors start executing their programs. Computations proceed until P halts, at which time all other active 0 processors are halted. Parallel time complexity = the time for P 's computations. 0 Parallel space complexity = the maximum number of shared memory cells in use during the computations. (Moreno Maza) Parallel Random-Access Machines CS3101 8 / 69The PRAM Model The PRAM Model: De nition (6/6) Summary of main assumptions Inputs/Outputs are placed in the shared memory Memory cell stores an arbitrarily large integer Each instruction takes unit time Instructions are synchronized across the processors PRAM complexity measures for each individual processor time: number of instructions executed space: number of memory cells accessed PRAM machine time: time taken by the longest running processor hardware: maximum number of active processors (Moreno Maza) Parallel Random-Access Machines CS3101 9 / 69The PRAM Model The PRAM Model: Remarks The PRAM Model is attractive for designing parallel algorithms: It is natural: the number of operations executed per one cycle on p processors is at most p. It is strong: any processor can read or write any shared memory cell in unit time. It is simple: ignoring any communication or synchronization overhead. This natural, strong and simple PRAM model can be used as a benchmark: If a problem has no feasible (or ecient) solution on a PRAM then it is likely that it has no feasible (or ecient) solution on any parallel machine. The PRAM model is an idealization of existing shared memory parallel machines. The PRAM ignores lower level architecture constraints (memory access overhead, synchronization overhead, intercommunication throughput, connectivity, speed limits, etc.) (Moreno Maza) Parallel Random-Access Machines CS3101 10 / 69The PRAM Model Constrained PRAM Models (1/2) A small-memory PRAM satis es the axioms of a PRAM except that is has a bounded number of shared memory cells. A m-cell PRAM is a small-memory PRAM with m shared memory cells. If the input (or output) data set exceeds the capacity of the shared memory, then this data can be distributed evenly among the registers of the processors. Limiting the amount of shared memory corresponds to restricting the amount of information that can be communicated between processors in one step. For example, a distributed memory machine with processors interconnected by a shared bus can be modeled as a PRAM with a single shared memory. (Moreno Maza) Parallel Random-Access Machines CS3101 11 / 69The PRAM Model Constrained PRAM Models (2/2) A small PRAM satis es the axioms of a PRAM except that is has a bounded number of processors. A p-processor PRAM is a small PRAM with p + 1 processors (counting P ). 0 (Moreno Maza) Parallel Random-Access Machines CS3101 12 / 69Performance counters Plan 1 The PRAM Model 2 Performance counters 3 Handling Shared Memory Access Con icts: PRAM submodels 4 Simulation of large PRAMs on small PRAMs 5 Comparing the Computational Power of PRAM Submodels 6 More PRAM algorithms and exercies (Moreno Maza) Parallel Random-Access Machines CS3101 13 / 69Performance counters Performance counters (1/8) Recall The Parallel Time, denoted by T (n;p), is the time elapsed from the start of a parallel computation to the moment where the last processor nishes, on an input data of size n, and using p processors. T (n;p) takes into account computational steps (such as adding, multiplying, swapping variables), routing (or communication) steps (such as transferring and exchanging information between processors). (Moreno Maza) Parallel Random-Access Machines CS3101 14 / 69Performance counters Performance counters (2/8) Example 1 Parallel search of an item x in an unsorted input le with n items, in a shared memory with p processors, where any cell can be accessed by only one processor at a time. Broadcasting x costs O(log(p)), leading to T (n;p) = O(log(p)) +O(n=p): (Moreno Maza) Parallel Random-Access Machines CS3101 15 / 69Performance counters Performance counters (3/8) De nition The parallel eciency, denoted by E (n;p), is SU(n) E (n;p) = ; pT (n;p) where SU(n) is a lower bound for a sequential execution. Observe that we have SU(n) pT (n;p) and thus E (n;p) 1. One also often considers the speedup factor de ned by SU(n) S(n;p) = : T (n;p) (Moreno Maza) Parallel Random-Access Machines CS3101 16 / 69Performance counters Performance counters (4/8) Remark Reasons for ineciency: large communication latency compared to computational performances (it would be better to calculate locally rather than remotely) too big overhead in synchronization, poor coordination, poor load distribution (processors must wait for dependent data), lack of useful work to do (too many processors for too little work). (Moreno Maza) Parallel Random-Access Machines CS3101 17 / 69Performance counters Performance counters (5/8) The Work is de ned by W (n;p) = a + +a where a is the t t t start end number of active processors a time t. A data-processor iso-eciency function is an asymptotically maximal function f such that for all p 0 there exists n such that for n n we 1 0 0 0 have E (n;f (n)) E (n ;p ). 1 0 0 A processor-data iso-eciency function is an asymptotically minimal function f such that for all n 0 there exists p such that for p p we 2 0 0 0 have E (f (p);p) E (n ;p ). 2 0 0 The iso-eciency function f quanti es the growth rate of the problem size, 2 required to keep the eciency xed while increasing the number of processors. It re ects the ability of a parallel algorithm to maintain a constant eciency. A large iso-eciency function f indicates poor 2 scalability, whereas a small one indicates that only a small increment in the problem size is sucient for ecient exploitation of newly added processors. (Moreno Maza) Parallel Random-Access Machines CS3101 18 / 69Performance counters Performance counters (6/8) Example 2 Consider the following problem: summing n numbers on a small PRAM with p n processors. With the assumption that every \basic" operation runs in unit time, we have SU(n) = n. n Each processor adds locallyd e numbers. p Then the p partial sums are summed using a parallel binary reduction on p processors indlog(p)e iterations. n Thus, we have: T (n;p)2 O( + log(p)). p Elementary computations give n f (n) = and f (p) = p log(p): 1 2 log(n) (Moreno Maza) Parallel Random-Access Machines CS3101 19 / 69Performance counters Performance counters (7/8) Example 3 (1/2) Consider a tridiagonal linear system of order n:     a x + b x + c x = e i2 i2 i1 i1 i i i1 a x + b x + c x = e i1 i1 i i i+1 i+1 i a x + b x + c x = e i i i+1 i+1 i+2 i+2 i+1     ec x a x i i+1 i+1 i1 i1 For every even i replacing x with leads to another i b i tridiagonal system of order n=2:     A x + B x + C x = E i3 i3 i1 i1 i+1 i+1 i1 A x + B x + C x = E i1 i1 i+1 i+1 i+3 i+3 i+1     (Moreno Maza) Parallel Random-Access Machines CS3101 20 / 69