Question? Leave a message!




Parallel and Distributed Computing

Parallel and Distributed Computing
Parallel and Distributed Ccomputing with Julia Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 (updated January 7, 2015)Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysTasks: Concurrent Function Calls Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysTasks: Concurrent Function Calls Tasks (aka Coroutines) Tasks Tasks are a control ow feature that allows computations to be suspended and resumed in a exible manner This feature is sometimes called by other names, such as symmetric coroutines, lightweight threads, cooperative multitasking, or oneshot continuations. When a piece of computing work (in practice, executing a particular function) is designated as a Task, it becomes possible to interrupt it by switching to another Task. The original Task can later be resumed, at which point it will pick up right where it left o Tasks: Concurrent Function Calls Producerconsumer scheme The producerconsumer scheme One complex procedure is generating values and another complex procedure is consuming them. The consumer cannot simply call a producer function to get a value, because the producer may have more values to generate and so might not yet be ready to return. With tasks, the producer and consumer can both run as long as they need to, passing values back and forth as necessary. Julia provides the functions produce and consume for implementing this scheme.Tasks: Concurrent Function Calls Producerconsumer scheme example function producer() produce("start") for n=1:2 produce(2n) end produce("stop") end To consume values, rst the producer is wrapped in a Task, then consume is called repeatedly on that object: ulia p = Task(producer) Task julia consume(p) "start" julia consume(p) 2 julia consume(p) 4 julia consume(p) "stop"Tasks: Concurrent Function Calls Tasks as iterators A Task can be used as an iterable object in a for loop, in which case the loop variable takes on all the produced values: julia for x in Task(producer) println(x) end start 2 4 stopTasks: Concurrent Function Calls More about tasks julia for x in 1,2,4 println(x) end 1 2 4 julia t = task for x in 1,2,4 println(x) end Task (runnable) 0x00000000045c62e0 julia istaskdone(t) false julia currenttask() Task (waiting) 0x00000000041473b0 julia consume(t) 1 2 4 1element ArrayAny,1: nothingJulia's Prnciples for Parallel Computing Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysJulia's Prnciples for Parallel Computing Julia's message passing principle Julia's message passing Julia provides a multiprocessing environment based on message passing to allow programs to run on multiple processors in shared or distributed memory. Julias implementation of message passing is onesided:  the programmer needs to explicitly manage only one processor in a twoprocessor operation  these operations typically do not look like message send and message receive but rather resemble higherlevel operations like calls to user functions.Julia's Prnciples for Parallel Computing Remote references and remote calls Two key notions: remote references and remote calls A remote reference is an object that can be used from any processor to refer to an object stored on a particular processor. A remote call is a request by one processor to call a certain function on certain arguments on another (possibly the same) processor. A remote call returns a remote reference. How remote calls are handled in the program ow Remote calls return immediately: the processor that made the call can then proceeds to its next operation while the remote call happens somewhere else. You can wait for a remote call to nish by calling wait on its remote reference, and you can obtain the full value of the result using fetch.Julia's Prnciples for Parallel Computing Remote references and remote calls: example morenogorgosaurus: julia p 4 julia r = remotecall(2, rand, 2, 2) RemoteRef(2,1,6) julia fetch(r) 2x2 ArrayFloat64,2: 0.675311 0.735236 0.682474 0.569424 julia s = spawnat 2 1+fetch(r) RemoteRef(2,1,8) julia fetch(s) 2x2 ArrayFloat64,2: 1.67531 1.73524 1.68247 1.56942 Commnets on the example Starting with julia p n provides n processors on the local machine. The rst argument to remote call is the index of the processor that will do the work. The rst line we asked processor 2 to construct a 2by2 random matrix, and in the third line we asked it to add 1 to it. The spawnat macro evaluates the expression in the second argument on the processor speci ed by the rst argument.Julia's Prnciples for Parallel Computing More on remote references julia remotecallfetch(2, getindex, r, 1, 1) 0.675311345332873 remote call fetch Occasionally you might want a remotelycomputed value immediately. The function remotecall fetch exists for this purpose. It is equivalent to fetch(remotecall(...)) but is more ecient. Note that getindex(r,1,1) is equivalent to r1,1, so this call fetches the rst element of the remote reference r.Julia's Prnciples for Parallel Computing The macro spawn The macro spawn The syntax of remote call is not especially convenient. The macro spawn makes things easier:  It operates on an expression rather than a function, and  chooses the processor where to do the operation for you julia r = spawn rand(2,2) RemoteRef(3,1,12) julia s = spawn 1+fetch(r) RemoteRef(3,1,13) julia fetch(s) 2x2 ArrayFloat64,2: 1.6117 1.20542 1.12406 1.51088 Remarks on the example Note that we used 1+fetch(r) instead of 1+r. This is because we do not know where the code will run, so in general a fetch might be required to move r to the processor doing the addition. In this case, spawn is smart enough to perform the computation on the processor that owns r, so the fetch will be a noop.Tips on Moving Code and Data Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysTips on Moving Code and Data Availability of a function to processors (1/3) One important point is that your code must be available on any processor that runs it. For example, type the following into the julia prompt julia function rand2(dims...) return 2rand(dims...) end julia rand2(2,2) 2x2 Float64 Array: 0.153756 0.368514 1.15119 0.918912 julia spawn rand2(2,2) RemoteRef(1,1,1) julia spawn rand2(2,2) RemoteRef(2,1,2) julia exception on 2: in anonymous: rand2 not definedTips on Moving Code and Data Availability of a function to processors (2/3) In the previous example, Processor 1 knew about the function rand2, but processor 2 did not. To make your code available to all processors, the require function will automatically load a source le on all currently available processors: julia require("myfile") In a cluster, the contents of the le (and any les loaded recursively) will be sent over the network.Tips on Moving Code and Data Availability of a function to processors (3/3) julia everywhere id = myid() julia remotecallfetch(2, ()id) 2 julia workers() 4element ArrayInt64,1: 2 3 4 5 The everywhere macro executes a statement on all processes.Tips on Moving Code and Data Running Julia with several proocesses or several machines Each process has an associated identi er. The process providing the interactive julia prompt always has an id equal to 1, as would the julia process running the driver script in the example above. The processes used by default for parallel operations are referred to as workers. When there is only one process, process 1 is considered a worker. Otherwise, workers are considered to be all processes other than process 1.Tips on Moving Code and Data Running Julia with several proocesses or several machines The base Julia installation has inbuilt support for two types of clusters:  A local cluster speci ed with the p option as shown above.  A cluster spanning machines using the machine le option. This uses a passwordless ssh login to start julia worker processes (from the same path as the current host) on the speci ed machines. Functions addprocs, rmprocs, workers, and others are available as a programmatic means of adding, removing and querying the processes in a cluster.Tips on Moving Code and Data Data Movement (1/4) Motivation Sending messages and moving data constitute most of the overhead in a parallel program. Reducing the number of messages and the amount of data sent is critical to achieving performance and scalability. To this end, it is important to understand the data movement performed by Julias various parallel programming constructs.Tips on Moving Code and Data Data Movement (2/4) fetch and spawn fetch can be considered an explicit data movement operation, since it directly asks that an object be moved to the local machine. spawn (and a few related constructs) also moves data, but this is not as obvious, hence it can be called an implicit data movement operation. Consider these two approaches to constructing and squaring a random matrix Which one is the most ecient method 1 A = rand(1000,1000) Bref = spawn A2 ... fetch(Bref) method 2 Bref = spawn rand(1000,1000)2 ... fetch(Bref)Tips on Moving Code and Data Data Movement (3/4) method 1 A = rand(1000,1000) Bref = spawn A2 ... fetch(Bref) method 2 Bref = spawn rand(1000,1000)2 ... fetch(Bref) Answer to the question The di erence seems trivial, but in fact is quite signi cant due to the behavior of spawn. In the rst method, a random matrix is constructed locally, then sent to another processor where it is squared. In the second method, a random matrix is both constructed and squared on another processor. Therefore the second method sends much less data than the rst.Tips on Moving Code and Data Data Movement (4/4) Remarks on the previous example In the previous toy example, the two methods are easy to distinguish and choose from. However, in a real program designing data movement might require more thought and very likely some measurement. For example, if the rst processor needs matrix A then the rst method might be better. Or, if processing A is expensive but only the current processor has it, then moving it to another processor might be unavoidable. Or, if the current processor has very little to do between the spawn and fetch(Bref) then it might be better to eliminate the parallelism altogether. Or imagine rand(1000,1000) is replaced with a more expensive operation. Then it might make sense to add another spawn statement just for this step.Around the Parallel Julia Code for Fibonacci Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysAround the Parallel Julia Code for Fibonacci Fibonacci (1/4) () A fresh approach to technical computing () () () Documentation: http://docs.julialang.org Type "help()" to list help topics / ` ( Version 0.2.0prerelease+3622 / \'\' Commit c9bb96c 20130904 15:34:41 UTC / x8664redhatlinux ulia addprocs(3) 3element ArrayAny,1: 2 3 4 julia everywhere function fib(n) if (n 2) then return n else return fib(n1) + fib(n2) end endAround the Parallel Julia Code for Fibonacci Fibonacci (2/4) julia z = spawn fib(10) RemoteRef(3,1,8) julia fetch(z) 55 time fib(i) for i=1:45; elapsed time: 27.646200328 seconds (416 bytes allocated)Around the Parallel Julia Code for Fibonacci Fibonacci (3/4) julia everywhere function fibparallel(n) if (n 40) then return fib(n) else x = spawn fibparallel(n1) y = fibparallel(n2) return fetch(x) + y end end julia time fibparallel(i) for i=1:45; elapsed time: 12.315891358 seconds (62472 bytes allocated)Around the Parallel Julia Code for Fibonacci Fibonacci (4/4) julia time parallel fib(45) for i=1:4 elapsed time: 11.186433545 seconds (74564 bytes allocated) 4element DArrayInt64,1,ArrayInt64,1: 1134903170 1134903170 1134903170 1134903170 julia time fib(45) for i=1:4 elapsed time: 42.185831168 seconds (80 bytes allocated) 4element ArrayInt64,1: 1134903170 1134903170 1134903170 1134903170Parallel Maps and Reductions Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysParallel Maps and Reductions A rst example of parallel reduction julia everywhere function countheads(n) c::Int = 0 for i=1:n c += randbool() end c end julia a = spawn countheads(100000000) RemoteRef(7,1,31) julia b = spawn countheads(100000000) RemoteRef(2,1,32) julia fetch(a)+fetch(b) 99993168 This simple example demonstrates a powerful and oftenused parallel programming pattern: reductuon. Many iterations run independently over several processors, and then their results are combined using some function.Parallel Maps and Reductions Parallel reduction using parallel (1/4) Usage of parallel for loops In the previous example, we use two explicit spawn statements, which limits the parallelism to two processors. To run on any number of processors, we can use a parallel for loop, which can be written in Julia like this: nheads = parallel (+) for i=1:200000000 randbool() end Comments This construct implements the pattern of  assigning iterations to multiple processors, and  combining them with a speci ed reduction (in this case (+)). Notice that the reduction operator can be omitted if it is not needed However, the semantics of such a parallel forloop can be dramatically di erent from its serial elision. As we shall see on the example of the next slide.Parallel Maps and Reductions Parallel reduction using parallel (2/4) julia a = zeros(4) 4element ArrayFloat64,1: 0.0 0.0 0.0 0.0 julia parallel for i=1:4 ai = i end julia a 4element ArrayFloat64,1: 0.0 0.0 0.0 0.0 julia for i=1:4 ai = i end julia a 4element ArrayFloat64,1: 1.0 2.0 3.0 4.0Parallel Maps and Reductions Parallel reduction using parallel (3/4) Evaluation of a parallel forloop Iterations run on di erent processors and do not happen in a speci ed order, Conseqnently, variables or arrays will not be globally visible. Any variables used inside the parallel loop will be copied and broadcast to each processor. Processors produce results which are made visible to the lauching processor via the reduction. This explains why the following code will not work as intended: julia parallel for i=1:4 ai = i end Comments on the example Each processor will have a separate copy if it. Parallel for loops like these must be avoidedParallel Maps and Reductions Parallel reduction using parallel (4/4) Use of \outside" variables in parallel forloops Using outside variables in parallel loops is perfectly reasonable if the variables are readonly. See the example on the next slide. In some cases no reduction operator is needed, and we merely wish to apply a function to all elements in some collection. This is another useful operation called parallel map, implemented in Julia as the pmap function. For example, we could compute the rank of several large random matrices in parallel as follows: julia M = rand(1000,1000) for i=1:4; julia pmap(rank, M) 4element ArrayAny,1: 1000 1000 1000 1000Parallel Maps and Reductions Use of \outside" variables in parallel forloops julia tic() 0x0000730b8e54d53a julia R = spawnat i rank(Mi) for i=1:4 4element ArrayAny,1: RemoteRef(1,1,57) RemoteRef(2,1,58) RemoteRef(3,1,59) RemoteRef(4,1,60) julia toc() elapsed time: 5.252494335 seconds 5.252494335 julia tic() 0x0000731c4a2ef8cc julia S = 0 0 julia for i=1:4 S = S + fetch(Ri) end julia toc() elapsed time: 8.340909436 seconds 8.340909436 julia S 4000 time parallel (+) for i=1:4 rank(Mi) end elapsed time: 1.23295268 seconds (234965420 bytes allocated) 4000Distributed Computing with Arrays: First Examples Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysDistributed Computing with Arrays: First Examples Computing the maximum value of an array in parallel julia everywhere function maxnumserial(a,s,e) if s==e as else mid = ifloor((s+e)/2) low = maxnumserial(a,s,mid) high = maxnumserial(a,mid+1,e) low high low:high end end julia everywhere function maxnumparallel(a,s,e) if (es)=10000000 maxnumserial(a,s,e) else mid = ifloor((s+e)/2) lowremote = spawn maxnumparallel(a,s,mid) high = maxnumparallel(a,mid+1,e) low = fetch(lowremote) low high low:high end end julia a=rand(20000000); julia time maxnumserial(a,1,20000000) elapsed time: 0.458792535 seconds (61556 bytes allocated) 0.999999919794377 julia time maxnumparallel(a,1,20000000) two recursive calls elapsed time: 0.654630977 seconds (268541944 bytes allocated) 0.999999919794377 As we can see, the parallel version runs slower than its serial counterpart. Indeed, the amount of work (number of comparisons) is in the same order of magnitude of data transfer (number of integers to move from one processor than another). But the latter costs much more clockcycles.Distributed Computing with Arrays: First Examples Computing the minimum and maximum values of an array in parallel julia everywhere function minimummaximumserial(a,s,e) if s==e as, as else mid = ifloor((s+e)/2) X = minimummaximumserial(a,s,mid) Y = minimummaximumserial(a,mid+1,e) min(X1,Y1), max(X2,Y2) end end julia everywhere function minimummaximumparallel(a,s,e) if (es)=10000000 minimummaximumserial(a,s,e) else mid = ifloor((s+e)/2) R = spawn minimummaximumparallel(a,s,mid) Y = minimummaximumparallel(a,mid+1,e) X = fetch(R) min(X1,Y1), max(X2,Y2) end end julia a=rand(20000000); julia time minimummaximumserial(a,1,20000000) elapsed time: 7.89881551 seconds (3840094852 bytes allocated) julia time minimummaximumparallel(a,1,20000000) elapsed time: 4.32320816 seconds (2188546996 bytes allocated)Distributed Computing with Arrays: First Examples Inplace serial merge sort julia function mergesort(data, istart, iend) if(istart iend) mid = (istart + iend) 1 mergesort(data, istart, mid) mergesort(data, mid+1, iend) merge(data, istart, mid, iend) end end methods for generic function mergesort mergesort(data,istart,iend) at none:2 julia function merge( data, istart, mid, iend) n = iend istart + 1 temp = zeros(n) s = istart m = mid+1 for tem = 1:n if s = mid (m iend datas = datam) temptem = datas s=s+1 else temptem = datam m=m+1 end end dataistart:iend = temp1:n end methods for generic function merge merge(data,istart,mid,iend) at none:2 julia n = 1000000 julia A = rem(rand(Int32),10) for i =1:n; julia time mergesort(A, 1, n); elapsed time: 0.6119898 seconds (447195104 bytes allocated)Distributed Computing with Arrays: First Examples Outofplace serial merge sort julia function mergesort(data, istart, iend) if(istart iend) mid = ifloor((istart + iend)/2) a = mergesort(data, istart, mid) b = mergesort(data,mid+1, iend) c = merge(a, b, istart, mid, iend) else dataistart end end methods for generic function mergesort julia everywhere function merge(a, b, istart, mid, iend) n = iend istart + 1 nb = iend mid na = mid istart + 1 c = zeros(n) s = 1 m = 1 for tem = 1:n if s = na (m nb as = bm) ctem = as s=s+1 else ctem = bm m=m+1 end end c end methods for generic function merge julia n = 1000000; julia A = rem(rand(Int32),10) for i =1:n; julia time mergesort(A, 1, n); elapsed time: 0.60765198 seconds (348516200 bytes allocated)Distributed Computing with Arrays: First Examples Outofplace parallel merge sort everywhere function mergesortserial(data, istart, iend) if(istart iend) mid = ifloor((istart + iend)/2) a = mergesortserial(data, istart, mid) b = mergesortserial(data,mid+1, iend) c = merge(a, b, istart, mid, iend) else dataistart end end everywhere function mergesortparallel(data, istart, iend) if(iend istart = 2500000) then mergesortserial(data, istart, iend) else mid = ifloor((istart + iend)/2) a = spawn mergesortparallel(data, istart, mid) b = mergesortparallel(data,mid+1, iend) c = merge(fetch(a), b, istart, mid, iend) end end julia n = 10000000; julia A = rem(rand(Int32),10) for i =1:n; julia time mergesortserial(A, 1, n); elapsed time: 9.25899279 seconds (3533393840 bytes allocated, 21.86 gc time) julia time mergesortparallel(A, 1, n); elapsed time: 6.142867529 seconds (1292099096 bytes allocated, 9.75 gc time)Distributed Arrays Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysDistributed Arrays Distributed Arrays (1/7) Idea Large computations are often organized around large arrays of data. In these cases, a particularly natural way to obtain parallelism is to distribute arrays among several processes. This combines the memory resources of multiple machines, allowing use of arrays too large to t on one machine. Each process operates on the part of the array it owns, providing a ready answer to the question of how a program should be divided among machines. The DArray type Julia distributed arrays are implemented by the DArray type. A DArray has an element type and dimensions just like an Array. A DArray can also use arbitrary arraylike types to represent the local chunks that store actual data. The data in a DArray is distributed by dividing the index space into some number of blocks in each dimension.Distributed Arrays Distributed Arrays (2/7) Constructing distributed arrays Common kinds of arrays can be constructed with functions beginning with d: dzeros(100,100,10) dones(100,100,10) drand(100,100,10) drandn(100,100,10) dfill(x, 100,100,10) In the last case, each element will be initialized to the speci ed value x. These functions automatically pick a distribution for you. Constructing distributed arrays with more control For more control, you can specify which processors to use, and how the data should be distributed: dzeros((100,100), workers()1:4, 1,4) The second argument speci es that the array should be created on the rst four workers. When dividing data among a large number of processes, one often sees diminishing returns in performance. Placing DArrays on a subset of processes allows multiple DArray computations to happen at once, with a higher ratio of work to communication on each process. The third argument speci es a distribution; the nth element of this array speci es how many pieces dimension n should be divided into. In this example the rst dimension will not be divided, and the second dimension will be divided into 4 pieces. Therefore each local chunk will be of size (100,25). Note that the product of the distribution array must equal the number of processors.Distributed Arrays Distributed Arrays (3/7) Constructing distributed arrays with even more control The primitive DArray constructor has the following somewhat elaborate signature: DArray(init, dims, procs, dist) init is a function that accepts a tuple of index ranges. This function should allocate a local chunk of the distributed array and initialize it for the speci ed indices. dims is the overall size of the distributed array. procs optionally speci es a vector of processor IDs to use. dist is an integer vector specifying how many chunks the distributed array should be divided into in each dimension. The last two arguments are optional, and defaults will be used if they are omitted. Example As an example, here is how to turn the local array constructor ll into a distributed array constructor: dfill(v, args...) = DArray(Ifill(v, map(length,I)), args...) In this case the init function only needs to call ll with the dimensions of the local piece it is creating.Distributed Arrays Distributed Arrays (4/7) julia everywhere function par(I) create our local patch I is a tuple of intervals, each interval is regarded as a 1D array with integer entries size(I1, 1) gives the number of entries in I1 size(I2, 1) gives the number of entries in I2 d=(size(I1, 1), size(I2, 1)) m = fill(myid(), d) return m end julia julia everywhere h=8 julia everywhere w=8 julia m = DArray(par, (h, w), 2:5) 8x8 DArrayInt64,2,ArrayInt64,2: 2 2 2 2 4 4 4 4 2 2 2 2 4 4 4 4 2 2 2 2 4 4 4 4 2 2 2 2 4 4 4 4 3 3 3 3 5 5 5 5 3 3 3 3 5 5 5 5 3 3 3 3 5 5 5 5 3 3 3 3 5 5 5 5Distributed Arrays Distributed Arrays (5/7) julia m.chunks 2x2 ArrayRemoteRef,2: RemoteRef(2,1,28) RemoteRef(4,1,30) RemoteRef(3,1,29) RemoteRef(5,1,31) julia m.indexes 2x2 Array(Range1Int64,Range1Int64),2: (1:4,1:4) (1:4,5:8) (5:8,1:4) (5:8,5:8) julia spawn rank(m) RemoteRef(3,1,289) julia spawn rank(m) RemoteRef(4,1,290) julia spawn rank(m) RemoteRef(5,1,291) julia exception on 3: exception on 4: exception on ERROR: 5: ERROR: ERROR: no method svdvals(DArrayInt64,2,ArrayInt64,2,) in rank at linalg/generic.jl:87 in anonymous at multi.jl:1239 in anonymous at multi.jl:804 in runworkthunk at multi.jl:563 in anonymous at task.jl:76Distributed Arrays Distributed Arrays (6/7) spawnat 2 println(localpart(m)) VERSION 2.0 RemoteRef(2,1,292) julia mm = spawnat 2 rank(localpart(m)) RemoteRef(2,1,293) julia fetch(mm) From worker 2: 2 2 2 2 From worker 2: 2 2 2 2 From worker 2: 2 2 2 2 From worker 2: 2 2 2 2 From worker 2: 1 julia DArray Loading help data... Base.DArray(init, dims, procs, dist) Construct a distributed array. "init" is a function that accepts a tuple of index ranges. This function should allocate a local chunk of the distributed array and initialize it for the specified indices. "dims" is the overall size of the distributed array. "procs" optionally specifies a vector of processor IDs to use. "dist" is an integer vector specifying how many chunks the distributed array should be divided into in each dimension. For example, the "dfill" function that creates a distributed array and fills it with a value "v" is implemented as: "dfill(v, args...) = DArray(Ifill(v, map(length,I)), args...)"Distributed Arrays Distributed Arrays (7/7) Operations on distributed arrays distribute(a::Array) converts a local array to a distributed array. localpart(a::DArray) obtains the locallystored portion of a DArray. myindexes(a::DArray) gives a tuple of the index ranges owned by the local process. convert(Array, a::DArray) brings all the data to the local processor. Indexing a DArray (square brackets) with ranges of indexes always creates a SubArray, not copying any data.Map Reduce Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysMap Reduce Distributed arrays and parallel reduction (1/4) morenocompute03 julia p 5 () A fresh approach to technical computing () () () Documentation: http://docs.julialang.org Type "help()" to list help topics / ` ( Version 0.2.0prerelease+3622 / \'\' Commit c9bb96c 20130904 15:34:41 UTC / x8664redhatlinux julia da = parallel 2i for i = 1:10 10element DArrayInt64,1,ArrayInt64,1: 2 4 6 8 10 12 14 16 18 20Map Reduce Distributed arrays and parallel reduction (2/4) julia procs(da) 4element ArrayInt64,1: 2 3 4 5 julia da.chunks 4element ArrayRemoteRef,1: RemoteRef(2,1,1) RemoteRef(3,1,2) RemoteRef(4,1,3) RemoteRef(5,1,4) julia julia da.indexes 4element Array(Range1Int64,),1: (1:3,) (4:5,) (6:8,) (9:10,) julia da3 6 julia da3:5 3element SubArrayInt64,1,DArrayInt64,1,ArrayInt64,1,(Range1Int64,): 6 8 10Map Reduce Distributed arrays and parallel reduction (3/4) julia fetch(spawnat 2 da3) 6 julia julia (spawnat p sum(localpart(da))) for p=procs(da) 4element ArrayAny,1: RemoteRef(2,1,71) RemoteRef(3,1,72) RemoteRef(4,1,73) RemoteRef(5,1,74) julia julia map(fetch, (spawnat p sum(localpart(da))) for p=procs(da) ) 4element ArrayAny,1: 12 18 42 38 julia julia sum(da) 110Map Reduce Distributed arrays and parallel reduction (4/4) julia reduce(+, map(fetch, (spawnat p sum(localpart(da))) for p=procs(da) )) 110 julia julia preduce(f,d) = reduce(f, map(fetch, (spawnat p f(localpart(d))) for p=procs(d) )) methods for generic function preduce preduce(f,d) at none:1 julia function Base.minimum(x::Int64, y::Int64) min(x,y) end minimum (generic function with 10 methods) julia preduce(minimum, da) 2Shared Arrays Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysShared Arrays Shared arrays (1/6) Shared arrays vs distributed arrays Shared Arrays use system shared memory to map the same array across many processes. While there are some similarities to a DArray, the behavior of a SharedArray is quite di erent. In a DArray, each process has local access to just a chunk of the data, and no two processes share the same chunk; in contrast, in a SharedArray each participating process has access to the entire array. A SharedArray is a good choice when you want to have a large amount of data jointly accessible to two or more processes on the same machine.Shared Arrays Shared arrays (2/6) Shared arrays vs regular arrays SharedArray indexing (assignment and accessing values) works just as with regular arrays, and is ecient because the underlying memory is available to the local process. Therefore, most algorithms work naturally on SharedArrays, albeit in singleprocess mode. In cases where an algorithm insists on an Array input, the underlying array can be retrieved from a SharedArray by calling sdata(S).Shared Arrays Shared arrays (3/6) The constructor for a shared array is of the form: SharedArray(T::Type, dims::NTuple; init=false, pids=Int) which creates a shared array of a type T and size dims across the processes speci ed by pids. Unlike distributed arrays, a shared array is accessible only from those participating workers speci ed by the pids named argument (and the creating process too, if it is on the same host). If an init function, of signature initfn(S::SharedArray), is speci ed, then it is called on all the participating workers. You can arrange it so that each worker runs the init function on a distinct portion of the array, thereby parallelizing initialization.Shared Arrays Shared arrays (4/6) Heres a brief example (with Julia started with p 4) julia S = SharedArray(Int, (3,4), init = S Slocalindexes(S) = myid()) 3x4 SharedArrayInt64,2: 1 2 4 5 1 3 4 5 2 3 5 5 julia S3,2 = 7 7 julia S 3x4 SharedArrayInt64,2: 1 2 4 5 1 3 4 5 2 7 5 5 localindexes provides disjoint onedimensional ranges of indexes, and is sometimes convenient for splitting up tasks among processes. You can, of course, divide the work any way you wish: S=SharedArray(Int,(4,4),init = S Smyid():nworkers()+1:length(S) = myid())Shared Arrays Shared arrays (5/6) Continuing the example (with Julia started with p 3): julia S 4x4 SharedArrayInt64,2: 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 julia for i=1:3, j=1:4 Si,j = myid() end julia S 4x4 SharedArrayInt64,2: 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 julia spawn for i=1:3, j=1:4 Si,j = myid() end RemoteRef(3,1,33) julia S 4x4 SharedArrayInt64,2: 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4Shared Arrays Shared arrays (6/6) Since all processes have access to the underlying data, you do have to be careful not to set up con icts. For example: sync begin for p in workers() spawn for i=1:4, j=1:4 Si,j = myid() end end end would result in unde ned behavior: because each process lls the entire array with its own pid, whichever process is the last to execute (for any particular element of S) will have its pid retained. One could even get a more random behavior as follows: sync begin for p in workers() async begin remotecallwait(p, fill, S, p) end end endMatrix Multiplication Using Shared Arrays Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysMatrix Multiplication Using Shared Arrays Blockwise matrix multiplication (1/3) Assume that we want to multiply two square matrices A and B of order n, yielding a square matrix C of order n. Assume also that n is a power of 2. Then, each of A;B;C can be divided into 4 blocks (themselves matrices) of order n=2 as depicted below. C C C C A A A A BB BB 11 12 11 12 11 12 = ยท C C C C A A A A BB BB 21 22 21 22 21 22 A A BB A A BB A A BB A A BB 11 11 11 11 11 11 12 12 12 12 21 21 12 12 22 22 =+ A B A B A B A B 21 21 11 11 21 21 12 12 22 22 21 21 22 22 22 22 This leads to a recursive process for multiplying matrices. with 8 recursive calls, namely for A B , A B , . . . , A B . 11 11 11 12 22 22 In practice, the recursive calls should be performed until a base case (typically n = 32 or n = 64 or n = 128, depending on the machine, the type of the input coecients and the initial value of n). The code on the next slide implements these ideas.Matrix Multiplication Using Shared Arrays Blockwise matrix multiplication (2/3) function dacmm(i0, i1, j0, j1, k0, k1, A, B, C, n, basecase) A, B, C are matrices We compute C = A B if n basecase n = n/2 dacmm(i0, i1, j0, j1, k0, k1, A, B, C, n, basecase) dacmm(i0, i1, j0, j1+n, k0, k1+n, A, B, C, n, basecase) dacmm(i0+n, i1, j0, j1, k0+n, k1, A, B, C, n, basecase) dacmm(i0+n, i1, j0, j1+n, k0+n, k1+n, A, B, C, n, basecase) dacmm(i0, i1+n, j0+n, j1, k0, k1, A, B, C, n, basecase) dacmm(i0, i1+n, j0+n, j1+n, k0, k1+n, A, B, C, n, basecase) dacmm(i0+n, i1+n, j0+n, j1, k0+n, k1, A, B, C, n, basecase) dacmm(i0+n, i1+n, j0+n, j1+n, k0+n, k1+n, A, B, C, n, basecase) else for i= 1:n, j=1:n, k=1:n Ci+k0,k1+j = Ci+k0,k1+j + Ai+i0,i1+k Bk+j0,j1+j end end endMatrix Multiplication Using Shared Arrays Blockwise matrix multiplication (3/3) julia n=4 4 julia basecase = 2 2 julia A = rem(rand(Int32),5) for i =1:n, j = 1:n 4x4 ArrayInt64,2: 4 2 0 3 1 4 1 0 1 0 0 4 2 3 4 2 julia B = rem(rand(Int32),5) for i =1:n, j = 1:n 4x4 ArrayInt64,2: 3 4 4 2 4 4 3 1 4 4 0 2 0 3 2 3 julia C = zeros(Int32,n,n); julia dacmm(0, 0, 0, 0, 0, 0, A, B, C, n, basecase) julia C 4x4 ArrayInt32,2: 4 17 16 1 15 16 16 4 3 8 4 14 2 10 21 13Matrix Multiplication Using Shared Arrays Parallel blockwise matrix multiplication (1/2) everywhere function dacmmparallel(i0, i1, j0, j1, k0, k1, A, B, C, s, X) if s X s = s/2 lrf = spawn dacmmparallel(i0, i1, j0, j1, k0, k1, A, B, C, s,X), spawn dacmmparallel(i0, i1, j0, j1+s, k0, k1+s, A, B, C, s,X), spawn dacmmparallel(i0+s, i1, j0, j1, k0+s, k1, A, B, C, s,X), spawn dacmmparallel(i0+s, i1, j0, j1+s, k0+s, k1+s, A, B, C, s,X) pmap(fetch, lrf) lrf = spawn dacmmparallel(i0, i1+s, j0+s, j1, k0, k1, A, B, C, s,X), spawn dacmmparallel(i0, i1+s, j0+s, j1+s, k0, k1+s, A, B, C, s,X), spawn dacmmparallel(i0+s, i1+s, j0+s, j1, k0+s, k1, A, B, C, s,X), spawn dacmmparallel(i0+s, i1+s, j0+s, j1+s, k0+s, k1+s, A, B, C, s,X) pmap(fetch, lrf) else for i= 0:(s1), j=0:(s1), k=0:(s1) Ci+k0,k1+j += Ai+i0,i1+k Bk+j0,j1+j end end endMatrix Multiplication Using Shared Arrays Parallel blockwise matrix multiplication (2/2) s = 8 A = convert(SharedArray, rand(s,s)) B = convert(SharedArray, rand(s,s)) C = convert(SharedArray, zeros(s,s)) dacmmparallel(1,1,1,1,1,1,A,B,C,s,8) dacmmparallel(1,1,1,1,1,1,A,B,C,s,2) s = 1024 A = convert(SharedArray, rand(s,s)) B = convert(SharedArray, rand(s,s)) C = convert(SharedArray, zeros(s,s)); time dacmmparallel(1,1,1,1,1,1,A,B,C,s,64) 4.486267909 seconds C = convert(SharedArray, zeros(s,s)); time dacmmparallel(1,1,1,1,1,1,A,B,C,s,1024) 45.38339897 secondsSynchronization Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysSynchronization How does Julia's schedule computations Julia's scheduling strategy is based on tasks Julias parallel programming platform uses Tasks (aka Coroutines) to switch among multiple computations. Whenever code performs a communication operation like fetch or wait, the current task is suspended and a scheduler picks another task to run. A task is restarted when the event it is waiting for completes. Dynamic scheduling For many problems, it is not necessary to think about tasks directly. However, they can be used to wait for multiple events at the same time, which provides for dynamic scheduling. In dynamic scheduling, a program decides what to compute or where to compute it based on when other jobs nish. This is needed for unpredictable or unbalanced workloads, where we want to assign more work to processes only when they nish their current tasks. As an example, consider computing the ranks of matrices of di erent sizes M = rand(800,800), rand(600,600), rand(800,800), rand(600,600) pmap(rank, M)Synchronization Implementation of pmap Main idea Processor 1 dispatches the arguments of function f to the workkers via remotecall fetch. Details Each worker is associated with a local task feeding work to it. This mapping is done in the for loop where each iteration is run asynchronously. Indeed, each of these iterations submits remote calls via remotecall fetch and waits; note the use of the while true loop. Once a remote call is submitted, the corresponding task is inerrupted and another iteration can run; note that all these tasks are local to Processor 1, hence, only one runs at a time. Each worker knows which item to pick from the list lst thanks to the fuction nextidx(). May be another task has changed the variable i when a call to nextidx() returns: but this does not matter thanks to the use of the local variable idx.Synchronization Implementation of pmap function pmap(f, lst) np = nprocs() determine the number of processes available n = length(lst) results = cell(n) i = 1 function to produce the next work item from the queue. in this case it's just an index. nextidx() = (idx=i; i+=1; idx) sync begin for p=1:np if p = myid() np == 1 async begin while true idx = nextidx() if idx n break end resultsidx = remotecallfetch(p, f, lstidx) end end end end end results endSynchronization spawnlocal, sync and everywhere spawnlocal (recently renamed async) spawnlocal is similar to spawn, but only runs tasks on the local processor. In the pmap example above, we use it to create a feeder task for each processor. Each task picks the next index that needs to be computed, then waits for its processor to nish, then repeats until we run out of indexes. sync A sync block is used to wait for all the local tasks to complete, at which point the whole operation is done. Notice that all the feeder tasks are able to share the state i via next idx() since they all run on the same processor. However, no locking is required, since the threads are scheduled cooperatively and not preemptively. This means context switches only occur at wellde ned points (during the fetch operation). everywhere It is often useful to execute a statement on all processors, particularly for setup tasks such as loading source les and de ning common variables. This can be done with the everywhere macro.A Simple Simulation Using Distributed Arrays Plan 1 Tasks: Concurrent Function Calls 2 Julia's Prnciples for Parallel Computing 3 Tips on Moving Code and Data 4 Around the Parallel Julia Code for Fibonacci 5 Parallel Maps and Reductions 6 Distributed Computing with Arrays: First Examples 7 Distributed Arrays 8 Map Reduce 9 Shared Arrays 10 Matrix Multiplication Using Shared Arrays 11 Synchronization 12 A Simple Simulation Using Distributed ArraysA Simple Simulation Using Distributed Arrays Simulation 1/14 julia everywhere function SimulationSerial(A,N,T) for t=0:(T1) past = rem(t,3) +1 present = rem(t+1,3) +1 future = rem(t+2,3) + 1 for x=1:N Afuture,x = (Apresent,x + Apast,x) / 2 end end end julia Comments Consider a simple simulationserial with a stencil of the form At + 2;i = (At + 1;i +At;i)=2 We start a serial function realizing T time steps at N points.A Simple Simulation Using Distributed Arrays Simulation 2/14 julia N = 16 16 julia T = 7 7 julia A = rand(3,N) 3x16 ArrayFloat64,2: 0.0685805 0.0163473 0.782845 0.0100164 0.449585 0.937391 0.571368 0.397517 0.90764 0.468425 0.830325 0.0634363 0.733477 0.267525 0.792513 0.54764 0.183695 0.597147 0.75237 0.68958 0.129608 julia for j=1:N A3,j = 0 end julia A 3x16 ArrayFloat64,2: 0.0685805 0.0163473 0.782845 0.0100164 0.449585 0.937391 0.571368 0.397517 0.90764 0.468425 0.830325 0.0634363 0.733477 0.267525 0.0 0.0 0.0 0.0 0.0 0.0 0.0 julia SimulationSerial(A, N, T) julia A 3x16 ArrayFloat64,2: 0.284445 0.601258 0.576507 0.548344 0.196175 0.803572 0.371971 0.289585 0.615184 0.571594 0.561161 0.190141 0.800386 0.367223 0.287015 0.608221 0.57405 0.554753 0.193158 0.801979 0.369597 Comments We continue with a very simple input data for testing our serial code.A Simple Simulation Using Distributed Arrays Simulation 3/14 julia dA = dones(3,N) 3x16 DArrayFloat64,2,ArrayFloat64,2: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 julia for p=procs(dA) spawnat p println(localpart(dA)) end julia for p=procs(dA) spawnat p println(((dA.indexes)p1)) end julia for p=procs(dA) spawnat p println(size((dA.indexes)p12,1)) end julia From worker 9: 1 1 From worker 9: 1 1 From worker 9: 1 1 From worker 7: 1 1 From worker 7: 1 1 ... ... From worker 7: :3,11:12) From worker 8: (21 From worker 8: :3,13:14) Comments In preparation for a parallel implementation, we review how to manipulate distributed arrays.A Simple Simulation Using Distributed Arrays Simulation 4/14 julia function SimulationParallel(dA,N,T) P = length(procs(dA)) Nlocal = size((dA.indexes)w2,1) for w=1:P refs = (spawnat (procs(dA))w SimulationSerial((localpart(dA)), Nlocalw, T)) for w=1:P pmap(fetch,refs) end methods for generic function SimulationParallel SimulationParallel(dA,N,T) at none:2 Comments In this code, each worker updates its local part without exchanging data with the other workers Remote calls get workers to start computing at essentially the same time The last statement of the code forces the workers to complete before returning from the functionA Simple Simulation Using Distributed Arrays Simulation 5/14 julia N = 1000000 1000000 julia T = 1000 1000 julia A = rand(3,N) ; for j=1:N A3,j = 0 end ; A 3x1000000 ArrayFloat64,2: 0.690014 0.539029 0.182901 0.272785 0.709785 0.784796 0.140619 0.523338 0.96348 0.278264 0.135104 0.288478 0.032159 0.136924 0.0 0.0 0.0 0.0 0.0 0.0 0.0 julia time SimulationSerial(A, N, T) elapsed time: 6.561795108 seconds (13880 bytes allocated) Comments Now we consider a large example with 1; 000; 000 points and 1; 000 time steps. The serial code runs in 4 seconds.A Simple Simulation Using Distributed Arrays Simulation 6/14 julia dA = drand(3,N) ; dA 3x1000000 DArrayFloat64,2,ArrayFloat64,2: 0.714203 0.789365 0.79275 0.381862 0.350883 0.423851 0.455572 0.851916 0.46507 0.99652 0.551413 0.411047 0.665104 0.502293 0.802434 0.663831 0.352931 0.787101 0.763005 0.736609 0.653733 julia time SimulationParallel(dA,N,T) elapsed time: 3.890954306 seconds (18475672 bytes allocated, 0.61 gc time) 8element ArrayAny,1: nothing nothing nothing nothing nothing nothing nothing nothing Comments Our rst parallel function runs twice faster on 8 cores.A Simple Simulation Using Distributed Arrays Simulation 7/14 function SimulationParallelWithSynchronization(dA,N,T) Ps = procs(dA) P = length(procs(dA)) Nlocal = size((dA.indexes)w2,1) for w=1:P for t=0:(T1) refs = (spawnat Psw SimulationSerial((localpart(dA)),Nlocalw, 1)) for w=1:P pmap(fetch, refs) end end Comments Now we consider a more challenging situation where synchronization (among workers) and data communication are needed after time step.A Simple Simulation Using Distributed Arrays Simulation 8/14 julia N = 1000000 ; T = 1000 ; A = rand(3,N) 3x1000000 ArrayFloat64,2: 0.216248 0.786213 0.703382 0.90462 0.115365 0.612519 0.016185 0.111335 0.345602 0.447664 0.842326 0.835184 0.210003 0.557303 0.66509 0.655981 0.522527 0.100767 0.224793 0.652794 0.444837 julia time SimulationSerial(A, N, T) elapsed time: 6.560018627 seconds (80 bytes allocated) julia dA = drand(3,N) 3x1000000 DArrayFloat64,2,ArrayFloat64,2: 0.971967 0.468293 0.427618 0.880686 0.177674 0.07172 0.591517 0.843807 0.0390448 0.949847 0.269363 0.0891077 0.9297 0.555951 0.171203 0.252551 0.346985 0.528161 0.84094 0.755807 0.643235 julia time SimulationParallelWithSynchronization(dA,N,T) elapsed time: 7.372995738 seconds (186383344 bytes allocated, 14.39 gc time) Comments This results in a severe slowdown: the new parallel code is slower than its serial counterpart.A Simple Simulation Using Distributed Arrays Simulation 9/14 function SimulationParallelWithLessSynchronization(dA,N,T,s) Ps = procs(dA) P = length(procs(dA)) Nlocal = size((dA.indexes)w2,1) for w=1:P for t=0:(div(T1,s)) refs = (spawnat Psw SimulationSerial((localpart(dA)),Nlocalw, s)) for w=1:P pmap(fetch, refs) end end Comments Assume now that synchronization (among workers) and data communication are needed after s time step, where s is an extra argument of the function.A Simple Simulation Using Distributed Arrays Simulation 10/14 julia dA = drand(3,N) 3x1000000 DArrayFloat64,2,ArrayFloat64,2: 0.291769 0.686413 0.689035 0.820063 0.405656 0.126106 0.265046 0.176355 0.545126 0.173003 0.749658 0.707024 0.78379 0.601479 0.668186 0.276344 0.703813 0.467613 0.102299 0.383863 0.44299 julia time SimulationParallelWithLessSynchronization(dA,N,T,10) elapsed time: 4.171525224 seconds (20101176 bytes allocated, 2.38 gc time) julia dA = drand(3,N) 3x1000000 DArrayFloat64,2,ArrayFloat64,2: 0.412486 0.512069 0.382262 0.179192 0.40006 0.306373 0.911919 0.214544 0.0326704 0.357465 0.0421321 0.561617 0.883781 0.332846 0.181292 0.305909 0.122544 0.60928 0.929871 0.870011 0.626707 julia time SimulationParallelWithLessSynchronization(dA,N,T,100) elapsed time: 3.952158437 seconds (1889936 bytes allocated, 1.69 gc time Comments This new paralle code runs faster (than the serial code) on 8 cores.A Simple Simulation Using Distributed Arrays Simulation 11/14 julia N = 1000000; T = 10000 ; A = rand(3,N) 3x1000000 ArrayFloat64,2: 0.918162 0.960783 0.379644 0.972133 0.333328 0.956825 0.612031 0.962998 0.393202 0.686331 0.502514 0.131563 0.576491 0.611821 0.196092 0.267477 0.649858 0.56717 0.403075 0.861212 0.94803 julia time SimulationSerial(A, N, T) elapsed time: 65.727261258 seconds (80 bytes allocated) Comments From now on, T is multiplied by 10. Which multiplies the serial time by 10.A Simple Simulation Using Distributed Arrays Simulation 12/14 julia dA = drand(3,N) 3x1000000 DArrayFloat64,2,ArrayFloat64,2: 0.0414031 0.335514 0.671915 0.129808 0.161842 0.405936 0.875208 0.113102 0.543436 0.518715 0.461345 0.789894 0.88799 0.0541927 0.917239 0.439923 0.880784 0.811733 0.578741 0.0245696 0.69354 julia time SimulationParallel(dA,N,T) elapsed time: 33.086805207 seconds (186440 bytes allocated) 8element ArrayAny,1: nothing nothing nothing nothing nothing nothing nothing nothing Comments The parallel time without communication is also multiplied by 10.A Simple Simulation Using Distributed Arrays Simulation 14/14 ulia dA = drand(3,N) 3x1000000 DArrayFloat64,2,ArrayFloat64,2: 0.18564 0.321798 0.76392 0.125143 0.0633626 0.504838 0.691746 0.800847 0.160715 0.876192 0.37438 0.806474 0.798101 0.185602 0.0690838 0.136328 0.839011 0.403338 0.972107 0.623438 0.142404 julia time SimulationParallelWithSynchronization(dA,N,T) elapsed time: 72.346711931 seconds (1878056616 bytes allocated, 14.94 gc time) julia dA = drand(3,N) 3x1000000 DArrayFloat64,2,ArrayFloat64,2: 0.44517 0.421967 0.523604 0.832822 0.896757 0.310802 0.840134 0.599286 0.869408 0.472468 0.359417 0.00964712 0.999168 0.463951 0.130293 0.276227 0.84656 0.444655 0.555158 0.982442 0.0199537 julia time SimulationParallelWithLessSynchronization(dA,N,T,1000) elapsed time: 32.460148169 seconds (1977856 bytes allocated, 0.19 gc time) Comments The parallel time with lots of communication and synchronization is also multiplied by 10. The parallel time with few communication and synchronization is only multiplied by 8.
Website URL
Comment