Question? Leave a message!




Introduction to Multicore Programming

Introduction to Multicore Programming
Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Introduction to Multicore Programming CS 3101 1 / 38Plan 1 Multicore Architecture Multicore processor CPU Cache CPU Coherence 2 Concurrency Platforms An overview of Cilk++ Race Conditions and Cilkscreen MMM in Cilk++ (Moreno Maza) Introduction to Multicore Programming CS 3101 2 / 38Multicore Architecture Plan 1 Multicore Architecture Multicore processor CPU Cache CPU Coherence 2 Concurrency Platforms An overview of Cilk++ Race Conditions and Cilkscreen MMM in Cilk++ (Moreno Maza) Introduction to Multicore Programming CS 3101 3 / 38Multicore Architecture Multicore processor (Moreno Maza) Introduction to Multicore Programming CS 3101 4 / 38Multicore Architecture Multicore processor (Moreno Maza) Introduction to Multicore Programming CS 3101 5 / 38Multicore Architecture Multicore processor (Moreno Maza) Introduction to Multicore Programming CS 3101 6 / 38Multicore Architecture Multicore processor Memory I/O Network … P P P Chip Multiprocessor (CMP) (Moreno Maza) Introduction to Multicore Programming CS 3101 7 / 38Multicore Architecture Multicore processor Multicore processor A multicore processor is an integrated circuit to which two or more individual processors (called cores in this sense) have been attached. In a manycore processor the number of cores is large enough that traditional multiprocessor techniques are no longer ecient. Cores on a multicore device can be coupled tightly or loosely: may share or may not share a cache, implement intercore communications methods or message passing. Cores on a multicore implement the same architecture features as singlecore systems such as instruction pipeline parallelism (ILP), vectorprocessing, SIMD or multithreading. Many applications do not realize yet large speedup factors: parallelizing algorithms and software is a major ongoing research area. (Moreno Maza) Introduction to Multicore Programming CS 3101 8 / 38Multicore Architecture CPU Cache CPU Cache (1/7) A CPU cache is an auxiliary memory which is smaller, faster memory than the main memory and which stores copies of of the main memory locations that are expectedly frequently used. Most modern desktop and server CPUs have at least three independent caches: the data cache, the instruction cache and the translation lookaside bu er. (Moreno Maza) Introduction to Multicore Programming CS 3101 9 / 38Multicore Architecture CPU Cache CPU Cache (2/7) Each location in each memory (main or cache) has a datum (cache line) which ranges between 8 and 512 bytes in size, while a datum requested by a CPU instruction ranges between 1 and 16. a unique index (called address in the case of the main memory) In the cache, each location has also a tag (storing the address of the corresponding cached datum). (Moreno Maza) Introduction to Multicore Programming CS 3101 10 / 38Multicore Architecture CPU Cache CPU Cache (3/7) When the CPU needs to read or write a location, it checks the cache: if it nds it there, we have a cache hit if not, we have a cache miss and (in most cases) the processor needs to create a new entry in the cache. Making room for a new entry requires a replacement policy: the Least Recently Used (LRU) discards the least recently used items rst; this requires to use age bits. (Moreno Maza) Introduction to Multicore Programming CS 3101 11 / 38Multicore Architecture CPU Cache CPU Cache (4/7) Read latency (time to read a datum from the main memory) requires to keep the CPU busy with something else: outoforder execution: attempt to execute independent instructions arising after the instruction that is waiting due to the cache miss hyperthreading (HT): allows an alternate thread to use the CPU (Moreno Maza) Introduction to Multicore Programming CS 3101 12 / 38Multicore Architecture CPU Cache CPU Cache (5/7) Modifying data in the cache requires a write policy for updating the main memory writethrough cache: writes are immediately mirrored to main memory writeback cache: the main memory is mirrored when that data is evicted from the cache The cache copy may become outofdate or stale, if other processors modify the original entry in the main memory. (Moreno Maza) Introduction to Multicore Programming CS 3101 13 / 38Multicore Architecture CPU Cache CPU Cache (6/7) The replacement policy decides where in the cache a copy of a particular entry of main memory will go: fully associative: any entry in the cache can hold it direct mapped: only one possible entry in the cache can hold it Nway set associative: N possible entries can hold it (Moreno Maza) Introduction to Multicore Programming CS 3101 14 / 38Multicore Architecture CPU Cache Cache Performance for SPEC CPU2000 by J.F. Cantin and M.D. Hill. The SPEC CPU2000 suite is a collection of 26 computeintensive, nontrivial programs used to evaluate the performance of a computer's CPU, memory system, and compilers (http://www.spec.org/osg/cpu2000 ). (Moreno Maza) Introduction to Multicore Programming CS 3101 15 / 38Multicore Architecture CPU Coherence Cache Coherence (1/6) x=3 Load x … x=3 PP P Figure: ProcessorP reads x=3 rst from the backing store (higherlevel memory) 1 (Moreno Maza) Introduction to Multicore Programming CS 3101 16 / 38Multicore Architecture CPU Coherence Cache Coherence (2/6) x=3 Load x … x=3 x=3 PP P Figure: Next, ProcessorP loads x=3 from the same memory 2 (Moreno Maza) Introduction to Multicore Programming CS 3101 17 / 38Multicore Architecture CPU Coherence Cache Coherence (3/6) x=3 Load x … x=3 x=3 x=3 PP P Figure: ProcessorP loads x=3 from the same memory 4 (Moreno Maza) Introduction to Multicore Programming CS 3101 18 / 38Multicore Architecture CPU Coherence Cache Coherence (4/6) x=3 Store Store … x=3 x=3 x=3 x=5 PP P Figure: ProcessorP issues a write x=5 2 (Moreno Maza) Introduction to Multicore Programming CS 3101 19 / 38Multicore Architecture CPU Coherence Cache Coherence (5/6) x=3 Store Store … x=3 x=5 x=3 x=5 PP P Figure: ProcessorP writes x=5 in his local cache 2 (Moreno Maza) Introduction to Multicore Programming CS 3101 20 / 38Multicore Architecture CPU Coherence Cache Coherence (6/6) x=3 Load x … x=3 x=5 x=3 PP P Figure: ProcessorP issues a read x, which is now invalid in its cache 1 (Moreno Maza) Introduction to Multicore Programming CS 3101 21 / 38Multicore Architecture CPU Coherence MSI Protocol In this cache coherence protocol each block contained inside a cache can have one of three possible states: M: the cache line has been modi ed and the corresponding data is inconsistent with the backing store; the cache has the responsibility to write the block to the backing store when it is evicted. S: this block is unmodi ed and is shared, that is, exists in at least one cache. The cache can evict the data without writing it to the backing store. I: this block is invalid, and must be fetched from memory or another cache if the block is to be stored in this cache. These coherency states are maintained through communication between the caches and the backing store. The caches have di erent responsibilities when blocks are read or written, or when they learn of other caches issuing reads or writes for a block. (Moreno Maza) Introduction to Multicore Programming CS 3101 22 / 38Multicore Architecture CPU Coherence True Sharing and False Sharing True sharing: True sharing cache misses occur whenever two processors access the same data word True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it reuses much of the data it has been accessing. Programs with high temporal locality tend to have less true sharing. False sharing: False sharing results when di erent processors use di erent data that happen to be colocated on the same cache line A computation is said to have spatial locality if it uses multiple words in a cache line before the line is displaced from the cache Enhancing spatial locality often minimizes false sharing See Data and Computation Transformations for Multiprocessors by J.M. Anderson, S.P. Amarasinghe and M.S. Lam http://suif.stanford.edu/papers/anderson95/paper.html (Moreno Maza) Introduction to Multicore Programming CS 3101 23 / 38Multicore Architecture CPU Coherence Multicore processor (cntd) Advantages: Cache coherency circuitry operate at higher rate than o chip. Reduced power consumption for a dual core vs two coupled singlecore processors (better quality communication signals, cache can be shared) Challenges: Adjustments to existing software (including OS) are required to maximize performance Production yields down (an Intel quadcore is in fact a double dualcore) Two processing cores sharing the same bus and memory bandwidth may limit performances High levels of false or true sharing and synchronization can easily overwhelm the advantage of parallelism (Moreno Maza) Introduction to Multicore Programming CS 3101 24 / 38Concurrency Platforms Plan 1 Multicore Architecture Multicore processor CPU Cache CPU Coherence 2 Concurrency Platforms An overview of Cilk++ Race Conditions and Cilkscreen MMM in Cilk++ (Moreno Maza) Introduction to Multicore Programming CS 3101 25 / 38Concurrency Platforms An overview of Cilk++ From Cilk to Cilk++ Cilk has been developed since 1994 at the MIT Laboratory for Computer Science by Prof. Charles E. Leiserson and his group, in particular by Matteo Frigo. Besides being used for research and teaching, Cilk was the system used to code the three worldclass chess programs: Tech, Socrates, and Cilkchess. Over the years, the implementations of Cilk have run on computers ranging from networks of Linux laptops to an 1824nodes Intel Paragon. From 2007 to 2009 Cilk has lead to Cilk++, developed by Cilk Arts, an MIT spino , which was acquired by Intel in July 2009 and became Cilk Plus, see http://www.cilk.com/ Cilk++ can be freely downloaded at http://software.intel.com/enus/articles/downloadintelcilksdk/ Cilk is still developed at MIT http://supertech.csail.mit.edu/cilk/ (Moreno Maza) Introduction to Multicore Programming CS 3101 26 / 38Concurrency Platforms An overview of Cilk++ Cilk ++ Cilk++ (resp. Cilk) is a small set of linguistic extensions to C++ (resp. C) supporting forkjoin parallelism Both Cilk and Cilk++ feature a provably ecient workstealing scheduler. Cilk++ provides a hyperobject library for parallelizing code with global variables and performing reduction for data aggregation. Cilk++ includes the Cilkscreen race detector and the Cilkview performance analyzer. (Moreno Maza) Introduction to Multicore Programming CS 3101 27 / 38Concurrency Platforms An overview of Cilk++ Nested Parallelism in Cilk ++ int fib(int n) if (n 2) return n; int x, y; x = cilkspawn fib(n1); y = fib(n2); cilksync; return x+y; The named child function cilk spawn fib(n1) may execute in parallel with its parent executes fib(n2). Cilk++ keywords cilk spawn and cilk sync grant permissions for parallel execution. They do not command parallel execution. (Moreno Maza) Introduction to Multicore Programming CS 3101 28 / 38Concurrency Platforms An overview of Cilk++ Loop Parallelism in Cilk ++ a a a a a a 11 12⋯ 1n 11 21⋯ n1 a a a a a a 21 21 22 22⋯ 2n 2n 12 12 22 22⋯ n2 n2 ⋮⋮⋱⋮ ⋮⋮⋱⋮ a a a a a a n1 n1 n2 n2⋯ nn nn 1n 1n 2n 2n⋯ nn nn T AA // indices run from 0, not 1 cilkfor (int i=1; in; ++i) for (int j=0; ji; ++j) ddb oublle temp = Aiij j; Aij = Aji; Aji = temp; The iterations of a cilk for loop may execute in parallel. (Moreno Maza) Introduction to Multicore Programming CS 3101 29 / 38Concurrency Platforms An overview of Cilk++ Serial Semantics (1/2) Cilk (resp. Cilk++) is a multithreaded language for parallel programming that generalizes the semantics of C (resp. C++) by introducing linguistic constructs for parallel control. Cilk (resp. Cilk++) is a faithful extension of C (resp. C++): The C (resp. C++) elision of a Cilk (resp. Cilk++) is a correct implementation of the semantics of the program. Moreover, on one processor, a parallel Cilk (resp. Cilk++) program scales down to run nearly as fast as its C (resp. C++) elision. To obtain the serialization of a Cilk++ program define cilkfor for define cilkspawn define cilksync (Moreno Maza) Introduction to Multicore Programming CS 3101 30 / 38Concurrency Platforms An overview of Cilk++ Serial Semantics (2/2) int fib (int n) if (n2) return (n); else int x,y; x x = cilk cilkspawn spawn fib(n fib(n1); 1); y = fib(n2); cilksync; return (x+y); Cilk++ source int fib (int n) if (n2) return (n); else int x,y; x x = fib(n fib(n1); 1); y = fib(n2); return (x+y); Serialization (Moreno Maza) Introduction to Multicore Programming CS 3101 31 / 38Concurrency Platforms An overview of Cilk++ Scheduling int fib (int n) if (n2) return (n); else int x,y; x = cilk ilkspawn fib( fib(n1) 1); y = fib(n2); cilksync; return (x+y); Memory I/O Network P … PP PP PP (Moreno Maza) Introduction to Multicore Programming CS 3101 32 / 38Concurrency Platforms Race Conditions and Cilkscreen Race Bugs (1/3) A Example int x = 0; A A it int x = 00; cilkfor(int i=0, i2, ++i) x++; B C B C x++; x++; D assert(x == 2); assert(x == 2); D D Dependency Graph Iterations of a cilk for should be independent. Between a cilk spawn and the corresponding cilk sync, the code of the spawned child should be independent of the code of the parent, including code executed by additional spawned or called children. The arguments to a spawned function are evaluated in the parent before the spawn occurs. (Moreno Maza) Introduction to Multicore Programming CS 3101 33 / 38Concurrency Platforms Race Conditions and Cilkscreen Race Bugs (2/3) 1 x = 0; A int x = 0; 2 4 r1 = x; r2 = x; B C x++; x++; 3 5 r1++; r2++; 77 66 x x = r1; r1; x x = r2; r2; assert((x == 2)2); D 8 assert(x == 2); 110011 0000011011 011110 r1 x r2 (Moreno Maza) Introduction to Multicore Programming CS 3101 34 / 38Concurrency Platforms Race Conditions and Cilkscreen Race Bugs (3/3) Watch out for races in packed data structures such as: struct char a; char b; Updating x.a and x.b in parallel can cause races. If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any di erently than its serialization, Cilkscreen race detector guarantees to report and localize the o ending race. Employs a regressiontest methodology (where the programmer provides test inputs) and dynamic instrumentation of binary code. Identi es lesnames, lines and variables involved in the race. Runs about 20 times slower than realtime. (Moreno Maza) Introduction to Multicore Programming CS 3101 35 / 38Concurrency Platforms MMM in Cilk++ templatetypename T void multiplyiterpar(int ii, int jj, int kk, T A, T B, T C) cilkfor(int i = 0; i ii; ++i) for (int k = 0; k kk; ++k) cilkfor(int j = 0; j jj; ++j) Ci jj + j += Ai kk + k + Bk jj + j; Does not scale up well due to a poor locality and uncontrolled granularity. (Moreno Maza) Introduction to Multicore Programming CS 3101 36 / 38Concurrency Platforms MMM in Cilk++ templatetypename T void multiplyrecseqhelper(int i0, int i1, int j0, int j1, int k0, int k1, T A, ptrdifft lda, T B, ptrdifft ldb, T C, ptrdifft ldc) int di = i1 i0; int dj = j1 j0; int dk = k1 k0; if (di = dj di = dk di = RECURSIONTHRESHOLD) int mi = i0 + di / 2; multiplyrecseqhelper(i0, mi, j0, j1, k0, k1, A, lda, B, ldb, C, ldc); multiplyrecseqhelper(mi, i1, j0, j1, k0, k1, A, lda, B, ldb, C, ldc); else if (dj = dk dj = RECURSIONTHRESHOLD) int mj = j0 + dj / 2; multiplyrecseqhelper(i0, i1, j0, mj, k0, k1, A, lda, B, ldb, C, ldc); multiplyrecseqhelper(i0, i1, mj, j1, k0, k1, A, lda, B, ldb, C, ldc); else if (dk = RECURSIONTHRESHOLD) int mk = k0 + dk / 2; multiplyrecseqhelper(i0, i1, j0, j1, k0, mk, A, lda, B, ldb, C, ldc); multiplyrecseqhelper(i0, i1, j0, j1, mk, k1, A, lda, B, ldb, C, ldc); else for (int i = i0; i i1; ++i) for (int k = k0; k k1; ++k) for (int j = j0; j j1; ++j) Ci ldc + j += Ai lda + k Bk ldb + j; (Moreno Maza) Introduction to Multicore Programming CS 3101 37 / 38Concurrency Platforms MMM in Cilk++ templatetypename T inline void multiplyrecseq(int ii, int jj, int kk, T A, T B, T C) multiplyrecseqhelper(0, ii, 0, jj, 0, kk, A, kk, B, jj, C, jj); Multiplying a 4000x8000 matrix by a 8000x4000 matrix on 32 cores = 8 sockets x 4 cores (Quad Core AMD Opteron 8354) per socket. The 32 cores share a L3 32way setassociative cache of 2 Mbytes. core Elision (s) Parallel (s) speedup 8 420.906 51.365 8.19 16 432.419 25.845 16.73 24 413.681 17.361 23.83 32 389.300 13.051 29.83 (Moreno Maza) Introduction to Multicore Programming CS 3101 38 / 38