Question? Leave a message!




Introduction to Parallel Processing

Introduction to Parallel Processing 3
Introduction to Parallel Processing Introduction to Parallel Processing • • Parallel Computer Architecture: Parallel Computer Architecture: Definition Broad issues involved Definition Broad issues involved – A Generic Parallel Computer Architecture – A Generic Parallel Computer Architecture • • The Need And Feasibility of The Need And Feasibility of Parallel Computing Parallel Computing Why – Scientific Supercomputing Trends – Scientific Supercomputing Trends – – CPU Performance and Technology Trends, CPU Performance and Technology Trends, Parallelism in Microprocessor Generations Parallelism in Microprocessor Generations – – Compute Computer r System Peak FLOP Rating History/Near Future System Peak FLOP Rating History/Near Future • The Goal of Parallel Processing • The Goal of Parallel Processing • • Elements of Parallel Computing Elements of Parallel Computing • Factors Affecting Parallel System Performance • Factors Affecting Parallel System Performance • Parallel Architectures History • Parallel Architectures History – – Parall Parallel Program el Programm ming Models ing Models – – Flynn Flynn’ ’s 1972 Classification of Computer Architecture s 1972 Classification of Computer Architecture • • Current Trends In Current Trends In Parallel Architectures Parallel Architectures – Modern Parallel Architecture Layered Framework – Modern Parallel Architecture Layered Framework • Shared Address Space Parallel Architectures • Shared Address Space Parallel Architectures • MessagePassing Multicomputers: MessagePassing Programming Tools • MessagePassing Multicomputers: MessagePassing Programming Tools • Data Parallel Systems • Data Parallel Systems • • Dataflow Architectures Dataflow Architectures • Systolic Architectures: Matrix Multiplication Systolic Array Example • Systolic Architectures: EECC756 EECC756 Shaaban Shaaban PCA Chapter 1.1, 1.2 1 lec 1 Spring 2011 382011Parallel Computer Architecture Parallel Computer Architecture A parallel computer (or multiple processor system) is a collection of communicating processing elements (processors) that cooperate to solve large computational problems fast by dividing such problems into parallel i.e Parallel Processing tasks, exploiting ThreadLevel Parallelism (TLP). Task = Computation done on one processor • Broad issues involved: – The concurrency and communication characteristics of parallel algorithms for a given computational problem (represented by dependency graphs) – Computing Resources and Computation Allocation: • The number of processing elements (PEs), computing power of each element and amount/organization of physical memory used. • What portions of the computation and data are allocated or mapped to each PE. – Data access, Communication and Synchronization • How the processing elements cooperate and communicate. • How data is shared/transmitted between processors. • Abstractions and primitives for cooperation/communication and synchronization. • The characteristics and performance of parallel system network (System interconnects). – Parallel Processing Performance and Scalability Goals: • Maximize performance enhancement of parallelism: Maximize Speedup. Goals and balancing workload on processors – By minimizing parallelization overheads • Scalability of performance to larger systems/problems. Processor = Programmable computing element that runs stored programs written using predefined instruction set EECC756 EECC756 Shaaban Shaaban Processing Elements = PEs = Processors 2 lec 1 Spring 2011 382011A Generic Parallel Computer Architecture A Generic Parallel Computer Architecture Parallel Machine Network Network 2 (Custom or industry standard) ° ° ° A processing nodes 1 Processing Nodes Communication Mem assist (CA) AKA Communication Assist (CA) Network Interface Operating System (custom or industry standard) Parallel Programming Environments P One or more processing elements or processors per node: Custom or commercial microprocessors. 28 cores per chip Single or multiple processors per chip Processing Nodes: 1 Homogenous or heterogonous Each processing node contains one or more processing elements (PEs) or processor(s), memory system, plus communication assist: (Network interface and communication controller) Parallel machine network (System Interconnects). 2 Function of a parallel machine network is to efficiently (reduce communication cost) transfer information (data, results .. ) from source node to destination node as needed to allow cooperation among parallel processing nodes to solve large computational problems divided into a number parallel computational tasks. EECC756 EECC756 Shaaban Shaaban Parallel Computer = Multiple Processor System 3 lec 1 Spring 2011 382011The Need And Feasibility of Parallel Computing The Need And Feasibility of Parallel Computing • Application demands: More computing cycles/memory needed – Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, ... Driving – Generalpurpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Force Gaming… – Mainstream multithreaded programs, are similar to parallel programs Moore’s Law still alive • Technology Trends: – Number of transistors on chip growing rapidly. Clock rates expected to continue to go up but only slowly. Actual performance returns diminishing due to deeper pipelines. – Increased transistor density allows integrating multiple processor cores per creating Chip Multiprocessors (CMPs) even for mainstream computing applications (desktop/laptop..). + multitasking (multiple independent programs) • Architecture Trends: – Instructionlevel parallelism (ILP) is valuable (superscalar, VLIW) but limited. – Increased clock rates require deeper pipelines with longer latencies and higher CPIs. – Coarserlevel parallelism (at the task or thread level, TLP), utilized in multiprocessor systems is the most viable approach to further improve performance. Multicore • Main motivation for development of chipmultiprocessors (CMPs) Processors • Economics: – The increased utilization of commodity oftheshelf (COTS) components in high performance parallel computing systems instead of costly custom components used in traditional supercomputers leading to much lower parallel system cost. • Today’s microprocessors offer highperformance and have multiprocessor support eliminating the need for designing expensive custom Pes. • Commercial System Area Networks (SANs) offer an alternative to custom more costly networks EECC756 EECC756 Shaaban Shaaban 4 lec 1 Spring 2011 382011Why is Parallel Processing Needed Why is Parallel Processing Needed Challenging Applications in Applied Science/Engineering Traditional Driving Force For HPC/Parallel Processing • Astrophysics • Atmospheric and Ocean Modeling Such applications have very high • Bioinformatics 1 computational and 2 memory • Biomolecular simulation: Protein folding requirements that cannot be met • Computational Chemistry with singleprocessor architectures. • Computational Fluid Dynamics (CFD) Many applications contain a large degree of computational parallelism • Computational Physics • Computer vision and image understanding • Data Mining and Dataintensive Computing • Engineering analysis (CAD/CAM) • Global climate modeling and forecasting • Material Sciences • Military applications Driving force for High Performance Computing (HPC) • Quantum chemistry and multiple processor system development • VLSI design • …. EECC756 EECC756 Shaaban Shaaban 5 lec 1 Spring 2011 382011Why is Parallel Processing Needed Why is Parallel Processing Needed Scientific Computing Demands Scientific Computing Demands Driving force for HPC and multiple processor system development (Memory Requirement) Computational and memory demands exceed the capabilities of even the fastest current uniprocessor systems 35 GFLOPS for uniprocessor 9 12 GLOP = 10 FLOPS TeraFLOP = 1000 GFLOPS = 10 FLOPS EECC756 EECC756 Shaaban Shaaban 15 PetaFLOP = 1000 TeraFLOPS = 10 FLOPS 6 lec 1 Spring 2011 382011Scientific Supercomputing Trends Scientific Supercomputing Trends • Proving ground and driver for innovative architecture and advanced high performance computing (HPC) techniques: – Market is much smaller relative to commercial (desktop/server) segment. – Dominated by costly vector machines starting in the 1970s through the 1980s. – Microprocessors have made huge gains in the performance needed for such applications: • High clock rates. (Bad: Higher CPI) • Multiple pipelined floating point units. • Instructionlevel parallelism. • Effective use of caches. Enabled with high • Multiple processor cores/chip (2 cores 20022005, 4 end of 2006, 612 cores transistor density/chip 2011) However even the fastest current single microprocessor systems still cannot meet the needed computational demands. As shown in last slide • Currently: Largescale microprocessorbased multiprocessor systems and computer clusters are replacing (replaced) vector supercomputers that utilize custom processors. EECC756 EECC756 Shaaban Shaaban 7 lec 1 Spring 2011 382011Uniprocessor Performance Evaluation Uniprocessor Performance Evaluation • CPU Performance benchmarking is heavily programmix dependent. • Ideal performance requires a perfect machine/program match. • Performance measures: – Total CPU time = T = TC / f = TC x C = I x CPI x C = I x (CPI + M x k) x C (in seconds) execution TC = Total program execution clock cycles f = clock rate C = CPU clock cycle time = 1/f I = Instructions executed count CPI = Cycles per instruction CPI = CPI with ideal memory execution M = Memory stall cycles per memory access k = Memory accesses per instruction 6 6 6 – MIPS Rating = I / (T x 10 ) = f / (CPI x 10 ) = f x I /(TC x 10 ) (in million instructions per second) 6 – Throughput Rate: W = 1/ T = f /(I x CPI) = (MIPS) x 10 /I p (in programs per second) • Performance factors: (I, CPI , m, k, C) are influenced by: instructionset execution architecture (ISA) , compiler design, CPU microarchitecture, implementation and control, cache and memory hierarchy, program access locality, and program instruction mix and instruction dependencies. EECC756 EECC756 Shaaban Shaaban T = I x CPI x C 8 lec 1 Spring 2011 382011Single CPU Performance Trends Single CPU Performance Trends • The microprocessor is currently the most natural building block for multiprocessor systems in terms of cost and performance. • This is even more true with the development of costeffective multicore microprocessors that support TLP at the chip level. 100 Supercomputers Custom Processors 10 Mainframes Microprocessors Minicomputers Commodity 1 Processors 0.1 1965 1970 1975 1980 1985 1990 1995 EECC756 EECC756 Shaaban Shaaban 9 lec 1 Spring 2011 382011 Performance°±² Microprocessor Frequency Trend Microprocessor Frequency Trend 10,000 100 Intel Processor freq IBM Power PC Realty Check: scales by 2X per DEC generation Clock frequency scaling Gate delays/clock is slowing down 21264S (Did silicone finally hit 1,000 the wall) 21164A 21264 Pentium(R) 21064A Why 10 21164 II 21066 MPC750 1 Power leakage 604 604+ 2 Clock distribution Pentium Pro 100 delays (R) 601, 603 Pentium(R) Result: 486 Deeper Pipelines 386 Longer stalls 10 1 Higher CPI (lowers effective performance No longer per cycle) the case Frequency doubles each generation Number of gates/clock reduce by 25 Solution: Leads to deeper pipelines with more stages Exploit TLP at the chip level, (e.g Intel Pentium 4E has 30+ pipeline stages) Chipmultiprocessor (CMPs) EECC756 EECC756 Shaaban Shaaban T = I x CPI x C 10 lec 1 Spring 2011 382011 Mhz 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 Gate Delays/ ClockTransistor Count Growth Rate Transistor Count Growth Rate Enabling Technology for ChipLevel ThreadLevel Parallelism (TLP) Currently 2 Billion 800,000x transistor density increase in the last 38 years Moore’s Law: Moore’s Law: 2X transistors/Chip Every 1.5 years (circa 1970) still holds Enables ThreadLevel Parallelism (TLP) at the chip level: ChipMultiprocessors (CMPs) + Simultaneous Multithreaded (SMT) processors Intel 4004 (2300 transistors) Solution • One billion transistors/chip reached in 2005, two billion in 20089, Now three billion • Transistor count grows faster than clock rate: Currently 40 per year • Singlethreaded uniprocessors do not efficiently utilize the increased transistor count. EECC756 EECC756 Shaaban Shaaban Limited ILP, increased size of cache 11 lec 1 Spring 2011 382011‹‹ ‹ ‹‹ ‹‹ ‹ ‹ ‹‹‹‹‹‹ ‹‹‹ ‹ ‹‹‹‹‹‹‹‹‹‹‹ ‹‹‹ ‹ ‹‹‹‹‹ ‹‹ ‹ ‹ ‹ ‹ ‹ ‹‹ ‹ ‹ ‹‹ ‹ ‹ ‹ ‹‹ ‹ ‹ Parallelism in Microprocessor VLSI Generations Parallelism in Microprocessor VLSI Generations Bitlevel parallelism Instructionlevel Threadlevel () 100,000,000 (ILP) (TLP) Superscalar Multiple microoperations Simultaneous /VLIW per cycle Multithreading SMT: Singleissue CPI 1 e.g. Intel’s Hyperthreading (multicycle nonpipelined) Pipelined CPI =1 10,000,000 ChipMultiprocessors (CMPs) ‹‹ e.g IBM Power 4, 5 R10000 Intel Pentium D, Core Duo Not Pipelined ‹‹ AMD Athlon 64 X2 CPI 1 ‹‹ Dual Core Opteron Sun UltraSparc T1 (Niagara) ‹‹ 1,000,000 Pentium ChipLevel i80386 TLP/Parallel i80286 R3000 Processing 100,000 R2000 Even more important due to slowing clock i8086 rate increase 10,000 i8080 i8008 ILP = InstructionLevel Parallelism Single Thread i4004 TLP = ThreadLevel Parallelism Per Chip 1,000 1970 1975 1980 1985 1990 1995 2000 2005 Improving microprocessor generation performance by EECC756 EECC756 Shaaban Shaaban exploiting more levels of parallelism 12 lec 1 Spring 2011 382011 TransistorsCurrent DualCore ChipMultiprocessor Architectures Two Dice – Shared Package Single Die Single Die Private Caches Private Caches Shared L2 Cache Private System Interface Shared System Interface Shared L2 or L3 Onchip crossbar/switch FSB Cores communicate using shared cache Cores communicate using onchip Cores communicate over external (Lowest communication latency) Interconnects (shared system interface) Front Side Bus (FSB) (Highest communication latency) Examples: Examples: IBM POWER4/5 Examples: AMD Dual Core Opteron, Intel Pentium Core Duo (Yonah), Conroe Intel Pentium D, Athlon 64 X2 (Core 2), i7, Sun UltraSparc T1 (Niagara) Intel Quad core (two dualcore chips) Intel Itanium2 (Montecito) AMD Phenom …. EECC756 EECC756 Shaaban Shaaban Source: Real World Technologies, http://www.realworldtech.com/page.cfmArticleID=RWT101405234615 13 lec 1 Spring 2011 382011„Œz zz‹ „ Œzz‹‹ z‹ z‹ z‹ „Œz‹ z‹ ‹z „Œ „Œ „ Œ „ Œz‹ Microprocessors Vs. Vector Processors Microprocessors Vs. Vector Processors Uniprocessor Performance: LINPACK Uniprocessor Performance: LINPACK 10,000 Now about CRAYn = 1,000 Vector Processors CRAYn = 100 520 GFLOPS Micro n = 1,000 per microprocessor core Micro n = 100 1,000 T94 1 GFLOP C90 9 (10 FLOPS) DEC 8200 Ymp Xmp/416 ‹‹ IBM Power2/990 100 MIPS R4400 Xmp/14se DEC Alpha HP9000/735 DEC Alpha AXP HP 9000/750 CRAY 1s IBM RS6000/540 10 MIPS M/2000 Microprocessors MIPS M/120 Sun 4/260 1 1975 1980 1985 1990 1995 2000 EECC756 EECC756 Shaaban Shaaban 14 lec 1 Spring 2011 382011 LINPACK (MFLOPS)z z„ z„ zz„ z z zzz „ z„ „ Parallel Performance: LINPACK Parallel Performance: LINPACK Since Nov. 2010 Current Top LINPACK Performance: 10,000 Now about 2,566,000 GFlop/s = 2566 TeraFlops = 2.566 PetaFlops MPP peak Tianhe1A ( National Supercomputing Center in Tianjin, China) 186,368 processor cores: CRAY peak 14,336 Intel Xeon X5670 6core processors 2.9 GHz + 7,168 Nvidia Tesla M2050 (8core) GPUs ASCI Red 1,000 1 TeraFLOP Paragon XP/S MP 12 (6768) (10 FLOPS = 1000 GFLOPS) Paragon XP/S MP (1024) T3D CM5 100 T932(32) Paragon XP/S CM200 C90(16) CM2 Delta 10 iPSC/860 nCUBE/2(1024) Ymp/832(8) 1 Xmp /416(4) 0.1 1985 1987 1989 1991 1993 1995 1996 Current ranking of top 500 parallel supercomputers EECC756 EECC756 Shaaban Shaaban in the world is found at: www.top500.org 15 lec 1 Spring 2011 382011 LINPACK (GFLOPS)z z„ z„ zz„ z z zzz „ z„ „ „Œz zz‹ „zz‹‹ Œ z‹ z‹ z‹ „Œz‹ z‹ ‹z „Œ „Œ „ Œ „ Œz‹ Why is Parallel Processing Needed Why is Parallel Processing Needed LINPAK Performance Trends 10,000 10,000 MPP peak CRA Y peak CRA Y n = 1,000 CRA Y n = 100 Micro n = 1,000 Micro n = 100 ASCI Red 1,000 Paragon XP/S MP 1 TeraFLOP 1,000 (6768) 12 (10 FLOPS =1000 GFLOPS) T94 1 GFLOP Paragon XP/S MP (1024) 9 (10 FLOPS) C90 T3D CM5 100 DEC 8200 T932(32) Ymp Xmp/416 Paragon XP/S ‹‹ IBM Power2/99 100 CM200 MIPS R4400 C90(16) Xmp/14se CM2 Delta 10 DEC Alpha HP9000/735 DEC Alpha AXP HP 9000/750 CRA Y 1s iPSC/860 IBM RS6000/540 nCUBE/2(1024) Ymp/832(8) 10 1 Xmp /416(4) MIPS M/2000 MIPS M/120 Sun 4/260 0.1 1 1985 1987 1989 1991 1993 1995 199 1975 1980 1985 1990 1995 200 Parallel System Performance Uniprocessor Performance Parallel System Performance Uniprocessor Performance EECC756 EECC756 Shaaban Shaaban 16 lec 1 Spring 2011 382011 LINPACK (MFLOPS) LINPACK (GFLOPS)Computer System Peak FLOP Rating History Computer System Peak FLOP Rating History Current Top Peak FP Performance: Since Nov. 2010 Now about 4,701,000 GFlop/s = 4701 TeraFlops = 4.701 PetaFlops Tianhe1A ( National Supercomputing Center in Tianjin, China) 186,368 processor cores: 14,336 Intel Xeon X5670 6core processors 2.9 GHz + 7,168 Nvidia Tesla M2050 (8core) GPUs Tianhe1A 15 (10 FLOPS = 1000 Tera FLOPS) Peta FLOP 12 Teraflop (10 FLOPS = 1000 GFLOPS) Current ranking of top 500 parallel supercomputers EECC756 EECC756 Shaaban Shaaban in the world is found at: www.top500.org 17 lec 1 Spring 2011 382011November 2005 EECC756 EECC756 Shaaban Shaaban Source (and for current list): www.top500.org 18 lec 1 Spring 2011 382011TOP500 Supercomputers nd 32 List (November 2008): The Top 10 The Top 10 KW EECC756 EECC756 Shaaban Shaaban Source (and for current list): www.top500.org 19 lec 1 Spring 2011 382011TOP500 Supercomputers nd 34 List (November 2009): The Top 10 The Top 10 KW EECC756 EECC756 Shaaban Shaaban Source (and for current list): www.top500.org 20 lec 1 Spring 2011 382011TOP500 Supercomputers nd 36 List (November 2010): The Top 10 The Top 10 Current List KW EECC756 EECC756 Shaaban Shaaban Source (and for current list): www.top500.org 21 lec 1 Spring 2011 382011The Goal of Parallel Processing The Goal of Parallel Processing • Goal of applications in using parallel machines: Maximize Speedup over single processor performance Parallel Performance (p processors) Speedup (p processors) = Performance (1 processor) • For a fixed problem size (input data set), performance = 1/time Fixed Problem Size Parallel Speedup Parallel Speedup, Speedup p Time (1 processor) Speedup (p processors) = fixed problem Time (p processors) • Ideal speedup = number of processors = p Very hard to achieve + load imbalance EECC756 EECC756 Shaaban Shaaban Due to parallelization overheads: communication cost, dependencies ... 22 lec 1 Spring 2011 382011The Goal of Parallel Processing The Goal of Parallel Processing • Parallel processing goal is to maximize parallel speedup: Fixed Problem Size Parallel Speedup Or time Time(1) Sequential Work on one processor Speedup = Time(p) Max (Work + Synch Wait Time + Comm Cost + Extra Work) Time Parallelization overheads i.e the processor with maximum execution time • Ideal Speedup = p = number of processors – Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors. • Maximize parallel speedup by: 1 – Balancing computations on processors (every processor does the same amount of work) and the same amount of overheads. 2 – Minimizing communication cost and other overheads associated with each step of parallel program creation and execution. • Performance Scalability: Achieve a good speedup for the parallel application on the parallel architecture as + problem size and machine size (number of processors) are increased. EECC756 EECC756 Shaaban Shaaban 23 lec 1 Spring 2011 382011Elements of Parallel Computing Elements of Parallel Computing HPC Driving Assign Computing Computing Force parallel Problems Problems computations (Tasks) to processors Processing Nodes/Network Parallel Parallel Mapping Mapping Parallel Parallel Algorithms Algorithms Hardware Hardware and Data and Data Architecture Architecture Structures Structures Programming Programming Dependency Operating System Operating System analysis (Task Dependency Binding Binding Highlevel Highlevel Graphs) Applications Software Applications Software (Compile, (Compile, Languages Languages Load) Load) Performance Performance e.g Parallel Speedup Evalu Evalua ation tion EECC756 EECC756 Shaaban Shaaban 24 lec 1 Spring 2011 382011Elements of Parallel Computing Elements of Parallel Computing 1 Computing Problems: – Numerical Computing: Science and and engineering numerical Driving Force problems demand intensive integer and floating point computations. – Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches. 2 Parallel Algorithms and Data Structures – Special algorithms and data structures are needed to specify the computations and communication present in computing problems (from dependency analysis). – Most numerical algorithms are deterministic using regular data structures. – Symbolic processing may use heuristics or nondeterministic searches. – Parallel algorithm development requires interdisciplinary interaction. EECC756 EECC756 Shaaban Shaaban 25 lec 1 Spring 2011 382011Elements of Parallel Computing Elements of Parallel Computing Computing power 3 Hardware Resources A – Processors, memory, and peripheral devices (processing nodes) form the hardware core of a computer system. B – Processor connectivity (system interconnects, network), memory organization, influence the system architecture. Communication/connectivity 4 Operating Systems – Manages the allocation of resources to running processes. – Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication. • Parallelism exploitation possible at: 1 algorithm design, 2 program writing, 3 compilation, and 4 run time. EECC756 EECC756 Shaaban Shaaban 26 lec 1 Spring 2011 382011Elements of Parallel Computing Elements of Parallel Computing 5 System Software Support – Needed for the development of efficient programs in highlevel languages (HLLs.) – Assemblers, loaders. – Portable parallel programming languages/libraries – User interfaces and tools. 6 Compiler Support Approaches to parallel programming (a) – Implicit Parallelism Approach • Parallelizing compiler: Can automatically detect parallelism in sequential source code and transforms it into parallel constructs/code. • Source code written in conventional sequential languages – Explicit Parallelism Approach: (b) • Programmer explicitly specifies parallelism using: – Sequential compiler (conventional sequential HLL) and lowlevel library of the target parallel computer , or .. – Concurrent (parallel) HLL . • Concurrency Preserving Compiler: The compiler in this case preserves the parallelism explicitly specified by the programmer. It may perform some program flow analysis, dependence checking, limited optimizations for parallelism detection. Illustrated next EECC756 EECC756 Shaaban Shaaban 27 lec 1 Spring 2011 382011Approaches to Parallel Programming Approaches to Parallel Programming Programmer Programmer Programmer Programmer Source code written in Source code written in Source code written in Source code written in concurrent dialects of C, C++ sequential languages C, C++ concurrent dialects of C, C++ sequential languages C, C++ FORTRAN, LISP FORTRAN, LISP .. FORTRAN, LISP FORTRAN, LISP .. Concurrency Parallelizing Concurrency Parallelizing preserving compiler compiler preserving compiler compiler (b) Explicit (b) Explicit Concurrent Parallel Parallel Concurrent (a) Implicit (a) Implicit Parallelism Parallelism object code object code object code object code Pa Parallelism rallelism Programmer explicitly Compiler automatically specifies parallelism detects parallelism in using parallel constructs Execution by Execution by Execution by sequential source code Execution by and transforms it into runtime system runtime system runtime system runtime system parallel constructs/code EECC756 EECC756 Shaaban Shaaban 28 lec 1 Spring 2011 382011Factors Affecting Parallel System Performance • Parallel Algorithm Related: – Available concurrency and profile, grain size, uniformity, patterns. i.e Inherent • Dependencies between computations represented by dependency graph Parallelism – Type of parallelism present: Functional and/or data parallelism. – Required communication/synchronization, uniformity and patterns. – Data size requirements. – Communication to computation ratio (CtoC ratio, lower is better). • Parallel program Related: – Programming model used. – Resulting data/code memory requirements, locality and working set characteristics. – Parallel task grain size. – Assignment (mapping) of tasks to processors: Dynamic or static. – Cost of communication/synchronization primitives. • Hardware/Architecture related: – Total CPU computational power available. + Number of processors – Types of computation modes supported. (hardware parallelism) – Shared address space Vs. message passing. – Communication network characteristics (topology, bandwidth, latency) – Memory hierarchy properties. EECC756 EECC756 Shaaban Shaaban Concurrency = Parallelism 29 lec 1 Spring 2011 382011Sequential Execution Possible Parallel Execution Schedule on Two Processors P0, P1 Task Dependency Graph on one processor 0 0 Task: 1 Computation 1 Idle A run on one A A 2 2 processor 3 3 Comm Comm 4 4 B C 5 5 B B C 6 6 7 7 Comm C F 8 8 D 9 9 D E F 10 10 D E 11 11 Idle Comm 12 12 Comm 13 13 E G 14 14 Idle G 15 15 16 16 F 17 17 What would the speed be 18 18 with 3 processors T =16 2 19 19 4 processors 5 … G 20 20 21 21 Assume computation time for each task AG = 3 P0 P1 Assume communication time between parallel tasks = 1 Time Time Assume communication can overlap with computation Speedup on two processors = T1/T2 = 21/16 = 1.3 T =21 1 EECC756 EECC756 Shaaban Shaaban A simple parallel execution example 30 lec 1 Spring 2011 382011Nonpipelined Scalar Evolution of Computer Evolution of Computer Architecture Architecture Sequential Lookahead Functional Limited I/E Overlap Parallelism Pipelining Pipelined (single or multiple issue) Multiple Pipeline Func. Units Implicit Explicit Vector/data parallel Vector Vector I/E: Instruction Fetch and Execute Memoryto Registerto Shared SIMD: Single Instruction stream Memory Register Memory over Multiple Data streams Parallel MIMD: Multiple Instruction SIMD MIMD streams over Multiple Data Machines streams Associative Processor Multicomputer Multiprocessor Processor Array Data Parallel Computer Massively Parallel Clusters Processors (MPPs) Message Passing EECC756 EECC756 Shaaban ShaabanParallel Architectures History Parallel Architectures History Historically, parallel architectures were tied to parallel programming models: • Divergent architectures, with no predictable pattern of growth. Application Software System Systolic Data Parallel Software SIMD Arrays Architectures Architecture Message Passing Dataflow Shared Memory More on this next lecture EECC756 EECC756 Shaaban Shaaban 32 lec 1 Spring 2011 382011Parallel Programming Models Parallel Programming Models • Programming methodology used in coding parallel applications • Specifies: 1 communication and 2 synchronization However, a good way to utilize multicore processors for the masses • Examples: – Multiprogramming: or Multitasking (not true parallel processing) No communication or synchronization at program level. A number of independent programs running on different processors in the system. – Shared memory address space (SAS): Parallel program threads or tasks communicate implicitly using a shared memory address space (shared data in memory). – Message passing: Explicit point to point communication (via send/receive pairs) is used between parallel program tasks using messages. – Data parallel: More regimented, global actions on data (i.e the same operations over all elements on an array or vector) – Can be implemented with shared address space or message passing. EECC756 EECC756 Shaaban Shaaban 33 lec 1 Spring 2011 382011Flynn’s 1972 Classification of Computer Architecture Flynn’s 1972 Classification of Computer Architecture (Taxonomy) Instruction Stream = Thread of Control or Hardware Context • Single Instruction stream over a Single Data stream (SISD): (a) Conventional sequential machines or uniprocessors. • Single Instruction stream over Multiple Data streams (b) (SIMD): Vector computers, array of synchronized processing elements. Data parallel systems (c) • Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. (d) • Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: Tightly coupled • Shared memory multiprocessors. processors • Multicomputers: Unshared distributed memory, Loosely coupled processors messagepassing used instead (e.g clusters) Classified according to number of instruction streams EECC756 EECC756 Shaaban Shaaban (threads) and number of data streams in architecture 34 lec 1 Spring 2011 382011Flynn’s Classification of Computer Architecture (Taxonomy) Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements. Uniprocessor Shown here: array of synchronized CU = Control Unit processing elements PE = Processing Element M = Memory Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors. Parallel computers or multiprocessor systems Multiple Instruction streams over Multiple Multiple Instruction streams and a Single Data stream (MISD): Data streams (MIMD): Parallel computers: Systolic arrays for pipelined execution. Distributed memory multiprocessor system shown Classified according to number of instruction streams EECC756 EECC756 Shaaban Shaaban (threads) and number of data streams in architecture 35 lec 1 Spring 2011 382011Current Trends In Parallel Architectures Current Trends In Parallel Architectures Conventional or sequential • The extension of “computer architecture” to support communication and cooperation: – OLD: Instruction Set Architecture (ISA) – NEW: Communication Architecture • Defines: 1 – Critical abstractions, boundaries, and primitives (interfaces). 2 – Organizational structures that implement interfaces (hardware or software) Implementation of Interfaces • Compilers, libraries and OS are important bridges today i.e. software abstraction layers More on this next lecture EECC756 EECC756 Shaaban Shaaban 36 lec 1 Spring 2011 382011Modern Parallel Architecture Modern Parallel Architecture Layered Framework Layered Framework CAD Database Scientific modeling Parallel applications Multiprogramming Shared Message Data Programming models address passing parallel User Space Compilation Communication abstraction or library User/system boundary Operating systems support System Space Hardware/software boundary Communication hardware (ISA) Physical communication medium Hardware: Processing Nodes Interconnects More on this next lecture EECC756 EECC756 Shaaban Shaaban 37 lec 1 Spring 2011 382011 Hardware SoftwareShared Address Space (SAS) Parallel Architectures Shared Address Space (SAS) Parallel Architectures (in shared address space) • Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores Communication is implicit via loads/stores • Convenient: – Location transparency – Similar programming model to timesharing in uniprocessors • Except processes run on different processors • Good throughput on multiprogrammed workloads i.e multitasking • Naturally provided on a wide range of platforms – Wide range of scale: few to hundreds of processors • Popularly known as shared memory machines or model – Ambiguous: Memory may be physically distributed among processing nodes. i.e Distributed shared memory multiprocessors Sometimes called TightlyCoupled Parallel Computers EECC756 EECC756 Shaaban Shaaban 38 lec 1 Spring 2011 382011Shared Address Space (SAS) Parallel Programming Model • Process: virtual address space plus one or more threads of control • Portions of address spaces of processes are shared: Machine physical address space Virtual address spaces for a collection of processes communicating via shared addresses P pr i vat e n In SAS: Load P Communication is implicit n Shared Common physical via loads/stores. P 2 addresses Space P 1 P 0 Ordering/Synchronization is explicit using synchronization St or e P pr i vat e 2 Primitives. Shared portion of address space P pr i vat e 1 Private portion of address space P pr i vat e 0 • Writes to shared address visible to other threads (in other processes too) • Natural extension of the uniprocessor model: Thus communication is implicit via loads/stores • Conventional memory operations used for communication • Special atomic operations needed for synchronization: i.e for event ordering and mutual exclusion • Using Locks, Semaphores etc. Thus synchronization is explicit • OS uses shared memory to coordinate processes. EECC756 EECC756 Shaaban Shaaban 39 lec 1 Spring 2011 382011Models of SharedMemory Multiprocessors Models of SharedMemory Multiprocessors 1 • The Uniform Memory Access (UMA) Model: – All physical memory is shared by all processors. – All processors have equal access (i.e equal memory bandwidth and access latency) to all memory addresses. – Also referred to as Symmetric Memory Processors (SMPs). 2 • Distributed memory or Nonuniform Memory Access (NUMA) Model: – Shared memory is physically distributed locally among processors. Access latency to remote memory is higher. 3 • The CacheOnly Memory Architecture (COMA) Model: – A special case of a NUMA machine where all distributed main memory is converted to caches. – No memory hierarchy at each processor. EECC756 EECC756 Shaaban Shaaban 40 lec 1 Spring 2011 382011Models of SharedMemory Multiprocessors Models of SharedMemory Multiprocessors Uniform Memory Access (UMA) Model UMA 1 or Symmetric Memory Processors (SMPs). Interconnect: I/O Bus, Crossbar, Multistage network devices P: Processor Mem Mem Mem Mem I/O ctrl I/O ctrl M or Mem: Memory C: Cache Interconnect Interconnect /Network D: Cache directory Processor Processor Network Network °°° D D D °°° M M M C C C P P P P P P 3 NUMA Distributed memory or 2 Nonuniform Memory Access (NUMA) Model CacheOnly Memory Architecture (COMA) EECC756 EECC756 Shaaban Shaaban 41 lec 1 Spring 2011 382011Uniform Memory Access (UMA) Uniform Memory Access (UMA) Circa 1997 Example: Intel Pentium Pro Quad Example: Intel Pentium Pro Quad 4way SMP CPU PPro PPro PPro 256KB Interrupt module module module L controller 2 Bus interface Shared PPro bus (64bit data, 36bit address, 66 MHz) FSB PCI PCI Memory bridge bridge controller PCI MIU I/O cards 1, 2, or 4way interleaved DRAM • All coherence and multiprocessing glue in processor module • Highly integrated, targeted at high volume • Computing node used in Intel’s ASCIRed MPP BusBased Symmetric Memory Processors (SMPs). A single Front Side Bus (FSB) is shared among processors EECC756 EECC756 Shaaban Shaaban This severely limits scalability to only 24 processors 42 lec 1 Spring 2011 382011 PCI bus PCI busNonUniform Memory Access (NUMA) NonUniform Memory Access (NUMA) Example: AMD 8way Opteron Server Node Example: AMD 8way Opteron Server Node Circa 2003 Dedicated pointtopoint interconnects (HyperTransport links) used to connect Total 16 processor cores when processors alleviating the traditional limitations of FSBbased SMP systems. dual core Opteron processors used Each processor has two integrated DDR memory channel controllers: (32 cores with quad core processors) memory bandwidth scales up with number of processors. NUMA architecture since a processor can access its own memory at a lower latency EECC756 EECC756 Shaaban Shaaban than access to remote memory directly connected to other processors in the system. 43 lec 1 Spring 2011 382011Distributed SharedMemory Distributed SharedMemory Multiprocessor System Example: Multiprocessor System Example: Circa 19951999 Cray T3E Cray T3E NUMA MPP Example External I/O MPP = Massively Parallel Processor System P Mem Mem ctrl and NI Switch XY More recent Cray MPP Example: Cray X1E Supercomputer Communication Assist (CA) 3D Torus PointToPoint Network Z • Scale up to 2048 processors, DEC Alpha EV6 microprocessor (COTS) • Custom 3D Torus pointtopoint network, 480MB/s links • Memory controller generates communication requests for nonlocal references • No hardware mechanism for coherence (SGI Origin etc. provide this) EECC756 EECC756 Shaaban Shaaban Example of Nonuniform Memory Access (NUMA) 44 lec 1 Spring 2011 382011MessagePassing Multicomputers MessagePassing Multicomputers • Comprised of multiple autonomous computers (computing nodes) connected via a suitable network. Industry standard System Area Network (SAN) or proprietary network • Each node consists of one or more processors, local memory, attached storage and I/O peripherals and Communication Assist (CA). • Local memory is only accessible by local processors in a node (no shared memory among nodes). • Internode communication is carried explicitly out by message passing through the connection network via send/receive operations. Thus communication is explicit • Process communication achieved using a messagepassing programming environment (e.g. PVM, MPI). Portable, platformindependent – Programming model more removed or abstracted from basic hardware operations • Include: 1 – A number of commercial Massively Parallel Processor systems (MPPs). – Computer clusters that utilize commodity oftheshelf (COTS) components. 2 Also called LooselyCoupled Parallel Computers EECC756 EECC756 Shaaban Shaaban 45 lec 1 Spring 2011 382011MessagePassing Abstraction MessagePassing Abstraction Tag Send (X, Q, t) Match Receive Y,P, t AddressY Data Send X, Q, t Recipient Tag Recipient blocks (waits) AddressX Receive (Y, P, t) until message is received Local pr ocess Local pr ocess address space address space Sender P Recipient Q Data Sender Process P Process Q • Send specifies buffer to be transmitted and receiving process. Communication is explicit via sends/receives • Receive specifies sending process and application storage to receive into. • Memory to memory copy possible, but need to name processes. • Optional tag on send and matching rule on receive. i.e event ordering, in this case • User process names local data and entities in process/tag space too • In simplest form, the send/receive match achieves implicit pairwise synchronization event – Ordering of computations according to dependencies Synchronization is • Many possible overheads: copying, buffer management, protection ... implicit Pairwise synchronization Blocking Receive Data Dependency using send/receive match /Ordering Sender P Recipient Q EECC756 EECC756 Shaaban Shaaban 46 lec 1 Spring 2011 382011MessagePassing Example: MessagePassing Example: Circa 1983 Intel Paragon Intel Paragon Intel i860 i860 Paragon node Each node L L 1 1 Is a 2waySMP Memory bus (64bit, 50 MHz) Mem DMA ctrl Driver NI 4way Sandia’ s Intel Paragon XP/Sbased Supercomputer interleaved Communication DRAM Assist (CA) 8 bits, 175 MHz, bidirectional 2D grid network with processing node attached to every switch 2D grid point to point network EECC756 EECC756 Shaaban Shaaban 47 lec 1 Spring 2011 382011MessagePassing Example: IBM SP2 MessagePassing Example: IBM SP2 MPP Circa 19941998 Power 2 IBM SP2 node CPU L 2 Memory bus General interconnection 4way network formed from Memory interleaved 8port switches controller DRAM MicroChannel bus • Made out of essentially NIC complete RS6000 I/O DMA workstations i860 NI • Network interface integrated in I/O bus (bandwidth limited by I/O Communication Assist (CA) bus) Multistage network EECC756 EECC756 Shaaban Shaaban MPP = Massively Parallel Processor System 48 lec 1 Spring 2011 382011 DRAMMessagePassing MPP Example: MessagePassing MPP Example: Circa 2005 IBM Blue Gene/L (2 processors/chip) • (2 chips/compute card) • (16 compute cards/node board) • (32 node boards/tower) • (64 tower) = 128k = 131072 (0.7 GHz PowerPC 440) processors (64k nodes) 2.8 Gflops peak per processor core System System Location: Lawrence Livermore National Laboratory (64 cabinets, 64x32x32) Networks: Cabinet 3D Torus pointtopoint network (32 Node boards, 8x8x16) Global tree 3D pointtopoint network (both proprietary) Node Board (32 chips, 4x4x2) 16 Compute Cards Compute Card 180/360 TF/s (2 chips, 2x1x1) 16 TB DDR Chip (2 processors) 2.9/5.7 TF/s 256 GB DDR 90/180 GF/s 8 GB DDR Design Goals: 5.6/11.2 GF/s 2.8/5.6 GF/s 0.5 GB DDR High computational power efficiency 4 MB High computational density per volume LINPACK Performance: 280,600 GFLOPS = 280.6 TeraFLOPS = 0.2806 Peta FLOP Top Peak FP Performance: EECC756 EECC756 Shaaban Shaaban Now about 367,000 GFLOPS = 367 TeraFLOPS = 0.367 Peta FLOP 49 lec 1 Spring 2011 382011MessagePassing Programming Tools MessagePassing Programming Tools • Messagepassing programming environments include: – Message Passing Interface (MPI): • Provides a standard for writing concurrent messagepassing programs. • MPI implementations include parallel libraries used by existing programming languages (C, C++). Both MPI PVM are examples of the explicit parallelism approach to parallel programming – Parallel Virtual Machine (PVM): • Enables a collection of heterogeneous computers to used as a coherent and flexible concurrent computational resource. • PVM support software executes on each machine in a user configurable pool, and provides a computational environment of concurrent applications. • User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines. Both MPI and PVM are portable (platformindependent) and allow the user to explicitly specify parallelism EECC756 EECC756 Shaaban Shaaban 50 lec 1 Spring 2011 382011Data Parallel Systems SIMD in Flynn taxonomy SIMD in Flynn taxonomy Data Parallel Systems • Programming model (Data Parallel) – Similar operations performed in parallel on each element of data structure Control – Logically single thread of control, performs processor sequential or parallel steps – Conceptually, a processor is associated with each data element PE PE PE • Architectural model ° ° ° – Array of many simple processors each with little memory PE PE PE ° ° ° • Processors don’t sequence through instructions ° ° ° ° ° ° ° ° ° – Attached to a control processor that issues instructions – Specialized and general communication, global PE PE PE ° ° ° synchronization • Example machines: All PE are synchronized – Thinking Machines CM1, CM2 (and CM5) (same instruction or operation – Maspar MP1 and MP2, in a given cycle) Other Data Parallel Architectures: Vector Machines EECC756 EECC756 Shaaban Shaaban PE = Processing Element 51 lec 1 Spring 2011 382011Dataflow Architectures Dataflow Architectures • Represent computation as a graph of essential data dependencies – NonVon Neumann Architecture (Not PCbased Architecture) i.e data or results – Logical processor at each node, activated by availability of operands – Message (tokens) carrying tag of next instruction sent to next processor – Tag compared with others in matching store; match fires execution 1b ce Research Dataflow machine prototypes include: Token + − × a = (b +1) × (b − c) Distribution • The MIT Tagged Architecture d = c × e  Network f = a × d  • The Manchester Dataflow Machine d × Dataflow graph a The Tomasulo approach of dynamic × Dependency graph for Network instruction execution utilizes dataflow entire computation (program) f driven execution engine: • The data dependency graph for a small window of instructions is constructed Token Program One Node store store dynamically when instructions are issued in order of the program. Network Waiting Form Instruction Execute Matching token fetch •The execution of an issued instruction Token queue is triggered by the availability of its Network operands (data it needs) over the CDB. Token Matching Token EECC756 EECC756 Shaaban Shaaban Distribution Tokens = Copies of computation results 52 lec 1 Spring 2011 382011Example of Flynn’s Taxonomy’s MISD (Multiple Instruction Streams Single Data Stream): Systolic Architectures Systolic Architectures • Replace single processor with an array of regular processing elements • Orchestrate data flow for high throughput with less memory access PE = Processing Element M = Memory M M PE PE PE PE • Different from linear pipelining – Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory • Different from SIMD: each PE may do something different • Initial motivation: VLSI ApplicationSpecific Integrated Circuits (ASICs) • Represent algorithms directly by chips connected in regular pattern A possible example of MISD in Flynn A possible example of MISD in Flynn’ ’s s EECC756 EECC756 Shaaban Shaaban Classification of Computer Architecture Classification of Computer Architecture 53 lec 1 Spring 2011 382011C = A X B Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2D grid b2,2 • Each processor accumulates one b2,1 b1,2 element of the product b2,0 b1,1 b0,2 b1,0 b0,1 Column 2 Alignments in time b0,0 Columns of B Column 1 Column 0 Rows of A a0,2 a0,1 a0,0 Row 0 a1,2 a1,1 a1,0 Row 1 a2,2 a2,1 a2,0 Row 2 T = 0 EECC756 EECC756 Shaaban Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ 54 lec 1 Spring 2011 382011Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2D grid b2,2 • Each processor accumulates one b2,1 b1,2 element of the product b2,0 b1,1 b0,2 Alignments in time b1,0 b0,1 b0,0 a0,0b0,0 a0,0 a0,2 a0,1 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 T = 1 EECC756 EECC756 Shaaban Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ 55 lec 1 Spring 2011 382011Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2D grid • Each processor accumulates one b2,2 element of the product b2,1 b1,2 Alignments in time b2,0 b1,1 b0,2 b1,0 b0,1 a0,0b0,0 a0,0b0,1 a0,0 a0,1 + a0,1b1,0 a0,2 b0,0 a1,0b0,0 a1,0 a1,2 a1,1 a2,2 a2,1 a2,0 T = 2 EECC756 EECC756 Shaaban Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ 56 lec 1 Spring 2011 382011Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2D grid • Each processor accumulates one element of the product b2,2 Alignments in time b2,1 b1,2 b2,0 b0,2 b1,1 a0,0b0,0 a0,0b0,1 a0,0b0,2 a0,0 a0,1 a0,2 + a0,1b1,0 + a0,1b1,1 + a0,2b2,0 C00 b1,0 b0,1 a1,0b0,0 a1,0b0,1 a1,0 + a1,1b1,0 a1,1 a1,2 b0,0 a2,0b0,0 a2,0 a2,2 a2,1 T = 3 EECC756 EECC756 Shaaban Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ 57 lec 1 Spring 2011 382011Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2D grid • Each processor accumulates one element of the product Alignments in time b2,2 b1,2 b2,1 a0,0b0,0 a0,0b0,1 a0,0b0,2 a0,1 a0,2 + a0,1b1,0 + a0,1b1,1 + a0,1b1,2 + a0,2b2,0 + a0,2b2,1 C01 C00 b2,0 b1,1 b0,2 a1,0b0,0 a1,0b0,2 a1,0b0,1 a1,1 a1,0 + a1,1b1,0 a1,2 +a1,1b1,1 + a1,2a2,0 C10 b0,1 b1,0 a2,0b0,1 a2,0b0,0 a2,0 a2,1 + a2,1b1,0 a2,2 a2,2 T = 4 EECC756 EECC756 Shaaban Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ 58 lec 1 Spring 2011 382011Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2D grid • Each processor accumulates one element of the product Alignments in time b2,2 a0,0b0,0 a0,0b0,1 a0,0b0,2 a0,2 + a0,1b1,0 + a0,1b1,1 + a0,1b1,2 + a0,2b2,0 + a0,2b2,1 + a0,2b2,2 C01 C00 C02 b2,1 b1,2 a1,0b0,0 a1,0b0,2 a1,0b0,1 a1,2 a1,1 + a1,1b1,0 + a1,1b1,2 +a1,1b1,1 + a1,2a2,0 + a1,2b2,1 C11 C10 b1,1 b0,2 b2,0 a2,0b0,1 a2,0b0,2 a2,0b0,0 a2,0 a2,1 a2,2 + a2,1b1,1 + a2,1b1,0 + a2,2b2,0 C20 T = 5 EECC756 EECC756 Shaaban Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ 59 lec 1 Spring 2011 382011Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2D grid • Each processor accumulates one element of the product Alignments in time a0,0b0,0 a0,0b0,1 a0,0b0,2 + a0,1b1,0 + a0,1b1,1 + a0,1b1,2 + a0,2b2,0 + a0,2b2,1 + a0,2b2,2 C01 C00 C02 b2,2 a1,0b0,0 a1,0b0,2 a1,0b0,1 a1,2 + a1,1b1,0 + a1,1b1,2 +a1,1b1,1 + a1,2a2,0 + a1,2b2,2 + a1,2b2,1 C11 C10 C12 b2,1 b1,2 a2,0b0,1 a2,0b0,2 a2,0b0,0 a2,1 a2,2 + a2,1b1,1 + a2,1b1,2 + a2,1b1,0 + a2,2b2,1 + a2,2b2,0 C21 C20 T = 6 EECC756 EECC756 Shaaban Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ 60 lec 1 Spring 2011 382011Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2D grid 3 • Each processor accumulates one On one processor = O(n ) t = 27 element of the product Speedup = 27/7 = 3.85 Alignments in time a0,0b0,0 a0,0b0,1 a0,0b0,2 + a0,1b1,0 + a0,1b1,1 + a0,1b1,2 + a0,2b2,0 + a0,2b2,1 + a0,2b2,2 C01 C00 C02 a1,0b0,0 a1,0b0,2 a1,0b0,1 + a1,1b1,0 + a1,1b1,2 +a1,1b1,1 + a1,2a2,0 + a1,2b2,2 + a1,2b2,1 C11 C10 C12 Done b2,2 a2,0b0,1 a2,0b0,2 a2,0b0,0 a2,2 + a2,1b1,1 + a2,1b1,2 + a2,1b1,0 + a2,2b2,1 + a2,2b2,2 + a2,2b2,0 C21 C22 C20 T = 7 EECC756 EECC756 Shaaban Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ 61 lec 1 Spring 2011 382011
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
ImogenCameron
User Type:
Teacher
Country:
France
Uploaded Date:
14-07-2017