Introduction to parallel processing ppt

massively parallel processing ppt and parallel processing applications ppt and pipelining and parallel processing ppt
ImogenCameron Profile Pic
ImogenCameron,France,Teacher
Published Date:14-07-2017
Your Website URL(Optional)
Comment
Introduction to Parallel Processing Introduction to Parallel Processing • • Parallel Computer Architecture: Parallel Computer Architecture: Definition & Broad issues involved Definition & Broad issues involved – A Generic Parallel Computer Architecture – A Generic Parallel Computer Architecture • • The Need And Feasibility of The Need And Feasibility of Parallel Computing Parallel Computing Why? – Scientific Supercomputing Trends – Scientific Supercomputing Trends – – CPU Performance and Technology Trends, CPU Performance and Technology Trends, Parallelism in Microprocessor Generations Parallelism in Microprocessor Generations – – Compute Computer r System Peak FLOP Rating History/Near Future System Peak FLOP Rating History/Near Future • The Goal of Parallel Processing • The Goal of Parallel Processing • • Elements of Parallel Computing Elements of Parallel Computing • Factors Affecting Parallel System Performance • Factors Affecting Parallel System Performance • Parallel Architectures History • Parallel Architectures History – – Parall Parallel Program el Programm ming Models ing Models – – Flynn Flynn’ ’s 1972 Classification of Computer Architecture s 1972 Classification of Computer Architecture • • Current Trends In Current Trends In Parallel Architectures Parallel Architectures – Modern Parallel Architecture Layered Framework – Modern Parallel Architecture Layered Framework • Shared Address Space Parallel Architectures • Shared Address Space Parallel Architectures • Message-Passing Multicomputers: Message-Passing Programming Tools • Message-Passing Multicomputers: Message-Passing Programming Tools • Data Parallel Systems • Data Parallel Systems • • Dataflow Architectures Dataflow Architectures • Systolic Architectures: Matrix Multiplication Systolic Array Example • Systolic Architectures: EECC756 EECC756 - - Shaaban Shaaban PCA Chapter 1.1, 1.2 1 lec 1 Spring 2011 3-8-2011Parallel Computer Architecture Parallel Computer Architecture A parallel computer (or multiple processor system) is a collection of communicating processing elements (processors) that cooperate to solve large computational problems fast by dividing such problems into parallel i.e Parallel Processing tasks, exploiting Thread-Level Parallelism (TLP). Task = Computation done on one processor • Broad issues involved: – The concurrency and communication characteristics of parallel algorithms for a given computational problem (represented by dependency graphs) – Computing Resources and Computation Allocation: • The number of processing elements (PEs), computing power of each element and amount/organization of physical memory used. • What portions of the computation and data are allocated or mapped to each PE. – Data access, Communication and Synchronization • How the processing elements cooperate and communicate. • How data is shared/transmitted between processors. • Abstractions and primitives for cooperation/communication and synchronization. • The characteristics and performance of parallel system network (System interconnects). – Parallel Processing Performance and Scalability Goals: • Maximize performance enhancement of parallelism: Maximize Speedup. Goals and balancing workload on processors – By minimizing parallelization overheads • Scalability of performance to larger systems/problems. Processor = Programmable computing element that runs stored programs written using pre-defined instruction set EECC756 EECC756 - - Shaaban Shaaban Processing Elements = PEs = Processors 2 lec 1 Spring 2011 3-8-2011A Generic Parallel Computer Architecture A Generic Parallel Computer Architecture Parallel Machine Network Network 2 (Custom or industry standard) ° ° ° A processing nodes 1 Processing Nodes Communication Mem assist (CA) AKA Communication Assist (CA) Network Interface Operating System (custom or industry standard) Parallel Programming Environments P One or more processing elements or processors per node: Custom or commercial microprocessors. 2-8 cores per chip Single or multiple processors per chip Processing Nodes: 1 Homogenous or heterogonous Each processing node contains one or more processing elements (PEs) or processor(s), memory system, plus communication assist: (Network interface and communication controller) Parallel machine network (System Interconnects). 2 Function of a parallel machine network is to efficiently (reduce communication cost) transfer information (data, results .. ) from source node to destination node as needed to allow cooperation among parallel processing nodes to solve large computational problems divided into a number parallel computational tasks. EECC756 EECC756 - - Shaaban Shaaban Parallel Computer = Multiple Processor System 3 lec 1 Spring 2011 3-8-2011The Need And Feasibility of Parallel Computing The Need And Feasibility of Parallel Computing • Application demands: More computing cycles/memory needed – Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, ... Driving – General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Force Gaming… – Mainstream multithreaded programs, are similar to parallel programs Moore’s Law still alive • Technology Trends: – Number of transistors on chip growing rapidly. Clock rates expected to continue to go up but only slowly. Actual performance returns diminishing due to deeper pipelines. – Increased transistor density allows integrating multiple processor cores per creating Chip- Multiprocessors (CMPs) even for mainstream computing applications (desktop/laptop..). + multi-tasking (multiple independent programs) • Architecture Trends: – Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited. – Increased clock rates require deeper pipelines with longer latencies and higher CPIs. – Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor systems is the most viable approach to further improve performance. Multi-core • Main motivation for development of chip-multiprocessors (CMPs) Processors • Economics: – The increased utilization of commodity of-the-shelf (COTS) components in high performance parallel computing systems instead of costly custom components used in traditional supercomputers leading to much lower parallel system cost. • Today’s microprocessors offer high-performance and have multiprocessor support eliminating the need for designing expensive custom Pes. • Commercial System Area Networks (SANs) offer an alternative to custom more costly networks EECC756 EECC756 - - Shaaban Shaaban 4 lec 1 Spring 2011 3-8-2011Why is Parallel Processing Needed? Why is Parallel Processing Needed? Challenging Applications in Applied Science/Engineering Traditional Driving Force For HPC/Parallel Processing • Astrophysics • Atmospheric and Ocean Modeling Such applications have very high • Bioinformatics 1- computational and 2- memory • Biomolecular simulation: Protein folding requirements that cannot be met • Computational Chemistry with single-processor architectures. • Computational Fluid Dynamics (CFD) Many applications contain a large degree of computational parallelism • Computational Physics • Computer vision and image understanding • Data Mining and Data-intensive Computing • Engineering analysis (CAD/CAM) • Global climate modeling and forecasting • Material Sciences • Military applications Driving force for High Performance Computing (HPC) • Quantum chemistry and multiple processor system development • VLSI design • …. EECC756 EECC756 - - Shaaban Shaaban 5 lec 1 Spring 2011 3-8-2011Why is Parallel Processing Needed? Why is Parallel Processing Needed? Scientific Computing Demands Scientific Computing Demands Driving force for HPC and multiple processor system development (Memory Requirement) Computational and memory demands exceed the capabilities of even the fastest current uniprocessor systems 3-5 GFLOPS for uniprocessor 9 12 GLOP = 10 FLOPS TeraFLOP = 1000 GFLOPS = 10 FLOPS EECC756 EECC756 - - Shaaban Shaaban 15 PetaFLOP = 1000 TeraFLOPS = 10 FLOPS 6 lec 1 Spring 2011 3-8-2011Scientific Supercomputing Trends Scientific Supercomputing Trends • Proving ground and driver for innovative architecture and advanced high performance computing (HPC) techniques: – Market is much smaller relative to commercial (desktop/server) segment. – Dominated by costly vector machines starting in the 1970s through the 1980s. – Microprocessors have made huge gains in the performance needed for such applications: • High clock rates. (Bad: Higher CPI?) • Multiple pipelined floating point units. • Instruction-level parallelism. • Effective use of caches. Enabled with high • Multiple processor cores/chip (2 cores 2002-2005, 4 end of 2006, 6-12 cores transistor density/chip 2011) However even the fastest current single microprocessor systems still cannot meet the needed computational demands. As shown in last slide • Currently: Large-scale microprocessor-based multiprocessor systems and computer clusters are replacing (replaced?) vector supercomputers that utilize custom processors. EECC756 EECC756 - - Shaaban Shaaban 7 lec 1 Spring 2011 3-8-2011Uniprocessor Performance Evaluation Uniprocessor Performance Evaluation • CPU Performance benchmarking is heavily program-mix dependent. • Ideal performance requires a perfect machine/program match. • Performance measures: – Total CPU time = T = TC / f = TC x C = I x CPI x C = I x (CPI + M x k) x C (in seconds) execution TC = Total program execution clock cycles f = clock rate C = CPU clock cycle time = 1/f I = Instructions executed count CPI = Cycles per instruction CPI = CPI with ideal memory execution M = Memory stall cycles per memory access k = Memory accesses per instruction 6 6 6 – MIPS Rating = I / (T x 10 ) = f / (CPI x 10 ) = f x I /(TC x 10 ) (in million instructions per second) 6 – Throughput Rate: W = 1/ T = f /(I x CPI) = (MIPS) x 10 /I p (in programs per second) • Performance factors: (I, CPI , m, k, C) are influenced by: instruction-set execution architecture (ISA) , compiler design, CPU micro-architecture, implementation and control, cache and memory hierarchy, program access locality, and program instruction mix and instruction dependencies. EECC756 EECC756 - - Shaaban Shaaban T = I x CPI x C 8 lec 1 Spring 2011 3-8-2011Single CPU Performance Trends Single CPU Performance Trends • The microprocessor is currently the most natural building block for multiprocessor systems in terms of cost and performance. • This is even more true with the development of cost-effective multi-core microprocessors that support TLP at the chip level. 100 Supercomputers Custom Processors 10 Mainframes Microprocessors Minicomputers Commodity 1 Processors 0.1 1965 1970 1975 1980 1985 1990 1995 EECC756 EECC756 - - Shaaban Shaaban 9 lec 1 Spring 2011 3-8-2011 Performance°±² Microprocessor Frequency Trend Microprocessor Frequency Trend 10,000 100 Intel Processor freq IBM Power PC Realty Check: scales by 2X per DEC generation Clock frequency scaling Gate delays/clock is slowing down 21264S (Did silicone finally hit 1,000 the wall?) 21164A 21264 Pentium(R) 21064A Why? 10 21164 II 21066 MPC750 1- Power leakage 604 604+ 2- Clock distribution Pentium Pro 100 delays (R) 601, 603 Pentium(R) Result: 486 Deeper Pipelines 386 Longer stalls 10 1 Higher CPI (lowers effective performance No longer per cycle) ? the case Frequency doubles each generation Number of gates/clock reduce by 25% Solution: Leads to deeper pipelines with more stages Exploit TLP at the chip level, (e.g Intel Pentium 4E has 30+ pipeline stages) Chip-multiprocessor (CMPs) EECC756 EECC756 - - Shaaban Shaaban T = I x CPI x C 10 lec 1 Spring 2011 3-8-2011 Mhz 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 Gate Delays/ ClockTransistor Count Growth Rate Transistor Count Growth Rate Enabling Technology for Chip-Level Thread-Level Parallelism (TLP) Currently 2 Billion 800,000x transistor density increase in the last 38 years Moore’s Law: Moore’s Law: 2X transistors/Chip Every 1.5 years (circa 1970) still holds Enables Thread-Level Parallelism (TLP) at the chip level: Chip-Multiprocessors (CMPs) + Simultaneous Multithreaded (SMT) processors Intel 4004 (2300 transistors) Solution • One billion transistors/chip reached in 2005, two billion in 2008-9, Now three billion • Transistor count grows faster than clock rate: Currently 40% per year • Single-threaded uniprocessors do not efficiently utilize the increased transistor count. EECC756 EECC756 - - Shaaban Shaaban Limited ILP, increased size of cache 11 lec 1 Spring 2011 3-8-2011‹‹ ‹ ‹‹ ‹‹ ‹ ‹ ‹‹‹‹‹‹ ‹‹‹ ‹ ‹‹‹‹‹‹‹‹‹‹‹ ‹‹‹ ‹ ‹‹‹‹‹ ‹‹ ‹ ‹ ‹ ‹ ‹ ‹‹ ‹ ‹ ‹‹ ‹ ‹ ‹ ‹‹ ‹ ‹ Parallelism in Microprocessor VLSI Generations Parallelism in Microprocessor VLSI Generations Bit-level parallelism Instruction-level Thread-level (?) 100,000,000 (ILP) (TLP) Superscalar Multiple micro-operations Simultaneous /VLIW per cycle Multithreading SMT: Single-issue CPI 1 e.g. Intel’s Hyper-threading (multi-cycle non-pipelined) Pipelined CPI =1 10,000,000 Chip-Multiprocessors (CMPs) ‹‹ e.g IBM Power 4, 5 R10000 Intel Pentium D, Core Duo Not Pipelined ‹‹ AMD Athlon 64 X2 CPI 1 ‹‹ Dual Core Opteron Sun UltraSparc T1 (Niagara) ‹‹ 1,000,000 Pentium Chip-Level i80386 TLP/Parallel i80286 R3000 Processing 100,000 R2000 Even more important due to slowing clock i8086 rate increase 10,000 i8080 i8008 ILP = Instruction-Level Parallelism Single Thread i4004 TLP = Thread-Level Parallelism Per Chip 1,000 1970 1975 1980 1985 1990 1995 2000 2005 Improving microprocessor generation performance by EECC756 EECC756 - - Shaaban Shaaban exploiting more levels of parallelism 12 lec 1 Spring 2011 3-8-2011 TransistorsCurrent Dual-Core Chip-Multiprocessor Architectures Two Dice – Shared Package Single Die Single Die Private Caches Private Caches Shared L2 Cache Private System Interface Shared System Interface Shared L2 or L3 On-chip crossbar/switch FSB Cores communicate using shared cache Cores communicate using on-chip Cores communicate over external (Lowest communication latency) Interconnects (shared system interface) Front Side Bus (FSB) (Highest communication latency) Examples: Examples: IBM POWER4/5 Examples: AMD Dual Core Opteron, Intel Pentium Core Duo (Yonah), Conroe Intel Pentium D, Athlon 64 X2 (Core 2), i7, Sun UltraSparc T1 (Niagara) Intel Quad core (two dual-core chips) Intel Itanium2 (Montecito) AMD Phenom …. EECC756 EECC756 - - Shaaban Shaaban Source: Real World Technologies, http://www.realworldtech.com/page.cfm?ArticleID=RWT101405234615 13 lec 1 Spring 2011 3-8-2011„Œz zz‹ „ Œzz‹‹ z‹ z‹ z‹ „Œz‹ z‹ ‹z „Œ „Œ „ Œ „ Œz‹ Microprocessors Vs. Vector Processors Microprocessors Vs. Vector Processors Uniprocessor Performance: LINPACK Uniprocessor Performance: LINPACK 10,000 Now about CRAYn = 1,000 Vector Processors CRAYn = 100 5-20 GFLOPS Micro n = 1,000 per microprocessor core Micro n = 100 1,000 T94 1 GFLOP C90 9 (10 FLOPS) DEC 8200 Ymp Xmp/416 ‹‹ IBM Power2/990 100 MIPS R4400 Xmp/14se DEC Alpha HP9000/735 DEC Alpha AXP HP 9000/750 CRAY 1s IBM RS6000/540 10 MIPS M/2000 Microprocessors MIPS M/120 Sun 4/260 1 1975 1980 1985 1990 1995 2000 EECC756 EECC756 - - Shaaban Shaaban 14 lec 1 Spring 2011 3-8-2011 LINPACK (MFLOPS)z z„ z„ zz„ z z zzz „ z„ „ Parallel Performance: LINPACK Parallel Performance: LINPACK Since Nov. 2010 Current Top LINPACK Performance: 10,000 Now about 2,566,000 GFlop/s = 2566 TeraFlops = 2.566 PetaFlops MPP peak Tianhe-1A ( National Supercomputing Center in Tianjin, China) 186,368 processor cores: CRAY peak 14,336 Intel Xeon X5670 6-core processors 2.9 GHz + 7,168 Nvidia Tesla M2050 (8-core?) GPUs ASCI Red 1,000 1 TeraFLOP Paragon XP/S MP 12 (6768) (10 FLOPS = 1000 GFLOPS) Paragon XP/S MP (1024) T3D CM-5 100 T932(32) Paragon XP/S CM-200 C90(16) CM-2 Delta 10 iPSC/860 nCUBE/2(1024) Ymp/832(8) 1 Xmp /416(4) 0.1 1985 1987 1989 1991 1993 1995 1996 Current ranking of top 500 parallel supercomputers EECC756 EECC756 - - Shaaban Shaaban in the world is found at: www.top500.org 15 lec 1 Spring 2011 3-8-2011 LINPACK (GFLOPS)z z„ z„ zz„ z z zzz „ z„ „ „Œz zz‹ „zz‹‹ Œ z‹ z‹ z‹ „Œz‹ z‹ ‹z „Œ „Œ „ Œ „ Œz‹ Why is Parallel Processing Needed? Why is Parallel Processing Needed? LINPAK Performance Trends 10,000 10,000 MPP peak CRA Y peak CRA Y n = 1,000 CRA Y n = 100 Micro n = 1,000 Micro n = 100 ASCI Red 1,000 Paragon XP/S MP 1 TeraFLOP 1,000 (6768) 12 (10 FLOPS =1000 GFLOPS) T94 1 GFLOP Paragon XP/S MP (1024) 9 (10 FLOPS) C90 T3D CM-5 100 DEC 8200 T932(32) Ymp Xmp/416 Paragon XP/S ‹‹ IBM Power2/99 100 CM-200 MIPS R4400 C90(16) Xmp/14se CM-2 Delta 10 DEC Alpha HP9000/735 DEC Alpha AXP HP 9000/750 CRA Y 1s iPSC/860 IBM RS6000/540 nCUBE/2(1024) Ymp/832(8) 10 1 Xmp /416(4) MIPS M/2000 MIPS M/120 Sun 4/260 0.1 1 1985 1987 1989 1991 1993 1995 199 1975 1980 1985 1990 1995 200 Parallel System Performance Uniprocessor Performance Parallel System Performance Uniprocessor Performance EECC756 EECC756 - - Shaaban Shaaban 16 lec 1 Spring 2011 3-8-2011 LINPACK (MFLOPS) LINPACK (GFLOPS)Computer System Peak FLOP Rating History Computer System Peak FLOP Rating History Current Top Peak FP Performance: Since Nov. 2010 Now about 4,701,000 GFlop/s = 4701 TeraFlops = 4.701 PetaFlops Tianhe-1A ( National Supercomputing Center in Tianjin, China) 186,368 processor cores: 14,336 Intel Xeon X5670 6-core processors 2.9 GHz + 7,168 Nvidia Tesla M2050 (8-core?) GPUs Tianhe-1A 15 (10 FLOPS = 1000 Tera FLOPS) Peta FLOP 12 Teraflop (10 FLOPS = 1000 GFLOPS) Current ranking of top 500 parallel supercomputers EECC756 EECC756 - - Shaaban Shaaban in the world is found at: www.top500.org 17 lec 1 Spring 2011 3-8-2011November 2005 EECC756 EECC756 - - Shaaban Shaaban Source (and for current list): www.top500.org 18 lec 1 Spring 2011 3-8-2011TOP500 Supercomputers nd 32 List (November 2008): The Top 10 The Top 10 KW EECC756 EECC756 - - Shaaban Shaaban Source (and for current list): www.top500.org 19 lec 1 Spring 2011 3-8-2011TOP500 Supercomputers nd 34 List (November 2009): The Top 10 The Top 10 KW EECC756 EECC756 - - Shaaban Shaaban Source (and for current list): www.top500.org 20 lec 1 Spring 2011 3-8-2011