Question? Leave a message!

Introduction to Parallel Computing

Introduction to Parallel Computing 4
ImogenCameron Profile Pic
Published Date:14-07-2017
Website URL
COMP 422: Introduction to Parallel Computing Vivek Sarkar Department of Computer Science Rice University COMP 422 Lecture 1 8 January 2008Acknowledgments for today’s lecture • Jack Dongarra (U. Tennessee) - CS 594 slides from Spring 2008 — • John Mellor-Crummey (Rice) - COMP 422 slides from Spring 2007 • Kathy Yelick (UC Berkeley) - CS 267 slides from Spring 2007 — • Slides accompanying course textbook — COMP 422, Spring 2008 (V.Sarkar) 2Course Information • Meeting time: TTh 10:50-12:05 • Meeting place: DH 1046 • Instructor: Vivek Sarkar —, x5304, DH 3131 —office hours: Wednesdays 11am - 12noon, and by appointment • TA: Raj Barik —, x2738, DH 2070 —Office hours: Tuesdays & Thursdays, 1pm - 2pm, and by appointment • Web site: COMP 422, Spring 2008 (V.Sarkar) 3Prerequisites • Programming in C, Fortran, or Java • Basics of data structures • Basics of machine architecture • Required courses: Comp 212 and 320, or equivalent • See me if you have concerns COMP 422, Spring 2008 (V.Sarkar) 4Text Introduction to Parallel Computing, 2nd Edition Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Addison-Wesley 2003 COMP 422, Spring 2008 (V.Sarkar) 5Topics • Introduction (Chapter 1) - today’s lecture • Parallel Programming Platforms (Chapter 2) —New material: homogeneous & heterogeneous multicore platforms • Principles of Parallel Algorithm Design (Chapter 3) • Analytical Modeling of Parallel Programs (Chapter 5) —New material: theoretical foundations of task scheduling • Programming Shared Address Space Platforms (Chapter 7) —New material: new programming models (beyond threads and OpenMP) - Java Concurrency Utilities, Intel Thread Building Blocks, .Net Parallel Extensions (Task Parallel Library & PLINQ), Cilk, X10 • Dense Matrix Operations (Chapter 8) • Graph Algorithms (Chapter 10) • Programming Using the Message-Passing Paradigm (Chapter 6) —New material: Partitioned Global Address Space (PGAS) languages - Unified Parallel C (UPC), Co-array Fortran (CAF) • New material: Programming Heterogeneous Processors and Accelerators • New material: Problem Solving on Large Scale Clusters using MapReduce • (Secondary references will be assigned for new material) COMP 422, Spring 2008 (V.Sarkar) 6Parallel Machines for the Course • Ada —316 dual-core Opteron processors —dual processor nodes —Infiniband interconnection network —no global shared memory • Additional nodes on Ada —Quad-core AMD Barcelona (denali) —Sun T2000 (yellowstone) • Google/IBM cluster —Accounts to be obtained later in semester COMP 422, Spring 2008 (V.Sarkar) 7Assignments and Exams • Coursework will be weighted as follows: —Homeworks: 5% —Programming Assignments: 55% – to be done individually or in 2-person teams; the requirements will be scaled accordingly —Midterm Exam: 20% (one-hour written exam) —Final Exam: 20% (one-hour multiple-choice exam) COMP 422, Spring 2008 (V.Sarkar) 8Why is Parallel Computing Important? COMP 422, Spring 2008 (V.Sarkar) 9Computing and Science • “Computational modeling and simulation are among the most significant developments in the practice of scientific inquiry in the 20th Century. Within the last two decades, scientific computing has become an important contributor to all scientific disciplines. • It is particularly important for the solution of research problems that are insoluble by traditional scientific theoretical and experimental approaches, hazardous to study in the laboratory, or time consuming or expensive to solve by traditional means” — “Scientific Discovery through Advanced Computing” DOE Office of Science, 2000 COMP 422, Spring 2008 (V.Sarkar) 10 Simulation: The Third Pillar of Science • Traditional scientific and engineering paradigm: 1) Do theory or paper design. 2) Perform experiments or build system. • Limitations: —Too difficult build large wind tunnels. —Too expensive build a throw-away passenger jet. —Too slow wait for climate or galactic evolution. —Too dangerous weapons, drug design, climate experimentation. • Computational science paradigm: 3) Use high performance computer systems to simulate the phenomenon – Base on known physical laws and efficient numerical methods. COMP 422, Spring 2008 (V.Sarkar) 11Some Particularly Challenging Computations • Science —Global climate modeling —Biology: genomics; protein folding; drug design —Astrophysical modeling —Computational Chemistry —Computational Material Sciences and Nanosciences • Engineering —Semiconductor design —Earthquake and structural modeling —Computation fluid dynamics (airplane design) —Combustion (engine design) —Crash simulation • Business —Financial and economic modeling —Transaction processing, web services and search engines • Defense —Nuclear weapons test by simulations —Cryptography COMP 422, Spring 2008 (V.Sarkar) 12Units of Measure in HPC • High Performance Computing (HPC) units are: —Flop: floating point operation —Flops/s: floating point operations per second —Bytes: size of data (a double precision floating point number is 8) • Typical sizes are millions, billions, trillions… 6 20 6 Mega Mflop/s = 10 flop/sec Mbyte = 2 = 1048576 10 bytes 9 30 9 Giga Gflop/s = 10 flop/sec Gbyte = 2 10 bytes 12 40 12 Tera Tflop/s = 10 flop/sec Tbyte = 2 10 bytes 15 50 15 Peta Pflop/s = 10 flop/sec Pbyte = 2 10 bytes 18 60 18 Exa Eflop/s = 10 flop/sec Ebyte = 2 10 bytes 21 70 21 Zetta Zflop/s = 10 flop/sec Zbyte = 2 10 bytes 24 80 24 Yotta Yflop/s = 10 flop/sec Ybyte = 2 10 bytes • See for current list of fastest machines COMP 422, Spring 2008 (V.Sarkar) 13Global Climate Modeling Problem • Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity • Approach: —Discretize the domain, e.g., a measurement point every 10 km —Devise an algorithm to predict weather at time t+δt given t • Uses: - Predict major events, e.g., El Nino - Use in setting air emissions standards Source: COMP 422, Spring 2008 (V.Sarkar) 14Global Climate Modeling Computation • One piece is modeling the fluid flow in the atmosphere —Solve Navier-Stokes equations – Roughly 100 Flops per grid point with 1 minute timestep • Computational requirements: 11 —To match real-time, need 5 x 10 flops in 60 seconds = 8 Gflop/s —Weather prediction (7 days in 24 hours)  56 Gflop/s —Climate prediction (50 years in 30 days)  4.8 Tflop/s —To use in policy negotiations (50 years in 12 hours)  288 Tflop/s • To double the grid resolution, computation is 8x to 16x • State of the art models require integration of atmosphere, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more • Current models are coarser than this COMP 422, Spring 2008 (V.Sarkar) 15Scale of Today’s HPC Systems Figure Credit:, Nov 2007 COMP 422, Spring 2008 (V.Sarkar) 16 Historical Concurrency in Top 500 Systems Figure Credit:, Nov 15, 2005 COMP 422, Spring 2008 (V.Sarkar) 17Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Gordon Moore (co-founder of Called “Moore’s Law” Intel) predicted in 1965 that the transistor density of Microprocessors have semiconductor chips would become smaller, denser, double roughly every 18 months. and more powerful. Slide source: Jack Dongarra COMP 422, Spring 2008 (V.Sarkar) 18Why Parallelism is now necessary for Mainstream Computing • Chip density is continuing increase 2x every 2 years —Clock speed is not —Number of processor cores have to double instead • There is little or no hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) COMP 422, Spring 2008 (V.Sarkar) 19Fundamental limits on Serial Computing: Three “Walls” • Power Wall —Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase. • Frequency Wall —Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a pointof diminishing returns, and even negative returns if power is taken into account. • Memory Wall —On multi-gigahertz symmetric processors - even those with integrated memory controllers - latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor. COMP 422, Spring 2008 (V.Sarkar) 20