Question? Leave a message!




Introduction to Parallel Computing

Introduction to Parallel Computing 4
COMP 422: Introduction to Parallel Computing Vivek Sarkar Department of Computer Science Rice University vsarkarrice.edu COMP 422 Lecture 1 8 January 2008Acknowledgments for today’s lecture • Jack Dongarra (U. Tennessee) CS 594 slides from Spring 2008 —http://www.cs.utk.edu/7Edongarra/WEBPAGES/cs5942008.htm • John MellorCrummey (Rice) COMP 422 slides from Spring 2007 • Kathy Yelick (UC Berkeley) CS 267 slides from Spring 2007 —http://www.eecs.berkeley.edu/yelick/cs267sp07/lectures • Slides accompanying course textbook —http://wwwusers.cs.umn.edu/karypis/parbook/ COMP 422, Spring 2008 (V.Sarkar) 2Course Information • Meeting time: TTh 10:5012:05 • Meeting place: DH 1046 • Instructor: Vivek Sarkar —vsarkarrice.edu, x5304, DH 3131 —office hours: Wednesdays 11am 12noon, and by appointment • TA: Raj Barik —rajbarikrice.edu, x2738, DH 2070 —Office hours: Tuesdays Thursdays, 1pm 2pm, and by appointment • Web site: http://www.owlnet.rice.edu/comp422 COMP 422, Spring 2008 (V.Sarkar) 3Prerequisites • Programming in C, Fortran, or Java • Basics of data structures • Basics of machine architecture • Required courses: Comp 212 and 320, or equivalent • See me if you have concerns COMP 422, Spring 2008 (V.Sarkar) 4Text Introduction to Parallel Computing, 2nd Edition Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar AddisonWesley 2003 http://wwwusers.cs.umn.edu/karypis/parbook/ COMP 422, Spring 2008 (V.Sarkar) 5Topics • Introduction (Chapter 1) today’s lecture • Parallel Programming Platforms (Chapter 2) —New material: homogeneous heterogeneous multicore platforms • Principles of Parallel Algorithm Design (Chapter 3) • Analytical Modeling of Parallel Programs (Chapter 5) —New material: theoretical foundations of task scheduling • Programming Shared Address Space Platforms (Chapter 7) —New material: new programming models (beyond threads and OpenMP) Java Concurrency Utilities, Intel Thread Building Blocks, .Net Parallel Extensions (Task Parallel Library PLINQ), Cilk, X10 • Dense Matrix Operations (Chapter 8) • Graph Algorithms (Chapter 10) • Programming Using the MessagePassing Paradigm (Chapter 6) —New material: Partitioned Global Address Space (PGAS) languages Unified Parallel C (UPC), Coarray Fortran (CAF) • New material: Programming Heterogeneous Processors and Accelerators • New material: Problem Solving on Large Scale Clusters using MapReduce • (Secondary references will be assigned for new material) COMP 422, Spring 2008 (V.Sarkar) 6Parallel Machines for the Course • Ada —316 dualcore Opteron processors —dual processor nodes —Infiniband interconnection network —no global shared memory • Additional nodes on Ada —Quadcore AMD Barcelona (denali) —Sun T2000 (yellowstone) • Google/IBM cluster —Accounts to be obtained later in semester COMP 422, Spring 2008 (V.Sarkar) 7Assignments and Exams • Coursework will be weighted as follows: —Homeworks: 5 —Programming Assignments: 55 – to be done individually or in 2person teams; the requirements will be scaled accordingly —Midterm Exam: 20 (onehour written exam) —Final Exam: 20 (onehour multiplechoice exam) COMP 422, Spring 2008 (V.Sarkar) 8Why is Parallel Computing Important COMP 422, Spring 2008 (V.Sarkar) 9Computing and Science • “Computational modeling and simulation are among the most significant developments in the practice of scientific inquiry in the 20th Century. Within the last two decades, scientific computing has become an important contributor to all scientific disciplines. • It is particularly important for the solution of research problems that are insoluble by traditional scientific theoretical and experimental approaches, hazardous to study in the laboratory, or time consuming or expensive to solve by traditional means” — “Scientific Discovery through Advanced Computing” DOE Office of Science, 2000 COMP 422, Spring 2008 (V.Sarkar) 10 Simulation: The Third Pillar of Science • Traditional scientific and engineering paradigm: 1) Do theory or paper design. 2) Perform experiments or build system. • Limitations: —Too difficult build large wind tunnels. —Too expensive build a throwaway passenger jet. —Too slow wait for climate or galactic evolution. —Too dangerous weapons, drug design, climate experimentation. • Computational science paradigm: 3) Use high performance computer systems to simulate the phenomenon – Base on known physical laws and efficient numerical methods. COMP 422, Spring 2008 (V.Sarkar) 11Some Particularly Challenging Computations • Science —Global climate modeling —Biology: genomics; protein folding; drug design —Astrophysical modeling —Computational Chemistry —Computational Material Sciences and Nanosciences • Engineering —Semiconductor design —Earthquake and structural modeling —Computation fluid dynamics (airplane design) —Combustion (engine design) —Crash simulation • Business —Financial and economic modeling —Transaction processing, web services and search engines • Defense —Nuclear weapons test by simulations —Cryptography COMP 422, Spring 2008 (V.Sarkar) 12Units of Measure in HPC • High Performance Computing (HPC) units are: —Flop: floating point operation —Flops/s: floating point operations per second —Bytes: size of data (a double precision floating point number is 8) • Typical sizes are millions, billions, trillions… 6 20 6 Mega Mflop/s = 10 flop/sec Mbyte = 2 = 1048576 10 bytes 9 30 9 Giga Gflop/s = 10 flop/sec Gbyte = 2 10 bytes 12 40 12 Tera Tflop/s = 10 flop/sec Tbyte = 2 10 bytes 15 50 15 Peta Pflop/s = 10 flop/sec Pbyte = 2 10 bytes 18 60 18 Exa Eflop/s = 10 flop/sec Ebyte = 2 10 bytes 21 70 21 Zetta Zflop/s = 10 flop/sec Zbyte = 2 10 bytes 24 80 24 Yotta Yflop/s = 10 flop/sec Ybyte = 2 10 bytes • See www.top500.org for current list of fastest machines COMP 422, Spring 2008 (V.Sarkar) 13Global Climate Modeling Problem • Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity • Approach: —Discretize the domain, e.g., a measurement point every 10 km —Devise an algorithm to predict weather at time t+δt given t • Uses: Predict major events, e.g., El Nino Use in setting air emissions standards Source: http://www.epm.ornl.gov/chammp/chammp.html COMP 422, Spring 2008 (V.Sarkar) 14Global Climate Modeling Computation • One piece is modeling the fluid flow in the atmosphere —Solve NavierStokes equations – Roughly 100 Flops per grid point with 1 minute timestep • Computational requirements: 11 —To match realtime, need 5 x 10 flops in 60 seconds = 8 Gflop/s —Weather prediction (7 days in 24 hours)  56 Gflop/s —Climate prediction (50 years in 30 days)  4.8 Tflop/s —To use in policy negotiations (50 years in 12 hours)  288 Tflop/s • To double the grid resolution, computation is 8x to 16x • State of the art models require integration of atmosphere, ocean, seaice, land models, plus possibly carbon cycle, geochemistry and more • Current models are coarser than this COMP 422, Spring 2008 (V.Sarkar) 15Scale of Today’s HPC Systems Figure Credit: www.top500.org, Nov 2007 COMP 422, Spring 2008 (V.Sarkar) 16 Historical Concurrency in Top 500 Systems Figure Credit: www.top500.org, Nov 15, 2005 COMP 422, Spring 2008 (V.Sarkar) 17Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Gordon Moore (cofounder of Called “Moore’s Law” Intel) predicted in 1965 that the transistor density of Microprocessors have semiconductor chips would become smaller, denser, double roughly every 18 months. and more powerful. Slide source: Jack Dongarra COMP 422, Spring 2008 (V.Sarkar) 18Why Parallelism is now necessary for Mainstream Computing • Chip density is continuing increase 2x every 2 years —Clock speed is not —Number of processor cores have to double instead • There is little or no hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) COMP 422, Spring 2008 (V.Sarkar) 19Fundamental limits on Serial Computing: Three “Walls” • Power Wall —Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integratedcircuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase. • Frequency Wall —Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a pointof diminishing returns, and even negative returns if power is taken into account. • Memory Wall —On multigigahertz symmetric processors even those with integrated memory controllers latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effectiveaddress space that includes main memory) and the processor. COMP 422, Spring 2008 (V.Sarkar) 20Illustration of Power Wall COMP 422, Spring 2008 (V.Sarkar) 21Illustration of Frequency Wall: diminishing returns from Instructionlevel Parallelism Issue Waste • Contributing factors —instruction dependencies —longlatency operations within a thread COMP 422, Spring 2008 (V.Sarkar) 22Illustration of Memory Wall COMP 422, Spring 2008 (V.Sarkar) 23Memory Hierarchy KB MB GB TB PB COMP 422, Spring 2008 (V.Sarkar) 24Memory Hierarchy (contd) COMP 422, Spring 2008 (V.Sarkar) 25Another Limit: Speed of Light 1 Tflop/s, 1 r = 0.3 Tbyte sequential mm machine • Consider a hypothetical 1 Tflop/s sequential machine: —Data must travel some distance, r, to get from memory to CPU. 12 —To get 1 data element per cycle, this means 10 times per 8 12 second at the speed of light, c = 3x10 m/s. Thus r c/10 = 0.3 mm. • Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: —Each bit occupies about 1 square Angstrom, or the size of a small atom. • No choice but parallelism COMP 422, Spring 2008 (V.Sarkar) 26Why is Parallel Computing Hard COMP 422, Spring 2008 (V.Sarkar) 27Impediments to Parallel Computing • Algorithm development is harder —complexity of specifying and coordinating concurrent activities • Software development is much harder —lack of standardized effective development tools, programming models, and environments • Rapid pace of change in computer system architecture —today’s hot parallel algorithm may not be suitable for tomorrow’s parallel computer COMP 422, Spring 2008 (V.Sarkar) 28Principles of Parallel Computing • Finding enough parallelism (Amdahl’s Law) • Granularity • Locality • Load balance • Coordination and synchronization • Performance modeling All of these things makes parallel programming even harder than sequential programming. COMP 422, Spring 2008 (V.Sarkar) 29“Automatic” Parallelism in Modern Machines • Bit level parallelism —within floating point operations, etc. • Instruction level parallelism (ILP) —multiple instructions execute per clock cycle • Memory system parallelism —overlap of memory operations with computation • OS parallelism —multiple jobs run in parallel on commodity SMPs Limits to all of these for very high performance, need user to identify, schedule and coordinate parallel tasks COMP 422, Spring 2008 (V.Sarkar) 30Amdahl’s Law COMP 422, Spring 2008 (V.Sarkar) 31Illustration of Amdahl’s Law COMP 422, Spring 2008 (V.Sarkar) 32Overhead of Parallelism • Given enough parallel work, this is the biggest barrier to getting desired speedup • Parallelism overheads include: —cost of starting a thread or process —cost of communicating shared data —cost of synchronizing —extra (redundant) computation • Each of these can be in the range of milliseconds (=millions of flops) on some systems • Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work COMP 422, Spring 2008 (V.Sarkar) 33Load Imbalance • Load imbalance is the time that some processors in the system are idle due to —insufficient parallelism (during that phase) —unequal size tasks • Examples of the latter —adapting to “interesting parts of a domain” —treestructured computations —fundamentally unstructured problems • Algorithm needs to balance load COMP 422, Spring 2008 (V.Sarkar) 34Reading List for Thursday Homework 1 • Reading list for next lecture —Sections 2.1, 2.2, 2.3 and 2.4 of textbook • Homework 1 (due Jan 15, 2008) —Apply for an account on the Ada cluster, if you don’t already have one – Go to https://rcsg.rice.edu/apply – Click on "Apply for a class user account” —Send email to TA (rajbarikrice.edu) with – Your userid on Ada – Your preference on whether to do assignments individually or in two person teams (in which case you should also include your team partner’s name) – A ranking of C, Fortran, and Java as your language of choice for programming assignments This is for planning purposes; we cannot guarantee that your top choice will suffice for all programming assignments COMP 422, Spring 2008 (V.Sarkar) 35