High performance computing 2018

high performance computing and communications and high performance computing for big data methodologies and applications
Dr.NaveenBansal Profile Pic
Dr.NaveenBansal,India,Teacher
Published Date:25-10-2017
Your Website URL(Optional)
Comment
Chapter 1 A Glance at High Performance Computing (HPC) 1.1 What is High Performance Computing (HPC)? High Performance Computing, or HPC for short, is an area encompassing among others the various paradigms of parallel programming, their related programming languages and application programming interfaces (APIs), the dedicated software tools, the international specialized conferences (ACM/IEEE Super-Computing, or SC for short), etc. Loosely speaking, HPC is both the scientific and technical fields of study of “Super-Computers” (SCs). 1 The list of the top 500 supercomputers (called the Top500 ) is regularly updated andavailableontheInternet.In2014,itisthesupercomputernamedTianhe-2(trans- latedfromChineseasMilkyWay-2inEnglish)ofthe National Super Computer Cen- ter in Guangzhou (China) that received the top rank position. This supercomputer consists of an impressive 3.12 million cores (a core being a processing unit, PU), and delivers an amazing overall performance of 54.9 Petaflops, where a PFlops is 15 10 floating-point operations per second. This number one supercomputer requires 17.8MW in electrical power to work properly Roughly speaking, the cost of 1 MW is about 100 dollars/hour, meaning that the electricity cost of this supercomputer is about 1 million dollars per year Table1.1summarizesthemainscaleorderswhenreferringtotheprocessingpower and memory storage capabilities of (super-)computers. Nowadays, the international 18 community is racing to reach the Exaflop performance (10 flops, 1024 Pflops) that 21 wehopetohavereadyfor2017–2020,andthenwillcometheeraofzetaFlops(10 ), 24 maybe around 2030 (?), and thereafter the yottaFlops (10 ), and so on. Thistraditionalratingofsupercomputersisbasedonlyonpeakprocessingperfor- mance of arithmetic operations, and completely ignores the energy cost required to obtain this performance. Another twice-yearly eco-friendly ranking called the green 2 HPC focuses instead on scoring supercomputers according to their performance in 1 http://www.top500.org/. 2 http://www.green500.org/. © Springer International Publishing Switzerland 2016 3 F. Nielsen, Introduction to HPC with MPI for Data Science, Undergraduate Topics in Computer Science, DOI 10.1007/978-3-319-21903-5_14 1 A Glance at High Performance Computing (HPC) Table 1.1 Scale orders characterizing the overall computing and memory performance of super- computers: Super-computers are rated in FLOPs (floating-point operations per second), and the memory size in bytes (with one byte being eight packed bits) Unit Scale Computing Memory size performance (in bytes) 3 K(kilo) 10 KFLOPS KB 6 M (mega) 10 MFLOPS MB 9 G(giga) 10 GFLOPS GB 12 T(tera) 10 TFLOPS TB 15 P(peta) 10 PFLOPS PB 18 E(exa) 10 EFLOPS (around EB 2017–2020) 21 Z (zeta) 10 ZFLOPS ZB 24 Y (yotta) 10 YFLOPS YB … … … … 100 googol 10 googolFLOPS googol bytes … … … … 10 3 When we choose 1024 = 2 (a power of 2) instead of 1000 = 10 for the gap between two 10 20 30 consecutive orders, we get the following international units (SI): Ki (2 ), Mi (2 ), Gi (2 ), Ti 40 50 60 70 80 (2 ), Pi (2 ), Eo (2 ),Zo(2 ), and Yi (2 ) MFlops/W. In November 2014, the L-CSC supercomputer from the GSI Helmholtz Center (Darmstadt, Germany) achieved the performance of 5.27 gigaflops per Watt. For comparison, this L-CSC supercomputer ranks 168 in the Top500, with 10,976 cores,andhasa593.6TFlopspeakperformance.Althoughprocessingpowerisdefi- nitelyanimportantcriterion(andonethatmotivatesustoimplementHPCsolutions), one also needs to take other parameters into account, like the overall memory size, the network bandwidth of the interconnection network, and so on. Finally, let us point out that the cost per GFLOPs exponentially decreases, and is estimated to cost 0.08 US in January 2015. 1.2 Why Do We Need HPC? The first answer that comes to mind is that HPC is useful for being faster and being thus more precise overall (say, for simulation purposes like weather forecasting, numerical mechanics for car crash tests, or in other various complex phenomena modelings).HPCalsoallowsonetosolve largerproblems:simulationscanbecarried outeitheronmorefinely-mesheddomainsoronbiggerdatasets(theBigDatatrend). But what is less widely known is that HPC is also very useful for saving energy: indeed,atconstantflopsperformance,weprefertousemorelower-profileprocessors that consume less overall than high-profile processors which require more energy to work. Finally, HPC is well-suited to some kinds of algorithms that are intrinsically1.2 Why Do We Need HPC? 5 parallel in essence. Indeed, algorithms in image or video processing often compute filters that are operations that can be carried out independently in parallel for each pixel or voxel (in medical imaging). In the latter case, graphics cards (Graphics Processing Unit, GPU) are hardware cards that have many cores (nowadays, a few thousand). For example, the high-end NVIDIA® card has 1536 cores and deliver a performanceof2.3TFlops.AMD®RadeonSky900GPUhas3584coreandget1.5 TFlops (in double-precision). Those GPU cards not only allows one to get stunning images, they are also useful for general computing: this is the GPGPU (General Purpose Graphics Processing Unit) paradigm. Let us now more precisely mention several case studies of supercomputing: • Use models to perform simulations. Otherwise, it is too difficult to build (blower or wind tunnel) or too expensive to build (car/airplane crash test), or too slow on usual computers (evolution of climate or galaxies), or too dangerous in practice (nuclear weapons, drugs, pollution, epidemics). • Getresults fast,andideallywithoutdelay.Thatis,allowincrementalresultsdeliv- ered by on-line algorithms: some results have a time-limited value, like weather conditions, which are only useful to predict in the future Tomorrow’s weather is onlyusefultoknowifwecanpredictitbeforetomorrowSimilarly,itisinteresting to be the first to get the results in order to take decisions as soon as possible (like in the stock market, where high-frequency trading algorithms are automatically booking orders). • Processing big data-sets like genome analytics or families of genomes, or even to 3 search for extra-terrestrial intelligence (see the SETI project ). 1.3 Big Data: The Four Vs (Volume, Variety, Velocity, Value) Big Data is a buzzword that has been extremely used in the media and encompasses many meanings. We can define it as the technologies for processing massive data- sets, or performing large-scale data processing. We can characterize the processing of big data-sets using the four Vs: • Volume, • Variety (heterogeneous data), • Velocity (data obtained continuously from sensors), • Value (not simulation but valuable insights). 3 http://www.seti.org/.6 1 A Glance at High Performance Computing (HPC) 1.4 Parallel Programming Paradigms: MPI and MapReduce There are several paradigms for programming parallel algorithms on large/big data-sets (a sub-area of HPC). Those models depend on whether they are robust to computer or network errors (like hardware failures). Nowadays, the two main programming models that are complementary to each other are: • Programming with MPI (Message Passing Interface), which has zero tolerance for hardware or network errors, but offers programmers a flexible framework, and 4 • Programming with MapReduce (or its free open source version, Hadoop ) which includes an infrastructure tolerant of errors and hardware failures, but has a rather a limited model of parallel computing compared with MPI. 1.5 Granularity: Fined-Grained Versus Coarse-Grained Parallelism Onecandesignandprogramparallelalgorithmsusingvariousdegreesofgranularity ofparallelism.Informallyspeaking,thegranularityistheproportionofthecodethat can be parallelized. Granularity can also be explained as the ratio of computing time to communication time in a parallel algorithm. We classify parallel algorithms into three main categories according to their granularity: • Fine-grained parallelism: at the variable level inside the same task. Data are often transferred between the computing units. It is worth noting that the instruction set of common microprocessors based on the x86 architecture (released in 1978) has been extended with various extension sets like MMX, SSE or SSE2, etc. Many 5 of these extensions provide SIMD instructions (Streaming SIMD Extensions ). Fine-grained parallelism can also rely on GPU code snippets of the graphics card. • Mid-grained parallelism: at the level of tasks of a same program using threads. • Coarse-grained parallelism:Datatransfersarelimited,andoccurafterbigchunks of calculations. Coarse-grained parallelism can also be done at the application levelusingthetaskscheduler,whichhandlestaskstobeexecutedonthecomputer cluster (the “parallel machine”). 4 http://hadoop.apache.org/. 5 http://fr.wikipedia.org/wiki/Streaming_SIMD_Extensions.1.6 Architectures for Supercomputing: Memory and Network 7 1.6 Architectures for Supercomputing: Memory and Network Weclassifyparallelcomputerseitherasparallelmachineswith shared memoryoras parallel machines with distributed memory. Usually, on a multi-core processor, all coresusethesame(shared)memorybanks:thisarchitecture,called symmetric shared memory multiprocessor (SMP), considers all cores as independent computing units. Evenwhenweconsiderthissharedmemorymodelofparallelcomputation,weneed totakeintoaccountthevarioustypesofsharedmemory:fastregistermemorylocated inside the processor and L1, L2, L3 cache memories (‘L’ for layer, emphasizing the hierarchical memory structure), hard-disk drives (disk arrays), Solid-State Drives (SSDs), magnetic tapes for backup, etc. In practice, in order to obtain good computer performance, one needs to take into account the spatial locality of data when accessing memory. For example, let us compare this C/C++/Java nested loop code: for (int j=0; jn; ++j) for (int i=0; in; ++i) yi += aij xj; with this other code: for (int i=0; in; ++i) for (int j=0; jn; ++j) yi += aij xj; 2 Intheory,thosetwoprogramshavethesamecomplexity:quadratictime,in O(n ). 6 However, in practice, compilers carry out various optimizations (like caching vari- ables or pipelining instructions and fetching) for fast memory access. To report a concrete example of the time difference between these two code snippets, we got 1.45s when n = 10,000 for the non-optimized code, and 0.275s for the optimized binaryexecutableobtainedaftercompileroptimizationusingtheg++ -Ocommand. Ontheoppositesideofsharedmemoryarchitectures,wehave parallelcomputers with distributed memory that consist of independent computers linked by an inter- connection network. The chosen topology for the interconnection network affects the efficiency of communications, but also implies a corresponding hardware cost. Figure1.1 displays common topologies for the interconnection network. We dis- tinguish between the bus interconnection and the complete graph interconnection network. Messages can be exchanged point-to-point (for example, using the bus of the complete graph) or by being routed using intermediate nodes. Figure1.2 depicts schematically the evolution of computer architectures over the last decades: as manufacturing processes like lithography have improved 6 See the-O option argument ofg++. Other options are listed usingg++ –help.8 1 A Glance at High Performance Computing (HPC) bus ring star complete grid torus hypercube tree fat tree Fig. 1.1 Common topologies met in practice for the interconnection network in a distributed memory computer cluster. Communication links can be either one-way or bi-directional1.6 Architectures for Supercomputing: Memory and Network 9 Computer Computer (CPU) (CPU) CPU CPU core core socket socket Network core core CPU CPU one socket socket socket Computer Computer motherboard motherboard (CPU) (CPU) 4 computers interconnected with a network quad processor on a single board quad core processor Fig. 1.2 Evolution of computer architectures: from small interconnected computers to multi- processor computers and recently multi-core computers local node of the network memory local processor memory processor local interconnection memory network processor local memory message passing processor with MPI local memory local memory processor processor Fig. 1.3 Architectureofaparallelmachinewithdistributedmemory:nodescommunicatebetween them using the network by message passing (that is, sending and receiving messages using the standard API: MPI)10 1 A Glance at High Performance Computing (HPC) considerably, a former small network of 4 interconnected computers has been put firstonasinglecomputer,asaquad-processorcomputerusingasocketperprocessor toconnecttheprocessorstothemotherboard,andrecentlywehaveaquad-coreCPU that is attached by a single socket to the motherboard for communication efficiency. However, this on-chip reduction is still limited to a few cores and cannot (yet?) be scaled. Distributed memory architecture have a local memory associated to each proces- sor: there is no shared memory in this model. Memory access to another process is explicitly carried out by exchanging messages on the network. Figure1.3 illustrates thismodelofparallelmachinethatwefocusoninthisbook.Itistheinterconnection network that determines the speed of data access. In general, three features of the network are considered: • latency: time required to initiate a communication, • bandwidth: rate of data transfer over communication links, node simple machine Central Processing Unit memory interconnection node network CPU CPU quadprocessor machine (topology) CPU CPU memory node core core C modern machine: P core core multi-core CPU with several GPU cards U GPU memory GPU Computer cluster Fig. 1.4 Architecture of a computer cluster: computing resources are located in nodes that are interconnected using a network. Computers attached to the cluster (the parallel machine) can either be simple computers (a CPU), or multi-processor machines (several CPUs with sockets on the motherboards),ormodernmulti-coreCPUswithseveralgraphicscards(GPUs).Intheory,forsake of simplicity, we assume that each process is run on a distinct node on a single processing unit (CPU or core)1.6 Architectures for Supercomputing: Memory and Network 11 • topology:physicalarchitectureoftheinterconnectionnetwork(likethestartopol- ogy or the grid topology). There exist several models of parallel programming. On vector supercomputers (like the Cray machines), one can use the paradigm of Single Instruction Multiple Data (or SIMD for short). On multi-core computers with shared memory, one can use multi-threading and its standardized application programming interface called 7 OpenMP (Open Multi-Platform shared-memory parallel programming). We can also consider hybrid models of computation, which use the GPU for some of the computations. GPU are controlled using several possible APIs like CUDA, OpenCL orHMPP.Finally,wecanusea blended stylewhereeachMPIprocesscanuseseveral threads, and both blended and hybrid styles where each multi-core processor node are programmed using MPI and GPU interfaces. Figure1.4 displays the architecture of a computer cluster. For the sake of simplicity, we assume in this textbook that each process runs on its proper node on a single processing unit (CPU or core). 1.7 Speedup Let t denote the time spent by a sequential program (also called serial program) seq and t the time spent by an equivalent parallel program executed on P processors. P 8 We denote by t the execution time of the parallel program ran on P = 1 processor. 1 We define the following three quantities: t t seq seq t 1 • speedup: speedup(P) = . Often, we have  , t t t P P P speedup(P) t seq • efficiency: e = = ,Lessefficiencymeansthatthereisalargeparallel P P×t P overhead (in contrast with linear speedup). An optimal linear speedup implies a maximal efficiency of one, and vice-versa, t O • scalability: scalability(O, P) = with O P. t P 1.7.1 Scalability and Iso-efficiency Analysis Most of the time, one is often interested in characterizing how parallel algorithms scale in practice with the number of available processors P. Indeed, one would like to use resources on the fly as more nodes of a computer cluster are dynamically allocated to the project (for instance, once other project runs on a cluster have been completed).A scalableparallelalgorithmisanalgorithmthatmakesiteasytoexecute on P processors for any arbitrary value of P. For a given problem size n, when we increase the number of processors P, the efficiency usually tends to decrease. Thus 7 http://openmp.org/. 8 For sake of simplicity, we assume that each process runs on a distinct processor that is a node of the distributed-memory parallel machine.12 1 A Glance at High Performance Computing (HPC) inordertopreserveagoodspeedupwhen P increases,itisalsoimportanttoincrease the data size n of its input: that is, the parameters n (data size) and P (processors) are correlated. Thisispreciselythepointthatiso-efficiencyanalysisfocuseson.Thekeyquestion in iso-efficiency analysis is how to decide the growth rate ρ of the input size as a function of the number of processors, in order to keep the efficiency constant. The smallerthe ρ value,thebetterGivenaproblem,onetriestodesigngoodisoefficient algorithms. 1.7.2 Amdahl’s Law: Characterizing the Asymptotic Speedup for Fixed Data-Size Problems Gene M. Amdahl (IBM) first characterized in 1967 the ideal expected performance speedup 1 as follows: let α denote the proportion of the code that can be paral- par lelized, and α the proportion of code that is intrinsically sequential (that is, that seq cannotbeparallelized).Wehave α + α = 1.Now,wecanmathematicallywrite par seq the time t of a parallel implementation using P processors as P t 1 t = α t + (1 − α ) P seq 1 seq P t 1 = α + α t . par seq 1 P Assuming t = t ,wededucethatthespeedupofaparallelimplementationusing seq 1 P computers is t 1 speedup(P) = t P (α + α )t par seq 1 = α par (α + )t seq 1 P 1 = . α par α + seq P Thus,asthenumberofprocessingelementstendtoinfinity(P →∞),thespeedup isalways upper boundedbytheproportionα = 1 − α ofnon-parallelizablecode seq par as follows: 1 1 lim speedup(P) = = . P→∞ α 1 − α seq par Figure1.5 plots the graphs of the functions speedup(P) for 0 ≤ α ≤ 1, using seq a logarithmic scale on the x-axis. Amdahl’s law reflects the sequential bottleneck in parallelism.1.7 Speedup 13 Fig. 1.5 Amdahl’s law characterize the speedup as a function of the percentage of parallelizable code. We used a logarithmic scale on the x-axis. Observe that the maximum speedup is always 1 upper-bounded by ,where α denotes the proportion of serial code seq α seq Theorem 1 Amdahl’s law gives the optimal asymptotic speedup of a parallel pro- 1 1 gram as speedup = = , where α denotes the proportion of the program par α 1−α seq par that can be parallelized and α = 1 − α the proportion of code that is intrinsi- seq par cally sequential. 9 The Gnuplot code for plotting the theoretical curves of Fig.1.5 is given below: WWW source code:Amdahl.gnuplot set terminal png set output ’Amdahl.png’ set encoding iso_8859_1 set logscale x 2 set xrange 1:65536 set autoscale set xlabel "number of processors (P)" set ylabel "speed-up" set key on right bottom set pointsize 2 Amdahl(p,s) = 1/(s + ( (1-s)/p)) set grid show grid plot Amdahl(x,1-0.75) title "0.75" lt -1 lw 3,\ Amdahl(x,1-0.90) title "0.9" lt -1 lw 5, \ Amdahl(x,1-0.95)title "0.95" lt -1 lw 7 9 A free graphing utility that can be downloaded from http://www.gnuplot.info/.14 1 A Glance at High Performance Computing (HPC) P = 1 P=2 P=4 P=8 P→∞ seq Time par 5 2 10 S=1 S = S = S = S=5 3 5 3 Fig. 1.6 Amdahl’s law considers the speedup for a fixed data size and provides an upper bound on the maximal speedup as the inverse of the fraction of the proportion of the sequential code. Here, the maximal speedup is asymptotically ×5 Note that Amdahl’s law makes the assumption that the input data size is fixed (this is not a strong assumption, since there are many programs with constant input sizes). In other words, the workload is prescribed once and for all. It states that no matter how many processors P you have, the theoretical maximal speedup will be upper bounded asymptotically by the inverse of the proportion of serial code: 1 speedup = . This is illustrated in Fig.1.6. Observe also that the performance to α seq price ratio falls quickly as P increases. Althoughthatthespeedupistheoreticallyupperboundedby P,onecansometimes obtain better speedups in practice That is, one can sometimes observe super-linear speedups.Thismayseemsurprisingatfirst,butthisphenomenonissimplyexplained by the fact that in practice there is a hierarchy of memories. For example, when one uses P computers in parallel for processing n data, we implicitly assume that each n computer can store data in its local memory (the RAM). However, one computer P n−1 needs to use hard disks to store the remaining data that it cannot store in the P RAM, and disk access is far slower than RAM access. Thus super-linear speedups can be observed for large data sets, simply by allowing all data to be stored in the local memory (RAM) of the processors. 1.7.3 Gustafson’s Law: Scaled Speedup, Increasing Data Size with Resources Gustafson’s law 2 characterizes the impact of data parallelism, challenging Amdahl’s law by offering a novel viewpoint. In this setting, we no longer assume that the data size is constant, but rather that it depends on the number of processors P (the available resources). This framework is therefore different from Amdahl’s law, which considers a fixed input (with incompressible running time, the critical1.7 Speedup 15 time, due to the intrinsic serial code). Indeed, Gustafson observes that in practice programuserstendtosetthedatasizeaccordingtotheavailableresourcesinorderto 10 get reasonable overall execution time. Setting the data size according to the com- puting resources is the case for example in video processing or medical imaging, where people tend to first rescale large image data input down so that it fits into memory and yields reasonable expected running times. Similarly, in simulations the data size depends on the chosen grid resolution. As faster (multi-core) machines become available to users, the users pour more data into programs. In this scenario, data sizes are considered to be arbitrary large (for example, think of 4K videos, 8K videos, etc.). Again, let α denote the proportion of parallelizable code and α = 1 − α par seq par the proportion of sequential code on all the P processors. Then the speedup is the ratio of the sequential processing time t + Pt (for a workload α n + α nP) seq par seq par to the time t + t achieved by the parallel algorithm: seq par t + P × t seq par speedup(P) = . t + t seq par t t seq par Since = α bydefinition,and = α ,wededucethatthespeedup seq par t +t t +t seq par seq par is given by speedup (P) = α + P × α = α + (1 − α )P. seq par seq seq Gustafson Theorem 2 Gustafson’s law states that the optimal speedup is asymptotically speedup(P) = P × α , where α 0 denotes the proportion of the code that par par is parallelizable. When the problem size increases for a fixed serial portion, speedup grows as more processors are added. Therefore Gustafson called his speedup analysis scaled speedup. We observe that the workload is scaled up with the number of processors in order to maintain a fixed execution time. Thus Gustafson’s philosophy is that the true power of a parallel system is demonstrated when large enough data problems are given in input, as depicted in Fig.1.7. 1.7.4 Simulating Parallel Machines on Sequential Computers One can simulate any parallel algorithm on a classic computer by sequentially exe- cutingtheelementaryinstructionsorpiecesofcodeforthe P processors P ,..., P 1 P one instruction at a time. A portion of code is defined between two synchronization 10 Indeed, we are not interested in running a program on a data-set that may take 100 years to complete.16 1 A Glance at High Performance Computing (HPC) P =1 P =2 P =4 P =8 seq Time par n 2n 4n 8n data size n increases Fig. 1.7 Gustafson’sspeeduplawconsidersdatasizebeingscaledupwiththenumberofprocessors (or assumes constant parallel run time) barriers, which are blocking communication primitives. Thus a parallel algorithm always yields a sequential algorithm, but not necessarily the other way around: a parallel algorithm needs to be designed, and often from scratch One has to think purposely to design efficient parallel algorithms In theory, there are several conclusions that can be drawn from being able to simulate a parallel program on a sequential computer. First, the maximal speed up is in O(P), where P denotes the number of processors. Second, it yields a lower bound for problems, since the time complexity to solve a problem in parallel is at c seq best in Ω( ) (where Ω(c ) denotes the sequential lower bound). Last, we note seq P that we necessarily have t ≤ t : the time of a parallel algorithm executed on a seq 1 single processor is necessarily more costly than the time of an equivalent sequential problem. (Otherwise, this single-node parallel algorithm would be better, and could replace the sequential algorithm.) 1.7.5 Big Data and Parallel Inputs/Outputs (I/O) To process Big Data, one has to read and save huge files, or an extremely large number of small files. These operations are called Input/Output,orI/Oforshort.On a distributed memory architecture, this can be managed by explicitly programmed 11 parallel I/O (say, using multi-collective MPI-IO ), or locally by explicit I/O on every node. Fortunately, in order to avoid this time-consuming programming task, there exist several parallel file systems that handle the complexity of those various 11 Available since MPI-2.1.7 Speedup 17 12 I/O operations. For example, supercomputers may use the free software Lustre, 13 or IBM’s GPFS (General Parallel File System) for data-set sizes exceeding the 15 petabyte(10 bytes).MapReduce,theotherpopularsystemusedforprocessingBig Data, relies on the Google file system called GFS (or HDFS for Hadoop Distributed File System). 1.8 Eight Common Fallacies on Distributed Systems One of the very first large-scale distributed systems was the ARPANET network (1969), which yielded the Internet. Another world-wide distributed system is the 14 SWIFT money transfer protocol (SWIFT stands for Society for Worldwide Inter- bank Financial Telecommunication). After gaining half a century of experiences in distributed systems, nowadays it is still difficult to design and scale large distributed systems. There are many reasons for this, but Peter Deutsch (SUN, 1994) and James Gosling (SUN, 1997) have listed the following eight common fallacies: 1. the network is reliable, 2. the latency is zero, 3. the bandwidth is infinite (or at least large enough), 4. the network is safe, 5. the network topology does not change (meaning it is static over time), 6. there is only one network administrator, 7. the transport cost is free, and 8. the network is homogeneous. We shall now briefly explain what those fallacies are, and why these assumptions do not hold in practice. Fallacy 1: The network is safe. Manycriticalapplicationsneedtobealwaysavail- able at any time of the day, and thus should be always functioning properly and have a zero-tolerance policy towards hardware/software breakdown. However, say, when a switch router becomes out of order, there can be an unexpected cas- cading chain reaction, which can cause an unpredictable catastrophe (think of a nuclear power plant getting out of control). In order to avoid this scenario (or at least to minimize it), one has to introduce redundancy both in the hardware and in the software. In practice, the risks are evaluated in terms of investment (or equivalently in terms of redundancy). Fallacy 2: The latency is zero. Although latency in LANs (Local Area Network) is good enough, it becomes worse in WANs (Wide Area Network). In the past eleven years network bandwidth has increased by a factor of 1500, but latency 12 http://lustre.opensfs.org/. 13 http://www-03.ibm.com/systems/platformcomputing/products/gpfs/. 14 http://www.swift.com/about_swift/company_information/swift_history.18 1 A Glance at High Performance Computing (HPC) has only been reduced by a factor of 10. Latency is intrinsically bounded by the speed of light in optic fibers. Indeed, consider the time required toping: that is, to go to and come back between two antipodal points on the earth (rtt, round trip time). This requires 0.2s (20000km × 2 × 5µs/km = 0.2s, ignoring time spent in computing), and 40ms between New York and Los Angeles. Thus one has be extremely carefulwithlatency problems when deploying applications worldwide beyond LANs Fallacy 3: The bandwidth is infinite. Although bandwidth is steadily increasing over the years, the data size of content is too (like nowadays 4K video, and soon 8K video) Also, routing algorithms have to deal in practice with conges- tion problems and packet loss in WANs that require us to resend data (and hence increase the network traffic). Fallacy 4: The network is safe. Well, nobody is fool enough to believe this any- more. Even worse,hacker attacks are increasing exponentially (sometimes, inter- national companies have to face hundreds of attacks per week). Thus one has to plan for backing up systems and data, and design emergency and recovery procedures, etc. Fallacy 5: The network topology is static. Althoughatfirstonecanassumeastatic topologywhendevelopinganapplication,clearlyinpracticeonedoesnotcontrol thetopologyonWANs;andevenworse,Internettopologyisconstantlyevolving. Thismeansthatdeployedapplicationsshouldalwaysconsiderthenetworktopol- ogy as dynamic and make sure it is robust and adaptive to these various kinds of topologies. Fallacy 6: There is only one network administrator. Inthebestcasescenario,even if we assume that the network administrator is never ill and can reply instantly to themanyuserrequests,networkadministrationisnowadaysfartoocomplextobe mastered by a single expert. Indeed, the variety of architectures and software to takeintoaccountrequiresmanycompetenciesthatcanbehandledonlybya group of administrators. In other words: nowadays each administrator is necessarily specializedintheirfieldofexpertise,buttheycannolongercoverthefullspectrum of knowledge for administrating a large-scale IT structure. Fallacy 7: The transport cost is free. Of course, network cables and other related equipment like routers have a cost. But besides this obvious observation, in prac- tice one has to pay to get a guaranteed Quality of Service (QoS) for network traffic. Also, to run an application in single-node mode and have this application executed in a distributed fashion on several nodes, one has to transfer application data on the network, thus incurring an implementation cost as well. That is, one has also to implement a (de)serializing procedure in bits and bytes of structured data in order to send and receive them between nodes. This latter point requires an extra software development cost. Fallacy 8: The network is homogeneous. In large international companies, employeesusemanydifferentkindsofcomputers,operatingsoftwareandsolution software, and it then becomes critical to have interoperability between platforms and software in order to smoothly run the business.1.9 Notes and References 19 1.9 Notes and References This chapter covers the basic notions of High Performance Computing (HPC). The various concepts introduced here can be found in most textbooks dealing with par- allelism. Amdahl’s law has been revisited and extended by taking into account the multicoreerawithenergyconsiderations3, 4.BothAmdahl’sandGustafson’slaws can be generalized using the generic memory bound model of Sun and Ni 5. The eight fallacies on distributed systems are detailed in the 2006 paper 6. With the rise of the Big Data era, one seeks to process as big as possible data-sets and this as fast as possible: Big and Fast Data targets a real-time analysis of data flows (like the incoming flow of tweets) by minimizing latency to deliver results. 1.10 Summary Themainpurposeofparallelizingalgorithmsistoobtainbetterperformance(reduc- ing time, increasing output throughput rate, reducing power consumption, etc.). Large-scale(ormassive/big)dataprocessingreliesonsupercomputersthatareranked accordingtotheirperformancemeasuredinFLOPS(FLoating-point Operations Per Second): that is, supercomputers are ranked according to the maximum number of elementary floating-point arithmetic operations they can carry out per second (peakperformance).Inhigh-performance computing,wedistinguishbetweenparal- lel multi-core computers with shared memory that use threads with shared memory, andparallelcomputer (cluster)withdistributedmemory.Inthelattercase,machines aremodeledbynodeslinkedbyaninterconnectionnetwork.Ondistributedmemory architectures, data transfers are explicitly executed using message passing, using a standard interface named MPI which includes all basic communication primitives. Theefficiencyofcommunicationdependsontheunderlyingtopologicalnetwork,the bandwidth, and the latency of uni-directional/bi-directional links. Amdahl’s law rig- orouslydemonstratesthatthespeedupisupper-boundedbytheratioofcodethatcan not be parallelized. Nowadays, we are racing towards the exaflops supercomputer, which we wish to build around 2017–2020. Notation: n input size P number of processors (or processes, nodes) t(n) sequential (or serial) time or data of size n,or t (n) s t (n) parallel time on P processors for data of size n P t (n) sequential time for n data seq t (n) parallel time for n data (P implicitly assumed) par c (n) cost: work performed by all processors: c (n) = Pt (n) P P P t (n) single-node parallel algorithm, t (n) ≥ t(n) 1 1 but often t (n) ≈ t(n) 120 1 A Glance at High Performance Computing (HPC) t(n) S (n) speedup: S (n) = P P t (n) P t(n) S (n) t(n) P E (n) efficiency: E (n) = = = P P C (n) P Pt (n) P P α proportion of parallel code (α = 1 − α ) par par seq α proportion of sequential code (α = 1 − α ) seq seq par 1 S (P) Amdahl’s speedup for fixed workload: α A par α + seq P 1 (upper bounded by ) α seq S (P) Gustafson’s scale speedup: S (P) = α + (1 − α )P G G seq seq t (n) O scalability(O, P) generic scalability with O P t (n) P 1.11 Exercises Exercise 1 (Amdahl’s law) A program can be parallelized at 90%. Calculate the asymptotic speedup using Amdahl’s law. The serial program runs in 10 hours. What is the critical time that no parallel algorithm will be able to beat? Deduce again the maximal speedup. Same questions when only 1% of the code can be parallelized. Exercise 2 (Estimating the fraction of parallelizable code for Amdahl’s law) Show 1 −1 S that one can estimate the ratio of parallelizable code using the formula α  = , par 1 −1 P where S is the measured speedup observed when using P processors. Deduce a formula for the maximal speedup given the measured speedup S achieved when using P processors. Exercise 3 (Upper bound for Amdahl’s law) Prove that for an arbitrary number of 1 processors P, the speedup is always upper bounded by where α is the relative seq α seq proportion of serial code. References 1. Amdahl, G.M.: Validity of the single processor approach to achieving large scale comput- ing capabilities. In: Proceedings of Spring Joint Computer Conference, AFIPS ’67 (Spring), pp. 483–485. ACM, New York (1967) 2. Gustafson, J.L.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988) 3. Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. Computer 41(7), 33–38 (2008) 4. Woo, D.H., Lee, H.-H.S.: Extending Amdahl’s law for energy-efficient computing in the many- core era. Computer 41(12), 24–31 (2008) 5. Hwang, K.: Advanced Computer Architecture: Parallelism, Scalability, Programmability, 1st edn. McGraw-Hill Higher Education, New York (1992) 6. Rotem-Gal-Oz, A.: Fallacies of distributed computing explained (2006), (initially discussed by James Gosling and Peter L. Deutsch). See http://www.rgoarchitects.com/Files/fallacies.pdf www.allitebooks.com Chapter 2 Introduction to MPI: The Message Passing Interface 2.1 MPI for Parallel Programming: Communicating with Messages Programming parallel algorithms is far more delicate than programming sequential algorithms. And so is debugging parallel programs too Indeed, there exists several abstract modelsof“parallelmachines”(parallelcomputations,distributedcomputa- tions) with different kinds of parallel programming paradigms: For example, let us mention: • Vector super-computersthatrelyontheprogrammingmodelcalled Single Instruc- tion Multiple Data (SIMD) with their optimized code based on pipelined opera- tions, • Multi-core machines with shared memory and their programming model using multi-threading, with all threads potentially accessing the shared memory. Pro- gramscanbeeasilycrashinganditisdifficulttodebugsometimesduetopotential conflicts when accessing concurrently a shared resource, • Clusters of computer machines interconnected by a high-speed network that have a distributed memory. It is precisely this last category of “parallel machines”, the clusters of machines, that we are focusing on in this textbook: namely, parallel programming paradigm with distributed memory. Each computer can execute programs using its own local memory. Executed programs can be the same on all computers or can be different. Cooperation takes place by sending and receiving messages among these intercon- nected computers to accomplish their overall task. Speaking on the size of these clusters, we can distinguish between: • small-size to mid-size clusters of computers (say, dozens to hundreds, sometimes thousands, of computer nodes) that communicate with each other by sending and receiving messages, and • large-size clusters (thousands to hundreds of thousands, sometimes millions computers) that execute rather simpler codes targeting Big Data processing. © Springer International Publishing Switzerland 2016 21 F. Nielsen, Introduction to HPC with MPI for Data Science, Undergraduate Topics in Computer Science, DOI 10.1007/978-3-319-21903-5_222 2 Introduction to MPI: The Message Passing Interface Usually, these large-size clusters are programmed using the MapReduce/Hadoop programming model. The Message Passing Interface (or MPI for short) standard is a programming interface: namely, an Application Programming Interface (API) that defines prop- erly the syntax and full semantic of a software library that provides standardized basic routines to build complex programs thereof. Thus the MPI interface allows one to code parallel programs exchanging data by sending and receiving messages encapsulating those data. Using an API has the advantage of leaving the program- mer free of the many details concerning the implementation of the fine details of implementing from scratch network procedures, and allows the ecosystem (acad- emy, industry, programmers) to benefit of interoperability and portability of source codes.ItisimportanttoemphasizethefactthattheMPIAPI does not dependonthe underlying programming language it uses. Thus we can use MPI commands with the most common (sequential) programming languages like C, C++, Java, Fortran, Python and so on. That is, several language bindings of the MPI API are available. MPI historically got initiated from a workshop organized in 1991 on distributed memory environments. Nowadays, we use the third version of the standard, MPI- 3, which standardization has been completed and published openly in 2008. We shall choose OpenMPI (http://www.open-mpi.org/) to illustrate the programming examples in this book. Let us emphasize that the MPI interface is the dominant programming interface for parallel algorithms with distributed memory in the HPC community. The strong argument in favor of MPI is the standardization of many (i) global routines of com- munication (like broadcasting, the routine that consists in sending a message to all other machines) and (ii) many primitives to perform global calculations (like com- putingacumulativesumofdatadistributedamongallmachinesusinganaggregation mechanism). Inpractice, thecomplexity oftheseglobal communications andcalcu- lationoperations depend ontheunderlyingtopologyoftheinterconnection network of the machines of the cluster. 2.2 Parallel Programming Models, Threads and Processes Modern operating systems are multi-tasks: from the user viewpoint, several non- blocking applications seem to be executed (run) “simultaneously”. This is merely an illusion since on a single Central Processing Unit (CPU) there can be only one program instruction at a time being executed. In other words, on the CPU, there is a current process being executed while the others are blocked (suspended or waiting to be waked up) and wait their turn to be executed on the CPU. It is the rôle of the task scheduler to allocate dynamically processes to CPU. ModernCPUshaveseveral coresthatareindependent Processing Units(PUs)that canexecutetrulyinparalleloneachcoreathread.Multi-corearchitecturesyieldthe multi-threading programming paradigm that allows for concurrency. For example,

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.