Lecture Notes on Parallel Computation

lecture notes on parallel computing and how parallel computing can be achieved, what is parallel computation models, what is parallel computing platform pdf free download
Dr.JakeFinlay Profile Pic
Dr.JakeFinlay,Germany,Teacher
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Lecture Notes on Parallel Computation Stefan Boeriu, Kai-Ping Wang and John C. Bruch Jr. Office of Information Technology and Department of Mechanical and Environmental Engineering University of California Santa Barbara, CA CONTENTS 1 1. INTRODUCTION 4 1.1 What is parallel computation? 4 1.2 Why use parallel computation? 4 1.3 Performance limits of parallel programs 4 1.4 Top 500 Supercomputers 4 2. PARALLEL SYSTEMS 6 2.1 Memory Distribution 6 2.1.1 Distributed Memory 6 2.1.2 Shared Memory 6 2.1.2 Hybrid Memory 6 2.1.4 Comparison 6 2.2 Instruction 7 2.2.1 MIMD (Multi-Instruction Multi-Data) 7 2.2.2 SIMD (Single-Instruction Multi-Data) 7 2.2.3 MISD (Multi-Instruction Single-data) 7 2.2.4 SISD (Single-Instruction Single-Data) 7 2.3 Processes and Granularity 8 2.3.1 Fine-grain 8 2.3.2 Medium-grain 8 2.3.3 Course-grain 8 2.4 Connection Topology 9 2.4.1 Static Interconnects 9 Line/Ring 9 Mesh 10 Torus 11 Tree 12 Hypercube 13 12. Parallel Systems 2.1 Memory Distribution 2.2.1 Distributed Memory • Each processor in a parallel computer has its own memory (local memory); no other processor can access this memory. • Data can only be shared by message passing • Examples: Cray T3E, IBM SP2 2.2.2 Shared Memory • Global memory which can be accessed by all processors of a parallel computer. • Data in the global memory can be read/write by any of the processors. • Examples: Sun HPC, Cray T90 2.1.3 Hybrid (SMP Cluster) • A distributed memory parallel system but has a global memory address space management. Message passing and data sharing are taken care of by the system. • Examples: SGI Power Challenge Array 2.1.4 Comparison • Shared Memory o Explicit global data structure o Decomposition of work is independent of data layout o Communication is implicit o Explicit synchronization Need to avoid race condition and over writing • Message Passing o Implicit global data structure o Decomposition of data determines assignment of work o Communication is explicit o Synchronization is implicit 6 2.2. Instruction Flynn’s classification of computer architectures (1966): 2.2.1 MIMD (Multi-Instruction Multi-data) • All processors in a parallel computer can execute different instructions and operate on different data at the same time. • Parallelism achieved by connecting multiple processors together • Shared or distributed memory • Different programs can be run simultaneously • Each processor can perform any operation regardless of what other processors are doing. • Examples: Cray T90, Cray T3E, IBM-SP2 2.2.2. SIMD (Single-Instruction Multi-Data) • All processors in a parallel computer execute the same instructions but operate on different data at the same time. • Only one program can be run at a time. • Processors run in synchronous, lockstep function • Shared or distributed memory • Less flexible in expressing parallel algorithms, usually exploiting parallelism on array operations, e.g. F90 • Examples: CM2, MsPar 2.2.3 MISD (Multiple-Instruction Single-Data) • Special purpose computer 2.2.4 SISD (Single-Instruction Single-Data) • Serial computer 72.3 Processes and Granularity On a parallel computer, user applications are executed as processes, tasks or threads. The traditional definition of process is a program in execution. To achieve an improvement in speed through the use of parallelism, it is necessary to divide the computation into tasks or processes that can be executed simultaneously. The size of a process can be described by its granularity. 2.3.1 Fine-grain • In fine granularity, a process might consist of a few instructions, or perhaps even one instruction. 2.3.2. Medium-grain • Medium granularity describes the middle ground between fine-grain and course grain. 2.3.3 Course-grain • In course granularity, each process contains a large number of sequential instructions and takes a substantial time to execute. Sometimes granularity is defined as the size of the computation between communication or synchronization points. Generally, we want to increase the granularity to reduce the cost of process creation and interprocess communication, but of course this will likely reduce the number of concurrent processes and the amount of parallelism. A suitable compromise has to be made. In general, we would like to design a parallel program in which it is easy to vary granularity: i.e. a scalable program design. 82.4 Connection Topology The best choice would be a fully connected network in which each processor has a direct link to every other processor. Unfortunately, this type of network would be very expensive and difficult to scale. Instead, processors are arranged in some variation of a grid, torus, hypercube, etc. Key issues in network design are the network bandwidth and the network latency. The bandwidth is the number of bits that can be transmitted in unit time, given as bits/sec. The network latency is the time to make a message transfer through the network. 2.4.1 Static Interconnects • Consist of point-to-point links between processors • Can make parallel system expansion easy • Some processors may be “closer” than others • Examples: Line/Ring, Mesh/Torus, Tree, Hypercube Line/Ring. o a line consists of a row of processors with connections limited to the adjacent nodes. o the line can be formed into a ring structure by connecting the free ends. Fig. 2.4.1.a - Ring 9 Mesh o processors are connected in rows and columns in a 2 dimensional mesh o example: Intel Paragon Fig. 2.4.1.b – 2D Mesh In a mesh network of dimension D, each nonboundary processor is connected to 2D immediate neighbors. Connections typically consist of two wires, one in each direction. 10 Torus This architecture extends from the mesh by having wraparound connections. The torus is a symmetric topology, whereas a mesh is not. All added wraparound connections help reduce the torus diameter and restore the symmetry. o one-dimensional torus o two-dimensional torus o three-dimensional torus o example: Cray T3E Fig. 2.4.1.c – 2D Torus 11 Tree o binary tree first node is called root each node has two links connecting to two nodes below it as the network fans out from the root node At the first level below the root node, there are two nodes. At the next level, there are four nodes, and at the j-th level below the j root node there are 2 nodes. o fat tree The number of links is progressively increased toward the root. Fig. 2.4.1.d – Fat tree o universal fat tree number of links between the nodes grows exponentially toward the root, thereby allowing increased traffic toward the root and reducing the communication bottleneck. examples: the Thinking Machine’s CM5, Meiko CS2 12 Hypercube n • each processor connects to 2 neighbors in a n dimension Hypercube • examples: iPSC, nCUBE, SGI O2K Fig. 2.4.1.e – Hypercubes Hypercubes of dimension zero through four. The processors in the cubes are labeled with integers, here represented as binary numbers. Two processors are neighbors if and only if their binary labels differ only in one digit place. 132.4.2 Dynamic Interconnects • Paths are established as needed between processors • System expansion is difficult • Processors are usually equidistant Examples: Bus-based, Crossbar, Multistage Networks Bus-based Networks • In a bus-based network, processors share a single communication resource the bus. • A bus is a highly non-scalable architecture, because only one processor can communicate on the bus at a time. • Used in shared-memory parallel computers to communicate read and write requests to a shared global memory Fig. 2.4.2.a – Bus-based Networks A bus-based interconnection network, used here to implement a shared-memory parallel computer. Each processor (P) is connected to the bus, which in turn is connected to the global memory. A cache associated with each processor stores recently accessed memory values in an effort to reduce the bus traffic. 14 Crossbar Switching Network • A crossbar switch avoids competition for bandwidth 2 by using O(N ) switches to connect N inputs to N outputs. • Although highly non-scalable, crossbar switches are a popular mechanism for connecting a small number of workstations, typically 20 or fewer. Fig. 2.4.2.b– Crossbar Network A 44 nonblocking crossbar, used here to connect 4 processors. On the right, two switching elements are expanded: the top one is set to pass messages through and the lower one to switch messages. Each processor is depicted twice. Pairs of processors can communicate without preventing other processor pairs from communicating. 15 Multistage Interconnection Networks • In a multistage interconnection network (MIN), switching elements are distinct from processors. 2 • Fewer than O(p ) switches are used to connect p processors. • Messages pass through a series of switch stages. • In a unidirectional MIN, all messages must traverse the same number of wires, and so the cost of sending a message is independent of processor location – in effect, all processors are equidistant. • In a bi-directional MIN, the number of wires traversed depends to some extent on processor location, although to a lesser extent than in a mesh or hypercube. • Example: IBM SP networks are bi-directional multistage inter-connection networks: o bi-directional, any-to-any inter-node connection: allows all processors to send messages simultaneously. o multistage interconnection: on larger systems (over 80 nodes), additional intermediate switches are added as the system is scaled upward Fig. 2.4.2.c – Multistage interconnection network Shaded circles represent processors and unshaded circles represent crossbar switches. 16 2.5 Hardware Specifics – Examples 2.5.1 IBM SP2 • Message passing system • Cluster of workstations • 200 MHz power 3 CPU o Peak 800 MFLOPS nd o 4-16 MB 2 -level cache o sustained memory bandwidth 1.6 GB/s • Multistage crossbar switch • MPI o Latency 21.7 usec o Bandwidth 139 MB/sec • I/O hardware 172.5.2 IBM PWR3 – SDSC Blue Horizon • 222 MHz …888MFLOPS (1152 CPUs, 144 nodes with 8 CPUs (SMP)) • 2 Pipes, 1FMA per pipe per clock tick • MPI & OpenMP programming • 32 KB L1 Cache, 2MB L2 Cache CPU CPU CPU CPUCPU CPU CPU CPU CPU CPU CPU CPU bus bus bus MEMORY MEMORY MEMORY Networ k 2.5.3 Sun HPC • 400 MHz …..800 MFLOPS (64 CPUs) • MPI or OpenMP Programming • 16 KB L1 Cache, 4MB L2 Cache, 64GB total Main memory • 2 Pipes, 1 FLOP per pipe per cycle CP CP CP CP U U U U bus MEMORY 182.5.4 Cray T3E • Remote memory access system • Single system image • 600 MHz DEC Alpha CPU o Peak 1200 MFLOPS nd o 96 KB 2 -level cache o Sustained memory bandwidth 600 MB/s • 3D torus network • MPI o Latency 17 usec o Bandwidth 300 MB/s • Shmem o Latency 4 usec o Bandwidth 400 MB/s • SCI-based I/O network 19 2.5.5 SGI O2K • Cc-NUMA system • Single system image • 600250 MHz MIPS R10000 CPU o Peak 500 MFLOPS nd o 2 -level data cache 4-8 MB o Sustained memory bandwidth 670 MB/s • 4D hypercube • MPI o Latency 16 usec o Bandwidth 100 MB/s • Remote memory access o Latency 497 usec o Bandwidth 600 MB/s 20 2.5.6 Cluster of workstations • Hierarchical architecture: shared memory in a node, message passing across nodes. • PC-based nodes or workstation-based nodes • Networks: Myrianet, Scalable Coherent Interface, Gigabit Ethernet 213. PARALLEL PROGRAMMING MODELS • A parallel computer system should be flexible and easy to use and should exhibit good programmability in supporting various parallel algorithms. • Explicit parallelism means that parallelism is explicitly specified in the source code by the programmer using special language constructs, compiler directives or library function calls. • If the programmer does not explicitly specify parallelism, but lets the compiler and the run-time support system automatically exploit it, we have the implicit parallelism. 3.1 Implicit Parallelism 3.1.1 Parallelizing Compilers o Automatic parallelization of sequential programs o Do not exploit functional parallelism o Compiler performs dependence analysis on a sequential program’s source data and then – using a suite of program transformation techniques – converts the sequential code into a native parallel code. o Some performance studies indicate, however, that the parallelizing compilers are not very effective. 3.2 Explicit Parallelism Although many explicit programming models have been proposed, three models have become dominant ones: data parallel, message passing and shared variable. 3.2.1 Data parallel o Execute the same instruction or program segment over different data sets simultaneously on multiple computing nodes. o Has a single thread of control o Parallelism is exploited at data set level o No functional parallelism available 223.2.1.1 Fortran 90 Uses array syntax to express parallelism Implementation on SIMD and MIMD machines Single processor versions are available Communication is transparent 3.2.1.2 High Performance Fortran (HPF) Evolves from Fortran 90, allows for far more detail in expressing parallelism Attempt to standardize data parallel programming Data distribution and alignment can be defined Allows explicit definition of parallelism 3.2.2 Message-passing model o Multithreading – a message-passing program consists of multiple processes, each of which has its own thread of control and may execute different code. Both control parallelism (MPMD – Multiple-Program-Multiple-Data) and data parallelism (SPMD – Single-Program-Multiple- Data) are supported. o Asynchronous – the processes of a message-passing program execute asynchronously. o Separate address space - the processes of a parallel program reside in different address spaces. o Explicit interactions – the programmer must solve all the interaction issues, including data mapping, communication and synchronization. o Scales well, especially if data is well distributed. 3.2.2.1 PVM The PVM (Parallel Virtual Machine) is a software package that permits a heterogeneous collection of Unix and/or NT computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved most cost effectively by using the aggregate power and memory of many computers. The software is very portable. The source, which is available free thru Netlib www.netlib.org, has been compiled on everything from laptops to CRAYs. 23PVM enables users to exploit their existing computer hardware to solve much larger problems at minimal additional cost. Hundreds of sites around the world are using PVM to solve important scientific, industrial, and medical problems in addition to PVM’s use as an educational tool to teach parallel programming. 3.2.2.2 MPI • MPI (Message Passing Interface) is the standard programming interface MPI 1.0 in 1994 MPI 2.0 in 1997 • Library interface (Fortran, C, C++) • It includes point-to-point communication collective communication barrier synchronization one-sided communication (MPI 2.0) parallel I/O (MPI 2.0) process creation (MPI 2.0) 3.2.3 Shared variable o Similar to data-parallel model, in that it has single address space o Similar to message-passing model, in that it is multithreading and asynchronous o Data reside in a single, shared address space and does not have to be explicitly allocated o Communication is done implicitly through shared reads and writes of variables o Synchronization is explicit 3.2.3.1 SGI Power C Model extension to the sequential C language with compiler directives (pragmas) and library functions supports shared-variable parallel programming similar extended constructs are also provided for Fortran it is structured and relatively simple 24

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.