Lecture notes on parallel processing

what is parallel processing in computer architecture and what is parallel processing why and how it is applicable pdf free download
ImogenCameron Profile Pic
Published Date:14-07-2017
Your Website URL(Optional)
1 Introduction to Parallel Processing There is a number of concepts concerning parallel execution whose understanding is crucial in the following chapters, such as the notions of program, process, thread, concurrent and parallel execution, or kinds of concurrency and parallelism. Therefore, this chapter commences with the review of these notions. Subsequently, in the framework of a proposed classification, major parallel architecture classes will be introduced. This categorization can be seen as a major extension of Flynn's classic scheme, described also in the concluding part of this chapter. 1.1 Basic concepts 1.1.1 The concept of program From the programmers perspective a program is an ordered set of instructions. On the other hand, from the point of view of an operating system, it is an executable file stored in the secondary (auxiliary) memory, typically on a disk. For example, Figure 1-1 shows the program P1.EXE as seen by the programmer and also by the operating system. The file, P1.EXE is stored in the secondary memory and can be accessed through the corresponding directory entry and file description. Secondary memory Operating system DIR File description P1.EXE readDataFromFile( fDataFile, cData); while ( cData = EOF ) P1.EXE processData( cData ); readDataFromFile( fDataFile, cData); ; Programmer's interpretation: Operating system's interpretation: Ordered set of instructions Executable file stored on secondary memory Figure 1-1. Dual interpretation of the notion program 1.1.2 The concept of process In operating system terminology, the notion of process is used relating to execution instead of the term program. It designates a commission, or quantum of work dealt with as an entity. Consequently, required resources, like address space etc. will be allocated typically on a process basis. Let's make a comparison. Imagine a car repair shop. Then a car that should be given in commission will be parked first on the parking place of the repair shop. This car resembles a program stored in the auxiliary memory. Subsequently, the owner goes to the counter where he or she asks the service station to repair the car. When the dispatcher accepts this request, a new commission will be created. Therefore, a work sheet will be filled in where all the relevant information pertaining to the commission will be recorded, such as registration number, owner's name etc. Afterwards, the repair commissions are represented by the work sheets and the dispatcher deals solely with the worksheets as far as scheduling of the work is concerned. Evidently, repair commissions correspond to the processes. Processes are created similarly and described by appropriate tables. Remaining by our example, in order to commence repairing a car, a depot and a mechanic should be allocated before. Furthermore, if additional resources, such as a hydraulic car lift, are required these also should be allocated to. Similarly, in case of process creation, memory space and, when needed, extra resources like I/O-devices should be allocated to a process. Lastly, to execute a process, a processor is to be allocated for. This is expressed in operating system terminology by saying that the process is to be scheduled for execution. 2 Earlier operating systems, such as the IBM/360 operating systems like /DOS, /MFT, /MVT etc. used the term task in the same sense as recent operating systems use the notion of a process. Each process has a lifecycle, which consists of creation, an execution phase and termination. In order to execute a program prior a corresponding process is to be created. A process will be brought into existence by using system services (system calls or supervisor macros). For example, in OS/2 processes will be created by means of the system call DosExecPgm, where the name of the executable program file will be given as a parameter like P1.EXE. The creation of a process means commissioning the operating system to execute a program. Process creation involves the following four main actions:  setting up the process description,  allocation of an address space and  loading the program into the allocated address space and  passing on the process description to the scheduler. Usually, Operating systems describe a process by means of a description table which we will call Process Control Block or PCB. Different operating systems are using, however, diverse designations for the process description table, like Process Table in UNIX, Process Control Block in VMS, Process Information Block in OS/2, or Task Control Block in IBM mainframe operating systems. A PCB contains all the information that can be relevant during the whole lifecycle of a process. It holds on the one hand, basic data, such as process identification, owner, process status, description of the allocated address space etc. On the other hand it provides space for all implementation dependent, process specific additional information. Such supplementary information may be required sometimes during process management, in connection with memory management, scheduling etc., like page tables, working set lists, various timers relating to the execution of the process etc.. This may amount to a considerably extent. Memory Operating system Shared memory P1.EXE Secondary memory Process schedule PCB queue P1.EXE P1 P1.EXE Per-process address space Figure 1-2. Description and location of a process after creation A second major component of process creation is the allocation of address space to a process for execution. There are two basic ways of address space allocation, i.e. sharing the address space among the created processes or allocating distinct address spaces to each process (per process address spaces). Subsequently, the executable program file will usually be loaded into the allocated memory space. This 3 will be accomplished corresponding to the memory management, either partly, when virtual memory management is used or, entirely, for real memory management. Finally, a created process is to be passed over to the process scheduler that allocates the processor to the competing processes. The process scheduler manages processes typically by setting up and manipulating queues of PCBs. Thus, after creating a process the scheduler puts the PCB into a queue of ready-to-run processes. Summing up, process creation results essentially in setting up the PCB, allocating a shared or a per process address space to the process, loading the program file and putting the PCB into the ready to run queue for scheduling (see Figure 1-2). Nevertheless, there are two submodels for process creation to be mentioned. In the simpler submodel a process can not generate a new process, whereas in the more advanced one, a created process is allowed to establish new processes. Usually, this latter possibility is called process spawning (e.g. in UNIX or VMS) or subtasking (in IBM terminology ). Earlier operating systems, such as IBM/PCP or earlier versions of IBM/360/DOS, OS/MFT did not allowed process spawning. Thus, in such operating systems a program was represented as one process. In the more advanced submodel it is allowed for a process to create a new process, called child process (for instance in VMS or UNIX) or subtask (by IBM operating systems). Child processes can be created by the programmer using the standard mechanisms for process creation. Spawned processes form a process hierarchy termed also as process tree Figure 1-3 illustrates a program where process A spawned two processes B and C, and B further spawned another two processes, D and E. Processes spawning results in more concurrency during execution. That is, in this case more processes belonging to the same program are competing for the processor and if the running process becomes blocked, another ready-to-run process pertaining to the same program can be scheduled to execution. However, the price of ensuring concurrency is the effort to create multiple processes and to carefully plan their communication and synchronization. The execution phase of a process is under the control of the scheduler. It commences with creation of a process and lasts until the process is terminated. However, as far as scheduling is concerned, there are two basic models used by operating systems, termed as the process model and the process-thread mode. They differ essentially in the granularity of the units of work which are scheduled as one entity. A Parent process BC Child processes DE Figure 1-3. Process spawning In the process model the scheduling is performed on a per process basis, i.e. the smallest unit of work to be scheduled is a process. On the contrary, the process-thread model is a finer grained scheduling model, where smaller units of work, called threads, are scheduled as entities. In the following we will outline the process model, whereas the process-thread model will be briefly discussed in connection with the description of threads. Process scheduling involves three key concepts, the declaration of distinct process states, the specification of the state transition diagram and the statement of a scheduling policy. As far as process states are concerned, there are three basic states connected with scheduling; the ready to run state, the running state as well as the wait (or blocked) state. In the ready to run state processes are able to run when a processor is allocated for them. In this state they are waiting for the processor to be executed. In the running state they are in execution on the allocated processor. In the wait state they are suspended or blocked waiting for the occurrence of some events for getting again ready-to-run. 4 Based on the declared states the possible state transitions as well as their conditions are stated usually in a state transition diagram. For the quite straightforward case of the basic states, the state transition diagram of the scheduler could be specified as follows (see Figure 1-4). When the scheduler selects a process for execution, its state will be changed from the ready-to-run to the running state. The process remains in this state until one of the following three events occur. Either it can happen that in compliance with the scheduling policy the scheduler decides to cease the execution of the particular process (for instance because the allocated time slice is over), and puts this process into the ready to run queue again, changing at the same time its state accordingly. Or the process in execution issues such an instruction that causes this process to wait till an event takes place. In this case the process state will be changed to the waiting state and its PCB will be placed into the queue of the waiting (or blocked) processes. Lastly, if the process arrives at the end of the execution, it terminates. Finally, a particular process, which is in the waiting state, can go over into the ready to run state, if the event it is waiting for has occurred. Real operating systems have a number of additional states, in the order of ten, introduced to cover features like swapping, different execution modes and so on. Scheduling to run Ready Creation Termination to Running run Suspension (time slice is over) Occuring of the Event causing blocking event the process has waiting for Wait Figure 1-4. State transition diagram for the basic states The last component of the scheduling specifies rules for the managing multiple competing processes which is usually termed as scheduling policy. Basic scheduling policies will be overviewed in connection with concurrent execution later in this chapter. By finishing the execution the process will be terminated releasing all the allocated resources. A large number of operating systems, in the first line earlier systems, is based on the process model (such as IBM/360 /DOS, OS/MFT, OS/MVT etc., UNIX, VMS). 1.1.3 The concept of thread The notion of thread was introduced in the framework of the process-thread model in order to express more parallelism in code than in the process model. This will be achieved by declaration of smaller chunks of code, called threads (lightweight processes), within a process as an entity that can be executed concurrently or parallel. A thread, like a process, is a sequence of instructions. Threads are created within and belonging to processes. All the threads created within one process share the resources of the process, above all, the address space. Of course, scheduling will be performed on a per- thread basis. In other words, the process-thread model is a finer grain scheduling model than the process model. Although this model is far more affordable then the process model, it has numerous advantages over and above. Evidently, with finer grained entities more parallelism can be exposed than in the case of the processes. In addition, the creation of threads or the communication, synchronization or switch among threads are far less expensive operations then those for processes, since all threads belonging to the same process are sharing the same resources. Therefore, most operating systems introduced in the 80s or 90s are based on the process-thread model, such as the OS/2, Windows NT or SunOS 5.0. Even recently, a standard is prepared to standardize the thread interface (IEEE POSIX 1003.40). Some operating systems, such as SunOS 5.0 are taking already into consideration this emerging new standard. 5 Threads have a similar lifecycle as the processes and will be managed mainly in the same way as processes are. Initially each process will be created with one single thread. However, threads are usually allowed to create new ones using particular system calls. For instance, in OS/2 new threads can be created by issuing the DoscCreateThread system call. Then, typically for each process a thread tree will be created (Figure 1-5). Process Threads Figure 1-5. Thread tree During creation each thread will be declared by a particular data structure mostly called Thread Control Block (TCB). The scheduling of threads will be performed in a similar way as described above for processes. Correspondingly, threads can be basically in one of three states; running, ready to run or waiting (blocked). Of course, each real operating system maintains, beyond the basic ones, a number of system specific additional states. For instance, Windows NT distinguishes the following additional states: initialized, terminated, standby (selected to execution but not yet in execution and transition state (a distinguished waiting state, marked by unavailability of resources). Again, a state transition diagram describes possible state transfers. Thread management is performed by setting up TCB queues for each state and performing the state transitions according to the state transition diagram and the scheduling policy. The scheduling part of the operating system overtakes the responsibility for managing all these queues in much the same way as it occurs with processes. At the end of thread creation the TCB will be placed into the queue of the ready-to-run threads and is contesting for the processor. 1.1.4 Processes and threads in languages So far, processes and threads has been described as operating system entities that can be managed by system calls. However, there is a large number of programming languages, called concurrent and parallel languages, that enable to express parallelism at the language level by providing language tools to specify the creation of processes and threads. Moreover, these languages also contain language constructs to describe the synchronization and communication of processes and threads. Like in operating systems, different languages use different terms (task, process, module, etc.) for processes and threads. Unfortunately, there is an ambiguity between processes of the operating systems and processes of the concurrent languages. Most of the concurrent and parallel languages (Ada, Concurrent Pascal, Occam-2, etc.) mean threads even if they use the term of tasks or processes. One of the rare exceptions is 3L Parallel C which use the name task to specify processes and the name thread to describe threads in compliance with the operating system terminology. Concerning thread creation and termination there are three basic methods in concurrent languages:  unsynchronized creation and termination,  unsynchronized creation and synchronized termination,  synchronized creation and termination. The first method is typically realised by calling library functions as CREATE_PROCESS, START_PROCESS, CREATE_THREAD, and START_THREAD. As a result of these function calls, a new process or thread is created and started running independently from the parent one. The only connection between them is the possibility of communication and synchronization. However, when the child process or thread reaches its last instruction, it terminates without any synchronization to its parent. The second method relies on the use of two instructions: FORK and JOIN. The FORK instruction serves for spawning a new thread (or process) in order to deliver some computation result to the parent. 6 When the parent needs the result of the child, it performs a JOIN instruction. At this point the two threads should be synchronized. The parent waits until the child reaches its terminating instruction and delivers the requested result. The typical language construct to implement the third method is the COBEGIN/COEND structure: .., COBEGIN, T1, T2, ..., Tn, COEND, ... The threads between the COBEGIN/COEND brackets are executed in parallel. The work of the parent thread is suspended until all the child threads are terminated. At this point the parent can continue its progress by executing the instruction that follows COEND. A semantically equivalent construct called the PAR construct is used in the Occam-2 language. The first two solutions represent an undisciplined programming style. Though they are more efficient than the third method, they can lead to unambigious process graphs, the verification of which is nearly impossible. However, compilers can use them in a disciplined way resulting in efficient parallel code. The situation is similar to the use of the GOTO instruction. Compilers can use it to realise higher level program constructs like loops, but their direct use by programmers led to unmaintainable software products. T2 T0 T1 T2 T0 T1 Tn COBEGIN FORK . . . FORK COEND JOIN JOIN (b) COBEGIN/COEND (a) FORK/JOIN T: Thread Figure 1-6. Thread graphs of language constructs to create and terminate threads The comparison of the FORK/JOIN and COBEGIN/COEND constructs is illustrated by the thread graphs of Figure 1-6. The special characteristics of the COBEGIN/COEND construct is that it can always be included in a black box which has only one input and one output arc. This property enables reasonably easy debugging of individual threads in these programs. On the contrary, overlapped FORK/JOIN constructs prohibit this property rendering the verification of these programs difficult. The FORK/JOIN construct is a lower level one than the COBEGIN/COEND construct since the former can be used to implement the latter but not vice versa. Usually, lower level language constructs are more efficient to implement but more difficult and dangerous to use. 7 1.1.5 The concepts of concurrent and parallel execution Although, the adjectives concurrent and parallel are often used as synonyms, it is often desirable to make a distinction between them. Concurrent execution is the temporal behaviour of the multiple clients (requesters) single server model (Figure 1-7) where one single client is served at any given moment. This model has a dual nature, it is sequential in a diminutive time scale, but simultaneous in a rather large time scale. Clients t Server Clients t Server Sequential nature Simultaneus nature Figure 1-7. Single server client-server model with concurrent execution In this situation the key problem is how the competing clients, let us say processes or threads, should be scheduled for service (execution) by the single server (processor). Scheduling policies may be oriented towards an efficient service in terms of highest throughput (least intervention) or towards short average respond time, etc.. The scheduling policy may be viewed as covering two aspects. The first one responds to the problem whether servicing of a client can be interrupted or not and, on what occasions (preemption rule). The other component states a rule how one of the competing clients will be selected for service (selection rule), see Figure 1-8. Scheduling policy Preemption rule Selection rule Whether servicing a client How clients from the competing can be interrupted or not clients will be selected for service and if yes on what occasions Figure 1-8. Main aspects of the scheduling policy If pre-emption is not allowed, a client will be serviced as long as needed (Figure 1-9). It results often in intolerable long waiting times or in the blocking of important service requests for other clients. Therefore, a pre-emptive scheduling is often used. The pre-emption rule may specify either time- sharing, which restricts continuous service for each client merely for the duration of a time slice, or can be priority based, interrupting servicing one client whenever a higher priority client requests service, as shown in Figure 1-9. The selection rule is typically based on some chosen parameters, such as priority, time of arrival etc.. This rule specifies an algorithm to determine a numeric value, which we will call rank, from the given parameters. During selection the ranks of all competing clients will be computed and the client with the highest rank will be scheduled for service. If more than one client have the same rank, an arbitration is needed to single out one (for example on a FIFO basis). Parallel execution is associated with the multiple clients multiple servers model (Figure 1-10). Having more than one server (let us say processor) allows the servicing of more than one client (processes or threads) at the same time, which is called parallel execution. 8 Preemption rule Non preemptive Preemptive Time-shared Prioritized Priority Clients Server Clients Clients Server Server Figure 1-9. Basic preemption schemes As far as the temporal harmonization of the executions is concerned there are two different schemes to be distinguished. In the lock-step or synchronous scheme each server starts service at the same moment, like in SIMD architectures. On the contrary, by the asynchronous scheme, the servers do not work concertedly, like in MIMD architectures. Multiple servers mode Synchron Asynchron (lock step) Clients Servers Servers Clients Figure 1-10. Multiple servers mode 1.1.6 Concurrent and parallel programming languages Languages can be classified according to the available language constructs. Languages that do not contain any constructs to support the N-client model belong to the class of sequential (or traditional) languages (For example: C, Pascal, FORTRAN, Prolog, Lisp). Concurrent languages employ constructs to specify the N-client model by specifying parallel threads and processes but miss language constructs to describe the N-server model (For example: Ada, Concurrent Pascal, Modula-2, Concurrent Prolog). Data parallel languages introduce special data structures that are processed in parallel, element by element. They also apply special mapping directives to help the compiler in the optimal distribution of parallel data structures among processors. (For example: High Performance FORTRAN, DAP FORTRAN, DAP Prolog, Connection Machine C and Lisp). Finally, parallel languages extend the specification of the N-client model of concurrent languages with processor allocation language constructs that enable the use of the N-server model (For example: Occam-2, 3L 9 Parallel C, Strand-88). Table 1-1 summarizes the relationship between language classes and client- server models. In Section 1.1.4 the constructs of concurrent languages supporting the N-client 1–server model have been described. Additionally to those constructs, parallel languages contain tools to specify the relation between processes and processors, i.e., the programmer can impose processor allocation on the compiler and run-time system. A typical parallel language is Occam-2, which contains PAR constructs to create processes (clients) and channels to enable synchronization and communication among processes. These language features support the N-client model. Occam-2 also has a configuration part that does not affect the logical behaviour of the program but it enables processes to be arranged on the processors to ensure that performance requirements are met INMOS88. PLACED PAR, PROCESSOR, and PLACE AT commands are introduced for this purpose. In 3L Parallel C a separate configuration language is defined for similar purpose. 1-client N-client 1-client N-client Languages 1-server 1-server N-server N-server model model model model Sequential + - - - Concurrent + + - - Data parallel + - + - Parallel + + - + Table 1-1 Classification of programming languages Though parallel languages contain more types of language constructs than concurrent languages, it does not mean that they are superior to concurrent or sequential languages. On the contrary, the configuration part of parallel languages represent an extra burden for the user. It is much more convenient for the programmer if the task of the configuration is performed either by the compiler or by the run-time system. 1.1.7 Concurrent execution models Finally, in order to round up the picture we will below briefly outline high level concurrent execution models such as multithreading, multitasking, multiprogramming and time sharing, see also Figure 1-11). Note that all these concurrent execution models are referring to different granularity of execution termed also as different levels. Typically, they are implemented by means of the operating system on one single processor (that is on a SISD, in Flynn's taxonomy). It is important to note however, that all these notions can also be interpreted in a broader sense, as notions designating the execution of multiple threads, processes or users in a concurrent or parallel way. Nevertheless, the interpretations below are formulated focused on the narrower scope of concurrent execution. Multiprogramming, User level time-sharing Process level Multitasking Thread level Multithreading Level of granularity Figure 1-11. Concurrent execution models 10 Thread level concurrent execution is termed as multithreading. In this case multiple threads can be generated to each process, and these threads will be executed concurrently on a single processor under the control of the operating system. Multithreading will usually be interpreted as concurrent execution at the thread level. Multithreading presumes evidently that multiple threads of a process exist, that is, a process-thread model is used to represent and schedule units of work for the processor. Multithreading is supported by recent operating systems (like OS/2, Windows NT or SunOS 5.0) as well as by multithreaded architectures. Process level concurrent execution is usually called multitasking. Each widely used present operating system supports this concept. Multitasking refers to concurrent execution of processes. Multiple ready to run processes can be created either by a single user if process spawning is feasible, or by multiple users participating in multiprogramming or in time-sharing. Multitasking was introduced in operating systems in the middle 60s, among others, in the IBM operating systems for the System/36O such as /DOS, OS/MFT and OS/MVT. Almost all recently used operating systems provide this feature. MIMD architectures support this model of parallel execution. Finally, user level concurrent or parallel execution can be either multiprogramming or time- sharing. Multiprogramming aims at the effective utilization of the processor by creating multiple ready-to-run processes, each belonging to different users. If the actually running process becomes blocked, because it has to wait for a particular event, such as completition of I/O, the processor will be scheduled to another ready to run process (if any). This mode of operation was supported by multiprogrammed operating systems such as the systems mentioned in the forgoing paragraph. Notice that multiprogramming is internally implemented as multitasking for independent tasks arising from different users. On the other hand, time-sharing has the objective to offer adequate quality computer services to a number of users through terminals. This mode of operation intends to guarantee a short access time to all users instead of striving for the effective utilization of the processor. In this case a number of ready- to-run processes are created again, however, by different users. The scheduling has to be performed in such a way that a proper response time could be guaranteed for each user. Time-sharing evolved as a response to the lengthy turn around times of efficiency oriented and user unfriendly multiprogrammed systems in the 60's (TSS, MAC etc.). 1.2 Types and levels of parallelism 1.2.1 Available and utilised parallelism Parallelism is one of the 'hottest' ideas in computing. Architectures, compilers and operating systems have been striving to extract and utilise as much parallelism as possible for more than two decades in order to speed-up computation,. Related to our subject the notion of parallelism is used in two different contexts. Either it Data Computations Task 1 Task 2 Task 3 Figure 1-12 The principle of data parallelism 11 designates available parallelism in programs (or in a more general sense in problem solutions) or it refers to parallelism occurring during execution, called utilised parallelism. First, let us overview available parallelism. 1.2.2 Types of available parallelism Problem solutions may contain two different kinds of available parallelism, called data parallelism and functional parallelism. TASK 1 a b c do i = 1, 20 1 a(i) = b (i) c (i) enddo 20 TASK 2 do i = 1, 60 21 a (i) = b (i) c (i) do i = 21, 40 enddo a(i) = b (i) c (i) 40 enddo 41 TASK 3 60 do i = 41, 60 a(i) = b (i) c (i) enddo Data parallelism is the use of multiple functional units to apply the same operation simultaneously to different elements of a data set. A k-fold increase in the number of functional units leads to a k-fold increase in the throughput of the system, if there is no overhead associated with the increase in parallelism. In case of data parallelism, the level of achievable parallelism is proportional to the amount of data. Data parallelism is well suited for scientific computation on large and regular data structures (vectors, matrices, etc.). Data decomposition and the distribution of data segments among several processes (tasks) are shown in Figure 1-12. A typical example to exploit data parallelism appears in do-loops as illustrated on a simple example in Σφάλµα Το αρχείο προέλευσης της αναφοράς δεν βρέθηκε.. Another typical form of data parallelism is applied in solving Laplace’s equations on a square based on the Jacobi iterative method. In such neighbourhood-oriented computations we get processes with a regular communication pattern. Figure 1-13 depicts a simple example for the Jacobi method using data parallel execution scheme. Data parallelism originates in using such data structures in problem solutions, which allow parallel operations on their elements, such as vectors or matrices (HillStee86). Data parallelism is inherent only in a restricted set of problems, such as scientific or engineering calculations or image processing. Usually, this kind of parallelism gives rise to a massively parallel execution for the data-parallel part of the computation. Thus, the actual values for the achievable speed-up depend heavily on the characteristics of the application in concern. We term functional that kind of parallelism which arises from the logic of a problem solution. It occurs more or less in all formal descriptions of problem solutions, like program flow diagrams, data flow graphs, programs etc. to a more or less extent. However, in the following we will restrict our attention to available functional parallelism inherent in programs expressed in traditional imperative languages. In contrast to data parallelism, in which parallelism is achieved by decomposing data into pieces and applying the same operation to each element of the data set, function parallelism is achieved by 12 applying different operations (functions) to different data elements simultaneously. Accordingly, it is the computation which is decomposed into pieces (functions) as shown in Figure 1-14.         1 t1 +  t t t t  V = V +V +V +V  ij  i1 − j i1 + j ij−1 ij+1 4                                         Figure 1-13 Jacobi iterative method for solving Laplace’s equation on a square Data Computations Task 1 Task 2 Task 3 Figure 1-14 The principle of function parallelism 13 Figure 1-15 Data-flow graph to represent function parallelism Data are taken to the processes where they are needed. The flow of data among these processes can be arbitrarily complex. The data-flow graph visualises the data dependencies among the different operations. Nodes of the graph represent operations (instructions or functions) while the directed arcs show the direction of data flow among the operations (see Figure 1-15). a = b + 1 TASK 1 a = a + c a = b + 1 uses a,b,c and d d = 5 d a = a + c d = 5 d a = 2 a TASK 2 a = 2 a e = a - 1 uses a,e,d and f e = a - 1 f = d 3 f = d 3 f = f - 1 f = f - 1 TASK 3 f = f 4 f = f 4 uses f Figure 1-16 Function Parallelism: A simple example A simple example for function parallelism is depicted in Figure 1-16. Function parallelism is suitable for non regular computations resulting in non regular communication pattern and the level of parallelism is usually limited by the number of functions. If the data-flow graph forms a simple directed graph (like in the floating point adder of a vector computer as shown in Figure 1-18), then we say the algorithm is pipelined. A pipelined computation is divided into a number of steps called stages. Each stage works at full speed on a particular part of a computation. The output of one stage forms the input of the next stage (see Figure 1-17). If all the stages work at the same speed, once the pipe is full the work rate of the pipeline is equal to the sum of the work rates of the stages. A pipeline is analogous to an assembly line: the flow of results is simple and fixed, precedence constraints must be honoured, and it takes time to fill and drain the pipe. If we assume that each stage of the pipe requires the same amount of time, the multiplicative increase in the throughput is equal to the number of stages in the pipeline which is usually very limited. Pipelining is suitable for regular computations on regular data. Figure 1-18 illustrates a typical pipelining example used in vector computers to add vectors. From another point of view parallelism can be considered as being either regular or irregular. Data parallelism is regular, whereas functional parallelism, with the exception of loop-level parallelism, is usually irregular. Intuitively, parallelism will be called weak, when the extent of the available or exploited parallelism remains in the one digit range. It is the typical case for irregular parallelism. On the contrary, regular parallelism is often massive, offering several orders of magnitude in speed-up. 14 Data Computations Task 1 Task 2 Task 3 Figure 1-17 The principle of pipeline parallelism A B + = Align Significands A B C Add Significands Normalize Task 3 Results C Figure 1-18 Structure of a pipeline adder in a vector processor 15 1.2.3 The Sieve of Eratosthenes In this section we will explore methods to parallelize the Sieve of Eratosthenes, the classic prime- finding algorithm. We will design and analyse both function parallel, data parallel and pipeline implementations of this algorithm. We want to find the number of primes less than or equal to some positive integer n. A prime number has exactly two factors: itself and 1. The Sieve of Eratosthenes begins with a list of natural numbers 2, 3, 4, ..., n, and removes composite numbers from the list by striking multiples of 2, 3, 5, and successive primes. The sieve terminates after multiples of the largest prime less than or equal to n have been struck. A sequential implementation of the Sieve of Eratosthenes manages three key data structures: a boolean array whose elements correspond to the natural numbers being sieved, an integer corresponding to latest prime number found, and an integer used as a loop index incremented as multiples of the current prime are marked as composite numbers. Function parallel approach First let's examine a function parallel algorithm to find the number of primes less than or equal to some positive integer n. In this algorithm every processor repeatedly goes through the two-step process of finding the next prime number and striking from the list multiples of that prime, beginning with its square. The processors continue until a prime is found whose value is greater than n . Using this approach, processors concurrently mark multiples of different primes. For example, one processor will be responsible for marking multiples of 2 beginning with 4. While this processor marks multiples of 2, another may be marking multiples of 3 beginning with 9. P1 P2 Pp Index Index Index Current prime Shared Memory 1 3 4 n-1 n Figure 1-19 Shared-memory model for function parallel Sieve of Erathostenes algorithm Each processor has its own private loop index and shares access to other variables with all the other processors . We will base the function parallel algorithm on the simple model of parallel computation illustrated in Figure 1-19. Every processor shares access to the boolean array representing the natural numbers, as well as the integer containing the value of the latest prime number found. Because processors independently mark multiples of different primes, each processor has its own local loop index. If a group of asynchronously executing processors share access to the same data structure in an unstructured way, inefficiencies or errors may occur. Here are two problems that can occur in the algorithm we just described. First, two processors may end up using the same prime value to sieve through the array. Normally a processor accesses the value of the current prime and begins searching at the next array location until it finds another unmarked cell, which corresponds to the next prime. Then it updates the value of the integer containing the current prime. If a second processor accesses the value of the current prime before the first processor updates it, then both processors will end up finding the same new prime and performing a sieve based on that value. This does not make the algorithm incorrect, but it wastes time. 16 Second, a processor may end up sieving multiples of a composite number. For example, assume processor A is responsible for marking multiples of 2, but before it can mark any cells, processor B finds the next prime to be 3, and processor C searches for the next unmarked cell. Because cell 4 has not yet been marked, processor C returns with the value 4 as the latest "prime" number. As in the previous example, the algorithm is still correct, but a processor sieving multiples of 4 is wasting time. In later chapters we will discuss ways to design parallel algorithms that avoid such problems. For now, let's explore the maximum speedup achievable by this parallel algorithm, assuming that none of the time-wasting problems described earlier happen. To make our analysis easier, we will also ignore the time spent finding the next prime and concentrate on the operation of marking cells. First let's consider the time taken by the sequential algorithm. Assume it takes 1 unit of time for a processor to mark a cell. Suppose there are k primes less than or equal to SQUARE(n). We denote these primes π , π , . . ., π . (For example, π = 2, π = 3, and π = 5.) The total amount of time a single l 2 k l 2 3 processor spends striking out composite numbers is 2 2 2 2     nn +−11 π   +−π  ()nn +−11 π ()+−π () () 3 k 1 2 + + ++ K         π π π π  1   2   3   k  2   nnn − 3 − 8 − 24 () n+− 1 π       k = + + ++ K         2 3 5 π        k  There are (n - 3)/2 multiples of 2 in the range 4 through n, (n - 8)/3 multiples of 3 in the range 9 through n, and so on. For n = 1,000 the sum is 1,411. Now let's think about the time taken by the parallel algorithm. Whenever a processor is unoccupied, it grabs the next prime and marks its multiples. All processors continue in this fashion until the first prime greater than n is found. For example, Figure 1-20 illustrates the time required by one, two, and three processors to find all primes less than or equal to 1,000. With two processors the parallel algorithm has speedup 2 (1,411/706). With three processors the parallel algorithm has speedup 2.83 (1,411/499). It is clear that the parallel execution time will not decrease if more than three processors are used, because with three or more processors the time needed for a single processor to sieve all multiples of 2 determines the parallel execution time. Hence an upper bound on the execution time of the parallel algorithm for n = 1,000 is 2.83. Figure 1-20 Study of how adding processors reduces the execution time of the function parallel Sieve of Eratosthenes algorithm when n = 1,000. The number in the bar represents the prime whose multiples are being marked. The length of the bar is the time needed to strike these multiples. (a) Single processor strikes out all composite numbers in 1,411 units of line. (b) With two processors execution time drops to 706 time units. (c) With three or more processors execution time is 499 time units, the time needed for a processor to strike all multiples of 2. 17 Data parallel approach Let's consider another approach to parallelizing the Sieve of Eratosthenes. In our new algorithm, processors will work together to strike multiples of each newly found prime. Every processor will be responsible for a segment of the array representing the natural numbers. The algorithm is data parallel, because each processor applies the same operation (striking multiples of a particular prime) to its own portion of the data set. Analysing the speedup achievable by the data-parallel algorithm on the shared memory model of Figure 1-19 is straightforward; we have left it as an exercise. Instead, we will consider a different model of parallel computation (Figure 1-21). In this model there is no shared memory, and processor interaction occurs through message passing. P1 P2 Current Index Current Index 2 n/p n/p+1 2n/p Network Pp Current Index (p-1)n/p+1 n Figure 1-21 Private memory model for parallel Sieve of Eratosthenes algorithm. Each processor has its own copy of the variables containing the current prime and the loop index. Processor 1 finds primes and communicates them to the other processors. Each processor iterates through its own portion of the array of natural numbers, marking multiples of the prime Assume we are solving the problem on p processors. Every processor is assigned no more than n/p natural numbers. We will also assume that p is much less than SQUARE(n). In this case all primes less than SQUARE(n), as well as the first prime greater than SQUARE(n), are in the list of natural numbers controlled by the first processor. Processor 1 will find the next prime and broadcast its value to the other processors. Then all processors strike from their lists of composite numbers all multiples of the newly found prime. This process of prime finding and composite number striking continues until the first processor reaches a prime greater than SQUARE(n), at which point the algorithm terminates. Let's estimate the execution time of this parallel algorithm. As in the previous analysis, we ignore time spent finding the next prime and focus on the time spent marking composite numbers. However, since this model does not have a shared memory, we must also consider the time spent communicating the value of the current prime from processor 1 to all other processors. Assume it takes χ time units for a processor to mark a multiple of a prime as being a composite number. Suppose there are k primes less than or equal to SQUARE(n). We denote these primes π1, π2, . . ., πk. The total amount of time a processor spends striking out composite numbers is no greater than           n n n n   p p p p             + + ++ K χ           23 5 π k                     18 Assume every time processor 1 finds a new prime it communicates that value to each of the other p - 1 processors in turn. If processor 1 spends λ time units each time it passes a number to another processor, its total communication time for all k primes is k(p-1)λ. . To bring this discussion down to earth, suppose we want to count the number of primes less than 1,000,000. It turns out that there are 168 primes less than 1,000, the square root of 1,000,000. The largest of these is 997. The maximum possible execution time spent striking out primes is    1,, 000 000   1,, 000 000   10, 00,000    pp p    + ++ K χ         2 3 997                 The total communication time is 168(p-1)λ.. Figure 1-22 Estimated speedup of the data parallel Sieve of Eratosthenes algorithm assuming that n = 1,000,000 and λ = l00χ. Note that speedup is graphed as a function of number of processors used. This is typical If we know the relation between χ and λ, we can plot an estimated speedup curve for the parallel algorithm. Suppose λ = l00χ. Figure 1-22 illustrates the estimated speedup of the data-parallel algorithm. Figure 1-23 Total execution time of the data-parallel Sieve of Eratosthenes algorithm is the sum of the time spent computing and the time spent communicating. Computation time is inversely proportional to the number of processors; communication time is directly proportional to the number of processors. 19 . Notice that speedup is not directly proportional to the number of processors used. In fact, speedup is highest at 11 processors. When more processors are added, speedup declines. Figure 1-23 illustrates the estimated total execution time of the parallel algorithm along with its two components: computation time and communication time. Computation time is inversely proportional to the number of processors used, while communication time increases linearly with the number of processors used. After 11 processors, the increase in communication time is greater than the decrease in computation time, and the total execution time begins to increase. Pipeline approach Figure 1-24 shows the principles of a pipeline algorithm to solve the Sieve of Eratosthenes. The first stage of the pipe generates the series of natural numbers and passes it to the next stage of the pipe. The other stages of the pipe represent the sieves. These sieve stages consider the first element of their input as prime number and use it for striking non-primes from the input stream. As a result the output stream of the stage will not contain multiples of the first prime. Finally, all the stages will contain one prime number. 2,3,4,5,6,7,8,9,... 3,5,7,9,11,13,15,... 5,7,11,13,17,... Generator Sieve 1 Sieve 2 Sieve n ... (prime = 2) (prime = 3) (prime = ?) Figure 1-24 Principles of a pipeline algorithm to solve the Sieve of Eratosthenes 1.2.4 Levels of available functional parallelism Programs written in imperative languages may embody functional parallelism at different levels, that is, at different sizes of granularity. In this respect we can identify the following four levels and corresponding granularity sizes: • parallelism at the instruction-level (fine-grained parallelism), • parallelism at the loop-level (middle-grained parallelism), • parallelism at the procedure-level (middle-grained parallelism), and • parallelism at the program-level (coarse-grained parallelism), as shown in Figure 1-25. Available instruction-level parallelism means that particular instructions of a program may be executed in parallel. To this end, instructions can be either assembly (machine-level) or high-level language instructions. Usually, instruction-level parallelism is understood at the machine-language (assembly-language) level. In addition, while considering instruction-level parallelism we will confine us to instructions expressing more or less elementary-operations, such as an instruction prescribing the addition of two scalar operands, as opposed to multi-operation instructions like instructions implying vector- or matrix-operations. Parallelism may also be available at the loop-level. Here subsequent loop iterations are candidates for parallel execution. However, data dependences between subsequent loop iterations, called recurrences, may restrict their parallel execution. The potential speed-up is proportional to the loop limit or in case of nested loops to the product of the limits of the nested loops. Loop-level parallelism is a promising source of parallelism. Next, there is also parallelism available at the procedure-level in form of parallel executable procedures. The extent of parallelism exposed at this level is subject mainly to the kind of the problem solution considered. 20 In addition, different programs (users) are obviously, independent from each other. Thus, parallelism is also available at the user-level (which we consider to be of coarse-grained parallelism). Multiple, independent users are a key source of parallelism occurring in computing scenarios. Evidently, in a problem solution different levels of parallelism are not exclusive but, they may coexist at the same time. Available levels Utilised levels User (program) level User level 2 Procedure level Process level 1 Loop level Thread level Instruction level Instruction level 1 : Exploited by architectures 2 : Exploited by means of operating systems Figure 1-25. Available and utilised levels of functional parallelism 1.2.5 Utilisation of functional parallelism Available parallelism can be utilised by architectures, compilers and operating systems conjointly for speeding up computation. Let us first discuss the utilisation of functional parallelism. In general, functional parallelism can be utilised at four different levels of granularity, that is at instruction, thread, process and user-level, (see again Figure 1-25). It is quite natural to utilise available functional parallelism, which is inherent in a conventional sequential program, at the instruction-level by executing instructions in parallel. This can be achieved by means of architectures capable of parallel instruction execution. Such architectures are referred to as instruction-level function-parallel or simply instruction-level parallel architectures, commonly abbreviated as ILP-architectures. Since available instruction-level parallelism is typically unrevealed, i.e. implicit in traditional sequential programs, prior to execution it must be detected either by a dedicated compiler (called usually parallel optimising compiler) or by the ILP-architecture itself. Available functional parallelism in a program can also be utilised at the thread and/or at the process-level. As discussed before, threads and processes are self-contained execution entities embodying an executable chunk of code. They are constructs to expose parallel executable pieces of code. Processes are higher-level constructs than threads, that is they expose coarser granular parallelism. Threads and processes can be created either by the programmer using parallel languages or by operating systems that support multithreading or multitasking. They can also automatically be generated by parallel compilers during compilation of high-level language programs. Available loop- and procedure-level parallelism will be often exposed in form of threads and processes. There are two different ways to execute threads and processes. On the one hand, they can be executed in parallel by means of specialised architectures referred to as multithreaded and MIMD architectures, respectively. Multithreaded architectures are typically specialised processors able to perform very fast context switch. The other way to execute threads and processes concurrently is the use of architectures that run threads or processes in sequence, under the supervision of a multithreaded or multitasking operating system. In general, lower levels of available parallelism are utilised more likely directly by parallel architectures in connection with parallel optimising or parallel compilers whereas the utilisation of higher levels of parallelism relies usually more heavily on operating systems supporting concurrent or

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.