What is Multiprocessor Communication

Multiprocessors and Clusters and what multiprocessor means pdf free download
Dr.AstonCole Profile Pic
Dr.AstonCole,United Kingdom,Researcher
Published Date:10-07-2017
Your Website URL(Optional)
9 Multiprocessors and Clusters There are finer fish in the sea than have ever been caught. Irish proverb 9.1 Introduction 9-4 9.2 Programming Multiprocessors 9-8 9.3 Multiprocessors Connected by a Single Bus 9-11 9.4 Multiprocessors Connected by a Network 9-20 9.5 Clusters 9-25 9.6 Network Topologies 9-27 9.7 Multiprocessors Inside a Chip and Multithreading 9-30 9.8 Real Stuff: The Google Cluster of PCs 9-34 9.9 Fallacies and Pitfalls 9-39 9.10 Concluding Remarks 9-42 9.11 Historical Perspective and Further Reading 9-47 9.12 Exercises 9-55 The Five Classic Components of a Computer Computer Computer Network Computer Computer 9-4 Chapter 9 Multiprocessors and Clusters “Over the Mountains Of the Moon, 9.1 Introduction 9.1 Down the Valley of the Shadow, Ride, boldly ride” Computer architects have long sought the El Dorado of computer design: to cre- The shade replied,— ate powerful computers simply by connecting many existing smaller ones. This “If you seek for Eldorado” golden vision is the fountainhead of multiprocessors. The customer orders as Edgar Allan Poe, “Eldorado,” many processors as the budget allows and receives a commensurate amount of stanza 4, 1849 performance. Thus, multiprocessors must be scalable: the hardware and software are designed to be sold with a variable number of processors, with some machines multiprocessor Parallel pro- varying by a factor of more than 50. Since software is scalable, some multiproces- cessors with a single shared sors can support operation in the presence of broken hardware; that is, if a single address. processor fails in a multiprocessor with n processors, the system provides contin- ued service with n – 1 processors. Finally, multiprocessors have the highest abso- lute performance—faster than the fastest uniprocessor. The good news is that the multiprocessor has established a beachhead. Keeping in mind that the microprocessor is now the most cost-effective processor, it is gen- erally agreed that if you can’t handle a workload on a microprocessor, then a mul- cluster A set of computers con- tiprocessor or cluster composed of many microprocessors is more effective than nected over a local area network building a high-performance uniprocessor from a more exotic technology. There (LAN) that function as a single are many scientific applications that are too demanding to make progress on them large multiprocessor. with a single microprocessor: weather prediction, protein folding, and even search for extraterrestrial intelligence. Thus, Figure 9.1.1 shows that the high-perfor- mance computing industry depends on multiprocessors and clusters. There are also applications outside the sciences that are demanding: search engines, Web servers, and databases. For example, Figure 9.1.2 illustrates that the database industry has standardized on multiprocessors and clusters. Conse- quently, they now embody a significant market. Commercial multiprocessors and clusters usually define high performance as high throughput for independent tasks. This definition is in contrast to running a parallel processing single task on multiple processors. We use the term parallel processing program program A single program to refer to a single program that runs on multiple processors simultaneously. that runs on multiple processors Here are key questions that drive the designs of multiprocessors and clusters: simultaneously.  How do parallel processors share data?  How do parallel processors coordinate? shared memory A memory  How many processors? for a parallel processor with a sin- The answers to the first question fall in two main camps. Processors with a single address space, implying gle address space, sometimes called shared-memory processors, offer the pro- implicit communication with loads and stores. grammer a single memory address space that all processors share. Processors 9.1 Introduction 9-5 Single Instruction multiple data (SIMD) 500 Cluster (network of workstations) 400 Cluster (network of SMPs) 300 Massively parallel 200 processors (MPPs) 100 Shared- memory multiprocessors 0 (SMPs) 93 93 94 94 95 95 96 96 97 97 98 98 99 99 00 Uniprocessors FIGURE 9.1.1 Plot of top 500 supercomputer sites over a decade. The numbers for 1993/1998/2003 are 93/0/0 for uniprocessor, 251/175/0 for SMP, 121/310/165 for MPP, 35/0/0 for SIMD, 0/14/127 for cluster of SMPs, and 0/1/208 for cluster of workstations. Note that in the last five years unipro- cessors, SMPs, and SIMDs have disappeared while clusters of various kinds grew from 3% to 67%. More- over, most of the MPPs in the list look similar to clusters. Performance is measured as the speed of running Linpack, which solves a dense system of linear equations. This list at www.top500.org is updated twice a year. This site uses the term constellation to mean a network of SMP servers and the term cluster to mean a cluster of PCs or workstations, which can be either uniprocessors or small SMPs. This vague distinction is not used in this text; in this book, a cluster is a collection of computers connected by a standard LAN that is used for a common task. communicate through shared variables in memory, with all processors capable of accessing any memory location via loads and stores. As processors operating in parallel will normally share data, they also need to coordinate when operating on shared data; otherwise, one processor could start working on data before another is finished with it. This coordination is called syn- synchronization The process of coordinating the behavior of chronization. When sharing is supported with a single address space, there must two or more processes, which be a separate mechanism for synchronization. One approach uses a lock: only one may be running on different processor at a time can acquire the lock, and other processors interested in shared processors. data must wait until the original processor unlocks the variable. Locking is lock A synchronization device described in Section 9.3. that allows access to data to only Single address space multiprocessors come in two styles. The first takes the one processor at a time. same time to access main memory no matter which processor requests it and no 9-6 Chapter 9 Multiprocessors and Clusters 1,000,000 Cluster SMP 100,000 symmetric multiprocessor 10,000 (SMP) or uniform memory 1 10 100 1,000 access (UMA) A multiproces- Number processors sor in which accesses to main memory take the same amount FIGURE 9.1.2 Performance versus number of processor for TPC-C on a log-log scale. of time no matter which proces- These plots are for computers running version 5 of the TPC-C benchmark. Note that even the smallest sor requests the access and no computer has 2 processors. Clusters get high performance by scaling. They can sustain 2500–3800 transac- matter which word is asked. tions per minute per processor from 32 to 280 processors. Not only do clusters have the highest tpmC rat- ing, they have better cost-performance (/tpmC) than any SMP with a total cost over 1 million. nonuniform memory access (NUMA) A type of single- address space multiprocessor in which some memory accesses matter which word is requested. Such machines are called uniform memory are faster than others depending access (UMA) multiprocessors or symmetric multiprocessors (SMP). In the sec- which processor asks for which ond style, some memory accesses are faster than others depending on which word. processor asks for which word. Such machines are called nonuniform memory message passing access (NUMA) multiprocessors. As you might expect, the programming chal- Communicating between lenges are different for a NUMA multiprocessor versus a UMA multiprocessor, multiple processors by explic- but NUMA machines can scale to larger sizes and hence are potentially higher itly sending and receiving performance. Figure 9.1.3 shows the number of processors and nonlocal memory information. access times for commercial SMPs and NUMAs. send message routine The alternative model for communicating uses message passing for communi- A routine used by a processor in cating among processors. Message passing is required for machines with private machines with private memo- memories, in contrast to shared memory. One example is a cluster, which proces- ries to pass to another proces- sors in different desktop computers communicate by passing messages over a local sor. area network. Provided the system has routines to send and receive messages, receive message routine A coordination is built in with message passing since one processor knows when a routine used by a processor in message is sent, and the receiving processor knows when a message arrives. The machines with private memo- receiving processor can then send a message back to the sender saying the message ries to accept a message from has arrived if the sender needs that confirmation. another processor. TPC-C transactions per minute 9.1 Introduction 9-7 Year SMP or Maximum Interconnection Typical remote memory access Multiprocessor shipped NUMA processors network time (ns) Sun Starfire servers 1996 SMP 64 multiple address 500 buses, data switch SGI Origin 3000 1999 NUMA 512 fat hypercube 500 Cray T3E 1996 NUMA 2048 2-way 3D torus 300 HP V series 1998 SMP 32 8 × 8 crossbar 1000 Compaq AlphaServer GS 1999 SMP 32 switched buses 400 Sun V880 2002 SMP 8 switched buses 240 HP Superdome 9000 2003 SMP 64 switched buses 275 FIGURE 9.1.3 Typical remote access times to retrieve a word from a remote memory in shared-memory multiprocessors. In addition to two main communication styles, multiprocessors are con- structed in two basic organizations: processors connected by a single bus, and processors connected by a network. The number of processors in the multiproces- sor has a lot to do with this choice. We will examine these two styles in detail in Sections 9.3 and 9.4. Let’s start by looking at the general issues in programming multiprocessors. Figure 9.1.4 shows the relationship between the number of processors in The BIG a multiprocessor and choice of shared address versus message-passing communication and the choice of bus versus network physical connec- Picture tion. Shared address is further divided between uniform and nonuniform memory access. Although there are many choices for some numbers of processors, for other regions there is widespread agreement. 9-8 Chapter 9 Multiprocessors and Clusters Category Choice Number of processors Message passing 8–2048 Communicationmodel Shared NUMA 8–256 address UMA 2–64 Physical connection Network 8–256 Bus 2–36 FIGURE 9.1.4 Options in communication style and physical connec tion for multiprocessors as the number of processors varies. Note that the shared address space is divided into uniform memory access (UMA) and nonuniform memory access (NUMA) machines. A major concern which is frequently voiced in connec- 9.2 Programming Multiprocessors 9.2 tion with very fast comput- ing machines . . . is that they will . . . run out of work. . . . The bad news is that it remains to be seen how many important applications will It must be considered that run faster on multiprocessors via parallel processing. The obstacle is not the price . . . past problem size was of the uniprocessor used to compose multiprocessors, the flaws in topologies of dictated by the speed of the interconnection networks, or the unavailability of appropriate programming lan- computing machines then guages; the difficulty has been that too few important application programs have available. . . . For faster been rewritten to complete tasks sooner on multiprocessors. Because it is even machines, the same auto- harder to find applications that can take advantage of many processors, the chal- matic mechanism will exert lenge is greater for large-scale multiprocessors. pressure towards problems of Because of the programming difficulty, most parallel processing success stories larger size. are a result of software wizards developing a parallel subsystem that presents a John von Neumann, address sequential interface. Examples include databases, file servers, computer-aided presented at IBM seminar on design packages, and multiprocessing operating systems. scientific computation, However, why is this so? Why should parallel processing programs be so much November 1949 harder to develop than sequential programs? The first reason is that you must get good performance and efficiency from the parallel program on a multiprocessor; otherwise, you would use a uniprocessor, as programming is easier. In fact, uniprocessor design techniques such as superscalar and out-of-order execution take advantage of instruction-level parallelism, nor- mally without involvement of the programmer. Such innovation reduces the demand for rewriting programs for multiprocessors. Why is it difficult to write multiprocessor programs that are fast, especially as the number of processors increases? As an analogy, think of the communication overhead for a task done by one person compared to the overhead for a task done by a committee, especially as the size of the committee increases. Although n peo- 9.2 Programming Multiprocessors 9-9 ple may have the potential to finish any task n times faster, the communication overhead for the group may prevent it; n-fold speedup becomes especially unlikely as n increases. (Imagine the change in communication overhead if a committee grows from 10 people to 1000 people to 1,000,000.) Another reason why it is difficult to write parallel processing programs is that the programmer must know a good deal about the hardware. On a uniprocessor, the high-level language programmer writes the program largely ignoring the underlying machine organization—that’s the job of the compiler. Alas, it’s not that simple for multiprocessors. Although this second obstacle is beginning to lessen, our discussion in Chapter 4 reveals a third obstacle: Amdahl’s law. It reminds us that even small parts of a program must be parallelized to reach their full potential; thus coming close to linear speedup involves discovering new algorithms that are inherently parallel. Speedup Challenge Suppose you want to achieve linear speedup with 100 processors. What frac- EXAMPLE tion of the original computation can be sequential? Amdahl’s law (page 267) says, Execution time after improvement = ANSWER Execution time affected by improvement + Execution time unaffected - - Amount of improvement Substituting for the goal of linear speedup with 100 processors means the execution time is reduced by 100: Execution time before improvement -= 100 Execution time affected by improvement - - + Execution time unaffected 100 Since Execution time before improvement = Execution time affected by improvement + Execution time unaffected 9-10 Chapter 9 Multiprocessors and Clusters if we substitute this in the equation above, we get Execution time affected by improvement + Execution time unaffected - 100 Execution time affected by improvement = - - + Execution time unaffected 100 Simplifying, we get Execution time unaffected by improvement - - = Execution time unaffected 100 This can only be true if Execution time unaffected is 0. Accordingly, to achieve linear speedup with 100 processors, none of the original computation can be sequential. Put another way, to get a speedup of 99 from 100 processors means the percentage of the original program that was sequential would have to be 0.01% or less. Yet, there are applications with substantial parallelism. Speedup Challenge, Bigger Problem Suppose you want to perform two sums: one is a sum of two scalar variables EXAMPLE and one is a matrix sum of a pair of two-dimensional arrays, size 1000 by 1000. What speedup do you get with 1000 processors? If we assume performance is a function of the time for an addition, t, then ANSWER there is 1 addition that does not benefit from parallel processors and 1,000,000 additions that do. If the time before is 1,000,001t, Execution time after improvement = Execution time affected by improvement + Execution time unaffected - - Amount of improvement 1,000,000t Execution time after improvement = - + 1t = 1001 1000 Speedup is then 1,000,001 Speedup== - 999 1001 Even if the sequential portion expanded to 100 sums of scalar variables versus one sum of a pair of 1000 by 1000 arrays, the speedup would still be 909.9.3 Multiprocessors Connected by a Single Bus 9-11 Multiprocessors Connected by a Single 9.3 Bus 9.3 The high performance and low cost of the microprocessor inspired renewed inter- est in multiprocessors in the 1980s. Several microprocessors can usefully be placed on a common bus for several reasons:  Each microprocessor is much smaller than a multichip processor, so more processors can be placed on a bus.  Caches can lower bus traffic.  Mechanisms were invented to keep caches and memory consistent for multi- processors, just as caches and memory are kept consistent for I/O, thereby simplifying programming. Figure 9.3.1 is a drawing of a generic single-bus multiprocessor. Traffic per processor and the bus bandwidth determine the useful number of processors in such a multiprocessor. The caches replicate data in their faster mem- ories both to reduce the latency to the data and to reduce the memory traffic on the bus. Parallel Program (Single Bus) Suppose we want to sum 100,000 numbers on a single-bus multiprocessor EXAMPLE computer. Let’s assume we have 100 processors. The first step again would be to split the set of numbers into subsets of the same size. We do not allocate the subsets to a different memory, since there is ANSWER a single memory for this machine; we just give different starting addresses to each processor. Pn is the number of the processor, between 0 and 99. All pro- cessors start the program by running a loop that sums their subset of num- bers: sumPn = 0; for (i = 1000Pn; i 1000(Pn+1); i = i + 1) sumPn = sumPn + Ai; / sum the assigned areas/9-12 Chapter 9 Multiprocessors and Clusters The next step is to add these many partial sums, so we divide to conquer. Half of the processors add pairs of partial sums, then a quarter add pairs of the new partial sums, and so on until we have the single, final sum. We want each processor to have its own version of the loop counter variable i, so we must indicate that it is a “private” variable. In this example, the two processors must synchronize before the “consum- er” processor tries to read the result from the memory location written by the “producer” processor; otherwise, the consumer may read the old value of the data. Here is the code (half is private also): half = 100; / 100 processors in multiprocessor/ repeat synch(); / wait for partial sum completion/ if (half%2 = 0 && Pn == 0) sum0 = sum0 + sumhalf-1; / Conditional sum needed when half is odd; Processor0 gets missing element / half = half/2; / dividing line on who sums / if (Pn half) sumPn = sumPn + sumPn+half; until (half == 1); / exit with final sum in Sum0 / Recall from Chapter 8 that I/O can experience inconsistencies in the value of cache coherency Consistency data between the version in memory and the version in the cache. This cache in the value of data between the coherence problem applies to multiprocessors as well as I/O. Unlike I/O, which versions in the caches of several rarely uses multiple data copies (a situation to be avoided whenever possible), as processors. the second half of the example suggests, multiple processors routinely require copies of the same data in multiple caches. Alternatively, accesses to shared data could be forced always to go around the cache to memory, but that would be too slow and it would require too much bus bandwidth; performance of a multipro- cessor program depends on the performance of the system when sharing data. Processor Processor . . . Processor Cache Cache . . . Cache Single bus Memory I/O FIGURE 9.3.1 A single-bus multiprocessor. Typical size is between 2 and 32 processors.9.3 Multiprocessors Connected by a Single Bus 9-13 The protocols to maintain coherence for multiple processors are called cache coherence protocols. The next few subsections explain cache coherence protocols and methods of synchronizing processors using cache coherency. Multiprocessor Cache Coherence The most popular protocol to maintain cache coherence is called snooping. snooping cache coherency A method for maintaining cache Figure 9.3.2 shows how caches access memory over a common bus. All cache con- coherency in which all cache trollers monitor, or snoop, on the bus to determine whether they have a copy of controllers monitor or snoop on the shared block. the bus to determine whether or Snooping became popular with machines of the 1980s, which used single buses not they have a copy of the to their main memories. These uniprocessors were extended by adding multiple desired block. processors on that bus to give easy access to the shared memory. Caches were then added to improve the performance of each processor, leading to schemes to keep the caches up-to-date by snooping on the information over that shared bus. Maintaining coherence has two components: reads and writes. Multiple copies are not a problem when reading, but a processor must have exclusive access to write a word. Processors must also have the most recent copy when reading an object, so all processors must get new values after a write. Thus, snooping proto- cols must locate all the caches that share an object to be written. The consequence of a write to shared data is either to invalidate all other copies or to update the shared copies with the value being written. The status bits already in a cache block are expanded for snooping protocols, and that information is used in monitoring bus activities. On a read miss, all caches check to see if they have a copy of the requested block and then take the appropriate action, such as supplying the data to the cache that missed. Similarly, on a write, all caches check to see if they have a copy and then act, either invalidat- ing or updating their copy to the new value. . . . Processor Processor Processor Snoop Cache tag Snoop Cache tag Snoop Cache tag . . . tag and data tag and data tag and data . . . Single bus Memory I/O FIGURE 9.3.2 A single-bus multiprocessor using snooping cache coherency. The extra set of tags, shown in color, is used to handle snoop requests. The tags are duplicated to reduce the demands of snooping on the caches.9-14 Chapter 9 Multiprocessors and Clusters Since every bus transaction checks cache address tags, you might assume that it interferes with the processor. It would interfere if not for duplicating the address tag portion of the cache—not the whole cache—to get an extra read port for snooping (see Figure 9.3.2). This way, snooping rarely interferes with the proces- sor’s access to the cache. When there is interference, the processor will likely stall because the cache is unavailable. Commercial cache-based multiprocessors use write-back caches because write- back reduces bus traffic and thereby allows more processors on a single bus. To preserve that precious communications bandwidth, all commercial machines use write-invalidate A type of write-invalidate as the standard coherence protocol: the writing processor causes snooping protocol in which all copies in other caches to be invalidated before changing its local copy; it is then the writing processor causes all free to update the local data until another processor asks for it. The writing pro- copies in other caches to be cessor issues an invalidation signal over the bus, and all caches check to see if they invalidated before changing its have a copy; if so, they must invalidate the block containing the word. Thus, this local copy, which allows it to scheme allows multiple readers but only a single writer. update the local data until Measurements to date indicate that shared data has lower spatial and temporal another processor asks for it. locality than other types of data. Thus, shared data misses often dominate cache behavior, even though they may be just 10% to 40% of the data accesses. Figure 9.3.3 shows the fraction of misses due to coherence as the number of processors varies in an SMP. One insight is that block size plays an important role in cache coherency. For Hardware example, take the case of snooping on a cache with a block size of eight words, Software with a single word alternatively written and read by two processors. A protocol Interface that only broadcasts or sends a single word has an advantage over one that trans- fers the full block. false sharing A sharing Large blocks can also cause what is called false sharing: When two unrelated situation in which two unrelated shared variables are located in the same cache block, the full block is exchanged shared variables are located in the same cache block and the between processors even though the processors are accessing different variables full block is exchanged between (see Exercises 9.5 and 9.6). Compiler research is under way to reduce false sharing processors even though the by allocating highly correlated data to the same cache block and thereby reduce processors are accessing differ- cache miss rates. ent variables. Elaboration: In a multiprocessor using cache coherence over a single bus, what hap- pens if two processors try to write to the same shared data word in the same clock cycle? The bus arbiter decides which processor gets the bus first, and this processor will invalidate or update the other processor’s copy, depending on the protocol. The second processor then does its write. Bus arbitration forces sequential behavior from writes to the same block by different processors, and this explains how writes from dif- ferent processors to different words in the same block will work correctly.9.3 Multiprocessors Connected by a Single Bus 9-15 FFT LU 8% 2% 7% Miss rate 1% 6% 0% 1248 16 5% Miss rate 4% Processor count 3% 2% Ocean 1% 20% 0% 1248 16 18% 16% Processor count 14% 12% Barnes Miss rate 10% 1% Miss rate 8% 6% 0% 1248 16 4% 2% Processor count 0% 1248 16 Processor count Coherence miss rate Capacity miss rate FIGURE 9.3.3 Data miss rates can vary in nonobvious ways as the processor count is increased from 1 to 16. The miss rates include both coherence and capacity miss rates. The compul- sory misses in these benchmarks are all very small and are included in the capacity misses. For all these runs, the cache size is 64 KB, two-way set associative, with 32-byte blocks. Notice that the scale on the y-axis for each benchmark is different, so that the behavior of the individual benchmarks can be seen clearly. (From Figure 6.23 on page 572 in Hennessy and Patterson, Computer Architecture: A Quantitative Approach, third edition, 2003.) The policy of when a processor sees a write from another processor is called the memory consistency model. The most conservative is called sequential consistency: the result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved. Some machines use more liberal models to achieve higher memory performance. barrier synchronization A Elaboration: Our example used a barrier synchronization primitive; processors wait synchronization scheme in at the barrier until every processor has reached it. Then they proceed. Barrier synchro- which processors wait at the nization allows all processors to rapidly synchronize. This function can be implemented barrier and do not proceed until in software or with the lock synchronization primitive, described shortly. every processor has reached it.9-16 Chapter 9 Multiprocessors and Clusters An Example of a Cache Coherence Protocol To illustrate the intricacies of a cache coherence protocol, Figure 9.3.4 shows a finite state transition diagram for a write-invalidation protocol based on a write- back policy. Each cache block is in one of three states: Shared (read only): This cache block is clean (not written) and may be 1. shared. Modified (read/write): This cache block is dirty (written) and may not be 2. shared. 3. Invalid: This cache block does not have valid data. The three states of the protocol are duplicated in the figure to show transitions based on processor actions as opposed to transitions based on bus operations. This duplication is done only for purposes of illustration; there is really only one finite state machine per cache block, with stimuli coming either from the attached processor or from the bus. This abstraction applies to caches blocks not resident in the case as well; these state machines are obviously all in the invalid state. Transitions in the state of a cache block happen on read misses, write misses, or write hits; read hits do not change cache state. Let’s start with a read miss. Let’s call the block to be replaced the victim. When the processor has a read miss, it will acquire the bus, and write back the victim if it was in the Modified state (dirty). All the caches in the other processors monitor the read miss to see if this block is in their cache. If one has a copy and it is in the Modified state, then the block is written back and its state is changed to the Invalid state. (Some protocols would change the state to Shared.) The read miss is then satisfied by reading from mem- ory, and the state of the block is set to Shared. Note that the block is read from memory whether a copy is in a cache or not in this protocol. Writes are more complex. Let’s try write hits. A write hit to a Modified block cause no protocol action. A write hit to a Shared block causes the cache to acquire the bus, send an invalidate signal to knock out any other copies, modify the por- tion of the block being written, and change the state to Modified. Last is write misses. A write miss to an Invalid block causes the cache to acquire the bus, read the full missing block, modify the portion of the block being written, and change the state to Modified. A write miss to a Shared block in another cache causes the cache to acquire the bus, send an invalidate signal to knock out all cop- ies, read the full missing block, modify the portion of the block being written, and MESI cache coherency change the state to Modified. protocol A write-invalidate As you might imagine, there are many variations on cache coherence that are protocol whose name is an much more complicated than this simple model. The one found on the Pentium 4 acronym for the four states of the protocol: Modified, and many other microprocessors is called MESI, a write-invalidate protocol Exclusive, Shared, Invalid. whose name is an acronym for the four states of the protocol: Modified, Exclusive,9.3 Multiprocessors Connected by a Single Bus 9-17 Invalid Processor read miss Shared (not valid (clean) cache block) Processor Processor miss read hit Processor write miss Processor Processor write hit miss (write dirty block to memory) Modified (dirty) Processor read hit or write hit a. Cache state transitions using signals from the processor Invalid Shared (not valid (clean) Invalidate or cache block) another processor has a write miss for this block Another processor has a read (seen on bus) miss or a write miss for this block (seen on bus); write back old block Modified (dirty) b. Cache state transitions using signals from the bus FIGURE 9.3.4 A write-invalidate cache coherence protocol. The upper part of the diagram shows state transitions based on actions of the processor associated with this cache; the lower part shows transitions based on actions of other processors as seen as operations on the bus. There is really only one state machine in a cache block, although there are two represented here to clarify when a transition occurs. The black arrows and actions specified in black text would be found in caches without coherency; the col- ored arrows and actions are added to achieve cache coherency. (Send invalidate)9-18 Chapter 9 Multiprocessors and Clusters Shared, Invalid. Modified and Invalid are the same as above. The Shared state of Figure 9.3.4 is divided, depending on whether there are multiple copies (Shared state) or there is just one (Exclusive state). In either case, memory has an up-to- date version of the data. This extra Exclusive state means there is only one copy of the block, so a write hit doesn’t need to invalidate. A write hit to a Shared block in Figure 9.3.4 requires an invalidation, since there may be multiple copies. Other variations on coherence protocols include whether the other caches try to supply the block if they have a copy, and whether the block must be invalidated on a read miss. Synchronization Using Coherency One of the major requirements of a single-bus multiprocessor is to be able to coordinate processes that are working on a common task. Typically, a pro- grammer will use lock variables (also known as semaphores) to coordinate or syn- chronize the processes. The challenge for the architect of a multiprocessor is to provide a mechanism to decide which processor gets the lock and to provide the operation that locks a variable. Arbitration is easy for single-bus multiprocessors, since the bus is the only path to memory: the processor that gets the bus locks out atomic swap operation An all other processors from memory. If the processor and bus provide an atomic operation in which the proces- swap operation, programmers can create locks with the proper semantics. Here sor can both read a location and the adjective atomic means indivisible, so an atomic swap means the processor can write it in the same bus opera- both read a location and set it to the locked value in the same bus operation, pre- tion, preventing any other pro- venting any other processor or I/O device from reading or writing memory until cessor or I/O device from the swap completes. reading or writing memory until Figure 9.3.5 shows a typical procedure for locking a variable using an atomic swap instruction. Assume that 0 means unlocked (“go”) and 1 means locked (“stop”). A processor first reads the lock variable to test its state. A processor keeps reading and testing until the value indicates that the lock is unlocked. The proces- sor then races against all other processors that were similarly spin waiting to see who can lock the variable first. All processors use an atomic swap instruction that reads the old value and stores a 1 (“stop”) into the lock variable. The single winner will see the 0 (“go”), and the losers will see a 1 that was placed there by the winner. (The losers will continue to write the variable with the locked value of 1, but that doesn’t change its value.) The winning processor then executes the code that updates the shared data. When the winner exits, it stores a 0 (“go”) into the lock variable, thereby starting the race all over again. MIPS does not include an atomic swap instruction. An alternative is to have a pair of instructions where the second instruction returns a value from which it can be deduced whether the pair of instructions was executed as if the instructions were atomic. The pair of instructions is effectively atomic if it appears as if all other operations executed by any processor occurred before or after the pair. Thus, when an instruction pair is effectively atomic, no other processor can change the value between the instruction pair. 9.3 Multiprocessors Connected by a Single Bus 9-19 Load lock variable No Unlocked? (= 0?) Yes Try to lock variable using swap: read lock variable and then set variable to locked value (1) No Succeed? (= 0?) Yes Begin update of shared data Finish update of shared data Unlock: set lock variable to 0 FIGURE 9.3.5 Steps to acquire a lock or semaphore to synchronize processes and then to release the lock on exit from the key section of code. The MIPS pair of instructions includes a special load called a load linked or load locked (ll) and a special store called a store conditional (sc). These instruc- tions are used in sequence: If the contents of the memory location specified by the load linked are changed before the store conditional to the same address occurs, then the store conditional fails. If the processor does a context switch between the two instructions, then the store conditional also fails. The store conditional is defined to return a value indicating whether the store was successful. Since the load linked returns the initial value and the store conditional returns 1 if it . . .9-20 Chapter 9 Multiprocessors and Clusters succeeds and 0 otherwise, the following sequence implements an atomic exchange on the memory location specified by the contents of t1: try: mov t3,t4 move exchange value ll t2,0(t1) load linked sc t3,0(t1) store conditional changes t3 beqz t3,try branch if store cond fails (=0) nop (delayed branch) mov t4,t2 put load value into t4 At the end of this sequence the contents of t4 and the memory location specified by t1 have been atomically swapped. Any time a processor intervenes and modi- fies the value in memory between the ll and sc instructions, the sc returns 0 in t3, causing the code sequence to try again. Let’s examine how the spin lock scheme of Figure 9.3.5 works with bus-based cache coherency. One advantage of this algorithm is that it allows processors to spin wait on a local copy of the lock in their caches. This reduces the amount of bus traffic; Figure 9.3.6 shows the bus and cache operations for multiple proces- sors trying to lock a variable. Once the processor with the lock stores a 0 into the lock, all other caches see that store and invalidate their copy of the lock variable. Then they try to get the new value for the lock of 0. This new value starts the race to see who can set the lock first. The winner gets the bus and stores a 1 into the lock; the other caches replace their copy of the lock variable containing 0 with a 1. This value indicates the variable is already locked, so they must return to testing and spinning. This scheme has difficulty scaling up to many processors because of the com- munication traffic generated when the lock is released. Multiprocessors Connected by a 9.4 Network 9.4 Single-bus designs are attractive, but limited because the three desirable bus char- acteristics are incompatible: high bandwidth, low latency, and long length. There is also a limit to the bandwidth of a single memory module attached to a bus. Thus, a single bus imposes practical constraints on the number of processors that can be connected to it. To date, the largest number of processors connected to a single bus in a commercial computer is 36, and this number seems to be dropping over time. If the goal is to connect many more processors together, then the computer designer needs to use more than a single bus. Figure 9.4.1 shows how this can be organized. Note that in Figure 9.3.1 on page 9-12, the connection medium—the bus—is between the processors and memory, whereas in Figure 9.4.1, memory is9.4 Multiprocessors Connected by a Network 9-21 Step Processor P0 Processor P1 Processor P2 Bus activity Memory 1 Has lock Spins, testing if lock = 0 Spins, testing if lock = 0 None 2 Sets lock to 0; sends Spins, testing if lock = 0 Spins, testing if lock = 0 Write-invalidate of invalidate over bus lock variable sent from P0 3 Cache miss Cache miss Bus services P2’s cache miss 4 Responds to P2’s (waits for cache miss) (waits for cache miss) Response to P2’s Update memory with cache miss; sends cache miss block from P0 lock = 0 5 (waits for cache miss) Tests lock = 0; succeeds Bus services P1’s cache miss 6 Tests lock = 0; succeeds Attempt swap; needs Response to P1’s Responds to P1’s cache write permission cache miss miss; sends lock variable 7 Attempt swap; needs write Send invalidate to Bus services P2’s permission gain write permission invalidate 8 Cache miss Swap; reads lock = 0 Bus services P1’s and sets to 1 cache miss 9 Swap; read lock = 1 sets to 1; Responds to P1’s cache Response to P2’s go back to spin miss, sends lock = 1 cache miss FIGURE 9.3.6 Cache coherence steps and bus traffic for three processors, P0, P1, and P2. This figure assumes write-invalidate coherency. P0 starts with the lock (step 1). P0 exits and unlocks the lock (step 2). P1 and P2 race to see which reads the unlocked value during the swap (steps 3–5). P2 wins and enters the critical section (steps 6 and 7), while P1 spins and waits (steps 7 and 8). The “critical section” is the name for the code between the lock and the unlock. When P2 exits the critical section, it sets the lock to 0, which will invalidate the copy in P1’s cache, restarting the process. . . . Processor Processor Processor Cache Cache . . . Cache Memory Memory . . . Memory Network FIGURE 9.4.1 The organization of a network-connected multiprocessor. Note that, in con- trast to Figure 9.3.1 on page 9-12, the multiprocessor connection is no longer between memory and the processor. attached to each processor, and the connection medium—the network—is between these combined nodes. For single-bus systems, the medium is used on every memory access, while in the latter case it is used only for interprocessor communication.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.