Lecture notes of Computer architecture and organization

what is a computer designed to request information from a server,lecture notes on advanced computer architecture ,what is computer architecture Pdf free download
Dr.ShaneMatts Profile Pic
Dr.ShaneMatts,United States,Teacher
Published Date:23-07-2017
Your Website URL(Optional)
www.dbeBooks.com - An Ebook Library 1 Fundamentals of Computer Design And now for something completely different. Monty Python’s Flying Circus 2  Chapter One Fundamentals of Computer Design 1.1 Introduction Computer technology has made incredible progress in the roughly 60 years since the first general-purpose electronic computer was created. Today, less than 500 will purchase a personal computer that has more performance, more main mem- ory, and more disk storage than a computer bought in 1985 for 1 million dollars. This rapid improvement has come both from advances in the technology used to build computers and from innovation in computer design. Although technological improvements have been fairly steady, progress aris- ing from better computer architectures has been much less consistent. During the first 25 years of electronic computers, both forces made a major contribution, delivering performance improvement of about 25% per year. The late 1970s saw the emergence of the microprocessor. The ability of the microprocessor to ride the improvements in integrated circuit technology led to a higher rate of improve- ment—roughly 35% growth per year in performance. This growth rate, combined with the cost advantages of a mass-produced microprocessor, led to an increasing fraction of the computer business being based on microprocessors. In addition, two significant changes in the computer marketplace made it easier than ever before to be commercially successful with a new architecture. First, the virtual elimination of assembly language program- ming reduced the need for object-code compatibility. Second, the creation of standardized, vendor-independent operating systems, such as UNIX and its clone, Linux, lowered the cost and risk of bringing out a new architecture. These changes made it possible to develop successfully a new set of architec- tures with simpler instructions, called RISC (Reduced Instruction Set Computer) architectures, in the early 1980s. The RISC-based machines focused the attention of designers on two critical performance techniques, the exploitation of instruction- level parallelism (initially through pipelining and later through multiple instruction issue) and the use of caches (initially in simple forms and later using more sophisti- cated organizations and optimizations). The RISC-based computers raised the performance bar, forcing prior archi- tectures to keep up or disappear. The Digital Equipment Vax could not, and so it was replaced by a RISC architecture. Intel rose to the challenge, primarily by translating x86 (or IA-32) instructions into RISC-like instructions internally, allowing it to adopt many of the innovations first pioneered in the RISC designs. As transistor counts soared in the late 1990s, the hardware overhead of translat- ing the more complex x86 architecture became negligible. Figure 1.1 shows that the combination of architectural and organizational enhancements led to 16 years of sustained growth in performance at an annual rate of over 50%—a rate that is unprecedented in the computer industry. The effect of this dramatic growth rate in the 20th century has been twofold. First, it has significantly enhanced the capability available to computer users. For many applications, the highest-performance microprocessors of today outper- form the supercomputer of less than 10 years ago. 1.1 Introduction  3 10,000 64-bit Intel Xeon, 3.6 GHz Intel Xeon, 3.6 GHz 6505 AMD Opteron, 2.2 GHz 5764 Intel Pentium 4,3.0 GHz 5364 4195 AMD Athlon, 1.6 GHz 2584 Intel Pentium III, 1.0 GHz Alpha 21264A, 0.7 GHz 1779 1267 Alpha 21264, 0.6 GHz 1000 993 Alpha 21164, 0.6 GHz 649 Alpha 21164, 0.5 GHz 481 Alpha 21164, 0.3 GHz 280 Alpha 21064A, 0.3 GHz 20% 183 PowerPC 604, 0.1GHz 117 100 Alpha 21064, 0.2 GHz 80 HP PA-RISC, 0.05 GHz 51 IBM RS6000/540 52%/year 24 MIPS M2000 18 MIPS M/120 13 10 Sun-4/260 9 VAX 8700 5 VAX-11/780 25%/year 1.5, VAX-11/785 0 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Figure 1.1 Growth in processor performance since the mid-1980s. This chart plots performance relative to the VAX 11/780 as measured by the SPECint benchmarks (see Section 1.8). Prior to the mid-1980s, processor perfor- mance growth was largely technology driven and averaged about 25% per year. The increase in growth to about 52% since then is attributable to more advanced architectural and organizational ideas. By 2002, this growth led to a difference in performance of about a factor of seven. Performance for floating-point-oriented calculations has increased even faster. Since 2002, the limits of power, available instruction-level parallelism, and long memory latency have slowed uniprocessor performance recently, to about 20% per year. Since SPEC has changed over the years, performance of newer machines is estimated by a scaling factor that relates the performance for two different versions of SPEC (e.g., SPEC92, SPEC95, and SPEC2000). Second, this dramatic rate of improvement has led to the dominance of microprocessor-based computers across the entire range of the computer design. PCs and Workstations have emerged as major products in the computer industry. Minicomputers, which were traditionally made from off-the-shelf logic or from gate arrays, have been replaced by servers made using microprocessors. Main- frames have been almost replaced with multiprocessors consisting of small num- bers of off-the-shelf microprocessors. Even high-end supercomputers are being built with collections of microprocessors. These innovations led to a renaissance in computer design, which emphasized both architectural innovation and efficient use of technology improvements. This rate of growth has compounded so that by 2002, high-performance microproces- sors are about seven times faster than what would have been obtained by relying solely on technology, including improved circuit design. Performance (vs. VAX-11/780) 4  Chapter One Fundamentals of Computer Design However, Figure 1.1 also shows that this 16-year renaissance is over. Since 2002, processor performance improvement has dropped to about 20% per year due to the triple hurdles of maximum power dissipation of air-cooled chips, little instruction-level parallelism left to exploit efficiently, and almost unchanged memory latency. Indeed, in 2004 Intel canceled its high-performance uniproces- sor projects and joined IBM and Sun in declaring that the road to higher perfor- mance would be via multiple processors per chip rather than via faster uniprocessors. This signals a historic switch from relying solely on instruction- level parallelism (ILP), the primary focus of the first three editions of this book, to thread-level parallelism (TLP) and data-level parallelism (DLP), which are featured in this edition. Whereas the compiler and hardware conspire to exploit ILP implicitly without the programmer’s attention, TLP and DLP are explicitly parallel, requiring the programmer to write parallel code to gain performance. This text is about the architectural ideas and accompanying compiler improvements that made the incredible growth rate possible in the last century, the reasons for the dramatic change, and the challenges and initial promising approaches to architectural ideas and compilers for the 21st century. At the core is a quantitative approach to computer design and analysis that uses empirical observations of programs, experimentation, and simulation as its tools. It is this style and approach to computer design that is reflected in this text. This book was written not only to explain this design style, but also to stimulate you to contrib- ute to this progress. We believe the approach will work for explicitly parallel computers of the future just as it worked for the implicitly parallel computers of the past. 1.2 Classes of Computers In the 1960s, the dominant form of computing was on large mainframes—com- puters costing millions of dollars and stored in computer rooms with multiple operators overseeing their support. Typical applications included business data processing and large-scale scientific computing. The 1970s saw the birth of the minicomputer, a smaller-sized computer initially focused on applications in sci- entific laboratories, but rapidly branching out with the popularity of time- sharing—multiple users sharing a computer interactively through independent terminals. That decade also saw the emergence of supercomputers, which were high-performance computers for scientific computing. Although few in number, they were important historically because they pioneered innovations that later trickled down to less expensive computer classes. The 1980s saw the rise of the desktop computer based on microprocessors, in the form of both personal com- puters and workstations. The individually owned desktop computer replaced time-sharing and led to the rise of servers—computers that provided larger-scale services such as reliable, long-term file storage and access, larger memory, and more computing power. The 1990s saw the emergence of the Internet and the World Wide Web, the first successful handheld computing devices (personal digi- 1.2 Classes of Computers  5 Feature Desktop Server Embedded Price of system 500–5000 5000–5,000,000 10–100,000 (including network routers at the high end) Price of microprocessor 50–500 200–10,000 0.01–100 (per processor) module (per processor) (per processor) Critical system design issues Price-performance, Throughput, availability, Price, power consumption, graphics performance scalability application-specific performance Figure 1.2 A summary of the three mainstream computing classes and their system characteristics. Note the wide range in system price for servers and embedded systems. For servers, this range arises from the need for very large-scale multiprocessor systems for high-end transaction processing and Web server applications. The total num- ber of embedded processors sold in 2005 is estimated to exceed 3 billion if you include 8-bit and 16-bit microproces- sors. Perhaps 200 million desktop computers and 10 million servers were sold in 2005. tal assistants or PDAs), and the emergence of high-performance digital consumer electronics, from video games to set-top boxes. The extraordinary popularity of cell phones has been obvious since 2000, with rapid improvements in functions and sales that far exceed those of the PC. These more recent applications use embedded computers, where computers are lodged in other devices and their presence is not immediately obvious. These changes have set the stage for a dramatic change in how we view com- puting, computing applications, and the computer markets in this new century. Not since the creation of the personal computer more than 20 years ago have we seen such dramatic changes in the way computers appear and in how they are used. These changes in computer use have led to three different computing mar- kets, each characterized by different applications, requirements, and computing technologies. Figure 1.2 summarizes these mainstream classes of computing environments and their important characteristics. Desktop Computing The first, and still the largest market in dollar terms, is desktop computing. Desk- top computing spans from low-end systems that sell for under 500 to high-end, heavily configured workstations that may sell for 5000. Throughout this range in price and capability, the desktop market tends to be driven to optimize price- performance. This combination of performance (measured primarily in terms of compute performance and graphics performance) and price of a system is what matters most to customers in this market, and hence to computer designers. As a result, the newest, highest-performance microprocessors and cost-reduced micro- processors often appear first in desktop systems (see Section 1.6 for a discussion of the issues affecting the cost of computers). Desktop computing also tends to be reasonably well characterized in terms of applications and benchmarking, though the increasing use of Web-centric, inter- active applications poses new challenges in performance evaluation. 6  Chapter One Fundamentals of Computer Design Servers As the shift to desktop computing occurred, the role of servers grew to provide larger-scale and more reliable file and computing services. The World Wide Web accelerated this trend because of the tremendous growth in the demand and sophistication of Web-based services. Such servers have become the backbone of large-scale enterprise computing, replacing the traditional mainframe. For servers, different characteristics are important. First, dependability is crit- ical. (We discuss dependability in Section 1.7.) Consider the servers running Google, taking orders for Cisco, or running auctions on eBay. Failure of such server systems is far more catastrophic than failure of a single desktop, since these servers must operate seven days a week, 24 hours a day. Figure 1.3 esti- mates revenue costs of downtime as of 2000. To bring costs up-to-date, Ama- zon.com had 2.98 billion in sales in the fall quarter of 2005. As there were about 2200 hours in that quarter, the average revenue per hour was 1.35 million. Dur- ing a peak hour for Christmas shopping, the potential loss would be many times higher. Hence, the estimated costs of an unavailable system are high, yet Figure 1.3 and the Amazon numbers are purely lost revenue and do not account for lost employee productivity or the cost of unhappy customers. A second key feature of server systems is scalability. Server systems often grow in response to an increasing demand for the services they support or an increase in functional requirements. Thus, the ability to scale up the computing capacity, the memory, the storage, and the I/O bandwidth of a server is crucial. Lastly, servers are designed for efficient throughput. That is, the overall per- formance of the server—in terms of transactions per minute or Web pages served Annual losses (millions of ) with downtime of Cost of downtime per 1% 0.5% 0.1% Application hour (thousands of ) (87.6 hrs/yr) (43.8 hrs/yr) (8.8 hrs/yr) Brokerage operations 6450 565 283 56.5 Credit card authorization 2600 228 114 22.8 Package shipping services 150 13 6.6 1.3 Home shopping channel 113 9.9 4.9 1.0 Catalog sales center 90 7.9 3.9 0.8 Airline reservation center 89 7.9 3.9 0.8 Cellular service activation 41 3.6 1.8 0.4 Online network fees 25 2.2 1.1 0.2 ATM service fees 14 1.2 0.6 0.1 Figure 1.3 The cost of an unavailable system is shown by analyzing the cost of downtime (in terms of immedi- ately lost revenue), assuming three different levels of availability, and that downtime is distributed uniformly. These data are from Kembel 2000 and were collected and analyzed by Contingency Planning Research. 1.2 Classes of Computers  7 per second—is what is crucial. Responsiveness to an individual request remains important, but overall efficiency and cost-effectiveness, as determined by how many requests can be handled in a unit time, are the key metrics for most servers. We return to the issue of assessing performance for different types of computing environments in Section 1.8. A related category is supercomputers. They are the most expensive comput- ers, costing tens of millions of dollars, and they emphasize floating-point perfor- mance. Clusters of desktop computers, which are discussed in Appendix H, have largely overtaken this class of computer. As clusters grow in popularity, the num- ber of conventional supercomputers is shrinking, as are the number of companies who make them. Embedded Computers Embedded computers are the fastest growing portion of the computer market. These devices range from everyday machines—most microwaves, most washing machines, most printers, most networking switches, and all cars contain simple embedded microprocessors—to handheld digital devices, such as cell phones and smart cards, to video games and digital set-top boxes. Embedded computers have the widest spread of processing power and cost. They include 8-bit and 16-bit processors that may cost less than a dime, 32-bit microprocessors that execute 100 million instructions per second and cost under 5, and high-end processors for the newest video games or network switches that cost 100 and can execute a billion instructions per second. Although the range of computing power in the embedded computing market is very large, price is a key factor in the design of computers for this space. Performance requirements do exist, of course, but the primary goal is often meeting the performance need at a minimum price, rather than achieving higher performance at a higher price. Often, the performance requirement in an embedded application is real-time execution. A real-time performance requirement is when a segment of the appli- cation has an absolute maximum execution time. For example, in a digital set-top box, the time to process each video frame is limited, since the processor must accept and process the next frame shortly. In some applications, a more nuanced requirement exists: the average time for a particular task is constrained as well as the number of instances when some maximum time is exceeded. Such approaches—sometimes called soft real-time—arise when it is possible to occa- sionally miss the time constraint on an event, as long as not too many are missed. Real-time performance tends to be highly application dependent. Two other key characteristics exist in many embedded applications: the need to minimize memory and the need to minimize power. In many embedded appli- cations, the memory can be a substantial portion of the system cost, and it is important to optimize memory size in such cases. Sometimes the application is expected to fit totally in the memory on the processor chip; other times the 8  Chapter One Fundamentals of Computer Design application needs to fit totally in a small off-chip memory. In any event, the importance of memory size translates to an emphasis on code size, since data size is dictated by the application. Larger memories also mean more power, and optimizing power is often criti- cal in embedded applications. Although the emphasis on low power is frequently driven by the use of batteries, the need to use less expensive packaging—plastic versus ceramic—and the absence of a fan for cooling also limit total power con- sumption. We examine the issue of power in more detail in Section 1.5. Most of this book applies to the design, use, and performance of embedded processors, whether they are off-the-shelf microprocessors or microprocessor cores, which will be assembled with other special-purpose hardware. Indeed, the third edition of this book included examples from embedded computing to illustrate the ideas in every chapter. Alas, most readers found these examples unsatisfactory, as the data that drives the quantitative design and evalu- ation of desktop and server computers has not yet been extended well to embed- ded computing (see the challenges with EEMBC, for example, in Section 1.8). Hence, we are left for now with qualitative descriptions, which do not fit well with the rest of the book. As a result, in this edition we consolidated the embed- ded material into a single appendix. We believe this new appendix (Appendix D) improves the flow of ideas in the text while still allowing readers to see how the differing requirements affect embedded computing. 1.3 Defining Computer Architecture The task the computer designer faces is a complex one: Determine what attributes are important for a new computer, then design a computer to maximize performance while staying within cost, power, and availability constraints. This task has many aspects, including instruction set design, functional organization, logic design, and implementation. The implementation may encompass inte- grated circuit design, packaging, power, and cooling. Optimizing the design requires familiarity with a very wide range of technologies, from compilers and operating systems to logic design and packaging. In the past, the term computer architecture often referred only to instruction set design. Other aspects of computer design were called implementation, often insinuating that implementation is uninteresting or less challenging. We believe this view is incorrect. The architect’s or designer’s job is much more than instruction set design, and the technical hurdles in the other aspects of the project are likely more challenging than those encountered in instruction set design. We’ll quickly review instruction set architecture before describing the larger challenges for the computer architect. Instruction Set Architecture We use the term instruction set architecture (ISA) to refer to the actual programmer- visible instruction set in this book. The ISA serves as the boundary between the 1.3 Defining Computer Architecture  9 software and hardware. This quick review of ISA will use examples from MIPS and 80x86 to illustrate the seven dimensions of an ISA. Appendices B and J give more details on MIPS and the 80x86 ISAs. 1. Class of ISA—Nearly all ISAs today are classified as general-purpose register architectures, where the operands are either registers or memory locations. The 80x86 has 16 general-purpose registers and 16 that can hold floating- point data, while MIPS has 32 general-purpose and 32 floating-point registers (see Figure 1.4). The two popular versions of this class are register-memory ISAs such as the 80x86, which can access memory as part of many instruc- tions, and load-store ISAs such as MIPS, which can access memory only with load or store instructions. All recent ISAs are load-store. 2. Memory addressing—Virtually all desktop and server computers, including the 80x86 and MIPS, use byte addressing to access memory operands. Some architectures, like MIPS, require that objects must be aligned. An access to an object of size s bytes at byte address A is aligned if A mod s = 0. (See Figure B.5 on page B-9.) The 80x86 does not require alignment, but accesses are generally faster if operands are aligned. 3. Addressing modes—In addition to specifying registers and constant operands, addressing modes specify the address of a memory object. MIPS addressing modes are Register, Immediate (for constants), and Displacement, where a constant offset is added to a register to form the memory address. The 80x86 supports those three plus three variations of displacement: no register (abso- lute), two registers (based indexed with displacement), two registers where Name Number Use Preserved across a call? zero 0 The constant value 0 N.A. at 1 Assembler temporary No v0–v1 2–3 Values for function results and No expression evaluation a0–a3 4–7 Arguments No t0–t7 8–15 Temporaries No s0–s7 16–23 Saved temporaries Yes t8–t9 24–25 Temporaries No k0–k1 26–27 Reserved for OS kernel No gp 28 Global pointer Yes sp 29 Stack pointer Yes fp 30 Frame pointer Yes ra 31 Return address Yes Figure 1.4 MIPS registers and usage conventions. In addition to the 32 general- purpose registers (R0–R31), MIPS has 32 floating-point registers (F0–F31) that can hold either a 32-bit single-precision number or a 64-bit double-precision number. 10  Chapter One Fundamentals of Computer Design one register is multiplied by the size of the operand in bytes (based with scaled index and displacement). It has more like the last three, minus the dis- placement field: register indirect, indexed, and based with scaled index. 4. Types and sizes of operands—Like most ISAs, MIPS and 80x86 support operand sizes of 8-bit (ASCII character), 16-bit (Unicode character or half word), 32-bit (integer or word), 64-bit (double word or long integer), and IEEE 754 floating point in 32-bit (single precision) and 64-bit (double pre- cision). The 80x86 also supports 80-bit floating point (extended double precision). 5. Operations—The general categories of operations are data transfer, arith- metic logical, control (discussed next), and floating point. MIPS is a simple and easy-to-pipeline instruction set architecture, and it is representative of the RISC architectures being used in 2006. Figure 1.5 summarizes the MIPS ISA. The 80x86 has a much richer and larger set of operations (see Appendix J). 6. Control flow instructions—Virtually all ISAs, including 80x86 and MIPS, support conditional branches, unconditional jumps, procedure calls, and returns. Both use PC-relative addressing, where the branch address is speci- fied by an address field that is added to the PC. There are some small differ- ences. MIPS conditional branches (BE, BNE, etc.) test the contents of registers, while the 80x86 branches (JE, JNE, etc.) test condition code bits set as side effects of arithmetic/logic operations. MIPS procedure call (JAL) places the return address in a register, while the 80x86 call (CALLF) places the return address on a stack in memory. 7. Encoding an ISA—There are two basic choices on encoding: fixed length and variable length. All MIPS instructions are 32 bits long, which simplifies instruction decoding. Figure 1.6 shows the MIPS instruction formats. The 80x86 encoding is variable length, ranging from 1 to 18 bytes. Variable- length instructions can take less space than fixed-length instructions, so a pro- gram compiled for the 80x86 is usually smaller than the same program com- piled for MIPS. Note that choices mentioned above will affect how the instructions are encoded into a binary representation. For example, the num- ber of registers and the number of addressing modes both have a significant impact on the size of instructions, as the register field and addressing mode field can appear many times in a single instruction. The other challenges facing the computer architect beyond ISA design are particularly acute at the present, when the differences among instruction sets are small and when there are distinct application areas. Therefore, starting with this edition, the bulk of instruction set material beyond this quick review is found in the appendices (see Appendices B and J). We use a subset of MIPS64 as the example ISA in this book. 1.3 Defining Computer Architecture  11 Instruction type/opcode Instruction meaning Data transfers Move data between registers and memory, or between the integer and FP or special registers; only memory address mode is 16-bit displacement + contents of a GPR LB, LBU, SB Load byte, load byte unsigned, store byte (to/from integer registers) LH, LHU, SH Load half word, load half word unsigned, store half word (to/from integer registers) LW, LWU, SW Load word, load word unsigned, store word (to/from integer registers) LD, SD Load double word, store double word (to/from integer registers) L.S, L.D, S.S, S.D Load SP float, load DP float, store SP float, store DP float MFC0, MTC0 Copy from/to GPR to/from a special register MOV.S, MOV.D Copy one SP or DP FP register to another FP register MFC1, MTC1 Copy 32 bits to/from FP registers from/to integer registers Arithmetic/logical Operations on integer or logical data in GPRs; signed arithmetic trap on overflow DADD, DADDI, DADDU, DADDIU Add, add immediate (all immediates are 16 bits); signed and unsigned DSUB, DSUBU Subtract; signed and unsigned DMUL, DMULU, DDIV, Multiply and divide, signed and unsigned; multiply-add; all operations take and yield DDIVU, MADD 64-bit values AND, ANDI And, and immediate OR, ORI, XOR, XORI Or, or immediate, exclusive or, exclusive or immediate LUI Load upper immediate; loads bits 32 to 47 of register with immediate, then sign-extends DSLL, DSRL, DSRA, DSLLV, Shifts: both immediate (DS__) and variable form (DS__V); shifts are shift left logical, DSRLV, DSRAV right logical, right arithmetic SLT, SLTI, SLTU, SLTIU Set less than, set less than immediate; signed and unsigned Control Conditional branches and jumps; PC-relative or through register BEQZ, BNEZ Branch GPRs equal/not equal to zero; 16-bit offset from PC + 4 BEQ, BNE Branch GPR equal/not equal; 16-bit offset from PC + 4 BC1T, BC1F Test comparison bit in the FP status register and branch; 16-bit offset from PC + 4 MOVN, MOVZ Copy GPR to another GPR if third GPR is negative, zero J, JR Jumps: 26-bit offset from PC + 4 (J) or target in register (JR) JAL, JALR Jump and link: save PC + 4 in R31, target is PC-relative (JAL) or a register (JALR) TRAP Transfer to operating system at a vectored address ERET Return to user code from an exception; restore user mode Floating point FP operations on DP and SP formats ADD.D, ADD.S, ADD.PS Add DP, SP numbers, and pairs of SP numbers SUB.D, SUB.S, SUB.PS Subtract DP, SP numbers, and pairs of SP numbers MUL.D, MUL.S, MUL.PS Multiply DP, SP floating point, and pairs of SP numbers MADD.D, MADD.S, MADD.PS Multiply-add DP, SP numbers, and pairs of SP numbers DIV.D, DIV.S, DIV.PS Divide DP, SP floating point, and pairs of SP numbers CVT._._ Convert instructions: CVT.x.y converts from type x to type y, where x and y are L (64-bit integer), W (32-bit integer), D (DP), or S (SP). Both operands are FPRs. C.__.D, C.__.S DP and SP compares: “__” = LT,GT,LE,GE,EQ,NE; sets bit in FP status register Figure 1.5 Subset of the instructions in MIPS64. SP = single precision; DP = double precision. Appendix B gives much more detail on MIPS64. For data, the most significant bit number is 0; least is 63.12  Chapter One Fundamentals of Computer Design ∆ Basic instruction formats R opcode rs rt rd shamt funct 31 26 25 21 20 16 15 11 10 6 5 0 I opcode rs rt immediate 31 26 25 21 20 16 15 J opcode address 31 26 25 Floating-point instruction formats FR opcode fmt ft fs fd funct 31 26 25 21 20 16 15 11 10 6 5 0 FI opcode fmt ft immediate 31 26 25 21 20 16 15 Figure 1.6 MIPS64 instruction set architecture formats. All instructions are 32 bits long. The R format is for integer register-to-register operations, such as DADDU, DSUBU, and so on. The I format is for data transfers, branches, and immediate instructions, such as LD, SD, BEQZ, and DADDIs. The J format is for jumps, the FR format for floating point operations, and the FI format for floating point branches. The Rest of Computer Architecture: Designing the Organization and Hardware to Meet Goals and Functional Requirements The implementation of a computer has two components: organization and hardware. The term organization includes the high-level aspects of a computer’s design, such as the memory system, the memory interconnect, and the design of the internal processor or CPU (central processing unit—where arithmetic, logic, branching, and data transfer are implemented). For example, two processors with the same instruction set architectures but very different organizations are the AMD Opteron 64 and the Intel Pentium 4. Both processors implement the x86 instruction set, but they have very different pipeline and cache organizations. Hardware refers to the specifics of a computer, including the detailed logic design and the packaging technology of the computer. Often a line of computers contains computers with identical instruction set architectures and nearly identi- cal organizations, but they differ in the detailed hardware implementation. For example, the Pentium 4 and the Mobile Pentium 4 are nearly identical, but offer different clock rates and different memory systems, making the Mobile Pentium 4 more effective for low-end computers. In this book, the word architecture covers all three aspects of computer design—instruction set architecture, organization, and hardware. Computer architects must design a computer to meet functional requirements as well as price, power, performance, and availability goals. Figure 1.7 summa- rizes requirements to consider in designing a new computer. Often, architects1.3 Defining Computer Architecture  13 Functional requirements Typical features required or supported Application area Target of computer General-purpose desktop Balanced performance for a range of tasks, including interactive performance for graphics, video, and audio (Ch. 2, 3, 5, App. B) Scientific desktops and servers High-performance floating point and graphics (App. I) Commercial servers Support for databases and transaction processing; enhancements for reliability and availability; support for scalability (Ch. 4, App. B, E) Embedded computing Often requires special support for graphics or video (or other application-specific extension); power limitations and power control may be required (Ch. 2, 3, 5, App. B) Level of software compatibility Determines amount of existing software for computer At programming language Most flexible for designer; need new compiler (Ch. 4, App. B) Object code or binary Instruction set architecture is completely defined—little flexibility—but no compatible investment needed in software or porting programs Operating system requirements Necessary features to support chosen OS (Ch. 5, App. E) Size of address space Very important feature (Ch. 5); may limit applications Memory management Required for modern OS; may be paged or segmented (Ch. 5) Protection Different OS and application needs: page vs. segment; virtual machines (Ch. 5) Standards Certain standards may be required by marketplace Floating point Format and arithmetic: IEEE 754 standard (App. I), special arithmetic for graphics or signal processing I/O interfaces For I/O devices: Serial ATA, Serial Attach SCSI, PCI Express (Ch. 6, App. E) Operating systems UNIX, Windows, Linux, CISCO IOS Networks Support required for different networks: Ethernet, Infiniband (App. E) Programming languages Languages (ANSI C, C++, Java, FORTRAN) affect instruction set (App. B) Figure 1.7 Summary of some of the most important functional requirements an architect faces. The left-hand column describes the class of requirement, while the right-hand column gives specific examples. The right-hand col- umn also contains references to chapters and appendices that deal with the specific issues. also must determine what the functional requirements are, which can be a major task. The requirements may be specific features inspired by the market. Applica- tion software often drives the choice of certain functional requirements by deter- mining how the computer will be used. If a large body of software exists for a certain instruction set architecture, the architect may decide that a new computer should implement an existing instruction set. The presence of a large market for a particular class of applications might encourage the designers to incorporate requirements that would make the computer competitive in that market. Many of these requirements and features are examined in depth in later chapters. Architects must also be aware of important trends in both the technology and the use of computers, as such trends not only affect future cost, but also the lon- gevity of an architecture.14  Chapter One Fundamentals of Computer Design 1.4 Trends in Technology If an instruction set architecture is to be successful, it must be designed to survive rapid changes in computer technology. After all, a successful new instruction set architecture may last decades—for example, the core of the IBM mainframe has been in use for more than 40 years. An architect must plan for technology changes that can increase the lifetime of a successful computer. To plan for the evolution of a computer, the designer must be aware of rapid changes in implementation technology. Four implementation technologies, which change at a dramatic pace, are critical to modern implementations:  Integrated circuit logic technology—Transistor density increases by about 35% per year, quadrupling in somewhat over four years. Increases in die size are less predictable and slower, ranging from 10% to 20% per year. The com- bined effect is a growth rate in transistor count on a chip of about 40% to 55% per year. Device speed scales more slowly, as we discuss below.  Semiconductor DRAM (dynamic random-access memory)—Capacity increases by about 40% per year, doubling roughly every two years.  Magnetic disk technology—Prior to 1990, density increased by about 30% per year, doubling in three years. It rose to 60% per year thereafter, and increased to 100% per year in 1996. Since 2004, it has dropped back to 30% per year. Despite this roller coaster of rates of improvement, disks are still 50–100 times cheaper per bit than DRAM. This technology is central to Chapter 6, and we discuss the trends in detail there.  Network technology—Network performance depends both on the perfor- mance of switches and on the performance of the transmission system. We discuss the trends in networking in Appendix E. These rapidly changing technologies shape the design of a computer that, with speed and technology enhancements, may have a lifetime of five or more years. Even within the span of a single product cycle for a computing system (two years of design and two to three years of production), key technologies such as DRAM change sufficiently that the designer must plan for these changes. Indeed, designers often design for the next technology, knowing that when a product begins shipping in volume that next technology may be the most cost- effective or may have performance advantages. Traditionally, cost has decreased at about the rate at which density increases. Although technology improves continuously, the impact of these improve- ments can be in discrete leaps, as a threshold that allows a new capability is reached. For example, when MOS technology reached a point in the early 1980s where between 25,000 and 50,000 transistors could fit on a single chip, it became possible to build a single-chip, 32-bit microprocessor. By the late 1980s, first- level caches could go on chip. By eliminating chip crossings within the processor and between the processor and the cache, a dramatic improvement in cost- performance and power-performance was possible. This design was simply infea-1.4 Trends in Technology  15 sible until the technology reached a certain point. Such technology thresholds are not rare and have a significant impact on a wide variety of design decisions. Performance Trends: Bandwidth over Latency As we shall see in Section 1.8, bandwidth or throughput is the total amount of work done in a given time, such as megabytes per second for a disk transfer. In contrast, latency or response time is the time between the start and the comple- tion of an event, such as milliseconds for a disk access. Figure 1.8 plots the rela- tive improvement in bandwidth and latency for technology milestones for microprocessors, memory, networks, and disks. Figure 1.9 describes the exam- ples and milestones in more detail. Clearly, bandwidth improves much more rap- idly than latency. Performance is the primary differentiator for microprocessors and networks, so they have seen the greatest gains: 1000–2000X in bandwidth and 20–40X in latency. Capacity is generally more important than performance for memory and disks, so capacity has improved most, yet their bandwidth advances of 120–140X are still much greater than their gains in latency of 4–8X. Clearly, bandwidth has outpaced latency across these technologies and will likely continue to do so. A simple rule of thumb is that bandwidth grows by at least the square of the improvement in latency. Computer designers should make plans accordingly. 10,000 Microprocessor 1000 Network Disk Memory 100 (Latency improvement = 10 bandwidth improvement) 1 1 10 100 Relative latency improvement Figure 1.8 Log-log plot of bandwidth and latency milestones from Figure 1.9 rela- tive to the first milestone. Note that latency improved about 10X while bandwidth improved about 100X to 1000X. From Patterson 2004. Relative bandwidth improvement16  Chapter One Fundamentals of Computer Design Microprocessor 16-bit 32-bit 5-stage 2-way Out-of-order Out-of-order address/bus, address.bus, pipeline, superscalar, 3-way superpipelined, microcoded microcoded on-chip I & D 64-bit bus superscalar on-chip 1.2 caches, FPU cache Product Intel 80286 Intel 80386 Intel 80486 Intel Pentium Intel Pentium Pro Intel Pentium 4 Year 1982 1985 1989 1993 1997 2001 2 Die size (mm) 47438190 308 217 Transistors 134,000 275,000 1,200,000 3,100,000 5,500,000 42,000,000 Pins 68 132 168 273 387 423 Latency (clocks) 6555 10 22 Bus width (bits) 16 32 32 64 64 64 Clock rate (MHz) 12.5 16 25 66 200 1500 Bandwidth (MIPS) 2 6 25 132 600 4500 Latency (ns) 320 313 200 76 50 15 Memory module DRAM Page mode Fast page Fast page Synchronous Double data DRAM mode DRAM mode DRAM DRAM rate SDRAM Module width (bits) 16 16 32 64 64 64 Year 1980 1983 1986 1993 1997 2000 Mbits/DRAM chip 0.06 0.25 1 16 64 256 2 Die size (mm ) 35 45 70 130 170 204 Pins/DRAM chip 16 16 18 20 54 66 Bandwidth (MBit/sec) 13 40 160 267 640 1600 Latency (ns) 225 170 125 75 62 52 Local area network Ethernet Fast Ethernet Gigabit 10 Gigabit Ethernet Ethernet IEEE standard 802.3 803.3u 802.3ab 802.3ac Year 1978 1995 1999 2003 Bandwidth (MBit/sec) 10 100 1000 10000 Latency (µsec) 3000 500 340 190 Hard disk 3600 RPM 5400 RPM 7200 RPM 10,000 RPM 15,000 RPM Product CDC WrenI Seagate Seagate Seagate Seagate 94145-36 ST41600 ST15150 ST39102 ST373453 Year 1983 1990 1994 1998 2003 Capacity (GB) 0.03 1.4 4.3 9.1 73.4 Disk form factor 5.25 inch 5.25 inch 3.5 inch 3.5 inch 3.5 inch Media diameter 5.25 inch 5.25 inch 3.5 inch 3.0 inch 2.5 inch Interface ST-412 SCSI SCSI SCSI SCSI Bandwidth (MBit/sec) 0.6 4 9 24 86 Latency (ms) 48.3 17.1 12.7 8.8 5.7 Figure 1.9 Performance milestones over 20 to 25 years for microprocessors, memory, networks, and disks. The microprocessor milestones are six generations of IA-32 processors, going from a 16-bit bus, microcoded 80286 to a 64-bit bus, superscalar, out-of-order execution, superpipelined Pentium 4. Memory module milestones go from 16- bit-wide, plain DRAM to 64-bit-wide double data rate synchronous DRAM. Ethernet advanced from 10 Mb/sec to 10 Gb/sec. Disk milestones are based on rotation speed, improving from 3600 RPM to 15,000 RPM. Each case is best- case bandwidth, and latency is the time for a simple operation assuming no contention. From Patterson 2004.1.5 Trends in Power in Integrated Circuits  17 Scaling of Transistor Performance and Wires Integrated circuit processes are characterized by the feature size, which is the minimum size of a transistor or a wire in either the x or y dimension. Feature sizes have decreased from 10 microns in 1971 to 0.09 microns in 2006; in fact, we have switched units, so production in 2006 is now referred to as “90 nanome- ters,” and 65 nanometer chips are underway. Since the transistor count per square millimeter of silicon is determined by the surface area of a transistor, the density of transistors increases quadratically with a linear decrease in feature size. The increase in transistor performance, however, is more complex. As feature sizes shrink, devices shrink quadratically in the horizontal dimension and also shrink in the vertical dimension. The shrink in the vertical dimension requires a reduction in operating voltage to maintain correct operation and reliability of the transistors. This combination of scaling factors leads to a complex interrelation- ship between transistor performance and process feature size. To a first approxi- mation, transistor performance improves linearly with decreasing feature size. The fact that transistor count improves quadratically with a linear improve- ment in transistor performance is both the challenge and the opportunity for which computer architects were created In the early days of microprocessors, the higher rate of improvement in density was used to move quickly from 4-bit, to 8-bit, to 16-bit, to 32-bit microprocessors. More recently, density improve- ments have supported the introduction of 64-bit microprocessors as well as many of the innovations in pipelining and caches found in Chapters 2, 3, and 5. Although transistors generally improve in performance with decreased fea- ture size, wires in an integrated circuit do not. In particular, the signal delay for a wire increases in proportion to the product of its resistance and capacitance. Of course, as feature size shrinks, wires get shorter, but the resistance and capaci- tance per unit length get worse. This relationship is complex, since both resis- tance and capacitance depend on detailed aspects of the process, the geometry of a wire, the loading on a wire, and even the adjacency to other structures. There are occasional process enhancements, such as the introduction of copper, which provide one-time improvements in wire delay. In general, however, wire delay scales poorly compared to transistor perfor- mance, creating additional challenges for the designer. In the past few years, wire delay has become a major design limitation for large integrated circuits and is often more critical than transistor switching delay. Larger and larger fractions of the clock cycle have been consumed by the propagation delay of signals on wires. In 2001, the Pentium 4 broke new ground by allocating 2 stages of its 20+-stage pipeline just for propagating signals across the chip. 1.5 Trends in Power in Integrated Circuits Power also provides challenges as devices are scaled. First, power must be brought in and distributed around the chip, and modern microprocessors use18  Chapter One Fundamentals of Computer Design hundreds of pins and multiple interconnect layers for just power and ground. Sec- ond, power is dissipated as heat and must be removed. For CMOS chips, the traditional dominant energy consumption has been in switching transistors, also called dynamic power. The power required per transis- tor is proportional to the product of the load capacitance of the transistor, the square of the voltage, and the frequency of switching, with watts being the unit: 2 Power = 12 ⁄ × Capacitive load × Voltage × Frequency switched dynamic Mobile devices care about battery life more than power, so energy is the proper metric, measured in joules: 2 Energy = Capacitive load × Voltage dynamic Hence, dynamic power and energy are greatly reduced by lowering the volt- age, and so voltages have dropped from 5V to just over 1V in 20 years. The capacitive load is a function of the number of transistors connected to an output and the technology, which determines the capacitance of the wires and the tran- sistors. For a fixed task, slowing clock rate reduces power, but not energy. Example Some microprocessors today are designed to have adjustable voltage, so that a 15% reduction in voltage may result in a 15% reduction in frequency. What would be the impact on dynamic power? Answer Since the capacitance is unchanged, the answer is the ratios of the voltages and frequencies: 2 Power () Voltage × 0.85 ×() Frequency switched × 0.85 3 new -== - 0.85= 0.61 2 Power old Voltage × Frequency switched thereby reducing power to about 60% of the original. As we move from one process to the next, the increase in the number of transistors switching, and the frequency with which they switch, dominates the decrease in load capacitance and voltage, leading to an overall growth in power consumption and energy. The first microprocessors consumed tenths of a watt, while a 3.2 GHz Pentium 4 Extreme Edition consumes 135 watts. Given that this heat must be dissipated from a chip that is about 1 cm on a side, we are reaching the limits of what can be cooled by air. Several Intel microprocessors have temperature diodes to reduce activity automatically if the chip gets too hot. For example, they may reduce voltage and clock frequency or the instruc- tion issue rate. Distributing the power, removing the heat, and preventing hot spots have become increasingly difficult challenges. Power is now the major limitation to using transistors; in the past it was raw silicon area. As a result of this limitation, most microprocessors today turn off the clock of inactive modules to save energy1.6 Trends in Cost  19 and dynamic power. For example, if no floating-point instructions are executing, the clock of the floating-point unit is disabled. Although dynamic power is the primary source of power dissipation in CMOS, static power is becoming an important issue because leakage current flows even when a transistor is off: Power = Current × Voltage static static Thus, increasing the number of transistors increases power even if they are turned off, and leakage current increases in processors with smaller transistor sizes. As a result, very low power systems are even gating the voltage to inactive modules to control loss due to leakage. In 2006, the goal for leakage is 25% of the total power consumption, with leakage in high-performance designs sometimes far exceeding that goal. As mentioned before, the limits of air cooling have led to exploration of multiple processors on a chip running at lower voltages and clock rates. 1.6 Trends in Cost Although there are computer designs where costs tend to be less important— specifically supercomputers—cost-sensitive designs are of growing significance. Indeed, in the past 20 years, the use of technology improvements to lower cost, as well as increase performance, has been a major theme in the computer industry. Textbooks often ignore the cost half of cost-performance because costs change, thereby dating books, and because the issues are subtle and differ across industry segments. Yet an understanding of cost and its factors is essential for designers to make intelligent decisions about whether or not a new feature should be included in designs where cost is an issue. (Imagine architects designing sky- scrapers without any information on costs of steel beams and concrete) This section discusses the major factors that influence the cost of a computer and how these factors are changing over time. The Impact of Time, Volume, and Commodification The cost of a manufactured computer component decreases over time even with- out major improvements in the basic implementation technology. The underlying principle that drives costs down is the learning curve—manufacturing costs decrease over time. The learning curve itself is best measured by change in yield—the percentage of manufactured devices that survives the testing proce- dure. Whether it is a chip, a board, or a system, designs that have twice the yield will have half the cost. Understanding how the learning curve improves yield is critical to projecting costs over a product’s life. One example is that the price per megabyte of DRAM has dropped over the long term by 40% per year. Since DRAMs tend to be priced

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.