Chip multiprocessor architecture

difference between core and processor and chip multiprocessor ppt
Dr.AldenCutts Profile Pic
Dr.AldenCutts,United Kingdom,Teacher
Published Date:23-07-2017
Your Website URL(Optional)
Comment
The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University Stanford University■ ■ ■ Technology ↔ Architecture Transistors are cheap, plentiful and fast ➤ Moore’s law ➤ 100 million transistors by 2000 Wires are cheap, plentiful and slow ➤ Wires get slower relative to transistors ➤ Long cross-chip wires are especially slow Architectural implications ➤ Plenty of room for innovation ➤ Single cycle communication requires localized blocks of logic ➤ High communication bandwidth across the chip easier to achieve than low latency Stanford UniversityExploiting Program Parallelism Process Thread Loop Instruction 1 10 100 1K 10K 100K 1M Grain Size (instructions) Stanford University Levels of Parallelism■ ■ ■ ■ Hydra Approach A single-chip multiprocessor architecture composed of simple fast processors Multiple threads of control ➤ Exploits parallelism at all levels Memory renaming and thread-level speculation ➤ Makes it easy to develop parallel programs Keep design simple by taking advantage of single chip implementation Stanford University■ ■ ■ ■ ■ ■ ■ Outline Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread performance Hydra prototype design Conclusions Stanford UniversityThe Base Hydra Design Centralized Bus Arbitration Mechanisms CPU 0 CPU 1 CPU 2 CPU 3 L1 Inst. L1 Inst. L1 Inst. L1 Inst. L1 Data Cache L1 Data Cache L1 Data Cache L1 Data Cache Cache Cache Cache Cache CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller Write-through Bus (64b) Read/Replace Bus (256b) On-chip L2 Cache Rambus Memory Interface I/O Bus Interface DRAM Main Memory I/O Devices ➤ Single-chip multiprocessor ➤ Shared 2nd-level cache ➤ Four processors ➤ Low latency interprocessor communication (10 cycles) ➤ Separate primary caches ➤ Separate read and write buses ➤ Write-through data caches to maintain coherence Stanford UniversityHydra vs. Superscalar 4 Hydra 4 x 2-way issue ➤ ILP only 3.5 Superscalar 6-way issue ⇒ SS 30-50% better than single Hydra 3 processor 2.5 ➤ ILP & fine thread ⇒ SS and Hydra 2 comparable ➤ ILP & coarse thread 1.5 ⇒ Hydra 1.5–2× better 1 ➤ “The Case for a CMP” ASPLOS ‘96 0.5 0 Stanford University Speedup compress eqntott m88ksim apsi MPEG2 applu swim tomcatv OLTP pmake■ ■ ■ Problem: Parallel Software Parallel software is limited ➤ Hand-parallelized applications ➤ Auto-parallelized dense matrix FORTRAN applications Traditional auto-parallelization of C-programs is very difficult ➤ Threads have data dependencies ⇒ synchronization ➤ Pointer disambiguation is difficult and expensive ➤ Compile time analysis is too conservative How can hardware help? ➤ Remove need for pointer disambiguation ➤ Allow the compiler to be aggressive Stanford University■ ■ ■ Solution: Data Speculation Data speculation enables parallelization without regard for data-dependencies ➤ Loads and stores follow original sequential semantics ➤ Speculation hardware ensures correctness ➤ Add synchronization only for performance ➤ Loop parallelization is now easily automated Other ways to parallelize code ➤ Break code into arbitrary threads (e.g. speculative subroutines ) ➤ Parallel execution with sequential commits Data speculation support ➤ Wisconsin multiscalar ➤ Hydra provides low-overhead support for CMP Stanford University➊ ➋ Data Speculation Requirements I Original Sequential Loop Speculatively Parallelized Loop Iteration i Forwarding read X Iteration i from write: read X read X VIOLATION Iteration i+1 write X read X read X read X write X read X read X Iteration i+1 FORWARDING write X read X read X read X write X read X Forward data between parallel threads Detect violations when reads occur too early Stanford University TIME➌ ➍ Data Speculation Requirements II Writes after Violations Writes after Successful Iterations Iteration i Iteration i Iteration i+1 Iteration i+1 write A read X write X write B write X write X 1 2 TRASH PERMANENT STATE Safely discard bad state after violation Correctly retire speculative state Stanford University TIME➎ Data Speculation Requirements III Multiple Memory “Views” Iteration i Iteration i+1 write X write X read X read X Iteration i+2 read X Maintain multiple “views” of memory Stanford University TIME➊ ➋ ➌ ➍ ➎ Hydra Speculation Support Centralized Bus Arbitration Mechanisms CP2 CP2 CP2 CP2 CPU 0 CPU 1 CPU 2 CPU 3 L1 Inst. L1 Data Cache & L1 Inst. L1 Data Cache & L1 Inst. L1 Data Cache & L1 Inst. L1 Data Cache & Cache Speculation Bits Cache Speculation Bits Cache Speculation Bits Cache Speculation Bits CPU 1 Memory Controller CPU 0 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller Write-through Bus (64b) Read/Replace Bus (256b) Speculation Write Buffers 0 1 2 3 retire Rambus Memory Interface I/O Bus Interface On-chip L2 Cache DRAM Main Memory I/O Devices Write bus and L2 buffers provide forwarding “Read” L1 tag bits detect violations “Dirty” L1 tag bits and write buffers provide backup Write buffers reorder and retire speculative state Separate L1 caches with pre-invalidation & smart L2 forwarding for “view” ➝ Speculation coprocessors to control threads Stanford University➋ Speculative Reads Nonspeculative Speculative Speculative “Me” “Head” CPU earlier CPU later CPU CPU CPU CPU CPU i-2 i-1 i i+1 1 – L1 hit 2 The read bits are set L1 Cache D CB A Write Write Write Write Buffer Buffer Buffer Buffer L2 Cache L1 miss L2 and write buffers are checked in parallel The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D) Stanford University➊ ➋ ➌ ➍ Speculative Writes Nonspeculative Speculative Speculative “Me” “Head” CPU earlier CPU later CPU CPU CPU CPU CPU i-2 i-1 i i+1 1 2 3 Write L1 Invalidations & Bus Cache Pre-invalidations RAW Detection Write Write Write Write Buffer Buffer Buffer Buffer L2 Cache A CPU writes to its L1 cache & write buffer 4 “Earlier” CPUs invalidate our L1 & cause RAW hazard checks “Later” CPUs just pre-invalidate our L1 Non-speculative write buffer drains out into the L2 Stanford University■ Speculation Runtime System Software Handlers ➤ Control speculative threads through CP2 interface ➤ Track order of all speculative threads ➤ Exception routines recover from data dependency violations ➤ Adds more overhead to speculation than hardware but more flexible and simpler to implement ➤ Complete description in “Data Speculation Support for a Chip Multiprocessor” ASPLOS ‘98 and “Improving the Performance of Speculatively Parallel Applications on the Hydra CMP” ICS ‘99 Stanford University■ ■ ■ Creating Speculative Threads Speculative loops ➤ for and while loop iterations ➤ Typically one speculative thread per iteration Speculative procedures ➤ Execute code after procedure speculatively ➤ Procedure calls generate a speculative thread Compiler support ➤C source to source translator ➤ Pfor, pwhile ➤ Analyze loop body and globalize any local variables that could cause loop-carried dependencies Stanford UniversityBase Speculative Thread Performance 4 3.5 Base 3 ➤ Entire applications ➤ GCC 2.7.2 -O2 2.5 ➤ 4 single-issue 2 processors 1.5 ➤ Accurate modeling of all aspects of 1 Hydra architecture 0.5 and real runtime system 0 Stanford University Speedup compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse1.3■ ■ ■ Improving Speculative Runtime System Procedure support adds overhead to loops ➤ Threads are not created sequentially ➤ Dynamic thread scheduling necessary ➤ Start and end of loop: 75 cycles ➤ End of iteration: 80 cycles Performance ➤ Best performing speculative applications use loops ➤ Procedure speculation often lowers performance ➤ Need to optimize RTS for common case Lower speculative overheads ➤ Start and end of loop: 25 cycles ➤ End of iteration: 12 cycles (almost a factor of 7) ➤ Limit procedure speculation to specific procedures Stanford UniversityImproved Speculative Performance 4 3.5 Optimized RTS Base ➤ Improves performance 3 of all applications 2.5 ➤ Most improvement for 2 applications with fine- grained threads 1.5 ➤ Eqntott uses procedure 1 speculation 0.5 0 Stanford University Speedup compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse1.3

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.