Question? Leave a message!




Reliable Architectures

Reliable Architectures
Joel Emer December 7, 2005 6.823, L241 Reliable Architectures Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Joel Emer December 7, 2005 6.823, L242 Strike Changes State of a Single Bit 0 1 Joel Emer December 7, 2005 6.823, L243 Impact of Neutron Strike on a Si Device neutron strike Strikes release electron source drain hole pairs that can be absorbed by source + + + + drain to alter the state of the device Transistor Device • Secondary source of upsets: alpha particles from packaging Joel Emer December 7, 2005 6.823, L244 Cosmic Rays Come From Deep Space p p n n p n n p n p n Earth’s Surface • Neutron flux is higher in higher altitudes 3x 5x increase in Denver at 5,000 feet 100x increase in airplanes at 30,000+ feet Joel Emer December 7, 2005 6.823, L245 Physical Solutions are hard • Shielding – No practical absorbent (e.g., approximately 10 ft of concrete) – unlike Alpha particles • Technology solution: SOI – Partiallydepleted SOI of some help, effect on logic unclear – Fullydepleted SOI may help, but is challenging to manufacture • Circuit level solution – Radiation hardened circuits can provide 10x improvement with significant penalty in performance, area, cost – 24x improvement may be possible with less penalty Joel Emer December 7, 2005 6.823, L246 Triple Modular Redundancy (Von Neumann, 1956) M M V Result M V does a majority vote on the results Joel Emer December 7, 2005 6.823, L247 Dual Modular Redundancy (e.g., Binac, Stratus) Error M Mismatch C M Error • Processing stops on mismatch • Error signal used to decide which processor be used to restore state to other Joel Emer December 7, 2005 6.823, L248 Pair and Spare Lockstep (e.g., Tandem, 1975) Primary M Mismatch C M Backup M Mismatch C M • Primary creates periodic checkpoints • Backup restarts from checkpoint on mismatch Joel Emer December 7, 2005 6.823, L249 Redundant Multithreading (e.g., Reinhardt, Mukherjee, 2000) Leading Thread X W X X W X X W C Fault C Fault C Fault Trailing Thread X W X X W X X W • Writes are checked Joel Emer December 7, 2005 6.823, L2410 Component Protection Parity ECC 1 1 0 1 1 … 0 0 … Parity ECC Error … 1 1 • Fujitsu SPARC in 130 nm technology (ISSCC 2003) – 80 of 200k latches protected with parity –versus very few latches protected in commodity microprocessors Joel Emer December 7, 2005 6.823, L2411 Strike on a bit (e.g., in register file) Bit Read no yes benign fault Bit has error no error protection detection no no error correction detection only affects program affects program outcome outcome yes no yes no yes no benign fault False DUE True DUE SDC no error SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error Joel Emer December 7, 2005 6.823, L2412 Metrics • Intervalbased – MTTF = Mean Time to Failure – MTTR = Mean Time to Repair – MTBF = Mean Time Between Failures = MTTF + MTTR – Availability = MTTF / MTBF • Ratebased – FIT = Failure in Time = 1 failure in a billion hours 9 –1 year MTTF = 10 / (24 365) FIT = 114,155 FIT – SER FIT = SDC FIT + DUE FIT Hypothetical Example Cache: 0 FIT Image removed due to + IQ: 100K FIT copyright restrictions. FU: 58K FIT + Total of 158K FIT Joel Emer December 7, 2005 6.823, L2413 Cosmic Ray Strikes: Evidence Reaction • Publicly disclosed incidence – Error logs in large servers, E. Normand, “Single Event Upset at Ground Level,” IEEE Trans. on Nucl Sci, Vol. 43, No. 6, Dec 1996. – Sun Microsystems found cosmic ray strikes on L2 cache with defective error protection caused Sun’s flagship servers to crash, R. Baumann, IRPS Tutorial on SER, 2000. – Cypress Semiconductor reported in 2004 a single soft error brought a billiondollar automotive factory to a halt once a month, Zielger Puchner, “SER – History, Trends, and Challenges,” Cypress, 2004. Joel Emer December 7, 2005 6.823, L2414 Vulnerable Bits Growing with Moore’s Law 10000 12x GAP 1000 100 10 100 Vulnerable 1 20 Vulnerable 1000 year MTBF Goal Year Typical SDC goal: 1000 year MTBF Typical DUE goal: 1025 year MTBF 200 3 200 4 200 5 200 6 200 7 200 8 200 9 201 0 201 1 201 2 Joel Emer December 7, 2005 6.823, L2415 Architectural Vulnerability Factor (AVF) AVF = Probability Bit Matters bit of Visible Errors = of Bit Flips from Particle Strikes FIT = intrinsic FIT AVF bit bit bit Joel Emer December 7, 2005 6.823, L2416 Architectural Vulnerability Factor Does a bit matter • Branch Predictor – Doesn’t matter at all (AVF = 0) • Program Counter – Almost always matters (AVF 100) Joel Emer December 7, 2005 6.823, L2417 Statistical Fault Injection (SFI) with RTL Simulate Strike on Latch 0 1 output Logic Logic 0 Does Fault Propagate to Architectural State + Naturally characterizes all logical structures Joel Emer December 7, 2005 6.823, L2418 Architecturally Correct Execution (ACE) Program Input Program Outputs • ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) • Anything else (unACE path) can be derated away Joel Emer December 7, 2005 6.823, L2419 Example of unACE instruction: Dynamically Dead Instruction Dynamically Dead Instruction Most bits of an unACE instruction do not affect program output Joel Emer December 7, 2005 6.823, L2420 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state T = 1 ACE = 2/4 Joel Emer December 7, 2005 6.823, L2421 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state T = 2 ACE = 1/4 Joel Emer December 7, 2005 6.823, L2422 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state T = 3 ACE = 0/4 Joel Emer December 7, 2005 6.823, L2423 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state T = 4 ACE = 3/4 Joel Emer December 7, 2005 6.823, L2424 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state ( 2 + 1 + 0 + 3 ) / 4 ( 2 + 1 + 0 + 3 ) / 4 = 4 4 Average number of ACE bits in a cycle Average number of ACE bits in a cycle = Total number of bits in the structure Total number of bits in the structureJoel Emer December 7, 2005 6.823, L2425 Little’s Law for ACEs Nace = Tace × Lace Nace AVF = NtotalJoel Emer December 7, 2005 6.823, L2426 Computing AVF • Approach is conservative – Assume every bit is ACE unless proven otherwise • Data Analysis using a Performance Model – Prove that data held in a structure is unACE • Timing Analysis using a Performance Model – Tracks the time this data spent in the structure Joel Emer December 7, 2005 6.823, L2427 Dynamic Instruction Breakdown DYNAMICALLY DEAD 20 PERFORMANCE INST 1 ACE 46 PREDICATED FALSE 7 NOP 26 Average across Spec2K slices Joel Emer December 7, 2005 6.823, L2428 Mapping ACE unACE Instructions to the Instruction Queue Wrong Ex ACE ACE Idle NOP Prefetch Path ACE Inst Inst Inst Inst Architectural unACE Microarchitectural unACE Joel Emer December 7, 2005 6.823, L2429 ACE Lifetime Analysis (1) (e.g., writethrough data cache) • Idle is unACE Fill Read Read Evict Idle Valid Valid Valid Idle • Assuming all time intervals are equal • For 3/5 of the lifetime the bit is valid • Gives a measure of the structure’s utilization – Number of useful bits – Amount of time useful bits are resident in structure – Valid for a particular trace Joel Emer December 7, 2005 6.823, L2430 ACE Lifetime Analysis (2) (e.g., writethrough data cache) • Valid is not necessarily ACE Fill Read Read Evict Idle Idle Writethrough Data Cache • ACE = AVF = 2/5 = 40 • Example Lifetime Components – ACE: filltoread, readtoread – unACE: idle, readtoevict, writetoevict Joel Emer December 7, 2005 6.823, L2431 ACE Lifetime Analysis (3) (e.g., writethrough data cache) • Data ACEness is a function of instruction ACEness Fill Read Read Evict Idle Idle Writethrough Data Cache • Second Read is by an unACE instruction • AVF = 1/5 = 20 Joel Emer December 7, 2005 6.823, L2432 Instruction Queue IDLE ACE 31 29 ExACE NOP 10 15 PREDICATED WRONG PATH FALSE 3 3 DYNAMICALLY PERFORMANCE DEAD INST 8 1 ACE percentage = AVF = 29 Joel Emer December 7, 2005 6.823, L2433 Strike on a bit (e.g., in register file) Bit Read no yes benign fault Bit has error no error protection detection no no error correction detection only affects program affects program outcome outcome yes no yes no yes no benign fault False DUE True DUE SDC no error SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error Joel Emer December 7, 2005 6.823, L2434 DUE AVF of Instruction Queue with Parity True DUE AVF 29 Idle Misc 38 Uncommitted 6 Neutral Dynamically CPU2000 16 Dead False DUE AVF Asim 11 33 Simpoint Itanium®2like Joel Emer December 7, 2005 6.823, L2435 Sources of False DUE in an Instruction Queue • Instructions with uncommitted results – e.g., wrongpath, predicatedfalse – solution: π (possibly incorrect) bit till commit • Instruction types neutral to errors – e.g., noops, prefetches, branch predict hints – solution: antiπ bit • Dynamically dead instructions – instructions whose results will not be used in future – solution: π bit beyond commit Joel Emer December 7, 2005 6.823, L2436 Coping with WrongPath Instructions (assume parityprotected instruction queue) inst Fetch Decode Commit Execute IQ RR X DECLARE ERROR Instruction ON ISSUE Data Cache Cache (IC) • Problem: not enough information at issue Joel Emer December 7, 2005 6.823, L2437 The π (Possibly Incorrect) Bit (assume parityprotected instruction queue) Fetch Decode Commit Execute IQ RR inst inst (π) inst π) inst (π) inst (π) inst (ππ) ) π)π) POST ERROR IN π BIT ON π Instruction Data Cache Cache (IC) ISSUE At commit point, declare error only if not wrongpath instruction and π bit is set π bit is setJoel Emer December 7, 2005 6.823, L2438 Antiπ bit: coping with Noops (assume parityprotected instruction queue) Fetch Decode Commit IQ RR Execute inst inst inst inst inst inst (antiπ) π) (antiπ) π) antiπ bit π bit neutralizes neutralizes Instruction Data Cache Cache (IC) the π bit theπ bit On issue, if the antiπ bit is set, then do not set the π bit π bit is set, then do not set theπ bitJoel Emer December 7, 2005 6.823, L2439 π bit: avoiding False DUE on Dynamically Dead Instructions write R1(π) Inst i: write R1 write R1 write R1(π) write R1(π) write R1(π) read R1 (π) read R1 read R1 Inst i+n: read R1 Fetch Decode Commit Execute IQ RR Instruction Data Cache Cache (IC) • Declare the error on reading R1, if π bit is set • If R1 isn’t read (i.e., dynamically dead), then no False DUE • π bit can be used in caches main memory … Joel Emer December 7, 2005 6.823, L2440 False DUE AVF Eliminated (PI = π) PI bit till I/O PI bit till register commit commit 12 18 PI bit till store commit 8 PI bit till register read 14 CPU2000 antiPI bit Asim 48 Simpoint Itanium®2like Practical to eliminate most of the False DUE AVF