Question? Leave a message!




Reliable Architectures

Reliable Architectures
Joel Emer December 7, 2005 6.823, L24-1 Reliable Architectures Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Joel Emer December 7, 2005 6.823, L24-2 Strike Changes State of a Single Bit 0 1 Joel Emer December 7, 2005 6.823, L24-3 Impact of Neutron Strike on a Si Device neutron strike Strikes release electron source drain & hole pairs that can be absorbed by source & + + - + + drain to alter the state - - - of the device Transistor Device • Secondary source of upsets: alpha particles from packaging Joel Emer December 7, 2005 6.823, L24-4 Cosmic Rays Come From Deep Space p p n n p n n p n p n Earth’s Surface • Neutron flux is higher in higher altitudes 3x - 5x increase in Denver at 5,000 feet 100x increase in airplanes at 30,000+ feet Joel Emer December 7, 2005 6.823, L24-5 Physical Solutions are hard • Shielding? – No practical absorbent (e.g., approximately 10 ft of concrete) – unlike Alpha particles • Technology solution: SOI? – Partially-depleted SOI of some help, effect on logic unclear – Fully-depleted SOI may help, but is challenging to manufacture • Circuit level solution? – Radiation hardened circuits can provide 10x improvement with significant penalty in performance, area, cost – 2-4x improvement may be possible with less penalty Joel Emer December 7, 2005 6.823, L24-6 Triple Modular Redundancy (Von Neumann, 1956) M M V Result M V does a majority vote on the results Joel Emer December 7, 2005 6.823, L24-7 Dual Modular Redundancy (e.g., Binac, Stratus) Error? M Mismatch? C M Error? • Processing stops on mismatch • Error signal used to decide which processor be used to restore state to other Joel Emer December 7, 2005 6.823, L24-8 Pair and Spare Lockstep (e.g., Tandem, 1975) Primary M Mismatch? C M Backup M Mismatch? C M • Primary creates periodic checkpoints • Backup restarts from checkpoint on mismatch Joel Emer December 7, 2005 6.823, L24-9 Redundant Multithreading (e.g., Reinhardt, Mukherjee, 2000) Leading Thread X W X X W X X W C Fault? C Fault? C Fault? Trailing Thread X W X X W X X W • Writes are checked Joel Emer December 7, 2005 6.823, L24-10 Component Protection Parity ECC 1 1 0 1 1 … 0 0 … Parity ECC Error? … 1 1 • Fujitsu SPARC in 130 nm technology (ISSCC 2003) – 80% of 200k latches protected with parity –versus very few latches protected in commodity microprocessors Joel Emer December 7, 2005 6.823, L24-11 Strike on a bit (e.g., in register file) Bit Read? no yes benign fault Bit has error no error protection? detection & no no error correction detection only affects program affects program outcome? outcome? yes no yes no yes no benign fault False DUE True DUE SDC no error SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error Joel Emer December 7, 2005 6.823, L24-12 Metrics • Interval-based – MTTF = Mean Time to Failure – MTTR = Mean Time to Repair – MTBF = Mean Time Between Failures = MTTF + MTTR – Availability = MTTF / MTBF • Rate-based – FIT = Failure in Time = 1 failure in a billion hours 9 –1 year MTTF = 10 / (24 365) FIT = 114,155 FIT – SER FIT = SDC FIT + DUE FIT Hypothetical Example Cache: 0 FIT Image removed due to + IQ: 100K FIT copyright restrictions. FU: 58K FIT + Total of 158K FIT Joel Emer December 7, 2005 6.823, L24-13 Cosmic Ray Strikes: Evidence & Reaction • Publicly disclosed incidence – Error logs in large servers, E. Normand, “Single Event Upset at Ground Level,” IEEE Trans. on Nucl Sci, Vol. 43, No. 6, Dec 1996. – Sun Microsystems found cosmic ray strikes on L2 cache with defective error protection caused Sun’s flagship servers to crash, R. Baumann, IRPS Tutorial on SER, 2000. – Cypress Semiconductor reported in 2004 a single soft error brought a billion-dollar automotive factory to a halt once a month, Zielger & Puchner, “SER – History, Trends, and Challenges,” Cypress, 2004. Joel Emer December 7, 2005 6.823, L24-14 Vulnerable Bits Growing with Moore’s Law 10000 12x GAP 1000 100 10 100% Vulnerable 1 20% Vulnerable 1000 year MTBF Goal Year Typical SDC goal: 1000 year MTBF Typical DUE goal: 10-25 year MTBF 200 3 200 4 200 5 200 6 200 7 200 8 200 9 201 0 201 1 201 2 Joel Emer December 7, 2005 6.823, L24-15 Architectural Vulnerability Factor (AVF) AVF = Probability Bit Matters bit of Visible Errors = of Bit Flips from Particle Strikes FIT = intrinsic FIT AVF bit bit bit Joel Emer December 7, 2005 6.823, L24-16 Architectural Vulnerability Factor Does a bit matter? • Branch Predictor – Doesn’t matter at all (AVF = 0%) • Program Counter – Almost always matters (AVF 100%) Joel Emer December 7, 2005 6.823, L24-17 Statistical Fault Injection (SFI) with RTL Simulate Strike on Latch 0 1 output Logic Logic 0 Does Fault Propagate to Architectural State + Naturally characterizes all logical structures Joel Emer December 7, 2005 6.823, L24-18 Architecturally Correct Execution (ACE) Program Input Program Outputs • ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) • Anything else (un-ACE path) can be derated away Joel Emer December 7, 2005 6.823, L24-19 Example of un-ACE instruction: Dynamically Dead Instruction Dynamically Dead Instruction Most bits of an un-ACE instruction do not affect program output Joel Emer December 7, 2005 6.823, L24-20 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state T = 1 ACE% = 2/4 Joel Emer December 7, 2005 6.823, L24-21 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state T = 2 ACE% = 1/4 Joel Emer December 7, 2005 6.823, L24-22 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state T = 3 ACE% = 0/4 Joel Emer December 7, 2005 6.823, L24-23 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state T = 4 ACE% = 3/4 Joel Emer December 7, 2005 6.823, L24-24 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state ( 2 + 1 + 0 + 3 ) / 4 ( 2 + 1 + 0 + 3 ) / 4 = 4 4 Average number of ACE bits in a cycle Average number of ACE bits in a cycle = Total number of bits in the structure Total number of bits in the structureJoel Emer December 7, 2005 6.823, L24-25 Little’s Law for ACEs Nace = Tace × Lace Nace AVF = NtotalJoel Emer December 7, 2005 6.823, L24-26 Computing AVF • Approach is conservative – Assume every bit is ACE unless proven otherwise • Data Analysis using a Performance Model – Prove that data held in a structure is un-ACE • Timing Analysis using a Performance Model – Tracks the time this data spent in the structure Joel Emer December 7, 2005 6.823, L24-27 Dynamic Instruction Breakdown DYNAMICALLY DEAD 20% PERFORMANCE INST 1% ACE 46% PREDICATED FALSE 7% NOP 26% Average across Spec2K slices Joel Emer December 7, 2005 6.823, L24-28 Mapping ACE & un-ACE Instructions to the Instruction Queue Wrong- Ex- ACE ACE Idle NOP Prefetch Path ACE Inst Inst Inst Inst Architectural un-ACE Micro-architectural un-ACE Joel Emer December 7, 2005 6.823, L24-29 ACE Lifetime Analysis (1) (e.g., write-through data cache) • Idle is unACE Fill Read Read Evict Idle Valid Valid Valid Idle • Assuming all time intervals are equal • For 3/5 of the lifetime the bit is valid • Gives a measure of the structure’s utilization – Number of useful bits – Amount of time useful bits are resident in structure – Valid for a particular trace Joel Emer December 7, 2005 6.823, L24-30 ACE Lifetime Analysis (2) (e.g., write-through data cache) • Valid is not necessarily ACE Fill Read Read Evict Idle Idle Write-through Data Cache • ACE % = AVF = 2/5 = 40% • Example Lifetime Components – ACE: fill-to-read, read-to-read – unACE: idle, read-to-evict, write-to-evict Joel Emer December 7, 2005 6.823, L24-31 ACE Lifetime Analysis (3) (e.g., write-through data cache) • Data ACEness is a function of instruction ACEness Fill Read Read Evict Idle Idle Write-through Data Cache • Second Read is by an unACE instruction • AVF = 1/5 = 20% Joel Emer December 7, 2005 6.823, L24-32 Instruction Queue IDLE ACE 31% 29% Ex-ACE NOP 10% 15% PREDICATED WRONG PATH FALSE 3% 3% DYNAMICALLY PERFORMANCE DEAD INST 8% 1% ACE percentage = AVF = 29% Joel Emer December 7, 2005 6.823, L24-33 Strike on a bit (e.g., in register file) Bit Read? no yes benign fault Bit has error no error protection? detection & no no error correction detection only affects program affects program outcome? outcome? yes no yes no yes no benign fault False DUE True DUE SDC no error SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error Joel Emer December 7, 2005 6.823, L24-34 DUE AVF of Instruction Queue with Parity True DUE AVF 29% Idle & Misc 38% Uncommitted 6% Neutral Dynamically CPU2000 16% Dead False DUE AVF Asim 11% 33% Simpoint Itanium®2-like Joel Emer December 7, 2005 6.823, L24-35 Sources of False DUE in an Instruction Queue • Instructions with uncommitted results – e.g., wrong-path, predicated-false – solution: π (possibly incorrect) bit till commit • Instruction types neutral to errors – e.g., no-ops, prefetches, branch predict hints – solution: anti-π bit • Dynamically dead instructions – instructions whose results will not be used in future – solution: π bit beyond commit Joel Emer December 7, 2005 6.823, L24-36 Coping with Wrong-Path Instructions (assume parity-protected instruction queue) inst Fetch Decode Commit Execute IQ RR X DECLARE ERROR Instruction ON ISSUE Data Cache Cache (IC) • Problem: not enough information at issue Joel Emer December 7, 2005 6.823, L24-37 The π (Possibly Incorrect) Bit (assume parity-protected instruction queue) Fetch Decode Commit Execute IQ RR inst inst (π) inst π) inst (π) inst (π) inst (ππ) ) π)π) POST ERROR IN π BIT ON π Instruction Data Cache Cache (IC) ISSUE At commit point, declare error only if not wrong-path instruction and π bit is set π bit is setJoel Emer December 7, 2005 6.823, L24-38 Anti-π bit: coping with No-ops (assume parity-protected instruction queue) Fetch Decode Commit IQ RR Execute inst inst inst inst inst inst (anti-π) π) (anti-π) π) anti-π bit π bit neutralizes neutralizes Instruction Data Cache Cache (IC) the π bit theπ bit On issue, if the anti-π bit is set, then do not set the π bit π bit is set, then do not set theπ bitJoel Emer December 7, 2005 6.823, L24-39 π bit: avoiding False DUE on Dynamically Dead Instructions write R1(π) Inst i: write R1 write R1 write R1(π) write R1(π) write R1(π) read R1 (π) read R1 read R1 Inst i+n: read R1 Fetch Decode Commit Execute IQ RR Instruction Data Cache Cache (IC) • Declare the error on reading R1, if π bit is set • If R1 isn’t read (i.e., dynamically dead), then no False DUE • π bit can be used in caches & main memory … Joel Emer December 7, 2005 6.823, L24-40 % False DUE AVF Eliminated (PI = π) PI bit till I/O PI bit till register commit commit 12% 18% PI bit till store commit 8% PI bit till register read 14% CPU2000 anti-PI bit Asim 48% Simpoint Itanium®2-like Practical to eliminate most of the False DUE AVF
Website URL
Comment