Question? Leave a message!




Reliable Architectures

Reliable Architectures
Dr.ShaneMatts Profile Pic
Dr.ShaneMatts,United States,Teacher
Published Date:23-07-2017
Website URL
Comment
Joel Emer December 7, 2005 6.823, L24-1 Reliable Architectures Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Joel Emer December 7, 2005 6.823, L24-2 Strike Changes State of a Single Bit 0 1 Joel Emer December 7, 2005 6.823, L24-3 Impact of Neutron Strike on a Si Device neutron strike Strikes release electron source drain & hole pairs that can be absorbed by source & + + - + + drain to alter the state - - - of the device Transistor Device • Secondary source of upsets: alpha particles from packaging Joel Emer December 7, 2005 6.823, L24-4 Cosmic Rays Come From Deep Space p p n n p n n p n p n Earth’s Surface • Neutron flux is higher in higher altitudes 3x - 5x increase in Denver at 5,000 feet 100x increase in airplanes at 30,000+ feet Joel Emer December 7, 2005 6.823, L24-5 Physical Solutions are hard • Shielding? – No practical absorbent (e.g., approximately 10 ft of concrete) – unlike Alpha particles • Technology solution: SOI? – Partially-depleted SOI of some help, effect on logic unclear – Fully-depleted SOI may help, but is challenging to manufacture • Circuit level solution? – Radiation hardened circuits can provide 10x improvement with significant penalty in performance, area, cost – 2-4x improvement may be possible with less penalty Joel Emer December 7, 2005 6.823, L24-6 Triple Modular Redundancy (Von Neumann, 1956) M M V Result M V does a majority vote on the results Joel Emer December 7, 2005 6.823, L24-7 Dual Modular Redundancy (e.g., Binac, Stratus) Error? M Mismatch? C M Error? • Processing stops on mismatch • Error signal used to decide which processor be used to restore state to other Joel Emer December 7, 2005 6.823, L24-8 Pair and Spare Lockstep (e.g., Tandem, 1975) Primary M Mismatch? C M Backup M Mismatch? C M • Primary creates periodic checkpoints • Backup restarts from checkpoint on mismatch Joel Emer December 7, 2005 6.823, L24-9 Redundant Multithreading (e.g., Reinhardt, Mukherjee, 2000) Leading Thread X W X X W X X W C Fault? C Fault? C Fault? Trailing Thread X W X X W X X W • Writes are checked Joel Emer December 7, 2005 6.823, L24-10 Component Protection Parity ECC 1 1 0 1 1 … 0 0 … Parity ECC Error? … 1 1 • Fujitsu SPARC in 130 nm technology (ISSCC 2003) – 80% of 200k latches protected with parity –versus very few latches protected in commodity microprocessors Joel Emer December 7, 2005 6.823, L24-11 Strike on a bit (e.g., in register file) Bit Read? no yes benign fault Bit has error no error protection? detection & no no error correction detection only affects program affects program outcome? outcome? yes no yes no yes no benign fault False DUE True DUE SDC no error SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error Joel Emer December 7, 2005 6.823, L24-12 Metrics • Interval-based – MTTF = Mean Time to Failure – MTTR = Mean Time to Repair – MTBF = Mean Time Between Failures = MTTF + MTTR – Availability = MTTF / MTBF • Rate-based – FIT = Failure in Time = 1 failure in a billion hours 9 –1 year MTTF = 10 / (24 365) FIT = 114,155 FIT – SER FIT = SDC FIT + DUE FIT Hypothetical Example Cache: 0 FIT Image removed due to + IQ: 100K FIT copyright restrictions. FU: 58K FIT + Total of 158K FIT Joel Emer December 7, 2005 6.823, L24-13 Cosmic Ray Strikes: Evidence & Reaction • Publicly disclosed incidence – Error logs in large servers, E. Normand, “Single Event Upset at Ground Level,” IEEE Trans. on Nucl Sci, Vol. 43, No. 6, Dec 1996. – Sun Microsystems found cosmic ray strikes on L2 cache with defective error protection caused Sun’s flagship servers to crash, R. Baumann, IRPS Tutorial on SER, 2000. – Cypress Semiconductor reported in 2004 a single soft error brought a billion-dollar automotive factory to a halt once a month, Zielger & Puchner, “SER – History, Trends, and Challenges,” Cypress, 2004. Joel Emer December 7, 2005 6.823, L24-14 Vulnerable Bits Growing with Moore’s Law 10000 12x GAP 1000 100 10 100% Vulnerable 1 20% Vulnerable 1000 year MTBF Goal Year Typical SDC goal: 1000 year MTBF Typical DUE goal: 10-25 year MTBF 200 3 200 4 200 5 200 6 200 7 200 8 200 9 201 0 201 1 201 2 Joel Emer December 7, 2005 6.823, L24-15 Architectural Vulnerability Factor (AVF) AVF = Probability Bit Matters bit of Visible Errors = of Bit Flips from Particle Strikes FIT = intrinsic FIT AVF bit bit bit Joel Emer December 7, 2005 6.823, L24-16 Architectural Vulnerability Factor Does a bit matter? • Branch Predictor – Doesn’t matter at all (AVF = 0%) • Program Counter – Almost always matters (AVF 100%) Joel Emer December 7, 2005 6.823, L24-17 Statistical Fault Injection (SFI) with RTL Simulate Strike on Latch 0 1 output Logic Logic 0 Does Fault Propagate to Architectural State + Naturally characterizes all logical structures Joel Emer December 7, 2005 6.823, L24-18 Architecturally Correct Execution (ACE) Program Input Program Outputs • ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) • Anything else (un-ACE path) can be derated away Joel Emer December 7, 2005 6.823, L24-19 Example of un-ACE instruction: Dynamically Dead Instruction Dynamically Dead Instruction Most bits of an un-ACE instruction do not affect program output Joel Emer December 7, 2005 6.823, L24-20 Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state T = 1 ACE% = 2/4