Lecture 01 - Introduction
Lecture 01 - Introduction
Fault Tolerance
SPRING 2021
What is Fault-Tolerance?
Key attributes
Examples
General Purpose Systems
PCs: RAMs with parity checks and possibly ECC (Error
Correction Code)
(consideration of re-execution on failure detection is being investigated)
Workstations/Servers: error detection (HW), occasional corrective
action (SW), Even ECC (HW), keeping log (SW)
Examples
Reliable Systems
Telephone systems
Banking systems e.g. ATM
Stock market
CAE (Cambridge English) - exams/projects
Football games - display/ticketing
Examples
Critical and Life Critical Systems
Manned and unmanned space borne systems
Aircraft control systems
Nuclear reactor control systems
Life support systems
Examples
Reliable -> Critical Systems
911 telephone switching system
Traffic light control system
Automotive control systems (ABS, Fuel injection system)
New initiatives
Goals of fault-tolerance
Applications of fault-tolerance
New initiatives
Density of devices more failures likely
Power issue – scheduler, on-chip sensors
Failures due to soft-errors, life time degradations
- hardening, re-exection,
- on-chip ECC
- reconfiguration
- micro-architectural solutions
- architectural solutions
Intuitive concepts
Reliability – continues to work
Availability – works when I need it
Safety – does not put me in jeopardy
Performability
Maintainability
Testability
Survivability – will the system survive catastrophic events?
Security
Applications
Space borne system
long life system
Airplane control system
critical system
Transaction processing system
high availability system
Switching system
high availability over certain level of performance
New definition: “ability to avoid service failures that are more frequent or
more severe than is acceptable” - deliver service that can justifiably be
trusted
Reason for modification
Security related issues
This recognizes that a system can fail and it usually does fail and it still can be called
dependable
This definition also enables a connection with “development failures”
Dependability/Security Attributes
(5/6)
Fault –
active or dormant
Error – Failure
Fau
masked or latent lt
Error
Failure –
incorrect response
Fault classes
Groups (not exclusive)
Development, Physical – (that affect hardware ), Interaction
Viewpoints:
phase, system boundary, cause, dimension, objective, intent, capability,
persistence
Failure classes
Development failures
Service failures
Security failures
2. Failure detectability
Signal provided by some checking mechanism
Signaled failure
Unsignaled failure
False alarm
3. Consistency
Consistent failure – all services see the same data
Inconsistent – different services see different data (like Byzantine
failure)
4. Consequence of failure
Need to rate the failure and hence develop criteria – examples:
Outage of duration (availability related)
Lives being endangered (safely related)
Extent of corrupted service (integrity related)
Amount of information disclosed (confidentiality related)
• Fault Removal
Remove faults during development phase – extensive simulation and validation
Testing
• Deterministic testing
• Random and statistical testing
• Back to back testing
Test/validation quality: fault injection, design for
test/verification
Hardware redundancy
Low level
High level
Software Redundancy
Time Redundancy
Information Redundancy
Software Redundancy
Use two different programs/algorithms
Time Redundancy
Re-compute or redo the task and compare the results
May or may not use the same hardware/software
Information Redundancy
backup information
Use of ECC
Intuitive definitions
Origins of faults
Methods to break FEF chain
Attribute of faults
Intuitive definitions
Fault -
An anomalous physical condition caused by a manufacturing
problem, fatigue, external disturbance (intentional or un-
intentional), desgin flaw, …
Causes
Error - Effect of activation of a fault
Failure - over-all system effect of an error
Fault -> Error -> Failure
Origins of faults
Physical device level (HW)
Logic level (HW)
Chip level (HW)
System level (HW/SW)
interfacing, specifications, …
Why systems fail
51
Fault-Error-Failure concept
52
Fault-Error-Failure concept
Attribute of faults
Cause
Nature
Duration
Extent
Value