0% found this document useful (0 votes)
126 views54 pages

Lecture 01 - Introduction

This document provides an introduction to fault tolerance by Dr. Tarek Abdul Hamid. It defines key concepts like faults, errors, failures and discusses attributes of dependability like availability, reliability, and security. Examples of systems that require fault tolerance are presented, from general purpose to critical systems. The goals, applications and new initiatives in fault tolerance are outlined.

Uploaded by

Ahmed Hamdy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views54 pages

Lecture 01 - Introduction

This document provides an introduction to fault tolerance by Dr. Tarek Abdul Hamid. It defines key concepts like faults, errors, failures and discusses attributes of dependability like availability, reliability, and security. Examples of systems that require fault tolerance are presented, from general purpose to critical systems. The goals, applications and new initiatives in fault tolerance are outlined.

Uploaded by

Ahmed Hamdy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

CSC310

Fault Tolerance
SPRING 2021

Lecture 01- Introduction


Instructor: Dr. Tarek Abdul Hamid
Motivation

What is Fault-Tolerance?

A “fault-tolerant system” is one that continues to perform at


desired level of service in spite of failures in some components
that constitute the system.

2 Fault Tolerant Computing Dr. Tarek abdul Hamid


Motivation

Key attributes

Fault - Error - Failure


Performance - Availability - Reliability
More recently concept of “survivability”
Inclusions of these constraints at design stage is likely to be
more cost effective.

3 Fault Tolerant Computing Dr. Tarek abdul Hamid


Motivation

 Who is concerned about fault-tolerance?


System Users – irrespective of the application but some are a lot
more concerned than others
 Who is concerned at design stages?
Universities
 R, d, and a (Research, development, applications)
Industry
 r, D, and A (research, Development, Applications)
 Issues
Design, Analysis/Validation, Implementation, Testing/Validation,
Evaluation

4 Fault Tolerant Computing Dr. Tarek abdul Hamid


Motivation

Examples
General Purpose Systems
PCs: RAMs with parity checks and possibly ECC (Error
Correction Code)
(consideration of re-execution on failure detection is being investigated)
Workstations/Servers: error detection (HW), occasional corrective
action (SW), Even ECC (HW), keeping log (SW)

5 Fault Tolerant Computing Dr. Tarek abdul Hamid


Motivation

Examples
Reliable Systems
Telephone systems
Banking systems e.g. ATM
Stock market
CAE (Cambridge English) - exams/projects
Football games - display/ticketing

6 Fault Tolerant Computing Dr. Tarek abdul Hamid


Motivation

Examples
Critical and Life Critical Systems
Manned and unmanned space borne systems
Aircraft control systems
Nuclear reactor control systems
Life support systems

7 Fault Tolerant Computing Dr. Tarek abdul Hamid


Motivation

Examples
Reliable -> Critical Systems
911 telephone switching system
Traffic light control system
Automotive control systems (ABS, Fuel injection system)

8 Fault Tolerant Computing Dr. Tarek abdul Hamid


Introduction

New initiatives
Goals of fault-tolerance
Applications of fault-tolerance

9 Fault Tolerant Computing Dr. Tarek abdul Hamid


Introduction

New initiatives
Density of devices more failures likely
Power issue – scheduler, on-chip sensors
Failures due to soft-errors, life time degradations
- hardening, re-exection,
- on-chip ECC
- reconfiguration
- micro-architectural solutions
- architectural solutions

10 Fault Tolerant Computing Dr. Tarek abdul Hamid


Introduction

New initiatives (contd.)


Deep submicron technology and time to market pressure
designs not fully verified
Implementation of numerous functionalities on
chip/board/system possibility of system hang-up
Speculative execution results may need to be re-checked
Low cost of HW and SW affordable/ecnomical
Hot issues: Soft errors, Life-time failures, Power and
Thermal Management

11 Fault Tolerant Computing Dr. Tarek abdul Hamid


Introduction

Goals - different goals for different applications


The key word is “reliability” – has different meaning for different users and
applications
Intuitive explanations
Dependability
Service
Specification

12 Fault Tolerant Computing Dr. Tarek abdul Hamid


Introduction

Intuitive concepts
Reliability – continues to work
Availability – works when I need it
Safety – does not put me in jeopardy
Performability
Maintainability
Testability
Survivability – will the system survive catastrophic events?
Security

13 Fault Tolerant Computing Dr. Tarek abdul Hamid


Introduction

Applications
Space borne system
 long life system
Airplane control system
 critical system
Transaction processing system
 high availability system
Switching system
 high availability over certain level of performance

14 Fault Tolerant Computing Dr. Tarek abdul Hamid


Terminology and definitions

Reliability and concept of probability


R(t): conditional probability that a system provides continuous
proper service in the interval [0,t] given that it provided desired
service at time 0.
Availability
Performabiltiy
An Example
Dependability
Security

15 Fault Tolerant Computing Dr. Tarek abdul Hamid


System Defined (1/4)
 “. . . an entity that interacts with other entities”
 First entity (system) – limited to be “electronic (mostly digital)” or “computer
based”
 Second entity
 Hardware, software, human, other systems, .. (can also be called “environment”)

 Characterization and fundamental properties


 Functionality
 Performance
 Dependability and security
 Cost
(usuability, managability, adaptabilty : not directly included in the paper)

16 Fault Tolerant Computing Dr. Tarek abdul Hamid


System Defined (2/4)
 Function – “ what the system is intended to do”
 functional specifications: describe it in terms of functionality and performance
 behavior – described as a sequence of states to implement the functionality
 Total states – set of states as system evolves
 Internal states
 External states – as viewed by the environment and users

 Structure – “What enables system behavior (function)”


 Interconnected components – recursively defined to “atomic” level

17 Fault Tolerant Computing Dr. Tarek abdul Hamid


System Defined (3/4)
 System Life Cycle
 Development phase
 Use phase
 Service – what is delivered by the system to its “environment” (user)

 Environment sees only the “external states”


 Development Phase – activities from concept to decision that system is ready
for “use phase”
 Use Phase - More meaningful and includes service delivery, service outage,
service shutdown, maintenance

18 Fault Tolerant Computing Dr. Tarek abdul Hamid


System Defined (4/4)
 Development phase environment
 Physical world
 Human developers
 Development tools
 Production and test facilities
 User phase environment
 Physical world
 Administrators – maintainers
 Users and intruders
 Providers and infrastructure

19 Fault Tolerant Computing Dr. Tarek abdul Hamid


Dependability/Security Attributes (1/6)

 Original definition: “ability to deliver service that can justifiably be


trusted”
 Encompassing the following attributes
 Availability
 Reliability
 Safety
 Integrity
 maintainability

20 Fault Tolerant Computing Dr. Tarek abdul Hamid


Dependability/Security Attributes (2/6)

 New definition: “ability to avoid service failures that are more frequent or
more severe than is acceptable” - deliver service that can justifiably be
trusted
 Reason for modification
 Security related issues
 This recognizes that a system can fail and it usually does fail and it still can be called
dependable
 This definition also enables a connection with “development failures”

21 Fault Tolerant Computing Dr. Tarek abdul Hamid


Dependability/Security Attributes
(3/6)
Dependability
availability: readiness for correct service.
reliability: continuity of correct service.
safety: absence of catastrophic consequences on the user(s) and the
environment.
integrity: absence of improper system alterations.
maintainability: ability to undergo modification and repairs

When addressing security, an additional attribute confidentiality:


the absence of unauthorized disclosure

22 Fault Tolerant Computing Dr. Tarek abdul Hamid


Dependability/Security Attributes
(4/6)

Security is concurrent existence of composite of the attributes


1) availability (for authorized actions only),
2) confidentiality, and
3) integrity (with “improper” meaning “unauthorized”)

23 Fault Tolerant Computing Dr. Tarek abdul Hamid


F

Dependability/Security Attributes
(5/6)

24 Fault Tolerant Computing Dr. Tarek abdul Hamid


Dependability/Security Attributes
(6/6)

Other related concepts are


Dependability
High confidence
Survivability
Trustworthiness
Example: all these have similar goals such as
1): ability to deliver service,
2): predictable service,
3): fulfill mission,
4): assurance of expected service delivery

25 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(1/12)

 Different phases are open to different types of threats – generally termed


as “Faults”
 Faults lead to “Errors” – a total state of the system different from the
“true total state”
 Errors can lead to “Failure” – the service deviates from the desired
service
 This creates a FEF chain – a hierarchical phenomenon

26 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(2/12)
Fault activation – Error manifestation – Failure

Fault –
active or dormant
Error – Failure
Fau
masked or latent lt
Error

Failure –
incorrect response

27 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(3/12)
FEF Chain in an hierarchy

28 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(4/12)

Fault classes
Groups (not exclusive)
Development, Physical – (that affect hardware ), Interaction
Viewpoints:
phase, system boundary, cause, dimension, objective, intent, capability,
persistence

29 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(5/12)

 Fault Taxonomy and Examples


Production defect: physical, hardware, natural
Bug: physical, software, natural
Omission (absence of an action): Humam made, system generated
Melicious (meant to cause harm): Human made, Hardware or software

30 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(6/12)

 Fault Taxonomy (contd.)


Permanent faults
Intermittent faults – repeat at some interval
Transient faults – no specific interval
Malicious logic faults – caused be natural faults
Intrusion attempts – caused by humans
Interaction faults – may be development phase or use phase
Configuration faults – incorrect setting of parameters

31 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats (7/12)
Errors classes
Detected
Latent
An example
An adder gives incorrect sum for certain operands
Fault is active when those operands appear, otherwise it is dormant
Incorrect sum is latent unless used or checked for correctness

32 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(8/12)

Failure classes
Development failures
Service failures
Security failures

33 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(9/12)

Development failures – introduced during the development phase


Human developers
Tools
Production facility
Budgetary reasons
Scheduling issue (time to market)
(basically the system delivered is a downgraded system)

34 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(10/12)

Service failures - delivery of incorrect service – Four viewpoints


1. Failure domain
Content failure
Timing failure – early or late delivery of the service(s)
 Special case: silent failure, halt failure, crash failure
 Erratic failure (like Byzantine failure)

35 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(11/12)

2. Failure detectability
Signal provided by some checking mechanism
 Signaled failure
 Unsignaled failure
 False alarm

3. Consistency
Consistent failure – all services see the same data
Inconsistent – different services see different data (like Byzantine
failure)

36 Fault Tolerant Computing Dr. Tarek abdul Hamid


Threats and modeling threats
(12/12)

4. Consequence of failure
Need to rate the failure and hence develop criteria – examples:
 Outage of duration (availability related)
 Lives being endangered (safely related)
 Extent of corrupted service (integrity related)
 Amount of information disclosed (confidentiality related)

37 Fault Tolerant Computing Dr. Tarek abdul Hamid


Means to attain dependability
(1/6)

• Fault Prevention or Fault Avoidance


Improvement of development process
Elimination of causes that can induce faults
• Fault Tolerance
Techniques and implementations (more later)

38 Fault Tolerant Computing Dr. Tarek abdul Hamid


Means to attain dependability (2/6)

• Fault Removal
 Remove faults during development phase – extensive simulation and validation
Testing
• Deterministic testing
• Random and statistical testing
• Back to back testing
Test/validation quality: fault injection, design for
test/verification

39 Fault Tolerant Computing Dr. Tarek abdul Hamid


Means to attain dependability
(3/6)

• Fault Forecasting – evaluate the system behavior and then use


one or more methods previously discussed to improve
dependability
 Qualitative evaluation
 Quantitative evaluation
 Use benchmarks
 Use of simulators

Examples: 1) Error and failure logs


2) when and where commissioned

40 Fault Tolerant Computing Dr. Tarek abdul Hamid


Means to attain dependability
(4/6)

Fault Tolerance Techniques


• Error detection - need redundancy
 Duplicate execution
 Use of parity
 Checker programs and/or hardware
 More later

41 Fault Tolerant Computing Dr. Tarek abdul Hamid


Means to attain dependability
(5/6)

• Recovery - Key is redundancy


Error handling
• Masking and compensation
• Rollback
• Rollforward
Fault handling
• Diagnosis
• Isolation
• Reconfiguration
• Initialization

42 Fault Tolerant Computing Dr. Tarek abdul Hamid


Means to attain dependability
(6/6)

Key to fault tolerance


• Break FEF chain
• Use “redundancy” to improve “use phase” dependability and security
• See next “fundamental principles”

43 Fault Tolerant Computing Dr. Tarek abdul Hamid


Fundamental Principles

Hardware redundancy
 Low level
 High level

Software Redundancy
Time Redundancy
Information Redundancy

44 Fault Tolerant Computing Dr. Tarek abdul Hamid


Fundamental Principles

Hardware Redundancy - Low level


logic level
 Example 1 - Self checking circuits
 Example 2 - Arithmetic code
A modular adder using the mathematical principle
(A+B) mod k = ((A mod k) + (B mod k)) mod k
Hardware Redundancy - High level
Triplicate or 5-copies as in space shuttle

45 Fault Tolerant Computing Dr. Tarek abdul Hamid


Fundamental Principles

Software Redundancy
Use two different programs/algorithms
Time Redundancy
Re-compute or redo the task and compare the results
May or may not use the same hardware/software
Information Redundancy
backup information
Use of ECC

46 Fault Tolerant Computing Dr. Tarek abdul Hamid


Fault-Error-Failure

Intuitive definitions
Origins of faults
Methods to break FEF chain
Attribute of faults

47 Fault Tolerant Computing Dr. Tarek abdul Hamid


Fault-Error-Failure concept

Intuitive definitions
Fault -
An anomalous physical condition caused by a manufacturing
problem, fatigue, external disturbance (intentional or un-
intentional), desgin flaw, …
Causes
Error - Effect of activation of a fault
Failure - over-all system effect of an error
Fault -> Error -> Failure

48 Fault Tolerant Computing Dr. Tarek abdul Hamid


Fault-Error-Failure concept

Origins of faults
Physical device level (HW)
Logic level (HW)
Chip level (HW)
System level (HW/SW)
interfacing, specifications, …
Why systems fail

49 Fault Tolerant Computing Dr. Tarek abdul Hamid


Fault-Error-Failure concept

Methods to break FEF chain


Flow FEF
Barriers
Fault avoidance
Fault masking
Fault removal
Fault forecasting

50 Fault Tolerant Computing Dr. Tarek abdul Hamid


Fault-Error-Failure concept

51
Fault-Error-Failure concept

52
Fault-Error-Failure concept

Attribute of faults
Cause
Nature
Duration
Extent
Value

53 Fault Tolerant Computing Dr. Tarek abdul Hamid

You might also like