0% found this document useful (0 votes)
54 views7 pages

Lesson 2 - Fault and Error Modelling

Fault and Error Modelling

Uploaded by

Paul Pogba Clive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views7 pages

Lesson 2 - Fault and Error Modelling

Fault and Error Modelling

Uploaded by

Paul Pogba Clive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lesson 2.

Fault and Error Modelling


1. Introduction
Faults, errors, and failures are terms often used interchangeably in computing systems, but they
represent different stages in a system malfunction. Fault modeling is a critical process in fault-
tolerant computing that helps us to understand how faults occur, propagate, and affect system
behavior. Understanding faults and errors and how to model them is very essential for improving
system reliability and performance.
2. Learning Outcomes
By the end of the lesson, students should be able to:
1. Differentiate between faults, errors, and failures.
2. Model faults and errors in systems.
3. Understand the implications of fault/error modeling on system reliability.
3. Faults, Errors, and Failures
3.1 Fault
A fault is an underlying defect or flaw in a system that could potentially lead to erroneous
behavior. Faults can occur due to hardware malfunctions, software bugs, human errors, or
external disturbances (e.g., radiation). Sometimes due to certain factors such as Lack of
resources or not following proper steps, Fault occurs in software which means that the logic was
not incorporated to handle the errors in the application. This is an undesirable situation, but it
mainly happens due to invalid documented steps or a lack of data definitions.
1. It is an unintended behavior by an application program.
2. It causes a warning in the program.
3. If a fault is left untreated it may lead to failure in the working of the deployed code.
4. A minor fault in some cases may lead to high-end error.
5. There are several ways to prevent faults like adopting programming techniques,
development methodologies, peer review, and code analysis.
3.2 Error
An error is a deviation in the system's internal state caused by a fault. Errors manifest when a
system state deviates from the expected state and can propagate through the system,
potentially leading to failures if not detected and corrected. For instance, in software
development, an error is simply a situation that happens when the Development team or the

1
developer fails to understand a requirement definition and hence that misunderstanding gets
translated into buggy code.
1. Errors are generated due to wrong logic, syntax, or loop that can impact the end-user
experience.
2. It is calculated by differentiating between the expected results and the actual results.
3. It rises due to several reasons like design issues, coding issues, or system specification
issues and leads to issues in the application.
3.3 Failure
A failure occurs when the system's output deviates from the expected behavior due to an error.
It is the external manifestation of an error, visible to the user or the system environment.
Failure is the accumulation of several defects that ultimately lead to Software failure and results
in the loss of information in critical modules thereby making the system unresponsive.
Generally, such situations happen very rarely because before releasing a product all possible
scenarios and test cases for the code are simulated. Failure is detected by end-users once they
face a particular issue in the software.
1. Failure can happen due to human errors or can also be caused intentionally in the system
by an individual.
2. It is a term that comes after the production stage of the software.
3. It can be identified in the application when the defective part is executed.

Diagram: Relationship Between Faults, Errors, and Failures

This diagram illustrates the sequence from fault to error to failure:

2
4. Types of Faults
Faults can be classified based on their behavior and duration. The three main types are transient,
intermittent, and permanent faults:
4.1 Transient Faults
These faults occur temporarily and disappear without any corrective action. For example,
power fluctuations or cosmic radiation may cause transient faults in electronic circuits
causing it to misbehave temporarily and then recovers without intervention.
4.2 Intermittent Faults
Intermittent faults appear and disappear at irregular intervals. This is typically seen in loose
electric circuit connections. Faulty components that temporarily malfunction cause
intermittent faults. These faults are more challenging to detect and troubleshoot since they
occur unpredictably.
4.3 Permanent Faults
Permanent faults persist until corrective action is taken. For instance, a burned-out
processor or a failed hard drive would cause permanent faults. These faults require repair or
replacement of the affected component.
5. Fault and Error Modelling
Fault and error modeling helps simulate and analyze how faults affect a system's reliability and
performance. Fault modeling allows designers to understand the behavior of faults and develop
strategies to handle them.
5.1 Fault Models
A fault model is an abstraction used to describe different types of faults in a system. Some
common fault models include:
 Stuck-at Fault Model: Stuck-at Fault Model is one of the most commonly used fault
models in digital circuit testing. It assumes that a signal or node in a digital circuit is
"stuck" at a constant logic level, either logic 0 (stuck-at-0) or logic 1 (stuck-at-1),
regardless of the inputs applied to the circuit. The model simplifies the analysis of faults
in combinational and sequential circuits by focusing on these two failure modes.

3
1. Stuck-at-0 (s-a-0):
In this fault type, a signal or a node that is supposed to change its value is stuck at
logic 0. No matter the input combinations that are applied, the output of that
node remains at 0. Mathematically, if a node N is stuck-at-0, we have:
𝑁 = 0 (for all input combinations)
2. Stuck-at-1 (s-a-1):
In this case, a signal or node that is supposed to change its value remains stuck at
logic 1, regardless of the input conditions. For a node N stuck-at-1, we have:
N=1 (for all input combinations)

 Byzantine Fault Model: This model describes a system in which components, such as
nodes or processors, can fail in arbitrary or malicious ways, including sending conflicting
or misleading information to different parts of the system. This type of fault is one of the
most challenging to deal with in distributed systems because it assumes that
components may not simply stop working but may actively work against the system.

In the Byzantine Fault Model: Nodes in a distributed system can behave unpredictably or
dishonestly. A Byzantine node may send different (possibly incorrect) information to
different nodes, creating inconsistencies. The system must reach a consensus, despite
some nodes behaving incorrectly.

This problem was originally formulated as the Byzantine Generals Problem, where
generals in different locations must agree on a common plan of action, even though
some generals may be traitors sending contradictory or false messages to other generals
in the circle.

4
Byzantine Fault Tolerance (BFT): A system is said to be Byzantine Fault-Tolerant if it can
reach a consensus or continue functioning correctly even in the presence of Byzantine
faults. In a distributed system of n nodes, where up to f nodes can behave in a Byzantine
manner, Byzantine Fault Tolerance is achieved if the system can still function correctly
despite these faulty nodes.

Necessary Condition: 3f+1 Nodes

For Byzantine Fault Tolerance to be possible, the system must have at least 3f+1 total
nodes. This ensures that even if f nodes behave arbitrarily, the correct nodes can still
outvote the faulty ones and reach a consensus. This is often referred to as the Byzantine
Quorum Condition.

Equations

Let: n: Total number of nodes in the system

f: Maximum number of Byzantine faulty nodes

The relationship between the total nodes and the maximum Byzantine faults tolerated
is expressed as: n ≥ 3f + 1

This means that for the system to tolerate f Byzantine faults, the total number of nodes
n must be at least 3f+1. For example, if the system needs to tolerate 1 Byzantine fault
(f=1), there must be at least 3×1+1 = 4 nodes in the system.

Example: Your job requires you to develop a robust system capable of withstanding to
events of failure from two sources. Compute the number of alternative correct
complementary sources needed to keep the system afloat using the Byzantine Fault
Tolerance model.

5
 Transient Fault Model: This model simulates faults that appear temporarily, such as
errors caused by radiation or electrical interference. A Transient Fault (or soft error)
refers to a temporary error in a system that occurs for a short duration and does not result
in permanent damage to the hardware. These faults are typically caused by external
factors like electromagnetic interference, power fluctuations, or cosmic radiation and
are often difficult to reproduce. Unlike permanent faults, transient faults do not indicate
a failure of the system's components, and the system can recover once the disturbance
has passed.
5.2 Error Models
Error models are very important in understanding how faults lead to errors in a system and how
these errors propagate through the system's components. Through analysis of these models,
engineers can design systems with appropriate fault tolerance and error detection mechanisms.
Some key error models are discussed in the following sections:
1. The Fail-Silent Model
In this model, a component that detects a fault stops all operations and provides no
output (silent failure). This model ensures that the faulty component does not propagate
errors to other components. For example, a temperature sensor in an industrial
application may stop sending readings if it detects a fault, ensuring that no erroneous
data is transmitted to the control system. This allows the system to rely on other sensors
or to take corrective actions without being misled by faulty data.
2. The Fail-Stop Model
In a fail-stop model, when a component fails, it stops functioning and signals its failure
to the rest of the system. This allows other components to take appropriate actions
based on the failure. For instance, in a distributed database system, if a server fails, it
may send a notification to other servers, indicating its unavailability. The remaining
servers can then redistribute the workload and maintain the overall system functionality.
3. Byzantine Model
In this model, components may fail and exhibit arbitrary behavior, including sending
conflicting or misleading information. This model is crucial for systems where nodes can
be compromised or may act maliciously. For example, in a blockchain network, some
nodes may attempt to submit fraudulent transactions. A Byzantine Fault Tolerance (BFT)
algorithm helps ensure that even if some nodes behave incorrectly, the majority can still
reach a consensus on the correct state of the blockchain.
4. The Error Propagation Model
This model describes how errors introduced by faulty components can affect other
components. The error can propagate through the system based on its architecture and
6
the interactions between components. For example, in a digital circuit, a faulty flip-flop
may cause an incorrect signal to be sent to other logic gates. If that signal is used as an
input to a combination of gates, the incorrect output can propagate further, leading to
errors in multiple parts of the circuit.
5. Silent Data Corruption Model
In this model, data may be corrupted without any fault detection mechanisms in place,
leading to incorrect results without any indication of failure. In a storage system, data
can be silently corrupted due to a transient fault, such as a power spike.

6. Fault Injection Techniques


Fault injection is a technique used in a simulated environment to assess the reliability of a
system by deliberately introducing faults into it. This allows engineers to observe how the
system responds to faults and whether the fault-tolerant mechanisms function as expected.
There are two basic approaches to this.
1. Software-Based Fault Injection: Here, faults are injected into the software by
manipulating variables, data, or instructions.
2. Hardware-Based Fault Injection: Hardware faults are simulated by manipulating physical
components, such as cutting wires or injecting electrical disturbances.
7. Importance of Fault and Error Modelling on System Reliability
Fault and error modeling play a crucial role in predicting a system's reliability. By modeling
potential faults and their effects, designers can:
 Helps in designing the system to detect and recover from errors easily, such as using
error-correcting codes like Hamming Codes to detect and correct bit-flip errors.
 Fault models help in deciding where to add redundancy (e.g., N-modular redundancy) to
ensure that even when a fault occurs, the system continues to operate.
 Fault models are essential in designing testing strategies like test pattern generation in
digital circuits, where faults are systematically introduced to ensure error detection
mechanisms work correctly.
8. Review Questions
1. Differentiate between transient, intermittent, and permanent faults.
2. Model the 3 system fault scenarios and discuss its effects on system performance.

You might also like