Lesson 2 - Fault and Error Modelling
Lesson 2 - Fault and Error Modelling
1
developer fails to understand a requirement definition and hence that misunderstanding gets
translated into buggy code.
1. Errors are generated due to wrong logic, syntax, or loop that can impact the end-user
experience.
2. It is calculated by differentiating between the expected results and the actual results.
3. It rises due to several reasons like design issues, coding issues, or system specification
issues and leads to issues in the application.
3.3 Failure
A failure occurs when the system's output deviates from the expected behavior due to an error.
It is the external manifestation of an error, visible to the user or the system environment.
Failure is the accumulation of several defects that ultimately lead to Software failure and results
in the loss of information in critical modules thereby making the system unresponsive.
Generally, such situations happen very rarely because before releasing a product all possible
scenarios and test cases for the code are simulated. Failure is detected by end-users once they
face a particular issue in the software.
1. Failure can happen due to human errors or can also be caused intentionally in the system
by an individual.
2. It is a term that comes after the production stage of the software.
3. It can be identified in the application when the defective part is executed.
2
4. Types of Faults
Faults can be classified based on their behavior and duration. The three main types are transient,
intermittent, and permanent faults:
4.1 Transient Faults
These faults occur temporarily and disappear without any corrective action. For example,
power fluctuations or cosmic radiation may cause transient faults in electronic circuits
causing it to misbehave temporarily and then recovers without intervention.
4.2 Intermittent Faults
Intermittent faults appear and disappear at irregular intervals. This is typically seen in loose
electric circuit connections. Faulty components that temporarily malfunction cause
intermittent faults. These faults are more challenging to detect and troubleshoot since they
occur unpredictably.
4.3 Permanent Faults
Permanent faults persist until corrective action is taken. For instance, a burned-out
processor or a failed hard drive would cause permanent faults. These faults require repair or
replacement of the affected component.
5. Fault and Error Modelling
Fault and error modeling helps simulate and analyze how faults affect a system's reliability and
performance. Fault modeling allows designers to understand the behavior of faults and develop
strategies to handle them.
5.1 Fault Models
A fault model is an abstraction used to describe different types of faults in a system. Some
common fault models include:
Stuck-at Fault Model: Stuck-at Fault Model is one of the most commonly used fault
models in digital circuit testing. It assumes that a signal or node in a digital circuit is
"stuck" at a constant logic level, either logic 0 (stuck-at-0) or logic 1 (stuck-at-1),
regardless of the inputs applied to the circuit. The model simplifies the analysis of faults
in combinational and sequential circuits by focusing on these two failure modes.
3
1. Stuck-at-0 (s-a-0):
In this fault type, a signal or a node that is supposed to change its value is stuck at
logic 0. No matter the input combinations that are applied, the output of that
node remains at 0. Mathematically, if a node N is stuck-at-0, we have:
𝑁 = 0 (for all input combinations)
2. Stuck-at-1 (s-a-1):
In this case, a signal or node that is supposed to change its value remains stuck at
logic 1, regardless of the input conditions. For a node N stuck-at-1, we have:
N=1 (for all input combinations)
Byzantine Fault Model: This model describes a system in which components, such as
nodes or processors, can fail in arbitrary or malicious ways, including sending conflicting
or misleading information to different parts of the system. This type of fault is one of the
most challenging to deal with in distributed systems because it assumes that
components may not simply stop working but may actively work against the system.
In the Byzantine Fault Model: Nodes in a distributed system can behave unpredictably or
dishonestly. A Byzantine node may send different (possibly incorrect) information to
different nodes, creating inconsistencies. The system must reach a consensus, despite
some nodes behaving incorrectly.
This problem was originally formulated as the Byzantine Generals Problem, where
generals in different locations must agree on a common plan of action, even though
some generals may be traitors sending contradictory or false messages to other generals
in the circle.
4
Byzantine Fault Tolerance (BFT): A system is said to be Byzantine Fault-Tolerant if it can
reach a consensus or continue functioning correctly even in the presence of Byzantine
faults. In a distributed system of n nodes, where up to f nodes can behave in a Byzantine
manner, Byzantine Fault Tolerance is achieved if the system can still function correctly
despite these faulty nodes.
For Byzantine Fault Tolerance to be possible, the system must have at least 3f+1 total
nodes. This ensures that even if f nodes behave arbitrarily, the correct nodes can still
outvote the faulty ones and reach a consensus. This is often referred to as the Byzantine
Quorum Condition.
Equations
The relationship between the total nodes and the maximum Byzantine faults tolerated
is expressed as: n ≥ 3f + 1
This means that for the system to tolerate f Byzantine faults, the total number of nodes
n must be at least 3f+1. For example, if the system needs to tolerate 1 Byzantine fault
(f=1), there must be at least 3×1+1 = 4 nodes in the system.
Example: Your job requires you to develop a robust system capable of withstanding to
events of failure from two sources. Compute the number of alternative correct
complementary sources needed to keep the system afloat using the Byzantine Fault
Tolerance model.
5
Transient Fault Model: This model simulates faults that appear temporarily, such as
errors caused by radiation or electrical interference. A Transient Fault (or soft error)
refers to a temporary error in a system that occurs for a short duration and does not result
in permanent damage to the hardware. These faults are typically caused by external
factors like electromagnetic interference, power fluctuations, or cosmic radiation and
are often difficult to reproduce. Unlike permanent faults, transient faults do not indicate
a failure of the system's components, and the system can recover once the disturbance
has passed.
5.2 Error Models
Error models are very important in understanding how faults lead to errors in a system and how
these errors propagate through the system's components. Through analysis of these models,
engineers can design systems with appropriate fault tolerance and error detection mechanisms.
Some key error models are discussed in the following sections:
1. The Fail-Silent Model
In this model, a component that detects a fault stops all operations and provides no
output (silent failure). This model ensures that the faulty component does not propagate
errors to other components. For example, a temperature sensor in an industrial
application may stop sending readings if it detects a fault, ensuring that no erroneous
data is transmitted to the control system. This allows the system to rely on other sensors
or to take corrective actions without being misled by faulty data.
2. The Fail-Stop Model
In a fail-stop model, when a component fails, it stops functioning and signals its failure
to the rest of the system. This allows other components to take appropriate actions
based on the failure. For instance, in a distributed database system, if a server fails, it
may send a notification to other servers, indicating its unavailability. The remaining
servers can then redistribute the workload and maintain the overall system functionality.
3. Byzantine Model
In this model, components may fail and exhibit arbitrary behavior, including sending
conflicting or misleading information. This model is crucial for systems where nodes can
be compromised or may act maliciously. For example, in a blockchain network, some
nodes may attempt to submit fraudulent transactions. A Byzantine Fault Tolerance (BFT)
algorithm helps ensure that even if some nodes behave incorrectly, the majority can still
reach a consensus on the correct state of the blockchain.
4. The Error Propagation Model
This model describes how errors introduced by faulty components can affect other
components. The error can propagate through the system based on its architecture and
6
the interactions between components. For example, in a digital circuit, a faulty flip-flop
may cause an incorrect signal to be sent to other logic gates. If that signal is used as an
input to a combination of gates, the incorrect output can propagate further, leading to
errors in multiple parts of the circuit.
5. Silent Data Corruption Model
In this model, data may be corrupted without any fault detection mechanisms in place,
leading to incorrect results without any indication of failure. In a storage system, data
can be silently corrupted due to a transient fault, such as a power spike.