Lesson 6 - System-Level Diagnosis
Lesson 6 - System-Level Diagnosis
SOE 504
Specific Objectives
By the end of this lecture, students should be able to:
1. Demonstrate an understanding of the process of diagnosing
system faults at the system level.
2. Implement techniques to isolate and diagnose faults.
Introduction to System-Level Diagnosis
System-level diagnosis refers to the identification and isolation of faults in
a complex system, which may include hardware and software components.
System-level diagnosis looks at the entire system with the aim of
discovering where the failure is occurring and how it affects overall system
functionality. It is necessary because it:
1. Helps maintain system reliability and performance.
2. Minimizes downtime by enabling quick detection and correction of faults,
and
3. Supports predictive maintenance, reducing the likelihood of system
breakdown.
Overview of System Faults
• Faults can be broadly categorized into three:
1. Permanent faults: occur when a component is permanently
damaged and stops functioning.
2. Transient faults: temporary malfunctions that resolve without
intervention but can still disrupt system operation.
3. Intermittent faults: occur unpredictably and are hard to diagnose
since they appear and disappear over time.
Types of System Faults:
1. Hardware faults are physical failures like power surges,
overheating, or component wear.
2. Software faults manifest in form of Bugs, logic errors, memory
leaks, and incorrect configurations.
3. Communication faults include network issues, signal degradation,
or protocol failures.
Process of Diagnosing System-Level Faults
Monitoring Systematic
Fault Fault
and Data Repair and
Isolation Identification
Collection Resolution
Monitoring and Data Collection
The first step is to gather information about the system. This includes:
Error logs: Automated logging of errors during system operation.
Event tracing: Capturing events in the system to track down
abnormal behaviors.
Performance metrics: Monitoring CPU usage, memory, disk I/O,
network throughput, etc., to detect anomalies.
Fault Isolation
Once enough data has been gathered, the next step is to isolate the
fault. This step involves:
Hypothesis generation: Based on the collected data, forming
hypotheses about which part of the system might be failing.
Testing subsystems: Running diagnostics on suspected subsystems
or components to verify if they are the source of the fault.
Fault Identification
After isolating the fault, the exact nature of the failure must be
determined. This might involve:
1. Root Cause Analysis (RCA): Finding the underlying cause of the
fault rather than just the symptoms.
2. Testing scenarios: Reproducing the fault in a controlled
environment to better understand its behavior.
Systematic Repair and Resolution
Once the fault is identified, the final step is to apply a fix. It could be:
Hardware repairs or replacements: For physical component
failures.
Software patches or updates: For bugs or configuration issues.
System reboots or reset: For transient issues that resolve with a
system reset.
Techniques for Fault Diagnosis
Built-In Self-Test (BIST)
Built-In Self-Test (BIST) A mechanism where the system performs self-
Techniques for fault