L.Fault Detection and Diagnostics in Equipment Maintenance
L.Fault Detection and Diagnostics in Equipment Maintenance
L.Fault Detection and Diagnostics in Equipment Maintenance
MAINTENANCE
Understanding equipment failures and developing strategies to detect and diagnose them is one
of the key elements of equipment maintenance. The purpose of this lesson is to present an
overview of Fault Detection and Diagnostics as they are applied to improve the equipment
maintenance process and boost asset reliability.
Fault
An unpermitted deviation of at least one characteristic property or parameter of the system from
the acceptable, usual or standard condition.
Failure
Malfunction
An intermittent irregularity in the fulfifilment of a system’s desired function.
Error
Disturbance
Residual
Symptom
Functions
Fault detection
Fault isolation
Determination of the kind, location and time of detection of a fault. Follows fault
detection.
Fault identification
Determination of the kind, size, location and time of detection of a fault. Follows
fault detection. Includes fault detection and identification.
Monitoring
Supervision
3. Models
Quantitative model
Use of static and dynamic relations among system variables and parameters in
order to describe a system’s behaviour in quantitative mathematical terms.
Qualitative model
Use of static and dynamic relations among system variables in order to describe a
system’s behaviour in qualitative terms such as causalities and IF–THEN rules.
Diagnostic model
A set of static or dynamic relations which link specific input variables, the
symptoms, to specific output variables, the faults.
Analytical redundancy
Use of more (not necessarily identical) ways to determine a variable, where one
way uses a mathematical process model in analytical form.
4. System properties
Reliability
Safety
Availability
Abrupt fault
Incipient fault
Fault modelled by using ramp signals. It represents drift of the monitored signal.
Intermittent fault
6. Fault terminology
Additive fault
Influences a variable by an addition of the fault itself. They may represent, e.g.,
offsets of sensors.
Multiplicative fault
Are represented by the product of a variable with the fault itself. They can appear
as parameter changes within a process.
In the early days, equipment maintenance was restricted to repairing faulty assets and performing
basic routine maintenance based on rigid time intervals. Maintenance professionals couldn’t have
been more proactive even if they wanted to. Their capability to collect, store and analyze data on
equipment health and performance was simply too limited.
The objective of Fault Detection and Diagnostics in the context of equipment maintenance is
to optimize maintenance costs while still improving the reliability, availability, maintainability
and safety (RAMS) ofthe equipment.
The FDD functions by continuously monitoring and analyzing condition monitoring data and
detecting any anomalies (if present). The equipment condition datasets are then processed by
fault diagnostics algorithms, sometimes embedded within the equipment itself, to produce failure
alerts for the equipment operators and enable timely maintenance intervention.
In some cases, the algorithms are sophisticated enough to even initiate failure containment
actions to auto-correct the failure itself and restore the equipment to its healthy condition.
The FDD, as the name implies, contains the detection and diagnosis of equipment failures. The
diagnosis of the failure can be broken down into failure isolation and identification. The failure
evaluation is often added within the scope of FDD as it helps to understand the severity of failure
on system performance – an important aspect of maintenance management.
Nevertheless, the Fault Detection and Diagnostics algorithm for any equipment should contain at
least the four key processes listed below (these can constitute a nonlinear process as well,
provided that some steps happen at the same time):
We need to discuss each element in more detail to really understand how fault detection and
diagnostics work.
1. Fault detection
Fault detection is the process of discovering the presence of a fault in any equipment before it
manifests itself in the form of a breakdown. It is the most important stage of FDD as all of the
downstream processes depend on its accuracy.
If the equipment is unable to discover the right failure mode (or if detection is incorrect and
triggers false alarms), then the isolation, identification and evaluation will also be ineffective.
One example of model-based fault detection is the use of time-domain reflectometry (TDR) to
detect faults in underground cables. In TDR, the signal is sent across the test cable and is
received after being reflected from the point of fault. If the cable has a discontinuity or high
impedance, the portion of the signal will be reflected back to the test equipment or receiver. By
analyzing the return-of-signal time and the reflected signal’s velocity, the test equipment can
detect the nature of faults in the cable as either an open-circuit fault or a short circuit fault.
Another simple rule-based detection example comes from the series operation of bottle filling,
capping, and packaging system on a conveyor belt system. A simple rule can be established that
indicates the hierarchy of processes such as:
the bottle cannot be capped until the bottles are filled with liquid
the bottles cannot be packaged unless they are filled and capped
In case of a fault in the bottle capping mechanism, the algorithm will detect the incoming
disruption in the packaging system. It will notify the packaging operator well ahead of time. The
necessary preparation can be made to minimize operational losses on the packaging side of the
conveyor belt.
Knowledge-based fault detection
For knowledge-based fault detection to work, we first need to establish a baseline. This is done
by retrieving the parameters of equipment performance such as voltage, current, vibration,
temperature, pressure and other relevant process variables – while the equipment is working
under normal conditions.
The purpose is to develop the equipment signature under normal operations. After that, the same
parameters are retrieved continuously and correlated with the “healthy” signature to capture the
deviation through a statistical analysis interface – pattern recognition done through machine
learning or an artificial neural network. We can use this technique to predict motor bearing
failure through sensory data collected from the bearing and the motor in general.
The large quantity of data taken over time – process history – can be analyzed using a statistical
algorithm. This helps us understand the impact of the different conditions the motor is subjected
to, such as thermal rating, mechanical stress, or some other operating conditions that occur in
special circumstances.
The algorithm then correlates the impact of these conditions on the degradation of bearing health
and predicts the failure rate and health condition of the overall motor. Based on these data
signatures, the analysis can be made to predict the future health of the equipment. Moreover, the
necessary alarms can be triggered and fault diagnosis can be conducted, so the
operator/technician can take appropriate action.
The same data can be used to establish a predictive maintenance strategy over the remaining life
of the motor.
2. Fault isolation
The goal of the fault isolation process is to localize the fault to the lowest component that can be
replaced. In some applications, fault detection and isolation go hand in hand; they can, of course,
be separate modules of the process. This is because the processes of detecting and localizing the
fault are happening at basically the same time, both done by the Fault Detection and Isolation
(FDI) algorithm.
For instance, consider the example of TDR testing for underground cable. The returned pulse
signal from the cable simultaneously indicates the presence and the location of fault through time
and velocity of the returned pulsed signal.
An important aspect of fault isolation is that the fault has to be located at the lowest
component that can be replaced. This is done to improve the accuracy of isolation and reduce
the impact of downtime.
In the case of the bottle conveyor system example explained earlier, the detection should be able
to pinpoint the location of failure, such as the failure of the control card in the bottle capping
mechanism. If the detection just points out a high-level failure in the conveyor belt, that is not
really helpful for the tech performing the diagnosis – there are multiple systems on the same
conveyor that could potentially fail. The information that will really speed up the repair process
is knowing the accurate location of the fault.
3. Fault identification
The purpose of fault identification is to understand the underlying failure mode, determine the
size of the fault, and find its root cause. Fault diagnosis methods may differ, but the steps to
follow are generally the same.
Understanding the underlying failure mode
In-depth understanding of the failure mode requires work:
we need to analyze how the fault behaves at different times
so, we can develop the time-variant signature of the failure mode
and classify it into different categories
If the fault magnitude is low, the system just needs to be able to endure the fault for an extra time
until the fault is cleared by itself. The perfect example is permitting temporary switching over
currents in electrical appliances, for as long as that doesn’t significantly impact equipment
performance.
Now, if the fault magnitude is really high, a different methodology is required: engineers have to
use active or passive redundancies to enhance fault tolerance on their devices.
Let’s use a high voltage and high power three-phase AC induction motor as an example.
More often than not, the underlying failure modes are mechanical in nature and associated with
the rotary part of the motor: shorted rotor windings, bearing failures, and rotor breakdown. Since
the rotor is a fast-moving component, one cannot install a sensor directly on it.
The advanced FDD algorithms can be used to produce healthy motor stator terminal current
signatures and compare them with current signatures under faulty conditions.
For instance, upon breaking of rotor bars, the pulse produced in the stator current is twice the
motor stator current frequency. There is an indirect correlation between the mechanical breaking
of rotor bars and the fluctuations in the stator current. Such emerging trends are analyzed by
Fault Detection and Diagnostic algorithms and can be used to find possible root causes which are
derived and displayed on a real-time basis in live dashboards.
The usage of such fault identification algorithms has significantly reduced the amount of time
techs need to troubleshoot equipment and reach the root cause of the failures. Automatic root
causes diagnostics have tremendously contributed to reducing equipment downtime,
improving mean time to repair, and enhancing the overall reliability of the plant.
4. Fault evaluation
Once the failure modes and the associated root causes are identified, the next step is to evaluate
the impact of that fault type on the overall performance of the system.
We need to consider factors such as;
the impact of the fault on the environment and the rest of the system
the impact of the fault on system safety
the financial loss due to downtime
the need to make capital replacement decisions (in case the severity of failure is enough to
warrant the replacement of equipment as opposed to fixing it)
Fault evaluation is a significant element of the overall process as it aims to understand the
severity of the fault. This helps reliability engineers provide equipment validation and calculate
the risk of failures, which will both have a big impact on maintenance requirements,
recommendations, and optimization.
For example, the result of the FDD for one piece of equipment could imply the rapidly
increasing failure rates. However, the impact of that fault could be minimal on the overall system
performance, thus making the overall risk to be moderate. In this case, the less stringent
maintenance strategy such as run-to-failure or preventive maintenance could be sufficient to
manage the risk.
Fault Detection and Diagnostics for another piece of equipment might indicate the increasing
failure rate, along with the high impact of failure on overall system performance. In this case, the
most stringent predictive maintenance program should be adopted despite its high cost. This is
because the increased cost of maintenance is warranted to prevent major fallout that will be way
more costly.
In short, fault detection and diagnostics play a decisive role in optimizing the maintenance
regime for any piece of equipment, across its lifecycle.
With the advent of fast computing technologies, big data processing, and advanced learning
algorithms, traditional fault detection has evolved into automatic fault management systems that
not only detect faults, but also identify the root cause and implement corrective actions to avoid
future recurrence.
Such automation of a series of manual processes has enabled reliability and maintenance
engineers to apply predictions on equipment health, derive future equipment performance, and
shape optimal maintenance intervals.
The only thing they have left to do is fire up their computerized maintenance management
software (CMMS), track the condition of their critical assets, and schedule appropriate
maintenance work.
TOP CAUSES OF MACHINE FAILURE AND HOW TO PREVENT
THEM
Machine failure, once an accepted part of life for manufacturers and OEMs, has met its match
with modern technology using IoT devices, the cloud, and edge computing. In order to pre-empt
and prevent machine failure, it’s first important to understand what it is and why it happens in an
industrial environment.
We can also review existing strategies for dealing with equipment failure including reactive
maintenance, diagnostic analytics, and preventive maintenance. In understanding where these
strategies fail, we can learn why manufacturers are moving toward predictive maintenance,
which resolves the issues of each of its three predecessors.
Sudden Failure
This is what most people think of when they hear machine failure. The production line is
humming along when an unexpected (but obvious) breakdown happens. Things like a shattered
tool, snapped band, melted wire, etc. fall into this category.
Intermittent Failure
Think of this like a sputtering engine in your production line. It’ll go a little bit, then quit. You
start it back up, and it keeps working as intended a little longer, but then it starts failing again.
Intermittent failures come and go, usually on their way to a “full” machine failure. These
sporadic or random failures can, by their nature, be difficult to identify. Intermittent failures can
frequently be prevented with maintenance.
Gradual Failure
These are the failures you can see over time as a machine’s usefulness takes a steady decline.
This includes things like a belt that’s slowly shredding, blades that get duller, pipes that
eventually clog with residue buildup. Most gradual failures can be prevented through regular
maintenance, armed with an understanding of the expected lifetime of the parts at hand.
The Most Common Causes of Machine Failure
Failure starts somewhere. The following are some of the most frequent causes of machine failure
and can be used to analyze, prepare, and prevent future instances of malfunction.
Operator Error
Despite extensive training, humans are still prone to making errors, forgetting important
principles from training, laziness, tiredness, and plain old forgetfulness. Sometimes misuse and
abuse of equipment by machine operators is to blame for failure. This can also include simple
accidents, like dropping a piece of equipment.
Reactive Maintenance
This is the traditional maintenance paradigm. When it breaks, we fix it. It doesn’t prevent the
machine from failing so much as it offers a route to resolving the problem once the malfunction
occurs.
Diagnostic Analytics
This requires a little more digging. Within this maintenance structure, machine data and root
cause analysis are deployed to determine why the machine failed in the first place. This
information can then be used within a preventive maintenance strategy.
Preventative Maintenance
Preventive maintenance includes regularly inspecting machines prior to use, establishing and
sticking to a maintenance schedule, regularly replacing components before their average lifespan
is over, and anything that tries to ward off the failure before it happens. Think of it like changing
the oil in your car every few thousand miles. We don’t wait until the oil is muck and has clogged
the rest of our equipment, we just preemptively, preventively, maintain it based on our
expectations of when failure would otherwise occur.
Predictive Maintenance
Predictive maintenance uses past machine performance to model asset behavior. With enough
data, algorithms can work to predict equipment failures based on real-time data off of machines
that are IoT-connected. This means that preventive maintenance tasks don’t happen
unnecessarily—like replacing perfectly good parts—but instead are based on a deeper and more
customized analysis of when failure is impending or most likely to occur.
The real boon of IoT vs. more traditional data-gathering and analytics methods is its real-time
collection capacity. While historical data can offer great insight for preventive maintenance
strategies, IoT-enabled predictive maintenance offers a competitive edge to manufacturers by
increasing uptime, reducing resource waste, and providing strategic insights that can extend
beyond maintenance schedules into process optimization and more. Plus, IoT-connected
machinery has the potential to utilize the cloud for deep, rich analysis as well as edge computing
for lightning-fast insights, even in secure and air-gapped environments.
MACHINE DATA.
READY TO EMPOWER YOUR SHOP FLOOR?
LEARN MORE
What is a Time Domain Reflectometer, TDR
Time domain reflectometers are used for testing cables like twisted pairs, coaxial cable, etc.,
where they can locate the position of faults.
Time domain reflectometer, TDR, includes: TDR basics and Optical TDR
Time domain reflectometers, TDRs are used for testing cable systems and other forms of feeder
where they are able to detect and pinpoint issues. As a result, time domain reflectometers, TDRs
are widely used in any area where there may be long or inaccessible lengths of cable that may
need to be tested, or they may have faults. Time domain reflectometry can also be used on
printed circuit boards to locate issues that can arise there as well.
TDR applications
Time domain reflectometers, TDRs are used in a variety of applications, some obvious, but
others less so.
Some of the TDR applications include:
Telecommunications cable landlines: TDRs are an invaluable tool for
telecommunications field engineers who need to repair telephone and broadband
landlines. They can be used for testing of very long cable runs, where it is impractical
to dig up or remove what may be a kilometers-long cable. If a break occurs, a TDR is
able to locate the position of the break with considerable accuracy.
Landline preventative maintenance: TDRs are used for preventive maintenance of
telecommunication lines. They can detect resistance on joints and connectors as they
corrode. TDRs can also detect increasing insulation leakage as it degrades and absorbs
moisture. Ultimately this can lead to catastrophic failure, but the TDR is able to detect
this before this point is reached.
Landline security surveillance: Time domain reflectometers can detect the existence
and location of wire taps. The wiretap introduces a slight change in line impedance
and this can be seen on the TDR when connected to a phone line.
Circuit board testing: Specialized time domain reflectometers can be used for the
failure detection of modern high-frequency printed circuit boards, especially on tracks
designed to emulate transmission lines. The reflections seen by the TDR reveal any
unsoldered pins of a ball grid array device or short-circuited pins, etc.
Industrial applications: Time domain reflectometry is used in a variety of industrial
applications, including the testing of integrated circuit packages where failing areas of
an IC can be detected. TDR technology can even be used for measuring liquid levels,
etc.
Although it is possible to use instruments such as network analyzers and the like to check the
integrity of cables this way, these test instruments are very expensive and not easy to use. A
much better approach for many applications is to use time domain reflectometry techniques and a
specific test instrument. This considerably simplifies the operation as well as reducing the cost of
the test instrument. Also, many time domain reflectometers are specifically made for portable
operation, enabling them to be used far more easily in the scenarios where they are required, i.e.,
for telecommunications cables that may be running under roads, paths, etc.
The time domain reflectometer operates by sending a short pulse along the line in question. With
the far end terminated in the required impedance, i.e., that of the line, if there are no problems
with the line, then all the energy in the pulse will travel along the line at the propagation velocity
and be dissipated in the load and no reflection will be observed.
Basic block diagram of a time domain reflectometer, TDR
From this it can be seen that the time domain reflectometer consist of a pulse generator and a
sampler. The sampler could be an oscilloscope that displays the waveforms on the line. In reality
a little more signal processing is often included to help locate problems and issues with the line
However, if there is a discontinuity in the line, energy will be reflected back to the reflectometer
where it is detected.
Within the reflectometer it is possible to analyze the returned pulse assuming that the voltage of
the outgoing pulse level is Ei, and the reflected pulse has a level Er.
The power return may occur for a variety of reasons from a break somewhere in the cable to a
poor match at the remote end. The time delay, T will be twice that for the wave to travel to the
mismatch point, i.e. out and return time together.
The sampler will be able to detect not only the level change and be able to calculate the
mismatch, but also the time difference from which the distance along the line where the
discontinuity exists can be calculated.
Distance = Vρ (T/2)
Where:
D = distance in metres
Vρ = velocity of propagation in metres per second
T = transit time from the monitoring point to the mismatch in seconds.
This is a straightforward calculation to make and is normally made within the time domain
reflectometer, giving the user a good indication of where the fault may be located.
The main issue is the propagation velocity within the cable. This can be determined by testing a
known length of the cable under test and leaving the remote end open.
Nature of mismatch
Not only is it possible for the time domain reflectometer to discover where the fault or problem
has occurred along the cable, it is also possible to discover much about the nature of it as well.
The reflected pulse enables the test instrument to see both the nature and magnitude of the
mismatch.
ρ=Er/ Ei = ZL − Z0/ ZL + Z0
Where:
ρ = reflection coefficient
ZL = load impedance in ohms
Z0 = line impedance in ohms
Although it may appear to be a specialist test instrument, the TDR is widely used in a variety of
industries, but particularly within the telecommunications industry where it is an invaluable tool.
Without the time domain reflectometer, locating problems with long inaccessible lines would be
very difficult and costly.
Fault Tolerance and Its Impact on System Reliability
Equipment and systems that are designed with no fault tolerance in mind often have poor(er)
reliability. This is why a fault-tolerant system design is an obvious choice for most reliability and
design engineers – especially when it comes to critical equipment which failure can compromise
the reliability, availability, maintainability, and safety (RAMS) of the whole system they are a
part of.
Each stage may adopt combinations of the below-stated techniques to develop new designs or
improve current ones to enhance their level of fault tolerance:
1. fault detection and display
2. fault diagnosis and containment
3. fault masking and compensation
For example, a simple air pressure sensor in a car tire pressure monitoring system (TPMS) can
detect the air overfill and notify the driver via the car dashboard.
A representation of TPMS activation
In this case, the detection and display are the only acceptable tolerance level for this fault event.
The customer can safely disengage the air hose before rupturing the tire. If the pressure detection
is inaccurate, the driver may disengage the hose too soon/late and experience tire failure during
driving. Since there is no automatic correction of air pressure, the tolerance aspect for this fault is
restricted to just detection and display.
For instance, in the case of overpressure of petroleum products in a vessel, the system is
triggered by relevant pressure sensors. It opens the safety pressure valve and exhausts the vapors
out in the flare stack. In this example, the containment is carried out by diverting the high-
pressure flammable vapor to the exhaust stack, protecting the system from fire or explosion.
With such equipment, one of the most significant challenges comes in the form of cybersecurity
threats. These types of threats can attempt to induce the fault by altering the state of the
equipment through the injection of false equipment data into the server.
With incorrect equipment state records, the very control and monitoring system originally
intended to protect can instead cause the failure of the asset. Alternatively, it can be “tricked”
into thinking the asset is in good condition when it is actually not – letting the deterioration lead
to failure without triggering any alerts.
By incorporating fault-masking, the system is designed in a way that it can recognize and mask
those incorrect values. For example, in the electricity grids, the circuit breakers are often
controlled and monitored through Supervisory Control and Data Acquisition (SCADA).
Such a system closely monitors the voltage and frequency parameters of the electrical equipment
and causes them to close or open to maintain power network stability.
An incoming cyberattack could alter the voltage and frequency limits on the equipment.
Consequences? The system could cause a power breakdown instead of preventing it.
Fault masking is often carried out through algorithms that detect anomalous data streams and
inject false data with the purpose of masking the data which represents the faulty state of the
equipment. This prevents the bad data actors from spreading the fault and further exacerbating
the grid’s reliability.
Improving fault tolerance through redundant designs
One of the simple actions that can be taken to increase fault tolerance is by incorporating
redundancies in the design. Redundancy simply means the presence of an alternate system or
solution that can take over the intended function should the primary system fail.
While redundancy improves fault tolerance, haphazardly adding systems should not be the
objective as the amount of cost required to add any new system can significantly outweigh the
attainable reliability benefit.
Active redundancies
Active redundancies can be established when multiple pieces of equipment are operated
simultaneously. In this configuration, each piece of equipment contributes its share towards
attaining the intended function while still acting as redundancy for each other.
A simplistic active redundancy is the parallel operation of two pumps at half of their rated
capacities. Both pumps jointly operate to achieve desired discharge pressure. If one pump fails,
the other pump can still be boosted to its rated capacity to attain intended discharge pressure on
its own. To attain economy of design, the reliability engineers have come up with various other
complicated ways to achieve active redundancies such as K of N redundancies and graceful
degradation.
Passive redundancies
Passive redundancy is the standby redundancy where the alternate equipment is present – but it
can only take over the intended function upon failure of the primary equipment.
We can differentiate two types of passive redundancies:
1. operating passive redundancies
2. non-operating passive redundancies
Operating passive redundancies are the ones where the alternative equipment is present as a
hot spare. The standby equipment is hot because it could be operating under no-load conditions.
In some cases, it may be serving a function that is outside the definition of primary equipment’s
function. Upon failure of the primary equipment, the operating standby equipment can be
automatically transitioned into performing the function of primary equipment.
An example of operating passive redundancies can be a secondary alternator that operates under
no-load conditions and meets all other paralleling conditions such as the same terminal voltage,
frequency, and phase sequence. Upon failure of the primary alternator, the secondary alternator
can be automatically synchronized with the system and take over the load.
In the case of non-operating passive redundancies, the standby equipment is powered down.
Upon failure of primary equipment, the standby equipment can be automatically or manually set
to operating conditions and take over the functionality of primary equipment.
A good example of non-operating passive redundancy is a standby municipal water pump which
can be started and operated manually to deliver water to residents if the primary water pump
malfunctions. Since the restoration of operation is not critical, an operator can go and start the
pump (and synchronize it with the system later, as needed).
Reliability techniques for analyzing fault tolerance
Fault tolerance is a part of reliability engineering efforts and requires careful examination of all
possible failures that can happen within the equipment. The Failure Mode Effect Analysis
(FMEA) and the Fault Tree Analysis (FTA) are two well-known techniques to analyze system
design from bottom-up and top-down approaches respectively.
To better understand tolerance, the failure sequence and dependencies must be analyzed and
investigated. A particularly useful technique to analyze dependencies and sequence is
the Markov model where the probability of any failure event would depend upon the state of the
previous event.
Similarly, another powerful technique is Monte Carlo simulations that can be used to model the
impact of uncertainties of any failure event on the system performance.
Because of redundancies and other characteristics we discussed earlier, such systems can usually
take on more faults before their functionality is compromised. However, if the issues aren’t
addressed, the accumulation of faults will eventually lead to a system or equipment breakdown.
Therefore, maintenance teams should use a CMMS system to make sure corrective maintenance
actions are taken in due time.
In some sense, fault tolerance gives maintenance and support teams more breathing room. They
still need to deal with the problem, but maybe not right away.
While fault-tolerant designs have their challenges in terms of increased costs and complexity,
they make up for it in the form of improved equipment reliability.
Root Cause Analysis (RCA): Steps, Tools, And Examples
RCA is a reactive process, meaning it’s performed after the event occurs. But once a root cause
analysis is done, it takes the shape of a proactive mechanism since it can predict problems before
they occur.
If you fix a symptom of the problem, but you don’t fix the actual cause of the problem, there’s a
high chance the failure will happen again.
For example, suppose you replace the broken belt but don’t change the misaligned part causing
the belt to overheat and break. In that case, you could bet your paycheck that the belt is going to
fail again. RCA tries to follow the chain of cause and effects to pinpoint the problem that will
make all the other faults disappear when finally eliminated.
Industry applications
Over the years, RCA has evolved to work within various fields, each with its own unique needs
and approach. The most apparent use of RCA is in the medical field. Aside from the healthcare
field, many other industries use root cause analysis regularly. Some of them are:
manufacturing (machine failure analysis)
industrial engineering and robotics
industrial process control and quality control
information technology (software testing, incident management, cybersecurity analysis)
complex event processing
disaster management and accident analysis
pharmaceutical research
change management
risk and safety management
These industries will generally use one specific type of root cause analysis that fits their situation
best. Below are some examples of different types of RCA methodologies used by various fields
and industries.
Keep in mind that RCA requires a significant investment of time, manpower, and money. And it
will likely cause further disruption in the specific production line or the system you’re working
on. So, bearing that in mind, you don’t need to (and you shouldn’t) do RCA for every single
fault.
Unfortunately, there is no cut-and-dry rule when to run an RCA and when not to. As the expert
and the experienced professional, you’re generally the best person to determine whether or not to
run a root cause analysis.
Persistent faults
If the same fault occurs over and over, it’s worth investigating. If the same defect is repeatedly
happening, you can assume that it won’t be cleared simply by fixing the visible problem. There
is an underlying reason for the recurring faults. These types of incidents need to be investigated
with RCA.
Critical failure
To determine if a failure is critical, you can look at the cost to the plant or the total downtime due
to the particular failure. When a critical failure occurs, it needs to be investigated to identify the
root cause to help avoid this situation in the future. Explosions at an oil rig and airplane crashes
are examples of critical failures that need to be investigated.
Failure impact
There are critical machines and critical subprocesses in any system. A failure of these types of
machines will halt the entire operation because there may not be a backup or mitigation plan for
that particular machine. In this case, how critical the machine is will determine whether or not to
do RCA.
Recognize
The actual cause of a problem is not always apparent, and simple cosmetic fixes usually don’t do
much to correct the underlying fault. Even though RCA can be an elaborate time-consuming
exercise, we do it to pinpoint the actual cause so we can take corrective actions that will
eliminate future issues. As mentioned earlier, RCA can also be done to identify the reason for an
unexpected positive outcome.
This first step is when you notice something’s not working quite right. The machine is
leaking fluid, making a weird sound, or not running as productively as it usually does. This is
when it’s time to put on your detective cap and find out what’s going on.
Rectify
Once you’ve recognized the root cause, it’s time to start a corrective course of action. If the
root cause is addressed, the same problem should not be cropping up again. If the same problem
reappears, it’s likely because the cause you identified was not actually the root cause. In this
case, you might have to go through the RCA process again to make sure that you get to the actual
root cause.
For example, you notice the machine is leaking fluid, so you patch the hole in the metal. If you
stop seeing fluid on the ground under the machine, you’ve solved the problem, and you’ve taken
care of the root issue. But if a leak crops up again in a week, it’s time to run another RCA to find
out if there are other holes in the metal or if gaskets are failing.
Replicate
Once you’ve identified and rectified the root cause, your next step is to ensure it will not happen
again at any point during the process or system. Sometimes you’ll want to do an RCA to get to
the bottom of an unexpectedly good outcome. In that case, you will test whether the same factors
can be replicated in other scenarios and environments.
Suppose there were issues with faulty parts coming off the line, but you’ve since fixed the issue.
The next step would be to replicate the problem to test whether you actually fixed the root issue.
In that case, you’d need to replicate what happened during this period to ensure that you
got to the bottom of the issue.
RCA is about solving problems. But one of the most significant benefits for you is that being
skilled at RCA makes you look good. When you’re good at what you do, you can get
management on your side (which usually means an easier time getting the budget you need). And
it can even make a big enough impression that it can change your career trajectory for the better.
To do a root cause analysis the right way, you should follow four basic steps.
Inspecting the machine in person also provides information that could be beneficial for root
cause analysis. It will be easy for facilities that run predictive maintenance to collate data
quickly.
From the data collected, you can identify correlations between various events, their timing, and
other data collected. Remember that correlation does not mean causation.
Questions to ask yourself when looking for correlations:
What sequence of events allowed this to happen?
What conditions are present/allowed this to happen?
What other problems surround the occurrence of the main problem?
The next step is to map out a causal graph. These graphs are used to represent the relationship
between events that happened and the data collected.
But it’s important to not stop investigating when you find a correlation between events.
Correlation means there is a link between two events, but it doesn’t automatically mean that one
event caused the other. That’s why it’s essential to continue your sleuthing until you find a
causal relationship. Find out what event caused another event. This will help you find the actual
root cause.
From the data collected, chronological sequencing, and clustering, we should be able to create
a causal graph (or use one of the root cause analysis tools we discuss later). You can use this
graph to represent the relationship between various events that occurred and the data collected.
The different paths are given different probability weights. They can serve as a visual tool to
track down the root cause.
Example of a causal graph. Source: Adam Kelleher on Medium
Once the problem is solved, you will need to take proactive steps to ensure it doesn’t happen
again. There can be multiple solutions applied to solve a single issue.
For example, the root cause could be the wear of a bearing, which happened much earlier than
expected. In this case, the procedure has to be adjusted to change the bearing at an earlier time.
Similar steps to avoid recurrence of fault can be changes in the maintenance schedule, different
modes of maintenance, changes in design, different OEM vendors, etc.
The implemented solution will have to be in line with the available resources. So, if the root
cause is pushing the machine too hard, the obvious answer is to shorten the machine run time.
However, if the production schedule doesn’t allow for shortened runtimes, another solution
might be scheduling more preventive maintenance.
You and your company should have your own unique protocol when conducting RCA. In
some instances, external consultants might be brought in to conduct RCA. In such cases, the
consultants will generally have their own preferred technique or a combination of techniques
they use. This is one of the reasons why it is hard to create a universal template for RCA that
everyone can follow.
5 Why analysis
5 Whys is the original technique developed by Sakichi Toyoda for root cause analysis at Toyota
factories. It is addressing everything with a ‘why’, just like a curious child. Keep asking ‘why’
until you’ve reached the root cause. You can continue this process until you reach a stage where
there is no need to ask ‘why’ again. At that point, you should have reached the root cause of the
problem.
As a rule of thumb, asking and finding answers to 5 subsequent ‘why’s’ should be more than
enough to reveal the root cause of most problems. Hence the name ‘5 why’ analysis.
Benefits of the 5 Whys:
helps identify the root cause of a problem
offers an understanding of how one process can cause a chain of problems
helps determine the relationship between different root causes
highly effective without complicated evaluation techniques
The 5 M framework (shown above) from the Toyota Production System uses RCA with the
Ishikawa method. The 5 Ms are:
man/mind power
machines
measurement
methods
material
The problem or fault is written down at the far right end, where the fish head would be. The
cause of the problem is represented along the horizontal line. Further effects and their respective
causes are written down along the fish bones representing each of the 5 Ms. This process
continues until the team is convinced that the root cause is identified.
A diverse cross-functional team is essential when using FMEA. You will need to clearly define
and communicate the scope of the analysis to your team members. Each subsystem, design, and
process is closely reviewed. The purpose, need, and function of each system are questioned.
Potential failure modes are brainstormed. Failure of similar processes and products in the past
can also be analyzed.
The potential effects and disruptions that could be caused by each of the identified failure modes
are assessed and used to calculate its RPN.
If the failure mode has a higher RPN than a company is comfortable with, you can address this
by changing one or more factors outlined in the image above.
Benefits of FMEA:
enables early identification of a failure point
captures the collective knowledge of a team
improves the quality, reliability, and safety of the process
a logical, structured approach for identifying process areas of concern
reduces process development time, cost
documents and tracks risk reduction activities
Fault tree analysis tries to map the logical relationships between faults and the subsystems
of a machine. The fault you are analyzing is placed at the top of the chart. If two causes have a
logical OR combination causing effect, they are combined with a logical OR operator. For
example, if a machine can fail while in operation or while under maintenance, it is a logical OR
relationship.
If two causes need to occur simultaneously for the fault to happen, it is represented with logical
AND. For example, if a machine only fails when the operator pushes the wrong button AND
relay fails to activate, it is a logical AND relationship. It is represented using the boolean AND
symbol. In the image above, AND is the blue symbol, and OR is the purple symbol.
The fault tree created for a failure is analyzed for possible improvements and risk management.
This is an effective tool to conduct RCA for automated machines and systems.
Pareto charts
A Pareto chart indicates the frequency of defects and their cumulative effects. Italian economist
Vilfredo Pareto recognized a common theme with almost all frequency distributions he could
observe. There is a vast imbalance between the ratio of failures and the effects caused by them.
He proposed that in any system, 80% of the results (or failures) are caused by 20% of all
potential reasons.
The principle is dubbed the Pareto principle (some know it as the 80-20 rule). This skew between
cause and effect is evident in many different distributions, from wealth distribution among
people to failures in a machine.
Paret chart for shirt defects. Source: Tulip.co
With the 80-20 principle in mind, you can use Pareto analysis to dig into failures and possible
causes. To start, draw a bar graph that includes the frequency of faults and causes. With this
graph, it’s easier to see the skew between causes and failures. Usually, you’ll see how a small
percentage of factors cause the majority of faults.
Next, you’ll analyze the causes that contribute to the largest number of faults and take corrective
action to eliminate the most common defects.
Honorable mentions
Root cause analysis is very open-ended and has a lot of widely used tools in various industries.
We covered the major ones in the sections above, but these systems also deserve some
recognition. A few honorary mentions:
Cause and effect diagrams. The Fishbone diagram is an example of cause and effect
diagrams. Many similar tools try to map the relationship between causes and effects in a
system.
Kaizen is another tool from the stable of Japanese process improvements. It is a continuous
process improvement method. Root cause analysis is embedded within the structure of
Kaizen.
Barrier analysis is an RCA technique commonly used for safety incidents. It is based on
the idea that a barrier between personnel and potential hazards can prevent most safety
incidents.
Change analysis is used when a potential incident occurs due to a single element or factor
change.
A scatter diagram is a statistical tool that plots the relationship between two data in a two-
dimensional chart. It can also be used as an RCA tool.
Let’s presume that the defect is part distortion. First, write down the problem, including the
number of defects occurring as a percentage. Once that is completed, collect all the available
data. Pull any maintenance logs can be pulled from your CMMS, review, manuals from the
injection mold machine manufacturer, etc.
Collect information on each defective product. From this, measure the deviation from
specifications. Take the heat signature of the product once it comes out of the mold, then
measure the temperature of molten plastic in the barrel.
We know that part distortion almost always occurs due to temperature problems. But we cannot
be sure where the temperature problem is…is it in the barrel while heating or in the mold while
cooling?
By analyzing the data, you collected, you would be able to identify that. For this example, we’ll
assume the heat signature of the finished product is different from the expected one.
This determines that the problem is in the cooling process. Further investigation concludes that
the root problem is the wrong spatial arrangement of cooling liquid conduits.
Changing the conduit arrangement that best fits the mold currently being produced will solve the
problem of part distortion.
RCA example #2: The mystery of the blown fuse
Next, let’s say a machine stopped because it overloaded and the fuse blew.
Investigation shows that the machine is overloaded because it had a bearing that wasn’t being
sufficiently lubricated.
Your investigation continues, and you find that the automatic lubrication mechanism had a pump
that was not pumping sufficiently. A review of the pump shows that it has a worn shaft.
Investigation of why the shaft was worn discovers that there isn’t an adequate mechanism in
place to prevent metal scraps from getting into the pump. This enabled scraps to get into the
pump and damage it.
The apparent root cause of the problem is metal scrap contaminating the lubrication system.
Fixing this problem should prevent the whole sequence of events from happening again. The real
root cause could be a design issue if no filter prevents the metal scrap from getting into the
system. Or if it has a filter that was blocked due to a lack of routine maintenance, then the actual
root cause is a maintenance issue.
Compare this with an investigation that does not find the causal factor: replacing the fuse, the
bearing, or the lubrication pump will probably allow the machine to go back into operation for a
while. But there is a risk that the problem will simply reoccur until the root cause is dealt with.
5 Steps to Troubleshooting That Will Fix Just About Anything
Everything breaks eventually. When rebooting doesn’t solve the problem, we brainstorm causes
and test them to find the issue. That is troubleshooting in a nutshell.
This lesson looks at:
What troubleshooting is
Some common causes
How to streamline the process using your CMMS (computerized maintenance management
system)
What is troubleshooting?
Troubleshooting is a step-by-step approach to finding the root cause of an issue and deciding the
best way to fix it to get it back in operation. Troubleshooting is not just for equipment that has
completely broken down. We also use it when a machine is just not working as expected.
Efficient troubleshooting is an essential part of asset management, diagnosis, and repair.
Machines that are properly operated and regularly maintained are less likely to suffer major
breakdowns. Still, there will never be a zero chance of failure. If you are using equipment, it will,
at some point, need repairing.
The fact is, unplanned downtime is expensive for companies, often costing them hundreds of
thousands of dollars per minute. Suppose you’ve got a capable maintenance team that knows
how to troubleshoot effectively. In that case, you can reduce high-severity outages and save
the company money.
2) Unexpected operation
Every machine has a defined set of functions it can perform. Most devices don’t do things
exactly the same way every time because of limitations in engineering and human error (as hard
as we may try to avoid it). Even with these slight variations in performance, the machine can
operate smoothly. This is considered its normal operation range.
If the machine starts to run outside these ranges, we may have a problem, and it needs to be on
your crew’s radar. These situations are not as urgent as a total failure. Still, unexpected
operations should be reported to fix the problem before a real issue comes up.
Take the cooling fans in your plant, for example. Imagine they are running and pushing out cool
air, but every so often, they stop blowing for a few minutes (or the air isn’t as cold as it should
be). Other equipment might overheat because of that malfunction and eventually start to break
down. Fixing the fan as soon as you know about it will save the company time and a lot of
money.
Getting operational users to log faults when they come up can be a great way to get to the
problems early and avoid total failure. Using your CMMS to log the problem will give you a
written history of what happened and how it was fixed, making troubleshooting time in the future
that much easier.
3) Other anomalies
The machine is working within the ideal operating range and is delivering the expected output.
However, an operator has spotted some anomaly. It could be a strange sound, a weird smell,
visible smoke, excessive vibration, etc. Such anomalies should also be investigated within an
appropriate time window
The process for reporting problems should never be made into a tedious task. It is the only way
to ensure people use it.
With detailed asset history logs and troubleshooting experience, users can take care of things
independently. This will free up more time for your team to focus on things that matter more.
What makes these technicians so good at what they do? Many of them have learned through trial
and error what are the best troubleshooting techniques for each piece of equipment. There is
massive value in having those senior technicians running the troubleshooting teams and
creating checklists that hit on the most common issues.
The problem is that when all these experienced technicians retire, they take their knowledge with
them. There is already a big labor shortage in the industry. Suppose we haven’t codified the
information into a central hub (like Limble). In that case, we risk losing valuable historical info
when they leave.
Depending on the complexity of the machine, your maintenance crew can train experienced users
for straightforward troubleshooting tasks. They will need to perform visual checks, general
troubleshooting, and other maintenance tasks to do this. It is an approach known as autonomous
maintenance.
Troubleshooting steps
Troubleshooting is a step-by-step process. Below, we break it down into six simple to follow
steps. It doesn’t matter if you are an advanced or inexperienced professional; you will follow the
same systematic approach every time.
Step 1: Define the problem
The first step of solving any problem is to know what type of problem it is and define it well. A
clear definition is fundamental when troubleshooting. When looking at a problem, you need to
know what you are up against and the possible causes. Is it machine failure, an unexpected
operation, user error, or a random anomaly? What happened that alerted you to the problem?
Some equipment will have built-in ways of letting you know; alarms can sound, red lights flash,
or a warning can go off when certain parts overheat. These signals can help with problem-
solving. Other equipment just stops working. Whatever the case may be, you have to identify and
define the problem before you can move forward.
Keep testing until you are sure that you have found the right solution. If nothing works, you will
need to rethink what the actual cause is.
A practical maintenance toolkit holds as much information about an asset as possible. In Limble,
tracking an asset’s history is ridiculously easy. You can see all related Work Orders, Parts, who
did work most recently – you can even manually add notes and images taken with your phone.
By keeping a record of every step, from reporting the fault or failure to the five steps above, you
can create a clear path through the troubleshooting journey to repair or, in some cases, show the
need to replace the asset.
Imagine how easy it will be to fix if the problem happens again!
Ways to make troubleshooting easier
We are here to make your job easier. When it comes to troubleshooting, it can feel overwhelming
and disorganized. There are many tools available to help you and your crew get to the bottom of
any problem. Below are a few of the commonly used tools and resources for effective
troubleshooting.
Troubleshooting checklists
Checklists are a great way to approach common problems methodically and help standardize the
process. They do the heavy lifting for you. When you’ve got a lot going on it can be risky to rely
on your own brain to remember all of the steps. Having a checklist means that you don’t have to.
Maintenance platforms like Limble also let you create and store troubleshooting checklists that
can be accessed on mobile devices and used in the field.
Maintenance engineers can work with experienced technicians to identify problematic assets and
create step-by-step troubleshooting instructions that include warnings and images for specific
assets/issues. When you finish, you can attach each checklist to the corresponding piece of
machinery.
A modern CMMS
Having the right CMMS can streamline, organize, and automate your maintenance
operations. A modern CMMS will save you and your team time and your company a lot of
money.
As a centralized repository of maintenance data, a CMMS keeps a lot of helpful information used
during the troubleshooting process like:
OEM manuals
contact information for machine and parts vendors
maintenance logs and reports
details of the work request sent to report the problem
troubleshooting and other maintenance checklists
past and current machine-condition and performance data gathered through CBM sensors
Limble CMMS uses QR codes to give your users easy access to all the information about the
equipment with a simple scan of their phone. They can scan the code on the side of the
equipment and quickly report faults to your team with the correct asset already attached to the
work order.
Having quick and easy access to this information can significantly speed up the troubleshooting
process and reduce the loss of institutional knowledge when technicians retire or move on. These
are just a few of many reasons why more and more organizations are implementing cloud-based
maintenance solutions.
When it comes to troubleshooting, machine learning is helping us analyze large amounts of data
and identify/predict possible causes of faults and failures.
Some organizations are already taking things a step further and testing something called
prescriptive analytics. In the context of troubleshooting, prescriptive analytics aims to help
machines diagnose themselves and then present possible solutions based on that self-diagnosis.
Enhancing the real world with AR
Augmented reality (AR) combines computer-generated imagery with the actual equipment to
give an additional layer of information. You can overlay parts and look into things that you
ordinarily wouldn’t be able to.
All you need is a phone or tablet loaded with the software. Hold it over the machine, and the
program will pull up all the different layers for you to look at.
If you are in the middle of a diagnosis, this can be a great way to check if everything is where it
should be or make sure that it is in good working order.
AR allows your maintenance team to see all the information about a component on the screen. It
can also show you tips, warnings, and next steps, improving quality and safety during the
troubleshooting process.