Smart Condition Monitoring Using Machine Learning
Smart Condition Monitoring Using Machine Learning
Learning
Dr. Patrick Bangert, algorithmica technologies GmbH
Abstract
Introduction
Any machine will eventually reach a point of poor health. That point is not a point of
shutdown or failure yet but it is a point at which it is apparent that the machine no
longer acts entirely as it should and will be in need of some maintenance activity to
restore its full operating potential. Whether the machine is a rotating machine (pump,
compressor, gas or steam turbine, etc.) or a non-rotating machine (heat exchanger,
distillation column, valve, etc.) we ask: Is it currently healthy? That is the domain of
condition monitoring.
The most common way to perform condition monitoring is to look at each sensor
measurement from the machine and to impose a minimum and maximum value limit
on it. If the current value is within the bounds, then the machine is healthy. If the
current value is outside the bounds, then the machine is unhealthy and an alarm is
sent.
This procedure is known to send a large number of false alarms, that is alarms for
situations that are actually healthy states for the machine. There are also missing
alarms, that is situations that are problematic but are not alarmed. The first problem
not only wastes time and effort but also availability of the equipment. The second
problem is more crucial as it leads to real damage with the associated collateral
damage, repair and lost production.
Both problems result from the same cause. The health of a complex piece of
equipment cannot be reliably judged based on the analysis of each measurement on
its own. We must consider a combination of the various measurements to get a true
indication of the situation.
Modeling
This is the domain of machine learning. These methods take empirical data that has
been measured on this particular machine in the past when it was known to be
healthy. From these data, the machine learning methods automatically and without
human effort construct a mathematical representation of the relationships of all the
parameters around the machine.
The selection of which measurements are important to take into consideration when
modeling a particular measurement can also be done automatically. We use a
combination of correlation modeling and principal component analysis to do this [1].
Fig. 1: The displacement of the central axle of a compressor is measured (red) and modeled (green)
with the confidence interval of the model (light green). It is seen that the model accurately models the
behavior of the machine even during load changes, both to increased and decreased load. We observe
a deviation between model and measurement at full load each time that full load is reached. This is an
indication of a machine problem and will be alarmed.
The result is that each measurement on the machine gets a formula that can
compute the expected value for this measurement. As the formula was trained on
data known to be healthy, this formula is the definition of health for this machine.
Unhealthy conditions are then considered deviations from health. See figure 1 for an
illustration.
It is important to model health and look for deviations from it because health is the
normal condition and much data is available for normal healthy behavior. Rather little
data is available for known unhealthy behavior and this small amount is very diverse
because of a host of different failure modes. Failure modes differ for each make and
model of a machine making a full characterization of possible faults very complex.
Modeling poor health is not a problem of data analysis but rather data availability. As
such, this problem is fundamental and cannot be tackled in a practical and
comprehensive way.
Alarming
At any one time, we can compare the expected healthy value to the sensor value. As
the expected value is computed from a model, we know the probability distribution of
deviations, i.e. how likely is it that the measurement will be away from the
Fig. 2: The deviation between model and sensor value (horizontal axis) versus the likelihood of that
deviation occurring (vertical axis). We expect a bell-shaped curve for a good model, i.e. many points
having little deviation and few points having a lot of deviation with an overall symmetry between
deviations above and below the model. From this distribution, we can easily read off how likely any
observed deviation is and thus how healthy any measured state is. This diagram directly translates a
sensor measurement into a health index.
The alarm can be enriched by the information of how unhealthy the state is by
providing the probability of poor health. Since the expected value has been computed
from a (usually small) number of other machine parameters, it is generally possible to
lay blame on some other measurement. This gives assistance to the human engineer
receiving the alarm in the effort to diagnose the problem and design some action.
With normal condition monitoring, in practice, it is often found that when a machine
transits from one stable state to another, lots of (false) alarms are released because a
simple analysis approach cannot keep up with the quickly changing conditions. As a
neural network can easily represent highly non-linear relationships, even a startup or
load change of a machine will be modeled accurately without alarm if everything is as
it should be.
Results
This approach has been thoroughly tested for rotating machinery such as
compressors, pumps, gas and steam turbines from several manufacturers in the
operational context of power generation, chemical production, oil refining and
production. Particularly the company MAN Diesel and Turbo uses this approach to
alarm its compressors and gas turbines before a human engineer takes a look at the
data.
Our partners have observed in practice that the total human engineering effort into
setting up and maintaining a condition monitoring system decreases by over 50% due
to the automated assistance of machine learning methods. This is mainly due to no
longer having to carefully set the upper and lower limits for each measurement
manually as now the models are generated automatically.
The incidence rate of both false alarms (false positives) and missing alarms (false
negatives) have been found to be reduced by over 90%. This reduces human
engineering efforts in diagnosing machine faults by over 60%, reduces maintenance
budgets and improves machine availability by about 10%.
Fig. 3: An example of a maintenance budget over time in thousands of US dollars. Over time, the
budget size decreases as maintenance becomes less reactive and more proactive.
As the organization adopts the new methods, maintenance becomes less reactive
and more proactive. This saves money in several ways. As we detect problems before
they result in an unscheduled trip of the plant, any collateral damage is prevented
fully. The potential lost production is reduced. The cost due to rush orders of both
people and materials is also prevented. The actual issue at hand must be repaired, of
course, but it can now be done in a planned, preemptive manner as opposed to a fire-
fighting mode. This may reduce a maintenance budget by as much as 50%.
Summary
The two principal problems of the standard approach to condition monitoring, i.e.
false alarms and bad conditions not alarmed, can be solved. This is accomplished by
creating a mathematical representation of each measurement in terms of the others
and thus considering the combination of various measurements around the same
piece of equipment. These models can be generated automatically using machine
learning without human effort. The accuracy of these models distinguishing healthy
Bibliography