0% found this document useful (0 votes)
15 views12 pages

Prytz 2015

Uploaded by

serverdownloadio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

Prytz 2015

Uploaded by

serverdownloadio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Engineering Applications of Artificial Intelligence 41 (2015) 139–150

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence


journal homepage: www.elsevier.com/locate/engappai

Predicting the need for vehicle compressor repairs using


maintenance records and logged vehicle data
Rune Prytz a,n,1, Sławomir Nowaczyk b, Thorsteinn Rögnvaldsson b, Stefan Byttner b
a
Volvo Group Trucks Technology, Advanced Technology & Research, Göteborg, Sweden
b
Center for Applied Intelligent Systems Research, Halmstad University, Sweden

art ic l e i nf o a b s t r a c t

Article history: Methods and results are presented for applying supervised machine learning techniques to the task of
Received 11 June 2014 predicting the need for repairs of air compressors in commercial trucks and buses. Prediction models are
Received in revised form derived from logged on-board data that are downloaded during workshop visits and have been collected
17 January 2015
over three years on a large number of vehicles. A number of issues are identified with the data sources,
Accepted 11 February 2015
many of which originate from the fact that the data sources were not designed for data mining.
Nevertheless, exploiting this available data is very important for the automotive industry as means to
Keywords: quickly introduce predictive maintenance solutions. It is shown on a large data set from heavy duty
Machine Learning trucks in normal operation how this can be done and generate a profit.
Diagnostics
Random forest is used as the classifier algorithm, together with two methods for feature selection
Fault Detection
whose results are compared to a human expert. The machine learning based features outperform the
Automotive Industry
Air Compressor human expert features, which supports the idea to use data mining to improve maintenance operations
in this domain.
& 2015 Elsevier Ltd. All rights reserved.

1. Introduction equipment and detect patterns that signal an emerging fault, which is
reviewed by Hines and Seibert (2006), Hines et al. (2008a,b), and Ma
Today, Original Equipment Manufacturers (OEMs) of commercial and Jiang (2011). A more challenging one is to predict the Remaining
transport vehicles typically design maintenance plans based on Useful Life (RUL) for key systems, which is reviewed by Peng et al.
simple parameters such as calendar time or mileage. However, this (2010), Si et al. (2011), Sikorska et al. (2011) and Liao and Köttig
is no longer sufficient in the market and there is a need for more (2014). For each of these approaches there are several options on how
advanced approaches that provide predictions of future mainte- to do it: use physical models, expert rules, data-driven models, or
nance needs of individual trucks. Instead of selling just vehicles, the hybrid combinations of these. The models can look for parameter
sector is heading towards selling complete transport services; for changes that are linked to actual degradation of components, or they
example, a fleet of trucks, including maintenance, with a guaran- can look at vehicle usage patterns and indirectly infer the wear on the
teed level of availability. This moves some of the operational risk components. Data-driven solutions can be based on real-time data
from the customer to the OEM but should lower the overall cost of streamed during operation or collected historical data.
ownership. The OEM has the benefit of scale and can exploit We present a data-driven approach that combines pattern
similarities in usage and wear between different vehicle operators. recognition with the RUL estimation, by classifying if the RUL is
Predicting future maintenance needs of equipment can be shorter or longer than the time to the next planned service visit. The
approached in many different ways. One approach is to monitor the model is based on combining collected (i.e. not real-time) data from
two sources: data collected on-board the vehicles and service records
collected from OEM certified maintenance workshops. This presents
n
Corresponding author. a number of challenges, since the data sources have been designed
E-mail addresses: [email protected] (R. Prytz), for purposes such as warranty analysis, variant handling and financial
[email protected] (S. Nowaczyk), follow-up on workshops, not for data mining. The data come from a
[email protected] (T. Rögnvaldsson), huge set of real vehicles in normal operation, with different opera-
[email protected] (S. Byttner).
1
Rune Prytz holds a licentiate in Signals and Systems engineering. He is active as
tors. The challenges include, among others, highly unbalanced data
an research engineer at AB Volvo and holds a position as Uptime systems & services sets, noisy class labels, uncertainty in the dates, irregular readouts
Specialist. and unpredictable number of readouts from individual vehicles. In

http://dx.doi.org/10.1016/j.engappai.2015.02.009
0952-1976/& 2015 Elsevier Ltd. All rights reserved.
140 R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150

addition, multiple readouts from the same truck are highly corre- attempt to model the full RUL curve whereas we only consider the
lated, which puts constraints on how data for testing and training are probability for survival until the next service stop.
selected. We specifically study air compressors on heavy duty trucks Recently Choudhary et al. (2009) presented a survey of 150 papers
and the fault complexity is also a challenge; air compressors face related to the use of data mining in manufacturing. While their scope
many possible types of failures, but we need to consider them all as was broader than only diagnostics and fault prediction, they covered a
one since they are not differentiated in the data sources. large portion of literature related to the topic of this paper. Their
Predictive maintenance in the automotive domain is more cha- general conclusion is that the specifics of the automotive domain
llenging than in many other domains, since vehicles are moving make fault prediction and condition based maintenance a more
machines, often operating in areas with low network coverage or challenging problem than in other domains; almost all research
travelling between countries. This means few opportunities for con- considers the case where continuous monitoring of devices is possible.
tinuous monitoring, due to the cost of wireless communication, Jardine et al. (2006) present an overview of condition-based
bandwidth limitations, etc. In addition, both the sensors and com- maintenance (CBM) solutions for mechanical systems, with special
putational units need to fulfil rigorous safety standards, which focus on models, algorithms and technologies for data processing
makes them expensive and not worth adding purely for diagnostic and maintenance decision-making. They emphasise the need for
purposes. Those problems are amplified due to a large variety of correct, accurate, information (especially event information) and
available truck configurations. Finally, heavy duty vehicles usually working tools for extracting knowledge from maintenance data-
operate in diverse and often harsh environments. bases. Peng et al. (2010) also review methods for prognostics in
The paper is structured as follows. A survey of related works CBM and conclude that methods tend to require extensive historical
introduces the area of data mining of warranty data. This is followed records that include many failures, even “catastrophic” failures that
by an overview of the data sets and then a methodology section destroy the equipment, and that few methods have been demon-
where the problem is introduced and the employed methods are strated in practical applications. Schwabacher (2005) surveys recent
described. This is finally followed by a results section and a conclu- work in data-driven prognostics, fault detection and diagnostics. Si
sion section. et al. (2011) and Sikorska et al. (2011) present overviews of methods
for prognostic modelling of RUL and note that available on-board
data are seldom tailored to the needs of making prognosis and that
1.1. Related work few case studies exist where algorithms are applied to real world
problems in realistic operating environments.
There are few publications where service records and logged When it comes to diagnostics specifically for compressors, it is
data are used for predicting maintenance needs of equipment, common to use sensors that continuously monitor the health
especially in the automotive industry, where wear prediction is state, e.g. accelerometers for vibration statistics, see Ahmed et al.
almost universally done using models that are constructed before (2012), or temperature sensors to measure the compressor work-
production. ing temperature, see Jayanth (2010). The standard off-board tests
In a survey of artificial intelligence solutions in the automotive for checking the health status of compressors require first dis-
industry, Gusikhin et al. (2007) discuss fault prognostics, after-sales charging the compressor and then measuring the time it takes to
service and warranty claims. Two representative examples of work in reach certain pressure limits in a charging test, as described e.g. in
this area are Buddhakulsomsiri and Zakarian (2009) and Rajpathak a compressor trouble shooting manual Bendix (2004). All these are
(2013). Buddhakulsomsiri and Zakarian (2009) present a data mining essentially model-based diagnostic approaches where the normal
algorithm that extracts associative and sequential patterns from a performance of a compressor has been defined and then compared
large automotive warranty database, capturing relationships among to the field case. Similarly, there are patents that describe methods
occurrences of warranty claims over time. Employing a simple for on-board fault detection for air brake systems (compressors, air
IF–THEN rule representation, the algorithm filters out insignificant dryers, wet tanks, etc.) that build on setting reference values at
patterns using a number of rule strength parameters. In their work, installment or after repair, see e.g. Fogelstrom (2007).
however, no information about vehicle usage is available, and the In summary, there exist very few published examples where
discovered knowledge is of a statistical nature concerning relations equipment maintenance needs are estimated from logged vehicle
between common faults. Rajpathak (2013) presents an ontology data and maintenance data bases. Yet, given how common these
based text mining system that clusters repairs with the purpose of data sources are and how central transportation vehicles are to the
identifying best-practice repairs and, perhaps more importantly, society, we claim that it is a very important research field.
automatically identifying when claimed labour codes are inconsistent
with the repairs. Related to the latter, but more advanced, is the work
by Medina-Oliva et al. (2014) on ship equipment diagnosis. They use 2. Presentation of data
an ontology approach applied to mining fleet data bases and
convincingly show how to use this to find the causes for observed Companies that produce high value products necessarily have
sensor deviations. well-defined processes for product quality follow-up, which usually
Thus, data mining of maintenance data and logged data has rely on large quantities of data stored in databases. Although these
mainly focused on finding relations between repairs and operations databases were designed for other purposes, e.g. analysing warranty
and to extract most likely root causes for faults. Few have used them issues, variant handling and workshop follow-up, it is possible to use
for estimating RUL or to warn for upcoming faults. We presented them also to model and predict component wear. In this work we use
preliminary results for the work in this paper in an earlier study two such databases: the Logged Vehicle Data (LVD) and the Volvo
(Prytz et al., 2013). Furthermore, Frisk et al. (2007) recently published Service Records (VSR). In this work we have used data from approxi-
a study where logged on-board vehicle data were used to model RUL mately 65,000 European Volvo trucks, models FH13 and FM13,
for lead-acid batteries. Their approach is similar to ours in the way produced between 2010 and 2013.
that they also use random forests and estimate the likelihood that
the component survives a certain time after the last data download. 2.1. LVD
Our work is different from theirs in two aspects. First, a compressor
failure is more intricate than a battery failure; a compressor can fail The LVD database contains aggregated information about
in many ways and there are many possible causes. Second, they also vehicle usage patterns. The values are downloaded each time a
R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150 141

vehicle visits an OEM authorised workshop for service and repair. the remaining ones continue to be maintained as before, without any
This happens several times per year, but at intervals that are significant drop in visit frequency. This loss of data with time is a
irregular and difficult to predict a priori. problematic issue. Plenty of valuable LVD data is never collected, even
During operation, a vehicle continuously aggregates and stores a after the first year of vehicle operation. A future predictive main-
number of parameters, such as average speed or total fuel con- tenance solution must address this, either by collecting the logged
sumption. In general, those are simple statistics of various kinds, data and the service information using telematics or by creating
since there are very stringent limitations on memory and computing incentives for independent workshops to provide data.
power, especially for older truck models. Most parameters belong to Finally, the specification of parameters that are monitored
one of the following three categories: Vehicle Performance and varies from vehicle to vehicle. A core set of parameters, covering
Utilisation, Diagnostics or Debugging. This work mainly focused on basic things like mileage, engine hours or fuel consumption, is
the first category, which includes over 2000 different parameters. available for all vehicles. Beyond that, however, the newer the
They are associated with various subsystems and components. The vehicle is, the more the LVD parameters are available, but it also
following four example ones have been identified as important for depends on vehicle configuration. For instance, detailed gearbox
predicting air compressor failures by a domain expert: Pumped parameters are only available for vehicles with automatic gear-
air volume since last compressor change, Mean compressed air per boxes. This makes it hard to get a consistent data set across a large
distance, Air compressor duty cycle, and Vehicle distance. fleet of vehicles and complicates the analysis. One must either
The vehicles in our data set visit a workshop, on average, every select a data set with inconsistencies and deal with missing values,
15 weeks. This means that the predictive horizon for the prog- or limit the analysis to only vehicles that have the desired
nostic algorithm must be at least that long. The system needs to parameters. In this work we follow the latter approach and only
provide warnings about components with an increased risk for consider parameter sets that are present across large enough
failing until the next expected workshop visit. However, if the vehicle fleets. Sometimes this means that we need to exclude
closest readout prior to the failure is 3–4 months, then it is less individual parameters that most likely would have been useful.
likely that the wear has had visible effects on the data.
This time sparseness is a considerable problem with the LVD. 2.2. VSR
The readout frequency varies a lot between vehicles and changes
with vehicle age, and can be as low as one readout per year. They The VSR database contains repair information collected from
also become less frequent as the vehicle ages. Fig. 1 illustrates how the OEM authorised workshops around the world. Each truck visit
many vehicles have data in LVD at different ages. Some vehicles is recorded in a structured entry, labelled with date and mileage,
(dark blue in the figure) have consecutive data, defined as at least detailing the parts exchanged and operations performed. Parts and
one readout every three months. They are likely to have all their operations are denoted with standardised identification codes.
maintenance and repairs done at OEM authorised workshops. The database contains relevant and interesting information
Many vehicles, however, only have sporadic readouts (light blue with regards to vehicle failures, information that is sometimes
in the figure). exploited by workshop technicians for diagnostics and to predict
For data mining purposes are the vehicles with consecutive future failures. However, the work presented here is only using the
data most useful. They have readouts in the LVD database and dates of historic repairs. Those dates form the supervisory signal
repairs documented in the VSR system. They contribute with for training and validating the classifier, i.e. to label individual LVD
sequences of data that can be analysed for trends and patterns. readouts as having either faulty or healthy air compressor, based
On the other hand, from the business perspective it is important on the time distance to the nearest replacement.
that as many trucks as possible are included in the analysis. Unfortunately, however, there are no codes for reasons why
The right panel of Fig. 1 illustrates two different maintenance operations are done. In some cases those can be deduced from the
strategies. The three peaks during the first year correspond to the free text comments from the workshop personnel, but not always.
typical times of scheduled maintenance. Repairs then get less The quality and level of detail of those comments vary greatly. This is
frequent during the second year, with the exception of just before a serious limitation since it introduces a lot of noise into the training
the end of it. This is probably the result of vehicles getting main- data classification labels. In the worst case can a perfectly good part
tenance and repairs before the warranty period ends. In general, all be replaced in the process of diagnosing an unrelated problem.
vehicles visit the OEM authorised workshops often during the Undocumented repairs are also a problem. They rarely happen
warranty period. After that, however, some vehicles disappear, while at authorised workshops since the VSR database is tightly coupled

60000
Consecutive
data 4000
In Total
Frequency

3000
Vehicles

40000

2000
20000
1000

0 0

1 2 3 0 250 500 750 1000

Age [Years] Age [Days]


Fig. 1. Vehicle age distribution, based on readouts in LVD. The left panel shows the number of vehicles, with data readouts, at different ages: dark blue are vehicles with any
data; light blue are vehicles with consecutive data. The right panel shows the age of vehicles at data readouts for the subset of vehicles that have any readouts beyond age of
two years (warranty period). (For interpretation of the references to colour in this figure caption, the reader is referred to the web version of this paper.)
142 R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150

with the invoicing systems. On the other hand, there is seldom any As described in the next section, we do not explicitly estimate
information about repairs done in other workshops. Patterns the posterior probability PðX t o Δ j Y t Þ, but instead use a supervised
preceding faults that suddenly disappear are an issue, both when classifier to predict the faulty/healthy decision directly.
training the classifier and later when evaluating it.
Much of the information in the VSR database is entered
manually. This results in various human errors such as typos and 4. Methods
missing values. A deeper problem, however, are incorrect dates
and mileages, where information in the VSR database can be 4.1. Machine learning algorithm and software
several weeks away from when the matching LVD data was read
out. This is partly due to lack of understanding by workshop All experimental results are averages over 10 runs using the
technicians; for the main purposes, invoicing and component Random Forest (Breiman, 2001) classifier, with 10-fold cross
failure statistics, exact dates are not overly important. In addition, validation. We used the R language (R Core Team, 2014) including
the VSR date is poorly defined. In some cases the date can be caret, unbalanced, DMwR and ggplot2 libraries.2
thought of as the date of diagnosis, i.e. when the problem was Random Forest are decision trees combined by bagging but with
discovered, and it may not be the same as the repair date, i.e. the an additional layer of randomness on top of what is added by the
date when the compressor was replaced. bootstrapping of training data. The additional randomness is added
by only considering subset of features at each node split. The con-
sidered features are randomly selected at each node and is normally
3. Problem formulation few compared to the available features in the training data.

Much diagnostics research focuses on predicting Remaining 4.2. Evaluation criteria


Useful Life (RUL) of a component, and, based on that, deciding
when to perform maintenance or component replacement. Supervised machine learning algorithms are typically evaluated
The RUL is usually modelled as a random variable that depends using measures like accuracy, area under the Receiver Operating
on the age of the component, the environment in which it Characteristic (ROC) curve or similar. Most of them, however, are
operates, and the partially observable health state, which is suitable only for balanced data sets, i.e. ones with similar numbers
continuously monitored or occasionally measured. Using the same of positive and negative examples. Measures that also work well
notation as Si et al. (2011) we define Xt as the random variable of for the unbalanced case include, e.g., the Positive Predictive Value
the RUL at time t, and Yt is the history of operational profiles and (PPV) or F1-score:
condition monitoring information up to that point. The probability 2TP TP
density function of Xt conditional on Yt is denoted as f ðxt j Y t Þ. F1 ¼ ; PPV ¼ : ð2Þ
2TP þ FN þ FP TP þ FP
The usual RUL approach is to estimate f ðxt j Y t Þ or the expecta-
where TP, FP and FN denote true positives, false positives and false
tion of the RUL. However, the setting we consider in this paper,
negatives, respectively.
this approach is unnecessarily complicated; we do not need to
However, the prognostic performance must take business
calculate the perfect time to perform repair since it is unpractical
aspect into account, where the ultimate goal is to minimise costs
to do a repair at arbitrary times. Repairs are preferably done
and maximise revenue. In this perspective, there are three main
during planned maintenance events. The truck is in workshop on a
components to consider: the initial investment cost, the financial
particular date and the decision is a binary one: either to replace
gain from correct predictions, and the cost of false alarms.
the component now or not. It should be replaced if the risk that it
The initial investment cost consists of designing and implement-
will not survive until the next planned maintenance is sufficiently
ing the solution, as well as maintaining the necessary infrastructure.
high (in relation to the cost for repairing it in an unplanned
This is a fixed cost, independent of the performance of the method.
service). That is, we need to estimate the posterior probability:
It needs to be overcome by the profits from the maintenance
Z Δ
predictions. In this paper we estimate it to be €150,000, which is
PðX t o Δ j Y t Þ ¼ f ðxt j Y t Þ dxt ð1Þ
0 approximately one year of full time work.
The financial gains come from correctly predicting failures before
where Δ is the time horizon. This is the probability that the RUL
they happen and doing something about them. It is reported in a
does not exceed the time horizon Δ, conditioned on the operation
recent white paper by Reimer (2013) that wrench time (the repair
history and maintenance information Yt. Based on this probability,
time, from estimate approval to work completion) is on average
we can make a decision of whether to flag individual component
about 16% of the time a truck spends at the workshop. Unexpected
as faulty or healthy, during every workshop visit of a given vehicle.
failures are one of the reasons for this since resources for repairs
In the following sections we make two simplifying assump-
need to be allocated. All component replacements are associated
tions. First, we use the same prediction horizon Δ for all vehicles.
with some cost of repair. However, unexpected breakdowns usually
Even though some vehicles are in the workshop more often than
cause additional issues, such as cost of delay associated with not
others, the main driving factor is the cost of wasting component
delivering the cargo in time. In some cases, there are additional costs
life and the acceptable level can be defined globally. The other
like towing. Fixed operational costs correspond to the cost of owning
simplification is that Yt only contains the currently downloaded
and operating a vehicle without using it. This includes drivers wage,
LVD data. This can be viewed as a form of Markovian condition, we
insurances and maintenance. A European long-haul truck costs on
have no memory of previous maintenance events in LVD, VSR or
average €1000 per day in fixed costs.
any other database. This latter assumption is for practical reasons;
False alarms are the most significant cost, since when good
a system like this should be possible to implement in a workshop
components are flagged by the system as being faulty, an action
tool, without the need to access a historical database.
needs to be taken. At best this results in additional work for
One simplification that we cannot assume is that of a single
workshop personnel, and at worst it leads to unnecessary compo-
failure mode. We must determine whether the component (air
nent replacements.
compressor in this case) will fail or not, regardless of mode. It is
impossible to get information on single failure modes from the
maintenance records. 2
http://cran.r-project.org/web/packages/
R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150 143

It is worth noting that the above analysis does not account for 4.4. Independent data sets for training and testing
false negatives, i.e. for cases where actual component failures were
not detected. This is somewhat counter-intuitive, in a sense that A central assumption in machine learning (and statistics) it that
one can think of them as missed opportunities, and missing of independent and identically distributed (IID) data. There are
opportunities is bad. In the current analysis, however, we focus methods that try to lift it to various degrees, and it is well known
on evaluating the feasibility of introducing a predictive mainte- that most common algorithms work quite well also in cases when
nance solution in a market where there is none. this assumption is not fully fulfilled, but it is still important,
At this stage, our goal is much less finding the best possible especially when evaluating and comparing different solutions.
method, but rather on presenting a convincing argument that a The readouts consist of aggregated data that have been sampled
predictive maintenance solution can improve upon existing situation. at different times. Subsequent values from any given truck are
In comparison to the current maintenance scheme, where the highly correlated to each other. It is even more profound in case of
vehicles run until failure, those false negatives maintain status quo. cumulative values, such as total mileage, a single event of abnormal
This is of course a simplification, since there is certain cost value will directly affect all subsequent readouts. Even without the
associated with missing a failure. We could elaborate about the value aggregation effect, however, there are individual patterns that are
of customer loyalty and quality reputation, but they are very hard to specific to each truck, be it particular usage or individual idiosyn-
quantify. Therefore, in the Results section, we use both the above- crasies of the complex cyber-physical system. It makes all readouts
defined profit and more “traditional” evaluation metrics (accuracy and from a single vehicle dependent. This underlying pattern of the data
F1-score), and point out some differences between them. is hard to visualise by analysing the parameter data as such.
In this respect is the predictive maintenance domain for the However, a classifier can learn these patterns and overfit.
automotive industry quite different from many others. For instance, A partial way of dealing with the problem is to ensure that the
in the medical domain, false negatives correspond to patients who test and train data set be split on a per vehicle basis and not
are not correctly diagnosed even though they carry the disease in randomly among all readouts. It means that if one or more
question. This can have fatal consequences and be more costly than readouts from a given vehicle belong to the test set, no readouts
false positives, where patients get mistakenly diagnosed. It is also from the same vehicle can be used to train the classifier. The data
similar for the aircraft industry. sets for training and testing must contain unique, non-overlap-
Among others, Sokolova et al. (2006) analyse a number of ping, sets of vehicles in order to guarantee that patterns that are
evaluation measures for assessing different characteristics of linked to wear and usage are learned, instead of specific usage
machine learning algorithms, while Saxena et al. (2008) specifi- patterns for individual vehicles.
cally focus on validation of predictions. They note how lack of
appropriate evaluation methods often renders prognostics mean- 4.5. Feature selection
ingless in practice.
Ultimately, the criterion for evaluation of the model performance The data set contains 1250 unique features and equally many
is a cost function based on the three components introduced at the differentiated features. However, only approximately 500 of them
beginning of this section. The function below captures the total cost are available for the average vehicle. It is clear that not all features
of the implementation of the predictive maintenance system: should be used as input to the classifier. It is important to find the
subset of features that yields the highest classification perfor-
Profit ¼ TP  ECUR  FP  CPR  Investment; ð3Þ
mance. Additionally, the small overlap of common features
where ECUR stands for extra cost of unplanned repair and CPR stands between vehicles makes this a research challenge. It is hard to
for cost of planned repair. Each true positive avoids the additional costs select large sets of vehicles that each share a common set of
of a breakdown and each false positive is a repair done in vain, which parameters. Every new feature that gets added to the data set
causes additional costs. must be evaluated with respect to the gain in performance and the
It is interesting to study the ratio between the cost of planned decrease in number of examples.
and unplanned repairs. It will vary depending on the component, Feature selection is an active area of research, but our setting
fleet operator domain and business model, etc. On the other hand, poses some specific challenges. Guyon and Elisseeff (2003) and
the cost for a breakdown for vehicles with and without predictive Guyon et al. (2006) present a comprehensive and excellent, even if
maintenance can be used to determine the “break even” ratio by now somewhat dated, overview of the feature selection con-
required between true positives and false positives. cepts. Bolón-Canedo et al. (2013) present a more recent overview of
methods. Molina et al. (2002) analyse performance of several
fundamental algorithms found in the literature in a controlled
4.3. Prediction Horizon scenario. A scoring measure ranks the algorithms by taking into
account the amount of relevance, irrelevance and redundance on
We define the Prediction Horizon (PH) as the period of interest sample data sets. Saeys et al. (2007) provide a basic taxonomy of
for the predictive algorithm. A replacement recommendation should feature selection techniques, and discuss their use, variety and
be made for a vehicle for which the air compressor is expected to fail potential, from the bioinformatics perspective, but many of the
within that time frame into the future. As described earlier, the issues they discuss are applicable to the data analysed in this paper.
vehicles visit the workshop on average every 15 weeks and the PH We use two feature selection methods: a wrapper approach based
needs to be at least that long. The system should provide warnings on the beam search algorithm, as well as a new filter method based
about components that are at risk of failing before the next expected on the Kolmogorov–Smirnov test to search for the optimal feature
workshop visit. set. The final feature sets are compared against an expert data set,
It is expected that the shorter the PH, the more likely it is that defined by an engineer with domain knowledge. The expert data set
there is information in the data about upcoming faults. It is contains four features, all of which have direct relevance to the age of
generally more difficult to make predictions the further into the the vehicle or the usage of the air compressor.
future they extend, which calls for a short PH. However, from a The beam search feature selection algorithm performs a greedy
business perspective it is desirable to have a good margin for graph search over the powerset of all the features, looking for the
planning, which calls for a long PH. We experiment with setting subset that maximises the classification accuracy. However, at
the PH up to a maximum of 50 weeks. each iteration, we only expand nodes that maintain the data set
144 R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150

size above the given threshold. The threshold is reduced with the the null hypothesis is true and a low p-value indicates that the null
number of parameters as shown in Eq. (4). Each new parameter is hypothesis may not be true. Features with low p-values are
allowed to reduce the data set with a small fraction. This ensures a therefore considered interesting since the observed difference
lower bound on the data set size. The top five nodes, with respect may indicate a fundamental underlying effect (wear or usage).
to accuracy, are stored for next iteration. This increased the The lower the p-value, the more interesting the feature. The KS
likelihood of finding the global optimum. The search is stopped filter search is terminated when a predetermined number of
when a fixed number of features are found: features have been reached.
Fig. 2 illustrates the case with the feature Compressor Duty
ndataset ¼ nall  constraintFactornparams : ð4Þ
Cycle (CUF) when evaluated as relevant from the usage point of
Many parameters in the LVD data set are highly correlated and view. There is a clear difference 0–5 weeks before the repair and
contain essentially the same information, which can potentially there is also a difference 0–25 weeks before the repair. Fig. 3
lower the efficiency of the beam search. Chances are that different illustrates the same case but when evaluated from the wear point
beams may select different, but correlated, feature sets. In this way of view. This is done for the illustrative purposes; CUF is a
the requirement for diversity on the “syntactic” level is met, while parameter that shows both usage and wear metrics, but there
the algorithm is still captured in the same local maximum. We are other interesting parameters that are only discovered in one or
have not found this to be a significant problem in practice. the other.
With the Kolmogorov–Smirnov method, we are interested in It is important to note that the distribution of vehicle ages in
features whose distributions vary in relation to oncoming failures the fault set is different from the normal set. We are working with
of air compressors. Based on the expertise within OEM company real data, and old vehicles are more likely to fail than newer ones.
we know that there are two main reasons for such variations. On This difference is clearly important when doing the classification,
the one hand, they may be related to different usage patterns of the since the RUL depends on the age of the truck. However, in the
vehicle. As an example, air compressors on long-haul trucks feature selection, this effect is undesirable: we are interested in
typically survive longer than on delivery trucks, due to factors identifying parameters that differ between healthy and failing
like less abrupt brakes usage and less often gear changes. On the vehicles, not those that differ between new and old vehicles.
other hand, early symptoms of component wear may also be visible We propose a methods for reducing the risk for such spurious
in some of the monitored parameters. For example, as worn effects, by re-sampling the normal group so that it has the same
compressors are weaker than new ones, it often takes them longer mileage or engine hour distribution as the fault group. We call this
time to reach required air pressure. age normalisation, and present evaluation of it in the Results
To identify these features, we define normal and fault data sets, section. The sampling is done in two steps. The first step is to
and compare their distributions using the Kolmogorov–Smirnov re-sample the reference distribution uniformly. In the second step
(KS) test (Hazewinkel, 2001). is the uniform reference distribution sampled again, this time
The normal sample is a random sample of fault-free LVD weighted according to the distribution of the test set. In cases with
readouts, while the fault sample are LVD readouts related to a a narrow age distribution for the fault set will only a fraction of the
compressor repair. The fault sample is drawn from vehicles with a normal data be used. This requires a substantial amount of normal
compressor change and selected from times up to PH before the data which, in our case, is possible since the data set is highly
repair. This is done in the same way for both usage and wear unbalanced and there is much more normal data than fault data.
metrics. The normal sample, on the other hand, is drawn either The effect of age normalisation is illustrated in Fig. 4.
from all vehicles that have not had a compressor change, or from
vehicles with a compressor change but outside of the PH time
window before the repair. In the first case, the difference in 4.6. Balancing the data set
distributions between normal and fault data corresponds to para-
meters capturing usage difference that is relevant for air compres- Machine learning methods usually assume a fairly balanced
sor failures. In the second case, it is the wear difference. data distribution. If that is not fulfilled, then the results tend to be
The two samples are compared using a two-sample KS test and heavily biased towards the majority class. This is a substantial
a p-value is computed under the null hypothesis that the two problem in our case, since only a small percentage of the vehicles
samples are drawn from the same distribution. The p-value is a experience compressor failure and, for any reasonable value of the
quantification of how likely it is to get the observed difference if PH, only a small subset of their readouts is classified as faulty.

Test Test
0.06 Reference Reference
0.06
density

density

0.04 0.04

0.02 0.02

0.00 0.00

0 10 20 30 40 50 0 10 20 30 40 50

CUF CUF
Fig. 2. Differences due to usage. The left panel shows the normal and fault distributions for the feature Compressor Duty Cycle (CUF) when the fault sample (light blue) is
drawn from periods 0–5 weeks prior to the compressor repair. The right panel shows the same thing but when the fault sample (light blue) is drawn from 0 to 25 weeks prior
to the repair. The normal sample (grey) is in both cases selected from vehicles that have not had any air compressor repair. (For interpretation of the references to colour in
this figure caption, the reader is referred to the web version of this paper.)
R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150 145

0.06 0.06
Test Test
Reference Reference

0.04 0.04
density

density
0.02 0.02

0.00 0.00

0 10 20 30 40 50 0 10 20 30 40 50

CUF CUF
Fig. 3. Differences due to wear. The left panel shows the normal and fault distributions for the feature Compressor Duty Cycle (CUF) when the fault sample (light blue) is
drawn from periods 0–5 weeks prior to the compressor repair. The right panel shows the same thing but when the fault sample (light blue) is drawn from 0 to 25 weeks prior
to the repair. The normal sample (grey) is in both cases selected from vehicles that have had an air compressor repair, but times that are before the PH fault data. (For
interpretation of the references to colour in this figure caption, the reader is referred to the web version of this paper.)

5e−04
Test Test
3e−04
Reference Reference
4e−04

3e−04 2e−04
density

density

2e−04
1e−04
1e−04

0e+00 0e+00

0 5000 10000 15000 0 5000 10000 15000

OO OO
Fig. 4. Illustration of the effect of age normalisation. The left panel shows the normal (grey) and the fault (light blue) distributions for a feature without age normalisation.
The right panel shows the result after age normalisation. Here age normalisation removed the difference, which was a spurious effect caused by age. (For interpretation of
the references to colour in this figure caption, the reader is referred to the web version of this paper.)

Imbalanced data sets require either learning algorithms that consideration (k) and the percentage of synthetic examples to
handle this or data preprocessing steps that even out the imbal- create. The first parameter, intuitively, determines how similar
ance. We chose to use the latter. There are many domains where new examples should be to existing ones, and the other how
class imbalance is an issue, and therefore a significant body of balanced the data should be afterwards. SMOTE can be combined
research is available concerning this. For example, He and Garcia with several preprocessing techniques, e.g. introduced by Batista
(2009) provide a comprehensive review of the research concern- et al. (2004) and some others, aggregated and implemented in a R
ing learning from imbalanced data. They provide a critical review library by Dal Pozzolo et al., 2012. We tried and evaluated four of
of the nature of the problem and the state-of-the-art techniques. them: The Edited Nearest Neighbour (ENN), the Neighbourhood
They also highlight the major opportunities and challenges, as well Cleaning Rule (NCL), the Tomek Links (TL), and the Condensed
as potential important research directions for learning from Nearest Neighbour (CNN).
imbalanced data. Van Hulse et al. (2007) present a comprehensive
suite of experimentation on the subject of learning from imbal-
anced data. Sun et al. (2007) investigate meta-techniques applic- 5. Results
able to most classifier learning algorithms, with the aim of
advancing the classification of imbalanced data, exploring three 5.1. Cost function
cost-sensitive boosting algorithms, which are developed by intro-
ducing cost items into the learning framework of AdaBoost. The cost of planned repair CPR, cost of unplanned repair CUR, and
Napierala and Stefanowski (2012) propose a comprehensive extra cost of unplanned repair ECUR can be split up into the
approach, called BRACID, that combines multiple different techni- following terms:
ques for dealing with imbalanced data, and evaluate it experi-
CPR ¼ C part þ C Pwork þ C Pdowntime ð5Þ
mentally on a number of well-known data sets.
We use the Synthetic Minority Over-sampling TEchnique
CUR ¼ C part þC Uwork þC Udowntime þ C extra ð6Þ
(SMOTE), introduced by Chawla et al. (2002). It identifies, for any
given positive example, the k nearest neighbours belonging to
ECUR ¼ CUR  CPR ð7Þ
the same class. It then creates new, synthetic, examples randomly
placed in between the original example and the k neighbours. It Here, Cpart is the cost of the physical component, the air
uses two design parameters: number of neighbours to take into compressor, that needs to be exchanged. We set this to €1000. It
146 R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150

is the same for both planned and unplanned repairs. Cwork is the Eq. (3), becomes (in Euros)
labour cost of replacing the air compressor, which takes approxi- ProfitðTP; FPÞ ¼ TP  15; 000 FP  1500 150; 000: ð8Þ
mately three hours. We set CPwork to €500 for planned repairs and
CUwork to €1000 for unplanned repairs. If the operation is unplanned,
Obviously, the Profit function (8) is an estimate and the numbers
then one needs to account for diagnosis, disruptions to the work- have been chosen so that there is a simple relationship between the
flow, extra planning, and so on. gain you get from true positives and the loss you take from false
Cdowntime is the cost for vehicle downtime. Planned component positives (here the ratio is 10:1). A more exact ratio is hard to
exchanges can be done together with regular maintenance; calculate since it is difficult to get access to the data required for
CPdowntime is therefore set to zero. It is included in Eq. (5) since it estimating it (this type of information is usually considered con-
will become significant in the future, once predictive maintenance fidential). Whether the predictive maintenance solution has a profit
becomes common and multiple components can be repaired at the or loss depends much on the extra cost Cextra .
same time. The downtime is a crucial issue for unplanned failures,
however, especially roadside breakdown scenarios. Commonly at 5.2. The importance of data independence
least half a day is lost immediately, before the vehicle is trans-
ported to the workshop and diagnosed. After that comes waiting The importance of selecting independent data sets for training
for spare parts. The actual repair may take place on the third day. and testing cannot be overstated. Using dependent data sets will
The resulting 2–3 days of downtime plus a possible cost of towing, lead to overly optimistic results that never hold in the real
CUdowntime, is estimated to cost a total of €3500. application. Fig. 5 shows the effects from selecting training and
The additional costs, Cextra, are things like the delivery delay, test data sets in three different ways.
the cost for damaged goods, fines for late arrival, and so on. This is The random method refers to when samples for training and
hard to estimate, since it is highly dependent on the cargo, as well testing are chosen completely randomly, i.e. when examples from
as on the vehicle operator's business model. The just in time the same vehicle can end up both in the training and the test data
principle is becoming more widespread in the logistics industry set. These data sets are not independent and the out-of-sample
and the additional costs are therefore becoming larger. We set accuracy is consequently overestimated.
Cextra to €11,000. The one sample method refers to when each vehicle provides one
Inserting those estimates into Eqs. (5)–(7) yields CPR ¼ €1500, positive and one negative example to the training and test data, and
CUR¼€16,500 and ECUR ¼€15,000. The final Profit function, there is no overlap of vehicles in the training and test data. This leads
to independent data sets that are too limited in size. The out-of-
sample performance is correctly estimated but the data set cannot be
made large. The all sample method refers to the case when each
vehicle can contribute to any number of examples but there is no
0.7 overlap of vehicles between the training and test data. This also yields
a correct out-of-sample accuracy but the training data set can be
made larger.
Accuracy

0.6 5.3. Feature selection

The different feature selection approaches, and the age normal-


isation of the data, described in the Methods section produced six
0.5
Random different feature sets in addition to the Expert feature set.
One.Sample
The beam search wrapper method was performed with five
All.Sample
beams and a size reduction constraint of 10%. The search gave five
different results, one from each beam, but four of them were
500 1000 1500 almost identical, differing only by the last included feature. The
Dataset Size [Readouts] four almost identical feature sets where therefore reduced to a
Fig. 5. Comparison of strategies for selecting training and test data. The Expert
single one, by including only the 14 common features. The fifth
feature set was used for all experiments and the data sets were balanced. The x-axis result was significantly different and was kept without any mod-
shows the size of the training data set. ifications. The two feature sets from the beam search are denoted

0.70 0.70
Wear
Wear (AgeNorm)
Usage
Usage (AgeNorm)
Accuracy

0.65
Accuracy

0.65 Beamsearch 1
Beamsearch 2
Wear Expert
Wear (AgeNorm)
0.60 Usage 0.60
Usage (AgeNorm)
Beamsearch 1
Beamsearch 2
Expert
0.55 0.55

400 800 1200 200 400 600


Data size [readouts] Data size [readouts]
Fig. 6. Comparison of feature selection methods when measuring the accuracy of the predictor. The left panel shows the result when training and test data are chosen
randomly, i.e. with dependence. The right panel shows the result when the training and test data are chosen with the one sample method, i.e. without dependence.
R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150 147

0.85 0.85
Wear Wear
Wear (AgeNorm) Wear (AgeNorm)
0.80 Usage 0.80 Usage
Usage (AgeNorm) Usage (AgeNorm)
Accuracy

Accuracy
0.75 Beamsearch 1 0.75 Beamsearch 1
Beamsearch 2 Beamsearch 2
Expert Expert
0.70 0.70

0.65 0.65

0.60 0.60

10 20 30 40 50 10 20 30 40 50

Prediction Horizon [Weeks] Prediction Horizon [Weeks]


Fig. 7. Prediction accuracy vs. prediction horizon. The left panel shows how the prediction accuracy decreases with PH when the training data set size is not limited. The
right panel shows how the prediction accuracy decreases with PH when the training data set size is limited to 600 samples.

Wear Wear
Wear (AgeNorm) Wear (AgeNorm)
15 15
Usage Usage
Usage (AgeNorm) Usage (AgeNorm)
Profit [M€]

Profit [M€]
Beamsearch 1 Beamsearch 1
Beamsearch 2 Beamsearch 2
10
Expert 10 Expert

5
5

10 20 30 40 50 10 20 30 40 50

Prediction Horizon [Weeks] Prediction Horizon [Weeks]


Fig. 8. Profit vs. prediction horizon. The left panel shows how the profit increases with PH when the training data is not limited. The right panel shows how the profit
increases with PH when the training data is limited to 600 samples. The PH required to achieve 600 samples varies between the data sets, which explains the differences in
starting positions of individual lines.

Beam search set 1 and Beam search set 2; they each had 14 features NDI: Number of times in idle mode (diff).
(the limit set for the method). Three out of the four features NDJ: Total time in idle mode Bumped (diff).
selected by the expert were also found by the beam search. NDP: Total time in idle mode Parked (diff).
The KS filter method was used four times, with different
combinations of wear features, usage features, and age normal- Where (diff) denote that the parameter has been differentiated
isation (the reader is referred to the Methods section for details). and reflect the parametric change since the previous readout.
This gave four feature sets: Wear, Usage, Wear with age normal-
isation, and Usage with age normalisation. 5.4. Accuracy vs. Prediction Horizon
Fig. 6 show the results when using the seven different feature
sets. The overly optimistic result from using randomly selected Two experiments were done to gauge how the Prediction
data sets is shown for pedagogical reasons, reiterating the impor- Horizon (PH) affects the classification results. In the first experiment
tance of selecting independent data sets. The data sets were were all available readouts used, while in the second experiment
balanced in the experiments. The Usage features performed best was the training set size fixed at 600 samples. Balanced data sets
and the Expert features were second (except when an erroneous were used throughout why the number of available fault readouts
method for selecting data was used). was the limiting factor. As PH increased could more samples be
As an example, the following 14 parameters have been used since more readouts were available for inclusion in the fault
included in the Beam search 1 feature set: data set.
Figs. 7 and 8 show the result of the two experiments. The accuracy
BIX: Pumped air volume since last compressor change. (Fig. 7) is best at lower PH and decreases as the PH increases. This is
CFZ: Timestamp at latest error activation. probably due to the training labels being more reliable closer to the
CHJ: Engine time at latest error activation (diff). fault. Accuracy decreases somewhat less rapidly in the first experiment
CUD: Max volume for air dryer cartridge. with unlimited training data set size (left panel of Fig. 7). Fig. 8 shows
KJ: Fuel consumed in Drive. the result when evaluated with the profit measure. Interestingly, from
MT: Fuel consumed in PTO. this point of view, the system performance improves with larger PH.
OA: Total distance in PTO (diff). This appears to be, at least partially, caused by a larger number of false
OF: Total time in coasting. negatives. In particular, the further away data readout is from
OL: Total time using pedal. compressor replacement, the less indications of problems it contains.
OQ: Fuel consumed in Econ mode (diff). Thus a classifier will consider them to be negative examples, but if it
OR: Fuel consumed in Pedal mode (diff). was trained on data with sufficiently large prediction horizon, they
148 R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150

1.5
0.20 5
15
25
0.15 1.0

Profit [M€]
F−score

0.10 0.5

0.05 5
15
0.0
25

0 250 500 750 1000 0 250 500 750 1000

SMOTE percentage [%] SMOTE percentage [%]


Fig. 9. Evaluation of SMOTE percentage settings, using the Expert data set. The number of SMOTE neighbours is fixed to 20.

1.6
5 5
15 15
0.20 25 25
1.2

Profit [M€]
F−score

0.18
0.8

0.16
0.4

0.14
5 10 15 20 5 10 15 20

Number of neighbours Number of neighbours


Fig. 10. Evaluation of the number of SMOTE neighbours (k) using the Expert data set and with the SMOTE% fixed at 900.

will be false negatives. Large number of them will lower accuracy Table 1
significantly, but will not affect profit. The best settings for each of the feature sets (AN denotes age normalised). The total
number of samples (second column) depends on the method used for selecting the
data. The columns marked with % and k show the SMOTE parameter settings and
5.5. SMOTE the column labelled Prepr shows the SMOTE preprocessing method used. The Profit
is evaluated for a PH of 15 weeks and the optimal training data set size (see Fig. 11
The SMOTE oversampling method depends on two parameters: and the discussion in the text). The Profit (M€) depends on the test data set size,
the percentage of synthetic examples to create and the number of which depends on the method used for selecting the data. The rightmost column,
labeled nProfit, shows per vehicle Profit (in €) which is the Profit normalised with
neighbours to consider when generating new examples.
respect to the number of vehicles in the testsets.
Fig. 9 shows F1-score (as defined in Eq. 2) and Profit (as defined
in Eq. 8) when the SMOTE percentage is varied but k is kept constant Feature set Samples Features % k Prepr Profit nProfit
at 20, for three different values of the PH. All results improve signi-
Wear 10,660 20 700 14 TL 1.59 86
ficantly with the percentage of synthetic examples, all the way up to
Wear AN 10,520 20 1000 12 ENN 0.62 22
a 10-fold oversampling of synthetic examples. A lower PH is better Usage 12,440 20 1000 16 TL 1.94 114
from the F1-score perspective but worse from the Profit perspective. Usage AN 12,440 20 1000 20 CNN 1.60 110
Fig. 10 shows the effect of varying k, the number of SMOTE Beam search 1 14,500 14 800 20 NCL 1.66 116
neighbours, when the SMOTE percentage is kept fixed at 900%. Beam search 2 14,500 15 800 16 TL 0.75 54
Expert 14,960 4 900 20 ENN 0.84 64
The results are not very sensitive to k although a weak increase in
performance comes with higher k.
The four SMOTE preprocessing methods mentioned in Section 4.6 SMOTE preprocessing determined in the previous experiments
were evaluated using a PH of 25 weeks and a SMOTE percentage of was used for each feature set. The final best settings for each
900% (the best average settings found). Nearly all feature sets feature set are summarised in Table 1, together with the basic data
benefitted from preprocessing but there was no single best method. for each feature set.
The left panel of Fig. 11 shows how varying the training data set
5.6. Final evaluation size affects the Profit. The PH was set to 15 weeks, which is a
practical PH, even though many of the feature sets perform better
A final experiment was done, using the best settings found for at higher values of PH. From a business perspective is a PH of 30
each feature set, in order to evaluate the whole approach. The best weeks considered too long, since it leads to premature warnings
SMOTE settings were determined by first keeping k fixed at 20 and when the vehicle is likely to survive one more maintenance
finding the best SMOTE%. Then the SMOTE% was kept fixed at the period. The ordering of feature selection algorithms is mostly
best value and the k value varied between 1 and 20 and the value consistent; Usage is best, with the exception of very small data
that produced the best cross-validation Profit was kept. The best sizes where it is beaten by Beam search 1.
R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150 149

2.5
Wear
Wear (AgeNorm)
1 2.0 Usage
Usage (AgeNorm)
Beamsearch 1
Beamsearch 2
Profit [M€]

Profit [M€]
0 1.5
Expert

1.0
−1 Wear
Wear (AgeNorm)
Usage
Usage (AgeNorm) 0.5
−2 Beamsearch 1
Beamsearch 2
Expert 0.0

5000 10000 10 20 30 40 50

Dataset size [readouts] Prediction Horizon [Weeks]

Fig. 11. Final evaluation of all feature sets. The left panel shows how the Profit varies with the training data set size using a prediction horizon of 15 weeks. The right panel
shows how the Profit changes with PH. The settings for each feature set are listed in Table 1.

sets Beam search 1 and Usage, with or without age normalisation,


are the best from the perspective of sensitivity and specificity. All
three are better than the Expert feature set. Profit is not uniquely
defined by specificity and sensitivity; it depends on the data set
size and the mix of positive and negative examples. However,
Profit increases from low values of specificity and sensitivity to
high values.

6. Conclusions

Transportation is a low margin business where unplanned stops


quickly turn profit to loss. A properly maintained vehicle reduces
the risk of failures and keeps the vehicle operating and generating
profit. Predictive maintenance introduces dynamic maintenance
recommendations which react to usage and signs of wear.
We have presented a data driven method for predicting
upcoming failures of the air compressor of a commercial vehicle.
The predictive model is derived from currently available warranty
and logged vehicle data. These data sources are in-production data
that are designed for and normally used for other purposes. This
imposes challenges which are presented, discussed and handled in
order to build predictive models. The research contribution is
twofold: a practical demonstration on these practical data, which
Fig. 12. Sensitivity and specificity for the classifiers based on each feature set using are of a type that is abundant in the vehicle industry, and the
the optimal settings in Table 1. Profit increases from lower left towards the techniques developed and tested to handle; feature selection with
upper right. inconsistent data sets, imbalanced and noisy class labels and
multiple examples per vehicle.
The left panel of Fig. 11 also shows an interesting phenomenon The method generalises to repairs of various vehicle compo-
where profit grows and then drops as the data set size increases. nents but it is evaluated on one component: the air compressor.
This is unexpected, and we are unable to explain it. It may be The air compressor is a challenge since a failing air compressor can
related, for example, to the k parameter of the SMOTE algorithm. be due to many things and can be a secondary fault caused by
The right panel of Fig. 11 illustrates how varying the prediction other problems (e.g. oil leaks in the engine that cause coal deposits
horizon affects the Profit, using all available data for each feature in the air pipes). Many fault modes are grouped into one label.
set. In general, the longer the PH the better the Profit. The relative Components with clearer or fewer fault causes should be easier to
ordering among feature sets is quite consistent, which indicates predict, given that the information needed to predict them is
that neither of them focus solely on patterns of wear. Such features available in the data sources, and given that the fault progresses
would be expected to perform better at lower PH when the wear is slow enough. We have not tested it on other components but plan
more prominent. to do so in the near future.
The performances listed in Table 1 are for one decision thresh- The best features are the Beam search 1 and the Usage sets, with
old. However, the classifiers can be made to be more or less or without age normalisation. All three outperform the Expert
restrictive when taking their decision to recommend a repair, feature set, which strengthens the arguments for using data driven
which will produce different numbers of true positives, true machine learning algorithms within this domain. There is an
negatives, false positives and false negatives. Fig. 12 shows the interesting difference between the Wear and Usage feature sets.
sensitivity–specificity relationships for each feature set (i.e. each In the latter, there is little effect of doing age normalisation while
classifier using each feature set). The perfect classifier, which on the first the age normalisation removes a lot of the information.
certainly is unachievable in this case, would have both sensitivity This indicates that important wear patterns are linked to age,
and specificity equal to one. It is, from Fig. 12, clear that the feature which in turn is not particularly interesting since age is easily
150 R. Prytz et al. / Engineering Applications of Artificial Intelligence 41 (2015) 139–150

measured using mileage or engine hours. It is possible that trends Hines, J., Garvey, D., Seibert, R., , Usynin, A., 2008a. Technical Review of On-Line
due to wear are faster than what is detectable given the readout Monitoring Techniques for Performance Assessment. Volume 2: Theoretical Issues.
Technical review NUREG/CR-6895, Vol. 2. U.S. Nuclear Regulatory Commission,
frequency. This could partly explain the low performance of the Office of Nuclear Regulatory Research. Washington, DC 20555-0001.
wear features. Hines, J., Garvey, J., Garvey, D.R., Seibert, R., 2008b. Technical Review of On-Line
All feature sets show a positive Profit in the final evaluation. Monitoring Techniques for Performance Assessment. Volume 3: Limiting Case
Studies. Technical review NUREG/CR-6895, Vol. 3. U.S. Nuclear Regulatory Commis-
However, this depends on the estimated costs for planned and sion, Office of Nuclear Regulatory Research. Washington, DC 20555-0001.
unplanned repair. There are large uncertainties in those numbers Hines, J., Seibert, R., 2006. Technical Review of On-Line Monitoring Techniques for
and one must view the profits from that perspective. The invest- Performance Assessment. Volume 1: State-of-the-Art. Technical review NUREG/
CR-6895. U.S. Nuclear Regulatory Commission, Office of Nuclear Regulatory
ment cost can probably be neglected and the important factor is Research. Washington, DC 20555-0001.
the ratio in cost between unplanned and planned repair. Jardine, A.K., Lin, D., Banjevic, D., 2006. A review on machinery diagnostics and
prognostics implementing condition-based maintenance. Mech. Syst. Signal
Process. 20, 1483–1510.
Jayanth, N., 2010 (filed 2006). Pat.no 7,648,342 b2—Compressor Protection and
Acknowledgements Diagnostic System.
Liao, L., Köttig, F., 2014. Review of hybrid prognostics approaches for remaining
The authors thank Vinnova (Swedish Governmental Agency for useful life prediction of engineered systems, and an application to battery life
prediction. IEEE Trans. Reliab. 63, 191–207.
Innovation Systems), AB Volvo, Halmstad University, and the Ma, J., Jiang, J., 2011. Applications of fault detection and diagnosis methods in
Swedish Knowledge Foundation for financial support for doing nuclear power plants: A review. Progress Nucl. Energy 53, 255–266.
this research. Medina-Oliva, G., Voisin, A., Monnin, M., Léger, J.B., 2014. Predictive diagnosis based
on a fleet-wide ontology approach. Knowl. Based Syst. 68, 40–57.
Molina, L., Belanche, L., Nebot, A., 2002. Feature selection algorithms: a survey and
References experimental evaluation. In: Proceedings of IEEE International Conference on
Data Mining, pp. 306–313.
Napierala, K., Stefanowski, J., 2012. BRACID: a comprehensive approach to learning
Ahmed, M., Baqqar, M., Gu, F., Ball, A.D., 2012. Fault detection and diagnosis using rules from imbalanced data. J. Intell. Inf. Syst. 39, 335–373.
principal component analysis of vibration data from a reciprocating compres- Peng, Y., Dong, M., Zuo, M.J., 2010. Current status of machine prognostics in condition-
sor. In: Proceedings of the UKACC International Conference on Control, 3–5 based maintenance: a review. Int. J. Adv. Manuf. Technol. 50, 297–313.
September 2012, IEEE Press, Cardiff, UK. Prytz, R., Nowaczyk, S., Rögnvaldsson, T., Byttner, S., 2013. Analysis of truck
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C., 2004. A study of the behavior of several compressor failures based on logged vehicle data. In: Proceedings of the 2013
methods for balancing machine learning training data. SIGKDD Explor. News- International Conference on Data Mining (DMIN13). 〈http://worldcomp-pro
lett. 6, 20–29. ceedings.com/proc/p2013/DMI.html〉.
Bendix, 2004. Advanced Troubleshooting Guide for Air Brake Compressors. Bendix R Core Team, 2014. R: A Language and Environment for Statistical Computing.
Commercial Vehicle Systems LLC. R Foundation for Statistical Computing. Vienna, Austria. 〈http://www.R-project.
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., 2013. A review of org〉.
feature selection methods on synthetic data. Knowl. Inf. Syst. 34, 483–519. Rajpathak, D.G., 2013. An ontology based text mining system for knowledge
Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. discovery from the diagnosis data in the automotive domain. Comput. Ind.
Buddhakulsomsiri, J., Zakarian, A., 2009. Sequential pattern mining algorithm for 64, 565–580.
automotive warranty data. Comput. Ind. Eng. 57, 137–147. Reimer, M., 2013. Service Relationship Management—Driving Uptime in Commer-
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic cial Vehicle Maintenance and Repair. White paper. DECISIV. 〈http://tinyurl.com/
minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357. q6xymfw〉.
Choudhary, A.K., Harding, J.A., Tiwari, M.K., 2009. Data mining in manufacturing: a Saeys, Y., Inza, I., Larraaga, P., 2007. A review of feature selection techniques in
review based on the kind of knowledge. J. Intell. Manuf. 20, 501–521. bioinformatics. Bioinformatics 23, 2507–2517.
Dal Pozzolo, A., Caelen, O., Bontempi, G., 2012. Comparison of Balancing Techniques Saxena, A., Celaya, J., Balaban, E., Goebel, K., Saha, B., Saha, S., Schwabacher, M.,
for Unbalanced Datasets. 〈http://www.ulb.ac.be/di/map/adalpozz/pdf/poster_ 2008. Metrics for evaluating performance of prognostic techniques. In: Inter-
unbalanced.pdf〉. national Conference on Prognostics and Health Management, 2008. PHM 2008.
Fogelstrom, K.A., 2007 (filed 2006). Prognostic and Diagnostic System for Air pp. 1–17. http://dx.doi.org/10.1109/PHM.2008.4711436.
Brakes. Schwabacher, M., 2005. A survey of data-driven prognostics. In: Infotech@Aerospace.
Frisk, E., Krysander, M., Larsson, E., 2014. Data-driven lead-acid battery prognostics Si, X.S., Wang, W., Hu, C.H., Zhou, D.H., 2011. Remaining useful life estimation—a
using random survival forests. In: Proceedings of the 2:nd European Conference review on the statistical data driven approaches. Eur. J. Oper. Res. 213, 1–14.
of the PHM Society (PHME14). Sikorska, J.Z., Hodkiewicz, M., Ma, L., 2011. Prognostic modelling options for
Gusikhin, O., Rychtyckyj, N., Filev, D., 2007. Intelligent systems in the automotive remaining useful life estimation by industry. Mech. Syst. Signal Process. 25,
industry: applications and trends. Knowl. Inf. Syst. 12, 147–168. http://dx.doi. 1803–1836.
org/10.1007/s10115-006-0063-1. Sokolova, M., Japkowicz, N., Szpakowicz, S., 2006. Beyond accuracy, f-score and
Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. ROC: A family of discriminant measures for performance evaluation. In: AI
J. Mach. Learn. Res. 3, 1157–1182. 2006: Advances in Artificial Intelligence. Lecture Notes in Computer Science,
Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A., 2006. Feature Extraction: Founda- vol. 4304. Springer, Berlin, Heidelberg, pp. 1015–1021.
tions and Applications (Studies in Fuzziness and Soft Computing). Springer- Sun, Y., Kamel, M.S., Wong, A.K., Wang, Y., 2007. Cost-sensitive boosting for
Verlag New York, Inc.. classification of imbalanced data. Pattern Recognit. 40, 3358–3378.
Hazewinkel, M. (Ed.), 2001. Encyclopedia of Mathematics. Springer, New York. Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A., 2007. Experimental perspectives
He, H., Garcia, E., 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data on learning from imbalanced data. In: Proceedings of the 24th International
Eng. 21, 1263–1284. Conference on Machine Learning, ACM, New York, NY, USA. pp. 935–942.

You might also like