An I4.0 Data Intensive Platform Suitable For The Deployment of
An I4.0 Data Intensive Platform Suitable For The Deployment of
com
Available online
Available at www.sciencedirect.com
online at www.sciencedirect.com
ScienceDirect
Procedia
Procedia Computer
Computer Science
Science 00 (2019)
200 (2022) 000–000
1014–1023
Procedia Computer Science 00 (2019) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
International
3rd Conference
International on on
Conference Industry 4.04.0
Industry and Smart Manufacturing
International Conference on Industry 4.0 andand Smart
Smart Manufacturing
Manufacturing
An
An I4.0
I4.0 data
data intensive
intensive platform
platform suitable
suitable for
for the
the deployment
deployment of
of
machine learning models: a predictive maintenance service
machine learning models: a predictive maintenance service casecase
study
study
Ricardo Dintén Herrero∗∗, Marta Zorrilla
a Grupo
Ricardo Dintén Herrero , Marta Zorrilla
de Ingenierı́a Software y Tiempo Real (Universidad de Cantabria), Avda. Los Castros, 48, Santander 39005, España
a Grupo de Ingenierı́a Software y Tiempo Real (Universidad de Cantabria), Avda. Los Castros, 48, Santander 39005, España
Abstract
Abstract
The Artificial Intelligence is one of the key enablers of the Industry 4.0. The building of learning models as well as their deployment
The Artificial Intelligence
in environments where theisrate
oneofofdata
the key enablersisofhigh
generation the Industry
and their4.0. The building
analysis of learning
must meet real timemodels as well as
requirements their
lead to deployment
the need of
in environments
selecting where
a big data the rate
platform of datafor
suitable generation is high
this purpose. Theand their analysisand
heterogeneous must meet realnature
distributed time of
requirements lead to the
I4.0 environments need
where of
data
selecting a big data
becomes highly platform
relevant suitable
requires the for
use this
of apurpose. The heterogeneous
data centric, distributed andand distributed
scalable nature
platform of I4.0
where the environments where data
different applications are
becomes
deployed highly relevant
as services. requires
In this paper the use of aandata
we present I4.0centric, distributed
digital platform and on
based scalable
RAI4.0platform
referencewhere the different
architecture applications
on which are
a predictive
deployed as services. In this paper we present an I4.0 digital platform based on RAI4.0 reference architecture
maintenance service has been built and deployed in Amazon Web Service cloud. Different strategies to build the predictor are on which a predictive
maintenance
described as service
well as has
the been
stagesbuilt and out
carried deployed
for itsinconstruction.
Amazon Web Service
Finally, thecloud. Different
predictor built strategies to build
with k-nearest the predictor
algorithm are
is chosen
described
because it as well
is the as the
fastest in stages carried
producing out forand
an answer its its
construction.
accuracy ofFinally,
99.87%the predictor
is quite close built
to thewith
best k-nearest
model for algorithm is chosen
our case study.
because it is the fastest in producing an answer and its accuracy of 99.87% is quite close to the best model for our case study.
© 2021 The
© 2022 The Authors.
Authors. Published
Published by
by Elsevier
Elsevier B.V.
B.V.
© 2021an
This The Authors. Published by Elsevier B.V.
This is
is an open
open access
access article
article under
under the
the CC
CC BY-NC-ND
BY-NC-ND license
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
(https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review
Peer-review access
under
under article under
responsibility
responsibility ofofthe CC
thethe BY-NC-ND
scientific
scientific license
committee
committee (http://creativecommons.org/licenses/by-nc-nd/4.0/)
of the International
of the Conference
3rd International on Industry
Conference on 4.0 and Smart
Industry Manufac-
4.0 and Smart
Peer-review
Manufacturing
turing. under responsibility of the scientific committee of the International Conference on Industry 4.0 and Smart Manufac-
turing.
Keywords: Big data platform; Data Stream Mining; Predictive Maintenance
Keywords: Big data platform; Data Stream Mining; Predictive Maintenance
1. Introduction
1. Introduction
Along history, technological development has had an important impact on industrial systems: first, with the steam
Along
engine andhistory, technologicalofdevelopment
the mechanisation has had
processes; next, withan important
mass impact
production, on industrial
automation andsystems:
robotics;first,
and with the steam
currently, with
engine
what is known as Industry 4.0 (I4.0), which aims not only to digitise processes but also to optimise, control andwith
and the mechanisation of processes; next, with mass production, automation and robotics; and currently, au-
what
tomateis them
known byasanalysing
Industry the
4.0 huge
(I4.0),amount
which of
aims notavailable
data only to digitise processes butInalso
in the environment. to optimise,
an I4.0 scenery, control and au-
cyber-physical
tomate
systemsthem by analysing
are capable the huge amount
of communicating withofeach
dataother,
available in the and
receiving environment. In an
transmitting I4.0 scenery,
information andcyber-physical
executing ac-
systems are capable of communicating with each other, receiving and transmitting information
tions, which allows the implementation of productive processes endowed with intelligence and achieving and executing ac-
significant
tions, which allows the implementation of productive processes
improvements in operational efficiency and organisational performance. endowed with intelligence and achieving significant
improvements in operational efficiency and organisational performance.
The application of intelligent techniques in industrial processes is not new, but its adaptation to a new context
where the volume of data to be handled is huge, data are produced continuously (data streams) and these must be
analysed in real time which makes it necessary to have a distributed and scalable digital platform to adapt to the needs
of computation and storage. Use cases are found in different areas of the organisations such as the production chain,
logistics and maintenance, among others [12].
Currently, the building of data intensive application is capturing the attention of the industry because it is a field
where there are a multitude of sensors, actuators and cyber-physical systems, among others, that constantly generate
data and whose analysis and actions derived from this analysis (reactive applications) must be performed in real time.
The existence of different data streams that must be processed together or independently leads to the design of a data-
centric architecture that is easily scalable to dynamically meet the demand of storage and computational resources.
Furthermore, this architecture must also be flexible so that resources can be hired on the edge, fog or cloud depending
on time and costs requirements and latencies that can be assumed (hybrid deployment environment). Likewise, the
vertical and horizontal integration promulgated by I4.0 involves collaboration with third parties, therefore the devel-
opment of applications under a service-oriented architecture offers the best guarantees of isolation and security in
these complex sceneries.
In this work, we describe an instantiation of a digital platform that fulfills the requirements of Industry 4.0 as well
as the detailed steps to build a predictor for failure diagnosis of a water pump to be deployed in the cloud. To this
end, we implemented a big data platform based on RAI4.0 reference architecture [7]. Both, the configuration of the
platform and the process of building the predictor are the contributions of this paper.
This is organised as follows: Section 2 introduces the data mining field and explains concepts, methods and tech-
niques necessary to understand the predictor building process. Section 3 describes the platform deployed in order to
support the predictive maintenance service. Section 4 addresses the data mining process followed to build the pre-
dictor. Section 5 relates other research works in the fields of predictive maintenance and stream data mining. Finally,
section 6 draws the conclusions of the paper and the next steps in our research.
Data mining (DM) is a discipline from the statistics and computation fields which has as objective to discover
unknown patterns from large data repositories. DM is an integral part of Knowledge Discovery in Databases (KDD)
which is the overall process of converting raw data into useful information. To achieve this goal, it relies on machine
learning, artificial intelligence and database technologies.
DM techniques are generally divided into two major categories: predictive and descriptive tasks, being the former
the one that arouses most interest. The prediction task consists of extracting relevant features from labeled training
data to build a model that discriminates between classes to classify unlabeled observed objects. Prediction methods
are, in turn, divided into classification and regression techniques. Classification techniques are used when the predicted
variable is a categorical value (e.g. the actuator should be on or off according to the value read from plant sensors)
and regression techniques, when the predicted variable is a continuous value or a probability density function (e.g.
predicting the rain probability based on the value of some meteorological parameters such as pressure, temperature
and humidity).
On the other hand, descriptive tasks aim at extracting patterns, rules or grouping the instances of the data set using
unsupervised learning techniques, that means, the training process does not require a target variable for the training
process. These, in turn, are classified in techniques based on clustering which consist in identifying and dividing the
data instances from a data set in groups with similar features and based on association which consist in identifying
relationships between different variables in order to discover rules or patterns that help understanding the causes of an
event, e.g. rules that explain failures in a production chain.
Within each of these categories there are a multitude of algorithms implemented according to different approaches.
Since our goal is to build a classifier, we describe below an algorithm chosen under each approach to compare its
performance in our case study.
• Random Forest Classifier. This classifier consists in the building of a big amount of decision trees that work
as an ensemble. The algorithm establishes the outcome based on the average or mean of the output from the
1016 Ricardo Dintén Herrero et al. / Procedia Computer Science 200 (2022) 1014–1023
Ricardo Dintén & Marta Zorrilla / Procedia Computer Science 00 (2019) 000–000 3
decision trees. In this way, each tree is capable of representing a different set of features and the overfitting is
avoided. Increasing the number of trees improves the precision of the outcome.
• Support Vector Classifier. This kind of classifier is based on finding a set of hyperplanes of dimension N-1 where
N is the number of features the model receives, so it could get the better separation between the different classes
in the data set. For this purpose, this algorithm represents each data instance as a point in a N-dimensional space
and calculates the distance between the points in order to find the hyperplane that has the lowest distance to
each target class.
• KNeighbors Classifier. This is a non-parametric classification method that consists in building k prototype vec-
tors. The output ( predicted class) for an unknown instance is the class of the closest prototype if k=1, otherwise
the most common class among its neighbours is selected. This algorithm requires to normalize the input vari-
ables because it is based on distance computation. The standard Euclidean distance is the most frequently used.
• Artificial Neural Networks. The artificial neural networks are a type of computational model inspired by the
neural networks present in animal brain. These are comprised of a set of artificial neurons connected among
them, known as edges. Each neuron receives a value (signal), then processes it and can signal neurons con-
nected to it. The ”signal” at a connection is a real number, and the output of each neuron is computed by some
linear or non-linear function of the sum of its inputs multiplied by the weight of the edges. Generally, neurons
are aggregated into layers. The learning (training) is achieved by a correct adjustment of the weights of each
connection.
The training and evaluations stages are essential for the building of the classifier. Cross validation [4] is one of
the most robust and reliable training and evaluation techniques. This splits the data sample in a certain number of
subsamples, taking one of the subsamples as validation set and the rest as the training set. The process is repeated
until each of the subsamples has been used as validation sample and finally the results are averaged. 10-fold cross
validation is generally used. Leave-one-out is a variation of this technique that consists in splitting the sample in as
many subsamples as instances are available in the original sample. This is only interesting when dealing with small
data sets, as it requires a lot of computation time.
Once chosen the validation technique, the metrics for evaluating the classifier built must be selected. The most
common are: Accuracy, F-score, Sensitivity, Specificity. These are defined from the confusion matrix or error matrix
(see Table 1), a table describing the performance of a two-class supervised model on test data:
Predicted
+ -
Real + TP FN
- FP TN
where:
True Positives (TP): number of cases where the model predicts the positive class correctly.
True Negatives (TN): number of cases where the model predicts the negative class correctly.
False Positives (FP): number of cases where the model misspredicts a positive.
False Negatives (FN): number of cases where the model misspredicts a negative.
Table 1. Confusion Matrix
• Accuracy: this metric represents the percentage of predictions correctly made by the model. It is a good metric
when the different classes are balanced in the data set. Accuracy = T P+TT N+FP+FN
P+T N
• Sensitivity o TPR: this represents the rate of true values predicted correctly by the model among the total number
TP
of positive cases. S ensitivity = T P+FN = Recall
• Specificity o TNR: this represents the rate of false values predicted correctly by the model among the total
TN
number of negative cases. S peci f icity = T N+FP
Ricardo Dintén Herrero et al. / Procedia Computer Science 200 (2022) 1014–1023 1017
4 Ricardo Dintén & Marta Zorrilla / Procedia Computer Science 00 (2019) 000–000
• False alarm rate o false positive rate: rate of instances incorrectly classified as positive among the total of
FP
negative cases. FAR = FP+T N
• Precision: represents the rate of true positives among the total of positives retrieved by the model. Precision =
TP
T P+FP
• F-score: it is a metric that combines (harmonic mean) precision and recall to compare classifiers. f score =
TP
T P+ 12 (FP+FN)
Finally, as a consequence of the fact that the selection of parameters to find the best setting-up of each algorithm is
time consuming, techniques like Grid Search are used. This consists in defining a map with a list of possible values for
each of the parameters of the model and performing the training phase with every possible combination of parameters.
CRISP-DM is a process model developed by CRISP-DM consortium [1]. Nowadays it is considered a de facto
standard. This divides the data mining process in six stages:
• Business Understanding: in this stage the main goal is analysing risks and opportunities of the project in order
to determine if the project is profitable. As a result of this stage a list of goals, Key Performance Indicators
(KPI) and expected outcomes should be established.
• Data Understanding: in this stage data is analysed to determine if the data available has the quality and quantity
required to develop the project. Data should be inspected through visual analysis, searching outliers, correla-
tions, etc. In case data is not enough, the project should be cancelled or the scope replanned to adapt the goal to
the possibilities that data offers.
• Data preparation: this stage consists in preparing data so it can be processed by the mining algorithms. This stage
also includes correcting the issues detected on the data understanding phase such as fixing inconsistencies in
the categorical variables, deleting outliers, managing null values, performing data normalization, among others.
In addition, the process of feature creation and selection is carried out. This last task has a double objective: to
eliminate noise and to use only data that provides value as well as to reduce the necessary computation resources
and the model processing time.
• Modeling: In this phase, data miners choose the modeling tools and algorithms that best fit the problem. There
are many alternatives, from algorithms that produce easily interpretable results, usually in the form of rules,
to those that work as a black box, but they sometimes have a superior performance. It is convenient to train
and validate models build with different strategies and techniques and assess their performance. This phase is
repeated several times until a satisfactory model is achieved.
• Evaluation: In this phase, the results are evaluated, but not from a technical point of view, but from a business
point of view. That means checking if the objectives established in the first stage have been met. If this vas not
the case, the project should be reviewed and moved to first phase.
• Deployment: In this phase, all the necessary components must be deployed to put the developed solution into
production. Other actions that are frequently carried out in this stage are the preparation of reports, or the
deployment of a monitoring system to know how the model performance evolves, among others.
In order to deploy a predictive maintenance system for an I4.0 environment, it is necessary to take into account
that the digital platform is capable of managing and processing data streams and answering in real time. Therefore
this was designed according to the kappa architecture proposed by Wingerath et al. [14] and implemented based on
the RAI4.0 reference architecture [7] with the aim of following a tested template to build a solution.
This reference architecture aims to delimit and establish the strategies with which the digital platform is organized,
and to qualify and relate the information that is managed, the tasks that process this information, the hardware and
software resources that support its transfer, storage and processing as well as the monitoring agents that allow the
1018 Ricardo Dintén Herrero et al. / Procedia Computer Science 200 (2022) 1014–1023
Ricardo Dintén & Marta Zorrilla / Procedia Computer Science 00 (2019) 000–000 5
configuration and management of the system as a whole. This is designed to meet the following Industry 4.0 require-
ments: i) The architecture must be operated in a reactive and decentralized mode in order to handle data streams that
are generated in the environment. ii) The distribution, heterogeneity and scalability of the computational resources
required to meet the functional and non-functional requirements of the applications deployed in the environment are
conceived as a set of services with the aim of keeping a unique strategy for its management. iii) The monitoring of
the environment is an essential task since its maintenance and management rely on adaptive and dynamic strategies
defined from the levels of use of the resources and the state of operation.
The detailed description of RAI4.0 and the justification of its suitability for I4.0 is written in [7]. Likewise, [7]
gathers the comparison of this architecture with others such as the one proposed by Perez-Palacı́n et al (2019) [8].
In this paper, for space reasons, we only describe the instantiation of a digital platform based on RAI4.0 architecture
to support AI models and relate the hardware resources hired to deploy the predictive maintenance service built and
presented here as case study. The digital platform is comprised by four modules (see Fig. 1): data sources, data bus,
processing layer and presentation layer.
Data sources module gathers the elements that generate (sensors,...) or retrieve (actuators, apps,..) the records that
are published in the data bus, a distributed and replicated system that provides data sharing. In our case study, this
would host the 52 sensors installed on the water pump.
The data bus allows deployed platform services to use, process, analyse and exchange information. This requires a
distribution service that provides secure access to the global information shared and a communication service that reg-
isters data in the digital platform and sends these among the agents of the environment following a publisher/subscriber
strategy. Zookeeper was chosen as coordination service and Kafka as queue manager. Kafka organises data in topics.
These represent flows of instances that describe the same type of information (for instance a topic for each sensor)
and are managed with the same criteria of persistence, durability, availability, security, etc. Topics are written by the
producer once and read several times by all consumers subscribed to the topic.
The processing layer is responsible for scheduling the execution of the processing tasks in the computation nodes.
In our case study this module hosts two services. The former, implemented by means of Spark, reads topics from
Kafka, calls the water pump predictor and publishes the prediction on Kafka again. The latter implements a web
service to publish the predictions through a web socket so that the presentation module can display data in real time.
This web service was developed with Spring framework.
Lastly, the presentation layer gathers a GUI developed using the Angular framework. It has been designed ad-hoc
with the aim of monitoring both data from sensors and the prediction given by the system implemented in real time.
The architecture adapts to any configuration of hardware resources, that is, it allows the combination of on-premise
nodes, with others hired in the cloud. In our case study, five machines were provisioned in AWS EC2, two of them ded-
icated to Zookeeper and Kafka to manage the data bus and the remaining ones to perform the distributed predictions
with Spark. The GUI was hosted on a separated on-premise server.
The purpose of this section is to explain the methodological process to be followed in the construction of data
mining models according to the CRISP-DM standard through its application to a real time case study. For this purpose,
a public Kaggle data set available at [11] is used.
The data set employed for this task is comprised by a set of measurements taken by the 52 sensors installed in a
water pump responsible for supplying water to a small village. The owner of the system published the data with the
purpose of being able to understand or detect failures as quickly and as accurately as possible so he could improve the
service given to the village population. The previous year there were 7 system failures that caused severe problems
to the families living there. We did not have access to the real machine, so in order to simulate the behaviour of the
system we streamed a sample of the data set at a constant rate using a software as explained below.
Next, we describe the steps carried out according to the model process.
• Business Understanding: The business goal is to know when the water pump can fail with the aim of avoiding
serious living problems that water outages involve. That means, to develop a predictive model capable of pro-
cessing data in real time and determine if the water pump is operating normally or if, on the contrary, it has a
Ricardo Dintén Herrero et al. / Procedia Computer Science 200 (2022) 1014–1023 1019
6 Ricardo Dintén & Marta Zorrilla / Procedia Computer Science 00 (2019) 000–000
Fig. 1. Instantiation of the digital platform based on RAI4.0 architecture for our case study. Both, the console to control the parallelization of the
prediction process as well as its visualization in the client application are zoomed out in the images above and below respectively.
malfunction that could lead to a pump failure. The model should have an accuracy higher to 75% and response
time less than 100 miliseconds. Lastly, as a more ambitious goal, the model should be able to predict and notify
the failures with at least 4 hours in advance.
• Data understanding: The data set contains 220.320 records, made up of 52 numeric variables corresponding
to different sensors and the target variable which gathers the machine status. The latter is a categorical variable
which takes 3 possible values: NORMAL, BROKEN and RECOVERING. The remaining values received from
sensors are supposed to be in a fixed range which should be stable and does not vary overtime. Thus, we consider
that the data set is not affected by concept drift (the value changes over time in unforeseen ways). Next, several
preprocessing tasks are performed with the aim of selecting the most significant variables. For this purpose,
four strategies were programmed: i) using all variables; ii) using only variables with a correlation coefficient
greater than 0.7 with respect to the target variable; iii) applying analysis of principal components and iv) using
a k-best feature selection algorithm provided by the sklearn package applying chi-square as evaluation metric.
Previously, as a consequence of the low number of instances with the states BROKEN and RECOVERING,
1020 Ricardo Dintén Herrero et al. / Procedia Computer Science 200 (2022) 1014–1023
Ricardo Dintén & Marta Zorrilla / Procedia Computer Science 00 (2019) 000–000 7
both were considered as MALFUNCTION. In such way, the predictor could generalise better and become more
useful.
• Modeling: Next, four predictive models capable of detecting water pump failures were built. We used the
following algorithms: random forest, support vector machine, k-nearest neighbours and a multilayer perceptron
(see Sect. 2). The sklearn and keras software libraries were used for this purpose. The parameter setting up was
performed by means of a grid search method. The preprocessing pipelines and the models were built following
traditional offline techniques, i.e., by means of a data sampling and training model from an historical data set.
Each of the mentioned algorithms was combined with the four preprocessing pipelines proposed in the data
understanding phase. In order to validate each of the models generated during the training phase, a 10-fold cross
validation method [3] was used. The 16 models built and the metrics used for their evaluation, accuracy, TPR
and FAR are shown in the table 2. These metrics were selected because accuracy provides the best perspective
on how well a model is performing, but it alone can be misleading when the data set classes are not balanced, so
two performance measures were added. TPR (True Positive Rate) lets us know how well the model is performing
when it detects any failure and FAR (False Alarm Rate or false positive rate) tells us how many times the model
triggers an alarm without existing any damage, which can prevent unnecessary downtime and revisions.
Additionally, an ensemble-type estimator [6] was generated with the three best models in an attempt to improve
the capability of generalisation of the model. This estimator takes the prediction of each of the models that
comprise it and selects the final answer by means of a voting strategy. The results obtained by evaluating the
models are shown in the table 3.
Next, we select the best models (marked in bold in Table 2) those with the highest reliability. However, another
important aspect to take into account is the response time of the predictor since it will be deployed in a data
stream environment. Therefore, the response time has to be measured to see if the model is able to generate
predictions at the data generation rate. For this purpose, the time taken by each of the models to make different
predictions was read and the average time and its standard deviation calculated. The results are gathered in the
table 3. Finally, the k-nearest neighbours model was used for the deployment for being the fastest of the four
alterative and having a similar performance to the other three models.
• Evaluation: Once the model is selected, it is evaluated in relation to the business objectives. The accuracy
achieved satisfies the initial business objectives. However, the model is not able to predict in advance; task that
will be performed as future work.
• Deployment: Finally, the predictor built was deployed. As previously mentioned, five machines were pro-
visioned in AWS EC2. The predictor was hosted and managed as part of the data processing infrastructure
embedding in a Spark task in order to reduce latencies and achieve faster response times. Later, a monitoring
service was developed and deployed.
5. Related Work
Predictive maintenance is one of the areas that has attracted great attention from the research community with
the aim of reducing failures or abnormal machine operations and, consequently, costs. In this section, we summarize
works found in the literature focused on this topic.
Li et al. (2014 )[5] developed a predictive maintenance system aimed to prevent derailment of trains by means of
detecting bad truck or wheels. They had a data set of 1.5TB with more than 500 variables. Therefore, they performed
a Principal Component Analysis to select the most significant features. They built a SVM with different kind of kernel
functions and relied on TPR and FAR to evaluate the model, however they prioritized FAR due to the high cost of
stopping a train. Finally, they reached the goal of FAR < 0.014% but with a TPR of 38,542% which is quite low.
They also achieved models with higher TPR values but increasing FAR. As a consequence of the fact that SVM is a
black-box mining model, authors built a decision tree to obtain interpretable rules from the used features.
Praveenkumar et al. (2014) [9] developed a system capable of diagnosing failures in a gearbox. They mounted a
gearbox with 4 different gears and some accelerometers installed near the input and output shafts. Then, they took
measurements in different scenarios varying the speed, torque and gear ratio both with healthy and damaged gears.
After that, they labeled the records and made some preliminary analysis of the statistical properties inspecting the
8 RicardoRicardo
Dintén Dintén
& MartaHerrero
Zorrillaet/ al. / Procedia
Procedia Computer
Computer Science
Science 200 (2022)
00 (2019) 1014–1023
000–000 1021
mean, variance, kurtosis and skewness of the features. They trained a SVM that reached over 90% of accuracy with
these features. The experiment was entirely carried out in laboratory conditions where the vibrations coming from the
road, the engine or the bumps do not exist, something that could affect the performance of the predictor.
Prytz et al. (2015) [10] developed a system aimed at predicting the remaining lifetime of the compressors installed
on Volvo trucks. The main goal of the system was to predict if the compressors would last until the next maintenance
or if they should be replaced before to avoid unexpected failures. They faced an unbalanced data set, so they had to
take it into account in the process of algorithm selection, evaluation and comparison of results. They used random
forest with different input features. They tried different prediction horizons to check how those affected the model
performance. In order to evaluate the system they developed a profit measure combining statistical metrics of the
model and the cost associated to solve unexpected failures and time out of service. Unlike our model, they were
capable of predicting failures in advance, but they did not evaluate the system in real time.
Canizo et al.(2017) [2] developed a system aimed to detect failures in a set of wind turbines based on a set of
parameters received every 10 minutes. They used a big data platform comprised by Zookeeper, Kafka, HDFS, Spark
and Mesos to perform the predictions in real time, that is similar to ours. They had three kind of wind turbines
with different number parameters available. So they picked the features that were common to all the machines and
performed PCA to reduce the dimensionality while maximizing the explained variance. They tuned the random forest
model employed until they found the best value for both depth and number of trees. The models of wind turbines
available had different number of classes for the target variable: 2,3 and 5 values. They obtained a mean accuracy of
82% being the highest one in those models with a lower number of classes for the target variable. They built a system
capable of performing the training offline and the detection online. However, they did not consider the time required
for making a prediction.
Most works related address the prediction of anomalies or the detection of failures using data mining and machine
learning techniques, but they do not deal with the difficulties of the deployment of the model in a big data streaming
environment. Canizo et al. did present a platform and claimed it to be up to 100 times faster than traditional technolo-
gies but they did not analyse the response time of the system to verify it. Moreover, most papers built predictors based
on only one technique which is not a good practice in the interest of finding the best predictor. It is well-known that
the data mining process is an art. The art-based approach is where you understand a problem more fully over time
1022 Ricardo Dintén Herrero et al. / Procedia Computer Science 200 (2022) 1014–1023
Ricardo Dintén & Marta Zorrilla / Procedia Computer Science 00 (2019) 000–000 9
by progressively working on alternative solutions, which lead to greater understanding of the problem itself. In short,
the related works can guide us on different strategies to deal with a problem, but the solution is not generalizable, it
strongly depends on the intended objective, the available data and the expertise of data miners. Our contribution is
therefore oriented to the architecture and systematization of the process.
In this new era where data is the fuel of the 21st century, the use of artificial intelligence and data mining models
in industrial processes is a necessity to advance their optimisation, control and automation. The huge volume of
data to be managed and analysed with the aim of having an answer in real time brings with it the need for a data-
centric, distributed, scalable digital platform that enables the ingestion and integration of new information as well as
processing tasks in a simple way that are compatible with the diversity of legacy technologies found in I4.0.
In this paper we present an instantiation of a big data platform based on the RAI4.0 architecture. This allows the
publication of data on a shared bus to which different processes subscribe. These processes (workloads) can generate
in turn new data to be shared and reused. The case study allows us to validate that this architecture is suitable for
building and deploying data mining models based on data streams. Furthermore, it guarantees us the development of
a modular, extensible and distributed solution that meets the requirements of I4.0 environments with intensive data
stream generation.
Regarding the data mining process, we can point out the following considerations. CRISP-DM is a well-defined
process model and following this model is considered a good practice to systematize the discovery process. However
it is not focused on data streaming, reason why the following issues must be taken into account:
• First, we point out that it is very important to check if the data source is affected by the concept drift [13] during
business understanding and data understanding stages. That means, that the statistical properties of the target
variable, which the model is trying to predict, change over time in unforeseen ways. This feature will dictate
the type of modelling algorithms that should be used and how data should be sampled.
• Secondly, due to its huge volume, it is impossible to handle the whole data set, so a representative but man-
ageable sample has to be selected. In some cases, the use of a single sample will not be enough, so ensemble
methods may be useful to build models trained on different fractions of the original data set.
• During the modelling phase, time and memory efficient algorithms should be selected to cope with the volume
and speed at which data arrives. Furthermore, if possible, distributed versions of these algorithms should be
selected so that the construction of the predictor could be scaled horizontally.
• Lastly, evaluation and deployment phases should be performed at the same time to check if business goals such
as time constraints are reached.
A limitation of our classifier is that it diagnoses the water pump failure when it happens but this is not available
some time in advance. This will be addressed by using time series cross-validation in such way that ensure that
training sets contain observations that occurred prior to the ones in validation sets. In addition, other time series
mining techniques such as LSTM will be tested.
Finally, we would like to add that it is essential to provide architectural proposals that orchestrate all elements
that comprise a IA digital platform and provide manufacturers with methodological tools to address their transition
towards Industry 4.0. This is also one of the goals pursued in this publication.
Our on going work focuses on evolving our reference architecture to support more big data technologies as well as
data mining workflows. In addition, we are working on data mining projects in collaboration with a logistics company
interested in data stream mining so we can test the platform in a real environment with the support of a company that
provides us the data and feedback about how well it performs.
As near future tasks to carry out we point out: i) implementing synchronisation mechanisms to orchestrate sensor
data published in order to build and update the predictor in real time; ii) applying other strategies to build classifiers
that predict what will happen at different time horizons (short, medium and long term).
Ricardo Dintén Herrero et al. / Procedia Computer Science 200 (2022) 1014–1023 1023
10 Ricardo Dintén & Marta Zorrilla / Procedia Computer Science 00 (2019) 000–000
Acknowledgements
This work has been funded in part by the Spanish Government and FEDER funds (AEI/FEDER, UE) under grant
TIN2017-86520-C3-3-R (PRECON-I4).
References
[1] Berthold, M.R., Borgelt, C., Hppner, F., Klawonn, F., 2010. Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data.
1st ed., Springer Publishing Company, Incorporated.
[2] Canizo, M., Onieva, E., Conde, A., Charramendieta, S., Trujillo, S., 2017. Real-time predictive maintenance for wind turbines using big data
frameworks, in: 2017 IEEE International Conference on Prognostics and Health Management (ICPHM), pp. 70–77. doi:10.1109/ICPHM.
2017.7998308.
[3] Kohavi, R., 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the 14th Interna-
tional Joint Conference on Artificial Intelligence - Volume 2, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. p. 1137–1143.
[4] Kohavi, R., 2001. A study of cross-validation and bootstrap for accuracy estimation and model selection 14.
[5] Li, H., Parikh, D., He, Q., Qian, B., Li, Z., Fang, D., Hampapur, A., 2014. Improving rail network velocity: A machine learning approach
to predictive maintenance. Transportation Research Part C: Emerging Technologies 45, 17 – 26. URL: http://www.sciencedirect.
com/science/article/pii/S0968090X14001107, doi:https://doi.org/10.1016/j.trc.2014.04.013. advances in Computing
and Communications and their Impact on Transportation Science and Technologies.
[6] Marcello, B., Davide, C., Marco, F., Roberto, G., Leonardo, M., Luca, P., 2020. An ensemble-learning model for failure rate prediction. Pro-
cedia Manufacturing 42, 41–48. URL: https://www.sciencedirect.com/science/article/pii/S2351978920305618, doi:https:
//doi.org/10.1016/j.promfg.2020.02.022. international Conference on Industry 4.0 and Smart Manufacturing (ISM 2019).
[7] Martı́nez, P.L., Dintén, R., Drake, J.M., Zorrilla, M., 2021. A big data-centric architecture metamodel for industry 4.0. Future Generation
Computer Systems URL: https://www.sciencedirect.com/science/article/pii/S0167739X21002156, doi:https://doi.org/
10.1016/j.future.2021.06.020.
[8] Perez-Palacin, D., Merseguer, J., Requeno, J.I., Guerriero, M., Di Nitto, E., Tamburri, D.A., 2019. A uml profile for the design, quality
assessment and deployment of data-intensive applications. Software and Systems Modeling 18, 3577–3614. URL: https://doi.org/10.
1007/s10270-019-00730-3, doi:10.1007/s10270-019-00730-3.
[9] Praveenkumar, T., Saimurugan, M., Krishnakumar, P., Ramachandran, K., 2014. Fault diagnosis of automobile gearbox based on ma-
chine learning techniques. Procedia Engineering 97, 2092 – 2098. URL: http://www.sciencedirect.com/science/article/pii/
S187770581403522X, doi:https://doi.org/10.1016/j.proeng.2014.12.452. ”12th Global Congress on Manufacturing and Man-
agement” GCMM - 2014.
[10] Prytz, R., Nowaczyk, S., Rögnvaldsson, T., Byttner, S., 2015. Predicting the need for vehicle compressor repairs using maintenance records and
logged vehicle data. Engineering Applications of Artificial Intelligence 41, 139 – 150. URL: http://www.sciencedirect.com/science/
article/pii/S0952197615000391, doi:https://doi.org/10.1016/j.engappai.2015.02.009.
[11] UnknownClass, 2019. pump sensor data — kaggle. https://www.kaggle.com/nphantawee/pump-sensor-data.
[12] Usuga Cadavid, J.P., Lamouri, S., Grabot, B., Pellerin, R., Fortin, A., 2020. Machine learning applied in production planning and control:
a state-of-the-art in the era of industry 4.0. Journal of Intelligent Manufacturing 31, 1531–1558. URL: https://doi.org/10.1007/
s10845-019-01531-7, doi:10.1007/s10845-019-01531-7.
[13] Wares, S., Isaacs, J., Elyan, E., 2019. Data stream mining: methods and challenges for handling concept drift. SN Applied Sciences 1, 1412.
URL: https://doi.org/10.1007/s42452-019-1433-0, doi:10.1007/s42452-019-1433-0.
[14] Wingerath, W., Gessert, F., Friedrich, S., Ritter, N., 2016. Real-time stream processing for big data. it - Information Technology 58, 186–194.
URL: https://doi.org/10.1515/itit-2016-0002, doi:doi:10.1515/itit-2016-0002.