Khiêm
Khiêm
Hà Nội, 03/07/2024
Anomaly Detection in Time Series
Abstract. This paper shows you how to use the Pycaret library in Python to predict the anomaly working points
of a water pump in industrial base on its sensors measures, then we can cope with the next time that the pump is
broken. The Pycaret library includes variety of machine learning models to apply to our data and it will speed up
the experimentation cycle with just a few lines of code. Final test have shown that KNN and Isolation Forest
model can be used to detect the next failure with high accuracy than others.
1 Introduction
Water pumps play several critical roles in industrial settings, where they are used for various purposes depending
on the specific requirements of the industry. Here are some of their key functions: transferring fluids, maintaining
flow of water or other liquids in some industrial processes, fire protection,... Therefore, a broken pump can leads to
several significant consequences such as: production downtime, safety hazards, environmental impact,... Beside the
financial loss, we also have to pay for repairing or renewing the broken pump and human casualties, which will cost
a lot of money.
From the above paragraph, we can easily see that anomaly detection is essentially because it can helps reduce the
risk to the minimum and brace ourself to deal with the issue. Due to the rapid development of AI in recent years, we
believe that machine learning will excel in helping us achieve our goals. We will be given the sensors data of a
water pump in a small area which have 7 failures in last year which cause huge problem to many people and also
lead to some serious living problem of some family. Our mission is to predict the next failure before it's happen.
2 Methodology
It is important to get a good understanding of the data you have to work with. Data driven maintenance strategy
optimization is critical in a rapidly changing, competitive world. it is very critical to reduce the costs, minimize the
risk, and improve performance for the competitive edge in today's rapidly changing operational environment.
Let's get an overview of the data. The dataset contains 220320 datapoints and 55 features, which are the reading of
52 sensors and the machine's status, measured minute by minute, from first of April to thirty-first of August. The
type of parameter being measured is not meantioned but we can assume that they are temperature, pressure,
vibration, load capacity, volume and the flow density of the machine, which are not affect the result much.
We see that there are 3 status of working of this pump: NORMAL, RECOVERING and BROKEN. The failure
events occurred in April(2 times),May(2 times),June(2 times) and one in July 2018. We findings that:
We have an unnamed column which will require to be removed.
The machine status will be required to be encoded as Machine learning models work better with
numbers.
Here we use the Pycaret library to get access to models: Pycaret is an open source, low-code machine learning
library based on Python. This solution automates the end-to-end machine learning workflow, and by automating
tasks and managing ML models speeds up the experimentation cycle. As the library is based on low code, PyCaret
only requires our different models, each will be briefly explained.
Histogram-based Outlier Detection (HBOS): histogram is the most commonly used graph to show
frequency distributions. It looks very much like a bar chart, but there are important differences between
them. If a variable observation falls to the tail, it’s likely an anomaly point. And after aggregate all the
observation we can decide which one is an anomaly.
Isolation Forest: Isolation Forest is an unsupervised learning model. Begins with isolation tree, a tree
such that represent a way to split the data to isolate a data point it keeps cutting away at the dataset until
all instances are isolated from one another. Because an anomaly is usually far away from other instances,
it becomes isolated in fewer steps than normal instances on average and many trees make a forest, and
that's call a isolation forest. In the score function formula, x is the data point and m is the number of
point, E(h(x)) represents the average path length for isolating x in a tree, c(m) represents the average
depth of data points in a tree. An "anomaly score" is given to each observation, taking into account the
following decision: if the score is close to 1, it is marked as an anomaly. Observations with a score of
less than 0.5 are normally marked.
K-Nearest Neighbors (KNN): The K-nearest neighbors (KNN) algorithm is a supervised learning model,
which uses proximity to make classifications or predictions about the grouping of an individual data
point. It can be used for both classification and regression tasks. In KNN, the algorithm classifies a new
data point based on the majority class of its k nearest neighbors in the feature space. For regression
tasks, KNN predicts the target value by averaging the values of its k nearest neighbors. The K value is
the default value in pycaret library, which is equal to 5. Low value for k such as k=1 or k=2 can be noisy
and subject to the outliers. Large k value smooth over things but you don't want k to be so large that a
category with a few samples in it will always be outvoted by other categories.
Local Outlier Factor (LOF): LOF is an unsupervised learning model, which computes the local density
deviation of a given data point with respect to its neighbors. It uses a parameter calls density, which is
the inverse of the average reachability(distances) from observation to all its KNN. The ratio of the
average density of the KNN of an observation to the density of the observation itself is measured and
called R. If R>1 then it's a anomaly and vice versa.
Principle Component Analysis (PCA): PCA is often used to reduce the dimensionality of large data sets,
by transforming a large set of variables into a smaller one that still contains most of the information in
the large set. To help imagine, here we have 52 sensors with their indicators which are a lot of
information, PCA help us reduce to fewer dimension data that easier to deal with, each new data
dimension is a linear combinations or mixtures of the initial data dimensions.
Confusion matrix: a confusion matrix’s size can be nxn, but here we only use 2x2 which represent for
actual and predict status of the machine. Imagine you’re doing a cancer diagnoses and it gives this
confusion matrix True-positive and True-negative is the best case when you have the right prediction for
patients that actually have cancer and doesn’t have cancer False-negative is when you predict that they
have cancer when they actually don’t, it’s doesn’t affect much since we will test on their cancer again The
critical fails here are False-positive cases. Of course you don’t want a cancer patient to calmly go home and
don’t get treated in the right time. Similar to Broken and Not Broken status of a pump. A large number of
True-positive is impressive but when you made a mistake at detect False-negative as False-positive, it will
lead to a serious consequence, like we mentioned in the first part. So the essential numbers here are False-
positive and True-negative.
o All 7 failures were correctly predicted by KNN model and LOF model
o The number of False positive of KNN and LOF is also the same so we have to comparing deeper
using statistical parameters
Statistical parameters:
4 Conclusion
We come to the conclusion that KNN is the best model for predicting in this case. Actually, you can use both KNN
and LOF. Predictive anomalies detection is presented to be an assistance to the maintance and operations of
industrial machineries. Due to this, we has examined the five machine learning models in regard of the given
industrial pump dataset to drawn results and conlusions shown in this paper.
References
1. MAKOMANE DOROTHEA L SEKHOTO: Pycaret Anomaly detection application on Pump,
https://www.kaggle.com/code/dorotheantsosheng/pycaret-anomaly-detection-application-on-pump/input
2. SUMIT KR JHA: https://www.kaggle.com/code/jhasony/anomaly-detection-in-time-series/notebook
3. MATTISON HINELINE https://www.kaggle.com/code/mattison/water-pump-classification