0% found this document useful (0 votes)
14 views7 pages

Khiêm

This report discusses the use of the Pycaret library for predicting anomalies in water pump operations based on sensor data, highlighting the importance of anomaly detection in industrial settings. The study evaluates five machine learning models, concluding that KNN and Isolation Forest are the most effective for predicting pump failures. The findings emphasize the necessity of accurate predictive models to minimize downtime and operational risks associated with pump failures.

Uploaded by

Thanh Vu Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views7 pages

Khiêm

This report discusses the use of the Pycaret library for predicting anomalies in water pump operations based on sensor data, highlighting the importance of anomaly detection in industrial settings. The study evaluates five machine learning models, concluding that KNN and Isolation Forest are the most effective for predicting pump failures. The findings emphasize the necessity of accurate predictive models to minimize downtime and operational risks associated with pump failures.

Uploaded by

Thanh Vu Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

ĐẠI HỌC BÁCH KHOA HÀ NỘI

TRƯỜNG ĐIỆN - ĐIỆN TỬ

BÁO CÁO MÔN HỌC


TECHNICAL WRITING AND PRESENTATION

Chủ đề: Anomaly Detection in Time Series

Giảng viên: Đặng Hoàng Anh


Họ và tên sinh viên: Phạm Gia Khiêm
Mã số sinh viên: 20222309
Mã lớp: 149950

Hà Nội, 03/07/2024
Anomaly Detection in Time Series

Vu Phuc Thanh1 , Pham Gia Khiem2 and Pham Viet Huy3


1
Hanoi University of Science and Technology, Ha Noi, Viet Nam
[email protected]
2
Hanoi University of Science and Technology, Ha Noi, Viet Nam
[email protected]
3
Hanoi University of Science and Technology, Ha Noi, Viet Nam
[email protected]

Abstract. This paper shows you how to use the Pycaret library in Python to predict the anomaly working points
of a water pump in industrial base on its sensors measures, then we can cope with the next time that the pump is
broken. The Pycaret library includes variety of machine learning models to apply to our data and it will speed up
the experimentation cycle with just a few lines of code. Final test have shown that KNN and Isolation Forest
model can be used to detect the next failure with high accuracy than others.

Keywords: Anomaly Detection, Pump sensor, Pycaret.

1 Introduction

1.1 Water pumps in industrial

Water pumps play several critical roles in industrial settings, where they are used for various purposes depending
on the specific requirements of the industry. Here are some of their key functions: transferring fluids, maintaining
flow of water or other liquids in some industrial processes, fire protection,... Therefore, a broken pump can leads to
several significant consequences such as: production downtime, safety hazards, environmental impact,... Beside the
financial loss, we also have to pay for repairing or renewing the broken pump and human casualties, which will cost
a lot of money.

1.2 The importance of anomaly detection

From the above paragraph, we can easily see that anomaly detection is essentially because it can helps reduce the
risk to the minimum and brace ourself to deal with the issue. Due to the rapid development of AI in recent years, we
believe that machine learning will excel in helping us achieve our goals. We will be given the sensors data of a
water pump in a small area which have 7 failures in last year which cause huge problem to many people and also
lead to some serious living problem of some family. Our mission is to predict the next failure before it's happen.

2 Methodology

2.1 EDA(Exploratory Data Analysis)

It is important to get a good understanding of the data you have to work with. Data driven maintenance strategy
optimization is critical in a rapidly changing, competitive world. it is very critical to reduce the costs, minimize the
risk, and improve performance for the competitive edge in today's rapidly changing operational environment.

Let's get an overview of the data. The dataset contains 220320 datapoints and 55 features, which are the reading of
52 sensors and the machine's status, measured minute by minute, from first of April to thirty-first of August. The
type of parameter being measured is not meantioned but we can assume that they are temperature, pressure,
vibration, load capacity, volume and the flow density of the machine, which are not affect the result much.

Fig. 1. Timestamp, sensor_00 and sensor_01 column

Fig. 2. Pie graph of machine status

We see that there are 3 status of working of this pump: NORMAL, RECOVERING and BROKEN. The failure
events occurred in April(2 times),May(2 times),June(2 times) and one in July 2018. We findings that:
 We have an unnamed column which will require to be removed.
 The machine status will be required to be encoded as Machine learning models work better with
numbers.

2.2 Preprocessing the data

Let’s change something in the dataset


 Delete unnamed column and sensor 15 column because they make no sense
 Machine status are assigned to different numerical value: NORMAL and RECOVERING =0 and
BROKEN =1
 It’s never easy to deal with missing data so ‘ffill’ is here to help filling the missing data points with
previous value
 We also found the sensor value differential between consecutive values of each sensor are neglectible,
and resampling the time series data from minutely to hourly does not affect significantly on the result.
Resampling therefore is done and reduce the dataset volume.

Here we use the Pycaret library to get access to models: Pycaret is an open source, low-code machine learning
library based on Python. This solution automates the end-to-end machine learning workflow, and by automating
tasks and managing ML models speeds up the experimentation cycle. As the library is based on low code, PyCaret
only requires our different models, each will be briefly explained.

Fig. 3. Before resample

Fig. 4. After resample

2.3 Apply models to the data


There are five models we considered for examination, all of which are machine learning models which can be
used for anomaly detection.

 Histogram-based Outlier Detection (HBOS): histogram is the most commonly used graph to show
frequency distributions. It looks very much like a bar chart, but there are important differences between
them. If a variable observation falls to the tail, it’s likely an anomaly point. And after aggregate all the
observation we can decide which one is an anomaly.

 Isolation Forest: Isolation Forest is an unsupervised learning model. Begins with isolation tree, a tree
such that represent a way to split the data to isolate a data point it keeps cutting away at the dataset until
all instances are isolated from one another. Because an anomaly is usually far away from other instances,
it becomes isolated in fewer steps than normal instances on average and many trees make a forest, and
that's call a isolation forest. In the score function formula, x is the data point and m is the number of
point, E(h(x)) represents the average path length for isolating x in a tree, c(m) represents the average
depth of data points in a tree. An "anomaly score" is given to each observation, taking into account the
following decision: if the score is close to 1, it is marked as an anomaly. Observations with a score of
less than 0.5 are normally marked.

 K-Nearest Neighbors (KNN): The K-nearest neighbors (KNN) algorithm is a supervised learning model,
which uses proximity to make classifications or predictions about the grouping of an individual data
point. It can be used for both classification and regression tasks. In KNN, the algorithm classifies a new
data point based on the majority class of its k nearest neighbors in the feature space. For regression
tasks, KNN predicts the target value by averaging the values of its k nearest neighbors. The K value is
the default value in pycaret library, which is equal to 5. Low value for k such as k=1 or k=2 can be noisy
and subject to the outliers. Large k value smooth over things but you don't want k to be so large that a
category with a few samples in it will always be outvoted by other categories.

 Local Outlier Factor (LOF): LOF is an unsupervised learning model, which computes the local density
deviation of a given data point with respect to its neighbors. It uses a parameter calls density, which is
the inverse of the average reachability(distances) from observation to all its KNN. The ratio of the
average density of the KNN of an observation to the density of the observation itself is measured and
called R. If R>1 then it's a anomaly and vice versa.

 Principle Component Analysis (PCA): PCA is often used to reduce the dimensionality of large data sets,
by transforming a large set of variables into a smaller one that still contains most of the information in
the large set. To help imagine, here we have 52 sensors with their indicators which are a lot of
information, PCA help us reduce to fewer dimension data that easier to deal with, each new data
dimension is a linear combinations or mixtures of the initial data dimensions.

3 Results and Discussion

3.1 How to evaluate the effectiveness of each model?

 Confusion matrix: a confusion matrix’s size can be nxn, but here we only use 2x2 which represent for
actual and predict status of the machine. Imagine you’re doing a cancer diagnoses and it gives this
confusion matrix True-positive and True-negative is the best case when you have the right prediction for
patients that actually have cancer and doesn’t have cancer False-negative is when you predict that they
have cancer when they actually don’t, it’s doesn’t affect much since we will test on their cancer again The
critical fails here are False-positive cases. Of course you don’t want a cancer patient to calmly go home and
don’t get treated in the right time. Similar to Broken and Not Broken status of a pump. A large number of
True-positive is impressive but when you made a mistake at detect False-negative as False-positive, it will
lead to a serious consequence, like we mentioned in the first part. So the essential numbers here are False-
positive and True-negative.

Fig. 5. Confusion matrix of different models

o All 7 failures were correctly predicted by KNN model and LOF model
o The number of False positive of KNN and LOF is also the same so we have to comparing deeper
using statistical parameters

 Statistical parameters:

Fig. 6. Statistical parameter of different models

Here’s the meaning of each parameter:


- Accuracy: measures the proportion of correct predictions made by the model across the entire
dataset.
- Precision: measures the proportion of true positive predictions among all positive predictions made
by the model.
- Recall: measures the proportion of true positive predictions among all actual positive instances.
- F1 Score: calculated as the harmonic mean of precision and recall.
We can observe that:
- KNN produced the highest score on the anomaly detection as compared to the other
models( all the broken labels were detected.
- We require more data and information regarding the sensors in order to verify the on
unseen whether the model will produce the same results.

4 Conclusion

We come to the conclusion that KNN is the best model for predicting in this case. Actually, you can use both KNN
and LOF. Predictive anomalies detection is presented to be an assistance to the maintance and operations of
industrial machineries. Due to this, we has examined the five machine learning models in regard of the given
industrial pump dataset to drawn results and conlusions shown in this paper.

References
1. MAKOMANE DOROTHEA L SEKHOTO: Pycaret Anomaly detection application on Pump,
https://www.kaggle.com/code/dorotheantsosheng/pycaret-anomaly-detection-application-on-pump/input
2. SUMIT KR JHA: https://www.kaggle.com/code/jhasony/anomaly-detection-in-time-series/notebook
3. MATTISON HINELINE https://www.kaggle.com/code/mattison/water-pump-classification

You might also like