0% found this document useful (0 votes)

14 views7 pages

Khiêm

This report discusses the use of the Pycaret library for predicting anomalies in water pump operations based on sensor data, highlighting the importance of anomaly detection in industrial settings. The study evaluates five machine learning models, concluding that KNN and Isolation Forest are the most effective for predicting pump failures. The findings emphasize the necessity of accurate predictive models to minimize downtime and operational risks associated with pump failures.

Uploaded by

Thanh Vu Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views7 pages

Khiêm

Uploaded by

Thanh Vu Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

ĐẠI HỌC BÁCH KHOA HÀ NỘI

TRƯỜNG ĐIỆN - ĐIỆN TỬ

BÁO CÁO MÔN HỌC

TECHNICAL WRITING AND PRESENTATION

Chủ đề: Anomaly Detection in Time Series

Giảng viên: Đặng Hoàng Anh

Họ và tên sinh viên: Phạm Gia Khiêm
Mã số sinh viên: 20222309
Mã lớp: 149950

Hà Nội, 03/07/2024
Anomaly Detection in Time Series

Vu Phuc Thanh1 , Pham Gia Khiem2 and Pham Viet Huy3

1
Hanoi University of Science and Technology, Ha Noi, Viet Nam
[email protected]
2
Hanoi University of Science and Technology, Ha Noi, Viet Nam
[email protected]
3
Hanoi University of Science and Technology, Ha Noi, Viet Nam
[email protected]

Abstract. This paper shows you how to use the Pycaret library in Python to predict the anomaly working points
of a water pump in industrial base on its sensors measures, then we can cope with the next time that the pump is
broken. The Pycaret library includes variety of machine learning models to apply to our data and it will speed up
the experimentation cycle with just a few lines of code. Final test have shown that KNN and Isolation Forest
model can be used to detect the next failure with high accuracy than others.

Keywords: Anomaly Detection, Pump sensor, Pycaret.

1 Introduction

1.1 Water pumps in industrial

Water pumps play several critical roles in industrial settings, where they are used for various purposes depending
on the specific requirements of the industry. Here are some of their key functions: transferring fluids, maintaining
flow of water or other liquids in some industrial processes, fire protection,... Therefore, a broken pump can leads to
several significant consequences such as: production downtime, safety hazards, environmental impact,... Beside the
financial loss, we also have to pay for repairing or renewing the broken pump and human casualties, which will cost
a lot of money.

1.2 The importance of anomaly detection

From the above paragraph, we can easily see that anomaly detection is essentially because it can helps reduce the
risk to the minimum and brace ourself to deal with the issue. Due to the rapid development of AI in recent years, we
believe that machine learning will excel in helping us achieve our goals. We will be given the sensors data of a
water pump in a small area which have 7 failures in last year which cause huge problem to many people and also
lead to some serious living problem of some family. Our mission is to predict the next failure before it's happen.

2 Methodology

2.1 EDA(Exploratory Data Analysis)

It is important to get a good understanding of the data you have to work with. Data driven maintenance strategy
optimization is critical in a rapidly changing, competitive world. it is very critical to reduce the costs, minimize the
risk, and improve performance for the competitive edge in today's rapidly changing operational environment.

Let's get an overview of the data. The dataset contains 220320 datapoints and 55 features, which are the reading of
52 sensors and the machine's status, measured minute by minute, from first of April to thirty-first of August. The
type of parameter being measured is not meantioned but we can assume that they are temperature, pressure,
vibration, load capacity, volume and the flow density of the machine, which are not affect the result much.

Fig. 1. Timestamp, sensor_00 and sensor_01 column

Fig. 2. Pie graph of machine status

We see that there are 3 status of working of this pump: NORMAL, RECOVERING and BROKEN. The failure
events occurred in April(2 times),May(2 times),June(2 times) and one in July 2018. We findings that:
 We have an unnamed column which will require to be removed.
 The machine status will be required to be encoded as Machine learning models work better with
numbers.

2.2 Preprocessing the data

Let’s change something in the dataset

 Delete unnamed column and sensor 15 column because they make no sense
 Machine status are assigned to different numerical value: NORMAL and RECOVERING =0 and
BROKEN =1
 It’s never easy to deal with missing data so ‘ffill’ is here to help filling the missing data points with
previous value
 We also found the sensor value differential between consecutive values of each sensor are neglectible,
and resampling the time series data from minutely to hourly does not affect significantly on the result.
Resampling therefore is done and reduce the dataset volume.

Here we use the Pycaret library to get access to models: Pycaret is an open source, low-code machine learning
library based on Python. This solution automates the end-to-end machine learning workflow, and by automating
tasks and managing ML models speeds up the experimentation cycle. As the library is based on low code, PyCaret
only requires our different models, each will be briefly explained.

Fig. 3. Before resample

Fig. 4. After resample

2.3 Apply models to the data

There are five models we considered for examination, all of which are machine learning models which can be
used for anomaly detection.

 Histogram-based Outlier Detection (HBOS): histogram is the most commonly used graph to show
frequency distributions. It looks very much like a bar chart, but there are important differences between
them. If a variable observation falls to the tail, it’s likely an anomaly point. And after aggregate all the
observation we can decide which one is an anomaly.

 Isolation Forest: Isolation Forest is an unsupervised learning model. Begins with isolation tree, a tree
such that represent a way to split the data to isolate a data point it keeps cutting away at the dataset until
all instances are isolated from one another. Because an anomaly is usually far away from other instances,
it becomes isolated in fewer steps than normal instances on average and many trees make a forest, and
that's call a isolation forest. In the score function formula, x is the data point and m is the number of
point, E(h(x)) represents the average path length for isolating x in a tree, c(m) represents the average
depth of data points in a tree. An "anomaly score" is given to each observation, taking into account the
following decision: if the score is close to 1, it is marked as an anomaly. Observations with a score of
less than 0.5 are normally marked.

 K-Nearest Neighbors (KNN): The K-nearest neighbors (KNN) algorithm is a supervised learning model,
which uses proximity to make classifications or predictions about the grouping of an individual data
point. It can be used for both classification and regression tasks. In KNN, the algorithm classifies a new
data point based on the majority class of its k nearest neighbors in the feature space. For regression
tasks, KNN predicts the target value by averaging the values of its k nearest neighbors. The K value is
the default value in pycaret library, which is equal to 5. Low value for k such as k=1 or k=2 can be noisy
and subject to the outliers. Large k value smooth over things but you don't want k to be so large that a
category with a few samples in it will always be outvoted by other categories.

 Local Outlier Factor (LOF): LOF is an unsupervised learning model, which computes the local density
deviation of a given data point with respect to its neighbors. It uses a parameter calls density, which is
the inverse of the average reachability(distances) from observation to all its KNN. The ratio of the
average density of the KNN of an observation to the density of the observation itself is measured and
called R. If R>1 then it's a anomaly and vice versa.

 Principle Component Analysis (PCA): PCA is often used to reduce the dimensionality of large data sets,
by transforming a large set of variables into a smaller one that still contains most of the information in
the large set. To help imagine, here we have 52 sensors with their indicators which are a lot of
information, PCA help us reduce to fewer dimension data that easier to deal with, each new data
dimension is a linear combinations or mixtures of the initial data dimensions.

3 Results and Discussion

3.1 How to evaluate the effectiveness of each model?

 Confusion matrix: a confusion matrix’s size can be nxn, but here we only use 2x2 which represent for
actual and predict status of the machine. Imagine you’re doing a cancer diagnoses and it gives this
confusion matrix True-positive and True-negative is the best case when you have the right prediction for
patients that actually have cancer and doesn’t have cancer False-negative is when you predict that they
have cancer when they actually don’t, it’s doesn’t affect much since we will test on their cancer again The
critical fails here are False-positive cases. Of course you don’t want a cancer patient to calmly go home and
don’t get treated in the right time. Similar to Broken and Not Broken status of a pump. A large number of
True-positive is impressive but when you made a mistake at detect False-negative as False-positive, it will
lead to a serious consequence, like we mentioned in the first part. So the essential numbers here are False-
positive and True-negative.

Fig. 5. Confusion matrix of different models

o All 7 failures were correctly predicted by KNN model and LOF model
o The number of False positive of KNN and LOF is also the same so we have to comparing deeper
using statistical parameters

 Statistical parameters:

Fig. 6. Statistical parameter of different models

Here’s the meaning of each parameter:

- Accuracy: measures the proportion of correct predictions made by the model across the entire
dataset.
- Precision: measures the proportion of true positive predictions among all positive predictions made
by the model.
- Recall: measures the proportion of true positive predictions among all actual positive instances.
- F1 Score: calculated as the harmonic mean of precision and recall.
We can observe that:
- KNN produced the highest score on the anomaly detection as compared to the other
models( all the broken labels were detected.
- We require more data and information regarding the sensors in order to verify the on
unseen whether the model will produce the same results.

4 Conclusion

We come to the conclusion that KNN is the best model for predicting in this case. Actually, you can use both KNN
and LOF. Predictive anomalies detection is presented to be an assistance to the maintance and operations of
industrial machineries. Due to this, we has examined the five machine learning models in regard of the given
industrial pump dataset to drawn results and conlusions shown in this paper.

References
1. MAKOMANE DOROTHEA L SEKHOTO: Pycaret Anomaly detection application on Pump,
https://www.kaggle.com/code/dorotheantsosheng/pycaret-anomaly-detection-application-on-pump/input
2. SUMIT KR JHA: https://www.kaggle.com/code/jhasony/anomaly-detection-in-time-series/notebook
3. MATTISON HINELINE https://www.kaggle.com/code/mattison/water-pump-classification

Wema Graduate Trainee Test 100 Questions
No ratings yet
Wema Graduate Trainee Test 100 Questions
18 pages
Jan Westerhoff - Ontological Categories (2005)
100% (7)
Jan Westerhoff - Ontological Categories (2005)
276 pages
Week 8
No ratings yet
Week 8
11 pages
5.1.1 Objective and Scope: Jyenis 2020
No ratings yet
5.1.1 Objective and Scope: Jyenis 2020
8 pages
PAACDA Comprehensive Data Corruption Detection Algorithm
No ratings yet
PAACDA Comprehensive Data Corruption Detection Algorithm
8 pages
Anamoly Detection
No ratings yet
Anamoly Detection
20 pages
Phase 2.1
No ratings yet
Phase 2.1
9 pages
Ashwath Thesis PDF
No ratings yet
Ashwath Thesis PDF
90 pages
Robust Data Model For Enhanced Anomaly Detection: R.Ravinder Reddy, Dr.Y Ramadevi, DR.K.V.N Sunitha
No ratings yet
Robust Data Model For Enhanced Anomaly Detection: R.Ravinder Reddy, Dr.Y Ramadevi, DR.K.V.N Sunitha
8 pages
Report Combined
No ratings yet
Report Combined
11 pages
Manufacturing Machine Learning Tool Mechanical
No ratings yet
Manufacturing Machine Learning Tool Mechanical
13 pages
Anomaly Detection in Big Data
No ratings yet
Anomaly Detection in Big Data
148 pages
4.3.2.4 Lab - Internet Meter Anomaly Detection
No ratings yet
4.3.2.4 Lab - Internet Meter Anomaly Detection
8 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Anomaly ND Condition Monitoring 2
No ratings yet
Anomaly ND Condition Monitoring 2
18 pages
FULLTEXT01
No ratings yet
FULLTEXT01
68 pages
Anomaly Detection On Industrial Electrical Systems Using Deep Learning
No ratings yet
Anomaly Detection On Industrial Electrical Systems Using Deep Learning
6 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
10 - Anomaly Detection
No ratings yet
10 - Anomaly Detection
12 pages
2799-Document Upload-9165-2-10-20210702
No ratings yet
2799-Document Upload-9165-2-10-20210702
7 pages
Anomaly Detection: Jing Gao
No ratings yet
Anomaly Detection: Jing Gao
51 pages
Anomaly Detection
No ratings yet
Anomaly Detection
3 pages
Anomaly Detection and Time Series Analysis1
No ratings yet
Anomaly Detection and Time Series Analysis1
6 pages
Predictive Maintenance Project Milestone Report
No ratings yet
Predictive Maintenance Project Milestone Report
7 pages
Knime Anomaly Detection Visualization
No ratings yet
Knime Anomaly Detection Visualization
13 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Anomaly Detection in Time Series Data: A Practical Implementation For Pulp and Paper Industry
No ratings yet
Anomaly Detection in Time Series Data: A Practical Implementation For Pulp and Paper Industry
108 pages
Minor Project
No ratings yet
Minor Project
21 pages
Phase 2.3
No ratings yet
Phase 2.3
8 pages
Ijphm 13 035
No ratings yet
Ijphm 13 035
6 pages
Predictive Maintenance For AirProductionUnit in EuroTram Vehicles MarianaBarros
No ratings yet
Predictive Maintenance For AirProductionUnit in EuroTram Vehicles MarianaBarros
110 pages
Variational Restricted Boltzmann Machines To Automated Anomaly Detection
No ratings yet
Variational Restricted Boltzmann Machines To Automated Anomaly Detection
14 pages
Human Activities Classifier Using SVM
No ratings yet
Human Activities Classifier Using SVM
19 pages
Predictivemaintenance FaultDetection
No ratings yet
Predictivemaintenance FaultDetection
12 pages
Elk 2111 123
No ratings yet
Elk 2111 123
17 pages
Outlier Generation and Anomaly Detection Based On Intelligent One-Class Techniques Over A Bicomponent Mixing System
No ratings yet
Outlier Generation and Anomaly Detection Based On Intelligent One-Class Techniques Over A Bicomponent Mixing System
12 pages
1 An Introduction To Machine Learning With Scikit Learn
No ratings yet
1 An Introduction To Machine Learning With Scikit Learn
2 pages
Aam Micro
No ratings yet
Aam Micro
13 pages
Report of Final Presentation
No ratings yet
Report of Final Presentation
38 pages
Summarize and Help Me To Write The Paper Complete...
No ratings yet
Summarize and Help Me To Write The Paper Complete...
9 pages
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
No ratings yet
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
13 pages
Machine Learning For Time Series Anomaly Detection: Ihssan Tinawi
No ratings yet
Machine Learning For Time Series Anomaly Detection: Ihssan Tinawi
55 pages
Practical No. 5
No ratings yet
Practical No. 5
12 pages
Anomaly Detection Analysis and Prediction-2019
No ratings yet
Anomaly Detection Analysis and Prediction-2019
18 pages
122AD0005 BDA Project Final Term Presentation
No ratings yet
122AD0005 BDA Project Final Term Presentation
27 pages
Phase 2
No ratings yet
Phase 2
16 pages
Human Activity Recognition
No ratings yet
Human Activity Recognition
8 pages
CRCF Brainstorm
No ratings yet
CRCF Brainstorm
20 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Deployment of Analytics Solutions - Module VII - Students
No ratings yet
Deployment of Analytics Solutions - Module VII - Students
120 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
B22EE010 Report
No ratings yet
B22EE010 Report
9 pages
Anomaly Detection in Log Files Using
No ratings yet
Anomaly Detection in Log Files Using
67 pages
Isolationforest1 Python
No ratings yet
Isolationforest1 Python
7 pages
WSDM21 Tutorial DLAD Slides
No ratings yet
WSDM21 Tutorial DLAD Slides
110 pages
Dealing With Outliers
No ratings yet
Dealing With Outliers
19 pages
Anomaly Detectionfor Water Quality Data
No ratings yet
Anomaly Detectionfor Water Quality Data
135 pages
Image Classification
No ratings yet
Image Classification
18 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
CCN Presentation
No ratings yet
CCN Presentation
13 pages
Ajayi Oluwaniyi Oluwafemi Final Defence
No ratings yet
Ajayi Oluwaniyi Oluwafemi Final Defence
39 pages
1st Quarter Tos 2018-2019
No ratings yet
1st Quarter Tos 2018-2019
4 pages
Quant Finance
No ratings yet
Quant Finance
43 pages
WBJEE 2020 Maths Question Answerkey Solutions
No ratings yet
WBJEE 2020 Maths Question Answerkey Solutions
58 pages
PH 114 Lab 4 - Kenneth
No ratings yet
PH 114 Lab 4 - Kenneth
13 pages
FIN 435 - Exam 2 Slides
No ratings yet
FIN 435 - Exam 2 Slides
157 pages
Integer Programming by Cutting Planes Methods
No ratings yet
Integer Programming by Cutting Planes Methods
58 pages
Confirmatory Factor Analysis vs. Rasch Approaches: Differences and Measurement Implications
No ratings yet
Confirmatory Factor Analysis vs. Rasch Approaches: Differences and Measurement Implications
6 pages
Manual of Logarithms, by Matthews, G. F
No ratings yet
Manual of Logarithms, by Matthews, G. F
144 pages
Mrsptu B.tech. 1st Year Syllabus 2016 Batch Onwards Updated On 16.7.2016
No ratings yet
Mrsptu B.tech. 1st Year Syllabus 2016 Batch Onwards Updated On 16.7.2016
28 pages
Methods of Matrix Inversion
No ratings yet
Methods of Matrix Inversion
17 pages
Angry Birds Mathematics - Parabolas Vectors PDF
No ratings yet
Angry Birds Mathematics - Parabolas Vectors PDF
7 pages
Iwegbu and Nwaogwugwu-Monetary Policy - Development Finance Institutions and Agriculture
100% (1)
Iwegbu and Nwaogwugwu-Monetary Policy - Development Finance Institutions and Agriculture
24 pages
Objectives of Curriculum
No ratings yet
Objectives of Curriculum
6 pages
1.introduction and Operations
No ratings yet
1.introduction and Operations
18 pages
Grade 8 March Controlled Test
100% (1)
Grade 8 March Controlled Test
8 pages
Center of Mass - DPP 04 (Extra) - Arjuna JEE AIR 2024 (Physics)
No ratings yet
Center of Mass - DPP 04 (Extra) - Arjuna JEE AIR 2024 (Physics)
5 pages
JCMPHS Sy2023 Las Math9 Q4W1 Trigonometric Ratios
No ratings yet
JCMPHS Sy2023 Las Math9 Q4W1 Trigonometric Ratios
8 pages
Perl Primer
No ratings yet
Perl Primer
2 pages
Introduction To Statistics and Data Analysis 3rd Edition Roxy Peck Download
No ratings yet
Introduction To Statistics and Data Analysis 3rd Edition Roxy Peck Download
70 pages
Neopythagoreanism and Negative Theology
100% (1)
Neopythagoreanism and Negative Theology
18 pages
Gcse Matheamtics Paper 1 (N - C) : Pre-Public Examinations
No ratings yet
Gcse Matheamtics Paper 1 (N - C) : Pre-Public Examinations
23 pages
Power System Operation Control
No ratings yet
Power System Operation Control
4 pages
Plotting Motor Starting Curve On TCC
No ratings yet
Plotting Motor Starting Curve On TCC
10 pages
Lab 1:introduction To Signals, Passive RC Filters and Opamps
No ratings yet
Lab 1:introduction To Signals, Passive RC Filters and Opamps
7 pages
DLD Lab Report
No ratings yet
DLD Lab Report
3 pages
TCS NQT (Numerical Ability) Official Memory Based Paper 2020
100% (2)
TCS NQT (Numerical Ability) Official Memory Based Paper 2020
7 pages
Cartography PDF
No ratings yet
Cartography PDF
20 pages
CVS Tripping Curve
No ratings yet
CVS Tripping Curve
7 pages