0% found this document useful (0 votes)

47 views20 pages

Multivariate Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg

This document discusses multivariate anomaly detection using principal component analysis (PCA). It describes how PCA analyzes the relationship between dependent time series variables and derives new principal component feature values. The document provides an example of using COVID infection cases and hospital bed utilization as dependent time series for analysis. It also outlines the general workflow for multivariate anomaly detection, including preparing data, selecting an algorithm like PCA, training and testing a model, and using the model for inference. Finally, it describes student exercises for detecting anomalies in generator performance data and sensor readings in a room.

Uploaded by

Ng Kai Ting

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views20 pages

Multivariate Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg

Uploaded by

Ng Kai Ting

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

OFFICIAL (CLOSED) \ NON-SENSITIVE

Multivariate
Anomaly Detection
Mr Hew Ka Kian
[email protected]
OFFICIAL (CLOSED) \ NON-SENSITIVE

Principal Component Analysis

• Multivariate anomaly detection involves 2 or more dependent time series
• Examples of dependent time series are
• Covid infection cases and hospital bed utilization
• A student’s lesson attendance and his/her module grade
• Principal Component Analysis (PCA) is a common technique used for multivariate
anomaly detection
• PCA analyzes the relationship between the dependent variables and derives the
new feature values as principal components.
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Prepare data

As we had learnt, there are many public dataset available or a company may use the metric Select the
generated by its equipment.
algorithm
One good source of public data we will contibnue to use is the Singapore government’s public
dataset at https://data.gov.sg .
Train the
model

Test the model

Use the model

for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Transform Prepare data

Filter the data
• Rename the column name to more
• The rows of interest like “above 65 years appropriate name instead of generic one so
old” that when data is combined, there is no
• The columns of interest, basically just the conflict of name. Select the
datetime and the feature columns For example, replace column name “cases” algorithm
to “icu_cases”
• Combine multiple source containing the
different metrics together for multi-variate
anomaly detection Train the
model

Test the model

Use the model

for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Use algorithm for multi-variate anaomaly detection like

Prepare data
PCA Reconstruction Error

PCA Anomaly Detection that will highlight the data that has a high PCA Reconstruction
Error as anomalous Select the
algorithm

Train the
model

Test the model

Use the model

for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Train and Test

Prepare data

To train the model using ADTK, call the fit(df)

function Select the
algorithm

To test or use the model to detect anomaly, call Train the

the detect(df) function model

Convenient function that train followed by Test the model

detect, call the fit_detect(df) function

Use the model

for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise A)

Discuss and complete Exercise A in worksheet to detect multivariate anomaly on dependent time
series data

Use Pandas to read csv data from a local file into a DataFrame
index_col is the time column
parse_dates=True treat the index_col as date time format
• df = pd.read_csv('/somefolder/generator.csv', index_col="Time", parse_dates=True)

Check for error in the dataframe df using validate_series(df)

• df = validate_series(df)

Source: Huawei Ascend AI Processors and Applications Elementary Course

OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Train and return the reconstruction error
•from adtk.transformer import PcaReconstructionError
•s = PcaReconstructionError().fit_transform(df).rename("PCA Reconstruction Error")

pd.concat(df,s],axis=1) : combines the data DataFrame and the PCA reconstruction error so we can call the plot() on them
curve_group=[("Speed (kRPM)", "Power (kW)"), "PCA Reconstruction Error"] : plots 2 subplots, 1st subplot with Speed and
Power columns, the 2nd subplot with PCA Reconstruction Error column. We can have more subplots if we need.

•plot(pd.concat([df, s], axis=1), curve_group=[("Speed (kRPM)", "Power (kW)"), "PCA Reconstruction Err
or"]);

We have plotted the PCA Reconstruction Error and a high value means anomaly. But conveniently we just want to plot the
anomaly directly. Do that with PcaAD

•from adtk.detector import PcaAD

•pca_ad = PcaAD()
•anomalies = pca_ad.fit_detect(df)
•plot(df, anomaly=anomalies, ts_linewidth=1, anomaly_color='red', curve_group='all');

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
• Low power corresponds to low
speed and high speed around 10
kRPM corresponds to power of
around 30 kW output. But
anomaly is detected at 2017 May
13 when the high speed of 10
kRPM does not produce the
expected power – the generator
may be faulty.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise B)

The temperature, humidity, light, CO2, humidity ratio (vapour vs dry air) and the occupancy of a room are
dependent variables.

occupancy1.csv contains the readings of the variables.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Read the CSV but as the date range (eg 2/2/2015) is ambigious on the format, we indicate dayfirst=True to tell Pandas the first number is day

•df_room = pd.read_csv('data/occupancy1.csv', index_col="date", parse_dates=True, dayfirst=True)

Validate the series checks for error; this is an optional but recommended step

•df_room = validate_series(df_room)

Get the PCA Reconstruction Error from the data

•s = PcaReconstructionError().fit_transform(df_room).rename("PCA Reconstruction Error")

Concat the data and PCA error to plot both of them together. curve_group tells plot to create 2 subplots, one with the data in the first
bracket group, the other is the PCA Error.
•plot(pd.concat([df_room, s], axis=1), curve_group=[("Light","CO2","Temperature","Humidity","HumidityRatio","Occupancy"), "PCA
Reconstruction Error"]);

Detect the anomaly and plot the data with the anomaly. curve_group='all' means draw all in a single plot, no subplot.

•anomalies = pca_ad.fit_detect(df_room)
•plot(df_room, anomaly=anomalies, ts_linewidth=1, anomaly_color='red', curve_group='all');

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
• Anomaly detected at 02-03 18hr
• The combination of the variables is flagged as not similar to the rest.
• For multivariate anomaly, it is often not immediately obvious which
part of the combination of variables is anomalous. Investigating it
further is not the purpose of the worksheet

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity

How to plot data in the same number range in the same subplot so that smaller numbers are more visible?
Indicate in curve_group to plot “Light” and “CO2” in one subplot, “Temperture” and “Humidity” in another
subplot etc
• plot(pd.concat([df_room, s], axis=1), curve_group=[("Light","CO2"),
("Temperature","Humidity"), ("HumidityRatio"), ("Occupancy"), "PCA
Reconstruction Error"]);

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise C)

Study the Covid-19 infection number of US, UK, SG and the world

• countries_covid =
pd.read_csv('https://raw.githubusercontent.com/datasets/covid-
19/main/data/time-series-19-covid-combined.csv', index_col="Date",
parse_dates=True, infer_datetime_format=True)

Filter for the countries

• us_covid=countries_covid[countries_covid["Country/Region"] == "US"]
Rename the “Confirmed” column
Drop the other columns
• us_covid = us_covid.rename(columns={"Confirmed":"US Confirmed"})
• us_covid = us_covid.drop(["Province/State","Country/Region","Recovered", "Deat
hs"], 1)

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Join the US, UK and SG Confimred cases into a DataFrame
•mv_covid = sg_covid.join([uk_covid,us_covid])

Get the world confirmed cases and join with the countries
•world_covid = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/main/data/worldwide-
aggregate.csv', index_col="Date", parse_dates=True, infer_datetime_format=True)
•world_covid = world_covid.drop(["Recovered","Deaths","Increase rate"],1)
•world_covid = validate_series(world_covid)
•world_covid = world_covid.rename(columns={"Confirmed":"World Confirmed"})
•mv_covid = mv_covid.join([world_covid])
Plot the anomaly
•pca_ad = PcaAD()
•s = PcaReconstructionError().fit_transform(mv_covid).rename("PCA Reconstruction Error")
•plot(pd.concat([mv_covid, s], axis=1), ts_linewidth=1, ts_markersize=3, curve_group=[("UK
Confirmed","US Confirmed","SG Confirmed", "World Confirmed"), "PCA Reconstruction Error"])
•anomalies = pca_ad.fit_detect(mv_covid)
•plot(mv_covid, anomaly=anomalies, anomaly_color='red', curve_group='all')

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
• The covid confirmed case trend of uk and sg are different from the
world or US’s.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise D)

Detect anomalies in the a few traded tickers that are usually closely related
• !pip install yfinance --upgrade --no-cache-dir
• import yfinance as yf
Plot the time series data, recontrcution error and also highlight the anomaly
• data_spy = yf.download("SPY", start="2018-01-01", end="2022-04-30")
• data_aapl = yf.download("AAPL", start="2018-01-01", end="2022-04-30")
• data_msft = yf.download("MSFT", start="2018-01-01", end="2022-04-30")
• data_amzn = yf.download("AMZN", start="2018-01-01", end="2022-04-30")
• data_spy.to_csv('data/spy.csv')
• data_aapl.to_csv('data/aapl.csv')
• data_msft.to_csv('data/msft.csv')
• data_amzn.to_csv('data/amzn.csv')

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise D)

Can write a function to clean data by dropping columns other than Volume and renaming Volume
to, say, SPY_Volume
• def clean_finance(data, name):
data = data.drop(['Open','High','Low','Adj Close','Close'], axis=1)
data = data.rename(columns={'Volume':name + "_Volume"})
return data

Call the function to clean the data

• data_spy = clean_finance(data_spy, 'SPY')

• data_aapl = clean_finance(data_aapl, 'AAPL')
• data_msft = clean_finance(data_msft, 'MSFT')
• data_amzn = clean_finance(data_amzn, 'AMZN')

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise D)

Join the data

• data = data_spy.join([data_aapl,data_msft,data_amzn])

Plot the anomalies

• pca_ad = PcaAD()
• s = PcaReconstructionError().fit_transform(data).rename("PCA Reconstruction
Error")
• anomalies = pca_ad.fit_detect(data)
• plot(pd.concat([data, s], axis=1), anomaly=anomalies, ts_linewidth=1,
anomaly_color='red',
curve_group=[('SPY_Volume','AAPL_Volume','MSFT_Volume','AMZN_Volume'), "PCA
Reconstruction Error"])

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
• Many anomalies where one ticker had extremely more volume than
the rest

Source:

Research Methodology For BBS 4th Year
No ratings yet
Research Methodology For BBS 4th Year
41 pages
ISPE-Statistical Method
100% (6)
ISPE-Statistical Method
137 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Python Data Cleaning
100% (1)
Python Data Cleaning
20 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Compusoft, 3 (1), 487-490 PDF
No ratings yet
Compusoft, 3 (1), 487-490 PDF
4 pages
Anomaly ND Condition Monitoring 2
No ratings yet
Anomaly ND Condition Monitoring 2
18 pages
Journal of Statistical Software: Anomaly
No ratings yet
Journal of Statistical Software: Anomaly
24 pages
Introduction To Pandas - Ipynb - Colaboratory
No ratings yet
Introduction To Pandas - Ipynb - Colaboratory
7 pages
Introductory Statistics 10th Edition by Neil A Weiss
No ratings yet
Introductory Statistics 10th Edition by Neil A Weiss
305 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Binomial Distribution: Created by T. Madas
No ratings yet
Binomial Distribution: Created by T. Madas
45 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
Anomaly Detection: A Tutorial: Arindam Banerjee, Varun Chandola, Vipin Kumar, Jaideep Srivastava
No ratings yet
Anomaly Detection: A Tutorial: Arindam Banerjee, Varun Chandola, Vipin Kumar, Jaideep Srivastava
101 pages
Applied Tech Lesson 45: 1 Lesson 45: Pie Chart & Bell Curve
No ratings yet
Applied Tech Lesson 45: 1 Lesson 45: Pie Chart & Bell Curve
25 pages
BCG Virtual Experience Task 3 Feature Engineering1
No ratings yet
BCG Virtual Experience Task 3 Feature Engineering1
12 pages
Lab 02 - Introduction To Pandas
No ratings yet
Lab 02 - Introduction To Pandas
6 pages
Explainable Contextual Anomaly Detection
No ratings yet
Explainable Contextual Anomaly Detection
48 pages
What Is Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg
No ratings yet
What Is Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg
29 pages
4.3.2.4 Lab - Internet Meter Anomaly Detection
No ratings yet
4.3.2.4 Lab - Internet Meter Anomaly Detection
8 pages
Ecmlpkdd08 Lazarevic Dmfa
No ratings yet
Ecmlpkdd08 Lazarevic Dmfa
116 pages
DMV - 4 - Jupyter Notebook
No ratings yet
DMV - 4 - Jupyter Notebook
8 pages
E6 - Report: Problem 1
No ratings yet
E6 - Report: Problem 1
16 pages
Lab Record Dev
No ratings yet
Lab Record Dev
20 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Anomalies 2312.16139
No ratings yet
Anomalies 2312.16139
41 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
Performing Analysis of Meteorological Data: Punam Seal
No ratings yet
Performing Analysis of Meteorological Data: Punam Seal
21 pages
12 Anomaly Detection SVD III
No ratings yet
12 Anomaly Detection SVD III
25 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
EDA Notes Part - 1
No ratings yet
EDA Notes Part - 1
25 pages
Tutorial 2 QB & QP
No ratings yet
Tutorial 2 QB & QP
4 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Anomaly Detection
No ratings yet
Anomaly Detection
7 pages
Khiêm
No ratings yet
Khiêm
7 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Pandas Library Problems For Parctice
No ratings yet
Pandas Library Problems For Parctice
13 pages
10 - Anomaly Detection
No ratings yet
10 - Anomaly Detection
12 pages
CO Activity-3
No ratings yet
CO Activity-3
2 pages
10.2 Forecasting Example Using Data From NOAA
No ratings yet
10.2 Forecasting Example Using Data From NOAA
6 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Introtoanomalydetection 170421012904
No ratings yet
Introtoanomalydetection 170421012904
53 pages
Hduud
No ratings yet
Hduud
55 pages
41b Data Wrangling, Grouping and Aggregation
No ratings yet
41b Data Wrangling, Grouping and Aggregation
31 pages
Exercise 7 - Pandas
No ratings yet
Exercise 7 - Pandas
2 pages
Entropy 22 01363 v2
No ratings yet
Entropy 22 01363 v2
15 pages
Advertising in ML
No ratings yet
Advertising in ML
9 pages
Davison Hinkley Bootstrap Methods and Their Application
No ratings yet
Davison Hinkley Bootstrap Methods and Their Application
596 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Pandas Merged
No ratings yet
Pandas Merged
2 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
Average Porosity
No ratings yet
Average Porosity
3 pages
cdp201 10 11 2023
No ratings yet
cdp201 10 11 2023
17 pages
ML Observation
No ratings yet
ML Observation
29 pages
Dissertation Hypothesis Sample
100% (2)
Dissertation Hypothesis Sample
6 pages
Network Anomaly Detection and Visualization Using Combined PCA and Adaptive Filtering
No ratings yet
Network Anomaly Detection and Visualization Using Combined PCA and Adaptive Filtering
3 pages
Bike Rental (Project)
No ratings yet
Bike Rental (Project)
16 pages
Factors Affecting The Success of Construction Projects in Khyber Pakhtunkhwa, Pakistan
No ratings yet
Factors Affecting The Success of Construction Projects in Khyber Pakhtunkhwa, Pakistan
6 pages
Assignment Solutions - 233
No ratings yet
Assignment Solutions - 233
6 pages
Chap 003
No ratings yet
Chap 003
40 pages
Chapter 3 - Demand Forecasting: # Het Begint Met Een Idee
No ratings yet
Chapter 3 - Demand Forecasting: # Het Begint Met Een Idee
32 pages
Comment For ISO5725
No ratings yet
Comment For ISO5725
9 pages
Book Assignment in PR2
No ratings yet
Book Assignment in PR2
4 pages
Sequences Doubts White Board
No ratings yet
Sequences Doubts White Board
8 pages
The Contribution of Electronic Marketing in The Increment of Grape Sales in Dodoma A Study of Bihawana and Mpunguzi Village
No ratings yet
The Contribution of Electronic Marketing in The Increment of Grape Sales in Dodoma A Study of Bihawana and Mpunguzi Village
5 pages
Frequentist Vs Bayesian
No ratings yet
Frequentist Vs Bayesian
48 pages
9) Interpreting Histograms
No ratings yet
9) Interpreting Histograms
14 pages
Quantitative and Qualitative Methods
No ratings yet
Quantitative and Qualitative Methods
6 pages
Mémoire M1EIE Handahamé
No ratings yet
Mémoire M1EIE Handahamé
17 pages
FILOSOFÍA
No ratings yet
FILOSOFÍA
4 pages
20230225DSLG2088
No ratings yet
20230225DSLG2088
33 pages
Analisis Gases Disueltos - Ingles
No ratings yet
Analisis Gases Disueltos - Ingles
5 pages
BookSlides 11 The Art of Machine Learning For Predictive Data Analytics
No ratings yet
BookSlides 11 The Art of Machine Learning For Predictive Data Analytics
27 pages
Sample Size Calculations For Evaluating Mediation
No ratings yet
Sample Size Calculations For Evaluating Mediation
17 pages
Business Statistics
No ratings yet
Business Statistics
2 pages
Establishment of Cephalometric Norms Using.119
No ratings yet
Establishment of Cephalometric Norms Using.119
6 pages
AEM Question Bank
No ratings yet
AEM Question Bank
6 pages
Unit 5 - Week 4: Assignment 4
No ratings yet
Unit 5 - Week 4: Assignment 4
4 pages
Treatment of Data in Qualitative Research - Google Search
No ratings yet
Treatment of Data in Qualitative Research - Google Search
1 page
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Top 20 MS Excel VBA Simulations, VBA to Model Risk, Investments, Growth, Gambling, and Monte Carlo Analysis
From Everand
Top 20 MS Excel VBA Simulations, VBA to Model Risk, Investments, Growth, Gambling, and Monte Carlo Analysis
Andrei Besedin
2.5/5 (2)
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
From Everand
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
Manish Soni
No ratings yet