0% found this document useful (0 votes)
47 views20 pages

Multivariate Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg

This document discusses multivariate anomaly detection using principal component analysis (PCA). It describes how PCA analyzes the relationship between dependent time series variables and derives new principal component feature values. The document provides an example of using COVID infection cases and hospital bed utilization as dependent time series for analysis. It also outlines the general workflow for multivariate anomaly detection, including preparing data, selecting an algorithm like PCA, training and testing a model, and using the model for inference. Finally, it describes student exercises for detecting anomalies in generator performance data and sensor readings in a room.

Uploaded by

Ng Kai Ting
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views20 pages

Multivariate Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg

This document discusses multivariate anomaly detection using principal component analysis (PCA). It describes how PCA analyzes the relationship between dependent time series variables and derives new principal component feature values. The document provides an example of using COVID infection cases and hospital bed utilization as dependent time series for analysis. It also outlines the general workflow for multivariate anomaly detection, including preparing data, selecting an algorithm like PCA, training and testing a model, and using the model for inference. Finally, it describes student exercises for detecting anomalies in generator performance data and sensor readings in a room.

Uploaded by

Ng Kai Ting
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

OFFICIAL (CLOSED) \ NON-SENSITIVE

Multivariate
Anomaly Detection
Mr Hew Ka Kian
[email protected]
OFFICIAL (CLOSED) \ NON-SENSITIVE

Principal Component Analysis


• Multivariate anomaly detection involves 2 or more dependent time series
• Examples of dependent time series are
• Covid infection cases and hospital bed utilization
• A student’s lesson attendance and his/her module grade
• Principal Component Analysis (PCA) is a common technique used for multivariate
anomaly detection
• PCA analyzes the relationship between the dependent variables and derives the
new feature values as principal components.
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Prepare data

As we had learnt, there are many public dataset available or a company may use the metric Select the
generated by its equipment.
algorithm
One good source of public data we will contibnue to use is the Singapore government’s public
dataset at https://data.gov.sg .
Train the
model

Test the model

Use the model


for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Transform Prepare data


Filter the data
• Rename the column name to more
• The rows of interest like “above 65 years appropriate name instead of generic one so
old” that when data is combined, there is no
• The columns of interest, basically just the conflict of name. Select the
datetime and the feature columns For example, replace column name “cases” algorithm
to “icu_cases”
• Combine multiple source containing the
different metrics together for multi-variate
anomaly detection Train the
model

Test the model

Use the model


for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Use algorithm for multi-variate anaomaly detection like


Prepare data
PCA Reconstruction Error

PCA Anomaly Detection that will highlight the data that has a high PCA Reconstruction
Error as anomalous Select the
algorithm

Train the
model

Test the model

Use the model


for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Train and Test


Prepare data

To train the model using ADTK, call the fit(df)


function Select the
algorithm

To test or use the model to detect anomaly, call Train the


the detect(df) function model

Convenient function that train followed by Test the model


detect, call the fit_detect(df) function

Use the model


for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise A)

Discuss and complete Exercise A in worksheet to detect multivariate anomaly on dependent time
series data

Use Pandas to read csv data from a local file into a DataFrame
index_col is the time column
parse_dates=True treat the index_col as date time format
• df = pd.read_csv('/somefolder/generator.csv', index_col="Time", parse_dates=True)

Check for error in the dataframe df using validate_series(df)

• df = validate_series(df)

Source: Huawei Ascend AI Processors and Applications Elementary Course


OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Train and return the reconstruction error
•from adtk.transformer import PcaReconstructionError
•s = PcaReconstructionError().fit_transform(df).rename("PCA Reconstruction Error")

pd.concat(df,s],axis=1) : combines the data DataFrame and the PCA reconstruction error so we can call the plot() on them
curve_group=[("Speed (kRPM)", "Power (kW)"), "PCA Reconstruction Error"] : plots 2 subplots, 1st subplot with Speed and
Power columns, the 2nd subplot with PCA Reconstruction Error column. We can have more subplots if we need.

•plot(pd.concat([df, s], axis=1), curve_group=[("Speed (kRPM)", "Power (kW)"), "PCA Reconstruction Err
or"]);

We have plotted the PCA Reconstruction Error and a high value means anomaly. But conveniently we just want to plot the
anomaly directly. Do that with PcaAD

•from adtk.detector import PcaAD


•pca_ad = PcaAD()
•anomalies = pca_ad.fit_detect(df)
•plot(df, anomaly=anomalies, ts_linewidth=1, anomaly_color='red', curve_group='all');

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
• Low power corresponds to low
speed and high speed around 10
kRPM corresponds to power of
around 30 kW output. But
anomaly is detected at 2017 May
13 when the high speed of 10
kRPM does not produce the
expected power – the generator
may be faulty.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise B)


The temperature, humidity, light, CO2, humidity ratio (vapour vs dry air) and the occupancy of a room are
dependent variables.

occupancy1.csv contains the readings of the variables.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Read the CSV but as the date range (eg 2/2/2015) is ambigious on the format, we indicate dayfirst=True to tell Pandas the first number is day

•df_room = pd.read_csv('data/occupancy1.csv', index_col="date", parse_dates=True, dayfirst=True)

Validate the series checks for error; this is an optional but recommended step

•df_room = validate_series(df_room)

Get the PCA Reconstruction Error from the data

•s = PcaReconstructionError().fit_transform(df_room).rename("PCA Reconstruction Error")

Concat the data and PCA error to plot both of them together. curve_group tells plot to create 2 subplots, one with the data in the first
bracket group, the other is the PCA Error.
•plot(pd.concat([df_room, s], axis=1), curve_group=[("Light","CO2","Temperature","Humidity","HumidityRatio","Occupancy"), "PCA
Reconstruction Error"]);

Detect the anomaly and plot the data with the anomaly. curve_group='all' means draw all in a single plot, no subplot.

•anomalies = pca_ad.fit_detect(df_room)
•plot(df_room, anomaly=anomalies, ts_linewidth=1, anomaly_color='red', curve_group='all');

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
• Anomaly detected at 02-03 18hr
• The combination of the variables is flagged as not similar to the rest.
• For multivariate anomaly, it is often not immediately obvious which
part of the combination of variables is anomalous. Investigating it
further is not the purpose of the worksheet

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity

How to plot data in the same number range in the same subplot so that smaller numbers are more visible?
Indicate in curve_group to plot “Light” and “CO2” in one subplot, “Temperture” and “Humidity” in another
subplot etc
• plot(pd.concat([df_room, s], axis=1), curve_group=[("Light","CO2"),
("Temperature","Humidity"), ("HumidityRatio"), ("Occupancy"), "PCA
Reconstruction Error"]);

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise C)


Study the Covid-19 infection number of US, UK, SG and the world

• countries_covid =
pd.read_csv('https://raw.githubusercontent.com/datasets/covid-
19/main/data/time-series-19-covid-combined.csv', index_col="Date",
parse_dates=True, infer_datetime_format=True)

Filter for the countries

• us_covid=countries_covid[countries_covid["Country/Region"] == "US"]
Rename the “Confirmed” column
Drop the other columns
• us_covid = us_covid.rename(columns={"Confirmed":"US Confirmed"})
• us_covid = us_covid.drop(["Province/State","Country/Region","Recovered", "Deat
hs"], 1)

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
Join the US, UK and SG Confimred cases into a DataFrame
•mv_covid = sg_covid.join([uk_covid,us_covid])

Get the world confirmed cases and join with the countries
•world_covid = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/main/data/worldwide-
aggregate.csv', index_col="Date", parse_dates=True, infer_datetime_format=True)
•world_covid = world_covid.drop(["Recovered","Deaths","Increase rate"],1)
•world_covid = validate_series(world_covid)
•world_covid = world_covid.rename(columns={"Confirmed":"World Confirmed"})
•mv_covid = mv_covid.join([world_covid])
Plot the anomaly
•pca_ad = PcaAD()
•s = PcaReconstructionError().fit_transform(mv_covid).rename("PCA Reconstruction Error")
•plot(pd.concat([mv_covid, s], axis=1), ts_linewidth=1, ts_markersize=3, curve_group=[("UK
Confirmed","US Confirmed","SG Confirmed", "World Confirmed"), "PCA Reconstruction Error"])
•anomalies = pca_ad.fit_detect(mv_covid)
•plot(mv_covid, anomaly=anomalies, anomaly_color='red', curve_group='all')

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
• The covid confirmed case trend of uk and sg are different from the
world or US’s.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise D)


Detect anomalies in the a few traded tickers that are usually closely related
• !pip install yfinance --upgrade --no-cache-dir
• import yfinance as yf
Plot the time series data, recontrcution error and also highlight the anomaly
• data_spy = yf.download("SPY", start="2018-01-01", end="2022-04-30")
• data_aapl = yf.download("AAPL", start="2018-01-01", end="2022-04-30")
• data_msft = yf.download("MSFT", start="2018-01-01", end="2022-04-30")
• data_amzn = yf.download("AMZN", start="2018-01-01", end="2022-04-30")
• data_spy.to_csv('data/spy.csv')
• data_aapl.to_csv('data/aapl.csv')
• data_msft.to_csv('data/msft.csv')
• data_amzn.to_csv('data/amzn.csv')

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise D)


Can write a function to clean data by dropping columns other than Volume and renaming Volume
to, say, SPY_Volume
• def clean_finance(data, name):
data = data.drop(['Open','High','Low','Adj Close','Close'], axis=1)
data = data.rename(columns={'Volume':name + "_Volume"})
return data

Call the function to clean the data

• data_spy = clean_finance(data_spy, 'SPY')


• data_aapl = clean_finance(data_aapl, 'AAPL')
• data_msft = clean_finance(data_msft, 'MSFT')
• data_amzn = clean_finance(data_amzn, 'AMZN')

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity (Exercise D)


Join the data

• data = data_spy.join([data_aapl,data_msft,data_amzn])

Plot the anomalies

• pca_ad = PcaAD()
• s = PcaReconstructionError().fit_transform(data).rename("PCA Reconstruction
Error")
• anomalies = pca_ad.fit_detect(data)
• plot(pd.concat([data, s], axis=1), anomaly=anomalies, ts_linewidth=1,
anomaly_color='red',
curve_group=[('SPY_Volume','AAPL_Volume','MSFT_Volume','AMZN_Volume'), "PCA
Reconstruction Error"])

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity
• Many anomalies where one ticker had extremely more volume than
the rest

Source:

You might also like