Multivariate Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg
Multivariate Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg
Multivariate
Anomaly Detection
Mr Hew Ka Kian
[email protected]
OFFICIAL (CLOSED) \ NON-SENSITIVE
Prepare data
As we had learnt, there are many public dataset available or a company may use the metric Select the
generated by its equipment.
algorithm
One good source of public data we will contibnue to use is the Singapore government’s public
dataset at https://data.gov.sg .
Train the
model
PCA Anomaly Detection that will highlight the data that has a high PCA Reconstruction
Error as anomalous Select the
algorithm
Train the
model
Discuss and complete Exercise A in worksheet to detect multivariate anomaly on dependent time
series data
Use Pandas to read csv data from a local file into a DataFrame
index_col is the time column
parse_dates=True treat the index_col as date time format
• df = pd.read_csv('/somefolder/generator.csv', index_col="Time", parse_dates=True)
• df = validate_series(df)
Student Activity
Train and return the reconstruction error
•from adtk.transformer import PcaReconstructionError
•s = PcaReconstructionError().fit_transform(df).rename("PCA Reconstruction Error")
pd.concat(df,s],axis=1) : combines the data DataFrame and the PCA reconstruction error so we can call the plot() on them
curve_group=[("Speed (kRPM)", "Power (kW)"), "PCA Reconstruction Error"] : plots 2 subplots, 1st subplot with Speed and
Power columns, the 2nd subplot with PCA Reconstruction Error column. We can have more subplots if we need.
•plot(pd.concat([df, s], axis=1), curve_group=[("Speed (kRPM)", "Power (kW)"), "PCA Reconstruction Err
or"]);
We have plotted the PCA Reconstruction Error and a high value means anomaly. But conveniently we just want to plot the
anomaly directly. Do that with PcaAD
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
• Low power corresponds to low
speed and high speed around 10
kRPM corresponds to power of
around 30 kW output. But
anomaly is detected at 2017 May
13 when the high speed of 10
kRPM does not produce the
expected power – the generator
may be faulty.
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
Read the CSV but as the date range (eg 2/2/2015) is ambigious on the format, we indicate dayfirst=True to tell Pandas the first number is day
Validate the series checks for error; this is an optional but recommended step
•df_room = validate_series(df_room)
Concat the data and PCA error to plot both of them together. curve_group tells plot to create 2 subplots, one with the data in the first
bracket group, the other is the PCA Error.
•plot(pd.concat([df_room, s], axis=1), curve_group=[("Light","CO2","Temperature","Humidity","HumidityRatio","Occupancy"), "PCA
Reconstruction Error"]);
Detect the anomaly and plot the data with the anomaly. curve_group='all' means draw all in a single plot, no subplot.
•anomalies = pca_ad.fit_detect(df_room)
•plot(df_room, anomaly=anomalies, ts_linewidth=1, anomaly_color='red', curve_group='all');
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
• Anomaly detected at 02-03 18hr
• The combination of the variables is flagged as not similar to the rest.
• For multivariate anomaly, it is often not immediately obvious which
part of the combination of variables is anomalous. Investigating it
further is not the purpose of the worksheet
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
How to plot data in the same number range in the same subplot so that smaller numbers are more visible?
Indicate in curve_group to plot “Light” and “CO2” in one subplot, “Temperture” and “Humidity” in another
subplot etc
• plot(pd.concat([df_room, s], axis=1), curve_group=[("Light","CO2"),
("Temperature","Humidity"), ("HumidityRatio"), ("Occupancy"), "PCA
Reconstruction Error"]);
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
• countries_covid =
pd.read_csv('https://raw.githubusercontent.com/datasets/covid-
19/main/data/time-series-19-covid-combined.csv', index_col="Date",
parse_dates=True, infer_datetime_format=True)
• us_covid=countries_covid[countries_covid["Country/Region"] == "US"]
Rename the “Confirmed” column
Drop the other columns
• us_covid = us_covid.rename(columns={"Confirmed":"US Confirmed"})
• us_covid = us_covid.drop(["Province/State","Country/Region","Recovered", "Deat
hs"], 1)
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
Join the US, UK and SG Confimred cases into a DataFrame
•mv_covid = sg_covid.join([uk_covid,us_covid])
Get the world confirmed cases and join with the countries
•world_covid = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/main/data/worldwide-
aggregate.csv', index_col="Date", parse_dates=True, infer_datetime_format=True)
•world_covid = world_covid.drop(["Recovered","Deaths","Increase rate"],1)
•world_covid = validate_series(world_covid)
•world_covid = world_covid.rename(columns={"Confirmed":"World Confirmed"})
•mv_covid = mv_covid.join([world_covid])
Plot the anomaly
•pca_ad = PcaAD()
•s = PcaReconstructionError().fit_transform(mv_covid).rename("PCA Reconstruction Error")
•plot(pd.concat([mv_covid, s], axis=1), ts_linewidth=1, ts_markersize=3, curve_group=[("UK
Confirmed","US Confirmed","SG Confirmed", "World Confirmed"), "PCA Reconstruction Error"])
•anomalies = pca_ad.fit_detect(mv_covid)
•plot(mv_covid, anomaly=anomalies, anomaly_color='red', curve_group='all')
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
• The covid confirmed case trend of uk and sg are different from the
world or US’s.
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
• data = data_spy.join([data_aapl,data_msft,data_amzn])
• pca_ad = PcaAD()
• s = PcaReconstructionError().fit_transform(data).rename("PCA Reconstruction
Error")
• anomalies = pca_ad.fit_detect(data)
• plot(pd.concat([data, s], axis=1), anomaly=anomalies, ts_linewidth=1,
anomaly_color='red',
curve_group=[('SPY_Volume','AAPL_Volume','MSFT_Volume','AMZN_Volume'), "PCA
Reconstruction Error"])
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity
• Many anomalies where one ticker had extremely more volume than
the rest
Source: