Isolationforest3 Python
Isolationforest3 Python
First off, we quickly import some useful modules that we will be using later on. We generate a
dataset with random data points using the make_blobs() function.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobsdata, _ = make_blobs(n_samples=500,
centers=1, cluster_std=2, center_box=(0, 0))
plt.scatter(data[:, 0], data[:, 1])
plt.show()
We can easily eyeball some outliers since this is only a 2-D use case. It is a good choice to proof
that the algorithm works. Note that the algorithm can be used on a data set with multiple
features without any problem.
Number of tree controls the ensemble size. We find that path lengths usually converge well
before t = 100. Unless otherwise specified, we shall use t = 100 as the default value in our
experiment.
Empirically, we find that setting subset sample to 256 generally provides enough details to
perform anomaly detection across a wide range of data
N_estimators here stands for the number of trees and max sample here stands for the subset
sample used in each round.
The contamination parameter here stands for the proportion of outliers in the data set. On
default, the anomaly score threshold will follow as in the original paper. However, we can
manually fix the proportion of outliers in the data if we have any prior knowledge. We set it to
0.03 here for demonstration purpose.
We then fit and predict the entire data set. It returns an array consisting of [-1 or 1] where -1
stands for anomaly and 1 stands for normal instance.
plt.show()
We can see that it works pretty well and identifies the data points around the edges.
We can also call decision_function() to calculate the anomaly score of each data points. This
way we can understand which data points are more abnormal.
score = iforest.decision_function(data)
We pick the top 5 anomalies using the anomaly scores and then plot it again.
Take-away
Isolation Forest is a fundamentally different outlier detection model that can isolate anomalies
at great speed. It has a linear time complexity which makes it one of the best to deal with high
volume data sets.
It pivots on the concept that since anomalies are “few and different”, they are easier to be
isolated compared to normal points. It’s Python implementation can be found at
sklearn.ensemble.IsolationForest.
Thank you for taking more time out of your busy schedule to sit down with me and enjoy this
beautiful piece of algorithm.