0% found this document useful (0 votes)
43 views

Isolationforest3 Python

This document discusses isolation forest, an algorithm for detecting outliers in datasets. It shows how to import modules, generate sample data, initialize an isolation forest model with default hyperparameters, fit and predict on the data to identify outliers, and plot the results. Key points are that isolation forest has linear time complexity, works well for high-dimensional data, and identifies outliers by how isolated they are from other data points.
Copyright
© © All Rights Reserved
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Isolationforest3 Python

This document discusses isolation forest, an algorithm for detecting outliers in datasets. It shows how to import modules, generate sample data, initialize an isolation forest model with default hyperparameters, fit and predict on the data to identify outliers, and plot the results. Key points are that isolation forest has linear time complexity, works well for high-dimensional data, and identifies outliers by how isolated they are from other data points.
Copyright
© © All Rights Reserved
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 3

Isolation Forest Python

First off, we quickly import some useful modules that we will be using later on. We generate a
dataset with random data points using the make_blobs() function.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobsdata, _ = make_blobs(n_samples=500,
centers=1, cluster_std=2, center_box=(0, 0))
plt.scatter(data[:, 0], data[:, 1])
plt.show()

We can easily eyeball some outliers since this is only a 2-D use case. It is a good choice to proof
that the algorithm works. Note that the algorithm can be used on a data set with multiple
features without any problem.

We initialize an isolation forest object by calling IsolationForest().

The hyperparameters used here are mostly default.

Number of tree controls the ensemble size. We find that path lengths usually converge well
before t = 100. Unless otherwise specified, we shall use t = 100 as the default value in our
experiment.

Empirically, we find that setting subset sample to 256 generally provides enough details to
perform anomaly detection across a wide range of data

N_estimators here stands for the number of trees and max sample here stands for the subset
sample used in each round.

Max_samples =’auto’ sets the subset size as min(256, num_samples).

The contamination parameter here stands for the proportion of outliers in the data set. On
default, the anomaly score threshold will follow as in the original paper. However, we can
manually fix the proportion of outliers in the data if we have any prior knowledge. We set it to
0.03 here for demonstration purpose.

We then fit and predict the entire data set. It returns an array consisting of [-1 or 1] where -1
stands for anomaly and 1 stands for normal instance.

iforest = IsolationForest(n_estimators = 100, contamination = 0.03, max_samples ='auto)


prediction = iforest.fit_predict(data)print(prediction[:20])

print("Number of outliers detected: {}".format(prediction[prediction < 0].sum()))


print("Number of normal samples detected: {}".format(prediction[prediction > 0].sum()))

We will then plot the outliers detected by Isolation Forest.

normal_data = data[np.where(prediction > 0)]

outliers = data[np.where(prediction < 0)]


plt.scatter(normal_data[:, 0], normal_data[:, 1])

plt.scatter(outliers[:, 0], outliers[:, 1])


plt.title("Random data points with outliers identified.")

plt.show()

We can see that it works pretty well and identifies the data points around the edges.

We can also call decision_function() to calculate the anomaly score of each data points. This
way we can understand which data points are more abnormal.

score = iforest.decision_function(data)

data_scores = pd.DataFrame(list(zip(data[:, 0],data[:, 1],score)),columns =


['X','Y','Anomaly Score'])display(data_scores.head())

We pick the top 5 anomalies using the anomaly scores and then plot it again.

top_5_outliers = data_scores.sort_values(by = ['Anomaly Score']).head()

plt.scatter(data[:, 0], data[:, 1])


plt.scatter(top_5_outliers['X'], top_5_outliers['Y'])

plt.title("Random data points with only 5 outliers identified.")


plt.show()

Take-away

Isolation Forest is a fundamentally different outlier detection model that can isolate anomalies
at great speed. It has a linear time complexity which makes it one of the best to deal with high
volume data sets.

It pivots on the concept that since anomalies are “few and different”, they are easier to be
isolated compared to normal points. It’s Python implementation can be found at
sklearn.ensemble.IsolationForest.

Thank you for taking more time out of your busy schedule to sit down with me and enjoy this
beautiful piece of algorithm.

You might also like