0% found this document useful (0 votes)

202 views7 pages

Isolationforest1 Python

Isolation Forest is an unsupervised machine learning algorithm for detecting outliers in datasets. It works by building isolation trees from the training data, and using these to assign anomaly scores to data points, with lower scores indicating more abnormal points. The algorithm can be applied to any dataset by importing necessary libraries, loading data, dealing with missing values, building the Isolation Forest model, and generating predictions to identify outliers. Visualizing the results helps evaluate the model's performance and identify outliers in the data. Increasing the contamination parameter causes more points to be identified as outliers. Using more features provides a more distributed view of outliers across the dataset.

Uploaded by

juan antonio garcia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

202 views7 pages

Isolationforest1 Python

Uploaded by

juan antonio garcia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

You are on page 1/ 7

Isolation Forest Python

Data Source

For this, we will be using a subset of a larger dataset that was used as part of a Machine
Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020).

The full dataset can be accessed at the following link: https://doi.org/10.5281/zenodo.4351155.

All of the examples within this article can be used with any dataset.

Importing Libraries and Data

We will need to import, seaborn, pandas and IsolationForest from Scitkit-Learn.

import pandas as pd

import seaborn as sns

from sklearn.ensemble import IsolationForest

Once these have been imported, we next need to load our data.

df = pd.read_csv('Data/Xeek_Well_15-9-15.csv')

df.describe()

The summary only shows the numeric data present within the file. If we want to take a look at
all features within the dataframe, we can call upon df.info() , which will inform us we have 12
columns of data, and varying levels of completeness.

As with many machine learning algorithms, we need to deal with the missing values. As seen
above we have a few columns, such as NPHI (neutron porosity) with 13,346 values, and GR
(gamma ray) with 17,717 values.
The simplest way to deal with these missing values is to drop them. Even though this is a quick
method, it should not be done blindly and you should attempt to understand the reason for the
missing values. Removing these rows results in a reduced dataset when it comes to building
machine learning models.

To remove missing rows, we can call upon the following:

df = df.dropna()

And if we call upon df again, we will see that we are now down to 13,290 values for every
column.

Building the Isolation Forest Model with Scikit-Learn

From our dataframe, we need to select the variables we will train our Isolation Forest model
with.

In this example, I am going to use just two variables (NPHI and RHOB). In reality, we would use
more and we will see an example of that later on. Using two variables allows us to visualise
what the algorithm has done.

First, we will create a list of our column names:

anomaly_inputs = ['NPHI', 'RHOB']

Next, we will create an instance of our Isolation Forest model. This is done, first by creating a
variable called model_IF and then assigning it to IsolationForest().

We can then pass in a number of parameters for our model. The ones I have used in the code
below are:
contamination: This is how much of the overall data we expect to be considered as an outlier.
We can pass in a value between 0 and 0.5 or set it to auto.

random_state: This allows us to control the random selection process for splitting trees. In
other words, if we were to rerun the model with the same data and parameters with a fixed
value for this parameter, we should get repeatable outputs.

model_IF = IsolationForest(contamination=float(0.1),random_state=42)

Once our model has been initialised, we can train it on the data. To do this we call upon the .fit()
function and pass it to our dataframe (df). When we pass the dataframe parameter, we will also
select the columns we defined earlier.

model_IF.fit(df[anomaly_inputs])

After fitting the model, we can now create some predictions. We will do this by adding two new
columns to our dataframe:

anomaly_scores : Generated by calling upon model_IF.decision_function() and provides the

anomaly score for each sample within the dataset. The lower the score, the more abnormal that
sample is. Negative values indicate that the sample is an outlier.

anomaly : Generated by calling upon model_IF.predict() and is used to identify if a point is an

outlier (-1) or an inlier (1)

df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])

df['anomaly'] = model_IF.predict(df[anomaly_inputs])

Once the anomalies have been identified, we can view our dataframe and see the result. Values
of 1 indicate data points are good.
df.loc[:, ['NPHI', 'RHOB','anomaly_scores','anomaly'] ]

In the returned values above, we can see the original input features, the generated anomaly
scores and whether that point is an anomaly or not.

Visualising Anomaly Data using matplotlib

Looking at the numeric values and trying to determine if the point has been identified as an
outlier or not can be tedious.

Instead, we can use seaborn to generate a basic figure. We can use the data we used to train
our model and visually split it up into outliers or inliers.

This simple function is designed to generate that plot and provide some additional metrics as
text. The function takes:

data : Dataframe containing the values

outlier_method_name : The name of the method we are using. This is just for display
purposes

xvar , yvar : The variables that we want to plot on the x and y axis respectively

xaxis_limits, yaxis_limits : The x and y-axis ranges

def outlier_plot(data, outlier_method_name, x_var, y_var,

xaxis_limits=[0,1], yaxis_limits=[0,1]):

print(f'Outlier Method: {outlier_method_name}')

# Create a dynamic title based on the method

method = f'{outlier_method_name}_anomaly'

# Print out key statistics

print(f"Number of anomalous values {len(data[data['anomaly']==-1])}")
print(f"Number of non anomalous values {len(data[data['anomaly']== 1])}")
print(f'Total Number of Values: {len(data)}')

# Create the chart using seaborn

g = sns.FacetGrid(data, col='anomaly', height=4, hue='anomaly', hue_order=[1,-1])
g.map(sns.scatterplot, x_var, y_var)
g.fig.suptitle(f'Outlier Method: {outlier_method_name}', y=1.10, fontweight='bold')
g.set(xlim=xaxis_limits, ylim=yaxis_limits)
axes = g.axes.flatten()
axes[0].set_title(f"Outliers\n{len(data[data['anomaly']== -1])} points")
axes[1].set_title(f"Inliers\n {len(data[data['anomaly']== 1])} points")
return g

Once our function has been defined, we can then pass in the required parameters.

outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);

Right away we can tell how many values have been identified as outliers and where they are
located. As we are only using two variables, we can see that we have essentially formed a
separation between the points at the edge of the data and those in the centre.

Increasing Isolation Forest Contamination Value

The previous example uses a value of 0.1 (10%) for the contamination parameter, what if we
increased that to 0.3 (30%)?

model_IF = IsolationForest(contamination=float(0.3), random_state=42)

df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
df['anomaly'] = model_IF.predict(df[anomaly_inputs])
outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);
We can see that significantly more points have been selected and identified as outliers.

How do we know which contamination value to set?

Setting the contamination value allows us to identify what percentage of values should be
identified as outliers, but choosing that value can be tricky.

There are no hard and fast rules for picking this value, and it should be based on the domain
knowledge surrounding the data and its intended application(s).

For this particular dataset, we should consider other features such as borehole caliper and
delta-rho (DRHO) to help identify potentially poor data.

Using More than 2 Features for Isolation Forest

Now that we have seen the basics of using Isolation Forest with just two variables, let's see
what happens when we use a few more.

anomaly_inputs = ['NPHI', 'RHOB', 'GR', 'CALI', 'PEF', 'DTC']

model_IF = IsolationForest(contamination=0.1, random_state=42)

model_IF.fit(df[anomaly_inputs])
df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
df['anomaly'] = model_IF.predict(df[anomaly_inputs])

outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);

We now see that the points identified as outliers are much more spread out on the scatter plot,
and there is no hard edge around a core group of points.

Visualising Outliers with Seaborn’s Pairplot

Instead of just looking at two of the variables, we can look at all of the variables we have used.
This is done by using the seaborn pairplot.

First, we need to set the palette, which will allow us to control the colours being used in the
plot.

Then, we can call upon sns.pairplot and pass in the required parameters.

palette = ['#ff7f0e', '#1f77b4']

sns.pairplot(df, vars=anomaly_inputs, hue='anomaly', palette=palette)

Orange points indicate outliers (-1) and blue points indicate inliers (1). Image by the author.

This provides us with a much better overview of the data, and we can now see some of the
outliers clearly highlighted within the other features. Especially within the PEF and GR features.

Summary

Isolation Forest is an easy-to-use and easy-to-understand unsupervised machine learning

method that can isolate anomalous data points from good data. The algorithm can be scaled up
to handle large and highly dimensional datasets if required.

ISO 37001 Audit Checklist Sample Report
No ratings yet
ISO 37001 Audit Checklist Sample Report
12 pages
Fire Safety Checklist
No ratings yet
Fire Safety Checklist
30 pages
Modbus Mapping Document
No ratings yet
Modbus Mapping Document
57 pages
Ims Qshe Hira 12 Rev. 02
100% (1)
Ims Qshe Hira 12 Rev. 02
6 pages
Procedure Template
No ratings yet
Procedure Template
3 pages
Process Risk & Opportunity Register
No ratings yet
Process Risk & Opportunity Register
2 pages
bms.0610 r0 Risk and Opportunities PDF
100% (1)
bms.0610 r0 Risk and Opportunities PDF
5 pages
PESTEL Analysis (External) : Political
No ratings yet
PESTEL Analysis (External) : Political
3 pages
General Risk Assessment
100% (1)
General Risk Assessment
2 pages
Form - COTO Log
No ratings yet
Form - COTO Log
54 pages
Procedure 04 - Hazard Identification and Risk Assessment
100% (1)
Procedure 04 - Hazard Identification and Risk Assessment
3 pages
Sample Risk Assessment Form
No ratings yet
Sample Risk Assessment Form
2 pages
04.procedure For Corrective - Preventive Action
100% (1)
04.procedure For Corrective - Preventive Action
3 pages
RISK ANALYSIS TABLE For (Name of The Base)
100% (1)
RISK ANALYSIS TABLE For (Name of The Base)
2 pages
Procedure Risk Assessment
100% (1)
Procedure Risk Assessment
8 pages
Safety Lockout Tag Out Procedure
100% (1)
Safety Lockout Tag Out Procedure
1 page
Procedura Risk Management (ENGLISH)
No ratings yet
Procedura Risk Management (ENGLISH)
8 pages
HSF BBKA Risk Assessment 2022
No ratings yet
HSF BBKA Risk Assessment 2022
6 pages
1.1 Risk Management Procedure1
100% (1)
1.1 Risk Management Procedure1
6 pages
Communication Procedure: Purpose and Application
100% (1)
Communication Procedure: Purpose and Application
2 pages
RISK Mba 17
100% (1)
RISK Mba 17
22 pages
ALTIVAR312 P8 2009 07 en
100% (1)
ALTIVAR312 P8 2009 07 en
127 pages
HD 42
No ratings yet
HD 42
11 pages
ENV18 Aspects Register Procedure 2014 - Updated
100% (1)
ENV18 Aspects Register Procedure 2014 - Updated
5 pages
Form Risk Assessment QEHS
100% (2)
Form Risk Assessment QEHS
2 pages
Organization Level - Internal & External Issues& Requirements of Interested Parties-New 11-11-17
No ratings yet
Organization Level - Internal & External Issues& Requirements of Interested Parties-New 11-11-17
3 pages
QA - Aspect & Impact
100% (1)
QA - Aspect & Impact
5 pages
Attendence System Using Python
No ratings yet
Attendence System Using Python
6 pages
Risk Ranking
No ratings yet
Risk Ranking
8 pages
ASPL-IsMS-C5-03 Disaster Recovery Policy
No ratings yet
ASPL-IsMS-C5-03 Disaster Recovery Policy
8 pages
Internal and External Register
No ratings yet
Internal and External Register
3 pages
Revised CERT Audit Checklist
No ratings yet
Revised CERT Audit Checklist
15 pages
Module 2 - Chap 7 Principles of Risk Assessment (1) Revised
No ratings yet
Module 2 - Chap 7 Principles of Risk Assessment (1) Revised
37 pages
R0 Workstation and Non-Routine Work Risk Assessment Tool Rev Septian
No ratings yet
R0 Workstation and Non-Routine Work Risk Assessment Tool Rev Septian
14 pages
Employee Performance Evaluation
No ratings yet
Employee Performance Evaluation
2 pages
Qcs 2010 Section 11 Part 2.3.01 She Procedures - Risk Identification
No ratings yet
Qcs 2010 Section 11 Part 2.3.01 She Procedures - Risk Identification
31 pages
4.2 Understanding Interested Parties Needs
No ratings yet
4.2 Understanding Interested Parties Needs
3 pages
CRW 013A Appraisal Form - Senior Officers CE
No ratings yet
CRW 013A Appraisal Form - Senior Officers CE
3 pages
SP 05 Hazard Identification Risk Assessment and Control HIRAC PDF
No ratings yet
SP 05 Hazard Identification Risk Assessment and Control HIRAC PDF
8 pages
IMSP 7.1 Resources
No ratings yet
IMSP 7.1 Resources
6 pages
Fca Safety Form Near Miss Report Form - Original
No ratings yet
Fca Safety Form Near Miss Report Form - Original
2 pages
Workplace Risk Assessment PDF
No ratings yet
Workplace Risk Assessment PDF
14 pages
4C Service Overview
No ratings yet
4C Service Overview
9 pages
SGSB Training Masterlist
No ratings yet
SGSB Training Masterlist
6 pages
Isolationforest3 Python
No ratings yet
Isolationforest3 Python
3 pages
IT Procedures Manual Template
No ratings yet
IT Procedures Manual Template
4 pages
7.1.4 Environment For The Operation of Processes
No ratings yet
7.1.4 Environment For The Operation of Processes
2 pages
ISO 9001 Clause 4.1
No ratings yet
ISO 9001 Clause 4.1
6 pages
11 Different Ways For Outlier Detection in Python
No ratings yet
11 Different Ways For Outlier Detection in Python
11 pages
Isolation Forest Algorithm For Anomaly Detection
No ratings yet
Isolation Forest Algorithm For Anomaly Detection
16 pages
Catalogue ns80 300 Eng PDF
No ratings yet
Catalogue ns80 300 Eng PDF
268 pages
HIRA Template - Old - XLSX - Risk Assessment
No ratings yet
HIRA Template - Old - XLSX - Risk Assessment
4 pages
RFPGlovesAug2019 Compressed PDF
No ratings yet
RFPGlovesAug2019 Compressed PDF
112 pages
Person Responsible Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
No ratings yet
Person Responsible Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2 pages
Facility/Building/Company: Date: Drill Observer: Phone: Prior To Drill: YES NO
No ratings yet
Facility/Building/Company: Date: Drill Observer: Phone: Prior To Drill: YES NO
2 pages
PR Corrective Action
No ratings yet
PR Corrective Action
5 pages
Risk Evaluation Form
No ratings yet
Risk Evaluation Form
8 pages
Iso Risk Management Cluster 6
No ratings yet
Iso Risk Management Cluster 6
3 pages
F36 Risk Management Assessment
No ratings yet
F36 Risk Management Assessment
2 pages
Job Description - Merchandiser
No ratings yet
Job Description - Merchandiser
3 pages
Iso 45001 - 6.1.4
No ratings yet
Iso 45001 - 6.1.4
1 page
Isolation Forest Made Easy & How To Tutorial
No ratings yet
Isolation Forest Made Easy & How To Tutorial
18 pages
Section 1 - Meeting Details: 1. Topic 1 2. Topic 2 3. Topic 3 1. Attachment 1 2. Attachment 2 3. Attachment 3
No ratings yet
Section 1 - Meeting Details: 1. Topic 1 2. Topic 2 3. Topic 3 1. Attachment 1 2. Attachment 2 3. Attachment 3
1 page
Iso 14001 2015
No ratings yet
Iso 14001 2015
12 pages
Siesta Odt
No ratings yet
Siesta Odt
36 pages
Unit 3 13 Assignment 3 Develop A Website Harrison Odonnell It Level 2
No ratings yet
Unit 3 13 Assignment 3 Develop A Website Harrison Odonnell It Level 2
18 pages
S3400 PDF
No ratings yet
S3400 PDF
16 pages
BSE Telangana 10th Maths Model Paper II by Sakshi
No ratings yet
BSE Telangana 10th Maths Model Paper II by Sakshi
3 pages
Sensors: Design and Construction of An ROV For Underwater Exploration
No ratings yet
Sensors: Design and Construction of An ROV For Underwater Exploration
25 pages
Detailed Syllabus
No ratings yet
Detailed Syllabus
61 pages
Tahmina-Zebin-General Risk Assessment Form
No ratings yet
Tahmina-Zebin-General Risk Assessment Form
6 pages
Google Trends
No ratings yet
Google Trends
65 pages
Company PROFILE - Peoplelink
No ratings yet
Company PROFILE - Peoplelink
118 pages
2 Memoire Meftah2018
No ratings yet
2 Memoire Meftah2018
106 pages
CRISC GF Application
No ratings yet
CRISC GF Application
10 pages
Infrastructure Components
No ratings yet
Infrastructure Components
26 pages
HKSI LE Withdrawal Application Form Chinese
No ratings yet
HKSI LE Withdrawal Application Form Chinese
1 page
Omnilogic Hlbase Operation
No ratings yet
Omnilogic Hlbase Operation
40 pages
Serverless Computing
No ratings yet
Serverless Computing
6 pages
Avaya CMS Maintaining and Troubleshooting 19.2 March 2021
No ratings yet
Avaya CMS Maintaining and Troubleshooting 19.2 March 2021
308 pages
Unit 5
No ratings yet
Unit 5
41 pages
Catalogo Sinamics
No ratings yet
Catalogo Sinamics
28 pages
Rclamp 7535 M
No ratings yet
Rclamp 7535 M
9 pages
E Governance Final Documentation E Governance - Docxmanisha
No ratings yet
E Governance Final Documentation E Governance - Docxmanisha
23 pages
A Survey On OpenFlow-based Software Defined Networks: Security Challenges and Countermeasures
No ratings yet
A Survey On OpenFlow-based Software Defined Networks: Security Challenges and Countermeasures
14 pages
A Robust and Regularized Extreme Learning Machine
No ratings yet
A Robust and Regularized Extreme Learning Machine
7 pages
DCT-Net - Domain-Calibrated Translation For Portrait Stylization
No ratings yet
DCT-Net - Domain-Calibrated Translation For Portrait Stylization
9 pages
Smartoffice:: An Intelligent and Interactive Environment
No ratings yet
Smartoffice:: An Intelligent and Interactive Environment
2 pages
Book
No ratings yet
Book
3 pages
GEPI Instructions 2025
No ratings yet
GEPI Instructions 2025
2 pages

Isolationforest1 Python

Uploaded by

Isolationforest1 Python

Uploaded by

Isolation Forest Python

The full dataset can be accessed at the following link: https://doi.org/10.5281/zenodo.4351155.

Importing Libraries and Data

We will need to import, seaborn, pandas and IsolationForest from Scitkit-Learn.

import seaborn as sns

To remove missing rows, we can call upon the following:

Building the Isolation Forest Model with Scikit-Learn

First, we will create a list of our column names:

anomaly_inputs = ['NPHI', 'RHOB']

anomaly_scores : Generated by calling upon model_IF.decision_function() and provides the

anomaly : Generated by calling upon model_IF.predict() and is used to identify if a point is an

Visualising Anomaly Data using matplotlib

data : Dataframe containing the values

xaxis_limits, yaxis_limits : The x and y-axis ranges

def outlier_plot(data, outlier_method_name, x_var, y_var,

print(f'Outlier Method: {outlier_method_name}')

# Create a dynamic title based on the method

# Print out key statistics

# Create the chart using seaborn

outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);

Increasing Isolation Forest Contamination Value

model_IF = IsolationForest(contamination=float(0.3), random_state=42)

How do we know which contamination value to set?

Using More than 2 Features for Isolation Forest

anomaly_inputs = ['NPHI', 'RHOB', 'GR', 'CALI', 'PEF', 'DTC']

model_IF = IsolationForest(contamination=0.1, random_state=42)

outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);

Visualising Outliers with Seaborn’s Pairplot

palette = ['#ff7f0e', '#1f77b4']

sns.pairplot(df, vars=anomaly_inputs, hue='anomaly', palette=palette)

Isolation Forest is an easy-to-use and easy-to-understand unsupervised machine learning

You might also like