Isolationforest1 Python
Isolationforest1 Python
Data Source
For this, we will be using a subset of a larger dataset that was used as part of a Machine
Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020).
All of the examples within this article can be used with any dataset.
import pandas as pd
Once these have been imported, we next need to load our data.
df = pd.read_csv('Data/Xeek_Well_15-9-15.csv')
df.describe()
The summary only shows the numeric data present within the file. If we want to take a look at
all features within the dataframe, we can call upon df.info() , which will inform us we have 12
columns of data, and varying levels of completeness.
As with many machine learning algorithms, we need to deal with the missing values. As seen
above we have a few columns, such as NPHI (neutron porosity) with 13,346 values, and GR
(gamma ray) with 17,717 values.
The simplest way to deal with these missing values is to drop them. Even though this is a quick
method, it should not be done blindly and you should attempt to understand the reason for the
missing values. Removing these rows results in a reduced dataset when it comes to building
machine learning models.
df = df.dropna()
And if we call upon df again, we will see that we are now down to 13,290 values for every
column.
From our dataframe, we need to select the variables we will train our Isolation Forest model
with.
In this example, I am going to use just two variables (NPHI and RHOB). In reality, we would use
more and we will see an example of that later on. Using two variables allows us to visualise
what the algorithm has done.
Next, we will create an instance of our Isolation Forest model. This is done, first by creating a
variable called model_IF and then assigning it to IsolationForest().
We can then pass in a number of parameters for our model. The ones I have used in the code
below are:
contamination: This is how much of the overall data we expect to be considered as an outlier.
We can pass in a value between 0 and 0.5 or set it to auto.
random_state: This allows us to control the random selection process for splitting trees. In
other words, if we were to rerun the model with the same data and parameters with a fixed
value for this parameter, we should get repeatable outputs.
model_IF = IsolationForest(contamination=float(0.1),random_state=42)
Once our model has been initialised, we can train it on the data. To do this we call upon the .fit()
function and pass it to our dataframe (df). When we pass the dataframe parameter, we will also
select the columns we defined earlier.
model_IF.fit(df[anomaly_inputs])
After fitting the model, we can now create some predictions. We will do this by adding two new
columns to our dataframe:
df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
df['anomaly'] = model_IF.predict(df[anomaly_inputs])
Once the anomalies have been identified, we can view our dataframe and see the result. Values
of 1 indicate data points are good.
df.loc[:, ['NPHI', 'RHOB','anomaly_scores','anomaly'] ]
In the returned values above, we can see the original input features, the generated anomaly
scores and whether that point is an anomaly or not.
Looking at the numeric values and trying to determine if the point has been identified as an
outlier or not can be tedious.
Instead, we can use seaborn to generate a basic figure. We can use the data we used to train
our model and visually split it up into outliers or inliers.
This simple function is designed to generate that plot and provide some additional metrics as
text. The function takes:
outlier_method_name : The name of the method we are using. This is just for display
purposes
xvar , yvar : The variables that we want to plot on the x and y axis respectively
xaxis_limits=[0,1], yaxis_limits=[0,1]):
Once our function has been defined, we can then pass in the required parameters.
Right away we can tell how many values have been identified as outliers and where they are
located. As we are only using two variables, we can see that we have essentially formed a
separation between the points at the edge of the data and those in the centre.
The previous example uses a value of 0.1 (10%) for the contamination parameter, what if we
increased that to 0.3 (30%)?
df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
df['anomaly'] = model_IF.predict(df[anomaly_inputs])
outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);
We can see that significantly more points have been selected and identified as outliers.
Setting the contamination value allows us to identify what percentage of values should be
identified as outliers, but choosing that value can be tricky.
There are no hard and fast rules for picking this value, and it should be based on the domain
knowledge surrounding the data and its intended application(s).
For this particular dataset, we should consider other features such as borehole caliper and
delta-rho (DRHO) to help identify potentially poor data.
Now that we have seen the basics of using Isolation Forest with just two variables, let's see
what happens when we use a few more.
Instead of just looking at two of the variables, we can look at all of the variables we have used.
This is done by using the seaborn pairplot.
First, we need to set the palette, which will allow us to control the colours being used in the
plot.
Then, we can call upon sns.pairplot and pass in the required parameters.
Orange points indicate outliers (-1) and blue points indicate inliers (1). Image by the author.
This provides us with a much better overview of the data, and we can now see some of the
outliers clearly highlighted within the other features. Especially within the PEF and GR features.
Summary