A Step-By-Step Guide To Robust ML Classification by Ryan Burke Mar, 2023 Towards Data Science
A Step-By-Step Guide To Robust ML Classification by Ryan Burke Mar, 2023 Towards Data Science
Ryan Burke
Mar 3 · 17 min read · Member-only · Listen Ryan Burke
166 Followers
How to avoid common pitfalls and dig deeper into our models Follow
Photo by Luca Bravo on Unsplash Help Status Writers Blog Careers Privacy Terms About
Text to speech
If you would like to see the whole notebook, please check it out → here ←
Libraries
Below, you will find a list of the libraries I used for today’s analyses. They
consist of the standard data science toolkit along with the necessary sklearn
libraries.
import sys
import os
import pandas as pd
import numpy as np
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
py.init_notebook_mode(connected=True)
import warnings
warnings.filterwarnings('ignore')
import swifter
Data
Today’s dataset includes the forest cover data that is ready-to-employ with
sklearn. Here’s a description from sklearn’s site.
The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch’s cover type, i.e. the dominant
species of tree. There are seven cover types, making this a multi-class
classification problem. Each sample has 54 features, described on the dataset’s
homepage. Some of the features are boolean indicators, while others are
discrete or continuous measurements.
Number of classes:
Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation
Load dataset
Here’s a simple function to load this data into your notebook as a dataframe.
df = sklearn_to_df(datasets.fetch_covtype())
df_name=df.columns
df.head(3)
Using df.info() and df.describe() to get to know our data better, we see that
there are no missing data and it consists of quantitative variables. The
dataset is also rather large (> 580 000 rows). I originally tried to run this on
the entire dataset, but it took FOREVER, so I recommend using a fraction of
the data.
Regarding the target variable, which is the forest cover class, using
df.target.value_counts(), we see the following distribution (in descending
order):
Class 2 = 283,301
Class 1 = 211,840
Class 3 = 35,754
Class 7 = 20,510
Class 6 = 17,367
Class 5 = 9,493
Class 4 = 2,747
It is important to note that our classes are imbalanced and we will need to
keep this in mind when selecting a metric to evaluate our models.
Let’s say we plan on scaling our data using the whole dataset. The equations
below are taken from their respective links.
Ex1 StandardScaler()
z = (x — u) / s
Ex2 MinMaxScaler()
The most important thing we should notice is they include information such
as mean, standard deviation, min, max. If we perform these functions prior
to splitting, the features in our train set will be computed based on
information included in the test set. This is an example of data leakage.
Data leakage is when information from outside the training dataset is used to
create the model. This additional information can allow the model to learn or
know something that it otherwise would not know and in turn invalidate the
estimated performance of the mode being constructed.
Therefore, the first step after getting to know our dataset is to split it and
keep your test set unseen until the very end. In the code below, we split the
data into 80% (training set) and 20% (test set). You will also note that I have
only kept 50,000 total samples to reduce the time it takes to train & evaluate
our models. Trust me, you will thank me later!
It is also worth noting that we are stratifying on the target variable. This is
good practice for imbalanced datasets as it maintains the distribution of
classes in the train and test set. If we don’t do this, there’s a chance that some
of the underrepresented classes aren’t even present in our train or test sets.
# here we are first separating our df into features (X) and target (y)
X = df[df_name[0:54]]
Y = df[df_name[54]]
# now we are separating into training (80%) and test (20%) sets. The test set won't be seen until we want to test our top model!
X_train, X_test, y_train, y_test =train_test_split(X,Y,
train_size = 40_000,
test_size=10_000,
random_state=SEED,
stratify=df['target']) # we stratify to ensure similar distribution in train/te
Feature engineering
With our train and test sets ready, we can now work on the fun stuff. The
first step in this project is to generate some features that could add useful
information to train our models.
This step can be a little tricky. In the real world, this requires domain-
specific knowledge on the particular subject you are working. To be
completely transparent with you, despite being a lover of nature and
everything outdoors, I am no expert in why certain trees grow in specific
areas.
For this reason, I have consulted [1] [2] [3] who have a better understanding
of this domain than myself. I have amalgamated the knowledge from these
references to create the features you will find below.
X['Hydro_Euclidean'] = np.sqrt(X['Horizontal_Distance_To_Hydrology']**2 +
X['Vertical_Distance_To_Hydrology']**2)
X['Hydro_Manhattan'] = abs(X['Horizontal_Distance_To_Hydrology'] +
X['Vertical_Distance_To_Hydrology'])
return X
On a side note, when you are working with large datasets, pandas can be
somewhat slow. Using swifter, as you can see in the last two lines above, you
can significantly speed up the time it takes to apply a function to your
dataframe. The article → here compares several methods used to speed this
process up.
Feature selection
At this point we have more than 70 features. If the goal is end up with the
best performing model, then you could try to use all of these as inputs. With
that said, often in business there is a trade-off between performance and
complexity that needs to be considered.
Keeping that in mind, I will perform feature selection to try and reduce the
complexity right away. Sklearn provides many options worth considering. In
this example, I will use SelectKBest which will select a pre-specified number
of features that provide the best performance. Below, I have requested (and
listed) the best performing 15 features. These are the features that I will use
to train the models in the following section.
X_train_reduced_cols
Baseline models
In this section I will compare three different classifiers:
KNeighboursClassifier
RandomForestClassifier
ExtraTreesClassifer
I have provided links for those who wish to investigate each model further.
They will also be helpful in the section on hyperparameter tuning, where
you can find all modifiable parameters when trying to improve your models.
Below you will find two functions to define and evaluate the baseline
models.
# baseline models
def GetBaseModels():
baseModels = []
baseModels.append(('KNN' , KNeighborsClassifier()))
baseModels.append(('RF' , RandomForestClassifier()))
baseModels.append(('ET' , ExtraTreesClassifier()))
return baseModels
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=num_folds, random_state=SEED, shuffle = True)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring, n_jobs = -1)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
There are some key elements in the second function that are worth
discussing further. The first of which is StratifiedKFold. Recall, we split the
original dataset into 80% training and 20% test. The test set will be reserved
for the final evaluation of our top performing model.
The second point worth discussing is the scoring metric. There are many
metrics available to evaluate the performance of your models, and often
there are several that could suit your project. It’s important to keep in mind
what you are trying to demonstrate with the results. If you work in a
business setting, often the metric that is most easily explained to those
without a data background is preferred.
On the other hand, there are metrics that are unsuitable to your analyses.
For this project, we have imbalanced classes. If you go to the link provided
above, you will find options for this case. I opted to use the weighted F1
score. Let’s briefly discuss why I chose this metric.
After training the baseline models, I have plotted the results from each
below. The baseline models all performed relatively well. Remember, at this
point I have done nothing to the data (i.e. transform, remove outliers). The
Extra trees classifier had the highest weighted F1 score at 86.9%.
Results from the 10-fold CV. KNN had the lowest at 78.8%, the RF was second with 85.9%, and the ET had the
highest weighted F1 score at 86.9%. Image provided by author
def GetScaledModel(nameOfScaler):
if nameOfScaler == 'standard':
scaler = StandardScaler()
elif nameOfScaler =='minmax':
scaler = MinMaxScaler()
pipelines = []
pipelines.append((nameOfScaler+'KNN' , Pipeline([('Scaler', scaler),('KNN' , KNeighborsClassifier())])))
pipelines.append((nameOfScaler+'RF' , Pipeline([('Scaler', scaler),('RF' , RandomForestClassifier())])))
pipelines.append((nameOfScaler+'ET' , Pipeline([('Scaler', scaler),('ET' , ExtraTreesClassifier())])))
return pipelines
The results using the StandardScaler are presented below. We see that our
hypothesis regarding scaling the data appears to hold. Both the random
forest and extra trees classifiers both performed nearly identically, whereas
the KNN improved in performance by roughly 4%. Despite this increase, the
two tree-based classifiers still outperform the scaled KNN.
Results from the 10-fold CV using the StandardScaler to transform our data. KNN still had the lowest although
the performance increased to 83.8%. The RF was second with 85.8%, and the ET once again had the highest
weighted F1 score at 86.8%. Image provided by author
Similar results can be seen when the MinMaxScaler is used. The results from
all models are almost identical to those presented using the StandardScaler.
Results from the 10-fold CV using the MinMaxScaler to transform our data. Each performed almost identically
to those using StandardScaler. KNN still had the lowest at 83.9%. The RF was second with 86.0%, and the ET
once again had the highest weighted F1 score at 87.0%. Image provided by author
It is worth noting at this point that I also checked the effect of removing
outliers. For this, I removed values that were beyond +/- 3 SD for each
feature. I am not presenting the results here because there were no values
outside this range. If you are interested in seeing how this was performed,
please feel free to check out the notebook found at the link provided at the
beginning of this article.
I chose to use GridSearchCV (CV for cross validated). Below you will find a
function that performs a 10-fold cross validation on the models we have been
using. The one additional detail here is that we need to provide the list of
hyperparameters we want to be evaluated.
Up to this point, we have not even looked at our test set. Before commencing
the grid search, we will scale our train and test data using the
StandardScaler. We are doing this here because we are going to find the best
hyperparameters for each model and use those as inputs into a
VotingClassifier (as we will discuss in the next section).