0% found this document useful (0 votes)
19 views1 page

A Step-By-Step Guide To Robust ML Classification by Ryan Burke Mar, 2023 Towards Data Science

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views1 page

A Step-By-Step Guide To Robust ML Classification by Ryan Burke Mar, 2023 Towards Data Science

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Search Medium Write

New: Navigate Medium


Published from
in Towards Data Science
the top of the page, and
Get unlimited access
focus more on reading as
you scroll.
You
Okay, gothave
it 1 free member-only story left this month. Upgrade for unlimited access.

Ryan Burke
Mar 3 · 17 min read · Member-only · Listen Ryan Burke
166 Followers

A step-by-step guide to robust ML classification Data scientist and a life-long learner.

How to avoid common pitfalls and dig deeper into our models Follow

More from Medium

Noble Ack… in Towards Data S…


son ence
GPT Is an Unreliable
Information Store

Youssef Hosni in Level Up Coding

13 SQL Statements for 90% of


Your Data Science Tasks

Wei Yi in Towards Data Science

Understanding the Denoising


Diffusion Probabilistic Model,
the Socratic Way

Yasemin Turker, PhD

Data Science and Industrial


Engineering: The Perfect
Match for Business…
Optimization

Photo by Luca Bravo on Unsplash Help Status Writers Blog Careers Privacy Terms About
Text to speech

In previous articles, I focused mainly on presenting individual algorithms


that I found interesting. Here, I walk through a complete ML classification
project. The goal is to touch on some of the common pitfalls in ML projects
and describe to the readers how to avoid them. I will also demonstrate how
we can go further by analysing our model errors to gain important insights
that normally go unseen.

If you would like to see the whole notebook, please check it out → here ←

Libraries
Below, you will find a list of the libraries I used for today’s analyses. They
consist of the standard data science toolkit along with the necessary sklearn
libraries.

import sys
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt


import seaborn as sns
from IPython.display import display
%matplotlib inline

import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
py.init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

from pandas import set_option


from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer, RobustScaler
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.feature_selection import RFECV, SelectFromModel, SelectKBest, f_classif
from sklearn.metrics import classification_report, confusion_matrix, balanced_accuracy_score, ConfusionMatrixDisplay, f1_score
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from scipy.stats import uniform

from imblearn.over_sampling import ADASYN

import swifter

# Always good to set a seed for reproducibility


SEED = 8
np.random.seed(SEED)

Data
Today’s dataset includes the forest cover data that is ready-to-employ with
sklearn. Here’s a description from sklearn’s site.

Data Set Characteristics:

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch’s cover type, i.e. the dominant
species of tree. There are seven cover types, making this a multi-class
classification problem. Each sample has 54 features, described on the dataset’s
homepage. Some of the features are boolean indicators, while others are
discrete or continuous measurements.

Number of Instances: 581 012

Feature information (Name / Data Type / Measurement / Description)


Elevation / quantitative /meters / Elevation in meters

Aspect / quantitative / azimuth / Aspect in degrees azimuth

Slope / quantitative / degrees / Slope in degrees

Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to


nearest surface water features

Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to


nearest surface water features

Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to


nearest roadway

Hillshade_9am / quantitative / 0 to 255 index / Hillshade index at 9am,


summer solstice

Hillshade_Noon / quantitative / 0 to 255 index / Hillshade index at noon,


summer soltice

Hillshade_3pm / quantitative / 0 to 255 index / Hillshade index at 3pm,


summer solstice

Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to


nearest wildfire ignition points

Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1


(presence) / Wilderness area designation

Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) /


Soil Type designation

Number of classes:
Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation

Load dataset
Here’s a simple function to load this data into your notebook as a dataframe.

columns = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology'


'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points'
'Wilderness_Area_3', 'Soil_Type_0', 'Soil_Type_1', 'Soil_Type_2', 'Soil_Type_3'
'Soil_Type_9', 'Soil_Type_10', 'Soil_Type_11', 'Soil_Type_12', 'Soil_Type_13',
'Soil_Type_19', 'Soil_Type_20', 'Soil_Type_21', 'Soil_Type_22', 'Soil_Type_23',
'Soil_Type_29', 'Soil_Type_30', 'Soil_Type_31', 'Soil_Type_32', 'Soil_Type_33',
'Soil_Type_39']

from sklearn import datasets


def sklearn_to_df(sklearn_dataset):
df = pd.DataFrame(sklearn_dataset.data, columns=columns)
df['target'] = pd.Series(sklearn_dataset.target)
return df

df = sklearn_to_df(datasets.fetch_covtype())
df_name=df.columns
df.head(3)

Using df.info() and df.describe() to get to know our data better, we see that
there are no missing data and it consists of quantitative variables. The
dataset is also rather large (> 580 000 rows). I originally tried to run this on
the entire dataset, but it took FOREVER, so I recommend using a fraction of
the data.

Regarding the target variable, which is the forest cover class, using
df.target.value_counts(), we see the following distribution (in descending
order):

Class 2 = 283,301
Class 1 = 211,840
Class 3 = 35,754
Class 7 = 20,510
Class 6 = 17,367
Class 5 = 9,493
Class 4 = 2,747

It is important to note that our classes are imbalanced and we will need to
keep this in mind when selecting a metric to evaluate our models.

Prepare your data


One of the most common misunderstandings when running ML models is
processing our data prior to splitting. Why is this a problem?

Let’s say we plan on scaling our data using the whole dataset. The equations
below are taken from their respective links.

Ex1 StandardScaler()

z = (x — u) / s

Ex2 MinMaxScaler()

X_std = (X - X.min()) / (X.max() - X.min())


X_scaled = X_std * (max - min) + min

The most important thing we should notice is they include information such
as mean, standard deviation, min, max. If we perform these functions prior
to splitting, the features in our train set will be computed based on
information included in the test set. This is an example of data leakage.

Data leakage is when information from outside the training dataset is used to
create the model. This additional information can allow the model to learn or
know something that it otherwise would not know and in turn invalidate the
estimated performance of the mode being constructed.

Therefore, the first step after getting to know our dataset is to split it and
keep your test set unseen until the very end. In the code below, we split the
data into 80% (training set) and 20% (test set). You will also note that I have
only kept 50,000 total samples to reduce the time it takes to train & evaluate
our models. Trust me, you will thank me later!

It is also worth noting that we are stratifying on the target variable. This is
good practice for imbalanced datasets as it maintains the distribution of
classes in the train and test set. If we don’t do this, there’s a chance that some
of the underrepresented classes aren’t even present in our train or test sets.

# here we are first separating our df into features (X) and target (y)
X = df[df_name[0:54]]
Y = df[df_name[54]]

# now we are separating into training (80%) and test (20%) sets. The test set won't be seen until we want to test our top model!
X_train, X_test, y_train, y_test =train_test_split(X,Y,
train_size = 40_000,
test_size=10_000,
random_state=SEED,
stratify=df['target']) # we stratify to ensure similar distribution in train/te

Feature engineering
With our train and test sets ready, we can now work on the fun stuff. The
first step in this project is to generate some features that could add useful
information to train our models.

This step can be a little tricky. In the real world, this requires domain-
specific knowledge on the particular subject you are working. To be
completely transparent with you, despite being a lover of nature and
everything outdoors, I am no expert in why certain trees grow in specific
areas.

For this reason, I have consulted [1] [2] [3] who have a better understanding
of this domain than myself. I have amalgamated the knowledge from these
references to create the features you will find below.

# engineering new columns from our df


def FeatureEngineering(X):

X['Aspect'] = X['Aspect'] % 360


X['Aspect_120'] = (X['Aspect'] + 120) % 360

X['Hydro_Elevation_sum'] = X['Elevation'] + X['Vertical_Distance_To_Hydrology'

X['Hydro_Elevation_diff'] = abs(X['Elevation'] - X['Vertical_Distance_To_Hydrology'

X['Hydro_Euclidean'] = np.sqrt(X['Horizontal_Distance_To_Hydrology']**2 +
X['Vertical_Distance_To_Hydrology']**2)

X['Hydro_Manhattan'] = abs(X['Horizontal_Distance_To_Hydrology'] +
X['Vertical_Distance_To_Hydrology'])

X['Hydro_Distance_sum'] = X['Horizontal_Distance_To_Hydrology'] + X['Vertical_Distance_To_Hydrology'

X['Hydro_Distance_diff'] = abs(X['Horizontal_Distance_To_Hydrology'] - X['Vertical_Distance_To_Hydrology'

X['Hydro_Fire_sum'] = X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Fire_Points'

X['Hydro_Fire_diff'] = abs(X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Fire_Points'

X['Hydro_Fire_mean'] = (X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Fire_Points'

X['Hydro_Road_sum'] = X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways'

X['Hydro_Road_diff'] = abs(X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways'

X['Hydro_Road_mean'] = (X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways'

X['Road_Fire_sum'] = X['Horizontal_Distance_To_Roadways'] + X['Horizontal_Distance_To_Fire_Points'

X['Road_Fire_diff'] = abs(X['Horizontal_Distance_To_Roadways'] - X['Horizontal_Distance_To_Fire_Points'

X['Road_Fire_mean'] = (X['Horizontal_Distance_To_Roadways'] + X['Horizontal_Distance_To_Fire_Points'

X['Hydro_Road_Fire_mean'] = (X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways'


X['Horizontal_Distance_To_Fire_Points'])/3

return X

X_train = X_train.swifter.apply(FeatureEngineering, axis = 1)


X_test = X_test.swifter.apply(FeatureEngineering, axis = 1)

On a side note, when you are working with large datasets, pandas can be
somewhat slow. Using swifter, as you can see in the last two lines above, you
can significantly speed up the time it takes to apply a function to your
dataframe. The article → here compares several methods used to speed this
process up.

Feature selection
At this point we have more than 70 features. If the goal is end up with the
best performing model, then you could try to use all of these as inputs. With
that said, often in business there is a trade-off between performance and
complexity that needs to be considered.

As an example, suppose we have 94% accuracy in our model using all of


these features. Then, imagine we have 89% accuracy with only 4 features.
What is the cost we are willing to pay for a more interpretable model. Always
weigh performance and complexity.

Keeping that in mind, I will perform feature selection to try and reduce the
complexity right away. Sklearn provides many options worth considering. In
this example, I will use SelectKBest which will select a pre-specified number
of features that provide the best performance. Below, I have requested (and
listed) the best performing 15 features. These are the features that I will use
to train the models in the following section.

selector = SelectKBest(f_classif, k=15)


selector.fit(X_train, y_train)
mask = selector.get_support()
X_train_reduced_cols = X_train.columns[mask]

X_train_reduced_cols

>>> Index(['Elevation', 'Wilderness_Area_3', 'Soil_Type_2', 'Soil_Type_3',


'Soil_Type_9', 'Soil_Type_37', 'Soil_Type_38', 'Hydro_Elevation_sum',
'Hydro_Elevation_diff', 'Hydro_Road_sum', 'Hydro_Road_diff',
'Hydro_Road_mean', 'Road_Fire_sum', 'Road_Fire_mean',
'Hydro_Road_Fire_mean'],
dtype='object')

Baseline models
In this section I will compare three different classifiers:

KNeighboursClassifier

RandomForestClassifier

ExtraTreesClassifer

I have provided links for those who wish to investigate each model further.
They will also be helpful in the section on hyperparameter tuning, where
you can find all modifiable parameters when trying to improve your models.
Below you will find two functions to define and evaluate the baseline
models.

# baseline models
def GetBaseModels():
baseModels = []
baseModels.append(('KNN' , KNeighborsClassifier()))
baseModels.append(('RF' , RandomForestClassifier()))
baseModels.append(('ET' , ExtraTreesClassifier()))

return baseModels

def ModelEvaluation(X_train, y_train,models):


# define number of folds and evaluation metric
num_folds = 10
scoring = "f1_weighted" #This is suitable for imbalanced classes

results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=num_folds, random_state=SEED, shuffle = True)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring, n_jobs = -1)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

return names, results

There are some key elements in the second function that are worth
discussing further. The first of which is StratifiedKFold. Recall, we split the
original dataset into 80% training and 20% test. The test set will be reserved
for the final evaluation of our top performing model.

Using cross-validation will provide us with a better evaluation of our models.


Specifically, I have set up a 10-fold cross-validation. For those not familiar,
the model is trained on k — 1 folds and is validated on the remaining fold at
each step. At the end you will have access to an average and variation of the k
models, providing you with better insight than a simple train-test evaluation.
Stratified K fold, as I eluded to earlier, is used to ensure that each fold has an
approximately equal representation of the target classes.

The second point worth discussing is the scoring metric. There are many
metrics available to evaluate the performance of your models, and often
there are several that could suit your project. It’s important to keep in mind
what you are trying to demonstrate with the results. If you work in a
business setting, often the metric that is most easily explained to those
without a data background is preferred.

On the other hand, there are metrics that are unsuitable to your analyses.
For this project, we have imbalanced classes. If you go to the link provided
above, you will find options for this case. I opted to use the weighted F1
score. Let’s briefly discuss why I chose this metric.

A very common classification metric is accuracy, which is the percent of


correct classifications. While this may seem like an excellent option,
suppose we have a binary classification where the target classes are uneven
(i.e. group 1 = 90, group 2 = 10). It is possible to have 90% accuracy, which is
great, but if we explore further, we have correctly classified all of group 1
and failed to classify any of the group 2. In this case our model is not terribly
informative.

If we would have used the weighted F1 score we would have a result of


42.6%. If you’re interested in reading more on the F1 score → here is an
article explaining how it is calculated.

After training the baseline models, I have plotted the results from each
below. The baseline models all performed relatively well. Remember, at this
point I have done nothing to the data (i.e. transform, remove outliers). The
Extra trees classifier had the highest weighted F1 score at 86.9%.

Results from the 10-fold CV. KNN had the lowest at 78.8%, the RF was second with 85.9%, and the ET had the
highest weighted F1 score at 86.9%. Image provided by author

Transforming the data


The next step in this project will look at the effects of data transformation on
model performance. Whereas many decision tree-based algorithms are not
sensitive to the magnitude of the data, it is reasonable to expect that models
measuring distance between samples , such as the KNN perform differently
when scaled [4] [5]. In this section, we will scale our data using
StandardScaler and MinMaxScaler as described above. Below you will find a
function that describes a pipeline that will apply the scaler and then train
the model using scaled data.

def GetScaledModel(nameOfScaler):

if nameOfScaler == 'standard':
scaler = StandardScaler()
elif nameOfScaler =='minmax':
scaler = MinMaxScaler()

pipelines = []
pipelines.append((nameOfScaler+'KNN' , Pipeline([('Scaler', scaler),('KNN' , KNeighborsClassifier())])))
pipelines.append((nameOfScaler+'RF' , Pipeline([('Scaler', scaler),('RF' , RandomForestClassifier())])))
pipelines.append((nameOfScaler+'ET' , Pipeline([('Scaler', scaler),('ET' , ExtraTreesClassifier())])))

return pipelines

The results using the StandardScaler are presented below. We see that our
hypothesis regarding scaling the data appears to hold. Both the random
forest and extra trees classifiers both performed nearly identically, whereas
the KNN improved in performance by roughly 4%. Despite this increase, the
two tree-based classifiers still outperform the scaled KNN.

Results from the 10-fold CV using the StandardScaler to transform our data. KNN still had the lowest although
the performance increased to 83.8%. The RF was second with 85.8%, and the ET once again had the highest
weighted F1 score at 86.8%. Image provided by author

Similar results can be seen when the MinMaxScaler is used. The results from
all models are almost identical to those presented using the StandardScaler.

Results from the 10-fold CV using the MinMaxScaler to transform our data. Each performed almost identically
to those using StandardScaler. KNN still had the lowest at 83.9%. The RF was second with 86.0%, and the ET
once again had the highest weighted F1 score at 87.0%. Image provided by author

It is worth noting at this point that I also checked the effect of removing
outliers. For this, I removed values that were beyond +/- 3 SD for each
feature. I am not presenting the results here because there were no values
outside this range. If you are interested in seeing how this was performed,
please feel free to check out the notebook found at the link provided at the
beginning of this article.

Hyperparameter tuning using GridSearchCV


The next step is to try and improve our models by tuning the
hyperparameters. We will do so on the scaled data because it had the best
average performance when considering our three models. Sklearn discusses
this in more detail → here.

I chose to use GridSearchCV (CV for cross validated). Below you will find a
function that performs a 10-fold cross validation on the models we have been
using. The one additional detail here is that we need to provide the list of
hyperparameters we want to be evaluated.

Up to this point, we have not even looked at our test set. Before commencing
the grid search, we will scale our train and test data using the
StandardScaler. We are doing this here because we are going to find the best
hyperparameters for each model and use those as inputs into a
VotingClassifier (as we will discuss in the next section).

To properly scale our full dataset 49


we have to follow the procedure below. You

You might also like