Data Preprocessing
Data Preprocessing
Data Preprocessing
COMP3314
Machine Learning
COMP 3314 2
Introduction
● Preprocessing a dataset is a crucial step
○ Garbage in, garbage out
○ Quality of data and amount of useful information it contains are
key factors
● Data-gathering methods are often loosely controlled, resulting in
out-of-range values (e.g., Income: −100), impossible data
combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc.
● Preprocessing is often the most important phase of a machine
learning project
COMP 3314 3
Outline
● In this chapter you will learn how to …
○ Remove and impute missing values from the dataset
○ Get categorical data into shape
○ Select relevant features
● Specifically, we will looking at the following topics
○ Dealing with missing data
○ Nominal and ordinal features
○ Partitioning a dataset into training and testing sets
○ Bringing features onto the same scale
○ Selecting meaningful features
○ Sequential feature selection algorithms
○ Random forests
COMP 3314 4
Code - DataPreprocessing.ipynb
● Available here on CoLab
COMP 3314 6
● Columns with missing values can be dropped via the dropna method with
argument axis=1
COMP 3314 9
Dropna
● The dropna method supports several additional parameters that can
come in handy
only drop rows only drop rows
where all drop rows that where NaN appear
columns are have less than 4 in specific columns
NaN real values (here: 'C')
COMP 3314 10
Interpolation
● Estimate missing values from the other training samples in our dataset
● Example: Mean imputation
○ Replace missing value with the mean value of the entire feature column
XL L M
COMP 3314 16
Categorical Data
● It is common that real-world datasets contain categorical features
○ How to deal with this type of data?
● Nominal features vs ordinal features
○ Ordinal features can be sorted / ordered
■ E.g., t-shirt size, because we can define an order XL>L>M
○ Nominal features don't imply any order
■ E.g., t-shirt color
COMP 3314 17
Example Dataset
● Reverse-mapping to go back
COMP 3314 19
LabelEncoder
● Alternatively, there is a convenient LabelEncoder class directly
implemented in scikit-learn to achieve this
One-Hot Encoding
● We could use a similar approach to transform the nominal color column
of our dataset, as follows
○ Problem:
■ Model may assume that green > blue, and red > green
■ This could result in suboptimal model
● Workaround: Use one-hot encoding
○ Create a dummy feature for each unique value of nominal features
■ E.g., a blue sample is encoded as blue = 1 , green = 0 , red = 0
COMP 3314 22
One-Hot Encoding
● Use the OneHotEncoder available in scikit-learn’s preprocessing
module
-1 means unknown
dimension and we want
numpy to figure it out
Apply to only a
single column
COMP 3314 23
Feature Scaling
● The majority of ML algorithms require feature scaling
○ Decision trees and random forests are two of few ML algorithms that don’t
require feature scaling
● Importance
○ Consider the squared error function in Adaline for two dimensional features
where one feature is measured on a scale from 1 to 10 and the second feature is
measured on a scale from 1 to 100,000
■ The second feature would contribute to the error with a much higher
significance
● Two common approaches to bring different features onto the same scale
○ Normalization
■ E.g., rescaling features to a range of [0, 1]
○ Standardization
■ E.g., center features at mean 0 with standard deviation 1
COMP 3314 30
○ Here xmin is the smallest value in a feature column and xmax the largest
COMP 3314 31
● Here μx is the sample mean of feature column and σx the corresponding standard deviation
● Similar to the MinMaxScaler class, scikit-learn also implements a class for standardization
COMP 3314 32
Robust Scaler
● More advanced methods for feature scaling are available in sklearn
● The RobustScaler is especially helpful and recommended if
working with small datasets that contain many outliers
COMP 3314 34
Feature Selection
● Selects a subset of relevant features
○ Simplify model for easier interpretation
○ Shorten training time
○ Avoid curse of dimensionality
○ Reduce overfitting
● Feature selection ≠ feature extraction (covered in next chapter)
○ Selecting subset of the features ≠ creating new features
● We are going to look a two techniques for feature selection
○ L1 Regularization
○ Sequential Backward Selection (SBS)
COMP 3314 35
L1 vs. L2 Regularization
● L2 regularization (penalty) used in chapter 3
L1 Regularization
● Why is L1 regularization a technique for feature selection?
COMP 3314 37
… and run
gradient
0 cost
descent
Let’s initialize
The cost decreases our model with
linearly with the (2.0, 0.5) ...
distance to the origin
COMP 3314 41
Initialization
Let’s initialize
our model with
(2.0, 0.5)
When one of them is 0 it
will then decrease the
other one only
COMP 3314 46
Initialization
Sparse Solution
● We can simply set the penalty parameter to ‘l1’ for models in scikit-learn that
support L1 regularization
SBS
Steps:
1. Initialize the algorithm with k = d
d is the dimensionality of the full feature space Xd
2. Determine the feature x- = argmax J (Xk - x) that maximizes the criterion
function J
3. Remove the feature x- from the feature set
Xk-1 = Xk - x-
k=k-1
4. Terminate if k equals the number of desired features;
otherwise, go to step 2
● The accuracy of the KNN classifier on the original test set is as follows
SelectFromModel
● scikit-learn implements a SelectFromModel object that selects features based on a
user-specified threshold after model fitting
● Use the RandomForestClassifier as a feature selector and intermediate step in a
scikit-learn Pipeline object, which allows us to connect different preprocessing
steps with an estimator
COMP 3314 61
Feature Extraction
● Alternative way to reduce the model complexity
○ Feature selection
■ Select a subset of original features
○ Feature extraction
■ Technique to compress a dataset onto a lower-dimensional
feature space (dimensionality reduction)
■ Covered in the next chapter
COMP 3314 62
Conclusion
● Handle missing data correctly
● Encode categorical variables correctly
● Map ordinal and nominal feature values to integer representations
● L1 regularization can help us to avoid overfitting by reducing the
complexity of a model
● Used a sequential feature selection algorithm to select meaningful
features from a dataset
COMP 3314 63
References
● Most materials in this chapter are
based on
○ Book
○ Code
COMP 3314 64
References
● Some materials in this chapter
are based on
○ Book
○ Code
COMP 3314 65
References
● The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, Second Edition
○ Trevor Hastie, Robert Tibshirani, Jerome Friedman
● https://web.stanford.edu/~hastie/ElemStatLearn/
● Pandas User Guide: Working with missing data