Comp3314 5. Data Preprocessing
Comp3314 5. Data Preprocessing
COMP3314
Machine Learning
COMP 3314 2
Introduction
● Preprocessing a dataset is a crucial step
○ Garbage in, garbage out
○ Quality of data and amount of useful information it contains are
key factors
● Data-gathering methods are often loosely controlled, resulting in
out-of-range values (e.g., Income: −100), impossible data
combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc.
● Preprocessing is often the most important phase of a machine
learning project
COMP 3314 3
Outline
● In this chapter you will learn how to …
○ Remove and impute missing values from the dataset
○ Get categorical data into shape
○ Select relevant features
● Specifically, we will looking at the following topics
○ Dealing with missing data
○ Nominal and ordinal features
○ Partitioning a dataset into training and testing sets
○ Bringing features onto the same scale
○ Selecting meaningful features
○ Sequential feature selection algorithms
○ Random forests
COMP 3314 4
● Columns with missing values can be dropped via the dropna method with
argument axis=1
COMP 3314 8
Dropna
● The dropna method supports several additional parameters that can
come in handy
only drop rows only drop rows
where all drop rows that where NaN appear
columns are have less than 4 in specific columns
NaN real values (here: 'C')
COMP 3314 9
Interpolation
● Estimate missing values from the other training samples in our dataset
● Example: Mean imputation
○ Replace missing value with the mean value of the entire feature column
XL L M
COMP 3314 15
Categorical Data
● It is common that real-world datasets contain categorical features
○ How to deal with this type of data?
● Nominal features vs ordinal features
○ Ordinal features can be sorted / ordered
■ E.g., t-shirt size, because we can define an order XL>L>M
○ Nominal features don't imply any order
■ E.g., t-shirt color
COMP 3314 16
Example Dataset
● Reverse-mapping to go back
COMP 3314 18
LabelEncoder
● Alternatively, there is a convenient LabelEncoder class directly
implemented in scikit-learn to achieve this
One-Hot Encoding
● We could use a similar approach to transform the nominal color column
of our dataset, as follows
○ Problem:
■ Model may assume that green > blue, and red > green
■ This could result in suboptimal model
● Workaround: Use one-hot encoding
○ Create a dummy feature for each unique value of nominal features
■ E.g., a blue sample is encoded as blue = 1 , green = 0 , red = 0
COMP 3314 21
One-Hot Encoding
● Use the OneHotEncoder available in scikit-learn’s preprocessing
module
-1 means unknown
dimension and we want
numpy to figure it out
Apply to only a
single column
red
blue
COMP 3314 22
● Instead of discarding the allocated test data after model training and
Bigger dataset, smaller testing ratio
Feature Scaling
● The majority of ML algorithms require feature scaling
○ Decision trees and random forests are two of few ML algorithms that don’t
require feature scaling
● Importance
○ Consider the squared error function in Adaline for two dimensional features
where one feature is measured on a scale from 1 to 10 and the second feature is
measured on a scale from 1 to 100,000
■ The second feature would contribute to the error with a much higher
significance
● Two common approaches to bring different features onto the same scale
○ Normalization
■ E.g., rescaling features to a range of [0, 1]
○ Standardization
■ E.g., center features at mean 0 with standard deviation 1
COMP 3314 29
○ Here xmin is the smallest value in a feature column and xmax the largest
COMP 3314 30
● Here μx is the sample mean of feature column and σx the corresponding standard deviation
● Similar to the MinMaxScaler class, scikit-learn also implements a class for standardization
COMP 3314 31
Robust Scaler
● More advanced methods for feature scaling are available in sklearn
● The RobustScaler is especially helpful and recommended if
working with small datasets that contain many outliers
COMP 3314 33
Feature Selection
● Selects a subset of relevant features
○ Simplify model for easier interpretation
○ Shorten training time
○ Avoid curse of dimensionality
○ Reduce overfitting
● Feature selection ≠ feature extraction (covered in next chapter)
○ Selecting subset of the features ≠ creating new features
● We are going to look at two techniques for feature selection
○ L1 Regularization
○ Sequential Backward Selection (SBS)
COMP 3314 34
L1 vs. L2 Regularization
● L2 regularization (penalty) used in chapter 3
Geometric Interpretation
● To better understand how L1 regularization encourages sparsity, let’s take a look
at a geometric interpretation of regularization
● Consider the sum of squared errors cost function used for Adaline
● Plot of the contours of a convex cost function for two coefficients w1 and w2
closer to the minimum cost the minimum value are very likely to be
located at the sharp corner
Sparse Solution
● We can simply set the penalty parameter to ‘l1’ for models in scikit-learn that
support L1 regularization
converge to zero
too small = all zero large regularization strength => non zero value
should find the correct C, not too large & not too small
COMP 3314 40
remove 1 - calculate - put it back, remove another features ,... || see which feature give the maximum performance
COMP 3314 41
SBS
Steps:
1. Initialize the algorithm with k = d
d is the dimensionality of the full feature space Xd
2. Determine the feature x- = argmax J (Xk - x) that maximizes the criterion
function J
3. Remove the feature x- from the feature set
Xk-1 = Xk - x-
k=k-1
4. Terminate if k equals the number of desired features;
otherwise, go to step 2
removing features
from 13 -1 -1 -1 ...
Best choice = 3 features
COMP 3314 44
● The accuracy of the KNN classifier on the original test set is as follows
Conclusion
● Handle missing data correctly
● Encode categorical variables correctly
● Map ordinal and nominal feature values to integer representations
● L1 regularization can help us to avoid overfitting by reducing the
complexity of a model
● Used a sequential feature selection algorithm to select meaningful
features from a dataset
COMP 3314 49
References
● Most materials in this chapter are
based on
○ Book
○ Code
COMP 3314 50
References
● Some materials in this chapter
are based on
○ Book
○ Code
COMP 3314 51
References
● The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, Second Edition
○ Trevor Hastie, Robert Tibshirani, Jerome Friedman
● https://web.stanford.edu/~hastie/ElemStatLearn/
● Pandas User Guide: Working with missing data