0% found this document useful (0 votes)
6 views

Data Preprocessing

Uploaded by

caijiahui0715
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Preprocessing

Uploaded by

caijiahui0715
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

4.

Data Preprocessing

COMP3314
Machine Learning
COMP 3314 2

Introduction
● Preprocessing a dataset is a crucial step
○ Garbage in, garbage out
○ Quality of data and amount of useful information it contains are
key factors
● Data-gathering methods are often loosely controlled, resulting in
out-of-range values (e.g., Income: −100), impossible data
combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc.
● Preprocessing is often the most important phase of a machine
learning project
COMP 3314 3

Outline
● In this chapter you will learn how to …
○ Remove and impute missing values from the dataset
○ Get categorical data into shape
○ Select relevant features
● Specifically, we will looking at the following topics
○ Dealing with missing data
○ Nominal and ordinal features
○ Partitioning a dataset into training and testing sets
○ Bringing features onto the same scale
○ Selecting meaningful features
○ Sequential feature selection algorithms
○ Random forests
COMP 3314 4

Dealing with Missing Data


● Missing data is common in real-world applications
○ Samples might be missing one or more values
● ML models are unable to handle this
● Two ways to handle this
○ Remove entries
○ Imputing missing values from other samples and features
COMP 3314 5

Code - DataPreprocessing.ipynb
● Available here on CoLab
COMP 3314 6

Identifying Missing Values


● Consider the following simple example generated from CSV
COMP 3314 7

Identifying Missing Values


● For larger data, it can be tedious to look for missing values
○ Use the isnull method to return a DataFrame with Boolean
values that indicate whether a cell
■ contains a numeric value (False), or if
■ data is missing (True)
● Use sum() to count the number of missing values per column
COMP 3314 8

Remove Missing Data


● One option is to simply remove the corresponding features (columns) or
samples (rows)
● Rows with missing values can be dropped via the dropna method with
argument axis=0

● Columns with missing values can be dropped via the dropna method with
argument axis=1
COMP 3314 9

Dropna
● The dropna method supports several additional parameters that can
come in handy
only drop rows only drop rows
where all drop rows that where NaN appear
columns are have less than 4 in specific columns
NaN real values (here: 'C')
COMP 3314 10

Remove Missing Data


● Convenient approach
● Disadvantage
○ May remove too many samples
■ Risk losing valuable information
■ Our classifier may need them to discriminate between
classes
● Could make a reliable analysis impossible
● Alternative approach: Interpolation
COMP 3314 11

Interpolation
● Estimate missing values from the other training samples in our dataset
● Example: Mean imputation
○ Replace missing value with the mean value of the entire feature column

mean and median are for


Try to change to: numerical data only,
- median most_frequent and constant can
- most_frequent be used for numerical data or
- constant, fill_value=42 strings
COMP 3314 12

Scikit-Learn Estimator API


● SimpleImputer is a Transformer class
○ Used for data transformation
○ Two essential methods
■ fit
■ transform
● Estimator class
○ Very similar to transformer class
○ Two essential methods
■ fit
■ predict
■ Transform (optional)
COMP 3314 13

Transformer - Fit and Transform


● fit method
○ Used to learn the
parameters from the
training data
● transform method
○ Uses those parameters
to transform the data

Note: Number of features need


to be identical
COMP 3314 14

Estimator - Fit and Predict


● Use fit method to learn parameters
○ Additionally provide class labels
● Use predict method to make predictions
about unlabeled data
COMP 3314 15

Handling Categorical Data


● We have been exclusively working with numerical data
● How to handle categorical data? A categorical feature can take
on one of a limited, and usually
● Example of categorical data fixed, number of possible
values

XL L M
COMP 3314 16

Categorical Data
● It is common that real-world datasets contain categorical features
○ How to deal with this type of data?
● Nominal features vs ordinal features
○ Ordinal features can be sorted / ordered
■ E.g., t-shirt size, because we can define an order XL>L>M
○ Nominal features don't imply any order
■ E.g., t-shirt color
COMP 3314 17

Example Dataset

nominal ordinal numerical


COMP 3314 18

Mapping Ordinal Features


● To ensure correct interpretation of ordinal features, convert string values
to integers

● Reverse-mapping to go back
COMP 3314 19

Encoding Class Labels


● Most models require integer encoding for class labels
○ Note: class labels are not ordinal, and it doesn't matter which integer number
we assign to a particular string label
COMP 3314 20

LabelEncoder
● Alternatively, there is a convenient LabelEncoder class directly
implemented in scikit-learn to achieve this

Shortcut of calling fit


and transform
separately
COMP 3314 21

One-Hot Encoding
● We could use a similar approach to transform the nominal color column
of our dataset, as follows

○ Problem:
■ Model may assume that green > blue, and red > green
■ This could result in suboptimal model
● Workaround: Use one-hot encoding
○ Create a dummy feature for each unique value of nominal features
■ E.g., a blue sample is encoded as blue = 1 , green = 0 , red = 0
COMP 3314 22

One-Hot Encoding
● Use the OneHotEncoder available in scikit-learn’s preprocessing
module
-1 means unknown
dimension and we want
numpy to figure it out

Apply to only a
single column
COMP 3314 23

One-Hot Encoding via ColumnTransformer


● To selectively transform columns in a multi-feature array, use
ColumnTransformer
○ Accepts a list of (name, transformer, column(s)) tuple
Only modify the first
column
COMP 3314 24

One-Hot Encoding - Via Pandas


● An even more convenient way to create those dummy features via
one-hot encoding is to use the get_dummies method implemented
in pandas
○ get_dummies will only convert string columns
COMP 3314 25

One-Hot Encoding - Dropping First Feature


● Note that we do not lose any information by removing one dummy column
○ E.g., if we remove the column color_blue, the feature information is still
preserved since if we observe color_green=0 and color_red=0, it implies that
the observation must be blue
COMP 3314 26

UCI Wine Dataset


● The UCI wine dataset consists of 178 wine samples with 13 features describing
their different chemical properties
COMP 3314 27

UCI Wine Dataset: Training-Testing


● Let’s first divide the dataset into separate training and testing sets
COMP 3314 28

UCI Wine Dataset: Training-Testing


● It is important to balance the trade-off between inaccurate estimation of
generalization error and withholding too much information from the
learning algorithm
● In practice, the most commonly used splits are 60:40, 70:30, or 80:20,
depending on the size of the initial dataset
○ For large datasets, 90:10 or 99:1 splits are also common and
appropriate
● Instead of discarding the allocated test data after model training and
evaluation, we can retrain a classifier on the entire dataset as it can
improve the predictive performance of the model
○ While this approach is generally recommended, it could lead to worse
generalization performance
COMP 3314 29

Feature Scaling
● The majority of ML algorithms require feature scaling
○ Decision trees and random forests are two of few ML algorithms that don’t
require feature scaling
● Importance
○ Consider the squared error function in Adaline for two dimensional features
where one feature is measured on a scale from 1 to 10 and the second feature is
measured on a scale from 1 to 100,000
■ The second feature would contribute to the error with a much higher
significance
● Two common approaches to bring different features onto the same scale
○ Normalization
■ E.g., rescaling features to a range of [0, 1]
○ Standardization
■ E.g., center features at mean 0 with standard deviation 1
COMP 3314 30

Feature Scaling - Normalization


● Most often, normalization refers to the rescaling of features to a range of [0, 1]
● To normalize our data, we can simply apply a min-max scaling to each feature column
○ A new value x(i)norm of a sample x(i) is calculated as follows

○ Here xmin is the smallest value in a feature column and xmax the largest
COMP 3314 31

Feature Scaling - Standardization


● Standardization is more practical for various reasons including retaining useful
information about outliers
● A new value x(i)std of a sample x(i) is calculated as follows

● Here μx is the sample mean of feature column and σx the corresponding standard deviation
● Similar to the MinMaxScaler class, scikit-learn also implements a class for standardization
COMP 3314 32

Normalization vs. Standardization


● The following example illustrates the difference between
standardization and normalization
COMP 3314 33

Robust Scaler
● More advanced methods for feature scaling are available in sklearn
● The RobustScaler is especially helpful and recommended if
working with small datasets that contain many outliers
COMP 3314 34

Feature Selection
● Selects a subset of relevant features
○ Simplify model for easier interpretation
○ Shorten training time
○ Avoid curse of dimensionality
○ Reduce overfitting
● Feature selection ≠ feature extraction (covered in next chapter)
○ Selecting subset of the features ≠ creating new features
● We are going to look a two techniques for feature selection
○ L1 Regularization
○ Sequential Backward Selection (SBS)
COMP 3314 35

L1 vs. L2 Regularization
● L2 regularization (penalty) used in chapter 3

● Another approach: L1 regularization (penalty)


COMP 3314 36

L1 Regularization
● Why is L1 regularization a technique for feature selection?
COMP 3314 37

The two axis represent the


model parameters, previously
we used w1 and w2
COMP 3314 38

Points on the contour


have equal cost

The background contour


represents the L2 regularization
term
COMP 3314 39

Why are the


contours
circular?

Make sure you can


answer this
question before
you continue
COMP 3314 40

… and run
gradient
0 cost
descent

Let’s initialize
The cost decreases our model with
linearly with the (2.0, 0.5) ...
distance to the origin
COMP 3314 41

The background contour


represents Adaline’s cost functions
+ L2 regularization term (i.e., a
combination of both)
COMP 3314 42

Initialization

This is the path


gradient descent
takes
COMP 3314 43

Points on the contour


have equal cost

The background contour


represents the L1 regularization
term
COMP 3314 44

Why are the contours


diamond shaped ?

Make sure you can


answer this
question before
you continue
COMP 3314 45

First the cost


decreases equally for … and run
both parameters gradient
descent
0 cost

Let’s initialize
our model with
(2.0, 0.5)
When one of them is 0 it
will then decrease the
other one only
COMP 3314 46

The background contour


represents Adaline’s cost functions
+ L1 regularization term (i.e., a
combination of both)
COMP 3314 47

Notice how the


path quickly
reaches 0 for one of
the parameters

Initialization

This is the path


gradient descent
takes
COMP 3314 48

To avoid the bouncing


around you should gradually
reduce the learning rate
COMP 3314 49

Sparse Solution
● We can simply set the penalty parameter to ‘l1’ for models in scikit-learn that
support L1 regularization

● In scikit-learn, w0 corresponds to intercept_ and wj (for j > 0) corresponds to the


values in coef_
COMP 3314 50

Sparse Solution - Regularization Strength


COMP 3314 51

Sparse Solution - Regularization Strength


COMP 3314 52

Sequential Backward Selection (SBS)


● Reduces an initial d-dimensional space to a k-dimensional subspace (k < d)
by automatically selecting features that are most relevant
● Idea:
○ Sequentially remove features until desired feature number is reached
○ Define a criterion function J to be maximized
■ E.g., performance of the classifier after removal
○ Eliminate the feature that causes the least performance loss
COMP 3314 53

SBS
Steps:
1. Initialize the algorithm with k = d
d is the dimensionality of the full feature space Xd
2. Determine the feature x- = argmax J (Xk - x) that maximizes the criterion
function J
3. Remove the feature x- from the feature set
Xk-1 = Xk - x-
k=k-1
4. Terminate if k equals the number of desired features;
otherwise, go to step 2

● In the following we will implement SBS in Python from scratch


COMP 3314 54
COMP 3314 55
COMP 3314 56

SBS - Analyzing the Result


● The smallest feature subset (k = 3) that yielded such a good performance on the
validation dataset has the following features

● The accuracy of the KNN classifier on the original test set is as follows

● The three-feature subset has the following accuracy


COMP 3314 57

Feature Selection Algorithms in scikit-learn


● There are many more feature selection algorithms available via
scikit-learn
● A comprehensive discussion of the different feature selection
methods is beyond the scope of this lecture
○ A good summary with illustrative examples can be found here
COMP 3314 58

Assessing Feature Importance


● We can determine relevant features using random forest
○ Measure the feature importance as the averaged impurity decrease
● The random forest implementation in scikit-learn already collects the
feature importance values for us
○ Access them via the feature_importances_ attribute after fitting a
RandomForestClassifier
● In the following we will train a forest of 500 trees on the Wine dataset
and rank the 13 features by their respective importance measures
COMP 3314 59
COMP 3314 60

SelectFromModel
● scikit-learn implements a SelectFromModel object that selects features based on a
user-specified threshold after model fitting
● Use the RandomForestClassifier as a feature selector and intermediate step in a
scikit-learn Pipeline object, which allows us to connect different preprocessing
steps with an estimator
COMP 3314 61

Feature Extraction
● Alternative way to reduce the model complexity
○ Feature selection
■ Select a subset of original features
○ Feature extraction
■ Technique to compress a dataset onto a lower-dimensional
feature space (dimensionality reduction)
■ Covered in the next chapter
COMP 3314 62

Conclusion
● Handle missing data correctly
● Encode categorical variables correctly
● Map ordinal and nominal feature values to integer representations
● L1 regularization can help us to avoid overfitting by reducing the
complexity of a model
● Used a sequential feature selection algorithm to select meaningful
features from a dataset
COMP 3314 63

References
● Most materials in this chapter are
based on
○ Book
○ Code
COMP 3314 64

References
● Some materials in this chapter
are based on
○ Book
○ Code
COMP 3314 65

References
● The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, Second Edition
○ Trevor Hastie, Robert Tibshirani, Jerome Friedman
● https://web.stanford.edu/~hastie/ElemStatLearn/
● Pandas User Guide: Working with missing data

You might also like