0% found this document useful (0 votes)
7 views

Comp3314 5. Data Preprocessing

The document discusses the importance of data preprocessing in machine learning, highlighting techniques for handling missing values, encoding categorical data, and feature selection. Key methods include removing or imputing missing data, using one-hot encoding for nominal features, and applying L1 regularization for feature selection. The document also covers feature scaling and the use of algorithms like Sequential Backward Selection and Random Forests to assess feature importance.

Uploaded by

jocelynpratamah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Comp3314 5. Data Preprocessing

The document discusses the importance of data preprocessing in machine learning, highlighting techniques for handling missing values, encoding categorical data, and feature selection. Key methods include removing or imputing missing data, using one-hot encoding for nominal features, and applying L1 regularization for feature selection. The document also covers feature scaling and the use of algorithms like Sequential Backward Selection and Random Forests to assess feature importance.

Uploaded by

jocelynpratamah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Data Preprocessing

COMP3314
Machine Learning
COMP 3314 2

Introduction
● Preprocessing a dataset is a crucial step
○ Garbage in, garbage out
○ Quality of data and amount of useful information it contains are
key factors
● Data-gathering methods are often loosely controlled, resulting in
out-of-range values (e.g., Income: −100), impossible data
combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc.
● Preprocessing is often the most important phase of a machine
learning project
COMP 3314 3

Outline
● In this chapter you will learn how to …
○ Remove and impute missing values from the dataset
○ Get categorical data into shape
○ Select relevant features
● Specifically, we will looking at the following topics
○ Dealing with missing data
○ Nominal and ordinal features
○ Partitioning a dataset into training and testing sets
○ Bringing features onto the same scale
○ Selecting meaningful features
○ Sequential feature selection algorithms
○ Random forests
COMP 3314 4

Dealing with Missing Data


● Missing data is common in real-world applications
○ Samples might be missing one or more values
● ML models are unable to handle this
● Two ways to handle this
○ Remove entries
○ Imputing missing values from other samples and features (repair)
COMP 3314 5

Identifying Missing Values


● Consider the following simple example generated from CSV
COMP 3314 6

Identifying Missing Values


● For larger data, it can be tedious to look for missing values
○ Use the isnull method to return a DataFrame with Boolean
values that indicate whether a cell
■ contains a numeric value (False), or if
■ data is missing (True)
● Use sum() to count the number of missing values per column
COMP 3314 7

Remove Missing Data


● One option is to simply remove the corresponding features (columns) or
samples (rows)
● Rows with missing values can be dropped via the dropna method with
argument axis=0

● Columns with missing values can be dropped via the dropna method with
argument axis=1
COMP 3314 8

Dropna
● The dropna method supports several additional parameters that can
come in handy
only drop rows only drop rows
where all drop rows that where NaN appear
columns are have less than 4 in specific columns
NaN real values (here: 'C')
COMP 3314 9

Remove Missing Data


● Convenient approach
● Disadvantage
○ May remove too many samples
■ Risk losing valuable information
■ Our classifier may need them to discriminate between
classes
● Could make a reliable analysis impossible
● Alternative approach: Interpolation
COMP 3314 10

Interpolation
● Estimate missing values from the other training samples in our dataset
● Example: Mean imputation
○ Replace missing value with the mean value of the entire feature column

mean and median are for


Try to change to: numerical data only,
- median most_frequent and constant can
- most_frequent be used for numerical data or
- constant, fill_value=42 strings
COMP 3314 11

Scikit-Learn Estimator API


● SimpleImputer is a Transformer class
○ Used for data transformation
○ Two essential methods
■ fit
■ transform
● Estimator class
○ Very similar to transformer class
○ Two essential methods
■ fit
■ predict
■ Transform (optional)
COMP 3314 12

Transformer - Fit and Transform


● fit method
○ Used to learn the
parameters from the
training data
● transform method
○ Uses those parameters
to transform the data

Note: Number of features need


to be identical
COMP 3314 13

Estimator - Fit and Predict


● Use fit method to learn parameters
○ Additionally provide class labels
● Use predict method to make predictions
about unlabeled data
COMP 3314 14

Handling Categorical Data - infinity to infinity

● We have been exclusively working with numerical data


● How to handle categorical data? A categorical feature can take
on one of a limited, and usually
● Example of categorical data fixed, number of possible
there is a fixed number of distinct value
values

XL L M
COMP 3314 15

Categorical Data
● It is common that real-world datasets contain categorical features
○ How to deal with this type of data?
● Nominal features vs ordinal features
○ Ordinal features can be sorted / ordered
■ E.g., t-shirt size, because we can define an order XL>L>M
○ Nominal features don't imply any order
■ E.g., t-shirt color
COMP 3314 16

Example Dataset

nominal ordinal numerical


COMP 3314 17

Mapping Ordinal Features


● To ensure correct interpretation of ordinal features, convert string values
to integers

● Reverse-mapping to go back
COMP 3314 18

Encoding Class Labels


● Most models require integer encoding for class labels
○ Note: class labels are not ordinal, and it doesn't matter which integer number
we assign to a particular string label
COMP 3314 19

LabelEncoder
● Alternatively, there is a convenient LabelEncoder class directly
implemented in scikit-learn to achieve this

Shortcut of calling fit


and transform
separately
COMP 3314 20

One-Hot Encoding
● We could use a similar approach to transform the nominal color column
of our dataset, as follows

○ Problem:
■ Model may assume that green > blue, and red > green
■ This could result in suboptimal model
● Workaround: Use one-hot encoding
○ Create a dummy feature for each unique value of nominal features
■ E.g., a blue sample is encoded as blue = 1 , green = 0 , red = 0
COMP 3314 21

One-Hot Encoding
● Use the OneHotEncoder available in scikit-learn’s preprocessing
module
-1 means unknown
dimension and we want
numpy to figure it out

Apply to only a
single column
red
blue
COMP 3314 22

One-Hot Encoding via ColumnTransformer


● To selectively transform columns in a multi-feature array, use
ColumnTransformer
○ Accepts a list of (name, transformer, column(s)) tuple
Only modify the first
column

dummy feature (color)


COMP 3314 23

One-Hot Encoding - Via Pandas


● An even more convenient way to create those dummy features via
one-hot encoding is to use the get_dummies method implemented
in pandas
○ get_dummies will only convert string columns
COMP 3314 24

One-Hot Encoding - Dropping First Feature


● Note that we do not lose any information by removing one dummy column
○ E.g., if we remove the column color_blue, the feature information is still
preserved since if we observe color_green=0 and color_red=0, it implies that
the observation must be blue
COMP 3314 25

UCI Wine Dataset


● The UCI wine dataset consists of 178 wine samples with 13 features describing
their different chemical properties
COMP 3314 26

UCI Wine Dataset: Training-Testing


● Let’s first divide the dataset into separate training and testing sets

30% for testing


COMP 3314 27

UCI Wine Dataset: Training-Testing


● It is important to balance the trade-off between inaccurate estimation of
generalization error and withholding too much information from the
learning algorithm
● In practice, the most commonly used splits are 60:40, 70:30, or 80:20,
depending on the size of the initial dataset
○ For large datasets, 90:10 or 99:1 splits are also common and
if we need 50 samples,

appropriate in 100 dataset, training : testing = 50:50


in 500 dataset, training: testing = 90 : 10

● Instead of discarding the allocated test data after model training and
Bigger dataset, smaller testing ratio

evaluation, we can retrain a classifier on the entire dataset as it could


improve the predictive performance of the model
○ While this approach is generally recommended, it could lead to worse
generalization performance testing set should not be > 50%
COMP 3314 28

Feature Scaling
● The majority of ML algorithms require feature scaling
○ Decision trees and random forests are two of few ML algorithms that don’t
require feature scaling
● Importance
○ Consider the squared error function in Adaline for two dimensional features
where one feature is measured on a scale from 1 to 10 and the second feature is
measured on a scale from 1 to 100,000
■ The second feature would contribute to the error with a much higher
significance
● Two common approaches to bring different features onto the same scale
○ Normalization
■ E.g., rescaling features to a range of [0, 1]
○ Standardization
■ E.g., center features at mean 0 with standard deviation 1
COMP 3314 29

Feature Scaling - Normalization Find the min and max value


● Most often, normalization refers to the rescaling of features to a range of [0, 1]
● To normalize our data, we can simply apply a min-max scaling to each feature column
○ A new value x(i)norm of a sample x(i) is calculated as follows

○ Here xmin is the smallest value in a feature column and xmax the largest
COMP 3314 30

Feature Scaling - Standardization


● Standardization is more practical for various reasons including retaining useful
information about outliers
● A new value x(i)std of a sample x(i) is calculated as follows

● Here μx is the sample mean of feature column and σx the corresponding standard deviation
● Similar to the MinMaxScaler class, scikit-learn also implements a class for standardization
COMP 3314 31

Normalization vs. Standardization


● The following example illustrates the difference between
standardization and normalization
COMP 3314 32

Robust Scaler
● More advanced methods for feature scaling are available in sklearn
● The RobustScaler is especially helpful and recommended if
working with small datasets that contain many outliers
COMP 3314 33

Feature Selection
● Selects a subset of relevant features
○ Simplify model for easier interpretation
○ Shorten training time
○ Avoid curse of dimensionality
○ Reduce overfitting
● Feature selection ≠ feature extraction (covered in next chapter)
○ Selecting subset of the features ≠ creating new features
● We are going to look at two techniques for feature selection
○ L1 Regularization
○ Sequential Backward Selection (SBS)
COMP 3314 34

L1 vs. L2 Regularization
● L2 regularization (penalty) used in chapter 3

● Another approach: L1 regularization (penalty)

● This will usually yield sparse feature weights


○ Most feature weights will be zero zero = not selected (discarded)
● Sparsity can be useful in practice if we have a high dimensional dataset with
many features that are irrelevant
● L1 regularization can be taken as a technique for feature selection
COMP 3314 35

Geometric Interpretation
● To better understand how L1 regularization encourages sparsity, let’s take a look
at a geometric interpretation of regularization
● Consider the sum of squared errors cost function used for Adaline
● Plot of the contours of a convex cost function for two coefficients w1 and w2

increasing when we move out


COMP 3314 36

Geometric Interpretation: L2 Regularization


● Regularization adds a penalty to the cost function to encourage smaller weights
○ By increasing the regularization strength λ we shrink the weights towards
zero and decrease the dependency of our model on the training data

cannot just minimize the cost, because


the penalty will be huge
need to balance between cost and
penalty
all points at the same distance of
the circle = same penalty value
COMP 3314 37

Geometric Interpretation: L1 Regularization


● Since the L1 penalty is the sum of the absolute weight coefficients we can
represent it as a diamond-shape
● It is more likely that the optimum is located on the axes, which encourages
sparsity

closer to the minimum cost the minimum value are very likely to be
located at the sharp corner

Mathematical details can be found in


Section 3.4 of
The Elements of Statistical Learning
COMP 3314 38

Sparse Solution
● We can simply set the penalty parameter to ‘l1’ for models in scikit-learn that
support L1 regularization

● In scikit-learn, w0 corresponds to intercept_ and wj (for j > 0) corresponds to the


values in coef_
COMP 3314 39

Sparse Solution - Regularization Strength

converge to zero

too small = all zero large regularization strength => non zero value
should find the correct C, not too large & not too small
COMP 3314 40

Sequential Backward Selection (SBS)


● Reduces an initial d-dimensional space to a k-dimensional subspace (k < d)
by automatically selecting features that are most relevant
● Idea:
○ Sequentially remove features until desired feature number is reached
○ Define a criterion function J to be maximized
■ E.g., performance of the classifier after removal
■ Use a validation subset of the training set for performance
evaluation
○ Eliminate the feature that causes the least performance loss

remove 1 - calculate - put it back, remove another features ,... || see which feature give the maximum performance
COMP 3314 41

SBS
Steps:
1. Initialize the algorithm with k = d
d is the dimensionality of the full feature space Xd
2. Determine the feature x- = argmax J (Xk - x) that maximizes the criterion
function J
3. Remove the feature x- from the feature set
Xk-1 = Xk - x-
k=k-1
4. Terminate if k equals the number of desired features;
otherwise, go to step 2

● In the following we will implement SBS in Python from scratch


COMP 3314 42
COMP 3314 43

removing features
from 13 -1 -1 -1 ...
Best choice = 3 features
COMP 3314 44

SBS - Analyzing the Result


● The smallest feature subset (k = 3) that yielded such a good performance on the
validation dataset has the following features

● The accuracy of the KNN classifier on the original test set is as follows

● The three-feature subset has the following accuracy

less testing accuracy


COMP 3314 45

Feature Selection Algorithms in scikit-learn


● There are many more feature selection algorithms available via
scikit-learn
● A comprehensive discussion of the different feature selection
methods is beyond the scope of this lecture
○ A good summary with illustrative examples can be found here
COMP 3314 46

Assessing Feature Importance


● We can determine relevant features using random forest
○ Measure the feature importance as the averaged information gain
● The random forest implementation in scikit-learn already collects the
feature importance values for us
○ Access them via the feature_importances_ attribute after fitting a
RandomForestClassifier
● In the following we will train a forest of 500 trees on the Wine dataset
and rank the 13 features by their respective importance measures
COMP 3314 47

different ML algorithm choose different features


(different result)
COMP 3314 48

Conclusion
● Handle missing data correctly
● Encode categorical variables correctly
● Map ordinal and nominal feature values to integer representations
● L1 regularization can help us to avoid overfitting by reducing the
complexity of a model
● Used a sequential feature selection algorithm to select meaningful
features from a dataset
COMP 3314 49

References
● Most materials in this chapter are
based on
○ Book
○ Code
COMP 3314 50

References
● Some materials in this chapter
are based on
○ Book
○ Code
COMP 3314 51

References
● The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, Second Edition
○ Trevor Hastie, Robert Tibshirani, Jerome Friedman
● https://web.stanford.edu/~hastie/ElemStatLearn/
● Pandas User Guide: Working with missing data

You might also like