0% found this document useful (0 votes)

6 views65 pages

Data Preprocessing

Uploaded by

caijiahui0715

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views65 pages

Data Preprocessing

Uploaded by

caijiahui0715

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

4.

Data Preprocessing

COMP3314
Machine Learning
COMP 3314 2

Introduction
● Preprocessing a dataset is a crucial step
○ Garbage in, garbage out
○ Quality of data and amount of useful information it contains are
key factors
● Data-gathering methods are often loosely controlled, resulting in
out-of-range values (e.g., Income: −100), impossible data
combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc.
● Preprocessing is often the most important phase of a machine
learning project
COMP 3314 3

Outline
● In this chapter you will learn how to …
○ Remove and impute missing values from the dataset
○ Get categorical data into shape
○ Select relevant features
● Specifically, we will looking at the following topics
○ Dealing with missing data
○ Nominal and ordinal features
○ Partitioning a dataset into training and testing sets
○ Bringing features onto the same scale
○ Selecting meaningful features
○ Sequential feature selection algorithms
○ Random forests
COMP 3314 4

Dealing with Missing Data

● Missing data is common in real-world applications
○ Samples might be missing one or more values
● ML models are unable to handle this
● Two ways to handle this
○ Remove entries
○ Imputing missing values from other samples and features
COMP 3314 5

Code - DataPreprocessing.ipynb
● Available here on CoLab
COMP 3314 6

Identifying Missing Values

● Consider the following simple example generated from CSV
COMP 3314 7

Identifying Missing Values

● For larger data, it can be tedious to look for missing values
○ Use the isnull method to return a DataFrame with Boolean
values that indicate whether a cell
■ contains a numeric value (False), or if
■ data is missing (True)
● Use sum() to count the number of missing values per column
COMP 3314 8

Remove Missing Data

● One option is to simply remove the corresponding features (columns) or
samples (rows)
● Rows with missing values can be dropped via the dropna method with
argument axis=0

● Columns with missing values can be dropped via the dropna method with
argument axis=1
COMP 3314 9

Dropna
● The dropna method supports several additional parameters that can
come in handy
only drop rows only drop rows
where all drop rows that where NaN appear
columns are have less than 4 in specific columns
NaN real values (here: 'C')
COMP 3314 10

Remove Missing Data

● Convenient approach
● Disadvantage
○ May remove too many samples
■ Risk losing valuable information
■ Our classifier may need them to discriminate between
classes
● Could make a reliable analysis impossible
● Alternative approach: Interpolation
COMP 3314 11

Interpolation
● Estimate missing values from the other training samples in our dataset
● Example: Mean imputation
○ Replace missing value with the mean value of the entire feature column

mean and median are for

Try to change to: numerical data only,
- median most_frequent and constant can
- most_frequent be used for numerical data or
- constant, fill_value=42 strings
COMP 3314 12

Scikit-Learn Estimator API

● SimpleImputer is a Transformer class
○ Used for data transformation
○ Two essential methods
■ fit
■ transform
● Estimator class
○ Very similar to transformer class
○ Two essential methods
■ fit
■ predict
■ Transform (optional)
COMP 3314 13

Transformer - Fit and Transform

● fit method
○ Used to learn the
parameters from the
training data
● transform method
○ Uses those parameters
to transform the data

Note: Number of features need

to be identical
COMP 3314 14

Estimator - Fit and Predict

● Use fit method to learn parameters
○ Additionally provide class labels
● Use predict method to make predictions
about unlabeled data
COMP 3314 15

Handling Categorical Data

● We have been exclusively working with numerical data
● How to handle categorical data? A categorical feature can take
on one of a limited, and usually
● Example of categorical data fixed, number of possible
values

XL L M
COMP 3314 16

Categorical Data
● It is common that real-world datasets contain categorical features
○ How to deal with this type of data?
● Nominal features vs ordinal features
○ Ordinal features can be sorted / ordered
■ E.g., t-shirt size, because we can define an order XL>L>M
○ Nominal features don't imply any order
■ E.g., t-shirt color
COMP 3314 17

Example Dataset

nominal ordinal numerical

COMP 3314 18

Mapping Ordinal Features

● To ensure correct interpretation of ordinal features, convert string values
to integers

● Reverse-mapping to go back
COMP 3314 19

Encoding Class Labels

● Most models require integer encoding for class labels
○ Note: class labels are not ordinal, and it doesn't matter which integer number
we assign to a particular string label
COMP 3314 20

LabelEncoder
● Alternatively, there is a convenient LabelEncoder class directly
implemented in scikit-learn to achieve this

Shortcut of calling fit

and transform
separately
COMP 3314 21

One-Hot Encoding
● We could use a similar approach to transform the nominal color column
of our dataset, as follows

○ Problem:
■ Model may assume that green > blue, and red > green
■ This could result in suboptimal model
● Workaround: Use one-hot encoding
○ Create a dummy feature for each unique value of nominal features
■ E.g., a blue sample is encoded as blue = 1 , green = 0 , red = 0
COMP 3314 22

One-Hot Encoding
● Use the OneHotEncoder available in scikit-learn’s preprocessing
module
-1 means unknown
dimension and we want
numpy to figure it out

Apply to only a
single column
COMP 3314 23

One-Hot Encoding via ColumnTransformer

● To selectively transform columns in a multi-feature array, use
ColumnTransformer
○ Accepts a list of (name, transformer, column(s)) tuple
Only modify the first
column
COMP 3314 24

One-Hot Encoding - Via Pandas

● An even more convenient way to create those dummy features via
one-hot encoding is to use the get_dummies method implemented
in pandas
○ get_dummies will only convert string columns
COMP 3314 25

One-Hot Encoding - Dropping First Feature

● Note that we do not lose any information by removing one dummy column
○ E.g., if we remove the column color_blue, the feature information is still
preserved since if we observe color_green=0 and color_red=0, it implies that
the observation must be blue
COMP 3314 26

UCI Wine Dataset

● The UCI wine dataset consists of 178 wine samples with 13 features describing
their different chemical properties
COMP 3314 27

UCI Wine Dataset: Training-Testing

● Let’s first divide the dataset into separate training and testing sets
COMP 3314 28

UCI Wine Dataset: Training-Testing

● It is important to balance the trade-off between inaccurate estimation of
generalization error and withholding too much information from the
learning algorithm
● In practice, the most commonly used splits are 60:40, 70:30, or 80:20,
depending on the size of the initial dataset
○ For large datasets, 90:10 or 99:1 splits are also common and
appropriate
● Instead of discarding the allocated test data after model training and
evaluation, we can retrain a classifier on the entire dataset as it can
improve the predictive performance of the model
○ While this approach is generally recommended, it could lead to worse
generalization performance
COMP 3314 29

Feature Scaling
● The majority of ML algorithms require feature scaling
○ Decision trees and random forests are two of few ML algorithms that don’t
require feature scaling
● Importance
○ Consider the squared error function in Adaline for two dimensional features
where one feature is measured on a scale from 1 to 10 and the second feature is
measured on a scale from 1 to 100,000
■ The second feature would contribute to the error with a much higher
significance
● Two common approaches to bring different features onto the same scale
○ Normalization
■ E.g., rescaling features to a range of [0, 1]
○ Standardization
■ E.g., center features at mean 0 with standard deviation 1
COMP 3314 30

Feature Scaling - Normalization

● Most often, normalization refers to the rescaling of features to a range of [0, 1]
● To normalize our data, we can simply apply a min-max scaling to each feature column
○ A new value x(i)norm of a sample x(i) is calculated as follows

○ Here xmin is the smallest value in a feature column and xmax the largest
COMP 3314 31

Feature Scaling - Standardization

● Standardization is more practical for various reasons including retaining useful
information about outliers
● A new value x(i)std of a sample x(i) is calculated as follows

● Here μx is the sample mean of feature column and σx the corresponding standard deviation
● Similar to the MinMaxScaler class, scikit-learn also implements a class for standardization
COMP 3314 32

Normalization vs. Standardization

● The following example illustrates the difference between
standardization and normalization
COMP 3314 33

Robust Scaler
● More advanced methods for feature scaling are available in sklearn
● The RobustScaler is especially helpful and recommended if
working with small datasets that contain many outliers
COMP 3314 34

Feature Selection
● Selects a subset of relevant features
○ Simplify model for easier interpretation
○ Shorten training time
○ Avoid curse of dimensionality
○ Reduce overfitting
● Feature selection ≠ feature extraction (covered in next chapter)
○ Selecting subset of the features ≠ creating new features
● We are going to look a two techniques for feature selection
○ L1 Regularization
○ Sequential Backward Selection (SBS)
COMP 3314 35

L1 vs. L2 Regularization
● L2 regularization (penalty) used in chapter 3

● Another approach: L1 regularization (penalty)

COMP 3314 36

L1 Regularization
● Why is L1 regularization a technique for feature selection?
COMP 3314 37

The two axis represent the

model parameters, previously
we used w1 and w2
COMP 3314 38

Points on the contour

have equal cost

The background contour

represents the L2 regularization
term
COMP 3314 39

Why are the

contours
circular?

Make sure you can

answer this
question before
you continue
COMP 3314 40

… and run
gradient
0 cost
descent

Let’s initialize
The cost decreases our model with
linearly with the (2.0, 0.5) ...
distance to the origin
COMP 3314 41

The background contour

represents Adaline’s cost functions
+ L2 regularization term (i.e., a
combination of both)
COMP 3314 42

Initialization

This is the path

gradient descent
takes
COMP 3314 43

Points on the contour

have equal cost

The background contour

represents the L1 regularization
term
COMP 3314 44

Why are the contours

diamond shaped ?

Make sure you can

answer this
question before
you continue
COMP 3314 45

First the cost

decreases equally for … and run
both parameters gradient
descent
0 cost

Let’s initialize
our model with
(2.0, 0.5)
When one of them is 0 it
will then decrease the
other one only
COMP 3314 46

The background contour

represents Adaline’s cost functions
+ L1 regularization term (i.e., a
combination of both)
COMP 3314 47

Notice how the

path quickly
reaches 0 for one of
the parameters

Initialization

This is the path

gradient descent
takes
COMP 3314 48

To avoid the bouncing

around you should gradually
reduce the learning rate
COMP 3314 49

Sparse Solution
● We can simply set the penalty parameter to ‘l1’ for models in scikit-learn that
support L1 regularization

● In scikit-learn, w0 corresponds to intercept_ and wj (for j > 0) corresponds to the

values in coef_
COMP 3314 50

Sparse Solution - Regularization Strength

COMP 3314 51

Sparse Solution - Regularization Strength

COMP 3314 52

Sequential Backward Selection (SBS)

● Reduces an initial d-dimensional space to a k-dimensional subspace (k < d)
by automatically selecting features that are most relevant
● Idea:
○ Sequentially remove features until desired feature number is reached
○ Define a criterion function J to be maximized
■ E.g., performance of the classifier after removal
○ Eliminate the feature that causes the least performance loss
COMP 3314 53

SBS
Steps:
1. Initialize the algorithm with k = d
d is the dimensionality of the full feature space Xd
2. Determine the feature x- = argmax J (Xk - x) that maximizes the criterion
function J
3. Remove the feature x- from the feature set
Xk-1 = Xk - x-
k=k-1
4. Terminate if k equals the number of desired features;
otherwise, go to step 2

● In the following we will implement SBS in Python from scratch

COMP 3314 54
COMP 3314 55
COMP 3314 56

SBS - Analyzing the Result

● The smallest feature subset (k = 3) that yielded such a good performance on the
validation dataset has the following features

● The accuracy of the KNN classifier on the original test set is as follows

● The three-feature subset has the following accuracy

COMP 3314 57

Feature Selection Algorithms in scikit-learn

● There are many more feature selection algorithms available via
scikit-learn
● A comprehensive discussion of the different feature selection
methods is beyond the scope of this lecture
○ A good summary with illustrative examples can be found here
COMP 3314 58

Assessing Feature Importance

● We can determine relevant features using random forest
○ Measure the feature importance as the averaged impurity decrease
● The random forest implementation in scikit-learn already collects the
feature importance values for us
○ Access them via the feature_importances_ attribute after fitting a
RandomForestClassifier
● In the following we will train a forest of 500 trees on the Wine dataset
and rank the 13 features by their respective importance measures
COMP 3314 59
COMP 3314 60

SelectFromModel
● scikit-learn implements a SelectFromModel object that selects features based on a
user-specified threshold after model fitting
● Use the RandomForestClassifier as a feature selector and intermediate step in a
scikit-learn Pipeline object, which allows us to connect different preprocessing
steps with an estimator
COMP 3314 61

Feature Extraction
● Alternative way to reduce the model complexity
○ Feature selection
■ Select a subset of original features
○ Feature extraction
■ Technique to compress a dataset onto a lower-dimensional
feature space (dimensionality reduction)
■ Covered in the next chapter
COMP 3314 62

Conclusion
● Handle missing data correctly
● Encode categorical variables correctly
● Map ordinal and nominal feature values to integer representations
● L1 regularization can help us to avoid overfitting by reducing the
complexity of a model
● Used a sequential feature selection algorithm to select meaningful
features from a dataset
COMP 3314 63

References
● Most materials in this chapter are
based on
○ Book
○ Code
COMP 3314 64

References
● Some materials in this chapter
are based on
○ Book
○ Code
COMP 3314 65

References
● The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, Second Edition
○ Trevor Hastie, Robert Tibshirani, Jerome Friedman
● https://web.stanford.edu/~hastie/ElemStatLearn/
● Pandas User Guide: Working with missing data

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
C06 Data
No ratings yet
C06 Data
16 pages
Comp3314 5. Data Preprocessing
No ratings yet
Comp3314 5. Data Preprocessing
51 pages
ML - WEEK 04
No ratings yet
ML - WEEK 04
33 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow
No ratings yet
COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow
44 pages
1737527078055
No ratings yet
1737527078055
111 pages
Practical 1 52
No ratings yet
Practical 1 52
4 pages
Week 10
No ratings yet
Week 10
50 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Eda
No ratings yet
Eda
48 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Hint_sheet
No ratings yet
Hint_sheet
13 pages
ML_Unit_2
No ratings yet
ML_Unit_2
52 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Seven Lab Instruction
No ratings yet
Seven Lab Instruction
38 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Kabir Data Preprocessing Python
No ratings yet
Kabir Data Preprocessing Python
14 pages
Unit-II
No ratings yet
Unit-II
119 pages
IML 2 - Data Preparation
No ratings yet
IML 2 - Data Preparation
13 pages
Data Pre-processing Steps
No ratings yet
Data Pre-processing Steps
32 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
LAB MANUAL 5 SOLVED 40 (1)
No ratings yet
LAB MANUAL 5 SOLVED 40 (1)
13 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
Slides on DataI
No ratings yet
Slides on DataI
33 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Lab File
No ratings yet
Lab File
96 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Lecture5
No ratings yet
Lecture5
26 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
MLP Week 2 Slides
No ratings yet
MLP Week 2 Slides
82 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
2 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
ML_Notes
No ratings yet
ML_Notes
44 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
40 Machine Learning Algorithms
From Everand
40 Machine Learning Algorithms
Anam Giri
No ratings yet
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
Wadi Vat Certificate
No ratings yet
Wadi Vat Certificate
1 page
EMI Shielding Theory - 1
No ratings yet
EMI Shielding Theory - 1
3 pages
Submitted by ROLL NO: 2400-BC-2018: Assignment
No ratings yet
Submitted by ROLL NO: 2400-BC-2018: Assignment
7 pages
My - Resume-18th Dec
No ratings yet
My - Resume-18th Dec
2 pages
UAE OMV Upstream
No ratings yet
UAE OMV Upstream
2 pages
Song of The Sky Loom
No ratings yet
Song of The Sky Loom
9 pages
05 Layout Managers
No ratings yet
05 Layout Managers
8 pages
WEEK 1 TLE9 - Q1 - Mod1 - Introduction To Computer v2
No ratings yet
WEEK 1 TLE9 - Q1 - Mod1 - Introduction To Computer v2
36 pages
Calibration Procedure For Abb Make Analyser: Check & Calibrate The Analyzer Within 7 Days
No ratings yet
Calibration Procedure For Abb Make Analyser: Check & Calibrate The Analyzer Within 7 Days
1 page
Led Controlled by HC
No ratings yet
Led Controlled by HC
8 pages
Chapter 11 - Arithmetic Progression PDF
No ratings yet
Chapter 11 - Arithmetic Progression PDF
81 pages
DO EZO DatasheetV5.6
No ratings yet
DO EZO DatasheetV5.6
80 pages
Materials and Financial
100% (1)
Materials and Financial
267 pages
Exerting Influence Without Authority - HBR
100% (1)
Exerting Influence Without Authority - HBR
7 pages
The Beatles - Let It Be
No ratings yet
The Beatles - Let It Be
4 pages
P6 - 2 Rev
No ratings yet
P6 - 2 Rev
4 pages
Session 1 - Speed-Time Graph Worksheet - For Tutor
No ratings yet
Session 1 - Speed-Time Graph Worksheet - For Tutor
4 pages
Study of Best Practices On Warehousing Operations in Oman
No ratings yet
Study of Best Practices On Warehousing Operations in Oman
6 pages
Brochure Complete Instrument Valve Solution Tescom Anderson Greenwood Instrumentation en 6272716
No ratings yet
Brochure Complete Instrument Valve Solution Tescom Anderson Greenwood Instrumentation en 6272716
8 pages
Intelligent Decision Support in Auditing: Big Data and Machine Learning Approach
No ratings yet
Intelligent Decision Support in Auditing: Big Data and Machine Learning Approach
7 pages
Mental Status Examination Quick Guide
No ratings yet
Mental Status Examination Quick Guide
3 pages
AV1-C1D2C Manual
No ratings yet
AV1-C1D2C Manual
8 pages
Study on Essential Newborn Care
No ratings yet
Study on Essential Newborn Care
145 pages
ECO201 - Mathematics For Business and Economics
No ratings yet
ECO201 - Mathematics For Business and Economics
5 pages
IDS IBIS Mining Light
No ratings yet
IDS IBIS Mining Light
33 pages
Interview and Its Various Types: Submitted By: Komal Sahi MBA-HR Semester 1
No ratings yet
Interview and Its Various Types: Submitted By: Komal Sahi MBA-HR Semester 1
39 pages
EFT Form
No ratings yet
EFT Form
1 page
(Marketing Momentum) LaJe Eblast: Thursday, October 3
No ratings yet
(Marketing Momentum) LaJe Eblast: Thursday, October 3
4 pages
Reiss - Text Types PDF
No ratings yet
Reiss - Text Types PDF
14 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

4.

Dealing with Missing Data

Identifying Missing Values

Identifying Missing Values

Remove Missing Data

Remove Missing Data

mean and median are for

Scikit-Learn Estimator API

Transformer - Fit and Transform

Note: Number of features need

Estimator - Fit and Predict

Handling Categorical Data

nominal ordinal numerical

Mapping Ordinal Features

Encoding Class Labels

Shortcut of calling fit

One-Hot Encoding via ColumnTransformer

One-Hot Encoding - Via Pandas

One-Hot Encoding - Dropping First Feature

UCI Wine Dataset

UCI Wine Dataset: Training-Testing

UCI Wine Dataset: Training-Testing

Feature Scaling - Normalization

Feature Scaling - Standardization

Normalization vs. Standardization

● Another approach: L1 regularization (penalty)

The two axis represent the

Points on the contour

The background contour

Why are the

Make sure you can

The background contour

This is the path

Points on the contour

The background contour

Why are the contours

Make sure you can

First the cost

The background contour

Notice how the

This is the path

To avoid the bouncing

● In scikit-learn, w0 corresponds to intercept_ and wj (for j > 0) corresponds to the

Sparse Solution - Regularization Strength

Sparse Solution - Regularization Strength

Sequential Backward Selection (SBS)

● In the following we will implement SBS in Python from scratch

SBS - Analyzing the Result

● The three-feature subset has the following accuracy

Feature Selection Algorithms in scikit-learn

Assessing Feature Importance

You might also like