Lect 04 Preprocessing Structured
Lect 04 Preprocessing Structured
Large part of this lecture is based on: “Data Preparation for Machine Learning” by Jason Brownlee
2
Machine Learning Pipeline
What is Data Preprocessing
4
Why do we need to preprocess the data?
5
ML Algorithms Expect Numbers
Model.fit(𝑿, 𝒚)
Features Labels
𝐹1 𝐹2 … 𝐹𝑚 𝑦
1. Machine learning seems cool, …
but I hate programming.
𝑋1 𝑽𝟏,𝟏 𝑽𝟏,𝟐 𝑽𝟏,𝒎 𝑳𝟏
2. This is a bad investment. 𝑋2 𝑽𝟐,𝟏 𝑽𝟐,𝟐 𝑽𝟐,𝒎
Instances
𝑳𝟐
3. …
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
6
ML Algorithms Requirements
• Different ML algorithms have different requirements & assumptions
• Some linear models expect numeric input variables with
Normal/Gaussian probability distribution
• Some algorithms do not perform well if input variables are irrelevant
or correlated
• For instance, tree-based models are insensitive to data
characteristics, while linear regression models are sensitive
7
Model Performance Dependents on Data
• The performance of ML algorithm is only as good as its training data
• The data may not be very representative to the problem at hand
• In practice, we could be provided with data and we have to do the
best we can with the available data
• Real world data are typically:
• Incomplete: missing values, lacking certain attributes of interest, mistyped, or
containing only aggregate data
• Noisy: containing errors or outliers
• Inconsistent: containing discrepancies, conflicting examples
• Complex nonlinear relationships may be compressed in the raw data
and need data preprocessing to be exposed
8
Data Preparation without Data Leakage
• Data leakage happens when a model is given information that can't
have access to when making predictions in production
• Test data is leaked into the training set
• Data from the future is leaked to the past
• Leakage leads to an overestimated performance during development
and a disappointment during operation
9
Data Preparation without Data Leakage
10
Tasks in Data Preprocessing
1. Data Cleaning
• Identifying and correcting mistakes or errors in the data that may negatively
impact a predictive model
2. Data Transformation
• Changing the scale or distribution of variables
11
Overview of Data Cleaning
Data cleaning is typically the first data preprocessing performed after data
collection and integration
12
Basic Data Cleaning
• Identify and remove features (column variables) that only have a single
value (zero-variance features)
• They add no information
• Identify and consider carefully features with very few unique values (near zero-
variance features)
• Might be useful when dealing with categorical data
• For numerical data, they can cause errors or unexpected results for some algorithms (e.g.,
linear models)
• Identify and remove duplicate samples (rows with same observations)
• Typically, ML algorithms perform better after removing duplicate instances
• Duplicate instances will result in misleading performance evaluation
• If you think otherwise for your model/data, evaluate the model trained with
and without duplicate instances
13
Outlier Identification and Removal
• An outlier is an observation that is unlike the other observations
• They are rare, distinct, or do not fit in some way
• They are samples that are exceptionally far from the mainstream of the data
• Outliers may be caused by:
• Measurement or input error
• Data corruption
• True outlier observation
• In general, no precise way to define and identify outliers due to data
specifics
• Domain expert must interpret the raw observations and decide whether a
value is an outlier or not
• Even with a good understanding of the data, outliers can be hard to define
• Be careful before removing or changing values (for small dataset size)
14
Outlier Identification and Removal
• We can use statistical methods to identify observations that appear
to be rare or unlikely given the available data
• For normally distributed data
X
• Observations that fall above 3 standard deviations for the mean can
be considered an outlier (𝒙𝒊 > 𝟑𝝈 or 𝒙𝒊 < 𝟑𝝈)
• This threshold could vary with the data size:
• 4𝜎 for large data and 2𝜎 for small data
15
Outlier Identification and Removal
• We can use statistical methods to identify observations that appear
to be rare or unlikely given the available data
• For non-normally distributed data
17
Missing Values
• Real-world data often has missing values
• The chance of having missing values increases with the size of the dataset
• Data can have missing values for a number of reasons such as
• Observations were not recorded or data corruption
• Handling missing data is important as many ML algorithms do not support
data with missing values
• Missing values are frequently indicated by out-of-range entries
• E.g., negative number (-1) in a numeric field that is normally only positive
• Or a 0 in a numeric field that can never normally be 0
• Special character or value, such as a question mark “?"
• In Python (Pandas, NumPy, and Scikit-Learn) it is recommended to mark
missing values as NaN
• NaN values are ignored by operations like sum, count, etc.
• You can detect (and count) NaNs by using the isnull() attribute
18
Missing Values – Dropping
• Mark invalid or corrupt values as missing in your dataset
• Good practice to compute the number or percentage of instances with
missing values for each feature
• Confirm that the presence of marked missing values causes problems
for learning algorithms
• Remove instances with missing data from your dataset and evaluate a
learning algorithm on the transformed dataset
• Removing all instances with missing values can be too limiting for
some predictive modeling problems and small datasets
• An alternative is to impute missing values as described next
19
Missing Values – Statistical Imputation
• A popular approach for data imputation is to calculate a statistical
value for each column and replace all missing values for that column
with that value
• Statistics are easy to calculate and often results in good performance
• Commonly used statistics:
• Feature mean (the column mean value)
• Feature median (the column median value)
• Feature mode (the column mode value)?
• A constant value
• You can evaluate these statistics on valid or k-FCV and chose the best
To avoid data leakage the statistics should be calculated on the training
data then applied to the train, valid and test
20
Tasks in Data Preprocessing
1. Data Cleaning
• Identifying and correcting mistakes or errors in the data that may negatively
impact a predictive model
2. Data Transformation
• Changing the scale or distribution of variables
21
Data Transformation
• Data transforms are used to change the type or distribution of data
variables
• Remember the data types:
• Numerical Data Type: Number values
• Integer: Integers with no fractional part
• Float: Floating point values
• Categorical Data Type: Label values
• Nominal: Labels with no rank ordering
• Ordinal: Labels with a rank ordering
• Boolean: Values True and False
22
Data Transforms
23
Scaling Numerical Data
• Input variables may have different units (e.g. feet, kilometers, and
hours) and hence variables can have different scales
• Differences in the scales across input variables may increase the
difficulty of the problem being modeled
• For example, large input values (e.g. a spread of hundreds or
thousands of units) can result in a model that learns large weight
values
• A model with large weight values is often unstable
• Suffer from poor performance during learning
• Sensitive to input values
• Increase the generalization error
24
Scaling Numerical Data
• Many ML algorithms perform better when numerical input variables
are scaled to a standard range
• Algorithms that use a weighted sum of the input like linear regression
• Algorithms that use distance measures like k-nearest neighbors and support
vector machines
• Some ML algorithms are robust to the scale of numerical input
variables, e.g., decision trees and ensembles of trees (random forest)
• A good idea to scale the target variable for regression problems to
make the problem easier to learn, e.g., for neural networks (NN)
• A target variable with a large spread may result in large error gradient values
• Weight values will change dramatically, making the learning process unstable
• Scaling input and output variables is a key step for NN models
• Main techniques for scaling: normalization and standardization
25
Scaling Numerical Data - Normalization
• Normalization scales each input variable separately to the range 0-1
• The range for floating-point values
𝑥 − 𝑚𝑖𝑛
𝑥𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 =
𝑚𝑎𝑥 − 𝑚𝑖𝑛
• Normalization requires that you know or are able to accurately
estimate the minimum and maximum observable values
• You may be able to estimate these values from your available data
• New values may be outside the range [min, max], their resulting
normalized value will not be in the range of 0 and 1
• You could check for these observations prior to making predictions
• Remove them from the dataset or
• Limit them to the pre-defined maximum or minimum values
26
Scaling Numerical Data - Standardization
• Standardization involves rescaling the distribution of observed values
so that the mean becomes 𝜇 = 0 and the standard deviation 𝜎 = 1
• This is done for each input variable separately by subtracting the
mean (centering) and dividing by the standard deviation (scaling)
𝑥−𝜇
𝑥𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑖𝑧𝑒𝑑 =
𝜎
• Standardization assumes that your observations fit a Gaussian
distribution with a well-behaved mean and standard deviation
• Standardization requires that you know or are able to accurately
estimate the mean and standard deviation of observable values
• You can estimate these values from your training data, not the entire
dataset
27
Scaling Numerical Data – Normalize or Standardize?
• Depends on the data and problem
• If the distribution of variables is known to be normal (e.g., heights,
blood pressure) it should be standardized
• If the range of quantity values is large (10s, 100s, etc.) or small (0.01,
0.0001, etc.) it should be normalized
• If in doubt, normalize the input (no assumptions)
• Standardization gives positive and negative values (centered around
zero); it may be desirable to normalize data after standardization
• Best evaluate the model performance (on validation) using the raw
data, standardized data and/or normalized data and choose the best
28
Scaling Numerical Data – Robust Scaling (Med/IQR)
• Standardization can become skewed or biased if the input variable
contains outlier values
• If there are input variables that have very large values relative to the
other input variables
• These large values can dominate or skew some machine learning algorithms
• The algorithms pay most of their attention to the large values and ignore the
variables with smaller values
• To overcome this, the median and interquartile range can be used
when standardizing numerical input variables (to ignore outliers)
𝑥 − 𝑚𝑒𝑑𝑖𝑎𝑛
𝑥𝑟𝑜𝑏𝑢𝑠𝑡 = , where 𝐼𝑄𝑅 = 𝑄3 − 𝑄1
𝐼𝑄𝑅
• This is generally referred to as robust scaling
29
Data Transforms
30
Changing Distributions – Power Transforms
• Many models, like linear regression and logistic regression, work best when
there is a linear relationship between features and the target
• A highly skewed feature can introduce non-linearity, making the model
struggle to fit the data and causing systematic prediction errors
• Other nonlinear algorithms may benefit from normally distributed variable
• Power transforms use mathematical functions (like a logarithm or
exponent) to make the probability distribution of a variable more Gaussian
• Help stabilize the variance
• Help to remove the skew
34
Changing Distributions – Discretization Approaches
• Uniform (Equal-Width) Discretization: Each bin has the same width in
the span of possible values for the observation
• Preserve the probability distribution of each input (doesn’t improve spread)
• Can handle outliers
• Quantile (Equal-Frequency) Discretization: Each bin has the same
number of observations, split based on percentiles
• Attempt to split the observations for each input variable into k groups, where
the number of observations assigned to each group is approximately equal
• Clustering Discretization: Examples are assigned to each cluster
• K-means clustering attempt to fit k clusters for each input variable and then
assign each observation to a cluster
35
Data Transforms
36
Encoding Categorical Data
• In general, if your data contains categorical data, you must encode it
to numbers before training and evaluating a model
• Some algorithms can work with categorical data directly
• For example, a decision tree can be trained directly on categorical data
• Some ML libraries require all data to be numerical
• For example, scikit-learn has this requirement
• This means that categorical data must be converted to a numerical form
• Common approaches to convert categorical to numerical variables:
1. Ordinal Encoding
2. One Hot Encoding
3. Dummy Variable Encoding
37
Encoding Categorical Data
1. Ordinal Encoding
• Each unique category value is assigned an integer value
• Low = 0, Medium = 1, High = 2
• It is a natural encoding for ordinal variables
• It can cause problems for nominal variables (impose arbitrary ordering)
2. One Hot Encoding
• A new binary variable is added for each unique category, where each bit
represents a possible category
• Read -> [0,0,1], Green -> [0,1,0], Blue -> [1,0,0]
3. Dummy Variable Encoding
1. Remove redundancy from one hot encoding (might hurt some algorithms)
2. K categories can be represented by K-1 binary variables
3. Read -> [0,0], Green -> [0,1], Blue -> [1,0]
38
Next Activities
• Practice with the lab materials
• Read additional material on Moodle
• Expect an assignment
39