0% found this document useful (0 votes)

38 views39 pages

Lect 04 Preprocessing Structured

This lecture on Data Preprocessing emphasizes the importance of preparing raw data for machine learning by addressing issues such as data cleaning, outlier detection, and handling missing values. It highlights the need to avoid data leakage during the preprocessing steps and discusses various techniques for data transformation, including normalization and standardization. Understanding these preprocessing tasks is crucial for improving the performance of machine learning models.

Uploaded by

gacia der

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views39 pages

Lect 04 Preprocessing Structured

Uploaded by

gacia der

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MSBA 315

ML & Predictive Analytics

Lecture 04 – Data Preprocessing

Wael Khreich
wk47@[Link]
Learning Outcomes
• Importance of data preparation
• Avoid data leakage
• Implement data cleaning, outliers, missing values
• Implement data transforms, normalization and standardization
• Encode categorical variables
• Transform numerical to categorical data

Large part of this lecture is based on: “Data Preparation for Machine Learning” by Jason Brownlee

2
Machine Learning Pipeline
What is Data Preprocessing

• Data Preprocessing is the process of transformation raw data into

usable form for modeling
• Also known as data wrangling, data cleaning, data preparation
• Key step in ML pipeline
• Raw data (as collected from a source) cannot typically be used
directly to train ML models
• Raw data must be pre-processed prior to training/evaluating ML
models

4
Why do we need to preprocess the data?

5
ML Algorithms Expect Numbers

[Link](𝑿, 𝒚)
Features Labels
𝐹1 𝐹2 … 𝐹𝑚 𝑦
1. Machine learning seems cool, …
but I hate programming.
𝑋1 𝑽𝟏,𝟏 𝑽𝟏,𝟐 𝑽𝟏,𝒎 𝑳𝟏
2. This is a bad investment. 𝑋2 𝑽𝟐,𝟏 𝑽𝟐,𝟐 𝑽𝟐,𝒎
Instances
𝑳𝟐
3. …
⋮ ⋮ ⋮ ⋮ ⋮ ⋮

𝑋𝑛 𝑽𝒏,𝟏 𝑽𝒏,𝟐 … 𝑽𝒏,𝒎 𝑳𝒌

6
ML Algorithms Requirements
• Different ML algorithms have different requirements & assumptions
• Some linear models expect numeric input variables with
Normal/Gaussian probability distribution
• Some algorithms do not perform well if input variables are irrelevant
or correlated
• For instance, tree-based models are insensitive to data
characteristics, while linear regression models are sensitive

7
Model Performance Dependents on Data
• The performance of ML algorithm is only as good as its training data
• The data may not be very representative to the problem at hand
• In practice, we could be provided with data and we have to do the
best we can with the available data
• Real world data are typically:
• Incomplete: missing values, lacking certain attributes of interest, mistyped, or
containing only aggregate data
• Noisy: containing errors or outliers
• Inconsistent: containing discrepancies, conflicting examples
• Complex nonlinear relationships may be compressed in the raw data
and need data preprocessing to be exposed

8
Data Preparation without Data Leakage
• Data leakage happens when a model is given information that can't
have access to when making predictions in production
• Test data is leaked into the training set
• Data from the future is leaked to the past
• Leakage leads to an overestimated performance during development
and a disappointment during operation

A common mistake is to The proper approach is to

1. Apply data transformation to 1. Split the data into train and test
the entire dataset 2. Fit the data transformation to the
2. Then split the dataset into training dataset
train and test or apply k-fold 3. Then apply this transformation to
cross-validation the train and test sets

9
Data Preparation without Data Leakage

• When using cross-validation avoiding leakage is slightly more

challenging
• It should be applied within the cross-validation procedure
• For 𝑘 = 1 … 𝐾:
• The data preparation method should be fit on the training folds
• Then applied to the training folds and to the (remaining) test fold

Data preprocessing must be prepared on the training set only in order

to avoid data leakage

10
Tasks in Data Preprocessing
1. Data Cleaning
• Identifying and correcting mistakes or errors in the data that may negatively
impact a predictive model
2. Data Transformation
• Changing the scale or distribution of variables

11
Overview of Data Cleaning
Data cleaning is typically the first data preprocessing performed after data
collection and integration

12
Basic Data Cleaning
• Identify and remove features (column variables) that only have a single
value (zero-variance features)
• They add no information
• Identify and consider carefully features with very few unique values (near zero-
variance features)
• Might be useful when dealing with categorical data
• For numerical data, they can cause errors or unexpected results for some algorithms (e.g.,
linear models)
• Identify and remove duplicate samples (rows with same observations)
• Typically, ML algorithms perform better after removing duplicate instances
• Duplicate instances will result in misleading performance evaluation
• If you think otherwise for your model/data, evaluate the model trained with
and without duplicate instances

13
Outlier Identification and Removal
• An outlier is an observation that is unlike the other observations
• They are rare, distinct, or do not fit in some way
• They are samples that are exceptionally far from the mainstream of the data
• Outliers may be caused by:
• Measurement or input error
• Data corruption
• True outlier observation
• In general, no precise way to define and identify outliers due to data
specifics
• Domain expert must interpret the raw observations and decide whether a
value is an outlier or not
• Even with a good understanding of the data, outliers can be hard to define
• Be careful before removing or changing values (for small dataset size)

14
Outlier Identification and Removal
• We can use statistical methods to identify observations that appear
to be rare or unlikely given the available data
• For normally distributed data
X

68% 95% 99.7

%
μ
μ  1σ μ  2σ μ  3σ

• Observations that fall above 3 standard deviations for the mean can
be considered an outlier (𝒙𝒊 > 𝟑𝝈 or 𝒙𝒊 < 𝟑𝝈)
• This threshold could vary with the data size:
• 4𝜎 for large data and 2𝜎 for small data

15
Outlier Identification and Removal
• We can use statistical methods to identify observations that appear
to be rare or unlikely given the available data
• For non-normally distributed data

• Outliers: observations that fall below 𝑸𝟏 − 𝟏. 𝟓 × 𝑰𝑸𝑹 or

above 𝑸𝟑 + 𝟏. 𝟓 × 𝑰𝑸𝑹
16
Automatic Outlier Detection
• Idea is to locate those examples that are far from the other examples
in the multi-dimensional feature space
• The local outlier factor (LOF) is a technique that attempts to harness
the idea of nearest neighbors for outlier detection
• Each example is assigned a score of how isolated (how likely to be an outlier)
based on the size of its local neighborhood
• Examples with the largest score are more likely to be outliers
• Works well for feature spaces with low dimensionality (few features)
• Becomes less reliable when the number of features is large
(curse of dimensionality)

17
Missing Values
• Real-world data often has missing values
• The chance of having missing values increases with the size of the dataset
• Data can have missing values for a number of reasons such as
• Observations were not recorded or data corruption
• Handling missing data is important as many ML algorithms do not support
data with missing values
• Missing values are frequently indicated by out-of-range entries
• E.g., negative number (-1) in a numeric field that is normally only positive
• Or a 0 in a numeric field that can never normally be 0
• Special character or value, such as a question mark “?"
• In Python (Pandas, NumPy, and Scikit-Learn) it is recommended to mark
missing values as NaN
• NaN values are ignored by operations like sum, count, etc.
• You can detect (and count) NaNs by using the isnull() attribute
18
Missing Values – Dropping
• Mark invalid or corrupt values as missing in your dataset
• Good practice to compute the number or percentage of instances with
missing values for each feature
• Confirm that the presence of marked missing values causes problems
for learning algorithms
• Remove instances with missing data from your dataset and evaluate a
learning algorithm on the transformed dataset
• Removing all instances with missing values can be too limiting for
some predictive modeling problems and small datasets
• An alternative is to impute missing values as described next

19
Missing Values – Statistical Imputation
• A popular approach for data imputation is to calculate a statistical
value for each column and replace all missing values for that column
with that value
• Statistics are easy to calculate and often results in good performance
• Commonly used statistics:
• Feature mean (the column mean value)
• Feature median (the column median value)
• Feature mode (the column mode value)?
• A constant value
• You can evaluate these statistics on valid or k-FCV and chose the best
To avoid data leakage the statistics should be calculated on the training
data then applied to the train, valid and test
20
Tasks in Data Preprocessing
1. Data Cleaning
• Identifying and correcting mistakes or errors in the data that may negatively
impact a predictive model
2. Data Transformation
• Changing the scale or distribution of variables

21
Data Transformation
• Data transforms are used to change the type or distribution of data
variables
• Remember the data types:
• Numerical Data Type: Number values
• Integer: Integers with no fractional part
• Float: Floating point values
• Categorical Data Type: Label values
• Nominal: Labels with no rank ordering
• Ordinal: Labels with a rank ordering
• Boolean: Values True and False

22
Data Transforms

23
Scaling Numerical Data
• Input variables may have different units (e.g. feet, kilometers, and
hours) and hence variables can have different scales
• Differences in the scales across input variables may increase the
difficulty of the problem being modeled
• For example, large input values (e.g. a spread of hundreds or
thousands of units) can result in a model that learns large weight
values
• A model with large weight values is often unstable
• Suffer from poor performance during learning
• Sensitive to input values
• Increase the generalization error

24
Scaling Numerical Data
• Many ML algorithms perform better when numerical input variables
are scaled to a standard range
• Algorithms that use a weighted sum of the input like linear regression
• Algorithms that use distance measures like k-nearest neighbors and support
vector machines
• Some ML algorithms are robust to the scale of numerical input
variables, e.g., decision trees and ensembles of trees (random forest)
• A good idea to scale the target variable for regression problems to
make the problem easier to learn, e.g., for neural networks (NN)
• A target variable with a large spread may result in large error gradient values
• Weight values will change dramatically, making the learning process unstable
• Scaling input and output variables is a key step for NN models
• Main techniques for scaling: normalization and standardization
25
Scaling Numerical Data - Normalization
• Normalization scales each input variable separately to the range 0-1
• The range for floating-point values
𝑥 − 𝑚𝑖𝑛
𝑥𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 =
𝑚𝑎𝑥 − 𝑚𝑖𝑛
• Normalization requires that you know or are able to accurately
estimate the minimum and maximum observable values
• You may be able to estimate these values from your available data
• New values may be outside the range [min, max], their resulting
normalized value will not be in the range of 0 and 1
• You could check for these observations prior to making predictions
• Remove them from the dataset or
• Limit them to the pre-defined maximum or minimum values

26
Scaling Numerical Data - Standardization
• Standardization involves rescaling the distribution of observed values
so that the mean becomes 𝜇 = 0 and the standard deviation 𝜎 = 1
• This is done for each input variable separately by subtracting the
mean (centering) and dividing by the standard deviation (scaling)
𝑥−𝜇
𝑥𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑖𝑧𝑒𝑑 =
𝜎
• Standardization assumes that your observations fit a Gaussian
distribution with a well-behaved mean and standard deviation
• Standardization requires that you know or are able to accurately
estimate the mean and standard deviation of observable values
• You can estimate these values from your training data, not the entire
dataset
27
Scaling Numerical Data – Normalize or Standardize?
• Depends on the data and problem
• If the distribution of variables is known to be normal (e.g., heights,
blood pressure) it should be standardized
• If the range of quantity values is large (10s, 100s, etc.) or small (0.01,
0.0001, etc.) it should be normalized
• If in doubt, normalize the input (no assumptions)
• Standardization gives positive and negative values (centered around
zero); it may be desirable to normalize data after standardization
• Best evaluate the model performance (on validation) using the raw
data, standardized data and/or normalized data and choose the best

28
Scaling Numerical Data – Robust Scaling (Med/IQR)
• Standardization can become skewed or biased if the input variable
contains outlier values
• If there are input variables that have very large values relative to the
other input variables
• These large values can dominate or skew some machine learning algorithms
• The algorithms pay most of their attention to the large values and ignore the
variables with smaller values
• To overcome this, the median and interquartile range can be used
when standardizing numerical input variables (to ignore outliers)
𝑥 − 𝑚𝑒𝑑𝑖𝑎𝑛
𝑥𝑟𝑜𝑏𝑢𝑠𝑡 = , where 𝐼𝑄𝑅 = 𝑄3 − 𝑄1
𝐼𝑄𝑅
• This is generally referred to as robust scaling
29
Data Transforms

30
Changing Distributions – Power Transforms
• Many models, like linear regression and logistic regression, work best when
there is a linear relationship between features and the target
• A highly skewed feature can introduce non-linearity, making the model
struggle to fit the data and causing systematic prediction errors
• Other nonlinear algorithms may benefit from normally distributed variable
• Power transforms use mathematical functions (like a logarithm or
exponent) to make the probability distribution of a variable more Gaussian
• Help stabilize the variance
• Help to remove the skew

• We can apply a power transform directly by selecting a power function: log(x)

• This may or may not be the best power transform for a given variable
31
Changing Distributions – Power Transforms
• Use an automatic way (parametrized by 𝜆) to find the best transform
• Popular approaches for such automatic power transforms:
1. Box-Cox Transform
• Assumes input variables are strictly positive (scaling data first)
2. Yeo-Johnson Transform
• Supports zero values and negative values in addition to positive
• They can be used to empirically identify an appropriate transformation
• 𝜆 = −1.0 is a reciprocal transform
• 𝜆 = −0.5 is a reciprocal square root
• 𝜆 = 0.0 is a log transform
• 𝜆 = 0.5 is a square root transform
The optimal 𝜆 on training can be
• 𝜆 = 1.0 is no transform
stored and reused to transform new
data (valid/test) to avoid leakage 32
Changing Distributions – Quantile Transforms
• The quantile transform provides an automatic way to transform a
numeric input variable to have a different data distribution
• Uses rank-based mapping via the empirical cumulative distribution function;
• Data can be mapped onto a uniform or normal distribution
• Normal quantile transform an input variable to have a normal
probability distribution to improve the model performance
• Uniform quantile transform a highly exponential or multi-modal
distribution to have a uniform distribution
• Especially useful for data with a large and sparse range of values, e.g. outliers
that are common rather than rare
• The number of quantiles is a hyper-parameter to tune but should be
less than the number of observations in the training dataset
33
Changing Distributions – Discretization Transforms
• Discretization (a.k.a. binning or categorization) transforms
continuous variables into a discrete form by creating a set of
contiguous intervals/bins across the range of desired variable
• continuous data is measured; discrete data is counted
• Some ML algorithms prefer/require categorical or ordinal input
variables, such as some decision tree and rule-based algorithms
• Help make the data easier to understand (correlate with the target)
• E.g., numerical weight values → light, mid, heavy
• There should be no large differences between variables under the same bin
• Reduce the impact of small fluctuations in the data (reduce noise)
• Help with non-standard distributions and multi-modal distributions

34
Changing Distributions – Discretization Approaches
• Uniform (Equal-Width) Discretization: Each bin has the same width in
the span of possible values for the observation
• Preserve the probability distribution of each input (doesn’t improve spread)
• Can handle outliers
• Quantile (Equal-Frequency) Discretization: Each bin has the same
number of observations, split based on percentiles
• Attempt to split the observations for each input variable into k groups, where
the number of observations assigned to each group is approximately equal
• Clustering Discretization: Examples are assigned to each cluster
• K-means clustering attempt to fit k clusters for each input variable and then
assign each observation to a cluster

35
Data Transforms

36
Encoding Categorical Data
• In general, if your data contains categorical data, you must encode it
to numbers before training and evaluating a model
• Some algorithms can work with categorical data directly
• For example, a decision tree can be trained directly on categorical data
• Some ML libraries require all data to be numerical
• For example, scikit-learn has this requirement
• This means that categorical data must be converted to a numerical form
• Common approaches to convert categorical to numerical variables:
1. Ordinal Encoding
2. One Hot Encoding
3. Dummy Variable Encoding

37
Encoding Categorical Data
1. Ordinal Encoding
• Each unique category value is assigned an integer value
• Low = 0, Medium = 1, High = 2
• It is a natural encoding for ordinal variables
• It can cause problems for nominal variables (impose arbitrary ordering)
2. One Hot Encoding
• A new binary variable is added for each unique category, where each bit
represents a possible category
• Read -> [0,0,1], Green -> [0,1,0], Blue -> [1,0,0]
3. Dummy Variable Encoding
1. Remove redundancy from one hot encoding (might hurt some algorithms)
2. K categories can be represented by K-1 binary variables
3. Read -> [0,0], Green -> [0,1], Blue -> [1,0]

38
Next Activities
• Practice with the lab materials
• Read additional material on Moodle
• Expect an assignment

Model Evaluation
No ratings yet
Model Evaluation
39 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Data Preprocessing for COVID-19 Data
No ratings yet
Data Preprocessing for COVID-19 Data
8 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Ch03 DS-Unit-2 ABM Final
No ratings yet
Ch03 DS-Unit-2 ABM Final
143 pages
DS-Unit-2 ABM Final
No ratings yet
DS-Unit-2 ABM Final
134 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
18 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
Data Preprocessing Steps Explained
No ratings yet
Data Preprocessing Steps Explained
6 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Etman MachineL 3
No ratings yet
Etman MachineL 3
47 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
Data Preprocessing Before Classification: Presented by
No ratings yet
Data Preprocessing Before Classification: Presented by
23 pages
Unit - II
No ratings yet
Unit - II
56 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Statistics For Data Science
100% (3)
Statistics For Data Science
39 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Unit 2
No ratings yet
Unit 2
19 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Study Material Data Preprocessing
No ratings yet
Study Material Data Preprocessing
11 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
11 pages
Module Description
No ratings yet
Module Description
17 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
14 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Machine Learning Essentials Guide
No ratings yet
Machine Learning Essentials Guide
33 pages
Data Analytics: Collection & Pre-processing
No ratings yet
Data Analytics: Collection & Pre-processing
16 pages
Data
No ratings yet
Data
36 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Pre-Processing Techniques
No ratings yet
Data Pre-Processing Techniques
12 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Lect 06 Feature Engineering and Selection
No ratings yet
Lect 06 Feature Engineering and Selection
41 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
ML Science
No ratings yet
ML Science
6 pages
MSBA315 Syllabus 2025
No ratings yet
MSBA315 Syllabus 2025
6 pages
MSBA315 Project Description
No ratings yet
MSBA315 Project Description
1 page
Class 11 Mathematics Notes 2024-25 Chapter - 13. Statistics
No ratings yet
Class 11 Mathematics Notes 2024-25 Chapter - 13. Statistics
48 pages
Analysis of Precipitation
No ratings yet
Analysis of Precipitation
46 pages
National Elections and Business Updates
No ratings yet
National Elections and Business Updates
4 pages
Germany's Aeolian Sediments Map
No ratings yet
Germany's Aeolian Sediments Map
12 pages
REINZOSIL Safety Data Sheet
No ratings yet
REINZOSIL Safety Data Sheet
8 pages
ASR9K 653 32 Smu List
No ratings yet
ASR9K 653 32 Smu List
3 pages
Sona Vaish Venkat
No ratings yet
Sona Vaish Venkat
5 pages
Nexus - Polyester Surfacing Veil
No ratings yet
Nexus - Polyester Surfacing Veil
1 page
Msds Hydroxyethyl Cellulose Hec
No ratings yet
Msds Hydroxyethyl Cellulose Hec
5 pages
BKlet - Consumer Behaviour in Clothing Choices and Implications PDF
No ratings yet
BKlet - Consumer Behaviour in Clothing Choices and Implications PDF
61 pages
PGNAA
100% (1)
PGNAA
27 pages
CPPLUS - Password Reset Orange
No ratings yet
CPPLUS - Password Reset Orange
7 pages
Current Affairs Digest: March 2023
No ratings yet
Current Affairs Digest: March 2023
211 pages
Time Table Chemical Engg Autumn Semester 2023-24 Final
No ratings yet
Time Table Chemical Engg Autumn Semester 2023-24 Final
6 pages
Understanding COPAR in Community Health
No ratings yet
Understanding COPAR in Community Health
5 pages
SA35AC E01 Merged
No ratings yet
SA35AC E01 Merged
87 pages
HRM HW #1 Marcelo
No ratings yet
HRM HW #1 Marcelo
2 pages
Menu
No ratings yet
Menu
7 pages
Transweld 400 Transweld 400: Air Cooled Welding Transformer Air Cooled Welding Transformer
No ratings yet
Transweld 400 Transweld 400: Air Cooled Welding Transformer Air Cooled Welding Transformer
2 pages
Treating Monkeypox with Chlorine Dioxide
No ratings yet
Treating Monkeypox with Chlorine Dioxide
10 pages
Understanding Illegal Contracts in Law
No ratings yet
Understanding Illegal Contracts in Law
11 pages
Hjt-Iec 61215&61730
No ratings yet
Hjt-Iec 61215&61730
2 pages
Corporate Law Project
No ratings yet
Corporate Law Project
17 pages
New Holland
No ratings yet
New Holland
1 page
Agriculture Lesson for G7 STVE-HE
100% (1)
Agriculture Lesson for G7 STVE-HE
10 pages
E-Marketing Communication: Earned Media
No ratings yet
E-Marketing Communication: Earned Media
21 pages
Predictions
No ratings yet
Predictions
60 pages
IOT - New Notes - Prepared by Arjun Chy - (WWW - Arjun00.com - NP)
No ratings yet
IOT - New Notes - Prepared by Arjun Chy - (WWW - Arjun00.com - NP)
130 pages
Indian Electricity Act & Rules Overview
No ratings yet
Indian Electricity Act & Rules Overview
3 pages
Transmotauto 2015
No ratings yet
Transmotauto 2015
107 pages

Lect 04 Preprocessing Structured

Uploaded by

Lect 04 Preprocessing Structured

Uploaded by

MSBA 315

ML & Predictive Analytics

Lecture 04 – Data Preprocessing

• Data Preprocessing is the process of transformation raw data into

𝑋𝑛 𝑽𝒏,𝟏 𝑽𝒏,𝟐 … 𝑽𝒏,𝒎 𝑳𝒌

A common mistake is to The proper approach is to

• When using cross-validation avoiding leakage is slightly more

Data preprocessing must be prepared on the training set only in order

68% 95% 99.7

• Outliers: observations that fall below 𝑸𝟏 − 𝟏. 𝟓 × 𝑰𝑸𝑹 or

• We can apply a power transform directly by selecting a power function: log(x)

You might also like