0% found this document useful (0 votes)

5 views34 pages

Week 6. Data Preparation and Transformation

The document provides an overview of data preparation and transformation in machine learning, covering topics such as feature types, dealing with categorical and numerical features, and the importance of data normalization and standardization. It outlines the CRISP-DM process and various techniques for handling outliers and transforming data for better model performance. Key concepts include label encoding, one-hot encoding, and the significance of consistent transformations across training and testing datasets.

Uploaded by

vefa.qafarzade.05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views34 pages

Week 6. Data Preparation and Transformation

Uploaded by

vefa.qafarzade.05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Data Preparation and

Transformation

Instructor: Sabina Mammadova

Agenda
• General information about ML

• Identifying types of features

• Dealing with categorical features

• Dealing with numerical features

• Dealing with outliers

Machine Learning (ML)
and CRISP-DM
CRoss-Industry Standard Process for Data Mining (CRISP
DM)
1. Business Understanding
Determine business objectives → Assess situation → Determine
data mining goals → Produce project plan

2. Data Understanding
Collect initial data → Describe data → Explore data → Verify
data quality

3. Data Preparation
Select data → Clean data → Construct data → Integrate data →
Format data

4. Modeling
Select modeling techniques → Generate test design → Build
model → Assess model

5. Evaluation
Evaluate results → Review process → Determine next steps

6. Deployment
Plan deployment → Plan monitoring and maintenance → Produce
final report → Review project https://aws.amazon.com/what-is/data-mi
ning/
What is Machine Learning (ML)?
Machine learning (ML) is a branch of artificial intelligence that
enables computers to learn patterns from data and make
predictions without explicit programming.
Machine Learning Algorithms
Linear Regression, Polynomial
Regression, Support Vector
Regression Regression, Decision Tree
Regression, Random Forest
Supervised Regression
Learning Logistic Regression, K-Nearest
Neighbors, Support Vector
Classification Machines, Decision Tree, Random
Forest, Naïve Bayes

Clustering K-Means, Hierarchical, DBSCAN

Machine Unsupervised Association Apriori, FP-Growth

Learning Learning Analysis
Dimensionality
PCA, LDA
Reduction

Reinforcemen Q-Learning, Deep Q-Networks…

t Learning
Difference between Supervised and
Unsupervised Learning

• Input data is labelled • Input data is unlabeled

• There is a training phase • There is no training
• Data is modelled based phase
on training dataset • Uses properties of given
• Known number of data for clustering
classes (for • Unknown number of
classification) classes
Machine Learning Process
Identifying types of
features
Types of Features
Feature

Qualitative Quantitative
(categorical) (numerical)

Nominal Ordinal Discrete Continuous

Binary
Types of Features
Feature type Feature sub-type Definition Example
Categorical Nominal Labelled variables Cloud provider:
with no quantitative AWS, MS, Google
value
Categorical Ordinal Adds the sense of Job title: junior data
order to the labelled scientist, senior
variable data scientist,
chief data scientist

Categorical Binary A variable with only Fraud classification:

two allowed values fraud, not fraud

Numerical Discrete Individual Number of students:

and countable items 100
Numerical Continuous Infinite number of Total amount:
possible $150.35
measurements and
they often
carry decimal points
Types of Features
• Although looking at the values of the variable may help you find its type, you
should never rely only on this approach. The nature of the variable is also very
important for making such decisions. For example, someone could encode the
cloud provider variable shown in Table as follows: 1 (AWS), 2 (MS), 3 (Google).
In that case, the variable is still a nominal feature, even if it is now
represented by discrete numbers.
• If you are building an ML model and you don’t tell your algorithm that this
variable is not a discrete number but is instead a nominal variable, the
algorithm will treat it as a number and the model won’t be interpretable
anymore.
• Before feeding any ML algorithm with data, make sure your feature types have
been properly identified.
Why data preparation and transformation is important?

• Some ML libraries, such as scikit-learn, may not accept string values on your
categorical features.
• The data distribution of your variable may not be the most optimal distribution
for your algorithm.
• Your ML algorithm may be impacted by the scale of your data.
• Some observations of your variable may be missing information, and you will
have to fix it. These are also known as missing values.
• You may find outlier values of your variable that can potentially add bias to
your model.
• Your variable may be storing different types of information, and you may only
be interested in a few of them (for example, a date variable can store the day
of the week or the week of the month).
• You might want to find a mathematical representation for a text variable.
• …
Dealing with categorical
features
Transforming nominal features: Label Encoding

• A label encoder is suitable for

categorical/nominal variables, and it will
just associate a number with each
distinct label of your variables. Country Label encoding
• A label encoder will always ensure that a India 1
unique number is associated with each
distinct label. In the table, although Canada 2
“India” appears twice, the same number Brazil 3
was assigned to it.
Australia 4
• It assigns a unique number(starting from
0) to each class of data, and this may led India 1
to the generation of priority issues during
model training of data sets. A label with a
high value may be considered to have
high priority than a label having a lower
value.
Transforming nominal features: One-hot
Encoding
• A label encoder is suitable for
categorical/nominal variables, and it will
just associate a number with each Country India Canada Brazil Australia
distinct label of your variables.
India 1 0 0 0
• A label encoder will always ensure that a
unique number is associated with each Canada 0 1 0 0
distinct label. In the table, although Brazil 0 0 1 0
“India” appears twice, the same number
was assigned to it. Australia 0 0 0 1
• It assigns a unique number(starting from India 1 0 0 0
0) to each class of data, and this may led
to the generation of priority issues during
model training of data sets. A label with a
high value may be considered to have
high priority than a label having a lower
value.
Transforming ordinal features: Ordinal Encoding

• Ordinal features have a very specific

characteristic: they have an order.
Because they have this quality, it does
not make sense to apply one-hot
encoding to them; if you do so, the Education Ordinal
underlying algorithm that is used to train encoding
your model will not be able to
differentiate the implicit order of the data Trainee 1
points associated with this feature. Junior data analyst 2
• The most common transformation for this Senior data analyst 3
type of variable is known as ordinal
Chief data scientist 4
encoding. An ordinal encoder will
associate a number with each distinct
label of your variable, just like a label
encoder does, but this time, it will
respect the order of each category.
Avoiding Confusion in Train and Test Datasets

• When working with machine learning models, it is crucial to apply

transformations consistently across training, testing, and production data.
Encoders should be fitted only on training data and then used to
transform test and production data—never refit on test data, as this would
bias performance metrics.
• A key challenge arises when test data contains new categories that were not
present in training. Most ML libraries allow handling this by either raising an
error or setting all zeros in one-hot encoding. This is a common issue, but it
raises concerns about whether the model can generalize to unseen data.
• For reliable models, training and testing data should follow the same
distribution. If many unknown categories appear in the test set, it may
indicate data distribution mismatch, leading to overfitting. Careful
investigation and proper handling of categorical and numerical transformations
help ensure model reliability and fairness in real-world applications.
Dealing with Numerical
Features
Dealing with Numerical Features

• Numerical feature transformations can be classified into two types:

• Transformations that rely on training data – These learn parameters
(e.g., mean, standard deviation) from the training set and apply them
to test and new data, similar to categorical encoding.
• Transformations that rely only on individual observations – These
apply direct mathematical computations (e.g., squaring a value) without
depending on learned parameters.
• There are many transformation techniques available, and while you don’t
need to know all of them, understanding the most important ones is key.
Additionally, creating custom transformations based on specific use
cases can enhance model performance.
Data Normalization
• Data normalization is a technique used to scale numerical features to a specific
range, such as 0 to 1, ensuring that all data points have the same magnitude.
This is particularly useful when different features have varying ranges, which can
negatively impact certain machine learning algorithms. For example, if employee
salaries range from 20,000 to 200,000, normalization transforms 20,000 into 0
and 200,000 into 1, keeping all values within a fixed scale.

It shows how different scales of the variable could change the hyper plan’s projection of
k-means clustering.
Why is Normalization important and when is it
unnecessary?
• Why is Normalization Important?
• Normalization is crucial for algorithms that rely on numerical
calculations, such as:
• Neural networks and linear regression, which use weighted sums of
input variables. Without normalization, large feature values can dominate
smaller ones, leading to unstable optimizations.
• Distance-based algorithms like K-nearest neighbors (KNN) and k-
means clustering, where different feature scales can distort distance
calculations and clustering results.
• When is Normalization Unnecessary?
• Some machine learning models, like decision trees, do not rely on feature
magnitudes but instead evaluate the predictive power of each feature
(e.g., through entropy or information gain). In such cases, normalization
does not impact model performance.
How is Normalization Applied?
• A common approach is the Min-Max Scaler, which scales values
between 0 and 1, or any other specified range. The formula for
Min-Max normalization is:

where Xmin and Xmax represent the minimum and maximum

values in the dataset, respectively.
• Normalization ensures that machine learning models work
efficiently, improving training stability and accuracy in algorithms
that depend on numerical relationships.
Data Standardization
• Data standardization is a scaling technique that transforms numerical
features so that they follow a normal (Gaussian) distribution with a
mean (µ) of 0 and a standard deviation (σ) of 1. Unlike normalization,
which scales data to a fixed range (e.g., 0 to 1), standardization adjusts
the data’s spread while keeping the overall shape of the distribution intact.
• The formula for standardization is:

• where:
• X is the original value,
• µ is the mean of the dataset,
• σ is the standard deviation of the dataset.
Why Use Standardization?
• Standardization is beneficial when:
• Data follows a normal distribution, as many machine learning models
(such as logistic regression, support vector machines, and PCA) perform
better when features are normally distributed.
• Identifying outliers, since standardized values (z-scores) indicate how far
a data point is from the mean in terms of standard deviations.

Normalization vs. Standardization

Normalization (Min-Max Standardization (Z-

Feature
Scaling) Score)
Output Range Fixed (e.g., 0 to 1) Mean = 0, Std Dev = 1
Preserves Distribution? No Yes
When algorithms are
When data follows a normal
sensitive to feature
When to Use? distribution or for outlier
magnitude (e.g., KNN, k-
detection
means, neural networks)
Binning and Discretization
• Binning and Discretization
• Binning is a technique that groups continuous values into
categories (bins). For example, age groups like "children" (0-
14), "teenager" (15-18), "adult" (19+). It simplifies data and
helps in analysis.
• Discretization converts a continuous variable into discrete or
categorical values. This can be done using different strategies:
Binning and Discretization
• Equal-Width Binning: • Equal-Frequency Binning:
• Splits data into bins of equal • Ensures each bin contains the
range (e.g., 20 units each). same number of values.
• Example (values: 10-90, 4 bins): • Example (same dataset, 4
• Bin 1: 10–30 → {10, 11, ..., bins):
24} • Bin 1: {10, 11, 12, 13}
• Bin 2: 31–50 → (empty) • Bin 2: {14, 15, 16, 17}
• Bin 3: 51–70 → (empty) • Bin 3: {18, 19, 20, 21}
• Bin 4: 71–90 → {90} • Bin 4: {22, 23, 24, 90}
• Issue: Uneven distribution of • Issue: Bin widths vary to
values. maintain equal frequency.
What to Do After Binning?
• Use bins as categories in your model (e.g., "low," "medium,"
"high").
• Apply one-hot encoding if using bins as nominal variables.
• Averaging bin values (smoothing) can reduce noise.
• Test different binning strategies—there is no universal rule; it
depends on the dataset.
• Binning helps in data simplification, handling outliers, and
improving model interpretability, but choosing the right
approach requires experimentation and analysis.
Applying Numerical Transformations
• Numerical transformations help preprocess data for machine
learning (ML) by making distributions more suitable for models.
Some transformations require parameters from training data (like
normalization and standardization), while others are purely
mathematical and can be applied universally.
• Standardization & Normalization
• Normalization (Min-Max Scaling): Rescales data between 0 and 1 using
the min and max of the training set.
• Standardization (Z-score Scaling): Centers data around zero mean with
unit variance, using mean and standard deviation of the training set.
• Key Rule: Always fit parameters on training data only, never on test
data.
Power Transformations (Handling Skewed Data)
Skewed distributions contain extreme values that pull the mean
and median toward one side. Example: Salary distributions,
where a few high salaries distort averages.
a) Logarithmic Transformation
• Converts skewed data into a more Gaussian (normal) shape.
• Formula: log⁡(x), where x is the feature value.
• Useful for reducing extreme values (e.g., salaries, population growth).
b) Square Root Transformation
• Similar to logarithmic transformation but less aggressive.
• Formula:
• Also reduces skewness but keeps differences between values.
Exponential Transformation
Squaring or exponentiating values can separate nonlinear data into
higher dimensions.
Example: If data points in 2D are not separable by a straight line,
squaring values can make them separable in a transformed space.
Used in advanced ML techniques, such as polynomial regression and
kernel methods in SVMs.
Dealing with Outliers
Dealing with Outliers
• Outliers: Atypical data points that significantly differ from other
observations. Example: A red point on a 2D plot that deviates from
the trend.
• Impact of outliers can distort statistical analyses and machine
learning models. Example: A regression line shifts when an outlier
is included.
• Methods to Detect Outliers
• Z-score Method measures how many standard deviations a
value is from the mean. Values beyond ±2 or ±3 are often
considered outliers.
• Box Plot Method (IQR - Interquartile Range)Uses quartiles to
determine "whiskers" (limits). Data points outside Q1 - 1.5×IQR
or Q3 + 1.5×IQR are flagged as outliers.
• For handling outliers remove them if they are errors or anomalies,
Thank you!

Linear Regression
83% (6)
Linear Regression
499 pages
Introduction To Stata Software, MaU, 2022
No ratings yet
Introduction To Stata Software, MaU, 2022
93 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
Classification Analysis
No ratings yet
Classification Analysis
4 pages
DMML
No ratings yet
DMML
65 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
3-Random Projection and Compressed Sensing Technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing Technique-13-01-2025
84 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
L1 - Data Pre-Processing & Steps of Building A Model
No ratings yet
L1 - Data Pre-Processing & Steps of Building A Model
30 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
MachineLearning Presentation
No ratings yet
MachineLearning Presentation
71 pages
Week 10
No ratings yet
Week 10
50 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
Unit II
No ratings yet
Unit II
119 pages
Exploring Categorical Data - Students
No ratings yet
Exploring Categorical Data - Students
40 pages
ML Unit1.notes
No ratings yet
ML Unit1.notes
8 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
1.3.2. Feature Engineering and Variable - Transformation
No ratings yet
1.3.2. Feature Engineering and Variable - Transformation
29 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Statistics Fundamentals
No ratings yet
Statistics Fundamentals
17 pages
ML 1
No ratings yet
ML 1
13 pages
Unit 1 MLF 1
No ratings yet
Unit 1 MLF 1
33 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
03 Machine Learning Overview
No ratings yet
03 Machine Learning Overview
24 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
5 pages
UNIT3
No ratings yet
UNIT3
98 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
The Data Arena.
No ratings yet
The Data Arena.
11 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
4 Data Pre Processing II
No ratings yet
4 Data Pre Processing II
26 pages
ML Unit 2
No ratings yet
ML Unit 2
33 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
1 Unit-1
No ratings yet
1 Unit-1
42 pages
Data Processing
No ratings yet
Data Processing
19 pages
Data Transformation
No ratings yet
Data Transformation
5 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Data in Machine Learning
No ratings yet
Data in Machine Learning
7 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Introduction To Data in Machine Learning
No ratings yet
Introduction To Data in Machine Learning
12 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
UNIT-1 (Preparing To Model)
No ratings yet
UNIT-1 (Preparing To Model)
82 pages
Machine Learning Pipeline: Created by Arbaz Ali
No ratings yet
Machine Learning Pipeline: Created by Arbaz Ali
32 pages
SAT Suite Question Bank - Results
No ratings yet
SAT Suite Question Bank - Results
12 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
Features
No ratings yet
Features
5 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Introduction To Machine Learning - Unit 4 - Week 2
100% (1)
Introduction To Machine Learning - Unit 4 - Week 2
3 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Business Statistics Objectives Questiona
No ratings yet
Business Statistics Objectives Questiona
8 pages
Using Likert Type Data in Social Science Research: Confusion, Issues and Challenges
No ratings yet
Using Likert Type Data in Social Science Research: Confusion, Issues and Challenges
14 pages
Box Jerkin
No ratings yet
Box Jerkin
7 pages
Bayesian Methods For Testing The Randomness of Lottery Draws
No ratings yet
Bayesian Methods For Testing The Randomness of Lottery Draws
34 pages
May (2012) Nonequivalent Comparison Group Designs
No ratings yet
May (2012) Nonequivalent Comparison Group Designs
21 pages
Lda - Sas
No ratings yet
Lda - Sas
420 pages
The University of The South Pacific: School of Computing, Information & Mathematical Sciences
No ratings yet
The University of The South Pacific: School of Computing, Information & Mathematical Sciences
8 pages
Exam 3 001 A FALL 008 Solutions
No ratings yet
Exam 3 001 A FALL 008 Solutions
7 pages
Bulacan State University: Republic of The Philippines
No ratings yet
Bulacan State University: Republic of The Philippines
12 pages
Moderated Mediation Model PDF
No ratings yet
Moderated Mediation Model PDF
41 pages
Orientation Presentation - PGPDSBA.O.Oct.B
No ratings yet
Orientation Presentation - PGPDSBA.O.Oct.B
31 pages
Data Analyst Roadmap
No ratings yet
Data Analyst Roadmap
10 pages
ﺺـــﺨﻠﻣ: Parenting Styles, Identity Styles and Academic Adjustment as Predictors of Academic Self-Efficacy Among Hashemite University Students
No ratings yet
ﺺـــﺨﻠﻣ: Parenting Styles, Identity Styles and Academic Adjustment as Predictors of Academic Self-Efficacy Among Hashemite University Students
23 pages
Chapter 8 Chi Square Test
No ratings yet
Chapter 8 Chi Square Test
47 pages
Blood Donation Predictions - Leveraging TPOT For Automated Model Selection
No ratings yet
Blood Donation Predictions - Leveraging TPOT For Automated Model Selection
31 pages
Tugasan/Assignment 5 (20 Markah/marks) : Serial No
No ratings yet
Tugasan/Assignment 5 (20 Markah/marks) : Serial No
4 pages
CH 4 Statistics in Research Work
No ratings yet
CH 4 Statistics in Research Work
38 pages
STAT Quiz 3
No ratings yet
STAT Quiz 3
3 pages
11014-Article Text-33351-2-10-20230201
No ratings yet
11014-Article Text-33351-2-10-20230201
15 pages
Segmentation and Decision - Tree in Sas
No ratings yet
Segmentation and Decision - Tree in Sas
2 pages
Assessment 1 - Sta404 - Nov 2021 - Week 7
No ratings yet
Assessment 1 - Sta404 - Nov 2021 - Week 7
2 pages
NLP - Emotion Detection
No ratings yet
NLP - Emotion Detection
8 pages
Business Decision Making
No ratings yet
Business Decision Making
5 pages
Unit 8 MCQ 2
No ratings yet
Unit 8 MCQ 2
3 pages
TH TH
No ratings yet
TH TH
1 page
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

Week 6. Data Preparation and Transformation

Uploaded by

Week 6. Data Preparation and Transformation

Uploaded by

Data Preparation and

Instructor: Sabina Mammadova

• Identifying types of features

• Dealing with categorical features

• Dealing with numerical features

• Dealing with outliers

Clustering K-Means, Hierarchical, DBSCAN

Machine Unsupervised Association Apriori, FP-Growth

Reinforcemen Q-Learning, Deep Q-Networks…

• Input data is labelled • Input data is unlabeled

Nominal Ordinal Discrete Continuous

Categorical Binary A variable with only Fraud classification:

Numerical Discrete Individual Number of students:

• A label encoder is suitable for

• Ordinal features have a very specific

• When working with machine learning models, it is crucial to apply

• Numerical feature transformations can be classified into two types:

where Xmin and Xmax represent the minimum and maximum

Normalization vs. Standardization

Normalization (Min-Max Standardization (Z-

You might also like