0% found this document useful (0 votes)
5 views34 pages

Week 6. Data Preparation and Transformation

The document provides an overview of data preparation and transformation in machine learning, covering topics such as feature types, dealing with categorical and numerical features, and the importance of data normalization and standardization. It outlines the CRISP-DM process and various techniques for handling outliers and transforming data for better model performance. Key concepts include label encoding, one-hot encoding, and the significance of consistent transformations across training and testing datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views34 pages

Week 6. Data Preparation and Transformation

The document provides an overview of data preparation and transformation in machine learning, covering topics such as feature types, dealing with categorical and numerical features, and the importance of data normalization and standardization. It outlines the CRISP-DM process and various techniques for handling outliers and transforming data for better model performance. Key concepts include label encoding, one-hot encoding, and the significance of consistent transformations across training and testing datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Data Preparation and

Transformation

Instructor: Sabina Mammadova


Agenda
• General information about ML

• Identifying types of features

• Dealing with categorical features

• Dealing with numerical features

• Dealing with outliers


Machine Learning (ML)
and CRISP-DM
CRoss-Industry Standard Process for Data Mining (CRISP
DM)
1. Business Understanding
Determine business objectives → Assess situation → Determine
data mining goals → Produce project plan

2. Data Understanding
Collect initial data → Describe data → Explore data → Verify
data quality

3. Data Preparation
Select data → Clean data → Construct data → Integrate data →
Format data

4. Modeling
Select modeling techniques → Generate test design → Build
model → Assess model

5. Evaluation
Evaluate results → Review process → Determine next steps

6. Deployment
Plan deployment → Plan monitoring and maintenance → Produce
final report → Review project https://aws.amazon.com/what-is/data-mi
ning/
What is Machine Learning (ML)?
Machine learning (ML) is a branch of artificial intelligence that
enables computers to learn patterns from data and make
predictions without explicit programming.
Machine Learning Algorithms
Linear Regression, Polynomial
Regression, Support Vector
Regression Regression, Decision Tree
Regression, Random Forest
Supervised Regression
Learning Logistic Regression, K-Nearest
Neighbors, Support Vector
Classification Machines, Decision Tree, Random
Forest, Naïve Bayes

Clustering K-Means, Hierarchical, DBSCAN

Machine Unsupervised Association Apriori, FP-Growth


Learning Learning Analysis
Dimensionality
PCA, LDA
Reduction

Reinforcemen Q-Learning, Deep Q-Networks…


t Learning
Difference between Supervised and
Unsupervised Learning

• Input data is labelled • Input data is unlabeled


• There is a training phase • There is no training
• Data is modelled based phase
on training dataset • Uses properties of given
• Known number of data for clustering
classes (for • Unknown number of
classification) classes
Machine Learning Process
Identifying types of
features
Types of Features
Feature

Qualitative Quantitative
(categorical) (numerical)

Nominal Ordinal Discrete Continuous

Binary
Types of Features
Feature type Feature sub-type Definition Example
Categorical Nominal Labelled variables Cloud provider:
with no quantitative AWS, MS, Google
value
Categorical Ordinal Adds the sense of Job title: junior data
order to the labelled scientist, senior
variable data scientist,
chief data scientist

Categorical Binary A variable with only Fraud classification:


two allowed values fraud, not fraud

Numerical Discrete Individual Number of students:


and countable items 100
Numerical Continuous Infinite number of Total amount:
possible $150.35
measurements and
they often
carry decimal points
Types of Features
• Although looking at the values of the variable may help you find its type, you
should never rely only on this approach. The nature of the variable is also very
important for making such decisions. For example, someone could encode the
cloud provider variable shown in Table as follows: 1 (AWS), 2 (MS), 3 (Google).
In that case, the variable is still a nominal feature, even if it is now
represented by discrete numbers.
• If you are building an ML model and you don’t tell your algorithm that this
variable is not a discrete number but is instead a nominal variable, the
algorithm will treat it as a number and the model won’t be interpretable
anymore.
• Before feeding any ML algorithm with data, make sure your feature types have
been properly identified.
Why data preparation and transformation is important?

• Some ML libraries, such as scikit-learn, may not accept string values on your
categorical features.
• The data distribution of your variable may not be the most optimal distribution
for your algorithm.
• Your ML algorithm may be impacted by the scale of your data.
• Some observations of your variable may be missing information, and you will
have to fix it. These are also known as missing values.
• You may find outlier values of your variable that can potentially add bias to
your model.
• Your variable may be storing different types of information, and you may only
be interested in a few of them (for example, a date variable can store the day
of the week or the week of the month).
• You might want to find a mathematical representation for a text variable.
• …
Dealing with categorical
features
Transforming nominal features: Label Encoding

• A label encoder is suitable for


categorical/nominal variables, and it will
just associate a number with each
distinct label of your variables. Country Label encoding
• A label encoder will always ensure that a India 1
unique number is associated with each
distinct label. In the table, although Canada 2
“India” appears twice, the same number Brazil 3
was assigned to it.
Australia 4
• It assigns a unique number(starting from
0) to each class of data, and this may led India 1
to the generation of priority issues during
model training of data sets. A label with a
high value may be considered to have
high priority than a label having a lower
value.
Transforming nominal features: One-hot
Encoding
• A label encoder is suitable for
categorical/nominal variables, and it will
just associate a number with each Country India Canada Brazil Australia
distinct label of your variables.
India 1 0 0 0
• A label encoder will always ensure that a
unique number is associated with each Canada 0 1 0 0
distinct label. In the table, although Brazil 0 0 1 0
“India” appears twice, the same number
was assigned to it. Australia 0 0 0 1
• It assigns a unique number(starting from India 1 0 0 0
0) to each class of data, and this may led
to the generation of priority issues during
model training of data sets. A label with a
high value may be considered to have
high priority than a label having a lower
value.
Transforming ordinal features: Ordinal Encoding

• Ordinal features have a very specific


characteristic: they have an order.
Because they have this quality, it does
not make sense to apply one-hot
encoding to them; if you do so, the Education Ordinal
underlying algorithm that is used to train encoding
your model will not be able to
differentiate the implicit order of the data Trainee 1
points associated with this feature. Junior data analyst 2
• The most common transformation for this Senior data analyst 3
type of variable is known as ordinal
Chief data scientist 4
encoding. An ordinal encoder will
associate a number with each distinct
label of your variable, just like a label
encoder does, but this time, it will
respect the order of each category.
Avoiding Confusion in Train and Test Datasets

• When working with machine learning models, it is crucial to apply


transformations consistently across training, testing, and production data.
Encoders should be fitted only on training data and then used to
transform test and production data—never refit on test data, as this would
bias performance metrics.
• A key challenge arises when test data contains new categories that were not
present in training. Most ML libraries allow handling this by either raising an
error or setting all zeros in one-hot encoding. This is a common issue, but it
raises concerns about whether the model can generalize to unseen data.
• For reliable models, training and testing data should follow the same
distribution. If many unknown categories appear in the test set, it may
indicate data distribution mismatch, leading to overfitting. Careful
investigation and proper handling of categorical and numerical transformations
help ensure model reliability and fairness in real-world applications.
Dealing with Numerical
Features
Dealing with Numerical Features

• Numerical feature transformations can be classified into two types:


• Transformations that rely on training data – These learn parameters
(e.g., mean, standard deviation) from the training set and apply them
to test and new data, similar to categorical encoding.
• Transformations that rely only on individual observations – These
apply direct mathematical computations (e.g., squaring a value) without
depending on learned parameters.
• There are many transformation techniques available, and while you don’t
need to know all of them, understanding the most important ones is key.
Additionally, creating custom transformations based on specific use
cases can enhance model performance.
Data Normalization
• Data normalization is a technique used to scale numerical features to a specific
range, such as 0 to 1, ensuring that all data points have the same magnitude.
This is particularly useful when different features have varying ranges, which can
negatively impact certain machine learning algorithms. For example, if employee
salaries range from 20,000 to 200,000, normalization transforms 20,000 into 0
and 200,000 into 1, keeping all values within a fixed scale.

It shows how different scales of the variable could change the hyper plan’s projection of
k-means clustering.
Why is Normalization important and when is it
unnecessary?
• Why is Normalization Important?
• Normalization is crucial for algorithms that rely on numerical
calculations, such as:
• Neural networks and linear regression, which use weighted sums of
input variables. Without normalization, large feature values can dominate
smaller ones, leading to unstable optimizations.
• Distance-based algorithms like K-nearest neighbors (KNN) and k-
means clustering, where different feature scales can distort distance
calculations and clustering results.
• When is Normalization Unnecessary?
• Some machine learning models, like decision trees, do not rely on feature
magnitudes but instead evaluate the predictive power of each feature
(e.g., through entropy or information gain). In such cases, normalization
does not impact model performance.
How is Normalization Applied?
• A common approach is the Min-Max Scaler, which scales values
between 0 and 1, or any other specified range. The formula for
Min-Max normalization is:

where Xmin and Xmax represent the minimum and maximum


values in the dataset, respectively.
• Normalization ensures that machine learning models work
efficiently, improving training stability and accuracy in algorithms
that depend on numerical relationships.
Data Standardization
• Data standardization is a scaling technique that transforms numerical
features so that they follow a normal (Gaussian) distribution with a
mean (µ) of 0 and a standard deviation (σ) of 1. Unlike normalization,
which scales data to a fixed range (e.g., 0 to 1), standardization adjusts
the data’s spread while keeping the overall shape of the distribution intact.
• The formula for standardization is:

• where:
• X is the original value,
• µ is the mean of the dataset,
• σ is the standard deviation of the dataset.
Why Use Standardization?
• Standardization is beneficial when:
• Data follows a normal distribution, as many machine learning models
(such as logistic regression, support vector machines, and PCA) perform
better when features are normally distributed.
• Identifying outliers, since standardized values (z-scores) indicate how far
a data point is from the mean in terms of standard deviations.

Normalization vs. Standardization

Normalization (Min-Max Standardization (Z-


Feature
Scaling) Score)
Output Range Fixed (e.g., 0 to 1) Mean = 0, Std Dev = 1
Preserves Distribution? No Yes
When algorithms are
When data follows a normal
sensitive to feature
When to Use? distribution or for outlier
magnitude (e.g., KNN, k-
detection
means, neural networks)
Binning and Discretization
• Binning and Discretization
• Binning is a technique that groups continuous values into
categories (bins). For example, age groups like "children" (0-
14), "teenager" (15-18), "adult" (19+). It simplifies data and
helps in analysis.
• Discretization converts a continuous variable into discrete or
categorical values. This can be done using different strategies:
Binning and Discretization
• Equal-Width Binning: • Equal-Frequency Binning:
• Splits data into bins of equal • Ensures each bin contains the
range (e.g., 20 units each). same number of values.
• Example (values: 10-90, 4 bins): • Example (same dataset, 4
• Bin 1: 10–30 → {10, 11, ..., bins):
24} • Bin 1: {10, 11, 12, 13}
• Bin 2: 31–50 → (empty) • Bin 2: {14, 15, 16, 17}
• Bin 3: 51–70 → (empty) • Bin 3: {18, 19, 20, 21}
• Bin 4: 71–90 → {90} • Bin 4: {22, 23, 24, 90}
• Issue: Uneven distribution of • Issue: Bin widths vary to
values. maintain equal frequency.
What to Do After Binning?
• Use bins as categories in your model (e.g., "low," "medium,"
"high").
• Apply one-hot encoding if using bins as nominal variables.
• Averaging bin values (smoothing) can reduce noise.
• Test different binning strategies—there is no universal rule; it
depends on the dataset.
• Binning helps in data simplification, handling outliers, and
improving model interpretability, but choosing the right
approach requires experimentation and analysis.
Applying Numerical Transformations
• Numerical transformations help preprocess data for machine
learning (ML) by making distributions more suitable for models.
Some transformations require parameters from training data (like
normalization and standardization), while others are purely
mathematical and can be applied universally.
• Standardization & Normalization
• Normalization (Min-Max Scaling): Rescales data between 0 and 1 using
the min and max of the training set.
• Standardization (Z-score Scaling): Centers data around zero mean with
unit variance, using mean and standard deviation of the training set.
• Key Rule: Always fit parameters on training data only, never on test
data.
Power Transformations (Handling Skewed Data)
Skewed distributions contain extreme values that pull the mean
and median toward one side. Example: Salary distributions,
where a few high salaries distort averages.
a) Logarithmic Transformation
• Converts skewed data into a more Gaussian (normal) shape.
• Formula: log⁡(x), where x is the feature value.
• Useful for reducing extreme values (e.g., salaries, population growth).
b) Square Root Transformation
• Similar to logarithmic transformation but less aggressive.
• Formula:
• Also reduces skewness but keeps differences between values.
Exponential Transformation
Squaring or exponentiating values can separate nonlinear data into
higher dimensions.
Example: If data points in 2D are not separable by a straight line,
squaring values can make them separable in a transformed space.
Used in advanced ML techniques, such as polynomial regression and
kernel methods in SVMs.
Dealing with Outliers
Dealing with Outliers
• Outliers: Atypical data points that significantly differ from other
observations. Example: A red point on a 2D plot that deviates from
the trend.
• Impact of outliers can distort statistical analyses and machine
learning models. Example: A regression line shifts when an outlier
is included.
• Methods to Detect Outliers
• Z-score Method measures how many standard deviations a
value is from the mean. Values beyond ±2 or ±3 are often
considered outliers.
• Box Plot Method (IQR - Interquartile Range)Uses quartiles to
determine "whiskers" (limits). Data points outside Q1 - 1.5×IQR
or Q3 + 1.5×IQR are flagged as outliers.
• For handling outliers remove them if they are errors or anomalies,
Thank you!

You might also like