0% found this document useful (0 votes)

2 views8 pages

Data Preprocessing

Data preprocessing is essential for transforming various types of raw data into numerical representations suitable for machine learning algorithms. Key steps include data quality assessment, feature aggregation, discretization, sampling, dimensionality reduction, encoding, and scaling, each addressing specific challenges in dataset management. Techniques such as PCA and various encoding methods ensure that machine learning models can effectively interpret and learn from the data, improving accuracy and performance.

Uploaded by

pandit27165

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views8 pages

Data Preprocessing

Uploaded by

pandit27165

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Preprocessing

 Large datasets often contain various types of data, including structured tables, images, audio
files, and videos.
 However, machine learning algorithms cannot directly process raw text, images, or videos, as
they only understand numerical representations (1s and 0s).
 Therefore, it is essential to transform or encode the dataset into a suitable format before
applying machine learning techniques.
 By converting the data into meaningful numerical features, the algorithm can effectively
interpret and learn patterns, enabling accurate predictions and analysis.

Steps of Data Preprocessing

Not all the steps are applicable for each problem

• Data Quality Assessment

• Feature Aggregation

• Feature Discretization

• Feature Sampling

• Dimensionality Reduction

• Feature Encoding

• Feature Scaling

Data Quality Assessment

The first step when working with a dataset is to check quality of data set, we may face some
challenges with these datasets.

• Missing values
• Outliers
• Inconsistent values
• Duplicate values
We need to address these issues using software to ensure data quality and improve the
performance of machine learning models.
Feature Aggregation

We need to aggregate values to organize the data and present it from a better perspective.

For Example

• day-to-day transactions of a product to record the daily sales of that product in various store
locations over the year
• Aggregating the transactions to single store-wise monthly or yearly transactions will help us
reducing the number of data objects.

There are lot of advantage in aggregating feature values because it will compress dataset, take
less memory and it will take less computation power.
Aggregation cannot be applying with all data set, definitely we will need to decide whether
aggregation require or not. If we want day-to - day transactional dataset in that case
aggregation
is not good step.
Feature aggregation provides a high-level representation of the original dataset, making it
easier
to analyze.

Feature Discretization
 Sometimes we have features that are having continuous values.it is easy to convert
continuous to discreet values.
 For example, if age is a feature in our dataset, it is usually represented in years or months.
However, in many cases, we may not need the exact number to represent age. Instead, it can
be categorized into groups such as "Young," "Middle-aged," and "Old".
 Discretizing age into these categories improves efficiency by eliminating the need to handle
continuous values for that feature.
We must check whether a continuous value is necessary or if it can be discretized. While
converting continuous values to discrete ones often improves efficiency, it is not always
applicable to all features.

Feature Sampling

 Sampling is a very common method for selecting a subset of the dataset that we are
analyzing.
 In most cases, working with the complete dataset can turn out to be too expensive
considering the memory and time constraints.
 the sampling should be done in such a manner that the sample generated should have
approximately the same properties as the original dataset, meaning that the sample is
representative.
We have different technique for sampling.
• Simple Random Sampling
• Sampling without Replacement: Once you pick some instance from dataset and
selected data is again in dataset then we will not
put selected instances in original dataset.
• Sampling with Replacement: same instance can be chosen multiple times.
 If we have large dataset not comfortable to work on whole data, it that case we do sampling.
 Depending on problem we need to choose sampling technique.
 In case of simple random sampling there is chance of Imbalance dataset. Imbalanced dataset is
one where the number of instances of a class(es) are significantly higher than another
class(es), thus leading to an imbalance. For example, in patient dataset there is chance that
there is many instances that are related to healthy person but few are related to patients.
Stratified Sampling Stratified sampling ensures that each class is proportionally represented in both
training and testing sets, reducing the risk of an imbalanced dataset. With large data we can use
stratified sampling to get balance dataset.

Instance:

Suppose we have a patient dataset with three disease categories:

 Diabetes: 500 patients

 Heart Disease: 300 patients
 Lung Disease: 200 patients
 Total Patients: 1,000

Without Stratified Sampling (Random Split - Risk of Imbalance)

If we randomly split 20% of the data for testing, we might get:

 Diabetes: 120 patients

 Heart Disease: 50 patients
 Lung Disease: 30 patients

Here, the proportions are not maintained, and Lung Disease patients are underrepresented, which may
lead to biased predictions.

With Stratified Sampling (Balanced Split)

Using stratified sampling (20% test data), we get:

 Diabetes: 100 patients (from 500)

 Heart Disease: 60 patients (from 300)
 Lung Disease: 40 patients (from 200)

Now, the original proportions are preserved, ensuring fair representation of all disease classes in both
the training and testing sets.

Stratified sampling significantly reduces the chance of an imbalanced dataset, leading to a well-
trained machine learning model that performs better across all classes.

Dimensionality Reduction

 Most real world datasets have a large number of features and we do not know whether the
feature are important or not, then there is chances of reduction in number of features. This is
called Dimensionality Reduction.
 For example, consider an image processing problem, we might have to deal with thousands of
features, also called as dimensions.
Two widely accepted techniques

 Principal Component Analysis

 Singular Value Decomposition

Principal Component Analysis

 Principal Component Analysis (PCA) is a technique used to reduce the number of features in
a dataset while keeping the most important information. In this technique we directly select
few of feature and leave some feature, that is called feature subset selection. It finds new
features, called principal components, that capture the most variation in the data.

 This helps in visualization, noise reduction, and improving model performance. Singular
Value Decomposition (SVD) is a method of breaking down a matrix into three smaller
matrices, making it useful for data compression, noise removal, and recommendation systems.
PCA often uses SVD to find principal components efficiently, especially for large datasets.
Both methods help in handling high-dimensional data in machine learning.

Singular Value Decomposition

 In this technique, we do not know which features are important or which ones to select or
remove. Instead, we create a new dataset with new features, where each new feature is a
combination of some original features.
 It is also possible that the new features are formed by combining multiple original features.
 As a result, the number of features in the new dataset will be less than in the original dataset.

So these are the dimensionality reduction task. Features represent dimensions in a dataset. If
there is one feature, it is one-dimensional; if there are two features, it is two-dimensional; and
if there are five features, it is five-dimensional. More features mean more dimensions in the
data.

Feature Encoding

Feature encoding is the process of converting categorical or numerical features into a form that
machine learning models can understand. Different types of features require different encoding
techniques.

Types of Features in Machine Learning

Machine Learning deals with four types of features:

1. Nominal Features (Categorical - No Order)
o These are labels without any specific order.
o Example: Colors (Red, Blue, Green), Cities (New York, London, Paris).
o Encoding Methods:
 One-Hot Encoding
 Label Encoding
 Binary Encoding

2. Ordinal Features (Categorical - With Order)

o These features have a meaningful order, but the difference between them is not
measurable.
o Example: Customer service Ratings (Poor < Average < Good < Excellent).
o Encoding Methods:
 Ordinal Encoding
 Label Encoding

3. Interval Features (Numerical - No True Zero)

o These are numeric values where the difference is meaningful, but there is no true zero.
o Example: Temperature in Celsius (0°C is not the absence of temperature).
o Encoding Methods:

New_value = a* old_value+ b, where a,b are constant

4. Ratio Features (Numerical - True Zero Exists)

o These are numeric values where the difference and ratio are meaningful, and zero
means the absence of value.
o Example: Age, Salary, Weight, Height.
o Encoding Methods:

New_value=a*old_value

Feature Scaling

Feature scaling is the process of transforming numerical features into a specific range to ensure that
no feature dominates others due to differences in scale. Because sometimes features with larger
ranges can dominate those with smaller ranges.

For example, if there is height feature is 140 cm and 8.2 feet, then ML algo give more preference to
140.
 Feature scaling ensures all numerical features have the same scale, for improvement of model
performance.
 Many ML algorithms like K-means, SVM,KNN are sensitive to the scale of the data.
 Feature scaling speeds up training and improves model accuracy.

Emp ID Height (cm) Salary

101 160 25,000
102 170 30,000
103 175 60,000
104 180 90,000
105 190 120,000

Height (cm) varies between 160 - 190.

Salary (USD) varies between 25,000 - 120,000 (large scale difference).
Since Salary is significantly larger than Height, it may dominate machine learning models.

Types of Scaling

1. Min-Max Scaling (Normalization: 0 to 1)

 Converts all values to a fixed range [0,1] using the formula:

X’=X-Xmin/Xmax -Xmin

 Example: If Height ranges from 160 to 190 cm, after Min-Max Scaling:
o 160 cm → 0.0
o 190 cm → 1.0

2. Standardization (Z-score Normalization)

 Converts values to have mean = 0 and standard deviation = 1:

X′=X−μ/σ

 Example: If the average Salary is 65,000 with a standard deviation of 35,000, then:
o Salary of 30,000 → -1.0 (below average)
o Salary of 120,000 → +1.57 (above average)
3. Robust Scaling (Using Median & IQR - Handles Outliers)

Robust Scaling is a feature scaling method that handles outliers by using the Interquartile Range
(IQR) instead of mean and standard deviation.

X’=X-median(X)/IQR(X)

Interquartile Range (IQR) – Explained Simply

The Interquartile Range (IQR) is a measure of statistical dispersion, which tells us how spread out
the middle 50% of a dataset is. It is useful for detecting outliers and understanding data distribution.

Formula for IQR

IQR=Q3−Q1

Where:

 Q1 (1st Quartile / 25th Percentile): The value below which 25% of the data falls.
 Q2 (Median / 50th Percentile): The middle value (not directly used in IQR calculation).
 Q3 (3rd Quartile / 75th Percentile): The value below which 75% of the data falls.

4. Log Transformation (For Skewed Data)

 Converts data using logarithm to reduce large value differences:

X′=log(X+1)

 Example: If Salary varies from 25,000 to 120,000, log transformation reduces the gap
between small and large values.

Car Price Prediction Using Machine Learning: SRM Institute of Science & Technology Faculty of Engineering & Technology
100% (4)
Car Price Prediction Using Machine Learning: SRM Institute of Science & Technology Faculty of Engineering & Technology
21 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
JS1 Computer Studies Examination (Third Term)
90% (10)
JS1 Computer Studies Examination (Third Term)
2 pages
ProxiGuard Patrol Management System 7x Manual English
No ratings yet
ProxiGuard Patrol Management System 7x Manual English
29 pages
Data
No ratings yet
Data
36 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
ML-Lecture-6-7-preprocess
No ratings yet
ML-Lecture-6-7-preprocess
43 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
NN-7
No ratings yet
NN-7
26 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
CHP 4
No ratings yet
CHP 4
72 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Day School 03
No ratings yet
Day School 03
32 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
Data Mining Disease Diagnosis Presentation
No ratings yet
Data Mining Disease Diagnosis Presentation
35 pages
Data Acquisition
No ratings yet
Data Acquisition
28 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Pca Smote
No ratings yet
Pca Smote
15 pages
UNIT04
No ratings yet
UNIT04
35 pages
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
EDAB Module 5 Singular Value Decomposition (SVD)
No ratings yet
EDAB Module 5 Singular Value Decomposition (SVD)
58 pages
1.6
No ratings yet
1.6
75 pages
Unit 3
No ratings yet
Unit 3
50 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Unit 1
No ratings yet
Unit 1
8 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Feature Pruning and Normalization
No ratings yet
Feature Pruning and Normalization
8 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
10-2 Data analysis and pre-processing part 4 PDF
No ratings yet
10-2 Data analysis and pre-processing part 4 PDF
23 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
DW&DM(Unit -4)
No ratings yet
DW&DM(Unit -4)
9 pages
UNIT 4
No ratings yet
UNIT 4
42 pages
DS Notes
No ratings yet
DS Notes
36 pages
Untitled
No ratings yet
Untitled
29 pages
Feature Extraction
No ratings yet
Feature Extraction
16 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Data Pre Processing
No ratings yet
Data Pre Processing
26 pages
L06 Features
No ratings yet
L06 Features
44 pages
ML Notes.docx
No ratings yet
ML Notes.docx
15 pages
ML Inter Q&A
No ratings yet
ML Inter Q&A
54 pages
Lecture 5 - Feature extraction, model building & evaluation
No ratings yet
Lecture 5 - Feature extraction, model building & evaluation
35 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
2 pages
AIML MODEL
No ratings yet
AIML MODEL
13 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
A Smart Dustbin Using Mobile Application: Anilkumar C.S., Suhas G, Sushma S
No ratings yet
A Smart Dustbin Using Mobile Application: Anilkumar C.S., Suhas G, Sushma S
4 pages
RMA 3.2.2 Overview
No ratings yet
RMA 3.2.2 Overview
4 pages
HaiRobotics Datasheet Product Brochure CN 220825
No ratings yet
HaiRobotics Datasheet Product Brochure CN 220825
13 pages
FIN41360 Assignment 1 - Portfolio choice & performance evaluation
No ratings yet
FIN41360 Assignment 1 - Portfolio choice & performance evaluation
6 pages
02-7. MT (Modbus - TCP) - MOO-ARIOCMTU-V1.1-1905US - W (Modbus Add)
No ratings yet
02-7. MT (Modbus - TCP) - MOO-ARIOCMTU-V1.1-1905US - W (Modbus Add)
29 pages
1.7 ACOPOS 1640, 128M: Technical Data - ACOPOS Servo Family
No ratings yet
1.7 ACOPOS 1640, 128M: Technical Data - ACOPOS Servo Family
20 pages
Python For Data Science - Unit 3 - Week 1 - Assignment
No ratings yet
Python For Data Science - Unit 3 - Week 1 - Assignment
5 pages
Digi Notes
No ratings yet
Digi Notes
11 pages
Instant ebooks textbook Computer Vision and Image Analysis Digital Image Processing and Analysis 4th Edition Scott E Umbaugh download all chapters
100% (3)
Instant ebooks textbook Computer Vision and Image Analysis Digital Image Processing and Analysis 4th Edition Scott E Umbaugh download all chapters
65 pages
Strategic and Economic Aspects of Network Sharing in FTTH/PON Architectures
No ratings yet
Strategic and Economic Aspects of Network Sharing in FTTH/PON Architectures
13 pages
Health Predict Final PJ
No ratings yet
Health Predict Final PJ
108 pages
Challenges For Indian E-Payment System
No ratings yet
Challenges For Indian E-Payment System
3 pages
Minot 2 Reportrrrrrrr
No ratings yet
Minot 2 Reportrrrrrrr
74 pages
Wize Technical Presentation - MAY 2021
No ratings yet
Wize Technical Presentation - MAY 2021
14 pages
It Vendor Management: Principles & Practices
No ratings yet
It Vendor Management: Principles & Practices
32 pages
Siemens SIVACON S8
No ratings yet
Siemens SIVACON S8
40 pages
Files: Ipython Writing A File
No ratings yet
Files: Ipython Writing A File
5 pages
SMY - 133 Manual v2.0 Eng
No ratings yet
SMY - 133 Manual v2.0 Eng
41 pages
Yokogawa Cs-3000 Engineering Training: Giza Systems Company Presents
No ratings yet
Yokogawa Cs-3000 Engineering Training: Giza Systems Company Presents
56 pages
Request Letter-Internet Connection 1
No ratings yet
Request Letter-Internet Connection 1
1 page
00-441128-02 - IPC1000 BIOS Setup - IBT - MB885 - 091203
No ratings yet
00-441128-02 - IPC1000 BIOS Setup - IBT - MB885 - 091203
29 pages
The Intel microprocessors: 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium Pro processor, Pentium II, Pentium III, Pentium 4, and Core2 with 64-bit extensions: architecture, programming, and interfacing 8th ed Edition Barry B Brey - eBook PDFpdf download
No ratings yet
The Intel microprocessors: 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium Pro processor, Pentium II, Pentium III, Pentium 4, and Core2 with 64-bit extensions: architecture, programming, and interfacing 8th ed Edition Barry B Brey - eBook PDFpdf download
51 pages
Computer Applications Technology P2 Nov 2023 MG Eng
No ratings yet
Computer Applications Technology P2 Nov 2023 MG Eng
14 pages
Thermostat Sys - WSC - 86 - Iom - en
No ratings yet
Thermostat Sys - WSC - 86 - Iom - en
44 pages
QB Operating System
No ratings yet
QB Operating System
11 pages
17 Random Vectors 2 Lecture
No ratings yet
17 Random Vectors 2 Lecture
49 pages
2024_Finance_Technology_Bullseye_Report_Gartner1741612837502
No ratings yet
2024_Finance_Technology_Bullseye_Report_Gartner1741612837502
38 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Data Preprocessing

Steps of Data Preprocessing

Not all the steps are applicable for each problem

• Data Quality Assessment

Data Quality Assessment

Suppose we have a patient dataset with three disease categories:

 Diabetes: 500 patients

Without Stratified Sampling (Random Split - Risk of Imbalance)

If we randomly split 20% of the data for testing, we might get:

 Diabetes: 120 patients

With Stratified Sampling (Balanced Split)

Using stratified sampling (20% test data), we get:

 Diabetes: 100 patients (from 500)

 Principal Component Analysis

Principal Component Analysis

Singular Value Decomposition

Types of Features in Machine Learning

Machine Learning deals with four types of features:

2. Ordinal Features (Categorical - With Order)

3. Interval Features (Numerical - No True Zero)

New_value = a* old_value+ b, where a,b are constant

4. Ratio Features (Numerical - True Zero Exists)

Emp ID Height (cm) Salary

Height (cm) varies between 160 - 190.

1. Min-Max Scaling (Normalization: 0 to 1)

 Converts all values to a fixed range [0,1] using the formula:

2. Standardization (Z-score Normalization)

 Converts values to have mean = 0 and standard deviation = 1:

Interquartile Range (IQR) – Explained Simply

Formula for IQR

4. Log Transformation (For Skewed Data)

 Converts data using logarithm to reduce large value differences:

You might also like