Data Pre Processing

The document discusses various techniques used for data pre-processing including data cleaning, handling noisy data, data transformation techniques like binning, normalization, attribute selection and aggregation. It also discusses regularization techniques like modifying the loss function using L1 and L2 regularization, modifying the training algorithm using dropout and noise injection, and modifying the sampling method using data augmentation and k-fold cross validation.

Uploaded by

ee23b007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views23 pages

Data Pre Processing

Uploaded by

ee23b007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data

Pre-processing
What is Data
Pre-Processing?
Manipulation or dropping of
data before it is used in order
to ensure or enhance
performance
The data can have many
irrelevant and missing parts.
To handle this part, data
cleaning is done. It involves
handling of missing data, noisy DATA
data etc.
CLEANING
You can do this in two ways,
removal of entries or fill in
missing values.
HANDLING NOISY
Noisy data is meaningless data that can’t be
DATA interpreted by machines. It can be generated
due to faulty data collection, data entry errors
etc.

Binning Regression Clustering

The whole data is divided This is used to smooth the This is used for finding the
into segments of equal size data and will help to outliers and also in
called bins. Each handle data when grouping the data.
segmented is handled unnecessary data is Clustering is generally
separately. present used in unsupervised
learning.
This is usually used when
compiling data from multiple
sources, each of which would
have different formats of
storage and sources of
DATA information.

INTEGRATION
This commonly includes
matching different names for
the same values, and removal
of unnecessary attributes.
Once data clearing has been
done, we need to consolidate
the quality data into alternate
forms by changing the value,
structure, or format of data.
DATA
This helps data to be better TRANSFORMATION
analysed by the developed
models. This also sets the
format in which the model
receives data.
NORMALIZATION
Decimal Scaling
It involves scaling of numerical attributes, so that each attribute has
Scaling values by a power
nearly equal significance.
of 10, so as to eliminate
Normalization is one of the most widely used techniques to transform the need for decimals.
data It is rarely used.

A few ways to normalise the data

Clipping
Outlier values that
Min-Max Normalization
Standardization (Z-Score) greater/lesser than the
Used for data having a
The data is rescaled such maximum/minimum value
range. It is used to
that the mean is 0 and are set to the maximum/
transforms the data to a
variance is 1. minimum respectively.
range of 0 to 1 or -1 to 1.
MIN-MAX NORMALIZATION

It is used for data having a range. The above formula

transforms the data to a
range of 0 to 1.

For transforming from -1 to 1, we can use

STANDARDIZATION
Scales the mean to zero and
variance (as well as standard
deviation) to 1.

Epsilon is an extremely small

number to ensure that when
variance is 0, there isn't an error.
ATTRIBUTE SELECTION
New attributes are introduced in the
data based on evaluation of earlier
attribute(s).

AGGREGATION
Presenting the data in summary
format. Used mainly to check
operations done on previous data and
their overall effect.
Regularization
TYPES OF
REGULARIZATION

Modifying the loss function

Modifying the Sampling method

Modifying training algorithms

MODIFYING THE LOSS L1 (Lasso) Regularisation
Penalty of sum of absolute weights
FUNCTION scaled by a hyper parameter is added
to the loss function.

L2 (Ridge) Regularisation
Penalty of sum of square of weights
scaled again by a hyper parameter is
added to the loss function.
L1 REGULARIZATION
Promotes Sparcity
L1 regularization promotes sparsity in
the model by encouraging some
coefficients to become exactly zero, When should it be used?
effectively performing feature It works much better when your data
selection. has many correlated features. This
also helps when you have low
amount of data or high number of
Feature Importance Ranking features
It can provide a feature importance
ranking based on the magnitude of
the non-zero coefficients. Features
with larger non-zero coefficients are
considered more important.
L2 REGULARIZATION

Encourages non-zero values When is it more useful?

L2 regularization encourages small It works much better when your data
but non-zero coefficient values, has many correlated features.
distributing the impact of features
across all variables.

Feature Importance Ranking

It can provide a feature importance
ranking based on the magnitude of
the non-zero coefficients. Features
with larger non-zero coefficients are
considered more important.
MODIFYING THE
TRAINING ALGORITHM
Dropout
In each training iteration, some
connections are randomly dropped
and the resultant output rescaled.

Injecting Noise
Introducing random variation while
updating weights
Dropout
Some nodes are randomly dropped
and the resultant is rescaled to
compensate for the dropped values.

By applying dropout during training,

the network effectively trains multiple
sub-networks, as different subsets of
neurons are dropped out at each
update step. This ensemble of sub-
networks helps in reducing
overfitting, as the network learns to
generalize from a variety of different
architectures. Dropout also acts as a
form of regularization, as it
discourages complex co-adaptations
of neurons and encourages the
learning of more robust features.
Data Augmentation
Introduction of more synthetic data
with noise, which makes the model
more resistant to variations.

K-Fold Cross Validation

Dataset is divided into k equally sized
subsets, and the model is trained and
evaluated k times, each time using a
MODIFYING THE different subset as the validation set
and the remaining subsets as the
SAMPLING METHOD training set.
Data Augmentation Hence, to smoothen out the entire
Introduction of more synthetic data feature space, we can generate
with noise, which makes the model artificial data based on the original
more resistant to variations. data, like in images, we can flip and
rotate, convert to grey scale, add
noise to the images, crop, resize,
change contrast, brightness, or
introduce deformations.
K-Fold Cross Validation
Dataset is divided into k equally sized
subsets, and the model is trained and
evaluated k times, each time using a
different subset as the validation set
and the remaining subsets as the
training set.

The purpose of training multiple

models in k-fold cross-validation is to
obtain a more reliable estimate of the
model's performance by evaluating it
on different subsets of the data. It
helps in assessing the model's
generalization ability and reducing
the impact of data variability.

Unit Ii
No ratings yet
Unit Ii
8 pages
Module 3 - 3
No ratings yet
Module 3 - 3
93 pages
Data Science Interview Question
No ratings yet
Data Science Interview Question
23 pages
Unit 1 Regularization
No ratings yet
Unit 1 Regularization
44 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Data Imbalance Problem
No ratings yet
Data Imbalance Problem
56 pages
Lec 4 - Regularization
No ratings yet
Lec 4 - Regularization
32 pages
DL 3 Regularization
No ratings yet
DL 3 Regularization
50 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
Unit 4
No ratings yet
Unit 4
33 pages
Data Mining: Preprocessing Techniques
No ratings yet
Data Mining: Preprocessing Techniques
33 pages
Early Stopping, Dropout, Augmentation, Optimizers New
No ratings yet
Early Stopping, Dropout, Augmentation, Optimizers New
91 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
DL Unit 3
No ratings yet
DL Unit 3
59 pages
MLquestions
No ratings yet
MLquestions
26 pages
نسخة من prep
No ratings yet
نسخة من prep
17 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Data Normalization Machine Learning
No ratings yet
Data Normalization Machine Learning
5 pages
LECTURE#9 EE258 F22 Part2 Draft v1
No ratings yet
LECTURE#9 EE258 F22 Part2 Draft v1
14 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Data Processing
No ratings yet
Data Processing
19 pages
Data Normalization
No ratings yet
Data Normalization
7 pages
Unit-2 L2
No ratings yet
Unit-2 L2
22 pages
Course 4
No ratings yet
Course 4
29 pages
Unit 4
No ratings yet
Unit 4
93 pages
Deep Learning Regularization Techniques
No ratings yet
Deep Learning Regularization Techniques
16 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Unit 3
No ratings yet
Unit 3
10 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Lecture 1.3
No ratings yet
Lecture 1.3
11 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
DL UNIT 3 - Part1
No ratings yet
DL UNIT 3 - Part1
27 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Unit 2
No ratings yet
Unit 2
23 pages
Parameter Norm Penalties
No ratings yet
Parameter Norm Penalties
6 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
Step 06 - Data Preprocessing
No ratings yet
Step 06 - Data Preprocessing
10 pages
Lecture # 13 Data - Transformation - Techniques
No ratings yet
Lecture # 13 Data - Transformation - Techniques
36 pages
Reserch Papers On Deep Learning Mpgi
No ratings yet
Reserch Papers On Deep Learning Mpgi
6 pages
L1, L2andBatchnormalization (1) T1754749408264
No ratings yet
L1, L2andBatchnormalization (1) T1754749408264
9 pages
What Is Regularization.
No ratings yet
What Is Regularization.
10 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
6 Batchnorm
No ratings yet
6 Batchnorm
30 pages
Additional Notes Practice Exam
No ratings yet
Additional Notes Practice Exam
8 pages
DL M2 Regularization
No ratings yet
DL M2 Regularization
12 pages
Data Normalization in Data Mining
No ratings yet
Data Normalization in Data Mining
8 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
5 pages
5 Preprocessing
No ratings yet
5 Preprocessing
44 pages
8103 M-36 6u2232351 Er Vent Fans
No ratings yet
8103 M-36 6u2232351 Er Vent Fans
168 pages
Case Studies and Best Practices From Leading Companies For Monitoring API Endpoints
No ratings yet
Case Studies and Best Practices From Leading Companies For Monitoring API Endpoints
7 pages
Session 21
No ratings yet
Session 21
19 pages
Notice: Service Manual iFLEX5
No ratings yet
Notice: Service Manual iFLEX5
2 pages
Class II Syllabus 2024 25
No ratings yet
Class II Syllabus 2024 25
8 pages
2024.11.25 - Microsoft 365 Teams For Faculty Training - CoPilot - 5 Day - TSC - Training Plan Ver 2.0 (1
No ratings yet
2024.11.25 - Microsoft 365 Teams For Faculty Training - CoPilot - 5 Day - TSC - Training Plan Ver 2.0 (1
11 pages
Ocn Viva
No ratings yet
Ocn Viva
4 pages
Electronic Payments Workshop Proceedings
No ratings yet
Electronic Payments Workshop Proceedings
112 pages
Keyblanks Guide
No ratings yet
Keyblanks Guide
36 pages
Ethical Issues in Social Networking Research
No ratings yet
Ethical Issues in Social Networking Research
11 pages
Redp 5157
No ratings yet
Redp 5157
42 pages
Manual Alarma GSM-PSTN App
100% (1)
Manual Alarma GSM-PSTN App
27 pages
Cyber Solutions
No ratings yet
Cyber Solutions
12 pages
Unlocking Ideas: Using Escape Room Puzzles in A Cryptography Classroom
No ratings yet
Unlocking Ideas: Using Escape Room Puzzles in A Cryptography Classroom
14 pages
Public Domain Book Digitization Guide
No ratings yet
Public Domain Book Digitization Guide
213 pages
3 Knowledge Distillation
No ratings yet
3 Knowledge Distillation
7 pages
Climaveneta W 3000
No ratings yet
Climaveneta W 3000
65 pages
Industrial Water Flow Control
No ratings yet
Industrial Water Flow Control
6 pages
History and Evolution of The Web
No ratings yet
History and Evolution of The Web
6 pages
Fittings ASME BPE PDF
No ratings yet
Fittings ASME BPE PDF
1 page
Introduction to Software Design Concepts
No ratings yet
Introduction to Software Design Concepts
11 pages
Arshita Resume
No ratings yet
Arshita Resume
1 page
Intelivision 1050 Datasheet
No ratings yet
Intelivision 1050 Datasheet
4 pages
TBS Encoder SRT Configuration Guide
No ratings yet
TBS Encoder SRT Configuration Guide
6 pages
Aircraft Control Surfaces Explained
100% (1)
Aircraft Control Surfaces Explained
139 pages
Summary of Corrections and Enhancements
No ratings yet
Summary of Corrections and Enhancements
10 pages
Certified Procurement Professional Certification CPP BOK
No ratings yet
Certified Procurement Professional Certification CPP BOK
67 pages
Sample Final Paper (.Doc) - Word Document
No ratings yet
Sample Final Paper (.Doc) - Word Document
12 pages
C Pipe Programming in Ubuntu
No ratings yet
C Pipe Programming in Ubuntu
4 pages
Complexities of AI Trends
No ratings yet
Complexities of AI Trends
19 pages