0% found this document useful (0 votes)
5 views

Data Pre Processing

The document discusses various techniques used for data pre-processing including data cleaning, handling noisy data, data transformation techniques like binning, normalization, attribute selection and aggregation. It also discusses regularization techniques like modifying the loss function using L1 and L2 regularization, modifying the training algorithm using dropout and noise injection, and modifying the sampling method using data augmentation and k-fold cross validation.

Uploaded by

ee23b007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Pre Processing

The document discusses various techniques used for data pre-processing including data cleaning, handling noisy data, data transformation techniques like binning, normalization, attribute selection and aggregation. It also discusses regularization techniques like modifying the loss function using L1 and L2 regularization, modifying the training algorithm using dropout and noise injection, and modifying the sampling method using data augmentation and k-fold cross validation.

Uploaded by

ee23b007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data

Pre-processing
What is Data
Pre-Processing?
Manipulation or dropping of
data before it is used in order
to ensure or enhance
performance
The data can have many
irrelevant and missing parts.
To handle this part, data
cleaning is done. It involves
handling of missing data, noisy DATA
data etc.
CLEANING
You can do this in two ways,
removal of entries or fill in
missing values.
HANDLING NOISY
Noisy data is meaningless data that can’t be
DATA interpreted by machines. It can be generated
due to faulty data collection, data entry errors
etc.

Binning Regression Clustering


The whole data is divided This is used to smooth the This is used for finding the
into segments of equal size data and will help to outliers and also in
called bins. Each handle data when grouping the data.
segmented is handled unnecessary data is Clustering is generally
separately. present used in unsupervised
learning.
This is usually used when
compiling data from multiple
sources, each of which would
have different formats of
storage and sources of
DATA information.

INTEGRATION
This commonly includes
matching different names for
the same values, and removal
of unnecessary attributes.
Once data clearing has been
done, we need to consolidate
the quality data into alternate
forms by changing the value,
structure, or format of data.
DATA
This helps data to be better TRANSFORMATION
analysed by the developed
models. This also sets the
format in which the model
receives data.
NORMALIZATION
Decimal Scaling
It involves scaling of numerical attributes, so that each attribute has
Scaling values by a power
nearly equal significance.
of 10, so as to eliminate
Normalization is one of the most widely used techniques to transform the need for decimals.
data It is rarely used.

A few ways to normalise the data

Clipping
Outlier values that
Min-Max Normalization
Standardization (Z-Score) greater/lesser than the
Used for data having a
The data is rescaled such maximum/minimum value
range. It is used to
that the mean is 0 and are set to the maximum/
transforms the data to a
variance is 1. minimum respectively.
range of 0 to 1 or -1 to 1.
MIN-MAX NORMALIZATION

It is used for data having a range. The above formula


transforms the data to a
range of 0 to 1.

For transforming from -1 to 1, we can use


STANDARDIZATION
Scales the mean to zero and
variance (as well as standard
deviation) to 1.

Epsilon is an extremely small


number to ensure that when
variance is 0, there isn't an error.
ATTRIBUTE SELECTION
New attributes are introduced in the
data based on evaluation of earlier
attribute(s).

AGGREGATION
Presenting the data in summary
format. Used mainly to check
operations done on previous data and
their overall effect.
Regularization
TYPES OF
REGULARIZATION

Modifying the loss function

Modifying the Sampling method

Modifying training algorithms


MODIFYING THE LOSS L1 (Lasso) Regularisation
Penalty of sum of absolute weights
FUNCTION scaled by a hyper parameter is added
to the loss function.

L2 (Ridge) Regularisation
Penalty of sum of square of weights
scaled again by a hyper parameter is
added to the loss function.
L1 REGULARIZATION
Promotes Sparcity
L1 regularization promotes sparsity in
the model by encouraging some
coefficients to become exactly zero, When should it be used?
effectively performing feature It works much better when your data
selection. has many correlated features. This
also helps when you have low
amount of data or high number of
Feature Importance Ranking features
It can provide a feature importance
ranking based on the magnitude of
the non-zero coefficients. Features
with larger non-zero coefficients are
considered more important.
L2 REGULARIZATION

Encourages non-zero values When is it more useful?


L2 regularization encourages small It works much better when your data
but non-zero coefficient values, has many correlated features.
distributing the impact of features
across all variables.

Feature Importance Ranking


It can provide a feature importance
ranking based on the magnitude of
the non-zero coefficients. Features
with larger non-zero coefficients are
considered more important.
MODIFYING THE
TRAINING ALGORITHM
Dropout
In each training iteration, some
connections are randomly dropped
and the resultant output rescaled.

Injecting Noise
Introducing random variation while
updating weights
Dropout
Some nodes are randomly dropped
and the resultant is rescaled to
compensate for the dropped values.

By applying dropout during training,


the network effectively trains multiple
sub-networks, as different subsets of
neurons are dropped out at each
update step. This ensemble of sub-
networks helps in reducing
overfitting, as the network learns to
generalize from a variety of different
architectures. Dropout also acts as a
form of regularization, as it
discourages complex co-adaptations
of neurons and encourages the
learning of more robust features.
Data Augmentation
Introduction of more synthetic data
with noise, which makes the model
more resistant to variations.

K-Fold Cross Validation


Dataset is divided into k equally sized
subsets, and the model is trained and
evaluated k times, each time using a
MODIFYING THE different subset as the validation set
and the remaining subsets as the
SAMPLING METHOD training set.
Data Augmentation Hence, to smoothen out the entire
Introduction of more synthetic data feature space, we can generate
with noise, which makes the model artificial data based on the original
more resistant to variations. data, like in images, we can flip and
rotate, convert to grey scale, add
noise to the images, crop, resize,
change contrast, brightness, or
introduce deformations.
K-Fold Cross Validation
Dataset is divided into k equally sized
subsets, and the model is trained and
evaluated k times, each time using a
different subset as the validation set
and the remaining subsets as the
training set.

The purpose of training multiple


models in k-fold cross-validation is to
obtain a more reliable estimate of the
model's performance by evaluating it
on different subsets of the data. It
helps in assessing the model's
generalization ability and reducing
the impact of data variability.

You might also like