Data Pre Processing
Data Pre Processing
Pre-processing
What is Data
Pre-Processing?
Manipulation or dropping of
data before it is used in order
to ensure or enhance
performance
The data can have many
irrelevant and missing parts.
To handle this part, data
cleaning is done. It involves
handling of missing data, noisy DATA
data etc.
CLEANING
You can do this in two ways,
removal of entries or fill in
missing values.
HANDLING NOISY
Noisy data is meaningless data that can’t be
DATA interpreted by machines. It can be generated
due to faulty data collection, data entry errors
etc.
INTEGRATION
This commonly includes
matching different names for
the same values, and removal
of unnecessary attributes.
Once data clearing has been
done, we need to consolidate
the quality data into alternate
forms by changing the value,
structure, or format of data.
DATA
This helps data to be better TRANSFORMATION
analysed by the developed
models. This also sets the
format in which the model
receives data.
NORMALIZATION
Decimal Scaling
It involves scaling of numerical attributes, so that each attribute has
Scaling values by a power
nearly equal significance.
of 10, so as to eliminate
Normalization is one of the most widely used techniques to transform the need for decimals.
data It is rarely used.
Clipping
Outlier values that
Min-Max Normalization
Standardization (Z-Score) greater/lesser than the
Used for data having a
The data is rescaled such maximum/minimum value
range. It is used to
that the mean is 0 and are set to the maximum/
transforms the data to a
variance is 1. minimum respectively.
range of 0 to 1 or -1 to 1.
MIN-MAX NORMALIZATION
AGGREGATION
Presenting the data in summary
format. Used mainly to check
operations done on previous data and
their overall effect.
Regularization
TYPES OF
REGULARIZATION
L2 (Ridge) Regularisation
Penalty of sum of square of weights
scaled again by a hyper parameter is
added to the loss function.
L1 REGULARIZATION
Promotes Sparcity
L1 regularization promotes sparsity in
the model by encouraging some
coefficients to become exactly zero, When should it be used?
effectively performing feature It works much better when your data
selection. has many correlated features. This
also helps when you have low
amount of data or high number of
Feature Importance Ranking features
It can provide a feature importance
ranking based on the magnitude of
the non-zero coefficients. Features
with larger non-zero coefficients are
considered more important.
L2 REGULARIZATION
Injecting Noise
Introducing random variation while
updating weights
Dropout
Some nodes are randomly dropped
and the resultant is rescaled to
compensate for the dropped values.