0% found this document useful (0 votes)
8 views20 pages

Lec 2 ML S4 Data Preprocessing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views20 pages

Lec 2 ML S4 Data Preprocessing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Lecture 2

Data Preprocessing

Aswathy P.

Aswathy P MACHINE LEARNING WITH PYTHON 1 / 20


Data Pre-Processing

Data preprocessing is about preparing the raw data


and making it suitable for a machine learning model.

Improving Data Quality:


by handling inconsistencies, inaccuracies, and errors, which is
critical for ensuring reliable and robust analytics.
Dealing with Missing Values:
techniques like imputation that are critical for dealing with missing
data effectively, as datasets often have missing values which can
significantly hinder the performance of machine learning models.

Aswathy P MACHINE LEARNING WITH PYTHON 2 / 20


Data Pre-Processing

Normalizing and Scaling:


normalizing or scaling feature sare important for algorithms
sensitive to the input’s scale. This ensures that all the features are
on a comparable scale, which is crucial for the accurate
performance of many machine learning algorithms.
Handling Outliers:
outliers are some data points that fall far out of the predominant
data patterns. Outliers can have a disproportionate effect on the
modeling process and can lead to misleading results.
Dimensionality Reduction:
techniques such as Principal Component Analysis (PCA) for
reducing the number of input features, which not only helps in
improving the performance of models but also makes the dataset
more manageable and computationally efficient.

Aswathy P MACHINE LEARNING WITH PYTHON 3 / 20


Data Pre-Processing

Feature Scaling
Missing Data
Outliers
Feature encoding
Data Imbalance
Train-Test dataset split

Aswathy P MACHINE LEARNING WITH PYTHON 4 / 20


Feature Scaling
to transform the values of features or variables in a dataset to a
similar scale.
adjust the feature values while preserving their relative
relationships and distributions.
purpose is to ensure that all features contribute equally to the
model and avoid dominating features with larger values.
facilitates meaningful comparisons between features, improves
model convergence, and prevents certain features from
overshadowing others based solely on their magnitude.
It helps optimization functions converge faster.

Aswathy P MACHINE LEARNING WITH PYTHON 5 / 20


Feature Scaling

standardization
normalization (min-max scaling )
MaxAbs scalar, Robust scalar, Power transformer scaler ...

Aswathy P MACHINE LEARNING WITH PYTHON 6 / 20


Feature Scaling

Standardization
the values are centered around the mean with a unit standard
deviation.
This approach works better with data that follows the normal
distribution
it’s not sensitive to outliers.

x −µ
x′ =
σ
x′ - standardized value
µ - mean
σ - standard deviation

Aswathy P MACHINE LEARNING WITH PYTHON 7 / 20


Feature Scaling

Normalization
rescaling the range of features to scale the range in [0, 1]
issue with this technique is that itâs sensitive to outliers, but itâs
worth using when the data doesnât follow a normal distribution.
This method is beneficial for algorithms like KNN and Neural
Networks since they donât assume any data distribution.

x − xmin
x′ =
xmax − xmin
x′ - normalized value
xmax - maximum value in x
xmin - minimum value in x

Aswathy P MACHINE LEARNING WITH PYTHON 8 / 20


Feature Scaling

maxAbs scaler:
takes the absolute maximum value of the feature and divides each
record by this max value, scaling the data in the range of -1 and 1.
robust scaler:
removes the median from the data and scales it using the
interquartile range (IQR). Itâs robust to outliers.
power transformer scaler:
changes the data distribution, making it more like a normal
distribution. Itâs used most with heteroscedasticity data, which
means that all variables donât have the same variance.

Aswathy P MACHINE LEARNING WITH PYTHON 9 / 20


Imputing Missing Values

Missing value is introduced in data for several reasons, including


lost data in the channel or sometimes the customer denies
providing information(people are reluctant to share their
earnings over a survey)
handle these missing values to optimally leverage available
data
Drop samples with missing values:
this is instrumental when both the number of samples is high, and the
count of missing values in one row/sample is high. This is not a
recommended solution for other cases, since it leads to heavy data
loss.

Aswathy P MACHINE LEARNING WITH PYTHON 10 / 20


Imputing Missing Values

Replace missing values with zero:


sometimes this technique works for basic datasets, since the data in
question assumes zero as a base number, signifying that the value is
absent. However, in most cases, zero can signify a value in itself. For
example, if a sensor generates temperature values and the dataset
belongs to a tropical region. Similarly, in most cases, if missing values
are populated with 0, then it would be misleading to the model. 0 can
be used as replacement only when the dataset is independent of its
effect. For example, in phone bill data, a missing value in the billed
amount column can be replaced by zero, since it might indicate that
the user didnât subscribe to the plan that month.
Replace missing value with mean, median or mode:
you can deal with the above problem, resulting from using 0
incorrectly, by using statistical functions like mean, median or mode
as a replacement for missing values. Even though theyâre also
assumptions, these values make more sense and are closer
approximations when compared to one single value like 0.

Aswathy P MACHINE LEARNING WITH PYTHON 11 / 20


Imputing Missing Values

Interpolate the missing values:


interpolation helps to generate values inside a range based on a
given step size. For instance, if there are 9 missing values in a
column between cells with values 0 and 10, interpolation will populate
the missing cells with numbers from 1 to 9. Understandably, the
dataset needs to be sorted according to a more reliable variable (like
the serial number) before interpolation.
Extrapolate missing values:
extrapolation helps to populate values that are beyond a given range,
like the extreme values of a feature. Extrapolation takes the help of
another variable (usually the target variable) to compare the variable
in question and populate it with a guided reference.

Aswathy P MACHINE LEARNING WITH PYTHON 12 / 20


Outlier treatment
data points that do not conform with the predominant pattern
observed in the data.
cause disruptions in the predictions by taking the calculations off
the actual pattern.
detected and treated with the help of box plots. Box plots identify
the median, interquartile ranges, and outliers.
To remove the outliers, the maximum and minimum range needs
to be noted, and the variable can be filtered accordingly.

Aswathy P MACHINE LEARNING WITH PYTHON 13 / 20


Feature encoding: Encoding of Categorical Variables

Categorical varaibles are of two types:


Ordinal categorical variables:
These variables can be ordered (grades in an exam. Here, one can
say that C<B<A).
For ordinal categorical features, label encoding is preferred.
Nominal categorical variables:
These variables can’t be ordered (colors of a car).
One Hot Encoding(OHE) is preferred in such a situation.

Aswathy P MACHINE LEARNING WITH PYTHON 14 / 20


Feature encoding: Encoding of Categorical Variables

Label Encoding
Label Encoding converts the labels into a numeric form to convert
them into a machine-readable format.
embeds values from 1 to n in an ordinal (sequential) manner. ‘n’ is
the number of categories in the column. (eg: If a column has 3 city
names, label encoding will assign values 1, 2 and 3 to the different
cities. This method is not recommended when the categorical
values have no inherent order, like cities, but it works well with
ordered categories, like student grades.)

Aswathy P MACHINE LEARNING WITH PYTHON 15 / 20


Feature encoding: Encoding of Categorical Variables

One hot encoding (OHE):


Each categorical value is converted into a new categorical column and
assigned a binary value of 1 or 0 to those columns.
One hot encoding generates one column for every category and assigns
a positive value (1) in whichever row that category is present, and 0
when itâs absent. The disadvantage of this is that multiple features get
generated from one feature, making the data bulky.

Aswathy P MACHINE LEARNING WITH PYTHON 16 / 20


Feature encoding: Encoding of Categorical Variables

Binary Encoding:
this solves the bulkiness of one hot encoding. Every categorical
value gets converted to its binary representation, and for each
binary digit a new column is created. This compresses the number
of columns compared to one hot encoding. With 100 values in a
categorical column, one hot encoding will create 100 (or 99) new
columns, whereas binary encoding will create much less, unless
the values are too large.
BaseN Encoding:
This is similar to binary encoding, with the only difference of base.
Instead of base 2 as with binary, any other base can be used for
baseN encoding. The higher the base number, the higher the
information loss, but the encoder’s compression power will also
keep increasing. A fair trade-off.

Aswathy P MACHINE LEARNING WITH PYTHON 17 / 20


Feature encoding: Encoding of Categorical Variables

Hashing:
hashing means generating values from a category with the use of
mathematical functions. Itâs like one hot encoding (with a true/false
function), but with a more complex function and fewer dimensions.
There is some information loss in hashing due to collisions of
resulting values.
Bayesian encoders

Aswathy P MACHINE LEARNING WITH PYTHON 18 / 20


Data Imbalance

Aswathy P MACHINE LEARNING WITH PYTHON 19 / 20


Train Validation and Test Split:

training data is used to build the model. The model identifies the
hidden patterns in this dataset and generates model parameters.
the model is validated on validation data. It helps to determine
how the model is performing. Validation and Training accuracy
help to identify any overfitting or underfitting in the data. Validation
data also helps to tune the model hyper-parameters.
test data is the unseen data the model uses to predict the output.

Aswathy P MACHINE LEARNING WITH PYTHON 20 / 20

You might also like