0% found this document useful (0 votes)

10 views20 pages

Lec 2 ML S4 Data Preprocessing

Uploaded by

jeevansaikarumuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views20 pages

Lec 2 ML S4 Data Preprocessing

Uploaded by

jeevansaikarumuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Lecture 2

Data Preprocessing

Aswathy P.

Aswathy P MACHINE LEARNING WITH PYTHON 1 / 20

Data Pre-Processing

Data preprocessing is about preparing the raw data

and making it suitable for a machine learning model.

Improving Data Quality:

by handling inconsistencies, inaccuracies, and errors, which is
critical for ensuring reliable and robust analytics.
Dealing with Missing Values:
techniques like imputation that are critical for dealing with missing
data effectively, as datasets often have missing values which can
significantly hinder the performance of machine learning models.

Aswathy P MACHINE LEARNING WITH PYTHON 2 / 20

Data Pre-Processing

Normalizing and Scaling:

normalizing or scaling feature sare important for algorithms
sensitive to the input’s scale. This ensures that all the features are
on a comparable scale, which is crucial for the accurate
performance of many machine learning algorithms.
Handling Outliers:
outliers are some data points that fall far out of the predominant
data patterns. Outliers can have a disproportionate effect on the
modeling process and can lead to misleading results.
Dimensionality Reduction:
techniques such as Principal Component Analysis (PCA) for
reducing the number of input features, which not only helps in
improving the performance of models but also makes the dataset
more manageable and computationally efficient.

Aswathy P MACHINE LEARNING WITH PYTHON 3 / 20

Data Pre-Processing

Feature Scaling
Missing Data
Outliers
Feature encoding
Data Imbalance
Train-Test dataset split

Aswathy P MACHINE LEARNING WITH PYTHON 4 / 20

Feature Scaling
to transform the values of features or variables in a dataset to a
similar scale.
adjust the feature values while preserving their relative
relationships and distributions.
purpose is to ensure that all features contribute equally to the
model and avoid dominating features with larger values.
facilitates meaningful comparisons between features, improves
model convergence, and prevents certain features from
overshadowing others based solely on their magnitude.
It helps optimization functions converge faster.

Aswathy P MACHINE LEARNING WITH PYTHON 5 / 20

Feature Scaling

standardization
normalization (min-max scaling )
MaxAbs scalar, Robust scalar, Power transformer scaler ...

Aswathy P MACHINE LEARNING WITH PYTHON 6 / 20

Feature Scaling

Standardization
the values are centered around the mean with a unit standard
deviation.
This approach works better with data that follows the normal
distribution
it’s not sensitive to outliers.

x −µ
x′ =
σ
x′ - standardized value
µ - mean
σ - standard deviation

Aswathy P MACHINE LEARNING WITH PYTHON 7 / 20

Feature Scaling

Normalization
rescaling the range of features to scale the range in [0, 1]
issue with this technique is that itâs sensitive to outliers, but itâs
worth using when the data doesnât follow a normal distribution.
This method is beneficial for algorithms like KNN and Neural
Networks since they donât assume any data distribution.

x − xmin
x′ =
xmax − xmin
x′ - normalized value
xmax - maximum value in x
xmin - minimum value in x

Aswathy P MACHINE LEARNING WITH PYTHON 8 / 20

Feature Scaling

maxAbs scaler:
takes the absolute maximum value of the feature and divides each
record by this max value, scaling the data in the range of -1 and 1.
robust scaler:
removes the median from the data and scales it using the
interquartile range (IQR). Itâs robust to outliers.
power transformer scaler:
changes the data distribution, making it more like a normal
distribution. Itâs used most with heteroscedasticity data, which
means that all variables donât have the same variance.

Aswathy P MACHINE LEARNING WITH PYTHON 9 / 20

Imputing Missing Values

Missing value is introduced in data for several reasons, including

lost data in the channel or sometimes the customer denies
providing information(people are reluctant to share their
earnings over a survey)
handle these missing values to optimally leverage available
data
Drop samples with missing values:
this is instrumental when both the number of samples is high, and the
count of missing values in one row/sample is high. This is not a
recommended solution for other cases, since it leads to heavy data
loss.

Aswathy P MACHINE LEARNING WITH PYTHON 10 / 20

Imputing Missing Values

Replace missing values with zero:

sometimes this technique works for basic datasets, since the data in
question assumes zero as a base number, signifying that the value is
absent. However, in most cases, zero can signify a value in itself. For
example, if a sensor generates temperature values and the dataset
belongs to a tropical region. Similarly, in most cases, if missing values
are populated with 0, then it would be misleading to the model. 0 can
be used as replacement only when the dataset is independent of its
effect. For example, in phone bill data, a missing value in the billed
amount column can be replaced by zero, since it might indicate that
the user didnât subscribe to the plan that month.
Replace missing value with mean, median or mode:
you can deal with the above problem, resulting from using 0
incorrectly, by using statistical functions like mean, median or mode
as a replacement for missing values. Even though theyâre also
assumptions, these values make more sense and are closer
approximations when compared to one single value like 0.

Aswathy P MACHINE LEARNING WITH PYTHON 11 / 20

Imputing Missing Values

Interpolate the missing values:

interpolation helps to generate values inside a range based on a
given step size. For instance, if there are 9 missing values in a
column between cells with values 0 and 10, interpolation will populate
the missing cells with numbers from 1 to 9. Understandably, the
dataset needs to be sorted according to a more reliable variable (like
the serial number) before interpolation.
Extrapolate missing values:
extrapolation helps to populate values that are beyond a given range,
like the extreme values of a feature. Extrapolation takes the help of
another variable (usually the target variable) to compare the variable
in question and populate it with a guided reference.

Aswathy P MACHINE LEARNING WITH PYTHON 12 / 20

Outlier treatment
data points that do not conform with the predominant pattern
observed in the data.
cause disruptions in the predictions by taking the calculations off
the actual pattern.
detected and treated with the help of box plots. Box plots identify
the median, interquartile ranges, and outliers.
To remove the outliers, the maximum and minimum range needs
to be noted, and the variable can be filtered accordingly.