Lec 2 ML S4 Data Preprocessing
Lec 2 ML S4 Data Preprocessing
Data Preprocessing
Aswathy P.
Feature Scaling
Missing Data
Outliers
Feature encoding
Data Imbalance
Train-Test dataset split
standardization
normalization (min-max scaling )
MaxAbs scalar, Robust scalar, Power transformer scaler ...
Standardization
the values are centered around the mean with a unit standard
deviation.
This approach works better with data that follows the normal
distribution
it’s not sensitive to outliers.
x −µ
x′ =
σ
x′ - standardized value
µ - mean
σ - standard deviation
Normalization
rescaling the range of features to scale the range in [0, 1]
issue with this technique is that itâs sensitive to outliers, but itâs
worth using when the data doesnât follow a normal distribution.
This method is beneficial for algorithms like KNN and Neural
Networks since they donât assume any data distribution.
x − xmin
x′ =
xmax − xmin
x′ - normalized value
xmax - maximum value in x
xmin - minimum value in x
maxAbs scaler:
takes the absolute maximum value of the feature and divides each
record by this max value, scaling the data in the range of -1 and 1.
robust scaler:
removes the median from the data and scales it using the
interquartile range (IQR). Itâs robust to outliers.
power transformer scaler:
changes the data distribution, making it more like a normal
distribution. Itâs used most with heteroscedasticity data, which
means that all variables donât have the same variance.
Label Encoding
Label Encoding converts the labels into a numeric form to convert
them into a machine-readable format.
embeds values from 1 to n in an ordinal (sequential) manner. ‘n’ is
the number of categories in the column. (eg: If a column has 3 city
names, label encoding will assign values 1, 2 and 3 to the different
cities. This method is not recommended when the categorical
values have no inherent order, like cities, but it works well with
ordered categories, like student grades.)
Binary Encoding:
this solves the bulkiness of one hot encoding. Every categorical
value gets converted to its binary representation, and for each
binary digit a new column is created. This compresses the number
of columns compared to one hot encoding. With 100 values in a
categorical column, one hot encoding will create 100 (or 99) new
columns, whereas binary encoding will create much less, unless
the values are too large.
BaseN Encoding:
This is similar to binary encoding, with the only difference of base.
Instead of base 2 as with binary, any other base can be used for
baseN encoding. The higher the base number, the higher the
information loss, but the encoder’s compression power will also
keep increasing. A fair trade-off.
Hashing:
hashing means generating values from a category with the use of
mathematical functions. Itâs like one hot encoding (with a true/false
function), but with a more complex function and fewer dimensions.
There is some information loss in hashing due to collisions of
resulting values.
Bayesian encoders
training data is used to build the model. The model identifies the
hidden patterns in this dataset and generates model parameters.
the model is validated on validation data. It helps to determine
how the model is performing. Validation and Training accuracy
help to identify any overfitting or underfitting in the data. Validation
data also helps to tune the model hyper-parameters.
test data is the unseen data the model uses to predict the output.