CSC407 - Chapter 4
CSC407 - Chapter 4
(csc 407)
Chapter 4
Taiwo Kolajo (PhD)
Department of Computer Science
Federal University Lokoja
+2348031805049
4: Feature Engineering
• What is Feature Engineering?
• Feature engineering is the process of transforming raw data into features
that are suitable for machine learning models.
• In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build
more accurate and efficient machine learning models.
• The success of machine learning models heavily depends on the quality of
the features used to train them.
• Feature engineering involves a set of techniques that enable us to create
new features by combining or transforming the existing ones.
• These techniques help to highlight the most important patterns and
relationships in the data, which in turn helps the machine learning model
to learn from the data more effectively.
• What is a Feature?
• In the context of machine learning, a feature (also known as a variable or
attribute) is an individual measurable property or characteristic of a data
point that is used as input for a machine learning algorithm.
4: Feature Engineering
• Features can be numerical, categorical, or text-based, and they represent
different aspects of the data that are relevant to the problem at hand.
• Now we are supposed to subtract these values from the data and then divide the results
from the maximum values as well.
print((df - max_vals) / max_vals)
LotArea MSSubClass
0 -0.960742 -0.684211
1 -0.955400 -0.894737
2 -0.947734 -0.684211
3 -0.955632 -0.631579
4 -0.933750 -0.684211
... ... ...
1459 -0.953834 -0.894737
[1460 rows x 2 columns]
4: Feature Engineering
• Min-Max Scaling
• This method of scaling requires below two-step:
• First, find the minimum and the maximum value of
the column.
• Then subtract the minimum value from the entry and
divide the result by the difference between the
maximum and the minimum value.
𝑋𝑖 − 𝑋𝑚𝑖𝑛
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
As we are using the maximum and the minimum value
this method is also prone to outliers but the range in
which the data will range after performing the above two
steps is between 0 to 1.
4: Feature Engineering
• Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
scaled_df.head()
Output:
LotArea MSSubClass
0 0.033420 0.235294
1 0.038795 0.000000
2 0.046507 0.235294
3 0.038561 0.294118
4 0.060576 0.235294
4: Feature Engineering
• Normalization
This method is more or less the same as the previous method but here instead of the
minimum value we subtract each entry by the mean value of the whole data and then divide
the results by the difference between the minimum and the maximum value.
𝑋𝑖 − 𝑋𝑚𝑒𝑎𝑛
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
scaler = Normalizer()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
Output:
LotArea MSSubClass
0 0.999975 0.007100
1 0.999998 0.002083
2 0.999986 0.005333
3 0.999973 0.007330
4 0.999991 0.004208
4: Feature Engineering
• Standardization
First, calculate the mean and standard deviation of the data we would like to normalize it.
Then subtract the mean value from each entry and then divide the result by the standard
deviation.
This helps us achieve a normal distribution of the data with a mean equal to zero and a
standard deviation equal to 1.
𝑋𝑖 − 𝑋𝑚𝑒𝑎𝑛
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
𝜎
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
Output:
LotArea MSSubClass
0 -0.207142 0.073375
1 -0.091886 -0.872563
2 0.073480 0.073375
3 -0.096897 0.309859
4 0.375148 0.073375
4: Feature Engineering
• Robust Scaling
In this method of scaling, we use two main statistical measures of the data.
• Median
• Inter-Quartile Range (the difference between the upper and lower medians, that
is 𝑄3 −𝑄1 )
After calculating these two values we are supposed to subtract the median from each entry
and then divide the result by the interquartile range.
𝑋𝑖 − 𝑋𝑚𝑒𝑑𝑖𝑎𝑛
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
𝐼𝑄𝑅
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
Output:
LotArea MSSubClass
0 -0.254076 0.2
1 0.030015 -0.6
2 0.437624 0.2
3 0.017663 0.4
4: Feature Engineering
• Encoding
• Most real-life datasets we encounter during our data science project development have
columns of mixed data type: both categorical and numerical columns.
• However, various Machine Learning models do not work with categorical data and to fit
this data into the machine learning model it needs to be converted into numerical data.
• For example, suppose a dataset has a Gender column with categorical elements like Male
and Female.
• These labels have no specific order of preference and also since the data is string labels,
machine learning models misinterpreted that there is some sort of hierarchy in them.
• One approach to solve this problem can be label encoding where we will assign a
numerical value to these labels for example Male and Female mapped to 0 and 1.
• But this can add bias in our model as it will start giving higher preference to the Female
parameter as 1>0
• To deal with this issue we will use the One Hot Encoding technique.
• One hot encoding is a technique that we use to represent categorical variables as numerical
values in a machine learning model.
4: Feature Engineering
• Encoding
• One Hot Encoding Example
• Imagine we have a dataset with fruits, their categorical values, and corresponding prices.
Using one-hot encoding, we can transform these categorical values into numerical form.
For instance:
• Wherever the fruit is “Apple,” the Apple column will have a value of 1, while the other fruit
columns (like Mango or Orange) will contain 0.
• This pattern ensures that each categorical value gets its own column, represented with
binary values (1 or 0), making it usable for machine learning models.
• Consider the data where fruits, their corresponding categorical values, and prices are
given.
Fruit Categorical values of fruit Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20
• The output after applying one-hot encoding on the data is given as follows,
apple mango orange Price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
4: Feature Engineering
• Encoding
• Implementing One Hot Encoding
• To implement one-hot encoding in Python, we can use either the Pandas library or the
Scikit-learn library, both of which provide efficient and convenient methods for this
task.
• 1. Using Pandas
• Pandas offers the get_dummies function, which is a simple and effective way to
perform one-hot encoding. This method converts categorical variables into multiple binary
columns.
• For example, the Gender column with values 'M' and 'F' becomes two binary columns:
Gender_F and Gender_M.
• drop_first=True in pandas drops one redundant column (e.g., keeps only Gender_F to
avoid multicollinearity).
• Encoding
4: Feature Engineering
• Implementing One Hot Encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
We can observe that we have 3 Remarks and 2 Gender columns in the data. However, you can just
use n-1 columns to define parameters if it has n unique labels. For example, if we only keep the
Gender_Female column and drop the Gender_Male column, then also we can convey the entire
information as when the label is 1, it means female and when the label is 0 it means male. This way
we can encode the categorical data and reduce the number of parameters as well.
• Encoding
4: Feature Engineering
2. One Hot Encoding using Scikit Learn Library
• Scikit-learn(sklearn) is a popular machine-learning library that provide numerous tools for data preprocessing. It provides a OneHotEncoder function that we use for
encoding categorical and numerical variables into binary vectors. Using df.select_dtypes(include=['object']) in Scikit Learn Library:
• This selects only the columns with categorical data (data type object).
• In this case, ['Gender', 'Remarks'] are identified as categorical columns.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
Use Pandas get_dummies() when you need quick and simple encoding.
Use Scikit-Learn OneHotEncoder when working within a machine learning pipeline, or when you
need finer control over encoding behavior.
4: Feature Engineering
• Encoding
Alternatives to One Hot Encoding
• While One Hot Encoding is a popular choice for handling categorical data, there are several
alternatives that may be more suitable depending on the context:
• Label Encoding: In cases where categorical variables have a natural order (e.g., “Low,”
“Medium,” “High”), label encoding can be a better option. This method assigns a unique
integer to each category without introducing the same risks of hierarchy
misinterpretation as with nominal data.
• Binary Encoding: This technique combines the benefits of One Hot Encoding and
label encoding. It converts categories into binary numbers and then creates binary
columns.This method can reduce dimensionality while preserving information.
• Target Encoding: In target encoding, we replace each category with the mean of the
target variable for that category. This method can be particularly useful for categorical
variables with a high number of unique values, but it also carries a risk of leakage if not
handled properly.
• Feature Selection
4: Feature Engineering
• Feature selection is a crucial step in the machine learning pipeline. It involves selecting the most important
features from your dataset to improve model performance and reduce computational cost. In this article, we will
explore various techniques for feature selection in Python using the Scikit-Learn library.
• Improved Model Performance: By removing irrelevant or redundant features, we can improve the accuracy of the
model.
• Reduced Overfitting:With fewer features, the model is less likely to learn noise from the training data.
• Faster Computation: Reducing the number of features decreases the computational cost and training time.
• Types of Feature Selection Methods
• Feature selection methods can be broadly classified into three categories:
• Filter Methods: Filter methods use statistical techniques to evaluate the relevance of features independently of the
model. Common techniques include correlation coefficients, chi-square tests, and mutual information.
• Wrapper Methods: Wrapper methods use a predictive model to evaluate feature subsets and select the best-
performing combination. Techniques include recursive feature elimination (RFE) and forward/backward feature
selection.
• Embedded Methods: Embedded methods perform feature selection during the model training process. Examples
include Lasso (L1 regularization) and feature importance from tree-based models.
4: Feature Engineering
• Feature Selection
• Feature Selection Techniques with Scikit-Learn
• Scikit-Learn provides several tools for feature selection, including:
# Load dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
Output
petal length (cm) 0.480141
petal width (cm) 0.378693
sepal length (cm) 0.092960
sepal width (cm) 0.048206