0% found this document useful (0 votes)
8 views28 pages

CSC407 - Chapter 4

Chapter 4 of the document focuses on Feature Engineering, which is the process of transforming raw data into suitable features for machine learning models, emphasizing the importance of feature quality for model success. It outlines various techniques such as feature creation, transformation, extraction, selection, and scaling, along with their significance in improving user experience, gaining competitive advantage, and meeting customer needs. Additionally, the chapter discusses encoding methods like one-hot encoding to convert categorical data into numerical format for machine learning applications.

Uploaded by

asukuhamza1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views28 pages

CSC407 - Chapter 4

Chapter 4 of the document focuses on Feature Engineering, which is the process of transforming raw data into suitable features for machine learning models, emphasizing the importance of feature quality for model success. It outlines various techniques such as feature creation, transformation, extraction, selection, and scaling, along with their significance in improving user experience, gaining competitive advantage, and meeting customer needs. Additionally, the chapter discusses encoding methods like one-hot encoding to convert categorical data into numerical format for machine learning applications.

Uploaded by

asukuhamza1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Machine Learning/Data Science

(csc 407)
Chapter 4
Taiwo Kolajo (PhD)
Department of Computer Science
Federal University Lokoja
+2348031805049
4: Feature Engineering
• What is Feature Engineering?
• Feature engineering is the process of transforming raw data into features
that are suitable for machine learning models.
• In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build
more accurate and efficient machine learning models.
• The success of machine learning models heavily depends on the quality of
the features used to train them.
• Feature engineering involves a set of techniques that enable us to create
new features by combining or transforming the existing ones.
• These techniques help to highlight the most important patterns and
relationships in the data, which in turn helps the machine learning model
to learn from the data more effectively.
• What is a Feature?
• In the context of machine learning, a feature (also known as a variable or
attribute) is an individual measurable property or characteristic of a data
point that is used as input for a machine learning algorithm.
4: Feature Engineering
• Features can be numerical, categorical, or text-based, and they represent
different aspects of the data that are relevant to the problem at hand.

• For example, in a dataset of housing prices, features could include the


number of bedrooms, the square footage, the location, and the age of the
property.

• In a dataset of customer demographics, features could include age,


gender, income level, and occupation.

• The choice and quality of features are critical in machine learning, as


they can greatly impact the accuracy and performance of the model.
4: Feature Engineering
• Importance of Feature Engineering?
• Improve User Experience: The primary reason we engineer features is to
enhance the user experience of a product or service. By adding new features,
we can make the product more intuitive, efficient, and user-friendly, which can
increase user satisfaction and engagement.
• Competitive Advantage: Another reason we engineer features is to gain a
competitive advantage in the marketplace. By offering unique and innovative
features, we can differentiate our product from competitors and attract more
customers.
• Meet Customer Needs: We engineer features to meet the evolving needs of
customers. By analyzing user feedback, market trends, and customer behavior,
we can identify areas where new features could enhance the product’s value and
meet customer needs.
• Increase Revenue: Features can also be engineered to generate more
revenue. For example, a new feature that streamlines the checkout process can
increase sales, or a feature that provides additional functionality could lead to
more upsells or cross-sells.
• Future-Proofing: Engineering features can also be done to future-proof a
product or service. By anticipating future trends and potential customer needs,
we can develop features that ensure the product remains relevant and useful in
the long term.
4: Feature Engineering
• Feature Engineering Process
• Feature engineering in Machine learning consists of mainly 5 processes:
• Feature Creation,
• Feature Transformation,
• Feature Extraction,
• Feature Selection, and
• Feature Scaling.
• It is an iterative process that requires experimentation and testing to find the
best combination of features for a given problem. The success of a machine
learning model largely depends on the quality of the features used in the model.
1. Feature Creation
• Feature Creation is the process of generating new features based on domain
knowledge or by observing patterns in the data. It is a form of feature engineering
that can significantly improve the performance of a machine-learning model.
• Types of Feature Creation:
• Domain-Specific: Creating new features based on domain knowledge, such
as creating features based on business rules or industry standards.
• Data-Driven: Creating new features by observing patterns in the data, such as
calculating aggregations or creating interaction features.
• Synthetic: Generating new features by combining existing features or
synthesizing new data points.
4: Feature Engineering
2. Feature Transformation
• Feature Transformation is the process of transforming the features into a more
suitable representation for the machine learning model. This is done to ensure that
the model can effectively learn from the data.

• Types of Feature Transformation


• Scaling: Scaling is a technique used to transform numerical variables to have a
similar scale, so that they can be compared more easily. Rescaling the features
to have a similar scale, such as having a standard deviation of 1, to make sure the
model considers all features equally.
• Normalization: Rescaling the features to have a similar range, such as
between 0 and 1, to prevent some features from dominating others.
• Encoding: Transforming categorical features into a numerical representation.
Examples are one-hot encoding and label encoding.
• Transformation: Transforming the features using mathematical operations to
change the distribution or scale of the features. Examples are logarithmic,
square root, and reciprocal transformations.
4: Feature Engineering
• Scaling
• Absolute Maximum Scaling: This method of scaling
requires two-step:
• We should first select the maximum absolute value out of
all the entries of a particular measure.
• Then after this, we divide each entry of the column by this
maximum value.
𝑋𝑖 − max(|𝑋|)
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
max(|𝑋|)
• After performing the above-mentioned two steps we will
observe that each entry of the column lies in the range of -1
to 1. But this method is not used that often the reason
behind this is that it is too sensitive to the outliers. And
while dealing with the real-world data presence of outliers is
a very common thing.
4: Feature Engineering
• Scaling
• Absolute Maximum Scaling: This method of scaling
requires two-step:
• We should first select the maximum absolute value out of
all the entries of a particular measure.
• Then after this, we divide each entry of the column by this
maximum value.
𝑋𝑖 − max(|𝑋|)
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
max(|𝑋|)
• After performing the above-mentioned two steps we will
observe that each entry of the column lies in the range of -1
to 1. But this method is not used that often the reason
behind this is that it is too sensitive to the outliers. And
while dealing with the real-world data presence of outliers is
a very common thing.
4: Feature Engineering
• Scaling
• For the demonstration purpose we will use the ‘SampleFile’
dataset - a simpler version of the original house price
prediction dataset having only two columns from the
original dataset. The first five rows of the original data are
shown below:
import pandas as pd
df = pd.read_csv('SampleFile.csv')
print(df.head())
Output:
LotArea MSSubClass
0 8450 60
1 9600 20
2 11250 60
3 9550 70
4 14260 60
4: Feature Engineering
• Scaling
• Now let’s apply the first method which is of the absolute maximum scaling. For this
first, we are supposed to evaluate the absolute maximum values of the columns.
import numpy as np
max_vals = np.max(np.abs(df))
max_vals
Output:
LotArea 215245
MSSubClass 190
dtype: int64

• Now we are supposed to subtract these values from the data and then divide the results
from the maximum values as well.
print((df - max_vals) / max_vals)
LotArea MSSubClass
0 -0.960742 -0.684211
1 -0.955400 -0.894737
2 -0.947734 -0.684211
3 -0.955632 -0.631579
4 -0.933750 -0.684211
... ... ...
1459 -0.953834 -0.894737
[1460 rows x 2 columns]
4: Feature Engineering
• Min-Max Scaling
• This method of scaling requires below two-step:
• First, find the minimum and the maximum value of
the column.
• Then subtract the minimum value from the entry and
divide the result by the difference between the
maximum and the minimum value.
𝑋𝑖 − 𝑋𝑚𝑖𝑛
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
As we are using the maximum and the minimum value
this method is also prone to outliers but the range in
which the data will range after performing the above two
steps is between 0 to 1.
4: Feature Engineering
• Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
scaled_df.head()

Output:
LotArea MSSubClass
0 0.033420 0.235294
1 0.038795 0.000000
2 0.046507 0.235294
3 0.038561 0.294118
4 0.060576 0.235294
4: Feature Engineering
• Normalization
This method is more or less the same as the previous method but here instead of the
minimum value we subtract each entry by the mean value of the whole data and then divide
the results by the difference between the minimum and the maximum value.
𝑋𝑖 − 𝑋𝑚𝑒𝑎𝑛
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛

from sklearn.preprocessing import Normalizer

scaler = Normalizer()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())

Output:
LotArea MSSubClass
0 0.999975 0.007100
1 0.999998 0.002083
2 0.999986 0.005333
3 0.999973 0.007330
4 0.999991 0.004208
4: Feature Engineering
• Standardization
First, calculate the mean and standard deviation of the data we would like to normalize it.
Then subtract the mean value from each entry and then divide the result by the standard
deviation.
This helps us achieve a normal distribution of the data with a mean equal to zero and a
standard deviation equal to 1.
𝑋𝑖 − 𝑋𝑚𝑒𝑎𝑛
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
𝜎
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
Output:
LotArea MSSubClass
0 -0.207142 0.073375
1 -0.091886 -0.872563
2 0.073480 0.073375
3 -0.096897 0.309859
4 0.375148 0.073375
4: Feature Engineering
• Robust Scaling
In this method of scaling, we use two main statistical measures of the data.
• Median
• Inter-Quartile Range (the difference between the upper and lower medians, that
is 𝑄3 −𝑄1 )
After calculating these two values we are supposed to subtract the median from each entry
and then divide the result by the interquartile range.
𝑋𝑖 − 𝑋𝑚𝑒𝑑𝑖𝑎𝑛
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
𝐼𝑄𝑅
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
Output:
LotArea MSSubClass
0 -0.254076 0.2
1 0.030015 -0.6
2 0.437624 0.2
3 0.017663 0.4
4: Feature Engineering
• Encoding
• Most real-life datasets we encounter during our data science project development have
columns of mixed data type: both categorical and numerical columns.

• However, various Machine Learning models do not work with categorical data and to fit
this data into the machine learning model it needs to be converted into numerical data.

• For example, suppose a dataset has a Gender column with categorical elements like Male
and Female.
• These labels have no specific order of preference and also since the data is string labels,
machine learning models misinterpreted that there is some sort of hierarchy in them.
• One approach to solve this problem can be label encoding where we will assign a
numerical value to these labels for example Male and Female mapped to 0 and 1.

• But this can add bias in our model as it will start giving higher preference to the Female
parameter as 1>0

• Ideally, both labels are equally important in the dataset.

• To deal with this issue we will use the One Hot Encoding technique.
• One hot encoding is a technique that we use to represent categorical variables as numerical
values in a machine learning model.
4: Feature Engineering
• Encoding
• One Hot Encoding Example
• Imagine we have a dataset with fruits, their categorical values, and corresponding prices.
Using one-hot encoding, we can transform these categorical values into numerical form.
For instance:
• Wherever the fruit is “Apple,” the Apple column will have a value of 1, while the other fruit
columns (like Mango or Orange) will contain 0.
• This pattern ensures that each categorical value gets its own column, represented with
binary values (1 or 0), making it usable for machine learning models.
• Consider the data where fruits, their corresponding categorical values, and prices are
given.
Fruit Categorical values of fruit Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20
• The output after applying one-hot encoding on the data is given as follows,
apple mango orange Price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
4: Feature Engineering
• Encoding
• Implementing One Hot Encoding
• To implement one-hot encoding in Python, we can use either the Pandas library or the
Scikit-learn library, both of which provide efficient and convenient methods for this
task.
• 1. Using Pandas
• Pandas offers the get_dummies function, which is a simple and effective way to
perform one-hot encoding. This method converts categorical variables into multiple binary
columns.
• For example, the Gender column with values 'M' and 'F' becomes two binary columns:
Gender_F and Gender_M.
• drop_first=True in pandas drops one redundant column (e.g., keeps only Gender_F to
avoid multicollinearity).
• Encoding
4: Feature Engineering
• Implementing One Hot Encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Create a dummy employee dataset


data = {
'Employee id': [10, 20, 15, 25, 30],
'Gender': ['M', 'F', 'F', 'M', 'F'],
'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice']
}

# Convert the data into a pandas DataFrame


df = pd.DataFrame(data)
print(f"Original Employee Data:\n{df}\n")
# Use pd.get_dummies() to one-hot encode the categorical columns
df_pandas_encoded = pd.get_dummies(df, columns=['Gender', 'Remarks'], drop_first=True)
print(f"One-Hot Encoded Data using Pandas:\n{df_pandas_encoded}\n")

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the categorical columns


one_hot_encoded = encoder.fit_transform(df[categorical_columns])

# Create a DataFrame with the encoded columns


one_hot_df = pd.DataFrame(one_hot_encoded,
columns=encoder.get_feature_names_out(categorical_columns))

# Concatenate the one-hot encoded columns with the original DataFrame


df_sklearn_encoded = pd.concat([df.drop(categorical_columns, axis=1), one_hot_df], axis=1)

print(f"One-Hot Encoded Data using Scikit-Learn:\n{df_sklearn_encoded}\n")


4: Feature Engineering
• Encoding
• Output
Original Employee Data:
Employee id Gender Remarks
0 10 M Good
1 20 F Nice
2 15 F Good
3 25 M Great
4 30 F Nice

One-Hot Encoded Data using Pandas:


Employee id Gender_M Remarks_Great Remarks_Nice
0 10 True False False
1 20 False False True
2 15 False False False
3 25 True True False
4 30 False False True

We can observe that we have 3 Remarks and 2 Gender columns in the data. However, you can just
use n-1 columns to define parameters if it has n unique labels. For example, if we only keep the
Gender_Female column and drop the Gender_Male column, then also we can convey the entire
information as when the label is 1, it means female and when the label is 0 it means male. This way
we can encode the categorical data and reduce the number of parameters as well.
• Encoding
4: Feature Engineering
2. One Hot Encoding using Scikit Learn Library
• Scikit-learn(sklearn) is a popular machine-learning library that provide numerous tools for data preprocessing. It provides a OneHotEncoder function that we use for
encoding categorical and numerical variables into binary vectors. Using df.select_dtypes(include=['object']) in Scikit Learn Library:

• This selects only the columns with categorical data (data type object).
• In this case, ['Gender', 'Remarks'] are identified as categorical columns.

#one hot encoding using OneHotEncoder of Scikit-Learn

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

#Building a dummy employee dataset for example


data = {'Employee id': [10, 20, 15, 25, 30],
'Gender': ['M', 'F', 'F', 'M', 'F'],
'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice'],
}
df = pd.DataFrame(data)
print(f"Employee data : \n{df}")

#Extract categorical columns from the dataframe


#Here we extract the columns with object datatype as they are the categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
encoder = OneHotEncoder(sparse_output=False)

# Apply one-hot encoding to the categorical columns


one_hot_encoded = encoder.fit_transform(df[categorical_columns])

#Create a DataFrame with the one-hot encoded columns


#We use get_feature_names_out() to get the column names for the encoded data
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

# Concatenate the one-hot encoded dataframe with the original dataframe


df_encoded = pd.concat([df, one_hot_df], axis=1)

# Drop the original categorical columns


df_encoded = df_encoded.drop(categorical_columns, axis=1)
print(f"Encoded Employee data : \n{df_encoded}")
4: Feature Engineering
• Encoding
Output
Employee data :
Employee id Gender Remarks
0 10 M Good
1 20 F Nice
2 15 F Good
3 25 M Great
4 30 F Nice
Encoded Employee data :
Employee id Gender_F Gender_M Remarks_Good Remarks_Great Remarks_Nice
0 10 0.0 1.0 1.0 0.0 0.0
1 20 1.0 0.0 0.0 0.0 1.0
2 15 1.0 0.0 1.0 0.0 0.0
3 25 0.0 1.0 0.0 1.0 0.0
4 30 1.0 0.0 0.0 0.0 1.0
Both Pandas and Scikit-Learn offer robust solutions for one-hot encoding.

Use Pandas get_dummies() when you need quick and simple encoding.
Use Scikit-Learn OneHotEncoder when working within a machine learning pipeline, or when you
need finer control over encoding behavior.
4: Feature Engineering
• Encoding
Alternatives to One Hot Encoding
• While One Hot Encoding is a popular choice for handling categorical data, there are several
alternatives that may be more suitable depending on the context:

• Label Encoding: In cases where categorical variables have a natural order (e.g., “Low,”
“Medium,” “High”), label encoding can be a better option. This method assigns a unique
integer to each category without introducing the same risks of hierarchy
misinterpretation as with nominal data.

• Binary Encoding: This technique combines the benefits of One Hot Encoding and
label encoding. It converts categories into binary numbers and then creates binary
columns.This method can reduce dimensionality while preserving information.

• Target Encoding: In target encoding, we replace each category with the mean of the
target variable for that category. This method can be particularly useful for categorical
variables with a high number of unique values, but it also carries a risk of leakage if not
handled properly.
• Feature Selection
4: Feature Engineering
• Feature selection is a crucial step in the machine learning pipeline. It involves selecting the most important
features from your dataset to improve model performance and reduce computational cost. In this article, we will
explore various techniques for feature selection in Python using the Scikit-Learn library.

• What is feature selection?


• Feature selection is the process of identifying and selecting a subset of relevant features for use in model
construction. The goal is to enhance the model's performance by reducing overfitting, improving accuracy, and
reducing training time.

• Why is Feature Selection Important?


• Feature selection offers several benefits:

• Improved Model Performance: By removing irrelevant or redundant features, we can improve the accuracy of the
model.
• Reduced Overfitting:With fewer features, the model is less likely to learn noise from the training data.
• Faster Computation: Reducing the number of features decreases the computational cost and training time.
• Types of Feature Selection Methods
• Feature selection methods can be broadly classified into three categories:

• Filter Methods: Filter methods use statistical techniques to evaluate the relevance of features independently of the
model. Common techniques include correlation coefficients, chi-square tests, and mutual information.
• Wrapper Methods: Wrapper methods use a predictive model to evaluate feature subsets and select the best-
performing combination. Techniques include recursive feature elimination (RFE) and forward/backward feature
selection.
• Embedded Methods: Embedded methods perform feature selection during the model training process. Examples
include Lasso (L1 regularization) and feature importance from tree-based models.
4: Feature Engineering
• Feature Selection
• Feature Selection Techniques with Scikit-Learn
• Scikit-Learn provides several tools for feature selection, including:

• Univariate Selection: Univariate selection evaluates each feature individually to determine


its importance. Techniques like SelectKBest and SelectPercentile can be used to select the
top features based on statistical tests.
• Recursive Feature Elimination (RFE): RFE is a wrapper method that recursively removes
the least important features based on a model's performance. It repeatedly builds a model
and eliminates the weakest features until the desired number of features is reached.
• Feature Importance from Tree-based Models: Tree-based models like decision trees and
random forests can provide feature importance scores, indicating the importance of each
feature in making predictions.
• Practical Implementation of Feature Selection with Scikit-Learn
• Let's implement these feature selection techniques using Scikit-Learn.
4: Feature Engineering
• Feature Selection
• Data Preparation:
• First, let's load a dataset and split it into features and target variables.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Method 1 : Univariate Selection in Python with Scikit-Learn


We'll use SelectKBest with the chi-square test to select the top 2 features.

from sklearn.feature_selection import SelectKBest, chi2

# Apply SelectKBest with chi2


select_k_best = SelectKBest(score_func=chi2, k=2)
X_train_k_best = select_k_best.fit_transform(X_train, y_train)

print("Selected features:", X_train.columns[select_k_best.get_support()])


Output
Selected features: Index(['petal length (cm)', 'petal width (cm)'], dtype='object')
4: Feature Engineering
• Feature Selection
Method 2: Recursive Feature Elimination
• Next, we'll use RFE with a logistic regression model to select the top 2
features.

from sklearn.feature_selection import RFE


from sklearn.linear_model import LogisticRegression

# Apply RFE with logistic regression


model = LogisticRegression()
rfe = RFE(model, n_features_to_select=2)
X_train_rfe = rfe.fit_transform(X_train, y_train)

print("Selected features:", X_train.columns[rfe.get_support()])


Output
Selected features: Index(['petal length (cm)', 'petal width (cm)'], dtype='object')
4: Feature Engineering
• Feature Selection
Method 3:Tree-Based Feature Importance
• Finally, we'll use a random forest classifier to determine feature importance.

from sklearn.ensemble import RandomForestClassifier

# Train random forest and get feature importances


model = RandomForestClassifier()
model.fit(X_train, y_train)
importances = model.feature_importances_

# Display feature importances


feature_importances = pd.Series(importances, index=X_train.columns)
print(feature_importances.sort_values(ascending=False))

Output
petal length (cm) 0.480141
petal width (cm) 0.378693
sepal length (cm) 0.092960
sepal width (cm) 0.048206

You might also like