0% found this document useful (0 votes)

10 views

Unit 2 Feature Engineering

Uploaded by

khanishkav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Unit 2 Feature Engineering

Uploaded by

khanishkav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Unit-2

Feature Engineering:
• Introduction
• Feature Transformation
• Subset Selection
Modelling and Evaluation:
• Selecting a model
• Training model
• Model representation
• Evaluating and Improving model performance
Unit-2
Feature Engineering
Unit-2
Feature Engineering:
• Introduction
• Feature Transformation
• Subset Selection
Feature Engineering for Machine Learning

What is a Feature
• In the context of machine learning, a feature (also known as a variable or
attribute) is an individual measurable property or characteristic of a data point
that is used as input for a machine learning algorithm.
• Features can be numerical, categorical or text-based, and they represent
different aspects of the data that are relevant to the problem at hand.
• For example, in a dataset of housing prices, features could include the number
of bedrooms, the square footage, the location, and the age of the property.
• The choice and quality of features are critical in machine learning, as they can
greatly impact the accuracy and performance of the model.
Dataset features-IRIS

Petal_width Petal_length Sepal_width Sepal_length Species_name

0.2 1.4 3.5 5.1 Setosa
1.5 1.4 3 4.9 Versicolor
2.2 1.3 3.2 4.7 Setosa
1.2 1.5 3.1 4.6 Versicolor
0.2 1.4 3.6 5 Setosa
0.4 1.7 3.9 5.4 Setosa
0.3 1.4 3.4 4.6 Setosa
2.3 1.5 3.4 5 Versicolor
Contd..
• Feature engineering is the pre-processing step of machine learning.
• Feature engineering is the process of transforming raw data into features that
are suitable for machine learning models.
• It involves selecting relevant information from raw data and transforming it into
a format that can be easily understood by a model.
• The goal is to improve model accuracy by providing more meaningful and
relevant information.
• The success of machine learning models heavily depends on the quality of the
features used to train them.
• Feature engineering involves a set of techniques that enable us to create new
features by combining or transforming the existing ones.
• These techniques help to highlight the most important patterns and relationships
in the data, which in turn helps the machine learning model to learn from the data
more effectively.
• Features to be engineered to improve the performance of machine
learning models by providing them with relevant and informative
input data.
• Raw data may contain noise, irrelevant information, or missing
values, which can lead to inaccurate or biased model predictions.
• By engineering features, we can extract meaningful information from
the raw data.
• Feature engineering is a crucial step in preparing data for analysis and
decision-making in various fields, such as finance, healthcare,
marketing, and social sciences.
Processes Involved in Feature Engineering

• Feature engineering in ML contains two major elements:

• Feature Transformation: transforms the data –structured or unstructured,
into a new set of features, that can represent the underlying problem that ML
is trying to solve.
• Feature Subset Selection: Its objective is to derive a subset of features from
full feature set that is most meaningful in the context of ML problem
Processes Involved in Feature Engineering

• Feature engineering in ML contains two major elements:

• Feature Transformation
• Variants:
1. Feature Construction: This process discovers missing information
about the relationships between features and augments the feature
space by creating additional features
2. Feature Extraction: It is the process of extracting a new set of
features from the original set of features using some functional
mapping
Feature Construction
Feature Construction
Feature Creation:
• Feature Creation is the process of generating new features based on
domain knowledge or by observing patterns in the data.
• It is a form of feature engineering that can significantly improve the
performance of a machine-learning model.
Types of Feature Creation:
1.Domain-Specific: Creating new features based on domain knowledge,
such as creating features based on business rules or industry standards.
2.Data-Driven: Creating new features by observing patterns in the data,
such as calculating aggregations or creating interaction features.
3.Synthetic: Generating new features by combining existing features or
synthesizing new data points.
• Creating features involves creating new variables which will be most
helpful for our model. This can be adding or removing some features.
Example: the cost per sq. ft column is a feature creation.
• Below are the prices of properties in x city. It shows the area of the
house and total price.
Contd..
• The data may have some errors or may be incorrect, not all sources on the
internet are correct. To begin, we’ll add a new column to display the cost per
square foot.

• This new feature will help us understand a lot about our data. So, we have a new
column which shows cost per square ft.
Contd..
Benefits of Feature Creation:
1.Improves Model Performance: By providing additional and more
relevant information to the model, feature creation can increase the
accuracy and precision of the model.
2.Increases Model Robustness: By adding additional features, the model
can become more robust to outliers and other anomalies.
3.Improves Model Interpretability: By creating new features, it can be
easier to understand the model’s predictions.
4.Increases Model Flexibility: By adding new features, the model can be
made more flexible to handle different types of data.
Feature Construction is an essential activity
Feature Construction: Encoding nominal variables
Feature Construction: Encoding categorical(ordinal) variables
Feature Construction: Encoding categorical variables
One-Hot Encoding:
• One-hot encoding is a technique used to transform categorical variables
into numerical values that can be used by machine learning models.
• In this technique, each category is transformed into a binary value
indicating its presence or absence.
• For example, consider a categorical variable “Colour” with three
categories: Red, Green, and Blue.
• One-hot encoding would transform this variable into three binary
variables: Colour_Red, Colour_Green, and Colour_Blue, where the value
of each variable would be 1 if the corresponding category is present and
0 otherwise.
Feature Construction: Encoding numeric to categorical(ordinal)
variables
Feature Construction: Text-specific data (Bag-of-Words)

Vectorization Process for Text Corpus

1. Tokenize
2. Count
3. Normalize

Document-Term Matrix
General Types of Feature Transformation:
1.Normalization: Rescaling the features to have a similar range, such as
between 0 and 1, to prevent some features from dominating others.
2.Scaling: Rescaling the features to have a similar scale, such as having a
standard deviation of 1, to make sure the model considers all features
equally.
3.Encoding: Transforming categorical features into a numerical
representation. Examples are one-hot encoding and label encoding.
4.Transformation: Transforming the features using mathematical operations
to change the distribution or scale of the features. Examples are
logarithmic, square root, and reciprocal transformations.
Contd..
Benefits of Feature Transformation:

1.Improves Model Performance: By transforming the features into a

more suitable representation, the model can learn more meaningful
patterns in the data.
2.Increases Model Robustness: Transforming the features can make the
model more robust to outliers and other anomalies.
3.Improves Computational Efficiency: The transformed features often
require fewer computational resources.
4.Improves Model Interpretability: By transforming the features, it can
be easier to understand the model’s predictions.
Feature Extraction in ML
Feature Extraction Examples used in ML
Feature Extraction Algorithms used in ML

• Principal Component Analysis (PCA)

• Singular Value Decomposition (SVD)
• Linear Discriminant Analysis (LDA)
Principal Component Analysis
• In PCA, a new set of features are extracted from the original features that are quite dissimilar in
nature.
• So, an n-dimensional feature space gets transformed to an m-dimensional feature space, where
the dimensions are orthogonal to each other, i.e. completely independent o each other.

• A vector is a quantity having both magnitude and direction and hence can determine the position
of a point relative to another point in the Euclidean space.
• A vector space is a set of vectors.
• Vector spaces have a property that they can be represented as a linear combination of smaller set
of vectors, called basis vectors.
• So, any vector ‘v’ in a vector space can be represented by using a, that represents ‘n’ scalars and
u represents the basis vectors, as
Principal Component Analysis

• Let us extend this notion to the feature space of a data set

• The feature vector can be transformed to a vector space of the basis vectors
which are termed as principal components.
• A set of feature vectors that have similarity with each other is transformed to a
set of principal components that are completely unrelated
• The principal components capture the variability of the original feature space
• The number of components derived, is much smaller than the original set of
features.
Principal Component Analysis
Principal Component Analysis: Steps

https://www.kdnuggets.com/2023/05/principal-component-
analysis-pca-scikitlearn.html
Principal Component Analysis

https://www.geeksforgeeks.org/covariance-matrix/
Principal Component Analysis: Steps
*Note: Standardize the features of
dataset by removing the mean and
* scaling to unit variance so that each
feature has μ = 0 and σ = 1.
Singular Value Decomposition (SVD)

• SVD is a matrix factorization technique commonly used in linear

algebra.
• SVD of a matrix A(m*n) is a factorization of the form:
Singular Value Decomposition (SVD)

• When the dataset is sparse (as in case of text data), it is not advisable to
remove the mean of a data attribute.
• SVD is a good choice for dimensionality reduction in those situations
than PCA.

https://machinelearningmastery.com/singular-value-decomposition-for-machine-learning/
Singular Value Decomposition (SVD)

SVD on text Source Code

Linear Discriminant Analysis (LDA)

• LDA is another commonly used feature extraction technique like PCA or SVD.
• The objective of LDA is to transform a dataset into a lower dimensional feature
space
• The focus of LDA is not to capture the dataset variability
• Instead, LDA focuses on class separability, i.e. separating the features based on
class separability so as to avoid overfitting of the machine learning model
• LDA calculates eigen values and eigen vectors within a class and inter-class
scatter matrices.

https://www.statology.org/scree-plot-python/
Linear Discriminant Analysis (LDA)
Steps to be followed are given below:
1. Calculate the mean vectors for the individual classes
2. Calculate intra-class and inter-class scatter matrices
3. Calculate eigen values and eigen vectors for Sw and SB where Sw is the intra-class scatter matrix and SB is
the inter-class scatter matrix
Feature Selection

Task: predicting weights of students

Issues in High-dimensional data
Objective of Feature Selection
Key drivers of Feature Selection
• Feature Relevance
• Redundancy in Features
Measures of Feature Relevance
• Mutual Information
• Entropy of feature
Measures of Feature Relevance
• Entropy of Feature: Shannon’s Formula
Measures of Feature Relevance
• Mutual Information: Higher the value of mutual information of feature, more relevant is that
feature.
Measures of Feature Relevance
• For supervised learning, mutual information is considered
a good measure
• For unsupervised learning, there is no class variable, hence
feature to class mutual information is not a measure.
• Entropy of set of features without one feature at a time is
calculated for all the features
• Features are ranked in descending order of information gain*
from a feature and top ‘p%’ are considered as relevant features.

https://medium.com/@ompramod9921/decision-trees-
6a3c05e9cb82
Measures of Feature redundancy

• Feature redundancy is based on similar information contribution by

multiple features.
• Measures of similarity of information contribution:
1. Correlation-based measures
2. Distance-based measures
3. Other coefficient-based measure
Measures of Feature redundancy
• Correlation-based measure: It is a measure of linear dependency between two
random variables.
• Pearson’s product moment correlation coefficient for two feature variables F1
and F2 is defined as:
Measures of Feature redundancy
• Distance-based measures:
• Euclidean distance
• Minkowski distance
• Manhattan distance
• Hamming distance
Measures of Feature redundancy
• Distance-based measures:
• Euclidean distance is the most common distance between two random
feature variables F1 and F2 (for n rows) defined as:
Measures of Feature redundancy
• Euclidean Distance: Example
Measures of Feature redundancy
• Euclidean Distance: Example
Measures of Feature redundancy
• Distance-based measures:
Measures of Feature redundancy
• Hamming Distance: A special case of Manhattan distance is the Hamming
distance which measures the distance between binary vectors.
• Example: Hamming distance between 01101011 and 11001001 is 3
Measures of Feature redundancy
• Other Distance based measures:
• Jaccard index/coefficient is a measure of similarity between two features. Jaccard distance is
a measure of dissimilarity between two features, complimentary of Jaccard index.
Measures of Feature redundancy
• Jaccard index/coefficient: Example
Measures of Feature redundancy
• Simple Matching Coefficient (SMC):
Measures of Feature redundancy
• Cosine Similarity: It is the most popular measure in text classification.
• It measures the angle between two vectors.
• Cosine similarity=1=> x and y are similar
• Cosine similarity=0=> x and y do not share any similarity.

Cosine similarity between two features x and y is given by:

Measures of Feature redundancy
• Cosine Similarity:
Feature Selection process
Types of approaches for Feature Selection
Filter approach for Feature Selection:

• It is based on statistical measures like Pearson’s correlation, ANOVA,

Information Gain, Fisher score, Chi-square etc.
• No learning algorithm is employed to evaluate the goodness of the feature
selected
Wrapper approach for Feature Selection:
• Inductive Learning algorithm is employed to evaluate the goodness of the
feature selected.
• In this approach, for every candidate subset, the learning model is trained and
result is evaluated by running the learning algorithm.
• It is computationally expensive but generally has better performance.
Hybrid approach for Feature Selection:
Embedded approach for Feature Selection:

Solutions To Fossen Structural Geology PDF
0% (1)
Solutions To Fossen Structural Geology PDF
31 pages
18 Gut Brain and Pandas Pans
100% (2)
18 Gut Brain and Pandas Pans
76 pages
Azure Data Factory
100% (4)
Azure Data Factory
16 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
Todd May On Deleuze
No ratings yet
Todd May On Deleuze
20 pages
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
No ratings yet
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
29 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
Basics of Feature Engineering Marked
No ratings yet
Basics of Feature Engineering Marked
33 pages
UNIT 4
No ratings yet
UNIT 4
25 pages
Feature Engineering
No ratings yet
Feature Engineering
6 pages
UNIT04
No ratings yet
UNIT04
35 pages
What Is Feature Engineering
No ratings yet
What Is Feature Engineering
2 pages
CHP 4
No ratings yet
CHP 4
72 pages
Feature Engineering
No ratings yet
Feature Engineering
11 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
UNIT 2 PART 2
No ratings yet
UNIT 2 PART 2
6 pages
ML UNIT 2 2 Old
No ratings yet
ML UNIT 2 2 Old
15 pages
ML 3170724 Unit-4
No ratings yet
ML 3170724 Unit-4
97 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
What is Feature Engineering
No ratings yet
What is Feature Engineering
9 pages
CH1
No ratings yet
CH1
64 pages
Machine_Learning-Note-Modul2[1]
No ratings yet
Machine_Learning-Note-Modul2[1]
20 pages
NOTES
No ratings yet
NOTES
9 pages
Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
Feature Engineering and Normalization
No ratings yet
Feature Engineering and Normalization
7 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
Unit No: 4 Basics of Feature Engineering (31707 24)
No ratings yet
Unit No: 4 Basics of Feature Engineering (31707 24)
98 pages
Feature Engineering
No ratings yet
Feature Engineering
2 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Unit - 3 Feature Engineering
No ratings yet
Unit - 3 Feature Engineering
29 pages
Rajat Agarwal-21bcon630
No ratings yet
Rajat Agarwal-21bcon630
13 pages
Unit-II
No ratings yet
Unit-II
119 pages
ML1
No ratings yet
ML1
69 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
DM - MOD - 1 Part III
No ratings yet
DM - MOD - 1 Part III
12 pages
AI6322 - Module 4 - Feature Engineering - MODULE
No ratings yet
AI6322 - Module 4 - Feature Engineering - MODULE
25 pages
Unit-4 Part 3 Feature Engineering
No ratings yet
Unit-4 Part 3 Feature Engineering
29 pages
Deep Learning Vocabulary
No ratings yet
Deep Learning Vocabulary
6 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Machine Learning Unit-2
No ratings yet
Machine Learning Unit-2
12 pages
Feature Engineering
No ratings yet
Feature Engineering
21 pages
Steps Assignment
No ratings yet
Steps Assignment
6 pages
Explore Feature Engineering
No ratings yet
Explore Feature Engineering
10 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
7 pages
AI Feature Engineering in Detail (wecompress.com)
No ratings yet
AI Feature Engineering in Detail (wecompress.com)
12 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Feature Engineering
No ratings yet
Feature Engineering
13 pages
CSC407_Chapter 4
No ratings yet
CSC407_Chapter 4
28 pages
dalal2008
No ratings yet
dalal2008
6 pages
Session 7 Feature Selection & Dimensionality Reduction
No ratings yet
Session 7 Feature Selection & Dimensionality Reduction
20 pages
life lesson
No ratings yet
life lesson
13 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
06 Feature Engineering
No ratings yet
06 Feature Engineering
24 pages
Features
No ratings yet
Features
5 pages
NN-7
No ratings yet
NN-7
26 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Eature Engineering: Presenter: Prof. Amit Kumar Das
No ratings yet
Eature Engineering: Presenter: Prof. Amit Kumar Das
17 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Unit No. 02 - Feature Extraction & Selection
No ratings yet
Unit No. 02 - Feature Extraction & Selection
47 pages
Mastering Python: A Comprehensive Guide to Programming
From Everand
Mastering Python: A Comprehensive Guide to Programming
Christine Lambertson
No ratings yet
03 Application Format Programa de Beca DAAD
No ratings yet
03 Application Format Programa de Beca DAAD
4 pages
MPI PROCEDURE Project 2
No ratings yet
MPI PROCEDURE Project 2
17 pages
Kala 22
No ratings yet
Kala 22
4 pages
Avaya 9608 IP Deskphone - Fact Sheet
No ratings yet
Avaya 9608 IP Deskphone - Fact Sheet
2 pages
Careers Collection: Library
No ratings yet
Careers Collection: Library
11 pages
04-Relation Between Senior Managers' Safety Leadership and Safety Behavior In-2020
No ratings yet
04-Relation Between Senior Managers' Safety Leadership and Safety Behavior In-2020
15 pages
Carriage of Grain in Bulk
No ratings yet
Carriage of Grain in Bulk
6 pages
(Cambridge Studies in Nineteenth-Century Literature and Culture) Hilary Fraser - Women Writing Art History in The Nineteenth Century - Looking Like A Woman (2014, Cambridge University Press)
No ratings yet
(Cambridge Studies in Nineteenth-Century Literature and Culture) Hilary Fraser - Women Writing Art History in The Nineteenth Century - Looking Like A Woman (2014, Cambridge University Press)
254 pages
10762
No ratings yet
10762
71 pages
BK00906 Interior Pages 37
No ratings yet
BK00906 Interior Pages 37
3 pages
MATH 2070U Midterm 2011 A Solution
No ratings yet
MATH 2070U Midterm 2011 A Solution
6 pages
Examkrackers MCAT Organic Chemistry 7th Edition Jonathan Orsay download
100% (1)
Examkrackers MCAT Organic Chemistry 7th Edition Jonathan Orsay download
58 pages
Blöser 2019
No ratings yet
Blöser 2019
10 pages
Polyglossia: English Department Subject: Sociolinguistics Stage: 4 Year Academic Year 2022-2023
No ratings yet
Polyglossia: English Department Subject: Sociolinguistics Stage: 4 Year Academic Year 2022-2023
6 pages
R600 A
No ratings yet
R600 A
1 page
Hunger Box
100% (1)
Hunger Box
15 pages
03.01.2022 Mca, Bca CST
No ratings yet
03.01.2022 Mca, Bca CST
1 page
Matrices
No ratings yet
Matrices
46 pages
MPDF
No ratings yet
MPDF
1 page
Plateefficiency Correlation in GAS P !) : Distllling AND Absorbers
100% (1)
Plateefficiency Correlation in GAS P !) : Distllling AND Absorbers
13 pages
Biotic and Abiotic Factors
No ratings yet
Biotic and Abiotic Factors
46 pages
Permutation Tests for Complex Data Theory Applications and Software Wiley Series in Probability and Statistics 1st Edition Fortunato Pesarin 2024 scribd download
100% (5)
Permutation Tests for Complex Data Theory Applications and Software Wiley Series in Probability and Statistics 1st Edition Fortunato Pesarin 2024 scribd download
61 pages
DLL - Science 5 - Q4 - W5
No ratings yet
DLL - Science 5 - Q4 - W5
6 pages
PARTS of SPEECH Pages 2 125
No ratings yet
PARTS of SPEECH Pages 2 125
124 pages
8086 Microprocessor
No ratings yet
8086 Microprocessor
15 pages
HSP 106418 -Rev M- Manual CFV-2..20 en
No ratings yet
HSP 106418 -Rev M- Manual CFV-2..20 en
132 pages

Unit 2 Feature Engineering

Uploaded by

Unit 2 Feature Engineering

Uploaded by

Unit-2

Petal_width Petal_length Sepal_width Sepal_length Species_name

• Feature engineering in ML contains two major elements:

• Feature engineering in ML contains two major elements:

Vectorization Process for Text Corpus

1.Improves Model Performance: By transforming the features into a

• Principal Component Analysis (PCA)

• Let us extend this notion to the feature space of a data set

• SVD is a matrix factorization technique commonly used in linear

SVD on text Source Code

Task: predicting weights of students

• Feature redundancy is based on similar information contribution by

Cosine similarity between two features x and y is given by:

• It is based on statistical measures like Pearson’s correlation, ANOVA,

You might also like