Unit 2 Feature Engineering
Unit 2 Feature Engineering
Feature Engineering:
• Introduction
• Feature Transformation
• Subset Selection
Modelling and Evaluation:
• Selecting a model
• Training model
• Model representation
• Evaluating and Improving model performance
Unit-2
Feature Engineering
Unit-2
Feature Engineering:
• Introduction
• Feature Transformation
• Subset Selection
Feature Engineering for Machine Learning
What is a Feature
• In the context of machine learning, a feature (also known as a variable or
attribute) is an individual measurable property or characteristic of a data point
that is used as input for a machine learning algorithm.
• Features can be numerical, categorical or text-based, and they represent
different aspects of the data that are relevant to the problem at hand.
• For example, in a dataset of housing prices, features could include the number
of bedrooms, the square footage, the location, and the age of the property.
• The choice and quality of features are critical in machine learning, as they can
greatly impact the accuracy and performance of the model.
Dataset features-IRIS
• This new feature will help us understand a lot about our data. So, we have a new
column which shows cost per square ft.
Contd..
Benefits of Feature Creation:
1.Improves Model Performance: By providing additional and more
relevant information to the model, feature creation can increase the
accuracy and precision of the model.
2.Increases Model Robustness: By adding additional features, the model
can become more robust to outliers and other anomalies.
3.Improves Model Interpretability: By creating new features, it can be
easier to understand the model’s predictions.
4.Increases Model Flexibility: By adding new features, the model can be
made more flexible to handle different types of data.
Feature Construction is an essential activity
Feature Construction: Encoding nominal variables
Feature Construction: Encoding categorical(ordinal) variables
Feature Construction: Encoding categorical variables
One-Hot Encoding:
• One-hot encoding is a technique used to transform categorical variables
into numerical values that can be used by machine learning models.
• In this technique, each category is transformed into a binary value
indicating its presence or absence.
• For example, consider a categorical variable “Colour” with three
categories: Red, Green, and Blue.
• One-hot encoding would transform this variable into three binary
variables: Colour_Red, Colour_Green, and Colour_Blue, where the value
of each variable would be 1 if the corresponding category is present and
0 otherwise.
Feature Construction: Encoding numeric to categorical(ordinal)
variables
Feature Construction: Text-specific data (Bag-of-Words)
Document-Term Matrix
General Types of Feature Transformation:
1.Normalization: Rescaling the features to have a similar range, such as
between 0 and 1, to prevent some features from dominating others.
2.Scaling: Rescaling the features to have a similar scale, such as having a
standard deviation of 1, to make sure the model considers all features
equally.
3.Encoding: Transforming categorical features into a numerical
representation. Examples are one-hot encoding and label encoding.
4.Transformation: Transforming the features using mathematical operations
to change the distribution or scale of the features. Examples are
logarithmic, square root, and reciprocal transformations.
Contd..
Benefits of Feature Transformation:
• A vector is a quantity having both magnitude and direction and hence can determine the position
of a point relative to another point in the Euclidean space.
• A vector space is a set of vectors.
• Vector spaces have a property that they can be represented as a linear combination of smaller set
of vectors, called basis vectors.
• So, any vector ‘v’ in a vector space can be represented by using a, that represents ‘n’ scalars and
u represents the basis vectors, as
Principal Component Analysis
https://www.kdnuggets.com/2023/05/principal-component-
analysis-pca-scikitlearn.html
Principal Component Analysis
https://www.geeksforgeeks.org/covariance-matrix/
Principal Component Analysis: Steps
*Note: Standardize the features of
dataset by removing the mean and
* scaling to unit variance so that each
feature has μ = 0 and σ = 1.
Singular Value Decomposition (SVD)
• When the dataset is sparse (as in case of text data), it is not advisable to
remove the mean of a data attribute.
• SVD is a good choice for dimensionality reduction in those situations
than PCA.
https://machinelearningmastery.com/singular-value-decomposition-for-machine-learning/
Singular Value Decomposition (SVD)
• LDA is another commonly used feature extraction technique like PCA or SVD.
• The objective of LDA is to transform a dataset into a lower dimensional feature
space
• The focus of LDA is not to capture the dataset variability
• Instead, LDA focuses on class separability, i.e. separating the features based on
class separability so as to avoid overfitting of the machine learning model
• LDA calculates eigen values and eigen vectors within a class and inter-class
scatter matrices.
https://www.statology.org/scree-plot-python/
Linear Discriminant Analysis (LDA)
Steps to be followed are given below:
1. Calculate the mean vectors for the individual classes
2. Calculate intra-class and inter-class scatter matrices
3. Calculate eigen values and eigen vectors for Sw and SB where Sw is the intra-class scatter matrix and SB is
the inter-class scatter matrix
Feature Selection
https://medium.com/@ompramod9921/decision-trees-
6a3c05e9cb82
Measures of Feature redundancy