0% found this document useful (0 votes)
14 views21 pages

Unit-I, Part-2 Feature Engineering

Uploaded by

sowmyadell680
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views21 pages

Unit-I, Part-2 Feature Engineering

Uploaded by

sowmyadell680
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT-I, PART-2

Feature Engineering
Contents

•Introduction to Features
•Need of Feature Engineering.
•Feature Selection
•Feature Extraction and
•Discriminant Analysis ( PCA, LDA).
Features / Dimensions

• ML works on data and it consists of samples.


• Each sample can have a high number of variables called features
• Features are the predictor variables and are also referred to as
dimensions of the data.
• Features are the measured values that describe the data,
they directly impact the predictive models that we build and
the results it produces
What is Feature Engineering?
• Feature engineering is the process of extracting useful features from
raw data.
• The useful features are nothing but the features that contribute in
improving the performance of machine learning algorithms.
Steps in feature engineering process
• Feature engineering is not just an ad hoc practice. It includes the following steps:
Understanding the feature– It is vital to understand what the data is, and
what is to achieve from it. It is about understanding the features, identifying
the target variable, data type, missing/incorrect values, data distribution etc.

Feature improvement – The objective of this step is to handle missing


values and categorical features before feeding the data into a machine
learning algorithm.

Machine learning algorithms require that the data is numerical and contains no
missing values.
Steps in feature engineering process contd.
Feature selection – All the features collected may not be useful for mode
building.
Some features may be irrelevant or provide less information in designing a
data driven solution, then such irrelevant features have to the removed.
Many times, the data used for training may have few features that are redundant
in context of other features.
Also, some of the features are significant in improving the model accuracy.
Feature selection process addresses these problems by automatically selecting a
subset that is most useful to the given problem. some of the feature selection
techniques are:
❑ Filters,
❑ wrappers
❑ embedded
Steps in feature engineering process Contd.

• Feature transformation - Sometimes, the data may not be suitable to train the
machine learning algorithm in its original form because patterns can’t be recognized.
• By transforming the data from original form to another form, then better insights can
be obtained.
• The cleaned raw data is transformed to segregate the useful information from the
data, often leading to dimensionality reduction.

• Most commonly used feature transformation techniques are:


❑ Principal Component Analysis (PCA)and
❑ Linear Discriminant Analysis (LDA)

20ITC21- BML, Dept. of IT, CBIT 7


High-dimensional data in machine learning
• When the number of dimensions is high, the data is called high
dimensional data.

• The performance of the ML algorithms depends on the quality and


quantity of data.

• It is important to give an adequate amount of data to train the machine.

• The machine will have more pattern to learn from as the number of
samples increase.
Challenges with High-dimensional data
• The feature space becomes sparse with a higher number of
dimensions.
• Below Fig. shows the feature space with five data points when
represented in 1-D, 2-D, and 3-D space.
Impact of sparsity on ML algorithms
• The sparsity in the data can be effectively analyzed using a proximity matrix.
• The proximity between two samples (X1, X2, …XD) and (Y1, Y2, …YD) in D-dimensional
space is calculated using the below equation:

• It can be observed that, when the dimension of the data increases, each dimension
augments a non-negative term to the sum in the above-given proximity equation.
consequently, the distance/proximity between the samples increases with the increase
in the number of dimensions.
Impact of sparsity on ML algorithms contd.
Impact of sparsity on ML algorithms contd.
• If the sparse data is used to train the ML model, then there is a risk in producing a
model that could be very good at predicting the target variable on the training data
but fail miserably with new data.
• As shown below, with sparse data, a model learnt patterns where none exist. This
leads to overfitting.
How to overcome the curse of dimensionality?
Increasing the number of samples to make the dataset denser is the solution to
overcome sparsity.
By increasing the amount of training data the sparsity gets reduced and data becomes
denser as shown in the fig.

Fig: Data points in different dimensional space

20ITC21- BML, Dept. of IT, CBIT 13


How to overcome the curse of dimensionality? contd.
• Making the dataset denser is practically not possible as we need to add data
samples exponentially to keep sparsity in check.
• For example, to maintain the uniform average distance of 10 data points the
number of samples required exponentially grows from 101,102, ...10D

• Instead of increasing the sample size, an alternate solution is to minimize the


dimension of data.

• The process of transforming the data from a high-dimensional space into a


low-dimensional space such that the low-dimensional feature space can provide
approximately the same information that of the original data is called
dimensionality reduction.
• Hence, dimensionality reduction is a key aspect of feature engineering.
Why dimensionality reduction is important in ML?
• Dimensionality reduction is important as it:
improves model performance
reduces the time required for model building
• Suppose an organization wants to use Machine Learning Technique to predict the
pay/salary of the employees using the below dataset
Types and methods of the dimensionality reduction
Types of dimensionality reduction
• Feature selection: A dimensionality reduction technique where useful
features are selected, and irrelevant/redundant features are removed.

• Feature extraction: A technique that transforms high dimensional


space to lower-dimensional space. This often involves combining the
features to create new feature sets.
QUIZ
Which of the following techniques are called feature engineering?
A. Exclusion of the missing values
B. Exclusion of the noisy data
C. Removal of the redundant features
D. All the above
QUIZ
• Which of the following factor get influenced due to dimensionality
reduction?
A.Increase in the classification accuracy
B.Reduction in the computation time
C.Minimization of the space requirement
D.All the above
QUIZ
• From the below given option, choose the appropriate option that identifies the
characteristic of the curse of dimensionality phenomenon Predicting if a
person will repay a loan within a grace period

A. Increase in the accuracy of the prediction results with the increase in the number
of features
B.Decrease in the computational power as the dimensionality of data increases
C.Decrease in the prediction accuracy due to increase in the dimensionality in data
D.Better visualization of data due to high dimensional data
QUIZ
• Which of the following statement is TRUE in dimensionality reduction?

A. Features extraction technique modifies the original feature space


B. Since some of the features are removed in the feature selection technique, the
machine prediction performance is reduced
C. As the feature extraction technique modifies the original feature space into a
new transformed features space, the insights cannot be extracted from the
transformed data

You might also like