Unit-I, Part-2 Feature Engineering

Uploaded by

sowmyadell680

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views21 pages

Unit-I, Part-2 Feature Engineering

Uploaded by

sowmyadell680

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

UNIT-I, PART-2

Feature Engineering
Contents

•Introduction to Features
•Need of Feature Engineering.
•Feature Selection
•Feature Extraction and
•Discriminant Analysis ( PCA, LDA).
Features / Dimensions

• ML works on data and it consists of samples.

• Each sample can have a high number of variables called features
• Features are the predictor variables and are also referred to as
dimensions of the data.
• Features are the measured values that describe the data,
they directly impact the predictive models that we build and
the results it produces
What is Feature Engineering?
• Feature engineering is the process of extracting useful features from
raw data.
• The useful features are nothing but the features that contribute in
improving the performance of machine learning algorithms.
Steps in feature engineering process
• Feature engineering is not just an ad hoc practice. It includes the following steps:
Understanding the feature– It is vital to understand what the data is, and
what is to achieve from it. It is about understanding the features, identifying
the target variable, data type, missing/incorrect values, data distribution etc.

Feature improvement – The objective of this step is to handle missing

values and categorical features before feeding the data into a machine
learning algorithm.

Machine learning algorithms require that the data is numerical and contains no
missing values.
Steps in feature engineering process contd.
Feature selection – All the features collected may not be useful for mode
building.
Some features may be irrelevant or provide less information in designing a
data driven solution, then such irrelevant features have to the removed.
Many times, the data used for training may have few features that are redundant
in context of other features.
Also, some of the features are significant in improving the model accuracy.
Feature selection process addresses these problems by automatically selecting a
subset that is most useful to the given problem. some of the feature selection
techniques are:
❑ Filters,
❑ wrappers
❑ embedded
Steps in feature engineering process Contd.

• Feature transformation - Sometimes, the data may not be suitable to train the
machine learning algorithm in its original form because patterns can’t be recognized.
• By transforming the data from original form to another form, then better insights can
be obtained.
• The cleaned raw data is transformed to segregate the useful information from the
data, often leading to dimensionality reduction.

• Most commonly used feature transformation techniques are:

❑ Principal Component Analysis (PCA)and
❑ Linear Discriminant Analysis (LDA)

20ITC21- BML, Dept. of IT, CBIT 7

High-dimensional data in machine learning
• When the number of dimensions is high, the data is called high
dimensional data.

• The performance of the ML algorithms depends on the quality and

quantity of data.

• It is important to give an adequate amount of data to train the machine.

• The machine will have more pattern to learn from as the number of
samples increase.
Challenges with High-dimensional data
• The feature space becomes sparse with a higher number of
dimensions.
• Below Fig. shows the feature space with five data points when
represented in 1-D, 2-D, and 3-D space.
Impact of sparsity on ML algorithms
• The sparsity in the data can be effectively analyzed using a proximity matrix.
• The proximity between two samples (X1, X2, …XD) and (Y1, Y2, …YD) in D-dimensional
space is calculated using the below equation:

• It can be observed that, when the dimension of the data increases, each dimension
augments a non-negative term to the sum in the above-given proximity equation.
consequently, the distance/proximity between the samples increases with the increase
in the number of dimensions.
Impact of sparsity on ML algorithms contd.
Impact of sparsity on ML algorithms contd.
• If the sparse data is used to train the ML model, then there is a risk in producing a
model that could be very good at predicting the target variable on the training data
but fail miserably with new data.
• As shown below, with sparse data, a model learnt patterns where none exist. This
leads to overfitting.
How to overcome the curse of dimensionality?
Increasing the number of samples to make the dataset denser is the solution to
overcome sparsity.
By increasing the amount of training data the sparsity gets reduced and data becomes
denser as shown in the fig.

Fig: Data points in different dimensional space

20ITC21- BML, Dept. of IT, CBIT 13

How to overcome the curse of dimensionality? contd.
• Making the dataset denser is practically not possible as we need to add data
samples exponentially to keep sparsity in check.
• For example, to maintain the uniform average distance of 10 data points the
number of samples required exponentially grows from 101,102, ...10D

• Instead of increasing the sample size, an alternate solution is to minimize the

dimension of data.

• The process of transforming the data from a high-dimensional space into a

low-dimensional space such that the low-dimensional feature space can provide
approximately the same information that of the original data is called
dimensionality reduction.
• Hence, dimensionality reduction is a key aspect of feature engineering.
Why dimensionality reduction is important in ML?
• Dimensionality reduction is important as it:
improves model performance
reduces the time required for model building
• Suppose an organization wants to use Machine Learning Technique to predict the
pay/salary of the employees using the below dataset
Types and methods of the dimensionality reduction
Types of dimensionality reduction
• Feature selection: A dimensionality reduction technique where useful
features are selected, and irrelevant/redundant features are removed.

• Feature extraction: A technique that transforms high dimensional

space to lower-dimensional space. This often involves combining the
features to create new feature sets.
QUIZ
Which of the following techniques are called feature engineering?
A. Exclusion of the missing values
B. Exclusion of the noisy data
C. Removal of the redundant features
D. All the above
QUIZ
• Which of the following factor get influenced due to dimensionality
reduction?
A.Increase in the classification accuracy
B.Reduction in the computation time
C.Minimization of the space requirement
D.All the above
QUIZ
• From the below given option, choose the appropriate option that identifies the
characteristic of the curse of dimensionality phenomenon Predicting if a
person will repay a loan within a grace period

A. Increase in the accuracy of the prediction results with the increase in the number
of features
B.Decrease in the computational power as the dimensionality of data increases
C.Decrease in the prediction accuracy due to increase in the dimensionality in data
D.Better visualization of data due to high dimensional data
QUIZ
• Which of the following statement is TRUE in dimensionality reduction?

A. Features extraction technique modifies the original feature space

B. Since some of the features are removed in the feature selection technique, the
machine prediction performance is reduced
C. As the feature extraction technique modifies the original feature space into a
new transformed features space, the insights cannot be extracted from the
transformed data

Unit 2 Feature Engineering
No ratings yet
Unit 2 Feature Engineering
64 pages
Feature Engineering and Dimensionality Reduction
No ratings yet
Feature Engineering and Dimensionality Reduction
146 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
274 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
47 pages
Scion Masks of The Mythos (Final Download)
80% (5)
Scion Masks of The Mythos (Final Download)
200 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
Module 6
No ratings yet
Module 6
51 pages
Python Unit 4
No ratings yet
Python Unit 4
43 pages
Feature Engineering Presentation
No ratings yet
Feature Engineering Presentation
40 pages
Feature Dimensionality Reduction: A Review: Survey and State of The Art
No ratings yet
Feature Dimensionality Reduction: A Review: Survey and State of The Art
31 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
UNIT04
No ratings yet
UNIT04
35 pages
Chapter6 - Unit IV2024
No ratings yet
Chapter6 - Unit IV2024
84 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Data
No ratings yet
Data
36 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
Service Manual ofDR7000D
No ratings yet
Service Manual ofDR7000D
28 pages
CBLM Washing Machine Revised FINAL
No ratings yet
CBLM Washing Machine Revised FINAL
55 pages
NN 7
No ratings yet
NN 7
26 pages
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
No ratings yet
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
29 pages
Dimensionality Reduction in Machine Learning-1
No ratings yet
Dimensionality Reduction in Machine Learning-1
16 pages
Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
Research Citation Notes
No ratings yet
Research Citation Notes
35 pages
1 s2.0 S156625351930377X Main
No ratings yet
1 s2.0 S156625351930377X Main
15 pages
Unit 3
No ratings yet
Unit 3
23 pages
Feature Engineering
No ratings yet
Feature Engineering
21 pages
Lecture4-Dimensionality Reduction Methods
No ratings yet
Lecture4-Dimensionality Reduction Methods
40 pages
ML UNIT 2 2 Old
No ratings yet
ML UNIT 2 2 Old
15 pages
DM - MOD - 1 Part III
No ratings yet
DM - MOD - 1 Part III
12 pages
Hubi Dubi
No ratings yet
Hubi Dubi
13 pages
Ibn Sina and Mulla Sadras Arguments Against Tanasukh in The Afterlife of Souls
No ratings yet
Ibn Sina and Mulla Sadras Arguments Against Tanasukh in The Afterlife of Souls
5 pages
Data Mining
No ratings yet
Data Mining
33 pages
2 3-FeatureRelatedIssues
No ratings yet
2 3-FeatureRelatedIssues
10 pages
Basics of Feature Engineering Marked
No ratings yet
Basics of Feature Engineering Marked
33 pages
Life Lesson
No ratings yet
Life Lesson
13 pages
Conference 101719
No ratings yet
Conference 101719
7 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Dimenn Red PDF
No ratings yet
Dimenn Red PDF
135 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
ML Unit Iv Part I
No ratings yet
ML Unit Iv Part I
11 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
Summery of Feature Eng
No ratings yet
Summery of Feature Eng
4 pages
Curse of Dimensionality and Its Reduction
No ratings yet
Curse of Dimensionality and Its Reduction
5 pages
Conference 101719
No ratings yet
Conference 101719
7 pages
NEC ML UNIT-III Complete Final
No ratings yet
NEC ML UNIT-III Complete Final
22 pages
JMP SUMMIT EUROPE 2018 - Data Mining Under The Curse of Dimensionality (Gianpaolo Polsinelli - LFoundry Italy)
No ratings yet
JMP SUMMIT EUROPE 2018 - Data Mining Under The Curse of Dimensionality (Gianpaolo Polsinelli - LFoundry Italy)
7 pages
Comparartive
No ratings yet
Comparartive
7 pages
CS434a/541a: Pattern Recognition Prof. Olga Veksler
No ratings yet
CS434a/541a: Pattern Recognition Prof. Olga Veksler
42 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
Dalal 2008
No ratings yet
Dalal 2008
6 pages
Unit 5 Notes New
No ratings yet
Unit 5 Notes New
6 pages
Data Acquisition
No ratings yet
Data Acquisition
28 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
E-Note 14653 Content Document 20231228101402AM
No ratings yet
E-Note 14653 Content Document 20231228101402AM
10 pages
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
41 pages
01 - Different Types of Metal Joining Processes
No ratings yet
01 - Different Types of Metal Joining Processes
44 pages
Feature Selection Based On Class-Dependent Densities For High-Dimensional Binary Data
No ratings yet
Feature Selection Based On Class-Dependent Densities For High-Dimensional Binary Data
13 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
Relationship Astrology
No ratings yet
Relationship Astrology
7 pages
Music 7 DLL 6
No ratings yet
Music 7 DLL 6
2 pages
Feature Engineering and Normalization
No ratings yet
Feature Engineering and Normalization
7 pages
Alloys in FPD
No ratings yet
Alloys in FPD
15 pages
Maths PDF
No ratings yet
Maths PDF
3 pages
Mourning and Militancy PDF
No ratings yet
Mourning and Militancy PDF
17 pages
Iqbal's Selected Poems From Bang-e-Dara (URCG-5112)
No ratings yet
Iqbal's Selected Poems From Bang-e-Dara (URCG-5112)
9 pages
U5 01 MongoDB
No ratings yet
U5 01 MongoDB
99 pages
Saudi Aramco Test Report: Ultraviolet (UV) Light Intensity Log 24-Jul-18 NDE
No ratings yet
Saudi Aramco Test Report: Ultraviolet (UV) Light Intensity Log 24-Jul-18 NDE
2 pages
Tesco's Visual Culture in Retail Marketing
No ratings yet
Tesco's Visual Culture in Retail Marketing
9 pages
Negotiations in Italy
0% (1)
Negotiations in Italy
12 pages
U4-01-Node JS
No ratings yet
U4-01-Node JS
50 pages
Glossary of Environmental Terms: Abatement
No ratings yet
Glossary of Environmental Terms: Abatement
17 pages
BE It 4thsem PQT Unit1 DR - Deepa
No ratings yet
BE It 4thsem PQT Unit1 DR - Deepa
51 pages
Patliputra University, Patna: B.Sc. (Hons) Part-II, Session - 2021-2024 Examination Held in May, 2023
No ratings yet
Patliputra University, Patna: B.Sc. (Hons) Part-II, Session - 2021-2024 Examination Held in May, 2023
1 page
P3 2024 January
No ratings yet
P3 2024 January
28 pages
Unit V
No ratings yet
Unit V
6 pages
9130C Rev 0305 English
No ratings yet
9130C Rev 0305 English
72 pages
Nist SP 1299
No ratings yet
Nist SP 1299
8 pages
Unit II PPT Complete
No ratings yet
Unit II PPT Complete
80 pages
DAV Unit 4 Material
No ratings yet
DAV Unit 4 Material
49 pages
3 Semrep
No ratings yet
3 Semrep
28 pages
Icaap Guidelines Arusha
No ratings yet
Icaap Guidelines Arusha
36 pages
Politeknik Nilai, Negeri Sembilan Jabatan Kejuruteraan Mekanikal
No ratings yet
Politeknik Nilai, Negeri Sembilan Jabatan Kejuruteraan Mekanikal
5 pages
Question Bank, PQT, 2023-2024
No ratings yet
Question Bank, PQT, 2023-2024
10 pages
Art of Prophesying.: To The Unconverted, Are Both Outstanding Examples
No ratings yet
Art of Prophesying.: To The Unconverted, Are Both Outstanding Examples
13 pages
U4 02 Express JS
No ratings yet
U4 02 Express JS
24 pages
Operating Systems 20ITC19
No ratings yet
Operating Systems 20ITC19
3 pages
Atomic Structure L2
No ratings yet
Atomic Structure L2
25 pages
U3-02-React Redux and MUI
No ratings yet
U3-02-React Redux and MUI
19 pages
Operation Analytics and Investigating Metric Spike Analysis
No ratings yet
Operation Analytics and Investigating Metric Spike Analysis
6 pages
U3-03-Google Location Based Services
No ratings yet
U3-03-Google Location Based Services
10 pages
Well Posed
No ratings yet
Well Posed
11 pages
TK Malika
No ratings yet
TK Malika
7 pages
Unit II
No ratings yet
Unit II
8 pages
Personal Narrative Story - Caleigh Stanier
No ratings yet
Personal Narrative Story - Caleigh Stanier
3 pages
Basic Differentialtion - 2 - 240819 - 120805
No ratings yet
Basic Differentialtion - 2 - 240819 - 120805
11 pages
Reference Leter-Melina
No ratings yet
Reference Leter-Melina
1 page
Zoe's Copy of H-R Diagram Gizmo
No ratings yet
Zoe's Copy of H-R Diagram Gizmo
9 pages
Full Stack Development 20ADC07
No ratings yet
Full Stack Development 20ADC07
2 pages
Syrian Arab Republic Ministry of Tourism Hotel/Tourism Training Center
No ratings yet
Syrian Arab Republic Ministry of Tourism Hotel/Tourism Training Center
1 page
Web Technologies 18ITE06
No ratings yet
Web Technologies 18ITE06
2 pages
ITD Vijaya-Kumari Jan-2024
No ratings yet
ITD Vijaya-Kumari Jan-2024
2 pages
Sowmya Sri Nainala: React With Redux Tutorials
No ratings yet
Sowmya Sri Nainala: React With Redux Tutorials
1 page
JPM - APAC Market Thematics - US Tariffs: First Take, Limited Change Vs Reciprocal / Thailand: Rising Interest / Global Strat: Stressing Goldilocks
No ratings yet
JPM - APAC Market Thematics - US Tariffs: First Take, Limited Change Vs Reciprocal / Thailand: Rising Interest / Global Strat: Stressing Goldilocks
6 pages
Introduction to N.C.M., a Non Contact Measurement Tool
From Everand
Introduction to N.C.M., a Non Contact Measurement Tool
Dennis R. Branch
No ratings yet