0% found this document useful (0 votes)

11 views42 pages

AI-Module 4 Updated

This document discusses feature engineering in machine learning. It describes feature engineering as the pre-processing step that transforms raw data into features for building predictive models. The key processes of feature engineering are feature creation, transformations, extraction, and selection. Various feature engineering techniques are also covered, including imputation, discretization, encoding, splitting, handling outliers, transformations, scaling, and creating new features.

Uploaded by

rithusagar5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views42 pages

AI-Module 4 Updated

Uploaded by

rithusagar5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

MODULE 4

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 1

Feature Engineering
• Feature engineering is the pre-processing step of machine learning,
which is used to transform raw data into features that can be used for
creating a predictive model using Machine learning.
• All machine learning algorithms take input data to generate the output.
• The input data remains in a tabular form consisting of rows (instances
or observations) and columns (variable or attributes), and these attributes
are often known as features.
• Example: An image is an instance in computer vision, but a line in the
image could be the feature. In NLP, a document can be an observation, and
the word count could be the feature.
• A feature is an attribute or individual measurable property or
characteristic of a phenomenon.

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 2

• Feature engineering process selects the most useful predictor variables
for the model.

• Feature engineering in ML contains mainly four processes: Feature

Creation, Transformations, Feature Extraction, and Feature Selection.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 3
1. Feature Creation:
• Feature creation is finding the most useful variables to be used in a
predictive model.
• The process is subjective, and it requires human creativity and intervention.
• The new features are created by mixing existing features using addition,
subtraction, and ratio, and these new features have great flexibility.

2. Transformation:
• It involves adjusting the predictor variable to improve the accuracy and
performance of the model.
• It ensures that all the variables are on the same scale, making the model
easier to understand.
• It ensures that all the features are within the acceptable range to avoid
any computational error.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 4
3. Feature Extraction:
• Feature extraction is an automated feature engineering process that
generates new variables by extracting them from the raw data.
• The aim is to reduce the volume of data so that it can be easily used and
managed for data modelling.
• Feature extraction methods include cluster analysis, text analytics, edge
detection algorithms, and principal components analysis (PCA).

4. Feature Selection:
• Feature selection is a way of selecting the subset of the most relevant
features from the original features set by removing the redundant,
irrelevant, or noisy features.
• This is done in order to reduce overfitting in the model and improve the
performance.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 5
Feature Engineering Techniques
1. Imputation:
• Imputation deals with handling missing values in data.
• Deleting records that are missing is one way of dealing with missing data
issue. But it could lead to losing out on a chunk of valuable data. This is
where imputation can help.
• Data imputation can be classified into two types:
 Categorical Imputation: Missing categorical values are generally
replaced by the most commonly occurring value (mode) of the feature.
 Numerical Imputation: Missing numerical values are generally replaced
by the mean or median of the corresponding feature.

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 6

• Example: Categorical Imputation

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 7

• Example: Numerical Imputation

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 8

2. Discretization:
• Discretization involves taking a set of values of data and grouping sets
of them in some logical fashion into bins (or buckets).
• Binning can apply to numerical values as well as to categorical values.

Prof. Trupthi Rao, Dept. of AI & DS, GAT 9

• The grouping of data can be done as follows:
 Grouping of equal intervals (equal width)
 Grouping based on equal frequencies (of observations in the bin)
• Example:

Prof. Trupthi Rao, Dept. of AI & DS, GAT 10

3. Categorical encoding:
• Categorical encoding is the technique used to encode categorical features
into numerical values which are usually simpler for an algorithm to
understand.
• This can be done by:
(i) Integer Encoding
(ii) One-Hot Encoding

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 11

(i) Integer Encoding:
• Integer encoding consist in replacing the
categories by digits from 1 to n (or 0 to n-1),
where n is the number of distinct categories of
the variable.
• Each unique category is assigned an integer
value.
• This method is also called as label encoding.
• This method is used when there exists ordinal
relationship in the variables.

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 12

(ii) One-Hot Encoding:
• For categorical variables where no ordinal
relationship exists, a one-hot encoding (OHE) can be
applied.
• Here a new binary variable is added for each
unique integer value.
• In the “color” variable example, there are 3
categories: red, green and blue.
• Therefore 3 binary variables: ‘color_red’,
‘color_blue’ and ‘color_green’ are needed.
• A “1” value is placed in the binary variable for the
color and “0” values for the other colors.
• The binary variables are often called “dummy
variables or indicator variables”.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 13
4. Feature Splitting:
• Feature splitting is the process of separating features into two or more
parts to make new features.
• This technique helps the algorithms to better understand and learn the
patterns in the dataset.
• Example 1: Sale Date is split into year, month and day.

18/03/2024 14
• Example 2: Time stamp is split into 6 different attributes.

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 15

5. Handling outliers:
• Outliers are unusually high or low values in the dataset which are
unlikely to occur in normal scenarios.
• Since the outliers could adversely affect the model prediction they must be
handled appropriately.
• Methods of handling outliers include:
 Removal: The records containing outliers are removed from the variable.
However, the presence of outliers over multiple variables could result in
losing out on a large portion of the data.
 Replacing values: The outliers could alternatively be treated as missing
values and replaced by using appropriate imputation.
 Capping: Capping the maximum and minimum values and replacing them
with an arbitrary value.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 16
6. Variable transformations:
• Variable transformation techniques
could help with normalizing skewed
data.
• Skewness is a measure of the
asymmetry of a distribution.
• A distribution is asymmetrical when its
left and right side are not mirror images.
• Some of the variable transformations are
the Logarithmic transformation, Square
root transformation and Box cox
transformation which when applied on
heavy-tailed distributions results in less
skewed values.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 17
7. Scaling:
• Feature scaling is a method used to normalize the range of independent
variables or features of data.
• The commonly used processes of scaling include:
 Min-Max Scaling/Normalization: This process involves the rescaling of
all values in a feature in the range 0 to 1. In other words, the minimum
value in the original range will take the value 0, the maximum value will
take 1 and the rest of the values in between the two extremes will be
appropriately scaled.

 Standardization/Variance scaling: Mean is subtracted from every data

point and the result is divided by the standard deviation to arrive at a
distribution with a 0 mean and variance of 1.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 18
8. Creating features:
• Feature creation involves deriving new features from existing ones.
• This can be done by simple mathematical operations such as aggregations
to obtain the mean, median, mode, sum, or difference and even product of
two values.
• These features, although derived directly from the given data, when
carefully chosen to relate to the target can have an impact on the
performance.
• Example:

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 19

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 20
Introduction to ML

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 21

a) Evolution of Machine Learning
• The term Machine Learning (ML) was first used by Arthur Samuel, one of
the pioneers of Artificial Intelligence at IBM, in 1959.
• Machine learning (ML) is an important tool for the goal of leveraging
technologies around artificial intelligence.
• Because of its learning and decision-making abilities, machine learning is often
referred to as AI, though, in reality, it is a subdivision of AI.
• Until the late 1970s, it was a part of AI’s evolution. Then, it branched off to
evolve on its own.
• Machine learning is now responsible for some of the most significant
advancements in technology.

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 22

b) What is Machine Learning (ML)?

• Machine learning is a branch of artificial intelligence (AI) and computer

science which focuses on the use of data and algorithms to imitate the way
that humans learn, gradually improving its accuracy.

• Machine learning is an application of AI that provides systems the ability to

learn on their own and improve from experiences without being programmed
externally.

• Machine learning was defined by Stanford University as “the science of

getting computers to act without being explicitly programmed.”
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 23
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 24
• Traditional programming is a manual process—here the programmer
creates the program. Programming aims to answer a problem using a
predefined set of rules or logic.

• In machine learning, the algorithm automatically formulates the rules

from the data. Machine learning seeks to construct a model or logic for the
problem by analyzing its input data and answers.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 25
c) Types of ML
• Based on the methods and way of learning, machine learning is
divided into six types.

18/03/2024 26
1. Supervised Machine Learning Algorithms:
• The primary purpose of supervised learning is to scale the scope of
data and to make predictions of unavailable, future or unseen data
based on labeled sample data.
• Supervised learning is where there are input variables (x) and an
output variable (Y) and an algorithm is used to learn the mapping
function from the input to the output Y = f(x) .
• The goal is to approximate the mapping function so well that when
there comes a new input data (x), the machine should be able to
predict the output variable (Y) for that data.
• Supervised machine learning includes two major
processes: classification and regression.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 27
 Classification is the process which basically categorizes a set of data into classes
(yes/no, true/false, 0/1, yes/no/may be). There are various types of Classification
problems, such as: Binary Classification, Multi-class Classification, Multi-label
Classification. Examples for classification problems are: Spam filtering, Image
classification, Sentiment analysis, Classifying cancerous and non-cancerous
tumors, Customer churn prediction etc.

 Regression is the process of identifying patterns and calculating the predictions

of continuous outcomes. The different types of regression analysis techniques get
used when the target and independent variables show a linear or non-linear
relationship between each other, and the target variable contains continuous values.
Examples for regression problems are: predicting the house rate, predicting month’s
sales, predicting age of a person, prediction of rain, determining Market trends etc.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 28
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 29
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 30
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 31
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 32
• The most widely used supervised algorithms are:
 Linear Regression
 Logistic Regression
 Random Forest
 Boosting algorithms
 Support Vector Machines
 Decision Trees
 Naive Bayes
 Nearest Neighbor.

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 33

2. Unsupervised Machine Learning Algorithms:
• Unsupervised learning feeds on unlabeled data.
• In unsupervised machine learning algorithms, the desired results are unknown and yet to be
defined.
• Unsupervised learning algorithms apply the following techniques to describe the data:
 Clustering: It is an exploration of data used to segment it into meaningful groups (i.e., clusters)
based on their internal patterns without any prior knowledge of group credentials. The credentials
are defined by similarity of individual data objects and also aspects of its dissimilarity from the
rest. Examples: Identifying fraudulent or criminal activity, classifying network traffic, Identifying
Fake News etc.

 Dimensionality reduction: Most of the time, there is a lot of noise in the incoming data.
Machine learning algorithms use dimensionality reduction to remove this noise while distilling
the relevant information. Examples: Image compression, classify a database full of emails into
“not spam” and “spam”.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 34
• The most widely used unsupervised algorithms are:
 K-means clustering
 PCA (Principal Component Analysis)
 Association rule.

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 35

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 36
3. Semi-supervised Machine Learning Algorithms:
• Semi-supervised learning algorithms represent a middle ground between
supervised and unsupervised algorithms.
• In this type of learning, the algorithm is trained upon a combination of
labeled and unlabelled data.
• This combination will contain a very small amount of labeled data and a
very large amount of unlabelled data.
• The basic procedure involved is that first, the programmer will cluster
similar data using an unsupervised learning algorithm and then use the
existing labeled data to label the rest of the unlabelled data.
• Examples: Text document classifier, Speech analysis etc.
• One of the popular Semi-supervised ML algorithm is Label Propagation
algorithm.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 37
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 38
4. Reinforcement Machine Learning Algorithms:
• Reinforced ML employs a technique called exploration/exploitation.
• It’s an iterative algorithm. The action takes place, the consequences
are observed, and the next action considers the results of the first
action.
• Using this algorithm, the machine is trained to make specific
decisions.
• It works this way: The machine is exposed to an environment where it
trains itself continually using trial and error. The machine learns from
past experience and tries to capture the best possible knowledge to
make accurate business decisions.
• Examples: Video games, Self-driving cars etc.
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 39
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 40
18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 41
• Most common reinforcement learning algorithms include:
 Q-Learning
 Temporal Difference (TD)
 Monte-Carlo Tree Search (MCTS)
 Asynchronous Actor-Critic Agents (A3C).

18/03/2024 Prof. Trupthi Rao, Dept. of AI & DS, GAT 42

Deep Learning PPT Full Notes
100% (3)
Deep Learning PPT Full Notes
105 pages
Motor Vehicle Inspector Paper 1 GK PDF Download 14 05 2025
No ratings yet
Motor Vehicle Inspector Paper 1 GK PDF Download 14 05 2025
29 pages
Insidethemachinelearninginterview Sample
0% (1)
Insidethemachinelearninginterview Sample
40 pages
ML GTU Solution
No ratings yet
ML GTU Solution
83 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
70 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Artificial Intelligence
100% (1)
Artificial Intelligence
47 pages
Learn Basics To Become A Generative AI Engineer PDF
No ratings yet
Learn Basics To Become A Generative AI Engineer PDF
25 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
112 pages
Modelling
No ratings yet
Modelling
69 pages
Blending Shapley Values For Feature Ranking in Machine Learning: An Analysis On Educational Data
No ratings yet
Blending Shapley Values For Feature Ranking in Machine Learning: An Analysis On Educational Data
25 pages
ML Notes-1
No ratings yet
ML Notes-1
59 pages
2016.random Forest in Remote Sensing A Review of Applications and Future
No ratings yet
2016.random Forest in Remote Sensing A Review of Applications and Future
8 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
ESE Lab File
No ratings yet
ESE Lab File
105 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Module 1
No ratings yet
Module 1
52 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
ML Lecture Notes Unit-1
No ratings yet
ML Lecture Notes Unit-1
45 pages
Unit No. 02 - Feature Extraction & Selection
No ratings yet
Unit No. 02 - Feature Extraction & Selection
47 pages
Cse3001 Ai ML m2
No ratings yet
Cse3001 Ai ML m2
118 pages
Datamining Presentation
No ratings yet
Datamining Presentation
20 pages
AI ML Concepts
No ratings yet
AI ML Concepts
97 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
Major Project Report Sem 7
No ratings yet
Major Project Report Sem 7
23 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
DD2437 Lecture04 PH 18ht
No ratings yet
DD2437 Lecture04 PH 18ht
33 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Doubt Clearance Session (AI) On 29.12.2024
No ratings yet
Doubt Clearance Session (AI) On 29.12.2024
41 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Unit II
No ratings yet
Unit II
119 pages
Data Science Master
No ratings yet
Data Science Master
11 pages
Basics of Machine Learning1
No ratings yet
Basics of Machine Learning1
67 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Unit 3 ML
No ratings yet
Unit 3 ML
24 pages
UNIT2
No ratings yet
UNIT2
20 pages
1 s2.0 S0957417422001452 Main
No ratings yet
1 s2.0 S0957417422001452 Main
41 pages
ML Imp Ques 1
No ratings yet
ML Imp Ques 1
22 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
Module 4
No ratings yet
Module 4
44 pages
2010 1 11 - Zaremba
No ratings yet
2010 1 11 - Zaremba
26 pages
TusharGoel Seminar PPT
No ratings yet
TusharGoel Seminar PPT
23 pages
Ai Life Cycle
No ratings yet
Ai Life Cycle
30 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
1 s2.0 S0360544221004898 Main
No ratings yet
1 s2.0 S0360544221004898 Main
14 pages
Data
No ratings yet
Data
36 pages
Everything Know About P Value From Scratch Data Science
No ratings yet
Everything Know About P Value From Scratch Data Science
14 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
A Tentative Analysis of Liver Disorder Using Data Mining Algorithms J48, Decision Table and Naive Bayes
No ratings yet
A Tentative Analysis of Liver Disorder Using Data Mining Algorithms J48, Decision Table and Naive Bayes
4 pages
Unsupervised by Any Other Name - Hidden Layers of Knowledge Production in Artificial Intelligence On Social Media
No ratings yet
Unsupervised by Any Other Name - Hidden Layers of Knowledge Production in Artificial Intelligence On Social Media
11 pages
University Institute of Engineering Department of Computer Science and Engg
No ratings yet
University Institute of Engineering Department of Computer Science and Engg
15 pages
Literature Review On Feature Selection Methods For HighDimensional Data
No ratings yet
Literature Review On Feature Selection Methods For HighDimensional Data
9 pages
NN 7
No ratings yet
NN 7
26 pages
Pattern Recognition Unit 2
No ratings yet
Pattern Recognition Unit 2
24 pages
Journal of Building Engineering
No ratings yet
Journal of Building Engineering
14 pages
Unit 2
No ratings yet
Unit 2
91 pages
ML 2022
No ratings yet
ML 2022
10 pages
1 s2.0 S1874490722000490 Main
No ratings yet
1 s2.0 S1874490722000490 Main
14 pages
01.14.pyramidal Implementation of The Lucas Kanade Feature Tracker - Description of The Algorithm
No ratings yet
01.14.pyramidal Implementation of The Lucas Kanade Feature Tracker - Description of The Algorithm
9 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
PavicJakov WEKA
No ratings yet
PavicJakov WEKA
40 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Hybrid Wrapper Feature Selection Method Based On Genetic Algorithm and Extreme Learning Machine For Intrusion Detection
No ratings yet
Hybrid Wrapper Feature Selection Method Based On Genetic Algorithm and Extreme Learning Machine For Intrusion Detection
25 pages
Detection of Cyber Attack in Network Using Machine Learning
No ratings yet
Detection of Cyber Attack in Network Using Machine Learning
71 pages
Credit Scoring With A Feature Selection Approach Based Deep Learning PDF
No ratings yet
Credit Scoring With A Feature Selection Approach Based Deep Learning PDF
5 pages
Psychoradiologic Utility of MR Imaging For Diagnosis of Attention Deficit Hyperactivity Disorder
No ratings yet
Psychoradiologic Utility of MR Imaging For Diagnosis of Attention Deficit Hyperactivity Disorder
11 pages
Data Science Lex
No ratings yet
Data Science Lex
7 pages
Image Processing Module1 Notes
No ratings yet
Image Processing Module1 Notes
15 pages
Aiml Model
No ratings yet
Aiml Model
13 pages
TE Computer DSBDA
No ratings yet
TE Computer DSBDA
11 pages
Unit-4 Part 3 Feature Engineering
No ratings yet
Unit-4 Part 3 Feature Engineering
29 pages
Untitled Document
No ratings yet
Untitled Document
8 pages
ML Syllabus
No ratings yet
ML Syllabus
5 pages
Machine Learning (Feature Engineering)
No ratings yet
Machine Learning (Feature Engineering)
10 pages
Ia - Eda
No ratings yet
Ia - Eda
10 pages
Module 2 Notes-1
No ratings yet
Module 2 Notes-1
18 pages
Cheng 2013
No ratings yet
Cheng 2013
6 pages
Final ML
No ratings yet
Final ML
2 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
T1 Scheme 24 25
No ratings yet
T1 Scheme 24 25
5 pages
3038-Article Text-5729-1-10-20210418
No ratings yet
3038-Article Text-5729-1-10-20210418
6 pages
Heart Disease
No ratings yet
Heart Disease
6 pages
Final 1
No ratings yet
Final 1
6 pages
2 Marks
No ratings yet
2 Marks
5 pages
Medical Image Feature, Extraction, Selection and Classification
No ratings yet
Medical Image Feature, Extraction, Selection and Classification
6 pages
Lec 4 - Data Science
No ratings yet
Lec 4 - Data Science
3 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet