0% found this document useful (0 votes)

2 views

Unit-II

The document discusses feature engineering, which is essential for effective machine learning, focusing on feature construction, selection, and transformation. It highlights the importance of preparing data through various techniques such as encoding categorical variables and transforming numeric features to improve model performance. Additionally, it covers the challenges of high-dimensional data and the need for dimensionality reduction in feature selection.

Uploaded by

SAYEEDA KHANUM PATHAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Unit-II

Uploaded by

SAYEEDA KHANUM PATHAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 119

UNIT – II:

Feature Engineering: Feature, feature engineering, feature

transformation: Feature construction, Feature extraction, Feature
subset selection: issues in high dimensional data, key drivers of
feature selection – feature relevance and redundancy, measures of
feature relevance and redundancy, overall feature selection
process, feature selection approaches.

1
Objectives

 Feature engineering is a critical allied task

that we need to perform to make learning
more effective.

 It has three key components – feature

construction, feature selection, and feature
transformation

2
INTRODUCTION

 Modelling alone doesn’t help us to realize the effectiveness of

machine learning as a problem-solving tool.

 If a specific model is not effective, we can use different levels

to boost the effectiveness

 Before applying machine learning to solve the problems,

there are certain preparatory steps.

 The key aspect which plays a critical role in solving any

machine learning problem is feature engineering.

 It is a critical preparatory step and deals with features of the

data set, which form an important input of any machine
learning problem – be supervised or unsupervised learning.

 It is responsible for taking raw input data and converting that 3

to well-aligned features which are ready to be used by the
machine learning models.
Structured vs Unstructured data
Structured data is highly specific and is stored in a
predefined format
Ex: tabular data

Unstructured data is raw, unorganized data which

doesn’t follow a specific format or hierarchy.
Examples: Text data from social networks, e.g.
Twitter, Facebook, etc. or data from server logs, etc.

4
Feature:

 A feature is an attribute of a data set that is used in a machine

learning process.

 The attributes which are meaningful to machine learning

problem are to be called as features.

 Selection of the subset of features which are meaningful for

machine learning is a sub-area of feature engineering which
draws a lot of research interest.

 The features in a data set are also called its dimensions. So a

data set having ‘n’ features is called an n-dimensional data
set.

Example: Iris dataset

 It has five attributes or features namely Sepal.Length, 5

Sepal.Width, Petal.Length and Petal.Width and Species.

 ‘Species’ feature represent the class variable and the

Iris dataset

6
Feature engineering

 Feature engineering is a process of translating a data

set into features such that these features are able to
represent the data set more effectively and result in a
better learning performance.

It has two major elements:

7
1. Feature transformation

 It transforms the data (structured or unstructured)

into a new set of features

 There are two variants of feature transformation:

a. feature construction
b. feature extraction

Both are sometimes known as feature discovery.

a. Feature construction process discovers missing

information about the relationships between features
and augments the feature space by creating additional
features.

 if there are ‘n’ features or dimensions in a data set, 8

after feature construction ‘m’ more features or
dimensions may get added.
b. Feature extraction is the process of extracting or
creating a new set of features from the original set of
features using some functional mapping.

2. Feature subset selection

 No new feature is generated.

 The objective of feature selection is to derive a subset
of features from the full feature set which is most
meaningful in the context of a specific machine
learning problem.

 So, essentially the job of feature selection is to derive

a subset Fj (F1 , F2 , …, Fm ) of Fi (F1 , F2 , …, Fn),

where m < n, such that Fj is most meaningful and 9

gets
the best result for a machine learning problem.
4.2 FEATURE TRANSFORMATION

 It is an important prerequisite for the success of any

machine learning model in data preprocessing.

 All available attributes of the dataset are used as

features, and then important features are identified
based on model.

 There will be data with different magnitudes and we

have to scale down different features to the same
range of magnitude.

 Most of the algorithms will give more importance to

the features with high volume rather than giving the
same importance to all features

 It leads to wrong predictions and faulty models.

 Incase a model has to be trained to classify a
document as spam or not a spam, a document as
represented as a bag of words.
 Then the features space contains all unique words as
a bag of words occurring across all documents, and
leads to a feature space of a few hundred thousand
features.

 If we start including bigrams or trigrams along with

words, the count of features will run in millions.

(‘n-gram’ is a contiguous set of n items for example

words in a
text block or document used in Natural Language
Processing)
11
 To deal with this problem, Feature transformation is
used as an effective tool for dimensionality reduction
 Broadly, there are two distinct goals of feature
transformation:

 Achieving best reconstruction of the original

features in the data set
 Achieving highest efficiency in the learning task

12
13
Feature construction

Transforms a given set of input features to generate a new

set of more powerful features.

Example:
Data set: Real Estate dataset
Attributes: apartment length, apartment breadth, and
price of the apartment.

It is convenient and makes more sense to use the area of

the apartment instead of length and breadth

14
Situations where feature construction is an essential
activity:

1. when features have categorical value and machine

learning needs numeric value inputs

2. when features having numeric (continuous) values

and need to be converted to ordinal values

3. when text-specific feature construction needs to be

done.

15
1. Encoding categorical (nominal) variables

Athletes dataset:

 Any machine learning algorithm requires numerical

figures to learn from.

 So there are three features - City of origin, Parents

athlete, and Chance of win, which are categorical in
nature and cannot be used by any machine learning
task.
16
 feature construction can be used to create new
dummy features
17
One-Hot Encoding
One-Hot Encoding is used to convert categorical variables into
binary number, where each category is represented by a separate
column with a 1 (indicating presence) or 0 (indicating absence).
This method works well when there are a few unique categories
in the variable.
Not to prefer this method if the input data is of ordinal types.

Dr. mohammed Alahmed

18
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample Data
data = {'Product Category': ['Electronics', 'Clothing', 'Furniture', 'Electronics',
'Furniture', 'Clothing'],
'Sales': [500, 300, 700, 600, 800, 350]}

df = pd.DataFrame(data)

# One-Hot Encoding using pandas get_dummies

Dr. mohammed Alahmed

one_hot_encoded_df = pd.get_dummies(df, columns=['Product Category'])

print("One-Hot Encoding Result:")

print(one_hot_encoded_df)

19
Target Encoding is generally used for nominal categorical features
(features with no inherent order). When applied to ordinal features, it
may not preserve the natural order of the categories, which could lead to
suboptimal performance for some models.

Target Encoding Process:

1.Calculate the mean of the target for each category. For example:
1. The mean target (price) for Electronics = (500 + 600) / 2 = 550

Dr. mohammed Alahmed

2. The mean target (price) for Clothing = (300 + 350) / 2 = 325
3. The mean target (price) for Furniture = (700 + 800) / 2 = 750

2.Replace the categories with the corresponding mean values.

20
import pandas as pd
# Sample data
data = {'Product Category': ['Electronics', 'Clothing', 'Furniture',
'Electronics', 'Furniture', 'Clothing'],
'Price': [500, 300, 700, 600, 800, 350]}
df = pd.DataFrame(data)

# Calculate the mean of the target for each category

mean_target = df.groupby('Product Category')['Price'].mean()

# Map the mean target to the original dataset

Dr. mohammed Alahmed

df['Encoded Product Category'] = df['Product Category'].map(mean_target)

# Display the encoded data

print(df[['Product Category', 'Price', 'Encoded Product Category']])

21
Encoding categorical (ordinal) variables

The grade is an ordinal variable with values A, B, C,

and D

22
Label Encoding
Label Encoding assigns each unique category a unique numeric value. This
method works well when the categories have a natural order (ordinal data).
Example:
Suppose you have a Size feature with the values:
• Small
• Medium
• Large

Dr. mohammed Alahmed

Here:
• The categories are replaced by numeric labels (0 for Small, 1 for Medium, 2 for
Large).
• This approach is suitable when the categorical values have a natural ordinal 23
relationship, such as size or rankings.
Dr. mohammed Alahmed
24
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {'Employee Name': ['John', 'Alice', 'Bob', 'Mary', 'Steve', 'Anna'],
'Education Level': ['Bachelor\'s', 'Master\'s', 'High School', 'PhD', 'Master\'s', 'Bachelor\'s']}

df = pd.DataFrame(data)

# Initialize the LabelEncoder

label_encoder = LabelEncoder()

Dr. mohammed Alahmed

# Apply Label Encoding to the Education Level column
df['Education Level Encoded'] = label_encoder.fit_transform(df['Education Level'])

# Display the original and encoded data

print(df[['Employee Name', 'Education Level', 'Education Level Encoded']])

25
2. Transforming numeric (continuous)
features to categorical features

Transforming numeric (continuous) features into categorical features is a

common technique in feature engineering, especially when you want to
apply models that work better with categorical data or when you want to
discretize continuous variables for improved interpretability. This process
involves binning or discretization, where the continuous values are

Dr. mohammed Alahmed

grouped into a few discrete categories or intervals.

Common Techniques for Transforming Numeric to Categorical Data:

1. Equal Width Binning (or Equal Range Binning)

2. Equal Frequency Binning (Quantile Binning)
3. Custom Binning (Manual Binning)
4. Clustering-Based Binning (e.g., K-Means Clustering)
26
1. Equal Width Binning (Range Binning)
Divides the range of the continuous data into equal intervals (bins).Each
interval represents a category. Suitable when the distribution of the data is
relatively uniform.

Example: If the range of data is from 0 to 100, you might create bins such
as:
• Bin 1: 0–20
• Bin 2: 21–40
• Bin 3: 41–60

Dr. mohammed Alahmed

• Bin 4: 61–80
• Bin 5: 81–100

2. Equal Frequency Binning (Quantile Binning)

Divides the data into bins such that each bin contains approximately the
same number of data points. Particularly useful for skewed data or when
you want each category to have the same weight.
Example: If you have 100 data points, each bin will contain 20 data points. 27
3. Custom Binning (Manual Binning)
Allows you to define specific bins manually based on domain knowledge or
business requirements. Can create bins that are meaningful for the particular
context of the dataset.
Example: In a dataset of income, you might manually define:
• Low: Income < 30,000
• Medium: 30,000 ≤ Income < 70,000
• High: Income ≥ 70,000

Dr. mohammed Alahmed

4. Clustering-Based Binning (e.g., K-Means Clustering)
Uses clustering algorithms, like K-Means, to cluster the continuous data into
distinct groups (clusters).Each cluster is then treated as a category. Useful
when the data has complex, non-linear patterns that need to be grouped.

Example: Use K-Means to group income data into 3 clusters: "Low",

"Medium", and "High".
28
 If we want to convert real estate price prediction
problem, which is a regression problem, as a real estate
price category prediction, which is a classification
problem.

 ‘bin’ the numerical data into multiple categories based

on the data range.

29
30
3. Text-specific feature construction

 In the current world, text is arguably the most

predominant medium of communication.

 Social networks like Facebook or micro-blogging

channels like Twitter or Emails or Short Messaging
Services(SMS) such as Whatsapp, text plays a major
role in the flow of information.

 Hence, Text mining is an important area of research –

not only for technology practitioners but also for
industry practitioners.

 Text data is unstructured and do not have readily

available features(straightforward) like structured data.
31
All ML algorithm need numerical data as input.

 So the text data in the dataset needs to be transformed

Vectorization
In natural language processing (NLP) this refers to the process of converting
text data into numerical vectors that machine learning models can understand
and process. There are many such techniques , But most simple one is BoW.

Bag of Words (BoW):

BoW transforms text into a set of word counts. It creates a vocabulary of all
unique words in the text corpus and represents each document as a vector
where each dimension corresponds to a word in the vocabulary. The value of

Dr. mohammed Alahmed

each dimension is typically the count of that word in the document or its
binary presence/absence.

Steps in the Bag of Words Model:

1.Tokenization: Split the document into individual words or tokens.
2.Building the Vocabulary: List all unique words in the entire corpus
(collection of documents).
3.Creating the Vector: Represent each document as a vector, where each
element corresponds to the frequency of a word in the vocabulary. 32
Step 1: Define the Corpus
Consider the following three documents in our corpus:
Document 1: "I love programming"
Document 2: "Programming is fun"
Document 3: "I love coding“

Step 2: Tokenization and Building Vocabulary

After tokenizing the documents, we get the following words:
Document 1: ["I", "love", "programming"]
Document 2: ["Programming", "is", "fun"]

Dr. mohammed Alahmed

Document 3: ["I", "love", "coding"]

Vocabulary: The unique words from all three documents are:

• ["I", "love", "programming", "is", "fun", "coding"]

33
Step 3: Creating the Vectors
Each document is now represented as a vector, with each position corresponding to
the frequency of a word from the vocabulary:

• Vocabulary Order: ["I", "love", "programming", "is", "fun", "coding"]

1.Document 1 ("I love programming"):

•"I" appears 1 time
•"love" appears 1 time
•"programming" appears 1 time

Dr. mohammed Alahmed

•"is" appears 0 times
•"fun" appears 0 times
•"coding" appears 0 times
Vector for Document 1: [1, 1, 1, 0, 0, 0]

34
Document 2 ("Programming is fun"):
•"I" appears 0 times
•"love" appears 0 times
•"programming" appears 1 time
•"is" appears 1 time
•"fun" appears 1 time
•"coding" appears 0 times
Vector for Document 2: [0, 0, 1, 1, 1, 0]

Document 3 ("I love coding"):

•"I" appears 1 time
•"love" appears 1 time
•"programming" appears 0 times
•"is" appears 0 times
•"fun" appears 0 times
•"coding" appears 1 time
Vector for Document 3: [1, 1, 0, 0, 0, 1]
Dr. mohammed Alahmed
document term matrix/ term document matrix

36
Feature extraction

 New features are created from a combination of original

features.

 Some of the commonly used operators for combining the

original features include

1. Boolean features: Conjunctions, Disjunctions, Negation,

etc.
2. Nominal features: Cartesian product, M of N, etc.
3. Numerical features: Min, Max, Addition, Subtraction,
Multiplication, Division, Average, Equivalence, Inequality,
etc.

Definition:
 Dataset with a feature set Fi (F1, F2 , …, Fn).
37
 After feature extraction using a mapping function f
 f (F1 , F2 , …, Fn ) then, we will have a set of features
Feat1=0.3 * 34 + 0.9 * 34.5=41.25
Feat2=34.5 + 0.5*23 + 38
0.6*233=185.80
The most popular feature extraction algorithms used in machine
learning:

• Principal Component Analysis(PCA)

• Singular value decomposition(SVD)

• Linear Discriminant Analysis(LDA)

39
Principal Component Analysis(PCA)

What is PCA?
Principal Component Analysis (PCA) is a dimensionality reduction technique
used in machine learning and statistics to simplify large datasets while
preserving as much information as possible. It transforms correlated variables
into a smaller set of uncorrelated variables called principal components,
which capture the most significant patterns in the data.

Why Use PCA?

• Reduces the number of features (dimensions) in a dataset, making
computations faster.
• Helps in visualizing high-dimensional data in 2D or 3D.
• Removes noise and redundancy in data.
• Improves performance in machine learning models by reducing overfitting

In Principal Component Analysis (PCA), the eigenvectors and eigenvalues

of the covariance matrix (or correlation matrix) determine the direction and 40
magnitude of the principal components (PCs), respectively.
The objective of PCA is to make the transformation in such
a way that

1. The new features are distinct, i.e. the covariance

between the new features, i.e. the principal components is
0.

2. The principal components are generated in order of the

variability in the data that it captures.

The first principal component should capture the

maximum variability

The second principal component should capture the

next highest variability etc.

3. The sum of variance of the new features or the principal

41
components should be equal to the sum of variance of the
original features.
 PCA works based on a process called eigenvalue
decomposition of a covariance matrix of a data set.
Below are the steps to be followed:

1. Calculate the covariance matrix of a data set.

2. Calculate the eigenvalues of the covariance matrix.
3. The eigenvector having highest eigenvalue represents
the direction in which there is the highest variance.
That is PC1
4. The eigenvector having the next highest eigenvalue
represents the PC2
5. Like this, identify the top ‘k’ eigenvectors having top ‘k’
42
eigenvalues so as to get the ‘k’ principal components.
6. Derive the new dataset
Dr. mohammed Alahmed
43
Dr. mohammed Alahmed
44
Dr. mohammed Alahmed
45
PCA Algorithm-

The steps involved in PCA Algorithm are as follows-

Dr. mohammed Alahmed

Step-01: Get data.
Step-02: Compute the mean vector (µ).
Step-03: Subtract mean from the given data.
Step-04: Calculate the covariance matrix.
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
Step-06: Choosing components and forming a feature vector.
Step-07: Deriving the new data set.
46
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Step 1: Define the dataset

data = np.array([
[4, 2], # Student A (Math, Science)
[3, 6], # Student B
[2, 4] # Student C
])

Dr. mohammed Alahmed

# Step 2: Standardize the data (subtract mean, divide by std deviation)
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)

# Step 3: Compute the covariance matrix

cov_matrix = np.cov(data_standardized, rowvar=False)

# Step 4: Compute eigenvalues and eigenvectors

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Step 5: Select the top principal component (highest eigenvalue) 47

idx = np.argsort(eigenvalues)[::-1] # Sort eigenvalues in descending order
eigenvectors = eigenvectors[:, idx]
# Step 6: Transform the original data
transformed_data = np.dot(data_standardized, eigenvectors)

# Print results
print("Standardized Data:\n", data_standardized)
print("\nCovariance Matrix:\n", cov_matrix)
print("\nEigenvalues:\n", eigenvalues)
print("\nEigenvectors:\n", eigenvectors)

Dr. mohammed Alahmed

print("\nTransformed Data (Principal Components):\n", transformed_data)

# Step 7: Using sklearn PCA for verification

pca = PCA(n_components=1) # Reduce to 1D
pca_data = pca.fit_transform(data_standardized)

print("\nPCA Transformed Data (sklearn):\n", pca_data)

48
Dr. mohammed Alahmed
49
Dr. mohammed Alahmed
50
Dr. mohammed Alahmed
51
Dr. mohammed Alahmed
52
Dr. mohammed Alahmed
53
Dr. mohammed Alahmed
54
Dr. mohammed Alahmed
55
Singular value decomposition

 Singular value decomposition (SVD) is a matrix

factorization technique commonly used in linear algebra.

SVD of a matrix A (m × n) is a factorization of the

form:

56
57
SVD of a data matrix is expected to have the properties
highlighted below:

1. Patterns in the attributes are captured by the right-

singular vectors, i.e. the columns of V.

2. Patterns among the instances are captured by the

left-singular, i.e. the columns of U
3. Larger a singular value, larger is the part of the matrix
A that it accounts for and its associated vectors.

4. New data matrix with ‘k’ attributes is obtained using

the equation

D = D × [v1 , v2 , … , vk ]

Thus, the dimensionality gets reduced to k, SVD is

often used in the context of text data. 58
Dr. mohammed Alahmed
59
60
61
62
Linear Discriminant Analysis

 Linear Discriminant Analysis (LDA) is one of the commonly

used dimensionality reduction techniques in machine learning
to solve more than two-class classification problems.

 It is also known as Normal Discriminant Analysis (NDA) or

Discriminant Function Analysis (DFA).

 LDA focuses on class separability, i.e. separating the features based

on class separability so as to avoid over-fitting of the machine
learning model.

 PCA calculates eigenvalues of the covariance matrix of the data set

where as LDA calculates eigenvalues and eigenvectors within a class
and inter-class scatter matrices.
63
Below are the steps to be followed:

1. Calculate the mean vectors for the individual classes.

2. Calculate intra-class and inter-class scatter matrices.
3. Calculate eigenvalues and eigenvectors for S w-1 and SB ,
where Sw-1
is the intra-class scatter matrix and SB is the inter-class
scatter matrix

where, m is the mean vector of the i-th class

64
where, mi is the sample mean for each class, m is the overall
mean of the data set, Ni is the sample size of each class
Dr. mohammed Alahmed
65
Dr. mohammed Alahmed
66
Dr. mohammed Alahmed
67
Dr. mohammed Alahmed
68
Dr. mohammed Alahmed
69
Dr. mohammed Alahmed
70
Dr. mohammed Alahmed
71
3 FEATURE SUBSET SELECTION

It selects a subset of system attributes or features which

makes a most meaningful contribution in a machine
learning activity.

Example:
 The student weight data set has features such as
Roll Number, Age, Height, and Weight.

 We can well understand that roll number can have

no bearing, whatsoever, in predicting student
weight.

 So we can eliminate the feature roll number and

build a feature subset to be considered in this
machine learning problem 72

The subset of features is expected to give better results

73
Issues in high-dimensional data

 With the rapid innovations in the digital space, the

volume of data generated has increased to an
unbelievable extent

 At the same time, breakthroughs in the storage

technology area have made storage of large quantity
of data quite cheap.

 This has further motivated the storage and mining of

very large and high-dimensionality data sets.

Examples:

DNA analysis, geographic information systems (GIS),

social networking, etc. 74

 Two new application domains have seen drastic

Issues in high-dimensional data

 The biomedical research includes gene selection from

microarray data.

 It generates data sets having a number of features in

the range of a few tens of thousands

 The text data generated from different sources also

have extremely high dimensions , from social
networking sites, like emails ,messages, article etc.

 In a large document corpus having few thousand

documents embedded, the number of unique word
tokens which represent the feature of the text data set,
can also be in the range of a few tens of thousands.
75
 This high-dimensional data may be a big challenge for
any machine learning algorithm
Issues in high-dimensional data

Problems in high dimensionality data:

 very high quantity of computational resources and

high amount of time will be required.

 The performance of the model – both for supervised

and unsupervised machine learning task, also
degrades sharply due to unnecessary noise in the
data.

 A model built on an extremely high number of

features may be very difficult to understand.

Hence, it is necessary to take a subset of the features

instead of the full set. 76
Issues in high-dimensional data

The objective of feature selection is three-fold:

• Having faster and more cost-effective (i.e. less need

for computational resources) learning model

• Improving the efficiency of the learning model

• Having a better understanding of the underlying

model that generated the data

77
4.3.2 Key drivers of feature selection – feature
relevance and redundancy

Feature relevance :
 In supervised learning, the input dataset which is the
training dataset, has a class label attached

 The model have to assign class labels to new, un-

labelled data.

 Each of the predictor variables, is expected to

contribute information to decide the value of the class
label

 A variable is not contributing any information, it is said

to be irrelevant

 In case the information contribution for prediction is 78

very little, the variable is said to be weakly relevant.

Feature relevance :

 In unsupervised learning, there is no training data set

or labelled data.

 Grouping of similar data instances are done and

similarity of data instances are evaluated based on the
value of different variables.

 Certain variables do not contribute any useful

information for deciding the similarity or dissimilarity
of data instances.

 So, those variables make no significant information

contribution in the grouping process.

 These variables are marked as irrelevant variables in 79

the context of the unsupervised machine learning task.
80
81
Example

 Student data set: To predict the weight of a student ,

Roll number doesn’t contribute any significant
information, in supervised learning

 To group the students with Similar academic

capabilities, Roll number can really not contribute any
information whatsoever.

 The irrelevant candidates are rejected in selecting a

subset of features.

 The weakly relevant features are to be rejected or not,

on a case-to-case basis.

82
Feature Redundancy…

 A feature may contribute information which is similar

to the information contributed by one or more other
features.

Example: In the weight prediction of a student, both the

features Age and Height contribute similar information.

 An increase in Age, Weight is expected to increase.

with the increase of Height also Weight is expected to
increase.
Age and Height increase with each other.

 So, in context of the Weight prediction problem, Age

and Height contribute similar information.
83
 when one feature is similar to another feature, the
feature is said to be potentially redundant in the
 All features having potential redundancy are candidates
for rejection in the final feature subset.

 Only a small number of representative features are

being a part of the final feature subset.

 The objective of feature selection is to remove all

features which are redundant and irrelevant

84
Measures of feature relevance

 The feature relevance is based on the amount of

information contributed by a feature

 For supervised learning, mutual information is

considered as a good measure of information
contribution of a feature to decide the value of the class
label.

 Higher the value of mutual information of a feature,

more relevant is that feature. Mutual information can be
calculated as follows:
MI(C, f ) = H(C) + H( f ) - H(C, f )

where, marginal entropy of the class, H(C) =

marginal entropy of the feature ‘x’, H( f ) = 85

K = number of classes, C = class variable,

 In case of unsupervised learning, there is no class
variable.

 In case of unsupervised learning, the entropy of the set

of features without one feature at a time is calculated
for all the features.

 Then, the features are ranked in a descending order of

information gain from a feature and top ‘β’ percentage
(value of ‘β’ is a design parameter of the algorithm) of
features are selected as relevant features.

 The entropy of a feature f is calculated using Shannon’s

formula below:

 is used only for features that take discrete values. 86

For continuous features, it should be replaced by
discretization performed first to estimate probabilities
p(f = x).
Weather data for playing
cricket
Dr. mohammed Alahmed
88
Dr. mohammed Alahmed
89
Dr. mohammed Alahmed
90
Step 3: Compute Joint Entropy H(C,f)H(C, f)H(C,f)

Dr. mohammed Alahmed

91
4.3.3.2 Measures of Feature redundancy

Feature redundancy, is based on similar information

contribution by multiple features.

Three types of measures are

1. Correlation-based measures
2. Distance-based measures
3. Other coefficient-based measure

92
1. Correlation-based similarity measure
 Correlation is a measure of linear dependency between
two random variables.

 Pearson’s product moment correlation coefficient is one

of the most popular and accepted measures of
correlation between two random variables.

 For two random feature variables F1 and F2 ,Pearson

correlation coefficient is defined as:

93
 Correlation values range between +1 and –1.

 A correlation of 1 (+ / –) indicates perfect correlation,

i.e. the two features having a perfect linear
relationship.

 In case the correlation is 0, then the features seem to

have no linear relationship.

 Generally, for all feature selection problems a

threshold value is adopted to decide whether two
features have adequate similarity or not

94
Dr. mohammed Alahmed
95
Dr. mohammed Alahmed
96
2. Distance-based similarity measure
 The most common distance measure is the Euclidean
distance, which, between two features F1 and F2 are
calculated as:

where F1 and F2 are features of an n-dimensional data set

 The data set has two features, aptitude (F1) and

communication (F2) under consideration.

 The Euclidean distance between the features has been

calculated using the formula provided above.

97
 A more generalized form of the Euclidean distance is
the Minkowski distance, measured as

 Minkowski distance takes the form of Euclidean

distance(L2 norm) when r = 2.

 At r = 1, it takes the form of Manhattan distance(L1

norm) , as shown below:

98
Dr. mohammed Alahmed
99
2. Distance between Binary Vectors

To calculate the distance between binary vectors is the

Hamming distance.

Example: Hamming distance between two vectors

01101011 and 11001001 is 3

100
3. Other similarity measures ..

a) Jaccard index/coefficient is used as a measure of

similarity between two features.
For two features having binary values, Jaccard index is
measured as

Where
n11 = number of cases where both the features have value 1
n01 = number of cases where the feature1 has value 0 and feature2 has value 1
n10= number of cases where the feature 1 has value 1 and feature2 has value 0

b) Jaccard distance, a measure of dissimilarity between

two features, is complementary of Jaccard index.

101
Jaccard distance d = 1 - J
Example :
Let’s consider two features F1 and F2 having values (0, 1,
1, 0, 1, 0, 1, 0) and (1, 1, 0, 0, 1, 0, 0, 0).

The identification of the values of n11 , n01 and n10 .

 As shown, the cases where both the values are 0 have

been left out without border – as an indication of the
fact that they will be excluded in the calculation of
Jaccard coefficient.

Jaccard coefficient of F1 and F2 ,

J= 102

Jaccard distance between F1 and F2 ,d = 1 – J = or 0.6.

c) Simple matching coefficient (SMC) is almost same as
Jaccard coefficient except the fact that it includes a
number of cases where both the features have a value of
0.

where,
n11= number of cases where both the features have value 1
n01= number of cases where the feature 1 has value 0 and
feature 2 has value 1
n10 = number of cases where the feature 1 has value 1 and
feature 2 has value 0
n11 = number of cases where both the features have value 0

103
Quite understandably, the total count of rows, n = n00 +
n01 + n10 + n11 . All values have been included in the
calculation of SMC.

104
d) Cosine similarity measure
Cosine Similarity which is one of the most popular measures in text
classification and calculated as

• The text corpus needs to be first transformed into features with a

word token being a feature and the number of times the word occurs
in a document comes as a value in each row.

• The data set is sparse in nature as only a few words do appear in a

document, and hence in a row of the data set.

• So each row has very few non-zero values. However, the non-zero
values can be anything integer value as the same word may occur any
number of times.

• Also, considering the sparsity of the data set, the 0-0 matches
(which obviously is going to be pretty high) need to be ignored.
105
• Most commonly used in Text classification
where, x.y = vector dot product of x and y =

Example :calculate the cosine similarity of x and y, where

x = (2, 4, 0, 0, 2, 1, 3, 0, 0) andy = (2, 1, 0, 0, 3, 2, 1, 0,
1).

x.y = 22 + 41 + 00 + 00 + 23 + 12 + 31 + 00 +

0*1
= 19
106
• It actually measures the angle between x and y vectors.Hence, if cosine

Dr. mohammed Alahmed

similarity has a value 1, the angle between x and y is 0° which means x
and y are same except for the magnitude. If cosine similarity is 0, the
angle between x and y is 90°.
• In the above example, the angle comes to be 43.2°.

107
Two rows in a document-term matrix have values -
(2, 3, 2, 0, 2, 3, 3, 0, 1) and (2, 1, 0, 0, 3, 2, 1, 3,
1). Find the cosine similarity

108
4.3.4 Overall feature selection process
Feature selection is the process of selecting a subset of
features in a data set. It consists of four steps:

1. Generation of possible subsets

2. Subset evaluation
3. Stop searching based on some stopping criterion
4. Validation of the result

109
Subset generation

 It is a search procedure which ideally should produce

all possible candidate subsets.

 Different approximate search strategies are

employed to find candidate subsets for evaluation

 The search may start with an empty set and keep

adding features - sequential forward selection.

 a search may start with a full set and successively

remove features - backward elimination.

 In certain cases, search start with both ends and add

and remove features simultaneously - bi-directional
selection.
110
 Each candidate subset is then evaluated and
compared with the previous best performing subset
If the new subset performs better, it replaces the previous
one.

This cycle of subset generation and evaluation continues

till a pre-defined stopping criterion is fulfilled.

Some commonly used stopping criteria are

1. the search completes
2. some given bound (e.g. a specified number of
iterations) is reached
3. subsequent addition (or deletion) of the feature is
not producing a better subset
4. a sufficiently good subset (e.g. a subset having
better classification accuracy than the existing
benchmark) is selected

111
Validation

 The selected best subset is validated either against

prior benchmarks or by experiments using real-life or
synthetic but authentic data sets.

 In case of supervised learning, the accuracy of the

learning model may be the performance parameter
considered for validation.

 The accuracy of the model using the subset derived is

compared against the model accuracy of the subset
derived using some other benchmark algorithm.

 In case of unsupervised, the cluster quality may be

the parameter for validation.
112
4.3.5 Feature selection approaches

There are four types of approach for feature

selection:

1. Filter approach
2. Wrapper approach
3. Hybrid approach
4. Embedded approach

113
Filter Approach

 In the filter approach, the feature subset is selected

based on statistical measures done to assess the
merits of the features from the data perspective.

 No learning algorithm is employed to evaluate the

goodness of the feature selected.

 Some of the common statistical tests conducted on

features as a part of filter approach are –

Pearson’s correlation, Information Gain, Fisher

Score, Analysis of Variance (ANOVA), Chi-Square,
etc.

114
wrapper approach

 Identification of best feature subset is done using the

induction algorithm(ML or greedy algorithms) as a
black box. Forward feature selection, Backward Elimination
Method

 The feature selection algorithm searches for a good

feature subset using the induction algorithm itself as
a part of the evaluation function.

 For every candidate subset, the learning model is

trained and the result is evaluated by running the
learning algorithm

 wrapper approach is computationally very 115

expensive.
116
Hybrid approach

 Hybrid approach takes the advantage of both filter and

wrapper approaches.

 A typical hybrid algorithm makes use of both the

statistical tests as used in filter approach to decide
the best subsets for a given cardinality

 And a learning algorithm to select the final best

subset among the best subsets across different
cardinalities.

117
Embedded approach

 It is quite similar to wrapper approach, However, the

difference is it performs feature selection(different
combinations) and classification simultaneously.
 Example: Random Forest algorithm

118
119

Microsoft AI-900 Vapr-2024 by - ToanNguyen 116q
No ratings yet
Microsoft AI-900 Vapr-2024 by - ToanNguyen 116q
73 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
ML UNIT 2 2 Old
No ratings yet
ML UNIT 2 2 Old
15 pages
Unit 2 Feature Engineering
No ratings yet
Unit 2 Feature Engineering
64 pages
3-Random Projection and Compressed Sensing technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing technique-13-01-2025
84 pages
UNIT 2 PART 2
No ratings yet
UNIT 2 PART 2
6 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Unit No: 4 Basics of Feature Engineering (31707 24)
No ratings yet
Unit No: 4 Basics of Feature Engineering (31707 24)
98 pages
UNIT04
No ratings yet
UNIT04
35 pages
cours data
No ratings yet
cours data
51 pages
Featureengineering 171206213206
No ratings yet
Featureengineering 171206213206
45 pages
ML 3170724 Unit-4
No ratings yet
ML 3170724 Unit-4
97 pages
06 Feature Engineering
No ratings yet
06 Feature Engineering
24 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
No ratings yet
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
29 pages
Features
No ratings yet
Features
5 pages
ML1
No ratings yet
ML1
69 pages
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
ML - WEEK 04
No ratings yet
ML - WEEK 04
33 pages
Feature Engineering
No ratings yet
Feature Engineering
6 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
CH1
No ratings yet
CH1
64 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
What is Feature Engineering
No ratings yet
What is Feature Engineering
9 pages
Machine Learning Unit-2
No ratings yet
Machine Learning Unit-2
12 pages
UNIT 4
No ratings yet
UNIT 4
25 pages
Presentation
No ratings yet
Presentation
10 pages
Basics of Feature Engineering Marked
No ratings yet
Basics of Feature Engineering Marked
33 pages
Machine_Learning-Note-Modul2[1]
No ratings yet
Machine_Learning-Note-Modul2[1]
20 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Eature Engineering: Presenter: Prof. Amit Kumar Das
No ratings yet
Eature Engineering: Presenter: Prof. Amit Kumar Das
17 pages
NOTES
No ratings yet
NOTES
9 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
NN-7
No ratings yet
NN-7
26 pages
Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Week 10
No ratings yet
Week 10
50 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
Final 1
No ratings yet
Final 1
6 pages
AI6322 - Module 4 - Feature Engineering - MODULE
No ratings yet
AI6322 - Module 4 - Feature Engineering - MODULE
25 pages
AI-Module 4 Updated
No ratings yet
AI-Module 4 Updated
42 pages
CSC407_Chapter 4
No ratings yet
CSC407_Chapter 4
28 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Module 4
No ratings yet
Module 4
44 pages
ML Inter Q&A
No ratings yet
ML Inter Q&A
54 pages
Summery of Feature Eng
No ratings yet
Summery of Feature Eng
4 pages
CHP 4
No ratings yet
CHP 4
72 pages
Feature Engineering
No ratings yet
Feature Engineering
2 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
A Data Enhancement Method Based on Generative Adversari 2024 Computers Che
No ratings yet
A Data Enhancement Method Based on Generative Adversari 2024 Computers Che
14 pages
Artificial Intelligence and Machine Learning For Medical Imaging
No ratings yet
Artificial Intelligence and Machine Learning For Medical Imaging
15 pages
Dimensionality Reduction in Automated Evaluation of Descriptive Answers Through Zero Variance, Near Zero Variance and Non Frequent Words Techniques - A Comparison
No ratings yet
Dimensionality Reduction in Automated Evaluation of Descriptive Answers Through Zero Variance, Near Zero Variance and Non Frequent Words Techniques - A Comparison
6 pages
Alzheimer's Disease Detection
No ratings yet
Alzheimer's Disease Detection
49 pages
Machine Learning in Big Data
No ratings yet
Machine Learning in Big Data
10 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
20BCP021 Assignment 6
No ratings yet
20BCP021 Assignment 6
15 pages
Heart Disease Prediction Using Machine Learning-1
No ratings yet
Heart Disease Prediction Using Machine Learning-1
6 pages
Ai-900 (1) - Removed - Organized
No ratings yet
Ai-900 (1) - Removed - Organized
15 pages
Information Sciences: Marta Galende, María José Gacto, Gregorio Sainz, Rafael Alcalá
No ratings yet
Information Sciences: Marta Galende, María José Gacto, Gregorio Sainz, Rafael Alcalá
24 pages
Software Defect Prediction Based on Multi-filter Wrapper Feature (2)
No ratings yet
Software Defect Prediction Based on Multi-filter Wrapper Feature (2)
28 pages
Sustainability 14 06199
No ratings yet
Sustainability 14 06199
23 pages
Blending Shapley Values For Feature Ranking in Machine Learning: An Analysis On Educational Data
No ratings yet
Blending Shapley Values For Feature Ranking in Machine Learning: An Analysis On Educational Data
25 pages
Sport Analytics For Cricket Game Results Using Machine Learning - An Experimental Study - Semantic Scholar
No ratings yet
Sport Analytics For Cricket Game Results Using Machine Learning - An Experimental Study - Semantic Scholar
9 pages
Soil Classification and Crop Suggestion Using Deep Learning Techniques: Soilla
No ratings yet
Soil Classification and Crop Suggestion Using Deep Learning Techniques: Soilla
53 pages
Feature Extraction and Selection From Acoustic Emission Signals With An Application in Grinding Wheel Condition Monitoring
No ratings yet
Feature Extraction and Selection From Acoustic Emission Signals With An Application in Grinding Wheel Condition Monitoring
11 pages
Handbook of Research On Machine and Deep Learning Applications For Cyber Security
No ratings yet
Handbook of Research On Machine and Deep Learning Applications For Cyber Security
507 pages
Tigabu Dagne
No ratings yet
Tigabu Dagne
125 pages
Paper Application of Artificial Intelligence in Drilling and Completion
No ratings yet
Paper Application of Artificial Intelligence in Drilling and Completion
20 pages
Information Fusion: Antonela Tommasel, Daniela Godoy
No ratings yet
Information Fusion: Antonela Tommasel, Daniela Godoy
17 pages
Chi2 Feature Selection and Discretization of Numeric Attributes
No ratings yet
Chi2 Feature Selection and Discretization of Numeric Attributes
4 pages
Confusion Matrix
No ratings yet
Confusion Matrix
26 pages
Chapter One 1.1 Background of The Study
No ratings yet
Chapter One 1.1 Background of The Study
40 pages
Final Project Report
No ratings yet
Final Project Report
73 pages
Research Proposal
No ratings yet
Research Proposal
4 pages
AML Unit 5
No ratings yet
AML Unit 5
13 pages
Zeba 1
No ratings yet
Zeba 1
38 pages
Fake News Analysis
No ratings yet
Fake News Analysis
46 pages
A Comprehensive Review of Approaches To Building Occupancy Detection
No ratings yet
A Comprehensive Review of Approaches To Building Occupancy Detection
14 pages