Unit-II
Unit-II
1
Objectives
2
INTRODUCTION
4
Feature:
6
Feature engineering
7
1. Feature transformation
gets
the best result for a machine learning problem.
4.2 FEATURE TRANSFORMATION
12
13
Feature construction
Example:
Data set: Real Estate dataset
Attributes: apartment length, apartment breadth, and
price of the apartment.
14
Situations where feature construction is an essential
activity:
15
1. Encoding categorical (nominal) variables
Athletes dataset:
# Sample Data
data = {'Product Category': ['Electronics', 'Clothing', 'Furniture', 'Electronics',
'Furniture', 'Clothing'],
'Sales': [500, 300, 700, 600, 800, 350]}
df = pd.DataFrame(data)
19
Target Encoding is generally used for nominal categorical features
(features with no inherent order). When applied to ordinal features, it
may not preserve the natural order of the categories, which could lead to
suboptimal performance for some models.
20
import pandas as pd
# Sample data
data = {'Product Category': ['Electronics', 'Clothing', 'Furniture',
'Electronics', 'Furniture', 'Clothing'],
'Price': [500, 300, 700, 600, 800, 350]}
df = pd.DataFrame(data)
21
Encoding categorical (ordinal) variables
22
Label Encoding
Label Encoding assigns each unique category a unique numeric value. This
method works well when the categories have a natural order (ordinal data).
Example:
Suppose you have a Size feature with the values:
• Small
• Medium
• Large
# Sample data
data = {'Employee Name': ['John', 'Alice', 'Bob', 'Mary', 'Steve', 'Anna'],
'Education Level': ['Bachelor\'s', 'Master\'s', 'High School', 'PhD', 'Master\'s', 'Bachelor\'s']}
df = pd.DataFrame(data)
25
2. Transforming numeric (continuous)
features to categorical features
Example: If the range of data is from 0 to 100, you might create bins such
as:
• Bin 1: 0–20
• Bin 2: 21–40
• Bin 3: 41–60
29
30
3. Text-specific feature construction
33
Step 3: Creating the Vectors
Each document is now represented as a vector, with each position corresponding to
the frequency of a word from the vocabulary:
34
Document 2 ("Programming is fun"):
•"I" appears 0 times
•"love" appears 0 times
•"programming" appears 1 time
•"is" appears 1 time
•"fun" appears 1 time
•"coding" appears 0 times
Vector for Document 2: [0, 0, 1, 1, 1, 0]
36
Feature extraction
Definition:
Dataset with a feature set Fi (F1, F2 , …, Fn).
37
After feature extraction using a mapping function f
f (F1 , F2 , …, Fn ) then, we will have a set of features
Feat1=0.3 * 34 + 0.9 * 34.5=41.25
Feat2=34.5 + 0.5*23 + 38
0.6*233=185.80
The most popular feature extraction algorithms used in machine
learning:
39
Principal Component Analysis(PCA)
What is PCA?
Principal Component Analysis (PCA) is a dimensionality reduction technique
used in machine learning and statistics to simplify large datasets while
preserving as much information as possible. It transforms correlated variables
into a smaller set of uncorrelated variables called principal components,
which capture the most significant patterns in the data.
# Print results
print("Standardized Data:\n", data_standardized)
print("\nCovariance Matrix:\n", cov_matrix)
print("\nEigenvalues:\n", eigenvalues)
print("\nEigenvectors:\n", eigenvectors)
48
Dr. mohammed Alahmed
49
Dr. mohammed Alahmed
50
Dr. mohammed Alahmed
51
Dr. mohammed Alahmed
52
Dr. mohammed Alahmed
53
Dr. mohammed Alahmed
54
Dr. mohammed Alahmed
55
Singular value decomposition
56
57
SVD of a data matrix is expected to have the properties
highlighted below:
D = D × [v1 , v2 , … , vk ]
64
where, mi is the sample mean for each class, m is the overall
mean of the data set, Ni is the sample size of each class
Dr. mohammed Alahmed
65
Dr. mohammed Alahmed
66
Dr. mohammed Alahmed
67
Dr. mohammed Alahmed
68
Dr. mohammed Alahmed
69
Dr. mohammed Alahmed
70
Dr. mohammed Alahmed
71
3 FEATURE SUBSET SELECTION
Example:
The student weight data set has features such as
Roll Number, Age, Height, and Weight.
Examples:
77
4.3.2 Key drivers of feature selection – feature
relevance and redundancy
Feature relevance :
In supervised learning, the input dataset which is the
training dataset, has a class label attached
82
Feature Redundancy…
84
Measures of feature relevance
1. Correlation-based measures
2. Distance-based measures
3. Other coefficient-based measure
92
1. Correlation-based similarity measure
Correlation is a measure of linear dependency between
two random variables.
93
Correlation values range between +1 and –1.
94
Dr. mohammed Alahmed
95
Dr. mohammed Alahmed
96
2. Distance-based similarity measure
The most common distance measure is the Euclidean
distance, which, between two features F1 and F2 are
calculated as:
97
A more generalized form of the Euclidean distance is
the Minkowski distance, measured as
98
Dr. mohammed Alahmed
99
2. Distance between Binary Vectors
100
3. Other similarity measures ..
Where
n11 = number of cases where both the features have value 1
n01 = number of cases where the feature1 has value 0 and feature2 has value 1
n10= number of cases where the feature 1 has value 1 and feature2 has value 0
101
Jaccard distance d = 1 - J
Example :
Let’s consider two features F1 and F2 having values (0, 1,
1, 0, 1, 0, 1, 0) and (1, 1, 0, 0, 1, 0, 0, 0).
J= 102
where,
n11= number of cases where both the features have value 1
n01= number of cases where the feature 1 has value 0 and
feature 2 has value 1
n10 = number of cases where the feature 1 has value 1 and
feature 2 has value 0
n11 = number of cases where both the features have value 0
103
Quite understandably, the total count of rows, n = n00 +
n01 + n10 + n11 . All values have been included in the
calculation of SMC.
104
d) Cosine similarity measure
Cosine Similarity which is one of the most popular measures in text
classification and calculated as
• So each row has very few non-zero values. However, the non-zero
values can be anything integer value as the same word may occur any
number of times.
• Also, considering the sparsity of the data set, the 0-0 matches
(which obviously is going to be pretty high) need to be ignored.
105
• Most commonly used in Text classification
where, x.y = vector dot product of x and y =
107
Two rows in a document-term matrix have values -
(2, 3, 2, 0, 2, 3, 3, 0, 1) and (2, 1, 0, 0, 3, 2, 1, 3,
1). Find the cosine similarity
108
4.3.4 Overall feature selection process
Feature selection is the process of selecting a subset of
features in a data set. It consists of four steps:
109
Subset generation
111
Validation
1. Filter approach
2. Wrapper approach
3. Hybrid approach
4. Embedded approach
113
Filter Approach
114
wrapper approach
117
Embedded approach
118
119