Module2.1 Feature Selection

Uploaded by

sudothearkknight

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views38 pages

Module2.1 Feature Selection

Uploaded by

sudothearkknight

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Course Code: CSA3002

MACHINE LEARNING ALGORITHMS

Course Type: LPC – 2-2-3

Course Objectives
• The objective of the course is to familiarize the learners with
the concepts of Machine Learning Algorithms and attain
Skill Development through Experiential Learning
techniques.
Course Outcomes
At the end of the course, students should be able to
1. Understanding of training and testing the datasets using machine
Learning techniques.
2. Apply optimization and parameter tuning techniques for machine
Learning algorithms.
3. Apply a machine learning model to solve various problems using
machine learning algorithms.
4. Apply machine learning algorithm to create models.
Feature Selection/Extraction Techniques
• Feature selection plays a crucial role in the machine learning process.
• Its main aim is to identify the subset of features that have the most
influential effect on the target variable.
• By removing irrelevant or noisy features, we can simplify the model,
enhance its interpretability, reduce training time, and avoid overfitting.
• This involves assessing the importance of each feature and choosing
the most informative ones.
Why is Feature Selection Important?
• Feature selection offers several advantages in the field of machine learning.
• Firstly, it enhances model performance by focusing on the most relevant
features.
• By eliminating irrelevant features, we can reduce the dimensionality of the
dataset, thereby justifying the curse of dimensionality and improving the
model's ability to generalize.
• Feature selection contributes significantly to model interpretability.
• By selecting the most important features, we gain a better understanding of the
underlying factors that influence the model's predictions.
• This interpretability holds particular significance in domains like healthcare
and finance, where transparency and explainability are crucial.
Common Feature Selection Techniques
• There are various approaches to performing feature selection, each
with its strengths and limitations.
• Three common categories of feature selection techniques: filter
methods, wrapper methods, and embedded methods.
Filter Methods
• Filter methods evaluate the relevance of features independently of the
machine learning algorithm chosen.
• These techniques utilize statistical measures to rank and choose
features. Two commonly used filter methods include Variance
Threshold and Chi-Square Test.
Variance Threshold
• The Variance Threshold method identifies features with low variance,
assuming that features with minimal variation across the dataset
contribute less to the model.
• By establishing a threshold, we can select features with variance above
this defined threshold and discard the rest.
Example:Predicting Student Grades
• Suppose you have a dataset containing information about students, and
you want to predict their final exam grades. The dataset includes the
following features:
• Age: Age of the student (numeric).
• Gender: Gender of the student (categorical: 'Male' or 'Female').
• Study Hours: Number of hours the student spends studying per week
(numeric).
• Test 1 Score: Score on the first practice test (numeric).
• Test 2 Score: Score on the second practice test (numeric).
Before applying the Variance
Threshold method, let's take a look
at the dataset:
• Now, let's calculate the variance for each feature:
• Age: Variance = ((18-19.2)^2 + (19-19.2)^2 + (20-19.2)^2 + (18-19.2)^2 + (21-
19.2)^2) / 5 = 2.96
• Gender: This is categorical, so it doesn't have variance as we can't measure
variance for categories.
• Study Hours: Variance = ((10-11)^2 + (12-11)^2 + (15-11)^2 + (8-11)^2 + (10-
11)^2) / 5 = 5.2
• Test 1 Score: Variance = ((85-84.6)^2 + (78-84.6)^2 + (92-84.6)^2 + (80-84.6)^2 +
(88-84.6)^2) / 5 = 32.16
• Test 2 Score: Variance = ((87-83.8)^2 + (80-83.8)^2 + (91-83.8)^2 + (79-83.8)^2 +
(86-83.8)^2) / 5 = 7.36
• Now, let's say you decide to set a threshold for variance at 5.0. Any feature with a
variance below this threshold will be considered low-variance and will be
removed.
• In this case, "Gender" and "Age" will be removed because they have variances
below 5.0, and "Study Hours," "Test 1 Score," and "Test 2 Score" will be retained.
After applying the Variance
Threshold method, the dataset will
look like this:
Chi-Square Test
• The Chi-Square Test feature selection method is commonly used when
dealing with categorical data and is used to identify the most relevant
features by assessing the independence between each categorical
feature and the target variable.
• It measures the association or dependency between two categorical
variables, and in feature selection, it helps us select features that have
a significant relationship with the target variable.
Example: Predicting Loan
Approval
• Bank dataset contains the following categorical features:
• Credit Score: Categorical feature representing the applicant's credit score
category (e.g., "Low," "Medium," "High").
• Employment Status: Categorical feature indicating the applicant's
employment status (e.g., "Employed," "Unemployed," "Self-Employed").
• Marital Status: Categorical feature representing the applicant's marital status
(e.g., "Married," "Single," "Divorced").
• Loan Purpose: Categorical feature indicating the purpose of the loan (e.g.,
"Home Purchase," "Debt Consolidation," "Education").
• Loan Approval: Binary target variable indicating whether the loan application
was approved (1 for approved, 0 for denied).
• o perform feature selection using the Chi-Square Test, follow these
steps:
• Step 1: Create a contingency table (also known as a cross-tabulation
table) between each categorical feature and the target variable. The
table shows the counts of each combination of feature and target
variable values. For instance, for the "Credit Score" feature:
• Step 2: Calculate the Chi-Square statistic for each contingency table. The
Chi-Square statistic measures the difference between the observed and
expected frequencies of each combination. The formula for Chi-Square is:
• χ^2 = Σ [(Observed - Expected)^2 / Expected]
• Where:
• Observed: The actual count in a cell of the contingency table.
• Expected: The expected count in the same cell if there were no
relationship between the feature and the target variable.
• Expected = (Total count in the row * Total count in the column) / Total
count in the dataset
• Step 3: Repeat the above calculation for all other cells in the contingency table.
• Once you have calculated the Chi-Square values for all cells, sum up these
values to get the overall Chi-Square statistic for the entire table:
• Chi-Square Total = Σ (Chi-Square for each cell) = (4/27) + ... + (other cell Chi-
Square values)
• Calculate - Chi-Square Total for Credit Score, Employment Status, Marital
Status, Loan Purpose and Loan Approval
• Step 4: Set a significance level as a threshold. Features with p-values below this
threshold are considered statistically significant and are selected.
• Calculate the p-value using the Chi-Square distribution. You can use software or
an online calculator for this calculation.
Wrapper Methods
• Wrapper methods evaluate feature subsets by iterative training and
evaluating a specific machine learning algorithm. These methods directly
measure the impact of features on the model's performance. Recursive
Feature Elimination and Forward Selection are popular wrapper methods.
• Recursive Feature Elimination
• Recursive Feature Elimination (RFE) is an iterative approach that begins
with all features and eliminates the least important feature in each
iteration.
• This process continues until a specified number of features remains. RFE
assigns importance scores to each feature based on how much their
removal affects the model's performance.
Recursive Feature Elimination - Example
• Step 1: Import Libraries and Load the Dataset First, you need to import the necessary libraries and load the
diabetes dataset.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the diabetes dataset

diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
• Step 2: Split the Dataset Split the dataset into training and testing
sets to evaluate the model later.
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=100)
• Step 3: Create a Linear Regression Model Create a linear regression
model that will be used for feature selection.
• model = LinearRegression()
• Step 4: Initialize RFE with the Model and Desired Number of
Features Initialize RFE with the linear regression model and specify
the number of features you want to select.
• num_features_to_select = 5 # You can change this to the desired
number of features
• rfe = RFE(model, num_features_to_select)
• Step 5: Fit RFE to the Training Data Fit RFE to the training data to
perform feature selection.
• rfe.fit(X_train, y_train)
• Step 6: Get the Selected Features Retrieve the indices of the selected
features using rfe.support_.
• selected_features = np.where(rfe.support_)[0]
• print("Selected Feature Indices:", selected_features)
• Step 7: Train a Model Using the Selected Features Train a linear regression
model using the selected features.
• X_train_selected = X_train[:, selected_features]
• X_test_selected = X_test[:, selected_features]

• model.fit(X_train_selected, y_train)
• Step 8: Make Predictions and Evaluate the Model Make predictions
on the test set and calculate the mean squared error to evaluate the
model's performance.
• y_pred = model.predict(X_test_selected)
• mse = mean_squared_error(y_test, y_pred)
• print("Mean Squared Error:", mse)
• Forward Selection
• Forward Selection starts with an empty set of features and gradually
adds the most promising feature at each step.
• The model's performance is evaluated after each feature addition, and
the process continues until a specified number of features are selected.
Step-by-step process
• Step 1: Empty Feature Set
• Begin with an empty set of features. This set will gradually grow as the algorithm progresses.
• Step 2: Model Training and Evaluation
• Train a machine learning model (e.g., linear regression, decision tree, etc.) using the dataset
with the currently selected features (initially, this is an empty set).
• Evaluate the model's performance using a suitable metric, such as mean squared error (MSE)
for regression tasks or accuracy for classification tasks.
• Step 3: Feature Selection
• In each iteration of forward selection, consider adding one of the remaining candidate
features to the set of selected features.
• Train a new model with the current set of selected features plus the candidate feature.
• Evaluate the performance of the new model using the same metric as in Step 2.
• Step 4: Select the Best Feature
• Among all the candidate features considered in the current iteration, choose the one that leads to the best
improvement in model performance. This is typically determined by comparing the model's performance
metrics.
• Add the selected feature to the set of selected features.
• Step 5: Stopping Criterion
• Decide on a stopping criterion. This could be a predefined number of features to select, a specific
performance threshold, or any other relevant criterion.
• Check if the stopping criterion is met. If it is, stop the forward selection process. Otherwise, continue to the
next iteration.
• Step 6: Final Model
• Once the stopping criterion is met, the selected features form the final set of features to be used in your
model.
• Train a final model using all the selected features.
• Evaluate the final model on a separate test dataset to assess its performance in a more realistic scenario.
• Suppose you're working on a predictive modeling task to predict house prices.
• You have a dataset with features like the number of bedrooms, square footage,
presence of a garage, distance to the nearest school, and age of the house.
• You start with an empty set of features.
• In each iteration, you consider adding one feature to the set and measure how
much it improves the model's ability to predict house prices.
• You continue this process until you've added a predefined number of features or
until you're satisfied with the model's performance.
• The selected features form the final set used in your model to predict house
prices.
Embedded Methods
• Embedded methods incorporate feature selection as part of the model
training process.
• These techniques automatically select relevant features during model
training.
• Lasso Regression and Random Forest Importance have been widely
used embedded methods.
• Lasso Regression
• Lasso Regression introduces a regularization term that penalizes the
absolute values of the feature coefficients. As a result, some
coefficients become zero, effectively removing the corresponding
features from the model. This technique encourages scattered and
performs feature selection simultaneously.
• Lasso is a modification of linear regression, where the model is
penalized for the sum of absolute values of the weights. Thus, the
absolute values of weight will be (in general) reduced, and many will
tend to be zeros.
• Initialize and Train the Lasso Regression Model Create a Lasso
Regression model, and specify the strength of the regularization
penalty, typically denoted as alpha. Larger alpha values lead to
stronger regularization, which results in more feature selection. You
can use techniques like cross-validation to choose an appropriate
alpha value.
• alpha = 0.01 # Adjust the value of alpha based on your data and
requirements
• lasso_model = Lasso(alpha=alpha)
• lasso_model.fit(X_train, y_train)
• Feature Selection Lasso Regression will automatically perform feature
selection by shrinking the coefficients of less important features
towards zero. After training the model, you can examine the
coefficients to identify which features were selected.
• selected_features = X.columns[lasso_model.coef_ != 0]
• Random Forest Importance
• Random Forest Importance (RFI) is a technique used to perform
feature selection by leveraging the capabilities of a Random Forest
classifier or regressor. Random Forest is an ensemble learning method
that combines multiple decision trees to make predictions. RFI
measures the importance of each feature in the Random Forest
model and ranks them based on their contribution to the model's
predictive performance. Features that contribute the most to
reducing impurity or error are considered more important.
• Train a Random Forest Model Create and train a Random Forest
classifier using your training data.
• rf_classifier = RandomForestClassifier(n_estimators=100,
random_state=42)
• rf_classifier.fit(X_train, y_train)
• Feature Importance Calculation Retrieve the feature importances
from the trained Random Forest model.
• feature_importances = rf_classifier.feature_importances_
• Rank Features Rank the features based on their importance scores, in
descending order. You can use this ranking to select the top features
for your model.
• feature_ranking = pd.DataFrame({'Feature': X.columns, 'Importance':
feature_importances})
• feature_ranking = feature_ranking.sort_values(by='Importance',
ascending=False)
• Select Top Features Choose the top N features based on your
requirements. You can select a fixed number of features or use a
threshold on the importance score.
• top_n_features = feature_ranking.head(N) # Replace N with the
desired number of features
• selected_features = top_n_features['Feature'].tolist()
• Train and Evaluate the Model with Selected Features Train a Random Forest
model using only the selected features and evaluate its performance.
• X_train_selected = X_train[selected_features]
• X_test_selected = X_test[selected_features]

• rf_classifier.fit(X_train_selected, y_train)
• y_pred = rf_classifier.predict(X_test_selected)

• accuracy = accuracy_score(y_test, y_pred)

• print("Accuracy with Selected Features:", accuracy)
Evaluation Metrics for Feature Selection
• In order to measure the efficiency of feature selection techniques, it is
necessary to have suitable evaluation metrics.
• There are several commonly employed metrics, such as accuracy,
precision, recall, F1-score, and area under the receiver operating
characteristic curve (AUC-ROC).
• These metrics offer valuable information on how effectively the model
performs when utilizing the selected features, as opposed to using all
available features.

Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
49 pages
C Language Apna College YT
No ratings yet
C Language Apna College YT
155 pages
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
No ratings yet
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
63 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
Module2.1 Feature Selection
No ratings yet
Module2.1 Feature Selection
46 pages
Kernels, Model Selection and Feature Selection
No ratings yet
Kernels, Model Selection and Feature Selection
5 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
CSL0777 L07fgfdg
No ratings yet
CSL0777 L07fgfdg
28 pages
Feature Gradients: Scalable Feature Selection Via Discrete Relaxation
No ratings yet
Feature Gradients: Scalable Feature Selection Via Discrete Relaxation
9 pages
CS464 Ch5 FeatureSelection
No ratings yet
CS464 Ch5 FeatureSelection
31 pages
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
No ratings yet
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
66 pages
Presentation 1
No ratings yet
Presentation 1
22 pages
Warpper Method
No ratings yet
Warpper Method
8 pages
Session On Filter Based Feature Selection
No ratings yet
Session On Filter Based Feature Selection
12 pages
CQF June 2021 M4L4 Solutions
No ratings yet
CQF June 2021 M4L4 Solutions
14 pages
Eature Engineering: Presenter: Prof. Amit Kumar Das
No ratings yet
Eature Engineering: Presenter: Prof. Amit Kumar Das
17 pages
CS F320 - Assignment II - Draft (Subject To A Few Changes in The Description of Problems)
No ratings yet
CS F320 - Assignment II - Draft (Subject To A Few Changes in The Description of Problems)
12 pages
Feature Selection
No ratings yet
Feature Selection
7 pages
Information Gain - Towards Data Science
No ratings yet
Information Gain - Towards Data Science
8 pages
Flairs99 042
No ratings yet
Flairs99 042
5 pages
Feature Selection 16891042299
No ratings yet
Feature Selection 16891042299
23 pages
ML pr5
No ratings yet
ML pr5
3 pages
Feature Selection
No ratings yet
Feature Selection
22 pages
Experiment No.: 9: T. Y. B. Tech (CSE) - II Subject: Open Source Lab-II
No ratings yet
Experiment No.: 9: T. Y. B. Tech (CSE) - II Subject: Open Source Lab-II
4 pages
The 5 Feature Selection Algorithms Every Data Scientist Should Know
No ratings yet
The 5 Feature Selection Algorithms Every Data Scientist Should Know
29 pages
Lecture 8 Feature Selection and Dimensionality Reduction
No ratings yet
Lecture 8 Feature Selection and Dimensionality Reduction
27 pages
Identifying Key Predictors and Influencers For Predictions Using Artificial Intelligence 1
No ratings yet
Identifying Key Predictors and Influencers For Predictions Using Artificial Intelligence 1
14 pages
7 Selectia Trasaturilor
No ratings yet
7 Selectia Trasaturilor
54 pages
کتاب پنجم بارگزاری شده
No ratings yet
کتاب پنجم بارگزاری شده
35 pages
Lecture Notes 12-Higher-Order Taylor Methods
No ratings yet
Lecture Notes 12-Higher-Order Taylor Methods
85 pages
Da Lab Mannual
No ratings yet
Da Lab Mannual
25 pages
Feature Selection Techniques
No ratings yet
Feature Selection Techniques
5 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
Project Idea
No ratings yet
Project Idea
8 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
GAIN RATIO and Correlation
No ratings yet
GAIN RATIO and Correlation
7 pages
Feature Selection
No ratings yet
Feature Selection
18 pages
Feature Engineering
No ratings yet
Feature Engineering
5 pages
Shap-Select:: Lightweight Feature Selection Using SHAP Values and Regression
No ratings yet
Shap-Select:: Lightweight Feature Selection Using SHAP Values and Regression
13 pages
AI5003 AML Week07
No ratings yet
AI5003 AML Week07
14 pages
Feature Selection
No ratings yet
Feature Selection
5 pages
Lecture#10
No ratings yet
Lecture#10
24 pages
Feature Selection: Slide 1
No ratings yet
Feature Selection: Slide 1
29 pages
Wa0028.
No ratings yet
Wa0028.
10 pages
Feature Selection in PR
No ratings yet
Feature Selection in PR
6 pages
Featuere Selection
No ratings yet
Featuere Selection
5 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
4 pages
Feature Selection
No ratings yet
Feature Selection
2 pages
DM 4
No ratings yet
DM 4
68 pages
Feature Selection Techniques in Machine Learning - Javatpoint
No ratings yet
Feature Selection Techniques in Machine Learning - Javatpoint
9 pages
Chandra Shekar 2014
No ratings yet
Chandra Shekar 2014
13 pages
Special Topic: Missing Values
No ratings yet
Special Topic: Missing Values
25 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Feature Selection
No ratings yet
Feature Selection
6 pages
SWATH-USV Innovative USV With SWATH Hull For Superior Operability in Sea States and AUV Support - Brizzolara 2010
No ratings yet
SWATH-USV Innovative USV With SWATH Hull For Superior Operability in Sea States and AUV Support - Brizzolara 2010
22 pages
PDC TR-06-02 Rev 1 SBEDS Users Guide DistribA
No ratings yet
PDC TR-06-02 Rev 1 SBEDS Users Guide DistribA
95 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
Exam Version
No ratings yet
Exam Version
413 pages
13094107901309410729BS App Geology
No ratings yet
13094107901309410729BS App Geology
49 pages
Notes On Mathematics of Quantum Mechanics: Sadi Turgut
No ratings yet
Notes On Mathematics of Quantum Mechanics: Sadi Turgut
56 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
17 pages
02 Eisenman Cardboard Architecture
No ratings yet
02 Eisenman Cardboard Architecture
12 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
5 - A/D and D/A Conversion: Systems For Digital Signal Processing
No ratings yet
5 - A/D and D/A Conversion: Systems For Digital Signal Processing
35 pages
The Fifteen Puzzle
No ratings yet
The Fifteen Puzzle
6 pages
BER Analysis Power Point Presentation
No ratings yet
BER Analysis Power Point Presentation
39 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
Forenoon 10 A.M. To 1 P.M. Session: Semester No. 01
No ratings yet
Forenoon 10 A.M. To 1 P.M. Session: Semester No. 01
35 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Is 808
No ratings yet
Is 808
60 pages
Reliability in Pavement Design: Paola Dalla Valle, Nick Thom
No ratings yet
Reliability in Pavement Design: Paola Dalla Valle, Nick Thom
15 pages
CO2 Ged102 pg.193
No ratings yet
CO2 Ged102 pg.193
3 pages
11.feature Selection, Extraction
No ratings yet
11.feature Selection, Extraction
38 pages
15-150703-Design and Analysis of Algorithms PDF
No ratings yet
15-150703-Design and Analysis of Algorithms PDF
2 pages
MSOR Program Plan
No ratings yet
MSOR Program Plan
2 pages
Week 1 - Nautical Charts and Its Classification Aprt 1
No ratings yet
Week 1 - Nautical Charts and Its Classification Aprt 1
21 pages
Revision Numbers Ws 2
No ratings yet
Revision Numbers Ws 2
4 pages
Revision Test
No ratings yet
Revision Test
6 pages
CBSE Class 8 Maths Activity 4
No ratings yet
CBSE Class 8 Maths Activity 4
2 pages
Ec8352 Ss Model 1
No ratings yet
Ec8352 Ss Model 1
2 pages
Unit 1 Lesson 1-5
No ratings yet
Unit 1 Lesson 1-5
24 pages
RPT Math DLP Year 2 (2025)
No ratings yet
RPT Math DLP Year 2 (2025)
17 pages
Naukri Kailas Madane
No ratings yet
Naukri Kailas Madane
2 pages
Bearing Stress: A P DT P
No ratings yet
Bearing Stress: A P DT P
5 pages
Week 4 Groundwater Flow Equation - Unconfined Aquifer
No ratings yet
Week 4 Groundwater Flow Equation - Unconfined Aquifer
18 pages
2015-Map-Normative-Data-Score Chart
No ratings yet
2015-Map-Normative-Data-Score Chart
1 page
Overview of Mathematics and Its Applications
No ratings yet
Overview of Mathematics and Its Applications
1 page
En 894-3
No ratings yet
En 894-3
46 pages
Ways to Achieve Quality
From Everand
Ways to Achieve Quality
chakrapani srinivasa
5/5 (1)
Automated Software Testing Interview Questions You'll Most Likely Be Asked
From Everand
Automated Software Testing Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Module2.1 Feature Selection

Uploaded by

Module2.1 Feature Selection

Uploaded by

Course Code: CSA3002

MACHINE LEARNING ALGORITHMS

Course Type: LPC – 2-2-3

# Load the diabetes dataset

• accuracy = accuracy_score(y_test, y_pred)

You might also like