Lecture#10

some topics related to NLP

Uploaded by

Qareena sadiq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views24 pages

Lecture#10

some topics related to NLP

Uploaded by

Qareena sadiq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

Feature selection

Feature selection
• Feature Selection is one of the core concepts in machine learning
which hugely impacts the performance of your model. The data
features that you use to train your machine learning models have
a huge influence on the performance you can achieve.
• In machine learning and statistics, feature selection, also known
as variable selection, attribute selection or variable subset
selection, is the process of selecting a subset of relevant features
(variables, predictors) for use in model construction. Feature
selection techniques are used for several reasons:
• Simplification of models to make them easier to interpret by
researchers/users
• Shorter training times,
• To avoid the curse of dimensionality
• Improve data's compatibility with a learning model class
Feature selection
• Feature selection techniques should be distinguished from feature
extraction. Feature extraction creates new features from functions of the
original features, whereas feature selection returns a subset of the
features. Feature selection techniques are often used in domains where
there are many features and comparatively few samples (or data points).
• A feature selection algorithm can be seen as the combination of a search
technique for proposing new feature subsets, along with an evaluation
measure which scores the different feature subsets. The simplest
algorithm is to test each possible subset of features finding the one which
minimizes the error rate.
• Irrelevant or partially relevant features can negatively impact model
performance.
• Feature selection and Data cleaning should be the first and most
important step of your model designing.
why feature
• What’s/why feature selection
• A procedure in machine learning
to find a subset of features that produces
‘better’ model for given dataset
– Avoid overfitting and achieve better
generalization ability
– Reduce the storage requirement and
training time
– Interpretability
When feature selection is important

• Noise data
•Lots of low frequent features
•Use multiple type features
•Too many features comparing to samples
•Complex model
•Samples in real scenario is in
homogeneous with training & test
samples
How to Select Features
• How to select features and what are Benefits of performing
feature selection before modeling your data?

• · Reduces Overfitting: Less redundant data means less

opportunity to make decisions based on noise.

• · Improves Accuracy: Less misleading data means modeling

accuracy improves.

• · Reduces Training Time: fewer data points reduce algorithm

complexity and algorithms train faster.
Goal of feature selection
• The goal of feature selection in machine learning is to find the
best set of features that allows one to build useful models of
studied phenomena.

• The techniques for feature selection in machine learning can be

broadly classified into the following categories:

• Supervised Techniques: These techniques can be used for

labeled data, and are used to identify the relevant features for
increasing the efficiency of supervised models like classification
and regression.

• Unsupervised Techniques: These techniques can be used for

unlabeled data like clustering.
Types of feature Selection
From a taxonomic point of view, these techniques are
classified as under:
A. Filter methods
B. Wrapper methods
C. Embedded methods
D. Hybrid methods
•Filter methods perform the feature selection independently of
construction of the classification model.
• Wrapper methods iteratively select or eliminate a set of
features using the prediction accuracy of the classification
model.
•In embedded methods the feature selection is an integral
part of the classification model.
Filter methods
Filter type methods select variables regardless of the model.
They are based only on general features like the correlation with
the variable to predict. Filter methods suppress the least
interesting variables. The other variables will be part of a
classification or a regression model used to classify or to predict
data. These methods are particularly effective in computation
time and robust to overfitting.
Types of Feature Selection
Wrapper methods: Wrapper methods evaluate subsets of
variables which allows, unlike filter approaches, to detect the
possible interactions amongst variables.The two main
disadvantages of these methods are:
•The increasing overfitting risk when the number of observations
is insufficient.
•The significant computation time when the number of variables
is large.
Types of Feature Selection
Embedded methods are a catch-all group
of techniques which perform feature
selection as part of the model construction
process. The exemplar of this approach is
the LASSO method.
Embedded methods have been recently
proposed that try to combine the
advantages of both previous methods. A
learning algorithm takes advantage of its
own variable selection process and
performs feature selection and
classification simultaneously, such as the
FRMT algorithm
Classifying Features

Relevance: These are features that have an influence

on the output and whose role can not be
assumed by the rest.
Irrelevance: Features that don't have any influence on
the output, and whose values are
generated at random for each example.
Redundance: A redundancy exists whenever a
feature can take the role of another.
Typical methods for feature selection

Categories
•Single feature evaluation
•Frequency based, mutual information, KL
divergence, Gini-indexing, information gain,
Chi-square statistic
•Subset selection method
– Sequential forward selection
– Sequential backward selection
Single feature evaluation
• Measure quality of features by all
kinds of metrics
- Frequency based
- Dependence of feature and label
(Co-occurrence)
(Mutual information, Chi-square statistic)
- Information theory
(KL divergence, information gain)
- Gini-indexin
Frequency based

– Remove features according to

frequency of features or instances
contain the feature
Typical scenario
– Text mining
Correlation criteria

• One of the simplest criteria is the

Pearson correlation coefficient defined
as:
Removing features with low variance
VarianceThreshold is a simple baseline approach to feature selection.
It removes all features whose variance doesn’t meet some threshold.
By default, it removes all zero-variance features, i.e. features that have
the same value in all samples.
As an example, suppose that we have a dataset with boolean features,
and we want to remove all features that are either one or zero (on or
off) in more than 80% of the samples. Boolean features are Bernoulli
random variables, and the variance of such variables is given by
so we can select using the threshold .8 * (1 - .8):
Univariate feature selection
• Univariate feature selection works by selecting the best features
based on univariate statistical tests. It can be seen as a
preprocessing step to an estimator. Scikit-learn exposes feature
selection routines as objects that implement the transform method:
• SelectKBest removes all but the highest scoring features
• SelectPercentile removes all but a user-specified highest scoring
percentage of features
• using common univariate statistical tests for each feature: false
positive rate Select Fpr, false discovery rate Select Fdr, or family
wise error Select Fwe.
• GenericUnivariateSelect allows to perform univariate feature
selection with a configurable strategy. This allows to select the best
univariate selection strategy with hyper-parameter search estimator.
Chi-square (X2)

Chi-square (X2) is a popular feature

selection method in which predefined finite
number of words are used to represent the
documents (Forman, 2003). In this method,
a positive or negative value (word) in the
documents represent a particular category.
In Chi-square, frequency of the feature is
totally ignored.
TF/IDF

TF/IDF is a popular technique in the field of

natural language processing (NLP) that is
used to determine the relative frequency of
words in a specific document through an
inverse proportion of the word over the
entire corpus. There are many techniques in
which TF/IDF is a popular weighting
technique (Dsouza & Ansari, 2015). This is
an unsupervised technique and it does not
use class information
Distinguishing Feature Selector

yusal and Gunal (2012) introduced a

probabilistic feature ranking method that
assigns a high rank to a term which
appears frequently in one class irrespective
of the class size. Distinguishing feature
selector is the improved version of Mutual
Information that normalize the weight on N-
Gram and2 assign weight in the range of
[0.5, 1].
Relative Discrimination Criterion
Rehman et al (2015) proposed a new feature ranking
method namely “Relative Discrimination Criterion
(RDC)” for text data. The RDC enhances rank of the
frequently occurring terms presented in one class. The
RDC estimates the rank of a term by taking a weighted
difference of the tpr and fpr for each term count. By
incorporating (tpr, fpr) while determining the rank of a
term, this method selects more useful terms
Thanx for Listening

Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
No ratings yet
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
63 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
49 pages
Unit - 3 Feature Engineering
No ratings yet
Unit - 3 Feature Engineering
29 pages
Feature Selection 1692278667
No ratings yet
Feature Selection 1692278667
100 pages
Curve Fitting
No ratings yet
Curve Fitting
16 pages
Unit I Introduction
No ratings yet
Unit I Introduction
55 pages
Module2.1 Feature Selection
No ratings yet
Module2.1 Feature Selection
46 pages
Efficient DFT Calculation
100% (1)
Efficient DFT Calculation
9 pages
CS464 Ch5 FeatureSelection
No ratings yet
CS464 Ch5 FeatureSelection
31 pages
Feature Selection
No ratings yet
Feature Selection
53 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
Presentation 1
No ratings yet
Presentation 1
22 pages
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
No ratings yet
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
66 pages
Feature Selection
No ratings yet
Feature Selection
61 pages
Feature Selection
No ratings yet
Feature Selection
13 pages
1 s2.0 S277266222400081X Main
No ratings yet
1 s2.0 S277266222400081X Main
11 pages
Session On Filter Based Feature Selection
No ratings yet
Session On Filter Based Feature Selection
12 pages
کتاب پنجم بارگزاری شده
No ratings yet
کتاب پنجم بارگزاری شده
35 pages
Module5.2 Feature Selection Methods
No ratings yet
Module5.2 Feature Selection Methods
64 pages
7 Selectia Trasaturilor
No ratings yet
7 Selectia Trasaturilor
54 pages
3.1 Dimensionality Reduction
No ratings yet
3.1 Dimensionality Reduction
24 pages
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
No ratings yet
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
105 pages
Lecture 15 - 23.09.2024 - Feature Selection
No ratings yet
Lecture 15 - 23.09.2024 - Feature Selection
47 pages
ML Lecture 02
No ratings yet
ML Lecture 02
40 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
Feature Selection
No ratings yet
Feature Selection
22 pages
Flairs99 042
No ratings yet
Flairs99 042
5 pages
ADC Dynamic Errors
No ratings yet
ADC Dynamic Errors
34 pages
Feature Selection - New
No ratings yet
Feature Selection - New
41 pages
Unit 3
No ratings yet
Unit 3
50 pages
Kernels, Model Selection and Feature Selection
No ratings yet
Kernels, Model Selection and Feature Selection
5 pages
Convolution Presentation
No ratings yet
Convolution Presentation
65 pages
Lecture Notes Chapter 7
No ratings yet
Lecture Notes Chapter 7
30 pages
Autoencoder: Tuan Nguyen - AI4E
No ratings yet
Autoencoder: Tuan Nguyen - AI4E
35 pages
Icml 2005
No ratings yet
Icml 2005
8 pages
A Review of Feature Selection Techniques in BioinformaticsBioinformatics
No ratings yet
A Review of Feature Selection Techniques in BioinformaticsBioinformatics
11 pages
10 Trees
No ratings yet
10 Trees
57 pages
Feature Selection Techniques
No ratings yet
Feature Selection Techniques
5 pages
Linear Programming (LP) : Georgia Institute of Technology Systems Realization Laboratory 1
No ratings yet
Linear Programming (LP) : Georgia Institute of Technology Systems Realization Laboratory 1
25 pages
The QR Method
No ratings yet
The QR Method
15 pages
AI5003 AML Week07
No ratings yet
AI5003 AML Week07
14 pages
Linear Algebra Assignment#2
No ratings yet
Linear Algebra Assignment#2
7 pages
Dynamic Programming
No ratings yet
Dynamic Programming
23 pages
Paper 01
No ratings yet
Paper 01
17 pages
Module-3 - DS (Autosaved)
No ratings yet
Module-3 - DS (Autosaved)
18 pages
A Comparative Study Between Feature Selection Algorithms - Ok
No ratings yet
A Comparative Study Between Feature Selection Algorithms - Ok
10 pages
Feature Selection Technique
No ratings yet
Feature Selection Technique
7 pages
Wa0028.
No ratings yet
Wa0028.
10 pages
Feature Selection
No ratings yet
Feature Selection
18 pages
Real Time Violence Alert Sym
No ratings yet
Real Time Violence Alert Sym
14 pages
3b Features PDF
No ratings yet
3b Features PDF
40 pages
Feature Selection
No ratings yet
Feature Selection
5 pages
Feature Engineering
No ratings yet
Feature Engineering
5 pages
Covering and Coloring Mat175
No ratings yet
Covering and Coloring Mat175
9 pages
GAIN RATIO and Correlation
No ratings yet
GAIN RATIO and Correlation
7 pages
Evaluation of Cordic Algorithms For Fpga Design PDF
No ratings yet
Evaluation of Cordic Algorithms For Fpga Design PDF
2 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Ecc 501 Cat 1 2020 2021
No ratings yet
Ecc 501 Cat 1 2020 2021
4 pages
A Review of Feature Selection Techniques in Bioinformatics
No ratings yet
A Review of Feature Selection Techniques in Bioinformatics
11 pages
A Review of Feature Selection Methods On Synthetic Data
No ratings yet
A Review of Feature Selection Methods On Synthetic Data
37 pages
Feature Selection: Slide 1
No ratings yet
Feature Selection: Slide 1
29 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
Non-Uniform Illumination Document Image Binarization Using K-Means Clustering Algorithm
No ratings yet
Non-Uniform Illumination Document Image Binarization Using K-Means Clustering Algorithm
5 pages
Featuere Selection
No ratings yet
Featuere Selection
5 pages
Assignment 1 - Opti 2 - Report
No ratings yet
Assignment 1 - Opti 2 - Report
13 pages
Feature Selection in PR
No ratings yet
Feature Selection in PR
6 pages
Traditional Conjoint Analysis With Excel - Tables
No ratings yet
Traditional Conjoint Analysis With Excel - Tables
9 pages
Feature Selection - Study Material
No ratings yet
Feature Selection - Study Material
6 pages
Signals and Systems
No ratings yet
Signals and Systems
174 pages
Be - Electronics-And-Telecommunicatio N-Engineering - Semester-5 - 2022 - November - Digital-Communication-Dc-Pattern-2019
No ratings yet
Be - Electronics-And-Telecommunicatio N-Engineering - Semester-5 - 2022 - November - Digital-Communication-Dc-Pattern-2019
2 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Feature Selection
No ratings yet
Feature Selection
6 pages
Survey 2006
No ratings yet
Survey 2006
15 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
OS ASSIGNMENT QUESTIONS (Module 3)
No ratings yet
OS ASSIGNMENT QUESTIONS (Module 3)
2 pages
Feature Selection Techniques in Machine Learning - Javatpoint
No ratings yet
Feature Selection Techniques in Machine Learning - Javatpoint
9 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
5 pages
Chandra Shekar 2014
No ratings yet
Chandra Shekar 2014
13 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
4 pages
E-Note 14653 Content Document 20231228101402AM
No ratings yet
E-Note 14653 Content Document 20231228101402AM
10 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Literature Review On Feature Subset Selection Techniques
No ratings yet
Literature Review On Feature Subset Selection Techniques
3 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Gujarat Technological University: Mca - Semester Ii-Examination - Winter-2023
No ratings yet
Gujarat Technological University: Mca - Semester Ii-Examination - Winter-2023
1 page
11.feature Selection, Extraction
No ratings yet
11.feature Selection, Extraction
38 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
Practical Statistical Process Control
From Everand
Practical Statistical Process Control
Colin Hardwick
5/5 (9)
Backpropagation PDF 1644779488
No ratings yet
Backpropagation PDF 1644779488
8 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet