6CS4-02 ML PPT Unit-3
6CS4-02 ML PPT Unit-3
6CS4-02 ML PPT Unit-3
Machine Learning
(6CS4-02)
Unit-III
1
AIET, Jaipur MACHINE LEARNING CS-VI Sem
Reference Books
T1 Pattern Recognition and Christopher M. Bishop Springer
Machine Learning
2
AIET, Jaipur MACHINE LEARNING CS-VI Sem
1. Feature Extraction
1.Principal component analysis
2.Singular value decomposition.
Introduction
2. Feature selection
to Statistical
3. Feature ranking
Learning
1.Subset selection,
Theory,
2.Filter,
3.Wrapper
4.Embedded methods,
4. Evaluating Machine Learning algorithms
5. Model Selection.
3
AIET, Jaipur MACHINE LEARNING CS-VI Sem
Dimensionality Reduction
Reducing the dimension of the feature space is called “dimensionality reduction.”
There are many ways to achieve dimensionality reduction, but most of these
techniques fall into one of two classes:
Feature Selection
Feature Extraction
4
AIET, JaipurMachine Leaning CS-VI Sem
Feature Extraction
Feature extraction is a process of dimensionality
reduction by which an initial set of raw data is reduced to
more manageable groups for processing. A characteristic
of these large data sets is a large number of variables that
require a lot of computing resources to process.
5
AIET, JaipurMachine Leaning CS-VI Sem
1. Do you want to reduce the number of variables, but aren’t able to identify variables
to completely remove from consideration?
2. Do you want to ensure your variables are independent of one another?
3. Are you comfortable making your independent variables less interpretable?
If you answered “yes” to all three questions, then PCA is a good method to use. If you
answered “no” to question 3, you should not use PCA.
By projecting our data into a smaller space, we’re reducing the dimensionality of our feature
space… but because we’ve transformed our data in these different “directions,” we’ve made
sure to keep all original variables in our model!
6
AIET, JaipurMachine Leaning CS-VI Sem
7
AIET, JaipurMachine Leaning CS-VI Sem
8
AIET, JaipurMachine Leaning CS-VI Sem
9
AIET, JaipurMachine Leaning CS-VI Sem
“The mean value of the product of the deviations of two variates from their
respective means.”
11
AIET, JaipurMachine Leaning CS-VI Sem
12
AIET, JaipurMachine Leaning CS-VI Sem
13
AIET, JaipurMachine Leaning CS-VI Sem
14
AIET, JaipurMachine Leaning CS-VI Sem
15
AIET, JaipurMachine Leaning CS-VI Sem
16
AIET, JaipurMachine Leaning CS-VI Sem
References
PCA:
https://medium.com/analytics-vidhya/understanding-principle-component-analysis-pca-ste
p-by-step-e7a4bb4031d9
SVD:
https://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
17
Feature Selection
18
AIET, JaipurMachine Leaning CS-VI Sem
19
Overview
• Why we need FS:
1. to improve performance (in terms of speed,
predictive power, simplicity of the model).
2. to visualize the data for model selection.
3. To reduce dimensionality and remove noise.
21
Perspectives
1. searching for the best subset of features.
22
Perspectives:
Search of a Subset of Features
• FS can be considered as a search problem, where
each state of the search space corresponds to a
concrete subset of features selected.
• The selection can be represented as a binary
array, with each element corresponding to the
value 1, if the feature is currently selected by the
algorithm and 0, if it does not occur.
• There should be a total of 2M subsets where M is
the number of features of a data set.
23
Perspectives:
Search of a Subset of Features
Search
Space:
24
Perspectives:
Search of a Subset of Features
• Search Directions:
– Sequential Forward Generation (SFG): It starts with an empty set of features
S. As the search starts, features are added into S according to some criterion
that distinguish the best feature from the others. S grows until it reaches a
full set of original features. The stopping criteria can be a threshold for the
number of relevant features m or simply the generation of all possible
subsets in brute force mode.
25
Perspectives:
Search of a Subset of Features
• Search Directions:
– Bidirectional Generation (BG): Begins the search in both directions,
performing SFG and SBG concurrently. They stop in two cases: (1)
when one search finds the best subset comprised of m features before
it reaches the exact middle, or (2) both searches achieve the middle of
the search space. It takes advantage of both SFG and SBG.
26
Perspectives:
Selection Criteria
– Information Measures.
• Information serves to measure the uncertainty of the
receiver when she/he receives a message.
• Shannon’s Entropy:
• Information gain:
27
Perspectives:
Selection Criteria
– Dependence Measures.
• known as measures of association or correlation.
• Its main goal is to quantify how strongly two variables
are correlated or present some association with each
other, in such way that knowing the value of one of
them, we can derive the value for the other.
• Pearson correlation coefficient:
28
Perspectives:
Selection Criteria
– Consistency Measures.
• They attempt to find a minimum number of features that
separate classes as the full set of features can.
29
Perspectives
• Filters:
30
Perspectives
• Filters:
– measuring uncertainty, distances, dependence or
consistency is usually cheaper than measuring the
accuracy of a learning process. Thus, filter methods are
usually faster.
– it does not rely on a particular learning bias, in such a way
that the selected features can be used to learn different
models from different DM techniques.
– it can handle larger sized data, due to the simplicity and
low time complexity of the evaluation measures.
31
Perspectives
• Wrappers:
32
Perspectives
• Wrappers:
– can achieve the purpose of improving the particular
learner’s predictive performance.
– usage of internal statistical validation to control the
overfitting, ensembles of learners and
hybridizations with heuristic learning like Bayesian
classifiers or Decision Tree induction.
– filter models cannot allow a learning algorithm to
fully exploit its bias, whereas wrapper methods do.
33
Perspectives
• Embedded FS:
– similar to the wrapper approach in the sense that
the features are specifically selected for a certain
learning algorithm, but in this approach, the
features are selected during the learning process.
– they could take advantage of the available data by
not requiring to split the training data into a
training and validation set; they could achieve a
faster solution by avoiding the re-training of a
predictor for each feature subset explored.
34
Comparison
35
• Filters: for a much faster alternative, filters do not test any particular
algorithm, but rank the original features according to their
relationship with the problem (labels) and just select the top of
them. Correlation and mutual information are the most widespread
criteria. There are many easy to use tools, like the feature selection
sklearn package.
36
Aspects:
Output of Feature Selection
• Feature Ranking Techniques:
– we expect as the output a ranked list of features
which are ordered according to evaluation
measures.
– they return the relevance of the features.
– For performing actual FS, the simplest way is to
choose the first m features for the task at hand,
whenever we know the most appropriate m value.
37
Aspects:
Output of Feature Selection
• Minimum Subset Techniques:
– The number of relevant features is a parameter
that is often not known by the practitioner.
– There must be a second category of techniques
focused on obtaining the minimum possible
subset without ordering the features.
– whatever is relevant within the subset, is
otherwise irrelevant.
38
Aspects:
Evaluation
• Goals:
– Inferability: For predictive tasks, considered as an
improvement of the prediction of unseen examples with
respect to the direct usage of the raw training data.
– Interpretability: Given the incomprehension of raw data
by humans, DM is also used for generating more
understandable structure representation that can explain
the behavior of the data.
– Data Reduction: It is better and simpler to handle data
with lower dimensions in terms of efficiency and
interpretability.
39
Aspects:
Evaluation
• We can derive three assessment measures
from these three goals:
– Accuracy
– Complexity
40
Aspects:
Drawbacks
• The resulted subsets of many models of FS are strongly dependent
on the training set size.
• It is not true that a large dimensionality input can always be
reduced to a small subset of features because the objective
feature is actually related with many input features and the
removal of any of them will seriously effect the learning
performance.
• A backward removal strategy is very slow when working with
large-scale data sets. This is because in the firsts stages of the
algorithm, it has to make decisions funded on huge quantities of
data.
• In some cases, the FS outcome will still be left with a relatively
large number of relevant features which even inhibit the use of
complex learning methods. 41
Aspects:
Using Decision Trees for FS
• Decision trees can be used to implement a
trade-off between the performance of the
selected features and the computation time
which is required to find a subset.
43
AIET, JaipurMachine Leaning CS-VI Sem
Loss/Objective function
44
AIET, JaipurMachine Leaning CS-VI Sem
High bias implies our estimate based on the observed data is not close to the true parameter. (aka
underfitting).
High variance implies our estimates are sensitive to sampling. They’ll vary a lot if we compute them
with a different sample of data (aka overfitting).
45
AIET, JaipurMachine Leaning CS-VI Sem
46
AIET, JaipurMachine Leaning CS-VI Sem
47
AIET, JaipurMachine Leaning CS-VI Sem
48
AIET, JaipurMachine Leaning CS-VI Sem
Holdout validation
Within holdout validation we have 2 choices: Single holdout and repeated holdout.
a) Single Holdout
Implementation
The basic idea is to split our data into a training set and a holdout test set. Train the model on the
training set and then evaluate model performance on the test set. We take only a single holdout—
hence the name. Let’s walk through the steps:
49
AIET, JaipurMachine Leaning CS-VI Sem
50
AIET, JaipurMachine Leaning CS-VI Sem
51
AIET, JaipurMachine Leaning CS-VI Sem
Resource:
https://heartbeat.fritz.ai/model-evaluation-selection-i-30d803a44ee
https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine
-learning-tips-and-tricks
https://docs.microsoft.com/en-us/azure/machine-learning/media/al
gorithm-cheat-sheet/machine-learning-algorithm-cheat-sheet.svg
52