Week_7_Notes[1]
Week_7_Notes[1]
4. Hypothesis Space (ℋ): This is the set of all possible functions that the
learning algorithm can choose from to approximate the target function.
Hypotheses are often represented as conjunctions of literals (e.g., specific
values or “don’t care” conditions for each attribute).
5. Training Examples (S): These are the specific instances and their
corresponding labels used to train the learning algorithm.
Computational Learning Theory (CoLT) focuses on the design and analysis of learning
algorithms using formal mathematical methods.
Key Concepts
CoLT uses theoretical computer science tools to quantify and analyze learning
problems. This includes characterizing the difficulty of learning specific tasks and
understanding the computational complexity involved.
One of the central frameworks in CoLT is Probably Approximately Correct (PAC)
learning. It provides a way to quantify the computational difficulty of a learning task by
defining how well a learning algorithm can perform given a certain amount of data and
computational resources.
Error of a hypothesis
𝒟
In a perfect world, we’d like the true error to be 0.
Bias: Fix hypothesis space H
C may not be in H => Find h close to c
A hypothesis h is approximately correct if
PAC
The PAC model is a framework in computational learning theory that aims to
understand how well a learning algorithm can generalize from a finite set of training
examples to unseen instances.
Goal: ( h ) has small error over D
- This means ( h ) should correctly classify most instances drawn from D.
•PAC Learning concerns efficient learning
•We would like to prove that
–With high probability an (efficient) learning algorithm will find a hypothesis that is
•We specify two parameters, 𝜀 and 𝛿 and require that with probability at least (1−𝛿) a
approximately identical to the hidden target concept.
Realizable Case
In the realizable case, we assume that there exists a hypothesis ( h ) in the hypothesis
space that perfectly classifies all training examples. The goal is to determine how
many training examples are needed to ensure that the hypothesis learned from the
training data will have a small error on new, unseen data.
Find-S Algorithm is a simple and intuitive algorithm used in concept learning to find
the most specific hypothesis that fits all the positive examples in a given dataset.
Here’s a detailed explanation:
Key Concepts
1. Hypothesis: A representation of the concept being learned.
2. Positive Examples: Instances that belong to the target concept.
3. Negative Examples: Instances that do not belong to the target concept.
Find-S Algorithm steps:
1. Initialize: Start with the most specific hypothesis, which means all attributes
are set to their most specific values.
2. Iterate through Examples:
o For each positive example, update the hypothesis to be more general if
necessary.
o Ignore negative examples.
3. Update Hypothesis: For each attribute in the positive example that differs
from the current hypothesis, replace the specific value with a question mark (?),
indicating that any value is acceptable for that attribute.
VC Dimension:
The Vapnik-Chervonenkis (VC) dimension is a measure of the capacity of a hypothesis
class. It is defined as the largest number of points that can be shattered by the
hypothesis class.
For instance, a linear classifier in a two-dimensional space has a VC dimension of 3,
meaning it can shatter any set of 3 points, but not necessarily any set of 4 points.
•If for every such labeling a function can be found in ℋ consistent with this labeling,
we set that the set of instances is shattered by ℋ.
VC Dimension
Question:
Suppose the VC dimension of a hypothesis space is 6. Which of the following are true?
A) At least one set of 6 points can be shattered by the hypothesis space.
B) Two sets of 6 points can be shattered by the hypothesis space.
C) All sets of 6 points can be shattered by the hypothesis space.
D) No set of 7 points can be shattered by the hypothesis space.
A) At least one set of 6 points can be shattered by the hypothesis space.
D) No set of 7 points can be shattered by the hypothesis space.
Here's the explanation:
A) At least one set of 6 points can be shattered by the hypothesis
space: This is true. The definition of VC dimension (Vapnik-Chervonenkis
dimension) tells us that the hypothesis space can shatter a set of size equal to
the VC dimension. Therefore, since the VC dimension is 6, there exists at least
one set of 6 points that can be shattered (i.e., classified in all possible ways
by the hypothesis space).
B) Two sets of 6 points can be shattered by the hypothesis space: This
is not necessarily true. The VC dimension only guarantees that one set of 6
points can be shattered. It does not imply that multiple sets of 6 points can be
shattered.
C) All sets of 6 points can be shattered by the hypothesis space: This is
false. The VC dimension only guarantees that at least one set of 6 points can
be shattered, but not necessarily all sets of 6 points.
D) No set of 7 points can be shattered by the hypothesis space: This is
true. Since the VC dimension is 6, no set of 7 points can be shattered by the
hypothesis space, meaning the hypothesis space cannot correctly classify every
possible labeling of 7 points.
Ensemble Learning:
Question:
Which of the following options is are correct regarding the benefits of ensemble
model?
I. Better performance
2. More generalized model
3. Better interpretability
A) 1 and 3
B) 2 and 3
C) 1 and 2
D) 1, 2 and 3
Ans: C
Better performance: Ensemble models, such as Random Forest or Gradient
Boosting, combine the predictions of multiple models, which often leads to better
performance compared to a single model. This is because they reduce overfitting and
variance by averaging the predictions or aggregating them in some manner.
More generalized model: Since ensemble models take advantage of multiple weak
learners or different models, they tend to generalize better on unseen data, reducing
the chances of overfitting on the training data.
Better interpretability: This is generally not true for ensemble models. In fact,
ensemble methods often result in complex models that are harder to interpret than
simpler models like decision trees or linear regression. Therefore, this option is
incorrect.
Generally, an ensemble method works better, if the individual base models have
--------- ?
(Note: Individual models have accuracy greater than 50%)
A) Less correlation among predictions (means should predict different type of errors)
B) High correlation among predictions
C) Correlation does not have an impact on the ensemble output
D) None of the above.
Ans:
A) Less correlation among predictions.
Ensemble methods typically perform better when the individual base models make
diverse predictions. This diversity, often achieved through less correlation among the
predictions of the models, helps the ensemble to correct the errors of individual
models and improve overall performance.
Question:
Which of the following algorithms is not an ensemble learning algorithm?
A) Random Forest
B) Adaboost
C) Decision Trees
The correct answer is C) Decision Trees.
Random Forest and AdaBoost are both ensemble learning algorithms.
Random Forest combines multiple decision trees to improve accuracy and
reduce overfitting. AdaBoost combines multiple weak classifiers to create a
strong classifier.
Decision Trees, on the other hand, are a single model and not an ensemble
method.
Question:
Which of the following is FALSE about bagging?
A) Bagging increases the variance of the classifier
B) Bagging can help make robust classifiers from unstable classifiers.
C) Majority Voting is one way of combining outputs from various classifiers which are
being bagged.
Ans:A
A) Bagging increases the variance of the classifier: This statement is false. In
fact, bagging reduces the variance of the classifier. The primary purpose of
bagging (Bootstrap Aggregating) is to reduce the variance by combining predictions
from multiple models trained on different subsets of the data. This leads to more
stable and generalized predictions, especially for high-variance models like decision
trees.
B) Bagging can help make robust classifiers from unstable classifiers: This is
true. Bagging is commonly used to improve the performance of unstable models, like
decision trees, by reducing overfitting and increasing robustness.
C) Majority Voting is one way of combining outputs from various classifiers
which are being bagged: This is true. In bagging, the outputs from individual
classifiers are often combined using methods like majority voting (for classification) or
averaging (for regression).
Boosting
•An iterative procedure. Adaptively change distribution of training data.
–Initially, all N records are assigned equal weights
–Weights change at the end of boosting round
•“weak”learners
–P(correct) > 50%, but not necessarily much better.
Question:
Identify whether the following statement is true or false:
"Boosting is easy to parallelize whereas bagging is inherently a sequential process."
A) True
B) False
B) False
Here's why:
Bagging (Bootstrap Aggregating): Bagging is easier to parallelize because
the individual models (such as decision trees) are trained independently on
different subsets of the data. Since there is no dependency between the models
during training, it can be easily parallelized.
Boosting: Boosting, on the other hand, is an inherently sequential process
because each model is built based on the errors of the previous model. This
dependency between models means that boosting is harder to parallelize.
Therefore, the statement is false because boosting is sequential and hard to
parallelize, while bagging is easier to parallelize.
Adaboost
•Boosting can turn a weak algorithm into a strong learner.
Question:
In AdaBoost, we give more weights to data points having been misclassified in
previous iterations.
Now, if we introduce a limit or cap on the weight that any point can take (for example,
say we
introduce a restriction that prevents any point's weight from exceeding a value of 10).
Which
among the following would be the effect of such a modification?
A) It will have no effect on the performance of the Adaboost method.
B) It makes the final classifier robust to outliers.
C) It may result in lower overall performance.
D) None of these.
Ans: B & C
In AdaBoost, points that are misclassified in previous iterations are given higher
weights in subsequent iterations, meaning the model focuses more on these points.
However, if there are outliers (points that are very difficult or impossible to classify
correctly), the algorithm can end up putting too much weight on these points, causing
the overall performance to degrade.
Considering the AdaBoost algorithm, which among the following statements is true?
A) In each stage, we try to train a classifier which makes accurate predictions on a
subset
of the data points where the subset contains more of the data points which were
misclassified in earlier stages.
B) The weight assigned to an individual classifier depends upon the weighted sum
error of
misclassified points for that classifier.
C) Both option A and B are true
D) None of them are true
C) Both option A and B are true
Here's why:
A) In each stage, we try to train a classifier which makes accurate
predictions on a subset of the data points where the subset contains
more of the data points which were misclassified in earlier stages: This
is true. In AdaBoost, more weight is given to the data points that were
misclassified in previous iterations, so subsequent classifiers focus more on
those harder-to-classify points.
B) The weight assigned to an individual classifier depends upon the
weighted sum error of misclassified points for that classifier: This is also
true. The performance of each classifier is evaluated based on the weighted
sum of its errors, and classifiers with lower weighted errors are given higher
weight in the final model.
Question:
ANS: 1