0% found this document useful (0 votes)
8 views

Week_7_Notes[1]

Uploaded by

Medha Harini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Week_7_Notes[1]

Uploaded by

Medha Harini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Computational Learning Theory:

Part A: Finite Hypothesis Space


Goal of Learning Theory
•To understand
–What kinds of tasks are learnable?
–What kind of data is required for learnability?
–What are the (space, time) requirements of the learning algorithm.?
•To develop and analyze models
–Develop algorithms that provably meet desired criteria
–Prove guarantees for successful algorithms
•Two core aspects of ML
–Algorithm Design. How to optimize?
–Confidence for rule effectiveness on future data.
•We need particular settings (models)
–Probably Approximately Correct (PAC)

Prototypical Concept Learning Task is a fundamental type of machine learning task


where the goal is to learn a target concept or target function that maps inputs to
outputs.
1. Instances (X): These are the individual data points described by various
attributes. For example, instances could be days described by attributes like
Temp, Humidity, Wind, Water, and Forecast.

2. Distribution (𝒟): This is the probability distribution over the instances,


indicating how likely different instances are to appear.
3. Target Function ©: This is the true function that maps each instance to its
correct label or output. For example, it could be a function that determines
whether a day is suitable for playing sports based on the weather attributes.

4. Hypothesis Space (ℋ): This is the set of all possible functions that the
learning algorithm can choose from to approximate the target function.
Hypotheses are often represented as conjunctions of literals (e.g., specific
values or “don’t care” conditions for each attribute).
5. Training Examples (S): These are the specific instances and their
corresponding labels used to train the learning algorithm.

• Goal: Find h which has small error over 𝒟


• An algorithm does optimization over S, find hypothesis h

Computational Learning Theory (CoLT) focuses on the design and analysis of learning
algorithms using formal mathematical methods.
Key Concepts
CoLT uses theoretical computer science tools to quantify and analyze learning
problems. This includes characterizing the difficulty of learning specific tasks and
understanding the computational complexity involved.
One of the central frameworks in CoLT is Probably Approximately Correct (PAC)
learning. It provides a way to quantify the computational difficulty of a learning task by
defining how well a learning algorithm can perform given a certain amount of data and
computational resources.

Function Approximation- Best Function finding


Function approximation is a technique used to estimate an unknown or complex
function using a simpler, more easily computable model.
Function approximation is crucial when theoretical models are unavailable or hard to
compute. It allows for practical solutions in real-world applications where exact
solutions are not feasible.

Error of a hypothesis

distribution 𝒟 is the probability that h will misclassify an instance drawn according to


The true error of hypothesis h, with respect to the target concept c and observation

𝒟
In a perfect world, we’d like the true error to be 0.
Bias: Fix hypothesis space H
C may not be in H => Find h close to c
A hypothesis h is approximately correct if

PAC
The PAC model is a framework in computational learning theory that aims to
understand how well a learning algorithm can generalize from a finite set of training
examples to unseen instances.
Goal: ( h ) has small error over D
- This means ( h ) should correctly classify most instances drawn from D.
•PAC Learning concerns efficient learning
•We would like to prove that
–With high probability an (efficient) learning algorithm will find a hypothesis that is

•We specify two parameters, 𝜀 and 𝛿 and require that with probability at least (1−𝛿) a
approximately identical to the hidden target concept.

system learn a concept with error at most 𝜀.


𝛿- confidence level .
δ: This parameter indicates the probability that the learning algorithm may fail to
produce a hypothesis (or learned concept) with the desired error bound. More
precisely,1−δ is the probability that the learning algorithm succeeds. So, if δ=0.05, it
means the system will produce a hypothesis with error at most epsilon ϵ with 95%
probability (since 1−δ=0.95).

Sample Complexity for Supervised Learning


Sample Complexity: Determines how many training examples are needed to ensure
that the training error is a good estimate of the true error.
True Error: Measures the performance of the hypothesis on future, unseen instances.
Training Error: Measures the performance of the hypothesis on the training data.

Sample Complexity: Finite Hypothesis


Sample complexity: inconsistent finite H
Spaces Realizable Case

General Finite Hypothesis Spaces


For any finite hypothesis space, the sample complexity can be derived using similar
principles. The key is to balance the size of the hypothesis space, the desired
accuracy, and the confidence level.

Inconsistent Finite Hypothesis Spaces


In the inconsistent case, we do not assume that there is a perfect hypothesis in the
hypothesis space. Instead, we aim to find a hypothesis that minimizes the error on the
training data, even if it cannot perfectly classify all examples.

Realizable Case
In the realizable case, we assume that there exists a hypothesis ( h ) in the hypothesis
space that perfectly classifies all training examples. The goal is to determine how
many training examples are needed to ensure that the hypothesis learned from the
training data will have a small error on new, unseen data.

Concept Learning Task


A concept learning task involves training a model to identify and classify objects or
instances based on their attributes.
 Learning from Examples: The model is shown a set of training examples and
learns to generalize from these examples to classify new, unseen instances.
 Simplification: The model simplifies what it has learned by forming a
hypothesis that can be applied to future examples. This hypothesis is a general
rule derived from the training data.

Find-S Algorithm is a simple and intuitive algorithm used in concept learning to find
the most specific hypothesis that fits all the positive examples in a given dataset.
Here’s a detailed explanation:
Key Concepts
1. Hypothesis: A representation of the concept being learned.
2. Positive Examples: Instances that belong to the target concept.
3. Negative Examples: Instances that do not belong to the target concept.
Find-S Algorithm steps:
1. Initialize: Start with the most specific hypothesis, which means all attributes
are set to their most specific values.
2. Iterate through Examples:
o For each positive example, update the hypothesis to be more general if
necessary.
o Ignore negative examples.
3. Update Hypothesis: For each attribute in the positive example that differs
from the current hypothesis, replace the specific value with a question mark (?),
indicating that any value is acceptable for that attribute.

Shattering and VC dimension


Understanding shattering and VC dimension helps in:

Model Selection: Choosing models with appropriate complexity to avoid overfitting or


underfitting.
Sample Complexity: Determining the number of training samples needed for effective
learning.

Shattering a Set of Points:


A set of points is said to be shattered if, for every possible way of labeling the points,
there exists a hypothesis that correctly classifies the points according to those labels.
For example, if you have a set of points in a two-dimensional space, and you can draw
a line (using a linear classifier) that separates the points into any possible combination
of two classes, then that set of points is shattered by the hypothesis class of linear
classifiers.

VC Dimension:
The Vapnik-Chervonenkis (VC) dimension is a measure of the capacity of a hypothesis
class. It is defined as the largest number of points that can be shattered by the
hypothesis class.
For instance, a linear classifier in a two-dimensional space has a VC dimension of 3,
meaning it can shatter any set of 3 points, but not necessarily any set of 4 points.

•A set of 𝑁points (instances) can be labeled as +or −in 2^𝑁ways.


•Consider a hypothesis for the 2-class problem.

•If for every such labeling a function can be found in ℋ consistent with this labeling,
we set that the set of instances is shattered by ℋ.

VC Dimension
Question:
Suppose the VC dimension of a hypothesis space is 6. Which of the following are true?
A) At least one set of 6 points can be shattered by the hypothesis space.
B) Two sets of 6 points can be shattered by the hypothesis space.
C) All sets of 6 points can be shattered by the hypothesis space.
D) No set of 7 points can be shattered by the hypothesis space.
A) At least one set of 6 points can be shattered by the hypothesis space.
D) No set of 7 points can be shattered by the hypothesis space.
Here's the explanation:
 A) At least one set of 6 points can be shattered by the hypothesis
space: This is true. The definition of VC dimension (Vapnik-Chervonenkis
dimension) tells us that the hypothesis space can shatter a set of size equal to
the VC dimension. Therefore, since the VC dimension is 6, there exists at least
one set of 6 points that can be shattered (i.e., classified in all possible ways
by the hypothesis space).
 B) Two sets of 6 points can be shattered by the hypothesis space: This
is not necessarily true. The VC dimension only guarantees that one set of 6
points can be shattered. It does not imply that multiple sets of 6 points can be
shattered.
 C) All sets of 6 points can be shattered by the hypothesis space: This is
false. The VC dimension only guarantees that at least one set of 6 points can
be shattered, but not necessarily all sets of 6 points.
 D) No set of 7 points can be shattered by the hypothesis space: This is
true. Since the VC dimension is 6, no set of 7 points can be shattered by the
hypothesis space, meaning the hypothesis space cannot correctly classify every
possible labeling of 7 points.

Ensemble Learning:

What is Ensemble Classification?


•Use multiple learning algorithms (classifiers)
•Combine the decisions
•Can be more accurate than the individual classifiers
•Generate a group of base-learners
•Different learners use different
–Algorithms
–Hyperparameters
–Representations (Modalities)
–Training sets

Question:
Which of the following options is are correct regarding the benefits of ensemble
model?
I. Better performance
2. More generalized model
3. Better interpretability
A) 1 and 3
B) 2 and 3
C) 1 and 2
D) 1, 2 and 3
Ans: C
Better performance: Ensemble models, such as Random Forest or Gradient
Boosting, combine the predictions of multiple models, which often leads to better
performance compared to a single model. This is because they reduce overfitting and
variance by averaging the predictions or aggregating them in some manner.
More generalized model: Since ensemble models take advantage of multiple weak
learners or different models, they tend to generalize better on unseen data, reducing
the chances of overfitting on the training data.
Better interpretability: This is generally not true for ensemble models. In fact,
ensemble methods often result in complex models that are harder to interpret than
simpler models like decision trees or linear regression. Therefore, this option is
incorrect.

Identify whether the following statement is true or false:


"Ensembles will yield bad results when there is a significant diversity among the
models. "
A) True
B) False
Ans: A
B) False
Explanation:
In fact, diversity among models is a key factor that contributes to the success of
ensemble methods. Ensemble methods like bagging, boosting, and stacking work
well when the individual models are diverse (i.e., they make different kinds of errors).
By combining these diverse models, the ensemble can reduce overall error and
improve generalization because the mistakes of one model can be compensated by
the correct predictions of another.
If all the models in an ensemble were very similar (i.e., not diverse), they would make
similar errors, and the ensemble wouldn't perform much better than an individual
model. Therefore, significant diversity among models typically leads to better
results, not worse.

Why should it work?


•Works well only if the individual classifiers disagree ( diverse models should be
group/ensembled)
–Error rate < 0.5 and errors are independent (different errors by different models)
–Error rate is highly correlated with the correlations of the errors made by the different
learners
Bias vs. Variance
•We would like low bias error and low variance error
•Ensembles using multiple trained (high variance/low bias) models can average out
the variance, leaving just the bias
–Less worry about overfit(stopping criteria, etc.) with the base models

Generally, an ensemble method works better, if the individual base models have
--------- ?
(Note: Individual models have accuracy greater than 50%)
A) Less correlation among predictions (means should predict different type of errors)
B) High correlation among predictions
C) Correlation does not have an impact on the ensemble output
D) None of the above.
Ans:
A) Less correlation among predictions.
Ensemble methods typically perform better when the individual base models make
diverse predictions. This diversity, often achieved through less correlation among the
predictions of the models, helps the ensemble to correct the errors of individual
models and improve overall performance.

Ensemble Creation Approaches


•Get less correlated (different) errors between models
–Injecting randomness
•initial weights (eg, NN), different learning parameters, different splits (eg, DT) etc.
–Different Training sets
•Bagging, Boosting, different features, etc.
–Forcing differences
•different objective functions
–Different machine learning model

Ensemble Combining Approaches


•Unweighted Voting (e.g. Bagging)
•Weighted voting –based on accuracy (e.g. Boosting), Expertise, etc.
•Stacking -Learn the combination function

Question:
Which of the following algorithms is not an ensemble learning algorithm?
A) Random Forest
B) Adaboost
C) Decision Trees
The correct answer is C) Decision Trees.
 Random Forest and AdaBoost are both ensemble learning algorithms.
Random Forest combines multiple decision trees to improve accuracy and
reduce overfitting. AdaBoost combines multiple weak classifiers to create a
strong classifier.
 Decision Trees, on the other hand, are a single model and not an ensemble
method.

Bayes Optimal Classifier


•The Bayes Optimal Classifier is an ensemble of all the hypotheses in the hypothesis
space.
•On average, no other ensemble can outperform it.
•The vote for each hypothesis
–proportional to the likelihood that the training dataset would be sampled from a
system if that
hypothesis were true.
–is multiplied by the prior probability of that hypothesis.
The Bayes Optimal Classifier represents a hypothesis h that is not necessarily in H.
But it is the optimal hypothesis in the ensemble space.

Practicality of Bayes Optimal Classifier


•Cannot be practically implemented.
•Most hypothesis spaces are too large
•Many hypotheses output a class or a value, and not probability
•Estimating the prior probability for each hypothesizes is not always possible.

BMA - Bayesian Model Averaging


•All possible models in the model space used weighted by their probability of being
the “Correct”model
•Optimal given the correct model space and priors

Challenge for developing Ensemble Models


•The main challenge is to obtain base models/learners which are independent and
make independent kinds of errors.
•Independence between two base classifiers can be assessed in this case by
measuring the degree of overlap in training examples they misclassify
.

Part B: Bagging and Boosting


Bagging: Is done to decrease Variance.

Bagging (Bootstrap Aggregating) and Boosting are ensemble learning


techniques aimed at improving the performance of machine learning models by
combining multiple models (typically decision trees). However, they differ significantly
in their approach. Here's a comparison:
1. Purpose
 Bagging: Aims to reduce variance by averaging multiple models trained on
different subsets of the data. It helps prevent overfitting.
 Boosting: Aims to reduce bias by sequentially training models, where each
new model corrects the errors of the previous one. It builds a strong classifier
from weak learners.
2. Model Training Process
 Bagging: Each model is trained independently on a different random sample
(with replacement) from the dataset. The results of the models are averaged
(for regression) or a majority vote is taken (for classification).
 Boosting: Models are trained sequentially. Each new model focuses on
correcting the errors of the previous models by giving more weight to
misclassified instances.
3. Data Sampling
 Bagging: Uses bootstrap sampling (sampling with replacement), meaning
each model is trained on a different subset of the data.
 Boosting: Uses the entire dataset, but misclassified instances are given
higher weights so that subsequent models focus more on those errors.
4. Parallelization
 Bagging: Can be easily parallelized since each model is trained independently
on different subsets of data.
 Boosting: Is inherently sequential, so it is difficult to parallelize. Each model
depends on the performance of the previous model.
5. Handling Bias and Variance
 Bagging: Primarily helps reduce variance by combining models and averaging
out predictions.
 Boosting: Helps reduce both bias and variance by creating a strong model
that corrects errors iteratively.
6. Overfitting Risk
 Bagging: Less prone to overfitting since the models are trained independently
on different samples, reducing variance.
 Boosting: More prone to overfitting, especially if the models are too complex
or if the data is noisy, since it tries to fit the data closely.
7. Performance
 Bagging: Works well when the base model is prone to high variance, such as
decision trees (leading to Random Forests).
 Boosting: Generally performs better than bagging on complex tasks but
requires careful tuning to avoid overfitting. Examples include AdaBoost,
Gradient Boosting, and XGBoost.
8. Examples
 Bagging: Random Forest
 Boosting: AdaBoost, Gradient Boosting, XGBoost

•Bagging = “bootstrap aggregation”


–Draw N items from X with replacement - Sampling with replacement
•Desired learners with high variance (unstable)
–Decision trees and ANNs are unstable models
–K-NN is stable
•Use bootstrapping to generate L training sets and train one base-learner with each
training set
•Use voting
•Build classifier on each bootstrap sample
•Each sample has probability of being selected.

Question:
Which of the following is FALSE about bagging?
A) Bagging increases the variance of the classifier
B) Bagging can help make robust classifiers from unstable classifiers.
C) Majority Voting is one way of combining outputs from various classifiers which are
being bagged.
Ans:A
 A) Bagging increases the variance of the classifier: This statement is false. In
fact, bagging reduces the variance of the classifier. The primary purpose of
bagging (Bootstrap Aggregating) is to reduce the variance by combining predictions
from multiple models trained on different subsets of the data. This leads to more
stable and generalized predictions, especially for high-variance models like decision
trees.
B) Bagging can help make robust classifiers from unstable classifiers: This is
true. Bagging is commonly used to improve the performance of unstable models, like
decision trees, by reducing overfitting and increasing robustness.
 C) Majority Voting is one way of combining outputs from various classifiers
which are being bagged: This is true. In bagging, the outputs from individual
classifiers are often combined using methods like majority voting (for classification) or
averaging (for regression).

Boosting
•An iterative procedure. Adaptively change distribution of training data.
–Initially, all N records are assigned equal weights
–Weights change at the end of boosting round
•“weak”learners
–P(correct) > 50%, but not necessarily much better.

Question:
Identify whether the following statement is true or false:
"Boosting is easy to parallelize whereas bagging is inherently a sequential process."
A) True
B) False

B) False
Here's why:
 Bagging (Bootstrap Aggregating): Bagging is easier to parallelize because
the individual models (such as decision trees) are trained independently on
different subsets of the data. Since there is no dependency between the models
during training, it can be easily parallelized.
 Boosting: Boosting, on the other hand, is an inherently sequential process
because each model is built based on the errors of the previous model. This
dependency between models means that boosting is harder to parallelize.
Therefore, the statement is false because boosting is sequential and hard to
parallelize, while bagging is easier to parallelize.

Adaboost
•Boosting can turn a weak algorithm into a strong learner.

Question:
In AdaBoost, we give more weights to data points having been misclassified in
previous iterations.
Now, if we introduce a limit or cap on the weight that any point can take (for example,
say we
introduce a restriction that prevents any point's weight from exceeding a value of 10).
Which
among the following would be the effect of such a modification?
A) It will have no effect on the performance of the Adaboost method.
B) It makes the final classifier robust to outliers.
C) It may result in lower overall performance.
D) None of these.
Ans: B & C
In AdaBoost, points that are misclassified in previous iterations are given higher
weights in subsequent iterations, meaning the model focuses more on these points.
However, if there are outliers (points that are very difficult or impossible to classify
correctly), the algorithm can end up putting too much weight on these points, causing
the overall performance to degrade.

Considering the AdaBoost algorithm, which among the following statements is true?
A) In each stage, we try to train a classifier which makes accurate predictions on a
subset
of the data points where the subset contains more of the data points which were
misclassified in earlier stages.
B) The weight assigned to an individual classifier depends upon the weighted sum
error of
misclassified points for that classifier.
C) Both option A and B are true
D) None of them are true
C) Both option A and B are true
Here's why:
 A) In each stage, we try to train a classifier which makes accurate
predictions on a subset of the data points where the subset contains
more of the data points which were misclassified in earlier stages: This
is true. In AdaBoost, more weight is given to the data points that were
misclassified in previous iterations, so subsequent classifiers focus more on
those harder-to-classify points.
 B) The weight assigned to an individual classifier depends upon the
weighted sum error of misclassified points for that classifier: This is also
true. The performance of each classifier is evaluated based on the weighted
sum of its errors, and classifiers with lower weighted errors are given higher
weight in the final model.

Question:

ANS: 1

You might also like