0% found this document useful (0 votes)
2 views

ML Unit-4 Prob Learning

The document discusses various concepts in computational learning, focusing on probability learning, hypothesis formulation, sample complexity, and rule learning. It explains the differences between frequentist and Bayesian approaches, the significance of hypothesis spaces, and the implications of finite and infinite hypothesis spaces. Additionally, it covers the mistake bound model and sequential covering algorithm for rule learning in machine learning.

Uploaded by

Vaishu Sangati
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ML Unit-4 Prob Learning

The document discusses various concepts in computational learning, focusing on probability learning, hypothesis formulation, sample complexity, and rule learning. It explains the differences between frequentist and Bayesian approaches, the significance of hypothesis spaces, and the implications of finite and infinite hypothesis spaces. Additionally, it covers the mistake bound model and sequential covering algorithm for rule learning in machine learning.

Uploaded by

Vaishu Sangati
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

UNIT- IV

Computational
Learning
Probability Learning

 Probability learning is a machine


learning technique that uses
probability theory to make
predictions and decisions.
 It's a statistical approach that
models uncertainty in data by using
probability distributions.
A probability
distribution describes the possible
values and the corresponding
likelihoods that a random variable
can take. For example, the
probabilities of having 0, 1, 2, …,
100 heads respectively.
 Frequentist calculates probabilities
from the relative frequencies of
specific events to the total number
of trials. For example, P(Head) =
Bayesian

 Bayesian modifies a prior belief with


current experiment data using
Bayesian inference. For example, a
Bayesian can combine a prior belief
that the coin is fair with the current
experiment data (56 heads out of
100) to form a new belief
A Frequentist estimates the most
likely value for P(head) (a point
estimate). But a Bayesian tracks all
possibilities with the corresponding
certainties. This calculation is
complex but contains richer
information for further computation.
Hypothesis

 The hypothesis is one of the


commonly used concepts of
statistics in Machine Learning.
 It is specifically used in Supervised
Machine learning, where an ML
model learns a function that best
maps the input to corresponding
outputs with the help of an available
dataset.
Hypothesis
 Thereare some common methods
given to find out the possible
hypothesis from the Hypothesis
space, where hypothesis space is
represented by uppercase-h
(H) and hypothesis by lowercase-h
(h). These are defined as follows:
 The hypothesis (h) can be formulated in
machine learning as follows:
 Y= mx + b
Where,
 Y: Range
 m: Slope of the line which divided test
data or changes in y divided by change
in x.
 x: domain
 c: intercept (constant)
Hypothesis space (H):
 Hypothesis space is defined as a
set of all possible legal
hypotheses; hence it is also
known as a hypothesis set.
Sample complexity
 Sample complexity is a concept
in machine learning that determines the
number of data samples required to
achieve a certain level of learning
performance.
 Its importance lies in its ability to assess
the efficiency of a learning algorithm.
 A more efficient algorithm needs fewer
samples to learn effectively, reducing the
resources required for data acquisition
and storage.
 There are two types of sample complexities
that are often referenced: worst-case sample
complexity and average-case sample
complexity.
 Worst-case sample complexity refers to the
maximum number of samples required to
reach a specific learning goal, irrespective of
the data distribution.
 Average-case sample complexity, on the
other hand, considers the average number of
samples needed, assuming the data follows a
certain distribution.
Mathematical backbone of
sample complexity
 Probably Approximately Correct
(PAC) learning theory provides a
framework to relate VC dimension to
sample complexity. PAC learning
seeks to identify the minimum
sample size that will, with high
probability, produce a hypothesis
within a specified error tolerance of
the best possible hypothesis.
 The PAC learning bound is given by:
 N >= (1/ε) * (ln|H| + ln(1/δ)),
Where,
 N is the sample size,
 ε is the maximum acceptable error
(the 'approximately correct' part),
 |H| is the size of the hypothesis
space (related to VC dimension),
 δ is the acceptable failure probability
(the 'probably' part).
Finite hypothesis

 A finite hypothesis space consists of


a limited, countable set of
hypotheses (models or functions)
that can be selected to explain or
predict outcomes based on the input
data.
 Example: Decision trees with a
limited depth, a fixed number of
linear classifiers, or a specific set of
rules in rule-based learning.
Implications

 Easier to manage and evaluate since


the number of hypotheses is small.
 Risk of underfitting if the hypothesis
space is too constrained and does
not capture the underlying data
distribution.
 It can be easier to ensure
generalization since overfitting can
be less of a concern due to the
limited complexity.
Infinite hypothesis space
 An infinite hypothesis space contains
an unbounded or uncountable set of
hypotheses. This means there are
potentially limitless models that could
be considered for fitting the data.
 Example: Linear regression with any
real-valued coefficients, neural
networks with varying architectures
and parameters, or kernel methods in
support vector machines (SVMs).
Implications

 Greater flexibility and the ability to


capture complex patterns in the
data.
 Higher risk of overfitting, as the
model can become too complex and
fit noise in the data rather than the
underlying distribution.
 Requires more advanced techniques
for regularization and validation to
ensure that the model generalizes
Mistake bound model
 The mistake bound (MB) model in machine
learning is a model that evaluates a learner
based on the total number of mistakes it
makes before reaching the correct
hypothesis.
 The model is used in cumulative learning
scenarios, where the learning process is
made up of rounds.
 In each round, the learner is asked about an
aspect of the learned phenomenon, makes a
prediction, and is told if it was correct.
 Thegoal of the learner is to make
the minimum number of mistakes
possible in the learning process.
Mistake bound model
algorithm
 An algorithm A is said to learn C in the
mistake bound model if for any concept
c ∈ C, and for any ordering of
examples consistent with c, the total
number of mistakes ever made by A is
bounded by p(n, size(c)), where p is a
polynomial. We say that A is a
polynomial time learning algorithm if
its running time per stage is also
polynomial in n and size(c).
 Conjunctions
 Let us assume that we know that the
target concept c will be a
conjunction of a set of (possibly
negated) variables, with an example
space of n-bit strings. Consider the
following algorithm:
Algorithm for MB
 1. Initialize hypothesis h to be x1x1
x2x2 . . . xnxn .
 2. Predict using h(x).
 3. If the prediction is False but the label is
actually True, remove all the literals in h
which are False in x. (So if the first mistake
is on 1001, the new h will be x1x2 x3 x4.)
 4. If the prediction is True but the label is
actually False, then output “no consistent
conjunction”. 5. Return to step 2.
 Lower Bound: In fact no deterministic
algorithm can achieve a mistake
bound better than n in the worst
case.
 This can be seen by considering the
sequence of n examples in which the
ith example has all bits except the
ith bit set to 1
 The target concept c will be a monotone
conjunction constructed by including xi only
if the algorithm predicts the ith example to
be True (in which case the ith example’s
label will be False).
 (If the algorithm predicts the ith example to
be False, then the target concept will not
include xi , and so the true label will be
True.) The algorithm will have made n
mistakes by the time all of these n examples
are processed.
Learning set of rules
A learning rule set in machine learning
is a collection of rules that describe a
dataset. The process of creating these
rules from data is called rule learning.
Rule types
 The most common type of rule learning
is inductive rule learning, also known as
rule induction. Other types of rules
include association rules, which are used
to express relationships in large datasets
Rule form

 The
basic form of a rule is "IF
PREMISE THEN CONSEQUENT". This
means that the consequent is true
whenever the premise is true.
Example 1
Example 2:
Learning strategies

 Some strategies for learning rule


sets include:
 Learn-One-Rule: Searches from
general to specific.
 Find-S: Searches from specific to
general.
 FOIL: Learns one rule at a time,
removing positive examples covered
by the learned rule before
attempting to learn another rule.
Sequential Covering
Algorithm
 Sequential Covering is a popular
algorithm based on Rule-Based
Classification used for learning a
disjunctive set of rules.
 The basic idea here is to learn one
rule, remove the data that it covers,
then repeat the same process.
 In this way, it covers all the rules
involved with it in a sequential
manner during the training phase.
Sequential Covering
Algorithm
 The algorithm involves a set of ‘ordered
rules’ or ‘list of decisions’ to be made.
 Step 1 – create an empty decision list, ‘R’.
Step 2 – ‘Learn-One-Rule’ Algorithm It
extracts the best rule for a particular class
‘y’, where a rule.
In the beginning,
 Step 2.a – if all training examples ∈ class
‘y’, then it’s classified as positive example.
 Step 2.b – else if all training examples ∉
class ‘y’, then it’s classified as negative
example.
 Step 3 – The rule becomes
‘desirable’ when it covers a
majority of the positive examples.
Step 4 – When this rule is obtained,
delete all the training data
associated with that rule. (i.e. when
the rule is applied to the dataset, it
covers most of the training data, and
has to be removed).
 Step 5 – The new rule is added to
the bottom of decision list, ‘R’.
Decision List format
Working of the algorithm

You might also like