MachineLearning - UNIT III
MachineLearning - UNIT III
UNIT III:
Computational Learning Theory: Models of learn ability: learning in the limit; probably
approximately correct (PAC) learning. Sample complexity for infinite hypothesis spaces, Vapnik-
Chervonenkis dimension. Rule Learning: Propositional and First-Order, Translating decision trees
into rules, Heuristic rule induction using separate and conquer and information gain, First-order
Horn-clause induction (Inductive Logic Programming) and Foil, Learning recursive rules, Inverse
resolution, Golem, and Progol.
B. Sample Complexity
• Sample complexity refers to the number of training examples required for a model to
generalize well.
• A function class is learnable if the sample complexity grows reasonably with the size of the
problem.
• PAC learning provides bounds on how much data is required to achieve a certain level of
accuracy.
Factors Affecting Sample Complexity
1. Size of Hypothesis Space: A larger hypothesis space requires more samples.
2. Complexity of the Concept: More complex concepts need more data.
3. Desired Accuracy (ε) and Confidence (δ): Lower ε (higher accuracy) and lower δ (higher
confidence) require more samples.
C. VC (Vapnik-Chervonenkis) Dimension
• VC Dimension is a measure of the complexity or capacity of a hypothesis class.
• It represents the largest set of points that can be shattered (correctly classified in all possible
ways) by a hypothesis class.
• A higher VC dimension means that a model is more complex and flexible, but it may also
overfit.
Example:
• A linear classifier in a 2D plane has a VC dimension of 3, because it can classify any three
points in any configuration.
• A decision tree with unlimited depth has a very high VC dimension, leading to potential
overfitting.
Example:
• A linear classifier in 2D can shatter at most 3 points → VC dimension = 3.
• A decision tree with unlimited depth has a very high VC dimension, leading to potential
overfitting.
Example:
• A medical diagnosis system might ask an expert whether a new symptom pattern belongs to
a specific disease class.
4. Mistake Bound Model
• Measures the maximum number of mistakes a learning algorithm can make before it
converges to the correct hypothesis.
• A hypothesis class is learnable if there is an algorithm with a finite mistake bound.
• Used in online learning, where data is received sequentially instead of all at once.
Example:
• A spam filter adjusts its decision rules after each new email it classifies incorrectly.
• Works well when we have prior knowledge, but requires defining meaningful priors.
Example:
• In medical diagnosis, Bayesian models can incorporate prior knowledge about disease
probabilities.
6. Statistical Query (SQ) Model
• A modification of PAC learning that allows the algorithm to access only statistical
properties of data instead of individual samples.
• Useful for learning in the presence of noise.
• Many deep learning algorithms follow SQ-like approaches by working with aggregate
statistics instead of individual labels.
Conclusion
Different models of learnability provide theoretical foundations for understanding how and when
machine learning algorithms can learn effectively.
• PAC learning is widely used for defining sample complexity and generalization
guarantees.
• VC dimension measures the capacity of a hypothesis space.
• Bayesian learning incorporates prior knowledge for probabilistic inference.
• Query models and mistake-bound learning are useful for interactive and online learning
scenarios.
LEARNING IN THE LIMIT
Learning in the Limit is a formal model of learning introduced by E. Mark Gold (1967) that
describes how a learner can eventually identify a correct hypothesis given an infinite amount of
data over time. This model is particularly used in the study of inductive inference and formal
language learning.
1. Basic Idea of Learning in the Limit
• The learner is presented with an infinite sequence of training examples.
• The goal is to eventually converge to a correct hypothesis after seeing enough data.
• The learner is allowed to make mistakes and change hypotheses but must eventually settle
on the correct one forever.
Example:
Imagine a child learning a language. The child initially makes incorrect guesses about grammar
rules, but as they receive more examples (sentences from parents, books, or school), they eventually
settle on the correct grammar and never change it again.
2. Formal Definition
A learning algorithm AAA is said to learn a target function fff in the limit if:
1. AAA produces a sequence of hypotheses h1,h2,h3,…h_1, h_2, h_3, \dotsh1,h2,h3,… based
on training examples.
2. There exists a point nnn such that for all m≥nm \geq nm≥n, the hypothesis hmh_mhm is
correct and does not change afterward.
This means that after seeing enough data, the learner stabilizes on the correct hypothesis and does
not revise it anymore.
3. Key Features of Learning in the Limit
Allows mistakes early on: The learner can initially make incorrect guesses.
Does not require a bound on time or samples: It only guarantees eventual convergence.
Considers infinite data streams: Learning happens over an infinite sequence of examples.
No requirement for efficiency: The time it takes to reach the final hypothesis is not constrained.
Does not guarantee efficiency: The learner might take an arbitrarily long time to find the
correct hypothesis.
Requires infinite data: Real-world learning often happens with finite data, making this model
impractical for some tasks.
No restrictions on hypothesis complexity: The model does not account for computational
limitations.
Comparison with PAC Learning:
While Learning in the Limit guarantees eventual correctness, PAC learning focuses on finding a
good approximation efficiently with finite data.
“Within this setting, we will be chiefly concerned with questions such as how many training
examples are sufficient to successfully learn the target function, and how many mistakes will the
learner make before succeeding”.
Our goal is to answer questions such as:
Sample complexity. How many training examples are needed for a learner to converge (with high
probability) to a successful hypothesis?
Computational complexity. How much computational effort is needed for a learner to converge (with
high probability) to a successful hypothesis?
Mistake bound. How many training examples will the learner misclassify before converging to a
successful hypothesis?
The Problem Setting:
➢ Let X refer to the set of all possible instances over which target functions may be defined. For
example, X might represent the set of all people, each described by the attributes age (e.g.,
young or old) and height (short or tall).
➢ Let C refer to some set of target concepts that our learner might be called upon to learn. Each
target concept c in C corresponds to some subset of X, or equivalently to some boolean-
valued function c : X -> {0, 1). For example, one target concept c in C might be the concept
"people who are skiers."
➢ If x is a positive example of c, then we will write c(x) = 1; if x is a negative example, c(x) =
0.
➢ We assume instances are generated at random from X according to some probability
distribution D.
➢ In general, D may be any distribution, and it will not generally be known to the learner.
➢ All that we require of D is that it be stationary; that is, that the distribution not change over
time.
➢ Training examples are generated by drawing an instance x at random according to D, then
presenting x along with its target value, c(x), to the learner.
➢ The learner L considers some set H of possible hypotheses when attempting to learn the
target concept.
➢ After observing a sequence of training examples of the target concept c, L must output some
hypothesis h from H, which is its estimate of c.
➢ The true error of hypothesis h, with respect to the target concept c and observation
distribution 𝒟 is the probability that h will misclassify an instance drawn according to 𝒟.
➢ 𝑒𝑟𝑟𝑜𝑟𝒟 (ℎ)=𝑃𝑟𝑥~𝒟 𝑐(𝑥)≠ ℎ(𝑥)
➢ In a perfect world, we’d like the true error to be 0.
➢ Bias: Fix hypothesis space H
➢ c may not be in H => Find h close to c
➢ A hypothesis h is approximately correct if
➢ 𝑒𝑟𝑟𝑜𝑟𝒟 ℎ ≤𝜀
➢ First, we will not require that the learner output a zero error hypothesis-we will require only
that its error be bounded by some constant, €, that can be made arbitrarily small.
➢ Second, we will not require that the learner succeed for every sequence of randomly drawn
training examples-we will require only that its probability of failure be bounded by some
constant, δ, that can be made arbitrarily small.
➢ In short, we require only that the learner probably learn a hypothesis that is approximately
correct-hence the term probably approximately correct learning.
Consider some class C of possible target concepts and a learner L using hypothesis space H. Loosely
speaking, we will say that
“the concept class C is PAC-learnable by L using H if, for any target concept c in C, L will with
probability (1 - δ) output a hypothesis h with errorD(h) < €, after observing a reasonable number of
training examples and performing a reasonable amount of computation”.
VAPNIK-CHERVONENKIS DIMENSION :
➢ The Vapnik-Chervonenkis (VC) dimension is a fundamental concept in machine learning and
statistical learning theory that measures the capacity or complexity of a set of functions (often
referred to as a hypothesis class). It helps to understand the learning ability of a machine
learning model in terms of its generalization power.
➢ VC Dimension Definition: The VC dimension of a hypothesis class is the largest number of
points that can be shattered by that class. To say that a set of points is shattered means that,
for every possible labeling (classification) of the points, there exists a hypothesis in the class
that correctly classifies those points according to that labeling.
➢ Formally, if a hypothesis class H can perfectly classify every possible arrangement of labels
for a set of n points, then those points are said to be shattered, and the VC dimension is at
least n.
RULE LEARNING : Propositional and First-Order
➢ One of the most expressive and human readable representations for learned hypotheses is sets
of if-then rules.
➢ One way to learn sets of rules is to first learn a decision tree, then translate the tree into an
equivalent set of rules-one rule for each leaf node in the tree.
➢ One important special case involves learning sets of rules containing variables, called first-
order Horn clauses.
➢ Because sets of first-order Horn clauses can be interpreted as programs in the logic
programming language PROLOG learning them is often called inductive logic programming
(ILP).
➢ As an example of first-order rule sets, consider the following two rules that jointly describe
the target concept Ancestor. Here we use the predicate Parent(x, y) to indicate that y is the
mother or father of x, and the predicate Ancestor(x, y) to indicate that y is an ancestor of x
related by an arbitrary numberof family generations.
For any literals A and B, the expression (A B) is equivalent to (A V ¬ B), and the expression ¬(A
^ B) is equivalent to (¬ A v ¬ B). Therefore, a Horn clause can equivalently be written as the
disjunction
➢ A substitution is any function that replaces variables by terms. For example, the substitution
{x/3, y/z) replaces the variable x by the term 3 and replaces the variable y by the term z.
➢ Given a substitution θ and a literal L we write Lθ to denote the result of applying substitution
θ to L.
➢ A unifying substitution for two literals L1 and L2 is any substitution θ such that L1 θ = L2 θ.
➢ Formally, the hypotheses learned by FOIL are sets of first-order rules, where each rule is
similar to a Horn clause with two exceptions.
➢ First, the rules learned by FOIL are more restricted than general Horn clauses, because the
literals are not permitted to contain function symbols (this reduces the complexity of the
hypothesis space search).
➢ Second, FOIL rules are more expressive than Horn clauses, because the literals appearing in
the body of the rule may be negated
➢ FOIL has been applied to a variety of problem domains. For example, it has been
demonstrated to learn a recursive definition of the QUICKSORT algorithm and to learn to
discriminate legal from illegal chess positions.
➢ In particular, FOIL seeks only rules that predict when the target literal is True. Also, FOIL
performs a simple hill climbing search rather than a beam search.
TRANSLATING DECISION TREES INTO RULES
Translating a decision tree into rules in machine learning involves essentially reading each path from
the root node to a leaf node as a set of "if-then" statements, where each decision node along the path
represents a condition on a feature, and the leaf node represents the final prediction based on those
conditions, effectively creating a rule set that mirrors the decision-making logic of the tree.
Structure: Each path through the tree becomes a single rule, with each decision node along the path
contributing a condition to the rule.
Conditions: These conditions are typically comparisons between a feature value and a threshold
determined during tree training.
Output: The final prediction at the leaf node is the "then" part of the rule.
Example:
Imagine a decision tree predicting whether someone should be approved for a loan based on their
income and credit score:
Root node: "Credit score > 650?"
Yes branch: "Income > 50k?"
Yes leaf: "Approved"
No leaf: "Approved (with conditions)"
No branch: "Debt-to-income ratio < 0.3?"
Yes leaf: "Approved"
No leaf: "Denied“
Translated rules:
➢ If credit score > 650 and income > 50k, then approve.
➢ If credit score > 650 and income <= 50k, then approve
(with conditions).
➢ If credit score <= 650 and debt-to-income ratio < 0.3,
then approve.
➢ If credit score <= 650 and debt-to-income ratio >= 0.3,
then deny.
Example: If we want to classify whether a person buys a computer based on age, income, and
credit rating, the algorithm might first learn:
java
CopyEdit
IF Age < 30 AND Income = High THEN BuysComputer = No
It then removes covered examples and searches for new rules.
where:
• SSS is the set of examples.
• PiP_iPi is the probability of class iii in SSS.
where:
• SSS is the original dataset.
• SvS_vSv is the subset of examples where attribute AAA has value vvv.
• H(S)H(S)H(S) is the entropy before splitting.
• H(Sv)H(S_v)H(Sv) is the entropy after splitting.
The attribute with the highest information gain is selected for rule construction.
Example:
Consider a dataset where we classify if a person buys a computer. We compute entropy and
information gain for "Age" and "Income" and select the attribute with the highest IG for rule
construction.
3. Example: Applying Separate-and-Conquer with Information Gain
Let’s say we have the following dataset:
Step 2: Compute Information Gain for Each Attribute
For "Age", partition the dataset and compute IG. Suppose we find that:
IG(S,Age)=0.246IG(S, \text{Age}) = 0.246IG(S,Age)=0.246 IG(S,Income)=0.151IG(S,
\text{Income}) = 0.151IG(S,Income)=0.151
Since "Age" has the highest IG, we select "Age" for the first rule.
Step 3: Create and Apply Rule
The first rule might be:
java
CopyEdit
IF Age > 40 THEN BuysComputer = Yes
This removes some examples from the dataset, and we repeat the process.
Greedy approach – Might not find the globally optimal rule set.
Sensitive to noise – If a bad rule is learned, it affects later rules.
Can generate too many rules – May result in overfitting.
Example:
If given parent(X, Y) data, ILP learns the recursive ancestor(X, Y) rule.
Example:
Given data:
scss
CopyEdit
parent(alice, bob).
parent(bob, charlie).
The LGG algorithm generalizes:
scss
CopyEdit
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
Example:
A decision tree for family relationships would first split on "Is X a parent of Y?", then refine to
include recursion.
Computational Complexity – Recursive rules require deeper search and more data.
Noise Sensitivity – Incorrect or missing data can break recursion.
Over-Generalization – Learning too broad rules can lead to incorrect inferences.
Search Space Explosion – Many possible recursive rules exist, making optimization difficult.
INVERSE RESOLUTION:
1. What is Inverse Resolution?
Inverse Resolution is a technique in Inductive Logic Programming (ILP) that helps in learning new
knowledge by reasoning backward from observations to general rules. It is an inverse process of
resolution, which is used in logic programming for deduction.
• Forward Resolution: Given known facts and rules, derive new conclusions (deductive
reasoning).
• Inverse Resolution: Given observations and conclusions, infer missing facts or rules
(inductive reasoning).
👉 Example: If we observe that "Alice is an ancestor of David," but only have data that Alice is
Bob's parent and Bob is David’s parent, we might learn the general rule:
scss
CopyEdit
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).