0% found this document useful (0 votes)
27 views30 pages

MachineLearning - UNIT III

The document discusses Computational Learning Theory (CLT), focusing on models of learnability such as Probably Approximately Correct (PAC) learning, sample complexity, and Vapnik-Chervonenkis (VC) dimension. It outlines key concepts, theoretical results, and practical implications of CLT, emphasizing the importance of understanding how learning algorithms can efficiently learn target functions. Additionally, it covers various learning models, including Learning in the Limit, Exact Learning, and Bayesian Learning, providing insights into their characteristics and limitations.

Uploaded by

crr92112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views30 pages

MachineLearning - UNIT III

The document discusses Computational Learning Theory (CLT), focusing on models of learnability such as Probably Approximately Correct (PAC) learning, sample complexity, and Vapnik-Chervonenkis (VC) dimension. It outlines key concepts, theoretical results, and practical implications of CLT, emphasizing the importance of understanding how learning algorithms can efficiently learn target functions. Additionally, it covers various learning models, including Learning in the Limit, Exact Learning, and Bayesian Learning, providing insights into their characteristics and limitations.

Uploaded by

crr92112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

MACHINE LEARNING

UNIT III:
Computational Learning Theory: Models of learn ability: learning in the limit; probably
approximately correct (PAC) learning. Sample complexity for infinite hypothesis spaces, Vapnik-
Chervonenkis dimension. Rule Learning: Propositional and First-Order, Translating decision trees
into rules, Heuristic rule induction using separate and conquer and information gain, First-order
Horn-clause induction (Inductive Logic Programming) and Foil, Learning recursive rules, Inverse
resolution, Golem, and Progol.

Computational Learning Theory


Computational Learning Theory (CLT) is a subfield of machine learning that focuses on
understanding the theoretical foundations of learning algorithms. It aims to answer fundamental
questions about learnability, sample complexity, computational complexity, and generalization
performance of learning models.

1. Key Questions in Computational Learning Theory


CLT seeks to answer questions such as:
1. Can a given concept be learned efficiently?
2. How many training examples are needed to achieve a good approximation of the target
function?
3. How much computation is required for learning?
4. What guarantees can we provide on the generalization ability of a model?

2. Key Concepts in Computational Learning Theory


A. PAC (Probably Approximately Correct) Learning
• Introduced by Leslie Valiant in 1984.
• Defines a formal framework to analyze whether a hypothesis can be learned with high
probability and within a certain error tolerance.
• A hypothesis h is PAC-learnable if, with high probability (probably), it is an approximately
correct representation of the target function.
Key Aspects of PAC Learning
1. ε (Epsilon) - Accuracy Parameter: The allowed error in the hypothesis.
2. δ (Delta) - Confidence Parameter: The probability that the learned hypothesis is within the
error bound.
3. Sample Complexity: Number of training examples needed to achieve the desired accuracy
and confidence.
4. Computational Complexity: The time required to find a good hypothesis.
Example: If a PAC-learning algorithm guarantees an error rate of at most 5% with 95%
confidence, it means that in at least 95% of the cases, the learned model will have an error of no
more than 5%.

B. Sample Complexity
• Sample complexity refers to the number of training examples required for a model to
generalize well.
• A function class is learnable if the sample complexity grows reasonably with the size of the
problem.
• PAC learning provides bounds on how much data is required to achieve a certain level of
accuracy.
Factors Affecting Sample Complexity
1. Size of Hypothesis Space: A larger hypothesis space requires more samples.
2. Complexity of the Concept: More complex concepts need more data.
3. Desired Accuracy (ε) and Confidence (δ): Lower ε (higher accuracy) and lower δ (higher
confidence) require more samples.

C. VC (Vapnik-Chervonenkis) Dimension
• VC Dimension is a measure of the complexity or capacity of a hypothesis class.
• It represents the largest set of points that can be shattered (correctly classified in all possible
ways) by a hypothesis class.
• A higher VC dimension means that a model is more complex and flexible, but it may also
overfit.
Example:
• A linear classifier in a 2D plane has a VC dimension of 3, because it can classify any three
points in any configuration.
• A decision tree with unlimited depth has a very high VC dimension, leading to potential
overfitting.

D. No Free Lunch Theorem


• States that no single learning algorithm is best for all problems.
• If an algorithm performs well on some tasks, it will perform poorly on others.
• Implication: Choosing the right bias and algorithm for a given problem is crucial!

3. Important Theoretical Results


1. Hoeffding’s Inequality: Provides a probabilistic bound on the difference between the
empirical error and the true error of a hypothesis.
2. Fundamental Theorem of Statistical Learning: Relates VC dimension to sample complexity
and generalization ability.
3. Occam’s Razor Principle: Simpler hypotheses are preferred as they generalize better.

4. Computational Complexity in Learning


• Some learning problems are NP-hard, meaning they require exponential time to solve.
• Efficient learning is often achieved through heuristics, approximation algorithms, or
restricting the hypothesis space.

5. Practical Implications of Computational Learning Theory


• Ensures that machine learning models generalize well.
• Helps in selecting the right model complexity and training data size.
• Provides mathematical guarantees on learning performance.
• Guides the development of better learning algorithms.
MODELS OF LEARN ABILITY
In computational learning theory, different models of learnability define the conditions under which a
machine learning algorithm can efficiently learn a target function. These models describe the amount
of data, computational effort, and guarantees on accuracy needed for learning.
1. PAC (Probably Approximately Correct) Learning Model
Introduced by Leslie Valiant (1984), PAC learning provides a formal framework for understanding
how efficiently a hypothesis can be learned with high probability.
Key Concepts of PAC Learning:
• A hypothesis h is considered PAC-learnable if, given sufficient training examples, it can
approximate the target function f with high probability.
• The goal is to learn a hypothesis h such that:
o The probability of error is at most some small value ε (epsilon).
o The probability of successful learning is at least 1 - δ (delta).
Mathematical Definition:
A concept class C is PAC-learnable if there exists an algorithm that, for any ε, δ > 0, can find a
hypothesis h such that:

using a polynomial number of training samples and computational steps.


Example:
If a PAC-learning algorithm guarantees ε = 0.05 and δ = 0.95, it means that in 95% of cases, the
learned model will have an error of no more than 5%.
Key Factors Affecting PAC Learnability:
1. Sample Complexity: The number of training examples needed for learning.
2. Computational Complexity: The time required to learn the hypothesis.
3. Representation of Hypothesis Space: The type of functions the model considers.
4. Noise in Data: PAC assumes noise-free data, but modifications exist for noisy cases.
Limitations of PAC Learning:
• Assumes that data is drawn from a fixed probability distribution.
• It does not handle adversarial noise well.
• Not all concept classes are PAC-learnable.

2. VC (Vapnik-Chervonenkis) Dimension and Learnability


• The VC Dimension measures the capacity of a hypothesis class H by determining how many
points it can shatter (classify correctly in all possible ways).
• If a hypothesis class has a finite VC dimension, then it is PAC-learnable.
• The higher the VC dimension, the more training examples are needed to generalize well.

Example:
• A linear classifier in 2D can shatter at most 3 points → VC dimension = 3.
• A decision tree with unlimited depth has a very high VC dimension, leading to potential
overfitting.

3. Exact Learning Model (Query Learning)


• The learner can query an oracle (teacher) for information about the target function.
• Two types of queries:
1. Membership Queries: Learner provides an instance and asks for its label.
2. Equivalence Queries: Learner proposes a hypothesis, and the oracle either accepts it
or provides a counterexample.
• This model is useful for interactive learning but is not always practical.

Example:
• A medical diagnosis system might ask an expert whether a new symptom pattern belongs to
a specific disease class.
4. Mistake Bound Model
• Measures the maximum number of mistakes a learning algorithm can make before it
converges to the correct hypothesis.
• A hypothesis class is learnable if there is an algorithm with a finite mistake bound.
• Used in online learning, where data is received sequentially instead of all at once.

Example:
• A spam filter adjusts its decision rules after each new email it classifies incorrectly.

5. Bayesian Learning Model


• Assumes that prior probabilities exist for different hypotheses.
• Learns by updating probabilities using Bayes' theorem based on observed data.
• The best hypothesis h is the one with the highest posterior probability:

• Works well when we have prior knowledge, but requires defining meaningful priors.

Example:
• In medical diagnosis, Bayesian models can incorporate prior knowledge about disease
probabilities.
6. Statistical Query (SQ) Model
• A modification of PAC learning that allows the algorithm to access only statistical
properties of data instead of individual samples.
• Useful for learning in the presence of noise.
• Many deep learning algorithms follow SQ-like approaches by working with aggregate
statistics instead of individual labels.

Conclusion
Different models of learnability provide theoretical foundations for understanding how and when
machine learning algorithms can learn effectively.
• PAC learning is widely used for defining sample complexity and generalization
guarantees.
• VC dimension measures the capacity of a hypothesis space.
• Bayesian learning incorporates prior knowledge for probabilistic inference.
• Query models and mistake-bound learning are useful for interactive and online learning
scenarios.
LEARNING IN THE LIMIT
Learning in the Limit is a formal model of learning introduced by E. Mark Gold (1967) that
describes how a learner can eventually identify a correct hypothesis given an infinite amount of
data over time. This model is particularly used in the study of inductive inference and formal
language learning.
1. Basic Idea of Learning in the Limit
• The learner is presented with an infinite sequence of training examples.
• The goal is to eventually converge to a correct hypothesis after seeing enough data.
• The learner is allowed to make mistakes and change hypotheses but must eventually settle
on the correct one forever.

Example:
Imagine a child learning a language. The child initially makes incorrect guesses about grammar
rules, but as they receive more examples (sentences from parents, books, or school), they eventually
settle on the correct grammar and never change it again.
2. Formal Definition
A learning algorithm AAA is said to learn a target function fff in the limit if:
1. AAA produces a sequence of hypotheses h1,h2,h3,…h_1, h_2, h_3, \dotsh1,h2,h3,… based
on training examples.
2. There exists a point nnn such that for all m≥nm \geq nm≥n, the hypothesis hmh_mhm is
correct and does not change afterward.
This means that after seeing enough data, the learner stabilizes on the correct hypothesis and does
not revise it anymore.
3. Key Features of Learning in the Limit

Allows mistakes early on: The learner can initially make incorrect guesses.
Does not require a bound on time or samples: It only guarantees eventual convergence.
Considers infinite data streams: Learning happens over an infinite sequence of examples.
No requirement for efficiency: The time it takes to reach the final hypothesis is not constrained.

4. Example: Learning a Language in the Limit


Imagine a learner trying to identify the correct grammar rules of a language from example sentences.
• The learner initially hypothesizes wrong grammar rules.
• As they encounter more sentences, they refine their hypothesis.
• After seeing enough examples, the learner settles on the correct grammar.
• Once the correct grammar is learned, they never change it again.
This is an example of learning in the limit, as the learner eventually stabilizes on the correct
concept.
5. Limitations of Learning in the Limit

Does not guarantee efficiency: The learner might take an arbitrarily long time to find the
correct hypothesis.
Requires infinite data: Real-world learning often happens with finite data, making this model
impractical for some tasks.
No restrictions on hypothesis complexity: The model does not account for computational
limitations.
Comparison with PAC Learning:

While Learning in the Limit guarantees eventual correctness, PAC learning focuses on finding a
good approximation efficiently with finite data.

PROBABLY APPROXIMATELY CORRECT (PAC) LEARNING


We focus here on the problem of inductively learning an unknown target function, given only
training examples of this target function and a space of candidate hypotheses.

“Within this setting, we will be chiefly concerned with questions such as how many training
examples are sufficient to successfully learn the target function, and how many mistakes will the
learner make before succeeding”.
Our goal is to answer questions such as:

Sample complexity. How many training examples are needed for a learner to converge (with high
probability) to a successful hypothesis?
Computational complexity. How much computational effort is needed for a learner to converge (with
high probability) to a successful hypothesis?
Mistake bound. How many training examples will the learner misclassify before converging to a
successful hypothesis?
The Problem Setting:
➢ Let X refer to the set of all possible instances over which target functions may be defined. For
example, X might represent the set of all people, each described by the attributes age (e.g.,
young or old) and height (short or tall).
➢ Let C refer to some set of target concepts that our learner might be called upon to learn. Each
target concept c in C corresponds to some subset of X, or equivalently to some boolean-
valued function c : X -> {0, 1). For example, one target concept c in C might be the concept
"people who are skiers."
➢ If x is a positive example of c, then we will write c(x) = 1; if x is a negative example, c(x) =
0.
➢ We assume instances are generated at random from X according to some probability
distribution D.
➢ In general, D may be any distribution, and it will not generally be known to the learner.
➢ All that we require of D is that it be stationary; that is, that the distribution not change over
time.
➢ Training examples are generated by drawing an instance x at random according to D, then
presenting x along with its target value, c(x), to the learner.
➢ The learner L considers some set H of possible hypotheses when attempting to learn the
target concept.
➢ After observing a sequence of training examples of the target concept c, L must output some
hypothesis h from H, which is its estimate of c.
➢ The true error of hypothesis h, with respect to the target concept c and observation
distribution 𝒟 is the probability that h will misclassify an instance drawn according to 𝒟.
➢ 𝑒𝑟𝑟𝑜𝑟𝒟 (ℎ)=𝑃𝑟𝑥~𝒟 𝑐(𝑥)≠ ℎ(𝑥)
➢ In a perfect world, we’d like the true error to be 0.
➢ Bias: Fix hypothesis space H
➢ c may not be in H => Find h close to c
➢ A hypothesis h is approximately correct if
➢ 𝑒𝑟𝑟𝑜𝑟𝒟 ℎ ≤𝜀

➢ First, we will not require that the learner output a zero error hypothesis-we will require only
that its error be bounded by some constant, €, that can be made arbitrarily small.
➢ Second, we will not require that the learner succeed for every sequence of randomly drawn
training examples-we will require only that its probability of failure be bounded by some
constant, δ, that can be made arbitrarily small.
➢ In short, we require only that the learner probably learn a hypothesis that is approximately
correct-hence the term probably approximately correct learning.
Consider some class C of possible target concepts and a learner L using hypothesis space H. Loosely
speaking, we will say that
“the concept class C is PAC-learnable by L using H if, for any target concept c in C, L will with
probability (1 - δ) output a hypothesis h with errorD(h) < €, after observing a reasonable number of
training examples and performing a reasonable amount of computation”.

SAMPLE COMPLEXITY FOR INFINITE HYPOTHESIS SPACES


When dealing with infinite hypothesis spaces, determining how many training samples are needed
to achieve good generalization is a fundamental question in machine learning. Sample complexity
refers to the number of training examples required to learn a hypothesis with high accuracy and
confidence.
1. Why is Sample Complexity Important?
• Ensures that the learned hypothesis generalizes well to unseen data.
• Helps in understanding the trade-off between data size and learning performance.
• Avoids overfitting by providing bounds on how much data is necessary.
When the hypothesis space is infinite, we must use formal measures like VC (Vapnik-
Chervonenkis) dimension to determine sample complexity.

2. Sample Complexity in PAC Learning for Infinite Hypothesis Spaces


In Probably Approximately Correct (PAC) learning, we want to find a hypothesis h that is
approximately correct with high probability.
Given:
• ε (epsilon): Maximum allowed error.
• δ (delta): Confidence level (1 - δ is the probability that the hypothesis is within ε error).
• VC(H): The Vapnik-Chervonenkis (VC) dimension of the hypothesis space H.
The number of training examples m needed to guarantee that a hypothesis has error at most ε with
probability at least 1 - δ is given by:
Interpretation of the Formula:
1. Higher VC Dimension → More Samples Needed
o If the hypothesis space H is very expressive (e.g., deep neural networks), VC(H) is
high, requiring more data to generalize.
2. Stronger Guarantees (Lower ε, Lower δ) → More Samples Needed
o If we want very high accuracy (small ε) or high confidence (small δ), we need
more data.
3. Logarithmic Dependence on Confidence (δ), Polynomial Dependence on Accuracy (ε)
o Sample complexity scales logarithmically with confidence and polynomially with
accuracy.
3. Example: Sample Complexity Calculation
Case 1: Learning a Linear Classifier in 2D
• Hypothesis space: Linear classifiers in 2D.
• VC dimension: 3 (since a linear separator can shatter at most 3 points).
• Desired accuracy: ε = 0.05 (5% error).
• Confidence level: δ = 0.01 (99% confidence).
Using the formula:

4. Growth of Sample Complexity with VC Dimension


• If VC(H) is small, fewer samples are needed.
• If VC(H) is large, sample complexity grows polynomially in VC(H).
• If VC(H) is infinite, learning is impossible without additional constraints.

5. Special Case: Infinite Hypothesis Spaces with Bounded VC Dimension


Even if H is infinite, learning is possible if it has a finite VC dimension.
• Example: The space of all decision trees with depth ≤ d has VC dimension ≤ 2^d.
• Deep Neural Networks: The VC dimension depends on the architecture and activation
functions.
6. Conclusion
• Finite VC Dimension → Learning is possible with polynomially bounded sample
complexity.
• Infinite VC Dimension → Sample complexity is too high, and generalization may fail.
• Trade-off: A more complex hypothesis class needs more data to generalize.

VAPNIK-CHERVONENKIS DIMENSION :
➢ The Vapnik-Chervonenkis (VC) dimension is a fundamental concept in machine learning and
statistical learning theory that measures the capacity or complexity of a set of functions (often
referred to as a hypothesis class). It helps to understand the learning ability of a machine
learning model in terms of its generalization power.
➢ VC Dimension Definition: The VC dimension of a hypothesis class is the largest number of
points that can be shattered by that class. To say that a set of points is shattered means that,
for every possible labeling (classification) of the points, there exists a hypothesis in the class
that correctly classifies those points according to that labeling.
➢ Formally, if a hypothesis class H can perfectly classify every possible arrangement of labels
for a set of n points, then those points are said to be shattered, and the VC dimension is at
least n.
RULE LEARNING : Propositional and First-Order

➢ One of the most expressive and human readable representations for learned hypotheses is sets
of if-then rules.
➢ One way to learn sets of rules is to first learn a decision tree, then translate the tree into an
equivalent set of rules-one rule for each leaf node in the tree.
➢ One important special case involves learning sets of rules containing variables, called first-
order Horn clauses.
➢ Because sets of first-order Horn clauses can be interpreted as programs in the logic
programming language PROLOG learning them is often called inductive logic programming
(ILP).
➢ As an example of first-order rule sets, consider the following two rules that jointly describe
the target concept Ancestor. Here we use the predicate Parent(x, y) to indicate that y is the
mother or father of x, and the predicate Ancestor(x, y) to indicate that y is an ancestor of x
related by an arbitrary numberof family generations.

➢ Inductive learning of first-order rules or theories is often referred to as inductive logic


programming (or LP for short), because this process can be viewed as automatically inferring
PROLOG programs from examples.
➢ PROLOG is a general purpose, Turing-equivalent programming language in which programs
are expressed as collections of Horn clauses.
Example: Consider the task of learning the simple target concept Daughter (x, y) defined over pairs
of people x and y.
➢ The value of Daughter(x, y) is True when x is the daughter of y, and False otherwise.
Suppose each person in the data is described by the attributes Name, Mother, Father, Male,
Female.
➢ Hence, each training example will consist of the description of two people in terms of these
attributes, along with the value of the target attribute Daughter. For example, the following is
a positive example in which Sharon is the daughter of Bob:
(Namel = Sharon, Motherl = Louise, Fatherl = Bob,
Malel = False, Female1 = True,
Name2 = Bob, Mother2 = Nora, Father2 = Victor,
Male2 = True, Female2 = False, Daughterl.2 = True)
A program using first-order representations could learn the following general rule:
IF Father(y, x) ʌ Female(y), THEN Daughter(x, y)
where x and y are variables that can be bound to any person.
➢ constants (e.g., Mary, 23, or Joe),
➢ variables (e.g., x),
➢ predicates (e.g., Female, as in Female(Mary)),
➢ functions (e.g., age, as in age(Mary)).
➢ A term is any constant, any variable, or any function applied to any term. Examples include
Mary, x, age(Mary), age(x).
➢ A literal is any predicate (or its negation) applied to any set of terms. Examples include
Female(Mary), ¬Female(x), Greater_than (age(Mary), 20).
➢ A ground literal is a literal that does not contain any variables (e.g., - ¬Female( Joe)).
➢ A negative literal is a literal containing a negated predicate (e.g., ¬ Female(Joe)).
➢ A positive literal is a literal with no negation sign (e.g., Female(Mary)).
➢ A clause is any disjunction of literals M1 V M2 . . . Mn whose variables are universally
quantified.
➢ A Horn clause is an expression that contains atmost one positive literal

For any literals A and B, the expression (A  B) is equivalent to (A V ¬ B), and the expression ¬(A
^ B) is equivalent to (¬ A v ¬ B). Therefore, a Horn clause can equivalently be written as the
disjunction

➢ A substitution is any function that replaces variables by terms. For example, the substitution
{x/3, y/z) replaces the variable x by the term 3 and replaces the variable y by the term z.
➢ Given a substitution θ and a literal L we write Lθ to denote the result of applying substitution
θ to L.
➢ A unifying substitution for two literals L1 and L2 is any substitution θ such that L1 θ = L2 θ.
➢ Formally, the hypotheses learned by FOIL are sets of first-order rules, where each rule is
similar to a Horn clause with two exceptions.
➢ First, the rules learned by FOIL are more restricted than general Horn clauses, because the
literals are not permitted to contain function symbols (this reduces the complexity of the
hypothesis space search).
➢ Second, FOIL rules are more expressive than Horn clauses, because the literals appearing in
the body of the rule may be negated
➢ FOIL has been applied to a variety of problem domains. For example, it has been
demonstrated to learn a recursive definition of the QUICKSORT algorithm and to learn to
discriminate legal from illegal chess positions.
➢ In particular, FOIL seeks only rules that predict when the target literal is True. Also, FOIL
performs a simple hill climbing search rather than a beam search.
TRANSLATING DECISION TREES INTO RULES
Translating a decision tree into rules in machine learning involves essentially reading each path from
the root node to a leaf node as a set of "if-then" statements, where each decision node along the path
represents a condition on a feature, and the leaf node represents the final prediction based on those
conditions, effectively creating a rule set that mirrors the decision-making logic of the tree.
Structure: Each path through the tree becomes a single rule, with each decision node along the path
contributing a condition to the rule.
Conditions: These conditions are typically comparisons between a feature value and a threshold
determined during tree training.
Output: The final prediction at the leaf node is the "then" part of the rule.
Example:
Imagine a decision tree predicting whether someone should be approved for a loan based on their
income and credit score:
Root node: "Credit score > 650?"
Yes branch: "Income > 50k?"
Yes leaf: "Approved"
No leaf: "Approved (with conditions)"
No branch: "Debt-to-income ratio < 0.3?"
Yes leaf: "Approved"
No leaf: "Denied“
Translated rules:
➢ If credit score > 650 and income > 50k, then approve.
➢ If credit score > 650 and income <= 50k, then approve
(with conditions).
➢ If credit score <= 650 and debt-to-income ratio < 0.3,
then approve.
➢ If credit score <= 650 and debt-to-income ratio >= 0.3,
then deny.

[Age <= 30?]


/ \
Yes No
/ \
[Income > 50k?] [Buy = No]
/ \
Yes No
This decision tree can be translated into the following if-then rules:
[Buy = Yes] [Buy = No]
Rule 1: If Age <= 30 and Income > 50k, then Buy = Yes.
Rule 2: If Age <= 30 and Income <= 50k, then Buy = No.
Rule 3: If Age > 30, then Buy = No.
Explanation of the Rules:
Rule 1 covers the path where the customer's age is 30 or younger, and they have an income greater
than 50k, predicting that they will buy the product.
Rule 2 handles the case where the customer's age is 30 or younger, but their income is less than or
equal to 50k, predicting that they will not buy the product.
Rule 3 deals with all customers older than 30, predicting that they will not buy the product.
Advantages of Decision Tree Rules
Simplicity: The rules are easy to follow and explain, especially in decision-making contexts.
Flexibility: Each rule operates independently. You can easily add, remove, or modify individual rules
without affecting the others.
Transparency: The decision-making process is clear, as each rule explicitly shows which feature
conditions lead to the decision.
Pruning Rules
Once the rules are generated, they can be pruned by removing unnecessary conditions. For example,
if a rule contains conditions that don't impact the prediction, they can be removed to simplify the
rule.
Example of Rule Pruning:
If a rule initially states:
"If Age <= 30 and Income > 50k and Gender = Male, then Buy = Yes",
but Gender does not influence the decision, it can be simplified to:
"If Age <= 30 and Income > 50k, then Buy = Yes."

HEURISTIC RULE INDUCTION USING SEPARATE AND CONQUER AND


INFORMATION GAIN
Introduction
Heuristic rule induction is a greedy approach to learning rules from data. One popular strategy for
this is Separate-and-Conquer, which iteratively finds rules that cover part of the data, removes
covered examples, and repeats the process. A common heuristic for selecting rules is Information
Gain, which helps identify the most informative attributes for rule construction.
1. Separate-and-Conquer Strategy
The Separate-and-Conquer strategy is an alternative to divide-and-conquer (used in decision
trees). Instead of splitting data recursively, it follows these steps:
Steps in Separate-and-Conquer Rule Induction:
1. Find the best rule (covering some positive examples).
2. Remove covered examples (so they are not learned again).
3. Repeat until all positive examples are covered or a stopping condition is met.

Example: If we want to classify whether a person buys a computer based on age, income, and
credit rating, the algorithm might first learn:
java
CopyEdit
IF Age < 30 AND Income = High THEN BuysComputer = No
It then removes covered examples and searches for new rules.

2. Information Gain as a Heuristic for Rule Selection


To select the best rule at each step, we use Information Gain, a concept from entropy-based
learning. It measures how much a particular attribute reduces uncertainty in classification.
Entropy (Measure of Uncertainty)
Entropy is given by:

where:
• SSS is the set of examples.
• PiP_iPi is the probability of class iii in SSS.

• Information Gain Formula


• Information Gain for an attribute A is:

where:
• SSS is the original dataset.
• SvS_vSv is the subset of examples where attribute AAA has value vvv.
• H(S)H(S)H(S) is the entropy before splitting.
• H(Sv)H(S_v)H(Sv) is the entropy after splitting.
The attribute with the highest information gain is selected for rule construction.

Example:
Consider a dataset where we classify if a person buys a computer. We compute entropy and
information gain for "Age" and "Income" and select the attribute with the highest IG for rule
construction.
3. Example: Applying Separate-and-Conquer with Information Gain
Let’s say we have the following dataset:
Step 2: Compute Information Gain for Each Attribute
For "Age", partition the dataset and compute IG. Suppose we find that:
IG(S,Age)=0.246IG(S, \text{Age}) = 0.246IG(S,Age)=0.246 IG(S,Income)=0.151IG(S,
\text{Income}) = 0.151IG(S,Income)=0.151
Since "Age" has the highest IG, we select "Age" for the first rule.
Step 3: Create and Apply Rule
The first rule might be:
java
CopyEdit
IF Age > 40 THEN BuysComputer = Yes
This removes some examples from the dataset, and we repeat the process.

4. Advantages and Disadvantages of Separate-and-Conquer


Advantages

Interpretable – Rules are easy to understand.


Handles missing values well – Since rules are independent, missing attributes do not affect
classification.
Works with imbalanced data – Focuses on covering positive examples first.
Disadvantages

Greedy approach – Might not find the globally optimal rule set.
Sensitive to noise – If a bad rule is learned, it affects later rules.
Can generate too many rules – May result in overfitting.

LEARNING RECURSIVE RULES :

1. What are Recursive Rules?


Recursive rules are if-then rules where a rule's conclusion refers back to its own condition, allowing
the representation of hierarchical or repetitive relationships. These rules are common in Inductive
Logic Programming (ILP) and relational learning.
Example of a Recursive Rule (Prolog Style)
Consider a family tree where we define an ancestor relationship:
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
Here:
• The base case (ancestor(X, Y) :- parent(X, Y).) states that if X is a parent of Y, then X is an
ancestor.
• The recursive case (ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).) means that if X is a
parent of Z, and Z is an ancestor of Y, then X is an ancestor of Y.

2. Why is Learning Recursive Rules Important?

Captures hierarchical relationships (e.g., family trees, organization hierarchies).


Generalizes beyond fixed-depth patterns (e.g., "is a part of" relationships in object
hierarchies).
Useful in natural language processing (NLP) (e.g., parsing sentences with nested structures).
Efficient representation of transitive relations (e.g., graph traversal, web links, transport
networks).

3. Methods for Learning Recursive Rules


Learning recursive rules requires a relational learning approach. The most common methods are:
A. Inductive Logic Programming (ILP)
ILP systems like FOIL, Progol, and TILDE use first-order logic (FOL) to induce recursive rules
from data.
• FOIL (First Order Inductive Learner): Learns rules incrementally using information gain.
• Progol: Uses inverse entailment to find the most specific hypothesis.
• TILDE: Induces decision trees with logical predicates.

Example:
If given parent(X, Y) data, ILP learns the recursive ancestor(X, Y) rule.

B. Bottom-Up Approaches (Least General Generalization - LGG)


• Finds most specific rules that explain observed examples.
• Generalizes rules to create recursive definitions.

Example:
Given data:
scss
CopyEdit
parent(alice, bob).
parent(bob, charlie).
The LGG algorithm generalizes:
scss
CopyEdit
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

C. Top-Down Approaches (Divide and Conquer)


• Starts with general rules and specializes them to fit data.
• Used in decision tree learning with first-order logic.

Example:
A decision tree for family relationships would first split on "Is X a parent of Y?", then refine to
include recursion.

4. Example: Learning Recursive Rules from Data (Prolog & ILP)


Given training examples:
prolog
CopyEdit
parent(alice, bob).
parent(bob, charlie).
parent(charlie, david).
An ILP system might induce:
prolog
CopyEdit
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
Now, if we ask:
prolog
CopyEdit
?- ancestor(alice, david).
The system will correctly infer true using recursion.

5. Challenges in Learning Recursive Rules

Computational Complexity – Recursive rules require deeper search and more data.
Noise Sensitivity – Incorrect or missing data can break recursion.
Over-Generalization – Learning too broad rules can lead to incorrect inferences.
Search Space Explosion – Many possible recursive rules exist, making optimization difficult.

6. Applications of Recursive Rule Learning

Natural Language Processing (NLP) – Parsing complex sentence structures.


Biological Systems – Modeling protein interaction networks.
Graph-Based Learning – Learning relationships in social networks and knowledge graphs.
Planning and Robotics – Learning recursive movement rules (e.g., navigating mazes).

INVERSE RESOLUTION:
1. What is Inverse Resolution?
Inverse Resolution is a technique in Inductive Logic Programming (ILP) that helps in learning new
knowledge by reasoning backward from observations to general rules. It is an inverse process of
resolution, which is used in logic programming for deduction.
• Forward Resolution: Given known facts and rules, derive new conclusions (deductive
reasoning).
• Inverse Resolution: Given observations and conclusions, infer missing facts or rules
(inductive reasoning).
👉 Example: If we observe that "Alice is an ancestor of David," but only have data that Alice is
Bob's parent and Bob is David’s parent, we might learn the general rule:
scss
CopyEdit
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

2. Why is Inverse Resolution Important?

Discovers new knowledge by inferring general rules.


Handles missing information in logical reasoning.
Enables learning recursive rules (e.g., family relationships, hierarchical data).
Used in Inductive Logic Programming (ILP) for AI and knowledge representation.

3. How Does Inverse Resolution Work?


Inverse resolution operates by reversing the steps of logical resolution to generate hypotheses. The
key steps include:
1. Absorption: Generalizing specific facts into more abstract rules.
2. Identification: Inferring missing predicates or relationships.
3. Intra-Construction: Combining facts to learn new relations.
4. Inter-Construction: Merging unrelated facts into a general rule.

4. Example of Inverse Resolution in Action


Given Facts (Observations in Prolog-style Logic)
prolog
CopyEdit
parent(alice, bob).
parent(bob, charlie).
parent(charlie, david).
ancestor(alice, david). % Observed fact
We do inverse resolution to learn the missing rule:
prolog
CopyEdit
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
Now, we can prove new facts, such as:
prolog
CopyEdit
?- ancestor(alice, charlie).
% Expected Output: True
This means Alice is an ancestor of Charlie, inferred by our learned rule.

5. Inverse Resolution Techniques


There are multiple approaches to performing inverse resolution:
A. Absorption
• Generalizes a specific example by introducing variables.
• Example: Given parent(alice, bob)., we generalize to parent(X, Y).
B. Identification
• Identifies missing predicates from given observations.
• Example: Given parent(alice, bob) and ancestor(alice, bob), infer ancestor(X, Y) :- parent(X,
Y).
C. Intra-Construction
• Combines related facts into a new rule.
• Example:
scss
CopyEdit
ancestor(alice, bob). ancestor(bob, charlie).
→ Learn: ancestor(X, Y) :- ancestor(X, Z), ancestor(Z, Y).
D. Inter-Construction
• Merges unrelated facts into a new rule.
• Example:
scss
CopyEdit
flies(bird). has_wings(bird).
→ Learn: flies(X) :- has_wings(X).

6. Applications of Inverse Resolution

Inductive Logic Programming (ILP) – Learning rules from data.


Automated Knowledge Discovery – Extracting rules from databases.
Natural Language Understanding (NLU) – Inferring missing relationships in text.
AI Planning – Learning rules for robotic actions.

7. Challenges of Inverse Resolution

Computational Complexity – Requires searching large spaces of possible rules.


Handling Noisy Data – Incorrect observations lead to incorrect rules.
Overgeneralization – Risk of learning too broad rules that don’t fit all cases.
GOLEM, AND PROGOL.
Inductive Logic Programming (ILP) is a subfield of machine learning that learns logical rules from
examples and background knowledge using first-order logic (FOL). It is widely used in knowledge
discovery, natural language processing, and bioinformatics.
Two major ILP algorithms are:
• Golem (A bottom-up ILP system)
• Progol (A top-down ILP system using inverse entailment)
2. Golem: Bottom-Up ILP Algorithm
Golem is a bottom-up ILP system that uses Least General Generalization (LGG) to
learn rules from positive examples.
How Golem Works
1. Start with positive examples from the dataset.
2. Compute Least General Generalization (LGG) to find common structures in
examples.
3. Generalize rules iteratively while keeping consistency with background
knowledge.
4. Output the most specific hypothesis that covers all positive examples.
Example of Golem
Given facts:
prolog
CopyEdit
parent(alice, bob).
parent(bob, charlie).
parent(charlie, david).
Golem generalizes:
prolog
CopyEdit
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
Advantages of Golem
✅ Efficient for large datasets.
✅ Avoids overfitting by computing least general rules.
✅ Works well when background knowledge is limited.
Disadvantages of Golem
❌ Cannot handle negative examples well.
❌ Limited expressiveness compared to Progol.

3. Progol: A Top-Down ILP Algorithm Using Inverse Entailment


Progol is a top-down ILP system that learns rules using inverse entailment. Unlike
Golem, it can handle both positive and negative examples.
How Progol Works
1. Select a positive example to learn a rule for.
2. Find the most specific hypothesis (bottom clause) covering that example.
3. Generalize it using inverse entailment to find a more general rule.
4. Test the rule on negative examples and refine it.
5. Repeat until a consistent set of rules is found.
Example of Progol
Given data:
prolog
CopyEdit
parent(alice, bob).
parent(bob, charlie).
parent(charlie, david).
ancestor(alice, david).
Progol generates:
prolog
CopyEdit
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
Unlike Golem, Progol refines rules by ensuring they do not cover negative examples.
Advantages of Progol
✅ More expressive than Golem (handles both positive and negative examples).
✅ Uses inverse entailment, making it more logical and structured.
✅ Handles recursive rules well (e.g., ancestor relationships).
Disadvantages of Progol
❌ Computationally expensive for large datasets.
❌ More complex than Golem.

5. Applications of Golem and Progol

Bioinformatics – Learning protein structures and gene interactions.


Natural Language Processing (NLP) – Learning grammar rules.
Knowledge Graphs – Extracting relationships between entities.
Fraud Detection – Identifying fraud patterns using logical rules.
6. Conclusion
• Golem is a bottom-up ILP system that generalizes examples using LGG.
• Progol is a top-down ILP system that uses inverse entailment to learn rules.
• Progol is more powerful but computationally expensive, while Golem is simpler and
efficient for large datasets.
Important Questions:
What is Computational Learning Theory, and why is it important in machine learning?
What are the different models of learnability in Computational Learning Theory?
Explain the concept of learning in the limit. How does it define learnability?
What is Probably Approximately Correct (PAC) Learning, and how does it ensure generalization?
What conditions must be satisfied for a hypothesis to be PAC-learnable?
What is sample complexity, and how does it affect learning in infinite hypothesis spaces?
Explain the role of the Vapnik-Chervonenkis (VC) dimension in measuring the capacity of a
hypothesis space.
How does the VC dimension relate to generalization in machine learning?
What is the relationship between PAC learning and the VC dimension?
How do sample complexity bounds ensure efficient learning in infinite hypothesis spaces?
What is rule learning, and how does it differ from other machine learning techniques?
Differentiate between propositional rule learning and first-order rule learning.
How can decision trees be translated into rules? Provide an example.
Explain the separate-and-conquer strategy in heuristic rule induction.
How does information gain help in heuristic rule learning?
What is the significance of First-Order Horn-Clause Induction in learning logical rules?
Compare rule-based learning with decision tree learning in terms of interpretability and
efficiency.
How does First-Order Logic (FOL) improve rule learning compared to propositional logic?
Discuss the challenges associated with learning rules from noisy data.
How does heuristic search influence the effectiveness of rule learning?

You might also like