0% found this document useful (0 votes)

125 views23 pages

Machine Learning: B.E, M.Tech, PH.D

Tree models and rule models are popular machine learning methods. Tree models use a hierarchical structure to represent recursive divide-and-conquer strategies, with leaf nodes containing outcomes. Decision trees branch on conditions at each node to separate data examples, and can be represented as logical rules. Rule models learn ordered or unordered rule sets that describe data. Decision trees are commonly used for classification and regression tasks due to their interpretability.

Uploaded by

Tatipatti chiranjeevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views23 pages

Machine Learning: B.E, M.Tech, PH.D

Uploaded by

Tatipatti chiranjeevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

MACHINE LEARNING

IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16

UNIT- III:
Tree models: Decision trees, Ranking and probability estimation trees, Tree learning as variance
reduction.
Rule models: Learning ordered rule lists, Learning unordered rule sets, Descriptive rule learning, First-
order rule learning

TREE MODELS:
1. A tree model is a hierarchical structure of conditions, where leafs contain tree outcome.
2. They represent recursive divide-and-conquer strategies.
3. Tree models are among the most popular models in machine learning, because they are easy to
understand and interpret:
4. E.g., Kinect uses them to detect character pose.

For instance, the feature tree in Figure 5.1 (left) is equivalent to the path in Figure 4.6 (left). This
equivalence is best seen by tracing the path and the tree from the bottom upward.
1. The left-most leaf of the feature tree represents the concept at the bottom of the path, covering a
single positive example.
2. The next concept up in the path generalizes the literal Length = 3 into Length = [3,5] by means
of internal disjunction; the added coverage (one positive example) is represented by the second
leaf from the left in the feature tree.
3. By dropping the condition Teeth = few we add another two covered positives.
4. Dropping the ‘Length’ condition altogether (or extending the internal disjunction with the one
remaining value ‘4’) adds the last positive and also a negative.
5. Dropping Beak = yes covers no additional examples (closed concepts in the previous chapter).
6. Finally, dropping Gills = no covers the four remaining negatives.

Figure 5.1. (left) The path from Figure 4.6, redrawn in the form of a tree. The coverage numbers in the
leaves are obtained from the data in Example 4.4. (Right) A decision tree learned on the same data. This
tree separates the positives and negatives perfectly.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 1
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
• A path through the hypothesis space can be turned into an equivalent feature tree. To obtain
a tree that is equivalent to the i -th concept from the bottom in the path, truncate the tree by
combining the left-most i leaves into a single leaf representing the concept; or label the left-
most i leaves positive and the remaining leaves negative, turning the feature tree into a decision
tree.
• Decision trees do not employ internal disjunction for features with more than two values, but
instead allow branching on each separate value shown in Figure 5.1 (right). This tree can be
turned into a logical expression:
(Gills = no ∧ Length = 3) ∨ (Gills = no ∧ Length = 4 ∧ Teeth = many) ∨ (Gills = no ∧ Length = 5)
Gills = no ∧ [Length = 3 ∨ (Length = 4 ∧ Teeth = many) ∨ Length = 5]
¬[(Gills = no ∧ Length = 4 ∧ Teeth = few) ∨ Gills = yes]
(Gills = yes ∨ Length = [3,5] ∨ Teeth = many) ∧ Gills = no
The first expression is in disjunctive normal form (DNF) and is obtained by forming a
disjunction of all paths from the root of the tree to leaves labelled positive, where each path
gives a conjunction of literals.
• Decision trees are strictly more expressive than conjunctive concepts.
• One way to avoid overfitting and encourage learning is to deliberately choose a restrictive
hypothesis language
• A FEATURE TREE is a tree such that each internal node (the nodes that are not leaves) is
labelled with a feature, and each edge emanating from an internal node is labelled with a literal.
The set of literals at a node is called a SPLIT. Each leaf of the tree represents a logical
expression, which is the conjunction of literals encountered on the path from the root of the tree
to the leaf. The extension of that conjunction (the set of instances covered by it) is called the
INSTANCE SPACE SEGMENT associated with the leaf.

• Algorithm 5.1 is a divide-and-conquer algorithm: it divides the data into subsets, builds a tree
for each of those and then combines those subtrees into a single tree.
• They are usually implemented recursively, because each subproblem (to build a tree for a
subset of the data) is of the same form as the original problem. These algorithms are greedy:
whenever there is a choice (such as choosing the best split), the best alternative is selected on
the basis of the information available, and this choice is never reconsidered. This may lead to
sub-optimal choices.
• An alternative would be to use a backtracking search algorithm, which can return an optimal
solution

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 2
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
5.1 DECISION TREES:
• Decision Trees are a non-parametric supervised learning method used for both classification
and regression tasks.
• A decision tree is a flowchart-like structure in which each internal node represents a "test" on
an attribute (e.g. Yes or No), each branch represents the outcome of the test, and each leaf node
represents a class label (Decision taken after computing all attributes). The paths from root to
leaf represent classification rules.
• Decision Analysis, a decision tree and the closely related influence diagram are used as a visual
and analytical decision support tool, where the expected values (or expected utility) of
competing alternatives are calculated.
• A decision tree consists of three types of nodes:
1. Decision nodes – typically represented by squares
2. Chance nodes – typically represented by circles
3. End nodes – typically represented by triangles
• Decision trees are commonly used in operations research, operations management and
calculating conditional probabilities.
• A CLASSIFICATION TASK we can simply define a set of instances D to be homogenous if they
are all from the same class, and the function Label(D) will then obviously return that class.
• In line 5 of Algorithm 5.1 we may be calling Label(D) with a non-homogeneous set of
instances in case one of the Di is empty, so the general definition of Label(D) is that it returns
the majority class of the instances in D2 This leaves us to decide how to define the function
BestSplit(D,F).
• We are dealing with Boolean features, so D is split into D1 and D2. Let’s assume we have two
classes, and denote by D+ and D- the positives and negatives in D (and likewise for D 1+ etc.). The
question is how to assess the utility of a feature in terms of splitting the examples into positives
and negatives. Clearly, the best situation is where D 1+= D+ and D1- = ᵠ , or where D1+= ᵠ and D1-
= D-. In that case, the two children of the split are said to be pure. So we need to measure the
impurity of a set of n+ positives and n- negatives. The impurity can be defined in terms of the
proportion p˙ = n+/ (n+ + n- ) which is called Empirical probability. Impurity should not change
if we swap the positive and negative class, which means that it should stay the same if we
replace p˙ with 1−p˙.

Figure 5.2. (left) Impurity functions plotted against the empirical probability of the positive
class. From the bottom: the relative size of the minority class, min(p˙,1 − p˙); the Gini index,
2p˙(1−p˙); entropy, −p˙ log2 p˙−(1−p˙) log2 (1−p˙) (divided by 2 so that it reaches its maximum in
the same point as the others); and the square root of the Gini index, √ p ˙ (1− p ˙) – notice that
this last function describes a semi-circle. (right) Geometric construction to determine the
impurity of a split (Teeth = [many, few] from Example 5.1. p˙ is the empirical probability of the
parent, and p˙1 and p˙2 are the empirical probabilities of the children.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 3
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
1. MINORITY CLASS MIN (p˙, 1−p˙) – this is sometimes referred to as the error rate, as it
measures the proportion of misclassified examples if the leaf was labelled with the majority
class; the purer the set of examples, the fewer errors this will make
2. GINI INDEX 2p˙ (1−p˙) –The probability of a false positive is then p˙(1− p˙) and the probability
of a false negative (1−p˙)p˙3
3. ENTROPY −p˙ log2 p˙− (1−p˙) log2 (1−p˙) – the purer the set of examples, the more predictable
this message becomes and the smaller the expected information.
A plot of these three impurity measures can be seen in Figure 5.2 (left)

Indicating the impurity of a single leaf D j as Imp (Dj ), the impurity of a set of mutually exclusive leaves
{D1, . . . ,Dl} is then defined as a WEIGHTED AVERAGE

Where D =D1 ∪. . .∪ Dl . For a binary split there is a nice geometric construction to find Imp ({D1,D2})
given the empirical probabilities of the parent and the children, which is illustrated in Figure 5.2
(right):
1. We first find the impurity values Imp(D1) and Imp(D2) of the two children on the impurity
curve (here the Gini index).
2. We then connect these two values by a straight line, as any weighted average of the two must
be on that line.
3. Since the empirical probability of the parent is also a weighted average of the empirical
probabilities of the children, with the same weights (i.e., p˙ = |D1|/|D| p˙1 + |D2| / |D| p˙2 – the
derivation is given in Equation 5.2), p˙ gives us the correct interpolation point.

EXAMPLE 4.4 (Data that is not conjunctively separable).(p-positive, n-negative)

p1: Length = 3 ∧ Gills = no ∧ Beak = yes ∧ Teeth = many
p2: Length = 4 ∧ Gills = no ∧ Beak = yes ∧ Teeth = many
p3: Length = 3 ∧ Gills = no ∧ Beak = yes ∧ Teeth = few
p4: Length = 5 ∧ Gills = no ∧ Beak = yes ∧ Teeth = many
p5: Length = 5 ∧ Gills = no ∧ Beak = yes ∧ Teeth = few
n1: Length = 5 ∧ Gills = yes ∧ Beak = yes ∧ Teeth = many
n2: Length = 4 ∧ Gills = yes ∧ Beak = yes ∧ Teeth = many
n3: Length = 5 ∧ Gills = yes ∧ Beak = no ∧ Teeth = many
n4: Length = 4 ∧ Gills = yes ∧ Beak = no ∧ Teeth = many
n5: Length = 4 ∧ Gills = no ∧ Beak = yes ∧ Teeth = few

EXAMPLE 5.1 (CALCULATING IMPURITY): Find the best feature to put at the root of the decision tree.
The four features available result in the following splits:
Length = [3,4,5] [2+, 0−] [1+, 3−][2+, 2−]
Gills = [yes, no] [0+, 4−] [5+, 1−]
Beak = [yes, no] [5+, 3−] [0+, 2−]
Teeth = [many, few] [3+, 4−] [2+, 1−]
Calculate the impurity of the first split. We have 3 segments:
The first one is pure and so has entropy 0;
The second one has entropy −(1/4)log2(1/4)− (3/4)log2(3/4) = 0.5+0.31 = 0.81;
The third one has entropy 1.
The total entropy is then the weighted average of these, which is 2/10·0+4/10·0.81+4/10·1 = 0.72.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 4
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Similar calculations for the other 3 features give the following entropies:
Gills 4/10 · 0+6/10 ·(−(5/6)log2 (5/6)−(1/6)log2 (1/6)) = 0.39;
Beak 8/10 ·(−(5/8)log2 (5/8)−(3/8)log2 (3/8)) +2/10 · 0 = 0.76;
Teeth 7/10 ·(−(3/7)log2(3/7)−(4/7)log2(4/7))+3/10·(−(2/3)log2(2/3)−(1/3)log2(1/3))=0.97.
The ‘Gills’ is an excellent feature to split on; ‘Teeth’ is poor; and the other two are somewhere in
between.
The calculations for the Gini index are as follows (notice that these are on a scale from 0 to 0.5):
Length 2/10 · 2 · (2/2 · 0/2)+4/10 · 2 · (1/4 · 3/4)+4/10 · 2 · (2/4 · 2/4) = 0.35;
Gills 4/10 · 0+6/10 · 2 · (5/6 · 1/6) = 0.17;
Beak 8/10 · 2 · (5/8 · 3/8)+2/10 · 0 = 0.38;
Teeth 7/10 · 2 · (3/7 · 4/7)+3/10 · 2 · (2/3 · 1/3) = 0.48.
As expected, the two impurity measures are in close agreement. See Figure 5.2 (right) for a
geometric illustration of the last calculation concerning ‘Teeth’.
If purity is measured by entropy, this is called the information gain splitting criterion

Figure 5.3. (Left) Decision tree learned from the data in Example 4.4 on p.115. (Right) Each internal
and leaf node of the tree corresponds to a line segment in coverage space: vertical segments for pure
positive nodes, horizontal segments for pure negative nodes, and diagonal segments for impure nodes.

Leaf C leaves three feature values unspecified, with a total of 3·2·2 = 12 possible combinations of
values; four of these were supplied as training examples, so leaf C covers eight unlabelled instances
and classifies them as negative. Similarly, two unlabelled instances are classified as positive by leaf D,
and a further two by leaf F; one is classified as negative by leaf G, and the remaining one as positive by
leaf H. The fact that more unlabelled instances are classified as negative (9) than as positive (5) is thus
mostly due to leaf C: because it is a leaf high up in the tree, it covers many instances.

Every node of the tree, internal or leaf, covers a certain number of +ve’s and –ve’s and hence can be
plotted as a line segment in coverage space. For instance, the root of the tree covers all positives and all
negatives, and hence is represented by the ascending diagonal A. Once we add our first split, segment A
is replaced by segment B (an impure node and hence diagonal) and segment C, which is pure and not
split any further. Segment B is further split into D (pure and positive), E (impure) and F (pure and
positive). Finally, E is split into two pure nodes.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 5
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
ALGORITHM 5.1 only compares splits with the same parent, and so we can ignore the impurity of the
parent and search for the feature which results in the lowest weighted average impurity of its children
(ALGORITHM 5.2).

5.2 RANKING AND PROBABILITY ESTIMATION TREES:

• Grouping classifiers such as decision trees divide the instance space into segments, and so
can be turned into rankers by learning an ordering on those segments
• Decision trees have access to the local class distributions in the segments or leaves, which
can directly be used to construct a leaf ordering that is optimal for the training data
• The ranking obtained from the empirical probabilities in the leaves of a decision tree yields a
convex ROC curve on the training data.
• The empirical probability of the parent is a weighted average of the empirical probabilities of
its children; but this only tells us that p˙1 ≤ p˙ ≤ p˙2 or p˙2 ≤ p˙ ≤ p˙1.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 6
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16

Figure 5.4. (top) Abstract representation of a tree with numbers of positive and negative
examples covered in each node. Binary splits are added to the tree in the order indicated.
(bottom) Adding a split to the tree will add new segments to the coverage curve as indicated by
the arrows. After a split is added the segments may need reordering, and so only the solid lines
represent actual coverage curves.

Example 5.2 (GROWING A TREE):

Consider the tree in Figure 5.4 (top). Each node is labelled with the numbers of positive and negative
examples covered by it: so, for instance, the root of the tree is labelled with the overall class distribution
(50 positives and 100 negatives), resulting in the trivial ranking [50+, 100−]. The corresponding one-
segment coverage curve is the ascending diagonal (Figure 5.4 (bottom)). Adding split (1) refines this
ranking into [30+,35−][20+,65−], resulting in a two-segment curve. Adding splits (2) and (3) again
breaks up the segment corresponding to the parent into two segments corresponding to the children.
However, the ranking produced by the full tree – [15+,3−][29+,10−][5+,62−][1+,25−] – is different
fromthe left-to-right ordering of its leaves, hence we need to reorder the segments of the coverage
curve, leading to the top-most, solid curve.

So, adding a split to a decision tree can be interpreted in terms of coverage curves as the following two-
step process:
1. Split the corresponding curve segment into two or more segments;
2. Reorder the segments on decreasing slope.

If a feature tree has l leaves and we have c classes, then the number of possible labellings of leaves with
classes is cl; in the example of Figure 5.4 this is 2 4 = 16. Figure 5.5 depicts these 16 labellings in
coverage space.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 7
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16

Figure 5.5: Graphical depiction of all possible labellings and all possible rankings that can
be obtained with the four-leaf decision tree in Figure 5.4. There are 24 = 16 possible leaf
labellings; Example: ‘+−+−’ denotes labelling the first and third leaf from the left as + and the
second and fourth leaf as −. Also indicated are some pairwise symmetries (dotted lines): e.g., +−
+− and −+−+ are each other’s inverse and end up at opposite ends of the plot. There are 4! = 24
possible blue- violet- red-orange paths through these points which start in −−−− and switch
each leaf to + in some order; these represent all possible four-segment coverage curves or
rankings.

Figure 5.6. (top) To achieve the labelling +−++ we don’t need the right-most split, which can
therefore be pruned away. (bottom) Pruning doesn’t affect the chosen operating point, but it
does decrease the ranking performance of the tree.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 8
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
• The operation of merging all leaves in a subtree is called PRUNING THE SUBTREE. The
process is illustrated in Figure 5.6. The advantage of pruning is that we can simplify the tree
without affecting the chosen operating point, which is sometimes useful if we want to
communicate the tree model to somebody else. The disadvantage is that we lose ranking
performance, as illustrated in Figure 5.6 (bottom). Pruning is therefore not recommended
unless (i) you only intend to use the tree for classification, not for ranking or probability
estimation; and (ii) you can define the expected operating conditions with sufficient precision.
One popular algorithm for pruning decision trees is called reduced-error pruning, and is given
in Algorithm 5.3. The algorithm employs a separate pruning set of labelled data not seen during
training, as pruning will never improve accuracy over the training data.

 Example 5.3 (Cost-sensitivity of splitting criteria). Suppose you have 10 positives and 10
negatives, and you need to choose between the two splits [8+,2−][2+,8−] and [10+,6−][0+,4−].
Calculate the weighted average entropy of both splits and conclude that the first split is the
better one. Calculate the average Gini index, and again the first split wins.
 You then remember that mistakes on the positives are about ten times as costly as mistakes on
the negatives.
 You’re not quite sure how to work out the maths, and so you decide to simply have ten copies of
every positive: the splits are now [80+,2−][20+,8−] and [100+,6−][0+,4−].
 You recalculate the three splitting criteria and now all three favour the second split.
 Even though you’re slightly bemused by all this, you settle for the second split since all three
splitting criteria are now unanimous in their recommendation.

Figure 5.7. (Left) ROC isometrics for entropy in blue, Gini index in violet and Gini in red through the
splits [8+,2−][2+,8−] (solid lines) and [10+,6−][0+,4−] (dotted lines). Only Gini prefers the second split.
(Right) The same isometrics after inflating the positives with a factor 10. All splitting criteria now
favour the second split; the Gini isometrics are the only ones that haven’t moved.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 9
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Consider Figure 5.7 (left). The two splits among which you needed to choose in Example 5.3 (before
inflating the positives) are indicated as points in this plot. I have drawn six isometrics in the top-left of
the plot: two splits times three splitting criteria. A particular splitting criterion prefers the split whose
isometric is the highest (closest to ROC heaven) of the two: you can see that only one of the three (Gini)
prefers the split on the top-right. Figure 5.7 (right) demonstrates how this changes when inflating the
positives with a factor 10 (a ROC space with the grid indicating how the class distribution has changed).
 Entropy and Gini index are sensitive to fluctuations in the class distribution, Gini isn’t.

5.3 TREE LEARNING AS VARIANCE REDUCTION

 Consider how to adapt decision trees to regression and clustering tasks.

 Earlier, we defined the two-class Gini index 2p˙(1− p˙) of a leaf as the expected error resulting
from labelling instances in the leaf randomly: positive with probability p˙ and negative with
probability 1−p˙.
 Representing this as a random variable with value 1 for heads and 0 for tails, the expected value
of this random variable is p˙ and its variance p˙(1− p˙)
 Consider a binary split into n1 and n2 = n −n1 examples with empirical probabilities p˙1 and
p˙2, then the weighted average impurity of these children in terms of the Gini index is

REGRESSION TREES:
The variance of a Boolean (i.e Bernoulli) variable with success probability p is (1-p), which is half the
Gini Index. So we could interpret the goal of tree learning as minimizing the class variance ( or standard
deviation, in case of square root of Gini) in the leaves. Regression trees are susceptible to overfitting.
In regression, the variance of a set Y of target values as the average squared distance from the mean:

If a split partitions the set of target values Y into mutually exclusive sets {Y1, . . . ,Yl }, the weighted
average variance is then

The first term is constant for a given set Y and so we want to maximize the weight average of squared
means in the children.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 10
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
EXAMPLE 5.4 (LEARNING A REGRESSION TREE). Imagine you are a collector of vintage Hammond
tone wheel organs. You have been monitoring an online auction site, from which you collected some
data about interesting transactions:

From this data, you want to construct a regression tree that will help you determine a reasonable
price for your next purchase. There are three features, hence three possible splits:
Model = [A100,B3,E112,M102,T202] [1051,1770,1900][4513][77][870][99,270,625]
Condition = [excellent,good, fair] [1770,4513][270,870,1051,1900][77,99,625]
Leslie = [yes,no] [625,870,1900][77,99,270,1051,1770,4513]
The means of the first split are 1574, 4513, 77, 870 and 331, and the weighted average of squared
means is 3.21 · 106. The means of the second split are 3142, 1023 and 267,with weighted average of
squared means 2.68·106; for the third split the means are 1132 and 1297, with weighted average of
squared means 1.55 ·106. We therefore branch on Model at the top level. This gives us three single-
instance leaves, as well as three A100s and three T202s. For the A100s we obtain the following splits:
Condition = [excellent,good, fair] [1770][1051,1900][]
Leslie = [yes,no] [1900][1051,1770]
The second split results in less variance. For the T202s the splits are as follows:
Condition = [excellent,good, fair] [][270][99,625]
Leslie = [yes,no] [625][99,270]
Again we see that splitting on Leslie gives tighter clusters of values. The learned regression tree is
depicted in Figure 5.8.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 11
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
CLUSTERING TREES:
 Since regression is a supervised learning problem while clustering is unsupervised.
 An abstract function Dis: X ×X →R that measures the distance or dissimilarity of any two
instances x, x1 ∈ X, such that the higher Dis (x, x1) is, the less similar x and x1 are. The cluster
dissimilarity of a set of instances D is then calculated as

 The weighted average cluster dissimilarity over all children of a split then gives the split
Dissimilarity.

EXAMPLE 5.5 (LEARNING A CLUSTERING TREE USING A DISSIMILARITY MATRIX)

 Assessing the nine transactions on the online auction site from Example 5.4, using some
additional features such as reserve price and number of bids (these features do not matter at
the moment but are shown in Example 5.6), you come up with the following dissimilarity
matrix:

 This shows, for instance, that the first transaction is very different from the other eight. The
average pairwise dissimilarity over all nine transactions is 2.94. Using the same features from
Example 5.4, the three possible splits are (now with transaction number rather than price):
Model = [A100,B3,E112,M102,T202] [3, 6, 8][1][9][5][2, 4,7]
Condition = [excellent,good, fair] [1, 6][3, 4, 5, 8][2,7,9]
Leslie = [yes,no] [2,5,8][1, 3, 4, 6, 7,9]
 The cluster dissimilarity among transactions 3, 6 and 8 is 1/32 (0+1+2+1+0+1+2+1+0) = 0.89;
and among transactions 2, 4 and 7 it is 1/32 (0+1+0+1+0+0+0+0+0) =0.22. The other three
children of the first split contain only a single element and so have zero cluster dissimilarity.
 The weighted average cluster dissimilarity of the split is then
3/9·0.89+1/9·0+1/9·0+1/9·0+3/9·0.22 = 0.37. For the second split, similar calculations
result in a split dissimilarity of 2/9·1.5+4/9·1.19+3/9·0 = 0.86, and the third split yields 3/9 ·
1.56+6/9 · 3.56 = 2.89. The Model feature thus captures most of the given dissimilarities, while
the Leslie feature is virtually unrelated.
 Smaller clusters tend to have lower dissimilarity, and so it is easy to overfit. Setting aside a
pruning set to remove the lower splits if they don’t improve the cluster coherence on the
pruning set is recommended.
 In the above example, removing the first transaction reduces the overall pairwise dissimilarity
from 2.94 to 1.5,

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 12
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16

Figure 5.9: A clustering tree learned from the data in Example 5.6 using Euclidean distance on the
numerical features.

RULE MODELS:
RULE MODELS ARE the second major type of logical machine learning models. Particularly in
supervised learning, a rule model is more than just a set of rules: the specification of how the rules are
to be combined to form predictions is a crucial part of the model.
They are four types of rule models
1. Learning ordered rule lists
2. Learning unordered rule sets
3. Descriptive rule learning
4. First-order rule learning

6.1 LEARNING ORDERED RULE LISTS:

The key idea of this kind of rule learning algorithm is to keep growing a conjunctive rule body by adding
the literal that most improves its homogeneity. we construct a downward path through the
hypothesis space, discussed in Section 4.2, and we stop as soon as some homogeneity criterion is
satisfied.
However, one key difference is that in decision tree learning we are interested in the purity of both
children, we use the weighted average impurity as our search heuristic when constructing the tree. In
rule learning, on the other hand, we are only interested in the purity of one of the children: the one in
which the added literal is true.

Example 6.1 (Learning a rule list): Consider Positive & Negatives examples
p1: Length = 3 ∧ Gills = no ∧ Beak = yes ∧ Teeth = many
p2: Length = 4 ∧ Gills = no ∧ Beak = yes ∧ Teeth = many
p3: Length = 3 ∧ Gills = no ∧ Beak = yes ∧ Teeth = few
p4: Length = 5 ∧ Gills = no ∧ Beak = yes ∧ Teeth = many
p5: Length = 5 ∧ Gills = no ∧ Beak = yes ∧ Teeth = few
n1: Length = 5 ∧ Gills = yes ∧ Beak = yes ∧ Teeth = many
n2: Length = 4 ∧ Gills = yes ∧ Beak = yes ∧ Teeth = many
n3: Length = 5 ∧ Gills = yes ∧ Beak = no ∧ Teeth = many
n4: Length = 4 ∧ Gills = yes ∧ Beak = no ∧ Teeth = many
n5: Length = 4 ∧ Gills = no ∧ Beak = yes ∧ Teeth = few
The nine possible literals are shown with their coverage counts in Figure 6.2 (top). Three of these
are pure; in the impurity isometrics plot in Figure 6.2 (bottom) they end up on the x-axis and y-axis.
One of the literals covers two positives and two negatives, and therefore has the same impurity as the
overall data set; this literal ends up on the ascending diagonal in the coverage plot.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 13
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Gills = yes is the best of the three as it covers more examples, so let’s formulate our first rule as:
·if Gills = yes then Class = ⊝

Figure 6.2. (top) All literals with their coverage counts on the data in Example 6.1. The ones in
green (red) are pure for the positive (negative) class. (bottom) The nine literals plotted as points in
coverage space, with their impurity values indicated by impurity isometrics. Impurity values are colour-
coded: towards green if p˙ > 1/2, towards red if p˙ < 1/2, and orange if p˙ = 1/2 (on a 45 degree
isometric). The violet arrow indicates the selected literal, which excludes all five positives and one
negative.

Figure 6.3. (left) Revised coverage counts after removing the four negative examples covered by the
first rule found (literals not covering any examples are omitted). (right)We are now operating in the
right-most ‘slice’ of Figure 6.2

Most rule learning algorithms now proceed as follows: they remove the examples covered by the rule
just learned from consideration, and proceed with the remaining examples. This strategy is called
SEPARATE-AND-CONQUER.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 14
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16

Figure 6.4: (left) The third rule covers the one remaining negative example, so that the remaining
positives can be swept up by a default rule. (right) This will collapse the coverage space.

Figure 6.5: (left) A right-branching feature tree corresponding to a list of single-literal rules. (right)
The construction of this feature tree depicted in coverage space. The leaves of the tree are either purely
positive (in green) or purely negative (in red). Reordering these leaves on their empirical probability
results in the blue coverage curve. As the rule list separates the classes this is a perfect coverage curve.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 15
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Rules List For Ranking and Probability Estimations: Turning a rule list into a ranker or probability
estimator is as easy as it was for decision trees.

Figure 6.6: Coverage curves of two rule lists consisting of the rules from Example 6.2, in different
order (AB in blue and BA in violet). B\A corresponds to the coverage of rule B once the coverage of
rule A is taken away, and ‘-’ denotes the default rule. Neither curve dominates the other, and thus each
has operating conditions under which it is superior. The dotted segment in red connecting the two
curves corresponds to the overlap of the two rules A∧B, which is not accessible by either rule list.
 If accuracy is our performance criterion, AB would be optimal, achieving 0.80 accuracy (tpr =
0.80 and tnr = 0.80) where BA only manages 0.70 (tpr = 1 and tnr = 0.40).
 The segment A∧B – the overlap of the two rules – is not accessible by either rule list. In Figure
6.6 this is indicated by the dotted segment connecting the segment B from rule list BA and the
segment B\A from rule list AB. It follows that this segment contains exactly those examples that
are in B but not in B\A, hence in A∧B. In order to access the rule overlap, we need to either
combine the two rule lists or go beyond the power of rule lists.
 Rule lists are similar to decision trees in that the empirical probabilities associated with each
rule yield convex ROC and coverage curves on the training data

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 16
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
6.2 LEARNING UNORDERED RULE SETS:

6.3 DESCRIPTIVE RULE LEARNING:

Rule learning for subgroup discovery:

 Subgroups as mappings ˆ g: X → {true, false} – or alternatively, subsets of the instance space –
that are learned from a set of labelled examples (x i , l (xi )), where l :X →C is the true labelling
function
 A good subgroup is one whose class distribution is significantly different from the overall
population. This is by definition true for pure subgroups, but these are not the only interesting
ones.
 For instance, one could argue that the complement of a subgroup is as interesting as the
subgroup itself: in our dolphin example, the concept Gills = yes, which covers four negatives and
no positives, could be considered as interesting as its complement Gills = no, which covers one
negative and all positives.
 This means that we need to move away from impurity-based evaluation measures.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 17
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16

Table 6.1: Detailed evaluation of the top subgroups. Using Laplace-corrected precision we can evaluate
the quality of a subgroup as |precL −pos|. Alternatively, we can use average recall to define the quality of
a subgroup as |avg-rec−0.5|. These two quality measures result in slightly different rankings.

Association rule mining:

Transaction Items
1 nappies
2 beer, crisps
3 apples, nappies
4 beer, crisps, nappies
5 apples
6 apples, beer, crisps, nappies
7 apples, crisps
8 crisps

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 18
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Each transaction in this table involves a set of items; conversely, for each item we can list the
transactions in which it was involved: transactions 1, 3, 4 and 6 for nappies, transactions 3, 5, 6 and 7
for apples, and so on.

Figure 6.17: An item set lattice. Item sets in dotted ovals cover a single transaction; in dashed ovals,
two transactions; in triangles, three transactions; and in polygons with n sides, n transactions. The
maximal item sets with support 3 or more are indicated in green.

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 19
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
RULE 1:
Frequent item sets can be used to build association rules, which are rules of the form ·if B then H·
where both body B and head H are item sets that frequently appear in transactions together.
 Pick any edge in Figure 6.17, say the edge between {beer} and {nappies, beer}. We know that
the support of the former is 3 and of the latter, 2: that is, three transactions involve beer and
two of those involve nappies as well. We say that the confidence of the association rule ·if beer
then nappies· is 2/3.
 Likewise, the edge between {nappies} and {nappies, beer} demonstrates that the confidence
of the rule ·if nappies then beer· is 2/4.
 There are also rules with confidence 1, such as ·if beer then crisps·; and rules with empty
bodies, such as ·if true then crisps·, which has confidence 5/8 (i.e., five out of eight transactions
involve crisps).

RULE 2:
But we only want to construct association rules that involve frequent items.
 The rule ·if beer ∧ apples then crisps· has confidence 1, but there is only one transaction
involving all three and so this rule is not strongly supported by the data.
 So we first use Algorithm 6.6 to mine for frequent item sets; we then select bodies B and
heads H from the frequent sets m, discarding rules whose confidence is below a given
confidence threshold. Algorithm 6.7 gives the basic algorithm.
 Notice that we are free to discard some of the items in the maximal frequent sets (i.e., H ∪B may
be smaller than m), because any subset of a frequent item set is frequent as well.

A run of the algorithm with support threshold 3 and confidence threshold 0.6 gives the following
association rules:
·if beer then crisps· support 3, confidence 3/3
·if crisps then beer· support 3, confidence 3/5

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 20
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16

6.4 FIRST-ORDER RULE LEARNING:

1. One of the most expressive and human readable representations for learned hypotheses is sets
of production rules (if-then rules).
2. Rules can be derived from other representations (e.g., decision trees) or they can be learned
directly. Here, we are concentrating on the direct method.
3. An important aspect of direct rule-learning algorithms is that they can learn sets of first-order
rules which have much more representational power than the propositional rules that can be
derived from decision trees. Learning first-order rules can also be seen as automatically
inferring PROLOG programs from example.

Many approaches to learning in first-order logic are based on the logic programming language Prolog,
and learning first-order rules is often called inductive logic programming (ILP). Logically speaking,
Prolog rules are Horn clauses with a single literal in the head.

Propositional versus First-Order Logic:

 Propositional Logic does not include variables and thus cannot express general relations among
the values of the attributes.
 Example 1: in Propositional logic, you can write:
IF (Father1=Bob) ^ (Name2=Bob)^ (Female1=True) THEN Daughter1,2=True.
This rule applies only to a specific family!
 Example 2: In First-Order logic, you can write
IF Father(y,x) ^ Female(y), THEN Daughter(x,y)
This rule (which you cannot write in Propositional Logic) applies to any family!

Learning Propositional versus First-Order Rules:

 Both approaches to learning are useful as they address different types of learning problems.
 Like Decision Trees, Feed forward Neural Nets and IBL systems, Propositional Rule Learning
systems are suited for problems in which no substantial relationship between the values of the
different attributes needs to be represented.
 In First-Order Learning Problems, the hypotheses that must be represented involve relational
assertions that can be conveniently expressed using first-order representations such as horn
clauses (H <- L1 ^…^Ln)

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 21
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 22
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16

Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 23

STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
18 pages
JNTU Old Question Papers 2007
67% (3)
JNTU Old Question Papers 2007
11 pages
Mathematics in Science and Technology Mathematical Methods Models and Algorithms in Science and Technology Proceedings of the Satellite Conference of ICM 2010 India Habitat Centre Ind 1st Edition A. H. Siddiqi - Own the ebook now with all fully detailed content
No ratings yet
Mathematics in Science and Technology Mathematical Methods Models and Algorithms in Science and Technology Proceedings of the Satellite Conference of ICM 2010 India Habitat Centre Ind 1st Edition A. H. Siddiqi - Own the ebook now with all fully detailed content
75 pages
فاينل تعلم
No ratings yet
فاينل تعلم
144 pages
Compensator Design Using Bode Plot
No ratings yet
Compensator Design Using Bode Plot
24 pages
FINAL - YEAR - PROJECT - GLAUCOMA - DETECTION - TEAM - COPY - Done
No ratings yet
FINAL - YEAR - PROJECT - GLAUCOMA - DETECTION - TEAM - COPY - Done
70 pages
This Thesis/project/dissertation Has Been Reviewed For 508 Compliance. To Request Enhancements, Please Email Lib-508accessibility@csus - Edu
No ratings yet
This Thesis/project/dissertation Has Been Reviewed For 508 Compliance. To Request Enhancements, Please Email Lib-508accessibility@csus - Edu
80 pages
6.034 Notes: Section 5.1: Slide 5.1.1
No ratings yet
6.034 Notes: Section 5.1: Slide 5.1.1
22 pages
DAA Theory Notes
No ratings yet
DAA Theory Notes
31 pages
Numerical Methods With Applications
No ratings yet
Numerical Methods With Applications
28 pages
Research Proposal Quantum Entaglement
No ratings yet
Research Proposal Quantum Entaglement
29 pages
Chapter 1 Tupad 2
No ratings yet
Chapter 1 Tupad 2
17 pages
Decision Trees and Regression Techniques
No ratings yet
Decision Trees and Regression Techniques
27 pages
Decision Tree Notes
No ratings yet
Decision Tree Notes
6 pages
System of Linear Equations
No ratings yet
System of Linear Equations
8 pages
Class Imbalance Should Not Throw You Off Balance - Choosing The Right Classifiers and Performance Metrics For Brain Decoding With Imbalanced Data
No ratings yet
Class Imbalance Should Not Throw You Off Balance - Choosing The Right Classifiers and Performance Metrics For Brain Decoding With Imbalanced Data
14 pages
ML Unit 3 New
100% (1)
ML Unit 3 New
24 pages
PKI Sunny Classes
No ratings yet
PKI Sunny Classes
22 pages
ML Unit 3
No ratings yet
ML Unit 3
22 pages
Antony 2016
No ratings yet
Antony 2016
6 pages
Decision Tree
No ratings yet
Decision Tree
7 pages
Link DAMPER ELEMENT
No ratings yet
Link DAMPER ELEMENT
9 pages
Module III - Classification Decision Tree
No ratings yet
Module III - Classification Decision Tree
48 pages
M3 Machine Input Output - 161
No ratings yet
M3 Machine Input Output - 161
7 pages
(Deep Learning Paper) The Unreasonable Effectiveness of Deep Features As A Perceptual Metric
No ratings yet
(Deep Learning Paper) The Unreasonable Effectiveness of Deep Features As A Perceptual Metric
14 pages
Ai Lab 7
No ratings yet
Ai Lab 7
3 pages
Optimal Channel Estimation For Capacity Maximization in OFDM Systems
No ratings yet
Optimal Channel Estimation For Capacity Maximization in OFDM Systems
20 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
A Linked List Is A Linear Data Structure Consistin...
No ratings yet
A Linked List Is A Linear Data Structure Consistin...
4 pages
ESGB - 2025 - Classification and Regression Tress (Enregistré Automatiquement)
No ratings yet
ESGB - 2025 - Classification and Regression Tress (Enregistré Automatiquement)
43 pages
Random Forests
No ratings yet
Random Forests
22 pages
DM Chapter 4
No ratings yet
DM Chapter 4
6 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
Decision Trees
No ratings yet
Decision Trees
27 pages
ML Unit 3
No ratings yet
ML Unit 3
83 pages
ML Ch-3 Decision Trees and Ensemble Methods
No ratings yet
ML Ch-3 Decision Trees and Ensemble Methods
14 pages
Bishop CH 3 Notes
No ratings yet
Bishop CH 3 Notes
6 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
Aiml M4 C1
No ratings yet
Aiml M4 C1
101 pages
Knight's Tour
No ratings yet
Knight's Tour
8 pages
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
No ratings yet
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
22 pages
Decision Trees
No ratings yet
Decision Trees
37 pages
TEAA - Tree Ensembles-1
No ratings yet
TEAA - Tree Ensembles-1
43 pages
Mod 4-1
No ratings yet
Mod 4-1
42 pages
MA4270
No ratings yet
MA4270
1 page
LVC 1 Post-Session Summary
No ratings yet
LVC 1 Post-Session Summary
9 pages
Simulation Sample Problem
No ratings yet
Simulation Sample Problem
2 pages
Machine Learning Approaches: Decision Trees
No ratings yet
Machine Learning Approaches: Decision Trees
44 pages
Decision Trees: A Recent Overview: S. B. Kotsiantis
No ratings yet
Decision Trees: A Recent Overview: S. B. Kotsiantis
23 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
QDSM6
No ratings yet
QDSM6
8 pages
Classification: Decision Trees: Business Analytics Lecture 7/8
No ratings yet
Classification: Decision Trees: Business Analytics Lecture 7/8
35 pages
Data Mining Notes Unit 4
No ratings yet
Data Mining Notes Unit 4
30 pages
PP - Nov Winter 2019
No ratings yet
PP - Nov Winter 2019
2 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
Decision Tree
No ratings yet
Decision Tree
45 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
Machine - Learning - Lecture - 08 - Decision Tree Learning
No ratings yet
Machine - Learning - Lecture - 08 - Decision Tree Learning
67 pages
Decision Trees
No ratings yet
Decision Trees
11 pages
CH 5 Digital Modulation - GATE - Course
No ratings yet
CH 5 Digital Modulation - GATE - Course
43 pages
Lecture 04 Decession Trees 04112022 015118pm
No ratings yet
Lecture 04 Decession Trees 04112022 015118pm
43 pages
DataMining-Handouts1 5
No ratings yet
DataMining-Handouts1 5
8 pages
Decision Tree in Machine Learning
No ratings yet
Decision Tree in Machine Learning
11 pages
2.2-2.3 Traveling Salesman Problem
No ratings yet
2.2-2.3 Traveling Salesman Problem
8 pages
Unit-5 Decision Trees & Ensembles Methods
No ratings yet
Unit-5 Decision Trees & Ensembles Methods
11 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
DECISION TREES-jb
No ratings yet
DECISION TREES-jb
8 pages
Ch02 DecisionTree
No ratings yet
Ch02 DecisionTree
41 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
ML Unit3
No ratings yet
ML Unit3
24 pages
Unit Ii
No ratings yet
Unit Ii
22 pages
Python MP Report PDF
No ratings yet
Python MP Report PDF
61 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Decision Treesnotes
No ratings yet
Decision Treesnotes
3 pages
AIML Module 4 Imp
No ratings yet
AIML Module 4 Imp
5 pages
Unit-3 Alt
No ratings yet
Unit-3 Alt
24 pages
ML Unit 2
No ratings yet
ML Unit 2
8 pages
09 Decision Trees Nearest Neighbor
No ratings yet
09 Decision Trees Nearest Neighbor
8 pages
What's Next?: Tree Models Decision Trees Ranking and Probability Estimation Trees
No ratings yet
What's Next?: Tree Models Decision Trees Ranking and Probability Estimation Trees
49 pages
Appendix D Risk Management Procedure Template
No ratings yet
Appendix D Risk Management Procedure Template
33 pages
2 Hons Mathematics SH-MTH-202-C-4 1654498669626
No ratings yet
2 Hons Mathematics SH-MTH-202-C-4 1654498669626
3 pages
CH 1.1 Slides 210 SJ
No ratings yet
CH 1.1 Slides 210 SJ
24 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
15 pages
Introduction to Calculus
From Everand
Introduction to Calculus
Joan Van Glabek
4.5/5 (8)
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Machine Learning: B.E, M.Tech, PH.D

Uploaded by

Machine Learning: B.E, M.Tech, PH.D

Uploaded by

MACHINE LEARNING

IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16

EXAMPLE 4.4 (Data that is not conjunctively separable).(p-positive, n-negative)

5.2 RANKING AND PROBABILITY ESTIMATION TREES:

Example 5.2 (GROWING A TREE):

5.3 TREE LEARNING AS VARIANCE REDUCTION

 Consider how to adapt decision trees to regression and clustering tasks.

EXAMPLE 5.5 (LEARNING A CLUSTERING TREE USING A DISSIMILARITY MATRIX)

6.1 LEARNING ORDERED RULE LISTS:

6.3 DESCRIPTIVE RULE LEARNING:

Rule learning for subgroup discovery:

Association rule mining:

6.4 FIRST-ORDER RULE LEARNING:

Propositional versus First-Order Logic:

Learning Propositional versus First-Order Rules:

You might also like