Machine Learning: B.E, M.Tech, PH.D
Machine Learning: B.E, M.Tech, PH.D
TREE MODELS:
1. A tree model is a hierarchical structure of conditions, where leafs contain tree outcome.
2. They represent recursive divide-and-conquer strategies.
3. Tree models are among the most popular models in machine learning, because they are easy to
understand and interpret:
4. E.g., Kinect uses them to detect character pose.
For instance, the feature tree in Figure 5.1 (left) is equivalent to the path in Figure 4.6 (left). This
equivalence is best seen by tracing the path and the tree from the bottom upward.
1. The left-most leaf of the feature tree represents the concept at the bottom of the path, covering a
single positive example.
2. The next concept up in the path generalizes the literal Length = 3 into Length = [3,5] by means
of internal disjunction; the added coverage (one positive example) is represented by the second
leaf from the left in the feature tree.
3. By dropping the condition Teeth = few we add another two covered positives.
4. Dropping the ‘Length’ condition altogether (or extending the internal disjunction with the one
remaining value ‘4’) adds the last positive and also a negative.
5. Dropping Beak = yes covers no additional examples (closed concepts in the previous chapter).
6. Finally, dropping Gills = no covers the four remaining negatives.
Figure 5.1. (left) The path from Figure 4.6, redrawn in the form of a tree. The coverage numbers in the
leaves are obtained from the data in Example 4.4. (Right) A decision tree learned on the same data. This
tree separates the positives and negatives perfectly.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 1
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
• A path through the hypothesis space can be turned into an equivalent feature tree. To obtain
a tree that is equivalent to the i -th concept from the bottom in the path, truncate the tree by
combining the left-most i leaves into a single leaf representing the concept; or label the left-
most i leaves positive and the remaining leaves negative, turning the feature tree into a decision
tree.
• Decision trees do not employ internal disjunction for features with more than two values, but
instead allow branching on each separate value shown in Figure 5.1 (right). This tree can be
turned into a logical expression:
(Gills = no ∧ Length = 3) ∨ (Gills = no ∧ Length = 4 ∧ Teeth = many) ∨ (Gills = no ∧ Length = 5)
Gills = no ∧ [Length = 3 ∨ (Length = 4 ∧ Teeth = many) ∨ Length = 5]
¬[(Gills = no ∧ Length = 4 ∧ Teeth = few) ∨ Gills = yes]
(Gills = yes ∨ Length = [3,5] ∨ Teeth = many) ∧ Gills = no
The first expression is in disjunctive normal form (DNF) and is obtained by forming a
disjunction of all paths from the root of the tree to leaves labelled positive, where each path
gives a conjunction of literals.
• Decision trees are strictly more expressive than conjunctive concepts.
• One way to avoid overfitting and encourage learning is to deliberately choose a restrictive
hypothesis language
• A FEATURE TREE is a tree such that each internal node (the nodes that are not leaves) is
labelled with a feature, and each edge emanating from an internal node is labelled with a literal.
The set of literals at a node is called a SPLIT. Each leaf of the tree represents a logical
expression, which is the conjunction of literals encountered on the path from the root of the tree
to the leaf. The extension of that conjunction (the set of instances covered by it) is called the
INSTANCE SPACE SEGMENT associated with the leaf.
• Algorithm 5.1 is a divide-and-conquer algorithm: it divides the data into subsets, builds a tree
for each of those and then combines those subtrees into a single tree.
• They are usually implemented recursively, because each subproblem (to build a tree for a
subset of the data) is of the same form as the original problem. These algorithms are greedy:
whenever there is a choice (such as choosing the best split), the best alternative is selected on
the basis of the information available, and this choice is never reconsidered. This may lead to
sub-optimal choices.
• An alternative would be to use a backtracking search algorithm, which can return an optimal
solution
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 2
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
5.1 DECISION TREES:
• Decision Trees are a non-parametric supervised learning method used for both classification
and regression tasks.
• A decision tree is a flowchart-like structure in which each internal node represents a "test" on
an attribute (e.g. Yes or No), each branch represents the outcome of the test, and each leaf node
represents a class label (Decision taken after computing all attributes). The paths from root to
leaf represent classification rules.
• Decision Analysis, a decision tree and the closely related influence diagram are used as a visual
and analytical decision support tool, where the expected values (or expected utility) of
competing alternatives are calculated.
• A decision tree consists of three types of nodes:
1. Decision nodes – typically represented by squares
2. Chance nodes – typically represented by circles
3. End nodes – typically represented by triangles
• Decision trees are commonly used in operations research, operations management and
calculating conditional probabilities.
• A CLASSIFICATION TASK we can simply define a set of instances D to be homogenous if they
are all from the same class, and the function Label(D) will then obviously return that class.
• In line 5 of Algorithm 5.1 we may be calling Label(D) with a non-homogeneous set of
instances in case one of the Di is empty, so the general definition of Label(D) is that it returns
the majority class of the instances in D2 This leaves us to decide how to define the function
BestSplit(D,F).
• We are dealing with Boolean features, so D is split into D1 and D2. Let’s assume we have two
classes, and denote by D+ and D- the positives and negatives in D (and likewise for D 1+ etc.). The
question is how to assess the utility of a feature in terms of splitting the examples into positives
and negatives. Clearly, the best situation is where D 1+= D+ and D1- = ᵠ , or where D1+= ᵠ and D1-
= D-. In that case, the two children of the split are said to be pure. So we need to measure the
impurity of a set of n+ positives and n- negatives. The impurity can be defined in terms of the
proportion p˙ = n+/ (n+ + n- ) which is called Empirical probability. Impurity should not change
if we swap the positive and negative class, which means that it should stay the same if we
replace p˙ with 1−p˙.
Figure 5.2. (left) Impurity functions plotted against the empirical probability of the positive
class. From the bottom: the relative size of the minority class, min(p˙,1 − p˙); the Gini index,
2p˙(1−p˙); entropy, −p˙ log2 p˙−(1−p˙) log2 (1−p˙) (divided by 2 so that it reaches its maximum in
the same point as the others); and the square root of the Gini index, √ p ˙ (1− p ˙) – notice that
this last function describes a semi-circle. (right) Geometric construction to determine the
impurity of a split (Teeth = [many, few] from Example 5.1. p˙ is the empirical probability of the
parent, and p˙1 and p˙2 are the empirical probabilities of the children.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 3
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
1. MINORITY CLASS MIN (p˙, 1−p˙) – this is sometimes referred to as the error rate, as it
measures the proportion of misclassified examples if the leaf was labelled with the majority
class; the purer the set of examples, the fewer errors this will make
2. GINI INDEX 2p˙ (1−p˙) –The probability of a false positive is then p˙(1− p˙) and the probability
of a false negative (1−p˙)p˙3
3. ENTROPY −p˙ log2 p˙− (1−p˙) log2 (1−p˙) – the purer the set of examples, the more predictable
this message becomes and the smaller the expected information.
A plot of these three impurity measures can be seen in Figure 5.2 (left)
Indicating the impurity of a single leaf D j as Imp (Dj ), the impurity of a set of mutually exclusive leaves
{D1, . . . ,Dl} is then defined as a WEIGHTED AVERAGE
Where D =D1 ∪. . .∪ Dl . For a binary split there is a nice geometric construction to find Imp ({D1,D2})
given the empirical probabilities of the parent and the children, which is illustrated in Figure 5.2
(right):
1. We first find the impurity values Imp(D1) and Imp(D2) of the two children on the impurity
curve (here the Gini index).
2. We then connect these two values by a straight line, as any weighted average of the two must
be on that line.
3. Since the empirical probability of the parent is also a weighted average of the empirical
probabilities of the children, with the same weights (i.e., p˙ = |D1|/|D| p˙1 + |D2| / |D| p˙2 – the
derivation is given in Equation 5.2), p˙ gives us the correct interpolation point.
EXAMPLE 5.1 (CALCULATING IMPURITY): Find the best feature to put at the root of the decision tree.
The four features available result in the following splits:
Length = [3,4,5] [2+, 0−] [1+, 3−][2+, 2−]
Gills = [yes, no] [0+, 4−] [5+, 1−]
Beak = [yes, no] [5+, 3−] [0+, 2−]
Teeth = [many, few] [3+, 4−] [2+, 1−]
Calculate the impurity of the first split. We have 3 segments:
The first one is pure and so has entropy 0;
The second one has entropy −(1/4)log2(1/4)− (3/4)log2(3/4) = 0.5+0.31 = 0.81;
The third one has entropy 1.
The total entropy is then the weighted average of these, which is 2/10·0+4/10·0.81+4/10·1 = 0.72.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 4
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Similar calculations for the other 3 features give the following entropies:
Gills 4/10 · 0+6/10 ·(−(5/6)log2 (5/6)−(1/6)log2 (1/6)) = 0.39;
Beak 8/10 ·(−(5/8)log2 (5/8)−(3/8)log2 (3/8)) +2/10 · 0 = 0.76;
Teeth 7/10 ·(−(3/7)log2(3/7)−(4/7)log2(4/7))+3/10·(−(2/3)log2(2/3)−(1/3)log2(1/3))=0.97.
The ‘Gills’ is an excellent feature to split on; ‘Teeth’ is poor; and the other two are somewhere in
between.
The calculations for the Gini index are as follows (notice that these are on a scale from 0 to 0.5):
Length 2/10 · 2 · (2/2 · 0/2)+4/10 · 2 · (1/4 · 3/4)+4/10 · 2 · (2/4 · 2/4) = 0.35;
Gills 4/10 · 0+6/10 · 2 · (5/6 · 1/6) = 0.17;
Beak 8/10 · 2 · (5/8 · 3/8)+2/10 · 0 = 0.38;
Teeth 7/10 · 2 · (3/7 · 4/7)+3/10 · 2 · (2/3 · 1/3) = 0.48.
As expected, the two impurity measures are in close agreement. See Figure 5.2 (right) for a
geometric illustration of the last calculation concerning ‘Teeth’.
If purity is measured by entropy, this is called the information gain splitting criterion
Figure 5.3. (Left) Decision tree learned from the data in Example 4.4 on p.115. (Right) Each internal
and leaf node of the tree corresponds to a line segment in coverage space: vertical segments for pure
positive nodes, horizontal segments for pure negative nodes, and diagonal segments for impure nodes.
Leaf C leaves three feature values unspecified, with a total of 3·2·2 = 12 possible combinations of
values; four of these were supplied as training examples, so leaf C covers eight unlabelled instances
and classifies them as negative. Similarly, two unlabelled instances are classified as positive by leaf D,
and a further two by leaf F; one is classified as negative by leaf G, and the remaining one as positive by
leaf H. The fact that more unlabelled instances are classified as negative (9) than as positive (5) is thus
mostly due to leaf C: because it is a leaf high up in the tree, it covers many instances.
Every node of the tree, internal or leaf, covers a certain number of +ve’s and –ve’s and hence can be
plotted as a line segment in coverage space. For instance, the root of the tree covers all positives and all
negatives, and hence is represented by the ascending diagonal A. Once we add our first split, segment A
is replaced by segment B (an impure node and hence diagonal) and segment C, which is pure and not
split any further. Segment B is further split into D (pure and positive), E (impure) and F (pure and
positive). Finally, E is split into two pure nodes.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 5
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
ALGORITHM 5.1 only compares splits with the same parent, and so we can ignore the impurity of the
parent and search for the feature which results in the lowest weighted average impurity of its children
(ALGORITHM 5.2).
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 6
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Figure 5.4. (top) Abstract representation of a tree with numbers of positive and negative
examples covered in each node. Binary splits are added to the tree in the order indicated.
(bottom) Adding a split to the tree will add new segments to the coverage curve as indicated by
the arrows. After a split is added the segments may need reordering, and so only the solid lines
represent actual coverage curves.
So, adding a split to a decision tree can be interpreted in terms of coverage curves as the following two-
step process:
1. Split the corresponding curve segment into two or more segments;
2. Reorder the segments on decreasing slope.
If a feature tree has l leaves and we have c classes, then the number of possible labellings of leaves with
classes is cl; in the example of Figure 5.4 this is 2 4 = 16. Figure 5.5 depicts these 16 labellings in
coverage space.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 7
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Figure 5.5: Graphical depiction of all possible labellings and all possible rankings that can
be obtained with the four-leaf decision tree in Figure 5.4. There are 24 = 16 possible leaf
labellings; Example: ‘+−+−’ denotes labelling the first and third leaf from the left as + and the
second and fourth leaf as −. Also indicated are some pairwise symmetries (dotted lines): e.g., +−
+− and −+−+ are each other’s inverse and end up at opposite ends of the plot. There are 4! = 24
possible blue- violet- red-orange paths through these points which start in −−−− and switch
each leaf to + in some order; these represent all possible four-segment coverage curves or
rankings.
Figure 5.6. (top) To achieve the labelling +−++ we don’t need the right-most split, which can
therefore be pruned away. (bottom) Pruning doesn’t affect the chosen operating point, but it
does decrease the ranking performance of the tree.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 8
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
• The operation of merging all leaves in a subtree is called PRUNING THE SUBTREE. The
process is illustrated in Figure 5.6. The advantage of pruning is that we can simplify the tree
without affecting the chosen operating point, which is sometimes useful if we want to
communicate the tree model to somebody else. The disadvantage is that we lose ranking
performance, as illustrated in Figure 5.6 (bottom). Pruning is therefore not recommended
unless (i) you only intend to use the tree for classification, not for ranking or probability
estimation; and (ii) you can define the expected operating conditions with sufficient precision.
One popular algorithm for pruning decision trees is called reduced-error pruning, and is given
in Algorithm 5.3. The algorithm employs a separate pruning set of labelled data not seen during
training, as pruning will never improve accuracy over the training data.
Example 5.3 (Cost-sensitivity of splitting criteria). Suppose you have 10 positives and 10
negatives, and you need to choose between the two splits [8+,2−][2+,8−] and [10+,6−][0+,4−].
Calculate the weighted average entropy of both splits and conclude that the first split is the
better one. Calculate the average Gini index, and again the first split wins.
You then remember that mistakes on the positives are about ten times as costly as mistakes on
the negatives.
You’re not quite sure how to work out the maths, and so you decide to simply have ten copies of
every positive: the splits are now [80+,2−][20+,8−] and [100+,6−][0+,4−].
You recalculate the three splitting criteria and now all three favour the second split.
Even though you’re slightly bemused by all this, you settle for the second split since all three
splitting criteria are now unanimous in their recommendation.
Figure 5.7. (Left) ROC isometrics for entropy in blue, Gini index in violet and Gini in red through the
splits [8+,2−][2+,8−] (solid lines) and [10+,6−][0+,4−] (dotted lines). Only Gini prefers the second split.
(Right) The same isometrics after inflating the positives with a factor 10. All splitting criteria now
favour the second split; the Gini isometrics are the only ones that haven’t moved.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 9
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Consider Figure 5.7 (left). The two splits among which you needed to choose in Example 5.3 (before
inflating the positives) are indicated as points in this plot. I have drawn six isometrics in the top-left of
the plot: two splits times three splitting criteria. A particular splitting criterion prefers the split whose
isometric is the highest (closest to ROC heaven) of the two: you can see that only one of the three (Gini)
prefers the split on the top-right. Figure 5.7 (right) demonstrates how this changes when inflating the
positives with a factor 10 (a ROC space with the grid indicating how the class distribution has changed).
Entropy and Gini index are sensitive to fluctuations in the class distribution, Gini isn’t.
REGRESSION TREES:
The variance of a Boolean (i.e Bernoulli) variable with success probability p is (1-p), which is half the
Gini Index. So we could interpret the goal of tree learning as minimizing the class variance ( or standard
deviation, in case of square root of Gini) in the leaves. Regression trees are susceptible to overfitting.
In regression, the variance of a set Y of target values as the average squared distance from the mean:
If a split partitions the set of target values Y into mutually exclusive sets {Y1, . . . ,Yl }, the weighted
average variance is then
The first term is constant for a given set Y and so we want to maximize the weight average of squared
means in the children.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 10
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
EXAMPLE 5.4 (LEARNING A REGRESSION TREE). Imagine you are a collector of vintage Hammond
tone wheel organs. You have been monitoring an online auction site, from which you collected some
data about interesting transactions:
From this data, you want to construct a regression tree that will help you determine a reasonable
price for your next purchase. There are three features, hence three possible splits:
Model = [A100,B3,E112,M102,T202] [1051,1770,1900][4513][77][870][99,270,625]
Condition = [excellent,good, fair] [1770,4513][270,870,1051,1900][77,99,625]
Leslie = [yes,no] [625,870,1900][77,99,270,1051,1770,4513]
The means of the first split are 1574, 4513, 77, 870 and 331, and the weighted average of squared
means is 3.21 · 106. The means of the second split are 3142, 1023 and 267,with weighted average of
squared means 2.68·106; for the third split the means are 1132 and 1297, with weighted average of
squared means 1.55 ·106. We therefore branch on Model at the top level. This gives us three single-
instance leaves, as well as three A100s and three T202s. For the A100s we obtain the following splits:
Condition = [excellent,good, fair] [1770][1051,1900][]
Leslie = [yes,no] [1900][1051,1770]
The second split results in less variance. For the T202s the splits are as follows:
Condition = [excellent,good, fair] [][270][99,625]
Leslie = [yes,no] [625][99,270]
Again we see that splitting on Leslie gives tighter clusters of values. The learned regression tree is
depicted in Figure 5.8.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 11
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
CLUSTERING TREES:
Since regression is a supervised learning problem while clustering is unsupervised.
An abstract function Dis: X ×X →R that measures the distance or dissimilarity of any two
instances x, x1 ∈ X, such that the higher Dis (x, x1) is, the less similar x and x1 are. The cluster
dissimilarity of a set of instances D is then calculated as
The weighted average cluster dissimilarity over all children of a split then gives the split
Dissimilarity.
This shows, for instance, that the first transaction is very different from the other eight. The
average pairwise dissimilarity over all nine transactions is 2.94. Using the same features from
Example 5.4, the three possible splits are (now with transaction number rather than price):
Model = [A100,B3,E112,M102,T202] [3, 6, 8][1][9][5][2, 4,7]
Condition = [excellent,good, fair] [1, 6][3, 4, 5, 8][2,7,9]
Leslie = [yes,no] [2,5,8][1, 3, 4, 6, 7,9]
The cluster dissimilarity among transactions 3, 6 and 8 is 1/32 (0+1+2+1+0+1+2+1+0) = 0.89;
and among transactions 2, 4 and 7 it is 1/32 (0+1+0+1+0+0+0+0+0) =0.22. The other three
children of the first split contain only a single element and so have zero cluster dissimilarity.
The weighted average cluster dissimilarity of the split is then
3/9·0.89+1/9·0+1/9·0+1/9·0+3/9·0.22 = 0.37. For the second split, similar calculations
result in a split dissimilarity of 2/9·1.5+4/9·1.19+3/9·0 = 0.86, and the third split yields 3/9 ·
1.56+6/9 · 3.56 = 2.89. The Model feature thus captures most of the given dissimilarities, while
the Leslie feature is virtually unrelated.
Smaller clusters tend to have lower dissimilarity, and so it is easy to overfit. Setting aside a
pruning set to remove the lower splits if they don’t improve the cluster coherence on the
pruning set is recommended.
In the above example, removing the first transaction reduces the overall pairwise dissimilarity
from 2.94 to 1.5,
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 12
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Figure 5.9: A clustering tree learned from the data in Example 5.6 using Euclidean distance on the
numerical features.
RULE MODELS:
RULE MODELS ARE the second major type of logical machine learning models. Particularly in
supervised learning, a rule model is more than just a set of rules: the specification of how the rules are
to be combined to form predictions is a crucial part of the model.
They are four types of rule models
1. Learning ordered rule lists
2. Learning unordered rule sets
3. Descriptive rule learning
4. First-order rule learning
Example 6.1 (Learning a rule list): Consider Positive & Negatives examples
p1: Length = 3 ∧ Gills = no ∧ Beak = yes ∧ Teeth = many
p2: Length = 4 ∧ Gills = no ∧ Beak = yes ∧ Teeth = many
p3: Length = 3 ∧ Gills = no ∧ Beak = yes ∧ Teeth = few
p4: Length = 5 ∧ Gills = no ∧ Beak = yes ∧ Teeth = many
p5: Length = 5 ∧ Gills = no ∧ Beak = yes ∧ Teeth = few
n1: Length = 5 ∧ Gills = yes ∧ Beak = yes ∧ Teeth = many
n2: Length = 4 ∧ Gills = yes ∧ Beak = yes ∧ Teeth = many
n3: Length = 5 ∧ Gills = yes ∧ Beak = no ∧ Teeth = many
n4: Length = 4 ∧ Gills = yes ∧ Beak = no ∧ Teeth = many
n5: Length = 4 ∧ Gills = no ∧ Beak = yes ∧ Teeth = few
The nine possible literals are shown with their coverage counts in Figure 6.2 (top). Three of these
are pure; in the impurity isometrics plot in Figure 6.2 (bottom) they end up on the x-axis and y-axis.
One of the literals covers two positives and two negatives, and therefore has the same impurity as the
overall data set; this literal ends up on the ascending diagonal in the coverage plot.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 13
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Gills = yes is the best of the three as it covers more examples, so let’s formulate our first rule as:
·if Gills = yes then Class = ⊝
Figure 6.2. (top) All literals with their coverage counts on the data in Example 6.1. The ones in
green (red) are pure for the positive (negative) class. (bottom) The nine literals plotted as points in
coverage space, with their impurity values indicated by impurity isometrics. Impurity values are colour-
coded: towards green if p˙ > 1/2, towards red if p˙ < 1/2, and orange if p˙ = 1/2 (on a 45 degree
isometric). The violet arrow indicates the selected literal, which excludes all five positives and one
negative.
Figure 6.3. (left) Revised coverage counts after removing the four negative examples covered by the
first rule found (literals not covering any examples are omitted). (right)We are now operating in the
right-most ‘slice’ of Figure 6.2
Most rule learning algorithms now proceed as follows: they remove the examples covered by the rule
just learned from consideration, and proceed with the remaining examples. This strategy is called
SEPARATE-AND-CONQUER.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 14
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Figure 6.4: (left) The third rule covers the one remaining negative example, so that the remaining
positives can be swept up by a default rule. (right) This will collapse the coverage space.
Figure 6.5: (left) A right-branching feature tree corresponding to a list of single-literal rules. (right)
The construction of this feature tree depicted in coverage space. The leaves of the tree are either purely
positive (in green) or purely negative (in red). Reordering these leaves on their empirical probability
results in the blue coverage curve. As the rule list separates the classes this is a perfect coverage curve.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 15
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Rules List For Ranking and Probability Estimations: Turning a rule list into a ranker or probability
estimator is as easy as it was for decision trees.
Figure 6.6: Coverage curves of two rule lists consisting of the rules from Example 6.2, in different
order (AB in blue and BA in violet). B\A corresponds to the coverage of rule B once the coverage of
rule A is taken away, and ‘-’ denotes the default rule. Neither curve dominates the other, and thus each
has operating conditions under which it is superior. The dotted segment in red connecting the two
curves corresponds to the overlap of the two rules A∧B, which is not accessible by either rule list.
If accuracy is our performance criterion, AB would be optimal, achieving 0.80 accuracy (tpr =
0.80 and tnr = 0.80) where BA only manages 0.70 (tpr = 1 and tnr = 0.40).
The segment A∧B – the overlap of the two rules – is not accessible by either rule list. In Figure
6.6 this is indicated by the dotted segment connecting the segment B from rule list BA and the
segment B\A from rule list AB. It follows that this segment contains exactly those examples that
are in B but not in B\A, hence in A∧B. In order to access the rule overlap, we need to either
combine the two rule lists or go beyond the power of rule lists.
Rule lists are similar to decision trees in that the empirical probabilities associated with each
rule yield convex ROC and coverage curves on the training data
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 16
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
6.2 LEARNING UNORDERED RULE SETS:
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 17
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Table 6.1: Detailed evaluation of the top subgroups. Using Laplace-corrected precision we can evaluate
the quality of a subgroup as |precL −pos|. Alternatively, we can use average recall to define the quality of
a subgroup as |avg-rec−0.5|. These two quality measures result in slightly different rankings.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 18
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Each transaction in this table involves a set of items; conversely, for each item we can list the
transactions in which it was involved: transactions 1, 3, 4 and 6 for nappies, transactions 3, 5, 6 and 7
for apples, and so on.
Figure 6.17: An item set lattice. Item sets in dotted ovals cover a single transaction; in dashed ovals,
two transactions; in triangles, three transactions; and in polygons with n sides, n transactions. The
maximal item sets with support 3 or more are indicated in green.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 19
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
RULE 1:
Frequent item sets can be used to build association rules, which are rules of the form ·if B then H·
where both body B and head H are item sets that frequently appear in transactions together.
Pick any edge in Figure 6.17, say the edge between {beer} and {nappies, beer}. We know that
the support of the former is 3 and of the latter, 2: that is, three transactions involve beer and
two of those involve nappies as well. We say that the confidence of the association rule ·if beer
then nappies· is 2/3.
Likewise, the edge between {nappies} and {nappies, beer} demonstrates that the confidence
of the rule ·if nappies then beer· is 2/4.
There are also rules with confidence 1, such as ·if beer then crisps·; and rules with empty
bodies, such as ·if true then crisps·, which has confidence 5/8 (i.e., five out of eight transactions
involve crisps).
RULE 2:
But we only want to construct association rules that involve frequent items.
The rule ·if beer ∧ apples then crisps· has confidence 1, but there is only one transaction
involving all three and so this rule is not strongly supported by the data.
So we first use Algorithm 6.6 to mine for frequent item sets; we then select bodies B and
heads H from the frequent sets m, discarding rules whose confidence is below a given
confidence threshold. Algorithm 6.7 gives the basic algorithm.
Notice that we are free to discard some of the items in the maximal frequent sets (i.e., H ∪B may
be smaller than m), because any subset of a frequent item set is frequent as well.
A run of the algorithm with support threshold 3 and confidence threshold 0.6 gives the following
association rules:
·if beer then crisps· support 3, confidence 3/3
·if crisps then beer· support 3, confidence 3/5
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 20
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Many approaches to learning in first-order logic are based on the logic programming language Prolog,
and learning first-order rules is often called inductive logic programming (ILP). Logically speaking,
Prolog rules are Horn clauses with a single literal in the head.
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 21
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 22
MACHINE LEARNING
IV B.Tech – II Sem (Batch: 2017-21) UNIT- 3 Regulation: R16
Dr. BVS VARMA, B.E, M.Tech, Ph.D DNR College of Engineering & Technology 23