Module4 QB 1
Module4 QB 1
How can we tell whether it is a mammal or a non-mammal? One approach is to pose a series of
questions about the characteristics of the species.
The first question we may ask is whether the species is cold- or warm-blooded. If it is cold-
blooded, then it is definitely not a mammal. In the latter case, we need to ask a follow-up
question: Otherwise, it is either a bird or a mammal. Do the females of the species give birth to
their young? Those that do give birth are definitely mammals, while those that do not are likely
to be non-mammals.
The previous example illustrates how we can solve a classification problem by asking a series
of questions about the attributes of the test record. Each time we receive an answer, a follow-up
question is asked until we reach a conclusion about the class label of the record.
The series of questions and their possible answers can be organized in the form of a decision
tree, which is a hierarchical structure consisting of nodes and directed edges.
Figure 4.4 shows the decision tree for the mammal classification problem. The tree has three types of
nodes:
A root node that has no incoming edges and zero or more outgoing edges.
Internal nodes, each of which has exactly one incoming edge and two or more outgoing
edges.
Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing
edges. In a decision tree, each leaf node is assigned a class label.
The non-terminal nodes, which include the root and other internal nodes, contain attribute test
conditions to separate records that have different characteristics.
For example, the root node shown in Figure 4.4 uses the attribute Body Temperature to separate
warm-blooded from cold-blooded vertebrates. Since all cold- blooded vertebrates are non-
mammals, a leaf node labeled Non-mammals is created as the right child of the root node.
If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish
mammals from other warm-blooded creatures, which are mostly birds. Classifying a test record
is straightforward once a decision tree has been constructed.
Starting from the root node, we apply the test condition to the record and follow the appropriate
branch based on the outcome of the test. This will lead us either to another internal node, for
which a new test condition is applied, or to a leaf node. The class label associated with the leaf
node is then assigned to the record.
3 Define classification. Draw a neat figure and explain general approach for solving classification 8
model.
Classification is the task of learning a target function f that maps each attribute set x to one of the
predefined class labels y. The target function is also known informally as a classification model.
Each technique employs a learning algorithm to identify a model that best fits the relationship
between the attribute set and class label of the input data.
The model generated by a learning algorithm should both fit the input data well and correctly predict
the class labels of records it has never seen before.
Therefore, a key objective of the learning algorithm is to build models with good generalization
capability; i.e., models that accurately predict the class labels of previously unknown records.
Figure 3.3.General approach for building a classification model.
Figure 3.3 shows a general approach for solving classification problems. First, a training set
consisting of records whose class labels are known must be provided. The training set is used to
build a classification model, which is subsequently applied to the test set, which consists of
records with unknown class labels.
Evaluation of the performance of a classification model is based on the counts of test records
correctly and incorrectly predicted by the model. These counts are tabulated in a table known as
a confusion matrix.
Table 4.2 depicts the confusion matrix for a binary classification problem. Each entry f ij in this
table denotes the number of records from class i predicted to be of class j. For instance, f01is the
number of records from class 0 incorrectly predicted as class 1. Based on the entries in the
confusion matrix, the total number of correct predictions made by the model is (f 11 + f00) and
the total number of incorrect predictions is (f10+ f01).
Although a confusion matrix provides the information needed to determine how well a
classification model performs, summarizing this information with a single number would make
it more convenient to compare the performance of different models. This can be done using a
performance metric such asaccuracy, which is defined as follows:
Equivalently, the performance of a model can be expressed in terms of its error rate, which is
given by the following equation:
Figure 3.13 compares the values of the impurity measures for binary classification problem
Figure 3.13 Comparison among the impurity measures for binary classification problems.
p refers to the fraction of records that belong to one of the two classes. Observe that all three
measures attain their maximum value when the class distribution is uniform (i.e., when p =
0.5).
The minimum values for the measures are attained when all the records belong to the same
class (i.e., when p equals 0 or 1). measures.
Examples of computing the different impurity are given below.
To detr
To determine how well a test condition performs, we need to compare the degree of impurity of
the parent node (before splitting) with the degree of the impurity of the child nodes(after
splitting). The larger their difference, the better the test condition.
The gain, , is a criterion that cn be used to determine the goodness of a split.
Where I(.) is the impurity measure of a given node, N is the total number of records at
the parent node, k is the number of attribute values and N(vj) is the number of records
associated with the child node, vj.
When Entropy is used as the impurity measure in above equation, the difference in entropy is
known as the Information Gain, Info.
5 Write an algorithm for decision tree induction and explain the same. 8
A skeleton decision tree induction algorithm called Tree Growth is shown in Algorithm 4.1.
The input to this algorithm consists of the training records E and the attribute set F. The algorithm
works by recursively selecting the best attribute to split the data (Step 7) and expanding the leaf nodes
of the tree.
Explanation
(Steps 11 and 12) until the stopping criterion is met (Step 1). The details of this algorithm are explained
below:
The createNode() function extends the decision tree by creating a new node. A node in the
decision tree has either a test condition, denoted as node. Test Cond, or a class label, denoted as
node.label.
The find best split () function determines which attribute should be selected as the test
condition for splitting the training records. As previously noted, the choice of test condition
depends on which impurity measure is used to determine the goodness of a split. Some widely
used measures include entropy, the Gini index, and the χ2statistic.
The Classify() function determines the class label to be assigned to a leaf node. For each leaf
node t, let p(i|t) denote the fraction of training records from class i associated with the node t. In
most cases, the leaf node is assigned to the class that has the majority number of training
records:
where the argmax operator returns the argument i that maximizes the expression p(i|t).
The stopping cond() function is used to terminate the tree-growing process by testing whether
all the records have either the same class label or the same attribute values. Another way to
terminate the recursive function is to test whether the number of records has fallen below some
Minimum threshold.
6 Explain the important characteristics of decision tree induction algorithm. 8
The following is a summary of the important characteristics of decision tree induction algorithms.
1. Decision tree induction is a nonparametric approach for building classification models. In other
words, it does not require any prior assumptions regarding the type of probability distributions
satisfied by the class and other attributes.
2. Finding an optimal decision tree is an NP-complete problem. Many decision tree algorithms employ
a heuristic-based approach to guide their search in the vast hypothesis space. For example, the
Decision Tree algorithm uses a greedy, top-down, recursive partitioning strategy for growing a
decision tree.
3. Techniques developed for constructing decision trees are computationally inexpensive, making it
possible to quickly construct models even when the training set size is very large.
4. Decision trees, especially smaller-sized trees, are relatively easy to interpret. The accuracies of the
trees are also comparable to other classification techniques for many simple data sets.
5. Decision trees provide an expressive representation for learning discrete valued functions.
However, they do not generalize well to certain types of Boolean problems.
6. Decision tree algorithms are quite robust to the presence of noise, especially when methods for
avoiding over fitting, are employed.
7. The presence of redundant attributes does not adversely affect the accuracy of decision trees.
Anattribute is redundant if it is strongly correlated with another attribute in the data. One of the two
redundant attributes will not be used for splitting once the other attribute has been chosen. However,
if the data set contains many irrelevant attributes, i.e., attributes that are not useful for the
classification task, then some of the irrelevant attributes may be accidently chosen during the tree-
growing process, which results in a decision tree that is larger than necessary.
8. Since most decision tree algorithms employ a top-down, recursive partitioning approach, the number
of records becomes smaller as we traverse down the tree. At the leaf nodes, the number of records
may be too small to make a statistically significant decision about the class representation of the
nodes. This is known as the data fragmentation problem. One possible solution is to disallow
further splitting when the number of records falls below a certain threshold.
9. A subtree can be replicated multiple times in a decision tree, This makes the decision tree more
complex than necessary and perhaps more difficult to interpret. Such a situation can arise from
decision tree implementations that rely on a single attribute test condition at each internal node.
Since most of the decision tree algorithms use a divide-and-conquer partitioning strategy, the same
test condition can be replication problem. applied to different parts of the attribute space, thus
leading to the subtree.
10. The test conditions described so far in this chapter involve using only a single attribute at a time. As
a consequence, the tree-growing procedure can be viewed as the process of partitioning the attribute
space into disjoint regions until each region contains records of the same class. The border between
two neighboring regions of different classes is known as a decision boundary. Constructive
induction provides another way to partition the data into homogeneous, nonrectangular regions.
11. Studies have shown that the choice of impurity measure has little effect on the performance of
decision tree induction algorithms. Some strategy used to prune the tree has a greater impact on the
final tree than the choice of impurity measure.
Figure 5.2 demonstrates how the sequential covering algorithm works for a data set that contains a
collection of positive and negative examples. The rule R1, whose coverage is shown in Figure 5.2(b), is
extracted first because it covers the largest fraction of positive examples. All the training records
covered by R1 are subsequently removed and the algorithm proceeds to look for the next best rule,
which is R2.
8 Give the recursive definition of Hunt’s algorithm. 8
In Hunt‗s algorithm, a decision tree is grown in a recursive fashion by partitioning the training
records into successively purer subsets. Let Dtbe the set of training records that are associated with
node t and
y = {y1, y2, . . . ,yc} be the class labels. The following is a recursive definition of Hunt‗s algorithm.
Step 1: If all the records in Data belong to the same class yt, then t is a leaf node labeled as yt.
Step 2:If Data contains records that belong to more than one class, an attribute test condition is selected
to partition the records into smaller subsets. A child node is created for each outcome of the test
condition and the records in Dt are distributed to the children based on the outcomes. The algorithm is
then recursively applied to each child node
To illustrate how the algorithm works, consider the problem of predicting whether a loan
applicant will repay her loan obligations or become delinquent, subsequently defaulting on her
loan. A training set for this problem can be constructed by examining the records of previous
borrowers. In the example shown in Figure 3.6, each record contains the personal information
of a borrower along with a class label indicating whether the borrower has defaulted on loan
payments.
Figure 3.6.Training set for predicting borrowers who will default on loan payments.
The initial tree for the classification problem contains a single node with class label Defaulted
= No (see Figure 3.7(a)), which means that most of the borrowers successfully repaid their
loans. The tree, however, needs to be redefined since the root node contains records from both
classes. The records are subsequently divided into smaller subsets based on the outcomes of the
Home Owner test condition, as shown in Figure 3.7(b).
For now, we will assume that this is the best criterion for splitting the data at this point. Hunt‗s
algorithm is then applied recursively to each child of the root node.
From the training set given in Figure 3.6, notice that all borrowers who are home owners
successfully repaid their loans. The left child of the root is therefore a leaf node labeled
Defaulted = No (see Figure 3.7(b)). For the right child, we need to continue applying the
recursive step of Hunt‗s algorithm until all the records belong to the same class. The trees
resulting from each recursive step are shown in Figures 3.7(c) and (d).
Figure 3.7 Hunt‘s algorithm for inducing decision trees.
9 Illustrate hunt‘s algorithm to develop a decision tree. Consider the following training set and derive the 8
decision tree.
In Hunt‗s algorithm, a decision tree is grown in a recursive fashion by partitioning the training
records into successively purer subsets. Let Dtbe the set of training records that are associated with
node t and
y = {y1, y2, . . . ,yc} be the class labels. The following is a recursive definition of Hunt‗s algorithm.
Step 1: If all the records in Data belong to the same class yt, then t is a leaf node labeled as yt.
Step 2:If Data contains records that belong to more than one class, an attribute test condition is selected
to partition the records into smaller subsets. A child node is created for each outcome of the test
condition and the records in Dt are distributed to the children based on the outcomes. The algorithm is
then recursively applied to each child node
To illustrate how the algorithm works, consider the problem of predicting whether a loan
applicant will repay her loan obligations or become delinquent, subsequently defaulting on her
loan. A training set for this problem can be constructed by examining the records of previous
borrowers. In the example shown in Figure 3.6, each record contains the personal information
of a borrower along with a class label indicating whether the borrower has defaulted on loan
payments.
Figure 3.6.Training set for predicting borrowers who will default on loan payments.
The initial tree for the classification problem contains a single node with class label Defaulted
= No (see Figure 3.7(a)), which means that most of the borrowers successfully repaid their
loans. The tree, however, needs to be redefined since the root node contains records from both
classes. The records are subsequently divided into smaller subsets based on the outcomes of the
Home Owner test condition, as shown in Figure 3.7(b).
For now, we will assume that this is the best criterion for splitting the data at this point. Hunt‗s
algorithm is then applied recursively to each child of the root node.
From the training set given in Figure 3.6, notice that all borrowers who are home owners
successfully repaid their loans. The left child of the root is therefore a leaf node labeled
Defaulted = No (see Figure 3.7(b)). For the right child, we need to continue applying the
recursive step of Hunt‗s algorithm until all the records belong to the same class. The trees
resulting from each recursive step are shown in Figures 3.7(c) and (d).
Binary Attributes:Binary Attributes generates two potential outcomes, as shown in Figure 3.8.
Nominal Attributes:
Since a nominal attribute can have many values, its test condition can be expressed in two
ways: Multiway split and binary split.
For a multiway split (Figure 3.9(a)), the number of outcomes depends on the number of distinct
values for the corresponding attribute. For example, if an attribute such as marital status has
three distinct values—single, married, or divorced its test condition will produce a three-way
split.
On the other hand, some decision tree algorithms, such as CART, produce only binary splits by
considering all (2k−1 – 1) ways of creating a binary partition of k attribute values. Figure 3.9(b)
illustrates three different ways of grouping the attribute values for marital status into two
subsets.
Figure 3.9 Test conditions for nominal attribute
Ordinal Attributes:
Ordinal attributes can also produce binary or multiway splits.
Ordinal attribute values can be grouped as long as the grouping does not violate the order
property of the attribute values. Figure 3.10 illustrates various ways of splitting training
records based onthe Shirt Size attribute. The groupings shown in Figures 3.10(a) and (b)
preserve the order among the attribute values, whereas the grouping shown in Figure
3.10(c) violates this property because it combines the attribute values Small and Large into
the same partition while Medium and Extra Large are combined into another partition.
• The first vertebrate, which is a lemur, is warm-blooded and gives birth to its young. It triggers the
rule r3, and thus, is classified as a mammal.
• The second vertebrate, which is a turtle, triggers the rules r4 and r5. Since the classes predicted by
the rules are contradictory (reptiles versus amphibians), their conflicting classes must be resolved.
• None of the rules are applicable to a dogfish shark. In this case, we need to ensure that the classifier
can still make a reliable prediction even though a test record is not covered by any rule.
The previous example illustrates two important properties of the rule set generated by a rule-based
classifier.
Mutually Exclusive Rules: The rules in a rule set R are mutually exclusive if no two rules in R
are triggered by the same record. This property ensuresthat every record is covered by at most
one rule in R. An example of amutually exclusive rule set is shown in Table 5.3.
Exhaustive Rules: A rule set R has exhaustive coverage if there is a rulefor each combination
of attribute values. This property ensures that everyrecord is covered by at least one rule in R.
Assuming that Body Temperatureand Gives Birth are binary variables, the rule set shown in
Table 5.3 hasexhaustive coverage.
If the rule set is not exhaustive, then a default rule, rd: () yd, must be added to cover the
remaining cases. A default rule has an empty antecedent and is triggered when all other rules have
failed.
If the rule set is not mutually exclusive, then a record can be covered by several rules, some of
which may predict conflicting classes. There are two ways to overcome this problem.
1) Ordered Rules:
In this approach, the rules in a rule set are ordered in decreasing order of their priority,
which can be defined in many ways (e.g., based on accuracy, coverage, total description length,
or the order in which the rules are generated). An ordered rule set is also known as a decision
list. When a test record is presented, it is classified by the highest-ranked rule that covers the
record. This avoids the problem of having conflicting classes predicted by multiple
classification rules.
2) Unordered Rules:
This approach allows a test record to trigger multiple classification rules and considers the
consequent of each rule as a vote for a particular class. The votes are then tallied to determine
the class label of the test record. The record is usually assigned to the class that receives the
highest number of votes.
13 Consider a training set that contains 60 positive examples and 100 negative examples. For each of 8
the following candidate rules,
Rule R1: (covers 50 positive examples and 5 negative examples),
Rule R2: (covers 2 positive examples and No negative examples),
Determine which is the best and worst candidate rule according to:
a) Rule accuracy.
b) (c) The likelihood ratio statistic.
c) (d) The Laplace measure
d) (b) FOIL’s information gain.
a) The accuracies for r1is 50/55 =90.9%
The accuracies for r2 is 2/2 = 100%
However, r1 is the better rule despite its lower accuracy. The high accuracy for r2 is potentially
spurious because the coverage of the rule is too low.
where k is the number of classes, fiis the observed frequency of class I examples that are covered
by the rule, and eiis the expected frequencyof a rule that makes random predictions. For example,
since r1covers 55 examples, the expected frequency for the positive class is e+ =55×60/160 =
20.625, while the expected frequency for the negative classis e− = 55 × 100/160 = 34.375. Thus,
the likelihood ratio for r1 is
R(r1) = 2 × [50 × log2(50/20.625) + 5 × log2(5/34.375)] = 99.9.
Similarly, the expected frequencies for r2 are e+ = 2 × 60/160 = 0.75 and e− = 2 × 100/160 = 1.25.
The likelihood ratio statistic for r2 is
R(r2) = 2 × [2 × log2(2/0.75) + 0 × log2(0/1.25)] = 5.66.
This statistic therefore suggests that r1 is a better rule than r2.
c)An evaluation metric that takes into account the rule coverage can beused. Consider the following
evaluation metrics:
where nis the number of examples covered by the rule, f+ is the number of positive examples
covered by the rule, k is the total number of classes, and p+ is the prior probability for the positive
class. Note that the mestimate is equivalent to the Laplace measure by choosing p+ = 1/k.
The Laplace measure for r1 is 51/57 = 89.47%, which is quite close to its accuracy. Conversely,the
Laplace measure for r2 (75%) is significantly lower than its accuracybecause r2 has a much lower
coverage.
d) An evaluation metric that takes into account the support count of therule can be used. One such
metric is the FOIL’s information gain.The support count of a rule corresponds to the number of
positive examplescovered by the rule. Suppose the rule r : A + covers p0 positiveexamples and n0
negative examples. After adding a new conjunct B, theextended rule r′ : A ∧B + covers p1
positive examples and n1 negativeexamples. Given this information, the FOIL‘s information gain
ofthe extended rule is defined as follows:
Since the measure is proportional to p1 and p1/(p1 +n1), it prefers rulesthat have high support count
and accuracy. The FOIL‘s information gains for rules r1 and r2 given in the preceding example are
43.12 and 2, respectively. Therefore, r1 is a better rule than r2.
14 Consider a training set that contains 100 positive examples and 400 negative examples. For each 8
of the following candidate rules,
Lazy learners such as nearest-neighbor classifiers do not require model building. However,
classifying a test example can be quite expensive because we need to compute the proximity
values individually between the test and training examples. In contrast, eager learners often
spend the bulk of their computing resources for model building. Once a model has been built,
classifying a test example is extremely fast.
Nearest-neighbor classifiers can produce wrong predictions unless the appropriate proximity
measure and data preprocessing steps are taken. For example, suppose we want to classify a
group of people based on attributes such as height (measured in meters) and weight (measured
in pounds). The height attribute has a low variability, ranging from 1.5 m to 1.85 m, whereas
the weight attribute may vary from 90 lb. to 250 lb. If the scale of the attributes are not taken
into consideration, the proximity measure may be dominated by differences in the weights of a
person.
Rule-based classifiers are generally used to produce descriptive models that are easier to
interpret, but gives comparable performance to the decision tree classifier.
General-to-specific strategy:
In this strategy an initial rule r : {} −→ y is created, where the left-handside is an empty set
and the right-hand side contains the target class. The rulehas poor quality because it covers all
the examples in the training set.
New conjuncts are subsequently added to improve the rule‘s quality. Figure 5.3(a)shows the
general-to-specific rule-growing strategy for the vertebrate classificationproblem. The
conjunct Body Temperature=warm-blooded is initiallychosen to form the rule antecedent.
The algorithm then explores all the possiblecandidates and greedily chooses the next
conjunct, Gives Birth=yes, tobe added into the rule antecedent. This process continues until
the stoppingcriterion is met (e.g., when the added conjunct does not improve the qualityof the
rule).
Specific-to-general strategy:
One of the positive examples is randomlychosen as the initial seed for the rule-growing
process. During therefinement step, the rule is generalized by removing one of its conjuncts
sothat it can cover more positive examples.
Figure 5.3(b) shows the specific-togeneralapproach for the vertebrate classification
problem. Suppose a positiveexample for mammals is chosen as the initial seed. The initial
rule containsthe same conjuncts as the attribute values of the seed.
To improve its coverage,the rule is generalized by removing the conjunct Hibernate=no.
Therefinement step is repeated until the stopping criterion is met, e.g., when therule starts
covering negative examples.
18 What are Baysian classifiers? Explain Baye’s theorem for classification. 7
Baysian classifiers is an approach for modeling probabilistic relationships between the attribute set and
the class variable.
Bayes theorem
Let X and Y be a pair of random variables. Their joint probability, P(X =x,Y =y), refers to the
probability that variable X will take on the value x and variable Y will take on the value y. A
conditional probability is the probability that a random variable will take on a particular value given
that the outcome for another random variable is known. For example, the conditional probability P(Y=y
lX =x) refers to the probability that the variable Y will take on the value y, given that the variable X is
observed to have the value x. The joint and conditional probabilities for X and Y are related in the
following way:
P(X,Y) = P(Y|X) x P(X) = P(X|Y) x P(Y)
Rearranging the last two expressions in above Equation leads to the following formula, known as the
Bayes theorem:
Estimating the posterior probabilities accurately for every possible combination of class labiel and
attribute value is a difficult problem because it requires a very large training set, even for a moderate
number of attributes.
The Bayes theorem is useful because it allows us to express the posterior probability in terms of the
prior probability P(f), the class-conditional probability P(X|Y), and the evidence, P(X):
Besides the conditional independence conditions imposed by the network topology, each node is also
associated with a probability table.
1. If a node X does not have any parents, then the table contains only the prior probability P(X).
2. If a node X has only one parent, Y, then the table contains the conditional probability P(X|Y).
3. If a node X has multiple parents, {Y1,Y2, . . . ,Yn}, then the table contain the conditional
probability P(X|Y1,Y2,. . ., Yd.).