0% found this document useful (0 votes)

44 views26 pages

Module4 QB 1

Decision trees are commonly used for classification problems. They work by recursively splitting the data into purer subsets based on attribute values, represented as internal nodes in a tree structure. Leaf nodes are assigned a class label. The decision tree induction algorithm starts with all the training data at the root node and selects the best attribute to split on using an impurity measure like information gain or Gini index. It then recursively applies the same process on the child nodes until stopping criteria are met. Some common impurity measures used for selecting the best splits are entropy, Gini index, and classification error.

Uploaded by

Vaishnavi G . Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views26 pages

Module4 QB 1

Uploaded by

Vaishnavi G . Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

1 How decision trees are used for classification?

Explain decision tree induction algorithm for 8

classification.
 To illustrate how classification with a decision tree works, consider a simpler version of the
vertebrate classification problem described in the previous section. Instead of classifying the
vertebrates into five distinct groups of species, we assign them to two categories: mammals and
non-mammals. Suppose a new species is discovered by scientists.

 How can we tell whether it is a mammal or a non-mammal? One approach is to pose a series of
questions about the characteristics of the species.

 The first question we may ask is whether the species is cold- or warm-blooded. If it is cold-
blooded, then it is definitely not a mammal. In the latter case, we need to ask a follow-up
question: Otherwise, it is either a bird or a mammal. Do the females of the species give birth to
their young? Those that do give birth are definitely mammals, while those that do not are likely
to be non-mammals.

 The previous example illustrates how we can solve a classification problem by asking a series
of questions about the attributes of the test record. Each time we receive an answer, a follow-up
question is asked until we reach a conclusion about the class label of the record.

 The series of questions and their possible answers can be organized in the form of a decision
tree, which is a hierarchical structure consisting of nodes and directed edges.

Figure 4.4 shows the decision tree for the mammal classification problem. The tree has three types of
nodes:

Figure 4.4.A decision tree for the mammal classification problem.

A root node that has no incoming edges and zero or more outgoing edges.
Internal nodes, each of which has exactly one incoming edge and two or more outgoing
edges.
Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing
edges. In a decision tree, each leaf node is assigned a class label.

 The non-terminal nodes, which include the root and other internal nodes, contain attribute test
conditions to separate records that have different characteristics.
 For example, the root node shown in Figure 4.4 uses the attribute Body Temperature to separate
warm-blooded from cold-blooded vertebrates. Since all cold- blooded vertebrates are non-
mammals, a leaf node labeled Non-mammals is created as the right child of the root node.
 If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish
mammals from other warm-blooded creatures, which are mostly birds. Classifying a test record
is straightforward once a decision tree has been constructed.
 Starting from the root node, we apply the test condition to the record and follow the appropriate
branch based on the outcome of the test. This will lead us either to another internal node, for
which a new test condition is applied, or to a leaf node. The class label associated with the leaf
node is then assigned to the record.

Decision tree induction algorithm for classification

A skeleton decision tree induction algorithm called Tree Growth is shown in Algorithm 4.1.
The input to this algorithm consists of the training records E and the attribute set F. The algorithm
works by recursively selecting the best attribute to split the data (Step 7) and expanding the leaf nodes
of the tree.

2 How rule based classifiers are used for classification? Explain. 6

 A rule-based classifier is a technique for classifying records using a collection of ―if . . .then. . .‖
rules.
 Table 5.1 shows an example of a model generated by a rule-based classifier for the vertebrate
classification problem. The rules for themodel are represented in a disjunctive normal form, R =
(r1∨r2∨. . . rk), whereR is known as the rule set and ri‘s are the classification rules or disjuncts.
Each classification rule can be expressed in the following way:
ri: (Conditioni) yi.(5.1)
The left-hand side of the rule is called the rule antecedent or precondition.It contains a
conjunction of attribute tests:
Conditioni= (A1op v1) ∧(A2op v2) ∧. . . (Akop vk)(5.2)
where (Aj, vj) is an attribute-value pair and op is a logical operator chosen from the set {=,
6=,<,>,≤,≥}. Each attribute test (Ajop vj) is known asa conjunct. The right-hand side of the rule is
called the rule consequent, which contains the predicted class yi.

3 Define classification. Draw a neat figure and explain general approach for solving classification 8
model.
Classification is the task of learning a target function f that maps each attribute set x to one of the
predefined class labels y. The target function is also known informally as a classification model.

 A classification technique (or classifier) is a systematic approach to building classification models

from an input data set. Examples include decision tree classifiers, rule-based classifiers, neural
networks, support vector machines and naive Bayes classifiers.

 Each technique employs a learning algorithm to identify a model that best fits the relationship
between the attribute set and class label of the input data.

 The model generated by a learning algorithm should both fit the input data well and correctly predict
the class labels of records it has never seen before.

 Therefore, a key objective of the learning algorithm is to build models with good generalization
capability; i.e., models that accurately predict the class labels of previously unknown records.
Figure 3.3.General approach for building a classification model.

 Figure 3.3 shows a general approach for solving classification problems. First, a training set
consisting of records whose class labels are known must be provided. The training set is used to
build a classification model, which is subsequently applied to the test set, which consists of
records with unknown class labels.

Table 3.2.Confusion matrix for a 2-class problem

 Evaluation of the performance of a classification model is based on the counts of test records
correctly and incorrectly predicted by the model. These counts are tabulated in a table known as
a confusion matrix.
 Table 4.2 depicts the confusion matrix for a binary classification problem. Each entry f ij in this
table denotes the number of records from class i predicted to be of class j. For instance, f01is the
number of records from class 0 incorrectly predicted as class 1. Based on the entries in the
confusion matrix, the total number of correct predictions made by the model is (f 11 + f00) and
the total number of incorrect predictions is (f10+ f01).
 Although a confusion matrix provides the information needed to determine how well a
classification model performs, summarizing this information with a single number would make
it more convenient to compare the performance of different models. This can be done using a
performance metric such asaccuracy, which is defined as follows:

 Equivalently, the performance of a model can be expressed in terms of its error rate, which is
given by the following equation:

4 Mention the three impurity measures for selecting best splits. 8

The three impurity measures include Entropy, Gini index and Classification error.

 Figure 3.13 compares the values of the impurity measures for binary classification problem

Figure 3.13 Comparison among the impurity measures for binary classification problems.
 p refers to the fraction of records that belong to one of the two classes. Observe that all three
measures attain their maximum value when the class distribution is uniform (i.e., when p =
0.5).
 The minimum values for the measures are attained when all the records belong to the same
class (i.e., when p equals 0 or 1). measures.
 Examples of computing the different impurity are given below.
 To detr

 To determine how well a test condition performs, we need to compare the degree of impurity of
the parent node (before splitting) with the degree of the impurity of the child nodes(after
splitting). The larger their difference, the better the test condition.
 The gain, , is a criterion that cn be used to determine the goodness of a split.

Where I(.) is the impurity measure of a given node, N is the total number of records at
the parent node, k is the number of attribute values and N(vj) is the number of records
associated with the child node, vj.
 When Entropy is used as the impurity measure in above equation, the difference in entropy is
known as the Information Gain, Info.
5 Write an algorithm for decision tree induction and explain the same. 8

(Steps 11 and 12) until the stopping criterion is met (Step 1). The details of this algorithm are explained
below:
 The createNode() function extends the decision tree by creating a new node. A node in the
decision tree has either a test condition, denoted as node. Test Cond, or a class label, denoted as
node.label.

 The find best split () function determines which attribute should be selected as the test
condition for splitting the training records. As previously noted, the choice of test condition
depends on which impurity measure is used to determine the goodness of a split. Some widely
used measures include entropy, the Gini index, and the χ2statistic.

 The Classify() function determines the class label to be assigned to a leaf node. For each leaf
node t, let p(i|t) denote the fraction of training records from class i associated with the node t. In
most cases, the leaf node is assigned to the class that has the majority number of training
records:

where the argmax operator returns the argument i that maximizes the expression p(i|t).
 The stopping cond() function is used to terminate the tree-growing process by testing whether
all the records have either the same class label or the same attribute values. Another way to
terminate the recursive function is to test whether the number of records has fallen below some
Minimum threshold.
6 Explain the important characteristics of decision tree induction algorithm. 8
The following is a summary of the important characteristics of decision tree induction algorithms.
1. Decision tree induction is a nonparametric approach for building classification models. In other
words, it does not require any prior assumptions regarding the type of probability distributions
satisfied by the class and other attributes.

2. Finding an optimal decision tree is an NP-complete problem. Many decision tree algorithms employ
a heuristic-based approach to guide their search in the vast hypothesis space. For example, the
Decision Tree algorithm uses a greedy, top-down, recursive partitioning strategy for growing a
decision tree.

3. Techniques developed for constructing decision trees are computationally inexpensive, making it
possible to quickly construct models even when the training set size is very large.

4. Decision trees, especially smaller-sized trees, are relatively easy to interpret. The accuracies of the
trees are also comparable to other classification techniques for many simple data sets.

5. Decision trees provide an expressive representation for learning discrete valued functions.
However, they do not generalize well to certain types of Boolean problems.

6. Decision tree algorithms are quite robust to the presence of noise, especially when methods for
avoiding over fitting, are employed.

7. The presence of redundant attributes does not adversely affect the accuracy of decision trees.
Anattribute is redundant if it is strongly correlated with another attribute in the data. One of the two
redundant attributes will not be used for splitting once the other attribute has been chosen. However,
if the data set contains many irrelevant attributes, i.e., attributes that are not useful for the
classification task, then some of the irrelevant attributes may be accidently chosen during the tree-
growing process, which results in a decision tree that is larger than necessary.

8. Since most decision tree algorithms employ a top-down, recursive partitioning approach, the number
of records becomes smaller as we traverse down the tree. At the leaf nodes, the number of records
may be too small to make a statistically significant decision about the class representation of the
nodes. This is known as the data fragmentation problem. One possible solution is to disallow
further splitting when the number of records falls below a certain threshold.

9. A subtree can be replicated multiple times in a decision tree, This makes the decision tree more
complex than necessary and perhaps more difficult to interpret. Such a situation can arise from
decision tree implementations that rely on a single attribute test condition at each internal node.
Since most of the decision tree algorithms use a divide-and-conquer partitioning strategy, the same
test condition can be replication problem. applied to different parts of the attribute space, thus
leading to the subtree.

10. The test conditions described so far in this chapter involve using only a single attribute at a time. As
a consequence, the tree-growing procedure can be viewed as the process of partitioning the attribute
space into disjoint regions until each region contains records of the same class. The border between
two neighboring regions of different classes is known as a decision boundary. Constructive
induction provides another way to partition the data into homogeneous, nonrectangular regions.

11. Studies have shown that the choice of impurity measure has little effect on the performance of
decision tree induction algorithms. Some strategy used to prune the tree has a greater impact on the
final tree than the choice of impurity measure.

7 Explain sequential covering algorithm in rule-based classifier. 6

The sequential covering algorithm is often used to extract rules directly from data. Rules are
grown in a greedy fashion based on a certain evaluation measure.
The algorithm extracts the rules one class at a time for data sets that contain more than two
classes. For the vertebrate classification problem, the sequential covering algorithm may
generate rules for classifying birds first, followed by rules for classifying mammals;
amphibians, reptiles, and finally, fishes (see Figure 5.1).
The criterion for deciding which class should be generated first depends on a number of factors,
such as the class prevalence (i.e., fraction of training records that belong to a particular class) or
the cost of misclassifying records from a given class.
A summary of the sequential covering algorithm is given in Algorithm 5.1. The algorithm starts
with an empty decision list, R.
The Learn-One- Rule function is then used to extract the best rule for class ythat covers the
current set of training records.
During rule extraction, all training records for class yare considered to be positive examples,
while those that belong toother classes are considered to be negative examples. A rule is
desirable if itcovers most of the positive examples and none (or very few) of the
negativeexamples. Once such a rule is found, the training records covered by the ruleare
eliminated.
The new rule is added to the bottom of the decision list R.This procedure is repeated until the
stopping criterion is met. The algorithmthen proceeds to generate rules for the next class.

Figure 5.2 demonstrates how the sequential covering algorithm works for a data set that contains a
collection of positive and negative examples. The rule R1, whose coverage is shown in Figure 5.2(b), is
extracted first because it covers the largest fraction of positive examples. All the training records
covered by R1 are subsequently removed and the algorithm proceeds to look for the next best rule,
which is R2.
8 Give the recursive definition of Hunt’s algorithm. 8
In Hunt‗s algorithm, a decision tree is grown in a recursive fashion by partitioning the training
records into successively purer subsets. Let Dtbe the set of training records that are associated with
node t and
y = {y1, y2, . . . ,yc} be the class labels. The following is a recursive definition of Hunt‗s algorithm.

Step 1: If all the records in Data belong to the same class yt, then t is a leaf node labeled as yt.

Step 2:If Data contains records that belong to more than one class, an attribute test condition is selected
to partition the records into smaller subsets. A child node is created for each outcome of the test
condition and the records in Dt are distributed to the children based on the outcomes. The algorithm is
then recursively applied to each child node

 To illustrate how the algorithm works, consider the problem of predicting whether a loan
applicant will repay her loan obligations or become delinquent, subsequently defaulting on her
loan. A training set for this problem can be constructed by examining the records of previous
borrowers. In the example shown in Figure 3.6, each record contains the personal information
of a borrower along with a class label indicating whether the borrower has defaulted on loan
payments.
Figure 3.6.Training set for predicting borrowers who will default on loan payments.

 The initial tree for the classification problem contains a single node with class label Defaulted
= No (see Figure 3.7(a)), which means that most of the borrowers successfully repaid their
loans. The tree, however, needs to be redefined since the root node contains records from both
classes. The records are subsequently divided into smaller subsets based on the outcomes of the
Home Owner test condition, as shown in Figure 3.7(b).

 For now, we will assume that this is the best criterion for splitting the data at this point. Hunt‗s
algorithm is then applied recursively to each child of the root node.
 From the training set given in Figure 3.6, notice that all borrowers who are home owners
successfully repaid their loans. The left child of the root is therefore a leaf node labeled
Defaulted = No (see Figure 3.7(b)). For the right child, we need to continue applying the
recursive step of Hunt‗s algorithm until all the records belong to the same class. The trees
resulting from each recursive step are shown in Figures 3.7(c) and (d).
Figure 3.7 Hunt‘s algorithm for inducing decision trees.

9 Illustrate hunt‘s algorithm to develop a decision tree. Consider the following training set and derive the 8
decision tree.

In Hunt‗s algorithm, a decision tree is grown in a recursive fashion by partitioning the training
records into successively purer subsets. Let Dtbe the set of training records that are associated with
node t and
y = {y1, y2, . . . ,yc} be the class labels. The following is a recursive definition of Hunt‗s algorithm.

Step 1: If all the records in Data belong to the same class yt, then t is a leaf node labeled as yt.

Figure 3.6.Training set for predicting borrowers who will default on loan payments.

Figure 3.7 Hunt‘s algorithm for inducing decision trees.

10 What are the design issues of a decision tree induction? 6

A learning algorithm for inducing decision trees must address the following two issues.

 How should the training records be split?

Each recursive step of the tree-growing process must select an attribute test condition to divide the
records into smaller subsets. To implement this step, the algorithm must provide a method for
specifying the test condition for different attribute types as well as an objective measure for
evaluating the goodness of each test condition.

 How should the splitting procedure stop?

A stopping condition is needed to terminate the tree-growing process. A possible strategy is to
continue expanding a node until either all the records belong to the same class or all the records
have identical attribute values. Although both conditions are sufficient to stop any decision tree
induction algorithm, other criteria can be imposed to allow the tree-growing procedure to terminate
earlier.

11 Discuss the methods for expressing attribute test condition. 6

Decision tree induction algorithms must provide a method for expressing an attribute test condition and
its corresponding outcomes for different attribute types.

 Binary Attributes:Binary Attributes generates two potential outcomes, as shown in Figure 3.8.

Figure 3.8 Test condition for binary attributes.

 Nominal Attributes:
 Since a nominal attribute can have many values, its test condition can be expressed in two
ways: Multiway split and binary split.
 For a multiway split (Figure 3.9(a)), the number of outcomes depends on the number of distinct
values for the corresponding attribute. For example, if an attribute such as marital status has
three distinct values—single, married, or divorced its test condition will produce a three-way
split.
 On the other hand, some decision tree algorithms, such as CART, produce only binary splits by
considering all (2k−1 – 1) ways of creating a binary partition of k attribute values. Figure 3.9(b)
illustrates three different ways of grouping the attribute values for marital status into two
subsets.
Figure 3.9 Test conditions for nominal attribute

 Ordinal Attributes:
 Ordinal attributes can also produce binary or multiway splits.
 Ordinal attribute values can be grouped as long as the grouping does not violate the order
property of the attribute values. Figure 3.10 illustrates various ways of splitting training
records based onthe Shirt Size attribute. The groupings shown in Figures 3.10(a) and (b)
preserve the order among the attribute values, whereas the grouping shown in Figure
3.10(c) violates this property because it combines the attribute values Small and Large into
the same partition while Medium and Extra Large are combined into another partition.

Figure 3.10 Different ways of grouping ordinal attribute values.

 Continuous Attributes:
 For continuous attributes, the test condition can be expressed as a comparison test (A < v)
or (A ≥ v) with binary outcomes, or a range query with outcomes of the form vi ≤ A < vi+1,
for i = 1. . . k.
 The difference between these approaches is shown in Figure 3.11. For the binary case, the
decision tree algorithm must consider all possible split positions v, and it selects the one
that produces the best partition.
 For the multiway split, the algorithm must consider all possible ranges of continuous
values. One approach is to apply the discretization strategies described. After discretization,
a new ordinal value will be assigned to ach discretized interval. Adjacent intervals can also
be aggregated into wider ranges as long as the order property is preserved.

Figure 3.11 Test condition for continuous attributes.

12 Explain how a rule based classifiers works? 8

A rule-based classifier classifies a test record based on the rule triggered by the record. To
illustrate how a rule-based classifier works, consider the rule set shown in Table 5.1 and the following
vertebrates:

• The first vertebrate, which is a lemur, is warm-blooded and gives birth to its young. It triggers the
rule r3, and thus, is classified as a mammal.
• The second vertebrate, which is a turtle, triggers the rules r4 and r5. Since the classes predicted by
the rules are contradictory (reptiles versus amphibians), their conflicting classes must be resolved.
• None of the rules are applicable to a dogfish shark. In this case, we need to ensure that the classifier
can still make a reliable prediction even though a test record is not covered by any rule.

The previous example illustrates two important properties of the rule set generated by a rule-based
classifier.
 Mutually Exclusive Rules: The rules in a rule set R are mutually exclusive if no two rules in R
are triggered by the same record. This property ensuresthat every record is covered by at most
one rule in R. An example of amutually exclusive rule set is shown in Table 5.3.

 Exhaustive Rules: A rule set R has exhaustive coverage if there is a rulefor each combination
of attribute values. This property ensures that everyrecord is covered by at least one rule in R.
Assuming that Body Temperatureand Gives Birth are binary variables, the rule set shown in
Table 5.3 hasexhaustive coverage.

 If the rule set is not exhaustive, then a default rule, rd: () yd, must be added to cover the
remaining cases. A default rule has an empty antecedent and is triggered when all other rules have
failed.

 If the rule set is not mutually exclusive, then a record can be covered by several rules, some of
which may predict conflicting classes. There are two ways to overcome this problem.

1) Ordered Rules:
In this approach, the rules in a rule set are ordered in decreasing order of their priority,
which can be defined in many ways (e.g., based on accuracy, coverage, total description length,
or the order in which the rules are generated). An ordered rule set is also known as a decision
list. When a test record is presented, it is classified by the highest-ranked rule that covers the
record. This avoids the problem of having conflicting classes predicted by multiple
classification rules.

2) Unordered Rules:
This approach allows a test record to trigger multiple classification rules and considers the
consequent of each rule as a vote for a particular class. The votes are then tallied to determine
the class label of the test record. The record is usually assigned to the class that receives the
highest number of votes.

13 Consider a training set that contains 60 positive examples and 100 negative examples. For each of 8
the following candidate rules,
Rule R1: (covers 50 positive examples and 5 negative examples),
Rule R2: (covers 2 positive examples and No negative examples),
Determine which is the best and worst candidate rule according to:
a) Rule accuracy.
b) (c) The likelihood ratio statistic.
c) (d) The Laplace measure
d) (b) FOIL’s information gain.
a) The accuracies for r1is 50/55 =90.9%
The accuracies for r2 is 2/2 = 100%
However, r1 is the better rule despite its lower accuracy. The high accuracy for r2 is potentially
spurious because the coverage of the rule is too low.

The following approaches can be used to handle this problem.

b) A statistical test can be used to prune rules that have poor coverage.For example, we may
compute the following likelihood ratio statistic:

where k is the number of classes, fiis the observed frequency of class I examples that are covered
by the rule, and eiis the expected frequencyof a rule that makes random predictions. For example,
since r1covers 55 examples, the expected frequency for the positive class is e+ =55×60/160 =
20.625, while the expected frequency for the negative classis e− = 55 × 100/160 = 34.375. Thus,
the likelihood ratio for r1 is
R(r1) = 2 × [50 × log2(50/20.625) + 5 × log2(5/34.375)] = 99.9.

Similarly, the expected frequencies for r2 are e+ = 2 × 60/160 = 0.75 and e− = 2 × 100/160 = 1.25.
The likelihood ratio statistic for r2 is
R(r2) = 2 × [2 × log2(2/0.75) + 0 × log2(0/1.25)] = 5.66.
This statistic therefore suggests that r1 is a better rule than r2.

c)An evaluation metric that takes into account the rule coverage can beused. Consider the following
evaluation metrics:

where nis the number of examples covered by the rule, f+ is the number of positive examples
covered by the rule, k is the total number of classes, and p+ is the prior probability for the positive
class. Note that the mestimate is equivalent to the Laplace measure by choosing p+ = 1/k.
The Laplace measure for r1 is 51/57 = 89.47%, which is quite close to its accuracy. Conversely,the
Laplace measure for r2 (75%) is significantly lower than its accuracybecause r2 has a much lower
coverage.

d) An evaluation metric that takes into account the support count of therule can be used. One such
metric is the FOIL’s information gain.The support count of a rule corresponds to the number of
positive examplescovered by the rule. Suppose the rule r : A + covers p0 positiveexamples and n0
negative examples. After adding a new conjunct B, theextended rule r′ : A ∧B + covers p1
positive examples and n1 negativeexamples. Given this information, the FOIL‘s information gain
ofthe extended rule is defined as follows:

Since the measure is proportional to p1 and p1/(p1 +n1), it prefers rulesthat have high support count
and accuracy. The FOIL‘s information gains for rules r1 and r2 given in the preceding example are
43.12 and 2, respectively. Therefore, r1 is a better rule than r2.

14 Consider a training set that contains 100 positive examples and 400 negative examples. For each 8
of the following candidate rules,

R1: A → + (covers 4 positive and 1 negative examples),

R2: B → + (covers 30 positive and 10 negative examples),
R3: C → + (covers 100 positive and 90 negative examples),
determine which is the best and worst candidate rule according to:
i. Rule accuracy.
ii. FOIL’s information gain.
iii. The likelihood ratio statistic.
iv. The Laplace measure.
The m-estimate measure (with k = 2 and p+ = 0.2).
15 Explain the characteristics of nearest neighbor Classifiers. 4
The characteristics of the nearest-neighbor classifier are summarized below:

Nearest-neighbor classification is part of a more general technique known as instance-based

learning, which uses specific training instances to make predictions without having to maintain
an abstraction (or model) derived from data. Instance-based learning algorithms require a
proximity measure to determine the similarity or distance between instances and a classification
function that returns the predicted class of a test instance based on its proximity to other
instances.

Lazy learners such as nearest-neighbor classifiers do not require model building. However,
classifying a test example can be quite expensive because we need to compute the proximity
values individually between the test and training examples. In contrast, eager learners often
spend the bulk of their computing resources for model building. Once a model has been built,
classifying a test example is extremely fast.

Nearest-neighbor classifiers make their predictions based on local information, whereas

decision tree and rule-based classifiers attempt to find a global model that fits the entire input
space. Because the classification decisions are made locally, nearest-neighbor classifiers (with
small values of k) are quite susceptible to noise.

Nearest-neighbor classifiers can produce arbitrarily shaped decision boundaries. Such

boundaries provide a more flexible model representation compared to decision tree and rule-
based classifiers that are often constrained to rectilinear decision boundaries. The decision
boundaries of nearest-neighbor classifiers also have high variability because they depend on the
composition of training examples. Increasing the number of nearest neighbors may reduce such
variability.

Nearest-neighbor classifiers can produce wrong predictions unless the appropriate proximity
measure and data preprocessing steps are taken. For example, suppose we want to classify a
group of people based on attributes such as height (measured in meters) and weight (measured
in pounds). The height attribute has a low variability, ranging from 1.5 m to 1.85 m, whereas
the weight attribute may vary from 90 lb. to 250 lb. If the scale of the attributes are not taken
into consideration, the proximity measure may be dominated by differences in the weights of a
person.

16 Explain the characteristics of Rule-based Classifiers 8

A rule-based classifier has the following characteristics:
 The expressiveness of a rule set is almost equivalent to that of a decision tree because a decision
tree can be represented by a set of mutually exclusive and exhaustive rules. Both rule-based and
decision tree classifiers create rectilinear partitions of the attribute space and assign a class to
each partition. Nevertheless, if the rule-based classifier allows multiple rules to be triggered for a
given record, then a more complex decision boundary can be constructed.

 Rule-based classifiers are generally used to produce descriptive models that are easier to
interpret, but gives comparable performance to the decision tree classifier.

 The class-based ordering approach adopted by many rule-based classifiers(such as RIPPER) is

well suited for handling data sets with imbalanced class distributions.
17 Discuss the two common strategies for growing a classification rule. 6
There are two common strategies for growing a classification rule: general-to-specific or
specific-to-general.

 General-to-specific strategy:
In this strategy an initial rule r : {} −→ y is created, where the left-handside is an empty set
and the right-hand side contains the target class. The rulehas poor quality because it covers all
the examples in the training set.
New conjuncts are subsequently added to improve the rule‘s quality. Figure 5.3(a)shows the
general-to-specific rule-growing strategy for the vertebrate classificationproblem. The
conjunct Body Temperature=warm-blooded is initiallychosen to form the rule antecedent.
The algorithm then explores all the possiblecandidates and greedily chooses the next
conjunct, Gives Birth=yes, tobe added into the rule antecedent. This process continues until
the stoppingcriterion is met (e.g., when the added conjunct does not improve the qualityof the
rule).
 Specific-to-general strategy:
One of the positive examples is randomlychosen as the initial seed for the rule-growing
process. During therefinement step, the rule is generalized by removing one of its conjuncts
sothat it can cover more positive examples.
Figure 5.3(b) shows the specific-togeneralapproach for the vertebrate classification
problem. Suppose a positiveexample for mammals is chosen as the initial seed. The initial
rule containsthe same conjuncts as the attribute values of the seed.
To improve its coverage,the rule is generalized by removing the conjunct Hibernate=no.
Therefinement step is repeated until the stopping criterion is met, e.g., when therule starts
covering negative examples.
18 What are Baysian classifiers? Explain Baye’s theorem for classification. 7
Baysian classifiers is an approach for modeling probabilistic relationships between the attribute set and
the class variable.

Bayes theorem
Let X and Y be a pair of random variables. Their joint probability, P(X =x,Y =y), refers to the
probability that variable X will take on the value x and variable Y will take on the value y. A
conditional probability is the probability that a random variable will take on a particular value given
that the outcome for another random variable is known. For example, the conditional probability P(Y=y
lX =x) refers to the probability that the variable Y will take on the value y, given that the variable X is
observed to have the value x. The joint and conditional probabilities for X and Y are related in the
following way:
P(X,Y) = P(Y|X) x P(X) = P(X|Y) x P(Y)
Rearranging the last two expressions in above Equation leads to the following formula, known as the
Bayes theorem:

Using the Bayes Theorem for Classification

 Before describing how the Bayes theorem can be used for classification, let us formalize the
classification problem from a statistical perspective.
 Let X denote the attribute set and Y denote the class variable. If the class variable has a non-
deterministic relationship with the attributes, then we can treat X and Y as random variables and
capture their relationship probabilistically using P(YIX). This conditional probability is also known
as the posterior probability for Y, as opposed to its prior probability, P(Y).
 During the training phase, we need to learn the posterior probabilities P(YlX) for every combination
of X and Y based on information gathered from the training data. By knowing these probabilities, a
test record X' can be classified by finding the class Yt that maximizes the posterior probability,
P(Y'lX').

 Estimating the posterior probabilities accurately for every possible combination of class labiel and
attribute value is a difficult problem because it requires a very large training set, even for a moderate
number of attributes.
 The Bayes theorem is useful because it allows us to express the posterior probability in terms of the
prior probability P(f), the class-conditional probability P(X|Y), and the evidence, P(X):

To estimate the class-conditional probabilities P(Xlf), we present two implementations of Bayesian

classification methods: the naiVe Bayes classifier and the Bayesian belief network.

NaiVe Bayes Classifier

A naive Bayes classifier estimates the class-conditional probability by assuming that the attributes are
conditionally independent, given the class label Y. The conditional independence assumption can be
formally stated as follows

where each attribute set X : {X1, X2,..., Xd} consists of d attributes.

Bayesian Belief Networks

A Bayesian belief network (BBN), or simply, Bayesian network, provides a graphical representation of
the probabilistic relationships among a set of random Variables. There are two key elements of a
Bayesian network:
1. A directed acyclic graph (dag) encoding the dependence relationship among a set of variables.
2. A probability table associating each node to its immediate parent nodes

Besides the conditional independence conditions imposed by the network topology, each node is also
associated with a probability table.
1. If a node X does not have any parents, then the table contains only the prior probability P(X).
2. If a node X has only one parent, Y, then the table contains the conditional probability P(X|Y).
3. If a node X has multiple parents, {Y1,Y2, . . . ,Yn}, then the table contain the conditional
probability P(X|Y1,Y2,. . ., Yd.).

CS201 Mcqs FinalTerm by Vu Topper RM
No ratings yet
CS201 Mcqs FinalTerm by Vu Topper RM
52 pages
PLL Algorithms (Permutation of Last Layer) : Algorithm Presentation Format
No ratings yet
PLL Algorithms (Permutation of Last Layer) : Algorithm Presentation Format
2 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Coding Round Cognizant Technical Interview Questions
No ratings yet
Coding Round Cognizant Technical Interview Questions
8 pages
Classification Slides
No ratings yet
Classification Slides
147 pages
MODULE 3 - Question &answer-2
No ratings yet
MODULE 3 - Question &answer-2
32 pages
DM Module 4
No ratings yet
DM Module 4
12 pages
Unit-III Classification
No ratings yet
Unit-III Classification
10 pages
Unit 4
No ratings yet
Unit 4
78 pages
Module 4
No ratings yet
Module 4
41 pages
Classification Part 1
No ratings yet
Classification Part 1
76 pages
Unit-4 - Data Ware
No ratings yet
Unit-4 - Data Ware
59 pages
3 Module DWM
No ratings yet
3 Module DWM
16 pages
Unit 3
No ratings yet
Unit 3
95 pages
Classification&Decision Tree
No ratings yet
Classification&Decision Tree
13 pages
7 Classification
100% (3)
7 Classification
63 pages
Unit 3
No ratings yet
Unit 3
29 pages
Classification&Decision Tree
No ratings yet
Classification&Decision Tree
10 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
DWDM 4
No ratings yet
DWDM 4
58 pages
Introduction To Java-Notes
No ratings yet
Introduction To Java-Notes
46 pages
Unit 3
100% (1)
Unit 3
21 pages
Classification and Prediction
No ratings yet
Classification and Prediction
21 pages
DM Unit-3
No ratings yet
DM Unit-3
23 pages
Classifiction
No ratings yet
Classifiction
42 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
CH 4
No ratings yet
CH 4
21 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Blackbook
No ratings yet
Blackbook
35 pages
CS402 Mod 3
No ratings yet
CS402 Mod 3
2 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
CH 5
No ratings yet
CH 5
84 pages
Unit 3
No ratings yet
Unit 3
34 pages
Unit 3 (DWDM)
No ratings yet
Unit 3 (DWDM)
23 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
03 Decision Tree
No ratings yet
03 Decision Tree
59 pages
Unit 5 (Csa)
No ratings yet
Unit 5 (Csa)
25 pages
Lecture11 Ch8 ClassBasic Part1
No ratings yet
Lecture11 Ch8 ClassBasic Part1
38 pages
Module 3
No ratings yet
Module 3
64 pages
Chapter 5 Classification
No ratings yet
Chapter 5 Classification
24 pages
Classification
No ratings yet
Classification
45 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
Chapter 4
No ratings yet
Chapter 4
31 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
Siv UNIT-3 Classification DWM PART-A
No ratings yet
Siv UNIT-3 Classification DWM PART-A
12 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
DMDW Classification
No ratings yet
DMDW Classification
18 pages
Unit-Iv (Dmwh6em)
No ratings yet
Unit-Iv (Dmwh6em)
33 pages
Dwdm-Unit-3 R16
No ratings yet
Dwdm-Unit-3 R16
14 pages
Unit-4 DM
No ratings yet
Unit-4 DM
19 pages
Unit 4
No ratings yet
Unit 4
20 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Module1-Question Bank With Answers (1) - 2
No ratings yet
Module1-Question Bank With Answers (1) - 2
23 pages
Unit 3
No ratings yet
Unit 3
16 pages
Dmi Unit 4
No ratings yet
Dmi Unit 4
34 pages
B.Tech AI &DS Curriculum - 16102020
No ratings yet
B.Tech AI &DS Curriculum - 16102020
64 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
Abaqus To Code - Aster - Plane Stress - Code - Aster For Windows
No ratings yet
Abaqus To Code - Aster - Plane Stress - Code - Aster For Windows
9 pages
DWDM Unit 4 PDF
No ratings yet
DWDM Unit 4 PDF
18 pages
Cambridge Assessment International Education: Computer Science 9608/22 October/November 2019
No ratings yet
Cambridge Assessment International Education: Computer Science 9608/22 October/November 2019
13 pages
DWDM Unit IV Note
No ratings yet
DWDM Unit IV Note
21 pages
R20 DMT Unit-Iii
No ratings yet
R20 DMT Unit-Iii
21 pages
MAD Mod1-5@AzDOCUMENTS - in
No ratings yet
MAD Mod1-5@AzDOCUMENTS - in
137 pages
Discrete Math Detailed 1 To 15 Hinglish
No ratings yet
Discrete Math Detailed 1 To 15 Hinglish
5 pages
Mangalore University Bachelor of Computer Applications (BCA) Degree Programme Choice Based Credit System (2019-2020 Onwards)
No ratings yet
Mangalore University Bachelor of Computer Applications (BCA) Degree Programme Choice Based Credit System (2019-2020 Onwards)
14 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
1st16ec042 Internship Report
No ratings yet
1st16ec042 Internship Report
45 pages
Fuzzy LOGIC
No ratings yet
Fuzzy LOGIC
32 pages
SC Unit I
No ratings yet
SC Unit I
191 pages
Sigma Curriculum
No ratings yet
Sigma Curriculum
14 pages
Software Testing Lab Manual Final - Updated - 29.4.22
No ratings yet
Software Testing Lab Manual Final - Updated - 29.4.22
79 pages
Topic 2 - Programming (i-GCSE) Computer Science
No ratings yet
Topic 2 - Programming (i-GCSE) Computer Science
47 pages
Module 4
No ratings yet
Module 4
59 pages
SEM5 - ADA - RMSE - Questions Solution1
No ratings yet
SEM5 - ADA - RMSE - Questions Solution1
58 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
Instructions PLC Mitsubishi
No ratings yet
Instructions PLC Mitsubishi
41 pages
CNS R20 - Unit-2
No ratings yet
CNS R20 - Unit-2
30 pages
NIHERST CSEC IT Curriculum Support Manual
No ratings yet
NIHERST CSEC IT Curriculum Support Manual
33 pages
Unit 4 CD
No ratings yet
Unit 4 CD
8 pages
2024 Winter Question Paper
No ratings yet
2024 Winter Question Paper
3 pages
Mod1 Co&a Bec306c
No ratings yet
Mod1 Co&a Bec306c
31 pages
DSA Project Based Assessment
No ratings yet
DSA Project Based Assessment
15 pages
Research Paper Chatbot
No ratings yet
Research Paper Chatbot
5 pages
Computing Game Values For Crash Games
No ratings yet
Computing Game Values For Crash Games
16 pages
Ge3151 - PSPP Model QP
No ratings yet
Ge3151 - PSPP Model QP
2 pages
COEN 352 Assignment 4 Nicolas GHARZOUZI
No ratings yet
COEN 352 Assignment 4 Nicolas GHARZOUZI
5 pages
Ps 1
No ratings yet
Ps 1
4 pages
Part 2 Solution To Exercise On 2DDCT Using Matrix Implementation
No ratings yet
Part 2 Solution To Exercise On 2DDCT Using Matrix Implementation
3 pages
Stevens' Handbook of Experimental Psychology and Cognitive Neuroscience, Methodology
From Everand
Stevens' Handbook of Experimental Psychology and Cognitive Neuroscience, Methodology
Wiley
No ratings yet
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Module4 QB 1

Uploaded by

Module4 QB 1

Uploaded by

1 How decision trees are used for classification?

Explain decision tree induction algorithm for 8

Figure 4.4.A decision tree for the mammal classification problem.

Decision tree induction algorithm for classification

2 How rule based classifiers are used for classification? Explain. 6

 A classification technique (or classifier) is a systematic approach to building classification models

Table 3.2.Confusion matrix for a 2-class problem

4 Mention the three impurity measures for selecting best splits. 8

7 Explain sequential covering algorithm in rule-based classifier. 6

Figure 3.7 Hunt‘s algorithm for inducing decision trees.

10 What are the design issues of a decision tree induction? 6

 How should the training records be split?

 How should the splitting procedure stop?

11 Discuss the methods for expressing attribute test condition. 6

Figure 3.8 Test condition for binary attributes.

Figure 3.10 Different ways of grouping ordinal attribute values.

Figure 3.11 Test condition for continuous attributes.

12 Explain how a rule based classifiers works? 8

The following approaches can be used to handle this problem.

R1: A → + (covers 4 positive and 1 negative examples),

Nearest-neighbor classification is part of a more general technique known as instance-based

Nearest-neighbor classifiers make their predictions based on local information, whereas

Nearest-neighbor classifiers can produce arbitrarily shaped decision boundaries. Such

16 Explain the characteristics of Rule-based Classifiers 8

 The class-based ordering approach adopted by many rule-based classifiers(such as RIPPER) is

Using the Bayes Theorem for Classification

To estimate the class-conditional probabilities P(Xlf), we present two implementations of Bayesian

NaiVe Bayes Classifier

where each attribute set X : {X1, X2,..., Xd} consists of d attributes.

Bayesian Belief Networks

You might also like