Module 4DMDW
Module 4DMDW
MODULE-4 CLASSIFICATION
Introduction
Decision Trees Induction,
Method for Comparing Classifiers
Rule Based Classifiers
Nearest Neighbor Classifiers
Bayesian Classifiers
Introduction
Classification: Definition
Classification, which is the task of assigning objects to one of several predefined categories. Given a
collection of records (training set ),Each record contains a set of attributes, one of the attributes is the
class. Find a model for class attribute as a function of the values of other attributes. The input data for a
classification task is a collection of records. Each record,
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into
training and test sets, with training set used to build the model and test set used to validate it.
Applications:
Examples
Detecting spam email messages based upon the messageheader and content.
Categorizing cells as malignant or benign based upon the results of MRI scans.
Classifying galaxies based upon their shapes.
Categorizing news stories as finance, weather, entertainment, sports, etc
Classifying credit card transactions as legitimate or fraudulent.
General Approach to Solving a Classification
Each technique employs a learning algorithm to identify a model that best fits the relationship
between the attribute set and class label of the input data.
The model generated by a learning algorithm should both fit the input data well and correctly
predict the class labels of records it has never seen before.
Therefore, a key objective of the learning algorithm is to build models with good generalization
capability; i.e., models that accurately predict the class labels ofpreviously unknown records
Most classification algorithms seek models that attain the highest accuracy, or equivalently, the lowest
error rate when applied to the test set.
Classification Techniques:
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Hunt's Algorithm
In Hunt's algorithm, a decision tree is grown in a recursive fashion by partitioning the training records
into successively purer subsets. Let Di be the set of training records that are associated with
Module-4 Classification
node t and y= {“y1,y2,y3….yc"} be the class labels. The following is a recursive definition of Hunt's
algorithm.
Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt.
Step 2: If Di contains records that belong to more than one class, an attribute test condition is selected to
partition the records into smaller subsets. A child node is created for each outcome of the test condition
and the records in Dt are distributed to the children based on the outcomes. The algorithm is then
recursively applied to each child node.
To illustrate how the algorithm works, consider the problem of predicting whether a loan applicant will
repay her loan obligations or become delinquent, subsequently defaulting on her loan.
The initial tree for the classification problem contains a single node with class label Defaulted = No (see
Figure 4.7a), which means that most of the borrowers successfully repaid their loans. The tree, however,
needs to be refined since the root node contains records from both classes.
The records are subsequently divided into smaller subsets based on the outcomes of the Home Owner test
Module-4 Classification
condition as shown in Figure 4.7(b). The justification for choosing this attribute test condition will be
discussed later. For now, we will assume that this is the best criterion for splitting the data at this point.
Hunt's algorithm is then applied recursively to each child of the root node. From the training set given in
Figure 4.6, notice that all borrowers who are home owners successfully repaid their loans. The left child
of the root is therefore a leaf node labeled Defaulted = No (see Figure 4.7(b)).
For the right child, we need to continue applying the recursive step of Hunt's algorithm until all the
records belong to the same class. The trees resulting from each recursive step are shown in Figures 4.7(c)
and (d).
Nominal Attributes :Since a nominal attribute can have many values, its test condition can be expressed
in two ways, as shown in Figure 4.9. For a multiway split (Figure 4.9(a)), the number of outcomes
depends on the number of distinct values for the corresponding attribute. For example, if an attribute such
as marital status has three distinct values-single, married, or divorced-its test condition will produce a
three-way split.
Figure 4.9(b) illustrates three different ways of grouping the attribute values for marital status into two
subsets.
Module-4 Classification
Ordinal Attributes: Ordinal attributes can also produce binary or multiway splits. Ordinal attribute
values can be grouped as long as the grouping does not violate the order property of the attribute values.
Figure 4.10 illustrates various ways of splitting training records based on the Shirt Size attribute.
The groupings shown in Figures 4.10(a) and (b) preserve the order among the attribute values, whereas
the grouping shown in Figure a.10(c) violates this property because it combines the attribute values Small
and Large into the same partition while Medium and Extra Large are combined into another partition.
Continuous Attributes: For continuous attributes, the test condition can be expressed as a
comparison test (A < V) or (A>=V ,) with binary outcomes, or a range query with outcomes of the
form Vi<=A<Vi+1, for i=1,2…k. The difference between these approaches is shown in Figure 4.11.
Module-4 Classification
Where p(i|t) denote the fraction of records belonging to class i at a given node t and where c is the
number of classes.
The measures developed for selecting the best split are often based on the degree of impurity of the child
nodes. The smaller the degree of impurity, the more skewed the class distribution.
Disadvantages:
Since most decision tree algorithms employ a top-down, recursive partitioning approach, the number of
records becomes smaller as we traverse down the tree. At the leaf nodes, the number of records may be
too small to make a statistically significant decision about the class representation of the nodes.
A subtree can be replicated multiple times in a decision tree, as illustrated in Figure 4.19. This makes
the decision tree more complex than necessary and perhaps more difficult to interpret. Such a situation
can arise from decision tree implementations that rely on a single attribute test condition at each internal
node.
TUPulse.co
Module-4 Classification
Exercises:
Module-4 Classification
VTUPulse.com
Module-4 Classification
Module-4 Classification
Module-4 Classification
Module-4 Classification
Module-4 Classification
Module-4 Classification
Underfitting : The training and test error rates of the model are large when the size of the tree is very
small. This situation is known as model underfitting.
Underfitting occurs because the model has yet to learn the true structure ofthe data. As a result, it
performs poorly on both the training and the test sets.
Overfitting.: As the number of nodes in the decision tree increases, the tree will have fewer training and
test error . However, once the tree becomes too large, its test error rate begins to increase even though its
training error rate continues to decrease. This phenomenon is known as model over fitting.
Figure shows the training and test error rates of the decision tree.
2 Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottom-up fashion
If generalization error improves after trimming, replace sub-tree by a leaf node.
Class label of leaf node is determined from majority class of instances in the sub-tree
Exercises:
Module-4 Classification
Module-4 Classification
VTUPulse.com
Module-4 Classification
Module-4 Classification
Rule-Based Classifier
A rule-based classifier is a technique for classifying records using a collection of "if . . .then. . ." rules.
The rules for the model are represented in a disjunctive normal form, . where R is known as the rule set
and r;'s are the classification rules or disjuncts
The left-hand side of the rule is called the rule antecedent or precondition.
The right-hand side of the rule is called the rule consequent, which contains the predicted class yi
VTUPulse.com
Module-4 Classification
A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
Mutually Exclusive Rules The rules in a rule set .R are mutually exclusive if no two rules in .R are
triggered by the same record. This property ensures that every record is covered by at most one rule in R.
Exhaustive Rules A rule set -R has exhaustive coverage if there is a rule for each combination of
attribute values. This property ensures that every record is covered by at least one rule in –R Ordered
Rules In this approach, the rules in a rule set are ordered in decreasing order of their priority, which can
be defined in many ways (e.g., based on accuracy, coverage, total description length, or the order in which
the rules are generated). An ordered rule set is also known as a decision list. When a test record is
presented, it is classified by the highest-ranked rule that covers the record. This avoids the problem of
having conflicting classes predicted by multiple classification rules.
Rule-Ordering Schemes
Rule-based ordering
Individual rules are ranked based on their quality
This approach orders the individual rules by some rule quality measure.
This ordering scheme ensures that every test record is classified by the "best" rule covering it.
Class-based ordering
Rules that belong to the same class appear together
In this approach, rules that belong to the same class appear together in the rule set R. The rules are then
collectively sorted on the basis of their class information.
Rule Evaluation:
Module-4 Classification
VTUPulse.com
Module-4 Classification
Rule-based classifiers are generally used to produce descriptive models that are easier to interpret, but
gives comparable performance to the decision tree classifier.
Module-4 Classification
VTUPulse.co
Module-4 Classification
VTUPulse.co
Module-4 Classification
Module-4 Classification
VTUPulse.co
Module-4 Classification
Nearest-Neighbor Classifiers
Requires three things
– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbors to retrieve
K-nearest neighbors of a record x are data points that have the k smallest distance to x Compute distance
between two points:
– Euclidean distance
Lazy learners such as nearest-neighbor classifiers do not require model building. However,
classifying a test example can be quite expensive because we need to compute the proximity
values individually between the test and training examples.
Nearest-neighbor classifiers can produce arbitrarily shaped decision boundaries. Such boundaries
provide a more flexible model representation compared to decision tree and rule-based classifiers
that are often constrained to rectilinear decision boundaries.
Nearest-neighbor classifiers can produce wrong predictions unless the appropriate proximity
measure and data preprocessing steps are taken.
Module-4 Classification
Bayesian Classifiers
Bayes’ Theorem:
Bayes‟ theorem is a way to figure out conditional probability. Conditional probability is the probability of
an event happening, given that it has some relationship to one or more other events
– A doctor knows that meningitis causes stiff neck 50% of the time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, what‟s the probability he/she has meningitis?
Suppose we are given a test record with the following attribute set:
X :(Home Owner : No, Marital Status : Married, Annual Income : $120K).
To classify the record, we need to compute the posterior probabilities P(Yes/X) and P(No/X) based on
information available in the training data.
If P(Yes/X) > P(No/X), then the record is classified as Yes; otherwise, it is classified as No
VTUPulse.com
VTUP
M-estimate of Conditional Probability:
where n is the total number of instances from class Yj, nc is the number of training examples from class
Yi that take on the value Xi, m is a parameter known as the equivalent sample size, and p is a user-
specified parameter.
Module-4 Classification
Module-4 Classification
Module-4 Classification
Module-4 Classification
A Bayesian belief network (BBN), or simply, Bayesian network, provides a graphical representation of
the probabilistic relationships among a set of random variables. There are two key elements of a Bayesian
network:
1. A directed acyclic graph (dag) encoding the dependence relationships among a set of variables.
2. A probability table associating each node to its immediate parent nodes.
Consider three random variables,A, B, and C, in which A and B are independent variables and each has a
direct influence on a third variable, C.
The relationships among the variables can be summarized into the directed acyclic graph shown in Figure
5.12(a).
Each node in the graph represents a variable, and each arc asserts the dependence relationship between the
pair of variables. If there is a directed arc from X to Y, then X is the parent of Y and Y is the child of X.
F\rrthermore, if there is a directed path in the network from X to Z, then X is an ancestor of Z, whlle Z is a
descendant of X.
For example, in the diagram shown in Figure 5.12(b), A is a descendant of D and D is an ancestor of
B. Both B and D arc also non-descendants of A.
In the diagram shown in Figure 5.12(b), A is conditionally independent of both B and D given C because
the nodes for B and D are non-descendants of node A.
Module-4 Classification
The conditional independence assumption made by a naive Bayes classifier can also be represented using
a Bayesian network, as shown in Figure 5.12(c), where gr is the target class and {Xt,Xz,...,Xa} is the
attribute set.
Besides the conditional independence conditions imposed by the network topology, each node is also
associated with a probability table.
1. If a node X does not have any parents, then the table contains only the prior probability P(X).
2. If a node X has only one parent, Y, then the table contains the conditional probability P(XIY).
3. If a node X has multiple parents, {Y1,,Y2, . . . ,Yn}, then the table contains the conditional
probability P(XlY1,Y2,. . ., Yn.).
Module-4 Classification
EXAMPLE:
You have a new burglar alarm installed at home.
It is fairly reliable at detecting burglary, but also sometimes responds to minor earthquakes.
You have two neighbors, Ali and Veli, who promised to call you at work when they hear the alarm.
Ali always calls when he hears the alarm, but sometimes confuses telephone ringing with the alarm
and calls too.
Veli likes loud music and sometimes misses the alarm.
Given the evidence of who has or has not called, we would like to estimate the probability of a
burglary.
Module-4 Classification
Characteristics of BBN
Following are some of the general characteristics of the BBN method:
BBN provides an approach for capturing the prior knowledge of a particular domain using a
graphical model. The network can also be used to encode causal dependencies amongvariables.
Constructing the network can be time consuming and requires a large amount of effort. However,
once the structure of the network has been determined, adding a new variable is quite straightforward.
Bayesian networks are well suited to dealing with incomplete data. Instances with missing attributes
can be handled by summing or integrating the probabilities over all possible values ofthe attribute.
Because the data is combined probabilistically with prior knowledge, the method is quite robust to
model over fitting.