Classification&DecisionTree (1)
Classification&DecisionTree (1)
Classification
Classification is the task of assigning objects to one of the several predefined categories in a persistent
problem that encompasses many diverse applications. Examples of classification are are as given below
1. Detecting spam E-mail massages based on header and content
Descriptive Modeling
A classification model can serve as explanatory tool to distinguish between different classes. For exam-
ple, it would be useful for both biologist and others to have a descriptive model that can summarize
the data given below
Predictive Modeling
A classification model can also be used to predict the class label of the unknown records. It can be
treated as a black box that automatically assign a class label when presented with attribute set of
unknown record.
Classification techniques are most suited for predicting or describing dataset with binary of nominal
categories. They are less effective for ordinal categories because they do not consider the implicit
ordering among the categories.
1
3. Nearest Neighbor Classifier
5. Bayesian Classifier
Each of above technique applies a learning algorithm to identify the model that best fits the rela-
tionship between attribute and class label of input data. A key objective of learning algorithm is to
build models with good generalization capability i.e. models that accurately predict the class label of
previously unknown record.
Learning
algorithm
Deduction Apply
Test Set
Model
A training set consisting of records whose class labels are known. It is used to build classification
model. This model is applied to test set which consists of record with unknown class labels.
2
Table 2: Confusion matrix
Predicted Class
Class 1 Class 0
Actual Class 1 f11 f10
Class Class 0 f01 f00
In the above table fij indicates the number of records from class i predicted to be of class j. Based on
the entries of the confusion matrix, the total number of the correct prediction by the model is
f11 + f00 .
f10 + f01
Confusion matrix provides information needed to determine how well the classification model performs.
Based on information provided by confusion matrix, we can define performance measures to compare
the performance of different classification models.
No of correct predictions
Accuracy = =
Total no of predictions
f11 + f00
f11 + f01 + f10 + f00
No of wrong predictions
Error Rate = =
Total no of predictions
f10 + f01
f11 + f01 + f10 + f00
No of True Positive
Positive Predictive Value = =
Total no of Positive
f11
f11 + f10
No of True Negative
Negative Predictive value = =
Total no of Negative
f00
f01 + f00
Most classification algorithms seek models that attain the highest accuracy, or equivalently lowest error
rate.
3
2. Do the females of species gives birth?
A series of questions and their possible answer can be organized in the form of a hierarchical structure
consisting of nodes and edges. This hierarchical structure is known as decision tree.
Body Temperature
Cold Warm
no yes
1. Root Node: This node has not any incoming edges and has zero or more outgoing edge. In the
above tree, (1) is root node.
2. Internal Node: This node has exactly one incoming edge and two or more outgoing edge. In
the above tree, (2) is internal node.
3. Leaf/Terminal Node: This node has exactly one incoming edge and no outgoing edge. In the
above tree, (3), (4) and (5) are leaf node. Leaf node always assigns a class label.
The non-terminal nodes, which includes root and other internal nodes, contain attributes test conditions
to separate records that have different characteristics.
Haunt’s Algorithm
In this algorithm, a decision tree is grown in iterative fashion to partition training records that are
associated with node t and y = {y1 , . . . , yc } be the class labels. This algorithm is a two step algorithm
as given below
4
2. If Dt contains records that belong to more than one class, an attribute test condition is selected
to partition the records into smaller subsets. A child node is created for each outcome of the test
condition and and the records in Dt are distributed to children based on outcome. The algorithm
is then applied to each child node
Example 1. Let us consider following Loan defaulter dataset. Based on this training dataset, construct
a decision tree for predicting borrowers who will default on loan payment.
Solution. Based on the above dataset, we can grow a decision tree model in following way
Defaulted=No
Figure 2: Step-I
Home Owner
Yes No
Defaulted=No Defaulted=No
Figure 3: Step-II
5
Home Owner
Yes No
M s/d
Defaulted=No Defaulted=Yes
Figure 4: Step-III
Home Owner
Yes No
M s/d
<80K >80K
Defaulted=No Defaulted=Yes
Figure 5: Step-IV
6
2. It is possible for some of the child nodes created in second step to be empty i.e. no record
associated with the nodes. This can happen if none of the training records have the combination
of attribute value associated with such nodes. in this case, the node is declared as leaf node with
the same class label as majority class of training records associates with parent node.
3. In second step, if all records associated with Dt have identical values, then it is not possible to
split these records any further. In this case, the node is declared as leaf node with same class
label as majority class of training records associated with this nodes.
To deal with the first issue, we need test condition for different attribute type while in order to dealing
with second issue, we need condition to stop tree growing process. A possible strategy is to continue
tree growing process until either all records belong to same class or all record have identical attribute
value.
Binary Attribute
Yes No
Decision 1 Decision 2
Nominal Attribute
Nominal attribute can have more than one split. An example of nominal attribute split is given in the
figure. Nominal attribute can also be converted to binary attribute by two way split. In example given
below, we can keep single and divorcee in one category and Married in other category.
7
Marital Status
Single Divorcee
Married
Ordinal Attribute
Like nominal attribute, ordinal attribute can also have more than one split. An example of ordinal
attribute is given in the figure. Ordinal attributes have inherent ordering between categories.
Shirt Size
XX Large
Small X Large
Medium Large
Ordinal attribute can also produce Binary splits. It can be grouped as long as group does not violet
the order property of attribute values. An example of binary split of above attribute is given below.
Shirt Size
8
Continuous Attribute
A continuous attribute can have binary or multi-way split. For continuous attributes, the test condition
can be expressed by comparison test A < υ or A > υ with binary outcome. It can also be splitted by
comparing a range query with outcome υi ≤ A < υi + l for i = 1, . . . , k. For multi-way split algorithm,
one must consider all possible ranges of continuous variable.
Annual Imcome
More than 80 K
10K–20K 20K-50K
c−1
X
Entropy(t) = − p(i|t) log(p(i|t))
i=0
c−1
X
Gini(t) = 1 − [p(i|t)]2
i=0
Classification Error(t) = 1 − max[p(i|t)],
i
1. The createN ode() function extends the decision tree by creating a new node. a node in the
decision tree has either a test condition, denoted as node.test cond, or a class label, denoted as
node.label.
9
Algorithm 1 Algorithm for decision tree induction
TreeGrowth(E, F)
2. leaf = createNode()
4. return leaf
5. else
6. root = createNode()
9. for each v ∈ V do
12. add child as descendant of root and label the edge (root → child) as v
14. end if
10
2. The f ind best split() function determines which attribute should be selected as the test condition
for splitting the training records. As previously noted, the choice of test condition depends on the
impurity measure is used to determine the good of split. Some widely used measure is entropy,
Gini index and χ2 statistic.
3. The classify function determines the class label to be assign to leaf node. For each leaf node t,
the p(i|t) denotes the fraction of training records from class i associated with the node t. In most
cases, the leaf node is assigned to class that has majority number of training records:
leaf.label = argmax p(i|t)
i
where the argmax operator returns the argument i that maximizes the expression p(i|t).
4. The stopping cond() function is used to terminate the tree-growing process by testing whether
all records have either the same class label or the same attribute values.
After building the decision tree, a tree-pruning step can be performing to reduce the size of the decision
tree. Decision trees that are too large are susceptible to a phenomenon known as over fitting.
P [Y= 0] = 0.65
P [Y= 1] = 0.35
P [X = 1|Y= 1] = 0.75
P [X = 1|Y= 0] = 0.30
P (X = 1|Y = 1)P (Y = 1)
P (Y = 1|X = 1) =
P (X = 1)
11
Using Bayes Theorem for Classification
Let us consider X as attribute set and Y as class variable. If the class variable has a non deterministic
relationship with attributes then we can treat X and Y as random variables and capture their relation-
ship statistically using Bayes theorem of P (Y |X) . The conditional probability is known as posterior
probability of Y given X as opposed to its prior probability P (Y ).
During the training phase we need to learn posterior probability P (Y |X) for every combination of
X and Y . based on the information gathered from the training data.
By knowing these probabilities, a test record X can be classified by finding class Y ′ can maximize
the posterior probability P (Y ′ |X).
Mow Let us consider the Loan default data and let
X = (HO = n, MS = m, AI = 120K)
.
To, classify the record, we need to compute the particular probabilities P (Y es|X) and P (N o|X) based
on the information available in the training data. If P (Y es|X) > P (N o|X). then the record is classified
as Yes otherwise we classify it as No.
Estimating the posterior probabilities accurately for every combination of class level and attribute
values is a difficult task because it require a large dataset even for moderate number of attributes.
Bayes Theorem is useful because it allows us to express the posterior probabilities in terms of prior
probability P (Y ) conditional probability P (X|Y ). It can be written as
P (X|Y )P (Y )
P (Y |X) =
P (X)
While comparing the posterior for different value of Y, denominator will always remain constant
and thus it can be ignored. P (Y ) can be easily estimated from training set by comparing fraction
of training record that belongs to each class.Further P (X|Y ) can be calculated using two methods as
given below
The vector X consists of of d attributes X1 · · · Xd . Let X, Y and Z are three random variables.X
is said to be conditionally independent of Y given Z if the following condition holds.
12
P (X, Y, X)
P (X, Y |Z) =
P (Z)
P (X, Y, Z)P (Y, Z)
=
P (Y, Z)P (Z)
= P (X|Y, Z)P (Y |Z)
P (Y ) di=1 P (Xi |Y )
Q
P (Y |X) =
P (X)
Note that the denominator is fixed for every class Y . Hence, we only need to calculate numerator for
each class level.
Example 3. In Example 1, assume that “Y= Defaulted Borrower” and remaining variables are features
(X). Based on these information find P (Y = yes|X) and P (Y = no|X) if
X = (HO = N o, M S = m, AI = 120k).
.
Solution. Using table and the assumptions [AI|Y es] ∼ N (90, 25) and [AI|N o] ∼ N (110, 2975), we can
obtain the following probabilities
P (Y = Y es) = 3/10
P (Y = N o) = 7/10
P (HO = N o|Y = N o) = 4/7
P (HO = N o|Y = Y es) = 1
P (M S = M arried|Y = Y es) = 0
P (M S = M arried|Y = N o) = 4/7
P (AI = 120K|N o) = 0.0027
P (AI = 120K|Y es) = 10e − 9
Based on the above information we can write P (X|Y = Y es) and P (X|Y = N o) as
13