Classification&DecisionTree (2)
Classification&DecisionTree (2)
Classification
Classification is the task of assigning objects to one of the several predefined categories in a persistent
problem that encompasses many diverse applications. Examples of classification are are as given below
Definition 1. Classification is the task of learning a target function f that maps each attribute set X
to one of the predefined class level Y .
The target function is also known as classification model and it is useful for Descriptive as well as
predictive modelings.
Descriptive Modeling
A classification model can serve as explanatory tool to distinguish between different classes. For exam-
ple, it would be useful for both biologist and others to have a descriptive model that can summarize
the data given below
Predictive Modeling
A classification model can also be used to predict the class label of the unknown records. It can be
treated as a black box that automatically assign a class label when presented with attribute set of
unknown record.
Classification techniques are most suited for predicting or describing dataset with binary of nominal
categories. They are less effective for ordinal categories because they do not consider the implicit
ordering among the categories.
1
2. Rule Based Classifier
3. Nearest Neighbor Classifier
4. Support vector Machine (SVM) Classifier
5. Bayesian Classifier
6. Neural Network Based Classifier
Each of above technique applies a learning algorithm to identify the model that best fits the rela-
tionship between attribute and class label of input data. A key objective of learning algorithm is to
build models with good generalization capability i.e. models that accurately predict the class label of
previously unknown record.
A training set consisting of records whose class labels are known. It is used to build classification
model. This model is applied to test set which consists of record with unknown class labels.
In the above table fij indicates the number of records from class i predicted to be of class j. Based on
the entries of the confusion matrix, the total number of the correct prediction by the model is
f11 + f00 .
2
Similarly, total number of wrong prediction is
f10 + f01
Confusion matrix provides information needed to determine how well the classification model performs.
Based on information provided by information matrix, we can define performance measures to compare
the performance of different classification models.
No of correct predictions f11 + f00
Accuracy = =
Total no of predictions f11 + f01 + f10 + f00
No of wrong predictions f10 + f01
Error Rate = =
Total no of predictions f11 + f01 + f10 + f00
No of True Positive f11
Positive Predictive Value = =
Total no of Positive f11 + f10
No of True Negative f00
Negative Predictive value = =
Total no of Negative f01 + f00
Most classification algorithms seek models that attain the highest accuracy, or equivalently lowest error
rate.
A series of questions and their possible answer can be organized in the form of a hierarchical structure
consisting of nodes and edges. This hierarchical structure is known as decision tree.
3
Figure 2: A decision tree
1. Root Node: This node has not any incoming edges and has zero or more outgoing edge. In the
above tree, (1) is root node.
2. Internal Node: This node has exactly one incoming edge and two or more outgoing edge. In
the above tree, (2) is internal node.
3. Leaf/Terminal Node: This node has exactly one incoming edge and no outgoing edge. In the
above tree, (3), (4) and (5) are leaf node. Leaf node always assigns a class label.
The non-terminal nodes, which includes root and other internal nodes, contain attributes test conditions
to separate records that have different characteristics.
Haunt’s Algorithm
In this algorithm, a decision tree is grown in iterative fashion to partition training records that are
associated with node t and y = {y1 , . . . , yc } be the class labels. This algorithm is a two step algorithm
as given below
4
2. If Dt contains records that belong to more than one class, an attribute test condition is selected
to partition the records into smaller subsets. A child node is created for each outcome of the test
condition and and the records in Dt are distributed to children based on outcome. The algorithm
is then applied to each child node
Example 1. Let us consider following Loan defaulter dataset. Based on this training dataset, construct
a decision tree for predicting borrowers who will default on loan payment.
Solution. Based on the above dataset, we can grow a decision tree model in following way
5
Table 4: Steps of decision tree growth algorithm based training dataset
2. It is possible for some of the child nodes created in second step to be empty i.e. no record
associated with the nodes. This can happen if none of the training records have the combination
of attribute value associated with such nodes. in this case, the node is declared as leaf node with
the same class label as majority class of training records associates with parent node.
3. In second step, if all records associated with Dt have identical values, then it is not possible to
split these records any further. In this case, the node is declared as leaf node with same class
label as majority class of training records associated with this nodes.
6
2. How should splitting procedure stop?
To deal with the first issue, we need test condition for different attribute type while in order to dealing
with second issue, we need condition to stop tree growing process. A possible strategy is to continue
tree growing process until either all records belong to same class or all record have identical attribute
value.
Nominal Attribute
Nominal attribute can have more than one split. An example of nominal attribute split is given in the
figure. Nominal attribute can also be converted to binary attribute by two way split. In example given
below, we can keep single and divorcee in one category and Married in other category.
7
Ordinal Attribute
Like nominal attribute, ordinal attribute can also have more than one split. An example of ordinal
attribute is given in the figure. Ordinal attributes have inherent ordering between categories.
Ordinal attribute can also produce Binary splits. It can be grouped as long as group does not violet
the order property of attribute values. An example of binary split of above attribute is given below.
Continuous Attribute
A continuous attribute can have binary or multi-way split. For continuous attributes, the test condition
can be expressed by comparison test A < υ or A > υ with binary outcome. It can also be splitted by
comparing a range query with outcome υi ≤ A < υi + l for i = 1, . . . , k. For multi-way split algorithm,
one must consider all possible ranges of continuous variable.
8
two class problem, the class distribution at any node can be written as (p0 , p1 ) where p0 = 1 − p1 .
The measures developed for selecting the best split are often based on the degree of impurity of child
nodes. The smaller the degree of impurity, the more skewed the class distribution. Entropy, Gini and
Classification error are few important measure of impurity.These can be expressed as
c−1
X
Entropy(t) = − p(i|t) log(p(i|t))
i=0
c−1
X
Gini(t) = 1 − [p(i|t)]2
i=0
Classification Error(t) = 1 − max[p(i|t)],
i
4. return leaf
5. else
9. for each v ∈ V do
12. add child as descendant of root and label the edge (root → child) as v
14. end if
9
1. The createN ode() function extends the decision tree by creating a new node. a node in the
decision tree has either a test condition, denoted as node.test cond, or a class label, denoted as
node.label.
2. The f ind best split() function determines which attribute should be selected as the test condition
for splitting the training records. As previously noted, the choice of test condition depends on the
impurity measure is used to determine the good of split. Some widely used measure is entropy,
Gini index and χ2 statistic.
3. The classify function determines the class label to be assign to leaf node. For each leaf node t,
the p(i|t) denotes the fraction of training records from class i associated with the node t. In most
cases, the leaf node is assigned to class that has majority number of training records:
where the argmax operator returns the argument i that maximizes the expression p(i|t).
4. The stopping cond() function is used to terminate the tree-growing process by testing whether
all records have either the same class label or the same attribute values.
After building the decision tree, a tree-pruning step can be performing to reduce the size of the decision
tree. Decision trees that are too large are susceptible to a phenomenon known as over fitting.
10