0% found this document useful (0 votes)
3 views

Classification&DecisionTree (1)

Classification is the process of assigning objects to predefined categories, with applications in spam detection and medical diagnosis. It involves building a classification model using training data, which can be evaluated using a confusion matrix to measure accuracy and error rates. Various techniques such as decision trees, support vector machines, and Bayesian classifiers are employed to create these models and make predictions.

Uploaded by

RAMU NAIK
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Classification&DecisionTree (1)

Classification is the process of assigning objects to predefined categories, with applications in spam detection and medical diagnosis. It involves building a classification model using training data, which can be evaluated using a confusion matrix to measure accuracy and error rates. Various techniques such as decision trees, support vector machines, and Bayesian classifiers are employed to create these models and make predictions.

Uploaded by

RAMU NAIK
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Classification

Classification
Classification is the task of assigning objects to one of the several predefined categories in a persistent
problem that encompasses many diverse applications. Examples of classification are are as given below
1. Detecting spam E-mail massages based on header and content

2. Categorizing cells as benign/malignant based upon result of MRI scan


Definition 1. Classification is the task of learning a target function f that maps each attribute set X
to one of the predefined class level Y .
The target function is also known as classification model and it is useful for Descriptive as well as
predictive modelings.

Descriptive Modeling
A classification model can serve as explanatory tool to distinguish between different classes. For exam-
ple, it would be useful for both biologist and others to have a descriptive model that can summarize
the data given below

Table 1: Vertebrate Data Set


Body Skin Give Aquatic Areal Has Class
Name Hibernes
Temp Cover Birth Creature Creature Legs Label
Human Warm Hair Yes No No Yes No Mammal
Python Cold Scales No No No No Yes Reptile

Predictive Modeling
A classification model can also be used to predict the class label of the unknown records. It can be
treated as a black box that automatically assign a class label when presented with attribute set of
unknown record.
Classification techniques are most suited for predicting or describing dataset with binary of nominal
categories. They are less effective for ordinal categories because they do not consider the implicit
ordering among the categories.

General approach to solve a classification Problem


Classification technique is a systematic approach for building a classification model based on training
data set. Some example of classification models are

1. Decision Tree Classifier

2. Rule Based Classifier

1
3. Nearest Neighbor Classifier

4. Support vector Machine (SVM) Classifier

5. Bayesian Classifier

6. Neural Network Based Classifier

Each of above technique applies a learning algorithm to identify the model that best fits the rela-
tionship between attribute and class label of input data. A key objective of learning algorithm is to
build models with good generalization capability i.e. models that accurately predict the class label of
previously unknown record.

Learning
algorithm

Training Induction Learn


Model
Set Model

Deduction Apply
Test Set
Model

Figure 1: General approach to solve classification problem

A training set consisting of records whose class labels are known. It is used to build classification
model. This model is applied to test set which consists of record with unknown class labels.

Evaluation of Classification model


Evaluation of performance of classification model is based on counts of test records correctly or incor-
rectly predicted by model. These counts are tabulated in table known as confusion matrix

2
Table 2: Confusion matrix
Predicted Class
Class 1 Class 0
Actual Class 1 f11 f10
Class Class 0 f01 f00

In the above table fij indicates the number of records from class i predicted to be of class j. Based on
the entries of the confusion matrix, the total number of the correct prediction by the model is

f11 + f00 .

Similarly, total number of wrong prediction is

f10 + f01

Confusion matrix provides information needed to determine how well the classification model performs.
Based on information provided by confusion matrix, we can define performance measures to compare
the performance of different classification models.
No of correct predictions
Accuracy = =
Total no of predictions
f11 + f00
f11 + f01 + f10 + f00
No of wrong predictions
Error Rate = =
Total no of predictions
f10 + f01
f11 + f01 + f10 + f00
No of True Positive
Positive Predictive Value = =
Total no of Positive
f11
f11 + f10
No of True Negative
Negative Predictive value = =
Total no of Negative
f00
f01 + f00
Most classification algorithms seek models that attain the highest accuracy, or equivalently lowest error
rate.

Decision Tree Classifier


Suppose a new species is discovered by scientist, we have to classify weather it is mammal or non-
mammal. One approach is to pose a series of question about the characteristics of the species as given
below

1. Weather the species is cold or warm blooded?

3
2. Do the females of species gives birth?

A series of questions and their possible answer can be organized in the form of a hierarchical structure
consisting of nodes and edges. This hierarchical structure is known as decision tree.

Body Temperature

Cold Warm

Non Mammal Give Birth

no yes

Non Mammal Mammal

The decision tree has three types of nodes:

1. Root Node: This node has not any incoming edges and has zero or more outgoing edge. In the
above tree, (1) is root node.

2. Internal Node: This node has exactly one incoming edge and two or more outgoing edge. In
the above tree, (2) is internal node.

3. Leaf/Terminal Node: This node has exactly one incoming edge and no outgoing edge. In the
above tree, (3), (4) and (5) are leaf node. Leaf node always assigns a class label.

The non-terminal nodes, which includes root and other internal nodes, contain attributes test conditions
to separate records that have different characteristics.

Construction of a decision tree


There may be many decision trees that that can be constructed from a given set of attributes. Effi-
cient algorithms have been developed to induce reasonably accurate and optimal decision tree. Hunt’s
algorithm is one among them.

Haunt’s Algorithm
In this algorithm, a decision tree is grown in iterative fashion to partition training records that are
associated with node t and y = {y1 , . . . , yc } be the class labels. This algorithm is a two step algorithm
as given below

1. If all records in Dt belong to same class yt , then t is a leaf node labeled as yt .

4
2. If Dt contains records that belong to more than one class, an attribute test condition is selected
to partition the records into smaller subsets. A child node is created for each outcome of the test
condition and and the records in Dt are distributed to children based on outcome. The algorithm
is then applied to each child node

Example 1. Let us consider following Loan defaulter dataset. Based on this training dataset, construct
a decision tree for predicting borrowers who will default on loan payment.

Table 3: Loan Defaulter Data


Home Marital Defaulted
TId Annual Income(AI)
Owner(HO) Status(MS) Borrower
1 y s n 125k
2 n m n 100k
3 n s n 70k
4 y m n 120k
5 n d y 95k
6 n m n 60k
7 y d n 220k
8 n s y 85k
9 n m n 75k
10 n s y 90k
y= yes, n=no, m=married, s=single, d=divorcee

Solution. Based on the above dataset, we can grow a decision tree model in following way

Defaulted=No

Figure 2: Step-I

Home Owner

Yes No

Defaulted=No Defaulted=No

Figure 3: Step-II

Some Important Notes


1. Hunt’s Algorithm will work if every combination of attribute value is present in the training data
and each combination has unique class label.

5
Home Owner

Yes No

Defaulted=No Marital Status

M s/d

Defaulted=No Defaulted=Yes

Figure 4: Step-III

Home Owner

Yes No

Defaulted=No Marital Status

M s/d

Defaulted=No Annual Income

<80K >80K

Defaulted=No Defaulted=Yes

Figure 5: Step-IV

6
2. It is possible for some of the child nodes created in second step to be empty i.e. no record
associated with the nodes. This can happen if none of the training records have the combination
of attribute value associated with such nodes. in this case, the node is declared as leaf node with
the same class label as majority class of training records associates with parent node.

3. In second step, if all records associated with Dt have identical values, then it is not possible to
split these records any further. In this case, the node is declared as leaf node with same class
label as majority class of training records associated with this nodes.

Design Issues in Decision Tree Induction


There are two major issues in decision tree induction.

1. How should training be spitted?

2. How should splitting procedure stop?

To deal with the first issue, we need test condition for different attribute type while in order to dealing
with second issue, we need condition to stop tree growing process. A possible strategy is to continue
tree growing process until either all records belong to same class or all record have identical attribute
value.

Different types of attribute


Binary Attribute
A binary attribute generates only two outcomes. The figure given below is an example of binary
attribute.

Binary Attribute

Yes No

Decision 1 Decision 2

Figure 6: Binary Attribute

Nominal Attribute
Nominal attribute can have more than one split. An example of nominal attribute split is given in the
figure. Nominal attribute can also be converted to binary attribute by two way split. In example given
below, we can keep single and divorcee in one category and Married in other category.

7
Marital Status

Single Divorcee

Married

Figure 7: Nominal Attribute

Ordinal Attribute
Like nominal attribute, ordinal attribute can also have more than one split. An example of ordinal
attribute is given in the figure. Ordinal attributes have inherent ordering between categories.

Shirt Size

XX Large

Small X Large

Medium Large

Figure 8: Ordinal Attribute

Ordinal attribute can also produce Binary splits. It can be grouped as long as group does not violet
the order property of attribute values. An example of binary split of above attribute is given below.

Shirt Size

Small, Medium Large, X Large, XX Large

Figure 9: Ordinal Binary Attribute

8
Continuous Attribute
A continuous attribute can have binary or multi-way split. For continuous attributes, the test condition
can be expressed by comparison test A < υ or A > υ with binary outcome. It can also be splitted by
comparing a range query with outcome υi ≤ A < υi + l for i = 1, . . . , k. For multi-way split algorithm,
one must consider all possible ranges of continuous variable.

Annual Imcome

More than 80 K

Less than 10K 50K–80K

10K–20K 20K-50K

Figure 10: Continuous Attribute

Measures of Selecting Best Split


The measures of selecting best split can be defined in terms of class distribution of records before and
after splitting. Let p(i|t) denote the fraction of records belonging to class i at a given node t. In a
two class problem, the class distribution at any node can be written as (p0 , p1 ) where p0 = 1 − p1 .
The measures developed for selecting the best split are often based on the degree of impurity of child
nodes. The smaller the degree of impurity, the more skewed the class distribution. Entropy, Gini and
Classification error are few important measure of impurity.These can be expressed as

c−1
X
Entropy(t) = − p(i|t) log(p(i|t))
i=0
c−1
X
Gini(t) = 1 − [p(i|t)]2
i=0
Classification Error(t) = 1 − max[p(i|t)],
i

where c is the number of classes and 0 log2 0 = 0.

Algorithm of decision Tree induction


Input of algorithm consists of the training record E and the attribute set F . The algorithm is as follows:

1. The createN ode() function extends the decision tree by creating a new node. a node in the
decision tree has either a test condition, denoted as node.test cond, or a class label, denoted as
node.label.

9
Algorithm 1 Algorithm for decision tree induction
TreeGrowth(E, F)

1. if stopping cond(E, F) = TRUE then

2. leaf = createNode()

3. leaf .label = Classify(E)

4. return leaf

5. else

6. root = createNode()

7. root.test cond = find.best split(E, F)

8. let V = {v|vis a all outcome of root.test cond}

9. for each v ∈ V do

10. Ev = {e|root.test cond(e) = v}

11. child = TreeGrowth(Ev , F)

12. add child as descendant of root and label the edge (root → child) as v

13. end for

14. end if

15. return root

10
2. The f ind best split() function determines which attribute should be selected as the test condition
for splitting the training records. As previously noted, the choice of test condition depends on the
impurity measure is used to determine the good of split. Some widely used measure is entropy,
Gini index and χ2 statistic.
3. The classify function determines the class label to be assign to leaf node. For each leaf node t,
the p(i|t) denotes the fraction of training records from class i associated with the node t. In most
cases, the leaf node is assigned to class that has majority number of training records:
leaf.label = argmax p(i|t)
i

where the argmax operator returns the argument i that maximizes the expression p(i|t).
4. The stopping cond() function is used to terminate the tree-growing process by testing whether
all records have either the same class label or the same attribute values.
After building the decision tree, a tree-pruning step can be performing to reduce the size of the decision
tree. Decision trees that are too large are susceptible to a phenomenon known as over fitting.

Bayes Theorem for Classification


We can express joint probability of two event X and Y as
P (X, Y ) = P (Y |X)P (X) = P (X|Y )P (Y )
P (X|Y )P (Y )
⇒ P (Y |X) =
P (X)
The equation mentioned in second line is known as Bayes Theorem. In details, Bayes theorem can be
expressed as
Example 2. Consider a football game between two rival teams: Team= 0 and Team-1. Suppose Team-0
wins 65% of the time and Team-1 wins the reaming matches. Among the game won by team-0 only 30%
of win come from Team-1’s field. On the other hand 75% of Team-1’s win com while playing at home
ground. If the Team-1 is going to host the next match, which team is most likely to be winner?
Solution. X: Team Hosting the Match
Y: Winner of the Match

P [Y= 0] = 0.65
P [Y= 1] = 0.35
P [X = 1|Y= 1] = 0.75
P [X = 1|Y= 0] = 0.30
P (X = 1|Y = 1)P (Y = 1)
P (Y = 1|X = 1) =
P (X = 1)

where, P (X = 1) = P (X = 1|Y = 1)P (Y = 1) + P (X = 1|Y = 0)P (Y = 0) Using above equation


0.75 × 0.35
P (Y = 1|X = 1) = = 5738
0.75 × 0.35 + 0.3 × 0.65

11
Using Bayes Theorem for Classification
Let us consider X as attribute set and Y as class variable. If the class variable has a non deterministic
relationship with attributes then we can treat X and Y as random variables and capture their relation-
ship statistically using Bayes theorem of P (Y |X) . The conditional probability is known as posterior
probability of Y given X as opposed to its prior probability P (Y ).
During the training phase we need to learn posterior probability P (Y |X) for every combination of
X and Y . based on the information gathered from the training data.
By knowing these probabilities, a test record X can be classified by finding class Y ′ can maximize
the posterior probability P (Y ′ |X).
Mow Let us consider the Loan default data and let

X = (HO = n, MS = m, AI = 120K)
.
To, classify the record, we need to compute the particular probabilities P (Y es|X) and P (N o|X) based
on the information available in the training data. If P (Y es|X) > P (N o|X). then the record is classified
as Yes otherwise we classify it as No.
Estimating the posterior probabilities accurately for every combination of class level and attribute
values is a difficult task because it require a large dataset even for moderate number of attributes.
Bayes Theorem is useful because it allows us to express the posterior probabilities in terms of prior
probability P (Y ) conditional probability P (X|Y ). It can be written as

P (X|Y )P (Y )
P (Y |X) =
P (X)
While comparing the posterior for different value of Y, denominator will always remain constant
and thus it can be ignored. P (Y ) can be easily estimated from training set by comparing fraction
of training record that belongs to each class.Further P (X|Y ) can be calculated using two methods as
given below

1. Näive Bayes classifier

2. Bayesian Belief Network

Näive Bayes classifier


It estimates the class conditional probability by assuming that the attributes are conditionally indepen-
dent given the class level Y . the conditional independence assumption can mathematically be expressed
as
d
Y
P (X|Y = y) = P (Xi |Y = y)
i=1

The vector X consists of of d attributes X1 · · · Xd . Let X, Y and Z are three random variables.X
is said to be conditionally independent of Y given Z if the following condition holds.

P (X|Y, Z) = P (X|Z) (1)


Conditional independence can also be Written in the form

12
P (X, Y, X)
P (X, Y |Z) =
P (Z)
P (X, Y, Z)P (Y, Z)
=
P (Y, Z)P (Z)
= P (X|Y, Z)P (Y |Z)

How Näive Bayes classifier Works?


With the conditional probability assumption, instead of computing the class conditional probability of
every combination of X, we only have to estimate the conditional probability of each X given Y . Later
approach is more practical because it does not require a very large training set to obtain good estimate
of probability

P (Y ) di=1 P (Xi |Y )
Q
P (Y |X) =
P (X)
Note that the denominator is fixed for every class Y . Hence, we only need to calculate numerator for
each class level.

Example 3. In Example 1, assume that “Y= Defaulted Borrower” and remaining variables are features
(X). Based on these information find P (Y = yes|X) and P (Y = no|X) if

X = (HO = N o, M S = m, AI = 120k).

.
Solution. Using table and the assumptions [AI|Y es] ∼ N (90, 25) and [AI|N o] ∼ N (110, 2975), we can
obtain the following probabilities

P (Y = Y es) = 3/10
P (Y = N o) = 7/10
P (HO = N o|Y = N o) = 4/7
P (HO = N o|Y = Y es) = 1
P (M S = M arried|Y = Y es) = 0
P (M S = M arried|Y = N o) = 4/7
P (AI = 120K|N o) = 0.0027
P (AI = 120K|Y es) = 10e − 9

Based on the above information we can write P (X|Y = Y es) and P (X|Y = N o) as

13

You might also like