Unit 4 Learning
Unit 4 Learning
Attribute set
Classification Class label
(x) Model (y)
Classification as the task of mapping an input attribute set x into its class label y
Speech recognition
Pattern recognition
Classification Ex: Grading
x
If x >= 90 then grade =A.
<90 >=90
If 80<=x<90 then grade =B.
x A
If 70<=x<80 then grade =C.
If 60<=x<70 then grade =D. <80 >=80
If x<50 then grade =F. x B
<70 >=70
Classify the following marks x C
78 , 56 , 99
<50 >=60
F D
Topics Covered
What is Classification
General Approach to Classification
Issues in Classification
Classification Algorithms
⚫ Statistical Based
• Bayesian Classification
⚫ Distance Based
• KNN
⚫ Decision Tree Based
• ID3
⚫ Neural Network Based
⚫ Rule Based
General approach to Classification
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
Defining Classes
Issues in Classification
Missing Data
⚫ Ignore missing value
⚫ Replace with assumed value
Measuring Performance
⚫ Classification accuracy on test data
⚫ Confusion matrix
• provides the information needed to determine how
well a classification model performs
Confusion Matrix
Confusion Matrix
Predicted Class
Class= 1 Class= 0
Actual Class= 1 f11 f10
Class Class =0 f01 f00
• Each entry fij in this table denotes the number of records from
class i predicted to be of class j.
• For instance, f01 is the number of records from class 0 incorrectly
predicted as class 1.
• The total number of correct predictions: (f11+ f00)
• The total number of incorrect predictions: (f01 + f10)
Classification Performance
Accuracy: Overall, how often is the classifier correct? kitna sahi bola
(TP+TN)/(TP+TN+FP+FN)
Error Rate: Overall, how often is it wrong?
kitna galat bola
(FP+FN)/(TP+TN+FP+FN)
equivalent to 1 minus Accuracy
Specificity: measures how exact the assignment to the positive class is
TN/(FP+TN)
Sensitivity/Recall: Recall can be defined as the ratio of the total number of
correctly classified positive examples divide to the total number of positive
examples.
TP/(TP+FN)
High Recall indicates the class is correctly recognized.
Class Statistics Measures
Precision: is a measure of how accurate a model’s positive
predictions are. TP/(TP+FP)
⚫ High Precision indicates an example labeled as positive is indeed positive
F Score =
High recall, low precision: This means that most of the positive examples are
correctly recognized (low FN) but there are a lot of false positives.
Low recall, high precision: This shows that we miss a lot of positive examples
(high FN) but those we predict as positive are indeed positive (low FP)
Example to interpret Confusion Matrix
Example
Height Example Data
Name Gender Height Output1(Correct) Output2(Actual
Assignment)
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
Confusion Matrix Example
Accuracy is used when the True Positives and True Negatives are
more important. Accuracy is a better metric for Balanced Data.
•P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability of the predictor given class.
•P(x) is the prior probability of the predictor.
Bayes Theorem
Data tuple (A): 35-year-old customer with an income of $40,000
Hypothesis (B): customer will buy a computer
Posterior Probability: P(A|B),
⚫ P(A|B) is the likelihood. It represents the probability of observing the data
(35-year-old with an income of $40,000) given that the hypothesis
(customer will buy a computer) is true.
Prior Probability: P(A),
⚫ is the prior probability of a customer being a 35-year-old with an income
of $40,000.
Posterior Probability: P(B|A),
⚫ probability of the customer buying a computer given their age and income.
For all entries in the dataset, the denominator does not change,
it remain static. Therefore, the denominator can be removed
and a proportionality can be introduced.
P(p) = 9/14
Prior probability
P(n) = 5/14
Derived probability
Play-Tennis example: classifying X
Posterior Probabilities ,
P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
Classify tuple:
t = (Adam , M, 1 .95 m)
Example: Height Classification
P (short) = 4/ 15 = 0.267
P (medium) = 8/15 = 0.533
P (tall) = 3/1 5 = 0.2
Finally, we obtain the actual probabilities of each event (Using Bayes Theorem):
Therefore, based on these probabilities, we classify the new tuple as tall because
it has the highest probability.
Advantages of naïve bayes
It is easy to use.
Unlike other classification approaches, only one
scan of the training data is required.
The naive Bayes approach can easily handle missing
values by simply omitting that probability when
calculating the likelihoods of membership in each
class.
In cases where there are simple relationships, the
technique often does yield good results.
Disadvantages of naïve bayes
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
fits the same data!
10
Apply Model to Test Data
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Parts of a Decision Tree
Decision Tree
Given:
⚫ D = {t1, …, tn} where ti=<ti1, …, tih>
⚫ Attributes {A1, A2, …, Ah}
⚫ Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated with
D such that
⚫ Each internal node is labeled with attribute, Ai
⚫ Each arc is labeled with predicate which can be
applied to attribute at parent
⚫ Each leaf node is labeled with a class, Cj
Decision Tree based Algorithms
Pruning
⚫ Once a tree is constructed, some modifications to the tree
might be needed to improve the performance of the tree
during the classification phase.
⚫ The pruning phase might remove redundant comparisons
or remove subtrees to achieve better performance.
Comparing
Decision
Trees
ID3
ID3 stands for Iterative Dichotomiser 3
Creates tree using information theory concepts and tries to
reduce expected number of comparison.
ID3 chooses split attribute with the highest information
gain:
Information gain=(Entropy of distribution before the
split)–(entropy of distribution after it)
Entropy
Entropy
⚫ Is used to measure the amount of uncertainty or surprise
or randomness in a set of data.
⚫ When all data belongs to a single class, entropy is zero as
there is no uncertainty.
⚫ An equally divided sample as an entropy of 1.
⚫ The Mathematical formula for Entropy is –
left node has low entropy or more purity than right node since left node has a greater number of
“yes” and it is easy to decide here.
Information Gain
Let’s see how our decision tree will be made using these 2 features.
We’ll use information gain to decide which feature should be the
root node and which feature should be placed after the split.
Information Gain
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99
and after looking at this value of
information gain, we can say that
the entropy of the dataset will
decrease by 0.37 if we make
“Energy” as our root node.
Information Gain
• entropy
. .
. .
. .
. .
red
Color? green
Entropy
reduction .
yellow .
by
data set
partitioning .
.
. . .
.
. .
. .
red
Color? green
.
yellow .
.
.
. . .
.
. .
. .
red
Information Gain
Color? green
.
yellow .
.
.
Information Gain of The
Attribute
Attributes
⚫ Gain(Color) = 0.246
⚫ Gain(Outline) = 0.151
⚫ Gain(Dot) = 0.048
Heuristics: attribute with the highest gain is chosen
So, color is chosen as the root node
. .
. . .
.
. . red
Red
Color?
Color?
green
Green
Yellow
. yellow Gain(outline)
. P(dashed)=3/5, P(solid)=2/5
.
. I(dashed)= -(3/3)log2(3/3) – 0 =0
I(solid)= -0-(2/2)log2(2/2) = 0
Gain(Dot) I(outline)=(3/5).0+(2/5).0 = 0
P(y)=2/5, P(n)=3/5 Gain (outline)= 0.971-0=0.971
I(y)= -(1/2)log2(1/2) – (1/2)log2(1/2)=1
I(n)= -(1/3)log2(1/3) – (2/3)log2(2/3)=
= 0.917
Gain(Outline) = 0.971 – 0 = 0.971
I (Dot)=(2/5).1+(3/5).(0.917) = 0.9502 Gain(Dot) = 0.971 – 0.951 = 0.020
Gain (Dot)= 0.971-0.9502= 0.020
. .
. . .
.
. .
Red
Gain(Outline) = 0.971 – 0.951 = 0.020 bits
Color?
Color? Gain(Dot) = 0.971 – 0 = 0.971 bits
Yellow Green
.
.
.
.
solid
.
Outline
dashed
.
. .
. . .
.
. . red
yes .
Dot? .
Color?
no
green
. yellow
.
.
.
solid
.
Outline?
dashed
.
Decision Tree
. .
. .
. .
Color
red green
yellow