Classification Ppts 2021
Classification Ppts 2021
Classification Ppts 2021
• Credit approval
– A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
– The history of past customers is used to train the classifier
– The classifier provides rules, which identify potentially
reliable future customers
– Classification rule:
• If age = “31...40” and income = high then credit_rating =
excellent
– Future customers
• Paul: age = 35, income = high excellent credit rating
• John: age = 20, income = medium fair credit rating
Supervised Classification
The input data, also called the training set,
consists of multiple records each having multiple
attributes or features.
Each record is tagged with a class label.
The objective of classification is to analyze the
input data and to develop an accurate description
or model for each class using the features present
in the data.
This model is used to classify test data for which
the class descriptions are not known.
Classification and Prediction
• Classification is the process of
– finding a model that describes data classes
– for the purpose of being able to use the model
– to predict the class of objects whose class label is
unknown.
• Prediction:
– predicts unknown or missing values
• Typical applications
Credit approval
Target marketing
Medical diagnosis
Fraud detection
• Model usage:
– for classifying future or unknown objects
– test sample is compared with the classified result
from the model
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 yes
George Professor 5 yes
Joseph
5/14/22
Assistant Prof Data Mining:
7 Concepts andyes
Techniques 8
• Classification model can be represented in
various forms such as
» IF-THEN Rules
» A decision tree
» Neural network
Classification Model
Decision Tree - Classification
• Decision tree builds classification models in the form of a
tree structure.
• It breaks down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is
incrementally developed.
• The final result is a tree with decision nodes and leaf nodes.
• A decision node has two or more branches
• Leaf node represents a classification or decision.
• The topmost decision node in a tree which corresponds to
the best predictor called root node.
• Decision trees can handle both categorical and numerical
data.
Example 1 : using given training data set, create classification model using decision tree
Outlook= Overcast
Outlook= Sunny
Example 2: using given training data set, create classification model
using decision tree
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
20
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
no yes no yes
Very high 2 0 2
High 4 0 4
Low 0 3 3
medium 1 2 3
E(Own house, Income) = p(vh)*E(vh) +12p(h)*E(h) +
p(l)*E(l) + p(m)*E(m)
• Information Gain
The information gain is based on the decrease in
entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding
attribute that returns the highest information gain
(i.e., the most homogeneous branches).
Step 2: For E(Age) we have,
Age Yes Rented total
Young 3 1 4
Medium 3 2 5
Old 1 2 3
12
= [(4/12)*E(0.75,0.25)] + [(5/12)*E(0.6,0.4)]
+ [(3/12)*E(0.33,0.67)]
= 0.90
• G(O,A)= E(O) – E(O,A)
= 0.98 – 0.90
= 0.08
Income attribute has highest gain, so used as a
decision attribute in the root node
R1: If(Income=VH) then Own
house=yes
[Pˆ (a1 |c* ) Pˆ (an |c* )]Pˆ (c* ) [Pˆ (a1 |c) Pˆ (an |c)]Pˆ (c), c c* , c c1 , , cL
33
Example 1: Naïve Bayes Classifier Example
35
Learning phase:
36
• Test Phase
– Given a new instance,
x’ = (Outlook=Sunny, Temperature=Cool,
Humidity=High, Wind=Strong)
• MAP rule
• P(x’|No):
[P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No)
= 0.0206
https://www.youtube.com/watch?
v=UzT4W1tOKD4
Surprise Test (20 marks)
Illustrate Decision tree and Naive Bayesian Classification techniques for the above
data set.
Show how we can classify a new tuple, with (Homeowner=yes; Status=Employed;
Income= Average)
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Samples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left
Algorithm for Decision Tree Induction
(pseudocode)
Algorithm GenDecTree(Sample S, Attlist A)
1. create a node N
2. If all samples are of the same class C then label N with C; terminate;
3. If A is empty then label N with the most common class C in S (majority
voting); terminate;
4. Select aA, with the highest information gain; Label N with a;
5. For each value v of a:
a. Grow a branch from N with condition a=v;
b. Let Sv be the subset of samples in S with a=v;
c. If Sv is empty then attach a leaf labeled with the most common class in S;
d. Else attach the node generated by GenDecTree(Sv, A-a)
Attribute Selection Measure
• Information gain (ID3/C4.5)
– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each attribute
– May need other tools, such as clustering, to get the possible split values
– Can be modified for categorical attributes
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n elements of
class N
– The amount of information, needed to decide if an arbitrary example in S
belongs to P or N is defined as
p p n n
I ( p , n) log2 log2
pn pn pn pn
Prediction
X mean value of x
Y mean value of y
Example
The below table shows the marks obtain by student in
midterm and final year exam
Midterm(x) Final year(y)
45 60
70 70
60 54
84 82
75 68
Find: 84 76
α =28.84
• Now,
63
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
65
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
5. Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
66
Classifier Evaluation Metrics: Example
67
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized
data
– *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
68
5 fold cross validation
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:
70
The bootstrap method involves iteratively resampling a dataset with replacement.
Estimating Confidence Intervals:
Table for t-distribution
• Symmetric
• Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
• Confidence limit, z
= sig/2
72
Model Selection: ROC Curves
• ROC (Receiver Operating Characteristics)
curves: for visual comparison of
classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a Vertical axis
measure of the accuracy of the model represents the true
• Rank the test tuples in decreasing order: positive rate
the one that is most likely to belong to
Horizontal axis rep.
the positive class appears at the top of the false positive rate
the list
The plot also shows a
diagonal line
• The closer to the diagonal line (i.e., the A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
73
Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
74
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of
classifiers. Eg. Random Forest
– Boosting: weighted vote with a collection of classifiers.
Eg. Ada Boost
75
Bagging: Boostrap Aggregation