0% found this document useful (0 votes)
112 views47 pages

7 - Classfication - Concept - DecisionTree - Evaluation

This document discusses classification techniques for machine learning models. It provides definitions and examples of classification tasks, describes how to build decision trees for classification, and how to evaluate model performance on test data. Key points covered include: defining classification, illustrating it with examples; constructing decision trees from training data; and applying the decision tree model to make predictions on new test data.

Uploaded by

Putri Anisa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views47 pages

7 - Classfication - Concept - DecisionTree - Evaluation

This document discusses classification techniques for machine learning models. It provides definitions and examples of classification tasks, describes how to build decision trees for classification, and how to evaluate model performance on test data. Key points covered include: defining classification, illustrating it with examples; constructing decision trees from training data; and applying the decision tree model to make predictions on new test data.

Uploaded by

Putri Anisa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Classification: Basic

Concepts, Decision Tree,


Model Evaluation
WEEK 9 BIG DATA AND DATA ANALYTICS
OUTLINE

o Classification Definition and Example


o Classification Basic Concept
o Decision Tree Construction
o Model Performance and Evaluation
o Rule Based Classifier
o Nearest Neighbors Classifier
o Naïve Bayes Classifier
o Artificial Neural Network
o Support Vector Machine
Classification Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the attributes is the class.
• Find a model for class attribute as a function of the values of other
attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible.
• A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.
Illustration Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Example of Classification Task
 Predicting tumor cells as benign or malignant

 Classifying credit card transactions


as legitimate or fraudulent

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

 Categorizing news stories as finance,


weather, entertainment, sports, etc
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


Refund
2 No Married 100K No
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married

6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc

4 Yes Married 120K No < 80K > 80K

5 No Divorced 95K Yes


NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
There could be more than one tree that fits the same
9 No Married 75K No
data!
10 No Single 90K Yes
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree Induction
• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that
optimizes certain criterion.

• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that
optimizes certain criterion.

• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
How to Specify Test Condition?
• Depends on attribute types
• Nominal
• Ordinal
• Continuous

• Depends on number of ways to split


• 2-way split
• Multi-way split
Splitting Based on Nominal
Attributes
• Multi-way split: Use as many partitions as
distinct values.
CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets.


Need to find optimal
partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.

{Small, Size Size


Medium {Large}
OR {Medium,
{Small}
Large}
}

• What about this split?


Size
{Small,
Large} {Medium}
Splitting Based on Continuous
Attributes
• Different ways of handling
• Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing
(percentiles), or clustering.

• Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous
Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that optimizes
certain criterion.

• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Classification Metrics / Measure
Node Impurity
• Gini Impurity / Index : Used by the CART (classification and regression tree)
algorithm, Gini impurity is a measure of how often a randomly chosen element from the set
would be incorrectly labeled if it was randomly labeled according to the distribution of labels in
the subset.

• Information Gain / Entropy : Used by the ID3, C4.5 and C5.0 tree-
generation algorithms. Information gain is based on the concept of entropy from information
theory.

• Variance Reduction / Misclassification Reduction :


Introduced in CART,[3] variance reduction is often employed in cases where the target variable is
continuous (regression tree), meaning that use of many other metrics would first require
discretization before being applied.
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that
optimizes certain criterion.

• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records belong
to the same class

• Stop expanding a node when all the records have


similar attribute values

• Early termination (to be discussed later)


Decision Tree Based Classification
• Advantages:
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification
techniques for many simple data sets
Example: C4.5
• Simple depth-first construction.
• Uses Information Gain
• Sorts Continuous Attributes at each node.
• Needs entire data to fit in memory.
• Unsuitable for Large Datasets.
• Needs out-of-core sorting.

• You can download the software from:


http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
Practical Issues of Classification
• Underfitting and Overfitting

• Missing Values

• Costs of Classification
Underfitting and Overfitting
(Example)
500 circular and 500
triangular data points.

Circular points:
0.5  sqrt(x12+x22)  1

Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise

Decision boundary is distorted by noise point


Overfitting due to Insufficient
Examples

Lack of data points in the lower half of the diagram makes it difficult to predict
correctly the class labels of that region
- Insufficient number of training records in the region causes the decision tree
to predict the test examples using other training records that are irrelevant to
the classification task
Notes on Overfitting
• Overfitting results in decision trees that are more complex
than necessary

• Training error no longer provides a good estimate of how


well the tree will perform on previously unseen records

• Need new ways for estimating errors


Other Issues
• Data Fragmentation
• Search Strategy
• Expressiveness
• Tree Replication
Tree Replication
P

Q R

S 0 Q 1

0 1 S 0

0 1

• Same subtree appears in multiple branches


Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?

• Methods for Performance Evaluation


• How to obtain reliable estimates?

• Methods for Model Comparison


• How to compare the relative performance among competing
models?
Metrics for Performance Evaluation
• Focus on the predictive capability of a model
• Rather than how fast it takes to classify or build models,
scalability, etc.
• Confusion Matrix:
PREDICTED CLASS

Class=Yes Class=No
a: TP (true positive)
Class=Yes a b b: FN (false negative)
ACTUAL
c: FP (false positive)
CLASS Class=No c d d: TN (true negative)
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


Refund
2 No Married 100K No
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married

6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Metrics for Performance Evaluation…
PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)

• Most widely-used metric:


ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?

• Methods for Performance Evaluation


• How to obtain reliable estimates?

• Methods for Model Comparison


• How to compare the relative performance among competing
models?
Methods for Performance Evaluation
• How to obtain a reliable estimate of performance?

• Performance of a model may depend on other


factors besides the learning algorithm:
• Class distribution
• Cost of misclassification
• Size of training and test sets
Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?

• Methods for Performance Evaluation


• How to obtain reliable estimates?

• Methods for Model Comparison


• How to compare the relative performance among competing
models?
Latihan
• Lakukan eksperimen mengikuti buku Matthew North (Data
Mining for the Masses) Chapter 10 (Decision Tree), p 157-
174

• Datasets: eReader-Training.csv dan eReader-Scoring.csv

• Analisis jenis decision tree apa saja yang digunakan dan


mengapa perlu dilakukan pada dataset tersebut

105
UTS
• Friday
• 13-OCT-17
• 08:30:00

You might also like