0% found this document useful (0 votes)

10 views87 pages

Datamining Lect10a Classsification Basics DT

This document discusses classification in data mining, focusing on decision trees as a method for distinguishing between different classes, such as tax cheaters and non-cheaters. It outlines the process of building a classification model using a training set and evaluating it with a test set, along with various classification techniques and algorithms. Additionally, it provides examples of classification tasks and explains the structure and functioning of decision trees.

Uploaded by

studytutor2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views87 pages

Datamining Lect10a Classsification Basics DT

Uploaded by

studytutor2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

DATA MINING

LECTURE 10
Classification
Basic Concepts
Decision Trees
Catching tax-evasion
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

Tax-return data for year 2011
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes A new tax return for 2012
6 No Married 60K No Is this a cheating tax return?
7 Yes Divorced 220K No Refund Marital Taxable
Status Income Cheat
8 No Single 85K Yes
No Married 80K ?
9 No Married 75K No 10

10 No Single 90K Yes

An instance of the classification problem: learn a method for discriminating between

records of different classes (cheaters vs non-cheaters)
What is classification?
• Classification is the task of learning a target function f that
maps attribute set x to one of the predefined class labels y

Tid Refund Marital Taxable

Status Income Cheat One of the attributes is the class attribute
1 Yes Single 125K No
In this case: Cheat
2 No Married 100K No
3 No Single 70K No
Two class labels (or classes): Yes (1), No (0)
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Why classification?
• The target function f is known as a classification
model

• Descriptive modeling: Explanatory tool to

distinguish between objects of different classes
(e.g., understand why people cheat on their
taxes)

• Predictive modeling: Predict a class of a

previously unseen record
Examples of Classification Tasks
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as legitimate or

fraudulent
• Categorizing news stories as finance,
weather, entertainment, sports, etc

• Identifying spam email, spam web pages, adult

content

• Understanding if a web query has commercial intent

or not
General approach to classification
• Training set consists of records with known class
labels

• Training set is used to build a classification model

• A labeled test set of previously unseen data

records is used to evaluate the quality of the
model.

• The classification model is applied to new records

with unknown class labels
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Evaluation of classification models
• Counts of test records that are correctly (or
incorrectly) predicted by the classification model
• Confusion matrix Predicted Class

Actual Class
Class = 1 Class = 0
Class = 1 f11 f10
Class = 0 f01 f00

# correct predictions f11  f 00

Accuracy  
total # of predictions f11  f10  f 01  f 00

# wrong predictions f10  f 01

Error rate  
total # of predictions f11  f10  f 01  f 00
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Decision Trees
• Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No Refund
Yes No
3 No Single 70K No Test outcome
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Class labels
Training Data Model: Decision Tree
Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K