0% found this document useful (0 votes)

86 views104 pages

Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar

DATA

Uploaded by

Samia Elsayed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views104 pages

Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar

DATA

Uploaded by

Samia Elsayed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 104

Data Mining

Classification: Basic Concepts,

Decision Trees, and Model Evaluation
Lecture Notes for Chapter 4
Introduction to Data Mining
by
Tan, Steinbach, Kumar

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Classification: Definition

Given a collection of records (training set )

Each record contains a set of attributes, one of the
attributes is the class.

Find a model for class attribute as a function

of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Illustrating Classification Task

Attrib1 = yes Class = No

Attrib1 = No = Attrib3 < 95K Class = Yes
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Examples of Classification Task

Predicting

tumor cells as benign or malignant

Classifying

credit card transactions

as legitimate or fraudulent

Classifying

secondary structures of protein

as alpha-helix, beta-sheet, or random
coil

Categorizing

news stories as finance,

weather, entertainment, sports, etc

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Classification Techniques
Decision

Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks

Nave Bayes and Bayesian Belief Networks

Support Vector Machines
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Example of a Decision Tree

go
e
t

al
c
ri
ca

go
e
t

al
c
ri

us
o
u
in
t
ss
n
a
cl
co

Tid Refund Marital

Status

Taxable
Income Cheat

Yes

Single

125K

Married

100K

Single

70K

Yes

Married

120K

Divorced 95K

Yes

Married

Yes

Divorced 220K

Single

85K

Yes

Married

75K

Single

90K

Yes

60K

Splitting Attributes

Refund
Yes

MarSt
Single, Divorced
TaxInc

< 80K
NO

Married
NO

> 80K
YES

Model: Decision Tree

Training Data
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Another Example of Decision Tree

al
al
us
c
c
i
i
o
or
or
nu
i
g
g
t
ss
e
e
t
t
n
a
cl
ca
ca
co
Tid Refund Marital
Status

Taxable
Income Cheat

Yes

Single

125K

Married

100K

Single

70K

Yes

Married

120K

Divorced 95K

Yes

Married

Yes

Divorced 220K

Single

85K

Yes

Married

75K

Single

90K

Yes

60K

Married

MarSt

Single,
Divorced
Refund
No

Yes
NO

TaxInc
< 80K

> 80K

YES

There could be more than one tree that

fits the same data!

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Decision Tree Classification Task

Decision
Tree

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Apply Model to Test Data

Test Data
Start from the root of tree.
Refund
Yes

Refund Marital
Status

Taxable
Income Cheat

80K

Married

MarSt
Single, Divorced
TaxInc
< 80K
NO

Tan,Steinbach, Kumar

Married
NO

> 80K
YES

Introduction to Data Mining

4/18/2004

Apply Model to Test Data

Test Data

Refund
Yes

Refund Marital
Status

Taxable
Income Cheat

80K

Married

MarSt
Single, Divorced
TaxInc
< 80K
NO

Tan,Steinbach, Kumar

Married
NO

> 80K
YES

Introduction to Data Mining

4/18/2004

Apply Model to Test Data

Test Data

Refund
Yes

Refund Marital
Status

Taxable
Income Cheat

80K

Married

MarSt
Single, Divorced
TaxInc
< 80K
NO

Tan,Steinbach, Kumar

Married
NO

> 80K
YES

Introduction to Data Mining

4/18/2004

Apply Model to Test Data

Test Data

Refund
Yes

Refund Marital
Status

Taxable
Income Cheat

80K

Married

MarSt
Single, Divorced
TaxInc
< 80K
NO

Tan,Steinbach, Kumar

Married
NO

> 80K
YES

Introduction to Data Mining

4/18/2004

Apply Model to Test Data

Test Data

Refund
Yes

Refund Marital
Status

Taxable
Income Cheat

80K

Married

MarSt
Single, Divorced
TaxInc
< 80K
NO

Tan,Steinbach, Kumar

Married
NO

> 80K
YES

Introduction to Data Mining

4/18/2004

Apply Model to Test Data

Test Data

Refund
Yes

Refund Marital
Status

Taxable
Income Cheat

80K

Married

MarSt
Single, Divorced
TaxInc
< 80K
NO

Tan,Steinbach, Kumar

Married

Assign Cheat to No

NO
> 80K
YES

Introduction to Data Mining

4/18/2004

Decision Tree Classification Task

Decision
Tree

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Decision Tree Induction

How to build a decision tree from a training set?

Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3 (Iterator Dichotomizer) , C4.5 (Quinlan
1986, 1993) (C5: See5) Demo
SLIQ,SPRINT (IBM 1996)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

General Structure of Hunts

Algorithm

Let Dt be the set of training records that

reach a node t
General Procedure:
If Dt contains records that belong the
same class yt, then t is a leaf node
labeled as yt
Dt
t (cheat)

If Dt is an empty set, then t is a leaf

node labeled by the default class, y d
If Dt contains records that belong to
more than one class, use an attribute
test to split the data into smaller
subsets. Recursively apply the
procedure to each subset.
Dt

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Hunts Algorithm
Dont
Cheat

Refund
Yes

No
Dont
Cheat

Dont
Cheat

Refund

Refund
Yes

Yes

Dont
Cheat
Single,
Divorced

Cheat

Dont
Cheat

Marital
Status
Married

Single,
Divorced

Marital
Status
Married
Dont
Cheat

Taxable
Income

Dont
Cheat

Tan,Steinbach, Kumar

< 80K

>= 80K

Dont
Cheat

Cheat

Introduction to Data Mining

4/18/2004

Tree Induction
Greedy

strategy.
greedy search through the space of possible
decision trees
Split the records based on an attribute test that
optimizes certain criterion
.
Issues
Determine how to split the records
How

to specify the attribute test condition?

How to determine the best split?

Determine when to stop splitting

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Tree Induction
Greedy

strategy.
Split the records based on an attribute test
that optimizes certain criterion.

Issues

Determine how to split the records

How

to specify the attribute test condition?

E.g. X < 1? or X+Y < 1?
How to determine the best split?

Determine when to stop splitting

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

How to Specify Test Condition?

Depends

on attribute types
Nominal
Ordinal
Continuous

Depends

on number of ways to split

2-way split
Multi-way split

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Splitting Based on Nominal

Attributes

Multi-way split: Use as many partitions as distinct

values.
CarType

Luxury

Family
Sports

Binary split: Divides values into two subsets.

Need to find optimal partitioning.
{Sports,
Luxury}

CarType

Tan,Steinbach, Kumar

{Family}

Introduction to Data Mining

{Family,
Luxury}

CarType
{Sports}

4/18/2004

Splitting Based on Ordinal

Attributes

Multi-way split: Use as many partitions as distinct

values.
Size
Small
Medium

Binary split: Divides values into two subsets.

Need to find optimal partitioning.
{Small,
Medium}

Large

Size
{Large}

What about this split?

Tan,Steinbach, Kumar

{Small,
Large}

Introduction to Data Mining

{Medium,
Large}

Size
{Small}

Size
{Medium}
4/18/2004

Splitting Based on Continuous

Attributes

Different ways of handling

Discretization to form an ordinal categorical attribute

Static discretize once at the beginning :

Dynamic ranges can be found by equal interval

bucketing, equal frequency bucketing
(percentiles), or clustering.

, :
. ( )

Binary Decision: (A < v) or (A v)

consider all possible splits and finds the best cut

can be more compute intensive

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Splitting Based on Continuous

Attributes

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Tree Induction
Greedy

strategy.
Split the records based on an attribute test
that optimizes certain criterion.

Issues

Determine how to split the records

How

to specify the attribute test condition?

How to determine the best split?

Determine when to stop splitting

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

How to determine the Best Split

Before Splitting: 10 records of class 0,
10 records of class 1

Which test condition is the best?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

How to determine the Best Split

Greedy approach:
Nodes with homogeneous class distribution
are preferred

Need a measure of node impurity:

Non-homogeneous,

Homogeneous,

High degree of impurity

Low degree of impurity

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Measures of Node Impurity

Gini

Index

Entropy

D .

Misclassification

error

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

How to Find the Best Split

Before Splitting:

A?
Yes

B?
No

Yes

Node N1

Node N2

Node N3

Node N4

M12

M34
Gain = M0 M12 vs M0 M34

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Measure of Impurity: GINI

Gini Index for a given node t :

GINI (t ) 1 [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

Maximum (1 - 1/nc) when records are equally distributed

among all classes, implying least interesting
information
Minimum (0.0) when all records belong to one class,
implying most interesting information
C1
C2

0
6

Gini=0.000

C1
C2

1
5

Gini=0.278

C1
C2

2
4

Gini=0.444

C1
C2

3
3

Gini=0.500

Gini = 1-(1/6)2 (5/6)2 = 1- 1/36 25/36 = 0.278

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Examples for computing GINI

GINI (t ) 1 [ p ( j | t )]2
j

C1
C2

0
6

P(C1) = 0/6 = 0

C1
C2

1
5

P(C1) = 1/6

C1
C2

2
4

P(C1) = 2/6

Tan,Steinbach, Kumar

P(C2) = 6/6 = 1

Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0

P(C2) = 5/6

Gini = 1 (1/6)2 (5/6)2 = 0.278

P(C2) = 4/6

Gini = 1 (2/6)2 (4/6)2 = 0.444

Introduction to Data Mining

4/18/2004

Splitting Based on GINI

Used in CART, SLIQ, SPRINT.

When a node p is split into k partitions (children), the
quality of split is computed as,

GINI split
where,

ni
GINI (i )
i 1 n

ni = number of records at child i,

n = number of records at node p.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Binary Attributes: Computing GINI Index

Splits into two partitions

Effect of Weighing partitions:
Larger and Purer Partitions are
sought for.
Parent
B?
Yes

Gini = 0.500

Gini(N1)
= 1 (5/6)2 (2/6)2
= 0.194
Gini(N2)
= 1 (1/6)2 (4/6)2
= 0.528
Tan,Steinbach, Kumar

Node N1

Node N2

C1
C2

N1 N2
5
1
2
4

Gini=0.333
Introduction to Data Mining

Gini(Children)
= 7/12 * 0.194 +
5/12 * 0.528
= 0.333
4/18/2004

Categorical Attributes: Computing Gini

Index

For each distinct value, gather counts for each class in

the dataset
Use the count matrix to make decisions
Two-way split
(find best partition of values)

Multi-way split
CarType
C1
C2
Gini

Family Sports Luxury

1
2
1
4
1
1
0.393

Tan,Steinbach, Kumar

C1
C2
Gini

CarType
{Sports,
{Family}
Luxury}
3
1
2
4
0.400

Introduction to Data Mining

C1
C2
Gini

CarType
{Family,
{Sports}
Luxury}
2
2
1
5
0.419

4/18/2004

Continuous Attributes: Computing Gini

Index

Use Binary Decisions based on one

value
Several Choices for the splitting value
Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrix
associated with it
Class counts in each of the
partitions, A < v and A v
Simple method to choose best v
For each v, scan the database to
gather count matrix and compute
its Gini index
Computationally Inefficient!
Repetition of work.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Continuous Attributes: Computing Gini

Index...

For efficient computation: for each attribute,

Sort the attribute on values
Linearly scan these values, each time updating the count matrix and
computing gini index
Choose the split position that has the least gini index

Cheat

Yes

100

120

125

220

Taxable Income
60

Sorted Values
Split Positions

110

122

172

230

Yes

Gini

Tan,Steinbach, Kumar

0.420

0.400

0.375

0.343

0.417

Introduction to Data Mining

0.400

0.300

0.343

0.375

0.400

4/18/2004

0.420

Alternative Splitting Criteria based on

INFO
Entropy at a given node t:

Entropy (t ) p ( j | t ) log p ( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

Measures homogeneity of a node.

Maximum

(log nc) when records are equally distributed among all

classes implying least information

Minimum

(0.0) when all records belong to one class, implying

most information

Entropy based computations are similar to the GINI index

computations
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Examples for computing Entropy

Entropy (t ) p ( j | t ) log p ( j | t )
j

C1
C2

0
6

P(C1) = 0/6 = 0

C1
C2

1
5

P(C1) = 1/6

C1
C2

2
4

P(C1) = 2/6

Tan,Steinbach, Kumar

P(C2) = 6/6 = 1

Entropy = 0 log 0 1 log 1 = 0 0 = 0

P(C2) = 5/6

5/6

Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65

P(C2) = 4/6

Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92

Introduction to Data Mining

4/18/2004

Splitting Based on INFO...

Information Gain: = )
(

GAIN

split

Entropy ( p )

Parent Node, p is split into k partitions;

ni is number of records in partition i

Entropy (i )
n

i 1

Measures Reduction in Entropy achieved because of the split. Choose

the split that achieves most reduction (maximizes GAIN)

Used in ID3 and C4.5
Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Splitting Based on INFO...

Gain Ratio:

GainRATIO

GAIN
n
n

SplitINFO log
SplitINFO
n
n
k

Split

split

i 1

Parent Node, p is split into k partitions

ni is the number of records in partition i

Adjusts Information Gain by the entropy of the partitioning

(SplitINFO). Higher entropy partitioning (large number of small
partitions) is penalized!

GAIN

produced by the split!

split

is penalized when large number of small partitions are

SplitINFO increases when a larger number of small partitions is produced.

Used in C4.5
Designed to overcome the disadvantage of Information Gain
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Splitting Criteria based on Classification

Error
Classification error at a node t :

Error (t ) 1 max P (i | t )
i

Measures misclassification error made by a node.

Maximum

(1 - 1/nc) when records are equally distributed

among all classes, implying least interesting information

Minimum

(0.0) when all records belong to one class, implying

most interesting information

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Examples for Computing Error

Error (t ) 1 max P (i | t )
i

C1
C2

0
6

P(C1) = 0/6 = 0

C1
C2

1
5

P(C1) = 1/6

C1
C2

2
4

P(C1) = 2/6

Tan,Steinbach, Kumar

P(C2) = 6/6 = 1

Error = 1 max (0, 1) = 1 1 = 0

P(C2) = 5/6

Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6

P(C2) = 4/6

Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3

Introduction to Data Mining

4/18/2004

Comparison among Splitting

Criteria
For a 2-class problem:

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Misclassification Error vs Gini

Parent

A?
Yes

Node N1

Gini(N1)
= 1 (3/3)2 (0/3)2
=0
Gini(N2)
= 1 (4/7)2 (3/7)2
= 0.489

Node N2

C1
C2

N1
3
0

N2
4
3

Gini=0.361

Gini = 0.42

Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342

Gini improves !!

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Tree Induction
Greedy

strategy.
Split the records based on an attribute test
that optimizes certain criterion.

Issues

Determine how to split the records

How

to specify the attribute test condition?

How to determine the best split?

Determine when to stop splitting

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Stopping Criteria for Tree

Induction
Stop

expanding a node when all the records

belong to the same class

Stop expanding a node when all the records have
similar attribute values

Early termination (to be discussed later)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Decision Tree Based

Classification

Advantages:

Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Example: C4.5
Simple

depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
Needs out-of-core sorting.
You

can download the software from:

http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Practical Issues of Classification

Underfitting
Missing
Costs

and Overfitting

Values

of Classification

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Underfitting and Overfitting

(Example)
500 circular and 500
triangular data points.

Circular points:
0.5 sqrt(x12+x22) 1

Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Overfitting due to Noise

Decision boundary is distorted by noise point

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Overfitting due to Insufficient

Examples

Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
-Insufficient

number of training records in the region causes the

decision tree to predict the test examples using other training
records that are irrelevant to the classification task
-

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Notes on Overfitting
Overfitting

results in decision trees that are more

complex than necessary

Training

error no longer provides a good estimate

of how well the tree will perform on previously
unseen records

Need

new ways for estimating errors

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Estimating Generalization Errors

Re-substitution errors: error on training ( e(t) )

Generalization errors: error on testing ( e(t))
Methods for estimating generalization errors:
Optimistic approach: e(t) = e(t)
Pessimistic approach:

For each leaf node: e(t) = (e(t)+0.5)

Total errors: e(T) = e(T) + N 0.5 (N: number of leaf nodes)
For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%

Reduced error pruning (REP):

uses validation data set to estimate generalization

error

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Occams Razor
Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model

For complex models, there is a greater chance
that it was fitted accidentally by errors in data

Therefore, one should include model complexity

when evaluating a model

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Minimum Description Length

(MDL)

X
X1
X2
X3
X4

y
1
0
0
1

B
1

C
1

X
X1
X2
X3
X4

y
?
?
?
?

Cost(Model,Data) = Cost(Data|Model) + Cost(Model)

Cost is the number of bits needed for encoding.
Search for the least costly model.
Cost(Data|Model) encodes the misclassification errors.

Cost(Model) uses node encoding (number of children) plus
splitting condition encoding.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

How to Address Overfitting

Pre-Pruning (Early Stopping Rule)

Stop the algorithm before it becomes a fully-grown tree

Typical stopping conditions for a node:

Stop if all instances belong to the same class

Stop if all the attribute values are the same

More restrictive conditions:

Stop if number of instances is less than some user-specified
threshold ) (

Stop if class distribution of instances are independent of the

available features (e.g., using 2 test)

Stop if expanding the current node does not improve impurity

measures (e.g., Gini or information gain).

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

How to Address Overfitting

Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottomup fashion
If generalization error improves after trimming,
replace sub-tree by a leaf node.
Class label of leaf node is determined from
majority class of instances in the sub-tree

Can use MDL for post-pruning

The minimum description length (MDL)principle
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes

Pessimistic error = (10 + 0.5)/30 = 10.5/30

Class = No

Training Error (After splitting) = 9/30

Pessimistic error (After splitting)

Error = 10/30

= (9 + 4 0.5)/30 = 11/30
PRUNE!

A?
A1

A4
A3

A2
Class = Yes

Class = Yes

Class = No

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Examples of Post-pruning
Optimistic error?

Case 1:

Dont prune for both cases

Pessimistic error?

C0: 11
C1: 3

C0: 2
C1: 4

C0: 14
C1: 3

C0: 2
C1: 2

Dont prune case 1, prune case 2

Reduced error pruning?

Case 2:

Depends on validation set

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Handling Missing Attribute Values

Missing

values affect decision tree construction in

three different ways:
Affects how impurity measures are computed
Affects how to distribute instance with missing
value to child node
Affects how a test instance with missing value
is classified

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Computing Impurity Measure

Before Splitting:
Entropy(Parent)
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813

Refund=Yes
Refund=No
Refund=?

Class Class
= Yes = No
0
3
2
4
1

Split on Refund:
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) (4/6)log(4/6) = 0.9183

Missing
value
Tan,Steinbach, Kumar

Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9 (0.8813 0.551) = 0.3303
Introduction to Data Mining

4/18/2004

Distribute Instances

Refund
Yes

Class = no
Refund =
yes

Refund
Yes

Probability that Refund=Yes is 3/9

Probability that Refund=No is 6/9
Assign record to the left child with
weight = 3/9 and to the right child
with weight = 6/9

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Classify Instances
New record:

Married

Refund
Yes
NO

Single

Divorced Total

Class=No

Class=Yes

6/9

2.67

Total

3.67

6.67

No
Single,
Divorced

MarSt
Married

TaxInc
< 80K
NO

Tan,Steinbach, Kumar

NO
> 80K

Probability that Marital Status

= Married is 3.67/6.67
Probability that Marital Status
={Single,Divorced} is 3/6.67

YES

Introduction to Data Mining

4/18/2004

Other Issues
Data

Fragmentation
Search Strategy
Expressiveness
Tree Replication

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Data Fragmentation
Number

of instances gets smaller as you traverse

down the tree

Number

of instances at the leaf nodes could be

too small to make any statistically significant
decision

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Search Strategy
Finding

an optimal decision tree is NP-hard

NP=negative positive

The

algorithm presented so far uses a greedy,

top-down, recursive partitioning strategy to
induce a reasonable solution

Other

strategies?
Bottom-up
Bi-directional

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Expressiveness

Decision tree provides expressive representation for

learning discrete-valued function
But they do not generalize well to certain types of
Boolean functions

Example: parity function:

Class = 1 if there is an even number of Boolean attributes with truth
value = True
Class = 0 if there is an odd number of Boolean attributes with truth
value = True

For accurate modeling, must have a complete tree

Not expressive enough for modeling continuous variables

Particularly when test condition involves only a single
attribute at-a-time

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Decision Boundary

Border line between two neighboring regions of different classes is

known as decision boundary

Decision boundary is parallel to axes because test condition involves

a single attribute at-a-time

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Oblique Decision Trees

x+y<1

Class = +

Class =

Test condition may involve multiple attributes

More expressive representation

Finding optimal test condition is computationally expensive

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Tree Replication
P

Same subtree appears in multiple branches

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Model Evaluation
Metrics

for Performance Evaluation

How to evaluate the performance of a model?

Methods

for Performance Evaluation

How to obtain reliable estimates?

Methods

for Model Comparison

How to compare the relative performance
among competing models?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Model Evaluation
Metrics

for Performance Evaluation

How to evaluate the performance of a model?

Methods

for Performance Evaluation

How to obtain reliable estimates?

Methods

for Model Comparison

How to compare the relative performance
among competing models?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Metrics for Performance

Evaluation

Focus on the predictive capability of a model

Rather than how fast it takes to classify or build

models, scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes

ACTUAL
CLASS

Class=No

Class=Yes

Class=No

Tan,Steinbach, Kumar

Introduction to Data Mining

a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)

4/18/2004

Metrics for Performance

Evaluation
PREDICTED CLASS
Class=Yes

ACTUAL Class=Yes
CLASS
Class=No

Most

Class=No

a
(TP)

b
(FN)

c
(FP)

d
(TN)

widely-used metric:

ad
TP TN
Accuracy

a b c d TP TN FP FN
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Limitation of Accuracy
Consider

a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10

model predicts everything to be class 0,

accuracy is 9990/10000 = 99.9 %
Accuracy is misleading because model
does not detect any class 1 example

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Cost Matrix
PREDICTED CLASS
C(i|j)

Class=Yes

ACTUAL Class=Yes C(Yes|Yes)

CLASS
Class=No

C(Yes|No)

Class=No
C(No|Yes)
C(No|No)

C(i|j): Cost of misclassifying class j example as class I

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Computing Cost of Classification

Cost
Matrix

PREDICTED CLASS

ACTUAL
CLASS

Model M1

C(i|j)

-1

100

PREDICTED CLASS

ACTUAL
CLASS

150

250

Accuracy = 80%
Cost = 3910
Tan,Steinbach, Kumar

Model M2

ACTUAL
CLASS

PREDICTED CLASS

250

200

Accuracy = 90%
Cost = 4255
Introduction to Data Mining

4/18/2004

Cost vs Accuracy
PREDICTED CLASS

Count

Class=Yes
Class=Yes

ACTUAL
CLASS

Class=No

Accuracy is proportional to cost if

1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p

b
N=a+b+c+d

Class=No

d
Accuracy = (a + d)/N

PREDICTED CLASS

Cost

Class=Yes

ACTUAL
CLASS

Class=No

Cost = p (a + d) + q (b + c)
= p (a + d) + q (N a d)

Class=Yes

= q N (q p)(a + d)

Class=No

= N [q (q-p) Accuracy]

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Cost-Sensitive Measures
a
Precision (p)
ac
a
Recall (r)
ab
2rp
2a
F - measure (F)

r p 2a b c

Precision is biased towards C(Yes|Yes) & C(Yes|No)

Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is biased towards all except C(No|No)

wa w d
Weighted Accuracy
wa wb wc w d
1

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Model Evaluation
Metrics

for Performance Evaluation

How to evaluate the performance of a model?

Methods

for Performance Evaluation

How to obtain reliable estimates?

Methods

for Model Comparison

How to compare the relative performance
among competing models?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Methods for Performance

Evaluation
How

to obtain a reliable estimate of

performance?

Performance

of a model may depend on other

factors besides the learning algorithm:
Class distribution
Cost of misclassification
Size of training and test sets

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Learning Curve

Learning curve shows

how accuracy changes
with varying sample size

Requires a sampling
schedule for creating
learning curve:

Arithmetic sampling
(Langley, et al)

Geometric sampling
(Provost et al)

Effect of small sample size:

Tan,Steinbach, Kumar

Introduction to Data Mining

Bias in the estimate

Variance of estimate
4/18/2004

Methods of Estimation

Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Stratified sampling
oversampling vs undersampling
Bootstrap
Sampling with replacement

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Model Evaluation
Metrics

for Performance Evaluation

How to evaluate the performance of a model?

Methods

for Performance Evaluation

How to obtain reliable estimates?

Methods

for Model Comparison

How to compare the relative performance
among competing models?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

ROC (Receiver Operating

Characteristic)
Developed

in 1950s for signal detection theory to

analyze noisy signals
Characterize the trade-off between positive
hits and false alarms
ROC curve plots TP (on the y-axis) against FP
(on the x-axis)
Performance of each classifier represented as a
point on the ROC curve
changing the threshold of algorithm, sample
distribution or cost matrix changes the location
of the point
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

ROC Curve
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive

At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal

Diagonal line:
Random guessing
Below diagonal line:
prediction is opposite of
the true class

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Using ROC for Model Comparison

No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR

Area Under the ROC

curve

Tan,Steinbach, Kumar

Introduction to Data Mining

Ideal:
Area = 1
Random guess:
Area = 0.5

4/18/2004

How to Construct an ROC curve

Use classifier that produces
posterior probability for each
test instance P(+|A)

Instance

P(+|A)

True Class

0.95

0.93

0.87

0.85

Sort the instances according

to P(+|A) in decreasing order

0.85

0.76

0.53

0.43

0.25

Tan,Steinbach, Kumar

Apply threshold at each

unique value of P(+|A)

Count the number of TP, FP,

TN, FN at each threshold

TP rate, TPR = TP/(TP+FN)

FP rate, FPR = FP/(FP + TN)

Introduction to Data Mining

4/18/2004

How to construct an ROC curve

0.25

0.43

0.53

0.76

0.85

0.87

0.93

0.95

1.00

TPR

0.8

0.6

0.4

0.2

FPR

0.8

0.6

0.4

0.2

Class
P
Threshold
>=
TP

ROC Curve:

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Test of Significance

Given two models:

Model M1: accuracy = 85%, tested on 30 instances
Model M2: accuracy = 75%, tested on 5000 instances

Can we say M1 is better than M2?

How much confidence can we place on accuracy of
M1 and M2?
Can the difference in performance measure be
explained as a result of random fluctuations in the test
set?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Confidence Interval for Accuracy

Prediction can be regarded as a Bernoulli trial

A Bernoulli trial has 2 possible outcomes
Possible outcomes for prediction: correct or wrong
Collection of Bernoulli trials has a Binomial distribution:

x Bin(N, p)

e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = Np = 50 0.5 = 25

x: number of correct predictions

Given x (# of correct predictions) or equivalently,

acc=x/N, and N (# of test instances),
Can we predict p (true accuracy of model)?

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Confidence Interval for Accuracy

Area = 1 -

For large test sets (N > 30),

acc has a normal distribution
with mean p and variance
p(1-p)/N

P(Z
/2

acc p
Z
p (1 p ) / N

1 / 2

Z/2

Z1- /2

Confidence Interval for p:

2 N acc Z Z 4 N acc 4 N acc

p
2( N Z )
2

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Confidence Interval for Accuracy

Consider

a model that produces an accuracy of

80% when evaluated on 100 test instances:
N=100, acc = 0.8
Let 1- = 0.95 (95% confidence)
From probability table, Z/2=1.96

0.99 2.58
0.98 2.33

100

500

1000

5000

0.95 1.96

p(lower)

0.670

0.711

0.763

0.774

0.789

0.90 1.65

p(upper)

0.888

0.866

0.833

0.824

0.811

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

100

Comparing Performance of 2
Models
Given

two models, say M1 and M2, which is

better?
M1 is tested on D1 (size=n1), found error rate = e1
M2 is tested on D2 (size=n2), found error rate = e2
Assume D1 and D2 are independent
If n1 and n2 are sufficiently large, then

e1 ~ N 1 , 1

e2 ~ N 2 , 2
e (1 e )
Approximate:
n
i

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

101

Comparing Performance of 2
Models
To

test if performance difference is statistically

significant: d = e1 e2
d ~ N(dt,t) where dt is the true difference
Since D1 and D2 are independent, their variance adds
up:

e1(1 e1) e2(1 e2)

n1
n2
2

At (1-) confidence level,

Tan,Steinbach, Kumar

d d Z

Introduction to Data Mining

4/18/2004

102

An Illustrative Example
Given: M1: n1 = 30, e1 = 0.15
M2: n2 = 5000, e2 = 0.25
d = |e2 e1| = 0.1 (2-sided test)

0.15(1 0.15) 0.25(1 0.25)

0.0043
30
5000
d

At 95% confidence level, Z/2=1.96

d 0.100 1.96 0.0043 0.100 0.128

=> Interval contains 0 => difference may not be

statistically significant
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

103

Comparing Performance of 2
Algorithms
Each

learning algorithm may produce k models:

L1 may produce M11 , M12, , M1k

L2 may produce M21 , M22, , M2k
If

models are generated on the same test sets

D1,D2, , Dk (e.g., via cross-validation)
For each set: compute dj = e1j e2j
dj has mean dt and variance t
Estimate:

(d j
j 1

k (k 1)
d d t
t

Tan,Steinbach, Kumar

1 ,k 1

Introduction to Data Mining

4/18/2004

104

(ENG) KOICA's Mid-Term Strategy (2021-2025)
No ratings yet
(ENG) KOICA's Mid-Term Strategy (2021-2025)
12 pages
Building Better Models with JMP Pro
From Everand
Building Better Models with JMP Pro
Jim Grayson
No ratings yet
List of Journals Recently Included in Scopus Database in Feb 2025
No ratings yet
List of Journals Recently Included in Scopus Database in Feb 2025
9 pages
Managerial Accounting: An Overview
No ratings yet
Managerial Accounting: An Overview
18 pages
Origins and Effects of Stereotypes IB Psychology Review Sheet
No ratings yet
Origins and Effects of Stereotypes IB Psychology Review Sheet
1 page
Session 10: Unit II: Time Value of Money
No ratings yet
Session 10: Unit II: Time Value of Money
49 pages
Chapter - 1
No ratings yet
Chapter - 1
35 pages
Chapter - 3
No ratings yet
Chapter - 3
45 pages
Information Systems in Organizations
No ratings yet
Information Systems in Organizations
25 pages
CSR - Corporate Social Responsibility
No ratings yet
CSR - Corporate Social Responsibility
32 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
101 pages
Phone Number (All University Jimmadar Sathi)
No ratings yet
Phone Number (All University Jimmadar Sathi)
4 pages
Action Research Final
No ratings yet
Action Research Final
53 pages
Examinations-UG Time Table April-2024 2023-ONWARDS)
No ratings yet
Examinations-UG Time Table April-2024 2023-ONWARDS)
41 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
КУРС ИШИ СОҲИБ 22
No ratings yet
КУРС ИШИ СОҲИБ 22
26 pages
Competency Based Education....... Amrita
No ratings yet
Competency Based Education....... Amrita
3 pages
Subramony 2006 GRH Options
No ratings yet
Subramony 2006 GRH Options
16 pages
By Eesha Tur Razia Babar: 2/1/2021 Introduction To Data Mining, 2 Edition 1
No ratings yet
By Eesha Tur Razia Babar: 2/1/2021 Introduction To Data Mining, 2 Edition 1
63 pages
Assessment of Learning: Stages in Test Construction
No ratings yet
Assessment of Learning: Stages in Test Construction
3 pages
DM Consolidated
100% (1)
DM Consolidated
676 pages
Rizky Safarini CV Metro TV
No ratings yet
Rizky Safarini CV Metro TV
3 pages
Chapter 3 DESKTOP VS93238 S Conflicted Copy 2019-09-29
No ratings yet
Chapter 3 DESKTOP VS93238 S Conflicted Copy 2019-09-29
55 pages
A.I. Lecture 6 NEW
No ratings yet
A.I. Lecture 6 NEW
59 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
63 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
01 Classification
No ratings yet
01 Classification
77 pages
Computer Programming-I (Cs 1301) Tutorial - #02
No ratings yet
Computer Programming-I (Cs 1301) Tutorial - #02
6 pages
Tutorial 1
No ratings yet
Tutorial 1
6 pages
Sound Unit Spring Practicum
No ratings yet
Sound Unit Spring Practicum
3 pages
Classification Techniques
No ratings yet
Classification Techniques
50 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
DMDW - Unit 3 - Classification
No ratings yet
DMDW - Unit 3 - Classification
43 pages
Lesson Plan in Short Story
No ratings yet
Lesson Plan in Short Story
10 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
82 pages
References: "Little Races" in Nineteenth-Centuu America, 25 Law & Hist. Rev. 467, 473
No ratings yet
References: "Little Races" in Nineteenth-Centuu America, 25 Law & Hist. Rev. 467, 473
8 pages
Decision Tree 1
No ratings yet
Decision Tree 1
59 pages
DM Lec6
No ratings yet
DM Lec6
18 pages
Reading Comprehension Worksheets 4kk
No ratings yet
Reading Comprehension Worksheets 4kk
3 pages
PDF Understanding Occupational Organizational Psychology 1st Edition Lynne J Millward Download
100% (5)
PDF Understanding Occupational Organizational Psychology 1st Edition Lynne J Millward Download
84 pages
Seetha Resume
No ratings yet
Seetha Resume
2 pages
Contemporary World Module 2
No ratings yet
Contemporary World Module 2
10 pages
Basic Classification
No ratings yet
Basic Classification
58 pages
Dtes Tos 2ND Grading 2019-2020
No ratings yet
Dtes Tos 2ND Grading 2019-2020
10 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
61 pages
Yes Phonics New Edition Phonics For English Manual v2
90% (10)
Yes Phonics New Edition Phonics For English Manual v2
259 pages
Chap4 - Basic - Classification - Class Teaching
No ratings yet
Chap4 - Basic - Classification - Class Teaching
168 pages
05 Chap3 - Basic - Classification Edited On Oct 10, 2023
No ratings yet
05 Chap3 - Basic - Classification Edited On Oct 10, 2023
78 pages
Introduction To Data Mining
100% (1)
Introduction To Data Mining
643 pages
Features of Academic Writing
No ratings yet
Features of Academic Writing
2 pages
Chap3 Basic Classification New 2
No ratings yet
Chap3 Basic Classification New 2
21 pages
4-Chap4 Basic Classification
No ratings yet
4-Chap4 Basic Classification
128 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
59 pages
LGBT Effect and Its Impact in Indonesia (Sociology of Law's Perpective)
No ratings yet
LGBT Effect and Its Impact in Indonesia (Sociology of Law's Perpective)
12 pages
Chap4 Basic Classification PDF
No ratings yet
Chap4 Basic Classification PDF
101 pages
DSTBD - 10 DMClassification ENG
No ratings yet
DSTBD - 10 DMClassification ENG
160 pages
Outmarket the Competition: Advanced Marketing Tactics to Drive Growth and Profitability
From Everand
Outmarket the Competition: Advanced Marketing Tactics to Drive Growth and Profitability
Nick Doyle
No ratings yet
Lec 1
No ratings yet
Lec 1
33 pages
Chap4 - Basic - Classification-Admin and Economy
No ratings yet
Chap4 - Basic - Classification-Admin and Economy
31 pages
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
58 pages
IJELS Volume 1 Issue 2 Pages 78-82
No ratings yet
IJELS Volume 1 Issue 2 Pages 78-82
5 pages
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
58 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
101 pages
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
No ratings yet
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
95 pages
CS 6823 Data Mining: Classification Decision Tree
No ratings yet
CS 6823 Data Mining: Classification Decision Tree
39 pages
Classification Basics
No ratings yet
Classification Basics
65 pages
Question Burst Guide
No ratings yet
Question Burst Guide
7 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
59 pages
Chapter 6. Decision Tree Classification
No ratings yet
Chapter 6. Decision Tree Classification
19 pages
Ehs21es ENGLISH IN PERSPECTIVE - CHAPTER 2
No ratings yet
Ehs21es ENGLISH IN PERSPECTIVE - CHAPTER 2
11 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
51 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
58 pages
Daily Lesson Log: Department of Education
No ratings yet
Daily Lesson Log: Department of Education
8 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Important For Data Mining
No ratings yet
Important For Data Mining
96 pages
06 Classification
No ratings yet
06 Classification
32 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
35 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
Week 6 Chap3 - Basic - Classificationi
No ratings yet
Week 6 Chap3 - Basic - Classificationi
59 pages
CH 6
No ratings yet
CH 6
72 pages
Lecture Notes For Chapter 1: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1: by Tan, Steinbach, Kumar
34 pages
CH03 Classification Part I
No ratings yet
CH03 Classification Part I
58 pages
7 - Classfication - Concept - DecisionTree - Evaluation
No ratings yet
7 - Classfication - Concept - DecisionTree - Evaluation
47 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
Classification: Lecture Notes For Chapters 4 & 5
No ratings yet
Classification: Lecture Notes For Chapters 4 & 5
42 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
58 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
34 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
29 pages
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
No ratings yet
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
79 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Lecture Notes For Chapter 1 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining
16 pages
GW SB1 Vocabulary Index
No ratings yet
GW SB1 Vocabulary Index
1 page
Grade 9 Sapphire - Catch Up Friday Reading - April 5,2024
No ratings yet
Grade 9 Sapphire - Catch Up Friday Reading - April 5,2024
3 pages
Data Warehousing: Optimizing Data Storage And Retrieval For Business Success
From Everand
Data Warehousing: Optimizing Data Storage And Retrieval For Business Success
Rob Botwright
No ratings yet