0% found this document useful (0 votes)
7 views20 pages

04 Classification

The document discusses data modeling techniques, particularly focusing on decision trees for customer profiling and classification tasks. It outlines the process of building decision trees, including training and test sets, tree construction methods, and evaluation criteria like information gain. Additionally, it provides examples of data and explains the structure of decision trees, including nodes and attributes.

Uploaded by

William D2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views20 pages

04 Classification

The document discusses data modeling techniques, particularly focusing on decision trees for customer profiling and classification tasks. It outlines the process of building decision trees, including training and test sets, tree construction methods, and evaluation criteria like information gain. Additionally, it provides examples of data and explains the structure of decision trees, including nodes and attributes.

Uploaded by

William D2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

DATA MODELING

▪ sample data to get a training and test set


▪ utilize decision trees to generate rules for profiling
customers using training set
▪ utilize validation techniques on the set to determine
accuracy of predictions
▪ perform sequential analysis to determine the
sequence of call transaction made
▪ Classification is a data mining task of predicting
the value of a categorical variable by building a
model based on one or more numerical and/or
categorical variables
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes

sunny mild high false No


sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
▪ Decision Tree Induction
▪ Bayesian Classification
▪ Backpropagation
▪ Association Rule Mining
▪ Decision Trees are one of most common methods
to build models
▪ Intuitive appeal for users
▪ Presentation Forms
▪ “if, then” statements (decision rules)
▪ graphically - decision trees
▪ Works like a flow chart
▪ Looks like an upside down tree
▪ Nodes
▪ appear as rectangles or circles
▪ represent test or decision

▪ represent outcome of a test


▪ terminal (leaf) nodes
▪ root node
▪ Internal nodes
▪ An internal node is a test on an attribute.
▪ A branch represents an outcome of the test, e.g.,
Color=red.
▪ A leaf node represents a class label or class label
distribution.
▪ At each node, one attribute is chosen to split training
examples into distinct classes as much as possible
▪ A new case is classified by following a matching path to a
leaf node.
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
▪ Top-down tree construction
▪ At start, all training examples are at the root.
▪ Partition the examples recursively by choosing one
attribute each time.
▪ Bottom-up tree pruning
▪ Remove sub-trees or branches, in a bottom-up
manner, to improve the estimated accuracy on new
cases.
▪ At each node, available attributes are evaluated on
the basis of separating the classes of the training
examples. A goodness function is used for this
purpose.
▪ Typical goodness functions:
▪ information gain (ID3/C4.5)
▪ information gain ratio
▪ gini index
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes

sunny mild high false No


sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
▪ Which is the best attribute?
▪ The one which will result in the smallest tree
▪ Heuristic: choose the attribute that produces the “purest” nodes

▪ Popular impurity criterion: information gain


▪ Information gain increases with the average purity of the subsets
that an attribute produces
▪ Strategy: choose attribute that results in greatest
information gain
▪ Information is measured in bits
▪ Given a probability distribution, the info required to predict an
event is the distribution’s entropy
▪ Entropy gives the information required in bits (this can involve
fractions of bits!)
▪ Watch this video:

▪ https://www.youtube.com/watch?v=_L39rN6gz7Y&t=722s
Outlook Temperature Humidity Windy Play?
sunny hot high false No
▪ Create the decision tree of the sunny hot high true No
following data: overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes

sunny mild high false No


sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No

You might also like