0% found this document useful (0 votes)
117 views53 pages

Decision Trees: Decision Tree Is One of The Most Widely Used and

Decision trees are a widely used machine learning method for classification and regression. They represent hypotheses as tree structures, where internal nodes represent attributes and branches represent attribute values. Decision trees are constructed top-down by choosing the attribute that best splits the data at each step. The quality of a split is typically measured by information gain or the Gini index. Overfitting can occur if trees grow too deep, so pruning methods are used to simplify trees.

Uploaded by

Alka Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views53 pages

Decision Trees: Decision Tree Is One of The Most Widely Used and

Decision trees are a widely used machine learning method for classification and regression. They represent hypotheses as tree structures, where internal nodes represent attributes and branches represent attribute values. Decision trees are constructed top-down by choosing the attribute that best splits the data at each step. The quality of a split is typically measured by information gain or the Gini index. Overfitting can occur if trees grow too deep, so pruning methods are used to simplify trees.

Uploaded by

Alka Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Decision Trees

▪ Decision Tree is one of the most widely used and


practical methods of inductive inference
▪ Features
▪ Method for approximating discrete-valued functions
(including boolean)
▪ Learned functions are represented as decision trees (or if-
then-else rules)
▪ Expressive hypotheses space, including disjunction
▪ Robust to noisy data

*
In decision analysis, a decision tree can be
used to visually and explicitly represent
decisions and decision making. As the
name goes, it uses a tree-like model of
decisions.

2
3
When to use Decision Trees
▪ Problem characteristics:
▪ Instances can be described by attribute value pairs
▪ Target function is discrete valued
▪ Disjunctive hypothesis may be required
▪ Possibly noisy training data samples
▪ Robust to errors in training data
▪ Missing attribute values
▪ Different classification problems:
▪ Equipment or medical diagnosis
▪ Credit risk analysis
▪ Several tasks in natural language processing

*
5
6
10
Top-down induction of Decision Trees
▪ ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
▪ Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
▪ The construction of the tree is top-down. The algorithm is greedy.
▪ The fundamental question is “which attribute should be tested next?
Which question gives us more information?”
▪ Select the best attribute
▪ A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
▪ The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left

*
12
Which attribute is the best classifier?

▪ A statistical property called information gain, measures how


well a given attribute separates the training examples
▪ Information gain uses the notion of entropy, commonly used
in information theory
▪ Information gain = expected reduction of entropy

* Maria Simi
14
15
16
17
Example: expected information gain
▪ Let
▪ Values(Wind) = {Weak, Strong}
▪ S = [9+, 5−]
▪ SWeak = [6+, 2−]
▪ SStrong = [3+, 3−]
▪ Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong)
= 0.94 − 8/14 × 0.811 − 6/14 × 1.00
= 0,048

*
19
20
21
-2/5log2/5-3/5log3/5 -3/5log3/5-2/5log2/5

22
First step: which attribute to test at the root?

▪ Which attribute should be tested at the root?


▪ Gain(S, Outlook) = 0.246
▪ Gain(S, Humidity) = 0.151
▪ Gain(S, Wind) = 0.084
▪ Gain(S, Temperature) = 0.029
▪ Outlook provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of Outlook
▪ partition the training samples according to the value of Outlook

*
After first step

*
Second step
▪ Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 − 3/5 × 0.0 − 2/5 × 0.0 = 0.970
Gain(SSunny, Wind) = 0.970 − 2/5 × 1.0 − 3.5 × 0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970 − 2/5 × 0.0 − 2/5 × 1.0 − 1/5 × 0.0 = 0.570
▪ Humidity provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of Humidity
▪ partition the training samples according to the value of Humidity

*
Second and third steps

{D1, D2, {D9, D11} {D4, D5, D10} {D6, D14}


D8} Yes Yes No
No

*
27
28
Other Splitting Criterion: GINI Index
Gini Index/Gini impurity, calculates the amount of
probability of a specific feature that is classified
incorrectly when selected randomly. If all the
elements are linked with a single class then it can
be called pure.GINI Index for a given node t :
k
n
GINI (t )  1   [ p( j | t )] 2 GINI split  
i 1 n
i
GINI (i )
j
Note: information gain in this slide is weighted GINI index
Note: information gain in this slide is weighted GINI index
Note: information gain in this slide is weighted GINI index
Note: information gain in this slide is weighted GINI index
Note: information gain in this slide is weighted GINI index

Gini Index is a metric to measure how often a randomly chosen element would
be incorrectly identified.
It means an attribute with lower Gini index should be preferred.
How to Specify Test Condition?
 Depends on attribute types
– Nominal
– ordinal
– Continuous

 Depends on number of ways to split


– Binary split
– Multi-way split
Splitting Based on Nominal Attributes

 Multi-way split: Use as many partitions as


values
CarType
Family Luxury
Sports

OR
 Binary split: Divide values into two subsets

CarType
CarType {Family,
{Sports, Luxury} {Sports}
Luxury} {Family}

Need to find optimal partitioning!


Splitting Based on Continuous Attributes

• Different ways of handling


– Multi-way split: form ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – repeat on each new partition

– Binary split: (A < v) or (A  v)


• How to choose v?

Need to find optimal partitioning!

Can use GAIN or GINI !


39
40
Overfitting and Underfitting

• Overfitting:
– Given a model space H, a specific model hH is said
to overfit the training data if there exists some
alternative model h’H, such that h has smaller error
than h’ over the training examples, but h’ has smaller
error than h over the entire distribution of instances
• Underfitting:
– The model is too simple, so that both training and
test errors are large
underfitting
Detecting Overfitting
Overfitting
Overfitting in Decision Tree Learning
 Overfitting results in decision trees that are
more complex than necessary
– Tree growth went too far
– Number of instances gets smaller as we build the
tree (e.g., several leaves match a single example)

 Training error no longer provides a good


estimate of how well the tree will perform on
previously unseen records
Avoiding Tree Overfitting – Solution 1
 Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
 Stop if all instances belong to the same class
 Stop if all the attribute values are the same
– More restrictive conditions:
Stop if number of instances is less than some user-specified threshold
Stop if class distribution of instances are independent of the available
features
Stop if expanding the current node does not improve impurity
measures (e.g., GINI or GAIN)
Avoiding Tree Overfitting – Solution 2
 Post-pruning
– Split dataset into training and validation sets
– Grow full decision tree on training set
– While the accuracy on the validation set increases:
Evaluate the impact of pruning each subtree, replacing
its root by a leaf labeled with the majority class for that
subtree
Replace subtree that most increases validation set
accuracy (greedy approach)
Decision Tree Based Classification
 Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Good accuracy (carefu
 Disadvantages:
– Axis-parallel decision boundaries
– Redundancy
– Need data to fit in memory
– Need to retrain with new data
Assignment

You might also like