0% found this document useful (0 votes)
8 views42 pages

Session 5b Classification by Decision Tree Induction

This document provides an overview of decision tree classification, detailing the algorithm for learning decision trees, attribute selection measures such as information gain, gain ratio, and Gini index, as well as tree pruning techniques. It discusses the structure of decision trees, their applications, and the historical development of algorithms like ID3, C4.5, and CART. Additionally, it highlights the importance of preventing overfitting and introduces random forests as an ensemble learning method.

Uploaded by

owekesa361
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views42 pages

Session 5b Classification by Decision Tree Induction

This document provides an overview of decision tree classification, detailing the algorithm for learning decision trees, attribute selection measures such as information gain, gain ratio, and Gini index, as well as tree pruning techniques. It discusses the structure of decision trees, their applications, and the historical development of algorithms like ID3, C4.5, and CART. Additionally, it highlights the importance of preventing overfitting and introduces random forests as an ensemble learning method.

Uploaded by

owekesa361
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

DECISION TREE CLASSIFICATION

1
INTRODUCTION

Welcome back, in this session, we shall learn the basic


algorithm for learning decision trees, Attribute Selection
Measures: Information gain, Gain ratio and Gini index, and
Tree Pruning.

2
LEARNING OUTCOMES

By the end of this session you should be able to;

• describe a basic algorithm for learning decision trees.


• apply attribute selection measures are used to select the
attribute that best partitions the tuples into distinct
classes.
• apply pruning algorithms to improve accuracy by
removing tree branches reflecting noise (or outliers) in
the training data.

3
DECISION TREES

• A decision tree is a tree-structured classifier where:


– each internal node (non leaf node) denotes a test on
an attribute
– each branch represents an outcome of the test
– each leaf node (or terminal node) holds a class label
– The top most node in a tree is the root node.

• It is called a decision tree because, similar to a tree, it


starts with the root node, which expands on further
branches and constructs a tree-like structure.

4
DECISION TREES

Figure 1. The general form of decision-tree.

5
DECISION TREES

• All the internal nodes represent the features of a


dataset (features are attributes)

• A decision tree simply asks a question, and based on the


answer (Yes/No), it further split the tree into subtrees.

6
DECISION TREES

Figure 2. Example of decision-tree

7
DECISION TREES

• Decision Tree is a Supervised learning technique.

• It can be used for both classification and Regression


problems, but mostly it is preferred for solving
Classification problems.

• In order to build a tree, we use the CART


algorithm, which stands for Classification and Regression
Tree algorithm

8
HOW DOES THE DECISION TREE
ALGORITHM WORK?
• Given a tuple X (where X is a set of attributes
associated with the unknown class label), the attribute
values of the X are tested against the decision tree.
1. Decision tree algorithm compares the values of root
attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and
jumps to the next node.
2. For the next node, the algorithm again compares the
attribute value with the other sub-nodes and move
further.
3. It continues the process until it reaches the leaf
node of the tree.
9
HOW DOES THE DECISION TREE
ALGORITHM WORK?
Example: Suppose there is a candidate who has a job
offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next
decision node (distance from the office) and one leaf
node based on the corresponding labels. The next
decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision
node splits into two leaf nodes (Accepted offers and
Declined offer).
10
DECISION TREES

Figure 3. Example of decision-tree process

11
DECISION TREES
History

• Late 1970s - early 1980s, J. Ross Quinlan, a


researcher in machine learning, developed a decision
tree algorithm known as ID3 (Iterative
Dichotomiser).
• ID3 is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two
or more groups at each step.
• ID3 uses a top-down greedy approach to build a
decision tree: we start building the tree from the top
and the greedy approach means that at each
iteration we select the best feature at the
present moment to create a node. 12
DECISION TREES
History

• Quinlan later presented C4.5 (a successor of ID3),


which became a benchmark to which newer
supervised learning algorithms are often compared.
• C4.5 is an extension of Quinlan's earlier ID3
algorithm.
• The decision trees generated by C4.5 can be used for
classification, and for this reason, C4.5 is often
referred to as a statistical classifier.

13
DECISION TREES
History

• In 1984, statisticians L. Breiman, J. Friedman, R.


Olshen, and C. Stone published the book
Classification and Regression Trees (CART), which
described the generation of binary decision trees.
• Classification and Regression Trees or CART refers
to Decision Tree algorithms that can be used for
classification or regression predictive modeling
problems.
• Classically, this algorithm is referred to as “decision
trees”, but on some platforms like R they
are referred to by the more modern term CART.
14
DECISION TREES
History

• ID3, C4.5, and CART adopt a greedy (i.e.,


nonbacktracking) approach in which decision trees
are constructed in a top-down recursive divide-and-
conquer manner.
• Most algorithms for decision tree induction also
follow a top-down approach, which starts with a
training set of tuples and their associated class
labels.
• The training set is recursively partitioned into smaller
subsets as the tree is being built.
15
THE DECISION TREE ALGORITHM
How to generate a decision tree
• Algorithm: Generate a decision tree from the
training tuples of data partition, D.
Where:
• D - is a set of training tuples and their associated
class labels;
• attribute list – is the set of candidate attributes;

• Attribute selection method – is a procedure to


determine the splitting criterion that “best”
partitions the data tuples into individual classes.
Note: This criterion consists of a splitting attribute
and, possibly, either a split-point or splitting
subset. 16
THE DECISION TREE ALGORITHM

17
ATTRIBUTE SELECTION MEASURES

• While implementing a Decision tree, the main issue


arises that how to select the best attribute for the root
node and for sub-nodes.

• To solve such problems there is a technique which is


called as Attribute selection measure or ASM.

• An attribute selection measure is a heuristic for


selecting the splitting criterion that “best” separates a
given data partition, D, of class-labeled training tuples
into individual classes.

18
ATTRIBUTE SELECTION MEASURES

• If we were to split D into smaller partitions according to


the outcomes of the splitting criterion, ideally each
partition would be pure (i.e., all the tuples that fall into a
given partition would belong to the same class).

• Conceptually, the “best” splitting criterion is the one that


most closely results in such a scenario.

• Attribute selection measures are also known as


splitting rules because they determine how the tuples
at a given node are to be split.

19
ATTRIBUTE SELECTION MEASURES

• The attribute selection measure provides a ranking for


each attribute describing the given training tuples.

• The attribute having the best score for the measure is


chosen as the splitting attribute for the given tuples.

• If the splitting attribute is continuous-valued or if we are


restricted to binary trees, then, respectively, either a
split point or a splitting subset must also be
determined as part of the splitting criterion.

20
ATTRIBUTE SELECTION MEASURES

• There are three popular attribute selection measures:


– information gain
– gain ratio
– Gini index

21
ATTRIBUTE SELECTION MEASURES

Information Gain
• Information gain is the measurement of changes in
entropy after the segmentation of a dataset based on an
attribute.
• It calculates how much information a feature provides us
about a class.
• According to the value of information gain, we split the
node and build the decision tree.
• A decision tree algorithm always tries to maximize the
value of information gain, and a node/attribute having
the highest information gain is split first

22
ATTRIBUTE SELECTION MEASURES

Information Gain
• ID3 uses information gain as its attribute selection
measure.
• This measure is based on pioneering work by Claude
Shannon on information theory, which studied the
value or “information content” of messages.
• Let node N represent or hold the tuples of partition
D:-The attribute with the highest information gain is
chosen as the splitting attribute for node N.

23
ATTRIBUTE SELECTION MEASURES

Information Gain
• The expected information needed to classify a tuple in D
is given by:

• where
• pi is the nonzero probability that an arbitrary tuple in D
belongs to class Ci
• Info(D) is just the average amount of information
needed to identify the class label of a tuple in D.

24
ATTRIBUTE SELECTION MEASURES

Information Gain
• Information gain is a decrease in entropy.
• It computes the difference between entropy before
split and average entropy after split of the dataset
based on given attribute values

• In the following example, we will calculate the


entropy before and after split then compute the
information gain

25
ATTRIBUTE SELECTION MEASURES
Information Gain
Example: Entropy for 1 attribute is calculated as follows

Where S → Current state, and Pi → Probability of an


event i of state S or Percentage of class i in a node of
state S. 26
ATTRIBUTE SELECTION MEASURES
Information Gain
Entropy for multiple attributes is calculated as:

where T→ Current state and X → Selected attribute

27
ATTRIBUTE SELECTION MEASURES
Information Gain
Example - Entropy for multiple attributes is calculated
as:

28
ATTRIBUTE SELECTION MEASURES
Information Gain

Therefore,

29
ATTRIBUTE SELECTION MEASURES

Gain Ratio
• Information gain is biased towards choosing
attributes with a large number of values as root
nodes.
• It means it prefers the attribute with a large number
of distinct values.
• C4.5, an improvement of ID3, uses Gain ratio which
is a modification of Information gain that reduces its
bias and is usually the best option.

30
ATTRIBUTE SELECTION MEASURES
Gain Ratio
• Gain ratio overcomes the problem with information
gain by taking into account the number of branches
that would result before making the split.
• It corrects information gain by taking the intrinsic
information of a split into account.

– Where “before” is the dataset before the split, K is the


number of subsets generated by the split, and (j, after) is
subset j after the split
31
ATTRIBUTE SELECTION MEASURES

Gini Index
• Gini index is a cost function used to evaluate splits in
the dataset.
• It is calculated by subtracting the sum of the squared
probabilities of each class from one.
• It favors larger partitions and easy to implement
whereas information gain favors smaller partitions
with distinct values.

32
ATTRIBUTE SELECTION MEASURES

Other Attribute Selection Measures include:


• Reduction in Variance
• Chi-Square

33
Overfitting Problem in Decision Trees

• The common problem with Decision trees, especially


having a table full of columns, they fit a lot.
• Sometimes it looks like the tree memorized the
training data set.
• If there is no limit set on a decision tree, it will give
you 100% accuracy on the training data set because
in the worse case it will end up making 1 leaf for
each observation.

34
Overfitting Problem in Decision Trees
• Overfitting problem affects the accuracy when
predicting samples that are not part of the training
set.
• As a problem usually has a large set of features, it
results in large number of split, which in turn gives a
huge tree.
• Such trees are complex and can lead to overfitting.
But when do we stop splitting/growing the tree?

• We will look at two ways to remove overfitting:


– Pruning Decision Trees.

– Random Forest 35
Pruning Decision Trees

• Pruning refers to the removal of those branches in


our decision tree which we feel do not contribute
significantly to our decision process.
• The concept of Pruning enables us to
avoid Overfitting of the regression or classification
model so that for a small sample of data, the errors
in measurement are not included while generating
the model.
• This can be done using any of the measures
discussed in attribute reduction methods such as
information gain where the one with the least
information gain is the least significant. 36
Random Forest
• Random Forest is an example of ensemble learning,
in which we combine multiple machine learning
algorithms to obtain better predictive performance.
• A technique known as bagging is used to create an
ensemble of trees where multiple training sets are
generated with replacement.
• In the bagging technique, a data set is divided
into N samples using randomized sampling.
• Then, using a single learning algorithm a model is
built on all samples.
• Later, the resultant predictions are combined using
voting or averaging in parallel.
37
DECISION TREES
Application Areas

• Decision tree algorithms have been used for


classification in many application areas such as
medicine, manufacturing and production, financial
analysis, astronomy, and molecular biology.

• Decision trees are the basis of several commercial


rule induction systems.

38
DECISION TREES
Why are decision tree classifiers so popular?
• The construction of decision tree classifiers does not
require any domain knowledge or parameter setting,
and therefore is appropriate for exploratory
knowledge discovery.
• Decision trees can handle multidimensional data.
• Their representation of acquired knowledge in tree
form is intuitive and generally easy to assimilate by
humans.
• The learning and classification steps of decision tree
induction are simple and fast.
• Decision tree classifiers have good accuracy.
39
SUMMARY

You have come to the end of this session on decision tree.


In this session, you learnt the basic algorithm for learning
decision trees, Attribute Selection Measures: Information
gain, Gain ratio and Gini index, and tree pruning In our
next session, you will learn about Bayesian Classification

40
THANK YOU

41
REFERENCES

1. Han, J., Kamber, M. and Pei, J., 2011. Data Mining:


Concepts and Techniques Third Edition [M]. The
Morgan Kaufmann Series in Data Management
Systems

42

You might also like