0% found this document useful (0 votes)
64 views

Lecture - 3 Classification (Decision Tree)

The document discusses decision trees (DT), including what they are, why they are useful, and how they work. It provides examples to illustrate key concepts in DT, such as: - DTs use splitting algorithms that recursively separate a dataset based on the values of features. - Information theory concepts like entropy and information gain are used to determine the optimal features to split on. - DTs output easy to understand classification rules and can handle both numeric and nominal data.

Uploaded by

Ashenafi
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Lecture - 3 Classification (Decision Tree)

The document discusses decision trees (DT), including what they are, why they are useful, and how they work. It provides examples to illustrate key concepts in DT, such as: - DTs use splitting algorithms that recursively separate a dataset based on the values of features. - Information theory concepts like entropy and information gain are used to determine the optimal features to split on. - DTs output easy to understand classification rules and can handle both numeric and nominal data.

Uploaded by

Ashenafi
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 44

Classification : Decision Tree

(DT)
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2020)
Outline

 What is decision tree (DT) algorithm


 Why we need DT
 The Pros and Cons of DT
 Information Theory
 Some Issues in DT
 Assignment II

11/07/22 2
Decision Tree (DT)

 Decision trees: splitting datasets one features at a time.


 The decision tree is one of the most commonly used classification technique.
 It has a decision blocks (rectangles).
 Termination block (ovals).
 The right and left arrows are called branches.

 The kNN algorithm can do a grate job of classification, but it didn’t lead to
any major insight about the data.

11/07/22 3
Decision Tree (DT)

Figure 1: A decision Tree

11/07/22 4
Decision Tree (DT)

 The best part of the DT (decision tree) algorithm is that humans can easily
understand the data:
 The DT algorithm:
 Takes a set of data. (training examples)
 Build a decision tree (model), and draw it.

 It can also be re-represented as sets of if-then rules to improve human readability.


 The DT does a grate job of distilling data into knowledge.
 Takes a set of unfamiliar data and extract a set of rules.
 DT is often used in expert system development.

11/07/22 5
Decision Tree (DT)

 The DT can be expressed using the following expression:


 (Outlook = Sunny ˄ Humidity = Normal) → Yes
 ˅ (Outlook = Overcast) → Yes
 ˅ (Outlook = Rain ˄ Wind = Weak) → Yes

11/07/22 6
Decision Tree (DT)

 The pros and cons of DT:


 Pros of DT:
 Computationally cheap to use,
 Easy for humans to understand the learned results,
 Missing values OK (robust to errors),
 Can deal with irrelevant features.

 Cons of DT:
 Prone to overfitting.
 Work with: Numeric values, nominal values.

11/07/22 7
Decision Tree (DT)

 Appropriate problems for DT learning:


 Instance are represented by attribute-value pairs (fixed set of attributes and
their values),
 The target function has discrete output values,
 Disjunctive descriptions may be required,
 The training data may contain errors,
 The training data may contain missing attribute values.

11/07/22 8
Decision Tree (DT)

 The mathematics that is used by DT to split the dataset is called


information theory:
 The first decision, you need to make is:
 Which feature shall be used to split the data.
 You need to try every feature and measure which split will give the
best result.
 Then split the dataset into subsets.
 The subsets will then traverse down the branches of the decision
node.
 If the data on the branch is the same, stop; else repeat the splitting.

11/07/22 9
Decision Tree (DT)

Figure 2: Pseudo-code for the splitting function


11/07/22 10
Decision Tree (DT)

 General approach to decision trees:


 Collect: Any method.
 Prepare: DI3 algorithm works only on nominal values, so any
continues values will need to be quantized.
 Analyze: Any method. You should visually inspect the tree after it
is build.
 Train: Construct a tree data structure. (DT)
 Test: Calculate the error rate with the learned tree.
 Use: This can be used in any supervised learning task. Often, to
better understand the data.

11/07/22 11
Decision Tree (DT)

 We would like to classify the following animals into two


classes:
 Fish and not Fish

Table 1: Marine animal data


11/07/22 12
Decision Tree (DT)

 Need to decide whether we should split the data based on the


first feature or the second feature:
 To make more organize the unorganized data.
 One way to do this is to measure the information.
 Measure the information before and after the split.

 Information theory is a branch of science that is concerned with


quantifying information.
 The change in information before and after the split is known
as the information gain.
11/07/22 13
Decision Tree (DT)

 The split with the highest information gain is the best option.
 The measure of information of a set is known as the Shannon
entropy or entropy.
 One way to do this is to measure the information.

 The change in information before and after the split is known


as the information gain.

11/07/22 14
Decision Tree (DT)

 To calculate entropy, you need the expected value of all the


information of all possible values of our class.
 This is given by:

 Where n is the number of classes:

11/07/22 15
Decision Tree (DT)

 The higher the entropy, the more mixed up the data.


 Another common measure of disorder in a set is the Gini
impurity.
 Which is the probability of choosing an item from the set and the
probability of that data item being misclassified.

 Calculate the shannon entropy of a dataset.


 Dataset splitting on a given feature.
 Choosing the best feature to split on.

11/07/22 16
Decision Tree (DT)

 Recursively building the tree.


 Start with dataset and split it based on the best attribute.
 The data will traverse down the branches of the tree to another
node.
 This node will then split the data again (recursively)
 Stop under the following conditions: run out of attributes or if the
instances in a branch are the same class.

11/07/22 17
Decision Tree (DT)

Table 2: Example training sets

11/07/22 18
Decision Tree (DT)

Figure 3: Data path while splitting


11/07/22 19
Decision Tree (DT)

 ID3 uses the information gain measure to select among the


candidate attributes.
 Start with dataset and split it based on the best attribute.
 Given a collection S, containing positive and negative examples
of some target.

 The entropy of S relative to this Boolean classification is:


Entropy(S) = -p1Xlog2p1+ - p2Xlog2p2

11/07/22 20
Decision Tree (DT)

 Example:
 The target attribute is PlayTennis. (yes/no)

Table 3: Example training sets


11/07/22 21
Decision Tree (DT)

 Suppose S is a collection of 14 examples of some Boolean


concept, including 9 positive and 5 negative examples.
 Then the entropy of S relative to Boolean classification is:

11/07/22 22
Decision Tree (DT)

 Note that the entropy is 0 if all members of S belong to the


same class.
 For example: if all the members are positive (p+ = 1), then (p-
= 0).
 Entropy (s) = -1.log2(1) – 0.log2(0) = 0

 Note the entropy is one (1) when the collection contain an


equal number of positive and negative examples.
 If the collection contain unequal number of positive and
negative the entropy is b/n 0 and 1.
11/07/22 23
Decision Tree (DT)

 Suppose S is a collection of training-example days described by


attributes Wind. (weak, strong)
 The information gain is the measure used by ID3 to select the
best attribute at each step in growing the tree.

11/07/22 24
Decision Tree (DT)

 Information gain of the two attributes: Humidity and Wind.

11/07/22 25
Decision Tree (DT)

 Example:
 ID3 will determines the information gain for each attribute.
(Outlook, Temperature, Humidity and Wind)
 Then select the one with the highest information gain.
 The information gain values for all four attributes are:
 Gain (S, Outlook) = 0.246
 Gain (S, Humidity) = 0.151
 Gain (S, Wind) = 0.048
 Gain (S, Temperature) = 0.029

 Outlook provides grater information gain than the other.


11/07/22 26
Decision Tree (DT)

 Example:
 According to the information gain measure, the Outlook attribute
selected as the root node.
 Branches are created below the root for each of its possible
values. (Sunny, Overcast, and Rain)

11/07/22 27
Decision Tree (DT)

 The partially learned decision tree resulting from the first step of ID3

11/07/22 28
Decision Tree (DT)

 The overcast descendant has only positive examples and


therefore becomes a leaf node with classification Yes:

 The other two nodes will be expand by selecting the attribute


with the highest information gain relative to the new subsets.

11/07/22 29
Decision Tree (DT)

 Decision Tree learning can be:


 Classification tree: finite set values target variables
 Regression tree: continuous values target variable

 There are many specific Decision Tree algorithms:


 ID3 (Iterative of ID3)
 C4.5 (Successor of ID3)
 CART(classification and Regression Tree)
 CHAID (Chi – squared Automatic Interaction Detector)
 MARS: extends DT to handle numerical data better
11/07/22 30
Decision Tree (DT)

 Different Decision Tree algorithms use different metrics for


measuring the “best attribute” :
 Information gain: used by ID3, C4.5 and C5.0
 Gini impurity: used by CART

11/07/22 31
Decision Tree (DT)

 ID3 in terms of its search space and search strategy:


 ID3’s hypothesis space of all decision tree is a complete space of
finite discrete-valued functions.
 ID3’s maintains only a single current hypothesis as it searches
through the space of decision trees.

 ID3 in its pure form perform no backtracking in its search. (post-


pruning the decision tree)
 ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis. (much less sensitive to error)
11/07/22 32
Decision Tree (DT)

 Inductive bias in Decision Tree learning (ID3) :


 Inductive bias are the set of assumption.
 ID3 selects in favor of shorter trees over longer ones. (breadth
first approach)
 Selects trees that place the attributes with the highest information
gain closest to the root.

11/07/22 33
Decision Tree (DT)

 Issues in Decision Tree learning:


 How deeply to grow the decision tree
 Handling continuous attributes
 Choosing an appropriate attributes selection measure
 Handling training data with missing attribute values
 Handling attributes with differing costs and
 Improve computational efficiency

 ID3 extended to address most of these issues to C4.5.

11/07/22 34
Decision Tree (DT)

 Avoiding over fitting the Data:


 Noisy data and too small training examples are problems.
 Over fitting is a practical problem for Decision Tree and many of
the learning algorithms.
 Over fitting was found to decrease the accuracy of the learned tree
by 10-25%.

 Approach to avoid over fitting:


 Stop growing the tree, before it over fitting. (direct but less practical)
 Allow the tree to over fitting, and then post-prune (the most successful
in practice)

11/07/22 35
Decision Tree (DT)

 Incorporating continuous-value attributes:


 Initial definition of ID3, attributes and target value must be
discrete set of values.

 The attributes tested in the decision nodes of the tree must be


discrete value :
 Create a new Boolean attribute for the continuous value or
 Multiple interval rather than just two interval.

11/07/22 36
Decision Tree (DT)

 Alternative measure for selecting attributes:


 Information gain favor attributes with many values.
 One alternative measure that has been used successfully is the
gain ratio.

11/07/22 37
Decision Tree (DT)

 Handling training example with missing attribute value:


 Assign it with the most common value among training examples at
node n.
 Assign the probability to each of the possible values of the attribute.
 The 2nd approach, is used in C4.5.

11/07/22 38
Decision Tree (DT)

 Handling attributes with different costs:


 Low-cost attributes than high-cost attributes.
 ID3 can be modified to take into account costs by introducing a cost
term into the attribute selection measure.
Divide the gain by the cost of the attribute.

11/07/22 39
Question & Answer

11/07/22 40
Thank You !!!

11/07/22 41
Assignment II

 Answer the given questions by considering the following set of training examples.

11/07/22 42
Assignment II

 (a) What is the entropy of this collection of training examples with respect to the target function classification?
 (b) What is the information gain of a2 relative to these training examples?

11/07/22 43
Decision Tree (DT)

 Do some research on the following Decision Tree algorithms:


 ID3 (Iterative of ID3)
 C4.5 (Successor of ID3)
 CART(classification and Regression Tree)
 CHAID (Chi – squared Automatic Interaction Detector)
 MARS: extends DT to handle numerical data better

11/07/22 44

You might also like