0% found this document useful (0 votes)
26 views45 pages

Deep Learning: Decision Trees I

The document discusses decision trees, which are a method for inductive learning that represents concepts as tree structures. Decision trees classify data instances by sorting them down the tree from the root to a leaf node, which assigns a classification. The document outlines how decision trees work, including that each node tests an attribute, branches represent attribute values, and leaves assign classifications. It also describes the basic algorithm for building decision trees, which selects the most informative attribute to test at each node using a measure called information gain.

Uploaded by

waheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views45 pages

Deep Learning: Decision Trees I

The document discusses decision trees, which are a method for inductive learning that represents concepts as tree structures. Decision trees classify data instances by sorting them down the tree from the root to a leaf node, which assigns a classification. The document outlines how decision trees work, including that each node tests an attribute, branches represent attribute values, and leaves assign classifications. It also describes the basic algorithm for building decision trees, which selects the most informative attribute to test at each node using a measure called information gain.

Uploaded by

waheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 45

Deep Learning

Decision Trees I

Dr M. Sultan Zia
Associate Professor
The University of Lahore, Chenab Campus, Gugrat, Pakistan
DECISION TREES

Introduction

It is a method that induces concepts from examples


(inductive learning)

Most widely used & practical learning method

The learning is supervised: i.e. the classes or categories of the


data instances are known

It represents concepts as decision trees (which can be


rewritten as if-then rules)

2
DECISION TREES

Introduction

The target function can be Boolean or discrete valued

3
DECISION TREES

Decision Tree Representation

1. Each node corresponds to an attribute

2. Each branch corresponds to an attribute value

3. Each leaf node assigns a classification

4
DECISION TREES

Example

5
DECISION TREES

Outlook
Example Sunny Rain
Overcast
Humidity Wind

High Normal Strong Weak

A Decision Tree for the concept PlayTennis


An unknown observation is classified by testing its attributes
and reaching a leaf node
6
DECISION TREES

Decision Tree Representation

Decision trees represent a disjunction of conjunctions of


constraints on the attribute values of instances

Each path from the tree root to a leaf corresponds to a


conjunction of attribute tests (one rule for classification)

The tree itself corresponds to a disjunction of these


conjunctions (set of rules for classification)

7
DECISION TREES

Decision Tree Representation

8
DECISION TREES

Basic Decision Tree Learning Algorithm

Most algorithms for growing decision trees are variants of a


basic algorithm

An example of this core algorithm is the ID3 algorithm


developed by Quinlan (1986)

It employs a top-down, greedy search through the space of


possible decision trees

9
DECISION TREES

Basic Decision Tree Learning Algorithm

First of all we select the best attribute to be tested at the root


of the tree

For making this selection each attribute is evaluated using a


statistical test to determine how well it alone classifies the
training examples

10
DECISION TREES

Basic Decision Tree Learning Algorithm


We have

D12 D11 - 14 observations


D1
D2 D5
D10 D4 - 4 attributes
D6 • Outlook
D3
D14 • Temperature
D8 D9 • Humidity
D7 D13 • Wind

- 2 classes (Yes, No)

11
DECISION TREES

Basic Decision Tree Learning Algorithm

Outlook
Sunny Rain
Overcast

D1 D8 D10 D6
D3
D14
D11 D4
D9 D12
D2 D7
D13 D5

12
DECISION TREES

Basic Decision Tree Learning Algorithm

The selection process is then repeated using the training


examples associated with each descendant node to select the
best attribute to test at that point in the tree

13
DECISION TREES

Outlook
Sunny Rain
Overcast

D1 D8 D10 D6
D3
D14
D11 D4
D9 D12
D2 D7
D13 D5

What is the
“best” attribute to test at this point? The possible choices are
Temperature, Wind & Humidity
14
DECISION TREES

Basic Decision Tree Learning Algorithm

This forms a greedy search for an acceptable decision tree, in


which the algorithm never backtracks to reconsider earlier
choices

15
DECISION TREES

Which Attribute is the Best Classifier?

The central choice in the ID3 algorithm is selecting which


attribute to test at each node in the tree

We would like to select the attribute which is most useful for


classifying examples

For this we need a good quantitative measure

For this purpose a statistical property, called information


gain is used

16
DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy

In order to define information gain precisely, we begin by


defining entropy

Entropy is a measure commonly used in information theory,


called.

Entropy characterizes the impurity of an arbitrary collection


of examples

17
DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy

Suppose we have four independent values of a variable X:

A, B, C, D

These values are independent and occur randomly

You might transmit these values over a binary serial link by


encoding each reading with two bits

A = 00 B = 01 C = 10 D=
11

18
DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy


Someone tells you that there probabilities of occurrence is
not equal:

p(A) = 1/2
p(B) = 1/4
p(C) = 1/8
p(D) = 1/8

It is now possible to invent a coding that only uses 1.75 bits


on average per symbol, for the transmission, e.g.

A=0 B = 10 C = 110 D = 111


19
DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy

Suppose X can have m values, V1, V2, …, Vm, with


probabilities: p1, p2, …, pm

The smallest number of bits, on average, per value, needed to


transmit a stream of values of X is

If a p = 1 and all other p’s are 0, then we need 0 bits (i.e. we


don’t need to transmit anything)

20
DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy


If all p’s are equal for a given m, we need the highest number
of bits for transmission

If there are m possible values of an attribute, then the


entropy can be as large as log2 m

21
DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy

This formula is called Entropy H

H(X) =

High Entropy means that the examples have more equal


probability of occurrence (and therefore not easily
predictable)

Low Entropy means easy predictability

22
DECISION TREES

Which Attribute is the Best Classifier?: Information Gain

Suppose we are trying to predict output Y (Like Film PK) &


we have input X (College Major = v)

Major
Math CS
History

23
DECISION TREES

Which Attribute is the Best Classifier?: Information gain


We have H(X) = 1.5 H(Y) = 1.0

Conditional Entropy H(Y | X = v)


The Entropy of Y among only those records in which X = v

Major
Math CS
History

24
DECISION TREES

Which Attribute is the Best Classifier?: Information Gain


Conditional Entropy of Y
H(Y | X = Math) = 1.0
H(Y | X = History) = 0
H(Y | X = CS) = 0

Major
Math CS
History

25
DECISION TREES

Which Attribute is the Best Classifier?: Information Gain


Average Conditional Entropy of Y

H(Y | X) =
Major
Math CS
History

26
DECISION TREES

Which Attribute is the Best Classifier?: Information Gain

Information Gain is the expected reduction in entropy


caused by partitioning the examples according to an
attribute’s value

Info Gain (Y | X) = H(Y) – H(Y | X) = 1.0 – 0.5 = 0.5

For transmitting Y, how much bits would be saved if both


side of the line knew X

In general, we write Gain (S, A)


Where S is the collection of examples & A is an attribute
27
DECISION TREES

Which Attribute is the Best Classifier?: Information Gain

Let’s
investigate
the attribute
Wind

28
DECISION TREES

Which Attribute is the Best Classifier?: Information Gain

The collection of examples has 9 positive values and 5


negative ones

Eight (6 positive and 2 negative ones) of these examples


have the attribute value Wind = Weak

Six (3 positive and 3 negative ones) of these examples have


the attribute value Wind = Strong

29
DECISION TREES

Which Attribute is the Best Classifier?: Information Gain

The information gain obtained by separating the


examples according to the attribute Wind is calculated as:

30
DECISION TREES

Which Attribute is the Best Classifier?: Information Gain


We calculate the Info Gain for each attribute and select
the attribute having the highest Info Gain

31
DECISION TREES

Select Attributes which Minimize Disorder

The formula can be converted from log2 to log10

logx(M) = log10M . logx10


= log10M/log10x

Hence log2(Y) = log10(Y)/log10(2)

32
DECISION TREES

Example

33
DECISION TREES

Example
Which attribute should be selected as the first test?

“Outlook” provides the most information

34
DECISION TREES

35
DECISION TREES

Example
The process of selecting a new attribute is now repeated for
each (non-terminal) descendant node, this time using only
training examples associated with that node

Attributes that have been incorporated higher in the tree are


excluded, so that any given attribute can appear at most once
along any path through the tree

36
DECISION TREES

Example
This process continues for each new leaf node until either:

1. Every attribute has already been included along this path


through the tree

2. The training examples associated with a leaf node have


zero entropy

37
DECISION TREES

Example

38
DECISION TREES

From Decision Trees to Rules

Next Step: Make rules from the decision tree

After making the identification tree, we trace each path from


the root node to leaf node, recording the test outcomes as
antecedents and the leaf node classification as the consequent

For our example we have:

If the Outlook is Sunny and the Humidity is High then No


If the Outlook is Sunny and the Humidity is Normal then Yes
...
39
DECISION TREES

Hypothesis Space Search


ID3 can be characterized as
searching a space of
hypotheses for one that fits
the training examples

The space searched is the set


of possible decision trees

ID3 performs a simple-to-


complex, hill-climbing
search through this
hypothesis space
40
DECISION TREES

Hypothesis Space Search


It begins with an empty tree,
then considers more and
more elaborate hypothesis
in search of a decision tree
that correctly classifies the
training data

The evaluation function that


guides this hill-climbing
search is the information
gain measure

41
DECISION TREES

Hypothesis Space Search

Some points to note:

• The hypothesis space of all decision trees is a complete


space. Hence the target function is guaranteed to be
present in it.

42
DECISION TREES

Hypothesis Space Search


• ID3 maintains only a single current hypothesis as it
searches through the space of decision trees.

By determining only a single hypothesis, ID3 loses the


capabilities that follow from explicitly representing all
consistent hypotheses.

For example, it does not have the ability to determine


how many alternative decision trees are consistent with
the training data, or to pose new instance queries that
optimally resolve among these competing hypotheses
43
DECISION TREES

Hypothesis Space Search

• ID3 performs no backtracking, therefore it is


susceptible to converging to locally optimal solutions

• ID3 uses all training examples at each step to refine its


current hypothesis. This makes it less sensitive to
errors in individual training examples.
However, this requires that all the training examples
are present right from the beginning and the learning
cannot be done incrementally with time

44
DECISION TREES

Reference
Sections 3.1 – 3.4.1 of T. Mitchell

Sections 3.4.2 – 3.5 of T. Mitchell

45

You might also like