0% found this document useful (0 votes)
3 views36 pages

CSC454 10

The document discusses Decision Tree Learning, a supervised learning algorithm used for classification and prediction, detailing its structure, types, and key concepts such as attribute selection measures including Entropy, Information Gain, and Gini Index. It explains how decision trees are constructed by selecting the best attributes based on these measures to minimize uncertainty and improve classification accuracy. Additionally, it outlines the processes of bagging and boosting as ensemble learning techniques.

Uploaded by

abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views36 pages

CSC454 10

The document discusses Decision Tree Learning, a supervised learning algorithm used for classification and prediction, detailing its structure, types, and key concepts such as attribute selection measures including Entropy, Information Gain, and Gini Index. It explains how decision trees are constructed by selecting the best attributes based on these measures to minimize uncertainty and improve classification accuracy. Additionally, it outlines the processes of bagging and boosting as ensemble learning techniques.

Uploaded by

abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Decision Tree Learning

By Dr. Osama Alia


Outline
• Decision tree

• Ensemble learning

o Bagging

o Boosting
Decision Trees
▪ Decision tree algorithm belongs to the family of supervised
learning algorithms.

▪ A decision tree is a tree in which each branch node represents


a choice between a number of alternatives, and each leaf node
represents a decision.

▪ Decision trees are powerful and popular tools for classification


and prediction.

▪ Learned functions are represented as decision trees (or if-


then-else rules).
Graphical intuition on decision tree

age Split is based on age


> 𝟒𝟎
< 𝟑𝟎
𝟑𝟎 − 𝟒𝟎
sal sal
y
< 𝟑𝟎, 𝟎𝟎𝟎 > 𝟑𝟎, 𝟎𝟎𝟎 < 𝟐𝟎, 𝟎𝟎𝟎 > 𝟐𝟎, 𝟎𝟎𝟎

N Y N Y
Graphical intuition on decision tree

sal Split is based on salary


< 𝟑𝟎, 𝟎𝟎𝟎 > 𝟑𝟎, 𝟎𝟎𝟎

age Y
< 𝟑𝟎 𝟑𝟎 − 𝟒𝟎 > 𝟒𝟎

N Y Y/N Which is the best attribute?


Attribute Selection Measures
▪ The attribute/feature selection measure provides a ranking for
each attribute describing the given training tuples.

▪ The attribute having the best score for the measures is


chosen as the splitting attribute for the given tuples.

▪ There are three attribute selection measures- Entropy


Information Gain, and Gini Index.
Decision Trees
Example

We have

- 14 observations/ examples
- 4 attributes/features: outlook- temperature- humidity- wind
- 2 classes (Yes, No) https://www.saedsayad.com/decision_tree.htm
Decision Tree Representation
1. Each node corresponds to an attribute

2. Each branch corresponds to an attribute value

3. Each leaf node assigns a classification


Types of decision trees
1. Categorical variable decision tree
• A categorical variable decision tree includes categorical
target variables that are divided into categories.
• For example, the categories can be yes or no. The
categories mean that every stage of the decision process
falls into one of the categories, and there are no in-
betweens.

2. Continuous variable decision tree


• A continuous variable decision tree is a decision tree with
a continuous target variable.
• For example, the income of an individual whose income
is unknown can be predicted based on available
information such as their occupation, age, and other
continuous variables.
Attribute Selection Measures: Entropy
• Entropy is a measure of disorder or uncertainty in given
examples.
• You can think of it as a measure of the level of purity
• In general, our goal is to reduce uncertainty (minimization).
𝒄

𝑯 𝒚 = − ෍ 𝒑𝒊 𝒍𝒐𝒈𝟐 ( 𝒑𝒊 ) 𝑐 𝑖𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠


𝒊=𝟏
9 9 5 5
𝐻 𝑦 =− log( )- log( )=0.94
14 14 14 14

0 0 14 14
✓ 𝑺𝒖𝒑𝒑𝒐𝒔𝒆 𝟎 𝒚𝒆𝒔 𝒂𝒏𝒅 𝟏𝟒 𝒏𝒐 𝐻 𝑦 = − log( )− log( )=0
14 14 14 14

14 14 0 0
✓ 𝑺𝒖𝒑𝒑𝒐𝒔𝒆 𝟏𝟒 𝒚𝒆𝒔 𝒂𝒏𝒅 𝟎 𝒏𝒐 𝐻 𝑦 = − log( )− log( )=0
14 14 14 14

7 7 7 7
✓ 𝑺𝒖𝒑𝒑𝒐𝒔𝒆 𝟕 𝒚𝒆𝒔 𝒂𝒏𝒅 𝟕 𝒏𝒐 𝐻 𝑦 = − log( )− log( )=1
14 14 14 14
Entropy

• Entropy is 0 if all the members of 𝑆 belong to the same


class.

• Entropy is 1 when the collection contains an equal no. of


+ 𝑣𝑒 and −𝑣𝑒 examples.

• Entropy is between 0 and 1 if the collection contains


unequal no. of +ve and -ve examples.
Information Gain
• We want to determine which attribute in a given set of
training feature vectors is most useful for discriminating
between the classes to be learned.

• Information gain tells us how important a given attribute


of the feature vectors

• To minimize the decision tree depth, the attribute with the


most entropy reduction is the best choice!

• The higher the information gain the more effective the


attribute in classifying training data.

• We will use it to decide the ordering of attributes in the nodes


of a decision tree.
Information Gain (IG)
𝐼𝐺 𝑦, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 = 𝐻 𝑦 − 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒)
𝐼𝐺 𝑦, 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝐻 𝑦 − 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑜𝑢𝑡𝑙𝑜𝑜𝑘)
9 9 5 5
𝐻 𝑦 =− log( )- log( )=0.94
14 14 14 14
outlook Total yes No Entropy
2 2 3 3
Sunny 5 2 3 0.97 𝐻 𝑠𝑢𝑛𝑛𝑦 = −
5
log( )−
5 5
log( )=0.97
5
Overcast 4 4 0 0
Rain 5 3 2 0.97

5 4 5
Weighted entropy= × 0.97+ × 0+ × 0.97
14 14 14
= 0.692
𝐼𝐺 𝑦, 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.94 − 0.692 =0.246

𝐼𝐺 𝑦, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029
𝐼𝐺 𝑦, 𝑤𝑖𝑛𝑑𝑦 = 0.94 − 0.892 =0.048
𝐼𝐺 𝑦, ℎ𝑢𝑚𝑑𝑖𝑡𝑦 = 0.94 − 0.789 =0.151
First step: which attribute to test at
the root?
▪ Which attribute should be tested at the root?
▪ Gain(S, Outlook) = 0.246
▪ Gain(S, Humidity) = 0.151
▪ Gain(S, Wind) = 0.048
▪ Gain(S, Temperature) = 0.029
▪ Outlook provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of Outlook
▪ partition the training samples according to the value of Outlook
After first step
Constructing decision tree
You need to calculate the information gain for sunny to choose which attribute is
the best attribute ( temperature, humidity, and Windy)

𝐼𝐺 𝑆𝑢𝑛𝑛𝑦, 𝑡𝑒𝑚𝑝𝑡𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝐻(𝑠𝑢𝑛𝑛𝑦) – 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒)


𝐼𝐺 𝑆𝑢𝑛𝑛𝑦, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 𝐼𝐺 𝑆𝑢𝑛𝑛𝑦, 𝑊𝑖𝑛𝑑𝑦
Total sunny=5

Sunny Total yes No Entropy


Hot 2 0 2 0
Mild 2 1 1 1
cool 1 1 0 0

2 2 1
𝑊𝑒𝑖𝑔𝑡ℎ𝑡𝑒𝑑 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(Temp)= × 0 + × 1 + × 0 = .4
5 5 5
𝐼𝐺 𝑆𝑢𝑛𝑛𝑦, 𝑡𝑒𝑚𝑝𝑡𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.97 − 0.4 = 0.57
Second step
▪ Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 − 3/5  0.0 − 2/5  0.0 = 0.970
Gain(SSunny, Wind) = 0.970 − 2/5  1.0 − 3.5  0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970 − 2/5  0.0 − 2/5  1.0 − 1/5  0.0 = 0.570
▪ Humidity provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of Humidity
▪ partition the training samples according to the value of Humidity
Second and third steps

{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes No Yes
Constructing decision tree outlook
𝒔𝒖𝒏𝒏𝒚 𝑹𝒂𝒊𝒏
The same process will be repeated to
select the best attribute. For example humidity wind
All Yes
𝐼𝐺 𝑅𝑎𝑖𝑛𝑦, 𝑡𝑒𝑚𝑝𝑡𝑒𝑟𝑎𝑡𝑢𝑟𝑒 2 yes 3 No
2 No 3 Yes
𝐼𝐺 𝑅𝑎𝑖𝑛𝑦, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦

𝐼𝐺 𝑅𝑎𝑖𝑛𝑙𝑦, 𝑊𝑖𝑛𝑑𝑦
Gini index/ impurity
• Gini Impurity is a method for
splitting the nodes when the target
variable is categorical.
𝒄

𝑮 𝒚 = 𝟏 − ෍(𝒑𝒊 )𝟐
𝒊=𝟏
𝟗 𝟐 𝟓 𝟐 yes No Gini
𝑮 𝒚 = 𝟏 −( ) −( ) =0.4598
𝟏𝟒 𝟏𝟒
0 14 0
1 13 0.1328
• Gini index is maximal if the 7 7 0.5
classes are perfectly mixed 0 14 0
• Gini Impurity is preferred to Information Gain because it does
not contain logarithms which are computationally intensive.
• The lower the Gini index value the more effective the attribute
in classifying training data.
▪ What is the best split (between outlook, Temp, Humidity, Wind) according to the
Gini index?
▪ For attribute Outlook, the Gini index is:
▪ P(Y)(1 – overcast – Sunny - Rainy) + P(N) (1 – overcast – Sunny - Rainy)
9 4 2 2 2 3 2 5 2 2 3 2 0 2
▪ 1− − − + 14 1− − − = 0.584
14 9 9 9 5 5 5

▪ For attribute Temperature, the Gini index is:


▪ P(Y)(1 – Hot– Mild – Cool) + P(N) (1 – Hot– Mild – Cool)
9 2 2 4 2 3 2 5 2 2 2 2 1 2
▪ 1− − − + 14 1− − − = 0.641
14 9 9 9 5 5 5

▪ For attribute Humidity, the Gini index is:


▪ P(Y)(1 – High – Normal) + P(N) (1 – High – Normal)
9 3 2 6 2 5 4 2 1 2
▪ 1− − + 1− − = 0.4
14 9 9 14 5 5

▪ For attribute Windy, the Gini index is:


▪ P(Y)(1 – True– False) + P(N) (1 – True– False)
9 3 2 6 2 5 3 2 2 2
▪ 1− − + 1− − = 0.457
14 9 9 14 5 5

▪ Which attribute should be chose?


▪ Answer: the one with the smallest Gini index which is Humidity
Decision tree Pseudocode
1. Create feature list, attribute list.
Example: Feature List : Outlook, Windy, Temperature and Humidity
Attributes for Outlook are Sunny, Overcast and Rainy.

2. Find the maximum information gain among all the features.


Assign it root node.
Outlook in our example and it has three branches: Sunny, Overcast
and Rainy.

3. Remove the feature assigned in root node from the feature list
and again find the maximum increase in information gain for each
branch. Assign the feature as child node of each branch and remove
that feature from feature list for that branch.
Sunny Branch for outlook root node has humidity as child node.

4. Repeat step 3 until you get branches with only pure leaf. In our
example, either yes or no
Decision tree in sklearn
Visualizing decision tree
Decision Tree Regression
• Decision tree regression predicts continuous output.
Overfitting & Underfitting in Decision tree

• Overfitting is a significant practical difficulty for decision tree


models and many other predictive models (high variance)

• Larger the depth of the tree more are the chances of


variance(overfitting).

• Whereas smaller the depth of the tree more are the chances
of bias tree(underfitting).
Causes of overfitting
• Overfitting Due to Presence of Noise: Mislabeled
instances may contradict the class labels of other similar
records.

• Overfitting Due to Lack of Representative Instances


Lack of representative instances in the training data can
prevent refinement of the learning algorithm.
Overfitting in Decision tree
There are several approaches to avoiding overfitting in building
decision trees:

• Pre-pruning that stops growing the tree earlier, before it


perfectly classifies the training set.
• We stop splitting the tree at some point

• Post-pruning takes a fully-grown decision tree and discard


unreliable parts.

o We generate a complete tree first, and then get rid of some


branches.
o Minimum error. The tree is pruned back to the point where
the cross-validated error is a minimum.

https://www.saedsayad.com/decision_tree_overfitting.htm
Ensemble learning
Ensemble learning
• Ensemble learning is a model that makes predictions
based on a number of different models.
• By combining individual models, the ensemble model
tends to be more flexible (less bias) and less data-
sensitive (less variance).

Two popular ensemble methods:


o bagging
o boosting
Types ensemble methods
▪ Bagging: Training a bunch of individual models in a parallel
way. Each model is trained by a random subset of the data.

▪ Boosting: Training a bunch of individual models in a


sequential way. Each individual model learns from mistakes
made by the previous model.

o Boosting: is an ensemble technique which aims at


combining multiple weak classifiers to build one strong
classifier.
What is a "weak" classifier?
• A weak classifier is one that performs better than random
guessing, but still performs poorly at designating classes to
objects.

• For example, a weak classifier may predict that everyone


above the age of 40 could not run a marathon but people
falling below that age could.

• Now, you might get above 60% accuracy, but you would still
be misclassifying a lot of data points!
Age>40

Yes No
Bagging and boosting
The End
References

▪ Machine Learning, Tom Mitchell, Mc Graw-Hill


International Editions, 1997 (Cap 3).

You might also like