0% found this document useful (0 votes)
89 views109 pages

ML L8 Decision Tree

- Decision trees are a supervised learning technique that can be used for both classification and regression problems. - They work by recursively splitting a dataset into purer subsets based on the values of predictor variables. - The document discusses the basic concepts of decision trees including their structure, types (classification vs regression trees), and applications. It also provides examples to illustrate how decision trees can be used for classification tasks.

Uploaded by

Mickey Mouse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views109 pages

ML L8 Decision Tree

- Decision trees are a supervised learning technique that can be used for both classification and regression problems. - They work by recursively splitting a dataset into purer subsets based on the values of predictor variables. - The document discusses the basic concepts of decision trees including their structure, types (classification vs regression trees), and applications. It also provides examples to illustrate how decision trees can be used for classification tasks.

Uploaded by

Mickey Mouse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 109

Machine Learning

19CSE305
L-T-P-C: 3-0-3-4
Lecture 8

Decision Trees
DECISION TREES

• Classic and natural model of learning


• Based on the fundamental computer science notion of “divide and conquer”
• It creates a model that predicts the value of a target variable based on several
input variables
• A tree can be learned by splitting the source set into subsets based on an
attribute value test using recursive partitioning.
• The recursion is completed when the subset at a node has all the same value
of the target variable, or when splitting no longer adds value to the
predictions. 
• Although decision trees can be applied to many learning problems, we will
begin with the simplest case: binary classification
What is Decision Tree?
● Decision Tree is a Supervised learning technique that can be used
for both classification and Regression problems, but mostly it is
preferred for solving Classification problems
● It is a tree-structured classifier, where
○ internal nodes represent the features of a dataset
○ branches represent the decision rules
○ each leaf node represents the outcome

3
Definition
The decision tree model of learning
The goal of learner: Figure out what questions to
ask, in what order, and what to predict when you
have answered enough questions
What is Decision Tree?

6
Two Main Types of Decision Trees

● Classification Trees
○ Here the decision variable is Categorical/ discrete
○ Binary recursive partitioning – a process of splitting the data into
partitions, and then splitting it up further on each of the branches.

7
Two Main Types of Decision Trees

● Regression Trees
○ Decision trees where the target variable can take continuous values
are called regression trees

8
Applications
The decision tree model of learning
Do people like movies by an Actor “x”?
The decision tree model of learning
Do people like movies by an Actor “x”?

Q : Is the genre of movie is “Action”?


‣Yes
Q : Is the lead actor is “x”?
‣Yes
Q : Has people liked previous Action movie of the “x”?
‣Yes
People do like Action movies of “x”
The decision tree model of learning
Do people like movies by an Actor “x”?

Q : Is the genre of the movie “Comedy”?


‣Yes
Q : Is the lead actor “x”?
‣Yes
Q : Has people liked the previous Comedy movie of the “x”?
‣Yes
People don’t like Comedy movies of “x”
The decision tree model of learning
Lead Actor Genre Hit(Y/N)
x Action Yes
x Fiction Yes
x Romance No
x Action Yes
y Action No
y Fiction No
y Romance Yes
The decision tree model of learning

Lead Actor Genre Hit(Y/N)

x Action Yes ‣ Questions are features


x Fiction Yes ‣ Responses are feature values
x Romance No ‣ Decision is the label
x Action Yes

y Action No

y Fiction No

y Romance Yes
Why it’s Decision Tree?

Lead Actor Genre Hit(Y/N)

x Action Yes

x Fiction Yes

x Romance No

x Action Yes

y Action No

y Fiction No

y Romance Yes
Why it’s Decision Tree?

Lead Actor Genre Hit(Y/N) ‣ We can represent set of questions and


x Action Yes guesses in a tree format

x Fiction Yes

x Romance No

x Action Yes

y Action No

y Fiction No

y Romance Yes
Decision Trees
Example 2

If we are classifying bank loan application for a customer, the decision tree may
look like this
Decision Trees
What is a Decision tree?

A decision tree is a tree where each node represents a feature(attribute), each


link(branch) represents a decision(rule) and each leaf represents an
outcome(categorical or continues value).
Create a tree for the entire data and process a single outcome at every leaf(or
minimize the error in every leaf).
Terms and Process
of building a
Decision Tree
Terminologies
Root Node

A
3 Types of nodes:
F T
1. Root Node
2. Branch Node Branch Node
B B
3. Leaf Node
F T
Benefits T F
•It does not require any
domain knowledge. F
•It is easy to comprehend. T T F
•The learning and
classification steps of a
decision tree are simple
and fast. Leaf Node
Classification—A Two-Step Process [RECAP]

Model construction: describing a set of predetermined classes


◦ Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
◦ The set of tuples used for model construction is training set
◦ The model is represented as classification rules, decision trees, or mathematical
formulae
Model usage: for classifying future or unknown objects
◦ Estimate accuracy of the model
◦ The known label of test sample is compared with the classified result from the model
◦ Accuracy rate is the percentage of test set samples that are correctly classified by the model
◦ Test set is independent of training set (otherwise overfitting)
◦ If the accuracy is acceptable, use the model to classify new data
22

Note: If the test set is used to select models, it is called validation (test) set
Steps in Decision Tree Construction

Begin the tree with the root node, says S, which contains the
complete dataset

Find the best attribute in the dataset using Attribute


Selection Measure (ASM)

Divide the S into subsets that contain possible values for the
best attributes

Generate the decision tree node, which contains the best


attribute

Recursively make new decision trees using the subsets of


the dataset created in step 3

23
How does Decision Tree Classification Work?

24
Examples
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes

6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Decision Tree-Classification
Task
al al us
r ic r ic u o
o o
teg teg n tin as
s
a a o c l
c c c
Tid Refund Marital Taxable Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refun
3 No Single 70K No Yes d No
4 Yes Married 120K No
NO MarSt
5 No Divorced 95K Yes
Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of
Decision Tree
al al us
r ic r ic uo
o o
teg teg ntin as
s
a a o c l MarSt Single,
c c c
Tid Refund Marital Taxable
Married Divorce
Status Income Cheat d
NO Refun
1 Yes Single 125K No
Yes d No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund
10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund
10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund
10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund
10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund
10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund
10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification
Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
ID3 Decision
Tree model
ID3 (Iterative Dichotomiser 3)
Assume that the attributes are categorical
The attribute with a high information gain value is placed at the root.

• Developed by J. Ross Quinlan in 1979

Quinlan was a computer science researcher in data mining, and decision theory. Received doctorate in
computer science at the University of Washington in 1968.

• Based on Entropy

• Only works for discrete data

• Cannot work with defective data


Entropy
Entropy is a measure of the randomness in the information being
processed. The higher the entropy, the harder it is to draw any
conclusions from that information

Flipping a coin
•  The entropy H(X) is zero when the
probability is either 0 or 1.
• The Entropy is maximum when the
probability is 0.5

A branch with an entropy of zero is a leaf


node and A branch with entropy more than
zero needs further splitting
ENTROPY

Entropy(Play Golf)= Entropy(Yes) + Entropy(No)


Entropy Yes    P ( yes ) log(P( yes ))
 (9 / 14) log(9 / 14)
 0.64*log(0.64)

Entropy  NO    P (no) log(P(no))


 (5 / 14) log(5 / 14)
 0.36 * log(0.36)

Entropy(Play Golf)= 0.94


Entropy of multiple attributes
Information Gain
Information gain or IG is a statistical property that measures how well a given
attribute separates the training examples according to their target classification.

Find an attribute that


returns the highest
information gain and
the smallest entropy.

Constructing a decision tree is all about finding an


attribute that returns the highest information gain and
the smallest entropy
Information Gain
IG computes the difference between entropy before split and entropy after split
of the dataset based on given attribute values. 
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)


◦ Tree is constructed in a top-down recursive divide-and-conquer manner

◦ At start, all the training examples are at the root

◦ Attributes are categorical (if continuous-valued, they are discretized in advance)

◦ Examples are partitioned recursively based on selected attributes

◦ Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)

Conditions for stopping partitioning


◦ All samples for a given node belong to the same class

◦ There are no remaining attributes for further partitioning – majority voting is employed
for classifying the leaf
43
◦ There are no samples left
ID3 Algorithm
1. compute the entropy for data-set
2. For every attribute/feature:
2.1 calculate entropy for all categorical values
2.2 take average information entropy for the current attribute
2.3 calculate gain for the current attribute
3. pick the highest gain attribute.
4. Repeat until we get the tree we desired.
Pros of ID3 Algorithm
Builds decision tree in minimum steps

Each level benefits from previous level choices

Whole dataset is scanned to create tree


Cons of ID3 Algorithm
Tree can not be updated when new data is classified incorrectly, instead

a new tree must be generated.

Only one attribute at a time is tested for making a decision.

Can not work with defective data

Can not work with numerical attributes


Decision Tree Boundary
Decision Tree Boundary
Full Problem
using Decision
Tree
Decision Trees
Consider whether a dataset based on which we will determine whether to play
football or not.
Contd…
CONTD…
CONTD…
CONTD…
CONTD…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Contd…
Example 2
Example 3
Example 4
Example 5
Lead Actor Genre Hit(Y/N)

Amitabh Bacchan Action Yes

Amitabh Bacchan Fiction Yes

Amitabh Bacchan Romance No

Amitabh Bacchan Action Yes

Abhishek Bacchan Action No

Abhishek Bacchan Fiction No

Abhishek Bacchan Romance Yes


Decision trees
Advantages;
Easy to learn
Easy to implement

Disadvantages:
Decision tree tends to overfit and may have high variance – don’t
perform well with testing data.
Ensemble learning
Ensemble methods, which combines several
decision trees to produce better predictive
performance than utilizing a single decision tree.
The main principle behind the ensemble model is
that a group of weak learners come together to form
a strong learner.
Machine learning techniques that use the combined output of
two or more models/weak learners and solve a particular
computational intelligence problem. E.g., a Random Forest
algorithm is an ensemble of various decision trees combined.
Simple Ensemble techniques
1. Max voting
2. Averaging
3. Weighted Averaging
Max Voting
The max voting method is generally used for classification problems.
In this technique, multiple models are used to make predictions for each
data point.
The predictions by each model are considered as a ‘vote’.
The predictions from the majority of the models are used as the final
prediction.

Rating 1 Rating 2 Rating 3 Rating 4 Rating 5 Final


value
5 4 5 4 4 4
Averaging
Multiple predictions are made for each data point in averaging.
In this method, we take an average of predictions from all the models
and use it to make the final prediction.
Averaging can be used for making predictions in regression problems or
while calculating probabilities for classification problems.

Rating 1 Rating 2 Rating 3 Rating 4 Rating 5 Final


value
5 4 5 4 4 4.4
Weighted Averaging
All models are assigned different weights defining the importance of
each model for prediction.

Rating 1 Rating 2 Rating 3 Rating 4 Rating 5 Final


value
0.23 0.23 0.18 0.18 0.18
5 4 5 4 4 4.41

[(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.


Advanced techniques
Stacking, Blending, Boosting and Bagging
Stacking
Stacking is an ensemble learning technique that uses predictions from
multiple models (for example decision tree, knn or svm) to build a new
model. This model is used for making predictions on the test set.
Individual learners are heterogeneous – different
ML models
Stacking uses meta model
•Original data: This data is divided into n-folds and is also considered test data or
training data.
•Base models: These models are also referred to as level-0 models. These models use
training data and provide compiled predictions (level-0) as an output.
•Level-0 Predictions: Each base model is triggered on some training data and
provides different predictions, which are known as level-0 predictions.
•Meta Model: The architecture of the stacking model consists of one meta-model,
which helps to best combine the predictions of the base models. The meta-model is
also known as the level-1 model.
•Level-1 Prediction: The meta-model learns how to best combine the predictions of
the base models and is trained on different predictions made by individual base
models, i.e., data not used to train the base models are fed to the meta-model,
predictions are made, and these predictions, along with the expected outputs, provide
the input and output pairs of the training dataset used to fit the meta-model.
Boosting
 It is an ensemble method in which each predictor learns
from preceding predictor mistakes to make better
predictions in the future.
The technique combines several weak base learners that
are arranged in a sequential manner such that weak learners
learn from the previous weak learner’s errors to create a
better predictive model.
Hence one strong learner is formed through significantly
improving the predictability of models. Eg. – XGBoost, 
AdaBoost.
Boosting
Final model
All the models don’t have equal decision in the final prediction.
Steps:
1.A subset is created from the original dataset.
2.Initially, all data points are given equal weights.
3.A base model is created on this subset.
4.This model is used to make predictions on the whole dataset.
5.Errors are calculated using the actual values and predicted values.
6.The observations which are incorrectly predicted, are given higher weights.
(Here, the three misclassified blue-plus points will be given higher weights)
7.Another model is created and predictions are made on the dataset.
(This model tries to correct the errors from the previous model)
8.Similarly, multiple models are created, each correcting the errors of the
previous model.
9.The final model (strong learner) is the weighted mean of all the models (weak
learners).
Bagging
Bagging is mainly applied in supervised learning problems.
It involves two steps, i.e., bootstrapping and aggregation.
Bootstrapping is a random sampling method in which samples are
derived from the data using the replacement procedure.
The first step in bagging is bootstrapping, where random data samples
are fed to each base learner. The base learning algorithm is run on the
samples to complete the procedure.
In Aggregation, the outputs from the base learners are combined.
The goal is to increase the accuracy meanwhile reducing variance to a
large extent. Eg.- Random Forest where the predictions from decision
trees(base learners) are taken parallelly.
In the case of regression problems, these predictions are averaged to
give the final prediction and in the case of classification problems, the
mode is selected as the predicted class
Random Forest classifier
Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset.

1. Random subsets are created from the original dataset


(bootstrapping).
2. At each node in the decision tree, only a random set of features are
considered to decide the best split.
3. A decision tree model is fitted on each of the subsets.
4. The final prediction is calculated by averaging the predictions from
all decision trees.
Random forest
Features of Random Forest
1. Diversity- Not all attributes/variables/features are considered while
making an individual tree, each tree is different.
2. Immune to the curse of dimensionality- Since each tree does not
consider all the features, the feature space is reduced.
3. Parallelization - Each tree is created independently out of different
data and attributes. This means that we can make full use of the CPU to
build random forests.
4. Train-Test split- In a random forest we don’t have to segregate the
data for train and test as there will always be 30% of the data which is
not seen by the decision tree.
5. Stability- Stability arises because the result is based on majority
voting/ averaging.
Comparison
Decision trees Random Forest
1. Decision trees normally suffer 1. Random forests are created from
from the problem of overfitting if subsets of data and the final output
it’s allowed to grow without any is based on average or majority
control. ranking and hence the problem of
overfitting is taken care of.

2. A single decision tree is faster in 2. It is comparatively slower.


computation.
3. When a data set with features is 3. Random forest randomly selects
taken as input by a decision tree it observations, builds a decision tree
will formulate some set of rules to and the average result is taken. It
do prediction. doesn’t use any set of formulas.
Hyperparameters
Following hyperparameters increases the predictive power:
1. n_estimators– number of trees the algorithm builds before averaging
the predictions.
2. max_features– maximum number of features random forest considers
splitting a node.
3. mini_sample_leaf– determines the minimum number of leaves
required to split an internal node.
Following hyperparameters increases the speed:
1. n_jobs– it tells the engine how many processors it is allowed to use. If
the value is 1, it can use only one processor but if the value is -1 there is
no limit.
2. random_state– controls randomness of the sample. The model will
always produce the same results if it has a definite value of random state
and if it has been given the same hyperparameters and the same training
data.
3. oob_score – OOB means out of the bag. It is a random forest cross-
validation method. In this one-third of the sample is not used to train the
data instead used to evaluate its performance. These samples are called
out of bag samples.
References
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-
for-ensemble-models/
https://www.analyticsvidhya.com/blog/2021/08/ensemble-stacking-for-
machine-learning-and-deep-learning/

https://www.analyticsvidhya.com/blog/2021/12/a-detailed-guide-to-ense
mble-learning/

You might also like