0% found this document useful (0 votes)
25 views37 pages

Unit IV Decision Trees

The document provides an overview of decision trees, a graphical representation used for decision-making based on given conditions. It explains key terminologies such as root node, leaf node, and the processes of splitting and pruning, as well as the algorithms used for constructing decision trees like CART. Additionally, it discusses concepts like entropy, information gain, and Gini index, which are essential for determining the best splits in both classification and regression trees.

Uploaded by

janarthana9789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views37 pages

Unit IV Decision Trees

The document provides an overview of decision trees, a graphical representation used for decision-making based on given conditions. It explains key terminologies such as root node, leaf node, and the processes of splitting and pruning, as well as the algorithms used for constructing decision trees like CART. Additionally, it discusses concepts like entropy, information gain, and Gini index, which are essential for determining the best splits in both classification and regression trees.

Uploaded by

janarthana9789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Unit 4 – Decision Trees

Prepared By: Nivetha Raju


Department of Computer Science and
Engineering
Decision Tree

• It is a graphical representation for getting all the


possible solutions to a problem/decision based
on given conditions.
• It is called a decision tree because, similar to a tree, it
starts with the root node, which expands on further
branches and constructs a tree-like structure.
• In order to build a tree, CART algorithm, which stands
for Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on
the answer (Yes/No), it further split the tree into
subtrees.
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree
starts. It represents the entire dataset, which further
gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and
the tree cannot be segregated further after getting a leaf
node.
• Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given
conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the
unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called
the parent node, and other nodes are called the child
nodes.
Decision Tree

• It is a hierarchical data structure implementing the divide-and-conquer


strategy.
• Efficient nonparametric method- used for classification and regression
Decision Tree
• A decision tree is a hierarchical model for supervised learning whereby the local region is
identified in a sequence of recursive splits in a smaller number of steps.
• It is a hierarchical data structure implementing the divide-and-conquer strategy.
• It is composed of internal decision nodes and terminal leaves
• It is also a nonparametric model in the sense that we do not assume any parametric form
for the class densities and the tree structure is not fixed a priori but the tree grows,
branches and leaves are added
• Each decision node m implements a test function fm(x) with discrete outcomes labeling the
branches.
• Given an input, at each node, a test is applied and one of the branches is taken depending
on the outcome.
• This process starts at the root and is repeated recursively until a leaf node is hit, at which
point the value written in the leaf constitutes the output
• Each leaf node has an output label,
• classification - class code
• regression - numeric value.
• A leaf node defines a localized region in the input space where instances falling in this
region have the same labels (in classification), or very similar numeric outputs (in
regression).
Example
Example
Example
Example
Univariate Trees

• The test uses only one of the input dimensions.


• If the used input dimension is discrete ,the decision node checks the
value of it and takes the corresponding branch, implementing an n-way
split.
Binary Split
• Numeric xi : Binary split : xi > wm
• where Wm is a suitably chosen threshold value.
• The decision node divides the input space into two:
Lm = {x|xj > wm} and
Rm = {x|xj ≤ wm}
Decision Trees

• Tree Induction – constructing tree from training sample.


• There may be many trees for sample- we are interested to find
smallest one( no.of nodes & complexity) – NP complete
• we are forced to use local search procedures based on heuristics that
give reasonable trees in reasonable time.
• Tree learning algorithms are greedy and, at each step, starting at the
root with the complete training data, we look for the best split
Classification Trees
• For node m, Nm instances reach m, Nim belong to Ci

i
N
PˆC i | x , m  pmi  m
• Node m is pure if pimis 0 or 1 Nm
• Measure of impurity is entropy

K
I m   p log p i
m
i
2 m
i 1

Entropy is the measure of purity


high entropy , a high level of disorder ( meaning low
level of purity)
Best Split

• If node m is pure, generate a leaf and stop, otherwise split and


continue recursively
• Impurity after split: Nmj of Nm take branch j. Nimj belong to Ci
i
Nmj
PˆC i | x , m, j  pmj
i

Nmj
• Find the variable and split that min impurity (among all variables -- and
split positions for numeric variables)
n Nmj K
I' m    mj 2 mj
p i
log p i

j 1 Nm i 1
Information Gain:
• Information gain is the measurement of changes in
entropy after the segmentation of a dataset based on
an attribute.
• It calculates how much information a feature provides
us about a class.
• According to the value of information gain, we split the
node and build the decision tree.
• A decision tree algorithm always tries to maximize the
value of information gain, and a node/attribute having
the highest information gain is split first. It can be
calculated using the below formula:

• Information Gain= Entropy(S)- [(Weighted Avg) *Entrop


y(each feature)
Information Gain:
• Entropy: Entropy is a metric to measure the impurity in
a given attribute. It specifies randomness in data.
Entropy can be calculated as:

• Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)


• Where,

• S= Total number of samples


• P(yes)= probability of yes
• P(no)= probability of no
Gini Index:
• Gini index is a measure of impurity or purity used while
creating a decision tree in the CART(Classification and
Regression Tree) algorithm.
• An attribute with the low Gini index should be preferred
as compared to the high Gini index.
• It only creates binary splits, and the CART algorithm
uses the Gini index to create binary splits.
• Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


1. GenerateTree(X):
Step 1: Check if the entropy of the current node is below a
threshold (θ). If yes, create a leaf node labeled with the
majority class.
Classification Step 2: If not, select the best attribute to split using
SplitAttribute(X).
tree construction
Step 3: For each branch of the selected attribute, generate the
tree recursively for the subset of data that falls into each
branch.
2. SplitAttribute(X):
Step 1: Initialize the minimum entropy (MinEnt) to a maximum
value.
Step 2: For each attribute, check if it is discrete or numeric:
Discrete attributes: Split the data into subsets based on the
values of the attribute. Compute the entropy for the split.
Numeric attributes: Consider all possible splits for the numeric
attribute and calculate the entropy.
Step 3: Choose the attribute and split that minimize the
entropy.
Step 4: Return the best attribute (bestf) with the lowest
entropy.
Example-Entropy
• Consider an example where we are building a decision tree to
predict whether a loan given to a person would result in a write-
off or not. Our entire population consists of 30 instances. 16
belong to the write-off class and the other 14 belong to the non-
write-off class. We have two features, namely “Balance” that can
take on two values -> “< 50K” or “>50K” and “Residence” that
can take on three values -> “OWN”, “RENT” or “OTHER”.
• How a decision tree algorithm would decide what attribute to split
on first and what feature provides more information, or reduces
more uncertainty about our target variable out of the two using
the concepts of Entropy.
Solution – step 1
• Feature 1 – Balance
Step 2
Feature 2 - Residence
Solution - Inference

• E(balance)<E(residence) => The child nodes from splitting on


Balance do seem purer than those of Residence.
• If you look at graph, however the left most node for residence is also
very pure but this is where the weighted averages come in play.
• Even though that node is very pure, it has the least amount of the
total observations, and a result contributes a small portion of it’s
purity when we calculate the total entropy from splitting on
Residence.
• Conclusion – A decision tree algorithm would use this result to make
the first split on our data using Balance.
Regression Trees
Regression Trees
Regression Trees
A regression tree is a decision tree where the values at the endpoint
nodes are continuous rather than discrete. That is, the regression tree
predicts a real-valued output rather than a class label
Ex: Predicting a person’s salary based on their age and occupation

Let us say for node m, Xm is the subset of X reaching node m; namely, it is


the set of all x ∈ X satisfying all the conditions in the decision nodes on
the path from the root until node m

In regression, the goodness of a split is measured by the mean square error from the estimated value.
Regression Trees
• If at a node, the error is acceptable, that is, Em < θr , then a
leaf node is created and it stores the gm value.
• If the error is not acceptable, data reaching node m is split
further such that the sum of the errors in the branches is
minimum.
• As in classification, at each node, we look for the attribute
(and split threshold for a numeric attribute) that minimizes
the error, and then we continue recursively.
1 if x  X mj : x reaches node m and branch j
bmj x  
0 otherwise
1 t mj
b x t
r t

E 'm   j t r  gmj bmj xt 


 
t 2
gmj 
Nm t mj 
b x t
Model Selection in Trees
Pruning Trees
• Any decision based on too few instances causes variance -
generalization error.
• Remove subtrees for better generalization (decrease
variance)
• Prepruning: Stopping tree construction early on before it is
full is called prepruning the tree.
• Postpruning: Grow the tree full until all leaves are pure and
we have no training error. We then find subtrees that cause
overfitting and we prune them.
• Prepruning is faster, postpruning is more accurate (requires a
separate pruning set)
Rule Extraction from Trees
• A decision tree does its own feature extraction.
• we build a tree and then take only those features used by the tree as inputs to another learning
method.
• Another main advantage of decision trees is interpretability: The decision nodes carry
conditions that are simple to understand
• Rule base: set of IF-THEN rules
• Knowledge Extraction: rule base allows knowledge extraction. It allows experts to verify the
model learned from data.
• Rule support - the percentage of training data covered by the rule
Learning Rules
• Rule induction is similar to tree induction but
• tree induction is breadth-first,
• rule induction is depth-first; one rule at a time
• Rule set contains rules; rules are conjunctions of terms
• A rule is said to cover an example if the example satisfies all the conditions of
the rule.
• Sequential covering: Generate rules one at a time until all positive examples
are covered
• Example: IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen, 1995)
IREP
• IREP (Incremental Reduced Error Pruning) is an algorithm used for
rule induction in machine learning. It generates a set of rules (a rule
set) to classify data and covers one rule at a time, aiming to capture
patterns in the data.
• Steps of IREP:
1.Generate a rule: It starts by finding a rule that covers many
positive examples (examples belonging to the class of interest).
2.Prune the rule: It simplifies the rule by removing unnecessary parts
to avoid overfitting to the training data.
3.Remove covered examples: Once a rule is created, the examples
covered by that rule are removed from the dataset.
4.Repeat: The algorithm repeats this process, generating more rules,
until all the positive examples are covered.
Multivariate Trees
In a multivariate tree, at a decision node, all input dimensions can be used and thus it is
more general.
Example Problems
Example Problems
Example Problems
Example Problems
Example Problems
1.https://www.youtube.com/watch?v=zNYdkpAcP-g&t=595s

2. https://www.youtube.com/watch?v=wefc_36d5mU&t=52s

3. https://www.youtube.com/watch?v=y6VwIcZAUkI&t=113s

You might also like