0% found this document useful (0 votes)
6 views78 pages

Trees

Decision Trees are non-parametric models used for classification and regression, constructed using nodes and branches to evaluate features and make predictions. They are easy to interpret and require less data to train, but are prone to overfitting and can be biased towards features with many levels. Techniques like ID3 and CART utilize measures such as Information Gain and Gini Index to determine the best attributes for splitting data.

Uploaded by

kaustub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views78 pages

Trees

Decision Trees are non-parametric models used for classification and regression, constructed using nodes and branches to evaluate features and make predictions. They are easy to interpret and require less data to train, but are prone to overfitting and can be biased towards features with many levels. Techniques like ID3 and CART utilize measures such as Information Gain and Gini Index to determine the best attributes for splitting data.

Uploaded by

kaustub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Decision Trees

• In the Machine Learning world, Decision Trees are a kind of non parametric
models, that can be used for both classification and regression.

• Decision trees are flexible models that don’t increase their number of
parameters as we add more features (if we build them correctly), and they can
either output
• a categorical prediction (like if a plant is of a certain kind or not) or
• a numerical prediction (like the price of a house).
Decision Trees
• They are constructed using two kinds of
elements: nodes and branches.
• At each node, one of the features of our
data is evaluated in order to split the
observations in the training process or
to make an specific data point follow a
certain path when making a prediction.
• When they are being built decision
trees are constructed
by recursively evaluating different
features and using at each node the
feature that best splits the data.
Decision Trees
• The Root Node: Is the node that starts
the graph. In a normal decision tree it
evaluates the variable that best splits
the data.
• Intermediate nodes: These are nodes
where variables are evaluated but which
are not the final nodes where
predictions are made.
• Leave nodes: These are the final nodes
of the tree, where the predictions of a
category or a numerical value are made.
Decision Trees
• It utilizes an if-then rule set which is
mutually exclusive and exhaustive for
classification.
• The rules are learned sequentially using
the training data one at a time.
• Each time a rule is learned, the tuples
covered by the rules are removed.
• This process is continued on the training
set until meeting a termination
condition.
Decision Trees
• Advantages:
• The main advantage of decision trees is how easy they are to interpret. While other machine
Learning models are close to black boxes, decision trees provide a graphical and intuitive way
to understand what our algorithm does.
• Compared to other Machine Learning algorithms Decision Trees require less data to train.
• They can be used for Classification and Regression.
• They are simple.

• Disadvantages
• They are quite prone to over fitting to the training data and can be sensible to outliers.
• They are weak learners: a single decision tree normally does not make great predictions, so
multiple trees are often combined to make ‘forests’ to give birth to stronger ensemble models.
• Bias towards features with many levels: Decision trees tend to favor features with many
levels or categories because they can create more splits, potentially leading to biased
predictions.
Example 1: Decision Tree
• We need to classify the fruits in the basket
Example 2: Decision Tree
Quantifying impurity
• Quantify uncertainty at each node.
• Calculate impurity first.
• Range is 0 to 1.

low impurity is low uncertainty


high impurity is high uncertainty
Quantifying impurity
Quantifying impurity

Impurity= 1-p
Decision Tree Induction Techniques
• Decision tree induction is a top-down, recursive and divide-and-conquer approach.
• Choose an attribute and split larger training set into smaller training sets
• Different algorithms have been proposed to take a good control over
1. Choosing the best attribute to be split – attribute selection measure
2. Splitting criteria
• Some of the algorithms are
• ID3
• CART
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Attribute Selection Measures
• An attribute selection measure is a heuristic for selecting the splitting
criterion that “best” separates a given data partition, D, of class-labeled
training tuples into individual classes.

• If we were to split D into smaller partitions according to the outcomes of


the splitting criterion, ideally each partition would be pure (i.e., all the
tuples that fall into a given partition would belong to the same class).

• Conceptually, the “best” splitting criterion is the one that most closely
results in such a scenario.
Measures of Node Impurity
• Entropy

• Information Gain – used by ID3 algorithm

• Gini Index – used by CART


Entropy
Entropy is an information theory metric that measures the impurity or uncertainty in a group of observations. It determines how
a decision tree chooses to split data. It is used to measure how informative a node is.

Entropy measures the impurity in the given dataset. In Physics and Mathematics, entropy is referred to as the randomness or
uncertainty of a random variable X. In information theory, it refers to the impurity in a group of examples. Splitting on any
attribute has the property that average entropy of the resulting training subsets is less than or equal to that of the previous
training set

Information gain is the decrease in entropy. Information gain computes the difference between entropy before split and average
entropy after split of the dataset based on given attribute values.

Attribute with the largest value of information gain is chosen as the splitting attribute
Attribute Selection Measures –
Information Gain
• ID3 uses information gain as its attribute selection measure.

• Constructing a decision tree is all about finding attribute that returns


the highest information gain (i.e., the most homogeneous branches).

• Let node N represent or hold the tuples of partition D. The attribute


with the highest information gain is chosen as the splitting attribute
for node N.
Entropy
• A={1,1,1,1,1,2,2,2,3}
• There are 3 distinct classes
• p1=5/9, p2=3/9, p3=1/9
• E=p1log(1/p1)+p2log(1/p2)+p3log(1/p3)

• Entropy of a Training Set


• If there are k classes , ……., and for denotes the number of occurrences of classes
divided by the total number of instances
Entropy
What happened if all the observations belong to the same class? Entropy would be 0

E=-(1log21) = 0

Such a data set does not have any impurity. Such a dataset would not be useful for learning.

However, if we have a dataset with say, two classes, half made up of yellow and the other half being purple, the
entropy will be one.

E=-((0.5log20.5)+(0.5log20.5)) = 1

This kind of a dataset is good for learning


Attribute Selection Measures –
Information Gain
• This attribute minimizes the information needed to classify the tuples
in the resulting partitions
• Such an approach minimizes the expected number of tests needed to
classify a given tuple and guarantees that a simple (but not
necessarily the simplest) tree is found.

Info(D) known as the entropy of D

where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is
estimated by |Ci,D|/|Dj|
Info(D) is just the average amount of information needed to identify the class label of a
Attribute Selection Measures –
Information Gain
Attribute Selection Measures – Information Gain
Entropy of D before splitting
Attribute Selection Measures –
Information Gain

Gain(A) = Info(D)-InfoA(D)

The attribute A with the highest information gain, Gain.(A), is chosen as the splitting
attribute at node N.

This is equivalent to saying that we want to partition on the attribute A that would do the
“best classification,” so that the amount of information still required to finish classifying
the tuples is minimal
Example

Training set, D, of class-labeled tuples randomly selected from the


AllElectronics customer database.
Example

Training set, D, of class-labeled tuples randomly selected from the


AllElectronics customer database.
Example

• Each attribute is discrete valued.

• Continuous-valued attributes have been


generalized.)

• The class label attribute, buys computer, has


two distinct values (namely, {yes, no});
therefore, there are two distinct classes (i.e., m
=2).

• Let class C1 correspond to yes and class C2


correspond to no.

• There are nine tuples of class yes and five tuples


Training set, D, of class-labeled tuples randomly selected fromno.
of class the A (root) node N is created for the
AllElectronics customer database. tuples in D.
Example

1. To find the splitting criterion for these tuples, we must


compute the information gain of each attribute using below
formula.
Example

2. we need to compute the expected information requirement


for each attribute using formula:

Let us compute for age attribute- with values youth, middle


aged, senior

Info(buys_computer|age=youth) = -2/5 log2(2/5) – 3/5log2(3/5)


= 0.97095
Info(buys_computer|age=middle_aged)=-4/4log2(4/4)=0
Info(buys_computer|age=senior)=-3/5log2(3/5)-2/5log2(2/5)=
0.097095

Infoage(D) = (5/14)*(0.97095) + (4/14)*(0) + (5/14)*(0.97095)


=0.694
Example

2. we need to compute the expected information requirement


for each attribute using formula:

Let us compute for age attribute- with values youth, middle


aged, senior

Info(buys_computer|age=youth) = -2/5 log2(2/5) – 3/5log2(3/5)


= 0.97095
Info(buys_computer|age=middle_aged)=-4/4log2(4/4)=0
Info(buys_computer|age=senior)=-3/5log2(3/5)-2/5log2(2/5)=
0.097095

Infoage(D) = (5/14)*(0.97095) + (4/14)*(0) + (5/14)*(0.97095)


=0.694

Gain = Infod(D)- Infoage(D) =0.940-0.694=0.246 bits


Example

Calculate gain for income, student and credit rating attribute.


Example

Gain(income)= 0.029 bits

Gain(student)=0.151 bits

Gain(credit_rating)= 0.048 bits

Gain (age)= 0.246 bits

As age has highest information gain, it is selected as the


splitting attribute.
Example
Example
Example: Decision tree

1. Compute entropy for the dataset


Example: Decision tree

1. Compute entropy for the dataset


Example: Decision tree

1. Compute entropy for the dataset


Example: Decision tree

1. Compute entropy for the dataset


Example: Decision Tree
• Selection of root node
Example: Decision Tree
Choose outlook
Example: Decision Tree

(rainy)
Example: Decision Tree

(rainy)
Example: Decision Tree

(rainy)
Example: Decision Tree

(rainy)
Example: Decision Tree

(rainy)
Example: Decision Tree
E(() ()
= 0.982
Example: Decision Tree
Example: Decision Tree
Example: Decision Tree
Example: Decision Tree
Example: Decision Tree

Person will go to play?


sunny
rain
overcast Windy: Strong
Humidity: High
Outlook: Sunny

Person will not go to play


CART Algorithm
• Information gain measure used in ID3 is biased towards test with many outcomes
• ID3 prefers to select attributes (ex: outlook, humidity, windy) having a large number of
values
• L. Breiman, J. Friedman, R. Olshen and C. Stone in 1984 proposed an algorithm to build a
binary decision tree also called CART decision tree.
• CART stands for Classification and Regression Tree
• CART is a technique that generates a binary decision tree
• Unlike ID3, in CART, for each node only two branches are created
• ID3 uses Information gain as a measure to select the best attribute to be split
• CART uses another measurement called Gini index
• It is also known as Gini Index of Diversity and is denote as
Gini Index
• Gini Index or Gini impurity measures the degree or probability of a particular
variable being wrongly classified when it is randomly chosen.

• If all the elements belong to a single class, then it can be called pure. The degree
of Gini Index varies between 0 and 1,

• where, '0' denotes that all elements belong to a certain class or there exists only
one class (pure), and '1' denotes that the elements are randomly distributed
across various classes (impure).

• A Gini Index of '0.5 'denotes equally distributed elements into some classes.
Formula for Gini Index
• The formula of the Gini Index is as follows:

where, ‘pi’ is the probability of an object being classified to a


particular class.

While building the decision tree, we would prefer to choose the


attribute/feature with the least Gini Index as the root node.
Gini Index of Diversity

Suppose, a binary partition on A splits D into and , then the weighted


average Gini Index of splitting denoted by is given by

This binary partition of D reduces the impurity and the reduction in impurity
is measured by
Gini
Example: CART Algorithm
Build decision tree for the data set EMP as given in the table below.
Tuple# Age Salary Job Performance Select
1 Y H P A N

Age Job 2 Y H P E N

Y : young G : government 3 M H P A Y

M : middle-aged P : private 4 O M P A Y

O : old 5 O L G A Y
Performance
6 O L G E N
A : Average
Salary E : Excellent 7 M L G E Y
L : low 8 Y M P A N
M : medium Class : Select 9 Y L G A Y
H : high Y : yes 10 O M G A Y
N : no 11 Y M G E Y

12 M M P E Y

13 M H G A Y

14 O M P E N
Example: CART Algorithm
For the EMP data set, Tuple# Age Salary Job Performance Select
1 Y H P A N
2 Y H P E N

Total samples =14, Y =9, N =5 3 M H P A Y


4 O M P A Y
5 O L G A Y
6 O L G E N
7 M L G E Y
8 Y M P A N
9 Y L G A Y
10 O M G A Y
11 Y M G E Y
12 M M P E Y
13 M H G A Y
14 O M P E N

Consider the calculation of for Age, Salary, Job and Performance


Example: CART Algorithm
Attribute of splitting: Age
• has three values, namely Y, M and O Tuple# Age Salary Job Perform Select
• There are 6 subsets, that should be considered for splitting as: ance
1 Y H P A N
2 Y H P E N
3 M H P A Y
4 O M P A Y
For age1, Tuple is {Y|MO} 5 O L G A Y
6 O L G E N
Age is divided into two branches, Y and (M, O)
7 M L G E Y
For Y, 5 instances (2Yes and 3 No), For M and O, 9 instances (7 Yes and 2 No)
8 Y M P A N
9 Y L G A Y
10 O M G A Y
11 Y M G E Y
12 M M P E Y
Age {Y} 13 M H G A Y
14 O M P E N
Yes No

{Y} {M,O}
Example: CART Algorithm
Attribute of splitting: Age
Tuple Ag Salar Jo Perform Select
# e y b ance
0.3936
1 Y H P A N
2 Y H P E N
?
For , it is {M|YO}
3 M H P A Y
4 O M P A Y

M has 4 instances (4 Yes), Y and O have 10 instances (5 Yes


and 5 No)
5 O L G A Y
6 O L G E N
7 M L G E Y
8 Y M P A N

0.3571 9 Y L G A Y
10 O M G A Y
11 Y M G E Y
Age {M}
12 M M P E Y
Yes No 13 M H G A Y
14 O M P E N
{M} {Y,O}
Example: CART Algorithm
Attribute of splitting: Age
The attribute age has three values, namely Y, M and O. So there are 6 subsets, Tuple Ag Salar Jo Perform Select
that should be considered for splitting as: # e y b ance
1 Y H P A N

0.3936
2 Y H P E N
3 M H P A Y
4 O M P A Y
5 O L G A Y
? 6 O L G E N
7 M L G E Y
8 Y M P A N
9 Y L G A Y
10 O M G A Y
Age {O} 11 Y M G E Y

Yes No 12 M M P E Y
13 M H G A Y
{O} {Y,M} 14 O M P E N
Example: CART Algorithm
Attribute of splitting: Age
Tuple Ag Salar Jo Perform Select

0.3936
# e y b ance
1 Y H P A N
2 Y H P E N
0.3571 3 M H P A Y
0.4571 4 O M P A Y
5 O L G A Y
considers {Y,M|O} and considers {O|Y,M} 6 O L G E N
Therefore, , 7 M L G E Y
8 Y M P A N
9 Y L G A Y
10 O M G A Y
11 Y M G E Y
The best value of Gini Index to split attribute Age is minimum of all
12 M M P E Y
13 M H G A Y
14 O M P E N
Example: CART Algorithm
Attribute of Splitting: Salary
Tuple Ag Salar Job Performa Select
The attribute salary has three values namely L, M and H. # e y nce
So, there are 6 subsets, that should be considered for 1 Y H P A N
splitting as: 2 Y H P E N
3 M H P A Y
4 O M P A Y
5 O L G A Y
6 O L G E N
Salary {L} 7 M L G E Y
8 Y M P A N
Yes No
9 Y L G A Y
10 O M G A Y
{L} {M,H}
11 Y M G E Y
12 M M P E Y
Overall Gini Index is G(EMP)
13 M H G A Y
14 O M P E N
Example: CART Algorithm
Attribute of Splitting: job
Tuple Ag Salar Job Performa Select
# e y nce
Job being the binary attribute , we have 1 Y H P A N
2 Y H P E N
3 M H P A Y
4 O M P A Y
5 O L G A Y
6 O L G E N
7 M L G E Y
8 Y M P A N
9 Y L G A Y
=? 10 O M G A Y
11 Y M G E Y
12 M M P E Y
? 13 M H G A Y
14 O M P E N
Example: CART Algorithm
Attribute of Splitting: Performance

Performance is a binary attribute


?
?

Out of these attributes, the attribute that gives the maximum reduction in impurity
would be chosen for splitting subset.
Trees Vs Linear Model
• Comparing tree-based models with linear models is essential in machine learning, as they represent two different approaches to modeling data. Here's a
comparison between the two:

• Model Complexity:
Tree-Based Models: Decision trees can capture complex nonlinear relationships between features and the target variable. Ensemble methods like random forests and gradient
boosting machines further increase the model complexity by combining multiple decision trees.
Linear Models: Linear models assume a linear relationship between the features and the target variable. They are relatively simple and may struggle to capture complex nonlinear
relationships.

• Interpretability:
Tree-Based Models: Decision trees offer interpretability as they represent a series of simple if-else conditions. It's easy to understand the decision-making process within a tree
model.
Linear Models: Linear models are also interpretable, as coefficients associated with each feature indicate the direction and strength of the relationship between the feature and the
target variable.

• Handling of Features:
Tree-Based Models: Tree models can handle both numerical and categorical features without the need for preprocessing such as one-hot encoding.
Linear Models:** Linear models typically require preprocessing for categorical variables, such as one-hot encoding, to represent them effectively.

• Robustness to Outliers:
Tree-Based Models: Decision trees are generally robust to outliers since they partition the feature space based on thresholds.
Linear Models: Linear models can be sensitive to outliers, especially in models like linear regression, where outliers can disproportionately influence the estimated coefficients.

• Overfitting:
Tree-Based Models: Decision trees are prone to overfitting, especially when the tree depth is not limited. Ensemble methods like random forests and gradient boosting machines
mitigate overfitting to some extent.
Linear Models: Linear models are less prone to overfitting, especially when the number of features is small compared to the number of observations.

• Computational Complexity:
Tree-Based Models: Building decision trees can be computationally expensive, especially for large datasets, although predicting with a trained tree is typically fast.
Linear Models: Linear models are computationally efficient both during training and prediction, making them suitable for large datasets.
Ensembles
• An ensemble is simply a collection of models that are all trained to
perform the same task
• An ensemble consist of many different versions of the model, or many
type of models
• The final output of an ensemble of classifiers is typically obtained
through a (weighted) average or vote of the predictions of the
different models in the ensemble
• An ensemble of different models that all achieve similar
generalization performance often outperforms any of the individual
models
What is Ensemble Learning?
It is a powerful method to improve the performance of the model. It combines the predictions
from multiple models. The results are improved by using this ML algorithm.

All algorithm should be different

• All Base Models should be different.


• OR Use same Model but provide different data.
• OR Combination of above two
Ensemble Prediction
For Classification Problem – It will calculate sum of 1
and 0 and display output which has majority.
Example – Pred1=0, Pred2=1 and Pred3=1 then
Count of Ones = 2 & Count of zeros = 1
Majority = Ones [1] then Ensemble Prediction =
1

For Regression Problem – Its will take mean of all


predictions and display that mean value
Example- Pred1=1.5, Pred2=2.5 and Pred3=2.0 then
Average/Mean =[1.5+2.5+2.0]/3
then Ensemble Prediction = 2
Types of Ensemble Learning
• Voting Ensemble

• Bagging or Bootstrap Aggregation — Random Forest

• Boosting — AdaBoosting, Gradient Boost and XG Boost

• Stacking Ensemble
Some Simple Ensembles
• Voting or averaging of predictions of multiple pre-trained models

• “Stacking” Use predictions of multiple models as features to train a


new model and use the new model to make predictions on test data
Ensemble : Another Approach
• Instead of training different models on same data, train same model
multiple times on different data sets, and combine these different
models.
• How do we get multiple training data sets (in practice, we have only
one data set at training time)?
Bagging
“Bagging” comes from Bootstrap+
AGGregating.
• All algorithm to be provided different set of
D data.
• Data Sampling will be of three types
a) Row Sampling
b) Column Sampling
c) Combination of Row & Column
D1
D2 • D1,D2 & D3 should always be less than D
D3 • Use same Machine Learning Models and they
all runs in Parallel
• Boot Strapping Sample Data - In bagging, a
random sample of data in a training set is
“with replacement”—meaning that the
individual data points can be chosen more
than once.
• All Machine learning Models are same, but
Input data is different, so prediction would
have different variety.
Random Forest
Random forest is a commonly-used machine learning algorithm,
which combines the output of multiple decision trees to reach a single result.
Its ease of use and flexibility have fuelled its adoption, as it handles both classification and regression problems.
Random Forest Hyperparameter
• max_depth
• min_sample_split
• max_leaf_nodes
• min_samples_leaf
• n_estimators
• max_sample (bootstrap sample)
• max_features
Random Forest Hyperparameter
• max_depth : of a tree in Random Forest is defined as the longest path
between the root node and the leaf node:
• min_sample_split : This parameter that tells the decision tree in a random
forest the minimum required number of observations in any given node
in order to split it. default value as 2 poses the issue that a tree often
keeps on splitting until the nodes are completely pure. As a result, the
tree grows in size and therefore overfits the data. By increasing the
value of the min_sample_split, we can reduce the number of splits that
happen in the decision tree and therefore prevent the model from
overfitting. Image show change from 2 to 6.

• min_sample_leaf : This specifies the minimum number of


samples that should be present in the leaf node after
splitting a node.
Random Forest Hyperparameter
n_estimators : We know that a Random Forest algorithm is nothing but a grouping of
trees. But how many trees should we consider

max_samples : it determines what fraction of the original dataset is given to any


individual tree.
Model performance reaches its max when the data provided is
less than 0.2 fraction of the original dataset

max_features : This is the number of maximum features provided to each tree in a


random forest.

max_terminal_nodes/max_leaf_nodes : This parameter sets a condition on the splitting of


the nodes in the tree and hence restricts the growth of the tree.
Boosting
• Boosting is a method used in machine learning to reduce errors in predictive data
analysis.
• A single machine learning model might make prediction errors depending on the
accuracy of the training dataset.
• Boosting tries to overcome this issue by training multiple models sequentially to
improve the accuracy of the overall system.

How does boosting work?


• Create the initial weak learner.
• Use the weak learner to make predictions on the entire dataset.
• Compute the prediction errors.
• Incorrect predictions are assigned more weight.
• Build another weak learner aimed at fixing the errors of the previous learner.
• Make predictions on the whole dataset using the new learner.
• Repeat this process until the optimal results are obtained.
• The final model is obtained by weighting the mean of all weak learners.
Boosting
Boosting
Why is boosting important?
Boosting is used when you have high bias and low variance.

• Boosting improves machine models' predictive accuracy and performance by converting multiple weak
learners into a single strong learning model. Machine learning models can be weak learners or strong
learners:
• Weak learners
• Weak learners have low prediction accuracy, similar to random guessing. They are prone to overfitting—
that is, they can't classify data that varies too much from their original dataset. For example, if you train
the model to identify cats as animals with pointed ears, it might fail to recognize a cat whose ears are
curled.
• Strong learners
• Strong learners have higher prediction accuracy. Boosting converts a system of weak learners into a single
strong learning system. For example, to identify the cat image, it combines a weak learner that guesses for
pointy ears and another learner that guesses for cat-shaped eyes. After analysing the animal image for
pointy ears, the system analyzes it once again for cat-shaped eyes. This improves the system's overall
accuracy.
Boosting
• You use same Models, same training
data to test model [which includes
wrong predictions from prior model] in
Sequential Order
• Data to be given to first Model.
• During training it identify all data which
is incorrectly predicted.
• Making a prediction and passing the
wrong predictions to the next predictor
• Paying more and more attention at each
iteration to the observations. having
prediction errors
• Making new predictions until the limit is
reached or a higher accuracy is
achieved.
• Model which are providing you more
accuracy, assign more weights to them
Bagging Vs Boosting

You might also like