Trees
Trees
• In the Machine Learning world, Decision Trees are a kind of non parametric
models, that can be used for both classification and regression.
• Decision trees are flexible models that don’t increase their number of
parameters as we add more features (if we build them correctly), and they can
either output
• a categorical prediction (like if a plant is of a certain kind or not) or
• a numerical prediction (like the price of a house).
Decision Trees
• They are constructed using two kinds of
elements: nodes and branches.
• At each node, one of the features of our
data is evaluated in order to split the
observations in the training process or
to make an specific data point follow a
certain path when making a prediction.
• When they are being built decision
trees are constructed
by recursively evaluating different
features and using at each node the
feature that best splits the data.
Decision Trees
• The Root Node: Is the node that starts
the graph. In a normal decision tree it
evaluates the variable that best splits
the data.
• Intermediate nodes: These are nodes
where variables are evaluated but which
are not the final nodes where
predictions are made.
• Leave nodes: These are the final nodes
of the tree, where the predictions of a
category or a numerical value are made.
Decision Trees
• It utilizes an if-then rule set which is
mutually exclusive and exhaustive for
classification.
• The rules are learned sequentially using
the training data one at a time.
• Each time a rule is learned, the tuples
covered by the rules are removed.
• This process is continued on the training
set until meeting a termination
condition.
Decision Trees
• Advantages:
• The main advantage of decision trees is how easy they are to interpret. While other machine
Learning models are close to black boxes, decision trees provide a graphical and intuitive way
to understand what our algorithm does.
• Compared to other Machine Learning algorithms Decision Trees require less data to train.
• They can be used for Classification and Regression.
• They are simple.
• Disadvantages
• They are quite prone to over fitting to the training data and can be sensible to outliers.
• They are weak learners: a single decision tree normally does not make great predictions, so
multiple trees are often combined to make ‘forests’ to give birth to stronger ensemble models.
• Bias towards features with many levels: Decision trees tend to favor features with many
levels or categories because they can create more splits, potentially leading to biased
predictions.
Example 1: Decision Tree
• We need to classify the fruits in the basket
Example 2: Decision Tree
Quantifying impurity
• Quantify uncertainty at each node.
• Calculate impurity first.
• Range is 0 to 1.
Impurity= 1-p
Decision Tree Induction Techniques
• Decision tree induction is a top-down, recursive and divide-and-conquer approach.
• Choose an attribute and split larger training set into smaller training sets
• Different algorithms have been proposed to take a good control over
1. Choosing the best attribute to be split – attribute selection measure
2. Splitting criteria
• Some of the algorithms are
• ID3
• CART
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Attribute Selection Measures
• An attribute selection measure is a heuristic for selecting the splitting
criterion that “best” separates a given data partition, D, of class-labeled
training tuples into individual classes.
• Conceptually, the “best” splitting criterion is the one that most closely
results in such a scenario.
Measures of Node Impurity
• Entropy
Entropy measures the impurity in the given dataset. In Physics and Mathematics, entropy is referred to as the randomness or
uncertainty of a random variable X. In information theory, it refers to the impurity in a group of examples. Splitting on any
attribute has the property that average entropy of the resulting training subsets is less than or equal to that of the previous
training set
Information gain is the decrease in entropy. Information gain computes the difference between entropy before split and average
entropy after split of the dataset based on given attribute values.
Attribute with the largest value of information gain is chosen as the splitting attribute
Attribute Selection Measures –
Information Gain
• ID3 uses information gain as its attribute selection measure.
E=-(1log21) = 0
Such a data set does not have any impurity. Such a dataset would not be useful for learning.
However, if we have a dataset with say, two classes, half made up of yellow and the other half being purple, the
entropy will be one.
E=-((0.5log20.5)+(0.5log20.5)) = 1
where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is
estimated by |Ci,D|/|Dj|
Info(D) is just the average amount of information needed to identify the class label of a
Attribute Selection Measures –
Information Gain
Attribute Selection Measures – Information Gain
Entropy of D before splitting
Attribute Selection Measures –
Information Gain
Gain(A) = Info(D)-InfoA(D)
The attribute A with the highest information gain, Gain.(A), is chosen as the splitting
attribute at node N.
This is equivalent to saying that we want to partition on the attribute A that would do the
“best classification,” so that the amount of information still required to finish classifying
the tuples is minimal
Example
Gain(student)=0.151 bits
(rainy)
Example: Decision Tree
(rainy)
Example: Decision Tree
(rainy)
Example: Decision Tree
(rainy)
Example: Decision Tree
(rainy)
Example: Decision Tree
E(() ()
= 0.982
Example: Decision Tree
Example: Decision Tree
Example: Decision Tree
Example: Decision Tree
Example: Decision Tree
• If all the elements belong to a single class, then it can be called pure. The degree
of Gini Index varies between 0 and 1,
• where, '0' denotes that all elements belong to a certain class or there exists only
one class (pure), and '1' denotes that the elements are randomly distributed
across various classes (impure).
• A Gini Index of '0.5 'denotes equally distributed elements into some classes.
Formula for Gini Index
• The formula of the Gini Index is as follows:
This binary partition of D reduces the impurity and the reduction in impurity
is measured by
Gini
Example: CART Algorithm
Build decision tree for the data set EMP as given in the table below.
Tuple# Age Salary Job Performance Select
1 Y H P A N
Age Job 2 Y H P E N
Y : young G : government 3 M H P A Y
M : middle-aged P : private 4 O M P A Y
O : old 5 O L G A Y
Performance
6 O L G E N
A : Average
Salary E : Excellent 7 M L G E Y
L : low 8 Y M P A N
M : medium Class : Select 9 Y L G A Y
H : high Y : yes 10 O M G A Y
N : no 11 Y M G E Y
12 M M P E Y
13 M H G A Y
14 O M P E N
Example: CART Algorithm
For the EMP data set, Tuple# Age Salary Job Performance Select
1 Y H P A N
2 Y H P E N
{Y} {M,O}
Example: CART Algorithm
Attribute of splitting: Age
Tuple Ag Salar Jo Perform Select
# e y b ance
0.3936
1 Y H P A N
2 Y H P E N
?
For , it is {M|YO}
3 M H P A Y
4 O M P A Y
0.3571 9 Y L G A Y
10 O M G A Y
11 Y M G E Y
Age {M}
12 M M P E Y
Yes No 13 M H G A Y
14 O M P E N
{M} {Y,O}
Example: CART Algorithm
Attribute of splitting: Age
The attribute age has three values, namely Y, M and O. So there are 6 subsets, Tuple Ag Salar Jo Perform Select
that should be considered for splitting as: # e y b ance
1 Y H P A N
0.3936
2 Y H P E N
3 M H P A Y
4 O M P A Y
5 O L G A Y
? 6 O L G E N
7 M L G E Y
8 Y M P A N
9 Y L G A Y
10 O M G A Y
Age {O} 11 Y M G E Y
Yes No 12 M M P E Y
13 M H G A Y
{O} {Y,M} 14 O M P E N
Example: CART Algorithm
Attribute of splitting: Age
Tuple Ag Salar Jo Perform Select
0.3936
# e y b ance
1 Y H P A N
2 Y H P E N
0.3571 3 M H P A Y
0.4571 4 O M P A Y
5 O L G A Y
considers {Y,M|O} and considers {O|Y,M} 6 O L G E N
Therefore, , 7 M L G E Y
8 Y M P A N
9 Y L G A Y
10 O M G A Y
11 Y M G E Y
The best value of Gini Index to split attribute Age is minimum of all
12 M M P E Y
13 M H G A Y
14 O M P E N
Example: CART Algorithm
Attribute of Splitting: Salary
Tuple Ag Salar Job Performa Select
The attribute salary has three values namely L, M and H. # e y nce
So, there are 6 subsets, that should be considered for 1 Y H P A N
splitting as: 2 Y H P E N
3 M H P A Y
4 O M P A Y
5 O L G A Y
6 O L G E N
Salary {L} 7 M L G E Y
8 Y M P A N
Yes No
9 Y L G A Y
10 O M G A Y
{L} {M,H}
11 Y M G E Y
12 M M P E Y
Overall Gini Index is G(EMP)
13 M H G A Y
14 O M P E N
Example: CART Algorithm
Attribute of Splitting: job
Tuple Ag Salar Job Performa Select
# e y nce
Job being the binary attribute , we have 1 Y H P A N
2 Y H P E N
3 M H P A Y
4 O M P A Y
5 O L G A Y
6 O L G E N
7 M L G E Y
8 Y M P A N
9 Y L G A Y
=? 10 O M G A Y
11 Y M G E Y
12 M M P E Y
? 13 M H G A Y
14 O M P E N
Example: CART Algorithm
Attribute of Splitting: Performance
Out of these attributes, the attribute that gives the maximum reduction in impurity
would be chosen for splitting subset.
Trees Vs Linear Model
• Comparing tree-based models with linear models is essential in machine learning, as they represent two different approaches to modeling data. Here's a
comparison between the two:
• Model Complexity:
Tree-Based Models: Decision trees can capture complex nonlinear relationships between features and the target variable. Ensemble methods like random forests and gradient
boosting machines further increase the model complexity by combining multiple decision trees.
Linear Models: Linear models assume a linear relationship between the features and the target variable. They are relatively simple and may struggle to capture complex nonlinear
relationships.
• Interpretability:
Tree-Based Models: Decision trees offer interpretability as they represent a series of simple if-else conditions. It's easy to understand the decision-making process within a tree
model.
Linear Models: Linear models are also interpretable, as coefficients associated with each feature indicate the direction and strength of the relationship between the feature and the
target variable.
• Handling of Features:
Tree-Based Models: Tree models can handle both numerical and categorical features without the need for preprocessing such as one-hot encoding.
Linear Models:** Linear models typically require preprocessing for categorical variables, such as one-hot encoding, to represent them effectively.
• Robustness to Outliers:
Tree-Based Models: Decision trees are generally robust to outliers since they partition the feature space based on thresholds.
Linear Models: Linear models can be sensitive to outliers, especially in models like linear regression, where outliers can disproportionately influence the estimated coefficients.
• Overfitting:
Tree-Based Models: Decision trees are prone to overfitting, especially when the tree depth is not limited. Ensemble methods like random forests and gradient boosting machines
mitigate overfitting to some extent.
Linear Models: Linear models are less prone to overfitting, especially when the number of features is small compared to the number of observations.
• Computational Complexity:
Tree-Based Models: Building decision trees can be computationally expensive, especially for large datasets, although predicting with a trained tree is typically fast.
Linear Models: Linear models are computationally efficient both during training and prediction, making them suitable for large datasets.
Ensembles
• An ensemble is simply a collection of models that are all trained to
perform the same task
• An ensemble consist of many different versions of the model, or many
type of models
• The final output of an ensemble of classifiers is typically obtained
through a (weighted) average or vote of the predictions of the
different models in the ensemble
• An ensemble of different models that all achieve similar
generalization performance often outperforms any of the individual
models
What is Ensemble Learning?
It is a powerful method to improve the performance of the model. It combines the predictions
from multiple models. The results are improved by using this ML algorithm.
• Stacking Ensemble
Some Simple Ensembles
• Voting or averaging of predictions of multiple pre-trained models
• Boosting improves machine models' predictive accuracy and performance by converting multiple weak
learners into a single strong learning model. Machine learning models can be weak learners or strong
learners:
• Weak learners
• Weak learners have low prediction accuracy, similar to random guessing. They are prone to overfitting—
that is, they can't classify data that varies too much from their original dataset. For example, if you train
the model to identify cats as animals with pointed ears, it might fail to recognize a cat whose ears are
curled.
• Strong learners
• Strong learners have higher prediction accuracy. Boosting converts a system of weak learners into a single
strong learning system. For example, to identify the cat image, it combines a weak learner that guesses for
pointy ears and another learner that guesses for cat-shaped eyes. After analysing the animal image for
pointy ears, the system analyzes it once again for cat-shaped eyes. This improves the system's overall
accuracy.
Boosting
• You use same Models, same training
data to test model [which includes
wrong predictions from prior model] in
Sequential Order
• Data to be given to first Model.
• During training it identify all data which
is incorrectly predicted.
• Making a prediction and passing the
wrong predictions to the next predictor
• Paying more and more attention at each
iteration to the observations. having
prediction errors
• Making new predictions until the limit is
reached or a higher accuracy is
achieved.
• Model which are providing you more
accuracy, assign more weights to them
Bagging Vs Boosting