0% found this document useful (0 votes)
33 views44 pages

ML4 - Decision Trees & Random Forest

Decision trees use a tree-like graph or model of decisions and their possible consequences to help determine an outcome. Random forest is an ensemble learning method that fits multiple decision trees on various sub-samples of a dataset and uses averaging to improve the predictive accuracy and control over-fitting. Ensemble methods like bagging, boosting, and stacking combine multiple machine learning models to reduce variance and bias to improve overall accuracy.

Uploaded by

param_email
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views44 pages

ML4 - Decision Trees & Random Forest

Decision trees use a tree-like graph or model of decisions and their possible consequences to help determine an outcome. Random forest is an ensemble learning method that fits multiple decision trees on various sub-samples of a dataset and uses averaging to improve the predictive accuracy and control over-fitting. Ensemble methods like bagging, boosting, and stacking combine multiple machine learning models to reduce variance and bias to improve overall accuracy.

Uploaded by

param_email
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Decision Trees

Random Forest
REFERENCES
Decision Trees
Python Decision Tree Classification Tutorial: Scikit-Learn
DecisionTreeClassifier | DataCamp
Ensemble Learning & Random Forest
Ensemble Learning Methods: Bagging, Boosting and Stacking (analyticsvi
dhya.com)
Basic Ensemble Learning (Random Forest, AdaBoost, Gradient Boosting)
- Step by Step Explained | by Lilly Chen | Towards Data Science
Decision Trees
• Flowchart-like tree structure where
• an internal node represents a feature(or attribute)
• the branch represents a decision rule, and
• each leaf node represents the outcome
• Root Node: Topmost node in a decision tree
• It learns to partition on the basis of the feature (attribute) value
• Tree is partitioned in a recursive manner called recursive partitioning
• Helps you in decision-making
• It's visualization like a flowchart diagram easily mimics the human level thinking
• That is why decision trees are easy to understand and interpret
Decision Trees
How does Decision Tree Algorithm work?
• Select the best feature (attribute) using Attribute Selection Measures
(ASM) to split the examples (records)
• Make that attribute a decision node and breaks the dataset into
smaller subsets
• Start tree building by repeating this process recursively for each child
until one of the conditions will match:
• All the tuples belong to the same attribute value
• There are no more remaining attributes
• There are no more instances
Attribute Selection Measures
• Heuristics for selecting the splitting criterion that partitions data in the
best possible manner
• Provide a rank to each feature (or attribute) by explaining the given
dataset
• The best score attribute will be selected as a splitting attribute
• Most popular selection measures are
• Information Gain
• Gain Ratio
• Gini Index
Information Gain
• Information Gain ≡ Reduction in Entropy
• How well the given feature (attribute) separates the target classes
• Entropy ≡ Measure of disorder (in target feature)
• For Binary Classification
• Entropy = 0, if all values of target feature are homogeneous (same)
• Entropy = 1, if target feature has equal number of values for both classes
• Entropy of a dataset S = Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
• n is the total number of classes in the target column (in our case n = 2 i.e YES and NO)
• pᵢ is the probability of class ‘i’ or the ratio of “number of rows (examples) with class i in
the target column” to the “total number of rows (examples)” in the dataset
• Information Gain for a feature (attribute/column) A
• IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))
• Sᵥ is the set of rows (examples) in S for which the feature (column) A has value v
• |Sᵥ| is the number of rows (examples) in Sᵥ
• |S| is the number of rows (examples) in S
• Sample dataset of COVID-19
infection
• Features (attributes) for decision nodes
• Fever
• Cough
• Breathing issues
• Target
• Infected
• Classes / Values (Two : Y & N)
• From the total of 14 rows in our dataset S,
there are 8 rows with the target
value YES and 6 rows with the target
value NO. The entropy of S is calculated as:
• Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
• Entropy(S) = — (8/14) * log₂(8/14) — (6/14) *
log₂(6/14) = 0.99
• IG Calculation for Fever
• In this(Fever) feature there are 8 rows
having value YES and 6 rows having
value NO Target
• As shown at left, in the 8 rows with YES for
Fever, there are 6 rows having target value
YES and 2 rows having target value NO
• IG Calculation for Fever
• In this(Fever) feature there are 8 rows
having value YES and 6 rows having
value NO Target
• As shown at left, in the 6 rows with NO,
there are 2 rows having target
value YES and 4 rows having target
value NO
First Step
Since the feature Breathing issues have the highest Information Gain it is
used to create the root node
Second Step
IG of Fever is greater than that of Cough, so we select Fever as the left
branch of Breathing Issues:
Third Step
Only one unused feature left we have no other choice but to make it the
right branch of the root node
Final Step (Creating Leaf Nodes)
- For the left leaf node of Fever, we see the subset of rows from the original
data set that has Breathing Issues and Fever both values as YES
- Similarly, for the right node of Fever we see the subset of rows from the
original data set that have Breathing Issues value as YES and Fever as NO
- We repeat the same process for the node Cough, however here both left and
right leaves turn out to be the same i.e. NO
Attribute Selection Measures
• Information Gain
• ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain
• Gain Ratio
• C4.5, an improvement of ID3, uses an extension to information gain known as
the gain ratio
• Gini Index (Impurity)
• CART (Classification and Regression Tree) uses the Gini method to create
split points
• Attribute with the minimum Gini index is chosen as the splitting attribute
Decision Tree Classifier Building in Scikit-
learn
Example Code
• Import Required Libraries
4
• Load Data
• Load the required Pima Indian Diabetes dataset using pandas' read CSV function
• Ensure diabetes.csv in current folder
• Split dataset in features and target variable
• Split dataset into training and test sets
• Create Decision Tree Classifier Model, Train it and Predict
• Evaluate the Model
• Visualising Decision Tress
• pip install graphviz
• pip install pydotplus
• The export_graphviz function converts the decision tree classifier into a dot file, and
pydotplus converts this dot file to png or displayable form on Jupyter
Decision Tree Classifier Building in Scikit-
learn
Example Code 4
• Optimization of decision tree classifier
• Pruning - Maximum depth of the tree can be used as a control variable for
pruning (with max_depth=3)
• Other attribute selection measure such as entropy
Decision Trees - Advantages
• Easy to interpret and visualize
• Can easily capture Non-linear patterns (can create complex decision
boundaries)
• Requires fewer data preprocessing from the user, for example,
• there is no need to normalize / scale columns
• Missing values do not affect the process of building tree
• Has no assumptions about distribution because of the non-parametric nature
of the algorithm
• Parametric ML Models (Linear Regression, Naïve Bayes, NN, Logistic Regression)
• Non-Parametric ML Models (k-NN, DT, SVM)
Decision Trees - Disadvantages
• Sensitive to noisy data (can overfit noisy data)
• Small variation(or variance) in data can result in the different decision tree
• Decision trees are biased with imbalanced dataset, so it is recommended that
balance out the dataset before creating the decision tree
• Can be highly time consuming in training phase
RANDOM FOREST
ENSEMBLE LEARNING
Ensemble Learning Methods: Bagging, Boosting and Stacking (analyticsvi
dhya.com)
Basic Ensemble Learning (Random Forest, AdaBoost, Gradient Boosting)
- Step by Step Explained | by Lilly Chen | Towards Data Science
RANDOM FOREST / ENSEMBLE
LEARNING
• In Real Life
• Before taking Big Decisions, we ask opinions (friends / family / colleagues)
• Prevention against being BIASED and IRRATIONAL
• For ML Models too
• Individual models may suffer from BIAS and VARIANCE
• Ensemble Learning may prevent this
• Ensemble Learning
• Making predictions based on a number of different models
• By combining individual models, ensemble model tends to be
• More flexible (less BIAS)
• Less data-sensitive (less VARIANCE)
• Ensemble ≡ Crowd opinion!!!
Ensemble Methods
• Meta-algorithms that combine several machine learning techniques
into one predictive model in order to
• Decrease variance (bagging)
• Decrease bias (boosting)
• Improve predictions (stacking)
• Individual Models
• Tend to perform poorly (low prediction accuracy)
• Are weak learners (either high bias or high variance)
• Ensemble Learning
• Combine multiple models (learners) to get one with better performance
(accuracy)
Problems in Individual Models
• High bias model (not learning data well enough)
• High variance model (learning the data too well)
Ensemble Learning aims to
• Reduce the bias if we have a weak model with high bias and low
variance
• Reduce the variance if we have a weak model with high variance and
low bias
• Have resulting model much more balanced, with low bias and
variance. Resulting model will be
• Known as a strong learner
• More generalized than the weak learners
• Able to make accurate predictions
Ensemble Learning
• Improves a model’s performance in mainly three ways:
• By reducing the variance of weak learners (BAGGING)
• By reducing the bias of weak learners (BOOSTING)
• By improving the overall accuracy of strong learners (STACKING)
Random Forest
• Ensemble Model
• Using Bagging as Ensemble Method
• Decision Tree as individual model
• Step 1: Select n (e.g. 1000) random subsets from the training set
• Step 2: Train n (e.g. 1000) decision trees
• one random subset is used to train one decision tree
• the optimal splits for each decision tree are based on a random subset of
features (e.g. 10 features in total, randomly select 5 out of 10 features to split)
• Step 3: Each individual tree predicts the records/candidates in the
test set, independently.
• Step 4: Make the final prediction (voting / averaging)
AdaBoosting (Adaptive Boosting)
• Ensemble Model
• Using Boosting as Ensemble method
• Using Decision Tree as individual model
• key is learning from the previous mistakes, e.g. misclassification data points
(weight of misclassified points are increased)
• Step 0: Initialize the weights of data points. if the training set has 100 data
points, then each point’s initial weight should be 1/100 = 0.01.
• Step 1: Train a decision tree
• Step 2: Calculate the weighted error rate (e) of the decision tree. The
weighted error rate (e) is just how many wrong predictions out of total and
you treat the wrong predictions differently based on its data point’s weight. The
higher the weight, the more the corresponding error will be
weighted during the calculation of the (e).
• Step 3: Calculate this decision tree’s weight in the ensemble
• the weight of this tree = learning rate * log( (1 — e) / e)
• the higher weighted error rate of a tree, 😫, the less decision power the tree will be given
during the later voting
• the lower weighted error rate of a tree, 😃, the higher decision power the tree will be
given during the later voting
• Step 4: Update weights of wrongly classified points
• the weight of each data point =
• if the model got this data point correct, the weight stays the same
• if the model got this data point wrong, the new weight of this point = old weight *
• Note: The higher the weight of the tree (more accurate this tree performs), the more boost (importance) the misclassified data
point by this tree will get. The weights of the data points are normalized after all the misclassified points are updated.
• Step 5: Repeat Step 1(until the number of trees we set to train is reached)
• Step 6: Make the final prediction
• AdaBoost makes a new prediction by adding up the weight (of each tree) multiply the prediction (of each tree). Obviously, the
tree with higher weight will have more power of influence the final decision.
Example Code
• Load Library
• Create Dataset
• Split Training – Test Set
• Fit a Decision Tree model
• Fit a Random Forest model
• Fit a AdaBoost model
Random Forest
• One of the most popular and commonly used algorithms by Data
Scientists
• Supervised Machine Learning Algorithm that is used widely in
Classification and Regression problems
• Builds decision trees on different samples and takes their majority vote
for classification and average in case of regression
• Can handle the data set containing continuous variables, as in the case
of regression, and categorical variables, as in the case of classification
Features of Random Forest
• Diversity: Not all attributes/variables/features are considered while making
an individual tree; each tree is different.
• Immune to the curse of dimensionality: Since each tree does not consider
all the features, the feature space is reduced.
• Parallelization: Each tree is created independently out of different data and
attributes. This means we can fully use the CPU to build random forests.
• Train-Test split: In a random forest, we don’t have to segregate the data for
train and test as there will always be 30% of the data which is not seen by the
decision tree.
• Stability: Stability arises because the result is based on majority voting/
averaging
Difference between DT & RF
DECISION TREES RANDOM FOREST
Decision trees normally suffer Random forests are created from
from the problem of overfitting if subsets of data, and the final output
it’s allowed to grow without any is based on average or majority
control ranking; hence the problem of
overfitting is taken care of
A single decision tree is faster in It is comparatively slower
computation
When a data set with features is Random forest randomly selects
taken as input by a decision tree, observations, builds a decision tree,
it will formulate some rules to and takes the average result. It
make predictions doesn’t use any set of formulas

You might also like