0% found this document useful (0 votes)
131 views52 pages

Konsep Ensemble

This document provides an introduction to decision trees, including: - Decision trees use a flowchart-like structure to display classification or regression algorithms, with internal nodes representing attribute tests, branches representing outcomes, and leaf nodes representing class labels. - Decision trees are commonly used in machine learning applications like customer segmentation, risk analysis, medical diagnosis, and more due to their simplicity and ability to model complex nonlinear relationships. - The document outlines the basic structure and terminology of decision trees, as well as their advantages like interpretability and ability to handle different data types, and disadvantages like susceptibility to overfitting.

Uploaded by

hary170893
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views52 pages

Konsep Ensemble

This document provides an introduction to decision trees, including: - Decision trees use a flowchart-like structure to display classification or regression algorithms, with internal nodes representing attribute tests, branches representing outcomes, and leaf nodes representing class labels. - Decision trees are commonly used in machine learning applications like customer segmentation, risk analysis, medical diagnosis, and more due to their simplicity and ability to model complex nonlinear relationships. - The document outlines the basic structure and terminology of decision trees, as well as their advantages like interpretability and ability to handle different data types, and disadvantages like susceptibility to overfitting.

Uploaded by

hary170893
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Decision Tree

For taking steps to know about Data Science and Machine Learning, till now in my blogs, I
have covered briefly an introduction to Data Science, Python, Statistics, Machine Learning,
Regression, Linear and Logistic Regression. In this fifth of the series, I shall cover Decision
Trees.

Introduction to Decision Trees :

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and
their possible consequences, including chance event outcomes, resource costs, and utility. It is
one way to display an algorithm that only contains conditional control statements.

A decision tree is a flowchart-like structure in which each internal node represents a “test” on
an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the
outcome of the test, and each leaf node represents a class label (decision taken after computing
all attributes). The paths from root to leaf represent classification rules.

Tree based learning algorithms are considered to be one of the best and mostly used supervised
learning methods. Tree based methods empower predictive models with high accuracy,
stability and ease of interpretation. Unlike linear models, they map non-linear relationships
quite well. They are adaptable at solving any kind of problem at hand (classification or
regression). Decision Tree algorithms are referred to as CART (Classification and
Regression Trees).

“The possible solutions to a given problem emerge as the leaves of a tree, each node
representing a point of deliberation and decision.”

- Niklaus Wirth (1934 — ), Programming language designer

Methods like decision trees, random forest, gradient boosting are being popularly used in all
kinds of data science problems.
Common terms used with Decision trees:

1. Root Node: It represents entire population or sample and this further gets divided
into two or more homogeneous sets.

2. Splitting: It is a process of dividing a node into two or more sub-nodes.

3. Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.

4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.

5. Pruning: When we remove sub-nodes of a decision node, this process is called


pruning. You can say opposite process of splitting.

6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.

7. Parent and Child Node: A node, which is divided into sub-nodes is called parent
node of sub-nodes whereas sub-nodes are the child of parent node.
Applications for Decision Tree :

Decision trees have a natural “if … then … else …” construction that makes it fit easily into a
programmatic structure. They also are well suited to categorization problems where attributes
or features are systematically checked to determine a final category. For example, a decision
tree could be used effectively to determine the species of an animal.
As a result, the decision making tree is one of the more popular classification algorithms
being used in Data Mining and Machine Learning. Example applications include:

· Evaluation of brand expansion opportunities for a business using historical sales data

· Determination of likely buyers of a product using demographic data to enable targeting of


limited advertisement budget

· Prediction of likelihood of default for applicant borrowers using predictive models


generated from historical data
· Help with prioritization of emergency room patient treatment using a predictive model based
on factors such as age, blood pressure, gender, location and severity of pain, and other
measurements

· Decision trees are commonly used in operations research, specifically in decision analysis,
to help identify a strategy most likely to reach a goal.

Because of their simplicity, tree diagrams have been used in a broad range of industries and
disciplines including civil planning, energy, financial, engineering, healthcare, pharmaceutical,
education, law, and business.

How does Decision Tree works ?


Decision tree is a type of supervised learning algorithm (having a pre-defined target variable)
that is mostly used in classification problems. It works for both categorical and continuous
input and output variables. In this technique, we split the population or sample into two or
more homogeneous sets (or sub-populations) based on most significant splitter / differentiator
in input variables.

Example:-

Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class (IX/
X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, we want to
create a model to predict who will play cricket during leisure period? In this problem, we
need to segregate students who play cricket in their leisure time based on highly significant
input variable among all three.

This is where decision tree helps, it will segregate the students based on all values of three
variable and identify the variable, which creates the best homogeneous sets of students (which
are heterogeneous to each other). In the snapshot below, you can see that variable Gender is
able to identify best homogeneous sets compared to the other two variables.
Decision tree identifies the most significant variable and its value that gives best
homogeneous sets of population. To identify the variable and the split, decision tree uses
various algorithms.

Types of Decision Trees

Types of decision tree is based on the type of target variable we have. It can be of two types:

1. Categorical Variable Decision Tree: Decision Tree which has categorical target
variable then it called as categorical variable decision tree. E.g.:- In above scenario of
student problem, where the target variable was “Student will play cricket or not” i.e.
YES or NO.

2. Continuous Variable Decision Tree: Decision Tree has continuous target variable
then it is called as Continuous Variable Decision Tree.

E.g.:- Let’s say we have a problem to predict whether a customer will pay his renewal
premium with an insurance company (yes/ no). Here we know that income of customer is a
significant variable but insurance company does not have income details for all customers.
Now, as we know this is an important variable, then we can build a decision tree to predict
customer income based on occupation, product and various other variables. In this case, we
are predicting values for continuous variable.

Decision Tree Algorithm Pseudocode


The decision tree algorithm tries to solve the problem, by using tree representation. Each
internal node of the tree corresponds to an attribute, and each leaf node corresponds to a
class label.
1. Place the best attribute of the dataset at the root of the tree.

2. Split the training set into subsets. Subsets should be made in such a way that each
subset contains data with the same value for an attribute.

3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches
of the tree.

In decision trees, for predicting a class label for a record we start from the root of the tree.
We compare the values of the root attribute with record’s attribute. On the basis of
comparison, we follow the branch corresponding to that value and jump to the next node.

We continue comparing our record’s attribute values with other internal nodes of the tree
until we reach a leaf node with predicted class value. The modeled decision tree can be used
to predict the target class or the value.

Assumptions while creating Decision Tree


Some of the assumptions we make while using Decision tree:

 At the beginning, the whole training set is considered as the root.

 Feature values are preferred to be categorical. If the values are continuous then they
are discretized prior to building the model.

 Records are distributed recursively on the basis of attribute values.

 Order to placing attributes as root or internal node of the tree is done by using some
statistical approach.
Advantages of Decision Tree:

1. Easy to Understand: Decision tree output is very easy to understand even for people
from non-analytical background. It does not require any statistical knowledge to read and
interpret them. Its graphical representation is very intuitive and users can easily relate
their hypothesis.

2. Useful in Data exploration: Decision tree is one of the fastest way to identify most
significant variables and relation between two or more variables. With the help of
decision trees, we can create new variables / features that has better power to predict
target variable. It can also be used in data exploration stage. For e.g., we are working on
a problem where we have information available in hundreds of variables, there decision
tree will help to identify most significant variable.

3. Decision trees implicitly perform variable screening or feature selection.


4. Decision trees require relatively little effort from users for data preparation.

5. Less data cleaning required: It requires less data cleaning compared to some other
modeling techniques. It is not influenced by outliers and missing values to a fair degree.

6. Data type is not a constraint: It can handle both numerical and categorical variables.
Can also handle multi-output problems.

7. Non-Parametric Method: Decision tree is considered to be a non-parametric method.


This means that decision trees have no assumptions about the space distribution and the
classifier structure.

8. Non-linear relationships between parameters do not affect tree performance.

9. The number of hyper-parameters to be tuned is almost null.

Disadvantages of Decision Tree:

1. Over fitting: Decision-tree learners can create over-complex trees that do not
generalize the data well. This is called overfitting. Over fitting is one of the most
practical difficulty for decision tree models. This problem gets solved by setting
constraints on model parameters and pruning.

2. Not fit for continuous variables: While working with continuous numerical
variables, decision tree loses information, when it categorizes variables in different
categories.

3. Decision trees can be unstable because small variations in the data might result in a
completely different tree being generated. This is called variance, which needs to be
lowered by methods like bagging and boosting.

4. Greedy algorithms cannot guarantee to return the globally optimal decision tree. This
can be mitigated by training multiple trees, where the features and samples are randomly
sampled with replacement.
5. Decision tree learners create biased trees if some classes dominate. It is therefore
recommended to balance the data set prior to fitting with the decision tree.

6. Information gain in a decision tree with categorical variables gives a biased response
for attributes with greater no. of categories.

7. Generally, it gives low prediction accuracy for a dataset as compared to other


machine learning algorithms.

8. Calculations can become complex when there are many class label.

Regression Trees vs Classification Trees

The terminal nodes (or leaves) lies at the bottom of the decision tree. This means that decision
trees are typically drawn upside down such that leaves are the bottom & roots are the tops.

Both the trees work almost similar to each other. The primary differences and similarities
between Classification and Regression Trees are:

1. Regression trees are used when dependent variable is continuous. Classification Trees
are used when dependent variable is categorical.

2. In case of Regression Tree, the value obtained by terminal nodes in the training data
is the mean response of observation falling in that region. Thus, if an unseen data
observation falls in that region, we’ll make its prediction with mean value.

3. In case of Classification Tree, the value (class) obtained by terminal node in the
training data is the mode of observations falling in that region. Thus, if an unseen data
observation falls in that region, we’ll make its prediction with mode value.

4. Both the trees divide the predictor space (independent variables) into distinct and
non-overlapping regions.

5. Both the trees follow a top-down greedy approach known as recursive binary splitting.
We call it as ‘top-down’ because it begins from the top of tree when all the observations
are available in a single region and successively splits the predictor space into two new
branches down the tree. It is known as ‘greedy’ because, the algorithm cares (looks for
best variable available) about only the current split, and not about future splits which will
lead to a better tree.

6. This splitting process is continued until a user defined stopping criteria is reached.
For e.g.: we can tell the algorithm to stop once the number of observations per node
becomes less than 50.

7. In both the cases, the splitting process results in fully grown trees until the stopping
criteria is reached. But, the fully grown tree is likely to over fit data, leading to poor
accuracy on unseen data. This bring ‘pruning’. Pruning is one of the technique used
tackle overfitting.
This tree below summarizes at a high level the types of decision trees available.

How does a tree decide where to split?

The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria
is different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The
creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we
can say that purity of the node increases with respect to the target variable. Decision tree
splits the nodes on all available variables and then selects the split which results in most
homogeneous sub-nodes.

The algorithm selection is also based on type of target variables. The four most commonly
used algorithms in decision tree are:
Gini Index

Gini index says, if we select two items from a population at random then they must be of
same class and probability for this is 1 if population is pure.

1. It works with categorical target variable “Success” or “Failure”.

2. It performs only Binary splits

3. Higher the value of Gini higher the homogeneity.

4. CART (Classification and Regression Tree) uses Gini method to create binary splits.

Steps to Calculate Gini for a split

1. Calculate Gini for sub-nodes, using formula sum of square of probability for success
and failure (p²+q²).

2. Calculate Gini for split using weighted Gini score of each node of that split

Chi-Square

It is an algorithm to find out the statistical significance between the differences between
sub-nodes and parent node. We measure it by sum of squares of standardized differences
between observed and expected frequencies of target variable.

1. It works with categorical target variable “Success” or “Failure”.

2. It can perform two or more splits.

3. Higher the value of Chi-Square higher the statistical significance of differences


between sub-node and Parent node.

4. Chi-Square of each node is calculated using formula,

5. Chi-square = ((Actual — Expected)² / Expected)¹/2

6. It generates tree called CHAID (Chi-square Automatic Interaction Detector)

Steps to Calculate Chi-square for a split:

1. Calculate Chi-square for individual node by calculating the deviation for Success and
Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of
each node of the split

Information Gain:
Less impure node requires less information to describe it. And, more impure node requires
more information. Information theory is a measure to define this degree of disorganization in
a system known as Entropy. If the sample is completely homogeneous, then the entropy is
zero and if the sample is an equally divided (50% — 50%), it has entropy of one.

Entropy can be calculated using formula:- Entropy = -p log2 p — q log2q

Here p and q is probability of success and failure respectively in that node. Entropy is also
used with categorical target variable. It chooses the split which has lowest entropy compared
to parent node and other splits. The lesser the entropy, the better it is.

Steps to calculate entropy for a split:

1. Calculate entropy of parent node

2. Calculate entropy of each individual node of split and calculate weighted average of
all sub-nodes available in split.

We can derive information gain from entropy as 1- Entropy.

Reduction in Variance

Reduction in variance is an algorithm used for continuous target variables (regression


problems). This algorithm uses the standard formula of variance to choose the best split. The
split with lower variance is selected as the criteria to split the population:

Above X-bar is mean of the values, X is actual and n is number of values.

Steps to calculate Variance:

1. Calculate variance for each node.

2. Calculate variance for each split as weighted average of each node variance.

Key parameters of tree modelling and how can we avoid over-fitting in decision trees:
Overfitting is one of the key practical challenges faced while modeling decision trees. If there
is no limit set of a decision tree, it will give you 100% accuracy on training set because in the
worst case, it will end up making 1 leaf for each observation. The model is having an issue of
overfitting, is considered when the algorithm continues to go deeper and deeper to reduce the
training set error but results with an increased test set error i.e. accuracy of prediction for our
model goes down. It generally happens when it builds many branches due to outliers and
irregularities in data. Thus, preventing overfitting is pivotal while modeling a decision tree
and it can be done in 2 ways:

1. Setting constraints on tree size

2. Tree pruning

Setting Constraints on Tree Size


This can be done by using various parameters which are used to define a tree. The parameters
used for defining a tree are:

1. Minimum samples for a node split

 Defines the minimum number of samples (or observations) which are required in a
node to be considered for splitting.

 Used to control over-fitting. Higher values prevent a model from learning relations
which might be highly specific to the particular sample selected for a tree.

 Too high values can lead to under-fitting hence, it should be tuned using CV.

1. Minimum samples for a terminal node (leaf)

 Defines the minimum samples (or observations) required in a terminal node or leaf.
 Used to control over-fitting similar to min_samples_split.

 Generally lower values should be chosen for imbalanced class problems because the
regions in which the minority class will be in majority will be very small.

1. Maximum depth of tree (vertical depth)

 The maximum depth of a tree.

 Used to control over-fitting as higher depth will allow model to learn relations very
specific to a particular sample.

 Should be tuned using CV.

1. Maximum number of terminal nodes

 The maximum number of terminal nodes or leaves in a tree.

 Can be defined in place of max_depth. Since binary trees are created, a depth of ’n’
would produce a maximum of 2^n leaves.

1. Maximum features to consider for split

 The number of features to consider while searching for a best split. These will be
randomly selected.

 As a thumb-rule, square root of the total number of features works great but we
should check upto 30–40% of the total number of features.

 Higher values can lead to over-fitting but depends on case to case.

Tree Pruning

The technique of setting constraint is a greedy-approach. In other words, it will check for the
best split instantaneously and move forward until one of the specified stopping condition is
reached. For e.g.: consider the following case when you’re driving:

There are 2 lanes:

1. A lane with cars moving at 80 km/h

2. A lane with trucks moving at 30 km/h

At this instant, you are the yellow car and you have 2 choices:

1. Take a left and overtake the other 2 cars quickly

2. Keep moving in the present lane


Analyzing these choice: In the former choice, you’ll immediately overtake the car ahead and
reach behind the truck and start moving at 30 km/h, looking for an opportunity to move back
right. All cars originally behind you move ahead in the meanwhile. This would be the
optimum choice if your objective is to maximize the distance covered in next say 10 seconds.
In the later choice, you drive through at same speed, cross trucks and then overtake maybe
depending on situation ahead.

This is exactly the difference between normal decision tree & pruning. A decision tree with
constraints won’t see the truck ahead and adopt a greedy approach by taking a left. On the
other hand if we use pruning, we in effect look at a few steps ahead and make a choice.

So we know, pruning is better. To implement it in decision tree:

1. We first make the decision tree to a large depth.

2. Then we start at the bottom and start removing leaves which are giving us negative
returns when compared from the top.

3. Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on
that gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we
will see that the overall gain is +10 and keep both leaves.

Are tree based models better than linear models?


If one can use logistic regression for classification problems and linear regression for
regression problems, why is there a need to use trees? Actually, we can use any algorithm. It
is dependent on the type of problem we are solving. Some key factors which will help us to
decide which algorithm to use:

1. If the relationship between dependent & independent variable is well approximated


by a linear model, linear regression will outperform tree based model.

2. If there is a high non-linearity and complex relationship between dependent &


independent variables, a tree model will outperform a classical regression method.

3. To build a model which is easy to explain to people, a decision tree model will
always do better than a linear model. Decision tree models are even simpler to interpret
than linear regression.e.g. Decison Tree Classification
From Tree to Rules :

How Decision Trees work: Algorithm

· Build tree

· Start with data at root node

· Select an attribute and formulate a logical test on attribute

· Branch on each outcome of the test, and move subset of examples satisfying that outcome to
corresponding child node
· Recurse on each child node

· Repeat until leaves are “pure”, i.e., have example from a single class, or “nearly pure”, i.e.,
majority of examples are from the same class

· Prune tree

· Remove subtrees that do not improve classification accuracy

· Avoid over-fitting, i.e., training set specific artifacts

· Build tree
· Evaluate split-points for all attributes

· Select the “best” point and the “winning” attribute

· Split the data into two

· Breadth/depth-first construction

· CRITICAL STEPS:

· Formulation of good split tests

· Selection measure for attributes

 How to capture good splits?

· Prefer the simplest hypothesis that fits the data

· Minimum message/description length

· Dataset D

· Hypotheses H1, H2, …, Hx describing D

· MML(Hi) = Mlength(Hi)+Mlength(D|Hi)

· Pick Hk with minimum MML

· Mlength given by Gini index, Gain, etc.


Tree pruning

 Data encoding: sum classification errors

 Model encoding:

· Encode the tree structure

· Encode the split points

 Pruning: choose smallest length option

· Convert to leaf

· Prune left or right child

· Do nothing

 Hunt’s Method

· Attributes: Refund (Yes, No), Marital Status (Single, Married, Divorced), Taxable Income

· Class: Cheat, Don’t Cheat

Finding good split points

· Use Gini index for partition purity

· If S is pure, Gini(S) = 0, Gini is a kind of entropy calculation

· Find split-point with minimum Gini

· Only need class distributions

How informative is an attribute ? :

· Statistic measure of informativity, measuring how well an attribute distinguishes between


examples of different classes.

· Informativity is measured as the decrease in entropy of the training set of examples.

· Entropy is the measure of impurity of the sample set: E(S) = -p+log2p+ — p-log2p-
Working with Decision Trees in Python:
#Import Library

# Import other necessary libraries like pandas, numpy…

from sklearn import tree

# Assumed you have, X (predictor) and Y (target) for training data set and x_test (predictor)
of test_dataset

# Create tree object

model = tree.DecisionTreeClassifier(criterion=’gini’) # for classification, here you can change


the algorithm as gini or entropy (information gain) by default it is gini

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score

model.fit(X, y)

model.score(X, y)

#Predict Output

predicted = model.predict(x_test)

Summary :

Not all problems can be solved with linear methods. The world is non-linear. It has been
observed that tree based models have been able to map non-linearity effectively. Methods like
decision trees, random forest, gradient boosting are being popularly used in all kinds of data
science problems.

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other
supervised learning algorithms, decision tree algorithm can be used for solving regression
and classification problems too. The general motive of using Decision Tree is to create a
training model which can use to predict class or value of target variables by learning decision
rules inferred from prior data (training data). The primary challenge in the decision tree
implementation is to identify which attributes do we need to consider as the root node and
each level. Decision tress often mimic the human level thinking so it’s simple to understand
the data and make some good interpretations.

Dividing efficiently based on maximum information gain is key to decision tree classifier.
However, in real world with millions of data dividing into pure class in practically not
feasible (it may take longer training time) and so we stop at points in nodes of tree when
fulfilled with certain parameters (for example impurity percentage). Decision tree is
classification strategy as opposed to the algorithm for classification. It takes top down
approach and uses divide and conquer method to arrive at decision. We can have multiple leaf
classes with this approach.

“When your values are clear to you, making decisions become easier.” — Roy E. Disney

Random Forest
The Random Forest Classifier
Random forest, like its name implies, consists of a large number of individual
decision trees that operate as an ensemble. Each individual tree in the random
forest spits out a class prediction and the class with the most votes becomes
our model’s prediction (see figure below).

Visualization of a Random Forest Model Making a Prediction

The fundamental concept behind random forest is a simple but powerful one
— the wisdom of crowds. In data science speak, the reason that the random
forest model works so well is:

A large number of relatively uncorrelated models (trees)


operating as a committee will outperform any of the
individual constituent models.
The low correlation between models is the key. Just like how investments with
low correlations (like stocks and bonds) come together to form a portfolio that
is greater than the sum of its parts, uncorrelated models can produce
ensemble predictions that are more accurate than any of the individual
predictions. The reason for this wonderful effect is that the trees protect
each other from their individual errors (as long as they don’t constantly all
err in the same direction). While some trees may be wrong, many other trees
will be right, so as a group the trees are able to move in the correct direction.
So the prerequisites for random forest to perform well are:

1. There needs to be some actual signal in our features so that models


built using those features do better than random guessing.

2. The predictions (and therefore the errors) made by the individual trees
need to have low correlations with each other.

An Example of Why Uncorrelated Outcomes are So Great


The wonderful effects of having many uncorrelated models is such a critical
concept that I want to show you an example to help it really sink in. Imagine
that we are playing the following game:

 I use a uniformly distributed random number generator to produce a


number.
 If the number I generate is greater than or equal to 40, you win (so you
have a 60% chance of victory) and I pay you some money. If it is below 40,
I win and you pay me the same amount.

 Now I offer you the the following choices. We can either:

1. Game 1 — play 100 times, betting $1 each time.

2. Game 2— play 10 times, betting $10 each time.

3. Game 3— play one time, betting $100.

Which would you pick? The expected value of each game is the same:

Expected Value Game 1 = (0.60*1 + 0.40*-1)*100 = 20

Expected Value Game 2= (0.60*10 + 0.40*-10)*10 = 20

Expected Value Game 3= 0.60*100 + 0.40*-100 = 20

Outcome Distribution of 10,000 Simulations for each Game

What about the distributions? Let’s visualize the results with a Monte Carlo
simulation (we will run 10,000 simulations of each game type; for example,
we will simulate 10,000 times the 100 plays of Game 1). Take a look at the
chart on the left — now which game would you pick? Even though the
expected values are the same, the outcome distributions are vastly
different going from positive and narrow (blue) to binary (pink).

Game 1 (where we play 100 times) offers up the best chance of making some
money — out of the 10,000 simulations that I ran, you make money in
97% of them! For Game 2 (where we play 10 times) you make money in 63%
of the simulations, a drastic decline (and a drastic increase in your probability
of losing money). And Game 3 that we only play once, you make money in 60%
of the simulations, as expected.
Probability of Making Money for Each Game

So even though the games share the same expected value, their outcome
distributions are completely different. The more we split up our $100 bet into
different plays, the more confident we can be that we will make money. As
mentioned previously, this works because each play is independent of the
other ones.

Random forest is the same — each tree is like one play in our game earlier. We
just saw how our chances of making money increased the more times we
played. Similarly, with a random forest model, our chances of making correct
predictions increase with the number of uncorrelated trees in our model.

If you would like to run the code for simulating the game yourself you can find
it on my GitHub here.

Ensuring that the Models Diversify Each Other


So how does random forest ensure that the behavior of each individual tree is
not too correlated with the behavior of any of the other trees in the model? It
uses the following two methods:

Bagging (Bootstrap Aggregation) — Decisions trees are very sensitive to


the data they are trained on — small changes to the training set can
result in significantly different tree structures. Random forest takes
advantage of this by allowing each individual tree to randomly sample from
the dataset with replacement, resulting in different trees. This process is known
as bagging.

Notice that with bagging we are not subsetting the training data into smaller
chunks and training each tree on a different chunk. Rather, if we have a sample
of size N, we are still feeding each tree a training set of size N (unless specified
otherwise). But instead of the original training data, we take a random sample
of size N with replacement. For example, if our training data was [1, 2, 3, 4, 5, 6]
then we might give one of our trees the following list [1, 2, 2, 3, 6, 6]. Notice
that both lists are of length six and that “2” and “6” are both repeated in the
randomly selected training data we give to our tree (because we sample with
replacement).

Node splitting in a random forest model is based on a random subset of


features for each tree.

Feature Randomness — In a normal decision tree, when it is time to split a


node, we consider every possible feature and pick the one that produces the
most separation between the observations in the left node vs. those in the
right node. In contrast, each tree in a random forest can pick only from a
random subset of features. This forces even more variation amongst the trees
in the model and ultimately results in lower correlation across trees and more
diversification.
Let’s go through a visual example — in the picture above, the traditional
decision tree (in blue) can select from all four features when deciding how to
split the node. It decides to go with Feature 1 (black and underlined) as it splits
the data into groups that are as separated as possible.

Now let’s take a look at our random forest. We will just examine two of the
forest’s trees in this example. When we check out random forest Tree 1, we find
that it it can only consider Features 2 and 3 (selected randomly) for its node
splitting decision. We know from our traditional decision tree (in blue) that
Feature 1 is the best feature for splitting, but Tree 1 cannot see Feature 1 so it
is forced to go with Feature 2 (black and underlined). Tree 2, on the other hand,
can only see Features 1 and 3 so it is able to pick Feature 1.

So in our random forest, we end up with trees that are not


only trained on different sets of data (thanks to bagging) but
also use different features to make decisions.

And that, my dear reader, creates uncorrelated trees that buffer and protect
each other from their errors.

Conclusion
Random forests are a personal favorite of mine. Coming from the world of
finance and investments, the holy grail was always to build a bunch of
uncorrelated models, each with a positive expected return, and then put them
together in a portfolio to earn massive alpha (alpha = market beating returns).
Much easier said than done!

Random forest is the data science equivalent of that. Let’s review one last time.
What’s a random forest classifier?

The random forest is a classification algorithm consisting of many


decisions trees. It uses bagging and feature randomness when
building each individual tree to try to create an uncorrelated
forest of trees whose prediction by committee is more accurate
than that of any individual tree.

What do we need in order for our random forest to make accurate class
predictions?

1. We need features that have at least some predictive power. After all,
if we put garbage in then we will get garbage out.
2. The trees of the forest and more importantly their predictions
need to be uncorrelated (or at least have low correlations with each
other). While the algorithm itself via feature randomness tries to engineer
these low correlations for us, the features we select and the
hyper-parameters we choose will impact the ultimate correlations as well.

For taking steps to know about Data Science and Machine Learning, till now in
my blogs, I have covered briefly an introduction to Data Science, Python,
Statistics, Machine Learning, Regression, Linear Regression, Logistic Regression
and Decision Trees. In this sixth of the series, I shall cover Boosting, an
ensemble method.

Introduction to Boosting

“Alone we can do so little and together we can do much” — Helen Keller

A road to success is incomplete without any failures in life. Each failure teaches
you something new and makes you stronger at each phase. Each time you
make a mistake, it’s important to learn from it and try not to repeat it again.
Just as we sometimes develop life skills by learning from our mistakes, we can
train our model to learn from the errors predicted and improvise the model’s
prediction and overall performance. This is the most basic intuition of Boosting
algorithm in Machine Learning.

Bagging (stands for Bootstrap Aggregating): It is an approach where you take


random samples of data, build learning algorithms and take simple means to
find bagging probabilities.
Boosting: Boosting is similar, however the selection of sample is made more
intelligently. We subsequently give more and more weight to hard and classify
observations.

Ensemble is a machine learning concept in which multiple models are trained


using the same learning algorithm. Bagging is a way to decrease the variance
in the prediction by generating additional data for training from dataset using
combinations with repetitions to produce multi-sets of the original data.
Boosting is an iterative technique which adjusts the weight of an observation
based on the last classification. If an observation was classified incorrectly, it
tries to increase the weight of this observation. Boosting in general builds
strong predictive models. Ensemble methods combine several decision trees
classifiers to produce better predictive performance than a single decision tree
classifier. The main principle behind the ensemble model is that a group of
weak learners come together to form a strong learner, thus increasing the
accuracy of the model.
The term Boosting refers to a family of algorithms which converts weak
learner to strong learners. Boosting is an ensemble method for improving the
model predictions of any given learning algorithm. The idea of boosting is to
train weak learners sequentially, each trying to correct its predecessor. A weak
learner is defined to be a classifier that is only slightly correlated with the true
classification. In contrast, a strong learner is a classifier that is arbitrarily
well-correlated with the true classification.

To understand this definition, consider the following example of solving a


problem of spam email identification:

How would you classify an email as SPAM or not? Our initial approach would
be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

1. Email has only one image file (promotional image), It’s a SPAM

2. Email has only link(s), It’s a SPAM

3. Email body consist of sentence like “You won a prize money of $ xxxxxx”,
It’s a SPAM

4. Email from any official domain say ril.com, Not a SPAM

5. Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not
spam’. But, these rules individually are not strong enough to successfully
classify an email i.e. individually, these rules are not powerful enough to
classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called
as weak learner.

To convert weak learner to strong learner, we’ll combine the prediction of each
weak learner using methods like:

· Using average/ weighted average

· Considering prediction has higher vote

For example: Above, we have defined 5 weak learners. Out of these 5, 3 are
voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll
consider an email as SPAM because we have higher (3) vote for ‘SPAM’.
How Boosting Algorithms works?
Boosting combines weak learner (base learner) to form a strong rule. To find
weak rule, apply base learning (ML) algorithms with a different distribution.
Each time base learning algorithm is applied, it generates a new weak
prediction rule. This is an iterative process. After many iterations, the boosting
algorithm combines these weak rules into a single strong prediction rule.

For choosing the right distribution, here are the following steps:

Step 1: The base learner takes all the distributions and assign equal weight or
attention to each observation.
Step 2: If there is any prediction error caused by first base learning algorithm,
then we pay higher attention to observations having prediction error. Then, we
apply the next base learning algorithm.

Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or
higher accuracy is achieved.

Finally, it combines the outputs from weak learner and creates a strong learner
which eventually improves the prediction power of the model. Boosting pays
higher focus on examples which are mis-classified or have higher errors by
preceding weak rules.

Suppose we have a binary classification task. A weak learner has an error rate
that is slightly lesser than 0.5 in classifying the object, i.e. the weak learner is
slightly better than deciding from a coin toss. A strong learner has an error rate
closer to 0. To convert a weak learner into strong learner, we take a family of
weak learners, combine them and vote. This turns this family of weak learners
into strong learners. The idea here is that the family of weak learners should
have a minimum correlation between them.

Types of Boosting
The accuracy of a predictive model can be boosted in two ways: Either by
embracing feature engineering or by applying boosting algorithms straight
away. It is preferred to work with boosting algorithms as it takes less time and
produces similar results.

There are multiple boosting algorithms like AdaBoost, Gradient Boosting,


XGBoost, etc. Every algorithm has its own underlying mathematics and a slight
variation is observed while applying them.
 AdaBoost (Adaptive Boosting)

AdaBoost combines multiple weak learners into a single strong learner. The
weak learners in AdaBoost are decision trees with a single split, called decision
stumps. When AdaBoost creates its first decision stump, all observations are
weighted equally. To correct the previous error, the observations that were
incorrectly classified now carry more weight than the observations that were
correctly classified. AdaBoost algorithms can be used for both classification
and regression problem.

The diagram shown below, aptly explains AdaBoost.


Box 1: You can see that we have assigned equal weights to each data point and
applied a decision stump to classify them as + (plus) or — (minus). The
decision stump (D1) has generated vertical line at left side to classify the data
points. We see that, this vertical line has incorrectly predicted three + (plus) as
— (minus). In such case, we’ll assign higher weights to these three + (plus) and
apply another decision stump.
(Box 1) (Box 2) (Box 3) (Box 4)

Box 2: Here, you can see that the size of three incorrectly predicted + (plus) is
bigger as compared to rest of the data points. In this case, the second decision
stump (D2) will try to predict them correctly. Now, a vertical line (D2) at right
side of this box has classified three mis-classified + (plus) correctly. But again, it
has caused mis-classification errors. This time with three -(minus). Again, we
will assign higher weight to three — (minus) and apply another decision
stump.

Box 3: Here, three — (minus) are given higher weights. A decision stump (D3) is
applied to predict these mis-classified observation correctly. This time a
horizontal line is generated to classify + (plus) and — (minus) based on higher
weight of mis-classified observation.

Box 4: Here, we have combined D1, D2 and D3 to form a strong prediction


having complex rule as compared to individual weak learner. You can see that
this algorithm has classified these observation quite well as compared to any
of individual weak learner.

AdaBoost works on similar method as discussed above. It fits a sequence of


weak learners on different weighted training data. It starts by predicting
original data set and gives equal weight to each observation. If prediction is
incorrect using the first learner, then it gives higher weight to observation
which have been predicted incorrectly. Being an iterative process, it continues
to add learner(s) until a limit is reached in the number of models or accuracy.

Mostly, we use decision stamps with AdaBoost. But, we can use any machine
learning algorithms as base learner if it accepts weight on training data set. We
can use AdaBoost algorithms for both classification and regression problem.

Advantages of AdaBoost :

 Can be used with many different classifiers

 Improves classification accuracy

 Commonly used in many areas

 Simple to implement

 Does feature selection resulting in relatively simple classifier

 Not prone to overfitting

 Fairly good generalization

Disadvantages of AdaBoost :

Ø The drawback of AdaBoost is that it is easily defeated by noisy data, the


efficiency of the algorithm is highly affected by outliers as the algorithm tries
to fit every point perfectly.

Ø Even though this algorithm tries to fit every point, it doesn’t overfit.

Ø Suboptimal solution

Python code:

from sklearn.ensemble import AdaBoostClassifier # For Classification

from sklearn.ensemble import AdaBoostRegressor # For Regression

from sklearn.tree import DecisionTreeClassifier


dt = DecisionTreeClassifier() clf = AdaBoostClassifier(n_estimators=100,
base_estimator=dt,learning_rate=1)

# We have used decision tree as a base estimator. We can use any ML learner
as base estimator, if it accepts sample weight

clf.fit(x_train,y_train)

You can tune the parameters to optimize the performance of algorithms, The
key parameters for tuning:

 n_estimators : It controls the number of weak learners.

 learning_rate :Controls the contribution of weak learners in the final


combination. There is a trade-off between learning_rate and n_estimators.

 base_estimators : It helps to specify different ML algorithm.

You can also tune the parameters of base learners to optimize its performance.
 Gradient Boosting

Just like AdaBoost, Gradient Boosting works by sequentially adding predictors


to an ensemble, each one correcting its predecessor. However, instead of
changing the weights for every incorrect classified observation at every
iteration like AdaBoost, Gradient Boosting method tries to fit the new predictor
to the residual errors made by the previous predictor. Gradient boosting is a
machine learning technique for regression and classification problems, which
produces a prediction model in the form of an ensemble of weak prediction
models, typically decision trees. The objective of any supervised learning
algorithm is to define a loss function and minimize it.

Gradient Boosting Machine (GBM) uses Gradient Descent to find the


shortcomings in the previous learner’s predictions and are an extremely
popular machine learning algorithm. In GBM, we take up a weak learner and at
each step, we add another weak learner to increase the performance and build
a strong learner. This reduces the loss of the loss function. We iteratively add
each model and compute the loss. The loss represents the error residuals (the
difference between actual value and predicted value) and using this loss value
the predictions are updated to minimise the residuals.

GBM algorithm can be given by following steps.

 Fit a model to the data, F1(x) = y

 Fit a model to the residuals, h1(x) = y−F1(x)

 Create a new model, F2(x) = F1(x) + h1(x)

 By combining weak learner after weak learner, our final model is able to
account for a lot of the error from the original model and reduces this
error over time.

GBM is the most widely used algorithm.

The intuition behind gradient boosting algorithm is to repetitively leverage the


patterns in residuals and strengthen a model with weak predictions and make
it better. Once we reach a stage that residuals do not have any pattern that
could be modeled, we can stop modeling residuals (otherwise it might lead to
overfitting). Algorithmically, we are minimizing our loss function, such that test
loss reach its minima.
Advantages of GBM:

 Often provides predictive accuracy that cannot be beat.

 Lots of flexibility — can optimize on different loss functions and


provides several hyper-parameter tuning options that make the function
fit very flexible.

 No data pre-processing required — often works great with categorical


and numerical values as is.

 Handles missing data — imputation not required.

Disadvantages of GBM:

 GBMs will continue improving to minimize all errors. This can


overemphasize outliers and cause overfitting. Must use cross-validation to
neutralize.

 Computationally expensive — GBMs often require many trees (>1000)


which can be time and memory exhaustive.

 The high flexibility results in many parameters that interact and


influence heavily the behavior of the approach (number of iterations, tree
depth, regularization parameters, etc.). This requires a large grid search
during tuning.

 Less interpretable although this is easily addressed with various tools


(variable importance, partial dependence plots, LIME, etc.).
Python code:

from sklearn.ensemble import GradientBoostingClassifier # For Classification

from sklearn.ensemble import GradientBoostingRegressor # For Regression

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,


max_depth=1)

clf.fit(X_train, y_train)

 n_estimators: It controls the number of weak learners.

 learning_rate:Controls the contribution of weak learners in the final


combination. There is a trade-off between learning_rate and n_estimators.

 max_depth: maximum depth of the individual regression estimators.


The maximum depth limits the number of nodes in the tree. Tune this
parameter for best performance; the best value depends on the interaction
of the input variables.

You can tune loss function for better performance.

 XGBoost

XGBoost stands for eXtreme Gradient Boosting and is another faster version of
boosting learner. XGBoost is an implementation of gradient boosted decision
trees designed for peed and performance. Gradient boosting machines are
generally very slow in implementation because of sequential model training.
Hence, they are not very scalable. Thus, XGBoost is focused on computational
speed and model performance. XGBoost provides:

 Parallelization of tree construction using all of your CPU cores during


training.

 Distributed Computing for training very large models using a cluster


of machines.

 Out-of-Core Computing for very large datasets that don’t fit into
memory.
 Cache Optimization of data structures and algorithm to make the best
use of hardware.

XGBoost is similar to gradient boosting algorithm but it has a few tricks up its
sleeve which makes it stand out from the rest.
Features of XGBoost are:

 Clever Penalisation of Trees

 A Proportional shrinking of leaf nodes

 Newton Boosting

 Extra Randomisation Parameter

In XGBoost the trees can have a varying number of terminal nodes and left
weights of the trees that are calculated with less evidence is shrunk more
heavily. Newton Boosting uses Newton-Raphson method of approximations
which provides a direct route to the minima than gradient descent. The extra
randomisation parameter can be used to reduce the correlation between the
trees, as seen in the previous article, the lesser the correlation among classifiers,
the better our ensemble of classifiers will turn out. Generally, XGBoost is faster
than gradient boosting but gradient boosting has a wide range of application.
Advantages of XGBoost :

1. Regularization:

 Standard GBM implementation has no regularization like XGBoost,


therefore it also helps to reduce overfitting.

 In fact, XGBoost is also known as regularized boosting technique.

1. Parallel Processing:

 XGBoost implements parallel processing and is blazingly faster as


compared to GBM.

 But hang on, we know that boosting is sequential process so how can it
be parallelized? We know that each tree can be built only after the
previous one, so what stops us from making a tree using all cores? I hope
you get where I’m coming from. Check this link out to explore further.

 XGBoost also supports implementation on Hadoop.

1. High Flexibility

 XGBoost allow users to define custom optimization objectives and


evaluation criteria.

 This adds a whole new dimension to the model and there is no limit to
what we can do.

1. Handling Missing Values

 XGBoost has an in-built routine to handle missing values.

 User is required to supply a different value than other observations and


pass that as a parameter. XGBoost tries different things as it encounters a
missing value on each node and learns which path to take for missing
values in future.

1. Tree Pruning:

 A GBM would stop splitting a node when it encounters a negative loss


in the split. Thus it is more of a greedy algorithm.
 XGBoost on the other hand make splits upto the max_depth specified
and then start pruning the tree backwards and remove splits beyond
which there is no positive gain.

 Another advantage is that sometimes a split of negative loss say -2 may


be followed by a split of positive loss +10. GBM would stop as it
encounters -2. But XGBoost will go deeper and it will see a combined
effect of +8 of the split and keep both.

1. Built-in Cross-Validation

 XGBoost allows user to run a cross-validation at each iteration of the


boosting process and thus it is easy to get the exact optimum number of
boosting iterations in a single run.

 This is unlike GBM where we have to run a grid-search and only a


limited values can be tested.

1. Continue on Existing Model

 User can start training an XGBoost model from its last iteration of
previous run. This can be of significant advantage in certain specific
applications.

 GBM implementation of sklearn also has this feature so they are even
on this point.

Summary :

Ensemble methods are learning models that achieve performance by


combining the opinions of multiple learners. Typically, an ensemble model is a
supervised learning technique for combining multiple weak learners or models
to produce a strong learner with the concept of Bagging and Boosting for data
sampling. Ensemble method is a combination of multiple models, that helps to
improve the generalization errors which might not be handled by a single
modeling approach.
The term Boosting refers to a family of algorithms which converts weak
learner to strong learners. The main idea of boosting is to modify a weak
learner to become better. Boosting is an ensemble technique in which the
predictors are not made independently, but sequentially. This technique
employs the logic in which the subsequent predictors learn from the mistakes
of the previous predictors. Therefore, the observations have an unequal
probability of appearing in subsequent models and ones with the highest error
appear most. (So the observations are not chosen based on the bootstrap
process, but based on the error). The predictors can be chosen from a range of
models like decision trees, regressors, classifiers etc. Because new predictors
are learning from mistakes committed by previous predictors, it takes less
time/iterations to reach close to actual predictions. But we have to choose the
stopping criteria carefully or it could lead to overfitting on training data.

Boosting can be used primarily for reducing bias and also variance. It is a
weighted average approach, i.e. by iteratively learning and forming multiple
weak learner and combining them to make a final strong learner. Adding of
each weak learner to make a strong learner is related to the accuracy of each
learner. The weight is recalculated after each weak learner is added to
overcome the problem of misclassified gain weight.
Multiple boosting algorithms are available to use such as AdaBoost, Gradient
Boost, XGBoost, etc. Different boosting algorithm has its own advantages over
different types of dataset. By tuning multiple algorithms with a wider range of
input data, classification or good predictive modeling can be build.

So in this series, we looked at Boosting, one of the method


of ensemble modeling to enhance the prediction power.

“Failure is the key to success; each mistake teaches us


something.” — Morihei Ueshiba

You might also like