Konsep Ensemble
Konsep Ensemble
For taking steps to know about Data Science and Machine Learning, till now in my blogs, I
have covered briefly an introduction to Data Science, Python, Statistics, Machine Learning,
Regression, Linear and Logistic Regression. In this fifth of the series, I shall cover Decision
Trees.
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and
their possible consequences, including chance event outcomes, resource costs, and utility. It is
one way to display an algorithm that only contains conditional control statements.
A decision tree is a flowchart-like structure in which each internal node represents a “test” on
an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the
outcome of the test, and each leaf node represents a class label (decision taken after computing
all attributes). The paths from root to leaf represent classification rules.
Tree based learning algorithms are considered to be one of the best and mostly used supervised
learning methods. Tree based methods empower predictive models with high accuracy,
stability and ease of interpretation. Unlike linear models, they map non-linear relationships
quite well. They are adaptable at solving any kind of problem at hand (classification or
regression). Decision Tree algorithms are referred to as CART (Classification and
Regression Trees).
“The possible solutions to a given problem emerge as the leaves of a tree, each node
representing a point of deliberation and decision.”
Methods like decision trees, random forest, gradient boosting are being popularly used in all
kinds of data science problems.
Common terms used with Decision trees:
1. Root Node: It represents entire population or sample and this further gets divided
into two or more homogeneous sets.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.
4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
7. Parent and Child Node: A node, which is divided into sub-nodes is called parent
node of sub-nodes whereas sub-nodes are the child of parent node.
Applications for Decision Tree :
Decision trees have a natural “if … then … else …” construction that makes it fit easily into a
programmatic structure. They also are well suited to categorization problems where attributes
or features are systematically checked to determine a final category. For example, a decision
tree could be used effectively to determine the species of an animal.
As a result, the decision making tree is one of the more popular classification algorithms
being used in Data Mining and Machine Learning. Example applications include:
· Evaluation of brand expansion opportunities for a business using historical sales data
· Decision trees are commonly used in operations research, specifically in decision analysis,
to help identify a strategy most likely to reach a goal.
Because of their simplicity, tree diagrams have been used in a broad range of industries and
disciplines including civil planning, energy, financial, engineering, healthcare, pharmaceutical,
education, law, and business.
Example:-
Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class (IX/
X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, we want to
create a model to predict who will play cricket during leisure period? In this problem, we
need to segregate students who play cricket in their leisure time based on highly significant
input variable among all three.
This is where decision tree helps, it will segregate the students based on all values of three
variable and identify the variable, which creates the best homogeneous sets of students (which
are heterogeneous to each other). In the snapshot below, you can see that variable Gender is
able to identify best homogeneous sets compared to the other two variables.
Decision tree identifies the most significant variable and its value that gives best
homogeneous sets of population. To identify the variable and the split, decision tree uses
various algorithms.
Types of decision tree is based on the type of target variable we have. It can be of two types:
1. Categorical Variable Decision Tree: Decision Tree which has categorical target
variable then it called as categorical variable decision tree. E.g.:- In above scenario of
student problem, where the target variable was “Student will play cricket or not” i.e.
YES or NO.
2. Continuous Variable Decision Tree: Decision Tree has continuous target variable
then it is called as Continuous Variable Decision Tree.
E.g.:- Let’s say we have a problem to predict whether a customer will pay his renewal
premium with an insurance company (yes/ no). Here we know that income of customer is a
significant variable but insurance company does not have income details for all customers.
Now, as we know this is an important variable, then we can build a decision tree to predict
customer income based on occupation, product and various other variables. In this case, we
are predicting values for continuous variable.
2. Split the training set into subsets. Subsets should be made in such a way that each
subset contains data with the same value for an attribute.
3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches
of the tree.
In decision trees, for predicting a class label for a record we start from the root of the tree.
We compare the values of the root attribute with record’s attribute. On the basis of
comparison, we follow the branch corresponding to that value and jump to the next node.
We continue comparing our record’s attribute values with other internal nodes of the tree
until we reach a leaf node with predicted class value. The modeled decision tree can be used
to predict the target class or the value.
Feature values are preferred to be categorical. If the values are continuous then they
are discretized prior to building the model.
Order to placing attributes as root or internal node of the tree is done by using some
statistical approach.
Advantages of Decision Tree:
1. Easy to Understand: Decision tree output is very easy to understand even for people
from non-analytical background. It does not require any statistical knowledge to read and
interpret them. Its graphical representation is very intuitive and users can easily relate
their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify most
significant variables and relation between two or more variables. With the help of
decision trees, we can create new variables / features that has better power to predict
target variable. It can also be used in data exploration stage. For e.g., we are working on
a problem where we have information available in hundreds of variables, there decision
tree will help to identify most significant variable.
5. Less data cleaning required: It requires less data cleaning compared to some other
modeling techniques. It is not influenced by outliers and missing values to a fair degree.
6. Data type is not a constraint: It can handle both numerical and categorical variables.
Can also handle multi-output problems.
1. Over fitting: Decision-tree learners can create over-complex trees that do not
generalize the data well. This is called overfitting. Over fitting is one of the most
practical difficulty for decision tree models. This problem gets solved by setting
constraints on model parameters and pruning.
2. Not fit for continuous variables: While working with continuous numerical
variables, decision tree loses information, when it categorizes variables in different
categories.
3. Decision trees can be unstable because small variations in the data might result in a
completely different tree being generated. This is called variance, which needs to be
lowered by methods like bagging and boosting.
4. Greedy algorithms cannot guarantee to return the globally optimal decision tree. This
can be mitigated by training multiple trees, where the features and samples are randomly
sampled with replacement.
5. Decision tree learners create biased trees if some classes dominate. It is therefore
recommended to balance the data set prior to fitting with the decision tree.
6. Information gain in a decision tree with categorical variables gives a biased response
for attributes with greater no. of categories.
8. Calculations can become complex when there are many class label.
The terminal nodes (or leaves) lies at the bottom of the decision tree. This means that decision
trees are typically drawn upside down such that leaves are the bottom & roots are the tops.
Both the trees work almost similar to each other. The primary differences and similarities
between Classification and Regression Trees are:
1. Regression trees are used when dependent variable is continuous. Classification Trees
are used when dependent variable is categorical.
2. In case of Regression Tree, the value obtained by terminal nodes in the training data
is the mean response of observation falling in that region. Thus, if an unseen data
observation falls in that region, we’ll make its prediction with mean value.
3. In case of Classification Tree, the value (class) obtained by terminal node in the
training data is the mode of observations falling in that region. Thus, if an unseen data
observation falls in that region, we’ll make its prediction with mode value.
4. Both the trees divide the predictor space (independent variables) into distinct and
non-overlapping regions.
5. Both the trees follow a top-down greedy approach known as recursive binary splitting.
We call it as ‘top-down’ because it begins from the top of tree when all the observations
are available in a single region and successively splits the predictor space into two new
branches down the tree. It is known as ‘greedy’ because, the algorithm cares (looks for
best variable available) about only the current split, and not about future splits which will
lead to a better tree.
6. This splitting process is continued until a user defined stopping criteria is reached.
For e.g.: we can tell the algorithm to stop once the number of observations per node
becomes less than 50.
7. In both the cases, the splitting process results in fully grown trees until the stopping
criteria is reached. But, the fully grown tree is likely to over fit data, leading to poor
accuracy on unseen data. This bring ‘pruning’. Pruning is one of the technique used
tackle overfitting.
This tree below summarizes at a high level the types of decision trees available.
The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria
is different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The
creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we
can say that purity of the node increases with respect to the target variable. Decision tree
splits the nodes on all available variables and then selects the split which results in most
homogeneous sub-nodes.
The algorithm selection is also based on type of target variables. The four most commonly
used algorithms in decision tree are:
Gini Index
Gini index says, if we select two items from a population at random then they must be of
same class and probability for this is 1 if population is pure.
4. CART (Classification and Regression Tree) uses Gini method to create binary splits.
1. Calculate Gini for sub-nodes, using formula sum of square of probability for success
and failure (p²+q²).
2. Calculate Gini for split using weighted Gini score of each node of that split
Chi-Square
It is an algorithm to find out the statistical significance between the differences between
sub-nodes and parent node. We measure it by sum of squares of standardized differences
between observed and expected frequencies of target variable.
1. Calculate Chi-square for individual node by calculating the deviation for Success and
Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of
each node of the split
Information Gain:
Less impure node requires less information to describe it. And, more impure node requires
more information. Information theory is a measure to define this degree of disorganization in
a system known as Entropy. If the sample is completely homogeneous, then the entropy is
zero and if the sample is an equally divided (50% — 50%), it has entropy of one.
Here p and q is probability of success and failure respectively in that node. Entropy is also
used with categorical target variable. It chooses the split which has lowest entropy compared
to parent node and other splits. The lesser the entropy, the better it is.
2. Calculate entropy of each individual node of split and calculate weighted average of
all sub-nodes available in split.
Reduction in Variance
2. Calculate variance for each split as weighted average of each node variance.
Key parameters of tree modelling and how can we avoid over-fitting in decision trees:
Overfitting is one of the key practical challenges faced while modeling decision trees. If there
is no limit set of a decision tree, it will give you 100% accuracy on training set because in the
worst case, it will end up making 1 leaf for each observation. The model is having an issue of
overfitting, is considered when the algorithm continues to go deeper and deeper to reduce the
training set error but results with an increased test set error i.e. accuracy of prediction for our
model goes down. It generally happens when it builds many branches due to outliers and
irregularities in data. Thus, preventing overfitting is pivotal while modeling a decision tree
and it can be done in 2 ways:
2. Tree pruning
Defines the minimum number of samples (or observations) which are required in a
node to be considered for splitting.
Used to control over-fitting. Higher values prevent a model from learning relations
which might be highly specific to the particular sample selected for a tree.
Too high values can lead to under-fitting hence, it should be tuned using CV.
Defines the minimum samples (or observations) required in a terminal node or leaf.
Used to control over-fitting similar to min_samples_split.
Generally lower values should be chosen for imbalanced class problems because the
regions in which the minority class will be in majority will be very small.
Used to control over-fitting as higher depth will allow model to learn relations very
specific to a particular sample.
Can be defined in place of max_depth. Since binary trees are created, a depth of ’n’
would produce a maximum of 2^n leaves.
The number of features to consider while searching for a best split. These will be
randomly selected.
As a thumb-rule, square root of the total number of features works great but we
should check upto 30–40% of the total number of features.
Tree Pruning
The technique of setting constraint is a greedy-approach. In other words, it will check for the
best split instantaneously and move forward until one of the specified stopping condition is
reached. For e.g.: consider the following case when you’re driving:
At this instant, you are the yellow car and you have 2 choices:
This is exactly the difference between normal decision tree & pruning. A decision tree with
constraints won’t see the truck ahead and adopt a greedy approach by taking a left. On the
other hand if we use pruning, we in effect look at a few steps ahead and make a choice.
2. Then we start at the bottom and start removing leaves which are giving us negative
returns when compared from the top.
3. Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on
that gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we
will see that the overall gain is +10 and keep both leaves.
3. To build a model which is easy to explain to people, a decision tree model will
always do better than a linear model. Decision tree models are even simpler to interpret
than linear regression.e.g. Decison Tree Classification
From Tree to Rules :
· Build tree
· Branch on each outcome of the test, and move subset of examples satisfying that outcome to
corresponding child node
· Recurse on each child node
· Repeat until leaves are “pure”, i.e., have example from a single class, or “nearly pure”, i.e.,
majority of examples are from the same class
· Prune tree
· Build tree
· Evaluate split-points for all attributes
· Breadth/depth-first construction
· CRITICAL STEPS:
· Dataset D
· MML(Hi) = Mlength(Hi)+Mlength(D|Hi)
Model encoding:
· Convert to leaf
· Do nothing
Hunt’s Method
· Attributes: Refund (Yes, No), Marital Status (Single, Married, Divorced), Taxable Income
· Entropy is the measure of impurity of the sample set: E(S) = -p+log2p+ — p-log2p-
Working with Decision Trees in Python:
#Import Library
# Assumed you have, X (predictor) and Y (target) for training data set and x_test (predictor)
of test_dataset
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted = model.predict(x_test)
Summary :
Not all problems can be solved with linear methods. The world is non-linear. It has been
observed that tree based models have been able to map non-linearity effectively. Methods like
decision trees, random forest, gradient boosting are being popularly used in all kinds of data
science problems.
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other
supervised learning algorithms, decision tree algorithm can be used for solving regression
and classification problems too. The general motive of using Decision Tree is to create a
training model which can use to predict class or value of target variables by learning decision
rules inferred from prior data (training data). The primary challenge in the decision tree
implementation is to identify which attributes do we need to consider as the root node and
each level. Decision tress often mimic the human level thinking so it’s simple to understand
the data and make some good interpretations.
Dividing efficiently based on maximum information gain is key to decision tree classifier.
However, in real world with millions of data dividing into pure class in practically not
feasible (it may take longer training time) and so we stop at points in nodes of tree when
fulfilled with certain parameters (for example impurity percentage). Decision tree is
classification strategy as opposed to the algorithm for classification. It takes top down
approach and uses divide and conquer method to arrive at decision. We can have multiple leaf
classes with this approach.
“When your values are clear to you, making decisions become easier.” — Roy E. Disney
Random Forest
The Random Forest Classifier
Random forest, like its name implies, consists of a large number of individual
decision trees that operate as an ensemble. Each individual tree in the random
forest spits out a class prediction and the class with the most votes becomes
our model’s prediction (see figure below).
The fundamental concept behind random forest is a simple but powerful one
— the wisdom of crowds. In data science speak, the reason that the random
forest model works so well is:
2. The predictions (and therefore the errors) made by the individual trees
need to have low correlations with each other.
Which would you pick? The expected value of each game is the same:
What about the distributions? Let’s visualize the results with a Monte Carlo
simulation (we will run 10,000 simulations of each game type; for example,
we will simulate 10,000 times the 100 plays of Game 1). Take a look at the
chart on the left — now which game would you pick? Even though the
expected values are the same, the outcome distributions are vastly
different going from positive and narrow (blue) to binary (pink).
Game 1 (where we play 100 times) offers up the best chance of making some
money — out of the 10,000 simulations that I ran, you make money in
97% of them! For Game 2 (where we play 10 times) you make money in 63%
of the simulations, a drastic decline (and a drastic increase in your probability
of losing money). And Game 3 that we only play once, you make money in 60%
of the simulations, as expected.
Probability of Making Money for Each Game
So even though the games share the same expected value, their outcome
distributions are completely different. The more we split up our $100 bet into
different plays, the more confident we can be that we will make money. As
mentioned previously, this works because each play is independent of the
other ones.
Random forest is the same — each tree is like one play in our game earlier. We
just saw how our chances of making money increased the more times we
played. Similarly, with a random forest model, our chances of making correct
predictions increase with the number of uncorrelated trees in our model.
If you would like to run the code for simulating the game yourself you can find
it on my GitHub here.
Notice that with bagging we are not subsetting the training data into smaller
chunks and training each tree on a different chunk. Rather, if we have a sample
of size N, we are still feeding each tree a training set of size N (unless specified
otherwise). But instead of the original training data, we take a random sample
of size N with replacement. For example, if our training data was [1, 2, 3, 4, 5, 6]
then we might give one of our trees the following list [1, 2, 2, 3, 6, 6]. Notice
that both lists are of length six and that “2” and “6” are both repeated in the
randomly selected training data we give to our tree (because we sample with
replacement).
Now let’s take a look at our random forest. We will just examine two of the
forest’s trees in this example. When we check out random forest Tree 1, we find
that it it can only consider Features 2 and 3 (selected randomly) for its node
splitting decision. We know from our traditional decision tree (in blue) that
Feature 1 is the best feature for splitting, but Tree 1 cannot see Feature 1 so it
is forced to go with Feature 2 (black and underlined). Tree 2, on the other hand,
can only see Features 1 and 3 so it is able to pick Feature 1.
And that, my dear reader, creates uncorrelated trees that buffer and protect
each other from their errors.
Conclusion
Random forests are a personal favorite of mine. Coming from the world of
finance and investments, the holy grail was always to build a bunch of
uncorrelated models, each with a positive expected return, and then put them
together in a portfolio to earn massive alpha (alpha = market beating returns).
Much easier said than done!
Random forest is the data science equivalent of that. Let’s review one last time.
What’s a random forest classifier?
What do we need in order for our random forest to make accurate class
predictions?
1. We need features that have at least some predictive power. After all,
if we put garbage in then we will get garbage out.
2. The trees of the forest and more importantly their predictions
need to be uncorrelated (or at least have low correlations with each
other). While the algorithm itself via feature randomness tries to engineer
these low correlations for us, the features we select and the
hyper-parameters we choose will impact the ultimate correlations as well.
For taking steps to know about Data Science and Machine Learning, till now in
my blogs, I have covered briefly an introduction to Data Science, Python,
Statistics, Machine Learning, Regression, Linear Regression, Logistic Regression
and Decision Trees. In this sixth of the series, I shall cover Boosting, an
ensemble method.
Introduction to Boosting
A road to success is incomplete without any failures in life. Each failure teaches
you something new and makes you stronger at each phase. Each time you
make a mistake, it’s important to learn from it and try not to repeat it again.
Just as we sometimes develop life skills by learning from our mistakes, we can
train our model to learn from the errors predicted and improvise the model’s
prediction and overall performance. This is the most basic intuition of Boosting
algorithm in Machine Learning.
How would you classify an email as SPAM or not? Our initial approach would
be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:
1. Email has only one image file (promotional image), It’s a SPAM
3. Email body consist of sentence like “You won a prize money of $ xxxxxx”,
It’s a SPAM
Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not
spam’. But, these rules individually are not strong enough to successfully
classify an email i.e. individually, these rules are not powerful enough to
classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called
as weak learner.
To convert weak learner to strong learner, we’ll combine the prediction of each
weak learner using methods like:
For example: Above, we have defined 5 weak learners. Out of these 5, 3 are
voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll
consider an email as SPAM because we have higher (3) vote for ‘SPAM’.
How Boosting Algorithms works?
Boosting combines weak learner (base learner) to form a strong rule. To find
weak rule, apply base learning (ML) algorithms with a different distribution.
Each time base learning algorithm is applied, it generates a new weak
prediction rule. This is an iterative process. After many iterations, the boosting
algorithm combines these weak rules into a single strong prediction rule.
For choosing the right distribution, here are the following steps:
Step 1: The base learner takes all the distributions and assign equal weight or
attention to each observation.
Step 2: If there is any prediction error caused by first base learning algorithm,
then we pay higher attention to observations having prediction error. Then, we
apply the next base learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or
higher accuracy is achieved.
Finally, it combines the outputs from weak learner and creates a strong learner
which eventually improves the prediction power of the model. Boosting pays
higher focus on examples which are mis-classified or have higher errors by
preceding weak rules.
Suppose we have a binary classification task. A weak learner has an error rate
that is slightly lesser than 0.5 in classifying the object, i.e. the weak learner is
slightly better than deciding from a coin toss. A strong learner has an error rate
closer to 0. To convert a weak learner into strong learner, we take a family of
weak learners, combine them and vote. This turns this family of weak learners
into strong learners. The idea here is that the family of weak learners should
have a minimum correlation between them.
Types of Boosting
The accuracy of a predictive model can be boosted in two ways: Either by
embracing feature engineering or by applying boosting algorithms straight
away. It is preferred to work with boosting algorithms as it takes less time and
produces similar results.
AdaBoost combines multiple weak learners into a single strong learner. The
weak learners in AdaBoost are decision trees with a single split, called decision
stumps. When AdaBoost creates its first decision stump, all observations are
weighted equally. To correct the previous error, the observations that were
incorrectly classified now carry more weight than the observations that were
correctly classified. AdaBoost algorithms can be used for both classification
and regression problem.
Box 2: Here, you can see that the size of three incorrectly predicted + (plus) is
bigger as compared to rest of the data points. In this case, the second decision
stump (D2) will try to predict them correctly. Now, a vertical line (D2) at right
side of this box has classified three mis-classified + (plus) correctly. But again, it
has caused mis-classification errors. This time with three -(minus). Again, we
will assign higher weight to three — (minus) and apply another decision
stump.
Box 3: Here, three — (minus) are given higher weights. A decision stump (D3) is
applied to predict these mis-classified observation correctly. This time a
horizontal line is generated to classify + (plus) and — (minus) based on higher
weight of mis-classified observation.
Mostly, we use decision stamps with AdaBoost. But, we can use any machine
learning algorithms as base learner if it accepts weight on training data set. We
can use AdaBoost algorithms for both classification and regression problem.
Advantages of AdaBoost :
Simple to implement
Disadvantages of AdaBoost :
Ø Even though this algorithm tries to fit every point, it doesn’t overfit.
Ø Suboptimal solution
Python code:
# We have used decision tree as a base estimator. We can use any ML learner
as base estimator, if it accepts sample weight
clf.fit(x_train,y_train)
You can tune the parameters to optimize the performance of algorithms, The
key parameters for tuning:
You can also tune the parameters of base learners to optimize its performance.
Gradient Boosting
By combining weak learner after weak learner, our final model is able to
account for a lot of the error from the original model and reduces this
error over time.
Disadvantages of GBM:
clf.fit(X_train, y_train)
XGBoost
XGBoost stands for eXtreme Gradient Boosting and is another faster version of
boosting learner. XGBoost is an implementation of gradient boosted decision
trees designed for peed and performance. Gradient boosting machines are
generally very slow in implementation because of sequential model training.
Hence, they are not very scalable. Thus, XGBoost is focused on computational
speed and model performance. XGBoost provides:
Out-of-Core Computing for very large datasets that don’t fit into
memory.
Cache Optimization of data structures and algorithm to make the best
use of hardware.
XGBoost is similar to gradient boosting algorithm but it has a few tricks up its
sleeve which makes it stand out from the rest.
Features of XGBoost are:
Newton Boosting
In XGBoost the trees can have a varying number of terminal nodes and left
weights of the trees that are calculated with less evidence is shrunk more
heavily. Newton Boosting uses Newton-Raphson method of approximations
which provides a direct route to the minima than gradient descent. The extra
randomisation parameter can be used to reduce the correlation between the
trees, as seen in the previous article, the lesser the correlation among classifiers,
the better our ensemble of classifiers will turn out. Generally, XGBoost is faster
than gradient boosting but gradient boosting has a wide range of application.
Advantages of XGBoost :
1. Regularization:
1. Parallel Processing:
But hang on, we know that boosting is sequential process so how can it
be parallelized? We know that each tree can be built only after the
previous one, so what stops us from making a tree using all cores? I hope
you get where I’m coming from. Check this link out to explore further.
1. High Flexibility
This adds a whole new dimension to the model and there is no limit to
what we can do.
1. Tree Pruning:
1. Built-in Cross-Validation
User can start training an XGBoost model from its last iteration of
previous run. This can be of significant advantage in certain specific
applications.
GBM implementation of sklearn also has this feature so they are even
on this point.
Summary :
Boosting can be used primarily for reducing bias and also variance. It is a
weighted average approach, i.e. by iteratively learning and forming multiple
weak learner and combining them to make a final strong learner. Adding of
each weak learner to make a strong learner is related to the accuracy of each
learner. The weight is recalculated after each weak learner is added to
overcome the problem of misclassified gain weight.
Multiple boosting algorithms are available to use such as AdaBoost, Gradient
Boost, XGBoost, etc. Different boosting algorithm has its own advantages over
different types of dataset. By tuning multiple algorithms with a wider range of
input data, classification or good predictive modeling can be build.