0% found this document useful (0 votes)
10 views13 pages

Trinh Khanh Ly 20213676

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 13

Table of Contents

CHAPTER 1. COMPARISON OF DIFFERENT TREES LIKE DECISION TREE


AND EXTRA TREE .......................................................................................................... 3
1.1. Decision tree ........................................................................................................ 3
1.1.1. Overview ....................................................................................................... 3
1.1.2. Attribute Selection Measures ........................................................................ 4
1.1.3. How does the Decision Tree algorithm Work? ............................................. 5
1.1.4. Advantages .................................................................................................... 5
1.1.5. Disadvantages ................................................................................................ 6
1.2. Extra tree ............................................................................................................. 6
1.2.1. Overview ....................................................................................................... 6
1.2.2. How does the Extra Tree algorithm Work? ................................................... 7
1.3. Summary of different between trees like Decision Tree and Extra Tree ....... 7
CHAPTER 2. COMPARISON OF DIFFERENT BOOSTING TECHNIQUES LIKE
ADABOOST AND XGBOOST......................................................................................... 8
2.1. AdaBoost .................................................................................................................. 8
2.1.1. Overview ............................................................................................................. 8
2.1.2. The Working of the AdaBoost Algorithm ........................................................... 9
2.1.3. Advantages of AdaBoost ................................................................................... 10
2.1.4. Disadvantages of AdaBoost .............................................................................. 10
2.2. XGBoost ................................................................................................................. 10
2.2.1. Overview ........................................................................................................... 10
2.2.2. Additive Training .............................................................................................. 11
2.3. Summary of different between boosting techniques like AdaBoost and
XGBoost ........................................................................................................................ 12

1
Table of Figure
Figure 1. Simple Decision tree………………………………………………………3
Figure 2. Flow chart of the Extra-Trees model. …………………………………….6
Figure 3. Flowchart of AdaBoost……………………………………………………9
Figure 4. XGboost algorithm structure……………………………………………..11

2
CHAPTER 1. COMPARISON OF DIFFERENT TREES LIKE DECISION TREE
AND EXTRA TREE
1.1. Decision tree
1.1.1. Overview
A decision tree is one of the most powerful tools of supervised learning algorithms
used for both classification and regression tasks. It builds a flowchart-like tree structure
where each internal node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (terminal node) holds a class label. It is
constructed by recursively splitting the training data into subsets based on the values of
the attributes until a stopping criterion is met, such as the maximum depth of the tree
or the minimum number of samples required to split a node.
During training, the Decision Tree algorithm selects the best attribute to split the data
based on a metric such as entropy or Gini impurity, which measures the level of
impurity or randomness in the subsets. The goal is to find the attribute that maximizes
the information gain or the reduction in impurity after the split.
For a simple decision tree example: Suppose, based on the weather, the boys will decide
whether to play soccer or not?
The initial characteristics are:
- Weather
- Humidity
- Wind
Based on the above information, you can build the model as follows:

Figure 1. Simple Decision tree


Based on the above model, we see:

3
If the weather is sunny and the humidity is normal, the possibility of boys going to
play football will be high. If it is sunny and the humidity is high, it is likely that the
boys will not go to play football.
1.1.2. Attribute Selection Measures
Construction of Decision Tree: A tree can be “learned” by splitting the source set into
subsets based on Attribute Selection Measures. Attribute selection measure (ASM) is a
criterion used in decision tree algorithms to evaluate the usefulness of different
attributes for splitting a dataset. The goal of ASM is to identify the attribute that will
create the most homogeneous subsets of data after the split, thereby maximizing the
information gain. This process is repeated on each derived subset in a recursive manner
called recursive partitioning. The recursion is completed when the subset at a node all
has the same value of the target variable, or when splitting no longer adds value to the
predictions. The construction of a decision tree classifier does not require any domain
knowledge or parameter setting and therefore is appropriate for exploratory knowledge
discovery. Decision trees can handle high-dimensional data.
Entropy:
Entropy is the measure of the degree of randomness or uncertainty in the dataset. In
the case of classifications, It measures the randomness based on the distribution of class
labels in the dataset.
The entropy for a subset of the original dataset having K number of classes for the
ith node can be defined as:
𝑛

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − ∑ 𝑝(𝑐 )log 2 𝑝(𝑐)


𝑐∈𝐶
• S represents the data set that entropy is calculated
• c represents the classes in set, S
• p(c) represents the proportion of data points that belong to class c to the number of
total data points in set, S
- The entropy is 0 when the dataset is completely homogeneous, meaning that each
instance belongs to the same class. It is the lowest entropy indicating no uncertainty
in the dataset sample.
- When the dataset is equally divided between multiple classes, the entropy is at its
maximum value. Therefore, entropy is highest when the distribution of class labels
is even, indicating maximum uncertainty in the dataset sample.
- Entropy is used to evaluate the quality of a split. The goal of entropy is to select the
attribute that minimizes the entropy of the resulting subsets, by splitting the dataset
into more homogeneous subsets with respect to the class labels.

4
- The highest information gain attribute is chosen as the splitting criterion (i.e., the
reduction in entropy after splitting on that attribute), and the process is repeated
recursively to build the decision tree.
Gini Impurity or index:
Gini Impurity is a score that evaluates how accurate a split is among the classified
groups. The Gini Impurity evaluates a score in the range between 0 and 1, where 0 is
when all observations belong to one class, and 1 is a random distribution of the elements
within classes. In this case, we want to have a Gini index score as low as possible. Gini
Index is the evaluation metric we shall use to evaluate our Decision Tree Model.

𝐺𝑖𝑛𝑖 𝐼𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − ∑ 𝑝𝑖2

• pi is the proportion of elements in the set that belongs to the ith category.
Information Gain:
Information gain measures the reduction in entropy or variance that results from
splitting a dataset based on a specific property. It is used in decision tree algorithms to
determine the usefulness of a feature by partitioning the dataset into more homogeneous
subsets with respect to the class labels or target variable. The higher the information
gain, the more valuable the feature is in predicting the target variable.
|𝐻𝑣 |
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛(𝐻, 𝐴) = 𝐻 − ∑ 𝐻𝑣
|𝐻 |
• A is the specific attribute or class label.
• |H| is the entropy of dataset sample S.
• |HV| is the number of instances in the subset S that have the value v for attribute A
1.1.3. How does the Decision Tree algorithm Work?
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node Classification and Regression Tree
algorithm.
1.1.4. Advantages
- Easy to interpret
- Little to no data preparation required
- More flexible
5
1.1.5. Disadvantages
- Prone to overfitting, which can be resolved using the Random Forest algorithm.
- High variance estimators
- More costly
1.2. Extra tree
1.2.1. Overview
Extremely Randomized Trees Classifier (Extra Trees Classifier) is a type of
ensemble learning technique which aggregates the results of multiple de-correlated
decision trees collected in a “forest” to output it’s classification result. In concept, it is
very similar to a Random Forest Classifier and only differs from it in the manner of
construction of the decision trees in the forest.

Figure 2. Flow chart of the Extra-Trees model.


6
1.2.2. How does the Extra Tree algorithm Work?
Each tree is constructed using a random subset of the data set to measure a random
subset of features in each partition. This randomness introduces variability among
individual trees, reducing the risk of overfitting and improving overall prediction
performance. In prediction, the algorithm aggregates the results of all trees, either by
voting (for classification tasks) or by averaging (for regression tasks). This
collaborative decision-making process, supported by multiple trees with their insights,
provides an example stable and precise results.
Each Decision Tree in the Extra Trees Forest is constructed from the original training
sample. Then, at each test node, Each tree is provided with a random sample of k
features from the feature-set from which each decision tree must select the best feature
to split the data based on some mathematical criteria (typically the Gini Index). This
random sample of features leads to the creation of multiple de-correlated decision trees.
To perform feature selection using the above forest structure, during the construction
of the forest, for each feature, the normalized total reduction in the mathematical criteria
used in the decision of feature of split (Gini Index if the Gini Index is used in the
construction of the forest) is computed. This value is called the Gini Importance of the
feature. To perform feature selection, each feature is ordered in descending order
according to the Gini Importance of each feature and the user selects the top k features.
Note that the formula for Information Gain is: -

1.3. Summary of different between trees like Decision Tree and Extra Tree
Decision Trees and Extra Trees are two popular algorithms in the field of machine
learning, particularly in classification and prediction tasks. In this report, we conducted
a comparative analysis of these two algorithms based on several key factors.
Firstly, the tree construction algorithm employed by each algorithm. Decision Trees
construct trees by selecting attributes at each node to optimize classification or
prediction, typically using algorithms like ID3, C4.5, or CART. On the other hand,
Extra Trees build child trees by randomly selecting attributes and splitting thresholds
without optimization.
About process of creating child trees, highlighting the difference in approach between
Decision Trees and Extra Trees. While Decision Trees select the best attribute for
splitting at each node based on specific criteria, Extra Trees create child trees by
randomly selecting attributes and thresholds, resulting in faster tree creation.
Next, the ability of both algorithms to combat overfitting. Decision Trees are
susceptible to overfitting, especially with deep trees, whereas Extra Trees tend to overfit
less due to the use of random child trees and the elimination of optimized splitting rules.
7
Additionally, we considered the training time of each algorithm. Decision Trees may
have longer training times due to the need for optimization, whereas Extra Trees
typically have faster training times as a result of random tree creation.
Finally, the accuracy of both algorithms, noting that it depends on the specific dataset
and training approach. While Extra Trees may offer better accuracy in some cases due
to their ability to reduce overfitting, this is not always guaranteed and varies depending
on various factors.
In conclusion, both Decision Trees and Extra Trees have their strengths and
weaknesses, and the choice between them depends on the specific requirements of the
problem at hand. Further experimentation and evaluation on specific datasets would be
necessary to determine which algorithm is more suitable for a given task.

CHAPTER 2. COMPARISON OF DIFFERENT BOOSTING TECHNIQUES LIKE


ADABOOST AND XGBOOST
Boosting is an ensemble modeling technique that attempts to build a strong classifier
from the number of weak classifiers. It is done by building a model by using weak
models in series. Firstly, a model is built from the training data. Then the second model
is built which tries to correct the errors present in the first model. This procedure is
continued and models are added until either the complete training data set is predicted
correctly or the maximum number of models are added.
2.1. AdaBoost
2.1.1. Overview
AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as an
Ensemble Method in Machine Learning. It is called Adaptive Boosting as the weights
are re-assigned to each instance, with higher weights assigned to incorrectly classified
instances. It is mainly used for classification, and the base learner (the machine learning
algorithm that is boosted) is usually a decision tree with only one level, also called as
“stumps”.
What this algorithm does is that it builds a model and gives equal weights to all the data
points. It then assigns higher weights to points that are wrongly classified. Now all the
points with higher weights are given more importance in the next model. It will keep
training models until and unless a lower error is received.

8
Figure 3. Flowchart of AdaBoost

2.1.2. The Working of the AdaBoost Algorithm


- Step 1: Assigning Weights
These data points will be assigned some weights. Initially, all the weights will be equal.
The formula to calculate the sample weights is:
1
𝑤(𝑥𝑖 , 𝑦𝑖 ) = , 𝑖 = 1,2, … . 𝑛
𝑁
Where N is the total number of data points.
- Step 2: Classify the Samples
Create a decision stump for each of the features and then calculate the Gini Index of
each tree. The tree with the lowest Gini Index will be the first stump.
- Step 3: Calculate the Influence
Calculate the “Amount of Say” or “Importance” or “Influence” for this classifier in
classifying the data points using this formula:
1 1 − 𝑡𝑜𝑡𝑎𝑙 𝑒𝑟𝑟𝑜𝑟
log
2 𝑡𝑜𝑡𝑎𝑙 𝑒𝑟𝑟𝑜𝑟
The total error is nothing but the summation of all the sample weights of misclassified
data points.
Note: Total error will always be between 0 (Indicates perfect stump) and 1 (indicates
horrible stump).
- Step 4: Calculate TE and Performance
About calculating the Total Error (TE) and performance of an Adaboost stump. The
reason is straightforward – updating the weights is crucial. If identical weights are
maintained for the subsequent model, the output will mirror what was obtained in the
initial model. The wrong predictions will be given more weight, whereas the correct
predictions weights will be decreased.
After finding the importance of the classifier and total error, we need to finally update
the weights, and for this, we use the following formula:
𝑁𝑒𝑤 𝑠𝑎𝑚𝑝𝑙𝑒 𝑤𝑒𝑖𝑔ℎ𝑡 = 𝑜𝑙𝑑 𝑤𝑒𝑖𝑔ℎ𝑡 ∗ 𝑒 ∓𝐴𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑠𝑎𝑦(𝛼)

9
The amount of, say (alpha) will be negative when the sample is correctly classified.
The amount of, say (alpha) will be positive when the sample is miss-classified.
- Step 5: Decrease Errors
Make a new dataset to see if the errors decreased or not. For this, we will remove the
“sample weights” and “new sample weights” columns and then, based on the “new
sample weights,” divide our data points into buckets.
- Step 6: New Dataset
Now, what the algorithm does is selects random numbers from 0-1. Since incorrectly
classified records have higher sample weights, the probability of selecting those records
is very high.
- Step 7: Repeat Previous Steps
Now this act as our new dataset, and we need to repeat all the above steps.
2.1.3. Advantages of AdaBoost
- It is easier to use with less need for tweaking parameters unlike algorithms like SVM.
Theoretically, AdaBoost is not prone to overfitting though there is no concrete proof for
this. It could be because of the reason that parameters are not jointly optimized — stage-
wise estimation slows down the learning process.
- AdaBoost can be used to improve the accuracy of your weak classifiers hence making
it flexible. It has now being extended beyond binary classification and has found use
cases in text and image classification as well.
2.1.4. Disadvantages of AdaBoost
- Boosting technique learns progressively, it is important to ensure that you have quality
data. AdaBoost is also extremely sensitive to Noisy data and outliers so if you do plan
to use AdaBoost then it is highly recommended to eliminate them.
- AdaBoost has also been proven to be slower than XGBoost.
2.2. XGBoost
2.2.1. Overview
XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient
Boosting” originates from the paper Greedy Function Approximation: A Gradient
Boosting Machine, by Friedman. It is the gold standard in ensemble learning, especially
when it comes to gradient-boosting algorithms. It develops a series of weak learners
one after the other to produce a reliable and accurate predictive model.Fundamentally,
XGBoost builds a strong predictive model by aggregating the predictions of several
weak learners, usually decision trees. It uses a boosting technique to create an extremely
accurate ensemble model by having each weak learner after it correct the mistakes of
its predecessors.
The optimization method (gradient) minimizes a cost function by repeatedly changing
the model’s parameters in response to the gradients of the errors. The algorithm also
presents the idea of “gradient boosting with decision trees,” in which the objective
function is reduced by calculating the importance of each decision tree that is added to
the ensemble in turn. By adding a regularization term and utilizing a more advanced
10
optimization algorithm, XGBoost goes one step further and improves accuracy and
efficiency.

Figure 4. XGboost algorithm structure


Mathematically, we can write our model in the form:
𝐾

𝑦̂𝑖 = ∑ 𝑓𝑘 (𝑥𝑖 ), 𝑓𝑘 𝜖 Ρ
𝑘=1
where 𝐾 is the number of trees, 𝑓𝑘 is a function in the functional space P, and P is the
set of all possible CARTs. The objective function to be optimized is given by:
𝑛 𝐾

𝑜𝑏𝑗(𝜃) = ∑ 𝑙(𝑦𝑖 , 𝑦̂𝑖 ) + ∑ 𝜔(𝑓𝑘 )


𝑖 𝑘=1
where 𝜔(𝑓𝑘) is the complexity of the tree 𝑓𝑘, defined in detail later.
2.2.2. Additive Training
Learning tree structure is much harder than traditional optimization problem where you
can simply take the gradient. It is intractable to learn all the trees at once. Instead, we
use an additive strategy: fix what we have learned and add one new tree at a time. We
write the prediction value at step 𝑡 as 𝑦̂(𝑡).
𝑖 Then we have

Using mean squared error (MSE) as our loss function, the objective becomes

11
The form of MSE is friendly, with a first order term (usually called the residual) and a
quadratic term. For other losses of interest (for example, logistic loss), it is not so easy
to get such a nice form. So in the general case, we take the Taylor expansion

This becomes our optimization goal for the new tree. One important advantage of this
definition is that the value of the objective function only depends on 𝑔𝑖 and ℎ𝑖. This is
how XGBoost supports custom loss functions. We can optimize every loss function,
including logistic regression and pairwise ranking, using exactly the same solver that
takes 𝑔𝑖 and ℎ𝑖 as input!

Figure 5. Example of XGBoost


If all this sounds a bit complicated, let’s take a look at the picture 5, and see how the
scores can be calculated. Basically, for a given tree structure, we push the
statistics 𝑔𝑖 and ℎ𝑖 to the leaves they belong to, sum the statistics together, and use the
formula to calculate how good the tree is. This score is like the impurity measure in a
decision tree, except that it also takes the model complexity into account.

2.3. Summary of different between boosting techniques like AdaBoost and


XGBoost
AdaBoost XGBoost
Year 1995 2014

12
Performance Simpler and may suffer from Outperforms AdaBoost due to its
overfitting if not properly more sophisticated algorithms
tuned. Not parallel processing and optimizations, especially in
complex datasets with high
dimensionality, parallel process.
Speed and Slower compared to XGBoost, Highly scalable and efficient,
Scalability especially when dealing with thanks to its parallelized
large datasets or high- implementation and optimization
dimensional feature spaces. techniques,
Robustness to Sensitive to noisy data and More robust to noise and outliers
Noise outliers since it tries to correct due to its regularization
misclassifications from techniques, such as tree pruning
previous weak learners and column subsampling.
Flexibility Versatile and can be used with Supports various base learners
different base learners, such as and objective functions, offering
decision stumps or decision more flexibility in model
trees. customization and tuning.
Both AdaBoost and XGBoost are powerful boosting algorithms with distinct
advantages and trade-offs. AdaBoost is simple, interpretable, and effective in certain
scenarios, while XGBoost excels in performance, scalability, and robustness to noise.
The choice between the two depends on the specific requirements of the problem, such
as dataset size, complexity, interpretability needs, and computational resources.

13

You might also like