Trinh Khanh Ly 20213676
Trinh Khanh Ly 20213676
Trinh Khanh Ly 20213676
1
Table of Figure
Figure 1. Simple Decision tree………………………………………………………3
Figure 2. Flow chart of the Extra-Trees model. …………………………………….6
Figure 3. Flowchart of AdaBoost……………………………………………………9
Figure 4. XGboost algorithm structure……………………………………………..11
2
CHAPTER 1. COMPARISON OF DIFFERENT TREES LIKE DECISION TREE
AND EXTRA TREE
1.1. Decision tree
1.1.1. Overview
A decision tree is one of the most powerful tools of supervised learning algorithms
used for both classification and regression tasks. It builds a flowchart-like tree structure
where each internal node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (terminal node) holds a class label. It is
constructed by recursively splitting the training data into subsets based on the values of
the attributes until a stopping criterion is met, such as the maximum depth of the tree
or the minimum number of samples required to split a node.
During training, the Decision Tree algorithm selects the best attribute to split the data
based on a metric such as entropy or Gini impurity, which measures the level of
impurity or randomness in the subsets. The goal is to find the attribute that maximizes
the information gain or the reduction in impurity after the split.
For a simple decision tree example: Suppose, based on the weather, the boys will decide
whether to play soccer or not?
The initial characteristics are:
- Weather
- Humidity
- Wind
Based on the above information, you can build the model as follows:
3
If the weather is sunny and the humidity is normal, the possibility of boys going to
play football will be high. If it is sunny and the humidity is high, it is likely that the
boys will not go to play football.
1.1.2. Attribute Selection Measures
Construction of Decision Tree: A tree can be “learned” by splitting the source set into
subsets based on Attribute Selection Measures. Attribute selection measure (ASM) is a
criterion used in decision tree algorithms to evaluate the usefulness of different
attributes for splitting a dataset. The goal of ASM is to identify the attribute that will
create the most homogeneous subsets of data after the split, thereby maximizing the
information gain. This process is repeated on each derived subset in a recursive manner
called recursive partitioning. The recursion is completed when the subset at a node all
has the same value of the target variable, or when splitting no longer adds value to the
predictions. The construction of a decision tree classifier does not require any domain
knowledge or parameter setting and therefore is appropriate for exploratory knowledge
discovery. Decision trees can handle high-dimensional data.
Entropy:
Entropy is the measure of the degree of randomness or uncertainty in the dataset. In
the case of classifications, It measures the randomness based on the distribution of class
labels in the dataset.
The entropy for a subset of the original dataset having K number of classes for the
ith node can be defined as:
𝑛
4
- The highest information gain attribute is chosen as the splitting criterion (i.e., the
reduction in entropy after splitting on that attribute), and the process is repeated
recursively to build the decision tree.
Gini Impurity or index:
Gini Impurity is a score that evaluates how accurate a split is among the classified
groups. The Gini Impurity evaluates a score in the range between 0 and 1, where 0 is
when all observations belong to one class, and 1 is a random distribution of the elements
within classes. In this case, we want to have a Gini index score as low as possible. Gini
Index is the evaluation metric we shall use to evaluate our Decision Tree Model.
• pi is the proportion of elements in the set that belongs to the ith category.
Information Gain:
Information gain measures the reduction in entropy or variance that results from
splitting a dataset based on a specific property. It is used in decision tree algorithms to
determine the usefulness of a feature by partitioning the dataset into more homogeneous
subsets with respect to the class labels or target variable. The higher the information
gain, the more valuable the feature is in predicting the target variable.
|𝐻𝑣 |
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛(𝐻, 𝐴) = 𝐻 − ∑ 𝐻𝑣
|𝐻 |
• A is the specific attribute or class label.
• |H| is the entropy of dataset sample S.
• |HV| is the number of instances in the subset S that have the value v for attribute A
1.1.3. How does the Decision Tree algorithm Work?
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node Classification and Regression Tree
algorithm.
1.1.4. Advantages
- Easy to interpret
- Little to no data preparation required
- More flexible
5
1.1.5. Disadvantages
- Prone to overfitting, which can be resolved using the Random Forest algorithm.
- High variance estimators
- More costly
1.2. Extra tree
1.2.1. Overview
Extremely Randomized Trees Classifier (Extra Trees Classifier) is a type of
ensemble learning technique which aggregates the results of multiple de-correlated
decision trees collected in a “forest” to output it’s classification result. In concept, it is
very similar to a Random Forest Classifier and only differs from it in the manner of
construction of the decision trees in the forest.
1.3. Summary of different between trees like Decision Tree and Extra Tree
Decision Trees and Extra Trees are two popular algorithms in the field of machine
learning, particularly in classification and prediction tasks. In this report, we conducted
a comparative analysis of these two algorithms based on several key factors.
Firstly, the tree construction algorithm employed by each algorithm. Decision Trees
construct trees by selecting attributes at each node to optimize classification or
prediction, typically using algorithms like ID3, C4.5, or CART. On the other hand,
Extra Trees build child trees by randomly selecting attributes and splitting thresholds
without optimization.
About process of creating child trees, highlighting the difference in approach between
Decision Trees and Extra Trees. While Decision Trees select the best attribute for
splitting at each node based on specific criteria, Extra Trees create child trees by
randomly selecting attributes and thresholds, resulting in faster tree creation.
Next, the ability of both algorithms to combat overfitting. Decision Trees are
susceptible to overfitting, especially with deep trees, whereas Extra Trees tend to overfit
less due to the use of random child trees and the elimination of optimized splitting rules.
7
Additionally, we considered the training time of each algorithm. Decision Trees may
have longer training times due to the need for optimization, whereas Extra Trees
typically have faster training times as a result of random tree creation.
Finally, the accuracy of both algorithms, noting that it depends on the specific dataset
and training approach. While Extra Trees may offer better accuracy in some cases due
to their ability to reduce overfitting, this is not always guaranteed and varies depending
on various factors.
In conclusion, both Decision Trees and Extra Trees have their strengths and
weaknesses, and the choice between them depends on the specific requirements of the
problem at hand. Further experimentation and evaluation on specific datasets would be
necessary to determine which algorithm is more suitable for a given task.
8
Figure 3. Flowchart of AdaBoost
9
The amount of, say (alpha) will be negative when the sample is correctly classified.
The amount of, say (alpha) will be positive when the sample is miss-classified.
- Step 5: Decrease Errors
Make a new dataset to see if the errors decreased or not. For this, we will remove the
“sample weights” and “new sample weights” columns and then, based on the “new
sample weights,” divide our data points into buckets.
- Step 6: New Dataset
Now, what the algorithm does is selects random numbers from 0-1. Since incorrectly
classified records have higher sample weights, the probability of selecting those records
is very high.
- Step 7: Repeat Previous Steps
Now this act as our new dataset, and we need to repeat all the above steps.
2.1.3. Advantages of AdaBoost
- It is easier to use with less need for tweaking parameters unlike algorithms like SVM.
Theoretically, AdaBoost is not prone to overfitting though there is no concrete proof for
this. It could be because of the reason that parameters are not jointly optimized — stage-
wise estimation slows down the learning process.
- AdaBoost can be used to improve the accuracy of your weak classifiers hence making
it flexible. It has now being extended beyond binary classification and has found use
cases in text and image classification as well.
2.1.4. Disadvantages of AdaBoost
- Boosting technique learns progressively, it is important to ensure that you have quality
data. AdaBoost is also extremely sensitive to Noisy data and outliers so if you do plan
to use AdaBoost then it is highly recommended to eliminate them.
- AdaBoost has also been proven to be slower than XGBoost.
2.2. XGBoost
2.2.1. Overview
XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient
Boosting” originates from the paper Greedy Function Approximation: A Gradient
Boosting Machine, by Friedman. It is the gold standard in ensemble learning, especially
when it comes to gradient-boosting algorithms. It develops a series of weak learners
one after the other to produce a reliable and accurate predictive model.Fundamentally,
XGBoost builds a strong predictive model by aggregating the predictions of several
weak learners, usually decision trees. It uses a boosting technique to create an extremely
accurate ensemble model by having each weak learner after it correct the mistakes of
its predecessors.
The optimization method (gradient) minimizes a cost function by repeatedly changing
the model’s parameters in response to the gradients of the errors. The algorithm also
presents the idea of “gradient boosting with decision trees,” in which the objective
function is reduced by calculating the importance of each decision tree that is added to
the ensemble in turn. By adding a regularization term and utilizing a more advanced
10
optimization algorithm, XGBoost goes one step further and improves accuracy and
efficiency.
𝑦̂𝑖 = ∑ 𝑓𝑘 (𝑥𝑖 ), 𝑓𝑘 𝜖 Ρ
𝑘=1
where 𝐾 is the number of trees, 𝑓𝑘 is a function in the functional space P, and P is the
set of all possible CARTs. The objective function to be optimized is given by:
𝑛 𝐾
Using mean squared error (MSE) as our loss function, the objective becomes
11
The form of MSE is friendly, with a first order term (usually called the residual) and a
quadratic term. For other losses of interest (for example, logistic loss), it is not so easy
to get such a nice form. So in the general case, we take the Taylor expansion
This becomes our optimization goal for the new tree. One important advantage of this
definition is that the value of the objective function only depends on 𝑔𝑖 and ℎ𝑖. This is
how XGBoost supports custom loss functions. We can optimize every loss function,
including logistic regression and pairwise ranking, using exactly the same solver that
takes 𝑔𝑖 and ℎ𝑖 as input!
12
Performance Simpler and may suffer from Outperforms AdaBoost due to its
overfitting if not properly more sophisticated algorithms
tuned. Not parallel processing and optimizations, especially in
complex datasets with high
dimensionality, parallel process.
Speed and Slower compared to XGBoost, Highly scalable and efficient,
Scalability especially when dealing with thanks to its parallelized
large datasets or high- implementation and optimization
dimensional feature spaces. techniques,
Robustness to Sensitive to noisy data and More robust to noise and outliers
Noise outliers since it tries to correct due to its regularization
misclassifications from techniques, such as tree pruning
previous weak learners and column subsampling.
Flexibility Versatile and can be used with Supports various base learners
different base learners, such as and objective functions, offering
decision stumps or decision more flexibility in model
trees. customization and tuning.
Both AdaBoost and XGBoost are powerful boosting algorithms with distinct
advantages and trade-offs. AdaBoost is simple, interpretable, and effective in certain
scenarios, while XGBoost excels in performance, scalability, and robustness to noise.
The choice between the two depends on the specific requirements of the problem, such
as dataset size, complexity, interpretability needs, and computational resources.
13