0% found this document useful (0 votes)
15 views

Tree

Uploaded by

Debjit Patar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Tree

Uploaded by

Debjit Patar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

23 June 2024 22:02

Decision Trees Page 1


The Gini impurity is a measure of how often a randomly chosen element from the set would
be incorrectly labelled if it was randomly labelled according to the distribution of labels in the
subset.

Decision Trees Page 2


Decision Trees Page 3
Decision Trees Page 4
Advantages
• Simple to understand and to interpret. Trees can be visualized.
Requires little data preparation.

Other techniques often require data normalization,


dummy variables need to be created and blank values to be removed. Note however that
this module does not support missing values.

•The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points
used to train the tree.

• Able to handle both numerical and categorical data.


• Can work on non-linear datasets
• Can give you feature importance.

Disadvantages
Decision-tree learners can create over-complex trees that do not generalize the data well.
This is called overfitting. Mechanisms such as pruning, setting the minimum number of
samples required at a leaf node or setting the maximum depth of the tree are necessary
to avoid this problem.

Decision trees can be unstable because small variations in the data might result in a
completely different tree being generated. This problem is mitigated by using decision
trees within an ensemble.

Predictions of decision trees are neither smooth nor continuous, but piecewise constant
approximations as seen in the above figure. Therefore, they are not good at
extrapolation.

This limitation is inherent to the structure of decision tree models. They are very useful
for interpretability and for handling non-linear relationships within the range of the

Decision Trees Page 5


extrapolation.

This limitation is inherent to the structure of decision tree models. They are very useful
for interpretability and for handling non-linear relationships within the range of the
training data, but they aren't designed for extrapolation. If extrapolation is important for
your task, you might need to consider other types of models.

The importance of a feature is computed as the (normalized) total reduction of the criterion
brought by that feature. It is also known as the Gini importance.

Pruning is a technique used in machine learning to reduce the size of decision trees and to
avoid overfitting. Overfitting happens when a model learns the training data too well,
including its noise and outliers, which results in poor performance on unseen or test data.

Decision trees are susceptible to overfitting because they can potentially create very complex
trees that perfectly classify the training data but fail to generalize to new data. Pruning helps
to solve this issue by reducing the complexity of the decision tree, thereby improving its
predictive power on unseen data.

There are two main types of pruning: pre-pruning and post-pruning.

1. Pre-pruning (Early stopping): This method halts the tree construction early. It can be done in
various ways: by setting a limit on the maximum depth of the tree, setting a limit on the
minimum number of instances that must be in a node to allow a split, or stopping when a split
results in the improvement of the model’s accuracy below a certain threshold.

2. Post-pruning (Cost Complexity Pruning): This method allows the tree to grow to its full size,
then prunes it. Nodes are removed from the tree based on the error complexity trade-off. The
basic idea is to replace a whole subtree by a leaf node, and assign the most common class in
that subtree to the leaf node.

Decision Trees Page 6


Pre-pruning, also known as early stopping, is a technique where the decision tree is pruned
during the learning process as soon as it's clear that further splits will not add significant value.
There are several strategies for pre-pruning:

1. Maximum Depth: One of the simplest forms of pre-pruning is to set a limit on the maximum
depth of the tree. Once the tree reaches the specified depth during training, no new nodes are
created. This strategy is simple to implement and can effectively prevent overfitting, but if the
maximum depth is set too low, the tree might be overly simplified and underfit the data.

2. Minimum Samples Split: This is a condition where a node will only be split if the number of
samples in that node is above a certain threshold. If the number of samples is too small, then
the node is not split and becomes a leaf node instead. This can prevent overfitting by not
allowing the model to learn noise in the data.

3. Minimum Samples Leaf: This condition requires that a split at a node must leave at least a
minimum number of training examples in each of the leaf nodes. Like the minimum samples
split, this strategy can prevent overfitting by not allowing the model to learn from noise in the
data.

Decision Trees Page 7

You might also like