4.3-DecisionTreesLearningAlgorithms Part 2
4.3-DecisionTreesLearningAlgorithms Part 2
The data-set is then partitioned by the selected feature to produce subsets of the data
that is associated with the branched out nodes corresponding to the values of the chosen
feature. The algorithm continues to recur on each subset, considering only attributes never
selected before.
Throughout the algorithm, the decision tree is constructed with each non-terminal node
representing selected feature on which the data is split, and terminal nodes representing the
class label best suited for the final subset of this branch.
ID3 algorithm pseudocode
3 dataitems, 3 YES and 0 NO 2 dataitems, 2 YES and 0 NO 2 dataitems, 2 YES and 0 NO 3 dataitems, 3 YES and 0 NO
Noise
Non-systematic errors in either the values of features or class labels are
usually referred to as noise.
Two modifications of the basic algorithms are required if the tree building
should be able to operate with a noise-affected training set.
(1) The algorithm must be able to work with inadequate features, because
noise can cause even the most comprehensive set of features to appear
inadequate.
(2) The algorithm must be able to detect if testing further features will not
improve the predictive accuracy of the decision tree but rather result in
overfitting and as consequence take some measures like pruning.
General definition of overfitting
Overfitting is a significant practical difficulty for
decision tree models and many other predictive
models. Overfitting happens when the learning
algorithm continues to develop hypotheses that
reduce training set error at the cost of an increased
test set error.
Definition
Pre-pruning
• Stops growing of the tree earlier, before it perfectly classifies the training data-set (when data
split is not statistically significant).
• Criteria for stopping are usually based on statistical significance test to decide whether
pruning or expanding a particular node is likely to produce an improvement beyond the
training set (e.g., Chi-square test).
• Has the problem of “too early stopping”, as it is not easy to precisely estimate when to stop
growing the tree.
Post-pruning that allows the tree to perfectly classify the training set, and then post-prunes the
tree by removal of sub-trees. Often a distinct subset of the data-set (called validation set) is set
aside, to evaluate the effect of post-pruning nodes from the tree.
A simple variant of Post Pruning
Reduced-Error Pruning
• A node is removed if the resulting tree performs no worse then the original on the
validation set.
• Pruning means removing the whole subtree for which the node is the root, making
it a leaf and assigned the most common class of the associated instances.
The later systems extends the ID3 setup in various ways primarily with
extended datatypes for features, better pruning and noisehandling.
Comparison of three TDIDT systems
Ensemble approaches
Ensemble methods construct more than one decision tree and use the set of trees for joint classification.
Two kind of approaches relevant not only for decision trees but for different kinds of classifiers:
Boosting approaches
Boosting is a sequential approach where a sequence of average performing classifiers can give a
boosted performance by feeding experience from one classifier to the next.
E.g. Ada Boost is a boosting technique that can be applied to many ML agorithms
Bagging approaches
Bagging is a parallel approach where a set of classifiers together can produce partial results that
then can be the basis for a total negotiated result.
E.g. The Random forest algorithm combines random decision trees with bagging to achieve very
high classification accuracy.
Random forests
Random forests or random decision forests is an ensemble learning method for classification,
regression and other tasks.
Random forests operates by constructing a multitude of decision trees at training time and outputs the
class that is the most common of the classes (classification) or mean predictions (regression) produced
as results from the individual trees.
The Random forests approach is an alternative remedy for the decision trees problem of overfitting.
NPTEL