.. index:: ensemble
Ensembles use multiple models to improve prediction performance. The module implements a number of popular approaches, including bagging, boosting, stacking and forest trees. Most of these are available both for classification and regression with exception of stacking, which with present implementation supports classification only.
.. index:: bagging
.. index:: single: ensemble; ensemble
.. autoclass:: Orange.ensemble.bagging.BaggedLearner :members: :show-inheritance:
.. autoclass:: Orange.ensemble.bagging.BaggedClassifier :members: :show-inheritance:
.. index:: boosting
.. index:: single: ensemble; boosting
.. autoclass:: Orange.ensemble.boosting.BoostedLearner :members: :show-inheritance:
.. autoclass:: Orange.ensemble.boosting.BoostedClassifier :members: :show-inheritance:
The following script fits classification models by boosting and bagging on Lymphography data set with TreeLearner and post-pruning as a base learner. Classification accuracy of the methods is estimated by 10-fold cross validation (:download:`ensemble.py <code/ensemble.py>`):
.. literalinclude:: code/ensemble.py :lines: 7-
Running this script demonstrates some benefit of boosting and bagging over the baseline learner:
Classification Accuracy: tree: 0.764 boosted tree: 0.770 bagged tree: 0.790
.. index:: stacking
.. index:: single: ensemble; stacking
.. autoclass:: Orange.ensemble.stacking.StackedClassificationLearner :members: :show-inheritance:
.. autoclass:: Orange.ensemble.stacking.StackedClassifier :members: :show-inheritance:
Stacking often produces classifiers that are more predictive than individual classifiers in the ensemble. This effect is illustrated by a script that combines four different classification algorithms (:download:`ensemble-stacking.py <code/ensemble-stacking.py>`):
.. literalinclude:: code/ensemble-stacking.py :lines: 3-
The benefits of stacking on this particular data set are substantial (numbers show classification accuracy):
stacking: 0.915 bayes: 0.858 tree: 0.688 lr: 0.868 knn: 0.839
.. index:: random forest
.. index:: single: ensemble; random forest
.. autoclass:: Orange.ensemble.forest.RandomForestLearner :members: :show-inheritance:
.. autoclass:: Orange.ensemble.forest.RandomForestClassifier :members: :show-inheritance:
The following script assembles a random forest learner and compares it to a tree learner on a liver disorder (bupa) and housing data sets.
:download:`ensemble-forest.py <code/ensemble-forest.py>`
.. literalinclude:: code/ensemble-forest.py :lines: 7-
Notice that our forest contains 50 trees. Learners are compared through 3-fold cross validation:
Classification: bupa.tab Learner CA Brier AUC tree 0.586 0.829 0.575 forest 0.710 0.392 0.752 Regression: housing.tab Learner MSE RSE R2 tree 23.708 0.281 0.719 forest 11.988 0.142 0.858
Perhaps the sole purpose of the following example is to show how to access the individual classifiers once they are assembled into the forest, and to show how we can assemble a tree learner to be used in random forests. In the following example the best feature for decision nodes is selected among three randomly chosen features, and maxDepth and minExamples are both set to 5.
:download:`ensemble-forest2.py <code/ensemble-forest2.py>`
.. literalinclude:: code/ensemble-forest2.py :lines: 7-
Running the above code would report on sizes (number of nodes) of the tree in a constructed random forest.
L. Breiman (2001) suggested the possibility of using random forests as a non-myopic measure of feature importance.
The assessment of feature relevance with random forests is based on the idea that randomly changing the value of an important feature greatly affects instance's classification, while changing the value of an unimportant feature does not affect it much. Implemented algorithm accumulates feature scores over given number of trees. Importance of all features for a single tree are computed as: correctly classified OOB instances minus correctly classified OOB instances when the feature is randomly shuffled. The accumulated feature scores are divided by the number of used trees and multiplied by 100 before they are returned.
.. autoclass:: Orange.ensemble.forest.ScoreFeature :members:
Computation of feature importance with random forests is rather slow and importances for all features need to be computes simultaneously. When it is called to compute a quality of certain feature, it computes qualities for all features in the dataset. When called again, it uses the stored results if the domain is still the same and the data table has not changed (this is done by checking the data table's version and is not foolproof; it will not detect if you change values of existing instances, but will notice adding and removing instances; see the page on :class:`Orange.data.Table` for details).
:download:`ensemble-forest-measure.py <code/ensemble-forest-measure.py>`
.. literalinclude:: code/ensemble-forest-measure.py :lines: 7-
The output of the above script is:
DATA:iris.tab first: 3.91, second: 0.38 different random seed first: 3.39, second: 0.46 All importances: sepal length: 3.39 sepal width: 0.46 petal length: 30.15 petal width: 31.98
- L Breiman. Bagging Predictors. Technical report No. 421. University of California, Berkeley, 1994.
- Y Freund, RE Schapire. Experiments with a New Boosting Algorithm. Machine Learning: Proceedings of the Thirteenth International Conference (ICML'96), 1996.
- JR Quinlan. Boosting, bagging, and C4.5 . In Proc. of 13th National Conference on Artificial Intelligence (AAAI'96). pp. 725-730, 1996.
- L Brieman. Random Forests. Machine Learning, 45, 5-32, 2001.
- M Robnik-Sikonja. Improving Random Forests. In Proc. of European Conference on Machine Learning (ECML 2004), pp. 359-370, 2004.
.. automodule:: Orange.ensemble