Skip to content

Latest commit

 

History

History
232 lines (174 loc) · 6.66 KB

Orange.ensemble.rst

File metadata and controls

232 lines (174 loc) · 6.66 KB

Ensemble algorithms (ensemble)

.. index:: ensemble

Ensembles use multiple models to improve prediction performance. The module implements a number of popular approaches, including bagging, boosting, stacking and forest trees. Most of these are available both for classification and regression with exception of stacking, which with present implementation supports classification only.

Bagging

.. index:: bagging
.. index::
   single: ensemble; ensemble

.. autoclass:: Orange.ensemble.bagging.BaggedLearner
   :members:
   :show-inheritance:

.. autoclass:: Orange.ensemble.bagging.BaggedClassifier
   :members:
   :show-inheritance:

Boosting

.. index:: boosting
.. index::
   single: ensemble; boosting


.. autoclass:: Orange.ensemble.boosting.BoostedLearner
  :members:
  :show-inheritance:

.. autoclass:: Orange.ensemble.boosting.BoostedClassifier
   :members:
   :show-inheritance:

Example

The following script fits classification models by boosting and bagging on Lymphography data set with TreeLearner and post-pruning as a base learner. Classification accuracy of the methods is estimated by 10-fold cross validation (:download:`ensemble.py <code/ensemble.py>`):

.. literalinclude:: code/ensemble.py
  :lines: 7-

Running this script demonstrates some benefit of boosting and bagging over the baseline learner:

Classification Accuracy:
           tree: 0.764
   boosted tree: 0.770
    bagged tree: 0.790

Stacking

.. index:: stacking
.. index::
   single: ensemble; stacking


.. autoclass:: Orange.ensemble.stacking.StackedClassificationLearner
  :members:
  :show-inheritance:

.. autoclass:: Orange.ensemble.stacking.StackedClassifier
   :members:
   :show-inheritance:

Example

Stacking often produces classifiers that are more predictive than individual classifiers in the ensemble. This effect is illustrated by a script that combines four different classification algorithms (:download:`ensemble-stacking.py <code/ensemble-stacking.py>`):

.. literalinclude:: code/ensemble-stacking.py
  :lines: 3-

The benefits of stacking on this particular data set are substantial (numbers show classification accuracy):

stacking: 0.915
   bayes: 0.858
    tree: 0.688
      lr: 0.868
     knn: 0.839

Random Forest

.. index:: random forest
.. index::
   single: ensemble; random forest

.. autoclass:: Orange.ensemble.forest.RandomForestLearner
  :members:
  :show-inheritance:

.. autoclass:: Orange.ensemble.forest.RandomForestClassifier
  :members:
  :show-inheritance:


Example

The following script assembles a random forest learner and compares it to a tree learner on a liver disorder (bupa) and housing data sets.

:download:`ensemble-forest.py <code/ensemble-forest.py>`

.. literalinclude:: code/ensemble-forest.py
  :lines: 7-

Notice that our forest contains 50 trees. Learners are compared through 3-fold cross validation:

Classification: bupa.tab
Learner  CA     Brier  AUC
tree     0.586  0.829  0.575
forest   0.710  0.392  0.752
Regression: housing.tab
Learner  MSE    RSE    R2
tree     23.708  0.281  0.719
forest   11.988  0.142  0.858

Perhaps the sole purpose of the following example is to show how to access the individual classifiers once they are assembled into the forest, and to show how we can assemble a tree learner to be used in random forests. In the following example the best feature for decision nodes is selected among three randomly chosen features, and maxDepth and minExamples are both set to 5.

:download:`ensemble-forest2.py <code/ensemble-forest2.py>`

.. literalinclude:: code/ensemble-forest2.py
  :lines: 7-

Running the above code would report on sizes (number of nodes) of the tree in a constructed random forest.

Feature scoring

L. Breiman (2001) suggested the possibility of using random forests as a non-myopic measure of feature importance.

The assessment of feature relevance with random forests is based on the idea that randomly changing the value of an important feature greatly affects instance's classification, while changing the value of an unimportant feature does not affect it much. Implemented algorithm accumulates feature scores over given number of trees. Importance of all features for a single tree are computed as: correctly classified OOB instances minus correctly classified OOB instances when the feature is randomly shuffled. The accumulated feature scores are divided by the number of used trees and multiplied by 100 before they are returned.

.. autoclass:: Orange.ensemble.forest.ScoreFeature
  :members:

Computation of feature importance with random forests is rather slow and importances for all features need to be computes simultaneously. When it is called to compute a quality of certain feature, it computes qualities for all features in the dataset. When called again, it uses the stored results if the domain is still the same and the data table has not changed (this is done by checking the data table's version and is not foolproof; it will not detect if you change values of existing instances, but will notice adding and removing instances; see the page on :class:`Orange.data.Table` for details).

:download:`ensemble-forest-measure.py <code/ensemble-forest-measure.py>`

.. literalinclude:: code/ensemble-forest-measure.py
  :lines: 7-

The output of the above script is:

DATA:iris.tab

first: 3.91, second: 0.38

different random seed
first: 3.39, second: 0.46

All importances:
   sepal length:   3.39
    sepal width:   0.46
   petal length:  30.15
    petal width:  31.98

References

.. automodule:: Orange.ensemble