Skip to content

Latest commit

 

History

History
346 lines (242 loc) · 13.6 KB

Orange.feature.discretization.rst

File metadata and controls

346 lines (242 loc) · 13.6 KB
.. py:currentmodule:: Orange.feature.discretization

Feature discretization (discretization)

.. index:: discretization

.. index::
   single: feature; discretization

Feature discretization module provides rutines that consider continuous features and introduce a new discretized feature based on the training data set. Most often such procedure would be executed on all the features of the data set using implementations from :doc:`Orange.data.discretization`. Implementation in this module are concerned with discretization of one feature at the time, and do not provide wrappers for whole-data set discretization. The discretization is data-specific, and consist of learning of discretization procedure (see Discretization Algorithms) and actual discretization (see Discretizers) of the data. Splitting of these two phases is intentional, as in machine learing discretization may be learned from the training set and executed on the test set.

Consider a following example (part of :download:`discretization.py <code/discretization.py>`):

.. literalinclude:: code/discretization.py
    :lines: 7-15

The discretized attribute sep_w is constructed with a call to :class:`Entropy`; instead of constructing it and calling it afterwards, we passed the arguments for calling to the constructor. We then constructed a new :class:`Orange.data.Table` with attributes "sepal width" (the original continuous attribute), sep_w and the class attribute:

Entropy discretization, first 5 data instances
[3.5, '>3.30', 'Iris-setosa']
[3.0, '(2.90, 3.30]', 'Iris-setosa']
[3.2, '(2.90, 3.30]', 'Iris-setosa']
[3.1, '(2.90, 3.30]', 'Iris-setosa']
[3.6, '>3.30', 'Iris-setosa']

The name of the new categorical variable derives from the name of original continuous variable by adding a prefix D_. The values of the new attributes are computed automatically when they are needed using a transformation function :obj:`~Orange.feature.Descriptor.get_value_from` (see :class:`Orange.feature.Descriptor`) which encodes the discretization:

>>> sep_w
EnumVariable 'D_sepal width'
>>> sep_w.get_value_from
<ClassifierFromVar instance at 0x01BA7DC0>
>>> sep_w.get_value_from.whichVar
FloatVariable 'sepal width'
>>> sep_w.get_value_from.transformer
<IntervalDiscretizer instance at 0x01BA2100>
>>> sep_w.get_value_from.transformer.points
<2.90000009537, 3.29999995232>

The select statement in the discretization script converted all data instances from data to the new domain. This includes a new feature sep_w whose values are computed on the fly by calling sep_w.get_value_from for each data instance. The original, continuous sepal width is passed to the transformer that determines the interval by its field points. Transformer returns the discrete value which is in turn returned by get_value_from and stored in the new example.

With exception to fixed discretization, discretization approaches infer the cut-off points from the training data set and thus construct a discretizer to convert continuous values of this feature into categorical value according to the rule found by discretization. In this respect, the discretization behaves similar to :class:`Orange.classification.Learner`.

Discretization Algorithms

Instances of discretization classes are all derived from :class:`Discretization`.

.. method:: __call__(variable, data[, weightID])

    Given a continuous ``variable``, ``data`` and, optionally id
    of attribute with example weight, this function returns a
    discretized feature. Argument ``variable`` can be a
    :obj:`~Orange.feature.Descriptor`, index or name of the
    variable within ``data.domain``.

Discretizes the feature by spliting its domain to a fixed number of equal-width intervals. The span of original domain is computed from the training data and is defined by the smallest and the largest feature value.

.. attribute:: n

    Number of discretization intervals (default: 4).

The following example discretizes Iris dataset features using six intervals. The script constructs a :class:`Orange.data.Table` with discretized features and outputs their description:

.. literalinclude:: code/discretization.py
    :lines: 38-43

The output of this script is:

D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30>
D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00>
D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90>
D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10>

The cut-off values are hidden in the discretizer and stored in attr.get_value_from.transformer:

>>> for attr in newattrs:
...    print "%s: first interval at %5.3f, step %5.3f" % \
...    (attr.name, attr.get_value_from.transformer.first_cut, \
...    attr.get_value_from.transformer.step)
D_sepal length: first interval at 4.900, step 0.600
D_sepal width: first interval at 2.400, step 0.400
D_petal length: first interval at 1.980, step 0.980
D_petal width: first interval at 0.500, step 0.400

All discretizers have the method construct_variable:

.. literalinclude:: code/discretization.py
    :lines: 69-73


Infers the cut-off points so that the discretization intervals contain approximately equal number of training data instances.

.. attribute:: n

    Number of discretization intervals (default: 4).

The resulting discretizer is of class :class:`IntervalDiscretizer`. Its transformer includes points that store the inferred cut-offs.

Entropy-based discretization as originally proposed by [FayyadIrani93]. The approach infers the most appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the increase of minimal descripton length (MDL) induced by the new cut-off point.

Entropy-based discretization can reduce a continuous feature into a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be removed. This discretization can therefore also serve for identification of non-informative features and thus used for feature subset selection.

.. attribute:: force_attribute

    Forces the algorithm to induce at least one cut-off point, even when
    its information gain is lower than MDL (default: ``False``).

Part of :download:`discretization.py <code/discretization.py>`:

.. literalinclude:: code/discretization.py
    :lines: 77-80

The output shows that all attributes are discretized onto three intervals:

sepal length: <5.5, 6.09999990463>
sepal width: <2.90000009537, 3.29999995232>
petal length: <1.89999997616, 4.69999980927>
petal width: <0.600000023842, 1.0000004768>

Infers two cut-off points to optimize the difference of class distribution of data instances in the middle and in the other two intervals. The difference is scored by chi-square statistics. All possible cut-off points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable for the attributes in which the middle region corresponds to normal and the outer regions to abnormal values of the feature.

.. attribute:: split_in_two

    Decides whether the resulting attribute should have three or two values.
    If ``True`` (default), the feature will be discretized to three
    intervals and the discretizer is of type :class:`BiModalDiscretizer`.
    If ``False`` the result is the ordinary :class:`IntervalDiscretizer`.

Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that sepal lenghts of versicolors are between lengths of setosas and virginicas.

files/bayes-iris.gif

If we merge classes setosa and virginica, we can observe if the bi-modal discretization would correctly recognize the interval in which versicolors dominate. The following scripts peforms the merging and construction of new data set with class that reports if iris is versicolor or not.

.. literalinclude:: code/discretization.py
    :lines: 84-87

The following script implements the discretization:

.. literalinclude:: code/discretization.py
    :lines: 97-100

The middle intervals are printed:

sepal length: (5.400, 6.200]
sepal width: (2.000, 2.900]
petal length: (1.900, 4.700]
petal width: (0.600, 1.600]

Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense.

Discretizers

Discretizers construct a categorical feature from the continuous feature according to the method they implement and its parameters. The most general is :class:`IntervalDiscretizer` that is also used by most discretization methods. Two other discretizers, :class:`EquiDistDiscretizer` and :class:`ThresholdDiscretizer`> could easily be replaced by :class:`IntervalDiscretizer` but are used for speed and simplicity. The fourth discretizer, :class:`BiModalDiscretizer` is specialized for discretizations induced by :class:`BiModalDiscretization`.

A superclass implementing the construction of a new attribute from an existing one.

.. method:: construct_variable(variable)

    Constructs a descriptor for a new variable. The new variable's
    name is equal to ``variable.name`` prefixed by "D\_". Its
    symbolic values are specific to discretizer.

Discretizer defined with a set of cut-off points.

.. attribute:: points

    The cut-off points; feature values below or equal to the first point will be mapped to the first interval,
    those between the first and the second point
    (including those equal to the second) are mapped to the second interval and
    so forth to the last interval which covers all values greater than
    the last value in ``points``. The number of intervals is thus
    ``len(points)+1``.

The script that follows is an examples of a manual construction of a discretizer with cut-off points at 3.0 and 5.0:

.. literalinclude:: code/discretization.py
    :lines: 22-26

First five data instances of data2 are:

[5.1, '>5.00', 'Iris-setosa']
[4.9, '(3.00, 5.00]', 'Iris-setosa']
[4.7, '(3.00, 5.00]', 'Iris-setosa']
[4.6, '(3.00, 5.00]', 'Iris-setosa']
[5.0, '(3.00, 5.00]', 'Iris-setosa']

The same discretizer can be used on several features by calling the function construct_var:

.. literalinclude:: code/discretization.py
    :lines: 30-34

Each feature has its own instance of :class:`ClassifierFromVar` stored in get_value_from, but all use the same :class:`IntervalDiscretizer`, idisc. Changing any element of its points affect all attributes.

Note

The length of :obj:`~IntervalDiscretizer.points` should not be changed if the discretizer is used by any attribute. The length of :obj:`~IntervalDiscretizer.points` should always match the number of values of the feature, which is determined by the length of the attribute's field values. If attr is a discretized attribute, than len(attr.values) must equal len(attr.get_value_from.transformer.points)+1.

Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first interval. Otherwise, value val's interval is floor((val-first_cut)/step). Possible overflows are mapped to the last intervals.

.. attribute:: first_cut

    The first cut-off point.

.. attribute:: step

    Width of the intervals.

.. attribute:: n

    Number of the intervals.

.. attribute:: points (read-only)

    The cut-off points; this is not a real attribute although it behaves
    as one. Reading it constructs a list of cut-off points and returns it,
    but changing the list doesn't affect the discretizer. Only present to provide
    the :obj:`EquiDistDiscretizer` the same interface as that of
    :obj:`IntervalDiscretizer`.

Threshold discretizer converts continuous values into binary by comparing them to a fixed threshold. Orange uses this discretizer for binarization of continuous attributes in decision trees.

.. attribute:: threshold

    The value threshold; values below or equal to the threshold belong to the first
    interval and those that are greater go to the second.

Bimodal discretizer has two cut off points and values are discretized according to whether or not they belong to the region between these points which includes the lower but not the upper boundary. The discretizer is returned by :class:`BiModalDiscretization` if its field :obj:`~BiModalDiscretization.split_in_two` is true (the default).

.. attribute:: low

    Lower boundary of the interval (included in the interval).

.. attribute:: high

    Upper boundary of the interval (not included in the interval).

References

[FayyadIrani93]UM Fayyad and KB Irani. Multi-interval discretization of continuous valued attributes for classification learning. In Proc. 13th International Joint Conference on Artificial Intelligence, pages 1022--1029, Chambery, France, 1993.