Skip to content

Latest commit

 

History

History
367 lines (240 loc) · 12.1 KB

Orange.statistics.distribution.rst

File metadata and controls

367 lines (240 loc) · 12.1 KB
.. py:currentmodule:: Orange.statistics.distribution

.. index:: Distributions

Distributions (distribution)

:obj:`Distribution` and derived classes store empirical distributions of discrete and continuous variables.

This class can store absolute or relative frequencies. It provides a convenience constructor which constructs instances of derived classes.

>>> import Orange
>>> data = Orange.data.Table("adult_sample")
>>> disc = Orange.statistics.distribution.Distribution("workclass", data)
>>> print disc
<685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000>
>>> print type(disc)
<type 'DiscDistribution'>

The resulting distribution is of type :obj:`DiscDistribution` since variable workclass is discrete. The printed numbers are counts of examples that have particular attribute value.

>>> workclass = data.domain["workclass"]
>>> for i in range(len(workclass.values)):
...     print "%20s: %5.3f" % (workclass.values[i], disc[i])
         Private: 685.000
Self-emp-not-inc: 72.000
    Self-emp-inc: 28.000
     Federal-gov: 29.000
       Local-gov: 59.000
       State-gov: 43.000
     Without-pay: 2.000
    Never-worked: 0.000

Distributions resembles dictionaries, supporting indexing by instances of :obj:`Orange.data.Value`, integers or floats (depending on the distribution type), and symbolic names (if :obj:`variable` is defined).

For instance, the number of examples with workclass="private", can be obtained in three ways:

print "Private: ", disc["Private"]
print "Private: ", disc[0]
print "Private: ", disc[orange.Value(workclass, "Private")]

Elements cannot be removed from distributions.

Length of distribution equals the number of possible values for discrete distributions (if :obj:`variable` is set), the value with the highest index encountered (if distribution is discrete and :obj: variable is :obj:`None`) or the number of different values encountered (for continuous distributions).

.. attribute:: variable

    Variable to which the distribution applies; may be :obj:`None` if not
    applicable.

.. attribute:: unknowns

    The number of instances for which the value of the variable was
    undefined.

.. attribute:: abs

    Sum of all elements in the distribution. Usually it equals either
    :obj:`cases` if the instance stores absolute frequencies or 1 if the
    stored frequencies are relative, e.g. after calling :obj:`normalize`.

.. attribute:: cases

    The number of instances from which the distribution is computed,
    excluding those on which the value was undefined. If instances were
    weighted, this is the sum of weights.

.. attribute:: normalized

    :obj:`True` if distribution is normalized.

.. attribute:: random_generator

    A pseudo-random number generator used for method :obj:`Orange.misc.Random`.

.. method:: __init__(variable[, data[, weightId=0]])

    Construct either :obj:`DiscDistribution` or :obj:`ContDistribution`,
    depending on the variable type. If the variable is the only argument, it
    must be an instance of :obj:`Orange.feature.Descriptor`. In that case,
    an empty distribution is constructed. If data is given as well, the
    variable can also be specified by name or index in the
    domain. Constructor then computes the distribution of the specified
    variable on the given data. If instances are weighted, the id of
    meta-attribute with weights can be passed as the third argument.

    If variable is given by descriptor, it doesn't need to exist in the
    domain, but it must be computable from given instances. For example, the
    variable can be a discretized version of a variable from data.

.. method:: keys()

    Return a list of possible values (if distribution is discrete and
    :obj:`variable` is set) or a list encountered values otherwise.

.. method:: values()

    Return a list of frequencies of values such as described above.

.. method:: items()

    Return a list of pairs of elements of the above lists.

.. method:: native()

    Return the distribution as a list (for discrete distributions) or as a
    dictionary (for continuous distributions)

.. method:: add(value[, weight=1])

    Increase the count of the element corresponding to ``value`` by
    ``weight``.

    :param value: Value
    :type value: :obj:`Orange.data.Value`, string (if :obj:`variable` is set), :obj:`int` for discrete distributions or :obj:`float` for continuous distributions
    :param weight: Weight to be added to the count for ``value``
    :type weight: float

.. method:: normalize()

    Divide the counts by their sum, set :obj:`normalized` to :obj:`True` and
    :obj:`abs` to 1. Attributes :obj:`cases` and :obj:`unknowns` are
    unchanged. This changes absoluted frequencies into relative.

.. method:: modus()

    Return the most common value. If there are multiple such values, one is
    chosen at random, although the chosen value will always be the same for
    the same distribution.

.. method:: random()

    Return a random value based on the stored empirical probability
    distribution. For continuous distributions, this will always be one of
    the values which actually appeared (e.g. one of the values from
    :obj:`keys`).

    The method uses :obj:`random_generator`. If none has been constructed or
    assigned yet, a new one is constructed and stored for further use.

Stores a discrete distribution of values. The class differs from its parent class in having a few additional constructors.

.. method:: __init__(variable)

    Construct an instance of :obj:`Discrete` and set the variable
    attribute.

    :param variable: A discrete variable
    :type variable: Orange.feature.Discrete

.. method:: __init__(frequencies)

    Construct an instance and initialize the frequencies from the list, but
    leave `Distribution.variable` empty.

    :param frequencies: A list of frequencies
    :type frequencies: list

    Distribution constructed in this way can be used, for instance, to
    generate random numbers from a given discrete distribution::

        disc = Orange.statistics.distribution.Discrete([0.5, 0.3, 0.2])
        for i in range(20):
            print disc.random(),

    This prints out approximatelly ten 0's, six 1's and four 2's. The values
    can be named by assigning a variable::

        v = orange.EnumVariable(values = ["red", "green", "blue"])
        disc.variable = v

.. method:: __init__(distribution)

    Copy constructor; makes a shallow copy of the given distribution

    :param distribution: An existing discrete distribution
    :type distribution: Discrete

Stores a continuous distribution, that is, a dictionary-like structure with values and their frequencies.

.. method:: __init__(variable)

    Construct an instance of :obj:`ContDistribution` and set the variable
    attribute.

    :param variable: A continuous variable
    :type variable: Orange.feature.Continuous

.. method:: __init__(frequencies)

    Construct an instance of :obj:`Continuous` and initialize it from
    the given dictionary with frequencies, whose keys and values must be integers.

    :param frequencies: Values and their corresponding frequencies
    :type frequencies: dict

.. method:: __init__(distribution)

    Copy constructor; makes a shallow copy of the given distribution

    :param distribution: An existing continuous distribution
    :type distribution: Continuous

.. method:: average()

    Return the average value. Note that the average can also be
    computed using a simpler and faster classes from module
    :obj:`Orange.statistics.basic`.

.. method:: var()

    Return the variance of distribution.

.. method:: dev()

    Return the standard deviation.

.. method:: error()

    Return the standard error.

.. method:: percentile(p)

    Return the value at the `p`-th percentile.

    :param p: The percentile, must be between 0 and 100
    :type p: float
    :rtype: float

    For example, if `d_age` is a continuous distribution, the quartiles can
    be printed by ::

        print "Quartiles: %5.3f - %5.3f - %5.3f" % (
             dage.percentile(25), dage.percentile(50), dage.percentile(75))

.. method:: density(x)

     Return the probability density at `x`. If the value is not in
     :obj:`Distribution.keys`, it is interpolated.

A class imitating :obj:`Continuous` by returning the statistics and densities for Gaussian distribution. The class is not meant only for a convenient substitution for code which expects an instance of :obj:`Distribution`. For general use, Python module :obj:`random` provides a comprehensive set of functions for various random distributions.

.. attribute:: mean

    The mean value parameter of the Gauss distribution.

.. attribute:: sigma

    The standard deviation of the distribution

.. attribute:: abs

    The simulated number of instances; in effect, the Gaussian distribution
    density, as returned by method :obj:`density` is multiplied by
    :obj:`abs`.

.. method:: __init__([mean=0, sigma=1])

    Construct an instance, set :obj:`mean` and :obj:`sigma` to the given
    values and :obj:`abs` to 1.

.. method:: __init__(distribution)

    Construct a distribution which approximates the given distribution,
    which must be either :obj:`Continuous`, in which case its
    average and deviation will be used for mean and sigma, or and existing
    :obj:`GaussianDistribution`, which will be copied. Attribute :obj:`abs`
    is set to the given distribution's ``abs``.

.. method:: average()

    Return :obj:`mean`.

.. method:: dev()

    Return :obj:`sigma`.

.. method:: var()

    Return square of :obj:`sigma`.

.. method:: density(x)

    Return the density at point ``x``, that is, the Gaussian distribution
    density multiplied by :obj:`abs`.

Class distributions

There is a convenience function for computing empirical class distributions from data.

.. function:: getClassDistribution(data[, weightID=0])

    Return a class distribution for the given data.

    :param data: A set of instances.
    :type data: Orange.data.Table
    :param weightID: An id for meta attribute with weights of instances
    :type weightID: int
    :rtype: :obj:`Discrete` or :obj:`Continuous`, depending on the class type

Distributions of all variables

Distributions of all variables can be computed and stored in :obj:`Domain`. The list-like object can be indexed by variable indices in the domain, as well as by variables and their names.

.. method:: __init__(data[, weightID=0])

    Construct an instance with distributions of all discrete and continuous
    variables from the given data.

param data:A set of instances.
type data:Orange.data.Table
param weightID:An id for meta attribute with weights of instances
type weightID:int

The script below computes distributions for all attributes in the data and prints out distributions for discrete and averages for continuous attributes.

dist = Orange.statistics.distribution.Domain(data)

for d in dist:
    if d.variable.var_type == Orange.feature.Type.Discrete:
         print "%30s: %s" % (d.variable.name, d)
    else:
         print "%30s: avg. %5.3f" % (d.variable.name, d.average())

The distribution for, say, attribute age can be obtained by its index and also by its name:

dist_age = dist["age"]