.. py:currentmodule:: Orange.statistics.distribution
.. index:: Distributions
:obj:`Distribution` and derived classes store empirical distributions of discrete and continuous variables.
This class can store absolute or relative frequencies. It provides a convenience constructor which constructs instances of derived classes.
>>> import Orange >>> data = Orange.data.Table("adult_sample") >>> disc = Orange.statistics.distribution.Distribution("workclass", data) >>> print disc <685.000, 72.000, 28.000, 29.000, 59.000, 43.000, 2.000> >>> print type(disc) <type 'DiscDistribution'>
The resulting distribution is of type :obj:`DiscDistribution` since variable workclass is discrete. The printed numbers are counts of examples that have particular attribute value.
>>> workclass = data.domain["workclass"] >>> for i in range(len(workclass.values)): ... print "%20s: %5.3f" % (workclass.values[i], disc[i]) Private: 685.000 Self-emp-not-inc: 72.000 Self-emp-inc: 28.000 Federal-gov: 29.000 Local-gov: 59.000 State-gov: 43.000 Without-pay: 2.000 Never-worked: 0.000
Distributions resembles dictionaries, supporting indexing by instances of :obj:`Orange.data.Value`, integers or floats (depending on the distribution type), and symbolic names (if :obj:`variable` is defined).
For instance, the number of examples with workclass="private", can be obtained in three ways:
print "Private: ", disc["Private"] print "Private: ", disc[0] print "Private: ", disc[orange.Value(workclass, "Private")]
Elements cannot be removed from distributions.
Length of distribution equals the number of possible values for discrete distributions (if :obj:`variable` is set), the value with the highest index encountered (if distribution is discrete and :obj: variable is :obj:`None`) or the number of different values encountered (for continuous distributions).
.. attribute:: variable Variable to which the distribution applies; may be :obj:`None` if not applicable.
.. attribute:: unknowns The number of instances for which the value of the variable was undefined.
.. attribute:: abs Sum of all elements in the distribution. Usually it equals either :obj:`cases` if the instance stores absolute frequencies or 1 if the stored frequencies are relative, e.g. after calling :obj:`normalize`.
.. attribute:: cases The number of instances from which the distribution is computed, excluding those on which the value was undefined. If instances were weighted, this is the sum of weights.
.. attribute:: normalized :obj:`True` if distribution is normalized.
.. attribute:: random_generator A pseudo-random number generator used for method :obj:`Orange.misc.Random`.
.. method:: __init__(variable[, data[, weightId=0]]) Construct either :obj:`DiscDistribution` or :obj:`ContDistribution`, depending on the variable type. If the variable is the only argument, it must be an instance of :obj:`Orange.feature.Descriptor`. In that case, an empty distribution is constructed. If data is given as well, the variable can also be specified by name or index in the domain. Constructor then computes the distribution of the specified variable on the given data. If instances are weighted, the id of meta-attribute with weights can be passed as the third argument. If variable is given by descriptor, it doesn't need to exist in the domain, but it must be computable from given instances. For example, the variable can be a discretized version of a variable from data.
.. method:: keys() Return a list of possible values (if distribution is discrete and :obj:`variable` is set) or a list encountered values otherwise.
.. method:: values() Return a list of frequencies of values such as described above.
.. method:: items() Return a list of pairs of elements of the above lists.
.. method:: native() Return the distribution as a list (for discrete distributions) or as a dictionary (for continuous distributions)
.. method:: add(value[, weight=1]) Increase the count of the element corresponding to ``value`` by ``weight``. :param value: Value :type value: :obj:`Orange.data.Value`, string (if :obj:`variable` is set), :obj:`int` for discrete distributions or :obj:`float` for continuous distributions :param weight: Weight to be added to the count for ``value`` :type weight: float
.. method:: normalize() Divide the counts by their sum, set :obj:`normalized` to :obj:`True` and :obj:`abs` to 1. Attributes :obj:`cases` and :obj:`unknowns` are unchanged. This changes absoluted frequencies into relative.
.. method:: modus() Return the most common value. If there are multiple such values, one is chosen at random, although the chosen value will always be the same for the same distribution.
.. method:: random() Return a random value based on the stored empirical probability distribution. For continuous distributions, this will always be one of the values which actually appeared (e.g. one of the values from :obj:`keys`). The method uses :obj:`random_generator`. If none has been constructed or assigned yet, a new one is constructed and stored for further use.
Stores a discrete distribution of values. The class differs from its parent class in having a few additional constructors.
.. method:: __init__(variable) Construct an instance of :obj:`Discrete` and set the variable attribute. :param variable: A discrete variable :type variable: Orange.feature.Discrete
.. method:: __init__(frequencies) Construct an instance and initialize the frequencies from the list, but leave `Distribution.variable` empty. :param frequencies: A list of frequencies :type frequencies: list Distribution constructed in this way can be used, for instance, to generate random numbers from a given discrete distribution:: disc = Orange.statistics.distribution.Discrete([0.5, 0.3, 0.2]) for i in range(20): print disc.random(), This prints out approximatelly ten 0's, six 1's and four 2's. The values can be named by assigning a variable:: v = orange.EnumVariable(values = ["red", "green", "blue"]) disc.variable = v
.. method:: __init__(distribution) Copy constructor; makes a shallow copy of the given distribution :param distribution: An existing discrete distribution :type distribution: Discrete
Stores a continuous distribution, that is, a dictionary-like structure with values and their frequencies.
.. method:: __init__(variable) Construct an instance of :obj:`ContDistribution` and set the variable attribute. :param variable: A continuous variable :type variable: Orange.feature.Continuous.. method:: __init__(frequencies) Construct an instance of :obj:`Continuous` and initialize it from the given dictionary with frequencies, whose keys and values must be integers. :param frequencies: Values and their corresponding frequencies :type frequencies: dict.. method:: __init__(distribution) Copy constructor; makes a shallow copy of the given distribution :param distribution: An existing continuous distribution :type distribution: Continuous.. method:: average() Return the average value. Note that the average can also be computed using a simpler and faster classes from module :obj:`Orange.statistics.basic`... method:: var() Return the variance of distribution... method:: dev() Return the standard deviation... method:: error() Return the standard error... method:: percentile(p) Return the value at the `p`-th percentile. :param p: The percentile, must be between 0 and 100 :type p: float :rtype: float For example, if `d_age` is a continuous distribution, the quartiles can be printed by :: print "Quartiles: %5.3f - %5.3f - %5.3f" % ( dage.percentile(25), dage.percentile(50), dage.percentile(75))
.. method:: density(x) Return the probability density at `x`. If the value is not in :obj:`Distribution.keys`, it is interpolated.
A class imitating :obj:`Continuous` by returning the statistics and densities for Gaussian distribution. The class is not meant only for a convenient substitution for code which expects an instance of :obj:`Distribution`. For general use, Python module :obj:`random` provides a comprehensive set of functions for various random distributions.
.. attribute:: mean The mean value parameter of the Gauss distribution.
.. attribute:: sigma The standard deviation of the distribution
.. attribute:: abs The simulated number of instances; in effect, the Gaussian distribution density, as returned by method :obj:`density` is multiplied by :obj:`abs`.
.. method:: __init__([mean=0, sigma=1]) Construct an instance, set :obj:`mean` and :obj:`sigma` to the given values and :obj:`abs` to 1.
.. method:: __init__(distribution) Construct a distribution which approximates the given distribution, which must be either :obj:`Continuous`, in which case its average and deviation will be used for mean and sigma, or and existing :obj:`GaussianDistribution`, which will be copied. Attribute :obj:`abs` is set to the given distribution's ``abs``.
.. method:: average() Return :obj:`mean`.
.. method:: dev() Return :obj:`sigma`.
.. method:: var() Return square of :obj:`sigma`.
.. method:: density(x) Return the density at point ``x``, that is, the Gaussian distribution density multiplied by :obj:`abs`.
There is a convenience function for computing empirical class distributions from data.
.. function:: getClassDistribution(data[, weightID=0]) Return a class distribution for the given data. :param data: A set of instances. :type data: Orange.data.Table :param weightID: An id for meta attribute with weights of instances :type weightID: int :rtype: :obj:`Discrete` or :obj:`Continuous`, depending on the class type
Distributions of all variables can be computed and stored in :obj:`Domain`. The list-like object can be indexed by variable indices in the domain, as well as by variables and their names.
.. method:: __init__(data[, weightID=0]) Construct an instance with distributions of all discrete and continuous variables from the given data.
param data: | A set of instances. |
---|---|
type data: | Orange.data.Table |
param weightID: | An id for meta attribute with weights of instances |
type weightID: | int |
The script below computes distributions for all attributes in the data and prints out distributions for discrete and averages for continuous attributes.
dist = Orange.statistics.distribution.Domain(data) for d in dist: if d.variable.var_type == Orange.feature.Type.Discrete: print "%30s: %s" % (d.variable.name, d) else: print "%30s: avg. %5.3f" % (d.variable.name, d.average())
The distribution for, say, attribute age can be obtained by its index and also by its name:
dist_age = dist["age"]