05 Descriptive Statistics - Distribution
05 Descriptive Statistics - Distribution
In statistics, the distribution of data usually refers to how the data is spread when graphed. Previously,
measures of dispersion was discussed, in which a numerical value was used to describe how the data is
spread. Several patterns of distribution have been determined, and these patterns are used frequently
in inferential statistics.
In inferential statistics, a sample of data is used to determine something about the population. Hence,
when the data is assumed to fit a particular type of distribution, more information can be gathered.
There are many types of distributions, but this section will focus on the uniform distribution and the
normal distribution. The uniform distribution will mainly be used to introduce terminology and concepts
used in distribution analysis.
Before beginning on distributions, some terminology and concepts from probability theory will be
formalized. You should be familiar with the idea of probability, so this will not be discussed here.
Probability theory is the mathematical formalization of probability, but this is beyond the scope of this
class. If you decide to look into it, you will see that our usual statistics shares some terminology with
probability theory, but the way it is presented is a bit different.
Probability Theory
Again, to reiterate, terminology and concepts from probability theory that will help understand
distributions will be discussed here. This is not a proper introduction to probability theory.
In probability theory, a trial, or experiment, is any procedure that can be repeated infinitely many times
and has a well-defined set of possible results from running the experiment. In probability theory, these
possible results are called outcomes and the set of all possible outcomes is called the sample space.
“Well-defined” is a mathematical concept whose definition varies slightly depending on what is being
discussed. In this case, the set is well-defined, so what you have been taught about sets being well-
defined applies (if you forgot, the basic idea is that a set is well-defined if it’s clear whether something
belongs in the set or not).
Performing a trial results in only one outcome. To be clear, a trial may result in any of the outcomes in
the sample space, but once it is executed, the result must be only one. For example, when you roll a six-
sided die, the sample space is that any one of the six sides is face-up. But after you throw the die, it is
not possible for more than one side to be face up. For the more “technical” amongst us, yes, depending
on where and how you throw it, it may be possible that the die ends up on a corner or an edge. If you
really want to account for these outcomes, you should include them in your sample space. Just note that
this tends to complicate analysis, so unless necessary for your needs, only the simple cases of one face
up are considered.
A trial is said to be deterministic if it has only one possible outcome. That is, there is only one element in
the sample space. Most of the experiments in your science classes are deterministic. If there are at least
2 possible outcomes, then the trial is said to be random.
In probability theory, a number between zero and one (including zero and one) is associated with each
outcome. These numbers are called the probability of the outcome, and it describes the likelihood that
the outcome will occur. A value of zero means the outcome will never occur, while a value of one means
the outcome will always occur. Note that when you add the probabilities of each outcome in the sample
space, you should get exactly one. This means that you are guaranteed that one of the outcomes will
occur. You should be familiar with this idea from basic probability.
A random variable is a variable used to represent outcomes. If you recall your basic math, variables
usually represent some number that is to be determined. In statistics, a random variable represents an
outcome. Ironically, a common random variable is one that you are familiar with that also happens to be
a common usual variable, which is the letter “x”. Note though, that to take advantage of numerical
processes, outcomes are usually represented by numbers, so, you will frequently see random variables
representing numbers. For these cases, make sure to keep in mind that these numbers actually
represent an outcome.
For example, when throwing a coin, we may let H represent the outcome that a heads lands face up,
while T represents the outcome that a tails lands face up. Assuming that there is a 50% chance for each
to occur, we may write
P( x = H ) = 0.5
which can be read as “the probability that the random variable x is the outcome represented by H is 0.5.
Or, less formally, “the probability that the outcome is H is 0.5”.
Note that the use of the letters T and H are arbitrary. Any letter, number, or symbol may be used.
Sometimes, numerical labelling is convenient though, for example, if we are labelling the outcomes for
rolling a die, we can use the number 1 to represent the outcome when a one lands face up, the number
2 to represent the outcome when a two lands face up, etc.
A probability distribution describes the probabilities for each value of the random variable.
Distributions are usually expressed as a formula, graph, or table.
A discrete probability distribution is a distribution whose random variable is discrete. There are many
discrete distributions, but two of the more common ones are the binomial probability distribution and
the Poisson probability distribution. The binomial distribution is a distribution for a random variable that
has only two possible values (usually called success or failure). The Poisson distribution is a distribution
for the number of occurrences of an event over an interval (which may be time, length, etc.)
A continuous probability distribution is a distribution whose random variable is continuous. There are
many types of continuous distributions, the two that will be discussed are the uniform continuous
probability distribution and the normal probability distribution. Recall (from calculus ....) that the area
under a curve represents the sum of the function values over the interval. Thus, if the formula for a
continuous distribution is integrable, you may use calculus to determine probabilities.
There is an important difference between discrete and continuous distributions. For discrete
distributions, the formula usually gives the probability of a random variable. In continuous distributions,
because of the nature of continuity, the probability of any single random variable is always zero, but the
probability of a group of random variables may not be zero. Thus, the formula doesn’t give a probability
of a random variable, but it gives the probability of a group of random variables. There are omitted
details regarding how the group is formed, but just be aware that there are technical considerations
involved. But, because of this, the area under a portion of the graph represents the probability of that
group of random variables.
The following link describes some other common distributions:
https://www.analyticsvidhya.com/blog/2017/09/6-probability-distributions-data-science/
A continuous uniform distribution or uniform distribution for short, is a distribution in which its
formula is a constant. Its graph is a horizontal segment. As stated above, the area under the graph
represents the sum of all the probabilities and it should be equal to one. The area under any portion of
the graph will look like a rectangle or a square. This makes our life easier as the computation of the area
is simplified to computing the area of a rectangle.
The following graph is an example of what the graph of a uniform distribution could look like.
You will notice that the random variables are any real numbers between 1 and 3. Also, the y-value is
always 0.5. The graph also satisfies that the total area under the graph is one, since the area is length
times height. The length of the line is 2 (coming from 3 – 1 = 2), and the height is 0.5 (the distance from
the x-axis to the line). So, the area is 2 × 0.5 = 1.
In continuous distributions, random variables are usually represented by numbers, so when talking
about probabilities, there are additional notations formed by using inequalities. For example,
P( x < 7 ) = 0.3
which can be read as “the probability that the random variable x is any outcome represented by any
number less than 7 is 0.3. Or, less formally, “the probability that the outcome is less than 7 is 0.3”.
Another example,
P( x < 7 ) = 1 (all possible values of x are between 1 and 3, which are all less than 7, so it’s guaranteed
that x is less than 7)
P( x < 0.5 ) = 0 (all possible values of x are between 1 and 3, none of which are less than 0.5, so it’s
guaranteed that x is never less than 0.5)
P( x < 2 ) = 0.5 (from the length being 2 – 1 = 1, and the height being 0.5, so the area is length times
height, which would be 1 × 0.5 = 0.5)
P( x < 2.5 ) = 0.75 (from the length being 2.5 – 1 = 1.5, and the height being 0.5, so the area is length
times height, which would be 1.5 × 0.5 = 0.75)
P( x > 0.2 ) = 1 (all possible values of x are between 1 and 3, which are all greater than 0.2, so it’s
guaranteed that x is greater than 0.2)
P( x > 5 ) = 0 (all possible values of x are between 1 and 3, none of which are greater than 5, so it’s
guaranteed that x is never greater than 5)
P( x > 2 ) = 0.5 (from the length being 3 – 2 = 1, and the height being 0.5, so the area is length times
height, which would be 1 × 0.5 = 0.5)
P( x > 2.2 ) = 0.4 (from the length being 3 – 2.2 = 0.8, and the height being 0.5, so the area is length
times height, which would be 0.8 × 0.5 = 0.4)
P( 1.5 < x < 2.1 ) = 0.3 (from the length being 2.1 – 1.5 = 0.6, and the height being 0.5, so the area is
length times height, which would be 0.6 × 0.5 = 0.3)
P( 1.1 < x < 2.9 ) = 0.9 (from the length being 2.9 – 1.1 = 1.8, and the height being 0.5, so the area is
length times height, which would be 1.8 × 0.5 = 0.9)
P( x < 1.5 or x > 2.1 ) = 0.7 (from two areas being formed. One area is formed with the length being 1.5
– 1 = 0.5, and the height being 0.5, so the area is length times height, which would be 0.5 × 0.5 = 0.25.
The other area is formed with the length being 3 – 2.1 = 0.9, and the height being 0.5, so the area is
length times height, which would be 0.9 × 0.5 = 0.45. So the total area is 0.25 + 0.45 = 0.7).
Remember, the distribution of the data is under descriptive statistics because it describes how the data
is spread. There are formulas/procedures to determine how close to a particular distribution a given
data set is. Just note that there are many types of distributions. The continuous uniform distribution is
very simple. It’s graph is just a line segment, so variations between different uniform distributions is
mainly on the length of the segment and where it is placed. Although there are some real-world cases of
uniform distributions, in this class, it has mainly been used to introduce notations and concepts.
Another type of distribution that you will frequently encounter is the normal distribution. It is described
by the following formula:
( )
2
−1 x−μ
1 2 σ
f ( x )= e
σ √2 π
The only variable in the above equation is x . You should be familiar with the constant π being
approximately 3.14. You’ve encountered the Greek letters μ and σ previously, and these just represent
constants in the formula. Not coincidentally, they also represent the mean and standard deviation,
respectively, of the distribution. The graph of normal distributions are usually called “bell-shaped”.
As you can see from the formula, there are many variations of the normal distribution, depending on the
values of the mean and standard deviation. Also, if you recall your calculus, areas under curves can be
computed using integration. Unfortunately, the above function can’t be integrated “nicely”, that is, it
doesn’t have a closed form. In other words, it’s not easy to compute the areas, so it’s not easy to
compute the probabilities.
The good news is that instead of studying each distinct normal distribution, we can study a particular
normal distribution, and we can use information from this to get information about the other normal
distributions. The standard normal distribution is the normal distribution when μ=0 and σ =1.
The standard normal distribution has been studied, and tables for areas have been made. The most used
table is the z-table.
If you have forgotten how to read z-tables, you may find it here: https://towardsdatascience.com/how-
to-use-and-create-a-z-table-standard-normal-table-240e21f36e53