Simple Statistics
Simple Statistics
Simple Statistics - univariate series. Measures of central
tendency and measures of dispersion.
What is statistics?
Statistics is
1. the science and art of collecting, organising, describing, presenting and analysing
data, which may be quantitative or qualitative (descriptive statistics);
2. it uses samples drawn from a previously defined population in order to establish
properties concerning the full population, as well as to formulate (predict) possible
future developments (inferential statistics).
Population – the set of things that we set out to investigate. The elements of a
population are called the individuals.
The characteristic that is studied (for example the weight of a sample of the students
at this school, their grades or their eye-colour) is called a random variable.
In class we started our study of statistics by taking a random sample of size 10 of the
population of all students in group 101. On that sample we determined the values of
two random variables, the variable height H and the variable age A.
Height H 161 172 170 164 168 165 163 160 160 177
Age A 17 18 18 19 18 17 18 18 18 18
Age A is a discrete random variable. A value for A is obtained be counting the number
of full years that have passed since an individual was born. Only integer numbers
(between 0 and, say, 130) can occur as values for A. ]
The 3 M’s
1. The mean of a set of observed values of the random variable X is the arithmetical
average of those values. If the size of our data set (population or sample) is n and the
1 n
n∑ i
individual values are x! 1, x 2, . . . , xn , then the mean is equal to ! x . (In case our
i=1
data are obtained from a full population, the mean is usually indicated by the Greek
letter μ
! (or, if we want to specify the random variable, by μ
! X) ; in case the values are
obtained from a sample, we will indicate the mean by X ! .
2. The median (or second quartile) of a set of observed values of the random variable
X is a number indicated Q ! 2(X ) such that half of the values is smaller than or equal to
!Q2(X ), and half of the values is greater than or equal to Q! 2(X ) . The median divides
the data set into two equal parts.
3. The mode is a measure that indicates which value(s) occur(s) most often in the data
set. In case of a discrete variable this is a simple matter of counting: the mode for Age
is obviously 18. (In case of Height, we might say that the mode is 160, even though in
case of a continuous variable —for obvious reasons— it is almost always more
informative to speak of a modal class, meaning a certain range that contains most of
the values in the set.)
Measures of dispersion
The range of the values, i.e. the difference between the max(imum) (the biggest
value) and the min(imum) (the smallest value) in a data set is an obvious first indicator
of how spread out the observed values are on the number line. In our example, the
range of Age is 19-17 = 2, and that of Height is 177-160 = 17.
However the range of course does not tell us much about the degree of variation that
is found in the data.
Attention! The calculation of these measures for a sample is a little different from
that same calculation in case of a full population.
n
1
!SX2 (xi − X )2 ; standard deviation: S SX2 .
n−1∑
variance for a sample: = ! X=
i=1
The calculation ‘by hand’1 of the variance of the Height data is somewhat more
lengthy. The following table shows how to proceed, step by step.
10
1
(hi − mean)2 = 288/9 = 32, and the
∑
So the sample variance of Height is !
10 − 1 i=1
sample standard deviation therefore ! 32 = 5.66.
1The statistical functions on most scientific calculators and software tools like Excel allow for more
efficient and less time consuming ways to calculate the variance data sets.
Fall 2020 – Business Maths & Statistics, A1
Note that because we squared the differences between the values of Height and their
mean, the dimension of variance (in this example) is cm2. The square root of the
variance, the standard deviation, brings us back to the original dimension, i.e. cm.
Frequency distributions
The values of a continuous quantitative random variable are often summarised in a so-
called frequency distribution. We divide the range of the values into a certain number
of disjoint (but adjacent) intervals (the ‘classes’ or ‘bins’ of the distribution), and then
count how many values are contained in each of the intervals, i.e. we determine the
frequency of values in the respective classes.
As an example, to make a frequency distribution for the values in the sample of
students’ heights, we can chooses intervals with a width of 5 cm, closed to the left,
and open to the right. We choose four of them to cover the range of values that we
found.
The cumulative frequency percentages indicate e.g. that 70% of the students in the
sample had a body length of less than 170 centimetres. And we can similarly read in
the frequency percentages row that 40% of the students in the sample had a body
length between 165 and 175 centimetres.
We visualise a frequency distribution in a so-called histogram.
Here is another example. For a small shop in Belleville the returns (in thousands of
euros) on 20 random days in the past 6 months are given in the following table.
Fall 2020 – Business Maths & Statistics, A1
[0,5[ [5, 10[ [10, 15[ [15,20[
frequency fi 1 7 9 3
freq. percentage pi 5% 35 % 45 % 15 %
cumulative 5% 40 % 85 % 100 %
freq. percentage
In this case we do not know the individual values in the sample of the daily turnovers
of the shop. We only know how they are distributed over four adjacent ‘slots’ with a
‘width’ of 5000 €.
But we still can use the information thus provided to approximate the measures of
central tendency and of dispersion that numerically summarise the sample data.
In each of the classes our ‘best guess’ can be no other than that each of the values will
be around the average value in that class. These average values are called the
‘midpoints’, mi. In the shop’s example these are 2.5, 7.5, 12.5 and 17.5k euros. So our
approximation of the mean will be :
1 1
! × (1 × 2.5 + 7 × 7.5 + 9 × 12.5 + 3 × 17.5) = × 220 = 11 k euros.
20 20
The cumulative frequency percentages guide us in approximating the median and the
other quartiles, for which we will assume that the increase of the values within each
of the classes will be —approximately— linear, allowing us to proceed by linear
interpolation (Thales theorem).
We will use that fact that the median is a value Q2 such that 50% of all values are
below and 50% of all values are above it. We can similarly determine the first and the
third quartile, Q1 and Q3, the first one being a value such that 25% of all values are
below and 75% are above it, the second being a value such that 75% of all values are
below, and 25% are above it.
Q1 − 5 25 − 5
! = ⟹ Q1 ≈ 7.86
10 − 5 40 − 5
Q2 − 10 50 − 40
! = ⟹ Q2 ≈ 11.11
15 − 10 85 − 40
Q3 − 10 75 − 40
! = ⟹ Q3 ≈ 13.89
15 − 10 85 − 40
Finally, to approximate the variance and standard deviation, we use —like for the
mean — the midpoints mi and the frequencies fi.. Basically, we approximate the
variance as the average of the squares of the differences between midpoints and the
approximated mean, weighted by the frequencies or frequency percentages. Only in
case the sample size is known, we can apply the usual ‘sample correction’. The
formulas to use therefore are the following:
∑ fi × (mi − mean)2
in case of a sample with known sample size: variance
! ≈
( ∑ fi) − 1
p × (mi − mean)2
∑ i
in case of population, or unknown sample size: variance
! ≈
(where the pi are the frequency percentages).