0% found this document useful (0 votes)
94 views8 pages

Simple Statistics

This document provides an introduction to descriptive statistics. It defines key statistical concepts such as populations, samples, parameters, statistics, qualitative and quantitative variables, and measures of central tendency and dispersion. Measures of central tendency discussed are the mean, median, and mode. Measures of dispersion examined are range, variance and standard deviation. Examples are provided using sample data on student height and age.

Uploaded by

harsman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views8 pages

Simple Statistics

This document provides an introduction to descriptive statistics. It defines key statistical concepts such as populations, samples, parameters, statistics, qualitative and quantitative variables, and measures of central tendency and dispersion. Measures of central tendency discussed are the mean, median, and mode. Measures of dispersion examined are range, variance and standard deviation. Examples are provided using sample data on student height and age.

Uploaded by

harsman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Fall 2020 – Business Maths & Statistics, A1


Simple Statistics - univariate series. Measures of central
tendency and measures of dispersion.

What is statistics?

Statistics is
1. the science and art of collecting, organising, describing, presenting and analysing
data, which may be quantitative or qualitative (descriptive statistics);
2. it uses samples drawn from a previously defined population in order to establish
properties concerning the full population, as well as to formulate (predict) possible
future developments (inferential statistics).

The goal of descriptive statistics is the study of a population (a set of individuals or


entities) or some characteristics of this population via the collection and analysis of
data concerning (all) the individuals, or those in a subset of the full population (such a
subset is called sample).

Population – the set of things that we set out to investigate. The elements of a
population are called the individuals.

Sample – a subset of a previously defined population.




A parameter – some (usually numerical) characteristic of a property of population.
Examples: a mean, a proportion, a variation …

A statistic – a (usually numerical) characteristic of a property of a sample, generally


used to estimate a corresponding population parameter. For example: the mean of a
sample is used to estimate the mean of the population from which it was drawn.
Fall 2020 – Business Maths & Statistics, A1

The characteristic that is studied (for example the weight of a sample of the students
at this school, their grades or their eye-colour) is called a random variable.

Qualitative variable – expresses a non-numeric characteristic, like gender, or eye-


colour, nationality, operating system …

Quantitative variable – the characteristic has a numeric value, it can be expressed


numerically. We distinguish discrete quantitative variables, like age (in years), the
number of students in a class or the number of bedrooms in a house (the value of a
discrete variable is generally obtained by counting), and continuous quantitative
variables, like air pressure in a tire, the room temperature at noon, or the weight of
students (the value of a continuous variable is generally obtained by measuring).

In this first introduction to statistics we will concentrate on quantitative variables,


properties of individuals the values of which are found by measuring or counting.

In the a majority practical situations it will unfeasible / impossible to collect and


analyse data on all individuals of a given population. A statistical study therefore
usually will start with the collection of data of a random sample of a population.

In class we started our study of statistics by taking a random sample of size 10 of the
population of all students in group 101. On that sample we determined the values of
two random variables, the variable height H and the variable age A.

Here are the results:

Height H 161 172 170 164 168 165 163 160 160 177
Age A 17 18 18 19 18 17 18 18 18 18

[ Height H is a continuous random variable. We measure height, and —in principle—


we can do that with as much precision as we like; all real numbers within a certain
range can occur as values for H.

Age A is a discrete random variable. A value for A is obtained be counting the number
of full years that have passed since an individual was born. Only integer numbers
(between 0 and, say, 130) can occur as values for A. ]

In a second step, we provide a numerical summary of the data that we collected. A


such summary consists in two types of measures. Measures of central tendency
(also called measures of location) indicate whereabout on the number line the data
can be found, around which values they are located. Measures of dispersion indicate
the degree of variation among the data.
????

Fall 2020 – Business Maths & Statistics, A1

Measures of central tendency 


The 3 M’s


1. The mean of a set of observed values of the random variable X is the arithmetical
average of those values. If the size of our data set (population or sample) is n and the
1 n
n∑ i
individual values are x! 1, x 2, . . . , xn , then the mean is equal to ! x . (In case our
i=1
data are obtained from a full population, the mean is usually indicated by the Greek
letter μ
! (or, if we want to specify the random variable, by μ
! X) ; in case the values are
obtained from a sample, we will indicate the mean by X ! .

For the observed values of age and height in our sample: A


! = 17.9 and H
! = 166.

2. The median (or second quartile) of a set of observed values of the random variable
X is a number indicated Q ! 2(X ) such that half of the values is smaller than or equal to
!Q2(X ), and half of the values is greater than or equal to Q! 2(X ) . The median divides
the data set into two equal parts.

To find the median of a given set of quantitative data, proceed as follows:


a. First order the values, in ascending or descending order.
b. If the size n of the set is uneven, then the median is the value that is the midpoint of
the list, i.e. it is the value that is the (n + 1)/2-th number in the list. For example, if our
data set has size 39, order the data, and find the 20th number in the list. That is the
median.
c. In case the size n of the set is even, then we will take the average of the n/2-th and
the (n/2 + 1)-th value in the list for the median.

For the observed values of age and height in our sample: Q


! 2(A) = 18 and
!Q2(H ) = 164.5

3. The mode is a measure that indicates which value(s) occur(s) most often in the data
set. In case of a discrete variable this is a simple matter of counting: the mode for Age
is obviously 18. (In case of Height, we might say that the mode is 160, even though in
case of a continuous variable —for obvious reasons— it is almost always more
informative to speak of a modal class, meaning a certain range that contains most of
the values in the set.)

Here is a visualisation of the Age values in a so-called bar graph:


There is a bar for each of the values, the length of which corresponds to the number of
occurrence of each of the three age values in the sample.
Fall 2020 – Business Maths & Statistics, A1

A different visualisation that we can use is the so-called pie chart.

Measures of dispersion

The range of the values, i.e. the difference between the max(imum) (the biggest
value) and the min(imum) (the smallest value) in a data set is an obvious first indicator
of how spread out the observed values are on the number line. In our example, the
range of Age is 19-17 = 2, and that of Height is 177-160 = 17.

However the range of course does not tell us much about the degree of variation that
is found in the data.

As an indicator of this variation, we use what is —basically— the average of the


distance between each of the values and their means.
Fall 2020 – Business Maths & Statistics, A1
This is the essence of what is determined in the so called variance and its square root,
the standard deviation.

Attention! The calculation of these measures for a sample is a little different from
that same calculation in case of a full population.

Here are the formulas:


1 n
! X2 = (xi − μ)2 ; standard deviation: σ! X = σX2 .

variance for a population: σ
n i=1

n
1
!SX2 (xi − X )2 ; standard deviation: S SX2 .

n−1∑
variance for a sample: = ! X=
i=1

So, for our sample, we calculate the variance of Age as:

2 × (17 − 17.9)2 + 7 × (18 − 17.9)2 + (19 − 17.9)2


S! A2
= = 0.32, and the standard
10 − 1
deviation S
! A = 0.32 = 0.57.

The calculation ‘by hand’1 of the variance of the Height data is somewhat more
lengthy. The following table shows how to proceed, step by step.

height hi height - mean (height - mean)2


161 -5 25
172 6 36
170 4 16
164 -2 4
168 2 4
165 -1 1
163 -3 9
160 -6 36
160 -6 36
177 11 121
Sum = 288

10
1
(hi − mean)2 = 288/9 = 32, and the

So the sample variance of Height is !
10 − 1 i=1
sample standard deviation therefore ! 32 = 5.66.

1The statistical functions on most scientific calculators and software tools like Excel allow for more
efficient and less time consuming ways to calculate the variance data sets.
Fall 2020 – Business Maths & Statistics, A1
Note that because we squared the differences between the values of Height and their
mean, the dimension of variance (in this example) is cm2. The square root of the
variance, the standard deviation, brings us back to the original dimension, i.e. cm.

Frequency distributions
The values of a continuous quantitative random variable are often summarised in a so-
called frequency distribution. We divide the range of the values into a certain number
of disjoint (but adjacent) intervals (the ‘classes’ or ‘bins’ of the distribution), and then
count how many values are contained in each of the intervals, i.e. we determine the
frequency of values in the respective classes.
As an example, to make a frequency distribution for the values in the sample of
students’ heights, we can chooses intervals with a width of 5 cm, closed to the left,
and open to the right. We choose four of them to cover the range of values that we
found.

[160,165[ [165, 170[ [170, 175[ [175,180[


frequency 5 2 2 1
freq. percentage 50 % 20 % 20 % 10 %
cumulative 50 % 70 % 90 % 100 %
freq. percentage

The cumulative frequency percentages indicate e.g. that 70% of the students in the
sample had a body length of less than 170 centimetres. And we can similarly read in
the frequency percentages row that 40% of the students in the sample had a body
length between 165 and 175 centimetres.
We visualise a frequency distribution in a so-called histogram.


Here is another example. For a small shop in Belleville the returns (in thousands of
euros) on 20 random days in the past 6 months are given in the following table.
Fall 2020 – Business Maths & Statistics, A1
[0,5[ [5, 10[ [10, 15[ [15,20[
frequency fi 1 7 9 3
freq. percentage pi 5% 35 % 45 % 15 %
cumulative 5% 40 % 85 % 100 %
freq. percentage

In this case we do not know the individual values in the sample of the daily turnovers
of the shop. We only know how they are distributed over four adjacent ‘slots’ with a
‘width’ of 5000 €.
But we still can use the information thus provided to approximate the measures of
central tendency and of dispersion that numerically summarise the sample data. 

In each of the classes our ‘best guess’ can be no other than that each of the values will
be around the average value in that class. These average values are called the
‘midpoints’, mi. In the shop’s example these are 2.5, 7.5, 12.5 and 17.5k euros. So our
approximation of the mean will be :
1 1
! × (1 × 2.5 + 7 × 7.5 + 9 × 12.5 + 3 × 17.5) = × 220 = 11 k euros.
20 20

It is the average of the midpoints weighted by frequencies. 



∑ fi × mi
Written as a formula that will look like mean ≈ , or (writing pi for the
∑ fi
∑ i
frequency percentages), equivalently: mean
! ≈ p × mi .
The mode in case of a frequency distribution, rather than a specific value, is a range of
values: the modal class is the class with the highest frequency of values. In this
example that would be the interval [10, 15[ .

The cumulative frequency percentages guide us in approximating the median and the
other quartiles, for which we will assume that the increase of the values within each
of the classes will be —approximately— linear, allowing us to proceed by linear
interpolation (Thales theorem).

We will use that fact that the median is a value Q2 such that 50% of all values are
below and 50% of all values are above it. We can similarly determine the first and the
third quartile, Q1 and Q3, the first one being a value such that 25% of all values are
below and 75% are above it, the second being a value such that 75% of all values are
below, and 25% are above it.

Here is how we find the three quartiles by means of linear interpolation:


Fall 2020 – Business Maths & Statistics, A1
In case inside each class the increase of the values (from 5 to 10, from 10 to 15 …) is
considered to be approximately linear, we can use Thales theorem to write the
following equations … and solve them to find approximate values for the quartiles:

Q1 − 5 25 − 5
! = ⟹ Q1 ≈ 7.86
10 − 5 40 − 5
Q2 − 10 50 − 40
! = ⟹ Q2 ≈ 11.11
15 − 10 85 − 40
Q3 − 10 75 − 40
! = ⟹ Q3 ≈ 13.89
15 − 10 85 − 40

Finally, to approximate the variance and standard deviation, we use —like for the
mean — the midpoints mi and the frequencies fi.. Basically, we approximate the
variance as the average of the squares of the differences between midpoints and the
approximated mean, weighted by the frequencies or frequency percentages. Only in
case the sample size is known, we can apply the usual ‘sample correction’. The
formulas to use therefore are the following:

∑ fi × (mi − mean)2
in case of a sample with known sample size: variance
! ≈
( ∑ fi) − 1
p × (mi − mean)2
∑ i
in case of population, or unknown sample size: variance
! ≈
(where the pi are the frequency percentages).

So for the variance in the example we find:


1 1
! × (1 × (2.5 − 11)2 + 7 × (7.5 − 11)2 + 9 × (12.5 − 11)2 + 3 × (17.5 − 11)2) = × 305 = 16.053
20 − 1 19

The standard deviation therefore is about 4.

You might also like