Lecture 3
Lecture 3
and Statistics
1 Descriptive Statistics
1.1 Median
Definition: This is the middle value of a given data set after arranging data
in ascending or descending order. If the sample size is n, then the median can
be obtained as: th
n − odd, the median is found to be in the n+1 2 position in the data set.
th th
n − even, the median values are found to be in the n2 and the n+2
2
positions within the data set (already arranged in ascending or descending or-
der).
Examples 1:
Data: 2, 1, 1, 2, 3, 2, 3, 3, 4, 6, 4, 1, 2, 3, 2, 3, 5, 4, 4, 5
Median:
Ascending order: 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6.
We have 2 values at the middle:
n th th
20
term =⇒ term = 10th term = 3
2 2
th th
n+2 22
term =⇒ term = 11th term = 3
2 2
=⇒ M edian = Avearage; 3
x f cf
1 3 3
2 5 8
3 5 13
4 4 17
5 2 19
6 1 20
20
1
Median lies in the 10th and 11th values. From the cf column, both values
lies before 13th position but after the 8th position. Thus they are both 3.
Example 2:
Data: 10, 11, 10, 12, 11, 10, 11, 11, 11, 12, 13, 11, 12, 12, 13.
Cumulative frequency distribution table:
x f cf
10 3 3
11 6 9
12 4 13
13 2 15
15
Median position is the
th
n+1 16
= = 8th P osition
2 2
=⇒ M edian = 11
• Graphically - An Ogive
• Iterative formula
Class f cf
0-4 1 1
5-9 14 15
Example 3: 10-14 23 38 Middle position
15-19 21 59
20-24 15 74
25-29 6 80
n th 80
= = 40th P osition
2 2
We first need to make the data continuous (Exclusive class intervals) if we are
estimating the median, juast like we do for the mode.
The new classes now become:
Class f cf
-0.5-4.5 1 1
4.5-9.5 14 15
9.5-14.5 23 38
14.5-19.5 21 59
19.5-24.5 15 74
24.5-29.9 6 80
2
Iteratively, the median now becomes,
40 − 38
=⇒ M edian = 14.5 + 5 = 14.98
21
1
− cfm
2N
=⇒ M edian = Lm + ×i
fm
We can as well estimate it using the ogive, a graph of cumulative frequency (cf)
against the variable, the continuous case.
Other estimates:
Quartiles Data divided into 4 equal parts. We have the 1st quarter (the
lower quartile), the 2nd quarter (the median) and the 3rd quarter (the upper
quartile).
1st 2nd 3rd 4th
The lower quartile The value separating the 1st quarter of the data from
the other data values
Estimated by applying the median concept:
1
4 N − cfLQ
=⇒ Lower Quartile = LLQ + ×i
fLQ
which is,
above 15
in the cf column
below 38
It ∴ lies in 10.5 − 14.5 class. Thus,
20 − 15
LQ = 9.5 + 5 = 9.5 + 1.09 = 10.59
23
3
The upper quartile The value separating the 1st 3 quarters of the data
from the last quarter of the other data values, estimated as
3
N − cfU Q
=⇒ U pperQuartile = LU Q + 4 ×i
fU Q
which is the,
above 59
in the cf column
below 74
It ∴ lies in 19.5 − 24.5 class. Thus
60 − 59
U Q = 19.5 + 5 = 19.5 + 0.33 = 19.83
15
Inter-quartile range The difference between lower quartile and the upper
quartile,
⇒ IQR = U Q − LQ
From the above exapmle, it then implies that,
4
Cofficient of Quartile Deviation This is a relative measure of dispersion
based on quartiles.
formally defined as
Q3 − Q1
Coef f icient of Quartile Deviation = × 100%
Q3 + Q1
It is a ratio of two quantities of the same dimension, hence unit-less. It is a
good measure of dispersion for comparing 2 datasets with different units.
Deciles Data divided into 10 parts; e.g. the 6th decile becomes
6
N − cf6th d
=⇒ 6th decile = L6th d + 10 ×i
f6th d
which is the,
above 38
in the cf column
below 59
It ∴ lies in 14.5 − 19.5 class. Thus
Percentiles Data divided into 100 parts; e.g. the 65th percentile becomes
65
N − cf65th d
=⇒ 65th decile = L65th d + 100 ×i
f65th d
which is the,
above 38
in the cf column
below 58
It ∴ lies in 14.5 − 19.5 class. Thus
52−38
65th percentile = 14.5 + 21 5
= 14.5 + 3.33 = 17.83
5
2 Measures of Dispersion
These statistics are, the
• Central percentage spread of items: These measures have link with the
median. e.g. the 10 to 90 percentile range, the quartile deviation e.t.c.
• Coefficient of Variation
2.1 Range
Definition:
The numerical difference between the highest and the smallest values of the
items in a given data set or distibution;
3. The highly used measure in our day-to-day discussions because of its sim-
plicity; range of prices of same commodity in different places; range of
times of delivery of different items ordered; the range of manpower avail-
able on a production line at different times e.t.c.
4. Has no natural partner in the measures of location and so cannot be used
in further statistical work.
5. Highly used measure of dispersion, e.g. in meteorological department -
when discussing temperatures of different places in the world.
6. Highly used in quality control charts in making sample mean control
charts. The chart for the ranges of samples enables a check to be kept on
the variability of production.
6
2.2 The Mean Deviation
Definition:
A measure of dispersion that gives the average absolute difference (i.e. ig-
noring ’minus’ sign) between each item and the mean.
A much more representative measure than the range. All item values are
taken into account.
e.g. Suppose an assembly line produced 3, 10, 5, 2 defective products;
Then, the mean number of defectives are x = 3+10+5+2
4 = 5. Tha absolute
differences between each value and the mean becomes,
Thus the
2+5+0+3
M ean deviation = = 2.5
4
Formulae for Mean Deviation:
For a set: P
|x − x|
M ean deviation, M d =
n
For a frequency distribution:
P
f |x − x|
M ean deviation, M d = P
f
then,
class f x fx |x − x| f |x − x|
0-4 1 2 2 13.025 13.025
5-9 14 7 98 8.025 112.35
10-14 23 12 253 3.025 69.575
15-19 21 17 357 1.975 41.475
20-24 15 22 330 6.975 104.625
25-29 6 27 162 11.975 71.85
80 1202 412.9
Thus, P
M ean deviation = fP|x−x|
f
= 412.9
80 = 15.16125 ≈ 15.16
7
Characteristics of the Mean Deviation
1. A good representative measure of dispersion that is not difficult to under-
stand. Used to compare variability between distributions of like nature.
2. It can be complicated and awkward to calculate if the mean is anything
other a whole number.
3. Mean deviation is virtually impossible to handle theoretically because of
the modulus sign. Hence, not used in more advanced analysis.
• Step 2: Calculate the sum of the squares of deviations of items from the
mean
• Step 3: Divide the sum by the number of items and take the square root;
s
P 2 P 2
2 (x − x) (x − x)
s = =⇒ s =
n n
for a data set; while for a distribution, the value becomes,
sP
P 2 2
f (x − x) f (x − x)
s2 = P =⇒ s = P
f f
This formula can be re-arranged to avoid the awkward subtraction from a dec-
imal mean to become
P 2 rP
2 x 2 x2
s = − x =⇒ s = − x2
n n
for a data set, and
P 2 P 2 sP P 2
2 fx fx f x2 fx
s = P − P =⇒ s = P − P
f f f f
8
for a frequency distribution.
From the previous example, then
class f x fx x2 f x2
0-4 1 2 2 4 4
5-9 14 7 98 49 686
10-14 23 12 253 144 3312
15-19 21 17 357 289 6069
20-24 15 22 330 484 7260
25-29 6 27 162 729 4374
80 1202 51705
The variance becomes
P 2
P 2
s2 = Pf fx − Pffx
1202 2
= 51705
80 − 80
= 646.3125 − 15.0252
= 646.3125 − 225.7506
= 420.5619
√
s = 420.5619 = 20.51
9
In comparison of distributions with respect to variability, coefficient of vari-
ation is much more appropriate.
standard deviation
Coef f icient of V ariation = × 100%
mean
Example:
Over a period of one month, the daily number of components produced by
two comparable machines was measured:
Note that:
P sk < 0 shows there is left or negative skew
P sk = 0 signifies no skew (mean = mode for a symmetric distribution)
P sk > 0 means there is right or positive skew
Example:
Determine the skewness for the previous example.
23−14
M ode = 10.5 + (23−14)+(23−21) 5
= 10.5 + 0.8182 × 5
= 10.5 + 4.091
= 14.591
10