Biostat Lecture Four
Biostat Lecture Four
Descriptive Statistics:
[email protected] 1
Measures of Central Tendency (MCT)
• A frequency distribution is a general picture of the
distribution of a variable .
• But, can’t indicate the average value and the
spread of the values .
• The tendency of the statistical data to get
concentrated at a certain value is called “central
tendency”
• The various methods of determining the point
about which the observations tend to concentrate
are called MCT.
[email protected] 2
Measures of Central Tendency (MCT)
[email protected] 4
• The most common measures of central tendency include:
Arithmetic Mean
Median
Mode
Others
[email protected] 5
1. Arithmetic Mean
A. Ungrouped Data
• The arithmetic mean is the "average" of the data
set and by far the most widely used measure of
central location and it is usually denoted by
• Is the sum of all the observations divided by the
total number of observations.
[email protected] 6
b)G ro
u pe d d
ata
I
n c alculatingthem e
anfr
o mgr
o up
eddata
,weass
u m
eth
ata
llvalu e
sfallingin
toa
par ticularc la
ssinte
rva
larelo
cate
d a
tth
em id
-po
into
fth
ein
ter
va l.I
tisc alc
ula
teda
s
f
o llo w:
k
mf
i=
1
i i
x= k
f
i=
1
i
w
he
re,
k =thenum be
rofclassinterv a
ls
th
m i=them id
-po
intofthei c la
ssinte
rva
l
fi=thefr
eq u
encyoftheithc lassin
ter
val
[email protected] 7
Example. Compute the mean age of 169 subjects from the
grouped data.
[email protected] 8
When the data are skewed, the mean is “dragged” in
the direction of the skewness .
• It is possible in extreme cases for all but one of the sample points
to be on one side of the arithmetic mean & in this case, the mean is
a poor measure of central location or does not reflect the center of
the sample.
[email protected] 9
Properties of the Arithmetic Mean.
• For a given set of data there is one and only one arithmetic
mean (uniqueness).
• Easy to calculate and understand (simple).
• Influenced by each and every value in a data set
• Greatly affected by the extreme values.
• In case of grouped data if any class interval is open,
arithmetic mean can not be calculated .
[email protected] 10
2. Median
a) Ungrouped data
• The median is the value which divides the data set into two equal
parts.
• If the number of values is odd, the median will be the
middle value when all values are arranged in order of
magnitude.
• When the number of observations is even, there is no
single middle value but two middle observations.
• In this case the median is the mean of these two middle
observations, when all observations have been arranged in
the order of their magnitude.
[email protected] 11
[email protected] 12
[email protected] 13
[email protected] 14
• The median is a better description (than the mean) of the
majority when the distribution is skewed .
• Example
– Data: 14, 89, 93, 95, 96
– Skewness is reflected in the outlying low value of 14
– The sample mean is 77.4
– The median is 93
[email protected] 15
b) Grouped data
• In calculating the median from grouped data, we
assume that the values within a class-interval are
evenly distributed through the interval.
• The first step is to locate the class interval in which
the median is located, using the following procedure.
• Find n/2 and see a class interval with a minimum cumulative
frequency which contains n/2.
• Then, use the following formula.
[email protected] 16
n
Fc
~
x = Lm 2 W
fm
where,
Lm = lower true class boundary of the interval containing the median
Fc = cumulative frequency of the interval just above the median
class
interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations
[email protected] 17
Example. Compute the median age of 169
subjects from the grouped data.
[email protected] 18
• n/2 = 84.5 = in the 3rd class interval
• Lower limit = 29.5, Upper limit = 39.5
• Frequency of the class = 47
• (n/2 – fc) = 84.5-70 = 14.5
[email protected] 19
Properties of the median
• There is only one median for a given set of data
(uniqueness)
• The median is easy to calculate
• Median is a positional average and hence it is
insensitive to very large or very small values .
• Median can be calculated even in the case of
open end intervals
• It is determined mainly by the middle points and
less sensitive to the remaining data points
(weakness).
[email protected] 20
3. Mode
[email protected] 21
3. Mode
Mode
[email protected] 22
a) Ungrouped data
• It is a value which occurs most frequently in a set of
values.
• If all the values are different there is no mode, on the
other hand, a set of values may have more than one
mode.
[email protected] 23
• Example
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4
• Example
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
• Example
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different
[email protected] 24
b) Grouped data
• To find the mode of grouped data, we usually refer to
the modal class, where the modal class is the class
interval with the highest frequency.
• If a single value for the mode of grouped data must
be specified, it is taken as the mid-point of the modal
class interval.
[email protected] 25
x̂ = L m
w f 2
0
f f 2
where
L - Lower boundary of the Modal class
f0 – The frequency of the class next below the modal
class in value
f2 – the frequency of the class next above the modal class
in value
w – length of the interval of the modal class
[email protected] 26
[email protected] 27
Properties of mode
It is not affected by extreme values
It can be calculated for distributions with open end
classes
Often its value is not unique
The main drawback of mode is that often it does not
exist
[email protected] 28
Which measure of central tendency is best with a
given set of data?
[email protected] 29
• The mean can be used for discrete and continuous data .
• The median is appropriate for discrete and continuous
data as well, but can also be used for ordinal data.
• The mode can be used for all types of data, but may be
especially useful for nominal and ordinal measurements .
• For discrete or continuous data, the “modal class” can be
used .
[email protected] 30
(a) Symmetric and unimodal distribution — Mean, median,
and mode should all be approximately the same .
[email protected] 31
(b) Bimodal — Mean and median should be about the
same, but may take a value that is unlikely to occur; two
modes might be best
[email protected] 32
(c) Skewed to the right (positively skewed) —Mean is
sensitive to extreme values, so median might be more
appropriate
Mode
Median
Mean
[email protected] 33
(d) Skewed to the left (negatively skewed) — Same as (c)
Mode
Median
Mean
[email protected] 34
Measures of Dispersion
[email protected] 35
These two distributions have the same mean,
median, and mode
[email protected] 36
Measures of Dispersion
• MCT are not enough to give a clear
understanding about the distribution of the data.
[email protected] 37
Measures of Dispersion
Other synonymous term:
– “Measure of Variation”
– “Measure of Spread”
– “Measures of Scatter”
[email protected] 38
• Measures of dispersion include:
– Range
– Variance
– Standard deviation
– Coefficient of variation
– Standard error
– Others
[email protected] 39
1. Range (R)
• The difference between the largest and smallest
observations in a sample.
• Example –
– Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37
• Data set with higher range exhibit more variability
[email protected] 40
Properties of range
It is the simplest crude measure and can be easily
understood
It takes into account only two values which causes it to be
a poor measure of dispersion
Very sensitive to extreme observations
The larger the sample size, the larger the
range
[email protected] 41
2. Variance (2, s2)
• Variance is used to measure the dispersion of values
relative to the mean.
• The variance is the average of the squares of the
deviations taken from the mean.
• When values are close to their mean (narrow range) the
dispersion is less than when there is scattering over a
wide range.
– Population variance = σ2
– Sample variance = S2
[email protected] 42
Ungrouped data
[email protected] 43
Degrees of freedom
• In computing the variance there are (n-1) degrees of
freedom because only (n-1) of the deviations are
independent from each other .
• The last one can always be calculated from the others
automatically.
[email protected] 44
b) Grouped data
k
(m i x) 2 f i
S2 i =1
k
i =1
fi - 1
where
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
k = the number of class intervals
x = the sample mean
[email protected] 45
Properties of Variance:
The main disadvantage of variance is that its unit
is the square of the unite of the original
measurement values .
The variance gives more weight to the extreme
values as compared to those which are near to
mean value, because the difference is squared in
variance.
• The drawbacks of variance are overcome by the
standard deviation.
[email protected] 46
4. Standard deviation (, s)
• It is the square root of the variance.
• This produces a measure having the same scale as
that of the individual values.
and S = S 2 2
[email protected] 47
[email protected] 48
Example. Compute the variance and SD of the age of 169
subjects from the grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22
[email protected] 49
Properties of SD
• The SD has the advantage of being expressed in
the same units of measurement as the mean
[email protected] 51
5. Coefficient of variation (CV)
• When two data sets have different units of
measurements, or their means differ sufficiently in
size, the CV should be used as a measure of
dispersion.
• It is the best measure to compare the variability of
two series of sets of observations.
• Data with less coefficient of variation is considered
more consistent.