Topic02. Descriptive Stats
Topic02. Descriptive Stats
Topic02. Descriptive Stats
The amount of statistical information that is disseminated to the public and indeed,
medical literature, for one reason or another is sometimes beyond comprehension, and what
part of it is "good" statistics and what part of it is "bad" statistics is anybody's guess.
Certainly, all of them can not be accepted uncritically. Sometimes, entirely erroneous
conclusions are based on unsound data. Indeed, use of statistics has already been replaced
by overuse and abuse. People are writing books and papers based on inappropriate
application of statistics. Alvan Feinstein recently commented: "some of these authors are
very popular because they are not afraid to provide solutions to problems that have not yet
been solved." We, of course, do not want to go down to that path. We need to use statistics
wisely.
In this topic we will deal with the use of some basic statistical indicators which are
usually referred to as descriptive statistics. Specifically, we will be concerned with the
summarising of continuous data. We will discuss four main themes:
1 n
x= ∑ xi .
n i =1
When the data are in the form of frequency distribution, the mean takes into account
the number of observations per category. Suppose that we have k categories, each
with n1 , n2 ,..., nk number of observations (total sample size: N = n1 + n2 +... + nk ) and
associated with means x1 , x2 ,..., x k . Then the overall mean is given by:
1 k
x= ∑ ni xi
N i =1
Example 1: The number of subjects and mean lumbar spine BMD for three genotypes
are as follows:
Genotype n Mean
TT 40 1.25 g / cm2
Tt 45 1.10 g / cm2
tt 15 1.00 g / cm2
The geometric mean is a useful measure of position for data involving ratios.
As it can be seen from this relation, the geometric mean is undefined for a set of
values with zeros or negative values.
2
To calculate the average percentage increase over the 5 visits, we need to (i) firstly
convert the percentage data into ratio and (ii) apply the geometric formula.
The 4 percents could be written in ratio terms as: 1.054 1.089 1.096 1.064
Then the average log(ratio) is:
1 1 1
+ +... +
1 x1 x2 xn
=
H n
n
1
So: x = H = n/∑
i =1 xi
When a data set contains values which represent rates of change, the harmonic mean
is an useful measure of central tendency.
4. THE MEDIAN of a set of observations is the value of the middle term when all
observations are arranged in order of magnitude. It is symbolised by Md.
Example 3: For the set of values 14, 17, -13, 41, 12. We can find the median as
follows:
3
However for a set of values: -13 12 14 17 41 66
The median is (14+17)/2 = 15.5.
5. The MODE (m) is another measure of central tendency which occurs at the most
frequently observed value of the variable.
For example, for a set of data {4, 5, 3, 2, 4, 1, 7, 4, 2, 4}, the mode would be 4 since it
is the most frequently occurred number.
1 n
∑ ( xi − x )
2
s2 =
n − 1 i =1
1 n 2 n(x )2
which is equivalent to: s 2 = ∑ xi −
n − 1 i =1 n −1
1 k
2
1 k 1 k
∑ wi ( xi − x ) =
2
For weighed data: s2 = ∑ wi xi2 − ∑ wi xi
n − 1 i =1 n − 1 i =1 n i =1
k
where n = ∑ wi
i =1
The wider the dispersion of the values around their mean, the greater the variance. If
there is no dispersion (eg 5, 5, 5, 5) then all values are equal to the mean; it follows
that the variance is 0.
Example 4: Consider the data set 5, 17, 12 and 10, whose mean is x =11. We
calculate the variance as follows:
s2 =
1 n
∑ ( xi − x ) =
n − 1 i =1
2 1
4 −1
[
(5 − 11)2 + (17 − 11)2 + (12 − 11)2 + (10 − 11)2 ]
4
=
(− 6)2 + 6 2 + 12 + (− 1)2
3
= 24.67.
Example 1 (continued): For the data in Example 1, we can treat the number of
subjects in each genotype as weights. The calculation of variance can be illustrated
by the following table:
1 k
2
2 1
k
then s =2
∑ wi xi − ∑ wi xi
n − 1 i =1 n i =1
1 (114.5)2
= 131.95 −
99 100
= 0.00856 g 2 / cm 4
2. STANDARD DEVIATION. The positive square root of the variance is called the
standard deviation and is denoted by s.
s = s2
The variance is expressed in units that are the square of the unit of measure of the
(
variable under study. For instance the variance of BMD is measured as g / cm 2 . )
2
However, the standard deviation is expressed in the original unit of measure of the
variable e.g. g / cm2 .
If the data set has a large number of observations and approximately symmetrical, the
standard deviation can be roughly approximated by using the maximum and
minimum values as follows:
5
s = (max− min ) / n for n < 12
= (max - min) / 4 for 20 < n < 40
= (max - min) / 5 for n about 100
= (max - min) / 6 for n > 400.
3. STANDARD ERROR (SE) is the standard deviation of the means of samples of given
size drawn from a particular parent population. If n is the sample size and N is the
size of the parent population and σ is the standard deviation of the parent population,
σ N −n
then the SE is defined by: . Therefore, for a large parent population or for
n N −1
σ
sampling with replacement this equation may be simplify to: . However, in a
n
sample of data, SE is estimated by:
s
SE =
n
SE is a measure of a reasonable difference between a sample mean and the parent
population mean and is used to test of whether a particular sample could have drawn
from a given parent population. It is used to work out the confidence limit.
s 24. 67
The SE for the data set in Example 4 is: SE = = = 12. 3 g/cm2.
n 4
4. 97
The CV for the data set in Example 4 is estimated by: CV = × 100 = 45. 2%
11
6
The following figure illustrates the 25th, 50th and 75th percentiles, often called the
lower quartile, the middle quartile (median) and the upper quartile, respectively.
Median
25%25%
25% 25%
-15 -9 1 3 5 9 13 17 23 92
where the median can be estimated to be: (5+9)/2 = 7. So, the 50th percentile is 7.
Similarly, the 25th percentile is 1 and the 75th percentile is 17, and so on.
7
III. MEASURES OF SHAPES
If S is positive (mode < mean), the distribution is skewed toward the right side; if S is
negative (mode > mean), the distribution is skewed toward to left side.
We have surveyed three main measures of central tendency. The question now is
which measure is the most appropriate and reliable? The answer to this question
depends on the distribution of the observed data. However, it can be stated that, like
any physical measures, none of the above statistics is perfect in describing a central
position of a distribution.
What can reasonably be stated is that from a theoretical point of view, the mean is the
best measure of central tendency of a distribution. This is because it can be computed
for numerical data, makes use of all the observations and is unique. Furthermore, it is
readily understood by most people. While the mean is influenced by extreme values,
the median does not. However, the median is not likely to be representative when
number of observations is small because it is a positional average; it is also not
unique. On the other hand, unless the number of observations is sufficiently large and
8
the distribution of the data reveals a clear picture of central tendency, the mode has
no significance.
If the distribution of a data set is symmetrical as in figure 1, the mean, the median
and the mode are the same (or at least similar). If the distribution is skewed to the
right (as in Figure 2), the mean is larger than the median. If the distribution is skewed
to the left (Figure 3), the mean is smaller than the median.
Mean
Median Median Mean
Figure 1 Figure 2
Mean Median
Figure 3.
For a reasonably large data set with approximately symmetrical, an empirical relation
between mean, median and mode can be established:
That is, given a median and a mean, the value of the mode can be approximated by:
9
2. CHEBYSHEV'S THEOREM AND ESTIMATING RANGES OF VALUES AND CONFIDENCE
INTERVAL.
It is important to emphasise here again that a set of data is a sample drawn from the
population of all possible measurements. Thus, the sample mean x , standard
deviation s, etc. may not equal to the true population mean and standard deviation
which are usually denoted by Greek characters such as µ and σ. The purpose of a
parametric estimation is not just to get an estimate of the mean in the general
population, but also to indicate its "uncertainty", i.e. how close or far off the estimate
may be from the true value. Related to this estimation is the concept of confidence
limit and is introduced here via the Chebyshev's theorem, one of the great theorem
in probability which was named after the great Russian mathematician. The exact
statement of this theorem is quite mathematically involved, however, it can be
interpreted as follows:
USE OF STANDARD DEVIATION. For any symmetrical data set with given mean ( x )
and standard deviation (s), we could estimate the range of individual measurements
with certain accuracy. For example, the mean and standard deviation of (natural)
logarithmic osteocalcin of a sample of Sydney subjects are 2.86 and 0.45
respectively; it could be inferred that approximately 95% of subjects in this sample
have their log(osteocalcin) between 2.86-2(0.45) to 2.86+2(0.45) (or 1.96 to 3.76).
USE OF STANDARD ERROR. The standard error (SE) which we discussed earlier is
often referred to as standard deviation of the mean, since it indicates the difference
between a sample mean and the parent population mean. The latter is often unknown.
10
However, one can apply the Chebyshev's theorem to estimate the range of possible
values of the population mean with certain confidence.
For example, the mean and standard error of femoral neck BMD among 20 fracture
women from a community in Sydney was found to be 0.70 g/cm2 and 0.02 g/cm2,
respectively. The true mean femoral neck BMD of all fracture subjects in Sydney was
unknown. However, it could be stated that the true mean could lie between 0.70-
2(0.02) = 0.66 g/cm2 to 0.70+2(0.02) = 0.74g/cm2. What it means here is that, if we
keep sampling 20 fracture women from the Sydney population repeatedly (each time
with different subjects) and each time the mean of 20 women was calculated, then we
would expect that 95% of the times, the mean lies between 0.66 g/cm2 to 0.74g/cm2.
3. TRANSFORMATION:
For a set of values x1 , x2 , x3 ,..., xn , let the mean be x and the variance be sx2 , then for
any constants a and b, we have the following properties:
(a) Linear transformation: yi = a + bxi . The mean and variance of Y is defined as:
y = a + b( x )
and ( )
s 2y = b 2 s x2
For example, the mean and variance of a variable X was 10 and 8, respectively. If a
new variable Y = 12 + 2X, then the mean and variance of Y are:
xi − x
(b) Z-transformation: zi = . The mean and variance of Z could be shown to be:
sx
z =0
and sz2 = 1 .
11
It is not uncommon nowadays in biomedical journals such presentation as a + b is
increasingly common. Some researchers indicate the two values as mean + SE or
mean + SEM or mean + SD; others do not care to mention what these numbers
actually stand for.
12
V. EXERCISES
2. Show that the sum of the deviations of a set of measurements, xi , about their mean,
n
x , is zero, i.e. ∑ ( xi − x ) = 0.
i =1
3. The hospitalised cost of fracture (in $AUS) for 29 patients in Dubbo is as follows:
5373, 15984, 7478, 3446, 11004, 9116, 3213, 5418, 16386
2857, 3656, 61876, 2972, 3057, 14449, 9400, 27518, 23278
23548, 3016, 12921, 4640, 4644, 23098, 2654, 7975, 10245
4045, 5018.
Construct a histogram of distribution of cost (you may use 5000-interval such as
5000-1000, 10001-15000, 15001-20000 etc.)
Calculate the mean, standard deviation, median, coefficient of skewness etc. and
comment on the distribution of data.
4. What can be said about a set of measurements which has a standard deviation of
zero?
6. When hunting insects, bats send out high-frequency sounds and then listen for the
echoes. One interest question is the distances (in cm) between the bat and its intended
prey when the bat's echo-location system first detects the insect.
The following data comprise the bat-to-prey detection distances for 11 catches:
62 52 68 23 34 45 27 42 83 56 40
(a) Find the mean of the data set.
(b) Calculate the standard deviation of the data set, using: (i) the exact mean
(calculated to 2 d.p) (ii) the rounded mean.
(c) Calculate 95% confidence interval (CI) for the measurements and 95% CI for the
mean.
Comment on the difference between these results.
13
7. The osteocalcin of 5 subjects are as follows: 4, 3, 7, 11 and 10.
(a) Calculate the mean ( x ), variance ( s 2 ), standard deviation and standard error (SE)
manually. Show your working fully.
(b) Transform the original observation by subtracting the mean from each observation
(eg (xi − x ) ). Show that the mean of (xi − x ) is zero.
x −x
(c) Let zi = i . Show that the mean and variance of Z is 0 and 1, respectively.
s
8. A set of 340 scores exhibiting a bell-shaped relative frequency distribution has means
x = 72 and standard deviation s = 8. How many of the scores would you expect to
fall in the interval 64 to 80? 56 to 88?
9. The theoretical frequency and phenotype value of a 2-allele gene locus (A and a) with
respective frequency p and q, are normally given by:
Where q = 1-p. Express the overall mean and variance of the phenotype in terms of µ,
a, d, p and q.
10. Data on lumbar spine BMD from 123 twins in Sydney stratified by VDR genotypes
are as follows:
TT 32 1.25 g / cm 2
Tt 61 1.17 g / cm 2
tt 30 1.07 g / cm 2
Find the mean and variance of lumbar spine BMD for these twins.
14
(a) Calculate the mean, standard deviation and median.
(b) Find the mean and variance of Y when
x x −5
(i) yi = xi − 8 (ii) yi = 7 xi (iii) yi = i (iv) yi = i .
12 7
What relation can you deduce for each of the cases ?
12. Use the technique of transformation (page 9) to calculate the mean and variance (and
hence SD) of the following samples: 997, 995, 998, 992 and 995, without using a
calculator.
14. Osteocalcin among a sample of 100 subjects from Denmark has the following
characteristics:
Mean: 6.9 ng/ml
Standard deviation: 5.1 ng/ml
Median: 6.2 ng/ml.
Comments on the distribution of the data.
15. Some characteristics of bone mineral contents (BMC) for Black and White people are
as follows:
Mean Median SD
Black: 2872 2812 374
White: 2744 2805 250
Calculate the coefficient of skewness for each group and comment on the results.
16. The changes in the vitamin D 1,25 level for a patient in 4 consecutive days are as
follows:
Day 1: 35; Day 2: 36; Day 3: 38; Day 4: 40
(a) Obtain the ratio of the change in one day to that in the preceding day for days 2, 3
and 4.
(b) Obtain the geometric mean of the three ratios. Show that the change in day 4 can
be obtained from knowledge of the change in day 1 and the geometric mean.
15
17. Data on lumbar spine BMD from a sample of 10 subjects are as follows: 0.98, 1.05,
1.01, 0.97, 0.95, 0.87, 0.50, 0.89, 1.05 and 1.08. Notice that there is one subject with
very low BMD. Would you exclude this subject from estimating the mean ?
18. In an experiment designed to answer the question "does environment affect the
anatomy of the brain", rats from a genetically pure strain were randomly allocated to
two groups: a treatment group and a control group. Those in the treatment group were
placed in large cages with new toys every day. Those in the control group were
isolated in separate cages with no toys. After a month, the cortex (grey matter of the
brain) were weighed. The weights in mg were as follows:
Treatment group: 707 740 745 652 649 676 699 696 712 708 749 690
Control group: 669 650 651 627 656 642 698 648 676 657 692 621
(a) Present the data in a graphical format so that it could be visualised easily.
(b) Calculate the relevant statistics and discuss on their values.
16