Presentation and Summary of Data
Presentation and Summary of Data
Objectives
a) To be able to recognise the scale of measurement of any given variable.
b) To know how to present numeric data graphically using histograms and boxplots.
c) To know how the various measures of location and dispersion are defined and to be able to select
appropriate measures for a given set of data.
d) To be able to produce both graphs and summary measures using SPSS.
2.1 Introduction
In a physiology experiment 30 students had their heart rates (in beats per minute) measured after
completing a standard exercise test. The results were as follows:
71 68 76 73 71 72 74 73 74 73
70 76 72 70 73 69 77 71 75 74
72 69 73 78 72 77 75 77 70 75
Clearly there is a degree of variation in these results. As is typical of many measurements made in
medicine and dentistry, this variation may arise from different sources e.g. biological variation
(differences in resting heart rate and fitness between students), measurement error.
In their present state, the figures convey little information. The aim is to present the data in a compact
and understandable form. There are two ways of doing this.
i. In graphical form
ii. By using summary statistics
A frequency distribution shows the frequency of occurrence of the different values of a variable, and
may be represented either as a table or as a graph. For nominal or ordinal scale variables the graph is
called a bar chart with frequency on the vertical axis and values of the variable on the horizontal axis
(see section 2.3). For interval scale variables the range of the variable is first divided into classes and
the figure is called a histogram (see section 2.6). Relative frequency is the frequency expressed as a
proportion (or percentage) of the total frequency and can be particularly useful for comparing two or
more frequency distributions.
2.1
2.3 Graphical presentation of categorical data
Although much used in the popular press, pie charts are not favoured in scientific work and bar charts
are preferred. The following data show the distribution of delay time for almost 2,000 middle-aged
men who had a heart attack in the greater Belfast area during the period 1983-85.
Frequency
Short 746 39% 600
Medium 459 24% 24%
21%
400 16%
Long 408 21%
Very long 301 16% 200
1914 100% 0
Short Medium Long Very long
Source: Rev.Epidem.et Sante Publ. 1990; 38: 419-427 Delay time
• The arithmetic mean (usually shortened to mean)is the sum of all the observations in
the sample divided by the total number of observations
x1 + x 2 + ...+ xn 1 n
x= = ∑ xi
n n i =1
• The median is the middle value if the sample is arranged in increasing order. The median therefore
cuts the sample in half with 50% less than the median and 50% greater than the median.
• The quantiles (quartiles, deciles, percentiles etc.) are the (k - 1) values of the variable which divide
the sample into k equal parts when the sample values are arranged in increasing order. They
identify locations other than the centre of the sample.
2.2
2.5 Measures of Dispersion
A measure of dispersion is a quantity which describes the degree of variation, spread, or scatter of the
observations in the sample about their central value.
• The range is the difference between the largest and smallest values in the sample.
Range = xmax - xmin. Unfortunately it is severely affected by outliers (rogue results).
• The interquartile range is the difference between the third and first quartiles.
_
• The variance is approximately the arithmetic x1 - x
_
mean of the squared deviations of the values xi - x _
from their mean xn - x
1 n _
2
s = ∑
n - 1 i=1
( xi - x ) 2 x1 x 2 ... x i _ ... xn
x
• The square root of the variance is the standard deviation. It has an advantage of being
in the original scale of measurement, and is therefore used in preference to the variance.
The coefficient of variation is particularly useful for comparing dispersions between two
)
variables with different units of measurement. Because it is a measure of relative variation (i.e.
standard deviation relative to mean) it can also be useful for comparing dispersions between
two sets of data with the same units of measurement but with very different means.
Mean Mean
Median Median
2.6 Graphical presentation of measurement data Mode Mode
The histogram is formed by dividing the range of 800
000
the variable into a number of classes of equal width. 700
Frequency
600
The frequency distribution is then plotted as a series
Frequency
500
of contiguous bars, the height of the bar being prop- 400 500
300
ortional to the frequency in the class. The 200
Maximum 15 Maximum
quartiles. Sometimes outliers (rogue results) are
identified separately by stars. Note the difference 100 10
Third quartile
in appearance of the plot for the symmetric and Median
First quartile
skewed distributions shown opposite. 5
DBP
50 Third quartile
Median
Minimum First quartile
0
Minimum
2.3
These graphical procedures are important in determining which measures of location and
) dispersion are most appropriate for summarising any given set of data. If a distribution is heavily
skewed then the median and interquartile range are preferred as the summary measures rather than
the mean and standard deviation. Sometimes variables which are heavily positively skewed are
logarithmically transformed in order to obtain a more symmetric distribution.
2.7 Heart rate example
Summarise the heart rate data in section 2.1 by constructing the frequency distribution, and present the
results in both tabular and graphical (histogram) form. Calculate appropriate measures of location and
dispersion.
10
Heart rate Tally Frequency Relative
9 Since the frequency
(beats per Frequency
distribution
8
is nearly symmetric the
min) mean 7and standard deviation are the
_______ ___________ _________ _________ most appropriate
6
measures of location
Frequency
67-68 | 1 .033 and dispersion.
5
69-70 ||||| 5 .167 4
71-72 ||||| || 7 .233 3
73-74 ||||| ||| 8 .267
2
75-76 ||||| 5 .167
1
77-78 |||| 4 .133 0
68 70 72 74 76 78
30 1.000
Heart rate (beats per min.)
Since the frequency distribution is nearly symmetric the mean and standard deviation are the most
appropriate measures of location and dispersion.
∑ xi 2190
mean x = = = 73 beats per min.
n 30
1 n _
variance s2 = ∑
n -1 i =1
( x i − x )2
=
1
30 - 1
[(71 − 73) 2 + ( 68 − 73) 2 +...+ ( 75 − 73) 2 ]
= 7.10 (beats per min)2
Had the distribution been skewed then the median and interquartile range would have been preferred.
median = (73 + 73)/2 = 73 beats per min
first quartile, Q1 = 71 beats per min
third quartile, Q3 = 75 beats per min
interquartile range = Q3 - Q1 = 75 - 71 = 4 beats per min
Throughout this course we will use the computer to perform these tasks, but it is nevertheless important to
appreciate how they are performed.
2.4
2.8 Obtaining graphical output and summary measures in SPSS
To obtain graphical output (histogram, boxplot, pie chat or bar chart) in SPSS follow the relevant
menu options below and then click on the variables and press the arrow button to move them into the
relevant boxes. Then press OK.
Graphs → Histogram...(click on Display normal curve box for a superimposed Normal distribution)
Graphs → Boxplot... (optionally enter a variable in the Category Axis box for side-by-side boxplots)
Graphs → Pie...
Graphs → Bar...
To obtain summary measures in SPSS follow the Analyze → Descriptive Statistics → Frequencies...
menu options and then click on the variables and press the arrow button to move them into the
Variable(s): box. Press the Statistics... button and click all the required options. Press the Continue
and OK buttons.
The Analyze → Descriptive Statistics → Descriptives… option offers a less comprehensive range of
statistics in a more compact output which may be useful for screening large numbers of variables
quickly.
2.10 Practical
2.1) The following results for haemoglobin concentration (g/dl) were obtained from blood samples
from 11 individuals.
14.7 15.2 16.2 15.9 13.4 11.6 12.0 13.4 13.3 12.5 10.6
Use SPSS to calculate the following measures:
(i) mean
(ii) median
(iii) range
(iv) standard deviation
(v) coefficient of variation
The mean corpuscular haemoglobin (pg) was estimated for the same 11 people. The results
are given below.
23.8 20.0 21.7 22.0 23.7 24.0 23.7 27.7 30.3 27.4 22.4
Which of these two measurements, (i.e. haemoglobin or mean corpuscular haemoglobin), is
the more variable?
2.5
2.2) Now open the worksheet j:\medstats\caer.sav which contains selected data from a study of
ischaemic heart disease in a cohort of approximately 2,500 middle-aged men from the Welsh
town of Caerphilly.
Examine the distribution of each variable in the table by a suitable graphical method.
Depending on the shape of the distribution, select an appropriate summary measure of location
and of dispersion from and record the values of these measures in the table.
HT Height (cm)
WT Weight (kg)
For any variable that is heavily positively skewed, re-examine the shape of the distribution
after applying a logarithmic transformation to check that the distribution is more symmetric.
2.3) The following table shows the numbers of fatal and non-fatal road accidents reported to the
police in Northern Ireland in 1981 by day of the week.
Present the data using a multiple bar chart in such a way as to make it easy to compare the
distributions of each type of accident throughout the week.
2.6