0% found this document useful (0 votes)
131 views6 pages

Presentation and Summary of Data

This document discusses presenting and summarizing numeric data. It defines key terms like variables, scales of measurement, and frequency distributions. Graphical methods for presenting data include histograms and box plots. Numerical summaries include measures of central tendency (mean, median, mode) and dispersion (range, interquartile range, variance, standard deviation, coefficient of variation). These statistical techniques help analyze and communicate patterns in data in a clear and understandable way.

Uploaded by

Dr P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views6 pages

Presentation and Summary of Data

This document discusses presenting and summarizing numeric data. It defines key terms like variables, scales of measurement, and frequency distributions. Graphical methods for presenting data include histograms and box plots. Numerical summaries include measures of central tendency (mean, median, mode) and dispersion (range, interquartile range, variance, standard deviation, coefficient of variation). These statistical techniques help analyze and communicate patterns in data in a clear and understandable way.

Uploaded by

Dr P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 2 PRESENTATION AND SUMMARY OF DATA

Objectives
a) To be able to recognise the scale of measurement of any given variable.
b) To know how to present numeric data graphically using histograms and boxplots.
c) To know how the various measures of location and dispersion are defined and to be able to select
appropriate measures for a given set of data.
d) To be able to produce both graphs and summary measures using SPSS.

2.1 Introduction
In a physiology experiment 30 students had their heart rates (in beats per minute) measured after
completing a standard exercise test. The results were as follows:
71 68 76 73 71 72 74 73 74 73
70 76 72 70 73 69 77 71 75 74
72 69 73 78 72 77 75 77 70 75
Clearly there is a degree of variation in these results. As is typical of many measurements made in
medicine and dentistry, this variation may arise from different sources e.g. biological variation
(differences in resting heart rate and fitness between students), measurement error.
In their present state, the figures convey little information. The aim is to present the data in a compact
and understandable form. There are two ways of doing this.
i. In graphical form
ii. By using summary statistics

2.2 Some Definitions


A variable is a characteristic, of a given subject, which may take any one of a set of values. There are
two types of variables:
• qualitative variables e.g. sex, blood group, country of birth
• quantitative or measurable variables subdivided into:
continuous variables, taking one of an infinite number of possible values
e.g. height, weight, temperature, haemoglobin.
discrete variables, taking a number of (usually integer) values
e.g. parity, radioactive counts, number of days in hospital.

Any variable may also be assigned to one of three scales of measurement.


i. nominal categorical scale variables e.g. sex, blood group, country of birth.
ii. ordinal categorical and ranked scale variables
e.g. pain on a three point scale, position in the class in an assessment
iii. interval scale e.g. height (cm), temperature (oC), blood pressure (mmHg)

A frequency distribution shows the frequency of occurrence of the different values of a variable, and
may be represented either as a table or as a graph. For nominal or ordinal scale variables the graph is
called a bar chart with frequency on the vertical axis and values of the variable on the horizontal axis
(see section 2.3). For interval scale variables the range of the variable is first divided into classes and
the figure is called a histogram (see section 2.6). Relative frequency is the frequency expressed as a
proportion (or percentage) of the total frequency and can be particularly useful for comparing two or
more frequency distributions.

2.1
2.3 Graphical presentation of categorical data

Although much used in the popular press, pie charts are not favoured in scientific work and bar charts
are preferred. The following data show the distribution of delay time for almost 2,000 middle-aged
men who had a heart attack in the greater Belfast area during the period 1983-85.

Delay time Frequency Relative frequency 800 39%

Frequency
Short 746 39% 600
Medium 459 24% 24%
21%
400 16%
Long 408 21%
Very long 301 16% 200

1914 100% 0
Short Medium Long Very long
Source: Rev.Epidem.et Sante Publ. 1990; 38: 419-427 Delay time

Relative frequency (%)


40
The simple bar chart shown above may usefully be
extended to a multiple bar chart or a composite bar 30
Single
chart to assist in the comparison of subgroups. 20 Married
Div/Wid/Sep
For example, careful examination of the multiple bar 10

chart shown opposite reveals a slight difference in the 0


distribution of delay time between single, married and Short Medium Long Very long

divorced/widowed/separated men. Delay time

2.4 Measures of Location


A measure of location is the value at which the sample is ‘centred’.

• The arithmetic mean (usually shortened to mean)is the sum of all the observations in
the sample divided by the total number of observations

x1 + x 2 + ...+ xn 1 n
x= = ∑ xi
n n i =1
• The median is the middle value if the sample is arranged in increasing order. The median therefore
cuts the sample in half with 50% less than the median and 50% greater than the median.

(a) for n odd, median is the middle observation


(b) for n even, median is the arithmetic mean of the two middle observations

• The mode is the most commonly occurring value in the sample.

• The quantiles (quartiles, deciles, percentiles etc.) are the (k - 1) values of the variable which divide
the sample into k equal parts when the sample values are arranged in increasing order. They
identify locations other than the centre of the sample.

2.2
2.5 Measures of Dispersion
A measure of dispersion is a quantity which describes the degree of variation, spread, or scatter of the
observations in the sample about their central value.

• The range is the difference between the largest and smallest values in the sample.
Range = xmax - xmin. Unfortunately it is severely affected by outliers (rogue results).

• The interquartile range is the difference between the third and first quartiles.
_
• The variance is approximately the arithmetic x1 - x
_
mean of the squared deviations of the values xi - x _
from their mean xn - x
1 n _
2
s = ∑
n - 1 i=1
( xi - x ) 2 x1 x 2 ... x i _ ... xn
x

• The square root of the variance is the standard deviation. It has an advantage of being
in the original scale of measurement, and is therefore used in preference to the variance.

• The coefficient of variation is the standard deviation as a percentage of the mean.


s
c = x100%
x
This expresses the standard deviation relative to the mean and provides a measure of variation
which is independent of the units of measurement.

The coefficient of variation is particularly useful for comparing dispersions between two

)
variables with different units of measurement. Because it is a measure of relative variation (i.e.
standard deviation relative to mean) it can also be useful for comparing dispersions between
two sets of data with the same units of measurement but with very different means.

Mean Mean
Median Median
2.6 Graphical presentation of measurement data Mode Mode
The histogram is formed by dividing the range of 800
000
the variable into a number of classes of equal width. 700
Frequency

600
The frequency distribution is then plotted as a series
Frequency

500

of contiguous bars, the height of the bar being prop- 400 500
300
ortional to the frequency in the class. The 200

examples opposite show histograms for a variable 100


0
0
with a symmetric distribution (diastolic blood pressure) 45 55 65 75 85 95 105 115 125 135 0 5 10 15

and a variable whose distribution shows positive Total Triglyc (mg/100ml)


DBP (mmHg)
skewness (total triglyceride) with a long tail to the right.
Total Triglyceride (mg/100ml)

The boxplot is a five point summary of the data 150


consisting of the
minimum, maximum, median and first and third
(mmHg)

Maximum 15 Maximum
quartiles. Sometimes outliers (rogue results) are
identified separately by stars. Note the difference 100 10
Third quartile
in appearance of the plot for the symmetric and Median
First quartile
skewed distributions shown opposite. 5
DBP

50 Third quartile
Median
Minimum First quartile
0
Minimum

2.3
These graphical procedures are important in determining which measures of location and
) dispersion are most appropriate for summarising any given set of data. If a distribution is heavily
skewed then the median and interquartile range are preferred as the summary measures rather than
the mean and standard deviation. Sometimes variables which are heavily positively skewed are
logarithmically transformed in order to obtain a more symmetric distribution.
2.7 Heart rate example
Summarise the heart rate data in section 2.1 by constructing the frequency distribution, and present the
results in both tabular and graphical (histogram) form. Calculate appropriate measures of location and
dispersion.

10
Heart rate Tally Frequency Relative
9 Since the frequency
(beats per Frequency
distribution
8
is nearly symmetric the
min) mean 7and standard deviation are the
_______ ___________ _________ _________ most appropriate
6
measures of location

Frequency
67-68 | 1 .033 and dispersion.
5
69-70 ||||| 5 .167 4
71-72 ||||| || 7 .233 3
73-74 ||||| ||| 8 .267
2
75-76 ||||| 5 .167
1
77-78 |||| 4 .133 0
68 70 72 74 76 78
30 1.000
Heart rate (beats per min.)

Since the frequency distribution is nearly symmetric the mean and standard deviation are the most
appropriate measures of location and dispersion.
∑ xi 2190
mean x = = = 73 beats per min.
n 30
1 n _
variance s2 = ∑
n -1 i =1
( x i − x )2

=
1
30 - 1
[(71 − 73) 2 + ( 68 − 73) 2 +...+ ( 75 − 73) 2 ]
= 7.10 (beats per min)2

standard deviation s = 7.10 = 2.67 beats per min

Had the distribution been skewed then the median and interquartile range would have been preferred.
median = (73 + 73)/2 = 73 beats per min
first quartile, Q1 = 71 beats per min
third quartile, Q3 = 75 beats per min
interquartile range = Q3 - Q1 = 75 - 71 = 4 beats per min

Throughout this course we will use the computer to perform these tasks, but it is nevertheless important to
appreciate how they are performed.

2.4
2.8 Obtaining graphical output and summary measures in SPSS
To obtain graphical output (histogram, boxplot, pie chat or bar chart) in SPSS follow the relevant
menu options below and then click on the variables and press the arrow button to move them into the
relevant boxes. Then press OK.

Graphs → Histogram...(click on Display normal curve box for a superimposed Normal distribution)
Graphs → Boxplot... (optionally enter a variable in the Category Axis box for side-by-side boxplots)
Graphs → Pie...
Graphs → Bar...

To obtain summary measures in SPSS follow the Analyze → Descriptive Statistics → Frequencies...
menu options and then click on the variables and press the arrow button to move them into the
Variable(s): box. Press the Statistics... button and click all the required options. Press the Continue
and OK buttons.

The Analyze → Descriptive Statistics → Descriptives… option offers a less comprehensive range of
statistics in a more compact output which may be useful for screening large numbers of variables
quickly.

2.9 Further Reading

Bland Sections 4.1-4.8, 5.3-5.8

2.10 Practical

2.1) The following results for haemoglobin concentration (g/dl) were obtained from blood samples
from 11 individuals.
14.7 15.2 16.2 15.9 13.4 11.6 12.0 13.4 13.3 12.5 10.6
Use SPSS to calculate the following measures:
(i) mean
(ii) median
(iii) range
(iv) standard deviation
(v) coefficient of variation

The mean corpuscular haemoglobin (pg) was estimated for the same 11 people. The results
are given below.
23.8 20.0 21.7 22.0 23.7 24.0 23.7 27.7 30.3 27.4 22.4
Which of these two measurements, (i.e. haemoglobin or mean corpuscular haemoglobin), is
the more variable?

2.5
2.2) Now open the worksheet j:\medstats\caer.sav which contains selected data from a study of
ischaemic heart disease in a cohort of approximately 2,500 middle-aged men from the Welsh
town of Caerphilly.
Examine the distribution of each variable in the table by a suitable graphical method.
Depending on the shape of the distribution, select an appropriate summary measure of location
and of dispersion from and record the values of these measures in the table.

Variable Symmetric Most appropriate Most appropriate


or skewed? measure of location measure of dispersion
SBP Systolic blood pressure (mmHg)

HT Height (cm)

WT Weight (kg)

TOTTRIG Total triglyceride (mmol/l)

For any variable that is heavily positively skewed, re-examine the shape of the distribution
after applying a logarithmic transformation to check that the distribution is more symmetric.

2.3) The following table shows the numbers of fatal and non-fatal road accidents reported to the
police in Northern Ireland in 1981 by day of the week.

Day of Week Fatal Accidents Non-fatal


Sunday 27 A id 585
Monday 18 677
Tuesday 20 657
Wednesday 26 722
Thursday 28 784
Friday 38 842
Saturday 47 774
204 5041

Source: Death and Injury Road Accidents in Northern Ireland


Royal Ulster Constabulary, 1981.

Present the data using a multiple bar chart in such a way as to make it easy to compare the
distributions of each type of accident throughout the week.

Comment on possible reasons for any differences in distribution you observe.

2.6

You might also like