Teaching Health Statistics Lesson and Se
Teaching Health Statistics Lesson and Se
Teaching Health Statistics Lesson and Se
Teaching
health statistics
LESSON AND SEMINAR OUTLINES
Second edition
Edited by
S.I{. Lwanga Cho-Yook Tye o. Ayeni
Department of Formerly Department of Special Programme of
Health Systems, Social Medicine and Research, Development and
World Health Organization, Public Health, Research Training in Human
Geneva, National University of Singapore, Reproduction, Department of
Switzerland Singapore Reproductive Health
and Research, World Health
Organization, Geneva,
Switzerland
Contents
Preface vii
Introduction ix
* Seminar.
v
Preface
The need for a statistical approach is now well recognized in epidemiology and
public health, since these fields are concerned with communities or populations
where the laws of large numbers and random fluctuations clearly apply.
Teachers of health workers and students, however, have been slow to recognize
the need for a knowledge of statistics, even though all aspects of diagnosis and
prognosis are affected by rules of probability.
This book is intended to contribute to the long-term reorientation of the health
information systems of Member States, by bringing about improved data
generation, handling, processing and use, in order to meet future health
requirements.
The extent of statistical knowledge and skills that students need to acquire var-
ies from country to country, according to such factors as the common health
problems and methods of delivering health care in the country, and the career
prospects of the students on graduation. Nevertheless, there is a core of statisti-
cal knowledge that all students need to have, irrespective of their country of
training.
The present set of outlines is a revised version of Teaching health statistics: twenty
lesson and seminar outlines (Lwanga SI< & Tye C-Y, eds. Geneva, World Health
Organization, 1986). The topics covered form an internationally acceptable stand-
ard basic curriculum for teaching health statistics. While based on those of the
first edition, the lesson and seminar outlines have been revised and updated in
both content and orientation. They cover not only the conventional topics of
data collection, presentation and analysis, probability and vital statistics, but also
such topics as health indicators, use of computers and rapid methods of interim
assessment. The concepts highlighted by the outlines should be useful to all
students in the health field, and are meant to be used selectively by teachers of
statistics in preparing their courses.
This new edition is a result of close collaboration between a number of
eminent teachers of statistics, and has been coordinated and edited by
Mr S. K. Lwanga, Statistician, Department of Health Systems, World Health
Organization, with valuable assistance from Dr o. Ayeni, Biostatistician, Special
Programme of Research, Development and Research Training in Human
Reproduction, Department of Reproductive Health and Research, World Health
Organization.
The preparation of the first edition of this book was conceived by Dr Boga
Skrinjar-Nerima while she was Chief Medical Officer at the World Health Or-
ganization in charge of the development of health statistical services. Her contri-
bution is still highly appreciated.
vii
PREFACE
viii
The World Health Organization wishes to thank the following eminent teachers
who made invaluable contributions to this edition of lesson and seminar out-
lines: Professor E. Bamgboye, Department of Family and Community Medicine,
College of Medicine, King Saud University, Riyadh, Saudi Arabia; Professor R.
Biritwum, Department of Community Health, University of Ghana Medical
School, Accra, Ghana; Professor A. Indrayan, Division of Biostatistics and Medi-
cal Informatics, University College of Medical Sciences, Delhi, India; and Profes-
sor 1(. Surnbuloglu. Department of Biostatistics, Hacettepe University, Faculty of
Medicine, Ankara, Turkey. Thanks are also due to all the teachers and colleagues
who contributed to the first edition of the book or reviewed the various drafts of
this revised version.
This publication is specially dedicated to the memory of the late Ron Lowe, C.B.E.,
Emeritus Professor of Community Medicine, Welsh National School of Medi-
cine, Cardiff, Wales, who helped and guided the Organization's efforts towards
the improvement of the teaching of statistics and their use in epidemiology and
public health.
PART I
Statistical principles
and methods
OUTLINE 1 Introduction to the role of statistics in
health sciences and health care delivery
Enabling objectives
At the end of the lesson the students should be able to:
(b) Indicate, through examples (without necessarily going into great detail), how statistical
principles and concepts are relevant in the following situations:
(c) List sources of uncertainty in health sciences and health care delivery.
(d) Describe the role of statistics in the management of uncertainty in health sciences and health
care delivery.
3
TEACHING HEALTH STATISTICS
(a) As an introduction, discuss the general and specific objectives of the course as a whole,
making it clear that it is not intended to produce health statisticians, but health workers who
will be able to make rational decisions in their work. Emphasize the use of statistics as a tool
rather than as an end. Give an overview of the course, its structure, organization, teaching
methods and timetable.
(b) Explain the meaning of "statistics" and "statistical methods", giving examples of their appli-
cation in health care. Explain the need for data in decision-making. Hence, explain the
importance of the study of survey design, instrument calibration, and data collection, process-
ing, analysis, presentation, interpretation and communication.
(c) Discuss the problems posed by variation and uncertainty in: the study of disease etiology,
causation or risk factors; the evaluation of response to treatment; the determination of "nor-
mal", "usual" and "ideal" values of characteristics; and hence the methods needed to handle
them.
(d) Explain the essential role of statistics in the field of health (for example, in acquiring and
using medical knowledge, and in medical practice). Use examples to show how decisions are
made by health workers in the course of their duties (for example, in making a diagnosis,
assessing prognosis and deciding on the correct treatment for a patient), and by health ad-
ministrators, planners and evaluators.
(e) Point out the widespread use of statistical methods in medical journals. Progressive health
workers depend, to a considerable extent, on literature to update their knowledge. Some-
times the handouts distributed by pharmaceutical firms also contain statistical results.
Readers, therefore, need to have the ability to evaluate the validity and reliability of the
information in these reports. They also need to be familiar with the basic technical language
of the statistical and epidemiological methods which are commonly used in the medical
literature. Health workers themselves may have to use this language in the reports of their
work.
Lesson exercises
Lesson exercises should test the students' ability to describe the importance and uses of statisti-
cal methods in the field of health, and should give as many examples as possible. The exercises
should, therefore, be of such a nature as to elicit from the students examples of the need for
statistics and their use in solving health problems.
• Give four areas in health care delivery where the science of statistics is
applied.
References
Bland M. An introduction to medical statistics. Oxford, Oxford University Press, 1987.
OUTLINE 1 ROLE OFSTATISTICS IN HEALTH SCIENCES AND HEALTH CARE DELIVERY
Dixon RA. Medical statistics: content and objectives of a core course for medical students. Medi-
cal education, 1994, 28:59-67.
Feinstein AR. On teaching statistics to medical students. Clinical pharmacology and therapeutics,
1975, 18:121-126.
• incomplete information on the person or patient (patient in coma, lack of facilities for medical investigations,
illiteracy, recall failure, etc.);
• an imperfect tool (false positive and false negative results of laboratory and radiological investigations, clinical
signs and symptoms are sometimes not specific, lack of accepted measure of important concepts such as
community health, etc.);
• poor compliance with the prescribed regimen (non-compliance with treatment schedule, imperfect post-
surgical care, breakdown of a vaccine cold chain, non-acceptance of family planning advice, etc.);
• inadequate medical knowledge (lack of treatment for AIDS, unknown causes of many cancers, inability to
restore severely malignant tissues, lack of a universally applicable cheap and effective method to break the
parasite-vector-host cycle in malaria transmission, unknown specific factors causing women to live longer
than men, unknown relationship between the mind and physiological and biochemical mechanisms, etc.).
Enabling objectives
At the end of the lesson the students should be able to:
(b) Distinguish between regular and ad hoc health data collection systems.
(d) Discuss the major differences between the three data measuring procedures:
(e) Explain the concepts of reliability and validity with regard to measurement, and discuss
their implications for the use of health data.
(f) Distinguish between the four principal scales of measurement (nominal, ordinal, interval
and ratio), indicating their respective application for health data collection.
11
OUTLINE 2 HEALTH DATA
17
Examples:
Enabling objectives
At the end of the lesson the students should be able to:
(a) Explain the importance of information-based health services management.
(b) Describe the role of health personnel in the health data generation process.
(d) Describe the various types of HIS (for example, public, hospital, private sector).
Lesson content
Definition and description of a health information system
A HIS is made up of mechanisms and procedures for acquiring, analysing, using
and disseminating health data for health management.
23
OUTLINE 3 HEALTH INFORMATION SYSTEMS
25
Decision-making; health information system; health policy; health programmes; health system
management; feedback; disease surveillance; vital registration.
• the structures (from the peripheral units, medical records departments of hospitals, up to
the headquarters);
• management (dates and frequency of reporting, local analysis and feedback mechanisms);
• use (development of indicators and setting of priorities) of a HIS.
TEACHING HEALTH STATISTICS
26
(9) Explain the types and training of the personnel of the system, with particular reference to
the country of interest to the students. For example, the personnel of the HIS may consist of
the following:
• Medical records department: records officer, assistant records officers, records assistants,
statistical assistants.
• National health statistics unit: medical statistician, records officer, biostatistics assistants,
computer system analysts, computer programmers.
(h) Describe the health data collecting forms in use in the HIS (see Handout 3.2). If possible, the
forms should be reviewed for any improvements that may be warranted.
(i) Explain the system of reporting and all the legislation regarding health information
reporting.
(j) Explain the use of computers in a HIS, for storage, retrieval and processing of health
data.
Lesson exercises
The teacher should set exercises that test the students' knowledge of the various components of
a HIS, the relevant forms for data collection, the factors that affect the quality of the data, and
the usefulness of the data.
• List six important forms in use in one health information subsystem of your
country. For each form, describe the information to be derived and how it is
used.
• Give five factors that can affect the quality and timely reporting of informa-
tion from the HIS.
• Select one of the forms in use in the HIS and describe its use, in covering:
• frequency of reporting;
• latest date for submission of forms;
• channel of reporting;
• required local analysis;
• feedback.
References
Ferrinho PD et al. Developing a health information system for a primary health care centre in
Alexandra, Johannesburg. South African medicaljournal, 1991,80(8):400-403.
Kleczkowski BM, Elling RH, Smith DL. Health systemsupportfor primary health care. Geneva, World
Health Organization, 1984 (Public Health Papers, No. 80).
OUTLINE 3 HEALTH INFORMATION SYSTEMS
27
Last JM, ed. A dictionary of epidemiology, 3rd ed. New York, Oxford University Press, 1995.
McLachlan G, ed. Information systems for health services. Copenhagen, World Health Organization
Regional Office for Europe, 1980 (Public Health in Europe, No. 13).
The roleof research and information systems in decision-making for the development of human resources for
health. Reportof a WHO Study Group. Geneva, World Health Organization, 1990 (WHO Techni-
cal Report Series, No. 802).
WHO cooperation in strengthening national health information systems: a briefing note for WHO country
representatives and ministries of health. Geneva, World Health Organization, 1997 (unpublished
document WHOjHST/97.2; available on request from Department of Health Systems, WHO,
1211 Geneva 27, Switzerland).
HANDOUT 3.1
Disease surveillance: The continuing scrutiny of all aspects of occurrence and spread of diseases to detect changes
in trends or distribution, as a basis for instigating control measures.
Feedback: The process by which information is passed back to the people providing the data. To be effective the
information should have useful analytical comments.
Health information system (HIS): The mechanisms and procedures for acquiring and analysing data, and provid-
ing information (for example, management information, health statistics, health literature) for the manage-
ment of a health programme or system, and for monitoring health activities.
Health policy: A set of statements and decisions defining health priorities and main directions for attaining health
goals.
Health system management: The management of the interrelated component parts, both sectoral and intersectoral,
as well as within the community itself, which produce a combined effect on the health of a population.
Vital registration: The formal recording of events of human life, such as births, deaths, marriages and divorces.
34
(g) Explain how distribution patterns of data are more readily discerned by using diagrams in-
stead of tabulated data. Describe how the following diagrams should be constructed, either
manually or using a computer, from data given in frequency tables: frequency histogram,
frequency polygon, cumulative frequency polygon and cumulative frequency chart (ogive),
bar chart, pie chart.
Lesson exercises
The teacher should organize a set of raw data suitable for tabular and graphic presentation. The
exercises should give emphasis mostly to the process of data reduction, tabulation and graphic
presentation, including the uses and interpretation of graphs and other diagrams.
• Using the data on intra-ocular pressure in Annex A (A. 1), draw a frequency
polygon for one of the variables and produce cross-tabulations for any two
variables.
• List four different graphic methods for presenting data from a survey on
family planning in a village. Illustrate with two variables for each graphic method.
References
Campbell MJ, Machin D. Medical statistics. Chichester, John Wiley, 1990.
Colton T. Statistics in medicine. Boston, Little, Brown, 1995.
Daly LE, Bourke GJ, McGilvray J. Interpretation and uses of medical statistics. Oxford, Blackwell,
1985.
Dawson-Saunders B, Trapp RG. Basic and clinical biostatistics. Norwalk, Appleton & Lange, Prentice-
Hall International, 1990.
Gore SM, Altman DG. Statistics in practice. London, British Medical Association, 1982.
Huff D. How to lie with statistics. New York, Norton, 1954.
Kirkwood BR. Essentials of medicalstatistics. Oxford, Blackwell, 1988.
HANDOUT 4.1
Bar chart: Diagrammatic presentation of frequency data for nominal classes by bars whose length is proportional to
the class frequencies.
Class: One of the intervals into which the entire range of the variable has been divided (for example, each of the
intervals 3.0-3.3, 3.4-3.7, ... , 5.0-5.3 is a class).
Class frequency: The number of observations in each class, also known as the absolute class frequency.
Class limits: The true values at the beginning and end of each class, which depend on the accuracy of measurement
(for example, if measurement is accurate to the nearest tenth, then the class limits for the class 3.0-3.3 are
2.95 and 3.34).
Class marks: The variable values thatdemarcate each class (forexample, 3.0 and 3.3 are, respectively, the lower and
upper class marks of the class 3.0-3.3).
Classification: The process of subdividing the range of values of a variable into classes or groups.
Cross-tabulation: A frequency table involving at least two variables that have been cross-classified (tabulated
against each other).
Cumulative class frequency: The number of observations up to the end of the particular class. It is obtained by
cumulating the frequencies of previous classes, including the class in question.
Frequency polygon: Diagrammatic presentation of the frequency distribution of a quantitative variable, with class
frequencies plotted against class midpoint marks, the points being joined by straight lines.
Frequency table or distribution: A tabular arrangement showing the number of times that data with particular
characteristics occur within a data set.
Histogram: Diagrammatic presentation of the frequency distribution of a quantitative variable, with areas of rec-
tangles proportional to the class frequency.
Ogive: Graph of the cumulative relative frequency distribution.
Ordered array: Simple rearrangement of the individual observations in order of magnitude.
Pie chart: Sectors of a circle, with areas proportional to class frequencies, used to present data in nominal classes.
Relative class frequency: The absolute class frequency expressed as a fraction of the total frequency.
Example 1
Extract of data on intra-ocular pressure measurements of 135 adults (forthe full data set see Annex A). The data are
given in mmHg, but may also be expressed in kPa (1 mmHg == 0.133 kPa).
24 M 20 27 -7 high
52 M 18 12 6 high
26 M 16 13 3 low
71 F 14 14 0 normal
49 M 13 14 -1 normal
39 M 21 16 5 high
71 M 14 12 2 low
32 F 13 12 1 normal
38 F 13 12 1 normal
33 F 9 8 1 normal
a Right eye (mmHg).
b Left eye (mmHg).
C Difference between the right and left eye measurements.
Using seven equal intervals, data on intra-ocular pressure in theright eye may be presented in a frequency distribution
table as inTable 4.1.
0-3 0 0
4-7 1 0.7
8-11 16 11.9
12-15 63 46.7
16-19 40 29.6
20-23 13 9.6
24-27 2 1.5
Total 135
Evident features of the distribution of right eye intra-ocular pressure values, among the 135 subjects studied,
include their variation from 4 to 27 and the factthat an appreciable number of persons have values between 12 and
15.
Figure 4.7 Equal intervals on the horizontal axis for unequal data intervals
1000
900
800
700
600
en
...co
.t:
..
Cl)
'0
0
500
d
z
400
300
200
r-,
100 セ
セ0)
0
セ
0
0 5 15 25 35 45 55 65 75 80
Age (years)
90......------------------------,
cv
>
Nセ 40
Qj
cc::
30
20
10
o 10 20 30 40 50 60 70 80
Age(years)
Enabling objectives
At the end of the lesson the students should be able to:
(a) Explain why summary indices are needed in medicine.
(b) Compute the mean, median and mode of a given set of data (grouped and ungrouped).
(d) Discuss, with examples, the uses and limitations of the mean, median and mode, and their
relative advantages and disadvantages as summary indices of health data.
(e) Explain the use of quartiles and percentiles to summarize health data.
(f) Select an appropriate measure of central tendency and location for a given application.
(g) Differentiate between "average", "normal" and "ideal" values, with reference to health data.
Lesson content
The lesson should cover the definitions, calculation, relative advantages and dis-
advantages, and appropriate data situations for use of the following:
43
TEACHING HEALTH STATISTICS
44
• Median
• Mode
Othermeasures of location
• Quartiles
• Percentiles
• Proportions
The teacher should be able to construct an outline of this lesson content with
reference to the material in the proposed handouts. The following are illustra-
tive examples of the computations of some of these descriptive statistics.
Examples (such as those given below or in Handout 5.2) should be used through-
out the lesson. Examples based on real data, and on topics familiar to the
students, would be preferable.
Arithmetic mean
To calculate the arithmetic mean, there are two steps:
First step: add all values to obtain the total number of people in the
households:
5 + 5 + 6 + 3 + ... + 4 == 165.
Second step: divide this total (165 people) by the number of households (31).
Thus the arithmetic mean is 165/31 == 5.32 persons per household.
Median
To determine the median household size, there are also two steps:
First step: arrange all values in order of their magnitude (this arrangement is
called an array):
1,1,1,1,1,2,2, ... 9,9,9,10,11.
Second step: select the value which divides this distribution into two halves (for
example, the middle observation if the number of observations is
HANDOUT 5.1
Arithmetic mean: The sum of all values of a set of observations divided by the number of observations.
Geometric mean: A mean derived by multiplying together the n individual values in a series of observations and
calculating the nth root. The logarithm of the geometric mean isthus the arithmetic mean of the logarithm of
individual values.
Measures of central tendency and location: Summary indices describing the central 11 11 point, or the most
characteristic value, of a set of measurements.
Median: Value that divides a distribution into two equal halves; central or middle value of a series of observations
when the observed values are listed in order of magnitude.
Mode: The most frequently occurring value in a series of observations.
Multimodal distributions: Data distributions with more than one mode. Distributions with two modes are
bimodal.
Percentiles: Those values in a series of observations, arranged in ascending order of magnitude, which divide the
distribution into 100 equal parts (thus the median isthe 50th percentile).
Quartiles: The values which divide a series of observations, arranged in ascending order, intofour equal parts. Thus
the second quartile is the median.
Summary indices: Values summarizing a set of observations.
Weighted mean: A mean for which individual values in the set are weighted, very often by their respective
frequencies.
Mean
The approximate mean isthe weighted average of the class mid-values:
i.e. 32 420/240 == 135.1 mmHg.
Median
The median blood pressure lies in the interval between 130 and 140 mmHg. It isthe average of the 120th and 121 st
observations. Their estimated values are respectively:
Enabling objectives
At the end of this lesson, the students should be able to:
(a) Explain the meaning of a measure of variability or dispersion and its place in descriptive
statistics.
(b) Explain the uses of the terms: range, inter-quartile range, variance, standard deviation and
coefficient of variation, as measures of variability of health data.
(c) Compute the following, given either grouped or non-grouped data, with the aid of reference
material:
• range;
• inter-quartile range;
• variance;
• standard deviation;
• coefficient of variation.
(d) Describe the relative advantages and disadvantages of the five indices listed above.
(f) Discuss the concept of normality of health data in terms of mean, standard deviation and
percentiles.
51
OUTLINE 6 MEASURES OFVARIABILITY
53
Coefficient of variation
- Used for the comparison of relative variability of two distributions.
Two types of "normal" values are usually required for medical decisions: the
"point normal" values and the "normal ranges". Point normal values are esti-
mated by measures of central tendency (refer to Handout 5.1 for definitions of
measures of central tendency and location). Normal ranges give the general level
(in terms of an interval) of a characteristic for healthy population groups. Some
people in the population will have exceptionally high or low values of a particu-
lar characteristic and yet apparently be perfectly healthy. These are called
"outliers". Such exceptional values cannot be regarded as typical of the popula-
tion group. Hence sometimes a few very extreme measurements are excluded
from the computation of normal values.
Very often, normal values differ between geographical areas or between sexes
or age groups. For example, "normal" blood pressure differs between sexes,
and also varies with age, and its pattern is not the same in all human populations.
A statement of normal values must therefore indicate the population referred
to.
TEACHING HEALTH STATISTICS
54
Coefficient of variation; dispersion; normal values; range; standard deviation; standard error; theo-
retical normal (Gaussian) distribution; variance.
(a) Recapitulate the various sources of variation as presented in Outline 1, and illustrate their
cumulative effect on the validity and reliability of measurements in health data. Distinguish
between random and systematic variations.
(b) Describe the nature of measures of variability or dispersion and their place in descriptive
statistics. Differentiate between a summary index of central tendency and a summary index
of dispersion, and explain their complementary roles in the study of any characteristic among
a group of subjects (for example, as indicators of homogeneity and heterogeneity), and for
comparison between different groups of subjects. Explain how variability mayor may not be
related to the magnitude of the variable, and hence differentiate between indices of absolute
dispersion and of relative dispersion.
(c) Give the definitions and methods of computation of the different summary indices of abso-
lute dispersion commonly encountered in the literature. These should include:
• indices summarizing the squares of differences of individual values from the mean (sum
of squares, variance or mean square, standard deviation);
(d) Draw attention to the concept of a range of "normal" values, often determined arbitrarily as
the interval spanning the central 95% of values in the frequency distribution (that is, the
range from the 2.5 percentile to the 97.5 percentile), and explain how the standard devia-
tion is often used to estimate this normal range in the form x ± 1.96 SD.
(e) Give special attention to this use of エィ・Gセ。ョ、イ deviation, which derives from the proper-
ties of the theoretical normal distribution or normal curve. Explain the concept of the stand-
ard normal deviate z (distance from the mean expressed in standard deviation units), and
illustrate how percentiles of the normal distribution are related to values of z. Mention how
the proportions of the normal distribution that lie within or outside various multiples of z
below or above the mean (for example, x ± SE, x ± 1.96 SE) can be used to determine the
"normal" range of values. Discuss when and why the standard deviation mayor may not be
used in this way for empirical data (observed frequency distributions).
(f) Summarize the uses and limitations of the different measures of variability or dispersion.
Lesson exercises
The teacher should obtain data that can demonstrate variation in an attribute, such as a continu-
ous variable, and ask the students to calculate the various measures of variation and to describe
how they can compare variation in variables measured in different units.
OUTLINE 7 INTRODUCTION TO PROBABILITY AND PROBABILITY DISTRIBUTIONS
59
Lesson content
Concept of probability
• Definition of probability (subjective definition and the frequency concept).
• Definitions of technical terms (trials, experiments, outcomes, events, chance,
odds; see Handout 7.1).
• Scale of measurement of probability and its interpretations.
Laws of probability
• Explanations of simple and compound events.
• Mutually exclusive and independent events.
• The addition and multiplication rules.
• Dependent events and definition of conditional probability.
Probability distributions
• Discrete probability distributions (binomial).
• Continuous probability distributions (normal).
• Properties and uses of the distributions.
(b) Explain the range of values for probabilities (0-1) and the interchangeable use of the terms
"chance" and "probability".
(c) Briefly review the uses of the descriptive statistical methods already learned, and introduce
the concept and meaning of inductive statistics, illustrating with medical data (for example,
how criteria of abnormality used in diagnosis are based on descriptive information but
applied to new patients).
(d) Explain the meaning of such terms as trials, outcomes, events, experiments.
(e) Explain the relationship between probability and observed proportions in data on a dicho-
tomous attribute.
Example: what is the chance of finding a person with Type A blood, or the chance of an
unborn child being male?
OUTLINE 7 INTRODUCTION TO PROBABILITY AND PROBABILITY DISTRIBUTIONS
61
laws of probability and the binomial probability distribution with particular reference to health
problems.
To demonstrate and reinforce the concept of a sampling distribution and to show how it is
governed by laws of probability, let the students generate an empirical (observed) sampling
distribution and compare it with the theoretical (expected binomial) distribution. One way of
doing this is to use coloured beads to represent persons with different attributes in a population.
Give examples of dichotomous medical attributes that can be represented by beads of two
colours, say black and white, for example, genetic traits (sickle cell anaemia, blood grouping,
etc.).
• From a box containing a large number of beads of two colours (for example,
black and white), let each student take a random sample of a given size n (for
example 5); tabulate the number of black (or white) beads seen in each sample.
This gives an observed sampling distribution.
Given the actual proportion of black (or white) beads in the box, calculate the
binomial probability distribution for sample size n. Hence calculate the expected
sampling distribution for the observed number of samples.
Compare and comment on the observed and expected sampling distributions.
The goodness-of-fit will be tested later when the students have learned about the chi-squared
test.
The following exercise is designed to demonstrate the application of the laws of probability and
the binomial probability distribution.
• For a couple desiring to have a baby boy following three female births, if it is
known that the chance of a pregnancy resulting in a male baby is 0.5, what is
the chance that the fourth pregnancy will result in a male birth?
and O! == 1 by definition.
At the height of the drought in a given region, it was estimated that 70% of the children under 10 years old were
severely malnourished. If five children, under 10 years old, were selected at random from the region, what is the
probability that: all, 4, 3, 2, 1, 0, are severely malnourished?
3
hence P(r= 3) = [5V(5 - 3)! 3!] x 0.7 3
x (1 - 0.7t
== 5 x 2 X 0.7 3 X 0.3 2
== 0.30870.
Enabling objectives
At the end of the lesson the students should be able to:
(a) State the reasons for sampling with the different sampling methods.
(e) List possible advantages and disadvantages of collecting health information through
samples.
(f) Discuss the relative advantages and disadvantages of each of the following sampling meth-
ods, as applied to the design of a health survey:
• probability (random) sample;
• simple random sample;
• stratified random sample;
• systematic sample;
• cluster sample;
• multistage sample.
(g) Calculate the standard error of the sample mean or proportion, given the relevant data and
formulae.
(h) Differentiate between point and interval estimates of health indices.
(j) Explain the meaning and application of confidence limits of an estimate of health indices.
66
OUTLINE 8 SAMPLING AND ESTIMATING POPULATION VALUES
67
(k) Explain how sampling error is related to sample size and to variability of the characteristic
under study.
(I) State the information needed to estimate the minimum sample size for a health survey.
Lesson content
The concept of sampling
• Population (universe)
• Sample
• Sampling
• Reasons for sampling
• Sampling unit
• Sampling frame
• Sampling fraction
• Unit of inquiry
• Probability and non-probability sampling
Sampling distributions
• Meaning of parameters and statistics
• The central limit theorem
Estimation
• Concept of standard error
• Point and interval estimation (mean and proportion)
TEACHING HEALTH STATISTICS
68
• Precision
• Determination of minimum sample size
Bias and selection in sampling; cluster sampling; confidence limits; confidence range; difference
between sampling and non-sampling error; dummy tables; estimation of a population mean; esti-
mation ofapopulation proportion; health survey; level ofconfidence; method ofsampling; multistage
sampling; point and interval estimates; population (universe); population parameter; precision
of estimates; pre-coded data; probability sampling; quality of sample; representativeness of a
sample; sample statistic; sampling error; sampling fraction; sampling frame; sampling unit; self-
coding record forms; self-selected or natural samples; standard error; statistical estimation; strati-
fied random sampling; survey questionnaire; systematic sampling; validity of estimates; unit of
inquiry.
Concept of sampling
Explain the concept of population, sample and sampling, giving the reasons for sampling: lim-
ited resources available for estimation, lack of access to total population, or sampling may be the
only feasible method of collecting the information.
Also explain the following terms: sampling frame, sampling unit, sampling fraction.
Explain the characteristics of a good sample (the sample must be selected at random to reduce
bias, be representative to improve validity, and be large enough to increase precision).
Distinguish between random sampling and non-random (purposive) sampling.
Sampling distributions
Explain the concept of a sampling distribution using simple examples, without invoking mathe-
matical statistics. Explain the differences between a parameter and a statistic, and that every
sample statistic belongs to a sampling distribution.
Describe the principles and applications of the central limit theorem which states that, for all
variables, whether normally distributed or not, the sample mean will tend to be normally
distributed.
OUTLINE 8 SAMPLING AND ESTIMATING POPULATION VALUES
69
Estimation
Explain the concepts of statistical estimation. The following should be covered:
• The context of statistical estimation: the need to estimate population parameters from sample
statistics; the problem posed by sampling error for making reliable estimates; the concepts of
point and interval estimation; validity and precision of a statistical estimate.
• The concepts of confidence limits and level of confidence: the connection between sampling
distributions, confidence limits and levels of confidence. Interval estimation: estimation of a
population parameter in terms of an interval that has a specified probability of containing the
true value. The interval is the confidence interval, and the limits are the confidence limits.
• The estimation of normal anthropometric values for a population, with examples; estimation
of mean birth weight from hospital births; and estimation of disease prevalence in morbidity
surveys.
Lesson exercises
The exercises for this lesson should focus on helping the students crystallize the concepts of
sampling and estimation of population values covered in the lesson. The emphasis should not be
on correct memorization of formulae but on their appropriate use and interpretation of the
results. The exercises should, in particular, cover all the major points indicated in the enabling
objectives of the lesson (reasons for sampling, the advantages and disadvantages of the different
sampling methods, interpretation of confidence interval, etc.).
1 See Lwanga SK, Lemeshow S. Sample size determination in health studies: a practical manual, Geneva, World Health
Organization, 1991.
TEACHING HEALTH STATISTICS
70
• In a family planning clinic, there are 2500 clients. Suppose the anticipated
prevalence of HIV infection is 3 0/0 and the investigator is willing to accept an
absolute error of 1 0/0. What is the minimum sample size required to estimate the
prevalence of HIV with 95 0/0 confidence?
References
Bland M. An introduction to medicalstatistics. Oxford, Oxford University Press, 1987.
Colton T. Statistics in medicine. Boston, Little, Brown, 1995.
Daly LE, Bourke GJ, McGilvray J. Interpretation and uses of medical statistics. Oxford, Blackwell,
1985.
Dawson-Saunders B, Trapp RG. Basic and clinicalbiostatistics. Norwalk, Appleton & Lange, Prentice-
Hall International, 1990.
Dixon RA. Medical statistics: content and objectives of a core course for medical students.
Medical education, 1994,28:59-67.
Feinstein AR. Clinicalbiostatistics. St Louis, Mosby, 1977.
Huff D. How to lie with statistics. New York, Norton, 1954.
Kirkwood BR. Essentials of medicalstatistics. Oxford, Blackwell, 1988.
Lemeshow S et al. Adequacy of sample size in health studies. London, John WHey, 1990.
Lwanga SI<, Lemeshow S. Sample size determination in health studies: a practical manual. Geneva,
World Health Organization, 1991.
HANDOUT 8.1
Confidence limits: The upper and lower limits of the interval in interval estimation. The interval itself is called the
confidence interval or confidence range. Confidence limits are so-called because they are determined in ac-
cordance with a specified orconventional level of confidence or probability that these limits will infact include
the population parameter being estimated. Thus, 95% confidence limits are values between which we are
95% confident that the population parameter being estimated will lie. Confidence limits can often be derived
from the standard error.
Interval estimation: Providing an estimate of a population parameter in terms of an interval or range of values
within which it is likely to lie.
Level of confidence: Conventionally 95% or 0.95, but may be set higher or lower as desired.
Point estimation: Providing an estimate of a population parameter in terms of a single value that it ismost likely to
have. A point estimate is usually provided by a sample statistic. By itself, point estimation ignores sampling
error.
Population: Any specified group (usually large) of persons, things, or measurement values.
Population parameter: A descriptive index whose value refers to the population at large, as opposed to a sample
of the population (for example, a population mean or population proportion).
Precision of an estimate: The inverse of thestandard error of theestimate. The less the sampling error that is likely
to occur, the greater the precision; that is, the smaller the confidence range, the greater the precision. Hence,
precision can be specified in terms of the confidence range or the standard error.
Sample: A subset of a population, whose properties have been, or are to be, generalized to the population.
Sample statistic: A descriptive index, the value of which is obtained from observations in a sample (forexample, a
sample mean or a sample proportion).
Sampling: The process of selecting a sample from a population.
Sampling distribution: The distribution of probabilities with which sampling error of different magnitudes can
occur purely by chance for a particular sample statistic and sample size. It can be demonstrated experimentally
bytabulating the values of the same sample statistic obtained from repeated samples of the same size taken
randomly from the same population. It can also be calculated theoretically (forexample, using the binomial or
the normal sampling distribution). Every sample statistic is a member of a sampling distribution, that is, the
distribution ofvalues of thatstatistic that can be expected to occur in different samples of the same size drawn
randomly from the same universe.
Sampling error: A difference that occurs purely by chance between the value of a sample statistic and that of the
corresponding population parameter (forexample, the difference between the value of the mean of a random
sample and that of the universe). Sampling error cannot be avoided or totally eliminated, and must always be
allowed for when making inferences or drawing conclusions from sample statistics. It can be reduced by
increasing sample size or using a more appropriate sampling method.
Sampling fraction: The proportion of sampling units to be selected from a specified sampling frame for inclusion in
the sample.
Sampling frame: The set of sampling units from which a sample is to be selected. For example, a list of names, or
places, or other items to be used as sampling units.
Sampling unit: The unit of selection in the sampling process. For example, a person, a household or a district. It is
not necessarily the unit of observation or study.
Standard error (SE): The standard deviation of a statistic.
Unit of inquiry: Smallest unit on which data are collected.
Universe (of a sample): The population of values, of which thevalues observed inthesample are a random sample,
and to which the properties of the sample can validly be generalized. The universe of a sample may be an
abstract or a real population of values, and it may be finite or infinite, depending on the type of sample and the
nature of the information under study.
Validity of an estimate: The extent to which an estimate corresponds to the parameter it is estimating. It depends,
not on the size of the sample, but on the representativeness of the sample. Hence it depends on the type or
nature of the sample, how it was selected, and on the accuracy of the information from which it was calcu-
lated and of the calculation itself.
Sampling
Advantages Disadvantages
• Sampling reduces demands on resources such as finance, • There is always a sampling error.
personnel and materials.
• Sampling may create a feeling of
• Results are obtained more quickly. discrimination within the population.
• Sampling may lead to better accuracy of collected data; a • Sampling may be inadvisable where every unit in the
smaller sample allows more effortto be made to reduce population is legally required to have a record.
non-sampling errors and non-response biases.
• For rare events, small samples may not yield sufficient
• Precise allowance can be made for sampling error (which cases for study.
can befound bycalculation), although notfor non-sampling
errors.
Probability sampling
• All individuals (elements) in the population have a known chance (probability) of selection. The chance of selection
need not be the same for each individual or element.
• The knowledge ofthe selection probability isin contrast with thesituation for non-probability sampling techniques,
such as quota and chunk sampling.
• There must be an identified sampling frame, whether of individual elements or clusters of elements, from which the
sample isto be drawn.
Advantages Disadvantages
• Because every unit in the population has an equal chance of • If the sampling frame is large, this method may be
being included in thesample, the sample is assured of being impracticable because of the difficulty and expense of
representative and subject only to sampling error. constructing or updating it in large-scale surveys.
• Estimates are easy to calculate. • Minority subgroups of interest in the population may not
be present in the sample in sufficient numbers for study.
• A simple random sample is then selected from each stratum using the same sampling fraction, unless otherwise
prescribed for special reasons.
Advantages Disadvantages
• Every unit in a stratum has the same chance of being • The sampling frame of the entire population has to be
selected. prepared separately for each stratum.
• Using the same sampling fraction for all strata ensures • Varying the sampling fraction between strata, to ensure
proportionate representation in the sample of the selection of sufficient numbers in minority subgroups for
characteristic being stratified. study, affects the proportiona I representativeness of the
subgroups in the sample as a whole.
• Adequate representation of minority subgroups of interest
can be ensured by stratification and by varying the sampling
fraction between strata as required.
Systematic sampling
• Involves the selection of every j(h unit in the population or the sampling frame, where 1/ k is the sampling fraction.
• The first unitto be selected is selected at random from among the first k units.
Advantages Disadvantages
• The sample is easy to select. • The sample may be biased if a hidden periodicity in the
population coincides with that of the selection.
• A suitable sampling frame can be identified more easily.
• It is difficult to assess the precision of the estimate from
• The sample is evenly spread over the entire reference
one survey.
population.
Cluster sampling
• The population is first divided into clusters of homogeneous units, usually based on geographical contiguity.
• All the units in the selected clusters are then examined or studied.
Advantages Disadvantages
• Cuts down on the cost of preparing a sampling frame. • Sampling error is usually higher than for a simple random
sample of the same size.
• Cuts down on the cost of travelling between selected units.
Multistage sampling
• Selection is done in stages until the final sampling units (for example, households orpersons) are arrived at.
• In the first stage, a list of large-sized sampling units is prepared. These may be towns, or villages or schools.
• A sample of these is selected at random, with probability of selection proportional to size.
• For each of the selected first-stage units, a listof smaller sampling units is prepared. (For example, if the first-stage
units are towns, then second-stage units may be houses or households.)
• A sample of these second-stage units is then randomly selected from each of the selected first-stage units. These
are then studied.
Advantage Disadvantage
• Cuts down the cost of preparing a sampling frame. • Sampling error is increased compared with a simple
random sample of the same size.
• objective;
With simple random sampling, for a given magnitude of confidence interval, the precision (z) can be measured by:
Z == d/SE.
If we want a 95% confidence interval, z must be 1.96 (see Table B.1, Annex B). Since the SE depends on n, we can
calculate the value of n required to achieve the chosen level of confidence.
If 5 is the sample estimate of the population standard deviation (see Outline 6), then the standard error (SE) of the
mean, for a sample of size n, is siJn.
For estimating a population mean, with SE == 51 Jn, the minimum required sample size, in general, is:
n == s2/SE 2 == Z2 52 / d'.
For a population of size n, involving a binomial distribution with probability p (see Outline 7), let a individuals be
observed with the relevant characteristics. Then the standard error of the estimate of p (that is, aln) is J(pqln), where
q == 1 - p.
If sampling isfrom a finite population of size N, then the minimum sample sizes are:
n = z2 pq/{d2+ z2 pQ/N),
If no is the sample from an infinite population, the finite population sample size is:
- how precisely one wishes to estimate this mean level; that is, the amount of sampling error that can be tolerated
(d), in either absolute or relative terms;
Example 1
A health officer wishes to estimate the mean haemoglobin level in a defined community. Preliminary information is
that this mean is about 150 mg/I with a standard deviation of 32mg/1. If a sampling error of up to 5mg/I in the
estimate is to be tolerated, how many subjects should be included in the study?
Therefore at least 136 people would have to be studied. For a larger community with, for example, N == 3000 people,
the required sample size would be:
- the sampling error that can be tolerated (d) in either absolute or relative terms;
- the acceptable chance of an unlucky sample (conventionally 50/0).
The minimum sample required, for a very large population, is then:
Example 2
If P == 0.26, and d == 0.03, then, for a very large population:
Enabling objectives
At the end of the lesson the students should be able to:
(a) Explain the context and meaning of statistical hypothesis.
(b) Explain when, and why, a test of significance needs to be carried out.
(c) Explain the procedures for carrying out tests of significance.
(d) Differentiate between type 1 and type 2 errors in hypothesis testing.
(e) Explain the possible outcomes of a test of statistical significance and their respective inter-
pretations in relation to the context of the test.
(f) Differentiate between statistical and medical significance.
(9) Select an appropriate test statistic for the comparison of two means, for independent and
dependent samples.
(h) Select an appropriate test statistic for the comparison of two proportions.
(i) Carry out an appropriate test of statistical significance for the difference between two means,
for independent and dependent samples.
(j) Carry out an appropriate test of statistical significance for the difference between two
proportions.
79
TEACHING HEALTH STATISTICS
80
objectives of Outlines 6 and 7 and, in particular, that they have understood the concepts of
sampling error and sampling distributions (Outline 8).
Lesson content
Construct an outline of the lesson with reference to the definitions and explana-
tions of new terms and concepts in Handout 9.1, with the following content.
81
Alternative hypothesis; degrees of freedom; hypothesis testing; level of significance; null hypothe-
sis; 1-tailed and 2-tailed tests; p-value; probability of a difference occurring purely by chance;
rejection of a hypothesis; statistical significance; test statistic (z, t, X2) ; type 1 and type 2 errors.
• differences, for example, between the means of certain biochemical, physiological, demo-
graphic or any health measurements in different samples, or between the proportions
with certain attributes in different samples, or between the observed and expected number
of occurrences of certain events;
• formulation and testing of the null hypothesis;
• the probability that a difference of a given magnitude or greater magnitude can occur
purely by chance; illustrate this in relation to the theoretical sampling distribution;
• the direction of difference and the implications for I-tailed or 2-tailed tests;
• the probability of being wrong in rejecting or not rejecting a hypothesis; type 1 and type 2
errors.
(b) Introduce the concept of level of significance
• The lowest value p (the probability) must have for an event to be considered "unlikely",
and hence for the null hypothesis to be rejected and the difference to be described as
being statistically significant.
• Describe the conventional levels of significance, Le. "significant" for p < 0.05; "highly
significant" for p < 0.01; "not significant" for p > 0.05 or p == 0.05.
(c) Describe the role of significance testing and the implications of the outcome
82
• The test of significance only takes care of the factor of chance; discuss how the other
possible causes of observed difference are dealt with. Emphasize the difference between
statistical and medical significance. For example:
- a statistically significant difference but of no clinical importance;
- a non-statistically significant observation but with the results pointing to a possible
clinical or medical importance.
• Discuss possible follow-up as a result of a statistical test of significance, for example, re-
peat of study with an enlarged sample size.
(d) Outline the methodology of the various tests of significance
There are many types of tests of significance, catering to different types of data and differences
being dealt with. The most commonly encountered are the z-test, the r-test and the X2 test.
Mention only the usefulness of the X2 test and indicate that detailed treatment of this test statistic
will follow in the next lesson. At least one type of test should be carried out by the students to
learn the concepts and principles involved; that is, how to:
• select the appropriate test to be used;
• differentiate between parametric and non-parametric tests;
• calculate the test statistic;
• evaluate its magnitude in relation to its theoretical sampling distribution, in terms of the
probability that this magnitude could have arisen purely by chance (if the null hypothesis
were true);
• decide whether the difference is significant and, if so, at what level of significance.
Refer to the worked examples given in Handouts 9.2 and 9.3.
Lesson exercises
The class exercises should emphasize the proper selection of the test to be used in each specific
situation and how to interpret the results obtained. The teacher should obtain a data set which
has both categorical and continuous variables that can be used for the various tests on means in
dependent and independent situations and in the case of proportions for small and large data
sets.
Class exercises are given to provide practice in carrying out tests of significance, and interpreting
the results in the context of the study objectives.
83
• The average clinic utilization rate for 1152 infants who reported to Kasangati
Health Clinic from 1961 and 1979 is provided in Table 9.1. (Kasangati Health
Clinic is the field station for Makerere Medical School, Institute of Public Health,
Kampala, Uganda.) The study was reported in the East African medical journal
(March, 1994).
• Determine whether the average utilization rate per child in the 1960s is sta-
tistically different from the rate in the 1970s.
• Comment on the distribution of the data for the test selected.
• Table 9.2 gives the summary of the data on immunization of children in Yemen,
as reported in the Demographic and Maternal and Child Health Survey, 1991/
1992 (source: Demographic and Health Surveys, 1991-92, Macro International
Inc., Calverton, MD, USA).
TEACHING HEALTH STATISTICS
84
Total 59.1 59.0 53.4 44.8 59.0 53.4 44.8 49.6 42.0 37.9 6715b
vaccines).
Note: The DPT coverage rate for children without a written record is assumed to be the same as that for polio vaccine,
since mothers were specifically asked whether the child had received polio vaccine. For children whose information was
based on the mother's report, the proportion of vaccinations given during the first year of life was assumed to be the
same as for children with a written record of vaccination.
b Editors' note: Differences between the total numbers of children accounted for under the different Characteristics"
11
85
References
Bland M. An introduction to medicalstatistics. Oxford, Oxford University Press, 1987.
Colton T. Statistics in medicine. Boston, Little, Brown, 1995.
Daly LE, Bourke GJ, McGilvray J. Interpretation and uses of medical statistics. Oxford, Blackwell,
1985.
Dawson-Saunders B, Trapp RG. Basic and clinical biostatistics. Norwalk, Appleton & Lange, Prentice-
Hall International, 1990.
Dixon RA. Medical statistics: content and objectives of a core course for medical students. Medi-
cal education, 1994,28:59-67.
Feinstein AR. Clinical biostatistics. St Louis, Mosby, 1977.
Huff D. How to lie with statistics. New York, Norton, 1954.
Kirkwood BR. Essentials of medicalstatistics. Oxford, Blackwell, 1988.
Warwick DP, Lininger CA. The sample survey: theory and practice. New York, McGraw Hill, 1975.
HANDOUT 9.1
1-tailed and 2-tailed tests: When the difference being tested for significance is not specified in direction (that is,
takes noaccount of whether Xl < X2 or Xl > X2) , then theprobabilities inboth tails of the sampling distribu-
tion are used in the test: a 2-tailed test is required. When the difference being tested isdirectionally specified
beforehand (when Xl < X2, but not Xl > X2, is being tested against the null hypothesis Xl == X2), then a 1-
tailed test is appropriate because we are only concerned with the probability P(Xl < X2) and not P(Xl > X2) .
Level of significance: The probability of a difference arising purely by chance, below which it is considered suffi-
ciently "unlikely" for the difference to be considered statistically significant (conventionally 0.05). The prob-
ability of wrongfully rejecting the null hypothesis.
Null hypothesis: The hypothesis of "no difference" or, more correctly, the hypothesis that the observed difference is
entirely due to sampling error, that is, that it occurred purely by chance. In a test of significance, the" null
hypothesis" is postulated to establish the basis for calculating the probability that the difference occurred
purely by chance. When the difference is not significant, the null hypothesis is not rejected; when the differ-
ence is significant, the null hypothesis is rejected in favour of other hypotheses about the causes of the
difference. Note thatthe null hypothesis is never proved completely right orwrong, or true or false, but is only
rejected or not rejected at the probability level of significance concerned, for example, 0.05 or 0.01.
p-value: The probability of obtaining the results or more extreme results than those observed in the study under the
null hypothesis.
Statistical significance: The concept by which results are judged as due to chance or not.
Type 1 and 2 errors: Type 1 error isthe risk of erroneously rejecting a null hypothesis that is really true. Type 2 error
is the chance of erroneously failing to reject a null hypothesis that is, in fact, false.
Data situation
A rural health survey investigated 124 households in avillage and recorded their sources of water supply. By reviewing
thevillage's health centre morbidity records for a period of three months prior to the survey, it was possible to identify
household members with a history of diarrhoeal episodes. A total of 88 used the river forwater supply and 49 of them
had episodes of diarrhoea, as against 10 from the 36 households using the well. There was no piped water in this
village. Is there a statistically significant difference in the proportions with episodes of diarrhoea between the house-
holds using river and well water supplies?
Solution
Null hypothesis: there is no difference in the proportion with episodes of diarrhoea between household members
using river or well water supplies.
Alternative hypothesis: there is a difference in the proportion of diarrhoea episodes as a result of different sources of
water supply. (Note that this is a 2-tailed test as no direction is indicated for thedifference in episodes of diarrhoea.)
Level of significance: 0.05.
Test statistic: The z-test for proportion is chosen as appropriate here:
r, and r2 are the numbers with attributes (in this case episodes of diarrhoea) in each group;
n, and n2 are the sample sizes in each group.
In our data situation,
t, == 49
n, == 88
r2 == 10
n2 == 36
P1 == 49/88 == 0.5568
P2 == 10/36 == 0.2778
1 - P1 == 39/88 == 0,4432
1 - P2 == 26/36 == 0.7222
z == 3.044,
Conclusion
Checking with the table of the normal distribution shows that the value of z at the 50/0 level is 1.96; therefore we
reject the null hypothesis that the proportion with diarrhoeal episodes is the same in the two groups of households
using the different sources ofwater supply. The difference intheproportion with episodes of diarrhoea is unlikely to be
due to chance, p < 0.05. In fact, it appears that the household members using the river have statistically significantly
more episodes of diarrhoea than those using the well. However, to establish a causal relationship, further investiga-
tions would have to be done.
Note: These data can also be tested by the X2 test, but the results have to be presented in a 2 x 2 contingency table,
as shown below.
Nodiarrhoea 39 26 65
Diarrhoea 49 10 59
Total 88 36 124
Percentage with diarrhoea 56,7 27,8 47,6
The hypothesis to be tested will now be that of no association between diarrhoeal episodes and source of water
supply. The data in fact indicate that an association exists between diarrhoeal episodes and source of water in the
village (with diarrhoeal episodes in 56.7 % and 27.8 % of households using river and well water, respectively).
The following data are from a study to compare the mean concentration of lead (in mg/1 00g)in the blood of a group
of workers in a battery plant (exposed) with that of a group of workers in a textile factory (not exposed).
0.082 0.040
0.080 0.035
0.079 0.036
0.069 0.039
0.085 0.040
0.090 0.046
0.086 0.040
where Xl == Xl - Xl
and X2 == X2 - X2
We find
where the suffixes 1 and 2 refer to battery workers and textile factory workers, respectively, and SEd is the standard
error of the difference in mean lead concentrations between the two groups.
The null hypothesis (Ho) isthat there is nodifference in the mean lead concentration inthe blood of theworkers ofthe
two industries. This implies a 2-tailed test. We have
d == Xl - X2 == 0.042414.
Enabling objectives
At the end of the lesson the students should be able to:
(a) Give examples of types of questions concerning health or medicine that are answered by
analysis of statistical association or correlation.
(e) Explain the concept of relationship between two quantitative variables presented in a scatter
diagram.
91
TEACHING HEALTH STATISTICS
92
Lesson content
Situations in which analysis of statistical association or correlation can provide
answers
• Studies of two or more variables measured on the same subject (or unit of
inquiry) where the interest is in their relationship
The X2 test
Procedure
• The null hypothesis
• Calculation of expected frequencies (E) for each cell under the null
hypothesis
• The concept of degrees of freedom
• Calculation of the X2 statistic
• Correction for continuity, for 2 X 2 tables
Li mitations
• Effect of small expected frequencies
• Applicability only to categorical data
Interpretati 0 n
• Use of the table of the theoretical distribution of X2 to determine significance
93
Properties
• Unit free (the coefficient r is an absolute number)
• Independent of change of origin and scale
• Lies between - 1 and +1
Uses
• Measure of the strength of association between two quantitative variables
Misuses
• Concluding no relationship from zero correlation, while in fact a strong non-
linear relationship may exist
• Unwarranted conclusion from spurious correlation
• Concluding a cause-effect relationship from a correlation, while it might just
be an indirect relationship
• Concluding an agreement between pairs of measurements, while they may
not have the same values at all points
Linear regression
A regression estimates the nature of the relationship. The concept and applica-
tions of linear regression should be covered, with explanation of the terms de-
pendent and independent variables. A description of the regression line should
be given.
94
Uses
• Measure of linear association
• Interpolation
• Prediction
Misuses
• Extrapolation without assurance that the trend remains the same
• Using a regression relationship whose slope has been shown to be not signifi-
cantly different from zero
• Forgetting that the predicted values are subject to sampling error
• Concluding that a cause-effect relationship exists, whereas the relationship
may just be statistical
• Applying a relationship established in one group of subjects to another group,
without the assurance that it is applicable to all groups
95
2
(c) Recapitulate the basic principles of hypothesis testing. For the X statistic as a means for
testing the statistical significance of the association between two categorical variables, give a
heuristic explanation of the formula of X2 so that the students realize that (0 - E)2/ E is a
measure of deviation from independence. Emphasize that E is the cell frequency expected
under the null hypothesis of independence (that is, of no association).
(d) Explain the concept of degrees of freedom by giving actual examples of, say, 2 X 2 and
2 X 3 tables illustrating the "freedom" to choose frequencies in one and two cells, respec-
tively, under the constraint of fixed marginal totals. Stress the interpretation of a significant
X2 as mere presence of association, with no implication for the strength of the association.
Point out that the magnitude of X2 is severely affected by n. Indicate what further calcula-
tions are required to measure the strength of the association.
(e) Distinguish between linear and non-linear relationships between two quantitative variables
by giving examples, as shown in Handout 10.2. Emphasize that the coefficient of correlation
measures only the linear component of the relationship, which may not exist in some cases,
despite the presence of a strong non-linear relationship. Make liberal use of diagrams to
illustrate the magnitude and direction of correlation coefficients.
(f) Use scatter plots to explain the fluctuations around a line, even in the case of a linear rela-
tionship. In the case of high fluctuations, the predictive value of the relationship can be
reduced substantially.
(g) Briefly explain the statistic [r J(n - 2)] / [J(1 - r)] following Student's t distribution with
n - 2 degrees of freedom, subject to the normality of either X or Y. This is just to test the
hypothesis that the correlation is zero in the population. For testing other values of correla-
tion, tests based on Fisher's z transformation are required. Explain the role of the sample size
in placing confidence on the value of a coefficient of correlation.
(h) Briefly explain the use of the t-test for the hypothesis of no correlation. This test does not
provide any clue to the magnitude of the correlation. A better indication of the magnitude of
correlation can be obtained by computing 100 X (1 - r), as the percentage of the variation
in the dependent variable is explained by its association with the independent variable. Also
mention that the t-test requires normality of Y, particularly for small samples, and that this
test should not be used indiscriminately.
(i) Discuss the need to obtain the nature of the relationship in the form of an equation. Restrict
the lesson to linear relationships only.
(j) Come back to the scatter plots used earlier in the context of correlation and illustrate various
types of regression lines. Use the illustrations in Handout 10.2 to explain the meaning of the
slope measured by the regression coefficient and of the intercept. Show the equivalence of
the use of rand b to indicate association. Give examples of situations for preferring one over
the other.
(k) Superimpose scatters of different variability and explain how variability affects the reliability
of predictions - whether extrapolated or interpolated.
(I) Discuss the uses and misuses of regression lines on the basis of the examples chosen from
published literature.
Lesson exercises
The teacher should give two different kinds of health-related data to the students: one set for
2 X 2 cross-tabulated data of two discrete variables, and the other for two quantitative variables
measured on the same individuals. The exercise should focus on enabling the students to pro-
duce a scatter diagram and to carry out appropriate procedures to test association between the
TEACHING HEALTH STATISTICS
96
variables in each of the two situations. The students should also be tested on their ability to
interpret the results of the tests .
Total 13 12 25
x ==
2
0.987, df == 1, P > 0.25. When each frequency is multiplied by 10, then
X2 == 9.87, df== 1 but now p < 0.001.
• Give some scatter plots of known data and ask the students to make an edu-
cated guess of the magnitude and direction of the coefficient of correlation. In-
clude among the scatter plots at least one random plot (no correlation) as well as
at least one with a non-linear relationship. Let the students calculate r to check
how good their guesses were.
• Give some regression lines with different slopes and ask the students to inter-
pret each of them. Superimpose scatter plots on them with different variability
and let the students describe the impact of variability on the reliability of the
conclusions based on the regression equation.
• Use the data on age, height and weight of a male preschool child, followed up
from the age of six months, to draw a scatter diagram and find the best regres-
sion line for age and weight.
6 66.9 7.1
7 68.5 7.2
12 72.0 7.8
16 77.0 8.3
18 79.0 8.9
22 82.1 9.2
24 82.7 9.5
26 84.2 10.4
30 86.0 11.0
32 86.5 10.8
34 89.5 11.4
35 89.7 11.8
43 95.0 13.0
OUTLINE 10 ASSOCIATION, CORRELATION AND REGRESSION
97
References
Armitage P, Berry G. Statistical methods in medical research. Oxford, Blackwell, 1987.
Campbell MJ, Machin D. Medical statistics. Chichester, John WHey, 1990.
Colton T. Statistics in medicine. Boston, Little, Brown, 1995.
Edwards AL. An introduction to linear regression and correlation, 2nd ed. New York, WH Freeman,
1984 (first six chapters only).
Kirkwood BR. Essentials of medical statistics. Oxford, Blackwell, 1988.
Last JM, ed. A dictionaryof epidemiology, 3rd ed. New York, Oxford University Press, 1995.
Wang C. Sense and nonsense of statistical inference: controversy, misuse and subtlety. New York, Marcel
Dekker, 1993.
HANDOUT 10.1
Association: 1 The degree of statistical dependence between two or more events or variables.
Bivariate relationship: Association (or relationship) between two variables.
Cell frequency: The number of observations in a cell of a contingency table.
Column total: The total number of observations in a column of a contingency table.
Contingency table:1 A tabular cross-classification of data such thatsubcategories of one characteristic are indicated
horizontally (in rows) and subcategories of another characteristic are indicated vertically (in columns).
Dependent variable: In a regression analysis, this is the variable of which the value is thought to be predictable
from another variable.
Expected frequency: The number of observations to be expected in a class or cell if the null hypothesis is true.
Extrapolation: The use of the regression line to predict a value of the dependent variable from that of the indepen-
dent variable outside the range of values actually observed.
Grand total: The total number of observations cross-classified in a contingency table.
Independent variable: The variable, in a regression analysis, of which the value is thought to be predictive of
another variable.
Interpolation: The use of a regression line to estimate a value of thedependent variable from that of the independ-
entvariable within the range of values actually observed.
Linear relationship: In a regression analysis, when the mathematical model describing the dependent variable in
terms of the independent variable is in the form of a straight line.
Multi-factorial relationship: Association (or relationship) between several factors or variables.
Non-linear relationship: When the form of the model describing y in terms of x is not a straight line.
Regression analysis: 1 Given data on a dependent variable yand an independent variable x, regression analysis
involves finding the" best" mathematical model (within some restricted form) to describe y as a function of x
or to predict yfrom x.
Regression coefficient(s): For a linear regression, these are the estimated slope and the intercept of the straight
line describing the dependent variable as a function of the independent variable.
Row total: The total number of observations in a row of a contingency table.
Spurious correlation: 1 An association between two variables that may be artefactual, fortuitous, false ordueto all
kinds of non-causal associations resulting from chance or bias.
1 From Last JM (ed.) A dictionary of epidemiology, (3rd ed.) New York, Oxford University Press, 1995.
100
80 •
..........
•
•
60
••
40 •
20
•
oMKセイNL
o 20 40 60 80 100
.-' -
20
•
.- •
10
••
•
•• •
0
0 10 20 30 40 50 60 70 80 90 WHO 97530
100
•••
80 • •
• •
60 • •
• •
40 • •
• •
20 • •
• •
••
0
0 20 40 60 80 100 120 140