0% found this document useful (0 votes)
8 views30 pages

Chap 4

Chapter 4 discusses the importance of sampling in health and social care research, distinguishing between random and non-random sampling methods. It outlines various sampling techniques, including probability sampling methods like simple random, systematic, stratified, cluster, and multi-stage sampling, as well as non-random methods such as quota, convenience, and snowball sampling. The chapter emphasizes the need for appropriate sample selection to ensure representativeness and validity in research findings.

Uploaded by

Craig Wadawu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views30 pages

Chap 4

Chapter 4 discusses the importance of sampling in health and social care research, distinguishing between random and non-random sampling methods. It outlines various sampling techniques, including probability sampling methods like simple random, systematic, stratified, cluster, and multi-stage sampling, as well as non-random methods such as quota, convenience, and snowball sampling. The chapter emphasizes the need for appropriate sample selection to ensure representativeness and validity in research findings.

Uploaded by

Craig Wadawu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Chapter 4: Sampling

Learning Objectives
By the end of this chapter you should be able to:
• distinguish between a random and non-random methods of
sample selection
• describe the advantages of random sample selection
• identify different methods of sample selection
• match different methods of sample selection to research
design
• describe the factors influencing sample size
• describe how to calculate the appropriate sample size

Introduction
Sampling is a crucial issue in health and social care research Definition:
Definition
design. When undertaking a quantitative approach, Internal validity relates
researchers are striving to be able to generalise their findings to the validity of the
to the wider world (achieve external validity). It is therefore study itself
itself including
essential that both the sampling method used and the sample both the design and
size achieved are appropriate firstly, if the results are to be the instruments used.
representative, and secondly, if statistically significant Definition:
associations or differences are to be identified. External validity relates
It is important to achieve representativeness in a sample, to the extent to which
because it is this very ‘representativeness’ that allows the the findings of the
study can be
researcher to generalise the findings derived from a sample to
generalised to a wider
the wider population. Obtaining a representative sample is population and be
therefore desirable as this means that your sample accurately claimed to be
reflects the population under investigation. representative.
In qualitative research, issues of sampling are different,
although no less important. Here, the objective of research is
more concerned with internal validity – with providing an
accurate representation of reality – than with the external
validity that will enable generalisation to other settings. Hence,
appropriate sampling to achieve representativeness is of
importance.
In this chapter we will concentrate on the different sampling
techniques that can be used in health and social care research to
achieve the desired validity objectives. We will examine
random and non-random methods of sample selection, which
are typically used in quantitative and qualitative research
respectively. In the second part of this chapter we will discuss
issues and techniques of sample size.

4-1
Why do we need to select a sample anyway?
In some circumstances it is not necessary to select a sample. If
the subjects of your study are so rare, for example a child with
a disease seen once in 100,000 children, then you might want to
study every case you can find. However, more generally you
are likely to find yourself in a situation where the potential
subjects of your study are more common and you cannot
include everybody.
Definition:
Definition If you were to include everybody who is eligible in your study,
A census is an for example everybody in the UK who has been diagnosed as
enumeration of a suffering from asthma, then this would be defined as a census.
population originally In many instances, a census is simply too large to handle. It
for purposes of taxation would take too long and cost too much money. If a census is
and military service. too large, it will be necessary to use a sample to carry out the
Derived from the Latin,
research.
censes meaning ‘rate’.
Before looking at different sampling techniques it is important
to differentiate between two terms: population and sample.
A population refers to ‘units’ (i.e. people, schools, cities) that
you wish to make a statement about whereas the sample refers
to the group of units selected from the population from which
information would be collected. For example imagine you wish
to carry out a study investigating the height of seven-year-old
boys in England. The population would be all seven-year-old
boys in England whereas the sample would be a group of
seven-year-old boys in Sheffield from which you could gather
data on.
There are a number of different sample techniques but overall
they can be split into two main groups: probability and non-
probability sampling, or as they are more commonly referred
to random vs. non-random methods.

Probability Sampling
Definition:
Definition they allow the use of inferential statistics. Inferential
Inferential statistics are statistics provide ways of judging whether it is possible to
used to make generalise the results between the sample and the population.
generalisations from a
There are five different types of probability sampling which are
sample to a population.
described in Fig 4.1 below:

4-2
Figure 4.1. Type of sampling and Selection Strategy (adapted from Henry 1990: 27)
Simple Each member of the study group has an equal probability of being selected
Random
Systematic Each member of the study population is either assembled or listed, a random
start is designated, then members of the population are selected at equal
intervals
Stratified Each member of the study population is assigned to a group or stratum, then a
simple random sample is selected from each stratum
Cluster Each member of the study population is assigned to a group or cluster, then
clusters are selected at random and all members of selected clusters are
included in the sample
Multistage Clusters are selected as in a cluster sample, and then sample members are
selected from the cluster members by simple random sampling. Clustering
may be done at more than one stage.

Before we look at examples of these sample designs in more


detail it is important to just clarify the difference between
random sampling and randomisation.

Random Sampling versus Randomisation


For random sampling (probability sampling) to take place,
you must first have defined your potential subjects or
population. The group from which you will extract your
sample is also known as the ‘sampling frame’. So, for
instance, you may be interested in doing a study of children
aged between two and ten years diagnosed within the last
month with ear infections, or you may be interested in studying
adults (aged 16-65 years) diagnosed as having asthma and
receiving drug treatment for asthma in the last six months, and
living in a defined geographical region. Having defined your
potential sampling frame, random sampling is a way of
selecting a sample of people who will be representative of this
population.
However, if the research design is based on an experimental
design, such as a RCT, with two or more groups, then the
population frame may often be more tightly defined with strict
eligibility criteria. Within a RCT potential subjects are
randomly allocated to either the intervention treatment group
or the control group. By randomly allocating subjects to each
of the groups, potential differences between the comparison
groups will be negated. In this way confounding variables
(i.e. variables you haven’t thought of or controlled for) will be
equally distributed between each of the groups and will be less

4-3
likely to influence the outcome or dependent variables in
either of the groups.
It is important to remember that randomisation is different
from random sampling. As we described in chapter 2,
randomisation within an experimental design is a way of
ensuring control over confounding variables and as such it
allows the researcher to have a greater confidence in
identifying real associations between an independent variable
(the cause) and a dependent variable (the effect or outcome
measure).
The term ‘random’ may imply to you that it is possible to take
some sort of haphazard or ad hoc approach, for example
stopping the first 20 people you meet in the street for inclusion
in your study. This is not random in the true sense of the word.
To be a ‘random’ sample, every individual in the population
must have an equal probability of being selected. In order to
carry out random sampling properly, strict procedures need to
be adhered to.
Random sampling techniques can be split into simple random
sampling and systematic random sampling.

Simple Random Sampling


If selections are made purely by chance this is known as simple
random sampling. So, for instance, if we had a population
containing 5,000 people, we could allocate every individual a
different number. If we wanted to achieve a sample size of
200, we could achieve this by pulling 200 of the 5000 numbers
out of a hat. This is an example of simple random sampling.
Another way of selecting the numbers would be to use a table
of random numbers. These tables are usually to be found in
the appendices of most statistical text-books.
Simple random sampling although correct is a very laborious
way of carrying out sampling. A simpler and quicker way is to
use systematic random sampling

Systematic Random Sampling


Systematic random sampling is a more commonly employed
method. After numbers are allocated to everybody in the
population frame, the first individual is picked using a random
number table and then subsequent subjects are selected using a
fixed sampling interval, i.e. every nth person.
Assume, for example, that we wanted to carry out a survey of
patients with asthma attending clinics in one city. There may
be too many to interview everyone, so we want to select a
representative sample. If there are 3,000 people attending the

4-4
clinics in total and we only require a sample of 200, we would
need to:
• calculate the sampling interval by dividing 3,000 by 200 to
give a sampling fraction of 15
• select a random number between one and 15 using a set of
random tables
• if this number were 13, we select the individual allocated
number 13 and then go on to select every 15th person
• this will give us a total sample size of 200 as required.
Care needs to be taken when using a systematic random
sampling method in case there is some bias in the way that lists
of individuals are compiled, for example if all the husbands’
names precede wives’ names and the sampling interval is an
even number, then we may end up selecting all women and no
men.

Stratified Random Sampling


Stratified sampling is a way of ensuring that particular strata
or categories of individuals are represented in the sampling
process.
For example, if we know that approximately 4% of our
population frame was made up of a particular ethnic minority
group then there may be a chance that with simple or
systematic random sampling, we could end up with no ethnic
minorities in our sample. If we wanted to ensure that our
sample was representative of the population frame, then we
would employ a stratified sampling method.
• First we would split the population into the different strata, in
this case, separating out those individuals with the relevant
ethnic background.
• We would then use random sampling techniques, to sample
each of the two ethnic groups separately, using the same
sampling interval in each group.
• This would ensure that the final sampling frame was
representative of the minority group we wanted to include, in a
pro-rata basis to the actual population.
If, however, we actually want to be able to compare the results
of our minority group with the larger group, then we would
have difficulty in doing so, using this form of proportionate
stratified sampling, because the numbers achieved in the
minority group, although pro-rata those of the population,
would not be large enough to demonstrate statistical
differences.

4-5
If we really want to be able to compare the survey results of the
minority individuals with those of the larger group, then it is
necessary to use a disproportionate sampling method.
With disproportionate sampling, the strata selected are not
selected pro-rata for their size in the wider population.
For instance, if we are interested in comparing the views and
behaviour of particular minority groups with other larger
groups, then it is necessary to over sample the smaller
categories in order to achieve statistical power, i.e. in order to
be able to demonstrate statistically significant differences
between groups.
If during data analysis we wish to refer to the total sample as a
whole; representative of the wider population, then it will
become necessary to re-weight the categories back into the
proportions in which they are represented in reality.
For example, if we wanted to compare the views and
satisfaction levels of women who gave birth in a birth pool
compared with those who gave birth ‘normally’ in a bed, a
systematic random sample, although representative of all
women giving birth would not produce a sufficient number of
women giving birth in water to be able to compare the results,
unless the total sample was so big that it would take many
years to collate. We would also end up interviewing more
women than we needed who have given birth ‘normally’. In
this case it would be necessary to over sample or over
represent those women giving birth in water to have enough
individuals in each group in order to compare them. We would
therefore use disproportionate stratified random sampling to
select the sample in this instance.
The important thing to note here is that random sampling is still
taking place within each strata or category. So we would use
systematic random selection to select a sample of women
giving birth in water and the same process to select women
giving birth ‘normally’.

Cluster Sampling
Cluster sampling is a method frequently employed in national
surveys where it is uneconomic to carry out interviews with
individuals scattered across the country. Cluster sampling
allows individuals to be selected in geographic batches. So, for
instance, before selecting at random, the researcher may decide
to focus on certain towns, electoral wards, hospitals or general
practices.
An example of a piece of research that used a cluster sampling
method can be found in:

4-6
Kennedy et al (2003) A cluster-randomised controlled trial of a
patient-centred guidebook for patients with ulcerative colitis:
effect on knowledge, anxiety and quality of life. Health and
Social Care in the Community, 11: 64
You can find this article on line, at:
http://www.shef.ac.uk/library/elecjnls/
The aim of the study was evaluate the impact of an evidence
based guidebook on anxiety, knowledge and quality of life in
patients with ulcerative colitis
Stage 1: A random sample of six hospitals were chosen from
the 20 district hospitals in the North West of England.
Stage 2: Hospitals were randomised to either receiving the
guidebook (intervention group) or not receiving the guidebook
(control group).

Multi-stage sampling
Multi-stage sampling allows the individuals within the
selected cluster units to then be selected at random. Obviously
care must be taken to ensure that the cluster units selected are
generally representative of the population and are not strongly
biased in any way.
An example of a piece of research that used a multistage
sampling method is:
Hughes et al (1997) Young people, alcohol and designer
drinks: quantitative and qualitative study. BMJ, 314: 414-
You can find this article on-line via the University of Sheffield
library, at:
http://www.shef.ac.uk/library/elecjnls/brbz.html

Aim of the study: to examine the appeal of designer alcoholic


drinks to young people aged between 12-17 years.
Stage 1: A list of all postcode sectors in the Argyll and Clyde
Health Board (excluding those rural parts of the area i.e. islands
and those with fewer than 500 households)
Stage 2: A random sample of 30 postcode sectors was chosen
Stage 3: From each of the 30 postcode sectors, 40 people aged
between 12-17 years were identified using a random procedure
which stratified for age and sex.

4-7
Non-Random Sampling
Non-random or non-probability sampling refers to sampling
methods that do not adhere to the principles of probability
sampling, i.e. that not everyone in the population has an equal
chance of being selected. For this reason, non-random
sampling is not used very often in quantitative health and social
care research.
Since the objective of qualitative research is to understand and
give meaning to a social process, rather than quantify and
generalise to a wider population, it is inappropriate to use
random sampling and apply statistical tests. The issues for
qualitative studies are therefore somewhat different, and the
approach to sampling is distinctive. Sample sizes used in
qualitative research are usually very small and the application
of statistical tests would be neither appropriate nor feasible.
For example, a study may wish to consider attitudes towards
compliance with asthma treatment. A small sample of local
asthmatic teenagers may be interviewed. The sample is thus
chosen on the basis of ‘theory’: this kind of theoretical
sampling aims to maximise the range of responses, but does
not strictly seek to ‘represent’ all the people in the community
and other stakeholders.
Non-random sampling techniques are also important to
consider as they are being used increasingly in market research
and commissioned studies such as political opinion polling.
There are three main types of non-random sampling: quota
sampling, convenience sampling and snowball sampling. The
technique most commonly used is known as quota sampling

Quota Sampling
Quota sampling is a technique for sampling whereby the
researcher decides in advance on certain key characteristics that
they will use to stratify the sample. Interviewers are often set
sample quotas in terms of age and sex. So, for example, with a
sample of 200 people, they may decide that 50% should be
male and 50% should be female; and 40% should be aged over
40 years and 60% aged 39 years or less. The difference with a
stratified sample is that the respondents in a quota sample are
not randomly selected within the strata. The respondents may
be selected just because they are accessible to the interviewer.
Because random sampling is not employed, it is not possible to
apply inferential statistics and generalise the findings to a wider
population

4-8
Convenience or Opportunistic Sampling.
Selecting respondents purely because they are easily accessible
is known as convenience sampling. Although quantitative
researchers generally frown upon this technique, it is an
acceptable approach when using a qualitative design, since
generalisability is not a main aim of qualitative approaches.
In fact qualitative data are often collected using a convenience
or opportunistic sampling approach, for instance where the
researcher selects volunteers amongst his or her work
colleagues. However, as this is rather a haphazard method,
many qualitative researchers employ a purposive sample to
identify specific groups of people who exhibit the
characteristics of the social process or phenomenon under
study. For example, a researcher may be seeking to interview
people who have recently been bereaved, or are being
rehabilitated into intermediate care; another researcher may be
seeking people who have experienced long-term
unemployment and suffer from chronic asthma. Sometimes
researchers try to find people who typify the characteristics
they are looking for. This is known as an ‘ideal type’.
Occasionally it is useful to deliberately include people who
exhibit the required characteristics in the extreme. Close
examination of extreme cases can sometimes be very
illuminating, when trying to formulate a theory. However,
caution must be adopted with this approach, to ensure these
extreme cases are not used to generalise.

Snowball or Networking Sampling


Snowballing occurs when one respondent supplies you with
the names of other individuals in a like position, who may be
interested in talking to you also. It is another sampling
technique commonly used in qualitative research.
Snowballing is particularly useful when trying to reach
individuals with rare or socially undesirable characteristics.
Lets look at an example. Imagine you wish to explore the safe
sex practices of female sex workers working in Sheffield. Once
a female sex worker had participated in the study they could
then approach their colleagues to see if they would also be
interested in participating in the research. A snowballing
sampling method may be particularly useful in this situation
given the sensitive nature of the research study.

Theoretical Sampling
If a researcher wishes to develop a social theory, a
‘theoretical’ sampling technique may be used. The idea is
that the researcher selects the subjects, collates and analyses the

4-9
data to produce an initial theory that is then used to guide
further sampling and data collection from which further theory
is developed. Since theoretical sampling is one of the main
methods used in qualitative research, this raises issues about
the degree to which findings can be generalised or achieve
transferability. For more information on this topic and
qualitative research in general see Chapter 7.

We have introduced you to random and non-random methods


of sampling. As mentioned above, quantitative and qualitative
research typically use random and non-random sampling
techniques to obtain a sample. The main differences between
quantitative and qualitative sampling techniques are shown in
Figure 4.2.

Figure 4.2 Sampling in quantitative and


qualitative research
Quantitative Research Qualitative Research
Sample big enough for Often very small,
statistical Inference occasionally a single case
Selected to be Rarely attempts to be
representative representative: sample
chosen to maximise range
of responses

Please now complete the following SAQ. Read the


descriptions and decide what type of sample selection has taken
place, and write in the boxes below.

4 - 10
Self-Assessment Questions 4.1 Sampling methods

1. A sample of school children some with asthma and some without is selected

from GP records. The children are selected randomly within each of the two
groups and the number of children in each group is representative of the total
patient population for this age group.

2. A sample of children is selected from social workers records. The children are selected
so that 50% have been placed into foster care and 50% are waiting for foster parents. The
children are randomly selected within each group.

3. A survey is carried out to examine the attitudes of mothers with children under one
year. The sample is selected by interviewers stopping likely-looking women pushing
prams in the street. The number of respondents who fall into different age bands and
social classes is strictly collated.

4. A sample of drug users is gathered by advertising in the local newspaper for potential
respondents.

5. All male adults whose National Insurance numbers end in ‘5’ are selected for a survey.

6. All patients attending wards 3, 5 and 10 in a hospital are selected for a study.

4 - 11
Now that you have been introduced to sampling techniques
used in qualitative and quantitative research we would like you
to read more on these sampling methods. Please read chapter
four of Alan Bryman’s book ‘Sampling’, which is provided
for you in the supplementary reading. When you have
finished, please complete the following SAQ. You may wish
to take notes as you read.

SAQ 4.2 Sampling

Having read the chapter from Bryman, answer the following:



1. What are the disadvantages often associated with using quota sampling?

2. For which research methodology are you most likely to use a snowball sampling technique
and why?

3. What are main advantages of using a stratified sampling method?

4. What is meant by non-response? In what way could this affect obtaining a representative
sample?

4 - 12
Answers to SAQ 4.1
1. Stratified random sample. The sample is stratified because
the sample has been selected to ensure that two different groups
are represented.
2. Disproportionate stratified random sample. This sample is
stratified to ensure that patients from the two different groups
are picked up, however the two groups are selected so that they
are equal in size and are not representative of the patient base.
3. Quota. The sample is not randomly selected but the
respondents are selected to meet certain criteria.
4. Convenience. The sample is not randomly selected and no
quotas are applied.
5. Systematic random sample.
6. Cluster sample because the patients are selected only from
certain wards.

So far in this chapter we have covered the different sampling


techniques that are used in health and social care research and
how random and non-random methods are typically applied in
quantitative and qualitative research respectively.
However, as well as correctly using the most appropriate
sampling technique, it is also important to consider another
aspect of sampling: sample size, which we will go on to look at
in a moment.

First, please complete the following for your log book.

4 - 13
Reflective Exercise 4.1 Sampling
We would like you to look at the practical issues surrounding
sampling techniques in research. Please read the journal article:
Helder D et al (2002) Living with Huntington's disease: Illness
perceptions, coping mechanisms, and patients' well-being. British
Journal of Health Psychology, 7 (4): 449 – 462.
(Tip: read the methods and the discussion).

1. How was the sample selected for this survey?

2. Did the researchers use random or non-random sampling methods?

3. Was the sample representative?

4. How might the sampling be improved?

4 - 14
Answers to SAQs 4.2
1. Non-representative because of the possible interviewer bias
involved in selecting the respondents, respondents may be
untypical as only those that are available and in the
interviewers vicinity are typically selected, the interviewer is
often prone to bias in selecting respondents, i.e. they may
assume that a person is younger than required for the study and
therefore not approach them, the reliance upon using social
class often causes problems in making sure that respondents are
correctly from these social class groups.
2. Qualitative because generalisability which is an essential
requirement within quantitative research aren’t possible with
this non-probability sampling technique.
3. The main advantage of using a stratified sample is that the
sample should be distributed in the same way as the population
which may not be achieved with just using a simple random
sample.
4. Non-response refers to social survey research and those
subjects participating in the research who do not return self
completed questionnaires. Non-response can affect obtaining a
representative sample as those participants who have not
agreed to participate may differ from those subjects who have
participated (i.e. in terms of age social class, reading and
writing abilities) which may make the sample left in the study
unrepresentative of the population.

Please take a break before moving on to the second part of


this chapter.

4 - 15
Part 2: Sample size and the ‘power’ of research
In the previous section, we looked at methods of sampling.
Now we want to turn to another aspect of sampling: how big a
sample needs to be in quantitative research to enable a study to
have sufficient ‘power’ to do the job of testing a hypothesis.
At first glance, many pieces of research seem to choose a
sample size merely on the basis of what ‘looks’ about right: or
perhaps simply for reasons of convenience: ten seems a bit
small, and hundred would be difficult to obtain, so 40 is a
happy compromise! Unfortunately, a lot of published research
uses precisely this kind of logic. In the following section, we
want to show you why using such reasoning could make your
research worthless. Choosing the correct size of sample is not
a matter of preference, it is a crucial element of the research
process without which you may well be spending months trying
to investigate a problem with a tool which is either completely
useless, or over expensive in terms of time and other resources.

The truth is out there: hypotheses and samples


Most (but not all) quantitative studies aim to test a hypothesis,
and we looked at the logic of hypothesis testing in Chapter 1.
A hypothesis is a kind of ‘truth claim’ about some aspect of the
world: the properties of a cell, the behaviour of a social group,
or whatever. Research sets out to try to prove this truth claim
right (or more properly, to disprove the null hypothesis - a truth
claim phrased as a negative).
For example, let us think about the following hypothesis:
Levels of childhood asthma are affected by traffic pollution
and the related null hypothesis:
Levels of childhood asthma are not affected by traffic
pollution
Definition:
Definition
Let us imagine that we have this as our research hypothesis,
An association and we are planning research to test it. We will undertake a
between two variables trial, comparing groups of children living in different traffic
represents some sort environments, to assess the extent of asthma in these different
of relationship. An
groups.
association can be a
causal one or it might Now, if you recall the chapter on validity, external validity is
be spurious.
spurious. the extent to which the findings of a hypothesis can be
Associations can be generalised. Obviously the findings of a study -while
positive or negative. interesting in themselves - only have value if they can be
generalised, to discover something about the topic that can be
applied. Were we to find an association, then we should want
to do something to reduce exposure to traffic pollution. So our
study has to have the capacity to be generalised beyond the few
children actually in the study.

4 - 16
The measurement of such generalisability of a study is done by
statistical tests of inference.
You may be familiar with some such tests: tests such as chi-
squared, the t-test, and tests of correlation. We will not look See Chapter 6 for
further details of
at these tests in any detail in this course, but we need to
these tests.
understand that the purpose of these and other tests of
statistical inference is to assess the extent to which the
findings of a study can be accepted as valid for the population
from which the study sample has been drawn. If the statistics
we use suggest that the findings are ‘true’, then we can be
happy to conclude that (within certain limits of probability), we
can assume that the study’s findings can be generalised, and we
can act on them (to reduce exposure to traffic by school
children, for instance).
From common sense, we see that the larger the sample is, the
easier it is to be satisfied that it is representative of the
population from which it is drawn: but how large does it need
to be? This is the question that we need to answer, and to do
so, we need to think a little more about the possibilities that our
findings may not reflect reality: that we have committed an
error in our conclusions.

Type 1 and Type 2 errors


What any researcher wants is to be right! She wants to
discover that there is an association between two variables: say,
asthma and traffic pollution, if such an association really
exists. Or, if there is no such association, she wants her study
to support the null hypothesis that the two are not related.
(While the former may be more exciting, both are important
findings, both accurately reflect the reality that our studies try
to tap into).
What no researcher wants is to be wrong! No-one wants to
find an association which does not really exist, or - just as
importantly - NOT find an association which DOES exist.
Both such situations can arise in any piece of research. The
first (finding an association which is not really there) is called a
Type 1 error. It is the error of falsely rejecting a true null
hypothesis.
(Think this through carefully: an example might be a study that
rejects the null hypothesis that there is no association between
asthma and pollution. The findings suggest such an association,
but in reality, no such relationship exists.)
The second kind of error, called a Type 2 error (sometimes
written as Type II), occurs when a study fails to find an
association that really does exist. It is then a matter of wrongly
accepting a false null hypothesis. (Using the traffic and

4 - 17
asthma example again, we conduct a study and find no
association, missing one that really does exist.)
Both types of error are serious. Both have consequences:
imagine the money that might be spent on reducing traffic
pollution, and all the time it does not really affect asthma
(result of a Type 1 error). Or imagine allowing traffic pollution
to continue, while it really is affecting children’s health (result
of a Type 2 error). Good research will minimise the chances of
committing both Type 1 and Type 2 errors as far as possible,
although they can never be ruled out absolutely.

Statistical Significance and Statistical Power


Fig 5.1 shows the four possible outcomes of a piece of research
diagrammatically. Each cell in the figure represents a possible
relationship between the findings of the study and the 'real-life'
situation in the population under investigation. (Of course, we
cannot actually know the latter unless we surveyed the whole
population: that is the reason we conduct studies that can be
generalised through statistical inference). Cells 1 and 4
represent desirable outcomes, while cells 2 and 3 represent
potential outcomes of a study that are undesirable and need to
be minimised. The relationship between these possible
outcomes, and two concepts, that of statistical significance and
of statistical power. The former is well-known by most
researchers who use statistics, the latter is less well understood.

Figure 4.1: The Null Hypothesis (Ho), Statistical Significance and Statistical Power

POPULATION
Null Hypothesis False True
is:

False Correct Result Type I Error (alpha)


STUDY

❶ ❷

True Type II Error (beta) Correct Result

❸ ❹

4 - 18
Cell 1. The null hypothesis has been disproved by the results of
the study (that is, there is support for a hypothesis which
suggests some differences between groups or association
between variables). This is also the situation in the population.
Thus, we can be satisfied that the study is reflecting the world
outside the limits of the study and it is to be accepted as a
'correct' result.
Cell 4. The results from the study support the null hypothesis.
This is the situation which pertains in the population, so we can
be satisfied that our study reflects the circumstances in the
population. Once again, this is a 'correct' result.
Cell 2. In this cell, as in cell 1, the study results falsify the null
hypothesis, indicating some kind of difference or association
between variables. However, in the world beyond the study,
the null hypothesis is actually true and there is no effect. This
is the Type I error: the error of wrongly rejecting a true null
hypothesis. The likelihood of committing a Type I error is
known as the alpha value or the statistical significance of the
test. Some of you may be familiar with alpha as the quoted p
level of significance of a test. The p value marks the
probability of committing a Type I error; thus a p value of 0.05
indicates a five per cent (or one in 20) chance of committing a
Type I error. Cell 2 thus reflects an incorrect finding from a
study, and the alpha value represents the likelihood of this
occurring.
Cell 3. This cell similarly reflects an undesirable outcome of a
study. Here, as in Cell 4, a study supports the null hypothesis,
implying that there is no difference or association in the
population under investigation. But in reality, the null
hypothesis is false and there is some kind of difference or
association that the study is missing. This mistake is the Type
II error of wrongly accepting a false null hypothesis. The
likelihood of committing a Type II error is the beta value of a
statistical test, and the value (1 - beta) is the statistical power
of the test. Thus, the statistical power of a test is the likelihood
of avoiding a Type II error. Conventionally, a value of 0.80 or
80% is the target value for statistical power, representing the
likelihood that four times out of five that a false null hypothesis
will be rejected. Outcomes of studies that fall into cell 3 are
incorrect; beta or its complement (1-beta) are the measures of
power: the likelihood of such an outcome of a study.
All research should seek to avoid both Type I and Type II
errors, which lead to incorrect inferences about the world
beyond the study. In practice, there is a trade-off. Reducing
the likelihood of committing a Type I error by increasing the
level of significance at which one is willing to accept a positive
finding reduces the statistical power of the test, increasing the

4 - 19
possibility of a Type II error, and vice versa. However, both
statistical significance and statistical power are affected by
sample size.
Most researchers who use statistics will be aware that the
chances of gaining a statistically significant result will be
increased by enlarging a study's sample. Similarly, the
statistical power of a study is enhanced as sample size
increases. Let us look at each of these aspects of quantitative
research in turn.

The Statistical Significance of a Study


When a researcher uses a statistical test of inference, what she
is doing is testing her results against a gold standard. If the test
gives a positive result (this is usually known as ‘achieving
statistical significance’), then she can be relatively satisfied
that her results are ‘true’, and that the real world situation is
that discovered in the study (Cell 1 in Fig 5.1). If the test does
not give significant results (non-significant or NS), then she
can be reasonably satisfied that the results reflect Cell 4, where
she has found no association and no such association exists.
However, we can never be absolutely certain that we have a
result that falls in Cells 1 or 4. Statistical significance
represents the likelihood of committing a Type 1 error (Cell
2). Let us imagine that we have results suggesting an
association between traffic pollution and asthma, and a t-test (a
test to compare the results of two different groups) gives a
value which indicates that at the 5% or 0.05 level of statistical
significance, the level of asthma is higher in children exposed
to traffic fumes.
What this means is that 95 per cent of the time, we can be
certain that this result reflects a true effect (Cell 1). Five per
cent of the time, it is a chance result, resulting from random
associations in the sample we chose. If the t-test value is
higher, we might reach 1% or 0.01 significance. Then only one
per cent of the time would the result be a chance association.
Tests of statistical significance are designed to account for
sample size, thus the larger the sample, the ‘easier’ it is for
results to reach significance. A study that compares two groups
of 10 children will have to demonstrate a much greater
difference between the groups than a study with 1000 children
in each group. This is fair: the larger study is much more likely
to be ‘representative’ of a population than the smaller one.
To summarise: statistical significance is a measure of the
likelihood that positive results reflect a real effect, and that the
findings can be used to make conclusions about differences
which really exist.

4 - 20
The Statistical Power of a Study
As we have just seen, statistical tests build in a safety margin to
avoid generalising false positive results, possibly with
disastrous or expensive consequences. Researchers who use
small samples thus run the risk of not being able to demonstrate
differences or associations that really do exist. Thus they are in
danger of committing a Type 2 error (Cell 3 in Fig 5.1), of
accepting a false null hypothesis. Such studies are ‘under-
powered’, not possessing sufficient statistical power to detect
the effects they set out to detect. Conventionally, the target is
a power of 80% or 0.8, meaning that a study has an 80 per
cent likelihood of detecting a difference or association that
exists.
Examination of research undertaken in various fields of study
suggests that many studies do not meet this 0.8 conventional
target for power (Fox and Mathers 1997). What this means is
that many studies have much reduced likelihood of being able
to discern the effects which they set out to seek: a study with a
power of 0.66 will only detect an effect two times out of three,
while studies with power of 0.5 or less will detect effects at
levels less frequent than those achieved by tossing a coin. A
non-significant finding of a study may thus simply reflect
the inadequate power of the study to detect differences or
associations at levels that are conventionally accepted as
statistically significant.
In such situations one must ask the simple question of such
research: ‘Why did you bother, when your study had little
chance of finding what you set out to find?’
A hypothetical example will illustrate the importance of being
able to detect a false null hypothesis. Imagine that researchers
wish to compare two inhalers for their efficacy in delivering
drugs for countering asthma attacks. Inhaler A is twice as
expensive as inhaler B. In the study, inhaler A demonstrates a
slightly higher level of efficacy, but because the study is small
(only 50 people in each group), this difference does not reach
significance, and the researchers conclude there is no
demonstrable difference. They report this finding, and as a
consequence, the cheaper but less effective inhaler comes to be
the one recommended to doctors to prescribe. If this study is
under-powered, this decision is based on the acceptance of a
false null hypothesis (Cell 3: false negative), the consequences
for children using a less effective inhaler could be fatal.
Statistical power calculations can be undertaken after a study
has been completed, to assess the likelihood of a study
discovering effects. More importantly, such calculations need
to be undertaken prior to a study to avoid both the wasteful
consequences of under-powering, (or of overpowering in
which sample sizes are excessively large, leading to very high

4 - 21
power at the expense of higher than necessary study costs).
Power is a function of three variables: sample size, the chosen
level of statistical significance (alpha) and effect size. While
calculation of power entails recourse to tables of values for
these variables, the calculation is relatively straightforward in
most cases.

Effect size and sample size


However, as was mentioned earlier, there is a trade-off between
significance and power, because as one tries to reduce the
chances of generating false positive results, the likelihood of a
false negative result increases. Researchers need to decide
which is more crucial, and set the significance level
accordingly. In the example of the two inhalers, it might be
concluded that wrongly accepting the null hypothesis of no
difference between A and B was the greater danger clinically,
and use a relatively low level of significance (0.05), to increase
the power of the study to discern an effect.
Fortunately statistical significance and power are both
increased by increasing sample size, so increasing sample size
will reduce both Type 1 and Type 2 errors. However, that does
not mean that researchers necessarily need to vastly increase
the size of their samples, at great expense of time and
resources.
The other factor affecting the power of a study is the effect size
(ES) that is under investigation in the study. This is a measure
of how `wrong’ the null hypothesis is. For example, in the
study of inhalers, the ES is the difference in efficacy in treating
the symptoms of an asthma attack. An effect size may be a
difference between groups or the strength of an association
between variables such as asthma and traffic pollution.
If an ES is small, then many studies with small sample sizes are
likely to be under-powered. But if an ES is large, then a
relatively small-scale study could have sufficient power to
identify the effect under investigation. It is sometimes possible
to increase effect size, but usually this is the intractable element
in the equation, and accurate estimation of the effect size is
essential for calculating power before a study begins, and hence
the necessary sample size.

An Effect Size can be estimated in four ways:


1. From a review of literature or meta-analysis, which can
suggest the size of ES that may be expected.
2. By means of a pilot study which can gather data from which
the size of effect may be estimated.

4 - 22
3. One can make a decision about the smallest size of effect
which it is worth identifying. To consider the earlier
example of two rival inhalers, if we are willing to accept the
two inhalers as equivalent if there is no more than a ten per
cent difference in their efficacy of treatment, then this effect
size may be set, acknowledging that smaller effects will not
be discernible.
4. As a last resort, one can use a `guesstimate’ as to whether an
ES is `small’, `medium’, or `large’.
These definitions and values for `small’, `medium’ and `large’
effects are conventions, as described by Cohen (1970). A
`medium’ effect is defined as one which is 'visible to the naked
eye' - in other words - which could be discerned from everyday
experience without recourse to formal measurement.
For example, the difference between male and female adult
heights in the UK would be counted as a medium ES. Most
effects encountered in biomedical and social research should be
assumed to be small, unless there is a good reason to claim a
medium effect, while a `large’ effect size would probably need
to be defined as one which is so large that it hardly seems
necessary to undertake research into something so well
established.
Cohen offers the example of the difference between the heights
of 13 and 18 year old girls as a `large’ effect.
Power calculations may be used as part of the critical appraisal
of research papers. Unfortunately it is rare to see beta values
quoted for tests in research reports, and indeed often the results
reported are inadequate to calculate effect sizes. Studies have
evaluated various scientific subjects, including nursing,
education, management and medicine.

Calculating sample sizes in inferential studies


Calculating sample sizes to achieve sufficient power can be a
complicated matter, and you should consult a statistician if you
plan to use a quantitative sample methodology.
However there is a short-cut that can be used to give an
estimate of sample size for a power calculation. Table 4.2
gives some rough and ready figures for a range of simple tests,
based on a power (1 - beta) value of 0.8 and a significance level
of 0.05 (5%), for ‘small’ and ‘medium’ effect sizes. You may
be surprised to see how large the samples may need to be.

4 - 23
Table 4.2: Necessary sample sizes for statistical tests where
alpha (p) = 0.05
Test Degrees of Freedom ES = Small ES = Medium
T-test 300 per group 50 per group
F-test 2 322 per group 64 per group
3 274 per group 45 per group
4 240 per group 49 per group
Chi squared 1 785 total 87 total
2 964 total 107 total
3 1090 total 121 total
Pearson’s 618 68
correlation

Statistical power is important for you to consider when you


plan your research study if you are using a quantitative design.
If in doubt, you should discuss this with your supervisor and
get expert statistical advice before planning your research
protocol.

Now please undertake the following reflective exercise for your


log book.

4 - 24
Reflective Exercise 4.2: Statistical power in health research
Please read the article in the associated reading by Fox and
Mathers entitled ‘Empowering your research: statistical power
in general practice research’
Now answer these questions:

1. What was the average power of research in the papers surveyed?

2. What proportion of papers had too high a power, and why is this an issue?

3. What recommendations do Fox and Mathers suggest to writers of papers which


report tests of statistical inference?

4. At what point in your PhD studies, should you consult a statistician, to ensure your
study is neither under- or over-powered?

4 - 25
Summary
Key points to remember when deciding on sample selection
are:
1. Always try to use a random method where possible and
remember that random does not mean haphazard.
2. Random selection means that everybody in your sampling
frame has an equal opportunity of being included in your
study.
3. If you need to be able to generalise about small or minority
groups and to compare those with larger groups, consider
using disproportionate stratified sampling, but remember to
re-weight the results afterwards if you wish to generalise
from the whole sample.

Key points to remember when deciding on sample size are:


1. A Type I error is the error of falsely rejecting a true null
hypothesis.
2. The likelihood of committing a Type I error is known as
alpha. The conventional level for alpha or p is usually 0.05
or 0.01.
3. A Type II error is the error of failing to reject a false null
hypothesis or wrongly accepting a false null hypothesis.
4. The likelihood of committing a Type II error is known as
beta. The conventional level of statistical power is set at
0.8 or 80%.
5. There is a trade off between committing a Type I error and a
Type II error, but historically science has placed the
emphasis on avoiding Type I errors.
6. Increasing the sample size reduced both Type I and Type II
error, but remember that it is costly and unethical to have
too large a sample size.
7. To calculate statistical power, you need to estimate the
effect size.

4 - 26
Recommended Reading

For details of sampling techniques


Bland M. An Introduction to Medical Statistics. Oxford:
University Press, 1995.
Clegg F. Simple Statistics: A Course Book for the Social
Sciences. Cambridge: Cambridge University Press, 1982.
For details of power calculations
Campbell MJ et al. Estimating sample sizes for binary, ordered,
categorical and continuous outcomes in two group
comparisons. BMJ 1995; 311: 1145-8.
Cohen J. Statistical Power Analysis for the Behavioural
Sciences. New York: Academic Press, 1977.
Machin D, Campbell MJ. Statistical Tables for the Design of
Clinical Trials. Oxford: Blackwell Scientific, 1987.
Studies of power in different scientific disciplines
Fox NJ, Mathers NJ. Empowering your research: statistical
power in general practice research. Family Practice 1997; 14:
324-9.
Polit DF, Sherman RE. Statistical power in nursing research.
Nursing Research 1990; 39: 365-369.
Reed JF, Slaichert W. Statistical proof in inconclusive
'negative' trials. Archives of Internal Medicine 1981; 141:
1307-1310.

4 - 27
4 - 28
Learning Review Form: Chapter 4
Please complete this form when you have completed the chapter. If you do not consider you
have achieved the learning outcomes, you need to go back and do more work on the chapter
or read the recommended reading.
When you have completed the form, save it to hand in with your log book

1. I am confident that I can:


Not at Partl Quite Very
all y well well
• distinguish between a random and 1 2 3 4

non-random methods of sample


selection

• describe the advantages of random 1 2 3 4

sample selection

• identify different methods of sample 1 2 3 4

selection

• match different methods of sample 1 2 3 4

selection to research design

• describe the factors influencing 1 2 3 4

sample size

• describe how to calculate the 1 2 3 4

appropriate sample size

4 - 29
4 - 30

You might also like