Basics
Basics
Basics
Basics of Statistics
for Students of Veterinary Medicine
Brno, 2019
2
Contents
Preface ..………………………………………………………………………………….. 5
3
6 Parametric Tests ……………………………………………………………………… 48
6.1 F-test (Variance ratio Test) ………………………………………………………… 48
6.2 t-test (Student’s) ……………………………………………………………………. 51
6.2.1 Population vs. Sample Comparison (One-sample t-test) …………………….. 52
6.2.2 Samples comparison (Two-sample t-test) ……………………………………. 54
4
Preface
Author
5
6
Chapter 1
Basic concepts of statistics
Statistics is the science that allows us to formulate and describe complex data in a short
form, easily understood by all professionals. It allows us to compare data (numerical facts resulting
from observations in some investigative monitoring) and gives us probabilities of the likelihood of
studied events. The term “statistics” is often encountered as a synonym for “data”: statistics of
sickness rate during the last month (how many patients, number of cured patients), labour statistics
(number of workers unemployed, number employed in various occupations), election statistics
(number of votes in different regions, parties), etc. Hereafter, this use of the word “statistics” will
not appear in this textbook. Instead, “statistics” will be used in its other common manner: to refer to
the analysis and interpretation of data with a view toward objective evaluation of the reliability of
the conclusions based on the data.
Statistics are predominantly needed in more probabilistic and less predictive sciences such
as biology and applied biology (medicine). In a predictive science such as physics, to find out how
fast a 300 g stone will reach the ground if dropped from a height of 30 m, one has only apply the
data in the appropriate formula to obtain an accurate answer. In art, on the other hand, the
evaluation of a given piece depends to a great extend on subjective criteria. Medicine falls
somewhere in between. There are numerous and complex physical/chemical events occurring
simultaneously which cannot be evaluated separately. For instance, if one wants to determine the
time of induction, or return, of a given reflex of the specific tendon, the issue is more complicated
than it appears initially. In this case we are dealing with transmission of electrical potential
difference across many nerves and transmission to muscles, making relevant calculations more
tedious and specific data less well known. Furthermore, the specific functions are affected by
several other components of internal milieu such as the level of hormones. To complicate matters,
this will represent just one out of many concomitant functions of a total inhomogeneous system,
living body. It is, therefore, easy to appreciate why biological sciences in general, and applied
biology such as medicine, in particular, are probabilistic in nature. As a result, a good grasp of
statistics is essential for one to be effective in this field.
Statistics applied to biological problems is simply called biostatistics or, sometimes,
biometry (the latter term literally meaning “biological measurements”). As biological entities are
counted or measured, it becomes apparent that some objective methods are necessary to aid the
investigator in presenting and analysing research data. Although the field of statistics has roots
extending back hundreds of years, its development began in earnest in the late nineteenth century,
and a major impetus from early in this development has been the need to examine biological and
medical data.
7
Nowadays the statistical methods are common and more and more important in all
biological and medical sciences. Biostatistics is a necessary part of every biology and medicine
(both human and veterinary) education. When dealing with living organisms, we must always keep
in mind, that every individual is unique and there is a high level of insecurity regarding its
reactions. Therefore all data obtained from biological objects may be very different and variable.
This results from vast genetic variability in living organisms and also from other aspects (ambient
environment, adaptability, etc.).
This large variability of biological data causes problems and difficulties in monitoring,
measurements and data acquisition in animals and other living organisms. These problems can
partially be solved by means of statistics, because only statistical methods are able to take into
account this great variability of biological data, evaluate them and give correct inferences
concerning studied biological objects. Statistics handles variability in two ways. First it provides
precise ways to describe and measure the extent of variability in our measured data. Secondly it
provides us with methods for using those measures of variability to determine a probability of the
correctness of any conclusions we draw from our data.
Before data can be analysed, they must be collected, and statistical considerations can aid in
the design of experiments and in the setting up of hypotheses to be tested. Many biologists attempt
the analysis of their research data only to find that too few data were collected to enable reliable
conclusions to be drawn, or that much extra effort was expended in collecting data that cannot be of
ready aid in the analysis of the experiment. Thus, knowledge of basis statistical principles and
procedures is important even before an experiment is begun.
Once the data have been obtained, we may organize and summarize them in such a way as
to arrive at their orderly and informative presentation. Such procedures are often termed descriptive
statistics. For example, tabulation might be made of the heights of all students of the Faculty of
veterinary medicine, indicating an average height for each sex, or for each age. However, it might
be desired to make some generalizations from these data. We might, for example, wish to make
reasonable estimate of the heights of all students in the university. Or we might wish to conclude
whether the males in the university are on the average taller than the females. The ability to make
such generalized conclusions, inferring characteristics of the whole from characteristics of its parts,
lies within the realm of inferential statistics.
8
1.2 Types of Biological Data
In biological and medical sciences, we analyse biological properties of living organisms that
are described on the basis of selected biological characters. These biological characters can be
measured usually by some means. Their values differ from one entity to another – therefore they are
called variables in statistics. Variables describe studied biological characters (properties of living
organisms usually) and they can quantify (more or less) these biological properties. Different kinds
of variables may be encountered by biologists, and it is desirable to be able to distinguish among
them. Variables can be quantitative or qualitative. Quantitative variables record the amount of
something (ordinal data and numerical data); qualitative variables describe the category to which
the data ca be assigned and are therefore sometimes referred to as categorical data.
Exactness of those biological variables can differ in their values – according to the exactness
we can distinguish between 3 types of biological data in statistics:
C. Numerical Data
They are represented by exact numeric values obtained by means of some objective
measurement (meter, thermometer, scale, measuring device etc.). Differences between various
degrees on the scale are uniform – the numerical scale consists of the same intervals. There is the
9
highest level of quantification in statistical data – they are most often used for statistical evaluation.
Numerical variables allow us to record an amount for each observation and to compare the
magnitude of differences between them. They can be either continuous or discrete.
- Discrete Data (discontinuous)
Variables that can take only specific available values – most often integer numbers. For
example the number of bacterial colonies on a Petri dish can only be a positive integer
value; there can be 24 colonies, but never 24.5 colonies or -24 colonies. Similarly also
number of laid eggs, puppies in a litter, animals in a stable, patients, cells, etc.
- Continuous Data
These variables can take on any conceivable value in our infinite spectrum of real
numbers - within any observed real range (height, length, weight, volume, body
temperature, concentration of enzyme, etc.)
Different categories of statistical data have their own specific statistical method used for
their examination. These methods are differently exact according to the exactness of data category.
Statistical methods used for numerical or ordinal data are more exact and generally they are not
applicable to nominal data (since nominal data contain only little information for exact methods). It
is possible reversely: the less exact methods intended for nominal (or ordinal) data are useful also
for numerical data. In this case we can purposely use these not very precise methods e.g. for
preliminary analyses that must be performed quickly.
Sometimes the distinction between different types of data is not very obvious, e.g. categories
fall into a natural order and it is not reasonable to distinguish between ranked and categorical data.
Values of heights in students are continuous data. Of course, they may also be ranked: smallest,
next smallest, ……., highest. If the height is categorized into three groups, <160, 161-180, >181,
the values are still ordered, but we have lost a lot of information. There are so many ties that
analysis methods will be sensitive only to three ranks, one for each category. Although we could
analyse them as categories A, B, and C, the methods treating them as ranks first, second, and third
are still “stronger” methods. When rendering data into categories, one should note whether the
categories fall into a natural order. If they do, treat them as ranks. For example, categories of ethnic
groups do not fall into a natural order, but the pain categories severe, moderate, small, and absent
do.
Note that we can always change data from higher to lower level of quantification, that is,
continuous to discrete to ranked to categorical, but not the other way. Thus, it is always preferable
to record data as high up on this sequence as possible; it can always be dropped lower.
10
By the word population, we denote the entire set of subject about whom we want
information. Thus, the population means also “all items” (individuals) that could show studied
variable. If we were to take our measurements on all individuals in the population, descriptive
statistics would give us exact information about the population. Populations are often very large
sets; number of individuals in the population is considered to be “endless” (for statistical purposes
and calculations). In practice, the number of members in the population can be literally “endless” –
especially from the time viewpoint: e.g. body weights in all cattle in CR, dogs in Europe, etc. –
number of individuals is not fixed (it fluctuates since new members are born and others die).
Occasionally populations of interest may be relatively small, such as the ages of men who
have travelled to the moon or the heights of women who have swum the English Channel. If the
population under study is very small, it might be practical to obtain all the measurements in the
population. Generally, however, populations of interest are so large as to render the obtaining of all
the measurements unfeasible (it’s time-consuming, expensive, etc.). We are not able to obtain all
possible measurements from the population in practice – for example, we could not reasonably
expect to determine the body weight of each dog in Europe. What can be done in such cases is to
obtain a subset of all measurements in the population. This subset of measurements comprises
a sample, and from the characteristics of samples conclusions can be drawn about the characteristics
of the populations from which the samples come.
Often one samples a population that does not physically exist. Suppose an experiment is
performed in which a food supplement is administered to thirty piglets, and the sample data consist
of the growth rates of these thirty animals. Then the population about which conclusions might be
drawn is the growth rates of all the piglets that conceivably might have been administered the same
food supplement under identical conditions. Such a population is said to be “imaginary” and is also
referred to as “hypothetical” or “potential”.
11
of the sample, but the selection of any member of the population must in no way influence the
selection of any other member.
It is sometimes possible to assign each member of a population a unique number and to
draw a sample by choosing a set of such numbers at random. This is equivalent to having all
members of a population in a hat and drawing a sample from them while blindfolded. We may not
to do a subjective choice generating a random sample; we can use e.g. table of random digits from
statistical literature (E.g. Zar: Biostatistical Analyses), drawing lots for registration numbers of
animals in a stable, etc.
Another requirement for a dependable generalization about certain characteristics is an
appropriate size of sample. The pattern of sample values (as well as sample descriptive
characteristics) gets closer in nature to the pattern of population values as the sample size gets
closer to the population size. It means that the bigger is our sample, the better, but there are
practicable limits in practice - not enough time, money, etc. Thus, we must do compromises often in
practice; in general the sample size above 30 members is considered to be an appropriate sample
size to give us results of calculations, which are comparable to population. However, samples that
consist e.g. only of 10 individuals may be sufficient in some cases in practice.
12
Fig. 1.1 Frequency Distribution – Discrete Data (Bar Graph)
When measuring continuous data we create classes i.e. equivalent intervals (categories) of
data to simplify the situation. Number of classes should be appropriate according to the sample size:
Then a certain number of values fall into each defined interval (class). All the data in this
interval get the same value – so called midpoint (mean value) of the class. These values replace the
original values measured in all individuals in the sample monitored. In this way, we are able to
obtain a frequency of the class i.e. number of items (individuals) in the appropriate interval to draw
a chart of frequency distribution for continuous data.
Fig.1.2 represents the frequency distribution of continuous data (e.g. body weights).
13
Fig. 1.2 Frequency Distribution – Continuous Data (Histogram)
As you can see from the chart above, it is possible to use one more type of graphical
presentation of frequency distribution in the sample - a polygon. Polygon is represented by a broken
line that joins tops of columns in the midpoints of the classes. The shape of the polygon is specific
for the unique sample that was used for our measurements i.e. the shape of polygon varies from
sample to sample. Therefore we can use the term “empirical curve” for polygon as well (from Latin
word empiria = experience).
Most of biological variables (both discrete and continuous) possess the characteristic
property – frequencies in the middle of the sample (around the mean value) are the highest and
frequencies of extremely small and large values in the sample are the lowest.
14
Fig. 1.3 Frequency (Probability) Distribution
The frequency distribution for the whole population is a statistical distribution that
determines the probability of occurrence of values in studied variable; therefore we use the term
probability P(x) instead of frequency on the axis y. The term frequency used with samples
represents an absolute scale (that is possible only in samples - they have definite number of
entities), whereas the term probability represents a relative scale (proportion of cases) that is
necessary to be used in populations, where the number of entities is infinite.
15
variables in biology and medicine can behave in another way – they can follow different probability
distributions, which have other shapes of their curves: asymmetrical - up to extreme (Fig. 1.5, 1.6)
or non-normal, irregular (Fig.1.7):
y y
x x
y y
x x
y y
x x
16
1.4.3 Portions of Distribution
For each distribution we can define measures (quantiles) that divide a group of data -
population (displayed as the area under the curve of distribution) into 2 parts (portions):
- Values which are smaller than quantile,
- Values which are larger than quantile.
There are specific quantiles used for description of distributions in statistics:
50% quantile – x0.5 (called median) divides a group of data into 2 equal halves (Fig. 1.8):
y y
x x
Important quantiles and their corresponding proportions of the most common distribution
curves are tabulated in statistical tables and used as critical values in statistical hypotheses testing
(see Chapter 5) or as coefficients in calculations (see Chapter 4: confidence intervals of statistical
parameters – e.g. mean value , standard deviation ).
17
Chapter 2
Descriptive Characteristics of Statistical Sets
The aim of statistical data evaluation: to get an image of monitored biological characters in
the whole population on the basis of data samples. At first we usually classify observed sample data
according to the measured values, arrange the variant sequences and draw graphs of frequency
distributions. These arranged data us give basic information about the sample and offer source
material for further statistical methods of data evaluation for monitored biological characters.
The deeper analysis follows, when we try to resume data information into one or several
numbers by means of specific exactly defined parameters (statistical characteristics). We can’t
really determine the exact values of these parameters at the level of the whole population, therefore
we select a sample (or several samples) from the studied population and we calculate the so-called
statistics from this sample data. It serves as estimation of the exact population parameters.
Several measures help to describe or characterize a population. For example, generally
a preponderance of measurements occurs somewhere around the middle of the range of a population
of measurements. Thus, some indication of a population “average” would express a useful bit of
descriptive information. Such information is called a measure of central tendency, and several such
measures (e.g. the mean and the median) will be discussed below. It is also important to describe
how dispersed the measurements are around the “average”. That is, we can ask whether there is
a wide spread of values in the population or whether the values are rather concentrated around the
middle. Such a descriptive property is called a measure of dispersion (e.g. the range, the standard
deviation, the variance etc.).
A quantity such as a measure of central tendency or a measure of dispersion is called
a parameter when it describes or characterizes a population, and we shall be very interested in
discussing parameters and drawing conclusions about them when studying a biological character in
the population. However, one seldom has data for entire populations, but nearly always has to rely
on samples to arrive at conclusions about populations. Thus, as mentioned above, one rarely is able
to calculate the true exact parameters. However, by random sampling of populations, parameters
can be estimated very well by means of special statistical methods (see the chapter Estimation of
population parameters). Due to the statistical methods, we can determine so-called confidence
intervals for population parameters or to calculate the estimates of population parameters called
statistics (on the basis of sample data). It is statistical convention to represent the true population
parameters by Greek letters and sample statistics by Latin letters.
Among the most often used measures of central tendency belong: the mean, the median, and
the mode. Among the most often used measures of dispersion and variability of statistical sets
belong: the range, the variance, the standard deviation, the coefficient of variability, and the
standard error of mean.
18
2.1 Measures of Central Tendency
The most widely used measure of central tendency is the arithmetic mean, usually referred
to simply as the mean, which is the measure most commonly called an “average” (the term
“average” is used predominantly for sample statistic, the term “mean” is used for population exact
parameter most often).
Each measurement in a population may be referred to as xi value. The subscript i might be
any integer value up through N, the total number of values X in the population.
The calculation of the population mean (the theoretical exact parameter):
x i
i 1
The most efficient and unbiased estimate of the population mean , is the sample mean,
denoted as x (read as “x bar”). Whereas the size of the population (which we generally do not
know) is denoted as N, the size of a sample is indicated by n (definite number of members in
a specific sample used for measurements).
The calculation of the sample average x :
n
x i
x i 1
n
Properties of the mean:
- Mean is affected by extreme values in the set (when changing one value xi arithmetic mean
change as well). The average is the correct measure of the central tendency of a sample only if
the sample is homogenous enough in its values (it should be used in homogenous regular
distributions (Gaussian) only). Otherwise, especially in small samples, the average can be
19
biased through possible extreme values in the sample, and does not represent the correct
measure of centre in sample (in irregular distributions as well).
- It has the same units of measurement as the individual observations.
- x i x 0 - the sum of all deviations from the mean will be always 0.
The median is typically defined as the middle measurement in an ordered set of data
(ordered in an ascending or descending row). That is, there are just as many values larger than the
median as there are smaller ones. The sample median is the best estimate of the population median.
In a symmetrical distribution the sample is also an unbiased estimate of , but it is not as efficient
a statistic as x , and should not be used as a substitute for x . If the frequency distribution is
asymmetrical, the median is poor estimate of the mean.
The median of a sample of data may be found by first arranging the measurements in order
of magnitude. Then the middle value of this row is the median.
In larger samples we can find out, which datum in a ranked sample data is median by means
n 1
of the calculation of rank of this figure: (it can be applied for centre of any row of n values
2
in math generally).
- If the sample size (n) is odd there is only 1 middle value (rank will be an integer) and
indicates, which datum in ordered sample is the median.
- If n is even rank of the median is a half-integer and it indicates that there are two middle
values, and the median is a midpoint (mean) between them.
Properties of median:
- Median is not affected by extreme values in the sample;
- Median = 50% quantile (divides the sample data into 2 halves: values that are smaller then
median and values that are larger than median);
- It may be used in irregular (asymmetric) distributions – in this case median is better
characteristic for the middle of the set than the average.
The mode is commonly defined as the most frequently occurring measurement in a set of
data (the value with the highest frequency). Mode always indicates the top of the distribution curve.
A distribution with two modes (two tops) is said to be bimodal and may indicate a combination of
two distributions with different modes (e.g. heights of men and women). The sample mode is the
20
best estimate of the population mode. When we sample a symmetrical unimodal population, the
mode is an unbiased estimate of the mean and median as well as, but it is relatively inefficient and
should not be so used.
The mode is a somewhat simple measure of the central tendency and it is not often used in
biological and medical research, although it is often interesting to report the number of modes
detected in a population, if there are more than one.
Properties of the mode:
- Mode is not affected by extreme values in the sample.
- It is not a very exact measure of the middle of the set.
Mean value indicates only the centre of sets but does not indicate the dispersion of values
around the centre (how much the values are scattered). For this purpose we use measures of
variability that describe this dispersion of values around the centre of the set, and determine also
the reliability of mean value of the set - reliability will be larger in samples that have similar
(homogenous) values and no extreme values. Measures of variability of a population are exact
parameters of the population, and the sample measures of variability that estimate them are
statistics.
2. 2. 1 The Range
The difference between the highest and the lowest value in the set of data is termed the
range. If sample measurements are arranged in increasing order of magnitude, as if the median were
about to be determined, then sample range R is calculated:
R = xmax – xmin
21
If we want to express variability in terms of deviations from the mean, there will be
a difficulty: as the sum of all deviations from the mean, i.e., ( xi x ) , will always equal zero,
summation would be useless as a measure of dispersion and variability. Summing the absolute
values of the deviations from the mean results in a quantity that is an expression of dispersion about
the mean. Dividing this quantity by n yields a measure known as the mean deviation (or the mean
absolute deviation) of the sample. This measure has the same units as do the data, but it is not very
often used as a measure of dispersion and variability in practice.
Another method of eliminating the signs of the deviations from the mean is to square the
deviations. The sum of squares of the deviations from the mean is called the sum of squares,
abbreviated SS, and is defined as follows:
population SS ( xi ) 2
sample SS ( xi x ) 2
By means of term “SS” we can define other measures of variability in the set of data:
The variance is defined as the mean sum of squares about the mean value of data.
Sometimes this measure is called also mean square – short for mean squared deviation.
x
2
i
2 i 1 Population variance
N
The best estimate of the population variance, 2, is the sample variance, s2:
n
x x
2
i
s2 i 1 “Estimated” variance (used for samples)
n 1
22
The variance expresses the same type of information as does the mean deviation, but it has
certain very important properties relative to probability and hypothesis testing that make it distinctly
superior. Thus, the mean deviation is very seldom encountered in biostatistical analysis.
The calculation of s2 can be tedious for large samples, but it can be facilitated by the use of
the equality:
x 2
x n
2 i
i
s2
n 1
This formula is much simpler to work with in practice, therefore is often referred to as
a “working formula”, or “machine formula” for sample variance. There are, in fact, two major
advantages in calculating s2 by this equation rather than by equation of previous (original) formula
for “estimated variance”. First, fewer computational steps are involved, a fact that decreases chance
of error. On many calculators the summed quantities, xi and xi2, can both be obtained with only
one pass through the data, whereas the original formula for “estimated variance” requires one pass
through the data to calculate x and at least one more pass to calculate and sum the squares of the
deviations, xi - x . Second, there may be a good deal of rounding error in calculating each deviation
xi - x , a situation that leads to decreased accuracy in computation, but which is avoided by the use
of the latter formula above.
Variance has the square units as the original measurements. If measurements are in grams,
their variance will be in grams squared, or if the measurements are in cubic centimetres, their
variance will be in terms of cubic centimetres squared, even though such squared units have no
physical interpretation.
The standard deviation is the positive square-root of the variance; therefore, it has the same
units as the original measurements.
Thus, the formula for a population SD is:
N
x 2
xi x
2
2 i
i
i 1
or N
N N
xi x
2 x 2
x
2 i
i
s i 1
or s n
n 1 n 1
23
Remember that the standard deviation is, by definition, always a nonnegative quantity. The
standard deviation and the mean shall be reported with the same number of decimal places.
s s 100
V or V [%]
x x
The coefficient of variability expresses sample variability relative to the mean of the sample;
because s and x have identical units, V has no units at all, a fact emphasizing that it is a relative
measure, divorced from the actual magnitude or units of measurements of data.
The coefficient of variability of a sample, namely V, is an estimate of the coefficient of
variability of the population from which the sample came (i.e., an estimate of /).
The population standard error of the mean is the theoretical standard deviation of all sample
means of size n that could be drawn from a population. The sample standard error of the mean can
be used as a measure of the precision with which the sample mean x estimates the true population
mean .
We don’t know what the true mean value in the population is. We can only estimate it by
means of the sample average. But we don’t know how precise our calculation is and what’s the
difference between our calculated sample AVG and the true population mean .. SEM may serve as
a measure of precision of the calculated sample mean.
Value of the standard error of the mean depends on both the population variance (2) and the
sample size (n):
x
n
Since in general we don’t know the population variance, the best estimate for the population
standard error of the mean in practice (sample standard error of the mean) is calculated as:
s
sx
n
24
The sample standard error of the mean is useful for construction of the confidence interval
for the mean. The true mean value of population will lie within the interval sample average AVG
SEM (see Chapter 4 for details).
25
Chapter 3
Distributions Commonly Used in Statistics
(Continuous Data)
In Chapter 1, we saw how frequency distributions arise from sample data and that the
population distribution, arising from sampling the entire population, becomes the probability
distribution. This probability distribution is used in the process of making statistical inferences
about population characteristics on the basis of sample information. There are, of course, endless
types of probability distributions possible. However, luckily, the great majority of statistical
methods for continuous data use only several probability distributions.
The probability distributions most often used in statistics for continuous data are the normal,
the non-normal, Student’s t, Pearson’s 2 (chi-square), and Fisher-Snedecor’s F distribution. Rank–
order methods depend on distributions of ranks rather than continuous data, but several of them use
the normal or chi-square. Categorical data depend mostly on the chi-square, with larger samples
transformed to normal. We need to become familiar with these commonly used distributions to
understand most of the methods given in this text. The following paragraphs describe these
distributions and some of their properties needed to use and interpret statistical methods.
Some of these distributions are useful with the population variables and some of
distributions are useful with the sample variables.
26
Fig. 3.1 Graph of Gaussian Normal Distribution
The curve is bell-shaped and symmetrical; majority of values is located around the mean
(centre of symmetry) with progressively fewer observations toward the extreme values. In
extremes, the curve is not terminated – the curve theoretically gets near the axis x in infinities (both
+ and - infinity).
The shape of the normal curve is fully described by means of two parameters - and :
(Mean value) – “Parameter of location” – it describes the centre of symmetry and also the
location of the curve on axis x.
(Standard deviation) – “Parameter of dispersion (variability)”. It describes the spread of
the curve in the inflexion point (where the flexure changes from convex to concave).
Spread of the curve determines the variability of biological character monitored in the
population.
The whole area under the curve represents all individuals in the population (100%); then:
Within the range 1: there are 68.3 % of all values (individuals) in the population,
Within the range 2: there are 95.5 % of all values (individuals) in the population,
Within the range 3: there are 99.7 % of all values (individuals) in the population.
The occurrence of remaining values (0.3%) in both extreme ends of axis x is so highly
improbable, that such extreme values are considered as an error of measurement in terms of
statistics.
27
3.1.2 Standard Normal Distribution
X
Z
The normal distribution then becomes the standard normal, which has mean in the value 0
and standard deviation always equal to 1. Units of the new standardized variable Z on the axis x
express the number of standard deviations away from the mean (zero value), positive for above the
mean and negative for below the mean. It means that standardized variable Z represents
a dimensionless quantity, it is a relative measure, divorced from the actual magnitude or units of
measurements of data.
This transformation usually is made in practice as the probability tables available usually
are of the standard normal curve. Table Appendix 1, in the back of the textbook, contains selected
value of z with four areas that often are used: (a) the area under the curve in the positive tail for
given z, i.e., one-tailed α; (b) the area under all except that tail, i.e., 1- α; (c) the areas combined for
both positive and negative tails, i.e., two-tailed α; and (d) the area under all except the two tails,
i.e., 1- α. (See Statistical tables Appendix 1)
The graphical description of the Standard normal distribution is in the Figure 3.2.
Axis x: values of standardized variable Z,
Axis y: probability of values Z.
28
3.1.3 Non-normal Distribution
Some of variables monitored in biological and medical sciences don’t follow Gaussian
normal distribution – then they usually have variously irregular shape of the probability distribution
curve, often asymmetrical or with 2 and more peaks. Such curves are most often called “non-
normal” or “unknown”; because for their irregularity it is not possible to describe the shape of the
curve in a very exact way.
The example of graphical description of the non-normal distribution is in the Figure 3.3.
Axis x: values of monitored biological character,
Axis y: probability of occurrence of these values in population.
The curves of the non-normal probability distribution have the shapes that can be variously
irregular; therefore it is impossible to use some exact parameters that would determine centre and
spread of data (like it was possible in Gaussian normal distribution). Only one descriptive
characteristics, the median, is usually used in such non-normal distributions. The median is
considered as the centre of such irregular curve. Since the median is defined as 50% quantile, it
divides the whole area under the curve into 2 equal halves regardless of the shape of the probability
distribution. It is not possible to determine the spread of the curve (variability of monitored
character) for its irregularity.
29
(t-distribution was published in 1908 by English chemist W. S. Gosset under the pseudonym
“Student”, because the policy of his employer, Guinness Brewery, forbade the publication.)
X
The t looks like the standard normal curve for variable Z , however, it is a little
fatter because it uses s instead of accurate . Whereas the normal is a single distribution, t is
a family of curves. In the figure 3.4, two t distributions are superposed on a standard normal
distribution. The particular member of the t family depends on the sample size or, more exactly, on
the degrees of freedom (see below).
t-distribution reflects an error of samples (when compared to the population sampled) that is
evident in all statistical calculations performed on the basis of such samples. This error of the
sample is caused by small numbers of members in the sample, and generally we can say that the
smaller is the sample, the more erroneous are calculations performed on the basis of this sample.
The shape of t-distribution curve is similar to the standard Normal distribution (bell-shaped,
symmetry above 0 value), but the spread of the curve is specific for different samples according to
30
the sample size – or more exactly according to the Degrees of Freedom (DF, ) of the sample
monitored: = n-1.
It is obvious from the graph of t-distribution figured above that:
- The smaller is the sample size, the broader and lower is the curve,
- The larger is the sample size, the narrower and higher is the curve.
In case of the endless expansion of a sample in the extreme: n = the curve joins the
Normal distribution that describes all population (such sample will have no error in statistical
calculations in comparison with population).
In small samples (that have large error in comparison with the population) the shape of the
curve is also much different from the shape of Normal distribution (used for population).
We can define the exact spread of the curve for t-distribution by ratio: /-2.
Values of t-distribution are tabulated in statistical tables (See the statistical tables -
Appendix 2: Critical values for Student’s t-distribution) and they can be used in statistical
calculations as e.g.:
- Critical values in testing for difference between two means (see Chapter 6: Student t-test),
- Coefficients in calculations of confidence intervals for mean values (see Estimation of
population parameters).
We search the critical values in the tables of t-distribution according to the degree of
freedom calculated for our sample ( = n- 1) and also according to an error chosen to specify the
exactness of our calculations in statistics - for biological data is commonly used = 5% or 1%
(when we need more precise calculations).
31
Fig. 3.5 Graph of Chi-square (2) distribution.
Chi-square distribution has an asymmetrical curve – right skewed (it rises from 0 rapidly to
a mode and then tails off slowly in a skew to the right) and it has different shapes for different
sample sizes ( = n-1 determines the shape).
It is obvious from the graph of 2 distribution figured above that:
- the smaller the sample size is the higher and more asymmetrical is the curve shape
- the larger the sample size is the lower and more symmetrical is the curve shape
Values of chi-square distribution are tabulated in statistical tables (See tables of the 2
distribution: Appendix 3, 4) and they can be used in statistical calculations e.g.:
- Critical values in testing for difference between frequencies in samples (See 2-test),
- Coefficients used in calculations of confidence intervals for standard deviation (see
Estimation of population parameters).
32
probability. To do this, we use their ratio, dividing the bigger by the smaller. The probability
distribution for this ratio is called F, named (by George Snedecor) after Sir Ronald Fisher, the
greatest producer of practical theories in the field of statistics.
The F-distribution looks very much like the chi-square distribution (indeed it is the ratio of
two independent chi-square-distributed variables), as can be seen in figure 3.6. F distribution has
one more complication than chi-square: because it involves two samples, the degrees of freedom for
each sample must be used. Then the shape of the F-distribution curve is determined by the degree of
freedom of 2 samples (1 and 2) that are used in testing (e.g. when we want to test differences
between variability of monitored biological character in two groups of animals: see Chapter 6
F test for details).
The graphical description of the F-distribution is in the Figure 3.6.
Axis x: values of variable F,
Axis y: probability of values F.
As it follows from the figure, the curve is asymmetrical, it rises from 0 value, shortly runs
up and then tails off slowly in a skew to the right toward higher values on the axis x. The shape of
the curve is changing according to sample sizes (more exactly according to 1 and 2).
It is obvious from the graph of F-distribution figured above that:
33
- The smaller the sample size is the lower and more asymmetrical is the curve shape,
- The larger the sample size is the higher and more symmetrical is the curve shape.
Values of F-distribution are tabulated in statistical tables (See tables of the F-distribution:
Appendix 5) and they are most often used as critical values in testing for differences between 2
sample variances (see F-test).
34
Chapter 4
Estimation of population parameters
(Confidence Intervals)
Having obtained a random sample from a population of interest, we are ready to use
information from that sample to estimate the characteristics of the underlying population. If you are
willing to assume that the sample was drawn from a normal distribution, summarize data with the
sample mean and sample standard deviation, the best estimates of the population are mean and
population standard deviation, because these two parameters completely define the normal
distribution. When there is evidence that the population under study does not follow a normal
distribution, summarize data with the median as the only descriptive characteristics used for the
non-normal distribution.
Although sample statistics are the best estimates of true population parameters, they are still
only estimates. Therefore, it is appropriate to determine confidence intervals for true population
parameters that allow us to express the precision of the estimates based on the sample data.
A confidence interval is an interval about an estimate, based on its probability distribution, that
expresses the confidence, or probability, that that interval contains the true population parameter
being estimated.
35
4.1.1 Confidence Interval for the Mean Value
The calculation of confidence interval for the mean consists of determination of confidence
limits L1 (lower) and L2 (upper). Limits will be symmetrical around the sample average x , true
mean value will lie within the interval restricted by the limits L1, L2.
For determination of the limits L1, L2 we need to know the standard error of the mean
s x (SEM, SE) = a measure of the precision with which a sample average x estimates the true
population mean .
If we try to estimate the true mean value by means of several sample averages – we will
see that every AVG is slightly different (caused by variability of individuals in samples) – but all of
them will estimate the same true . The question is: what AVG is the best one? We need some
measure to specify its precision = SEM. This statistic quantifies the certainty with which the mean
computed from a random sample estimates the true mean of the population from which the sample
was drawn.
Calculation formula for standard error of the mean:
s
SEM s x
n
L1, 2 x sx . t ,
x - sample mean
s x - standard error of the mean
t, - confidence coefficient = critical value of t-distribution (Appendix 2. Critical values for
Student’s t-distribution) - determined according to the selected error and DF: = n-1.
36
Example:
Calculate confidence intervals for the mean of body weights of piglets at the 95% and 99%
confidence level.
Body weights (in kg) of the sample of 25 piglets:
xi : 25.8, 24.6, 26.1, 22.9, 25.1, 27.3, 24.0, 24.5, 23.9, 26.2, 24.3, 24.6, 23.3, 25.5, 28.1, 24.8, 23.5,
26.3, 25.4, 25.5, 23.9, 27.0, 24.8, 22.9, 25.4.
Method:
1) Calculation of the sample mean and SEM:
Mean: x 25.0 kg
SEM: sx 0.27 kg
3) Conclusion:
The true mean value for population of body weights in piglets lies within the confidence
intervals: 25.0 0.56 kg (at the 95% confidence level),
25.0 0.76 kg (at the 95% confidence level).
The calculation of confidence interval for the population standard deviation consists of
determination of confidence limits L1 (lower) and L2 (upper).
Confidence limits L1, L2 for determination of confidence interval above estimate of standard
deviation (s) must be calculated separately, since the interval is not symmetrical; that is, the
distance from L1 to s is not the same as the distance from s to L2.
(n 1) s 2 (n 1) s 2
L1 L2
21 / 2 ( ) 2 / 2 ( )
37
s2 – sample variance
n – sample size
Confidence coefficients ( =critical values of 2distribution):
Quantile 2/2 – left tail value upper limit of the interval
Quantile 21-/2 – right tail value lower limit of the interval
(See Appendix 3 and 4: Critical values for 2 distribution, Right tail, Left tail)
Fig. 4.1 Left tail value and right tail value of 2-distribution (quantiles 2/2 , 21-/2 )
Example:
Calculate confidence intervals for the standard deviation of body temperatures of twenty
five intertidal crabs placed in air at 24.3°C at the 95% confidence level.
Body temperatures (measured in °C): 25.8, 24,6, 26.1, 22.9, 25.1, 27.3, 24.0, 24.5, 23.9,
26.2, 24.3, 24.6, 23.3, 25.5, 28.1, 24.8, 23.5, 26.3, 25.4, 25.5, 23.9, 27.0, 24.8, 22.9, 25.4.
Method:
1) Calculation of sample variance:
s2 = 1.80 (°C)2
2) Calculation of the confidence intervals:
At the 95% confidence level: =24 20.975, 24 = 12.40, 20.025, 24 = 39.36
(See Appendix 3 and 4: Critical values for 2 distribution, Right tail, Left tail)
38
L1
25 1.1.80 43.20
1.10 1.05 C
39.36 39.36
L2
25 1.1.80 43.20
3.48 1.87C
12.40 12.40
3) Conclusion:
The true standard deviation for population of body temperatures of intertidal crabs
placed in air at 24.3°C lies within the confidence interval that is restricted by limits: L1 = 1.05°C
and L2 = 1.87°C (calculated at the 95% confidence level).
(Note that the confidence limits are not symmetrical around s; that is, the distance
from L1 to s is not the same as the distance from s to L2).
When the population under study does not follow the Gaussian normal distribution, then the
only one descriptive characteristic, the median, can be used for definition of such non-normal
distribution. For irregularity of the distribution curve, it is not possible to determine the spread of
the distribution, i.e. the variability of monitored data set.
The true exact median of the population under study can’t be calculated in practice, so we
use imprecise sample median as estimation of the true parameter. The population median ~ is
estimated by means of the sample median ~ x.
Although sample median ~ x is the best estimate of population ~ , it is still only estimate.
Therefore, it is useful to calculate also the confidence intervals for ~ that allows us to express the
precision of the estimate.
Example:
Calculate the confidence interval for the true population median of body weights (kg) in the
sample of 14 dogs of a particular breed:
39
Measured values (in kg): 14.1, 16.4, 16.8, 14.3, 12.3, 14.9, 15.3, 12.8, 15.6, 13.5, 16.0, 16.2,
17.1, 17.0
Method:
1) According to the sample size and selected we find in statistical tables ranks for L1, L2:
n = 14
= 0.05
Tab. 4.1 Ranks for Confidence Limits for the Median (Part of the table, = 0.05)
Tab. 4.2 Sample data (n = 14) with marked confidence limits for the median
x1 12.3
x2 12.8
x3 13.5
x4 14.1
x5 14.3
x6 14.9
x7 15.3
x8 15.6
x9 16.0
x10 16.2
x11 16.4
x12 16.8
x13 17.0
x14 17.1
3) Conclusion:
Confidence interval for the true median of body weights in dogs of the particular breed
will lie within limits: L1 = 13.5 and L2 = 16.8 (calculated at the 95% confidence level).
40
Chapter 5
Statistical Hypotheses Testing
This hypothesis is called a null hypothesis (H0) – it expresses the concept of “no
difference”.
For example:
H0 : = const. (e.g. the population mean is equal to certain value known about the studied
population – e.g. physiological values of some biochemical indices)
1 = 2 (2 populations have the same mean value)
12 = 22 (2 populations have the same variance)
If it is concluded (through a statistical test – see below) that it is likely that a null hypothesis
is false, then an alternate hypothesis (abbreviated HA) is assumed to be true. HA denies H0, so for
examples above:
HA: const.
1 2
12 22
41
One states a null hypothesis and an alternate hypothesis for each statistical test performed,
and all possible outcomes are accounted for this pair of hypotheses.
It must be emphasized that statistical hypotheses are to be stated before data are collected to
test them. A statement of hypotheses after examination of data can devalue a statistical test. One
may, however, legitimately formulate hypotheses after inspecting data if a new set of data is then
collected with which to test the hypotheses.
E.g.: we need to find out if a vitamin supplement in food causes the increase of body weight
in piglets.
We set up an Experiment:
Group1 of piglets (Test sample) gets the vitamin supplement in food
Group2 of piglets (Control sample) gets standard food
After some period we measure the body weight in both groups of animals and we can find
out e.g. that test sample has a mean x1 which is higher than the mean of the control group: x 2 . We
have to decide (through a statistical test), whether the difference between the sample means is only
random (caused by variability of animals) – or whether it is big enough to conclude that population
means are different as well. It would mean that the difference was caused by our experimental
activity (we can say that this experimental activity is generally effective).
In this case we can reject the null hypothesis (1 = 2) and it means that the alternate
hypothesis is true: (1 2).
Conclusion in practice (for this particular experiment): the statement that “the increase of
body weight is caused by the vitamin supplement” is generally true (the increase of the body
weight is not a random effect).
42
Different variables are used as the test statistics in various statistical tests e.g.:
t – Testing for difference between 2 means (t-test),
F – Testing for difference between 2 variances (F-test),
2 – Testing for difference between 2 frequencies (2-test).
The procedure of each of the statistical tests consists in a test statistic calculation – then we
determine if the calculated value of the test statistic exceeds some critical value of the test statistic.
When the calculated test statistic (in absolute value) exceeds the critical value, then the null
hypothesis is rejected. Otherwise, the null hypothesis is accepted. Fig. 5.1 shows an example of
critical value for the test statistic t (t distribution).
This critical value is associated with a particular probability threshold (P-value) that is used
as a criterion for rejection of H0 in the test and that is called the significance level, denoted by α. In
fact, the critical value is usually the 1-α/2 quantile of the appropriate distribution used as the test
statistic. As explained above, a probability of 5% or less is commonly used as the criterion for
rejection of H0 in biological and medical data testing. It means that the calculated test statistic (in
absolute value) has to exceed the critical value at the α = 0.05 level of significance to obtain
a statistically significant outcome of the test (denoted as P < 0.05 usually). When the calculated test
statistic exceeds the critical value at the α = 0.01 level of significance, we obtain a statistically
highly significant outcome of the test (denoted as P < 0.01 usually). In the case, when the calculated
43
test statistic does not exceed the critical value at the α = 0.05 level of significance, we obtain
a statistically not significant outcome of the test (denoted as P > 0.05 usually). Following Figure 5.2
demonstrates the aforesaid possible outcomes of the test statistic t.
44
know. What we do know is that for a given sample size, n, the value of α is related inversely to the
value of . That is, lower probabilities of committing a Type I error are associated with higher
probabilities of committing a Type II error. Both types of error may be reduced simultaneously by
increasing n. Thus, for a given α, larger samples will result in statistical testing with greater power
(1 - ). Table 5.1 summarizes these two types of statistical errors.
DECISION
REJECTING NOT REJECTING
H0 H0
REALITY
Since, for a given n, one cannot minimize both of types of errors, it is appropriate to ask
what the acceptance combination of the two might be. In terms of veterinary medicine: “Not to treat
an ill animal (statistically evaluated as a healthy one – Type I error) is a more serious mistake than
to treat a healthy animal statistically evaluated as the ill one (Type II error)”. Therefore, statistical
tests used in medicine are set up to achieve a minimal Type I. error α. By experience, and hence by
convention, an α of 0.05 is usually considered to be a “small enough” chance of committing a Type
I error, while not being so small as to result in “too large a chance” of a Type II error. But there is
nothing sacred about the 0.05 level. Although it is the most widely used significance level,
researchers may decide for themselves whether it is more important to minimize one type of error or
the other. In some situations, for example, a 5% chance of an incorrect rejection of H0 may be felt to
be unacceptably high, so the 1% level of significance is sometimes employed.
It is necessary, of course, to state the significance level used when reporting the results of
a statistical test. Indeed, rather than simply stating whether the null hypothesis is rejected, it is good
practice to state also the test statistic itself and the best estimate of its exact probability (calculated
by means of a statistical software). In this way, readers of the research results may draw their own
conclusions, even if their choice of significance level is different from author’s.
Bear in mind, however, that the choice of α is to be made before even seeing the data.
Otherwise there is a great risk of having the choice influenced by examination of the data,
introducing bias instead of objectivity into proceedings. The best practice generally is to decide on
the null hypothesis, alternate hypothesis, and significance level before commencing with data
collection.
As we already know, it is conventional to refer to rejection of H0 at the 5% significance
level as denoting a “significant” difference (e.g. between compared population means) and rejection
at the 1% level as indicating a “highly significant difference”. As the significance level selected is
45
somewhat arbitrary, if test results are very near that level (e.g. between 0.04 and 0.06 if α = 0.05 is
used), then it may be wiser to repeat the analysis with additional data than to declare emphatically
that the null hypothesis is or is not a reasonable statement about the sampled population.
46
C) Continuous Form of Testing
Whether a difference between means exists most often is the focus in comparing two groups
with data in continuous form. Our first inclination is to look at the difference between means.
However, this difference depends on the scale. The offset distance of a broken femur appears larger
if measured in centimetres than in inches. The distance must be standardized into units of data
variability. We divide the distance between means by a measure of variability and achieve a statistic
t (if the population variability is estimated by small samples: see t-test). The risk of concluding
a difference when there is none (the P-value) is looked up in a table and the decision about the
group difference is made.
Continuous form of statistical testing is represented by Parametric tests that may be used for
testing of hypotheses with numerical data those are assumed to come from a normally distributed
population (See Chapter 6: Parametric tests).
Parametric Tests – a summary:
They are used in testing for differences between data sets that follow Gaussian normal
distribution,
Hypotheses concerning parameters (µ,) of this distribution are tested,
Calculations in these tests are based on sample statistics ( x , s).
47
Chapter 6
Parametric Tests
Among the most commonly employed biostatistical procedures is the comparison of two
samples to infer whether differences exist between the two populations sampled. In parametric tests,
we consider hypotheses concerning population parameters µ (mean value) and 2 (variance) of
Gaussian normal distribution.
As the mean is the most important characteristic of a population, the basic question asked in
parametric test most often is whether two samples have the same mean or whether a sample mean is
the same as a population mean. Questions concerning two variances (or standard deviations) are
also considered in parametric tests in some instances. The question is answered by testing the null
hypothesis that the means (or variances) are equal and then accepting or rejecting this hypothesis.
Tests of means and variances were developed under the assumption that the sample was
drawn from a normal distribution. Whereas usually not truly normal, a distribution that is roughly
normal in shape is adequate for a valid test. That is because the test is moderately robust.
Robustness is an important concept. A robust test is one that is affected little by deviations from
underlying assumptions. If a small-to-moderate sample is too far from normal in shape, the
calculation of error probabilities, based on the assumed distribution, will lead to erroneous
decisions; then non-parametric (rank-order) methods should be used preferably (see chapter Non-
parametric tests). In particular, tests of means are only moderately robust and they are especially
sensitive to outliers (extremely high or low values), whereas tests of variance are much more robust.
Student’s t-test (used in testing for difference between two means) and Snedecor’s F-test
(used in testing for difference between two variances) belong to the group of parametric tests.
We can decide by means of this F-test whether some treatment (activity used in an
experiment) influences the variability (variance - 2) of some biological character studied in
2 2
a population. A null hypothesis H0: 1 = 2 is verified by examining sample variances – s12 and
s2 2 .
48
We apply a tested treatment (e.g. a new medical preparation) to one of these samples, the
second sample (without any treatment) will serve as a control group for comparison.
Method:
2 2
1) We calculate sample variances s1 and s2 :
x 2
x 2
x x
i i
i
2
i
2
n1 n2
s 2 s 2
1 n1 1 2 n2 1
2 2
higher variance (out of s1 , s2 )
2) We calculate a test statistic: F 2 2
lower variance (out of s1 , s2 )
3) We specify degree of freedom of Numerator (of the F ratio) and Denominator (of the F
ratio), appropriate to the size of sample1 and sample2:
2
DF Numerator: N = n1(2)- 1 (for the higher out of s1 , s22)
2
DFDenominator: D = n1(2) - 1 (for the lower out of s1 , s22)
4) We find out critical value F(α, N, D) in statistical tables for F-distribution (Appendix 5)
according to the chosen error (0.05), and degree of freedom (DF) for numerator and denominator
of our ratio for the test statistic F.
5) We will compare the calculated test statistic F with table critical value Fcrit. to make
a conclusion about population variances and about an effect of the treatment used in the experiment
on variability of studied biological character:
2 2
If calculated F > F(crit.) We reject H0, then alternate hypothesis is true: HA: 1 2 .
i.e. There is a significant difference between variances – it means
that variability of 2 populations sampled is not equal (at the
level).
Conclusion: the treatment used in experiment has influenced the
variability of the studied biological character.
2 2
If calculated F F(crit.) H0 is true: 1 = 2 .
i.e. There is an insignificant difference between variances – it
means that variability of 2 populations sampled is equal (at the
level).
Conclusion: the treatment used in experiment has not influenced the
variability of the studied biological character.
49
Example:
The effect of a new veterinary preparation on AST (aspartate aminotransferase) level in
blood serum in dairy cows has been monitored. In 10 dairy cows (control group), to which the
preparation was not applied, the following AST activities in blood serum have been found (in
µmol.l-1 ):
0.409, 0.345, 0.392, 0.377, 0.398, 0.381, 0.400, 0.405, 0.302, 0.337
In 10 dairy cows (test group), to which the preparation was applied. the following AST
activities in blood serum have been found (µmoll-1):
0.341, 0.302, 0.504, 0.452, 0.309, 0.375, 0.479, 0.423, 0.311, 0.333
Does the preparation influence the variance of AST activity in blood serum of dairy cows?
Method:
2 2
1) We calculate sample variances s1 and s2 :
Control: s12 = 0.00125
Preparation: s22 = 0.00575
5) F > F(crit.). statistically significant difference was found between variances (An
alternate hypothesis HA: 12 22 is true) at the level α = 5%.
50
6.2 t-test (Student’s)
Student’s t-test is used for testing for difference between 2 population means (in general).
However, there are several different variants of t-test in practice according to the data sets that are
available for comparison (see below). The t-test is the most common statistical procedure in the
medical and biological literature; you can expect it to appear in more than half the papers you read
in the general medical literature. In addition to being used to compare two group of means, it is
widely applied incorrectly to compare multiple groups, by doing all the pairwise comparisons, for
example, by comparing more than one intervention (treatment) with a control condition.
Student’s t-test is especially useful for testing for significant differences between results
obtained under two experimental conditions (treatments). First, we hypothesise that the means of
our two populations are not different (null hypothesis, e.g. H0: 1 = 2). We then determine the
probability (our P-value) that the difference in our samples´ means could have arisen by chance.
This P-value is thus a measure of the compatibility between our experimental observations and our
null hypothesis. A low P-value), say <0.05, is typically regarded as statistical evidence to reject the
null hypothesis and conclude that there is significant difference in the result obtained from the two
experimental conditions (treatments).
The null hypothesis states that the mean of the population from which the sample is drawn is
not different from a theoretical mean or from the population mean of another sample drawn from
the same population. We also must choose the alternate hypothesis, which will select a two-tailed
or one-tailed test. We should decide this before seeing the data so that our choice will not be
influenced by the outcome. The two-tailed t-test is used to test against the alternative hypothesis
that 1 2. It is sometimes the case that before the data are collected there is only one reasonable
way in which the 2 means could differ, and the alternative hypothesis would be for example 1 > 2
(or 1 < 2). It is the appropriate to carry out a one-tailed t-test. We often expect the result to lie
toward one tail, but expectation is not enough. If we are sure the other tail is impossible, such as for
physical or psychological reasons, we unquestionably use a one-tailed test. Surgery to sever
adhesions and return motion to joint frozen by long casting will allow only a positive increase in
angle of motion; a negative angle physically is not possible. An one-tailed test is appropriate.
There are cases in which an outcome in either tail is possible, but a one-tailed test is
appropriate. When making a decision about a medical treatment, i.e., whether we will alter
treatment depending on the outcome of the test, the possibility requirement applies to the alteration
in treatment, not the physical outcome. If we will alter treatment only for significance in the
positive tail and it will in no way be altered for significance in the negative tail, a one-tailed test is
appropriate. However, very often we also need to test an alternate hypothesis that there is any
difference (whichever difference: toward positive or negative tail) between treatments used in
experiment – in these cases the two-tailed test is appropriate.
The difference between one-tailed and two-tailed tests is also reflected in the size of the
critical value; generally we can say that critical values used in one-tailed tests are lower than critical
values used in two-tailed tests (at the same α level of significance). In the table of critical values of
t-distribution (Appendix 2), we can see that one-tailed critical values (marked as α(1)) at the
specific α level are the same as two-tailed critical values (marked as α(2)) at the double α level of
significance for given .
51
The procedure of Student’s t-test consists in the calculation of a test statistic t that results
from the estimation of parameters and in samples: x and s. Calculated test statistic is compared
with the tabulated critical value tα, that we can find out in tables of t-distribution (Appendix 2)
according to the chosen error (our probability level for acceptance of significant difference it is
typically set at 0.05 by most researchers in the biological and medical sciences) and (DF - degree
of freedom calculated by means of a specific formula for each of the variants of t-test).
If calculated t > tα, => we reject the null hypothesis, it means that there is a significant
difference between means of populations sampled at the α level of significance. We accept the
alternate hypothesis that our two experimental groups (typically one sample with the treatment and
the second one – control without treatment) produced statistically significant results (i.e. samples
was not drawn from the same population).
If calculated t tα, => we accept the null hypothesis, it means that there is an insignificant
difference between means of populations sampled at the α level of significance, i.e. our two
experimental groups (typically one sample with the treatment and the second one – control without
treatment) produced insignificant results (i.e. samples was drawn from the same population).
This variant of t-test is used for evaluation of data in experiments, where a population
parameter is known. It may be e.g. physiological value of a biochemical indicator – this value is
considered as a constant. Then in the experiment, we verify a null hypothesis whether the test
sample (under a treatment) comes from a population with this known parameter (H0: = const.).
An alternate hypothesis is HA: const.
Method:
1) We calculate sample mean and variance
x μ
t
s2
n
x -sample mean, - population mean, s – sample SD, n–number of items in sample
4) We compare calculated t with the tabulated critical value t(,) , where = n-1 and = 0.05
(or 0.01).
52
It means that the treatment has been effective – it caused a change of
the mean in comparison with the known population mean = const., i.e.
the tested sample comes from another population with const.
Example:
In a population of dairy cows the mean value of glucose in blood serum is = 3.1 mmol.l-1.
After applying an energy preparation glucose level in serum in 10 cows selected at random was
measured:
3.1, 2.7, 3.3, 3.1, 3.1, 3.2, 3.0, 2.8, 2.9, 2.7.
Does the preparation influence the glucose level in serum?
Method:
x 3.0 9
s 0.21
s 2 0.044
2) Mean value known for the whole population: = 3.1 mmoll-1.
x μ 3.0 3.1
t 1.58
s2 0.212
n 10
(H0 is not rejected; the sample comes from the population with =3.1).
6) Conclusion:
Preparation used in experiment is not effective (to change the glucose level in dairy cows).
53
6.2.2 Samples comparison (Two-sample t-test)
This variant of t-test is used for evaluation of data in experiments, where a population
parameter is not known. We compare data of 2 samples that comes either from one group of
subjects measured twice (typically before and after treatment – “paired experiment”, “dependent
samples”) or from two different random groups of subjects (typically treated test group and
untreated control group) – “unpaired experiment”.
A) Paired t-test (paired experiment, dependent samples)
Data evaluated in this variant of t-test come from paired subjects; it means that the same
subjects are submitted to two measurements (both test and control treatments are performed in one
group of subjects). Such a situation represents for instance the weight before and after 2 weeks
since the beginning of a new diuretic medication. In that case the outcome we are interested in, i.e.
the weight, is evaluated before and after the medication of the diuretic using the same individual.
Therefore we get “matched” data from these repeated measurements.
Note that paired experiments are, in principle, more robust than unpaired experiments. For
example, if each individual taking a diuretic loses 2 kg in 2 weeks then you can feel comfortable
that this represents an effect of the diuretic. In contrast, in two random groups, if one receives
placebo (control) and the other the diuretic, a 2 kg weight reduction in the diuretic-treated group is
less strong evidence of effectiveness of the medication. This is because the variation of weight in
each group is larger than 2 kg, so it is unclear whether the difference between the two groups is
a real effect of the diuretic or simply a random small difference in mean weight between the two
groups. It follows that, when possible, it is better to evaluate a given manipulation in the same
subject.
The first step in the procedure of paired t-test consists in calculation of differences between
paired (matched) values: xi = xtest – xcontrol. Then we calculate the sample mean x and SD
(standard deviation) of the differences xi. We test a hypothesis that population mean of the
measurements before and after the treatment are equal (i.e. mean value of the differences xi
between matched measurements is equal to 0). Then the null hypothesis is H0: differ.=0 and an
alternate hypothesis is HA: differ.0).
Method:
1) We calculate differences between paired values, mean and standard deviation of the
differences.
2) We calculate test statistic for paired t-test:
x
t
s2
n
x - mean of differences between paired values, s – variance of differences, n – number of
2
pairs
3) We specify degree of freedom for the test: = n-1
54
4) We compare the calculated t with the tabulated critical value t(,), where = n-1 and =
0.05 (or 0.01):
If t t(,) H0 is rejected, i.e. difference between means is statistically
significant (at = 0.05)
or highly significant (at = 0.01)
It means that the treatment has been effective: mean after the
treatment is different from mean before the treatment.
If t t(,) H0: differ.= 0 is true, i.e. difference between means of values before
and after the treatment is statistically insignificant (at the specific
level).
It means that the treatment has not been effective: mean after the
treatment is the same as the mean before the treatment.
Example:
Determine a weight change of twelve rats after being subjected to a regimen of forced
exercise. Each weight change (in g) is the weight after exercise minus the weight before:
0.2, -0.5, -1.3, -1.6, -0.7, 0.4, -0.1, 0.0, -0.6, -1.1, -1.2, -0.8. Does the exercise cause any significant
change in rat weight?
Method:
1) We calculate sample statistics:
x 0.6 g 11
s 0.40 g
2 2
x 0.6
t 3.39
s2 0.40
n 12
3) Critical values found in statistical tables of the t-distribution: tcrit.(0.05; 11) = 2.201
tcrit.(0.01; 11) = 3.106
4) We compare the calculated test statistic with critical value:
t > tcrit.(0.01) H0 is rejected, there is a statistically highly significant difference
between mean before and after exercise (P<0.01).
5) Conclusion:
The exercise causes a highly significant weight loss in rats.
55
B) Unpaired t-test (unpaired experiment, independent samples)
Data evaluated in this variant of t-test come from two independent groups if individuals - we
deal with “unmatched” data. Typically there is one test group, to which we apply some tested
treatment, and one control group without any treatment. We can also compare two groups with
different treatments in experiment, when we are interested in the possibility, whether there is any
difference between effects of these two treatments.
Unpaired t-test is then used to determine whether the means of two independent samples are
different enough to conclude that they were drawn from different populations. We test the null
hypothesis, whether a population mean value 1 in the test group (treated) is the same as the
population mean value 2 in the control group: H0: 1 = 2. A two-tailed alternate hypothesis is
HA: 1 2.
The populations sampled can have different variability – this variability affects the
calculation of t-test. Therefore at first we have to determine the difference between variances
(through F-test) to specify what type of calculation formula we need to use for the following t-test.
Therefore the first step of the procedure of unpaired t-test consists in calculation of sample statistics
(estimated mean, standard deviation and variance for both samples compared):
Sample 1 (n1): we calculate x 1, s21
Sample 2 (n2): we calculate x 2, s22
In the following step, we determine by means of F-test whether there is any difference
between population variances:
F-test:
2 2
higher ( s1 , s2 )
F 2 2
lower ( s1 , s2 )
We compare the calculated F with the tabulated critical value of F-distribution that we find
out according to the chosen and degrees of freedom: N (DF of numerator) and D (DF of
denominator).
If F F(N, D) populations compared have the same variability (12 = 22), we use
the following formula for the unpaired t-test:
a) 12 = 22:
x1 x2
Test statistic: t DF: = n1+ n2 -2
n1 1 s12 n2 1 s2 2 n1 n2
n1 n2 2 n1 n2
56
In a special case of equal sizes of samples compared:
x1 x2
For n1 = n2 = n: t DF: = (n-1) .2
2 2
s1 s2
n
If F > F(N, D) populations compared have different variances (12 22),
we use the following formula for the unpaired t-test:
b) 12 22 :
x1 x 2
Test statistic: t
s1 2 s2 2
n1 n2
2
s1 2 s2 2
n1 n2
DF: 2 2 (For n1, n2 > 30: = )
s1 2 s2 2
n1 n2
n1 1 n2 1
x1 x2
(For n1 = n2 = n : t )
2 2
s1 s2
n
Conclusion:
57
Example:
Determine a drug effect on the change in blood-clotting times (in minutes) in pigs.
Times of individuals treated with drug (T): 9.9, 9.0, 11.1, 9.6, 8.7, 10.4, 9.5.
Times of untreated control individuals (C): 8.8, 8.4, 7.9, 8.7, 9.1, 9.6, 8.7.
Method:
1) We calculate statistics of both samples:
x1 9.7 min n1 7
Test sample:
s1 0.67 min
2 2
1 6
5) We calculate test statistic for the unpaired t-test for equal variances :
x1 x 2 9.7 8.7
t 3.83
s1 s 2 0.26
2 2
9) Conclusion:
Drug administration causes a highly significant longer blood-clotting time in pigs.
58
Chapter 7
Non-Parametric Tests
Non-parametric tests belong to special statistical methods that comprise procedures not
requiring the estimation of the population parameters (mean and variance) and not stating
hypotheses about parameters. As these methods also typically do not make assumptions about the
nature of the distribution (e.g., normality) of the sampled populations (although they might assume
that the sampled populations have the same dispersion or shape), they are sometimes referred to as
distribution-free tests. Non-parametric tests are often called “rank tests“ as their calculations are
based on sum of ranks of values measured in experiment.
Non-parametric tests may be applied in any situation where we would be justified in
employing a parametric test, such as two-sample t test, as well as in instances when the assumptions
of the latter are untenable. However, if either the parametric or non-parametric approach is
applicable, then the latter will always be less powerful than the former (difference between tested
data sets must be considerable to achieve a statistical significance).
Non-parametric tests are especially employed when dealing with ordinal scale data (data that
consist of ranks) and numerical scale data when normality is not assumed, but they may also be
employed when dealing with numerical data that follow normal distribution (for preliminary
analyses especially, as calculations of non-parametric tests are often more quick and simpler then
parametric ones). Non-parametric tests may also be useful in instances when dealing with small
sample sizes – then sample frequency distribution is insufficient to tell us whether the assumption
of normality may be confirmed.
In non-parametric tests, the hypothesis is verified that data from both samples were drawn
from the same population, i.e. that they have the same dispersion or shape of the distribution curve
(null hypothesis). We then determine (similarly to parametric tests) the probability (P-value) that
the observed difference in our samples could have arisen by chance. A low P-value (P<0.05), is
often regarded as statistical evidence to reject the null hypothesis and conclude that there is
significant difference in the results obtained from our two experimental conditions.
For this test, as for many other non-parametric procedures, the actual measurements are not
employed, but we use instead the ranks of the measurements.. The data of both samples compared
are arranged into one (mixed) sample and may be ranked either from the highest to lowest or from
59
the lowest to the highest values. The samples compared can consist of equal or unequal number of
the observations. E.g. if data in samples are arranged from the highest to the lowest, then the highest
value in either of the two samples compared is given rank 1, the second highest value is assigned
rank 2, and so on, with the lowest value being assigned rank N, where
N = n1 + n2.
n1 * (n1 1)
U n1 * n2 R1 ,
2
n2 * (n2 1)
U ´ n1 * n2 R2 ,
2
where n1 and n2 are the number of observations in samples,
R1 is the sum of the ranks of the observations in Sample1,
R2 is the sum of the ranks of the observations in Sample2.
U´= n1* n2 – U
The larger of the two calculated test statistics U and U´ is compared to the critical value Uα,
n1, n2, found in Appendix 6 (Critical Values for Mann-Whitney U-test). This table assumes that
n1 > n2; if n2 > n1, simply use Uα, n2, n1 as the critical value.
Then:
If the larger from U and U´ > Uα, n1, n2 => we reject H0 at the α level of significance (i.e.
samples tested were not drawn from the same population, there is a significant difference between
populations sampled - they don’t have the same shape of the distribution curve). It means that “the
treatment used in the experiment was effective“ at the α.level of significance.
If the larger from U and U´ < Uα, n1, n2 => we accept H0 at the α level of significance (i.e.
samples tested were drawn from the same population, there is an insignificant difference between
populations sampled - they have the same shape of the distribution curve). It means that “the
treatment used in the experiment was not effective“ at the α.level of significance.
Note that neither parameters nor parameter estimates are employed in the statistical
hypotheses or in the calculations of test statistics U or U´.
We may assign ranks either from large to small data, or from small to large, calling the
smallest datum rank 1, the next smallest rank 2, and so on. The value of U obtained using one
ranking procedure will be the same as the value of U´ using the other procedure.
60
Example:
By means of Mann-Whitney test for non-parametric testing find out whether there is some
difference between the heights of male and female students.
Method:
1) We arrange data from the highest to the lowest and find out ranks of male and female
heights in this arranged (mixed) sample:
193 >188 > 185 > 183 > 180 > 178 > 175 >173 > 170 > 168 > 165 > 163
U´ = n1*n2 – U = 7*5 - 33 = 2
5) Conclusion:
There is a statistically significant difference between male and female heights at the
significance level α = 0.05.
61
7.2 Wilcoxon Signed-Rank Test
(A non-parametric test for paired samples)
The smaller of the two calculated W+ and W- is compared to the critical value Wα, n from
the Tables of critical values for Wilcoxon signed rank test (Appendix 7):
If the smaller from W+ and W- < Wα, n => we reject H0, i.e. difference between
measurements before and after treatment is statistically significant at the α.level („The treatment
used in the experiment was effective“).
If the smaller from W+ and W- > Wα, n => we accept H0, i.e. difference between
measurements before and after treatment is statistically insignificant at the α.level („The treatment
used in the experiment was not effective“).
62
Example:
By means of Wilcoxon paired-sample test for non-parametric testing find out whether there
is some difference between the lengths of hind- and forelegs in deer.
Method:
1) We calculate differences di between paired values:
3) We determine the ranks of differences and apply appropriate sign to the rank according to
the difference. Note that there are several “average ranks“ used for equal differences (E.g. there are
three differences di = 5 in the ascending row above, therefore all of them will get the rank 7 instead
of original ranks 6, 7, 8):
Difference
Deer Ranks of |di| Signed ranks of |di|
(di)
1 4 4.5 4.5
2 4 4.5 4.5
3 -3 3 -3
4 5 7 7
5 -1 1 -1
6 5 7 7
7 6 9.5 9.5
8 5 7 7
9 6 9.5 9.5
10 2 2 2
63
4) We calculate sums of plus and minus ranks :
W+ = 4.5 + 4.5 + 7 + 7 +9.5 +7 +9.5 + 2 = 51
W- = 3 + 1 = 4
7) Conclusion:
Deer hind leg lengths are not the same as the foreleg lengths (P < 0.05).
64
Chapter 8
Relationship Between 2 Data Sets
(Quantitative Data)
65
Fig. 8.1 Linear function – relation between circle radius (X) and its circumference (Y)
This relationship is typical for relations between data in biology and medicine: there is no
exact functional relationship, because most of biological characters are very changeful and unstable;
therefore relations result from this variability and they are relative (statistical, correlative) only.
These relations in biology and medicine are very complicated – there are many different causes
including random effects that we are not able to exclude during our monitoring.
This relation is more or less free - the magnitude of one of the variables probably changes
as the magnitude of the second variable changes. Each value of xi corresponds to several random
values of yi and also the reverse is possible (“variables are correlated”). In such a case it is not very
often reasonable to consider that there is an independent and dependent variable (e.g. fore- and hind
leg lengths in animals, human height and weight, arm and leg lengths, etc.). It might be found that
an individual with long arms will in general possess long legs, so a relationship may be describable;
but there is no justification in stating that the length of the limb is dependent upon the length of the
other. In such situations, correlation, rather than regression, analyses are called for, and both
variables are theoretically to be random-effects factors.
We use so called correlation chart (“scatter diagram”, “dot plot”) for the graphical
description of such statistical relationship – each point represents a pair of X and Y values measured
in one member of sample under study. One pair of X and Y data may be denoted as (x1, y1), another
66
as (x2, y2), another as (x3, y3), etc. These corresponding values on the axis x and y for one point in
the scattered plot are called “correlation pairs” (xi, yi).
An example for the scattered diagram of correlative relation is shown in the Figure 8.2;
it is a correlation between height and weight in men.
Fig. 8.2 Dot plot for correlative relationship between height and weight in men
If the points in the scattered diagram are clustered in some direction – it means that there is
some relation between biological characters monitored; correlation may be positive – “direct
relation” (Fig. 8.3) or negative – “inverse relation” (Fig. 8.4).
y y
x x
If the points in the scattered diagram are irregularly scattered in the area – it means that there
is no correlation between biological characters monitored (Fig. 8.5).
67
Fig. 8.5 No correlation
x
If we wish to know the strength of any relationship we have observed, and how reliable this
observation is, we need to employ statistical techniques called correlation and regression analysis.
Correlation is a measure of the relationship between two (or more) variables that helps us determine
whether the variables really are related and the degree to which they vary together. Regression is
a statistical tool for determining the mathematical relationship between one or several independent
or predictor variables and a single dependent or criterion variable, allowing us to calculate the
value of one variable given known values of the other variables. If we need to evaluate and describe
the statistical relation in a graphical presentation - we have to estimate the best-fit function that can
be used for description of this relationship, and to determine its equation (to calculate coefficients
for this equation – either linear or nonlinear).
According to the allocation of points in the scatter diagram we can distinguish between two
types of correlative relation: Linear or Non-linear correlation (Fig. 8.6). These two types of
correlative relationships differ in the way of their statistical evaluation and analysis.
x x
68
8.2 Linear Correlative Relationship
The linear function is the most frequently used equation that we can use for estimation and
description of some correlative relationship between two variables (biological characters) monitored
in biology and related sciences. We need to employ the simplest case of regression analysis, the
simple linear regression, in this situation. Data amenable to simple regression analysis will consist
of a dependent variable that is a random-effect factor and an independent variable that is either
a fixed-effect or a random-effect factor. The data can be visualised in a scatter-plot and analysed by
fitting the best straight line to the points. The simplest and most commonly used fitting technique
of this sort is named least squares. The name comes from minimizing the sum of squared vertical
distances from the data points to the proposed line.
The analysis and description of the statistical relation is usually performed in the following
steps:
1) The construction of an empirical curve that describes the relation in a sample (estimates
the supposed theoretical line for the whole population):
We measure several values yi for the same value xi (e.g. in several men that have the same
height (xi) we measure their weights; we obtain several random values yi). We calculate an average
from these values yi in the appropriate xi, and then we join these averages in order to construct the
empirical curve that describes the relation in the particular sample monitored in our study. This
empirical curve can serve as the estimation of the best-fit linear function.
An example for the empirical curve in the case of relationship between height and weight in
men is presented in the Figure 8.7.
69
2) The construction of a regression line (i.e. calculation of equation for this best-fit line)
that can be used for description of the relation in the whole population.
We need to calculate coefficients of the best-fit regression equation using regression
analysis: y = a + bx
Coefficients a and b in the regression equation determine properties of the line:
a (called intercept) – represents the intercept point on the axis y for x=0,
b (slope, regression) = tg (α - an angle that is formed by the line and the axis x).
We need always keep in mind that coefficients a and b are only the best estimates of the true
coefficients denoted α and of the theoretical regression line that would uniquely describe the
functional relationship existing in the whole population.
Figures 8.8 and 8.9 demonstrate properties of the regression line that are determined by
coefficients a and b in linear equation.
If the coefficient a is a positive value, then the line intersects the axis y above the value 0, if
the coefficient a is a negative value, the line intersects the axis y below the value 0.
70
If the coefficient b is a positive value, then the line is ascending (it indicates a direct relation
between x and y variables), if the coefficient b is a negative value, then the line is descending (it
indicates an inverse relation between x and y).
n x y x y
b i i i i
2
n x 2 x
i i
a
y b
i xi
n
After calculation of the equation for regression line, we need to determine two points for
construction of the theoretical regression line. We can choose any value x1 and calculate its
appropriate value y1 (according to the calculated regression line equation), and then choose another
x2 and calculate appropriate y2: y1 = a + bx1
y2 = a + bx2
Figure 8.10 demonstrates construction of the best-fit regression line for the scattered
diagram in the case of direct correlative relationship.
71
Knowing the parameter estimates a and b for the linear regression equation, we can predict
the value of the dependent variable expected at a stated value xi. A word of caution is in order
concerning predicting yi values from regression equation. Generally, it is an unsafe procedure to
extrapolate from regression equations – that is, to predict yi values for xi values outside the observed
range of xi. What the linear regression actually describes is Y as a function of X within the range of
observed values of X. For values of X above or below this range, the function may not be the same
(i.e., α and/or may be different); indeed, the relationship may not even be linear in such ranges,
even though it is linear within the observed range. If there is good reason to believe that the
described function holds for X values outside the range of those observed, then we may cautiously
extrapolate. Otherwise, beware.
Correlation analysis is the statistical technique used for determination of association level
between variables in the analysis of the correlative relation monitored. In simple linear correlation,
we consider the linear relationship between two variables X and Y, whereas neither is assumed to be
functionally dependent upon the other. An example of a correlation situation is the relationship
between the wing length and tail length of a particular species of bird.
We calculate a correlation coefficient r that determines tightness (closeness) of the relation
between variables X and Y (and also determines the measure of dispersion of points around the
theoretical regression line in the scatter diagram). The calculation results from sample data -
correlation pairs (xi, yi), measured for each of individuals in the sample under study.
Calculation formula for correlation coefficient r:
r
x x yi y
i
x x y y
2 2
i i
Values of correlation coefficient r are located within the interval -1 ; +1. The larger is the
absolute value of r, the closer is the correlation between X and Y variable. A positive correlation
coefficient implies that for an increase in the value of one of the variables, the other variable also
increases in value; a negative correlation coefficient indicates that an increase in value of one of the
variables is accompanied by a decrease in value of the other variable. If the correlation coefficient r
= 0, and one has a zero correlation, denoting that there is no linear association between the
magnitudes of two variables; that is, a change in magnitude of one does not imply a change in
magnitude of the other. Correlation coefficient r = +1 represents total (functional) direct relation
72
(ascending line), correlation coefficient r = -1 represents total (functional) inverse relation
(descendent line). Figure 8.11 presents these considerations graphically.
The correlation coefficient r that we calculate from a sample is only an estimate of an actual
correlation coefficient in the population (denoted ). If we need to know whether the correlation in
the population really exists, we have to test a hypothesis of the independence (H0: =0) using t-test:
r
Test statistic: t
sr
Where sr is the standard error of the correlation coefficient r and is calculated using the
following formula:
1 r 2
sr
n2
73
We compare the calculated t with the critical value (Appendix 2: Critical values for Student’s
t-distribution) according to the chosen and given v = n-2:
If t > t() => H0 is not true, the correlation between X,Y really exists in the population
sampled (r is significant),
If t t() => H0 is true, the correlation between X,Y really does not exist in the population
sampled (r is insignificant).
If we deals with a non-linear relation between two variables or if we have data obtained
from a bivariate population that is far from normal, then the correlation procedures discussed in the
chapter 8.2 are generally not applicable. Instead, we may operate with the ranks of the
measurements for each variable studied in these situations.
Calculation of the Spearman rank correlation coefficient is a non-parametric method, since
we don’t need parameters (means of variables X and Y) for calculation. This method may also be
used for data sets that don’t follow Gaussian normal distribution, and it can be used more generally
– in both linear and non-linear correlations. This method may also be used in normal data sets, but
non-parametric correlation coefficient is less forceful (less effective) than the parametric one.
Therefore, the non-parametric correlation coefficient is mostly used only for preliminary
calculations in normal data.
In the course of the calculation of Spearman rank correlation coefficient rs, we use only the
ranks of values instead of the actually measured values xi, yi,. The calculation results from the
number of individuals (n) in the sample and correlation pairs (xi, yi ).
74
Method:
If there are some equal values in the row, they get so called “average ranks”: e.g., if x4 and
x1 are equal then both get the rank 2.5 (calculated as = (2+3)/2)
6 Di
2
rS 1
n.(n 2 1)
:
Where:
Di - differences between ranks of corresponding xi and yi values
n - number of members in the sample
After calculation, we compare computed rs to the critical rank coefficient found in the
statistical tables (Appendix 8: Tables of Spearman rank correlation) according to the chosen and
given n:
If |rs| > rcrit. => There is a significant correlation between X and Y variables (relation
really exists in the population sampled),
If |rs| rcrit. => There is an insignificant correlation between X and Y variables (relation
does not really exist in the population sampled).
Example:
Calculate the Spearman rank correlation coefficient for the relation between wing and tail
lengths among birds of a particular species:
75
Method:
6 Di
2
6(42.00)
rS 1 1 0.853
n(n 1)
2
1716
Correlation between wing and tail lengths among birds of a particular species is statistically
significant (it really exists in the population).
76
Chapter 9
Categorical Data
(Qualitative Data: On a Nominal Scale)
Categorical groups are formed in a natural way in most qualitative biological characters.
However, sometimes, categorical groups may be formed also artificially, by dividing the scale upon
which continuous data occur. If we were to categorize age by decade (50-59, 60-69, and 70-79
years), we would have age groupings, which we could name 1, 2, and 3. These groups could be
considered as categories and categorical methods used. However, they fall into a natural rank order,
as group 1 clearly comes before group 2, etc. Rank methods give better results than categorical
methods, since rank methods are more powerful than categorical methods, which are not very
powerful. When ranking is a natural option, rank methods should be used.
77
When dealing with categorical data, the basic statistic, count = frequency, is obtained by
counting the number of “events” per category (individuals that possess the appropriate “quality”) in
a sample of total number of n individuals.
The symbol for number of “events” used in calculation formulas is:
Another important statistic obtained from categorical data is the proportion of data in
a category, which is the count in the category divided by the total number of n individuals in the
sample under study. Proportion of data is a relative measure (in contrast to absolute frequencies)
and gives us probability of data in the category. Multiplication by 100 yields percent (denoted %).
Percent is useful in that most of the public is used to thinking in terms of percent, but statistical
methods have been developed for proportions. A symbol for a sample proportion could be P:
fi
P
n
E.g.: If 5 animals out of 50 have a disease (Proportion) Probability of the disease in this
sample is P=5/50=0.1 (incidence of disease is 10%)
In statistical methods intended and used for categorical data, we can distinguish between
empirical and theoretical counts (in the terms of the sample and population):
f i – Empirical count (frequency) – observed in a sample (actually found)
fˆi – Theoretical count (frequency) – theoretically expected in a population sampled (this
theoretical count may be obtained in various ways in particular statistical methods for categorical
data, e.g. according to some literature sources, from a long-term monitoring of the “event” under
study in the past or by means of calculation from tables of empirical counts).
78
9.1 Analysis of Categorical Data
For nominal (categorical) data we can only use categorical methods that are based on
frequencies (counts) or proportions of “events” in statistical sets (we can’t use any measured
values and parameters like in methods used for numerical data). When dealing with counts in
statistical methods for categorical data, we usually arrange them into tables of counts that allow us
a better technique for all calculations used in the course of analysis of nominal data.
Nominal data analyses give us a possibility to assess:
It is frequently desired to obtain a sample of nominal data and to infer whether the
population from which it came conforms to a specified theoretical distribution. For example, a plant
geneticist may rise 100 progeny from a cross that is hypothesized to result in a 3:1 phenotypic ratio
of yellow-flowered to green-flowered plants. Perhaps a ratio of 84 yellow : 16 green is observed,
although out of this total of 100 plants, the geneticist’s hypothesis would predict a ratio of 75
yellow : 25 green. The question to be asked, then, is whether the observed frequencies (84 and 16)
deviate significantly from the frequencies expected if the hypothesis were true (75 and 25).
The statistical procedure for attacking the question first involves the concise statement of the
hypothesis to be tested. The hypothesis in this case is that the population which was sampled has
a 3 : 1 ratio of yellow-flowered to green-flowered plants. This is referred to as a null hypothesis
(abbreviated H0), because it is a statement of “no difference”; in this instance, we are hypothesizing
that the population flower colour ratio is not different from 3 : 1. If it is concluded that H0 is false,
then an alternate hypothesis (abbreviated HA) will be assumed to be true. In this case, HA would be
that the population sampled has a flower-colour ratio which is not 3 yellow : 1 green. Recall that we
state a null hypothesis and an alternate hypothesis for every statistical test performed, and all
possible outcomes are accounted for by the two hypotheses.
The following calculation of a statistic called Chi-Square is used as a measure of how far
a sample distribution deviates from a theoretical distribution. This Chi-Square analysis represents
2
the basis of all calculations and techniques used for nominal data – we calculate test statistic of -
test for differences between observed counts (in a sample) and those that would be expected
theoretically (in all population).
The null hypothesis used in this Chi-Square test is H0: observed counts = expected counts
(“no difference”).
79
Calculation of test statistic Chi-square:
2
m
f i fi
ˆ
2
or
m 2
fi
n
i 1 fˆi i 1 fˆi
Where:
f i - observed frequencies (in the sample class i),
fˆi - expected frequencies (in the whole population class i), i.e. frequencies expected in
class i if the null hypothesis is true.
m – number of classes (categories) in the sample or population.
n – total nimber members in the sample
The summation is performed over all m categories of data; in the example with the flower-
coloured ratio, there are two categories of data (i.e. m = 2): yellow-flowered plants and green-
flowered plants. The expected frequency, fˆi , of each class is calculated by multiplying the total
number of observations, n, by the proportion of the total that the null hypothesis predicts for the
3 1
class. Therefore, for the two classes in the example, fˆ1 100 . 75 and fˆ2 100 . 25 .
4 4
If calculated 2 = 0 then the observed and theoretical frequencies are exactly identical.
The bigger is the value of calculated statistic 2 the bigger is the difference between
observed and theoretical frequencies. Thus, this type of calculation is referred to as a measure of
goodness of fit (“goodness of fit test”).
If we compare calculated 2 with the critical value 2 for the specific and DF: = m-1
from the statistical tables , then:
If 2 > 2, difference between the observed and expected counts is significant (at the
level ). The null hypothesis is not true, i.e., sample distribution (empirical frequencies)
deviates from a theoretical distribution (theoretical frequencies).
If 2 2, difference between the observed and expected counts is not significant (at
the level ). The null hypothesis is true, i.e., sample distribution (empirical frequencies)
does not deviate from a theoretical distribution (theoretical frequencies).
80
9.2 Test for Difference between Empirical and Theoretical Counts
(Sample vs. Population)
In practice, this test is usually used in the situations when the theoretical probability for
a studied “event” is known (e.g. predicted ratios in genetics, probabilities for the incidence of
a particular disease according to literature sources, from a long-term monitoring of the “event”
under study in the past, etc.). In the chi-square analysis we compare the expected frequencies
(calculated from theoretical probabilities) with the empirical frequencies observed in the sample
under our study in order to asses the statistical significance of differences between these counts.
Example:
From the total number of 146 calves in a sample 13 have enteritis. In the whole population
the probability of this disease is 4.5%. Is the enteritis occurrence in the sample different from the
whole population?
Method:
We can distinguish between 2 categories in the sample and population:
- ill animals
- healthy animals
6.57 139.43
2crit.0.05 = 3.84
2crit.0.01 = 6.63
81
2 > 2crit.0.05 difference between empirical and theoretical counts is significant (at the
α = 0.05 level of significance).
Conclusion: There is a significantly higher proportion of ill animals in the sample than in
the population (P<0.05).
(Empirical frequency of ill animals is 13, but theoretically it should be only 6.57 (to have the
same probability like in the whole population).
Example:
The number of live- and dead born piglets was observed in 3 farms (A, B, C) in a region.
We have to decide whether the frequencies of dead born piglets differ in the farms monitored.
Frequencies obtained from 10 litters in each farm are summarized in the following table:
3 groups - rows (A, B, C) – in general r, (i)
2 classes - columns (live, dead) – in general c, (j)
82
c
Live Dead
r
A 96 25
B 121 22
C 89 16
For the Chi-Square analysis we also need theoretical (expected) counts – we are able to
calculate them from the sums in rows and columns in the table; thus the next step in the analysis is
to sum empirical frequencies in rows and columns:
c Row
Live Dead
r (Ri)
96 25
A 121
(100.34) (20.66)
121 22
B 143
(118.59) (24.41)
89 16
C 105
(87.07) (17.93)
Col.
306 63 369 (n)
(Cj)
ˆ
Then, we can calculate theoretical frequencies f i j for each cell in the table.
Calculation formula for the theoretical frequencies in each table cell (row i, column j):
Ri . Cj
fˆij
n
Where :
Ri = sum of empirical counts in row i
Cj = sum of empirical counts in column j
83
E.g. calculation of the theoretical frequency for the cell in the first row and the first column:
121 .306
fˆ11 100 ,34
369
2
6 f ij 96 2 25 2 1212 22 2 89 2 16 2
2 ˆ
i , j 1 f ij
n
100,34 20,66 118,59 24,41 87,07 17,93
369 1,637
Degrees of freedom for the test (needed for the critical value 2 from the statistical tables)
are calculated according the following formula:
Conclusion:
2 < 2crit. 0.05 Difference between the empirical and theoretical counts is insignificant
(P>0.05).
It means that the farms monitored don’t differ in the mortality of born piglets; i.e.
frequencies of live - and dead born piglets don’t differ among farms A, B, and C .
9. 4 Contingency Tables
(Analysis of relations between categorical data)
In many situations, nominal data for two variables may be tested for a hypothesis H0:
frequencies in categories of one variable are independent on the frequencies in the second variable.
E.g. whether the incidence (frequency) of a parasitic infection in dogs is the same in vaccinated as
in the non-vaccinated individuals (= Does the incidence of parasites depend on the vaccination? Is
the vaccination effective? )
84
Observed data are arranged in a contingency table:
- number of rows – r (categories of variable1: incidence of parasites)
- number of columns – c (categories of variable2: vaccination)
The null hypothesis H0 for this contingency table: frequencies in columns are independent
on the frequencies in rows.
Method for the analysis of the contingency table r x c is the same as in the previous testing
(see the test for differences between empirical counts) by means of Chi-square statistic calculation.
We test the null hypothesis (independence of variables) through the testing for difference between
empirical and theoretical counts in this contingency table r x c.
E.g. it can be a situation, when we are concerned with the question whether some diseases
are associated with special breeds of cattle:
Variable 1 – Breeds of cattle (A, B, C)
Variable 2 – Diseases (1, 2)
Method:
1)We create the contingency table 3 x 2:
c
Disease 1 Disease 2
r
Breed A f11 f12
85
2) We calculate theoretical frequencies fˆi j for all table cells (according to the known
formula):
Ri . C j
fˆij
n
Ri – sums in rows
Cj – sums in columns
2
m
f i fˆi
2
or
2
2
f ij
n
i 1 fˆi i, j fˆij
5) Conclusion:
If the calculated Chi-square statistic is small, there is a little dependence between the
variables.
A large statistic indicates the positive dependence. If the critical value is exceeded, then the
H0 (independence) is rejected and dependence of variables monitored is statistically proved (In this
case it would mean that there is some dependence between breeds (A, B, C) and monitored
diseases).
Contingency table 2 x 2 is a special case of the table r x c for 2 categories in each variable
only. We can solve such table either in the same way like the previous table r x c or by means of the
special (shortened) method.
Following situation can be solved by contingency table 2 x 2:
E.g.: Does the vaccination affect the incidence of parasites?
(Does the incidence of parasites depend on vaccination?)
Variable A – vaccine application
Variable B – incidence of parasites
86
We test the null hypothesis H0: the incidence of parasites is not dependent on the
vaccination.
Method:
1) We create the contingency table 2 x 2:
B B´ Row
A a b
a+b
A´ c d
c+d
a – Frequency of animals that have A and B (vaccinated animals that have parasites)
b – Frequency of animals that have A and B´ (vaccinated animals that have no parasites)
c – Frequency of animals that have A´ and B (not vaccinated animals that have parasites)
d – Frequency of animals that have A´ and B´ (not vaccinated animals that have no
parasites)
(a, b, c, and d represent the empirical frequencies)
n.(a.d b.c) 2
2
(a b).(c d ).(a c).(b d )
87
Example:
In a sample of 50 dogs: 25 dogs got an experimental anti-parasitic substance
25 dogs did not get the substance
Does the substance affect the incidence of parasites in dogs?
Total 25 25 50
n.(15.16 9.10) 2
Test statistic: 2.885
2
=1
2crit. 0.05 = 3.84
Conclusion:
2 < 2crit. H0 is not rejected i.e. frequencies in columns are independent on the
frequencies in rows (P>0.05).
(It means that the tested substance does not affect the incidence of parasites in dogs.)
88
Appendix
Statistical Tables
Source:
Riffenburgh, R. H.: Statistics in medicine. ACADEMIC PRESS, San Diego, USA 1999
Zar, J. H.: Biostatistical Analysis. Prentice Hall, Upper Saddle River, N.Y. 1999
List of Tables:
89
Appendix 1 Normal Distribution 1)
1) For selected distances (z) to the right of the mean are given (a) one-tailed α, the area under the curve in the
positive tail; (b) one-tailed 1- α, the area under all except the tail; (c) two-tailed α, the areas combined for both
positive and negative tails; and (d) two-tailed 1- α, the area under all except the two tails. Entries for the most
commonly used areas are italicized.
90
Appendix 2 Critical values for Student´s t-distribution.
91
Appendix 3 Critical values for 2 distribution, Right tail
92
Appendix 4 Critical values for 2 distribution, Left tail
93
Appendix 5 Critical values for Snedecor’s F – test (two-tailed, α = 0.05)
Numerator DF
1 2 3 4 5 6 7 8 9
1 647.79 799.50 864.16 899.58 921.85 937.11 948.22 956.66 963.28
2 38.506 39.000 39.165 39.248 39.298 39.331 39.355 39.373 39.387
3 17.443 16.044 15.439 15.101 14.885 14.735 14.624 14.540 14.473
4 12.218 10.649 9.979 9.605 9.365 9.197 9.074 8.980 8.905
5 10.007 8.434 7.764 7.388 7.146 6.978 6.853 6.757 6.681
6 8.813 7.260 6.599 6.227 5.988 5.820 5.696 5.600 5.523
7 8.073 6.542 5.890 5.523 5.285 5.119 4.995 4.899 4.823
8 7.571 6.060 5.416 5.053 4.817 4.652 4.529 4.433 4.357
9 7.209 5.715 5.078 4.718 4.484 4.320 4.197 4.102 4.026
10 6.937 5.456 4.826 4.468 4.236 4.072 3.950 3.855 3.779
11 6.724 5.256 4.630 4.275 4.044 3.881 3.759 3.664 3.588
12 6.554 5.096 4.474 4.121 3.891 3.728 3.607 3.512 3.436
13 6.414 4.965 4.347 3.996 3.767 3.604 3.483 3.388 3.312
14 6.298 4.857 4.242 3.892 3.663 3.501 3.380 3.285 3.209
Denominator DF
94
Appendix 6 Critical Values for Mann-Whitney U-test (2-tailed, α = 0.05)
n1
n2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2 16 18 20 22 23 25 27 29 31 32 34 36 38
3 15 17 20 22 25 27 30 32 35 37 40 37 45 47 50 52
4 16 19 22 25 28 32 35 38 41 44 47 50 53 57 60 63 67
5 23 27 30 34 37 42 46 49 53 57 61 65 68 72 76 80
6 31 36 39 44 49 53 58 62 67 71 75 80 84 89 93
7 41 46 51 56 61 66 71 76 81 86 91 96 101 106
8 51 57 63 69 74 80 86 91 97 102 108 114 119
9 64 70 76 82 89 95 101 107 114 120 126 132
10 77 84 91 97 104 111 118 125 132 138 145
11 91 99 106 114 121 129 136 143 151 158
12 107 115 123 131 139 147 155 163 171
13 124 132 141 149 158 167 175 184
14 141 151 160 169 178 188 197
15 161 170 180 190 200 210
16 181 191 202 212 222
17 202 213 224 235
18 225 236 248
19 248 261
20 273
95
Appendix 7 Critical values for Wilcoxon signed rank test
96
Appendix 8 Critical Values for the Spearman’s Rank Correlation Coefficient rS
97
References
Armitage, P., Berry, G., Matthews, J.N.S.: Statistical methods in medical research. Blackwell
Publishing, Oxford UK 2002, 814 p.
Ashcroft, S., Pereira, Ch.: Practical Statistics for the Biological Sciences. PALGRAVE
MACMILLAN, New York, N.Y. 2003, 167 p.
Carvounis, Ch.: Handbook of Biostatistics. PARTHENON PUBLISHING, New York, USA 2000,
103 p.
Riffenburgh, R. H.: Statistics in medicine. ACADEMIC PRESS, San Diego, California USA 1999,
581 p.
Zar, J. H.: Biostatistical Analysis. Prentice Hall, Upper Saddle River, N.Y. 1999, 663 p.
98