Book
Book
James McBroom
2020-12-05
2
Contents
1 Introduction To Statistics 5
1.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Individuals and Variables . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Random Variables and Variation . . . . . . . . . . . . . . . . . . 6
1.4 Statistical Populations . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Uncertainty, Error and Variability . . . . . . . . . . . . . . . . . 7
1.6 Research Studies & Scientific Investigations . . . . . . . . . . . . 8
1.7 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9 Using The R Software - Week 1. . . . . . . . . . . . . . . . . . . 10
5 Week 7 - ANOVA 93
5.1 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2 Analysis of Variance (ANOVA) - The Concept . . . . . . . . . . 98
5.3 The One-Way Analysis of Variance . . . . . . . . . . . . . . . . . 98
5.4 USING THE R SOFTWARE – WEEKS 7/8 . . . . . . . . . . . 108
3
4 CONTENTS
Introduction To Statistics
Outline:
1. Introduction
2. Revision and Topics Assumed - Exploratory Data Analysis
(EDA) and Probability (separate document)
3. Using the R computer software:
• Accessing R
• Using R within R Studio.
• Creating A Basic R Script (numerical summaries, plots,
tables)
Accompanying Workshop (done in week 2):
• Using R - Using RStudio; Creating a project; writing and saving
commands;
• Entering data; viewing data;
• Basic EDA - summary(), boxplot(), table(), histogram()
Workshop for Week 1
Nil
Project Requirements for Week 1
Nil
Things for you to Check in Week 1
• Ensure you have enrolled in a workshop;
• If you intend to use University computers ensure your comput-
ing account is active and accessible;
5
6 CHAPTER 1. INTRODUCTION TO STATISTICS
1.1 Statistics
• Why Statistics? (What is Statistics?)
• why do I need to use statistics?
• what is a statistical test?
• what do the results of a statistical test mean?
• what is statistical significance?
• how does probability fit in?
Statistical Inference requires assumptions about the data being analysed.
Lack of awareness and/or concern for assumptions leads to “misuse”:
You can prove anything with statistics.
There are three types of lies; lies, damned lies and statistics.
Statistics is concerned with scientific methods for collecting, organizing, summa-
rizing, presenting and analyzing data as well as with drawing valid conclusions
and making reasonable decisions on the basis of such analysis.
We can no more escape data than we can avoid the use of words. With data
comes variation. Statistics is a tool to understand variation.
There may be more than one population within the same problem. Some exam-
ples of different types of populations are:
Error ≠ uncertainty. Both are present to some extent in any scientific research.
1.7 Probability
A vital tool in statistics. See the assumed knowledge notes for a basic introduc-
tion.
The Study of Randomness – probability theory - describes random behaviour.
Note that random does not mean haphazard.
There are numerous schools of thought when it comes to defining what ‘proba-
bility’ means. One definition states:
… ‘empirical probability’ of an event is taken to be the ‘relative fre-
quency’ of occurrence of the event when the number of observations
is very large. The probability itself is the ‘limit’ of the relative fre-
quency as the number of observations increases indefinitely.
Note there are different conceptualizations of probability: empirical, theoretical,
subjective – We will assume the empirical approach in this course.
Population and Sample: Note that population does not necessarily refer to
people.
Population: the totality of individual observations about which inferences are
to be made
Sample: a subset of the population. The part of the population that we actually
examine in order to gather information.
Why Sample? - (a) Cost: Resources available for study limited, as are time and
effort. - (b) Utility: In some cases items may be destroyed in the process of
sampling. - (c) Accessibility: Impracticable or even impossible to measure an
entire population.
Example of inference in every day life:
On a cold morning, should we wear warm clothes for the day ahead, or will it
warm up during the day?
Statistical Inference: a set of procedures used in making appropriate conclu-
sions and generalisations about a whole (the population), based on a limited
number of observations (the sample).
In general: Statistics are calculated from a sample and describe that sample.
• Parameters are inferred from the calculated statistics.
• Parameters are used to describe the population.
• Therefore, statistics can be used to infer things about a popula-
tion.
If all the population values were available, the population parameters would
simply be found by calculating them from the values available. This is what the
Australian Bureau of Statistics attempts to do when it conducts the census.
A population parameter has no error. It is an exact value which never changes
(assumption).
10 CHAPTER 1. INTRODUCTION TO STATISTICS
Sample statistics depend on the particular sample of data selected and thus
may vary in value; they represent a random variable and as such have variation.
A statistic should never be quoted without some estimate of its variation; usually
this is provided as a standard error of the statistic.
• Use of a hash at start of a line indicates R is not to read that line – this
is ideal for putting comments and reminders in your code.
• R uses “object oriented” programming – everything in R is, or can be
made into, an “object”. For our purposes this basically means we can
and should give everything we do a name using the assignment operator.
For example x <- rnorm(1000). There is now an object called x that
contains 1000 random numbers from a normal distribution. Next time we
need those numbers, we simply use x. The kinds of names we should use
will be discussed in lectures.
• We write our code in the script window in RStudio, and we run it by
putting the cursor on the line of code we wish to run and clicking the run
button.
A Inputting DATA
Most statistical analyses start with data, and so most analyses in R begin by
entering, or reading in, data. R has many functions to read in data (mostly
related to the kind of file the data is originally stored in, like Excel or text files).
Data should be given a name so that it is easily available after we input it. For
example:
rain.dat <- a.function.that.reads.in.data(“from an excel file”)
In this example we are using a function (I made its name up – we’ll see the
correct function name shortly) to read in data from a file (I’ve made up a
strange name for the file too…!) and I’ve put the result into an object that
I’ve called rain.dat. Now there is a data set in R called rain.dat, and if I
remember to save each time I quit R, rain.dat will be there when I next open
R, forever, or until I choose to delete it. I never have to read the data into R
again after this initial data input.
So let’s look at a proper example. The table below gives the rainfall over four
seasons in five different districts.
District Season
Winter Spring Summer Autumn
1 23 440 800 80
2 250 500 1180 200
3 120 400 420 430
4 10 20 30 5
5 60 200 250 120
Here there are three variables, district, season and rainfall – each measurement
has 3 parts to it - trivariate. Note that the data is in a compressed form –
typically each variable should have its own column, like this:
1.9. USING THE R SOFTWARE - WEEK 1. 13
An Excel file with the data in this form can be found in the lecture notes folder
for week 1.
There are two ways to import this data into R. The first is to use the Import
Dataset menu in the Environment window in RStudio. This will be shown in
the lecture.
The second way is a useful shortcut when you have small to moderate sized data
(say less than a few thousand rows of data). R allows you to copy the data using
your mouse/keyboard, and then enter it using the following code:
rainfall <- read.table(“clipboard”, header = T) #On Windows
or
rainfall <- read.table(pipe(“pbpaste”), header = T) #On MacOSX
This will also be demonstrated in lectures.
B Accessing, Exploring, Graphing and Analysing
Once the data set has been entered various analyses can proceed. R uses func-
tions to do analyses, graphs and summaries.
1. Printing/Viewing: typing the name of the variable/dataset will
print it in the command window:
rainfall <- read.csv("rainfall.csv")
rainfall
## 8 2 Autumn 200
## 9 3 Winter 120
## 10 3 Spring 400
## 11 3 Summer 420
## 12 3 Autumn 430
## 13 4 Winter 10
## 14 4 Spring 20
## 15 4 Summer 30
## 16 4 Autumn 5
## 17 5 Winter 60
## 18 5 Spring 200
## 19 5 Summer 250
## 20 5 Autumn 120
Accessing variables within a dataset is achieved by prepending the name of the
variable to the name of the dataset followed by a dollar sign like this: dataset-
name$variablename:
rainfall$Season
## [1] 23 440 800 80 250 500 1180 200 120 400 420 430 10 20 30
## [16] 5 60 200 250 120
rainfall$District
## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
2. Basic EDA – numerical summary of data using summary() and
table() functions:
For the rainfall example:
summary(rainfall)
##
## Autumn Spring Summer Winter
## 5 5 5 5
Histogram of rainfall$Rainfall
10
8
Frequency
6
4
2
0
rainfall$Rainfall
boxplot(rainfall$Rainfall)
16 CHAPTER 1. INTRODUCTION TO STATISTICS
1200
200 400 600 800
0
Season
qqnorm(rainfall$Rainfall)
qqline(rainfall$Rainfall)
1.9. USING THE R SOFTWARE - WEEK 1. 17
−2 −1 0 1 2
Theoretical Quantiles
barplot(table(rainfall$Season))
5
4
3
2
1
0
3 4 5 6 7
Chapter 2
Outline:
1. Statistical Inference
• Introductory Examples: Goodness of Fit
• Test Statistics & the Null Hypothesis
– The Null Hypothesis: 𝐻0
– The Test Statistic, 𝑇
– Distribution of the Test Statistic
– The Null Distribution
– Degrees of Freedom
– The Statistical Table
– Using the 𝜒2 Table
– Examples of Using the Table
– The Significance Level 𝛼, and the Type I Error
– The Goodness of Fit Examples Revisted
• The Formal Chi Squared, 𝜒2 , Goodness of Fit Test
• The Chi Squared, 𝜒2 , Test of Independence – The two-
way contingency table
2. Using R
• Using the rep and factor functions to enter repeating categorical
data into R.
19
20 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS
Mail (July 2017), and gives a summary (in the first two columns) of the number
of times each of the numbers, 1 to 45, has been drawn over a large subset of
games.
The total of all the numbers of times drawn is 7120 (166 + 162+ … + 155 + 152)
and, since eight numbers are drawn each time, this means that the data relate
to 890 games. The organisers guarantee that the mechanism used to select the
numbers is unbiased: therefore all numbers are equally likely to be selected.
Is this true?
How different from this expected number of 158.2222, can an observed number
of times be before we begin to claim ‘foul play’?
The table below also shows the differences between 158.2222 and the observed
frequency of occurrence for each number. Intuitively it would make sense to
add up all these differences and see how big the sum is. But, this doesn’t work.
Why? Also in the table are the squares for each of the differences – maybe the
sum of these values would mean something? But, if these squared differences
are used it means that all differences of say two are treated equally. Is it fair
for the difference between the numbers 10 and 12 to be treated the same as the
difference between 1000 and 1002? A way to take this into account is to scale
the squared difference by dividing by the expected value – in this example this
scaling will be the same as all expected values are 158.2222.
The final column in the table gives the scaled squared differences:
𝑘
(𝑂 − 𝐸)2
𝑇 =∑ (2.2)
𝑖=1
𝐸
where the summation takes place over all the bits to be added – that is, the
total number of categories for which there are observations. Here 𝑘 = 45, the
45 Gold Lotto numbers.
22 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS
lotto.bb
How bad is the lack of fit? Is a sum of 31.504 (31.386 in 2015) a lot more than
we would expect if the numbers are drawn randomly? Are any of the values in
the last column ‘very’ big? How big is ‘very’ big for any single number?
Plastic gloves are commonly used in a variety of factories. The following data
were collected in a study to test the belief that all parts of such gloves wear out
equally. Skilled workers who all carried out the same job in the same factory
used gloves from the same batch over a specified time period. Four parts on the
gloves were identified: the palm, fingertips, knuckle and the join between the
fingers and the palm. A ‘failure’ is detected by measuring the rate of permeation
through the material; failure occurs when the rate exceeds a given value. A total
of 200 gloves were assessed.
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 23
If the gloves wear evenly we would expect each of the four positions to have
the same number of first failures. That is, the 200 gloves would be distributed
equally (uniformly) across the four places and each place would have 200/4
= 50 first failures. How do the numbers shape up to this belief of a uniform
distribution? Are the observed numbers much different from 50?
Perhaps the best-known historical goodness of fit story belongs to Mendel and
his peas. In his work (published in 1866), Mendel sought to prove what we now
accept as the basic laws of Mendelian genetics in which an offspring receives
discrete genes from each of its parents. Mendel carried out numerous breeding
experiments with garden peas which were bred in a way that ensured they were
heterozygote, that is, they had one of each of two possible alleles, at a number of
single loci each of which controlled clearly observed phenotypic characteristics.
One of the alleles was a recessive and the other dominant so that for the recessive
phenotype to be present, the offspring had to receive one of the recessive alleles
from both parents. Individuals who had one or two of the dominant alleles would
all appear the same phenotypically.
This phenomenon is illustrated in the table below for the particular pea char-
acteristic of round or wrinkled seed form. The wrinkled form is recessive, thus
for a seed to appear wrinkled it must have a wrinkled allele from both of its
parents; all other combinations will result in a seed form that appears round.
The two possible alleles are indicated as r and w.
From the table it can be seen that the probabilities associated with the wrinkled
and round phenotypic seed forms are ¼ and ¾, respectively. The following table
gives the actual results from one of Mendel’s experiments.
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 25
𝐻0 ∶ the alleles making up an individual’s genotype associate independently and each has a probability of 0.5.
𝐻1 ∶ the alleles making up an individual’s genotype do not associate independently.
𝐻0 : null hypothesis.
𝐻1 or 𝐻𝐴 : alternative hypothesis.
The basis of all statistical hypothesis testing is to assume that the null hypoth-
esis is true and then see what the chances are that this situation could produce
the data that are obtained in the sample. If the sample data are unlikely to be
obtained from the situation described by the null hypothesis, then probably the
situation is different; that is, the null hypothesis is not true.
The first step is to determine some measure which expresses the ‘difference’ in
a meaningful way. In the above examples the measure used is the sum of the
scaled squares of the differences. We call this measure the test statistic.
In the Gold Lotto example the test statistic was 31.504 across the entire 45
numbers. How ‘significant’ is this value of 31.504? What sorts of values would
we get for this value if the observed frequencies had been slightly different –
suppose the number of times a 19 was chosen was 160 instead of 171; or that
27 had been observed 138 times instead of 130. If the expected value for each
number is 158.22, how much total difference from this across the 45 numbers can
be tolerated before a warning bell sounds to the authorities? What distribution
of values is ‘possible’ for this test statistic from different samples if the expected
frequency truly is 158.22? If all observed frequencies were very close to 158
and 159, the test statistic would be about zero. But, we would not really be
surprised (would not doubt that the numbers occur equally often) if all numbers
were 158 or 159 except two, one having an observed value of 156 say, and the
other 160, say. In this case the test statistic would be < 0.2 – this would not
be a result that would alarm us. Clearly there are many, many possible ways
that the 7120 draws could be distributed between the 45 numbers without us
crying ‘foul’ and demanding that action be taken because the selection process
is biased.
It would be impossible to list out all the possible sets of outcomes for the
Gold Lotto case in a reasonable amount of time. Instead, consider the simple
supposedly unbiased coin which is tossed 20 times. If the coin truly is unbiased
we expect to see 10 heads and 10 tails. Would we worry if the result was 11
heads and 9 tails? or 12 heads and 8 tails?
Using the same test statistic as in the Lotto example, these two situations would
give:
(9–10)2 (11–10)2
𝑇 = 10 + 10 = 0.2
and
(8–10)2 (12–10)2
𝑇 = 10 + 10 = 0.8.
With the extreme situation of 15 heads and 5 tails, the test statistic is starting
to get much bigger.
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 27
Each of the possible outcomes for any test statistic has an associated probability
based on the hypothesised situation. It is the accumulation of these probabilities
that gives us the probability distribution of the test statistic under the null
distribution.
In general, the test statistic, which has been calculated using the sample of
data, must be referred to the appropriate probability distribution. Depending
on where the particular value of the test statistic sits in the relative scale of
probabilities of the null distribution (on how likely the particular value is), a
decision can then be made about the validity of the proposed belief (hypothesis).
In the Gold Lotto example, we have a test statistic of 31.504 but, as yet, we
have no way of knowing whether or not this value is likely if the selection is
unbiased. We need some sort of probability distribution which will tell us which
values of the test statistic are likely if the ball selection is random. If we had this
distribution we could see how the value of 31.504 fits into the belief of random
ball selection. If it is not likely that a value as extreme as this would occur,
then we will tend to reject the belief that selection of the numbers is fair.
The test statistic used in the above examples was first proposed by Karl Pearson
in the late 1800’s. Following his publication in the 1900s it became known as
The Pearson (Chi-Squared) Test Statistic
In his 1900 work, Pearson showed that his test statistic has a probability distri-
bution which is approximately described by the Chi-Squared distribution. This
is a theoretical mathematical function which has been studied and whose prob-
ability distribution has a shape like this:
28 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS
The exact shape of the chi-squared distribution changes depending on the de-
grees of freedom involved in the calculation process. Degrees of Freedom (DF)
will be discussed below.
The distribution shows what values of the test statistic are likely (those that
have a high probability of occurring). Values in the tail have a small probability
of occurring and so are unlikely. If our calculated test statistic lies in the outer
extremes of the distribution we will doubt the validity of the null distribution as
a description of the population from which our sample has been taken. Since we
are interested in knowing whether or not our calculated test statistic is ‘extreme’
within the probability distribution, we will look at the values in the ‘tail’ of the
distribution – this is discussed further in section 1.3.3.
For the Gold Lotto there are 45 categories (the numbers) for which expected
values are needed. The sum across these 45 expected values must be 7120, thus
only 44 frequencies could be selected freely, the final or 45th frequency being
completely determined by the 44 previous values and the total of 7120.
In the occupational health and safety example there are four categories (the
four parts of the glove) across which the sum of the expected frequencies must
be 200. The degrees of freedom must be three (4 – 1 = 3).
The banking institution example - four categories giving three degrees of free-
dom.
For Mendel’s peas - 2 categories giving 1 degree of freedom.
A number of the mathematical probability distributions, including the chi-
squared, which are used in statistical inference vary a little depending on the
degrees of freedom associated with them. The effects of different degrees of
freedom vary for different types of distributions. For chi-squared, the effects are
discussed below.
the Greek letter 𝜈, which is pronounced ‘new’ (this is the most common symbol
used for degrees of freedom). Values for 𝜈 are given in the first column of the
table. The remaining columns in the table refer to quantiles, 𝜒2 (𝑝) , which are
the chi-squared values required to achieve the specific cumulative probabilities
𝑝 (ie probability of being less than or equal to the quantile).
We are interested in extreme values; that is, exceptionally large values of
the test statistic. Thus, the interesting part of the distribution is the right
hand tail. To obtain the proportion of chi-squared values lying in the extreme
right hand tail of the distribution, that is, the proportion of chi-squared values
greater than the particular quantile, the nominated 𝑝 is subtracted from 1. The
actual numbers in the bulk of the table are the 𝜒2 values that give the stated
cumulative probability, the quantile, for the nominated degrees of freedom. For
this particular table the probabilities relate to the left hand tail of the distri-
bution and are the probability of getting a value less than or equal to some
specified 𝜒2 value. For example, the column headed 0.950 contains values, say
A, along the x-axis such that: 𝑃 𝑟(𝜒2𝜈 ≤ 𝐴) = 0.95.
Conversely, there will be 5% of values greater than A, 𝑃 𝑟(𝜒2𝜈 > 𝐴) = 1 − 0.95 =
0.05. This can be seen graphically in the following figure, where the chi-squared
distribution for one degree of freedom is illustrated.
As you look down the column for 𝜒2 (0.95) you will notice that the values of A
vary. The rows are determined by the degrees of freedom, and thus the A value
also depends on the degrees of freedom. What is actually happening is that the
𝜒2 distribution changes its shape depending on the degrees of freedom, 𝜈. The
value of the degrees of freedom is incorporated into the expression as follows:
𝜒2𝜈 (0.95)
This means the point in a 𝜒2 with 𝜈 degrees of freedom that has 95% of values
below it (or, equally, 5% of values above it).
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 31
(Note that the table does not have a row for df = 44 so we use the
next highest df, 50)
The calculated test statistic for the gold lotto sample data is: T =
31 which does not lie in the critical region.
Do not reject 𝐻0 and conclude that the sample does not provide
evidence to reject the proposal that the 45 numbers are selected at
random (𝛼 = 0.05).
Degrees of freedom = 3.
Degrees of freedom = 3.
4. Mendel’s Peas
Calculated test statistic: T = 0.5751 which does not lie in the critical
region.
Do not reject 𝐻0 and conclude that the sample does not provide
sufficient evidence to reject the proposal that the alleles, each with
probability of 0.5, are inherited independently from each parent (
𝛼 = 0.05).
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 33
𝑘
(𝑂 − 𝐸)2
𝑇 =∑
𝑖=1
𝐸
calculated using the sample data.
Null Distribution:
The Chi-Squared, 𝜒2 , distribution and table.
Significance Level, 𝛼:
Assume 0.05 (5%).
Critical Value, A:
Will depend on the degrees of freedom 𝜈, which in turn depends on
the number of possible outcomes (categories, 𝑘).
Critical Region:
That part of the distribution more extreme than the critical value –
part of the distribution where the 𝜒2 value exceeds A.
Conclusion:
If 𝑇 > 𝐴 (i.e. 𝑇 lies in the critical region) reject 𝐻0 in favour of the
alternative hypothesis 𝐻1 .
Interpretation:
If 𝑇 > 𝐴 the null hypothesis is rejected. Conclude that the alter-
native hypothesis is true with significance level of 0.05. The null
hypothesis has been falsified.
34 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS
Recall the use of the definition of independent events in the basic rules of prob-
ability: this is the way in which the expected values are found.
The main difference between this form of the chi-squared test and the goodness
of fit test lies in the calculation of the degrees of freedom. If the concept does
not change, what would you expect the degrees of freedom to be for the example
given below? This will be discussed in lectures.
Example
One hundred students were selected at random from the ESC School and their
hair and eye colours recorded. These values have then been summarised into a
two-way table as follows.
Do these data support or refute the belief that a person’s hair and eye colours
are independent?
From the table, the probabilities for each eye colour are found by taking the
row sum and dividing by the total number of people, 100.
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 35
18
𝑃 𝑟(blue eyes) = = 0.18
100
35
𝑃 𝑟(green/hazel eyes) = = 0.35
100
47
𝑃 𝑟(brown eyes) = = 0.47
100
Similarly, the probabilities for the hair colours are found using the column totals.
70
𝑃 𝑟(brown/black hair) = = 0.70
100
20
𝑃 𝑟(blonde hair) = = 0.20
100
10
𝑃 𝑟(red hair) = = 0.10
100
The combined probabilities for each combination of hair and eye colour, under
the hypothesis that these characteristics are independent, are found by simply
multiplying the probabilities together. For example, under the assumption that
hair and eye colour are independent, the probability of having blue eyes and red
hair is:
We would therefore expect 0.018 of the 100 people (ie 1.8 people) to have both
blue eyes and red hair, if the assumption that hair and eye colour are
independent is true. We can calculate the expected values of all combinations
of hair and eye colour in this way. In fact by following the example we can
formulate an equation for the expected value of a cell defined by the rth row
and cth column as:
Using this formula, we can calculate the expected values of each cell as shown
in the following table. Note that the expected values are displayed in italics:
36 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS
Once the observed and expected frequencies are available, Pearson’s Chi Squared
Test Statistic is found in the same way as before
We can type the rainfall measurements into R as follows (note that we are typing
it in by row, not by column – this makes a difference as to how we do the next
bit):
rain <- c(23, 440, 800, 80,
250, 500, 1180, 200,
120, 400, 420, 430,
10, 20, 30, 5,
60, 200, 250, 120)
rain
## [1] 23 440 800 80 250 500 1180 200 120 400 420 430 10 20 30
## [16] 5 60 200 250 120
Next we can create the variable district using the rep() function. This func-
tion repeats the values you give it a specified number of times:
district <- rep(1:5, c(4, 4, 4, 4, 4))
district
## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
rep() takes two arguments, separated by a comma. The first argument are the
things you want to replicate (in this case the numbers from 1 to 5 inclusive). The
second argument is the number of times you want these things to be replicated.
Note that the second argument must match the dimension of the first argument
– since there are 5 numbers in the first argument, there must be 5 numbers in
the second (each number in the second argument matches how many times its
counterpart in the first argument gets replicated). In this example we want the
numbers 1 to 5 to each be replicated 4 times (once for each season).
Note that we can use R functions inside R functions (even the same function). If
we look carefully at the second argument above, we can see that c(4,4,4,4,4)
is just the number 4 replicated 5 times. This could be written rep(4,5). This
kind of thing happens a lot (where you want to replicate each thing a set the
same number of times each) and so rep also has an argument called each that
allows you to specify how many times each element in the first argument should
be replicated. Therefore, alternatives to the above code is:
38 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS
# OR
district.a
## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
district.b
## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
Now let’s look at the Season variable. The season variable is slightly different
to the district variable in that Season has character values (words/letters, not
numbers). This is not a major problem in R. However, to handle this type of
variable we need to introduce another function called factor(). The factor()
function is the way we tell R that a variable is categorical. Categorical variables
do not necessarily have to have character values (District is also a categorical
variable, for example), but any variable that has character values is categorical.
Entering character data into R is simple: we treat it like normal data but we
put each value in quotes:
seasons <- c("Winter", "Spring", "Summer", "Autumn")
seasons
## [1] Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring
## [11] Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn
## Levels: Autumn Spring Summer Winter
This new variable season is a factor, and it replicates the character variable
seasons 5 times. Factors have levels. Levels are just the names of the categories
that make up the factor. R list levels in alpha-numeric order.
Note we could have done all this at once using:
season <- factor(rep(c("Winter", "Spring", "Summer", "Autumn"), 5))
season
## [1] Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring
2.2. USING R (WEEK 2) 39
## [11] Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn
## Levels: Autumn Spring Summer Winter
Finally, remember we said District is also categorical (make sure you understand
why). If you create a variable but forget to make it categorical at the time by
using factor(), you can always come back and fix it:
district <- factor(district)
district
## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
## Levels: 1 2 3 4 5
Now we have entered each variable into R manually, we can place them into their
own data set. R calls data sets “data frames”. The function data.frame() lets
you put your variables into a data set that you can call whatever you want:
rainfall.dat <- data.frame(district, season, rain)
rainfall.dat
ls()
##
## Chi-squared test for given probabilities
##
## data: lotto
## X-squared = 30.841, df = 44, p-value = 0.9333
We will talk more about p-values in later lectures. For now, if the p-value is
bigger than or equal to 0.05 it means we cannot reject the null hypothesis. If
the p-value is smaller than 0.05, we reject the null in favour of the alternative
hypothesis.
In this example we see that we cannot reject the null hypothesis, and conclude
that there is insufficient evidence to suggest the game is “unfair” at the 0.05
level of significance.
For the credit card debt example:
credit.cards <- c(70, 20, 50, 20) # Enter the data
# Note we have to give it the probabilities in the null hypothesis using the p = ... argument.
creditcard.test
##
## Chi-squared test for given probabilities
##
## data: credit.cards
## X-squared = 26.25, df = 3, p-value = 8.454e-06
Note that here we needed to specify the probabilities for each category (50%,
20% etc). Also note here that R outputs scientific notation for small p-values
– 8.454e-06 means 0.000008454, which is very much smaller than 0.05. We
therefore reject the null in favour of the alternative hypothesis and conclude
that the Bank’s claims about credit card repayments are false at the 0.05 level
of significance.
Below is the code required to do the test of independence between hair and eye
colour given in the example above:
hair.eyes <- as.table(rbind(c(5, 12, 1), c(25, 2, 8), c(40, 6, 1))) # Make the table with the dat
hair.eyes
## A B C
## A 5 12 1
## B 25 2 8
## C 40 6 1
dimnames(hair.eyes) <- list(Eye = c("Blue", "Green", "Brown"),
Hair = c("Brown", "Blonde", "Red")) # We don't need to, but we can gi
hair.eyes
## Hair
## Eye Brown Blonde Red
## Blue 5 12 1
## Green 25 2 8
## Brown 40 6 1
42 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS
##
## Pearson's Chi-squared test
##
## data: hair.eyes
## X-squared = 39.582, df = 4, p-value = 5.282e-08
Note the warning message: one of the assumptions of the Chi-Squared test of
independence is that no more than 20% of all expected cell frequencies are less
than 5. We can see the expected cell frequencies by using:
haireye.test$expected
## Hair
## Eye Brown Blonde Red
## Blue 12.6 3.6 1.8
## Green 24.5 7.0 3.5
## Brown 32.9 9.4 4.7
We can see that 4 of the 9 cells have expected frequencies less than 5. This can
be an issue for the test of independence but ways to fix it are beyond the scope
of this course.
From the output we see the p-value is less than 0.05, so we therefore reject
the null in favour of the alternative hypothesis and conclude that hair and eye
colour are dependent on each other at the 0.05 level of significance.
Chapter 3
Outline:
1. Revision and Basics for Statistical Inference
• Review – Revision and things for you to look up
• Types of Inference
• Notation
• Probability Distribution Functions
• Areas Under a Probability Distribution Curve - Probabilities
• Tchebychef’s Rule – applies for any shaped distribution
• Probability of a Range of the Variable
• Cumulative Probability
• Central Limit Theorem
2. Inference for Counts and Proportions – test of a pro-
portion
• An Example
• One- and Two-Tailed Hypotheses
• The p-value of a Test Statistic
3. Statistical Distributions
43
44CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION
4. Using R
R:
3.1.3 Notation
population parameter: Greek Letter
sample statistic: Name - Upper case; observed value - lower case
sample statistic: is an ESTIMATOR of population parameter - use of
‘hat’ over the Greek symbols: 𝜃,̂ 𝜎,̂ 𝜙.̂
Some estimators as used so often they get a special symbol. E.g.: Sample mean,
𝑋 = 𝜇,̂ the estimate of the population mean 𝜇.
Sometimes use letters eg SE for standard error – the standard deviation of a
sample statistic
###Probability Distribution Functions: 𝑓(𝑥)
Statistical probability models:
Can be expressed in graphical form – distribution curve
• possible values of X along x-axis
• relative frequencies (or probabilities) for each possible value along y-axis
• total area under curve is 1; representing total of probabilities for all pos-
sible values/outcomes.
Shape can also be described by appropriate mathematical formula and/or ex-
pressed as a table of possible values and associated probabilities.
If the allowable values of X are discrete: Probability Mass Function (PMF),
𝑓(𝑥) = 𝑃 𝑟(𝑋 = 𝑥).
46CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION
1
𝑃 𝑟(𝜇 − 𝑘𝜎 ≤ 𝑋 ≤ 𝜇 + 𝑘𝜎) ≥ 1 −
𝑘2
𝑎
𝑃 𝑟(𝑎 < 𝑋 < 𝑏) = ∫ 𝑓(𝑥)𝑑𝑥.
𝑏
𝑃 (𝑋 ∈ 𝐴) = Σ𝑥∈𝐴 𝑓(𝑥).
1
𝑃 (𝑋 = 𝑥) = , 𝑥 = 1, 2, … , 6.
6
where 𝑋 represents the random variable describing the side that lands upper-
most, and 𝑥 represents the possible values 𝑋 can take. This kind of distribution
is known as a Uniform distribution (why?).
3.1. REVISION AND BASICS FOR STATISTICAL INFERENCE 47
𝐹 (𝑥) = 𝑃 𝑟(𝑋 ≤ 𝑥)
For the dice example, the cumulative probability distribution can be calculated
as follows:
𝑋 1 2 3 4 5 6
1 1 1 1 1 1
𝑓(𝑥) 6 6 6 6 6 6
1 2 3 4 5 6
𝐹 (𝑥) 6 6 6 6 6 6
We can also express the CDF for this example mathematically (note that this
is not always possible for all random variables, but it is generally possible to
create a table as above):
𝑥
𝐹 (𝑥) = 𝑃 𝑟(𝑋 ≤ 𝑥) = , 𝑥 = 1, 2, … , 6.
6
(Check for yourself that the values you get from this formula match those in the
table.)
The central limit theorem also requires the random variables to be identically
distributed, unless certain conditions are met. The CLT also justifies the ap-
proximation of large-sample statistics to the normal distribution in controlled
experiments.
eucalyptus species in Northern New South Wales are suffering from the disease
die back.
Research Question:
Is the coin unbiased? That is, if it is tossed, is it just as likely to
come down with a head showing as with a tail? Is the probability of
seeing a head when the coin is tossed equal to ½?
What sort of experiment:
A single toss will not tell us much – how many tosses will we carry
out? Resources are limited so we decide to use only six.
What sort of data:
Success or failure – assume a head is a success.
Binary Data – each toss is a Bernoulli trial (an experiment with
only two possible outcomes).
What feature from the experiment will have meaning for the question:
The number of heads seen in the sample. If the coin is unbiased we
would expect to see three heads and three tails in 6 tosses. number
of heads is a Binomial Variable – the sum of a series of independent
Bernoulli trials.
Hypotheses:
We want to test the current belief that the probability of seeing a
head is 0.5. The null hypothesis always reflects the status quo and
assumes the current belief to be true. The alternative hypothesis is
the opposite of the null and reflects the reason why the research was
conducted.
Null Hypothesis:
𝐻0 : within the population of interest, the probability that a head
will be seen is 21 .
𝐻0 ∶ 𝑃 𝑟(head) = 0.5.
Alternative Hypothesis:
𝐻1 : the distribution within the population is not as specified; the
probability that a head will be seen is not one half.
𝐻1 ∶ 𝑃 𝑟(head) ≠ 0.5.
Sample:
Selected at random from the population of interest – six random
throws.
Test Statistic:
50CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION
In the example, the question simply raises the issue that the coin may not be
unbiased. There is no indication as to whether the possible bias will make a
head more likely or less likely. The results could be too few heads or too many
heads. This is a case 3 situation and is an example of a two-tailed hypothesis.
The critical value can be at either end of the distribution and the value of the
stipulated significance, 0.05, must be split between the two ends, 0.025 (or as
close as we can get it) going to each tail.
One-Tailed Hypothesis
Suppose instead that the researcher suspects the coin is biased in such a way as
to give more heads and this is what is to be assessed (tested). The alternative
hypothesis would be that the probability of a head is greater than 21 : 𝐻1 ∶ 𝑝 >
0.5 – a case 1 situation.
Clearly the opposite situation could also occur if the researcher expected bias
towards tails leading to an alternative: 𝐻1 ∶ 𝑝 < 0.5. This is a case 2 situation.
In both of these cases, the researcher clearly expects that if the null hypothesis
is not true it will be false in a specific way. These are examples of a one-tailed
hypothesis.
The critical value occurs entirely in the tail containing the extremes anticipated
by the researcher. Thus for case 1 the critical value will cut off an upper tail.
For case 2 the critical value must cut off a lower tail.
Back to the Example
The example as given is a two-tailed situation thus two critical values are needed,
one to cut off the upper 0.025 portion of the null distribution, and the other
to cut off the lower 0.025 portion.
To find the actual critical values we look at the distribution as we did for chi-
squared.
AND NOW ANOTHER PROBLEM ARISES!!!!!!!
For chi-squared we had a continuous curve and the possible values could be
anything, enabling us to find a specific value for any significance level nominated.
Here we have discrete data (a count) with only the integer values from zero to
six and their associated probabilities. Working with 5% we want the values that
will cut off a lower and an upper probability of 0.025 each.
From the table we see:
• probability of being less than 0 = 0
• probability of being less than 1 = 0.01562
• probability of being less than 2 = 0.10938
The closest we can get to 0.025 in the lower tail is 0.01562 for a number of heads
of less than 1 (i.e. 0). Similar reasoning gives an upper critical value of greater
than 5 (i.e. 6) with a probability of 0.01562.
3.2. INFERENCE FOR COUNTS AND PROPORTIONS – TEST OF A PROPORTION53
We cannot find critical values for an exact significance level of 0.05 in this case.
The best we can do is to use a significance level of 0.01562 + 0.01562 = 0.03124
and the critical values of 1 and 5 – approximately 97% of the values lie between
1 and 5, inclusive.
?? What significance level would you be using if you selected the
critical values of (less than) 2 and (greater than) 4 ??
Critical Region:
The part of the distribution more extreme than the critical values,
A. The critical region for a significance level of 0.03124 will be any
value less than 1 or any value greater than 5:
𝑇 < 1 or 𝑇 > 5.
Thus, if the sample test statistic (number of heads) is either zero
or six, then it lies in the critical region (reject the null hypothesis).
Any other value is said to lie in the acceptance region (cannot reject
the null hypothesis).
Test Statistic:
Calculated using the sample data – number of heads.
We now need to carry out the experiment
Collect the data:
Complete six independent tosses of the coin. The experimental re-
sults are: H T H H H H
Calculate the test statistic:
Test statistic (number of heads) = 5
Compare the test statistic with the null distribution:
Where in the distribution does the value of 5 lie?
Two possible outcomes:
1. T lies in the critical region - conclusion: reject 𝐻0 in favour
of the alternative hypothesis.
2. T does not lie in critical region – conclusion: do not reject
𝐻0 (there is insufficient evidence to reject 𝐻0 ).
Here we have a critical region defined as values of 𝑇 < 1 and values
of 𝑇 > 5. The test statistic of 5 does NOT lie in the critical region
so the null hypothesis 𝐻0 is not rejected.
Interpretation – one of two possibilities
54CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION
𝑛
𝑃 𝑟(𝑋 = 𝑥; 𝑛, 𝑝) = ( )𝑝𝑥 (1 − 𝑝)(𝑛−𝑥) , 𝑥 = 0, 1, 2, … , 𝑛, 0 ≤ 𝑝 ≤ 1,
𝑥
56CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION
where:
• 𝑋 is the name of the binomial variable – the number of successes
• 𝑛 is the sample size – the number of identical, independent observations;
• 𝑥 is the number of successes in the 𝑛 observations;
• 𝑝 is the probability of a success.
The mathematical model in words is:
the probability of observing 𝑥 successes of the variable, 𝑋, in a
sample of 𝑛 independent trials, if the probability of a success for any
single trial is the same and equal to 𝑝.
(Compare this formula to that discussed in the coin toss example box.)
Binomial Tables: Binomial probabilities for various (but very limited) values
of 𝑛 and 𝑝 can be found in table form. See the Tables folder on the L@G site.
Also note the folder contains a binomial table generator written in java script
that will show you probabilities for user-selected 𝑛 and 𝑝. R will also calculate
Binomial probabilities (see R section in these notes).
5
𝑃 (𝑋 = 2) = ( ) × 0.252 × (1 − 0.25)(5−2)
2
= 10 × 0.252 × 0.753
= 0.2637.
The probability that this couple will have 2 children out 5 with blood type O is
0.2637. Can you find this probability in the Binomial tables?
3.3.1.1.2 Example 2: A couple have 5 children, and two of them have blood
type O. Using this data, test the hypothesis that the probability of the couple
having a child with type O blood is 0.25.
We are testing the following hypotheses:
𝐻0 ∶the probability of the couple having a child with type O blood is 0.25
𝐻0 ∶𝑃 (Blood Type O) = 0.25
3.3. THEORETICAL STATISTICAL DISTRIBUTIONS 57
versus
𝐻1 ∶the probability of the couple having a child with type O blood is not 0.25
𝐻1 ∶𝑃 (Blood Type O) ≠ 0.25
This is very similar to the coin tossing example above. Our test statistic will be
the number of children with type O blood, which we are told in the question is
T = 2.
The null distribution is the distribution of the test statistic assuming the null
hypothesis is true. 𝑇 is binomial with 𝑛 = 5 and 𝑝 = 0.25. This distribution is
shown here graphically:
20
𝑃 (𝑋 = 2) = ( ) × 0.616 × (1 − 0.6)(20−16)
16
= 4845 × 0.616 × 0.44
= 0.0349.
The probability of getting 16/20 True/False questions correct (if the probability
of a correct answer is 0.60, and assuming your answers to each question are
independent) is 0.0349. You need to study harder!
This is the probability of getting exactly 16 correct. What is the probability of
getting 16 or less correct? We could sum the probabilities for getting 0, 1, 2,
3….16 correct (tedious!!) or we could note that 𝑃 (𝑋 ≤ 16) = 1 − 𝑃 (𝑋 > 16):
20
𝑃 (𝑋 ≤ 16) = 1 − ∑ 𝑃 (𝑋 = 𝑥)
𝑥=17
= 1 − (𝑃 (𝑋 = 17) + 𝑃 (𝑋 = 18) + 𝑃 (𝑋 = 19) + 𝑃 (𝑋 = 20))
= 1 − (0.012 + 0.003 + 0.0005 + 0.00004)
= 0.984
Note: this is a one-tailed (upper) hypothesis test, since your research question
asks whether 5 hours of study will increase the chance of successfully answering
a question, 𝑝, over and above guessing (ie will 𝑝 be greater than 0.5?).
𝐻0 ∶𝑝 ≤ 0.5
𝐻1 ∶𝑝 > 0.5
Significance level 𝛼 = 0.05 (one tailed, all in the upper tail). Test statistic
𝑇 = 16.
To obtain the critical value we need to find the value in the upper tail that
has 0.05 of values (or as close as we can get to it) above it. We can sum the
probabilities backwards from 20 until we reach approximately 0.05:
So our critical value is 13, and the critical region is any value greater than 13.
60CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION
The test statistic 𝑇 = 16 > 13. Therefore we reject 𝐻0 and conclude that the
probability of getting a T/F question correct is significantly greater than 0.50
if we study for 5 hours, at the 𝛼 = 0.05722 level of significance.
NOTE:
If we took the critical value to be 14, our significance level would be
0.02072 and our test statistic would still be significant (i.e. we would
still reject the null hypothesis).
New conclusion: The test statistic of T = 16 > 14. Therefore we
reject 𝐻0 and conclude that the probability of getting a T/F question
correct is significantly greater than 0.50 if we study for 5 hours, at
the 𝛼 = 0.02072 level of significance.
What is the main effect of reducing the significance level? We have
reduced the chance of a Type I error. Make sure you can explain
this. Can the significance level be reduced further?
In example 3 we found that the probability of getting 16/20 True/False questions
correct (if the probability of a correct answer is 0.60) is 0.0349. You should
perhaps study more.
If the probability of getting a correct answer really is 0.6, how many of the 20
questions would you expect to answer correctly?
0.6 × 20 = 12
.
What should the probability of correctly answering a question be to make 16
correct answers out of 20 the expected outcome?
𝑝 × 20 = 16
16
𝑝=
20
= 0.8
The mode of a random variable is the most frequent, or probable, value. For
the binomial distribution the mode is either equal to the mean or very close to
it (the mode will be a whole number, whereas the mean will not necessarily be
so). The mode of any particular Bin(𝑛, 𝑝) distribution can be found by perusing
the binomial tables for that distribution and finding the value with the largest
probability, although this becomes prohibitive as 𝑛 gets large. (There is a
formula to calculate the mode for the binomial distribution; however this goes
beyond the scope of this course.)
𝑝 > 0.5, the distribution has a left skew (not shown). If 𝑝 < 0.5, the distribution
has a right skew (first figure). The closer 𝑝 gets to either 0 or 1, the more skewed
(right or left, respectively) the distribution becomes.
The number of trials, 𝑛, mediates the effects of 𝑝 to a certain degree in the
sense that the larger 𝑛 is, the less skewed the distribution becomes for values of
𝑝 ≠ 0.5.
where:
• 𝑥 is a particular value of the random variable, 𝑋, and 𝑓(𝑥) is the associated
probability;
• 𝜎2 is the population variance of the random variable, 𝑋;
• 𝜇 is the population mean of the random variable 𝑋.
We write: 𝑋 ∼ 𝑁 (𝜇, 𝜎2 ): “X is normally distributed with mean 𝜇 and variance
𝜎2 .”
Properties of the Normal probability distribution function
3.3. THEORETICAL STATISTICAL DISTRIBUTIONS 63
𝑋−𝜇
𝑍=
𝜎
If there are 450 students in a particular Faculty who do part-time work, how
many of them would you expect to earn more than $7.00 per hour?
You do not need to do these by hand. The R functions qqnorm() (and qqline()
to add a reference line) do these for you. See the example R code in the week
1 lecture notes folder for an example of how to do these graphs. (Boxplots can
be used to show similar things.)
3.3. THEORETICAL STATISTICAL DISTRIBUTIONS 65
If 𝑋 ∼ Bin(𝑛, 𝑝) then 𝑋 ∼𝑁
̇ (𝑛𝑝, 𝑛𝑝(1 − 𝑝)).
𝑋 − 𝑛𝑝
𝑇 = ∼𝑁
̇ (0, 1)
√𝑛𝑝(1 − 𝑝)
The following example will illustrate how to use this formula to test hypotheses
about proportions when the sample size (number of trials) is large.
Forestry Example:
A forester wants to know if more than 40% of the eucalyptus trees in a particular
state forest are host to a specific epiphyte. She takes a random sample of 150
trees and finds that 65 do support the specified epiphyte.
Research Question: What sort of experiment? What sort of data? What feature
of the data is of interest? Null Hypothesis Alternative Hypothesis One-tailed
or two-tailed test? Sample size? Null Distribution? Test Statistic? Significance
Level? Critical Value? Compare test statistic to critical value Conclusion
Although 65/150 = 0.43 is greater than 0.4, this on its own is not enough to
say that the true population proportion of epiphyte hosts is greater than 0.4.
Remember, we are using this sample to infer things about the wider population
of host trees. Of course, in this sample the proportion of hosts is greater than
0.4, but this is only one sample of 150 trees. What if we took another sample
of 150 trees from the forest and found that the sample proportion was 0.38?
Would we then conclude that the true population proportion was in fact less
than 0.4? Whenever we sample we introduce uncertainty. It is this uncertainty
we are trying to take into account when we do hypothesis testing.
3.4. USING R WEEK 3/4 67
How many host trees would we need to have found in our sample to feel confident
that the actual population proportion is > 40%? That is, how many host trees
would we need to have found in our 150 tree sample in order to reject 𝐻0 ?
## [1] 0.34375
# OR
pbinom(2, 6, 0.5)
## [1] 0.34375
What if we want the probability of finding exactly 2 heads: 𝑃 𝑟(𝑋 = 2)?
dbinom(2, size = 6, p = 0.5)
## [1] 0.234375
If we want the probability of seeing an upper extreme set, for example seeing
three or more heads, we can use the subtraction from unity approach as indicated
in the examples above:
𝑃 𝑟(𝑋 ≥ 3) = 1 − 𝑃 𝑟(𝑋 ≤ 2)
1 - pbinom(2, 6, 0.5)
## [1] 0.65625
68CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION
Or, we can do each probability in the set individually and add them up (note,
this is only really a good option if you don’t have a large number of trials, or if
there are not a lot of probabilities to add up):
sum(dbinom(3:6, 6, 0.5))
## [1] 0.65625
## [1] 0.9750021
Find 𝑃 𝑟(0.5 ≤ 𝑋 ≤ 1.96)
pnorm(1.96) - pnorm(0.5)
## [1] 0.2835396
Find 𝑃 𝑟(𝑋 > 1.96)
1 - pnorm(1.96)
## [1] 0.0249979
Find
1. 𝑃 𝑟(−1.96 ≤ 𝑋 ≤ 1.96); and
2. 𝑃 𝑟(|𝑋| > 1.96).
# 1.
pnorm(1.96) - pnorm(-1.96)
## [1] 0.9500042
# 2.
pnorm(-1.96) + (1 - pnorm(1.96))
## [1] 0.04999579
3.4. USING R WEEK 3/4 69
# OR
1 - (pnorm(1.96) - pnorm(-1.96)) ## MAKE SURE YOU UNDERSTAND WHY!
## [1] 0.04999579
attach(rainfall.dat)
# attaching a data frame lets us use the variable names directly
# (eg we can type 'rain' instead of needing to use 'rainfall.dat$rain')
# NEVER FORGET TO detach() THE DATA FRAME WHEN YOU ARE DONE!!!
## district: 1
## [1] 335.75
## ------------------------------------------------------------
70CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION
## district: 2
## [1] 532.5
## ------------------------------------------------------------
## district: 3
## [1] 342.5
## ------------------------------------------------------------
## district: 4
## [1] 16.25
## ------------------------------------------------------------
## district: 5
## [1] 157.5
There are always several ways to to do the same thing in R. Another way
we could find the mean for each district is to use the tapply function:
tapply(rain, district, mean).
Which you use can often just boil down to a personal preference (eg you might
prefer the output from using by over the output from tapply). As an exercise,
try adding the tapply version to the end of the code box above and see which
you prefer.
Now that we are finished with the rainfall.dat data frame we should detach
it:
detach()
More examples will be shown in lectures – please see the accompanying R file
in the weeks 3/4 lecture notes folder.
Chapter 4
Outline:
1. Hypothesis Testing – General Process
• The Concept
• The Basic Steps for Hypothesis Testing – 10 steps
• The Scientific Problem and Question
• The Research Hypothesis
• Resources, Required Detectable Differences, Significance Level
Required
• The Statistical Hypotheses
• One and Two Tailed Hypotheses
• Theoretical Models used in Testing Hypotheses
• The Test Statistic, its Null Distribution, Significance Level and
Critical Region
• Sample Collection and Calculation of Sample Test Statistic
• Comparison of Sample Test Statistic with Null Distribution
• The 𝑝-Value of a Test
• Conclusions and Interpretation
• Possible Errors
• Power of a Statistical Test
2. Specific Tests of Hypotheses I
71
72 CHAPTER 4. WEEK 5/6 - T-TESTS
3. Belief: crabs in the Tweed River weigh less than those in the Brisbane
River.
• If 𝐻0 is true, this is the distribution we would expect the feature (or some
expression based on it) to have.
• The distribution for the population of ‘feature values’ if H0 is true – eg,
the distribution of the sample mean .
Significance Level – alpha, 𝛼
The risk you are willing to take that you will reject the null hypothesis when it
is really true. The probability of a Type I error. It defines the ‘cut off’ point
for the test statistic.
Critical Region
• Determined by the specified significance level, 𝛼
• The region of the null distribution where it is considered unlikely for a
value of the test statistic to occur.
• If sample value lies here, it is regarded as evidence to reject 𝐻0 in favour
of 𝐻1 .
The relationships of the test statistic to the sample and population
are critical.
TRUE SITUATION
𝐻0 is True 𝐻0 is False
TEST Fail to Reject 𝐻0 Correct Type II Error (𝛽)
CONCLUSION Reject 𝐻0 Type I Error (𝛼) Correct
𝐻0 ∶𝜇 = 𝜇0
𝐻1 ∶𝜇 ≠ 𝜇0
𝐻0 ∶𝜇 ≤ 𝜇0
𝐻1 ∶𝜇 > 𝜇0
𝐻0 ∶𝜇 ≥ 𝜇0
𝐻1 ∶𝜇 < 𝜇0
2
Using theory: If 𝑋 ∼ 𝑁 (𝜇, 𝜎2 ) then 𝑋 𝑛 ∼ 𝑁 (𝜇, 𝜎𝑛 ). Applying the standard
normal (𝑍) transform we get the test statistic:
𝑋 𝑛 − 𝜇0
𝑇 = √ ∼ 𝑁 (0, 1).
𝜎/ 𝑛
𝑋 𝑛 − 𝜇0
𝑇 = √ ∼ 𝑡𝑛−1 .
𝑠/ 𝑛
The hypothesised value for 𝜇 (𝜇0 ) is substituted into the formula along with the
values calculated from the sample – mean and standard deviation – to obtain
the sample test statistic. The calculated sample test statistic is compared with
the relevant critical value from the Student’s 𝑡 distribution with 𝑛 − 1 degrees
of freedom.
Example
THE QUESTION:
Fiddler crabs in the Tweed River appear to be heavier than those
reported in the literature, where the mean weight is given as 230gm.
Is this true?
IDENTIFY A FEATURE WHICH WILL HAVE MEAN-
ING FOR THE QUESTION:
The mean weight.
THE RESEARCH HYPOTHESIS:
The mean weight of fiddler crabs in the Tweed river is greater than
230gm.
DETERMINE THE RESOURCES , DETECTABLE DIF-
FERENCE , LEVEL OF SIGNIFICANCE
Estimation of sample size for a given detectable difference is dis-
cussed in a later section. Assume for this example that a sample of
16 crabs will be taken.
STATISTICAL HYPOTHESES:
𝐻0 ∶𝜇 ≤ 230
𝐻1 ∶𝜇 > 230
Interested in testing the mean – central limit theorem gives the dis-
tribution of the sample mean as normal with mean 𝜇 and variance
𝜎2 /16.
Standard deviation, 𝜎, is not known and will have to be estimated
from the sample using 𝑠. This means our null distribution is the
Student’s 𝑡 distribution.
TEST STATISTIC , NULL DISTRIBUTION & CRITICAL
REGION:
The mean of a random sample of size 𝑛 from a variable with a
normal distribution 𝑁 (𝜇, 𝜎2 ) has a normal distribution 𝑁 (𝜇, 𝜎2 /𝑛).
Converting this to a 𝑍 format, and acknowledging that the popula-
tion standard deviation of the weights of crabs in the Tweed River
is not known gives a test statistic:
𝑋 𝑛 − 𝜇0
𝑇 = √ ∼ 𝑡𝑛−1 .
𝑠/ 𝑛
240 − 230 10
𝑇 = √ = = 1.667.
24/ 16 6
R using:
1 - pt(1.6667, 15)
[1] 0.05815621
The 𝑝-value for the calculated 𝑇 of 1.667 on 15 df is 0.058. (This is
larger than 0.05.)
MAKE CONCLUSION AND INFERENCES:
There is insufficient evidence to reject the null hypothesis (𝑝 ≥ 0.05).
The sample data do not support the research hypothesis that the
mean weight of crabs in the Tweed is greater than that reported in
the literature (230gm), at the 0.05 level of significance.
SPECIFY THE ERROR YOU MAY BE MAKING IN
YOUR INFERENTIAL CONCLUSIONS:
The researcher may be incorrect in not rejecting the null hypothesis
in favour of the research hypothesis – a type II error. The probability
associated with this error is unknown unless the true alternative
value of the mean weight for Tweed river crabs is known. The failure
to reject the null may simply reflect a low powered test.
Question: What if the standard deviation for the sample had been
20?
R: A one-sample t-test can be carried out using t.test() in R. See R section
for details.
where 𝜇1 and 𝜇2 are the respective means of the two populations to be compared.
One-tailed Hypothesis:
𝐻0 ∶𝜇1 ≤ 𝜇2
𝐻1 ∶𝜇1 > 𝜇2
Or
One-tailed Hypothesis:
𝐻0 ∶𝜇1 ≥ 𝜇2
𝐻1 ∶𝜇1 < 𝜇2
84 CHAPTER 4. WEEK 5/6 - T-TESTS
The situation where 𝜎1 and 𝜎2 are known is most unlikely and will not be
discussed.
2
The estimation of 𝜎𝑋 depends on whether or not 𝜎1 and 𝜎2 can be assumed
1 −𝑋 2
equal.
Let 𝑛1 and 𝑛2 denote the sample sizes taken from populations 1 and 2, respec-
tively. Let 𝑠1 and 𝑠2 denote the standard deviations of each sample taken from
populations 1 and 2, respectively.
Standard deviations unknown but assumed equal (Pooled Procedure)
Therefore
(𝑋 1 − 𝑋 2 ) − (𝜇1 − 𝜇2)
𝑇 = ∼ 𝑡𝑛1 +𝑛2 −2 if 𝐻0 is true.
𝑠𝑝 √ 𝑛1 + 1
𝑛2
1
(𝑋 1 − 𝑋 2 ) − (𝜇1 − 𝜇2)
𝑇 = 2
∼ 𝑡𝑛1 +𝑛2 −2 if 𝐻0 is true.
𝑠22
√ 𝑛𝑠1 + 𝑛2
1
NOTE:
• For large sample sizes, 𝑇 ∼ 𝑍.
• for small sample sizes, 𝑇 ∼ 𝑡 with weighted DF.
4.2. SPECIFIC TESTS OF HYPOTHESES I 85
In this course, the situation of unequal standard deviations and small samples
will not be considered further.
Question: Why would a test to compare means be of interest if pop-
ulations have unequal standard deviations?
The hypothesised value for (𝜇1 − 𝜇2 ) under 𝐻0 is substituted into the formula
along with the values calculated from the sample (means and standard devia-
tions) to obtain sample test statistic. The calculated sample test statistic is
compared with the relevant critical value of 𝑡.
NOTE: When comparing the means from two populations using the test statis-
tic shown above, the choice of which sample mean is subtracted from the other
is arbitrary. For two-tailed hypotheses this is not an issue. However, it can
create an issue for one-tailed tests when deciding whether the test is upper or
lower tailed. This will be discussed further in lectures.
Example
THE QUESTION:
Fiddler crabs in the Tweed River appear to be heavier than fiddler
crabs in the Brisbane River. Is this true?
IDENTIFY A FEATURE WHICH WILL HAVE MEAN-
ING FOR THE QUESTION:
The difference between the mean weights of crabs in the two loca-
tions.
THE RESEARCH HYPOTHESIS:
Mean weight for Tweed River crabs is greater than the mean weight
for Brisbane River crabs.
DETERMINE THE RESOURCES , DETECTABLE DIF-
FERENCE , LEVEL OF SIGNIFICANCE
Sample size? Assume sample sizes of 16 and 25 have been taken
from Tweed and Brisbane rivers respectively. Following tradition,
take the level of significance to be 𝛼 = 0.05.
STATISTICAL HYPOTHESES:
𝐻0 ∶𝜇T ≤ 𝜇B
𝐻1 ∶𝜇T > 𝜇B
(𝑋 1 − 𝑋 2 ) − (𝜇1 − 𝜇2)
𝑇 = ∼ 𝑡𝑛1 +𝑛2 −2 if 𝐻0 is true.
𝑠𝑝 √ 𝑛1 + 1
𝑛2
1
where
(𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22
𝑠2𝑝 = .
𝑛1 + 𝑛2 − 2
# Sp^2
sp.sqrd <- ((16-1)*24^2 + (25-1)*18^2)/(16 + 25 -2)
round(sp.sqrd, 3)
## [1] 420.923
4.2. SPECIFIC TESTS OF HYPOTHESES I 87
# Test Stat
## [1] 3.806
COMPARE THE TEST STATISTIC WITH THE NULL
DISTRIBUTION:
MAKE CONCLUSION AND INFERENCES:
SPECIFY THE ERROR YOU MAY BE MAKING IN
YOUR INFERENTIAL CONCLUSIONS:
R: Two-sample t-tests can be carried out using t.test() - see Using R section.
𝐻0 ∶ There is no difference between the mean blood pressures in the two popu-
lations
𝐻1 ∶ There is a difference between the mean blood pressures in the two popula-
tions
Since the two populations are paired (ie the same indivicuals are measured
twice), we are actually testing the whether the population mean difference (𝜇𝐷 )is
zero or not:
𝐻0 ∶𝜇𝐷 = 0
𝐻1 ∶𝜇𝐷 ≠ 0
This is just a one-sample t-test on the difference data: ie, use the difference
data as the sample and caclulate its sample mean (𝑋 𝐷 ) and sample standard
deviation (𝑠𝐷 ). We can then use these in the one-sample t-test test statistic:
𝑋 𝑛 − 𝜇0
𝑇 = √
𝑠/ 𝑛
𝑋 𝐷 − 𝜇𝐷
= √
𝑠𝐷 / 𝑛
2.5 − 0
= √
5.5/ 12
= 1.574.
From tables, 𝑡11 (0.975) = ±2.2010. Since 𝑇 = 1.574 is not greater than 2.010
(or less than -2.010) we cannot reject the null at the 0.05 level of significance.
We conclude there is insufficient evidence to suggest that the mean standing
blood pressure differs from the mean lying blood pressure.
One-tailed versions will be discussed during lectures.
4.3 Using R
4.3.1 More Probability Functions:
Use the functions: pnorm(z, mean, sd) and pt(t, df) to find the cumulative
probability for particular values of 𝑍 and the calculated t-test statistic, 𝑇 based
on Specified degrees of freedom (df).
Eg: An art auction produces normally distributed sale prices with a mean of
1600 dollars and a standard deviation of 220 dollars. What is the probability
that a particular painting will cost at least 2000 dollars?
4.3. USING R 89
Let 𝑋 denote sales prices. We want to find 𝑃 𝑟(𝑋 > 2000) = 1 − 𝑃 𝑟(𝑋 ≤ 2000).
1 - pnorm(2000, mean = 1600, sd = 220)
## [1] 0.03451817
# Or, convert to Z-value first:
## [1] 0.03451817
(Exercise: Modify the above code to find the probability that a painting will
cost 5000 dollars or less.)
Suppose now that the standard deviation given above had been estimated from
a random sample of 10 of the paintings. Student’s 𝑡 should be used (with df =
9) rather than the normal. Note you should convert the figure to a Z-value first
before using the pt function:
z <- (2000 - 1600)/220
1 - pt(z, df = 9)
## [1] 0.05119906
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
90 CHAPTER 4. WEEK 5/6 - T-TESTS
## 5 -0.1 1 5
## 6 3.4 1 6
## 7 3.7 1 7
## 8 0.8 1 8
## 9 0.0 1 9
## 10 2.0 1 10
## 11 1.9 2 1
## 12 0.8 2 2
## 13 1.1 2 3
## 14 0.1 2 4
## 15 -0.1 2 5
## 16 4.4 2 6
## 17 5.5 2 7
## 18 1.6 2 8
## 19 4.6 2 9
## 20 3.4 2 10
# Do the hypothesis test:
##
## One Sample t-test
##
## data: sleep$extra
## t = 3.413, df = 19, p-value = 0.001459
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
## 0.7597797 Inf
## sample estimates:
## mean of x
## 1.54
(Exercise: How would you modify the above code if you wanted to test whether
the mean extra hours of sleep is significantly less than 1 hour?)
Suppose in the sleep example we want to test whether the mean extra hours
of sleep differs between the two drugs. We simply do the two-sample t-test
comparing the two groups (drugs) using the t.test() function:
# Do the hypothesis test:
t.test(extra ~ group, data = sleep)
4.3. USING R 91
##
## Welch Two Sample t-test
##
## data: extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.3654832 0.2054832
## sample estimates:
## mean in group 1 mean in group 2
## 0.75 2.33
(Exercise: Investigate what R means when it says “Welch Two Sample t-test”
in the output. Start by looking at the help for t.test() using ?t.test in the
R console).
# Do the test:
t.test(lying.bp, standing.bp, paired = TRUE, alternative = "two.sided")
##
## Paired t-test
##
## data: lying.bp and standing.bp
## t = 1.574, df = 11, p-value = 0.1438
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9958458 5.9958458
## sample estimates:
## mean of the differences
## 2.5
# OR, create the differences yourself first:
diff <- lying.bp - standing.bp
t.test(diff, alternative = "two.sided", mu = 0)
##
## One Sample t-test
##
## data: diff
92 CHAPTER 4. WEEK 5/6 - T-TESTS
Week 7 - ANOVA
Outline:
1. Analysis Of Variance
• Statistical Models
– The Concept
– Data Synthesis: What the Model Means.
– Analysis of Variance (ANOVA) - The Concept
– The One-Way Analysis of Variance
∗ Introduction
∗ Design and Model
∗ Statistical Hypotheses in the ANOVA
∗ Sums of Squares
∗ Degrees of Freedom
∗ Mean Squares
∗ The Analysis of Variance Table
∗ Test of Hypothesis – The F-Test
∗ Comparison of the F-test in the ANOVA with 1 df for
Treatment, versus, the Two-sample Independent t-test.
∗ Worked Example
∗ Assumptions in the ANOVA Process
2. Using R
• Using R for ANOVA
• Calculating and Testing a Mean: The One-Sample t-test.
• Protected 𝑡-tests in R: Multiple Comparisons of Treatment
Means (MCTM).
93
94 CHAPTER 5. WEEK 7 - ANOVA
## harvesting.system observations
## 1 nil 55
## 2 nil 55
## 3 nil 55
## 4 nil 55
## 5 nil 55
## 6 CS1 55
## 7 CS1 55
## 8 CS1 55
## 9 CS1 55
## 10 CS1 55
## 11 CS2 55
## 12 CS2 55
## 13 CS2 55
## 14 CS2 55
## 15 CS2 55
## 16 new 55
## 17 new 55
## 18 new 55
## 19 new 55
## 20 new 55
Each individual tree has its own peculiarities which we have said are
distributed as a normal variable with a mean of zero and standard
deviation of 10. A random sample is selected from a normal distri-
bution with these parameters, usually using a computer package but
5.1. STATISTICAL MODELS 97
## harvesting.system observations
## 1 nil 73.88037
## 2 nil 48.76506
## 3 nil 56.98792
## 4 nil 57.63700
## 5 nil 53.38633
## 6 CS1 33.78667
## 7 CS1 69.78944
## 8 CS1 46.05728
## 9 CS1 64.23654
## 10 CS1 51.83999
## 11 CS2 74.15087
## 12 CS2 67.17830
## 13 CS2 55.05871
## 14 CS2 66.69690
## 15 CS2 49.02933
## 16 new 41.22425
## 17 new 78.15040
## 18 new 68.42995
## 19 new 58.56322
## 20 new 51.90277
## harvesting.system observations
## 1 nil 108.880369
## 2 nil 83.765059
## 3 nil 91.987917
## 4 nil 92.637000
## 5 nil 88.386331
## 6 CS1 38.786673
## 7 CS1 74.789435
## 8 CS1 51.057283
## 9 CS1 69.236543
## 10 CS1 56.839995
98 CHAPTER 5. WEEK 7 - ANOVA
## 11 CS2 69.150871
## 12 CS2 62.178296
## 13 CS2 50.058710
## 14 CS2 61.696901
## 15 CS2 44.029332
## 16 new 6.224251
## 17 new 43.150398
## 18 new 33.429953
## 19 new 23.563222
## 20 new 16.902768
1B 2C 3A 4C 5A 6B
7A 8D 9B 10 D 11 D 12 C
The numbers refer to the experimental units (plots) and the letters represent
the treatments allocated at random.
In a CRD, the only sources of variation are the treatments and random error.
Thus each piece of data (i.e. each measurement) can be thought of as being
made up as follows:
measurement = general mean + treatment effect + random error .
This model is specified symbolically as:
𝑦𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝜖𝑖𝑗
100 CHAPTER 5. WEEK 7 - ANOVA
[Check you understand the reason for equality and why the word “treatment”
has been bracketed.]
The null hypothesis is:
𝐻0 ∶ 𝜇 1 = 𝜇 2 = … = 𝜇 𝑘
where
• 𝜇𝑖 is the population mean of the variable for the 𝑖th treatment;
• 𝑘 is the number of treatments.
In English: the k population treatment means are all equal.
The alternative hypothesis is:
(∑𝑖 ∑𝑗 𝑦𝑖𝑗 )2
2
Total SS = ∑ ∑ 𝑦𝑖𝑗 −
𝑖 𝑗
𝑛
= Raw SS - CF
𝑇𝑖2
TSS = Treatment SS = ∑ − CF
𝑖
𝑛𝑖
TSS is found by getting the total of the observations in each treatment (𝑇𝑖 ),
squaring it, and dividing by the number of observations in the treatment; these
values from each treatment are then added together and the correction factor
subtracted from this sum.
The error d.f. can be obtained by considering the contribution to error made
by each treatment.
Examples:
1. With four treatments, each replicated three times, the degrees of freedom
breakdown will be:
Source DF
Treatment 3
Error 8
Total 11
Treatment A B C D E Total
Replications 12 9 8 7 9 45
Source DF Explanation
Treatments 4 (5-1)
Error 40 (11+8+7+6+8)
Total 44 (45-1)
The ratio of the treatment mean square to the error mean square is called the:
variance ratio. Under 𝐻0 , given all assumptions are true, the variance ratio has
an F-distribution.
At this point, you should think very clearly about 𝐻0 and the F-test. The
hypothesis tested by an F-test is that of equality of variances, but the hypothesis
given for the ANOVA concerns means. How is this apparent anomaly explained?
𝐻0 ∶𝜇1 = 𝜇2 = … = 𝜇𝑘
𝐻1 ∶The treatment means are not all equal
plot(seq(0,10, 0.01), df(seq(0,10, 0.01), 5, 20), type = "l", ylab = "Probability", xlab = "Varia
0.6
Probability
0.4
0.2
0.0
0 2 4 6 8 10
Variance Ratio
Filter/Day 1 2 3 4 5 6 7 8 9 10 Total
Standard 24 28 38 41 32 45 33 39 50 18 348
New 15 24 39 44 17 28 18 35 22 20 262
106 CHAPTER 5. WEEK 7 - ANOVA
(24 + 28 + … + 22 + 20)2
Tot SS = 242 + 282 + … + 222 + 202 −
20
(610)2
= 20732 −
20
= 20732 − 18605
= 2127
Degrees of freedom:
total = 20 - 1 = 19
treatment = 2 - 1 = 1
error = 19 - 1 = 18 (or 2 × (10 − 1))
ANOVA Table:
9×9.7622 +9×9.9982
𝑆p2 = 10+10−2 = 97.622; 𝑆𝑝 = 9.880.
√
From ANOVA: EMS = 97.622 & EMS = 9.880 = 𝑆𝑝 from above.
√
VR = 3.788 & 𝑉 𝑅 = 1.946, the equivalent t test statistic, T.
√
From Tables: 𝐹1,18 (0.05) = 4.414. 𝑡1 8(0.025) = 2.109 = 4.414.
Note for the tabled values and the sample test statistics, the F is the square of
the equivalent 𝑡 value – 2.1092 = 4.414 and 1.9462 = 3.788.
𝐻0 ∶𝜇1 = 𝜇2 = 𝜇3
𝐻1 ∶The treatment means are not all equal
where 𝜇𝑖 is the population mean wing thickness for the ith butterfly species.
The resulting data have been analysed and the results are presented in tabular
form as:
Antidote/rep 1 2 3 4 5 Total
A1 76 52 92 80 70 370
A2 110 96 74 105 125 510
A3 95 145 100 190 201 731
A4 87 93 91 120 99 490
Analyse these data and prepare a report for the club members on the compara-
tive merits of the four antidotes.
Degrees of freedom:
total = 20 - 1 = 19
treatment = 4 - 1 = 3
error = 19 - 3 = 16 (or 4 × (5 − 1))
110 CHAPTER 5. WEEK 7 - ANOVA
ANOVA Table:
𝐻0 ∶𝜇1 = 𝜇2 = 𝜇3 = 𝜇4
𝐻1 ∶The treatment means are not all equal
where 𝜇𝑖 is the population mean blood alcohol for the ith antidote.
First we need to enter the data into R:
ba <- c(76, 52, 92, 80, 70,
110, 96, 74, 105, 125,
95, 145, 100, 190, 201,
87, 93, 91, 120, 99)
The aov() function does anovas in R. The syntax looks like this:
name.the.model <- aov(y.variable ~ factor.variable, data = dataset)
The summary() function provides a summary of the ANOVA and prints out the
ANOVA table.
hangover.model <- aov(ba ~ antidote, data = hangovers)
summary(hangover.model)
You only need to install the package once if you are using your own computer
(if you are using the University computers you will need to install the package
every time you log in).
We can use a library/package we have installed by using the library() function
as above.
The agricolae library contains many R functions, but we are only interested in
one for this course: the LSD.test() function. The syntax to use this function
is:
hangovers.lsd <- LSD.test(hangover.model, "antidote", console = T)
##
## Study: hangover.model ~ "antidote"
##
## LSD t Test for ba
##
## Mean Square Error: 790.3
##
112 CHAPTER 5. WEEK 7 - ANOVA
𝐻0 ∶𝜇1 = 𝜇2 = 𝜇3 = 𝜇4
𝐻1 ∶The treatment means are not all equal
where 𝜇𝑖 is the population mean percent damage for the ith harvester.
First we need to enter the data into R:
perc.damage <- c(78.65, 95.67, 78.52, 97.74, 79.57,
62.81, 54.69, 45.64, 52.43, 71.66,
45.83, 36.58, 59.92, 42.25, 45.05,
15.89, 35.01, 38.38, 19.82, 40.93)
harv.system <-factor(rep(c("Nil", "CS1", "CS2", "New"), c(5, 5, 5, 5)))
harvester <- data.frame(harv.system, perc.damage)
rm(perc.damage, harv.system)
Now use the aov() function to run the model above in R, and get an output
summary with the summary() function:
harv.model <- aov(perc.damage ~ harv.system, data = harvester)
summary(harv.model)
##
## Study: harv.model ~ "harv.system"
##
## LSD t Test for perc.damage
##
## Mean Square Error: 100.0363
##
## harv.system, means and individual ( 95 %) CI
##
## perc.damage std r LCL UCL Min Max
## CS1 57.446 10.036779 5 47.96378 66.92822 45.64 71.66
114 CHAPTER 5. WEEK 7 - ANOVA
Week 8 - Multiple
Treatment Comparisons
and LSD
Outline:
1. Analysis Of Variance Continued
• Multiple Comparisons of Treatment Means
– Introduction
– The Protected (Extended) t test
– Least Significant Differences - LSD
– Worked ANOVA Examples
2. Using R
• See lecture notes in week 7.
Accompanying Workshop - done in week 9
• The analysis of variance process and multiple comparisons of
means - when the ANOVA rejects 𝐻0
Workshop for week 8
• Based on lectures in week 7
Project Requirements for Week 8
• Nil.
Assessment for Week 8
• Your second quiz worth 7% is this week.
115
116CHAPTER 6. WEEK 8 - MULTIPLE TREATMENT COMPARISONS AND LSD
IMPORTANT NOTE
Remember the extended t-test as described above must only be applied after a
significant F-test has been found. This proviso gives a “protection” to the test
to prevent the detection of false significant differences which can arise simply
by comparing the highest and lowest of a number of means.
Example Wing Thickness of Butterflies
In the week 8 notes the following ANOVA was given for the wing
thickness of butterfly species:
Species 1 2 3
Mean Wing Thickness 4.67 6.80 7.75
Number of Replicates 3 5 4
𝐻0 ∶𝜇𝑖 = 𝜇𝑗
𝐻1 ∶𝜇𝑖 ≠ 𝜇𝑗
where 𝜇𝑖 and 𝜇𝑗 are the population mean wing thicknesses for the
two species being compared.
The test statistic for each comparison will be
𝑋̄ 𝑖 − 𝑋̄ 𝑗
𝑇 =
𝑠√ 𝑛1 + 1
𝑛𝑗
𝑖
118CHAPTER 6. WEEK 8 - MULTIPLE TREATMENT COMPARISONS AND LSD
where symbols
√ are as defined in the notes on independent t- testing
and 𝑠 = EMS from the ANOVA table.
Under 𝐻0 : 𝑇 ∼ 𝑡9 . From tables, 𝑡9 (0.975) = 2.262.
Note that the degrees of freedom is always 9 in this example (error
df from ANOVA) regardless of which pair of means is being tested.
Thus for each pairwise test the critical region will be: 𝑇 < −2.262
or 𝑇 > 2.262 (alternatively, we can write these two regions as |𝑇 | >
2.262).
(i) Species 1 vs Species 2
The calculated 𝑇 does not lie in the critical region and thus we
cannot reject 𝐻0 . We conclude that there is insufficient evidence
in the data to suggest that the mean wing thicknesses of species 2
(6.80) and 3 (7.75) are significantly different (𝑝 ≥ 0.05).
The overall conclusion is that species 2 and 3 do not differ with re-
spect to mean wing thickness. Species 1 butterflies have, on average,
a mean wing thickness significantly less than that of butterflies of
species 2 and 3 (𝛼 = 0.05).
6.1. MULTIPLE COMPARISONS OF TREATMENT MEANS 119
2
𝑆𝐸𝑦𝑖̄ −𝑦𝑗̄ = 𝑠√
𝑛
√
where 𝑠 = EMS from the ANOVA and 𝑛 is the number of (equal) reps in each
treatment.
Instead of evaluating the test statistic for all possible treatment pairs, the test
statistic formula can be rearranged to find the smallest difference that must
exist between any two means for significance to be reached. (Note that this is
similar to the approach taken to find a confidence interval).
Using the traditional level of significance of 0.05, the critical value is: 𝑡𝜈 (0.025)
(these comparisions are always two-tailed). Substituting this in the equation
for the test statistic T, gives:
1 1
|𝑦𝑖̄ − 𝑦𝑗̄ |LSD > 𝑡𝜈 (0.025) × 𝑠 × √ +
𝑛𝑖 𝑛𝑗
where |𝑦𝑖̄ − 𝑦𝑗̄ |LSD is the difference that must exist between means 𝑖 and 𝑗 if the
test statistic is to just reach significance; that is, for 𝑇 to be just larger than
𝑡𝑣 (0.025).
When all treatments have the same number of replicates, say 𝑛, only one LSD
needs to be found for all the pairwise comparisons. If the replicates differ a LSD
value must be calculated for every pair of differing replicates (this situation is
not considered in this course).
The next step is to find all differences between the means and compare them
with the relevant LSD.
Table of Mean Differences
120CHAPTER 6. WEEK 8 - MULTIPLE TREATMENT COMPARISONS AND LSD
The most efficient way of looking at the differences between the means is to
construct a table of mean differences as follows:
𝐻0 ∶𝜇1 = 𝜇2 = 𝜇3 = 𝜇4 = 𝜇5
𝐻1 ∶The mean fuel efficiencies are not the same for all five fuel types
Means Table:
Fuel Type A B C D E
Mean Efficiency (%) 93.0 84.3 64.3 92.6 88.4
𝑛𝑖 4 4 4 4 4
1 1
LSD(5%, 4, 4) = 𝑡15 (0.975) × 𝑠 × √ +
4 4
= 2.131 × 3.596 × 0.7071
= 5.419.
Thus any pair of means that differs by at least 5.419 will be signifi-
cantly different at the 0.05 significance level.
Table of Mean Differences:
Fuel Type → C B E D A
Fuel Type ↓ 64.3 84.3 88.4 92.6 93.0
A 93.0 28.7 8.7 4.6 0.4 0
D 92.6 28.3 8.3 4.2 0
E 88.4 24.1 4.1 0
B 84.3 24 0
C 64.3 0
The first entry in the table, 28.7, is the difference between the small-
est mean (64.3 for fuel type C) and the largest mean (93.0 for fuel
type A). Any difference value in the table greater than the caclulated
LSD = 5.419 indicates a significant difference between those means
at the 0.05 level of significance.
A useful way of presenting the results is as follows:
Significant Differences: A > C, B * D > C, B * E > C * B > C *
(where * indicates a 5% level of significance).
The general symbols for other levels of significance are: 1% ** 0.1%
***
Conclusions
The mean efficiency of fuel type C is significantly lower than the
mean efficiency of all the other fuel types (p < 0.05). Fuel types A
and D have greater efficiency on average than do fuel types C and B
(p < 0.05). Fuel types A, D and E appear to have the same efficiency
on average (p > 0.05).
Example: Harvester Example Revisited
Do this by hand yourself, then check your results using R.
How close are the estimates of treatment means and the standard deviation
to the values we started with - remember we assumed these values and then
122CHAPTER 6. WEEK 8 - MULTIPLE TREATMENT COMPARISONS AND LSD
## harvesting.system observations
## 1 nil 73.644323
## 2 nil 84.769935
## 3 nil 85.514228
## 4 nil 95.613793
## 5 nil 91.977370
## 6 CS1 52.988987
## 7 CS1 46.922544
## 8 CS1 50.971228
## 9 CS1 53.460526
## 10 CS1 58.795703
## 11 CS2 44.527301
## 12 CS2 59.322388
## 13 CS2 60.097702
## 14 CS2 48.297088
## 15 CS2 50.464486
## 16 new 18.803197
## 17 new 11.836971
## 18 new 2.674373
## 19 new 21.417665
## 20 new 31.253825
Example: Growth Curves - Marine Birds
An ESC researcher is studying the growth rates of young marine birds on the
Great Barrier Reef. The growth curve of these birds is known to be logistic in
nature with a functional form as follows:
𝐾
𝑊 =
1 + 𝑒𝑥𝑝(−𝑟(𝑡 − 𝑡𝑚 ))
,
where:
• 𝑊 is the weight in grams of the individual bird at time 𝑡 days;
• 𝐾 is the asymptotic weight of the individual bird (its adult weight);
• 𝑟 is the growth constant for the individual bird;
6.2. USING R 123
Analyse these data to determine in what way (if any) the growth patterns differ
between the three species. Initially consider 𝑡𝑚 , the time to reach half the adult
weight.
6.2 Using R
Refer to R section in week 7 lecture notes.
124CHAPTER 6. WEEK 8 - MULTIPLE TREATMENT COMPARISONS AND LSD
Chapter 7
Outline:
1. Analysis Of Variance Continued
• Treatment Designs: Factorial ANOVA
– The Treatment Design Concept
– The Factorial Alternative - How to Waste resources??
– What is this ‘Interaction’??
– The Factorial Model
– Factorial Effects
– The Factorial ANOVA
– Partitioning Treatment Sums of Squares Into Factorial
Components
– Factorial Effects and Hypotheses
– Testing and Interpreting Factorial Effects
2. Using R
• Using R for Factorial ANOVA
Accompanying Workshop - done in week 10
• Defining factorial treatments.
• The analysis of variance for factorial treatment designs – by
hand and using R
Workshop for week 9
• Based on lectures in week 8
• Response to marker’s critique. Designed to help you improve
in your assignment 2.
Project Requirements for Week 9
125
126 CHAPTER 7. WEEK 9 - FACTORIAL ANOVA
The different possible values for a factor are known as its levels. In the example,
degree of instruction has 4 levels and type of machine has 3 levels. The total
number of treatments will be the product of the levels of the factors involved:
4 × 3 = 12.
Standard Order - ABC
When trying to identify all treatments from a number of factors, it is useful to
have some standard approach – this avoids missing treatments along the way!!
7.1. TREATMENT DESIGNS: FACTORIAL ANOVA 127
A commonly used approach is known as the ‘standard ABC order’ in which the
levels of the first factors (A and B) are held constant while the levels of the
outermost factor (C) are allowed to vary. Once all levels of C have been varied
for a particular combination of A and B, the level of the next factor, B, is varied
and C again moves through its possible levels. Eventually factor B will reach its
last level together with the last level of C, only then will the level of A change.
The process with factors B and C then repeats but with the second level of A.
Assuming there are ‘a’ levels of A, ‘b’ levels of B and ‘c’ levels of C, the 𝑎 × 𝑏 × 𝑐
treatments can be written schematically as: A1B1C1, A1B1C2, …, A1B1Cc,
A1B2C1, A1B2C2, …, A1B2Cc, …, A1BbCc, A2B1C1, A2B1C2, …, A2B2Cc,
A2B2C1, A2B2C2, …, A2BbCc, …, AaBbCc.
Example: Air Pollution, Land Usage, and Location
Three land uses (rural, residential and national park)
Two states (Qld and NSW)
Incomplete factorials are a valuable design; can save resources – but not covered
here.
Looking at the means for the different chemicals for the two times
and overall gives:
7.1. TREATMENT DESIGNS: FACTORIAL ANOVA 129
attach(weeds)
interaction.plot(chemical, time, drymatter, type = "l", col = c("blue", "red"), main = "Interacti
time
5
Late
mean of drymatter
Early
4
3
2
A B C D
chemical
detach()
where
• 𝑦𝑖𝑗𝑘 is the observation on the 𝑘th replicate receiving the 𝑖th level of A and
the 𝑗th level of B, 𝑘 = 1, … , 𝑛;
• 𝑛 is the number of replicates for each of the 𝑎 × 𝑏 treatment groups;
• 𝜇 is the overall grand mean of y;
• 𝛼𝑖 is the effect of the 𝑖th level of factor A, 𝑖 = 1, … , 𝑎;
• 𝛽𝑗 is the effect of the 𝑗th level of factor B, 𝑗 = 1, … , 𝑏;
• 𝛼𝛽𝑖𝑗 is the effect of the interaction between factor A and B;
• 𝜖𝑖𝑗𝑘 is the random effect attributable to the ijkth individual observation.
Notes
• The index, 𝑘, counts the replicates for a particular treatment group and
goes from 1 to n. (There are 𝑎 × 𝑏 treatment groups, each replicated n
times).
• Each level of factor A is replicated not only by each group of n, but also
across the levels of factor B – overall each level of A has 𝑏 × 𝑛 replicates.
Similarly, levels of factor B gain replication because they occur at the
different levels of factor A – overall each level of B has 𝑎 × 𝑛 replicates. It
is the replication of one factor’s levels across the levels of the other factor
that gives the factorial treatment design its power. As long as everything
is balanced, and all levels of one factor are represented equally in all levels
of every other factor, then any effect due to one factor is ‘evened’ out when
another factor’s levels are being compared.
Preliminary Model
It is useful to consider an initial model which contains a single treatment term
encompassing all the factorial effects (both main and interaction). Such a model
is known as a preliminary model. One of the advantages of this model is to
ascertain the sources of variation which will be unexplained (error) and the
sources that will form part of the experimental design (for example, blocking).
This will be discussed further in courses in later years.
7.1. TREATMENT DESIGNS: FACTORIAL ANOVA 131
The preliminary model for the general two-factor model above is:
where
• 𝑦𝑖𝑗𝑘 , 𝜇, and 𝜖𝑖𝑗𝑘 are as previously defined;
• treat𝑖𝑗 is the effect of the 𝑖𝑗th treatment group (made up from the factors
A and B), 𝑖 = 1, … , 𝑎 and 𝑗 = 1, … , 𝑏.
Factorial Model:
Factorial:
(∑𝑖 ∑𝑗 𝑦𝑖𝑗 )2
2
Total SS = ∑ ∑ 𝑦𝑖𝑗 −
𝑖 𝑗
𝑛
=
𝑇𝑖2
TSS = Treatment SS = ∑ − CF
𝑖
𝑛𝑖
=
2 2 2 2
(Total A) (Total B) (Total C) (Total D) (Total A + Total B + To
Chemical SS = + + + −
3+3 3+3 3+3 3+3 4 × (3 +
2 2 2 2 2
20.3 17.6 22.8 22.3 (20.3 + 17.6 + 22.8 + 22.3)
= + + + −
6 6 6 6 24
= 2.7883.
2 2
(Total Early) (Total Late) (Total Early + Total Late)2
Time SS = + +−
3+3+3+3 3+3+3+3 2 × (3 + 3 + 3 + 3)
2 2 2
45.0 38.0 (45.0 + 38.0)
= + −
12 12 24
= 2.0417.
Source DF SS MS VR
Treatments 7 44.7183 6.3883 17.09
Error 16 5.9800 0.3738
Total 23 50.6983
7.1. TREATMENT DESIGNS: FACTORIAL ANOVA 135
Factorial:
Source DF SS MS VR
Chemical 3 2.7883 0.92944 2.49
Time 1 2.0417 2.04167 5.46
Chemical × Time 3 39.8883 13.29611 35.57
Error 16 5.9800 0.37375
Total 23 50.6983
Each line of the Factorial ANOVA resulting from the partitioned treatment
source of variation, provides a test of some hypothesis. This means that a
factorial ANOVA with 2 main effects will need a null and alternative hypothesis
for EACH of the main effects and for the interaction, i.e. 3 null and 3 alternative
hypotheses will be needed.
A main effect line provides a test of the equality of means for the particular
factor, where the mean for each level is computed across all levels of the other
factors. It describes the effect of the factor when it is averaged across all levels
of the other factors. In the above example ‘chemicals’ and ‘time appln’ are the
2 main effects.
An interaction line tests whether or not the two (or more) factors interact
with each other; are the differences between the levels of one factor the same
regardless of which level of the other factor is considered OR do the differences
change, depending on which level of the other factor is being considered?
𝐻0 ∶ (interaction): 𝐻1 ∶ (interaction):
CRITICAL NOTE
• Step 1:
• Step 2 A:
• Step 2 B:
Return to the previous example and assume that the required as-
sumptions of, independence, homogeneity of variances, additivity
of model terms and normality of the variable, dry matter produc-
tion,are all valid.
Interaction
Find the least significant difference needed between any two of the
8 treatment means for the difference to be significant.
Standard deviation:
best estimate is square root of error mean square in the ANOVA: sd
= 0.61135
Number of replicates in each treatment:
for the two-way (treatment) means there are 3 replicates
Standard error of difference between 2 means:
1 1
𝑆𝐸𝑦𝑖̄ −𝑦𝑗̄ = 𝑠 × √ +
𝑛𝑖 𝑛𝑗
2
= 0.61135 × √
3
= 0.499
##
## Study: weeds.aov ~ c("time", "chemical")
##
## LSD t Test for drymatter
##
## Mean Square Error: 0.37375
##
## time:chemical, means and individual ( 95 %) CI
##
## drymatter std r LCL UCL Min Max
## Early:A 5.400000 0.6244998 3 4.6517505 6.148249 4.7 5.9
## Early:B 3.233333 0.5686241 3 2.4850838 3.981583 2.6 3.7
7.2. USING R FOR FACTORIAL ANOVA 139
a a
6
b
5
bc
bc
4
cd
3
de e
2
1
time
5
Late
mean of drymatter
Early
4
3
2
A B C D
chemical
detach()
Chapter 8
Outline:
1. BIVARIATE STATISTICAL METHODS
• Introduction
• Covariance
• The Correlation Coefficient – Pearson’s Product Moment
Coefficient
2. Regression Analysis
• Assumptions Underlying The Regression Model
• Simple Linear Regression
• Estimating A Simple Regression Model
• Evaluating The Model
• The Coefficient of Determination (𝑅2 )
• The Standard Error (Root Mean Squared Error) of the
Regression, 𝑆𝜖
• Testing The Significance Of The Independent Variable
• Testing The Overall Significance Of The Model
• Testing The Overall Significance Of The Model
• Functional Forms Of The Regression Model
3. Using R
• Using R for Correlation and Simple Linear Regression Mod-
elling
141
142CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION
8.1.1.1 Covariance
The dispersion or variation in a variable is usually measured by the variance
or its square root, standard deviation. Recall the definition of the population
variance for a single variable, say X:
1 𝑛
𝑉 𝑎𝑟(𝑋) = ∑(𝑋 − 𝜇)2
𝑛 𝑖=1 𝑖
𝑛
1
𝑆2 = ∑(𝑋𝑖 − 𝑋)2
𝑛 − 1 𝑖=1
A measure of the way in which two variables vary together is given by the
covariance which is defined as:
1 𝑛
𝐶𝑜𝑣(𝑋1 , 𝑋2 ) = ∑(𝑋 − 𝜇1 )(𝑋2𝑖 − 𝜇2 )
𝑛 𝑖=1 1𝑖
𝑛
1
𝐶𝑜𝑣(𝑋1 , 𝑋2 ) = ∑(𝑋1𝑖 − 𝑋 1 )(𝑋2𝑖 − 𝑋 2 )
𝑛 − 1 𝑖=1
One of the classical bivariate situations involves a bivariate normal – the two
variables have a joint normal distribution. The theory for this is beyond the
scope of this course but the following illustrates the graph of a bivariate normal
distribution.
# install.packages(c("mvtnorm", "plotly"))
library(mvtnorm)
library(plotly)
graph
The bivariate observations, (𝑥1𝑖 , 𝑥2𝑖 ) are said to be 𝑁 ((𝜇1 , 𝜇2 ), (𝜎1 , 𝜎2 , 𝜌12 ))
where 𝜌12 is a function of the covariance (known as the correlation) between the
two variables.
144CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION
𝑛
𝐶𝑜𝑣(𝑋1 , 𝑋2 ) ∑ (𝑋1𝑖 − 𝜇1 )(𝑋2𝑖 − 𝜇2 )
𝜌𝑥1 𝑥2 = = 𝑛 𝑖=1 𝑛
√𝑉 𝑎𝑟(𝑋1 )𝑉 𝑎𝑟(𝑋2 ) ∑𝑖=1 (𝑋1𝑖 − 𝜇1 )2 ∑𝑖=1 (𝑋2𝑖 − 𝜇2 )2
𝑛
𝐶𝑜𝑣(𝑋1 , 𝑋2 ) ∑ (𝑋1𝑖 − 𝑋 1 )(𝑋2𝑖 − 𝑋 2 )
𝑟𝑥1 𝑥2 = = 𝑛 𝑖=1 𝑛
√𝑉 𝑎𝑟(𝑋1 )𝑉 𝑎𝑟(𝑋2 ) ∑𝑖=1 (𝑋1𝑖 − 𝑋 1 )2 ∑𝑖=1 (𝑋2𝑖 − 𝑋 2 )2
Working Formula:
1
∑ 𝑥1 𝑥2 − 𝑛 (∑ 𝑥1 ∑ 𝑥2 )
𝑟𝑥1 𝑥2 =
1 2 1 2
√(∑ 𝑥21 − 2
𝑛 (∑ 𝑥1 ) ) (∑ 𝑥2 − 𝑛 (∑ 𝑥2 ) )
Other names for this correlation coefficient are: Simple Correlation Coefficent
and Product Moment Coefficient.
The Pearson Correlation Coefficient lies between –1 and +1. That is,
−1 ≤ 𝜌𝑥1 𝑥2 ≤ 1
A value near zero indicates little or no linear relationship between the two
variables; a value close to –1 indicates a strong negative linear relationship (as
one goes up the other comes down – large values of one variable are associated
with small values of the other variable); a value near +1 indicates a strong
positive linear relationship (large values of one variable are associated with large
values of the other variable). The following figures illustrate these ideas.
8.1. BIVARIATE STATISTICAL METHODS 145
library(MASS)
##
## Attaching package: 'MASS'
2
2
1
1
1
0
X2
X2
X2
0
0
−1
−2 −1
−1
−2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
X1 X1 X1
rho = −1 rho = −0.8 rho = 0
2
2
1.0
1
1
0.0
X2
X2
X2
0
0
−1
−1.5
−2
−1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 2
X1 X1 X1
rho = 0.25 rho = 0.5 rho = 1
The correlation coefficient measures the linear relationship between two vari-
ables. If the relationship is nonlinear any interpretation of the correlation will
be misleading. The following figure demonstrates this concept.
x <- seq(-3, 3, 0.1)
y <- 4*x^2 + 2*x + 10 + rnorm(length(x), 0, 4)
plot(x, y)
abline(lm(y ~ x), col = "blue")
legend("topleft", legend = paste("Correlation = ", round(cor(x, y), 3)), bty = "n")
8.2. REGRESSION ANALYSIS 147
50
40
30 Correlation = 0.243
y
20
10
−3 −2 −1 0 1 2 3
x
Note
Just because one variable relates to another variable does not mean that changes
in one causes changes in the other. Other variables may be acting on one or both
of the related variables and affecting them in the same direction. Cause-and-
effect may be present, but correlation does not prove cause. For example, the
length of a person’s pants and the length of their legs are positively correlated -
people with longer legs have longer pants; but increasing one’s pants length will
not lengthen one’s legs!
Property of Linearity
A low correlation (near 0) does not mean that 𝑋1 and 𝑋2 are not related in some
way. When |𝜌| < 𝜖 (where 𝜖 is some small value, say, for example, 0.1 or 0.2)
indicating no or very weak correlation between the two variables, there may still
be a definite pattern in the data reflecting a very strong “nonlinear” relationship
(see the previous figure for an example of this). Pearson’s correlation applies
only to the strength of linear relationships.
20
10
2 4 6 8 10 12
Age (X)
8.2. REGRESSION ANALYSIS 149
𝐸(𝑌𝑖 |𝑋𝑖 ) = 𝛽0 + 𝛽1 𝑋𝑖 , 𝑖 = 1, … , 𝑛.
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖 , 𝑖 = 1, … , 𝑛,
where
• 𝑦𝑖 is the 𝑌 measurement for the individual having an 𝑋 value of 𝑥𝑖 ;
• 𝛽0 is a constant representing the y value when x is zero (intercept);
• 𝛽1 is the slope of the population regression line;
• 𝜖𝑖 is the random error or dispersion or scatter in the 𝑦𝑖 observation asso-
ciated with unknown (excluded) influences.
The sample simple linear regression is:
where symbols with hats (^) denote estimates obtained from sample data.
𝑛 𝑛
2
min (∑ 𝜖2𝑖 ) = min (∑ (𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 ) )
(𝛽0 ,𝛽1 ) (𝛽0 ,𝛽1 )
𝑖=1 𝑖=1
for 𝛽0 and 𝛽1 that minimise the sum of squared errors (ie solve the above
equation) are:
𝑛 𝑛 𝑛
𝑛 ∑𝑖=1 𝑥𝑖 𝑦𝑖 − (∑𝑖=1 𝑥𝑖 ) (∑𝑖=1 𝑦𝑖 )
𝛽1̂ = 𝑛 𝑛 2
𝑛 ∑𝑖=1 𝑥2𝑖 − (∑𝑖=1 𝑥𝑖 )
and
X Y
200 180
250 230
300 280
350 310
𝑦 ̂ = 8 + 0.88𝑥
10
8 Y does not change when X changes
Y
6
Y
4
2
0
0 1 2 3 4 5
Of the total variation of the individual responses from their mean, some will
be explained by the model, and some will be unexplained. Thus the total
variation is made up of two components - the variation explained by the model
(or systematic variation) and the variation left unexplained (error variation).
plot(0:5, 2*(0:5), type = "n", xlab = "X", ylab = "Y", main = "Total Variation = Systematic Varia
10
Error Variation
(Y − Y^)
8
Total Variation
(Y − Y)
Systematic Variation
6
(Y^ − Y)
Y
4
2
0
0 1 2 3 4 5
(Note that for simple linear regression 𝑅2 is simply the square of the correlation
coefficient.)
8.2. REGRESSION ANALYSIS 153
In other words, 99.8% of the variation in the dependent variable (y) is explained
by the model. But is this significant? What about 60% or 40%? The coefficient
of determination does not give a testable measure of the significance of the model
overall in explaining the variation in (y). Inferences concerning the regression
model depend on the standard error or root mean square error (RMSE) of the
model.
𝑛
√ ∑𝑖=1 (𝑌𝑖 − 𝑌𝑖̂ )2
𝑆𝜖 =
𝑛−2
𝑛
∑𝑖=1 𝜖2𝑖̂
=√
𝑛−2
SSE
=√
𝑛−2
𝑛 𝑛 𝑛
√ ∑𝑖=1 𝑦𝑖2 − 𝛽0̂ ∑𝑖=1 𝑦𝑖 − 𝛽1̂ ∑𝑖=1 𝑥𝑖 𝑦𝑖
𝑆𝜖 =
𝑛−2
From the previous example:
𝐻0 ∶ 𝛽 1 = 0
𝐻1 ∶ 𝛽 1 ≠ 0
154CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION
𝛽1̂
𝑇 =
𝑆𝐸(𝛽1̂ )
where
𝑆𝜖
𝑆𝐸(𝛽1̂ ) =
𝑛
√∑𝑖=1 𝑥2𝑖 − 𝑛𝑥2̄
𝐻0 ∶ 𝛽 1 = 0
𝐻1 ∶ 𝛽 1 ≠ 0
using
7.7459
𝑆𝐸(𝛽1̂ ) = √ = 0.06928,
315000 − 4 × 2752
and
0.88
𝑇 = = 12.702.
0.06928
Degrees of freedom are 4−2 = 2. From 𝑡 tables, 𝑡2 (0.025) = 4.303. As 𝑇 > 4.303,
reject the null hypothesis at the 0.05 level of significance and conclude that 𝑌
has a significant linear relationship with 𝑋 .
Sum of
DF Squares Mean Square Variance
Source Ratio
Regression 𝑘−1 𝑆𝑆𝑅 = 𝑀 𝑆𝑅 = 𝑀 𝑆𝑅/𝑀 𝑆𝐸
∑(𝑦𝑖̂ − 𝑦)̄ 2 𝑆𝑆𝑅/(𝑘 − 1)
Error 𝑛−𝑘 𝑆𝑆𝐸 = 𝑀 𝑆𝐸 =
∑(𝑦𝑖 − 𝑦𝑖̂ )2 𝑆𝑆𝐸/(𝑛 − 𝑘)
Total 𝑛−1 𝑇 𝑆𝑆 =
∑(𝑦𝑖 − 𝑦)̄ 2
𝑇 𝑆𝑆 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
NB: 𝑘 is the number of parameters estimated in the model. For simple linear
regression, there are two parameters estimated (𝛽0 and 𝛽1 ), so 𝑘 = 2.
Notes:
• the purpose of the ANOVA is to break up the variation in 𝑦 and (in
simple regression) can also test 𝐻0 ∶ 𝛽1 = 0, and show how the coefficient
of determination (𝑅2 ) is derived.
• based on F-test statistic defined as the variance ratio.
• main question: “Is the ratio of the explained variance (MSR) to the unex-
plained variance (MSE) sufficiently greater than 1, to reject 𝐻0 that 𝑦 is
unrelated to 𝑥?”
• if we reject 𝐻0 the main conclusion is:
– the linear model explains a part of the variation in 𝑦. We accept that
𝑥 and 𝑦 are linearly related (at the given level of significance).
1 (𝑥𝑝 − 𝑥)̄ 2
𝑠2𝑦𝑝 = 𝑠2𝜖 ( + 𝑛 )
𝑛 ∑𝑖=1 𝑥2𝑖 − 𝑛𝑥2̄
where 𝑥𝑝 denotes the value of 𝑥 we are interested in predicting the 𝑦 value for.
This interval is sometimes called the “narrow” interval, or “confidence” interval
in R.
156CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION
1 (𝑥𝑝 − 𝑥)̄ 2
𝑠2𝑦𝑝 = 𝑠2𝜖 (1 + + 𝑛 )
𝑛 ∑𝑖=1 𝑥2𝑖 − 𝑛𝑥2̄
𝛽
𝑦𝑖 = 𝑒𝛽0 +𝜖𝑖 𝑥𝑖 1
becomes
ln(𝑦𝑖 ) = 𝛽0 + 𝛽1 ln(𝑥𝑖 ) + 𝜖𝑖
distance <- c(1.2, 0.8, 1, 1.3, 0.7, 0.3, 1, 0.6, 0.9, 1.1)
price <- c(101, 92, 110, 120, 90, 51, 93, 75, 77, 120)
house.prices <- data.frame(distance, price)
rm(distance, price)
attach(house.prices)
plot(x = distance, y = price, xlab = "Distance (km) from Abattoir", ylab = "House Price (x $1000)
90
80
70
60
50
detach()
##
## Call:
## lm(formula = price ~ distance, data = house.prices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5377 -6.2549 0.7738 8.1221 13.7083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14 10.86 3.327 0.010431 *
## distance 63.77 11.63 5.484 0.000584 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.46 on 8 degrees of freedom
## Multiple R-squared: 0.7899, Adjusted R-squared: 0.7636
## F-statistic: 30.08 on 1 and 8 DF, p-value: 0.0005844
Before we use this model to test hypotheses or make predictions etc, we should
first assess whether the fitted model follows the assumptions of regression mod-
elling. We do this graphically as follows:
plot(house.lm)
8.3. USING R AND EXAMPLES: 159
Residuals vs Fitted
10 15
10
5
Residuals
0
−10
9
−20
Fitted values
lm(price ~ distance)
Normal Q−Q
1.5
10
Standardized residuals
0.5
−0.5
1
−1.5
Theoretical Quantiles
lm(price ~ distance)
160CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION
Scale−Location
9
Fitted values
lm(price ~ distance)
Residuals vs Leverage
10
1
1.0
Standardized residuals
0.5
0.0
6
−1.0
0.5
1 1
Cook's distance
−2.0
Leverage
lm(price ~ distance)
What are the hypotheses and the population regression model for this example?
What would the predicted house price be 0.63 km away from the abattoir?
What would the predicted house price be 2.0 km away from the abattoir?
8.3. USING R AND EXAMPLES: 161
Distance to Center Line (Center) 12.8 12.9 12.9 13.6 14.5 14.6 15.1 17.5 19.5 20.8
Distance from Car to Cyclist (Car) 5.5 6.2 6.3 7.0 7.8 8.3 7.1 10.0 10.8 11.0
The data have been analysed using lm() in R and the results are given below.
center <- c(12.8, 12.9, 12.9, 13.6, 14.5, 14.6, 15.1, 17.5, 19.5, 20.8)
car <- c(5.5, 6.2, 6.3, 7, 7.8, 8.3, 7.1, 10, 10.8, 11)
10
9
8
7
6
14 16 18 20
detach()
##
## Call:
## lm(formula = car ~ center, data = cyclist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76990 -0.44846 0.03493 0.35609 0.84148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.18247 1.05669 -2.065 0.0727 .
## center 0.66034 0.06748 9.786 9.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5821 on 8 degrees of freedom
## Multiple R-squared: 0.9229, Adjusted R-squared: 0.9133
## F-statistic: 95.76 on 1 and 8 DF, p-value: 9.975e-06
# Check model using residual diagnostics:
Standardized residuals
Residuals vs Fitted Normal Q−Q
6 6
Residuals
−0.5 0.5
0.5
−1.5
1 7 1 10
Standardized residuals
Scale−Location Residuals vs Leverage
0.0 0.6 1.2
1 6 10 6
0.5
0.5
Cook's distance
−1.5
1 10 1
River 1 2 3 4 5 6 7 8 9 10
Dissolved Oxygen (%) 45.8 67.9 54.6 51.9 32.7 62.3 71.3 78.9 38.7 49.7
Temperature (C) 27.9 16.5 22.3 24.8 31.2 18.5 13.1 10.7 29.1 22.1
The data have been analyzed using lm() in R and the results are given below.
164CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION
# Enter Data:
DO <- c(45.8, 67.9, 54.6, 51.9, 32.7, 62.3, 71.3, 78.9, 38.7, 49.7)
temp <- c(27.9, 16.5, 22.3, 24.8, 31.2, 18.5, 13.1, 10.7, 29.1, 22.1)
attach(rivers)
Scatterplot of DO vs Temperature
80
70
Dissolved Oxygen (%)
60
50
40
10 15 20 25 30
Temperature (Degrees C)
detach()
anova(do.lm)
## Response: DO
## Df Sum Sq Mean Sq F value Pr(>F)
## temp 1 1880.1 1880.14 248.63 2.615e-07 ***
## Residuals 8 60.5 7.56
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(do.lm)
##
## Call:
## lm(formula = DO ~ temp, data = rivers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.671 -1.728 0.468 1.483 3.617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.8129 3.0097 33.50 6.89e-10 ***
## temp -2.1014 0.1333 -15.77 2.62e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.75 on 8 degrees of freedom
## Multiple R-squared: 0.9688, Adjusted R-squared: 0.9649
## F-statistic: 248.6 on 1 and 8 DF, p-value: 2.615e-07
# Check model using residual diagnostics:
plot(do.lm)
166CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION
Standardized residuals
Residuals vs Fitted Normal Q−Q
4
1 4 1
Residuals
4
0.5
0
−1.5
−4
10 10
Standardized residuals
Scale−Location Residuals vs Leverage
10 1 1
1 4 0.5
0.8
0
Cook's distance 5 0.5
0.0
−2
10
What is the population regression model? What are the hypotheses? Is the
model fit adequate? Explain your answer. What does the 𝑅2 tell us? Is there a
significant linear relationship between dissolved oxygen and temperature? What
is the fitted regression model? Predict the value of the dependent variable if the
independent variable=30. Comment on this prediction. Predict the value of the
dependent variable if the independent variable=9. Comment on this prediction.
Check both of the above predictions in R (see the R script file “Regression
Examples – Dissolved Oxygen.R”) using 95% “wide” prediction intervals.