0% found this document useful (0 votes)
69 views166 pages

Book

Uploaded by

Balcha bula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views166 pages

Book

Uploaded by

Balcha bula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 166

1014SCG Statistics - Lecture Notes

James McBroom

2020-12-05
2
Contents

1 Introduction To Statistics 5
1.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Individuals and Variables . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Random Variables and Variation . . . . . . . . . . . . . . . . . . 6
1.4 Statistical Populations . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Uncertainty, Error and Variability . . . . . . . . . . . . . . . . . 7
1.6 Research Studies & Scientific Investigations . . . . . . . . . . . . 8
1.7 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9 Using The R Software - Week 1. . . . . . . . . . . . . . . . . . . 10

2 Week 2 - Chi-Squared Tests 19


2.1 STATISTICAL INFERENCE – AN INTRODUCTION . . . . . . 20
2.2 Using R (Week 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Week 3/4 - Probability Distributions and The Test of Propor-


tion 43
3.1 Revision and Basics for Statistical Inference . . . . . . . . . . . . 45
3.2 Inference for Counts and Proportions – Test of a Proportion . . . 48
3.3 THEORETICAL STATISTICAL DISTRIBUTIONS . . . . . . . 55
3.4 Using R Week 3/4 . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 Week 5/6 - T-tests 71


4.1 Hypothesis Testing – The General Process . . . . . . . . . . . . . 72
4.2 Specific Tests of Hypotheses I . . . . . . . . . . . . . . . . . . . . 80
4.3 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5 Week 7 - ANOVA 93
5.1 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2 Analysis of Variance (ANOVA) - The Concept . . . . . . . . . . 98
5.3 The One-Way Analysis of Variance . . . . . . . . . . . . . . . . . 98
5.4 USING THE R SOFTWARE – WEEKS 7/8 . . . . . . . . . . . 108

6 Week 8 - Multiple Treatment Comparisons and LSD 115

3
4 CONTENTS

6.1 Multiple Comparisons of Treatment Means . . . . . . . . . . . . 116


6.2 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7 Week 9 - Factorial ANOVA 125


7.1 Treatment Designs: Factorial ANOVA . . . . . . . . . . . . . . . 126
7.2 Using R for Factorial ANOVA . . . . . . . . . . . . . . . . . . . . 137

8 Week 10/11 - Correlation and Simple Linear Regression 141


8.1 BIVARIATE STATISTICAL METHODS . . . . . . . . . . . . . 142
8.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.3 Using R and Examples: . . . . . . . . . . . . . . . . . . . . . . . 156
James McBroom - June 2020
Copyright Information
©Griffith University 2019. Subject to the provisions of the Copyright Act, no
part of this publication may be reproduced in any form or by any means (whether
mechanical, electronic, microcopying, photocopying, recording, or otherwise),
stored in a retrieval system or transmitted without prior permission.
Chapter 1

Introduction To Statistics

Outline:
1. Introduction
2. Revision and Topics Assumed - Exploratory Data Analysis
(EDA) and Probability (separate document)
3. Using the R computer software:
• Accessing R
• Using R within R Studio.
• Creating A Basic R Script (numerical summaries, plots,
tables)
Accompanying Workshop (done in week 2):
• Using R - Using RStudio; Creating a project; writing and saving
commands;
• Entering data; viewing data;
• Basic EDA - summary(), boxplot(), table(), histogram()
Workshop for Week 1
Nil
Project Requirements for Week 1
Nil
Things for you to Check in Week 1
• Ensure you have enrolled in a workshop;
• If you intend to use University computers ensure your comput-
ing account is active and accessible;

5
6 CHAPTER 1. INTRODUCTION TO STATISTICS

• Read the lecture notes for week 1 (these notes).


• Try out R (eg go to the R section at the end of these notes and
try to type and run the code there).

1.1 Statistics
• Why Statistics? (What is Statistics?)
• why do I need to use statistics?
• what is a statistical test?
• what do the results of a statistical test mean?
• what is statistical significance?
• how does probability fit in?
Statistical Inference requires assumptions about the data being analysed.
Lack of awareness and/or concern for assumptions leads to “misuse”:
You can prove anything with statistics.
There are three types of lies; lies, damned lies and statistics.
Statistics is concerned with scientific methods for collecting, organizing, summa-
rizing, presenting and analyzing data as well as with drawing valid conclusions
and making reasonable decisions on the basis of such analysis.
We can no more escape data than we can avoid the use of words. With data
comes variation. Statistics is a tool to understand variation.

1.2 Individuals and Variables


Individuals: the objects described by the data – may be people, but not nec-
essarily. Need clear definition of what individuals the data describe and how
many of them there are in the data.
Variable: A measurement made on an individual – a characteristic of an individ-
ual.
Example: A researcher randomly selects 100 trees from Toohey forest and
measures their height (H) and circumference at chest height (CCH). In this
example the individuals are trees in Toohey forest (there are 100 individuals in
this study). There are two variables: H and CCH.

1.3 Random Variables and Variation


A random variable is a variable whose value may change depending on chance.
The name of a random variable will often be symbolised by an upper case letter,
e.g.: 𝑋, 𝑄, 𝑃 , 𝑆, 𝑌 , 𝑍.
1.4. STATISTICAL POPULATIONS 7

A particular value (or realisation) of a random variable is called a variate, and


is often symbolised by a lower case letter, e.g.: 𝑥, 𝑞, 𝑝, 𝑠, 𝑦, 𝑧.

1.4 Statistical Populations


In statistics the term population refers to the totality of the data to which
reference is to be made. It is critically dependent on the measurement being
made and the question being asked.

There may be more than one population within the same problem. Some exam-
ples of different types of populations are:

• finite versus infinite


• real versus conceptual
• univariate versus multivariate
• quantitative versus qualitative
• discrete versus continuous

A statistical population can be described using the distribution of its measure-


ments or in terms of the probabilities of the values of a random variable.

1.5 Uncertainty, Error and Variability


The historical development of statistical theory began with the theory of errors.
Note the following entry from the Encyclopedia Britannica:

A physical occurrence can seldom if ever be reproduced exactly. In


particular the carefully staged physical occurrences known as scien-
tific experiments are never capable of exact repetition. For example
the yield of a product from a chemical reaction may vary quite con-
siderably from one occasion to another due to slight differences of
conditions such as temperature, pressure, concentration and agita-
tion rate. Other errors will be introduced because of the impossi-
bility of obtaining an entirely representative sample of the product
for analysis. Finally, no method of chemical analysis of the sampled
material will be exactly reproducible. The application of probability
theory to the treatment of these various discrepancies is called the
theory of errors.

Error ≠ uncertainty. Both are present to some extent in any scientific research.

Error: experimental error AND natural variability

Uncertainty: sampling AND lack of knowledge


8 CHAPTER 1. INTRODUCTION TO STATISTICS

1.6 Research Studies & Scientific Investigations


There are two main categories of scientific investigations:
Exploratory - fact finding, often with no specific hypothesis in mind.
Controlled Experimentation - investigations which begin with specific hy-
pothesis. In this latter category there are 2 main types of study:
• Observational Study – take measurements of something happening with-
out control.
• Experiment – manipulative, do something to observe a response – con-
trol.

1.7 Probability
A vital tool in statistics. See the assumed knowledge notes for a basic introduc-
tion.
The Study of Randomness – probability theory - describes random behaviour.
Note that random does not mean haphazard.
There are numerous schools of thought when it comes to defining what ‘proba-
bility’ means. One definition states:
… ‘empirical probability’ of an event is taken to be the ‘relative fre-
quency’ of occurrence of the event when the number of observations
is very large. The probability itself is the ‘limit’ of the relative fre-
quency as the number of observations increases indefinitely.
Note there are different conceptualizations of probability: empirical, theoretical,
subjective – We will assume the empirical approach in this course.

1.8 Statistical Inference


There are two main phases in any statistical analysis:
Exploratory Data Analysis – See separate assumed knowledge notes.
Statistical Inference – See below.
One can undertake exploratory data analysis without progressing to inference.
Conversely, in practice any inferential process should always be preceded by an
exploration of the data in order to understand the data more fully.

1.8.1 Introduction to Statistical Inference


Research is carried out to find out about generalities. Advances in human
knowledge generally rely on being able to state that something will happen in
most cases. BUT, we rarely have information for all cases.
1.8. STATISTICAL INFERENCE 9

Population and Sample: Note that population does not necessarily refer to
people.
Population: the totality of individual observations about which inferences are
to be made
Sample: a subset of the population. The part of the population that we actually
examine in order to gather information.
Why Sample? - (a) Cost: Resources available for study limited, as are time and
effort. - (b) Utility: In some cases items may be destroyed in the process of
sampling. - (c) Accessibility: Impracticable or even impossible to measure an
entire population.
Example of inference in every day life:
On a cold morning, should we wear warm clothes for the day ahead, or will it
warm up during the day?
Statistical Inference: a set of procedures used in making appropriate conclu-
sions and generalisations about a whole (the population), based on a limited
number of observations (the sample).

In general: Statistics are calculated from a sample and describe that sample.
• Parameters are inferred from the calculated statistics.
• Parameters are used to describe the population.
• Therefore, statistics can be used to infer things about a popula-
tion.
If all the population values were available, the population parameters would
simply be found by calculating them from the values available. This is what the
Australian Bureau of Statistics attempts to do when it conducts the census.
A population parameter has no error. It is an exact value which never changes
(assumption).
10 CHAPTER 1. INTRODUCTION TO STATISTICS

Sample statistics depend on the particular sample of data selected and thus
may vary in value; they represent a random variable and as such have variation.
A statistic should never be quoted without some estimate of its variation; usually
this is provided as a standard error of the statistic.

1.8.2 Estimation and Hypothesis Testing


There are two basic branches of statistical inference: estimation and hypoth-
esis testing.
Both make use of statistics calculated from sample data, and each has a specific
role to play in statistical inference. The choice of which to use depends on the
question being asked.
• What is the value of something? - This would entail estimation.
Versus
• Are two things the same? – This would entail hypothesis testing.
The basic principles of estimation & hypothesis testing are the same for ALL
types of parameters and statistics. However, the details may change. For exam-
ple:
How to calculate the test statistic & estimate the standard error, and what prob-
ability distribution to use, are both important considerations in any inferential
procedure and have situation-specific nswers (which we will deal with later).
BUT
If you have a basic understanding of the concepts you will easily cope with
most situations.

1.9 Using The R Software - Week 1.


1.9.1 Accessing R in the common use computer labs
• If you wish to install your own copy of R at home or on your laptop, both
R and RStudio are free to download and use. Please see the information
in Other Resources on the L@G site.
• R and RStudio are installed on all common use computers across the
University. Note that nothing can be saved on Uni computers, so make
sure you save all work to a project on your USB stick (or similar) if using
Uni computers, or else you will lose all your work when you log off and
shut down. Tutors will help you with this in workshops.

1.9.2 Using R Within RStudio


There are four main windows in RStudio:
1.9. USING THE R SOFTWARE - WEEK 1. 11

• Script - where you write the programs and code.


• Console - where the code from the script is run: output and errors appear
here also.
• Environment - Data, variables, functions etc are shown here, as well as
a tab that records the “history” of your analysis.
• Plot/Help - Any graphs you create, as well as the online help, plus a file
explorer, are shown here (using tabs).
The four windows usually appear together as a split screen. Note that when
you first open RStudio the script window may not appear – see the notes on
using RStudio for creating projects for details on how to open a new R script.

1.9.3 Getting Help


R has an excellent online ‘help’ facility within the once you get used to it.
Alternatives include searching the internet for R code help but just be aware,
as with all of the internet, that there are good and bad sources on the web. If
in doubt ask your lecturer or tutor.
The online help system in R is accessed in the help tab in the bottom right
window of RStudio. You can read the manuals if you wish, or else you can
search for specific functions in R to learn the syntax and what arguments the
function requires. See the Rintro notes and lectures for some examples of how
to use the help system within R.
Invaluable Help
As you move through the course you should try the code given to you (and
obviously write your own as well!) and document what each step in the code
does. R allows you to put comments in the script files, and this is very useful to
give you reminders about what each piece of code does. Save these R scripts on
your computer (perhaps in their own project areas – see the notes on creating
and using R projects in RStudio).
You will find these R script files, containing code demonstrated in lectures, in
each week’s lecture notes folder. You should download, save, and try running the
code in these files. They often have a lot of comments in them, and you should
feel free to add your own as well to help you understand what is happening from
your own perspective. You will be expected and encouraged to create your own
script files in workshops to help you do certain questions and analyses. Save
these in some sort of systematic way, and comment them extensively so that
when you look at them again for revision in week 12 (or later in your career
when you need to analyze some data) it is not a complete mystery to you!

1.9.4 Creating an R script/program


Some Initial Notes
• R is case sensitive – bla is different to Bla in R’s computer brain.
12 CHAPTER 1. INTRODUCTION TO STATISTICS

• Use of a hash at start of a line indicates R is not to read that line – this
is ideal for putting comments and reminders in your code.
• R uses “object oriented” programming – everything in R is, or can be
made into, an “object”. For our purposes this basically means we can
and should give everything we do a name using the assignment operator.
For example x <- rnorm(1000). There is now an object called x that
contains 1000 random numbers from a normal distribution. Next time we
need those numbers, we simply use x. The kinds of names we should use
will be discussed in lectures.
• We write our code in the script window in RStudio, and we run it by
putting the cursor on the line of code we wish to run and clicking the run
button.
A Inputting DATA
Most statistical analyses start with data, and so most analyses in R begin by
entering, or reading in, data. R has many functions to read in data (mostly
related to the kind of file the data is originally stored in, like Excel or text files).
Data should be given a name so that it is easily available after we input it. For
example:
rain.dat <- a.function.that.reads.in.data(“from an excel file”)
In this example we are using a function (I made its name up – we’ll see the
correct function name shortly) to read in data from a file (I’ve made up a
strange name for the file too…!) and I’ve put the result into an object that
I’ve called rain.dat. Now there is a data set in R called rain.dat, and if I
remember to save each time I quit R, rain.dat will be there when I next open
R, forever, or until I choose to delete it. I never have to read the data into R
again after this initial data input.
So let’s look at a proper example. The table below gives the rainfall over four
seasons in five different districts.

District Season
Winter Spring Summer Autumn
1 23 440 800 80
2 250 500 1180 200
3 120 400 420 430
4 10 20 30 5
5 60 200 250 120

Here there are three variables, district, season and rainfall – each measurement
has 3 parts to it - trivariate. Note that the data is in a compressed form –
typically each variable should have its own column, like this:
1.9. USING THE R SOFTWARE - WEEK 1. 13

District Season Rainfall


1 Winter 23
1 Spring 440
1 Summer 800
1 Autumn 80
2 Winter 250
2 Spring 500
2 Summer 1180
2 Autumn 200
… … …

An Excel file with the data in this form can be found in the lecture notes folder
for week 1.
There are two ways to import this data into R. The first is to use the Import
Dataset menu in the Environment window in RStudio. This will be shown in
the lecture.
The second way is a useful shortcut when you have small to moderate sized data
(say less than a few thousand rows of data). R allows you to copy the data using
your mouse/keyboard, and then enter it using the following code:
rainfall <- read.table(“clipboard”, header = T) #On Windows
or
rainfall <- read.table(pipe(“pbpaste”), header = T) #On MacOSX
This will also be demonstrated in lectures.
B Accessing, Exploring, Graphing and Analysing
Once the data set has been entered various analyses can proceed. R uses func-
tions to do analyses, graphs and summaries.
1. Printing/Viewing: typing the name of the variable/dataset will
print it in the command window:
rainfall <- read.csv("rainfall.csv")
rainfall

## District Season Rainfall


## 1 1 Winter 23
## 2 1 Spring 440
## 3 1 Summer 800
## 4 1 Autumn 80
## 5 2 Winter 250
## 6 2 Spring 500
## 7 2 Summer 1180
14 CHAPTER 1. INTRODUCTION TO STATISTICS

## 8 2 Autumn 200
## 9 3 Winter 120
## 10 3 Spring 400
## 11 3 Summer 420
## 12 3 Autumn 430
## 13 4 Winter 10
## 14 4 Spring 20
## 15 4 Summer 30
## 16 4 Autumn 5
## 17 5 Winter 60
## 18 5 Spring 200
## 19 5 Summer 250
## 20 5 Autumn 120
Accessing variables within a dataset is achieved by prepending the name of the
variable to the name of the dataset followed by a dollar sign like this: dataset-
name$variablename:
rainfall$Season

## [1] "Winter" "Spring" "Summer" "Autumn" "Winter" "Spring" "Summer" "Autumn"


## [9] "Winter" "Spring" "Summer" "Autumn" "Winter" "Spring" "Summer" "Autumn"
## [17] "Winter" "Spring" "Summer" "Autumn"
rainfall$Rainfall

## [1] 23 440 800 80 250 500 1180 200 120 400 420 430 10 20 30
## [16] 5 60 200 250 120
rainfall$District

## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
2. Basic EDA – numerical summary of data using summary() and
table() functions:
For the rainfall example:
summary(rainfall)

## District Season Rainfall


## Min. :1 Length:20 Min. : 5.0
## 1st Qu.:2 Class :character 1st Qu.: 52.5
## Median :3 Mode :character Median : 200.0
## Mean :3 Mean : 276.9
## 3rd Qu.:4 3rd Qu.: 422.5
## Max. :5 Max. :1180.0
summary(rainfall$Rainfall)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


1.9. USING THE R SOFTWARE - WEEK 1. 15

## 5.0 52.5 200.0 276.9 422.5 1180.0


table(rainfall$Season) #for categorical variables

##
## Autumn Spring Summer Winter
## 5 5 5 5

3. Graphs - Histograms, Boxplots, Quantile-Quantile (QQ) plots,


Barplots:

For the rainfall example:


hist(rainfall$Rainfall)

Histogram of rainfall$Rainfall
10
8
Frequency

6
4
2
0

0 200 400 600 800 1000 1200

rainfall$Rainfall
boxplot(rainfall$Rainfall)
16 CHAPTER 1. INTRODUCTION TO STATISTICS

1200
200 400 600 800
0

boxplot(Rainfall ~ Season, data = rainfall, main = "Boxplots are Great!!!")

Boxplots are Great!!!


1200
200 400 600 800
Rainfall

Autumn Spring Summer Winter

Season
qqnorm(rainfall$Rainfall)
qqline(rainfall$Rainfall)
1.9. USING THE R SOFTWARE - WEEK 1. 17

Normal Q−Q Plot


1200
Sample Quantiles

200 400 600 800


0

−2 −1 0 1 2

Theoretical Quantiles

barplot(table(rainfall$Season))
5
4
3
2
1
0

Autumn Spring Summer Winter


dotchart(as.numeric(table(rainfall$Season)))
18 CHAPTER 1. INTRODUCTION TO STATISTICS

3 4 5 6 7
Chapter 2

Week 2 - Chi-Squared Tests

Outline:
1. Statistical Inference
• Introductory Examples: Goodness of Fit
• Test Statistics & the Null Hypothesis
– The Null Hypothesis: 𝐻0
– The Test Statistic, 𝑇
– Distribution of the Test Statistic
– The Null Distribution
– Degrees of Freedom
– The Statistical Table
– Using the 𝜒2 Table
– Examples of Using the Table
– The Significance Level 𝛼, and the Type I Error
– The Goodness of Fit Examples Revisted
• The Formal Chi Squared, 𝜒2 , Goodness of Fit Test
• The Chi Squared, 𝜒2 , Test of Independence – The two-
way contingency table
2. Using R
• Using the rep and factor functions to enter repeating categorical
data into R.

19
20 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

• Using R for Goodness of Fit Tests


• Using R for Tests of Independence – Two Way Contingency
Table
Workshop for Week 2
Work relating to Week 1 – R – importing data, exploring data using
graphs and numerical summaries. Project selection and planning.
Things YOU must do in Week 2
• Revise and summarise the lecture notes for week 1;
• Read your week 2 lecture notes before the lecture;
• Attend your first workshop/computer lab;
• Read the workshop on learning@griffith before your workshop;
• Download and install R and RStudio on your personal computer(s)
[links are on L@G in Other Resources];
• Get to know your tutor and some of your fellow students in the
workshop;
• Start working with R;

2.1 STATISTICAL INFERENCE – AN IN-


TRODUCTION
From last weeks notes:
Statistical Inference: a set of procedures used in making appropriate conclusions
and generalisations about a whole (the population), based on a limited number
of observations (the sample).
and
There are two basic branches of statistical inference: estimation and hypothesis
testing.
During the rest of this course you will learn a number of different statistics tests
used for hypothesis testing and estimation (refer to the end of this week’s notes).

2.1.1 Some Goodness of Fit Examples


2.1.1.1 Example 1 – Gold Lotto: Is it fair or are players being ripped
off?
In Queensland, Saturday night Gold Lotto involves the selection of eight num-
bers from a pool of 45. Numbers are not replaced so each can only be selected
once on any draw. The following table is based on an extract from the Sunday
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 21

Mail (July 2017), and gives a summary (in the first two columns) of the number
of times each of the numbers, 1 to 45, has been drawn over a large subset of
games.

The total of all the numbers of times drawn is 7120 (166 + 162+ … + 155 + 152)
and, since eight numbers are drawn each time, this means that the data relate
to 890 games. The organisers guarantee that the mechanism used to select the
numbers is unbiased: therefore all numbers are equally likely to be selected.

Is this true?

There are 45 numbers and if each number is equally likely to be selected on


each of the 7120 draws, each number should have been selected 158.2222 times
(7120/45). [This is an approximation, which is OK given the large number of
draws we have. Why is this an approximation?]

How different from this expected number of 158.2222, can an observed number
of times be before we begin to claim ‘foul play’?

The table below also shows the differences between 158.2222 and the observed
frequency of occurrence for each number. Intuitively it would make sense to
add up all these differences and see how big the sum is. But, this doesn’t work.
Why? Also in the table are the squares for each of the differences – maybe the
sum of these values would mean something? But, if these squared differences
are used it means that all differences of say two are treated equally. Is it fair
for the difference between the numbers 10 and 12 to be treated the same as the
difference between 1000 and 1002? A way to take this into account is to scale
the squared difference by dividing by the expected value – in this example this
scaling will be the same as all expected values are 158.2222.

The final column in the table gives the scaled squared differences:

(observed number - expected number)2 (𝑂 − 𝐸)2


= (2.1)
(expected number)2 𝐸

The sum of the final column can be expressed as:

𝑘
(𝑂 − 𝐸)2
𝑇 =∑ (2.2)
𝑖=1
𝐸

where the summation takes place over all the bits to be added – that is, the
total number of categories for which there are observations. Here 𝑘 = 45, the
45 Gold Lotto numbers.
22 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

lotto.bb

How bad is the lack of fit? Is a sum of 31.504 (31.386 in 2015) a lot more than
we would expect if the numbers are drawn randomly? Are any of the values in
the last column ‘very’ big? How big is ‘very’ big for any single number?

2.1.1.2 Example 2 – Occupational Health and Safety: How do gloves


wear out?

Plastic gloves are commonly used in a variety of factories. The following data
were collected in a study to test the belief that all parts of such gloves wear out
equally. Skilled workers who all carried out the same job in the same factory
used gloves from the same batch over a specified time period. Four parts on the
gloves were identified: the palm, fingertips, knuckle and the join between the
fingers and the palm. A ‘failure’ is detected by measuring the rate of permeation
through the material; failure occurs when the rate exceeds a given value. A total
of 200 gloves were assessed.
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 23

If the gloves wear evenly we would expect each of the four positions to have
the same number of first failures. That is, the 200 gloves would be distributed
equally (uniformly) across the four places and each place would have 200/4
= 50 first failures. How do the numbers shape up to this belief of a uniform
distribution? Are the observed numbers much different from 50?

2.1.1.3 Example 3 - Credit Card Debt


A large banking corporation claims that there is no problem with credit card
debt as 50% of people pay the full debt owing, 20% pay half of the debt, 25%
pay the minimum repayment, and only 5% fail to pay at least the minimum.
They have been requested to prove that their claim is correct and have collected
the following information for the previous month on 160 of their credit card
customers selected at random.

Do the data support the bank’s statements?

If the distribution of credit card payments is as stipulated by the bank we would


expect to see 80 paying all of the debt, 32 paying half, 40 paying the minimum
and only 8 failing to pay. How do the observed data compare with these expected
numbers? Is the bank correct in its claims?

2.1.1.4 Example 4 – Mendel’s Peas


(Mendel, G. 1986 Verusuche über Pflanzen-Hybriden. Verhandlungen
des naturforschenden Vereines in Brünn 4, 3-47. English translation
by Royal Horticultural Society of London, reprinted in Peters, J. A.
1959. Classic Papers in Genetics. Prentice-Hall, Englewood Cliffs,
NJ)
24 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

Perhaps the best-known historical goodness of fit story belongs to Mendel and
his peas. In his work (published in 1866), Mendel sought to prove what we now
accept as the basic laws of Mendelian genetics in which an offspring receives
discrete genes from each of its parents. Mendel carried out numerous breeding
experiments with garden peas which were bred in a way that ensured they were
heterozygote, that is, they had one of each of two possible alleles, at a number of
single loci each of which controlled clearly observed phenotypic characteristics.
One of the alleles was a recessive and the other dominant so that for the recessive
phenotype to be present, the offspring had to receive one of the recessive alleles
from both parents. Individuals who had one or two of the dominant alleles would
all appear the same phenotypically.

Mendel hypothesised two things. Firstly he believed that the transmission of


a particular parental allele (of the two possible) was a random event, thus the
probabilities that an offspring would have the recessive or the dominant allele
from one of the parents were 50:50. Secondly, he believed that the inheritance of
the alleles from each parent were independent events; the probability that a par-
ticular allele would come from one parent was not in any way affected by which
allele came from the other parent. If his beliefs were true, the offspring from
the heterozygote parents would appear phenotypically to have a ratio between
the dominant and recessive forms of 3:1.

This phenomenon is illustrated in the table below for the particular pea char-
acteristic of round or wrinkled seed form. The wrinkled form is recessive, thus
for a seed to appear wrinkled it must have a wrinkled allele from both of its
parents; all other combinations will result in a seed form that appears round.
The two possible alleles are indicated as r and w.

From the table it can be seen that the probabilities associated with the wrinkled
and round phenotypic seed forms are ¼ and ¾, respectively. The following table
gives the actual results from one of Mendel’s experiments.
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 25

2.1.2 Test Statistics & the Null Hypothesis


2.1.2.1 The Null Hypothesis
The belief or proposal that allows us to calculate the expected values is known
as the null hypothesis. It represents a known situation that we take as being
true; it is the base line. The observed values from the sample of data are
used to test the null hypothesis. If the null hypothesis is true, how likely is
it that we would observed the results we saw in the sample of data? What
is 𝑃 𝑟(data|null hypothesis)? If the results from the sample are unlikely under
the null hypothesis, we reject it in favour of something else, the alternative
hypothesis.
In the Gold Lotto, we are told by the organisers that there is no bias in the
selection process and that each number is equally likely to be drawn. Assuming
this belief (hypothesis) to be true, we determined that each of the 45 numbers
should have been selected equally often, namely 158.22 times.
In the gloves example it was suggested that each part would wear evenly thus
the number of first failures would be the same for all four parts – 50 each in the
200 gloves.
For the credit card example, a specific distribution was proposed by the banking
organisation, namely the probabilities of 0.50, 0.20, 0.25 and 0.05 for each of
the outcomes of full debt payment, half debt payment, payment of minimum
amount and no payment. It was this distribution that was to be tested with
the observed sample of 160 customers – under the null hypothesis, 80 customers
should have paid out in full, 32 should have paid half their debt, 40 should have
paid just the minimum payment and 8 should have made no payment at all.
In Mendel’s peas examples, the hypotheses that would be used are:

𝐻0 ∶ the alleles making up an individual’s genotype associate independently and each has a probability of 0.5.
𝐻1 ∶ the alleles making up an individual’s genotype do not associate independently.

𝐻0 : null hypothesis.
𝐻1 or 𝐻𝐴 : alternative hypothesis.
The basis of all statistical hypothesis testing is to assume that the null hypoth-
esis is true and then see what the chances are that this situation could produce
the data that are obtained in the sample. If the sample data are unlikely to be
obtained from the situation described by the null hypothesis, then probably the
situation is different; that is, the null hypothesis is not true.

2.1.2.2 The Test Statistic


The issue in each of the above examples is to find an objective way of decid-
ing whether the difference between the observed and expected frequencies is
26 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

‘excessive’. Clearly we would not expect the observed to be exactly equal to


the expected in every sample we take. A certain amount of ‘leeway’ must be
acceptable. But, when does the difference become so great that we say we no
longer believe the given hypothesis?

The first step is to determine some measure which expresses the ‘difference’ in
a meaningful way. In the above examples the measure used is the sum of the
scaled squares of the differences. We call this measure the test statistic.

In the Gold Lotto example the test statistic was 31.504 across the entire 45
numbers. How ‘significant’ is this value of 31.504? What sorts of values would
we get for this value if the observed frequencies had been slightly different –
suppose the number of times a 19 was chosen was 160 instead of 171; or that
27 had been observed 138 times instead of 130. If the expected value for each
number is 158.22, how much total difference from this across the 45 numbers can
be tolerated before a warning bell sounds to the authorities? What distribution
of values is ‘possible’ for this test statistic from different samples if the expected
frequency truly is 158.22? If all observed frequencies were very close to 158
and 159, the test statistic would be about zero. But, we would not really be
surprised (would not doubt that the numbers occur equally often) if all numbers
were 158 or 159 except two, one having an observed value of 156 say, and the
other 160, say. In this case the test statistic would be < 0.2 – this would not
be a result that would alarm us. Clearly there are many, many possible ways
that the 7120 draws could be distributed between the 45 numbers without us
crying ‘foul’ and demanding that action be taken because the selection process
is biased.

It would be impossible to list out all the possible sets of outcomes for the
Gold Lotto case in a reasonable amount of time. Instead, consider the simple
supposedly unbiased coin which is tossed 20 times. If the coin truly is unbiased
we expect to see 10 heads and 10 tails. Would we worry if the result was 11
heads and 9 tails? or 12 heads and 8 tails?

Using the same test statistic as in the Lotto example, these two situations would
give:
(9–10)2 (11–10)2
𝑇 = 10 + 10 = 0.2

and
(8–10)2 (12–10)2
𝑇 = 10 + 10 = 0.8.

What about a result of 14 heads and 6 tails? (T = 3.2)

And 15 heads and 5 tails? (T = 5.0)

With the extreme situation of 15 heads and 5 tails, the test statistic is starting
to get much bigger.
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 27

2.1.3 Distribution of the Test Statistic

2.1.3.1 The Null Distribution

Each of the possible outcomes for any test statistic has an associated probability
based on the hypothesised situation. It is the accumulation of these probabilities
that gives us the probability distribution of the test statistic under the null
distribution.

If the proposed (hypothesised) situation is true, some of the outcomes will be


highly likely (eg if a coin really is unbiased, 10 heads and 10 tails or 11 heads,
and 9 tails will be highly likely). Other outcomes will be most unlikely (eg 19
heads and 1 tail, or 20 heads and no tails, if the coin is unbiased). Suppose
we toss the coin 20 times and get 16 heads and 6 tails. How likely is this to
occur if the coin is unbiased? You will see in a later lecture how the probability
associated with each of these outcomes can be calculated exactly for the coin
example. However, for the moment we will use ‘mathemagic’ to find all the
required probabilities.

In general, the test statistic, which has been calculated using the sample of
data, must be referred to the appropriate probability distribution. Depending
on where the particular value of the test statistic sits in the relative scale of
probabilities of the null distribution (on how likely the particular value is), a
decision can then be made about the validity of the proposed belief (hypothesis).

In the Gold Lotto example, we have a test statistic of 31.504 but, as yet, we
have no way of knowing whether or not this value is likely if the selection is
unbiased. We need some sort of probability distribution which will tell us which
values of the test statistic are likely if the ball selection is random. If we had this
distribution we could see how the value of 31.504 fits into the belief of random
ball selection. If it is not likely that a value as extreme as this would occur,
then we will tend to reject the belief that selection of the numbers is fair.

The test statistic used in the above examples was first proposed by Karl Pearson
in the late 1800’s. Following his publication in the 1900s it became known as
The Pearson (Chi-Squared) Test Statistic

In his 1900 work, Pearson showed that his test statistic has a probability distri-
bution which is approximately described by the Chi-Squared distribution. This
is a theoretical mathematical function which has been studied and whose prob-
ability distribution has a shape like this:
28 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

The exact shape of the chi-squared distribution changes depending on the de-
grees of freedom involved in the calculation process. Degrees of Freedom (DF)
will be discussed below.
The distribution shows what values of the test statistic are likely (those that
have a high probability of occurring). Values in the tail have a small probability
of occurring and so are unlikely. If our calculated test statistic lies in the outer
extremes of the distribution we will doubt the validity of the null distribution as
a description of the population from which our sample has been taken. Since we
are interested in knowing whether or not our calculated test statistic is ‘extreme’
within the probability distribution, we will look at the values in the ‘tail’ of the
distribution – this is discussed further in section 1.3.3.

2.1.3.2 Degrees of Freedom


The concept of degrees of freedom arises often in statistical inference. It is
best understood by the following example. Suppose you are asked to select 10
numbers at random. Your answer will include any 10 numbers you like; there
are no constraints and you can freely select any ten numbers. But, now suppose
that you are asked to select 10 numbers whose total must be 70. How many
numbers can you now select freely? The first nine numbers can be anything
at all. But, the tenth number must be something that makes the sum of all
10 numbers equal to 70. The single constraint concerning the total has limited
your freedom of choice by one. Only nine numbers could be chosen freely, the
final number was determined by the total having to be 70.
In all the chi-squared situations above, the expected frequencies are determined
by applying the stated hypothesised probability distribution and using the total
number of observations in the sample. Thus, a constraint is in place: the sum
of the expected frequencies must equal the total number of observations. This
represents a single constraint so the degrees of freedom will be one less than the
number of categories for which expected values are given.
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 29

For the Gold Lotto there are 45 categories (the numbers) for which expected
values are needed. The sum across these 45 expected values must be 7120, thus
only 44 frequencies could be selected freely, the final or 45th frequency being
completely determined by the 44 previous values and the total of 7120.
In the occupational health and safety example there are four categories (the
four parts of the glove) across which the sum of the expected frequencies must
be 200. The degrees of freedom must be three (4 – 1 = 3).
The banking institution example - four categories giving three degrees of free-
dom.
For Mendel’s peas - 2 categories giving 1 degree of freedom.
A number of the mathematical probability distributions, including the chi-
squared, which are used in statistical inference vary a little depending on the
degrees of freedom associated with them. The effects of different degrees of
freedom vary for different types of distributions. For chi-squared, the effects are
discussed below.

2.1.3.3 The Statistical Table


One of the most common ways of presenting the null distribution of a test
statistic is with the statistical table. A statistical table consists of selected prob-
abilities from the cumulative probability distribution of the test statistic if the
null hypothesis is true. The selected probabilities will represent different areas
of the ‘tail’ of the distribution, and the actual probabilities chosen for inclusion
in a table depend on a number of things. To some extent they are subjective
and vary depending on the person who creates the table. Conventionally prob-
abilities of 0.05 (5%) and 0.01 (1%) (for significance levels of 0.05 and 0.01) are
used most frequently.
The type of distribution used depends on the type of data involved, on the type
of hypothesis (or belief) proposed, and the test statistic used. For example, a
researcher might be interested in testing a belief about the difference in mean
pH level between two different types of soil. Thus, a frequency distribution
which is appropriate for a test statistic describing ‘the difference between two
means’ will be needed.
The statistical table needed in the examples of goodness of fit presented above
is the chi- squared (𝜒2 ) table. It is based on the 𝜒2 distribution, whose general
shape is given in the picture above (also see lectures). The associated table is
known as a chi-squared, 𝜒2 , table and a copy is given in the statistical tables
section of the L@G site.

2.1.3.4 Using the 𝜒2 Table


In order to read the 𝜒2 table we need to know the degrees of freedom of the
test statistic relevant for the given situation. Degrees of freedom are denoted by
30 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

the Greek letter 𝜈, which is pronounced ‘new’ (this is the most common symbol
used for degrees of freedom). Values for 𝜈 are given in the first column of the
table. The remaining columns in the table refer to quantiles, 𝜒2 (𝑝) , which are
the chi-squared values required to achieve the specific cumulative probabilities
𝑝 (ie probability of being less than or equal to the quantile).
We are interested in extreme values; that is, exceptionally large values of
the test statistic. Thus, the interesting part of the distribution is the right
hand tail. To obtain the proportion of chi-squared values lying in the extreme
right hand tail of the distribution, that is, the proportion of chi-squared values
greater than the particular quantile, the nominated 𝑝 is subtracted from 1. The
actual numbers in the bulk of the table are the 𝜒2 values that give the stated
cumulative probability, the quantile, for the nominated degrees of freedom. For
this particular table the probabilities relate to the left hand tail of the distri-
bution and are the probability of getting a value less than or equal to some
specified 𝜒2 value. For example, the column headed 0.950 contains values, say
A, along the x-axis such that: 𝑃 𝑟(𝜒2𝜈 ≤ 𝐴) = 0.95.
Conversely, there will be 5% of values greater than A, 𝑃 𝑟(𝜒2𝜈 > 𝐴) = 1 − 0.95 =
0.05. This can be seen graphically in the following figure, where the chi-squared
distribution for one degree of freedom is illustrated.

As you look down the column for 𝜒2 (0.95) you will notice that the values of A
vary. The rows are determined by the degrees of freedom, and thus the A value
also depends on the degrees of freedom. What is actually happening is that the
𝜒2 distribution changes its shape depending on the degrees of freedom, 𝜈. The
value of the degrees of freedom is incorporated into the expression as follows:

𝜒2𝜈 (0.95)

This means the point in a 𝜒2 with 𝜈 degrees of freedom that has 95% of values
below it (or, equally, 5% of values above it).
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 31

2.1.3.5 Examples Using the 𝜒2 Table


1. If df = 2 what is the 𝜒2 value which leads to a right hand tail of 0.05?
That is, find A such that: 𝑃 𝑟(𝜒22 > 𝐴) = 0.05.
Solution: Degrees of freedom of 2 means use the 2nd row; tail probability of
0.05 implies the 0.95 quantile, thus the 4th column giving: A = 5.99.
2. If df = 3 what is the 𝜒2 value which leads to a right hand tail of 0.05?
That is, find A such that: 𝑃 𝑟(𝜒23 > 𝐴) = 0.05
Solution: Degrees of freedom of 3 means use the 3rd row; tail probability of
0.05 implies the 0.95 quantile and thus the 4th column giving: A = 7.82.
3. If df = 1 what is the 𝜒2 value which leads to a right hand tail of 0.01?
That is, find A such that: 𝑃 𝑟(𝜒21 > 𝐴) = 0.01
Solution: Find the value in the 1st row (degrees of freedom one) and 0.99
quantile, 2nd column (quantile for 0.99 has an extreme right tail probability of
0.01): A = 6.64.

2.1.3.6 The Significance Level, 𝛼, and The Type I Error


The tail probability that we select as being ‘extreme’ is the probability that the
null hypothesis will be rejected when it is really true. It is the error we could
be making when we conclude that the data are such that we cannot accept the
proposed distribution. This error is known as the Type I Error. More will be
said about errors in a later section. For the moment you need to understand
that the level of significance is a measure of the probability that a Type I Error
may occur.
The value chosen for the level of significance (and thus the probability of a Type
I Error) is purely arbitrary. It has been traditionally set at 0.05 (one chance
in 20) and 0.01 (one chance in 100), partly because these are the values chosen
by Sir Ronald A. Fisher when he first tabulated some of the more important
statistical tables. Error rates of 1 in 20 and 1 in 100 were acceptable to Fisher
for his work. More will be said about this in later sections. Unless otherwise
stated, in this course we will use a level of significance of 𝛼 = 0.05.
The value, A, discussed in section 1.3.4 is known as the critical value for the
specified significance level. It is the value against which we compare the
calculated test statistic to decide whether or not to reject the null hypothesis.

2.1.3.7 The Goodness of Fit Examples Revisited


1. Gold Lotto
Degrees of freedom 𝜈= 44.
Using significance level of 𝛼 = 0.05 means the 0.95 quantile is needed.
With df = 50, from the 4 th column we get 67.5.
32 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

(Note that the table does not have a row for df = 44 so we use the
next highest df, 50)

Critical region is a calculated value > 67.5

The calculated test statistic for the gold lotto sample data is: T =
31 which does not lie in the critical region.

Do not reject 𝐻0 and conclude that the sample does not provide
evidence to reject the proposal that the 45 numbers are selected at
random (𝛼 = 0.05).

2. Occupational Health and Safety – Glove Wearing

Degrees of freedom = 3.

Critical value from 𝜒2 table with df = 3 using significance level of


0.05 requires quantile of 0.95 = 7.82.

Critical region is a calculated value > 7.82.

Calculated test statistic: T =

Write your conclusion:

3. Bank Customers and Credit Cards

Degrees of freedom = 3.

Critical value from 𝜒2 table with df = 3 using significance level of


0.05 requires 0.95 quantile = 7.82

Critical region is a calculated value > 7.82

Calculated test statistic: T =

Write your conclusion:

4. Mendel’s Peas

Degrees of freedom = 1. Critical value from 𝜒2 table with df = 1


using significance level of 0.05 (0.95 quantile) = 3.84.

Critical region is a calculated value > 3.84.

Calculated test statistic: T = 0.5751 which does not lie in the critical
region.

Do not reject 𝐻0 and conclude that the sample does not provide
sufficient evidence to reject the proposal that the alleles, each with
probability of 0.5, are inherited independently from each parent (
𝛼 = 0.05).
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 33

2.1.4 The Formal Chi-Squared, 𝜒2 , Goodness of Fit Test


Research Question:
The values are not as specified (eg uniformly; in some specified ratios
such as 20%, 50%, 10%, 15%, 5%).
Null Hypothesis 𝐻0 :
Within the population of interest, the distribution of the possible
outcomes is as specified. [Nullifies the Research Question]
Alternative Hypothesis 𝐻1 :
The distribution within the population is not as specified. [Matches
the Research Question]
Sample:
Selected at random from the population of interest.
Test Statistic:

𝑘
(𝑂 − 𝐸)2
𝑇 =∑
𝑖=1
𝐸
calculated using the sample data.
Null Distribution:
The Chi-Squared, 𝜒2 , distribution and table.
Significance Level, 𝛼:
Assume 0.05 (5%).
Critical Value, A:
Will depend on the degrees of freedom 𝜈, which in turn depends on
the number of possible outcomes (categories, 𝑘).
Critical Region:
That part of the distribution more extreme than the critical value –
part of the distribution where the 𝜒2 value exceeds A.
Conclusion:
If 𝑇 > 𝐴 (i.e. 𝑇 lies in the critical region) reject 𝐻0 in favour of the
alternative hypothesis 𝐻1 .
Interpretation:
If 𝑇 > 𝐴 the null hypothesis is rejected. Conclude that the alter-
native hypothesis is true with significance level of 0.05. The null
hypothesis has been falsified.
34 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

If 𝑇 ≤ 𝐴 the null hypothesis is not rejected. Conclude that there is


insufficient evidence in the data to reject the null hypothesis – note
that this does not prove the null.

2.1.5 The 𝜒2 Test of Independence – The Two-Way Con-


tingency Table
A second form of the goodness of fit test occurs when the question raised con-
cerns whether or not two categorical variables are independent of each other.
For example, are hair colour and eye colour independent of each other? Are
sex and dexterity (right or left handedness) independent of each other? Is the
incidence of asthma in children related to the use of mosquito coils? Does the
type of credit card preferred depend on the sex of the customer? Do males
prefer Android phones and females prefer iPhones?

Recall the use of the definition of independent events in the basic rules of prob-
ability: this is the way in which the expected values are found.

The main difference between this form of the chi-squared test and the goodness
of fit test lies in the calculation of the degrees of freedom. If the concept does
not change, what would you expect the degrees of freedom to be for the example
given below? This will be discussed in lectures.

Example

One hundred students were selected at random from the ESC School and their
hair and eye colours recorded. These values have then been summarised into a
two-way table as follows.

Do these data support or refute the belief that a person’s hair and eye colours
are independent?

If the two characteristics are independent then the probability of a particular


hair colour and a particular eye colour will be the probability associated with the
particular hair colour multiplied by the probability associated with the relevant
eye colour.

From the table, the probabilities for each eye colour are found by taking the
row sum and dividing by the total number of people, 100.
2.1. STATISTICAL INFERENCE – AN INTRODUCTION 35

18
𝑃 𝑟(blue eyes) = = 0.18
100
35
𝑃 𝑟(green/hazel eyes) = = 0.35
100
47
𝑃 𝑟(brown eyes) = = 0.47
100

Similarly, the probabilities for the hair colours are found using the column totals.

70
𝑃 𝑟(brown/black hair) = = 0.70
100
20
𝑃 𝑟(blonde hair) = = 0.20
100
10
𝑃 𝑟(red hair) = = 0.10
100

The combined probabilities for each combination of hair and eye colour, under
the hypothesis that these characteristics are independent, are found by simply
multiplying the probabilities together. For example, under the assumption that
hair and eye colour are independent, the probability of having blue eyes and red
hair is:

𝑃 𝑟(Blue Eyes and Red Hair) = 𝑃 𝑟(blue eyes) × 𝑃 𝑟(red hair)


18 10
= ×
100 100
18 × 10
=
100 × 100
= 0.018

We would therefore expect 0.018 of the 100 people (ie 1.8 people) to have both
blue eyes and red hair, if the assumption that hair and eye colour are
independent is true. We can calculate the expected values of all combinations
of hair and eye colour in this way. In fact by following the example we can
formulate an equation for the expected value of a cell defined by the rth row
and cth column as:

(Row r Total) × (Column c Total)


𝐸(r, c) = .
Overall Total

Using this formula, we can calculate the expected values of each cell as shown
in the following table. Note that the expected values are displayed in italics:
36 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

Once the observed and expected frequencies are available, Pearson’s Chi Squared
Test Statistic is found in the same way as before

(5 − 12.6)2 (25 − 24.5)2 (40 − 32.9)2 (1 − 4.7)2


𝑇 = + + +…+
12.6 24.5 32.9 4.7
= 39.582

The degrees of freedom are 2 × 2 = 4 (check you understand this).


The critical value from a 𝜒2 table with df = 4 using significance level of 0.05 is
𝜒24 (0.95) = 9.49.
The test statistic of 39.582 is much larger than this critical value of 9.49. We
say that the test is significant - the null hypothesis that hair and eye colours
are independent is rejected. We conclude that hair and eye colour are not
independent (𝛼 = 0.05).

2.2 Using R (Week 2)


2.2.1 Using rep and factor to Enter Data Manually in R
Sometimes, when the amount of data is reasonably small, it can be convenient
to enter data into R manually (as opposed to importing it into R from Excel,
say).
Recall the rainfall example from week 1 lecture notes. This data was presented
in a table in the notes, and we then expanded it out (by expanding District and
Season in repeating patterns) into Excel so we could import it. We need not
have done that: instead, we could have typed the rainfall measurements into R
by hand and then used some functions in R to create the repeating patterns for
District and Season.
Note that this approach is an alternative to the import data approach. You
do one or the other, not both. If the data you wish to use is already in “Excel”
form, import it using the methods we saw in the week 1 lectures. If the data
are small and appear in a table like the rainfall data, then this new approach
can be useful.
Here’s the rainfall data as it appears in the table from week 1:
2.2. USING R (WEEK 2) 37

We can type the rainfall measurements into R as follows (note that we are typing
it in by row, not by column – this makes a difference as to how we do the next
bit):
rain <- c(23, 440, 800, 80,
250, 500, 1180, 200,
120, 400, 420, 430,
10, 20, 30, 5,
60, 200, 250, 120)
rain

## [1] 23 440 800 80 250 500 1180 200 120 400 420 430 10 20 30
## [16] 5 60 200 250 120

Next we can create the variable district using the rep() function. This func-
tion repeats the values you give it a specified number of times:
district <- rep(1:5, c(4, 4, 4, 4, 4))
district

## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5

rep() takes two arguments, separated by a comma. The first argument are the
things you want to replicate (in this case the numbers from 1 to 5 inclusive). The
second argument is the number of times you want these things to be replicated.
Note that the second argument must match the dimension of the first argument
– since there are 5 numbers in the first argument, there must be 5 numbers in
the second (each number in the second argument matches how many times its
counterpart in the first argument gets replicated). In this example we want the
numbers 1 to 5 to each be replicated 4 times (once for each season).

Note that we can use R functions inside R functions (even the same function). If
we look carefully at the second argument above, we can see that c(4,4,4,4,4)
is just the number 4 replicated 5 times. This could be written rep(4,5). This
kind of thing happens a lot (where you want to replicate each thing a set the
same number of times each) and so rep also has an argument called each that
allows you to specify how many times each element in the first argument should
be replicated. Therefore, alternatives to the above code is:
38 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

district.a <- rep(1:5, rep(4, 5))

# OR

district.b <- rep(1:5, each = 4)

district.a

## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
district.b

## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
Now let’s look at the Season variable. The season variable is slightly different
to the district variable in that Season has character values (words/letters, not
numbers). This is not a major problem in R. However, to handle this type of
variable we need to introduce another function called factor(). The factor()
function is the way we tell R that a variable is categorical. Categorical variables
do not necessarily have to have character values (District is also a categorical
variable, for example), but any variable that has character values is categorical.
Entering character data into R is simple: we treat it like normal data but we
put each value in quotes:
seasons <- c("Winter", "Spring", "Summer", "Autumn")
seasons

## [1] "Winter" "Spring" "Summer" "Autumn"


Note the way the Season data appears in the table – the above character se-
quence appears once for each of the 5 rows (remember, we entered the rain data
on a row by row basis). Therefore, we need to replicate this entire sequence 5
times.
season <- factor(rep(seasons, 5))
season

## [1] Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring
## [11] Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn
## Levels: Autumn Spring Summer Winter
This new variable season is a factor, and it replicates the character variable
seasons 5 times. Factors have levels. Levels are just the names of the categories
that make up the factor. R list levels in alpha-numeric order.
Note we could have done all this at once using:
season <- factor(rep(c("Winter", "Spring", "Summer", "Autumn"), 5))
season

## [1] Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring
2.2. USING R (WEEK 2) 39

## [11] Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn
## Levels: Autumn Spring Summer Winter
Finally, remember we said District is also categorical (make sure you understand
why). If you create a variable but forget to make it categorical at the time by
using factor(), you can always come back and fix it:
district <- factor(district)
district

## [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
## Levels: 1 2 3 4 5
Now we have entered each variable into R manually, we can place them into their
own data set. R calls data sets “data frames”. The function data.frame() lets
you put your variables into a data set that you can call whatever you want:
rainfall.dat <- data.frame(district, season, rain)
rainfall.dat

## district season rain


## 1 1 Winter 23
## 2 1 Spring 440
## 3 1 Summer 800
## 4 1 Autumn 80
## 5 2 Winter 250
## 6 2 Spring 500
## 7 2 Summer 1180
## 8 2 Autumn 200
## 9 3 Winter 120
## 10 3 Spring 400
## 11 3 Summer 420
## 12 3 Autumn 430
## 13 4 Winter 10
## 14 4 Spring 20
## 15 4 Summer 30
## 16 4 Autumn 5
## 17 5 Winter 60
## 18 5 Spring 200
## 19 5 Summer 250
## 20 5 Autumn 120
Once your variables are in a data frame (in this case in a data frame called
rainfall.dat), R will keep them in that data frame. However, the variables
are also individually just hanging around in the workspace, so we should clean
up our mess (we can safely delete variables once we put them into a data frame).
To see what is in our workspace we can use the function ls(), which lists the
contents of the R workspace. The rm() function removes variables from the R
workspace – use it cautiously!
40 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

ls()

## [1] "district" "district.a" "district.b" "rain" "rainfall"


## [6] "rainfall.dat" "season" "seasons"
rm(rain, district, district.a, district.b, seasons, season)
ls()

## [1] "rainfall" "rainfall.dat"

2.2.2 Using R for Goodness of Fit Tests


The R function chisq.test() does goodness of fit tests. It requires you to
enter the data (ie the observed data in each category) and in general it also
requires you to enter the probabilities associated with each category under the
null hypothesis assumptions. When the null hypothesis assumption is “equally
likely” or “uniformity” (as in the gold lotto and gloves examples) you may omit
entering the probabilities for each category. However, if the probabilities under
the null hypothesis are different for the categories (like the credit cards example,
or Mendel’s peas) you must enter these probabilities in the same order that you
enter the observed data.
For the lotto example:
lotto <- c(166,159,156,174,163,170,162,145,158,170,151,164,168,167,171,130,162,160,158,
155,165,151,140,154,148,152,141,157,161,150,145,172,161,178,158,147,184,162,153
lotto.test <- chisq.test(lotto)
lotto.test

##
## Chi-squared test for given probabilities
##
## data: lotto
## X-squared = 30.841, df = 44, p-value = 0.9333
We will talk more about p-values in later lectures. For now, if the p-value is
bigger than or equal to 0.05 it means we cannot reject the null hypothesis. If
the p-value is smaller than 0.05, we reject the null in favour of the alternative
hypothesis.
In this example we see that we cannot reject the null hypothesis, and conclude
that there is insufficient evidence to suggest the game is “unfair” at the 0.05
level of significance.
For the credit card debt example:
credit.cards <- c(70, 20, 50, 20) # Enter the data

creditcard.test <- chisq.test(credit.cards, p = c(0.5, 0.2, 0.25, 0.05))


2.2. USING R (WEEK 2) 41

# Note we have to give it the probabilities in the null hypothesis using the p = ... argument.

creditcard.test

##
## Chi-squared test for given probabilities
##
## data: credit.cards
## X-squared = 26.25, df = 3, p-value = 8.454e-06

Note that here we needed to specify the probabilities for each category (50%,
20% etc). Also note here that R outputs scientific notation for small p-values
– 8.454e-06 means 0.000008454, which is very much smaller than 0.05. We
therefore reject the null in favour of the alternative hypothesis and conclude
that the Bank’s claims about credit card repayments are false at the 0.05 level
of significance.

2.2.3 Using R for Tests of Independence – Two Way Con-


tingency Table
The chisq.test() function can also be used to do the Chi-Squared test of
Independence. The test needs your data to be in table form, like in the hair
eye colour example above. If you do not have it in that form because your data
are raw and not tabulated, the table() function in R can be used to create the
table. We will show an example of this form in the lectures.

Below is the code required to do the test of independence between hair and eye
colour given in the example above:
hair.eyes <- as.table(rbind(c(5, 12, 1), c(25, 2, 8), c(40, 6, 1))) # Make the table with the dat
hair.eyes

## A B C
## A 5 12 1
## B 25 2 8
## C 40 6 1
dimnames(hair.eyes) <- list(Eye = c("Blue", "Green", "Brown"),
Hair = c("Brown", "Blonde", "Red")) # We don't need to, but we can gi
hair.eyes

## Hair
## Eye Brown Blonde Red
## Blue 5 12 1
## Green 25 2 8
## Brown 40 6 1
42 CHAPTER 2. WEEK 2 - CHI-SQUARED TESTS

haireye.test <- chisq.test(hair.eyes) # Now do the test of independence on the table

## Warning in chisq.test(hair.eyes): Chi-squared approximation may be incorrect


haireye.test

##
## Pearson's Chi-squared test
##
## data: hair.eyes
## X-squared = 39.582, df = 4, p-value = 5.282e-08
Note the warning message: one of the assumptions of the Chi-Squared test of
independence is that no more than 20% of all expected cell frequencies are less
than 5. We can see the expected cell frequencies by using:
haireye.test$expected

## Hair
## Eye Brown Blonde Red
## Blue 12.6 3.6 1.8
## Green 24.5 7.0 3.5
## Brown 32.9 9.4 4.7
We can see that 4 of the 9 cells have expected frequencies less than 5. This can
be an issue for the test of independence but ways to fix it are beyond the scope
of this course.
From the output we see the p-value is less than 0.05, so we therefore reject
the null in favour of the alternative hypothesis and conclude that hair and eye
colour are dependent on each other at the 0.05 level of significance.
Chapter 3

Week 3/4 - Probability


Distributions and The Test
of Proportion

Outline:
1. Revision and Basics for Statistical Inference
• Review – Revision and things for you to look up
• Types of Inference
• Notation
• Probability Distribution Functions
• Areas Under a Probability Distribution Curve - Probabilities
• Tchebychef’s Rule – applies for any shaped distribution
• Probability of a Range of the Variable
• Cumulative Probability
• Central Limit Theorem
2. Inference for Counts and Proportions – test of a pro-
portion
• An Example
• One- and Two-Tailed Hypotheses
• The p-value of a Test Statistic
3. Statistical Distributions

43
44CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

• The Binomial Distribution – for counts and proportions

– Examples using the Binomial Distribution

• The Normal Distribution

• The Normal Approximation to the Binomial

4. Using R

• Calculating Binomial and Normal probabilities using R.

• Sorting data sets and running functions separately for different


categories

Workshop for Week 3

• Based on week 1 and week 2 lecture notes.

R:

• Entering data, summarising data, rep(), factor()


• using R for Goodness of Fit and test of independence.

Workshop for Week 4

• Based on week 3 lecture notes.

Project Requirements for Week 3

• Ensure you have generated your randomly selected individual


project data;

• Explore your data using summaries, tables, graphs etc.

Project Requirements for Week 4

• Project 1 guidelines will be available in week 4. Make sure you


read it carefully!

Things YOU must do in Week 3

• Revise and summarise the lecture notes for week 2;

• Read your week 3 & 4 lecture notes before the lecture;

• Read the workshop on learning@griffith before your workshop;

• Revise and practice the 𝜒2 tests - You have a quiz on them


in week 4
3.1. REVISION AND BASICS FOR STATISTICAL INFERENCE 45

3.1 Revision and Basics for Statistical Inference


3.1.1 Revision and things for you to look up
Sample versus population & statistic versus parameter – see diagram in week 1
notes. Sampling variability

3.1.2 Types of Inference


2 basic branches of statistical inference: estimation and hypothesis testing.
eg1: Groundwater monitoring:
• what is the level of sodium (na) in the groundwater downstream of the
landfill (gdf)?
• is the level of na in the gdf above the set standard for drinking water?
• what level of na in the gdf can be expected over the next 12 months?
eg2: A new treatment is proposed for protecting pine trees against a disease –
does it work? How effective is it? Does it give more than 20% protection?

3.1.3 Notation
population parameter: Greek Letter
sample statistic: Name - Upper case; observed value - lower case
sample statistic: is an ESTIMATOR of population parameter - use of
‘hat’ over the Greek symbols: 𝜃,̂ 𝜎,̂ 𝜙.̂
Some estimators as used so often they get a special symbol. E.g.: Sample mean,
𝑋 = 𝜇,̂ the estimate of the population mean 𝜇.
Sometimes use letters eg SE for standard error – the standard deviation of a
sample statistic
###Probability Distribution Functions: 𝑓(𝑥)
Statistical probability models:
Can be expressed in graphical form – distribution curve
• possible values of X along x-axis
• relative frequencies (or probabilities) for each possible value along y-axis
• total area under curve is 1; representing total of probabilities for all pos-
sible values/outcomes.
Shape can also be described by appropriate mathematical formula and/or ex-
pressed as a table of possible values and associated probabilities.
If the allowable values of X are discrete: Probability Mass Function (PMF),
𝑓(𝑥) = 𝑃 𝑟(𝑋 = 𝑥).
46CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

If the allowable values of X are continuous: Probability Density Function (PDF)


NB 𝑓(𝑥) ≠ 𝑃 𝑟(𝑋 = 𝑥) for continuous r.v.s.

3.1.4 Areas Under a Probability Distribution Curve -


Probabilities
For continuous variables, the total area under the probability curve will be 1,
as this is the totality of the possible values X can take. Similarly, the sum over
all allowable values of a discrete random variable will be 1.

3.1.5 Tchebychef’s Rule – applies for any shaped distribu-


tion
For any integer 𝑘 > 1, at least 100(1− 𝑘12 )% of the measurements in a population
will be within a distance of 𝑘 standard deviations from the population mean:

1
𝑃 𝑟(𝜇 − 𝑘𝜎 ≤ 𝑋 ≤ 𝜇 + 𝑘𝜎) ≥ 1 −
𝑘2

3.1.6 Probability of a Range of the Variable


For continuous random variables, the area under the curve representing f(𝑥)
between two points is the probability that 𝑋 lies between those two points:

𝑎
𝑃 𝑟(𝑎 < 𝑋 < 𝑏) = ∫ 𝑓(𝑥)𝑑𝑥.
𝑏

Note that 𝑃 (𝑋 = 𝑎) = 0 for all 𝑎 ∈ 𝑋 when 𝑋 is continuous. (Why??)


For discrete random variables, 𝑃 (𝑋 = 𝑎) = 𝑓(𝑎), and to find the probability
that 𝑋 lies in some range of values (e.g. 𝑃 (𝑎 ≤ 𝑋 < 𝑏), 𝑃 (𝑋 > 𝑐), 𝑃 (𝑋 < 𝑑)
etc.), we simply sum the probabilities associated with the values specified in the
range:

𝑃 (𝑋 ∈ 𝐴) = Σ𝑥∈𝐴 𝑓(𝑥).

Example: Throwing a Fair Dice


The probability that any of one of the six sides of a fair dice lands uppermost
when thrown is 1/6. This can be represented be represented mathematically as:

1
𝑃 (𝑋 = 𝑥) = , 𝑥 = 1, 2, … , 6.
6
where 𝑋 represents the random variable describing the side that lands upper-
most, and 𝑥 represents the possible values 𝑋 can take. This kind of distribution
is known as a Uniform distribution (why?).
3.1. REVISION AND BASICS FOR STATISTICAL INFERENCE 47

How would we represent this probability mass function (pmf) graphically?

3.1.7 Cumulative Probability (CDF): 𝐹 (𝑥) = 𝑃 𝑟(𝑋 ≤ 𝑥)


The cumulative density function (CDF) of a random variable is the probability
that the random variable takes a value less than or equal to a specified value, x:

𝐹 (𝑥) = 𝑃 𝑟(𝑋 ≤ 𝑥)

For the dice example, the cumulative probability distribution can be calculated
as follows:

𝑋 1 2 3 4 5 6
1 1 1 1 1 1
𝑓(𝑥) 6 6 6 6 6 6
1 2 3 4 5 6
𝐹 (𝑥) 6 6 6 6 6 6

We can also express the CDF for this example mathematically (note that this
is not always possible for all random variables, but it is generally possible to
create a table as above):

𝑥
𝐹 (𝑥) = 𝑃 𝑟(𝑋 ≤ 𝑥) = , 𝑥 = 1, 2, … , 6.
6

(Check for yourself that the values you get from this formula match those in the
table.)

The cumulative probability is found by summing the relevant probabilities, start-


ing from the left hand (smallest) values of the variable and stopping at the spec-
ified value; this gives the cumulative probability up to the stopping point and
represents the probability that the variable is less than or equal to the specified
value.
48CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

What is 𝑃 𝑟(𝑋 < 4)?

3.1.8 Central Limit Theorem


In probability theory, the central limit theorem (CLT) states conditions under
which the mean of a sufficiently large number of independent random variables,
each with finite mean and variance, will be approximately normally distributed.
If we were to take lots of samples from our population of interest, the means of
these samples would be normally distributed.

The central limit theorem also requires the random variables to be identically
distributed, unless certain conditions are met. The CLT also justifies the ap-
proximation of large-sample statistics to the normal distribution in controlled
experiments.

3.2 Inference for Counts and Proportions – Test


of a Proportion
3.2.1 An Example
Suppose we are concerned that the coin used to decide who will bat first in a
cricket match is not unbiased. Note that this scenario is analogous to situations
which arise in all sorts of research situations. For example, consider the follow-
ing claims: the sex ratio for some animal species is 50:50; half of the Australian
population have access to the internet at home; fifty percent of Australian chil-
dren now go on to tertiary studies; there is a 50% chance that in the next six
months there will be a better than average rainfall in Queensland; half of the
3.2. INFERENCE FOR COUNTS AND PROPORTIONS – TEST OF A PROPORTION49

eucalyptus species in Northern New South Wales are suffering from the disease
die back.
Research Question:
Is the coin unbiased? That is, if it is tossed, is it just as likely to
come down with a head showing as with a tail? Is the probability of
seeing a head when the coin is tossed equal to ½?
What sort of experiment:
A single toss will not tell us much – how many tosses will we carry
out? Resources are limited so we decide to use only six.
What sort of data:
Success or failure – assume a head is a success.
Binary Data – each toss is a Bernoulli trial (an experiment with
only two possible outcomes).
What feature from the experiment will have meaning for the question:
The number of heads seen in the sample. If the coin is unbiased we
would expect to see three heads and three tails in 6 tosses. number
of heads is a Binomial Variable – the sum of a series of independent
Bernoulli trials.
Hypotheses:
We want to test the current belief that the probability of seeing a
head is 0.5. The null hypothesis always reflects the status quo and
assumes the current belief to be true. The alternative hypothesis is
the opposite of the null and reflects the reason why the research was
conducted.
Null Hypothesis:
𝐻0 : within the population of interest, the probability that a head
will be seen is 21 .
𝐻0 ∶ 𝑃 𝑟(head) = 0.5.
Alternative Hypothesis:
𝐻1 : the distribution within the population is not as specified; the
probability that a head will be seen is not one half.
𝐻1 ∶ 𝑃 𝑟(head) ≠ 0.5.
Sample:
Selected at random from the population of interest – six random
throws.
Test Statistic:
50CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

Seems sensible to look at the number of heads in the sample of 6


tosses as the test statistic - how likely are we to get the number of
heads seen?
Null Distribution:
The distribution of the random variable, number of heads in a sample
of six, IF the null hypothesis is true – that is, if 𝑃 𝑟(head) = 0.5. See
below for derivation and final distribution.

The following table shows the probability distribution function for


the number of heads in 6 tosses of unbiased coin (a binomial variable
with 𝑛 = 6 and 𝑝 = 0.5, as derived in the box above). Note that R
can be used to get these values.
3.2. INFERENCE FOR COUNTS AND PROPORTIONS – TEST OF A PROPORTION51

NOTE THAT THIS TABLE RELIES ON THE FACT THAT


EACH OF THE 64 POSSIBLE OUTCOMES IS EQUALLY
LIKELY – WHAT HAPPENS IF THE PROBABILITY OF A
HEAD IS TWICE THE PROBABILITY OF A TAIL??
• What is the probability of getting 5 or 6 heads?
• What is the probability of getting at least 4 heads?
• What is the probability of getting no more than 3 heads?
Significance Level, 𝛼:
Traditionally assume 0.05 (5%).
Critical Value, A:
AND NOW A PROBLEM ARISES!!!!!!!

3.2.2 One- and Two-Tailed Hypotheses


We need a value (or values) of the test statistic that will ‘cut off’ a portion of
the null distribution representing 95% of the entire area.
Firstly we need to decide where the 5% to be omitted is to be. Will it be in the
upper tail as it was for the chi-squared situation? Or, will it be in the lower
tail? Or, will it need to be apportioned across both tails?
The answer will depend on the alternative hypothesis. Consider the following
three possibilities:
1. the researcher’s belief is such that the test statistic will be larger than
that expected if the null hypothesis is true;
2. the researcher’s belief is such that the test statistic will be smaller than
that expected if the null hypothesis is true;
3. the researcher’s belief is such that the it is not clear whether the test statis-
tic will be larger or smaller, it will just be different from that expected
if the null hypothesis is true.
Two-Tailed Hypothesis
52CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

In the example, the question simply raises the issue that the coin may not be
unbiased. There is no indication as to whether the possible bias will make a
head more likely or less likely. The results could be too few heads or too many
heads. This is a case 3 situation and is an example of a two-tailed hypothesis.
The critical value can be at either end of the distribution and the value of the
stipulated significance, 0.05, must be split between the two ends, 0.025 (or as
close as we can get it) going to each tail.
One-Tailed Hypothesis
Suppose instead that the researcher suspects the coin is biased in such a way as
to give more heads and this is what is to be assessed (tested). The alternative
hypothesis would be that the probability of a head is greater than 21 : 𝐻1 ∶ 𝑝 >
0.5 – a case 1 situation.
Clearly the opposite situation could also occur if the researcher expected bias
towards tails leading to an alternative: 𝐻1 ∶ 𝑝 < 0.5. This is a case 2 situation.
In both of these cases, the researcher clearly expects that if the null hypothesis
is not true it will be false in a specific way. These are examples of a one-tailed
hypothesis.
The critical value occurs entirely in the tail containing the extremes anticipated
by the researcher. Thus for case 1 the critical value will cut off an upper tail.
For case 2 the critical value must cut off a lower tail.
Back to the Example
The example as given is a two-tailed situation thus two critical values are needed,
one to cut off the upper 0.025 portion of the null distribution, and the other
to cut off the lower 0.025 portion.
To find the actual critical values we look at the distribution as we did for chi-
squared.
AND NOW ANOTHER PROBLEM ARISES!!!!!!!
For chi-squared we had a continuous curve and the possible values could be
anything, enabling us to find a specific value for any significance level nominated.
Here we have discrete data (a count) with only the integer values from zero to
six and their associated probabilities. Working with 5% we want the values that
will cut off a lower and an upper probability of 0.025 each.
From the table we see:
• probability of being less than 0 = 0
• probability of being less than 1 = 0.01562
• probability of being less than 2 = 0.10938
The closest we can get to 0.025 in the lower tail is 0.01562 for a number of heads
of less than 1 (i.e. 0). Similar reasoning gives an upper critical value of greater
than 5 (i.e. 6) with a probability of 0.01562.
3.2. INFERENCE FOR COUNTS AND PROPORTIONS – TEST OF A PROPORTION53

We cannot find critical values for an exact significance level of 0.05 in this case.
The best we can do is to use a significance level of 0.01562 + 0.01562 = 0.03124
and the critical values of 1 and 5 – approximately 97% of the values lie between
1 and 5, inclusive.
?? What significance level would you be using if you selected the
critical values of (less than) 2 and (greater than) 4 ??
Critical Region:
The part of the distribution more extreme than the critical values,
A. The critical region for a significance level of 0.03124 will be any
value less than 1 or any value greater than 5:
𝑇 < 1 or 𝑇 > 5.
Thus, if the sample test statistic (number of heads) is either zero
or six, then it lies in the critical region (reject the null hypothesis).
Any other value is said to lie in the acceptance region (cannot reject
the null hypothesis).
Test Statistic:
Calculated using the sample data – number of heads.
We now need to carry out the experiment
Collect the data:
Complete six independent tosses of the coin. The experimental re-
sults are: H T H H H H
Calculate the test statistic:
Test statistic (number of heads) = 5
Compare the test statistic with the null distribution:
Where in the distribution does the value of 5 lie?
Two possible outcomes:
1. T lies in the critical region - conclusion: reject 𝐻0 in favour
of the alternative hypothesis.
2. T does not lie in critical region – conclusion: do not reject
𝐻0 (there is insufficient evidence to reject 𝐻0 ).
Here we have a critical region defined as values of 𝑇 < 1 and values
of 𝑇 > 5. The test statistic of 5 does NOT lie in the critical region
so the null hypothesis 𝐻0 is not rejected.
Interpretation – one of two possibilities
54CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

Rejecting 𝐻0 in favour of 𝐻1 – within the possible error defined


by the significance level, we believe the alternative hypothesis to be
true and the null hypothesis has been falsified.
Failing to reject 𝐻0 – there is no evidence to reject the null hy-
pothesis. Note: this does not prove the null hypothesis is
true! It may simply mean that the data are inadequate – e.g. the
sample size may be too small (Mythbusters effect…).
For the example, the null has not been rejected and we could give
the conclusion as:
We cannot reject the null hypothesis. We conclude that, based on this
data, there is insufficient evidence to suggest the coin is biased.
Note that this does not PROVE that the coin is unbiased, it simply
says that given the available data there is no reason to believe that
it is biased.
NOTE: Intuitively, getting 5 out of 6 is not particularly informative
– a sample of 6 is very small. If the equivalent figure of 15 out of 18
were obtained what would the decision be?

3.2.3 The 𝑝 - value of a test statistic.


An alternative to working with a specific significance level is to use what is
known as the 𝑝 - value of the test statistic. This is the probability of getting
a value as or more extreme than the sample test statistic, assuming
the null hypothesis is true. Instead of giving the conclusion conditioned on the
possible error as determined by the significance level, the conclusion is given
together with the actual 𝑝-value. There are various ways of expressing this and
some possible wordings are given with each example below.
In the example above, we had a sample test statistic of 5 heads. What is the
probability of observing a value this extreme, or more extreme, if the probability
of a head is 0.5 (i.e. if the null hypothesis is true)? From the table we want the
probability of 5 or 6 heads.
If the coin is unbiased, the probability of getting 5 heads in 6 independent ran-
dom tosses is 0.09375. The probability of getting 6 heads is 0.01562. Therefore,
the probability of selecting a number as or more extreme than 5 is: 0.09375 +
0.01562 = 0.10937. Note that this can be read directly from the cumulative
column of the table by realising that:

𝑃 𝑟(𝑋 ≥ 5) = 1–𝑃 𝑟(𝑋 < 5) = 1–𝑃 𝑟(𝑋 ≤ 4) = 1 − 0.89062 = 0.10937

There is approximately an 11% chance of seeing something as or more extreme


than the observed sample data just by pure chance, assuming the probability of
a head is 0.5 – this seems like reasonable odds.
3.3. THEORETICAL STATISTICAL DISTRIBUTIONS 55

3.3 THEORETICAL STATISTICAL DISTRI-


BUTIONS
All forms of statistical inference draw on the concept of some form of theoretical
or empirical statistical model that describes the values that are being measured.
This model is used to describe the distribution of the chosen test statistic under
the null hypothesis (that is, if the null is true).

3.3.1 The Binomial Distribution – Discrete Variable


The data and test statistic used for the coin example were a specific case of the
binomial probability distribution.
binary variable - variable can have only two possible values: present-absent,
yes-no, success- failure, etc.
Bernoulli trial - process of deciding whether or not each individual has the
property of interest; is a success or a failure.
The sum of 𝑛 independent Bernoulli trials results in a variable with
a binomial distribution (a binomial random variable).
A Binomial random variable measures the number of successes, number of
presences, number of yes’s etc. out of the total number of (independent) trials
(𝑛) conducted.
The Binomial Setting
1. There are a fixed number of observations (trials), 𝑛.
2. The 𝑛 observations (trials) are all independent.
3. Each observation falls into one of just two categories, which for
convenience we call “success” and “failure” (but could be any
dichotomy).
4. The probability of a “success” is called 𝑝 and it is the same for
each observation. (Note that this implies the probability of a
“failure” is (1 − 𝑝), since there are only two (mutually exclusive
and exhaustive) categories for each trial outcome.)
Mathematical Jargon: 𝑋 ∼ Bin(𝑛, 𝑝)
𝑋 is distributed as a Binomial random variable with number of trials
𝑛 and probability of “success” 𝑝.
The mathematical model for the probability mass function of a binomial random
variable is given by:

𝑛
𝑃 𝑟(𝑋 = 𝑥; 𝑛, 𝑝) = ( )𝑝𝑥 (1 − 𝑝)(𝑛−𝑥) , 𝑥 = 0, 1, 2, … , 𝑛, 0 ≤ 𝑝 ≤ 1,
𝑥
56CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

where:
• 𝑋 is the name of the binomial variable – the number of successes
• 𝑛 is the sample size – the number of identical, independent observations;
• 𝑥 is the number of successes in the 𝑛 observations;
• 𝑝 is the probability of a success.
The mathematical model in words is:
the probability of observing 𝑥 successes of the variable, 𝑋, in a
sample of 𝑛 independent trials, if the probability of a success for any
single trial is the same and equal to 𝑝.
(Compare this formula to that discussed in the coin toss example box.)
Binomial Tables: Binomial probabilities for various (but very limited) values
of 𝑛 and 𝑝 can be found in table form. See the Tables folder on the L@G site.
Also note the folder contains a binomial table generator written in java script
that will show you probabilities for user-selected 𝑛 and 𝑝. R will also calculate
Binomial probabilities (see R section in these notes).

3.3.1.1 Examples Using the Binomial Distribution


3.3.1.1.1 Example 1: Each child born to a particular set of parents has
probability 0.25 of having blood type O. If these parents have 5 children, what
is the probability that exactly 2 of them have type O blood?
Let 𝑋 denote the number of children with blood type O. Then 𝑋 ∼ Bin(5, 0.25).
We want to find 𝑃 (𝑋 = 2). Using the Binomial pmf above:

5
𝑃 (𝑋 = 2) = ( ) × 0.252 × (1 − 0.25)(5−2)
2
= 10 × 0.252 × 0.753
= 0.2637.

The probability that this couple will have 2 children out 5 with blood type O is
0.2637. Can you find this probability in the Binomial tables?

3.3.1.1.2 Example 2: A couple have 5 children, and two of them have blood
type O. Using this data, test the hypothesis that the probability of the couple
having a child with type O blood is 0.25.
We are testing the following hypotheses:

𝐻0 ∶the probability of the couple having a child with type O blood is 0.25
𝐻0 ∶𝑃 (Blood Type O) = 0.25
3.3. THEORETICAL STATISTICAL DISTRIBUTIONS 57

versus

𝐻1 ∶the probability of the couple having a child with type O blood is not 0.25
𝐻1 ∶𝑃 (Blood Type O) ≠ 0.25

This is very similar to the coin tossing example above. Our test statistic will be
the number of children with type O blood, which we are told in the question is
T = 2.
The null distribution is the distribution of the test statistic assuming the null
hypothesis is true. 𝑇 is binomial with 𝑛 = 5 and 𝑝 = 0.25. This distribution is
shown here graphically:

Significance level 𝛼 = 0.05 (two-tailed = 0.025 either end). However, because


this is a discrete distribution we may not be able to get exactly 0.05.
In the right hand tail using 3 as the critical value gives 0.0146 + 0.001 = 0.0153
(using 2 as the critical value would make the overall 𝛼 too large). In the left
hand tail the first category is very large but it’s the best we can do, so ≤ 0 gives
0.2373.
This means that the overall significance level for this test is 𝛼 = 0.0153 +
0.2373 = 0.2526 (the small sample size, 𝑛 = 5, is a big problem in this case).
Our test statistic 𝑇 = 2. This does not lie in the critical region.
Inference: There is insufficient evidence to reject the null hypothesis that
the probability of the couple having a child with type O blood is 0.25, at the
𝛼 = 0.2526 level of significance.
58CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

3.3.1.1.3 Example 3. You have to answer 20 True/False questions. You


have done some study so your answers will not be total guesses (i.e. the chance
of getting any one question correct will not be 50/50). If the probability of
getting a question correct is 0.60, what is the probability that you get exactly
16 of the 20 questions correct?
Let 𝑋 denote the number of questions answsered correctly. Then
𝑋 ∼ Bin(20, 0.6). We want to find 𝑃 (𝑋 = 16). Using the Binomial
pmf:

20
𝑃 (𝑋 = 2) = ( ) × 0.616 × (1 − 0.6)(20−16)
16
= 4845 × 0.616 × 0.44
= 0.0349.

The probability of getting 16/20 True/False questions correct (if the probability
of a correct answer is 0.60, and assuming your answers to each question are
independent) is 0.0349. You need to study harder!
This is the probability of getting exactly 16 correct. What is the probability of
getting 16 or less correct? We could sum the probabilities for getting 0, 1, 2,
3….16 correct (tedious!!) or we could note that 𝑃 (𝑋 ≤ 16) = 1 − 𝑃 (𝑋 > 16):

20
𝑃 (𝑋 ≤ 16) = 1 − ∑ 𝑃 (𝑋 = 𝑥)
𝑥=17
= 1 − (𝑃 (𝑋 = 17) + 𝑃 (𝑋 = 18) + 𝑃 (𝑋 = 19) + 𝑃 (𝑋 = 20))
= 1 − (0.012 + 0.003 + 0.0005 + 0.00004)
= 0.984

(Make sure you know where these numbers come from).


There is a 98.4% chance that you will answer between 0 and 16 questions cor-
rectly. More to the point, there is only a less than 2% chance of answering 17
or more questions correctly with the level of study undertaken (that is, a level
of study that leads to a 60% chance of answering any one question correctly).

3.3.1.1.4 Example 4. You decide to do an experiment to test whether


studying for 5 hours increases your chance of getting a correct answer in a
T/F exam compared to simply guessing (no study). You study for 5 hours,
answer 20 T/F questions, and discover you answered 16 correctly. Test your
hypothesis using this sample data.
3.3. THEORETICAL STATISTICAL DISTRIBUTIONS 59

Note: this is a one-tailed (upper) hypothesis test, since your research question
asks whether 5 hours of study will increase the chance of successfully answering
a question, 𝑝, over and above guessing (ie will 𝑝 be greater than 0.5?).

𝐻0 ∶𝑝 ≤ 0.5
𝐻1 ∶𝑝 > 0.5

With our sample of 20 questions and 16 successes, do we hvae enough evidence


to reject 𝐻0 ?
Null Distribution: the distribution of the test statistic assuming the null
hypothesis is true. This is a Bin(20, 0.5) and is shown graphically below.

Significance level 𝛼 = 0.05 (one tailed, all in the upper tail). Test statistic
𝑇 = 16.
To obtain the critical value we need to find the value in the upper tail that
has 0.05 of values (or as close as we can get to it) above it. We can sum the
probabilities backwards from 20 until we reach approximately 0.05:

0.000001 + 0.00001 + 0.0002 + 0.0011 + 0.0046 + 0.0148 + 0.037 = 0.0577

So our critical value is 13, and the critical region is any value greater than 13.
60CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

The test statistic 𝑇 = 16 > 13. Therefore we reject 𝐻0 and conclude that the
probability of getting a T/F question correct is significantly greater than 0.50
if we study for 5 hours, at the 𝛼 = 0.05722 level of significance.
NOTE:
If we took the critical value to be 14, our significance level would be
0.02072 and our test statistic would still be significant (i.e. we would
still reject the null hypothesis).
New conclusion: The test statistic of T = 16 > 14. Therefore we
reject 𝐻0 and conclude that the probability of getting a T/F question
correct is significantly greater than 0.50 if we study for 5 hours, at
the 𝛼 = 0.02072 level of significance.
What is the main effect of reducing the significance level? We have
reduced the chance of a Type I error. Make sure you can explain
this. Can the significance level be reduced further?
In example 3 we found that the probability of getting 16/20 True/False questions
correct (if the probability of a correct answer is 0.60) is 0.0349. You should
perhaps study more.
If the probability of getting a correct answer really is 0.6, how many of the 20
questions would you expect to answer correctly?

0.6 × 20 = 12
.
What should the probability of correctly answering a question be to make 16
correct answers out of 20 the expected outcome?

𝑝 × 20 = 16
16
𝑝=
20
= 0.8

Back To The Theoretical Binomial Distribution:


The expected (mean) value of the random variable 𝑋 ∼ Bin(𝑛, 𝑝) is 𝑛𝑝. Note
that this value does not always work out to be a whole number. You should
think of the expected value (or mean) as the long run, average value of the
random variable under repeated sampling.
The variance of a Bin(𝑛, 𝑝) random variable is 𝑛𝑝(1 − 𝑝). Note that the variance
depends on the mean, 𝑛𝑝.
3.3. THEORETICAL STATISTICAL DISTRIBUTIONS 61

The mode of a random variable is the most frequent, or probable, value. For
the binomial distribution the mode is either equal to the mean or very close to
it (the mode will be a whole number, whereas the mean will not necessarily be
so). The mode of any particular Bin(𝑛, 𝑝) distribution can be found by perusing
the binomial tables for that distribution and finding the value with the largest
probability, although this becomes prohibitive as 𝑛 gets large. (There is a
formula to calculate the mode for the binomial distribution; however this goes
beyond the scope of this course.)

The probability of success, 𝑝, influences the shape of the binomial distribution.


If 𝑝 = 0.5, the distribution is symmetric around its mode (second figure). If
62CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

𝑝 > 0.5, the distribution has a left skew (not shown). If 𝑝 < 0.5, the distribution
has a right skew (first figure). The closer 𝑝 gets to either 0 or 1, the more skewed
(right or left, respectively) the distribution becomes.
The number of trials, 𝑛, mediates the effects of 𝑝 to a certain degree in the
sense that the larger 𝑛 is, the less skewed the distribution becomes for values of
𝑝 ≠ 0.5.

3.3.2 The Normal Distribution – A Continuous Variable


“Everybody believes in the Normal Distribution (Normal Approxi-
mation), the experimenters because they think it is a Mathematical
theorem, the mathematicians because they think it is an experimen-
tal fact.”
(G. Lippman, A Nobel Prize winner in 1908, who specialised in Physics and
Astronomy and was responsible for making improvements to the seismograph.)
Original mathematical derivation - Abraham De Moivre in 1773.
In the 1880’s, the mathematician and physicist, Gauss “rediscovered” it - errors
in physical measurements. Often called the Gaussian distribution for this reason.
(History is written by the “winners”!)
The normal distribution arises in many situations where the random variable
is continuous and can be thought of as an agglomeration of a number of
components.
Examples:
• A physical feature determined by genetic effects and environmental influ-
ences, height, air temperature, yield from a crop, soil permeability.
• The final grade on an exam, where a number of questions each receive
some of the marks.
• The day to day movements of a stock market index.
The Mathematical Equation of the Normal Distribution:
1 1 𝑥−𝜇 2
𝑓𝑋 (𝑥; 𝜇, 𝜎) = √ 𝑒− 2 ( 𝜎 ) , 𝑥 ∈ ℜ, 𝜇 ∈ ℜ, 𝜎 > 0,
𝜎 2𝜋

where:
• 𝑥 is a particular value of the random variable, 𝑋, and 𝑓(𝑥) is the associated
probability;
• 𝜎2 is the population variance of the random variable, 𝑋;
• 𝜇 is the population mean of the random variable 𝑋.
We write: 𝑋 ∼ 𝑁 (𝜇, 𝜎2 ): “X is normally distributed with mean 𝜇 and variance
𝜎2 .”
Properties of the Normal probability distribution function
3.3. THEORETICAL STATISTICAL DISTRIBUTIONS 63

• the shape of the curve is determined by the values of the parameters 𝜇


and 𝜎2 ;
• the location of the peak (mode) is determined by the value of 𝜇;
• the spread, or dispersion, of the curve is determined by the value of 𝜎2 ;
• it is symmetric about 𝜇- thus the mean, median and mode are all equal
to 𝜇;
• the total area under the curve is one – as for all probability distribution
functions.
The Standard Normal Distribution
The shape of the normal probability distribution function depends on the pop-
ulation parameters. Separate curves are needed to describe each population.
This is a problem because it means we need statistical tables of probabilities
for each possible combination of 𝜇 and 𝜎 (and there are infinitely many such
combinations)!!
Happily, we can convert any normal distribution into the standard normal dis-
tribution via what is known as the Z-transformation formula. This means we
only need probability tables for the standard normal distribution: we can work
out probabilities for any other normal distribution from this.
We denote a random variable with the standard normal distribution as Z. The
standard normal distribution is a normal distribution with mean = 0 and vari-
ance (and hence standard deviation) = 1. That is, 𝑍 ∼ 𝑁 (0, 1).
If 𝑋 ∼ 𝑁 (𝜇, 𝜎2 ), we can convert it to a standard normal distribution via the
Z-transformation:

𝑋−𝜇
𝑍=
𝜎

Probability of a Range of the Variable – the continuous case


The area under the graphical model between 2 points, is the probability that the
variable lies between those 2 points. Tables exist that tabulate some of these
probabilities. See the tables folder on the L@G site. Also see the class examples
below.
Cumulative Probability as an Area – the continuous case
The area under a graphical model starting from the left hand (smallest) values
of the variable and stopping at a specified value is the cumulative probability
up to the stopping point; It represents the probability that the variable is less
than or equal to the specified value.
Class Examples
BE GUIDED BY THE DIAGRAM AND THE SYMMETRY OF
THE CURVE
64CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

1. If 𝑍 ∼ 𝑁 (0, 1) find 𝑃 𝑟(𝑍 > 1.52)

2. If 𝑍 ∼ 𝑁 (0, 1) find the probability that 𝑍 lies between 0 and 1.52.

3. Find 𝑃 𝑟(−1.52 < 𝑍 < 1.52) where 𝑍 ∼ 𝑁 (0, 1).

4. Find 𝑃 𝑟(𝑍 < −1.96)

5. Find the value 𝑍𝑖 for which 𝑃 𝑟(0 < 𝑍 < 𝑍𝑖 ) = 0.45.

Class Example of Application of the Normal Distribution

Many university students do some part-time work to supplement their al-


lowances. In a study on students’ wages earned from part-time work, it was
found that their hourly wages are normally distributed with mean, 𝜇 = $6.20
and standard deviation 𝜎 = $0.60. Find the proportion of students who do
part-time work and earn more than $7.00 per hour.

If there are 450 students in a particular Faculty who do part-time work, how
many of them would you expect to earn more than $7.00 per hour?

Normal Quantile Plots (normal probability plots)

You do not need to do these by hand. The R functions qqnorm() (and qqline()
to add a reference line) do these for you. See the example R code in the week
1 lecture notes folder for an example of how to do these graphs. (Boxplots can
be used to show similar things.)
3.3. THEORETICAL STATISTICAL DISTRIBUTIONS 65

Some Normal Notes

• Not all Bell-Shaped Curves are normal.


• It is a model that has been shown to approximate other models for a large
number of cases.
• It is by far the most commonly used probability distribution function in
(traditional) statistical inference.
66CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

3.3.3 Normal Approximation to the Binomial


There is a limit to creating binomial tables, especially when the number of trials,
𝑛, becomes large (say 30 or greater). Fortunately, as 𝑛 becomes large we can
approximate the binomial distribution with a normal distribution as follows:

If 𝑋 ∼ Bin(𝑛, 𝑝) then 𝑋 ∼𝑁
̇ (𝑛𝑝, 𝑛𝑝(1 − 𝑝)).

This approximation is reasonably good provided: 𝑛𝑝 ≥ 5 and 𝑛𝑝(1 − 𝑝) ≥ 5.


Note that (1 − 𝑝) is sometimes referred to as 𝑞, with adjustments made to the
above formulae accordingly.

3.3.3.1 Binomial Test of Proportion for Large Samples


We saw earlier in the binomial examples how to test hypotheses about propor-
tions when the number of trials is small (<20, say). When the number of trials
is large, we can use the normal approximation to the binomial to test hypotheses
about proportions.
If 𝑋 ∼ Bin(𝑛, 𝑝) then 𝑋 ∼𝑁
̇ (𝑛𝑝, 𝑛𝑝(1 − 𝑝)). Using the 𝑍 - transform,

𝑋 − 𝑛𝑝
𝑇 = ∼𝑁
̇ (0, 1)
√𝑛𝑝(1 − 𝑝)

The following example will illustrate how to use this formula to test hypotheses
about proportions when the sample size (number of trials) is large.
Forestry Example:
A forester wants to know if more than 40% of the eucalyptus trees in a particular
state forest are host to a specific epiphyte. She takes a random sample of 150
trees and finds that 65 do support the specified epiphyte.
Research Question: What sort of experiment? What sort of data? What feature
of the data is of interest? Null Hypothesis Alternative Hypothesis One-tailed
or two-tailed test? Sample size? Null Distribution? Test Statistic? Significance
Level? Critical Value? Compare test statistic to critical value Conclusion
Although 65/150 = 0.43 is greater than 0.4, this on its own is not enough to
say that the true population proportion of epiphyte hosts is greater than 0.4.
Remember, we are using this sample to infer things about the wider population
of host trees. Of course, in this sample the proportion of hosts is greater than
0.4, but this is only one sample of 150 trees. What if we took another sample
of 150 trees from the forest and found that the sample proportion was 0.38?
Would we then conclude that the true population proportion was in fact less
than 0.4? Whenever we sample we introduce uncertainty. It is this uncertainty
we are trying to take into account when we do hypothesis testing.
3.4. USING R WEEK 3/4 67

How many host trees would we need to have found in our sample to feel confident
that the actual population proportion is > 40%? That is, how many host trees
would we need to have found in our 150 tree sample in order to reject 𝐻0 ?

3.4 Using R Week 3/4


3.4.1 Calculating Binomial and Normal Probabilities in R
3.4.1.1 Binomial Probabilities:
R has several functions available for calculating binomial probabilities. The two
most useful are dbinom(x, size = n, p) and pbinom(x, size = n, p).
When 𝑋 ∼ Bin(𝑛, 𝑝):
• dbinom(x, n, p) calculates 𝑃 (𝑋 = 𝑥) (ie the density function); and
• pbinom(x, n, p) calculates 𝑃 (𝑋 ≤ 𝑥) (ie the cumulative density func-
tion).
Use the example of two heads out of six tosses of an unbiased coin. We want
the probability of getting two or fewer heads: 𝑃 𝑟(𝑋 ≤ 2), where 𝑋 ∼ Bin(𝑛 =
6, 𝑝 = 0.5).
x <- dbinom(0:2, 6, 0.5)
sum(x)

## [1] 0.34375
# OR

pbinom(2, 6, 0.5)

## [1] 0.34375
What if we want the probability of finding exactly 2 heads: 𝑃 𝑟(𝑋 = 2)?
dbinom(2, size = 6, p = 0.5)

## [1] 0.234375
If we want the probability of seeing an upper extreme set, for example seeing
three or more heads, we can use the subtraction from unity approach as indicated
in the examples above:

𝑃 𝑟(𝑋 ≥ 3) = 1 − 𝑃 𝑟(𝑋 ≤ 2)

1 - pbinom(2, 6, 0.5)

## [1] 0.65625
68CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

Or, we can do each probability in the set individually and add them up (note,
this is only really a good option if you don’t have a large number of trials, or if
there are not a lot of probabilities to add up):

𝑃 𝑟(𝑋 ≥ 3) = 𝑃 𝑟(𝑋 = 3) + 𝑃 𝑟(𝑋 = 4) + 𝑃 𝑟(𝑋 = 5) + 𝑃 𝑟(𝑋 = 6)

sum(dbinom(3:6, 6, 0.5))

## [1] 0.65625

3.4.1.2 Normal Probabilities:


There are similar functions for calculating Normal probabilities. However, note
that 𝑃 𝑟(𝑋 = 𝑥) is a meaningless quantity when a distribution is not discrete
(the Binomial is discrete, the Normal is continuous). Therefore the only useful
function to us for normal probability calculations is pnorm(x, mean, sd).
When 𝑋 ∼ 𝑁 (𝜇, 𝜎2 ):
• pnorm(x, mean = 𝜇 , sd = 𝜎 ) calculates 𝑃 𝑟(𝑋 ≤ 𝑥).
For example, suppose 𝑋 ∼ 𝑁 (0, 1). Find 𝑃 𝑟(𝑋 ≤ 1.96).
pnorm(1.96, mean = 0, sd = 1)

## [1] 0.9750021
Find 𝑃 𝑟(0.5 ≤ 𝑋 ≤ 1.96)
pnorm(1.96) - pnorm(0.5)

## [1] 0.2835396
Find 𝑃 𝑟(𝑋 > 1.96)
1 - pnorm(1.96)

## [1] 0.0249979
Find
1. 𝑃 𝑟(−1.96 ≤ 𝑋 ≤ 1.96); and
2. 𝑃 𝑟(|𝑋| > 1.96).
# 1.
pnorm(1.96) - pnorm(-1.96)

## [1] 0.9500042
# 2.
pnorm(-1.96) + (1 - pnorm(1.96))

## [1] 0.04999579
3.4. USING R WEEK 3/4 69

# OR
1 - (pnorm(1.96) - pnorm(-1.96)) ## MAKE SURE YOU UNDERSTAND WHY!

## [1] 0.04999579

3.4.2 Sorting Data Sets and running Functions Separately


for Different Categories:
When the data contain variables that are categorical, sometimes we would like
to sort the data based on those categories. We can do this in R using the
order() function. See the accompanying R file in the week 3/4 lecture notes
folder – examples will be shown and discussed in lectures.
We might also sometimes want to run separate analyses/summaries for our
datasets based on the categories of the factor (categorical) variables. For ex-
ample, suppose we wanted to know the mean rainfall of each district from the
rainfall data in week 1 lectures:
# Put the data into R if you do not still have it:

rain <- c(23, 440, 800, 80,


250, 500, 1180, 200,
120, 400, 420, 430,
10, 20, 30, 5,
60, 200, 250, 120)

district <- factor(rep(1:5, each = 4))


season <- factor(rep(c("Winter", "Spring", "Summer", "Autumn"), 5))

rainfall.dat <- data.frame(district, season, rain)

# And cleanup our mess

rm(rain, district, season)

## Now use R to calculate the mean rainfall for each district:

attach(rainfall.dat)
# attaching a data frame lets us use the variable names directly
# (eg we can type 'rain' instead of needing to use 'rainfall.dat$rain')
# NEVER FORGET TO detach() THE DATA FRAME WHEN YOU ARE DONE!!!

by(rain, district, mean)

## district: 1
## [1] 335.75
## ------------------------------------------------------------
70CHAPTER 3. WEEK 3/4 - PROBABILITY DISTRIBUTIONS AND THE TEST OF PROPORTION

## district: 2
## [1] 532.5
## ------------------------------------------------------------
## district: 3
## [1] 342.5
## ------------------------------------------------------------
## district: 4
## [1] 16.25
## ------------------------------------------------------------
## district: 5
## [1] 157.5
There are always several ways to to do the same thing in R. Another way
we could find the mean for each district is to use the tapply function:
tapply(rain, district, mean).
Which you use can often just boil down to a personal preference (eg you might
prefer the output from using by over the output from tapply). As an exercise,
try adding the tapply version to the end of the code box above and see which
you prefer.
Now that we are finished with the rainfall.dat data frame we should detach
it:
detach()

More examples will be shown in lectures – please see the accompanying R file
in the weeks 3/4 lecture notes folder.
Chapter 4

Week 5/6 - T-tests

Outline:
1. Hypothesis Testing – General Process
• The Concept
• The Basic Steps for Hypothesis Testing – 10 steps
• The Scientific Problem and Question
• The Research Hypothesis
• Resources, Required Detectable Differences, Significance Level
Required
• The Statistical Hypotheses
• One and Two Tailed Hypotheses
• Theoretical Models used in Testing Hypotheses
• The Test Statistic, its Null Distribution, Significance Level and
Critical Region
• Sample Collection and Calculation of Sample Test Statistic
• Comparison of Sample Test Statistic with Null Distribution
• The 𝑝-Value of a Test
• Conclusions and Interpretation
• Possible Errors
• Power of a Statistical Test
2. Specific Tests of Hypotheses I

71
72 CHAPTER 4. WEEK 5/6 - T-TESTS

• Hypothesis Testing: The Proportion versus a Stated Value


• Hypothesis Testing: The Mean versus a Stated Value (One-
sample t-test)
• Hypothesis Testing: Difference Between Two Means I – Inde-
pendent Samples (Two-Sample t-test)
• Hypothesis Testing: Difference Between Two Means II –Paired
Samples
3. Using R
• Functions for probability 𝑃 𝑟(𝑋 ≤ 𝑥): pnorm(), pt().
• Calculating and Testing a Mean: The One-Sample t-test.
• Testing the Difference Between Two Means - The Two-Sample
t-test.
• The Paired t-test.
Workshop for Week 5
• Keep working with your project data;
• Normal Distribution, Test of Independence, Binomial and nor-
mal approximation;
• Assignment help– applicable to Assignment 1.
Project Requirements for Weeks 5 & 6
• Proceed with your project - it is due at the end of week 6!
Things YOU must do in Week 5 & 6
• Revise and summarise the lecture notes for week 3 & 4;
• Read your week 5 & 6 lecture notes before the lecture;
• Read the workshop on learning@griffith before your workshop;
• Submit your assignment in week 6.

4.1 Hypothesis Testing – The General Process


4.1.1 The Concept
The first area of statistical inference that we discuss involves using sample data
to test some sort of belief. The second branch of statistical inference, estimation,
will be discussed later.
Do the sample data support the claim made by the researcher? In such situations
there are two main types of question:
4.1. HYPOTHESIS TESTING – THE GENERAL PROCESS 73

1. Question asked is:


Is the value (parameter) as proposed?
• Is the proportion of males equal to 0.5?
• Is the standard deviation of leaf area greater than 10% of its mean?
• Is the maximum energy output greater than 10kw?
• Is the mean dissolved oxygen (DO) in the Brisbane river below the critical
level for fish survival?
2. Question asked is:
Are the parameters the same for different groups/situations/etc?
• Is the mean level of NOX (nitrogen oxide) in the atmosphere increasing –
time 1 versus time 2?
• Is a particular grass species more tolerant to pressure from foot traffic
than another grass species?
• Is the average house loan through a particular bank the same this year as
at the same time last?

4.1.2 The Basic Steps for Hypothesis Testing – the HT 10


steps
1. Identify clearly the scientific problem and question.
2. From the identified question, clearly define the research hypothesis at is-
sue.
3. Decide on the resources, required detectable difference and significance
level.
4. Formulate the statistical hypotheses: null and alternative.
5. Determine the theoretical model - based on null hypothesis and assump-
tions.
6. Identify the test statistic, its null distribution, and the relevant critical
region.
7. Obtain the sample data and calculate the sample test statistic.
8. Compare the sample test statistic with the null distribution using the
critical region OR evaluate the p-value for the test.
9. Make statistical conclusion and interpret result in terms of original ques-
tion.
10. Consider the possible errors - type I, type II.
74 CHAPTER 4. WEEK 5/6 - T-TESTS

4.1.3 The Scientific Problem and Question


It is the duty of the researcher to identify and explain the problem being stud-
ied. If this is not carried out with care improper, incorrect, and/or misleading
conclusions may occur.

4.1.4 The Research Hypothesis


• A specific belief about some feature of the population variable – eg a mean,
proportion, range.
• The feature will describe the variable in some way.
• The feature must be measurable or observable (not necessarily quantita-
tive).
• Also known as a scientific hypothesis or an English hypothesis.
• Refers to a situation, problem, question.
Dictionary Definitions of the English word hypothesis
Supposition made as basis for reasoning, without assumption of its
truth, or as starting-point for investigation (The Concise Oxford
Dictionary, 1975)
A proposition assumed as a premise in an argument; a proposition
(or set of propositions) proposed as an explanation for the occurrence
of some specified group of phenomena, either asserted or merely as
a provisional conjecture to guide investigation (Macquarie Concise
Dictionary, 1996)
One of the most common problem areas in research design is inadequate clar-
ification of the research hypothesis – it must be specific and unambiguous; it
must be clear what is to be measured. What may seem obvious to the researcher
at the time may be less than obvious to someone else, for example a research
assistant actually collecting the data, and may be no longer obvious to anyone
at a later date!
Example:
Decide whether each of the following is a good research hypothesis.
1. 36% of Australian females between 15 and 24 years of age smoke cigarettes.
2. The probability that a cyclone first located in the Coral Sea will cross the
Queensland coast is 0.20.
3. Budgerigars in inland Australia have a smaller range of body weights than
do budgerigars on the coast.
4. The minimum temperature in Brisbane never goes below 0∘ C.
5. The average Mastercard debt is $600.
6. Toyota Corollas are better cars than Ford Lasers.
4.1. HYPOTHESIS TESTING – THE GENERAL PROCESS 75

7. Most people eat meat.


8. OPs in Private Schools cover a smaller range than OPs in State Schools.
9. Five percent of women who take the contraceptive pill still fall pregnant.
10. The average level of hydrocarbon concentration in body tissues increases
up the food chain indicating an accumulation process.
11. The noise levels from the freeway are above the maximum decibel level
set by the Australian standards.
Difficulties in defining the research hypothesis
The following are common difficulties encountered by researchers when they
are attempting to define the research hypothesis. - Identifying the problem of
interest - Defining the population - Identifying the specific question which is
being asked - Stating the specific belief
Remember, the feature describes the population variable
Example: Identify the variables, populations and research hypotheses for some
of the examples given in the example above.

4.1.5 Resources, Required Detectable Differences, Signif-


icance Level Required
Resources: The resources that are available for the study need to be assessed
at the beginning of the project and compared with the resources required to
achieve the desired aim. If the two are not compatible, proceeding with the
research may be a complete waste of time and money. Statistical input can help
with this process, and ‘clever’ designs may enable research that would otherwise
not be possible.
Detectable Differences: It is important to recognise the difference between
‘statistical difference’ and ‘observed difference’. For example, two sample means
may have different values, but because of the variation associated with the
measurement, it is not possible to say that they come from different populations
– they are not statistically different. The researcher needs to think about the
minimum difference he wishes to be able to detect – this will influence the size
of the sample needed in the experimental design. It may also mean that the
resources will not be sufficient; this will mean further thinking and maybe the
decision not to go ahead with the study.
The Significance Level: The chance the researcher is willing to take of incor-
rectly supporting the research hypothesis – usually designated by 𝛼 (alpha). -
Traditionally the level is set at 0.05 or 0.01, why? - The level depends on the
situation. - 0.05 and 0.01 are like hair lengths, different people and/or problems
require different reliabilities - be yourself! - The possible error if the conclusion
is to reject the null hypothesis.
76 CHAPTER 4. WEEK 5/6 - T-TESTS

4.1.6 The Statistical Hypotheses


The Alternative Hypothesis – 𝐻1 or 𝐻𝑎
• The ‘research’ hypothesis – possibly reformulated in statistical jargon.
• The ‘belief’ we want to prove true.
• The opposite of the null hypothesis.
• By disproving the null, we say we have ‘proved’ the alternative.
• Usually represented as H1
The Null Hypothesis - 𝐻0
• Restatement of the research hypothesis in a form that is testable – usually
involves negation.
• Expresses the belief about the feature describing the variable in a way
that is testable.
• There must be a known theoretical model relating to the distribution of the
feature OR a way of obtaining an empirical null distribution (resampling
or bootstrapping).
• Is true if and only if the alternative is false. We can never prove it true.
Hypotheses are statements about the population not about the sam-
ple.

4.1.7 One and Two Tailed Hypotheses


Where do the tails fit in?
Tails play a significant (pun intended!) role in statistical inference – depends
on question being asked.
Two Tailed:
Null contains: equals
Alternative contains: not equals
One Tailed:
Null contains: equals and greater than OR equals and less than
Alternative contains: less than OR greater than
Example:
A comparative study is to be carried out on the populations of fiddler crabs in
the Tweed River and the Brisbane River. One aspect to be studied is the weight
of an adult crab, a component of interest to a potential marketing venture.
Write the hypotheses for the following:
1. Belief: crabs in the Tweed River have a different weight from those in the
Brisbane River.
2. Belief: crabs in the Tweed River weigh more than those in the Brisbane
River.
4.1. HYPOTHESIS TESTING – THE GENERAL PROCESS 77

3. Belief: crabs in the Tweed River weigh less than those in the Brisbane
River.

4.1.8 Theoretical Models used in Testing Hypotheses


Theoretical models are used to specify the null distribution, that is, the dis-
tribution of the test statistic if the null hypothesis is true.
The model will depend on the measurement and on the feature of interest in
the research hypothesis. For example:
• A study involves a series of Bernoulli trials; feature of interest is a count
or proportion - the theoretical model will be the Binomial;
• If a continuous measurement such as weight is to be investigated and the
mean is of interest - the Normal or t- distribution may be an appropriate
model;
• If the aim is to test the goodness of fit of some data to a specified distri-
bution - the chi-squared model could be used.
The feature of interest is usually converted to a test statistic which has a known
distribution, assuming the null hypothesis is true (the Null Distribution).
All theoretical models involve assumptions. Violations of these assumptions may
or may not have a dramatic effect on the outcome of any inference undertaken.
If you are ever in any doubts regarding assumptions and your data, consult a
statistician for advice.

4.1.9 The Test Statistic, its Null Distribution, Significance


Level and Critical Region
The Test Statistic:
• Usually a function of the ‘feature of interest’ and is known to have a
particular distribution – this contributes to the ‘testability’ of the process.
• Should be something that has meaning in the context of the feature of
interest – if you want to determine if two things are different, you might
decide to look at their absolute difference, and include some sort of weight-
ing – a difference of two has more impact if the values are near10 than if
the values are near 1000
For example, when testing hypotheses about the population mean the equiva-
lent Z score (or t- value if the standard deviation is estimated) becomes a test
statistic.
The null distribution
• Is the probability distribution of the test statistic, assuming the null hy-
pothesis is true.
78 CHAPTER 4. WEEK 5/6 - T-TESTS

• If 𝐻0 is true, this is the distribution we would expect the feature (or some
expression based on it) to have.
• The distribution for the population of ‘feature values’ if H0 is true – eg,
the distribution of the sample mean .
Significance Level – alpha, 𝛼
The risk you are willing to take that you will reject the null hypothesis when it
is really true. The probability of a Type I error. It defines the ‘cut off’ point
for the test statistic.
Critical Region
• Determined by the specified significance level, 𝛼
• The region of the null distribution where it is considered unlikely for a
value of the test statistic to occur.
• If sample value lies here, it is regarded as evidence to reject 𝐻0 in favour
of 𝐻1 .
The relationships of the test statistic to the sample and population
are critical.

4.1.10 Sample Collection and Calculation of Sample Test


Statistic
Ways of selecting the sample are discussed at length in various introductory texts.
In general, samples should be random and representative of the population they
are taken from. The test statistic is calculated as per the definition of whatever
‘meaningful’ feature has been selected, given the question asked and the available
data – eg a count or a mean or a sum of deviations or …

4.1.11 Comparison of Sample Test Statistic with Null Dis-


tribution
• The sample test statistic is calculated from the observed data and com-
pared with the null distribution which reflects the population if 𝐻0 is
true.
• If the sample test statistic lies in the ‘critical region’ the null hypothesis
is rejected at the specified level of significance.
• If it does not lie in the critical region the null hypothesis is not rejected –
the data do not provide evidence to reject the null hypothesis in favour of
the research (alternative) hypothesis.

4.1.12 The p-Value of a Test


• Probability of observing a value of the test statistic as extreme as, or
more extreme than, that seen in the sample.
• Calculated from the null distribution.
4.1. HYPOTHESIS TESTING – THE GENERAL PROCESS 79

• Called the p-value for the sample test statistic


• Is the probability of selecting a sample at least as favourable to the
research hypothesis (alternative) as the observed sample.
• It represents the attained level of significance for the test.

4.1.13 Conclusion and Interpretation


• Depends on whether we reject or fail to reject the null hypothesis.
• Remember, failing to reject the null hypothesis does not mean the
null hypothesis is true

4.1.14 Consider Possible Errors:


Two basic types of error can occur whenever hypothesis testing is carried out.
These are summarised in the following table:

TRUE SITUATION
𝐻0 is True 𝐻0 is False
TEST Fail to Reject 𝐻0 Correct Type II Error (𝛽)
CONCLUSION Reject 𝐻0 Type I Error (𝛼) Correct

The LEVEL OF SIGNIFICANCE is the probability of making a Type I


error and is under the control of the person carrying out the statistical test.
The symbol used is 𝛼 (alpha).
The PROBABILITY OF A TYPE II ERROR depends on the true alter-
native hypothesis (and several other things) and is thus usually unknown. The
symbol used is 𝛽 (beta).

4.1.15 Power of a Statistical Test


• The power of a statistical test is the probability of correctly rejecting the
null hypothesis.
• The probability of correctly detecting a valid alternative hypothesis.
• Power is calculated as one minus the probability of a Type II error. Power
=1-𝛽
• A test with low power results in a higher chance of not rejecting the null
hypothesis when it should in fact be rejected.
For example, if we conclude that the null hypothesis: equal numbers of males
and females cannot be rejected, then it may be that the test of proportion being
used has a low power and we are simply not detecting the actual difference.
This may be a case of no statistical difference when there is a meaningful real
difference.
80 CHAPTER 4. WEEK 5/6 - T-TESTS

Note: It is also possible to find a statistically significant difference


that is not a scientifically significant or meaningful effect. Being a
slave to p-values can lead you into trouble - there is no substitute
for common sense and scientific knowledge. You should always ask
yourself” “Does this result make sense?”

4.2 Specific Tests of Hypotheses I


4.2.1 Hypothesis Testing: The Proportion versus a Stated
Value
See the week 3/4 lecture notes for theory, details and examples.
Use binomial if sample size no more than 20.
Use normal approximation for sample size > 20.

4.2.2 Hypothesis Testing: The Mean versus a Stated


Value (The one-sample t-test)
Two-tailed hypotheses:

𝐻0 ∶𝜇 = 𝜇0
𝐻1 ∶𝜇 ≠ 𝜇0

One-tailed Upper hypotheses:

𝐻0 ∶𝜇 ≤ 𝜇0
𝐻1 ∶𝜇 > 𝜇0

One-tailed Lower hypotheses:

𝐻0 ∶𝜇 ≥ 𝜇0
𝐻1 ∶𝜇 < 𝜇0

2
Using theory: If 𝑋 ∼ 𝑁 (𝜇, 𝜎2 ) then 𝑋 𝑛 ∼ 𝑁 (𝜇, 𝜎𝑛 ). Applying the standard
normal (𝑍) transform we get the test statistic:

𝑋 𝑛 − 𝜇0
𝑇 = √ ∼ 𝑁 (0, 1).
𝜎/ 𝑛

Generally, however, we do not know the population standard deviation 𝜎. In-


stead, we estimate it using the sample standard deviation 𝑠. Introducing this
extra level of uncertainty into the test statistic changes the distribution of the
test statistic to a Student’s 𝑡:
4.2. SPECIFIC TESTS OF HYPOTHESES I 81

𝑋 𝑛 − 𝜇0
𝑇 = √ ∼ 𝑡𝑛−1 .
𝑠/ 𝑛

The hypothesised value for 𝜇 (𝜇0 ) is substituted into the formula along with the
values calculated from the sample – mean and standard deviation – to obtain
the sample test statistic. The calculated sample test statistic is compared with
the relevant critical value from the Student’s 𝑡 distribution with 𝑛 − 1 degrees
of freedom.
Example
THE QUESTION:
Fiddler crabs in the Tweed River appear to be heavier than those
reported in the literature, where the mean weight is given as 230gm.
Is this true?
IDENTIFY A FEATURE WHICH WILL HAVE MEAN-
ING FOR THE QUESTION:
The mean weight.
THE RESEARCH HYPOTHESIS:
The mean weight of fiddler crabs in the Tweed river is greater than
230gm.
DETERMINE THE RESOURCES , DETECTABLE DIF-
FERENCE , LEVEL OF SIGNIFICANCE
Estimation of sample size for a given detectable difference is dis-
cussed in a later section. Assume for this example that a sample of
16 crabs will be taken.
STATISTICAL HYPOTHESES:

𝐻0 ∶𝜇 ≤ 230
𝐻1 ∶𝜇 > 230

where 𝜇 is the population mean weight of fiddler crabs in the Tweed


river.
DEDUCE A THEORETICAL MODEL FOR THE FEA-
TURE:
Continuous data, possibly normal – a genetic & environmental
derivation.
82 CHAPTER 4. WEEK 5/6 - T-TESTS

Interested in testing the mean – central limit theorem gives the dis-
tribution of the sample mean as normal with mean 𝜇 and variance
𝜎2 /16.
Standard deviation, 𝜎, is not known and will have to be estimated
from the sample using 𝑠. This means our null distribution is the
Student’s 𝑡 distribution.
TEST STATISTIC , NULL DISTRIBUTION & CRITICAL
REGION:
The mean of a random sample of size 𝑛 from a variable with a
normal distribution 𝑁 (𝜇, 𝜎2 ) has a normal distribution 𝑁 (𝜇, 𝜎2 /𝑛).
Converting this to a 𝑍 format, and acknowledging that the popula-
tion standard deviation of the weights of crabs in the Tweed River
is not known gives a test statistic:

𝑋 𝑛 − 𝜇0
𝑇 = √ ∼ 𝑡𝑛−1 .
𝑠/ 𝑛

where 𝑠 is the estimated standard deviation calculated from the sam-


ple. This test statistic has a 𝑡 distribution with (𝑛–1) degrees of
freedom if 𝐻0 is true.
Since a sample size of 16 has been proposed, the degrees of freedom
will be 15 and the critical region relevant for a level of significance
of 𝛼 = 0.05 and a one-tailed test (𝐻1 has a ‘greater than’ not a ‘not
equals to’) is found from the 𝑡-table (see table at end of notes) to
be: 𝑡 > 1.75.
COLLECT THE DATA:
The data have been collected.
CALCULATE THE TEST STATISTIC:
Calculations on observations: sample mean 𝑋 = 240, and sample
standard deviation 𝑠 = 24.
Compute sample test statistic,

240 − 230 10
𝑇 = √ = = 1.667.
24/ 16 6

COMPARE THE TEST STATISTIC WITH THE NULL


DISTRIBUTION:
Does T lie in critical region? No. Calculated 𝑇 of 1.667 < 1.7531.
OR: What is the 𝑝 - value for 𝑇 ?
4.2. SPECIFIC TESTS OF HYPOTHESES I 83

R using:
1 - pt(1.6667, 15)
[1] 0.05815621
The 𝑝-value for the calculated 𝑇 of 1.667 on 15 df is 0.058. (This is
larger than 0.05.)
MAKE CONCLUSION AND INFERENCES:
There is insufficient evidence to reject the null hypothesis (𝑝 ≥ 0.05).
The sample data do not support the research hypothesis that the
mean weight of crabs in the Tweed is greater than that reported in
the literature (230gm), at the 0.05 level of significance.
SPECIFY THE ERROR YOU MAY BE MAKING IN
YOUR INFERENTIAL CONCLUSIONS:
The researcher may be incorrect in not rejecting the null hypothesis
in favour of the research hypothesis – a type II error. The probability
associated with this error is unknown unless the true alternative
value of the mean weight for Tweed river crabs is known. The failure
to reject the null may simply reflect a low powered test.
Question: What if the standard deviation for the sample had been
20?
R: A one-sample t-test can be carried out using t.test() in R. See R section
for details.

4.2.3 Hypothesis Testing: Difference Between Two Means


I –Independent Samples (The two-sample t-test)
Two-tailed hypotheses:
𝐻0 ∶𝜇1 = 𝜇2
𝐻1 ∶𝜇1 ≠ 𝜇2

where 𝜇1 and 𝜇2 are the respective means of the two populations to be compared.
One-tailed Hypothesis:
𝐻0 ∶𝜇1 ≤ 𝜇2
𝐻1 ∶𝜇1 > 𝜇2

Or
One-tailed Hypothesis:
𝐻0 ∶𝜇1 ≥ 𝜇2
𝐻1 ∶𝜇1 < 𝜇2
84 CHAPTER 4. WEEK 5/6 - T-TESTS

Note that in the two-sample t-test the tail of a one-tailed hypothesis


test (upper or lower) depends on how you calculate the test statistic.
This will be discussed in lectures.
Using theory: If 𝑋1 ∼ 𝑁 (𝜇1 , 𝜎12 ) and 𝑋2 ∼ 𝑁 (𝜇2 , 𝜎22 ) then (𝑋 1 − 𝑋 2 ) ∼
2
𝑁 (𝜇1 − 𝜇2 , 𝜎𝑋 −𝑋
).
1 2

The situation where 𝜎1 and 𝜎2 are known is most unlikely and will not be
discussed.
2
The estimation of 𝜎𝑋 depends on whether or not 𝜎1 and 𝜎2 can be assumed
1 −𝑋 2
equal.
Let 𝑛1 and 𝑛2 denote the sample sizes taken from populations 1 and 2, respec-
tively. Let 𝑠1 and 𝑠2 denote the standard deviations of each sample taken from
populations 1 and 2, respectively.
Standard deviations unknown but assumed equal (Pooled Procedure)

2̂ = 𝑠2 = (𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22


𝜎𝑋 −𝑋 𝑝 .
1 2 𝑛1 + 𝑛 2 − 2

Therefore

(𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22


𝑠𝑝 = √ .
𝑛1 + 𝑛2 − 2

The Test Statistic is:

(𝑋 1 − 𝑋 2 ) − (𝜇1 − 𝜇2)
𝑇 = ∼ 𝑡𝑛1 +𝑛2 −2 if 𝐻0 is true.
𝑠𝑝 √ 𝑛1 + 1
𝑛2
1

𝑠𝑝 is known as the pooled sample standard deviation.


Standard deviations unknown but cannot be assumed equal
The Test Statistic is:

(𝑋 1 − 𝑋 2 ) − (𝜇1 − 𝜇2)
𝑇 = 2
∼ 𝑡𝑛1 +𝑛2 −2 if 𝐻0 is true.
𝑠22
√ 𝑛𝑠1 + 𝑛2
1

NOTE:
• For large sample sizes, 𝑇 ∼ 𝑍.
• for small sample sizes, 𝑇 ∼ 𝑡 with weighted DF.
4.2. SPECIFIC TESTS OF HYPOTHESES I 85

In this course, the situation of unequal standard deviations and small samples
will not be considered further.
Question: Why would a test to compare means be of interest if pop-
ulations have unequal standard deviations?
The hypothesised value for (𝜇1 − 𝜇2 ) under 𝐻0 is substituted into the formula
along with the values calculated from the sample (means and standard devia-
tions) to obtain sample test statistic. The calculated sample test statistic is
compared with the relevant critical value of 𝑡.
NOTE: When comparing the means from two populations using the test statis-
tic shown above, the choice of which sample mean is subtracted from the other
is arbitrary. For two-tailed hypotheses this is not an issue. However, it can
create an issue for one-tailed tests when deciding whether the test is upper or
lower tailed. This will be discussed further in lectures.
Example
THE QUESTION:
Fiddler crabs in the Tweed River appear to be heavier than fiddler
crabs in the Brisbane River. Is this true?
IDENTIFY A FEATURE WHICH WILL HAVE MEAN-
ING FOR THE QUESTION:
The difference between the mean weights of crabs in the two loca-
tions.
THE RESEARCH HYPOTHESIS:
Mean weight for Tweed River crabs is greater than the mean weight
for Brisbane River crabs.
DETERMINE THE RESOURCES , DETECTABLE DIF-
FERENCE , LEVEL OF SIGNIFICANCE
Sample size? Assume sample sizes of 16 and 25 have been taken
from Tweed and Brisbane rivers respectively. Following tradition,
take the level of significance to be 𝛼 = 0.05.
STATISTICAL HYPOTHESES:

𝐻0 ∶𝜇T ≤ 𝜇B
𝐻1 ∶𝜇T > 𝜇B

where 𝜇T and 𝜇B are the population mean weights of fiddler crabs


in the Tweed and Brisbane rivers, respectively.
86 CHAPTER 4. WEEK 5/6 - T-TESTS

DEDUCE A THEORETICAL MODEL FOR THE FEA-


TURE:
Assume weight has a normal distribution.
Assume the standard deviations are unknown but the same.
TEST STATISTIC , NULL DISTRIBUTION & CRITICAL
REGION:
The Test Statistic is:

(𝑋 1 − 𝑋 2 ) − (𝜇1 − 𝜇2)
𝑇 = ∼ 𝑡𝑛1 +𝑛2 −2 if 𝐻0 is true.
𝑠𝑝 √ 𝑛1 + 1
𝑛2
1

where
(𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22
𝑠2𝑝 = .
𝑛1 + 𝑛2 − 2

Degrees of freedom = 16 + 25 − 2 = 39.


From 𝑡-tables, 𝑡39 (0.95) = 1.69 (or −1.69 - see discussion below).
COLLECT THE DATA:
Take a sample of 16 crabs from the Tweed river; take a sample of 25
crabs from the Brisbane river.
CALCULATE THE TEST STATISTIC:
Using the sample crab weight data from each river, calculate means
and std deviations:
Tweed:
mean = 240, standard deviation = 24, n = 16
Brisbane:
mean = 215, standard deviation = 18, n = 25
𝑠2𝑝 =
𝑇 =
# CHECK HAND CALCULATIONS: Calculate Sp^2 and T using R:

# Sp^2
sp.sqrd <- ((16-1)*24^2 + (25-1)*18^2)/(16 + 25 -2)
round(sp.sqrd, 3)

## [1] 420.923
4.2. SPECIFIC TESTS OF HYPOTHESES I 87

# Test Stat

T.stat <- (240 - 215)/(sqrt(sp.sqrd * (1/16 + 1/25)))


round(T.stat, 3)

## [1] 3.806
COMPARE THE TEST STATISTIC WITH THE NULL
DISTRIBUTION:
MAKE CONCLUSION AND INFERENCES:
SPECIFY THE ERROR YOU MAY BE MAKING IN
YOUR INFERENTIAL CONCLUSIONS:
R: Two-sample t-tests can be carried out using t.test() - see Using R section.

4.2.4 Hypothesis Testing: Difference Between Two Means


I –Paired Samples (the Paired t-test)
The same experimental unit is measured twice. E.g. Standing and lying down
blood pressures. In these cases the data are not independent.
Here, we first calculate the difference between the two measures on each indi-
vidual. Then we apply the one-sample t-test to the **difference variable.
Example:
Twelve randomly selected individuals had their blood pressure measured in both
standing and lying down positions. The data is given in the table below.

Subject Standing BP Lying Down BP Difference


1 132 136 4
2 146 145 -1
3 135 140 5
4 141 147 6
5 139 142 3
6 162 160 -2
7 128 137 9
8 137 136 -1
9 145 149 4
10 151 158 7
11 131 120 -11
12 143 150 7
Mean – – 2.50
SD – – 5.50

Two-tailed Paired t-test


88 CHAPTER 4. WEEK 5/6 - T-TESTS

𝐻0 ∶ There is no difference between the mean blood pressures in the two popu-
lations
𝐻1 ∶ There is a difference between the mean blood pressures in the two popula-
tions
Since the two populations are paired (ie the same indivicuals are measured
twice), we are actually testing the whether the population mean difference (𝜇𝐷 )is
zero or not:

𝐻0 ∶𝜇𝐷 = 0
𝐻1 ∶𝜇𝐷 ≠ 0

This is just a one-sample t-test on the difference data: ie, use the difference
data as the sample and caclulate its sample mean (𝑋 𝐷 ) and sample standard
deviation (𝑠𝐷 ). We can then use these in the one-sample t-test test statistic:

𝑋 𝑛 − 𝜇0
𝑇 = √
𝑠/ 𝑛
𝑋 𝐷 − 𝜇𝐷
= √
𝑠𝐷 / 𝑛
2.5 − 0
= √
5.5/ 12
= 1.574.

From tables, 𝑡11 (0.975) = ±2.2010. Since 𝑇 = 1.574 is not greater than 2.010
(or less than -2.010) we cannot reject the null at the 0.05 level of significance.
We conclude there is insufficient evidence to suggest that the mean standing
blood pressure differs from the mean lying blood pressure.
One-tailed versions will be discussed during lectures.

4.3 Using R
4.3.1 More Probability Functions:
Use the functions: pnorm(z, mean, sd) and pt(t, df) to find the cumulative
probability for particular values of 𝑍 and the calculated t-test statistic, 𝑇 based
on Specified degrees of freedom (df).
Eg: An art auction produces normally distributed sale prices with a mean of
1600 dollars and a standard deviation of 220 dollars. What is the probability
that a particular painting will cost at least 2000 dollars?
4.3. USING R 89

Let 𝑋 denote sales prices. We want to find 𝑃 𝑟(𝑋 > 2000) = 1 − 𝑃 𝑟(𝑋 ≤ 2000).
1 - pnorm(2000, mean = 1600, sd = 220)

## [1] 0.03451817
# Or, convert to Z-value first:

z <- (2000 - 1600)/220


1 - pnorm(z)

## [1] 0.03451817
(Exercise: Modify the above code to find the probability that a painting will
cost 5000 dollars or less.)
Suppose now that the standard deviation given above had been estimated from
a random sample of 10 of the paintings. Student’s 𝑡 should be used (with df =
9) rather than the normal. Note you should convert the figure to a Z-value first
before using the pt function:
z <- (2000 - 1600)/220
1 - pt(z, df = 9)

## [1] 0.05119906

4.3.2 Testing a Mean – The One-Sample t- test


Use the t.test() function in R:
t.test(x, alternative, mu)
EG: Student’s sleep data (1908) contain data on the extra amount of sleep
gained (hours) from two types of soporific drug. Twenty patients were randomly
assigned to either drug 1 or drug 2 (10 patients in each group) and their extra
amount of sleep obtained was recorded. [NB: Actually the data are paired -
10 patients measured twice, but we are going to pretend they are independent
groups of patients for the next two exercises].
Let’s test whether the mean amount of extra sleep is greater than 0 hours, across
both groups. In other words, do both the drugs increase mean hours of sleep?
The data are in R already, so we just need to use the t.test() function:
# Print out the data, note the variable names
sleep

## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
90 CHAPTER 4. WEEK 5/6 - T-TESTS

## 5 -0.1 1 5
## 6 3.4 1 6
## 7 3.7 1 7
## 8 0.8 1 8
## 9 0.0 1 9
## 10 2.0 1 10
## 11 1.9 2 1
## 12 0.8 2 2
## 13 1.1 2 3
## 14 0.1 2 4
## 15 -0.1 2 5
## 16 4.4 2 6
## 17 5.5 2 7
## 18 1.6 2 8
## 19 4.6 2 9
## 20 3.4 2 10
# Do the hypothesis test:

t.test(sleep$extra, alternative = "greater", mu = 0)

##
## One Sample t-test
##
## data: sleep$extra
## t = 3.413, df = 19, p-value = 0.001459
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
## 0.7597797 Inf
## sample estimates:
## mean of x
## 1.54

(Exercise: How would you modify the above code if you wanted to test whether
the mean extra hours of sleep is significantly less than 1 hour?)

4.3.3 Testing the Difference Between Two Means – The


Two-Sample t-test
Again, we use the t.test() function in R to do two-sample t-tests.

Suppose in the sleep example we want to test whether the mean extra hours
of sleep differs between the two drugs. We simply do the two-sample t-test
comparing the two groups (drugs) using the t.test() function:
# Do the hypothesis test:
t.test(extra ~ group, data = sleep)
4.3. USING R 91

##
## Welch Two Sample t-test
##
## data: extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.3654832 0.2054832
## sample estimates:
## mean in group 1 mean in group 2
## 0.75 2.33
(Exercise: Investigate what R means when it says “Welch Two Sample t-test”
in the output. Start by looking at the help for t.test() using ?t.test in the
R console).

4.3.4 Testing the Mean Difference Between Paired Data –


The Paired t- test
Recall the blood pressure example. Enter the data and then do the test.
# Enter the data:
standing.bp <- c(132,146,135,141,139,162,128,137,145,151,131,143)
lying.bp <- c(136,145,140,147,142,160,137,136,149,158,120,150)

# Do the test:
t.test(lying.bp, standing.bp, paired = TRUE, alternative = "two.sided")

##
## Paired t-test
##
## data: lying.bp and standing.bp
## t = 1.574, df = 11, p-value = 0.1438
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9958458 5.9958458
## sample estimates:
## mean of the differences
## 2.5
# OR, create the differences yourself first:
diff <- lying.bp - standing.bp
t.test(diff, alternative = "two.sided", mu = 0)

##
## One Sample t-test
##
## data: diff
92 CHAPTER 4. WEEK 5/6 - T-TESTS

## t = 1.574, df = 11, p-value = 0.1438


## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.9958458 5.9958458
## sample estimates:
## mean of x
## 2.5
(Exercise: Can you work out how to do a paired t-test on Student’s sleep data?
Remember, the data are actually paired.)
Chapter 5

Week 7 - ANOVA

Outline:
1. Analysis Of Variance
• Statistical Models
– The Concept
– Data Synthesis: What the Model Means.
– Analysis of Variance (ANOVA) - The Concept
– The One-Way Analysis of Variance
∗ Introduction
∗ Design and Model
∗ Statistical Hypotheses in the ANOVA
∗ Sums of Squares
∗ Degrees of Freedom
∗ Mean Squares
∗ The Analysis of Variance Table
∗ Test of Hypothesis – The F-Test
∗ Comparison of the F-test in the ANOVA with 1 df for
Treatment, versus, the Two-sample Independent t-test.
∗ Worked Example
∗ Assumptions in the ANOVA Process
2. Using R
• Using R for ANOVA
• Calculating and Testing a Mean: The One-Sample t-test.
• Protected 𝑡-tests in R: Multiple Comparisons of Treatment
Means (MCTM).

93
94 CHAPTER 5. WEEK 7 - ANOVA

Accompanying Workshops - done in week 8


• The analysis of variance process - by hand and using R
• Multiple comparisons of means - when the ANOVA rejects the
null hypothesis.
Project Requirements for Week 7
• Nil.

5.1 Statistical Models


5.1.1 The Concept
All statistical tests are based on an underlying statistical model.
Example - The Independent t-test
When a comparison of two treatment means is carried out using
samples from two independent populations, it is assumed that the
data observations are constructed from the following model:
observation = treatment mean + error
The test involves comparing the variation between the treatment
means with the variation within each treatment - not all observations
from the one treatment will be identical.
BUT: Before looking at the treatment means, we usually subtract
the overall mean, and work with deviations from this general mean -
recall that the definition of variance involves the deviations, not the
original observations.
The correct statistical model is thus:
observation = overall mean + treatment effect + error
The treatment effect assesses how the particular treatment shifts the
observations above or below the overall (general) mean.
Testing equality of treatments means ≡ testing equality of their effects.
The statistical model includes reference to an error component, which is a
means of indicating the random variation expected within the same treatment
- it acknowledges the natural variation in the observations.
The modeling of the natural variation (or random error) requires using the
distribution it follows. Traditionally this has depended on invoking some
known mathematical formula, eg., the normal distribution function. Recent
increases in computational power offer alternatives to this - see later discussion.
5.1. STATISTICAL MODELS 95

5.1.2 Data Synthesis - What the Model Means


When a model is specified for a dependent variable (the variable of interest), the
researcher is stating a belief that the variable of interest is made up of several
components. The ANOVA aims to test this belief and also to provide estimates
of the various component parts of the model. That is, each data value is to be
analysed to find its component parts.
Example:
We wish to explore the effect of a new method of harvesting within
the forest, on the damage to adjacent trees – note that system of har-
vesting is the treatment. The measurement taken will be percentage
of trees within two metres of the felled tree, which have some form
of damage.
The following information is available:
• mean damage if tractor system and no guidance: 90%
• mean damage using current harvesting system 1 (CS1): 60%
• mean damage using current harvesting system 2 (CS2): 50%
It is believed that the new system will give an improvement of 30%
over the best current system. Thus the mean damage when the new
system is used will be 20%. It has been decided to use five random
replicates of each of the four treatments: tractor, CS1, CS2 and new.
The statistical model being proposed for the individual observations
of % damage is thus:
% Damage = general constant + system effect + individual variation

%𝐷𝑖𝑗 = 𝜇 + 𝑠𝑦𝑖 + 𝜖𝑖𝑗

The known values of the effects can be used to simulate a set of


typical data. Analysis of these data should lead to estimates which
closely reflect those used in the simulation.
The overall average will be an average of the four treatment means:

(90 + 60 + 50 + 20)/4 = 55.

The effect size of each harvesting system treatment will be:


• Nil guidance (mean of 90): effect (nil) +35
• CS1 (mean of 60): effect (CS1) +5
• CS2 (mean of 50): effect (CS2) -5
• New (mean of 20): effect (new) -35
96 CHAPTER 5. WEEK 7 - ANOVA

To complete the synthesis we need some estimate of standard devi-


ation (natural variability seen in different parts of the forest using
the same system). Previous studies indicate a standard deviation
of 10%. For the distribution of the random errors assume a normal
statistical distribution, with given sd.
That is: 𝑁 (0, 100) mean of zero, sd of 10 becomes variance of 100
Suppose that we synthesise the data for a completely randomised
experiment such as will be used in getting the real data. That is,
we want five random replicates of the four harvesting treatments; in
total 20 individual observations must be created.
The model proposed is: % damage = 𝜇 + system effect + random
bit
We start with 20 individuals that have a basic mean of 55 (𝜇 = 55):
harvesting.system <- factor(rep(c("nil", "CS1", "CS2", "new"), each = 5))
observations <- rep(55, 20)
dat <- data.frame(harvesting.system, observations)
dat

## harvesting.system observations
## 1 nil 55
## 2 nil 55
## 3 nil 55
## 4 nil 55
## 5 nil 55
## 6 CS1 55
## 7 CS1 55
## 8 CS1 55
## 9 CS1 55
## 10 CS1 55
## 11 CS2 55
## 12 CS2 55
## 13 CS2 55
## 14 CS2 55
## 15 CS2 55
## 16 new 55
## 17 new 55
## 18 new 55
## 19 new 55
## 20 new 55
Each individual tree has its own peculiarities which we have said are
distributed as a normal variable with a mean of zero and standard
deviation of 10. A random sample is selected from a normal distri-
bution with these parameters, usually using a computer package but
5.1. STATISTICAL MODELS 97

tables are available.

These 20 random deviates are assigned at random to the 20 trees


being simulated:
dat$observations <- dat$observations + rnorm(20, mean = 0, sd = 10)
dat

## harvesting.system observations
## 1 nil 73.88037
## 2 nil 48.76506
## 3 nil 56.98792
## 4 nil 57.63700
## 5 nil 53.38633
## 6 CS1 33.78667
## 7 CS1 69.78944
## 8 CS1 46.05728
## 9 CS1 64.23654
## 10 CS1 51.83999
## 11 CS2 74.15087
## 12 CS2 67.17830
## 13 CS2 55.05871
## 14 CS2 66.69690
## 15 CS2 49.02933
## 16 new 41.22425
## 17 new 78.15040
## 18 new 68.42995
## 19 new 58.56322
## 20 new 51.90277

Finally the allocated harvest system effects are added to complete


the model and give the simulated data:
sys.effect <- rep(c(35, 5, -5, -35), each = 5)
dat$observations <- dat$observations + sys.effect
dat

## harvesting.system observations
## 1 nil 108.880369
## 2 nil 83.765059
## 3 nil 91.987917
## 4 nil 92.637000
## 5 nil 88.386331
## 6 CS1 38.786673
## 7 CS1 74.789435
## 8 CS1 51.057283
## 9 CS1 69.236543
## 10 CS1 56.839995
98 CHAPTER 5. WEEK 7 - ANOVA

## 11 CS2 69.150871
## 12 CS2 62.178296
## 13 CS2 50.058710
## 14 CS2 61.696901
## 15 CS2 44.029332
## 16 new 6.224251
## 17 new 43.150398
## 18 new 33.429953
## 19 new 23.563222
## 20 new 16.902768

5.2 Analysis of Variance (ANOVA) - The Con-


cept
It is clear that observed data vary. The ultimate aim of analysing this variation
is to determine how much of it is due to known causes, and how much is due to
individual differences or random, unexplained causes.
The statistical model clearly identifies the known causes and the unknown com-
ponent (the error). If the only variable in the model is a treatment effect, then
the entire known component will be due to the treatments. In more sophisti-
cated experimental designs, some of the known component may be part of the
design - this situation is covered in second year. For treatments to produce a
significant amount of variation in the observations, we usually expect them to
have different means. Indeed, in a classical ANOVA using an F-test, we assume
that the only way that the treatments could affect the observations is to change
the mean value; there are no differences in any other aspect of the distributions
associated with each treatment.
Recall the independent t-test, which provides an objective way of assessing
whether or not two treatment means are different. The ANOVA provides an
extension of this idea to situations where there are more than two treatments,
and what is wanted is a way of objectively testing whether there are any real dif-
ferences between the treatment means in a global sense. The ANOVA provides
a way of comparing more than two treatment means at the same time.
The variation which is analysed is measured as the mean square, which is the
sum of squares divided by the degrees of freedom – recall the basic definition of
the standard deviation.

5.3 The One-Way Analysis of Variance


5.3.1 Introduction
When the variation in the observed data is partitioned into two sections, one
attributable to the treatments (explained) and the other to natural (error, resid-
5.3. THE ONE-WAY ANALYSIS OF VARIANCE 99

ual, unexplained) sources, the process is described as a one-way ANOVA. Later


in this section we will look at situations where the variation is split-up into more
than just one explained effect.
For a one-way ANOVA, the two sources of variation are known as:
• a between treatment variation; and
• a within treatment variation.
The ‘within-treatment’ variation represents the left-over or residual part after
treatments have been taken into account; the natural variation, unexplained
variation or error.
Other names frequently used for a one-way analysis are:
• ‘between and within’ - meaning sources of variation are between treat-
ments and within treatments
• ‘completely randomised ANOVA’ - indicating that the analysis regards the
data as having come from a completely randomised experimental design
model.

5.3.2 Design and Model


In a completely randomised design (CRD), a number of treatments, say k, are
each replicated a number of times. The ‘treatments’ represent ‘populations’ or
‘groups’ that are to be compared.
For example, four treatments, A, B, C and D, could each be replicated three
times. This would require 12 (i.e. 4 × 3) experimental units or plots.
For the design to be a CRD, the four treatments must be assigned to the 12
plots at random, i.e. without any pattern. A schematic way of representing this
is: The numbers refer to the experimental units (plots) and the letters represent
the treatments allocated at random.

1B 2C 3A 4C 5A 6B
7A 8D 9B 10 D 11 D 12 C

The numbers refer to the experimental units (plots) and the letters represent
the treatments allocated at random.
In a CRD, the only sources of variation are the treatments and random error.
Thus each piece of data (i.e. each measurement) can be thought of as being
made up as follows:
measurement = general mean + treatment effect + random error .
This model is specified symbolically as:

𝑦𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝜖𝑖𝑗
100 CHAPTER 5. WEEK 7 - ANOVA

where: - 𝑦𝑖𝑗 is the 𝑗th observation (replicate) on the ith treatment, 𝑖 = 1, … , 𝑘


and 𝑗 = 1, … , 𝑛𝑖 - 𝑛𝑖 is the number of replicates in treatment 𝑖; - 𝜇 is the overall
population mean; - 𝛼𝑖 is the effect of the 𝑖th treatment; - 𝜖𝑖𝑗 is the random
variation associated with the 𝑖𝑗th observation.
Note that this is the model used in the synthesis exercise above.
Examples: Write the models for each of the experiments described
below:
1. Four species of midge are being compared for their ability to
survive when inundated with water – part of a project to help
canal estate residents to survive the summer outdoors. The
experiment involves four replicates of each species.
2. A comparison is to be made of the effects of single sex classes on
the learning of Grade 11 students. Five state high schools have
been selected to participate in the study, and in each school one
class of each of males, females and co-ed will be monitored.
3. A study will compare the levels of hydrocarbon in muscle tissues
of individuals at different levels of the food chain. Six levels
have been identified and at each level a random sample of six
individuals will be measured.
4. The average loss due to house thefts is believed to be much
greater in Sydney than in other Australian capital cities. Over
a two year period in each of the eight cities, a random sample
will be taken of the losses reported to insurance companies.
5. A comparison is to be made of the annual incomes of a number
of categories of small businesses. Random samples of businesses
in each of the categories are selected at random from the official
taxation office listings
NOTES:
• For the CRD, the number of replications for the treatments need not be
equal.
• The mean of the 𝑖th treatment is given as: 𝜇 + 𝛼𝑖 .

5.3.3 Statistical Hypotheses in the ANOVA


The reason for comparing several treatment means simultaneously is to see
whether or not there are any differences between all or just some of the means.
Such differences will then be said to indicate differences between the treatments
involved, with respect to the mean values of the variable whose means have been
calculated. Thus the null hypothesis, which again must contain equality, will
be that there is no difference between the population (treatment) means; that
is, the (treatment) means are all equal.
5.3. THE ONE-WAY ANALYSIS OF VARIANCE 101

[Check you understand the reason for equality and why the word “treatment”
has been bracketed.]
The null hypothesis is:

𝐻0 ∶ 𝜇 1 = 𝜇 2 = … = 𝜇 𝑘

where
• 𝜇𝑖 is the population mean of the variable for the 𝑖th treatment;
• 𝑘 is the number of treatments.
In English: the k population treatment means are all equal.
The alternative hypothesis is:

𝐻1 ∶ the population treatment means are not all equal.

Expressing 𝐻1 in symbolic form is not straightforward and is unnecessary for


this course.

5.3.4 Sums of Squares


The aim of the analysis of variance technique is to partition the variation, but,
as you have already seen, variation is defined as the ratio of “sum of squares”
to “degrees of freedom”. Recall the earliest measure of variation you have used,
namely:

∑(𝑦 − 𝑦)̄ sums of squares


𝑆2 = =
𝑛−1 degrees of freedom

This represents the TOTAL variation in the data.


What is needed is a breakdown of this total variation into its parts. Each
‘deviate’ (the amount an individual observation differs from the overall mean)
can be broken into the part that is due to the treatment and the ‘left over’ or
random piece. It turns out that the sums of squares (SS) part of the variation
calculation is very important in this regard.
It can be shown that the Total SS breaks down into two components:

Total Sum of Squares = Treatment Sum of Squares + Random Sum of Squares

Working Formulae for Finding Sum of Squares


102 CHAPTER 5. WEEK 7 - ANOVA

(∑𝑖 ∑𝑗 𝑦𝑖𝑗 )2
2
Total SS = ∑ ∑ 𝑦𝑖𝑗 −
𝑖 𝑗
𝑛
= Raw SS - CF

where 𝑛 is the total number of observations – it will be the sum of 𝑛𝑖 , the


number of observations in each treatment.
Raw SS is simply the sum of each squared value – find it by squaring each
observation and then adding up the squares.
CF is the Correction Factor and is the square of the sum of all values divided
by the number of values involved – find it be adding up all the observations,
squaring the total, and then dividing by the number of observations. The CF
is the part of the calculation that corrects for the overall mean.

𝑇𝑖2
TSS = Treatment SS = ∑ − CF
𝑖
𝑛𝑖

TSS is found by getting the total of the observations in each treatment (𝑇𝑖 ),
squaring it, and dividing by the number of observations in the treatment; these
values from each treatment are then added together and the correction factor
subtracted from this sum.

ESS = Error Sum of Squares = Residual Sum of Squares = Tot SS - TSS

This Error/Residual Sum of Squares is simply the Unexplained Sum of Squares


and it is usually found by subtraction as indicated.

5.3.5 Degrees of Freedom


As before, total degrees of freedom are given by one less than the total number
of pieces of data.
The degrees of freedom involved in the comparison of the treatment means is
one less than the number of treatments. Think about this from the point of view
of how many independent comparisons can be made. This will be discussed in
lectures.
The error (or residual) degrees of freedom, is the difference between the total
degrees of freedom and the treatment degrees of freedom.
OR
5.3. THE ONE-WAY ANALYSIS OF VARIANCE 103

The error d.f. can be obtained by considering the contribution to error made
by each treatment.
Examples:
1. With four treatments, each replicated three times, the degrees of freedom
breakdown will be:

Source DF
Treatment 3
Error 8
Total 11

2. Suppose five treatments are replicated as follows:

Treatment A B C D E Total
Replications 12 9 8 7 9 45

Source DF Explanation
Treatments 4 (5-1)
Error 40 (11+8+7+6+8)
Total 44 (45-1)

5.3.6 Mean Squares


The variance is given by the ratio of a sum of squares to its degrees of freedom.
In general, we define this ratio as: the mean square.
For a completely randomised design, there are two mean squares of interest:
treatment mean square and error mean square.
The treatment mean square is an estimate of the treatment variance and repre-
sents the amount of variation between treatments.
The error mean square is an estimate of the error or residual variance and
represents the amount of random variation within treatments. It is a measure
of the natural variability, and is the best estimate of the population variance,
𝜎2 .
If 𝐻0 is correct and there is no difference between the treatment means, any
variation between treatments must simply be random variation, and the treat-
ment mean square also estimates the population variance, 𝜎2 . Thus both mean
squares (treatment and error) are estimates of the same thing and, as such, they
should be equal, and their ratio will be one.
104 CHAPTER 5. WEEK 7 - ANOVA

The ratio of the treatment mean square to the error mean square is called the:
variance ratio. Under 𝐻0 , given all assumptions are true, the variance ratio has
an F-distribution.
At this point, you should think very clearly about 𝐻0 and the F-test. The
hypothesis tested by an F-test is that of equality of variances, but the hypothesis
given for the ANOVA concerns means. How is this apparent anomaly explained?

5.3.7 The Analysis of Variance Table


The calculations involved in the analysis of variance are usually summarised in
an analysis of variance table as follows:

Source DF Sums of Squares Mean Squares Variance Ratio


Between Treatments (Treatment) k-1 TSS TMS = TSS/(k-1) TMS/ESS
Within Treatments (Error) n-k ESS EMS = ESS/(n-k)
Total n-1 Tot SS

5.3.8 Test of Hypothesis – The F-Test


The statistical hypotheses which are tested in the analysis of variance were given
above and are:

𝐻0 ∶𝜇1 = 𝜇2 = … = 𝜇𝑘
𝐻1 ∶The treatment means are not all equal

If 𝐻0 is true: variation due to treatments = the natural variation, estimated by


EMS.
If 𝐻0 is false: treatment variation > the natural variation, i.e. TMS > EMS.
𝐻0 true: TMS/EMS = VR = 1.
How do we test whether or not a result which is different from 1 is really differ-
ent???
It can be shown that if assumptions given below hold and 𝐻0 is true then: VR ∼
𝐹tdf, edf where F is: a distribution with known mathematical form available in
a table format – see tables.
Depends on 2 types of degrees of freedom: 1 for the numerator & 1 for the
denominator
General shape of F is:
5.3. THE ONE-WAY ANALYSIS OF VARIANCE 105

plot(seq(0,10, 0.01), df(seq(0,10, 0.01), 5, 20), type = "l", ylab = "Probability", xlab = "Varia
0.6
Probability

0.4
0.2
0.0

0 2 4 6 8 10

Variance Ratio

Examples of using F-tables (see previous example on projected


anovas)
1. 4 treatments and 3 replicates: treatment df = 3 error df = 8
→ 𝐹3,8 (0.05) = 4.07
2. 5 treatments with replicates, 12, 9, 8, 7 and 9: treatment df =
4 error df = 40 → 𝐹4,40 (0.05) = 2.61
Worked Example:
Two types of air filter have been compared in order to assess their
effectiveness in reducing particulate discharge into the atmosphere.
Each filter type was run over a period of 10 days and the amount of
particulates discharged measured. These particulate readings were
found to be:

Filter/Day 1 2 3 4 5 6 7 8 9 10 Total
Standard 24 28 38 41 32 45 33 39 50 18 348
New 15 24 39 44 17 28 18 35 22 20 262
106 CHAPTER 5. WEEK 7 - ANOVA

(24 + 28 + … + 22 + 20)2
Tot SS = 242 + 282 + … + 222 + 202 −
20
(610)2
= 20732 −
20
= 20732 − 18605
= 2127

3482 2622 6102


Treat SS = + −
10 10 20
= 18974.8 − 18605
= 369.8

Error SS = 2127 − 369.8


= 1757.2

Degrees of freedom:
total = 20 - 1 = 19
treatment = 2 - 1 = 1
error = 19 - 1 = 18 (or 2 × (10 − 1))
ANOVA Table:

Source DF Sums of Squares Mean Squares Variance Ratio


Between Treatments (Filter) 1 369.8 369.8 3.7881
Within Treatments (Error) 18 1757.2 97.622
Total 19 2127.0

From tables: 𝐹1,18 (0.05) = 4.414 → critical region: VR > 4.414

Calculated VR of 3.7881 < 4.414 so we cannot reject 𝐻0 and con-


clude that there is no evidence to support the research hypothesis
that there is a significant difference between the population mean
particulate levels for the two filters (p > 0.05).

NOTE: If significance level used is 0.10 instead of 0.05 then:


𝐹1,18 (0.10) = 3.007 and we reject 𝐻0 .
5.3. THE ONE-WAY ANALYSIS OF VARIANCE 107

5.3.9 Comparison of the F-test in the ANOVA with 1 df


for Treatment Versus the Two Sample Independent
t-test.

Standard Filter New Filter


mean = 34.8 mean = 26.2
std dev = 9.762 std dev = 9.998

9×9.7622 +9×9.9982
𝑆p2 = 10+10−2 = 97.622; 𝑆𝑝 = 9.880.

34.8 − 26.2 8.6


𝑇 = = = 1.946.
9.880 × 1
√ 10 + 1 4.418
10


From ANOVA: EMS = 97.622 & EMS = 9.880 = 𝑆𝑝 from above.

VR = 3.788 & 𝑉 𝑅 = 1.946, the equivalent t test statistic, T.

From Tables: 𝐹1,18 (0.05) = 4.414. 𝑡1 8(0.025) = 2.109 = 4.414.
Note for the tabled values and the sample test statistics, the F is the square of
the equivalent 𝑡 value – 2.1092 = 4.414 and 1.9462 = 3.788.

5.3.10 Worked Example:


A study has been carried out to test the belief that some butterfly species are
more fragile than others. The wing thickness was used as a measure of fragility.
We wish to investigate the differences between the mean wing thicknesses of the
three species.

𝐻0 ∶𝜇1 = 𝜇2 = 𝜇3
𝐻1 ∶The treatment means are not all equal

where 𝜇𝑖 is the population mean wing thickness for the ith butterfly species.
The resulting data have been analysed and the results are presented in tabular
form as:

Source DF Sums of Squares Mean Squares Variance Ratio


Between Species 2 16.7000 8.3500 17.822
Within Species 9 4.2167 0.4685
Total 11 20.9167
108 CHAPTER 5. WEEK 7 - ANOVA

Under 𝐻0 , the variance ratio, 𝑉 𝑅 ∼ 𝐹2,9


From tables, 𝐹2,9 (0.05) = 4.26. Thus the critical region is 𝐹 > 4.26.
The calculated VR of 17.822 lies in the critical region and 𝐻0 is rejected.
We conclude that there is a significant difference (𝑝 < 0.05) between the mean
wing thicknesses of the three butterfly species.
NOTES
• Rejection of 𝐻0 in the ANOVA does not indicate which treatment means
are different. It only specifies that there are some differences somewhere.
Further testing is needed to determine where the differences actually are.
• The sums of squares in the ANOVA are additive, but the mean squares
are not. BEWARE!

5.3.11 Assumptions in the ANOVA Process


The inferences involved in the analysis of variance require the following assump-
tions:
• the error terms must be independent;
• the error terms must each have a mean of zero;
• the error terms must all have the same variance;
• the terms in the model (treatment effects and error term) must be additive.
Depending on the statistical test used in the ANOVA, additional assumptions
may be necessary, eg., if the F-test is used, the variable analysed must have an
approximately normal distribution.
Note that this last assumption is also required in the various forms of the t-test.

5.4 USING THE R SOFTWARE – WEEKS 7/8


Note that this covers the lecture material in weeks 7 and 8
The function that does anova in R is aov().
We could also use the lm() function, but aov() is designed specifically for
analysis of variance, so we will use that. The required code will be demonstrated
with the following example.
An Example – Hangover Antidotes
Certain members of a club, intrigued by suggested antidotes to the consequences
of drinking, decided to investigate the following four well-known antidotes:
• A1 - mashed potato (1 kg)
5.4. USING THE R SOFTWARE – WEEKS 7/8 109

• A2 - milk (500 ml)


• A3 - raw onion (1) [AKA the “Tony Abbott”]
• A4 - water (500 ml)
Twenty one volunteers were selected with five being assigned at random to each
of A1, A2 A3 and A4. Each volunteer was given the assigned antidote and
then required to drink a prescribed amount of alcohol, an amount kept secret
to protect the innocent. One hour after the volunteer had drunk the alcohol, a
sample of his/her blood was taken and tested for blood alcohol. The results are
given below.

Antidote/rep 1 2 3 4 5 Total
A1 76 52 92 80 70 370
A2 110 96 74 105 125 510
A3 95 145 100 190 201 731
A4 87 93 91 120 99 490

Analyse these data and prepare a report for the club members on the compara-
tive merits of the four antidotes.

(76 + 52 + … + 120 + 99)2


Tot SS = 762 + 522 + … + 1202 + 992 −
20
(2101)2
= 246937 −
20
= 246937 − 220710.05
= 26226.95

3702 5102 7312 4902 21012


Treat SS = + + + −
5 5 5 5 20
= 234292.2 − 220710.05
= 13582.15

Error SS = 26226.95 − 13582.15


= 12644.80

Degrees of freedom:
total = 20 - 1 = 19
treatment = 4 - 1 = 3
error = 19 - 3 = 16 (or 4 × (5 − 1))
110 CHAPTER 5. WEEK 7 - ANOVA

ANOVA Table:

Source DF Sums of Squares Mean Squares Variance Ratio


Between Treatments (Antidote) 3 13582.15 4527.3833 5.73
Within Treatments (Error) 16 12644.80 790.3000
Total 19 26266.95

From tables: 𝐹3,16 (0.05) = 3.239 → critical region: VR > 3.239


Calculated VR of 5.73 < 3.239 so we reject 𝐻0 and conclude that there are
significant differences between the population mean blood alcohol levels for the
4 antidotes (p < 0.05).

5.4.1 Hangover Antidotes: ANOVA Using R


The model claimed to explain an individual’s blood alcohol is:

𝑏𝑎𝑖𝑗 = 𝜇 + 𝑎𝑛𝑖 + 𝜖𝑖𝑗 , 𝑖 = 1, … , 4; 𝑗 = 1, … , 5.

In words: the blood alcohol of a particular individual depends on which antidote


he/she has taken (𝑎𝑛) and on some basic individualism (𝜖). The indices, 𝑖 and
𝑗 are used to identify a particular person’s blood alcohol reading by indicating
which antidote the person is taking (𝑖) and then which particular person is
wanted within the group receiving that antidote (𝑗).
The hypotheses being tested by this model are:

𝐻0 ∶𝜇1 = 𝜇2 = 𝜇3 = 𝜇4
𝐻1 ∶The treatment means are not all equal

where 𝜇𝑖 is the population mean blood alcohol for the ith antidote.
First we need to enter the data into R:
ba <- c(76, 52, 92, 80, 70,
110, 96, 74, 105, 125,
95, 145, 100, 190, 201,
87, 93, 91, 120, 99)

antidote <- factor(rep(c("Mashed Potato", "Milk", "Onion", "Water"),


c(5, 5, 5, 5)))

# Put data into a dataset called hangovers:


hangovers <- data.frame(antidote, ba)
5.4. USING THE R SOFTWARE – WEEKS 7/8 111

#Remove variables to keep workspace clean:


rm(ba, antidote)

The aov() function does anovas in R. The syntax looks like this:
name.the.model <- aov(y.variable ~ factor.variable, data = dataset)
The summary() function provides a summary of the ANOVA and prints out the
ANOVA table.
hangover.model <- aov(ba ~ antidote, data = hangovers)
summary(hangover.model)

## Df Sum Sq Mean Sq F value Pr(>F)


## antidote 3 13582 4527 5.729 0.00736 **
## Residuals 16 12645 790
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the R output we can see that the effect of antidote is significant at the 0.05
level of significance (p-value for antidote is 0.00736 < 0.05). We can therefore
reject the null hypothesis in favour of the alternative and conclude that at least
one antidote has a different mean blood alcohol level.
To find out which means differ, we use the Least Significant Difference, LSD (see
lecture notes for week 8). To obtain the LSD in R, we first need to download
and install an R library called agricolae:
#install.packages("agricolae")
library(agricolae)

You only need to install the package once if you are using your own computer
(if you are using the University computers you will need to install the package
every time you log in).
We can use a library/package we have installed by using the library() function
as above.
The agricolae library contains many R functions, but we are only interested in
one for this course: the LSD.test() function. The syntax to use this function
is:
hangovers.lsd <- LSD.test(hangover.model, "antidote", console = T)

##
## Study: hangover.model ~ "antidote"
##
## LSD t Test for ba
##
## Mean Square Error: 790.3
##
112 CHAPTER 5. WEEK 7 - ANOVA

## antidote, means and individual ( 95 %) CI


##
## ba std r LCL UCL Min Max
## Mashed Potato 74.0 14.69694 5 47.34814 100.6519 52 92
## Milk 102.0 18.85471 5 75.34814 128.6519 74 125
## Onion 146.2 49.19045 5 119.54814 172.8519 95 201
## Water 98.0 13.03840 5 71.34814 124.6519 87 120
##
## Alpha: 0.05 ; DF Error: 16
## Critical Value of t: 2.119905
##
## least Significant Difference: 37.69142
##
## Treatments with the same letter are not significantly different.
##
## ba groups
## Onion 146.2 a
## Milk 102.0 b
## Water 98.0 b
## Mashed Potato 74.0 b
From the output of the LSD test and the graph, we can see that Onion has a
significantly higher mean blood alcohol than all the other antidotes (its letter,
“a”, is different to all the others). Milk, water, and mashed potato are not
significantly different to each other (their letters are all “b”, indicating they are
not significantly different to each other).
An overall conclusion to the analysis would therefore be: There is a significant
effect of antidote on mean blood alcohol (p-value = 0.00736). Further testing
using the Least Significant Difference value shows that Onion is associated with
the highest mean blood alcohol (146.2), and that this is significantly higher than
the other three antidotes (p-value < 0.05). The remaining three hangover cures
(milk, water, and mashedpotato) are not significantly different to each other in
terms of mean blood alcohol (p-value >0.05).
In terms of hangover cures, the analysis shows that eating a raw onion (aka
“doing a Tony Abbott”) is significantly worse than drinking milk or water, or
eating mashed potato.

5.4.2 Harvester Example - ANOVA Using R:


The model claimed to explain the percent damage due to the various harvester
types is:

𝑃 𝑒𝑟𝑐𝑒𝑛𝑡𝐷𝑎𝑚𝑎𝑔𝑒𝑖𝑗 = 𝜇 + ℎ𝑎𝑟𝑣𝑒𝑠𝑡𝑒𝑟𝑖 + 𝜖𝑖𝑗 , 𝑖 = 1, … , 4; 𝑗 = 1, … , 5.

The hypotheses being tested by this model are:


5.4. USING THE R SOFTWARE – WEEKS 7/8 113

𝐻0 ∶𝜇1 = 𝜇2 = 𝜇3 = 𝜇4
𝐻1 ∶The treatment means are not all equal

where 𝜇𝑖 is the population mean percent damage for the ith harvester.
First we need to enter the data into R:
perc.damage <- c(78.65, 95.67, 78.52, 97.74, 79.57,
62.81, 54.69, 45.64, 52.43, 71.66,
45.83, 36.58, 59.92, 42.25, 45.05,
15.89, 35.01, 38.38, 19.82, 40.93)
harv.system <-factor(rep(c("Nil", "CS1", "CS2", "New"), c(5, 5, 5, 5)))
harvester <- data.frame(harv.system, perc.damage)
rm(perc.damage, harv.system)

Now use the aov() function to run the model above in R, and get an output
summary with the summary() function:
harv.model <- aov(perc.damage ~ harv.system, data = harvester)
summary(harv.model)

## Df Sum Sq Mean Sq F value Pr(>F)


## harv.system 3 8379 2793 27.92 1.35e-06 ***
## Residuals 16 1601 100
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see from the output that harvesting system is significant at the 5%
level (p-value = 1.35e-06 < 0.05). Therefore, since we have rejected the null
hypothesis in favour of the alternative that at least one harvester mean differs
from the rest, we need to find exactly which means differ from each other using
the LSD value:
library(agricolae)
harvester.lsd <- LSD.test(harv.model, "harv.system", console = T)

##
## Study: harv.model ~ "harv.system"
##
## LSD t Test for perc.damage
##
## Mean Square Error: 100.0363
##
## harv.system, means and individual ( 95 %) CI
##
## perc.damage std r LCL UCL Min Max
## CS1 57.446 10.036779 5 47.96378 66.92822 45.64 71.66
114 CHAPTER 5. WEEK 7 - ANOVA

## CS2 45.926 8.623649 5 36.44378 55.40822 36.58 59.92


## New 30.006 11.374464 5 20.52378 39.48822 15.89 40.93
## Nil 86.030 9.780718 5 76.54778 95.51222 78.52 97.74
##
## Alpha: 0.05 ; DF Error: 16
## Critical Value of t: 2.119905
##
## least Significant Difference: 13.40989
##
## Treatments with the same letter are not significantly different.
##
## perc.damage groups
## Nil 86.030 a
## CS1 57.446 b
## CS2 45.926 b
## New 30.006 c
From the LSD.test() output and the graph we can see that Nil Harvester type
produces the largest mean percent damage (86.03%) and that this is significantly
larger than all the other harvester types. Harvester types CS1 and CS2 are not
significantly different to each other in terms of mean percent damage (57.4%
and 47.9%, respectively), but both produce significantly less mean damage than
Nil, and both produce significantly more damage than the New harvester type.
The new harvester type has the lowest mean percent damage (30%) and this is
significantly lower than all other harvester types.
Chapter 6

Week 8 - Multiple
Treatment Comparisons
and LSD

Outline:
1. Analysis Of Variance Continued
• Multiple Comparisons of Treatment Means
– Introduction
– The Protected (Extended) t test
– Least Significant Differences - LSD
– Worked ANOVA Examples
2. Using R
• See lecture notes in week 7.
Accompanying Workshop - done in week 9
• The analysis of variance process and multiple comparisons of
means - when the ANOVA rejects 𝐻0
Workshop for week 8
• Based on lectures in week 7
Project Requirements for Week 8
• Nil.
Assessment for Week 8
• Your second quiz worth 7% is this week.

115
116CHAPTER 6. WEEK 8 - MULTIPLE TREATMENT COMPARISONS AND LSD

6.1 Multiple Comparisons of Treatment Means


6.1.1 Introduction
In the ANOVA, the F-test is used to test the overall hypothesis of equality of
all treatment means.
If 𝐻0 is false, i.e. some difference (or differences) do exist, interest lies in deter-
mining where the differences do occur; which treatment means are different.
A number of tests exist for comparisons of multiple treatment means, the most
common of which is the extended (or protected) t-test. Other tests in common
usage include:
• Tukey’s q
• Student-Neuman-Keull’s Multiple Range Test
• Scheffe’s Test
• Bonferroni
• Duncan’s Multiple Range Test

6.1.2 The Protected (Extended) t-test


The original t-test was designed to compare two treatment means. If this test is
simply extended and used to carry out all possible pairwise comparisons between
more than two treatments, spurious significance may be found simply because so
many of the tests are done - each test may be at a prescribed level of significance
related to its specific type I error probability, but over all the possible tests this
probability of error (the experimentwise error) may be quite different.
To overcome this problem a requirement is imposed that the F-test in the
ANOVA must be significant before any t-tests are carried out. If the overall
test for significance says that there are no significant differences then no further
testing is carried out. The t-test with this conditioning on the outcome of the
F-test is known as the Protected t- test.
Even though significant differences may occur in pairwise t-testing, if the F-test
in the ANOVA is not significant, the null hypothesis that all treatment means
are equal must be accepted.
Providing the F-test is significant, at least two treatment means will be detected
as significantly different when the t-tests are done.
The treatment means are considered in pairs and each pair is tested using the
standard t-test. EXCEPT that: in applying each test, the standard deviation
used is that obtained from the error mean square in the ANOVA.
Similarly, the degrees of freedom appropriate for each individual t-test are the
error degrees of freedom in the ANOVA.
6.1. MULTIPLE COMPARISONS OF TREATMENT MEANS 117

IMPORTANT NOTE
Remember the extended t-test as described above must only be applied after a
significant F-test has been found. This proviso gives a “protection” to the test
to prevent the detection of false significant differences which can arise simply
by comparing the highest and lowest of a number of means.
Example Wing Thickness of Butterflies
In the week 8 notes the following ANOVA was given for the wing
thickness of butterfly species:

Source DF Sums of Squares Mean Squares Variance Ratio


Between Species 2 16.7000 8.3500 17.822
Within Species 9 4.2167 0.4685
Total 11 20.9167

On the basis of the 𝐹 -test, the variance ratio of 17.822, we rejected


the null hypothesis and concluded that the mean wing thicknesses
of the 3 butterfly species were not all the same (p<0.05).
The actual means were:

Species 1 2 3
Mean Wing Thickness 4.67 6.80 7.75
Number of Replicates 3 5 4

Since the F-test was significant in the ANOVA further pairwise t-


testing on the 3 means to isolate the specific differences, will be
valid.
The general hypotheses will be:

𝐻0 ∶𝜇𝑖 = 𝜇𝑗
𝐻1 ∶𝜇𝑖 ≠ 𝜇𝑗

where 𝜇𝑖 and 𝜇𝑗 are the population mean wing thicknesses for the
two species being compared.
The test statistic for each comparison will be

𝑋̄ 𝑖 − 𝑋̄ 𝑗
𝑇 =
𝑠√ 𝑛1 + 1
𝑛𝑗
𝑖
118CHAPTER 6. WEEK 8 - MULTIPLE TREATMENT COMPARISONS AND LSD

where symbols
√ are as defined in the notes on independent t- testing
and 𝑠 = EMS from the ANOVA table.
Under 𝐻0 : 𝑇 ∼ 𝑡9 . From tables, 𝑡9 (0.975) = 2.262.
Note that the degrees of freedom is always 9 in this example (error
df from ANOVA) regardless of which pair of means is being tested.
Thus for each pairwise test the critical region will be: 𝑇 < −2.262
or 𝑇 > 2.262 (alternatively, we can write these two regions as |𝑇 | >
2.262).
(i) Species 1 vs Species 2

4.67 − 6.80 −2.13 −2.13


𝑇 =√ = √ = = −4.261.
0.4685√ 13 + 1 0.6845 0.5333 0.4999
5

The calculated 𝑇 lies in the critical region and thus we reject 𝐻0 in


favour of 𝐻1 . We conclude that the mean wing thicknesses of species
1 (4.67) and 2 (6.80) are significantly different (𝑝 < 0.05).
(ii) Species 1 vs Species 3

4.67 − 7.75 −3.08 −3.08


𝑇 =√ = √ = = −5.891.
0.4685√ 13 + 1 0.6845 0.5833 0.5228
4

The calculated 𝑇 lies in the critical region and thus we reject 𝐻0 in


favour of 𝐻1 . We conclude that the mean wing thicknesses of species
1 (4.67) and 3 (7.75) are significantly different (𝑝 < 0.05).
(iii) Species 2 vs Species 3

6.80 − 7.75 −0.95 −0.95


𝑇 =√ = √ = = −2.069.
0.4685√ 15 + 1 0.6845 0.4500 0.4592
4

The calculated 𝑇 does not lie in the critical region and thus we
cannot reject 𝐻0 . We conclude that there is insufficient evidence
in the data to suggest that the mean wing thicknesses of species 2
(6.80) and 3 (7.75) are significantly different (𝑝 ≥ 0.05).
The overall conclusion is that species 2 and 3 do not differ with re-
spect to mean wing thickness. Species 1 butterflies have, on average,
a mean wing thickness significantly less than that of butterflies of
species 2 and 3 (𝛼 = 0.05).
6.1. MULTIPLE COMPARISONS OF TREATMENT MEANS 119

6.1.3 Least Significant Differences - LSD


When a number of pairwise 𝑡-tests are carried out following an ANOVA, much
of each calculation is common to all tests. This is especially true if there are
equal replicates for some of the treatments.
The standard deviation, 𝑠, and the critical region are always the same as they use
the ANOVA error results. In the case of equal replications for each treatment,
the standard error of the difference between the two means will be the same for
all pairs of comparisons:

2
𝑆𝐸𝑦𝑖̄ −𝑦𝑗̄ = 𝑠√
𝑛

where 𝑠 = EMS from the ANOVA and 𝑛 is the number of (equal) reps in each
treatment.
Instead of evaluating the test statistic for all possible treatment pairs, the test
statistic formula can be rearranged to find the smallest difference that must
exist between any two means for significance to be reached. (Note that this is
similar to the approach taken to find a confidence interval).
Using the traditional level of significance of 0.05, the critical value is: 𝑡𝜈 (0.025)
(these comparisions are always two-tailed). Substituting this in the equation
for the test statistic T, gives:

|𝑦𝑖̄ − 𝑦𝑗̄ |LSD


𝑇 = > 𝑡𝜈 (0.025).
𝑠√ 𝑛1 + 1
𝑛𝑗
𝑖

Rearranging and solving for |𝑦𝑖̄ − 𝑦𝑗̄ |LSD gives:

1 1
|𝑦𝑖̄ − 𝑦𝑗̄ |LSD > 𝑡𝜈 (0.025) × 𝑠 × √ +
𝑛𝑖 𝑛𝑗

where |𝑦𝑖̄ − 𝑦𝑗̄ |LSD is the difference that must exist between means 𝑖 and 𝑗 if the
test statistic is to just reach significance; that is, for 𝑇 to be just larger than
𝑡𝑣 (0.025).
When all treatments have the same number of replicates, say 𝑛, only one LSD
needs to be found for all the pairwise comparisons. If the replicates differ a LSD
value must be calculated for every pair of differing replicates (this situation is
not considered in this course).
The next step is to find all differences between the means and compare them
with the relevant LSD.
Table of Mean Differences
120CHAPTER 6. WEEK 8 - MULTIPLE TREATMENT COMPARISONS AND LSD

The most efficient way of looking at the differences between the means is to
construct a table of mean differences as follows:

Means Ranked in Ascending Order



Means Ranked in Descending matrix of differences between
Order ↓ treatment means

Example: Fuel Efficiency


An experiment was carried out to compare the fuel efficiency (mea-
sured as a percentage) of petrol engines using different fuels. Five
fuel types were used, each replicated four times.
The statistical hypotheses are:

𝐻0 ∶𝜇1 = 𝜇2 = 𝜇3 = 𝜇4 = 𝜇5
𝐻1 ∶The mean fuel efficiencies are not the same for all five fuel types

where 𝜇𝑖 is the mean fuel efficiency of fuel type 𝑖.


Analysis of the results gave the following ANOVA table which will
be completed in lectures:

Source DF Sums of Squares Mean Squares Variance Ratio


Fuel Type 106.8 8.26
Error
Total 621.15

Means Table:

Fuel Type A B C D E
Mean Efficiency (%) 93.0 84.3 64.3 92.6 88.4
𝑛𝑖 4 4 4 4 4

From the tables 𝐹4,15 (0.95) = 3.06.


The VR of 8.26 lies in the critical (rejection) region thus we reject
𝐻0 . Therefore conclude that the five fuel types do not all produce
the same efficiency (𝑝 < 0.05).
To determine which of the fuel types differ, t-testing can be carried
out using the LSD.
6.1. MULTIPLE COMPARISONS OF TREATMENT MEANS 121

1 1
LSD(5%, 4, 4) = 𝑡15 (0.975) × 𝑠 × √ +
4 4
= 2.131 × 3.596 × 0.7071
= 5.419.

Thus any pair of means that differs by at least 5.419 will be signifi-
cantly different at the 0.05 significance level.
Table of Mean Differences:

Fuel Type → C B E D A
Fuel Type ↓ 64.3 84.3 88.4 92.6 93.0
A 93.0 28.7 8.7 4.6 0.4 0
D 92.6 28.3 8.3 4.2 0
E 88.4 24.1 4.1 0
B 84.3 24 0
C 64.3 0

The first entry in the table, 28.7, is the difference between the small-
est mean (64.3 for fuel type C) and the largest mean (93.0 for fuel
type A). Any difference value in the table greater than the caclulated
LSD = 5.419 indicates a significant difference between those means
at the 0.05 level of significance.
A useful way of presenting the results is as follows:
Significant Differences: A > C, B * D > C, B * E > C * B > C *
(where * indicates a 5% level of significance).
The general symbols for other levels of significance are: 1% ** 0.1%
***
Conclusions
The mean efficiency of fuel type C is significantly lower than the
mean efficiency of all the other fuel types (p < 0.05). Fuel types A
and D have greater efficiency on average than do fuel types C and B
(p < 0.05). Fuel types A, D and E appear to have the same efficiency
on average (p > 0.05).
Example: Harvester Example Revisited
Do this by hand yourself, then check your results using R.
How close are the estimates of treatment means and the standard deviation
to the values we started with - remember we assumed these values and then
122CHAPTER 6. WEEK 8 - MULTIPLE TREATMENT COMPARISONS AND LSD

constructed (simulated) these data?


harvesting.system <- factor(rep(c("nil", "CS1", "CS2", "new"), each = 5))
observations <- rep(55, 20)
dat <- data.frame(harvesting.system, observations)
dat$observations <- dat$observations + rnorm(20, mean = 0, sd = 10)
sys.effect <- rep(c(35, 5, -5, -35), each = 5)
dat$observations <- dat$observations + sys.effect
dat

## harvesting.system observations
## 1 nil 73.644323
## 2 nil 84.769935
## 3 nil 85.514228
## 4 nil 95.613793
## 5 nil 91.977370
## 6 CS1 52.988987
## 7 CS1 46.922544
## 8 CS1 50.971228
## 9 CS1 53.460526
## 10 CS1 58.795703
## 11 CS2 44.527301
## 12 CS2 59.322388
## 13 CS2 60.097702
## 14 CS2 48.297088
## 15 CS2 50.464486
## 16 new 18.803197
## 17 new 11.836971
## 18 new 2.674373
## 19 new 21.417665
## 20 new 31.253825
Example: Growth Curves - Marine Birds
An ESC researcher is studying the growth rates of young marine birds on the
Great Barrier Reef. The growth curve of these birds is known to be logistic in
nature with a functional form as follows:

𝐾
𝑊 =
1 + 𝑒𝑥𝑝(−𝑟(𝑡 − 𝑡𝑚 ))
,
where:
• 𝑊 is the weight in grams of the individual bird at time 𝑡 days;
• 𝐾 is the asymptotic weight of the individual bird (its adult weight);
• 𝑟 is the growth constant for the individual bird;
6.2. USING R 123

• 𝑡 is the time from birth in days;


• 𝑡𝑚 is the time in days to reach a weight of 𝐾/2 grams.
It is believed that different species have different growth patterns which are
reflected in different values for the coefficients, 𝐾, 𝑟 and 𝑡𝑚 . The growth curves
of six individual birds from each of three species were studied. The breeds
involved and the resulting coefficients are given below.

Bridled Tern Black Noddy Crested Tern


𝐾 𝑡𝑚 𝑅 𝐾 𝑡𝑚 𝑅 𝐾 𝑡𝑚 𝑅
105.0 9.7 0.1594 96.4 8.3 0.2171 280.4 12.8 0.1474
114.0 15.5 0.1193 103.4 7.1 0.1751 290.5 13.9 0.1447
124.7 14.9 0.1087 104.4 7.6 0.1707 291.5 17.2 0.1453
127.1 13.0 0.1097 115.2 10.1 0.1346 290.7 13.8 0.1342
127.7 13.7 0.0920 116.7 9.9 0.1672 292.2 12.2 0.1193
130.9 15.1 0.1242 117.7 10.1 0.1808 326.4 19.5 0.0821

Analyse these data to determine in what way (if any) the growth patterns differ
between the three species. Initially consider 𝑡𝑚 , the time to reach half the adult
weight.

6.2 Using R
Refer to R section in week 7 lecture notes.
124CHAPTER 6. WEEK 8 - MULTIPLE TREATMENT COMPARISONS AND LSD
Chapter 7

Week 9 - Factorial ANOVA

Outline:
1. Analysis Of Variance Continued
• Treatment Designs: Factorial ANOVA
– The Treatment Design Concept
– The Factorial Alternative - How to Waste resources??
– What is this ‘Interaction’??
– The Factorial Model
– Factorial Effects
– The Factorial ANOVA
– Partitioning Treatment Sums of Squares Into Factorial
Components
– Factorial Effects and Hypotheses
– Testing and Interpreting Factorial Effects
2. Using R
• Using R for Factorial ANOVA
Accompanying Workshop - done in week 10
• Defining factorial treatments.
• The analysis of variance for factorial treatment designs – by
hand and using R
Workshop for week 9
• Based on lectures in week 8
• Response to marker’s critique. Designed to help you improve
in your assignment 2.
Project Requirements for Week 9

125
126 CHAPTER 7. WEEK 9 - FACTORIAL ANOVA

• Assignment 2 is available on Learning@Griffith from this week


(week 9). It is due in week 11.

7.1 Treatment Designs: Factorial ANOVA


7.1.1 The Treatment Design Concept
The treatments involved in a study often involve several factors. Interest will
then lie in the effects of each of the factors and in the possible interaction
between the factors.
Example: Environmental Monitoring
An environmental monitoring system treatment may consist of two
components: the machine used and the degree of instructions given
to the operator. Suppose these are:
• 4 degrees of instruction - 0, small (S), standard (M) and extra
(E)
• 3 types of machine - A, B and C
The treatments will be made up of all combinations of the two
factors.

Treatment Degree of Instruction Type of Machine


1 0 Instruction Machine A
2
3
4
5
6
7
8
9
10
11
12

The different possible values for a factor are known as its levels. In the example,
degree of instruction has 4 levels and type of machine has 3 levels. The total
number of treatments will be the product of the levels of the factors involved:
4 × 3 = 12.
Standard Order - ABC
When trying to identify all treatments from a number of factors, it is useful to
have some standard approach – this avoids missing treatments along the way!!
7.1. TREATMENT DESIGNS: FACTORIAL ANOVA 127

A commonly used approach is known as the ‘standard ABC order’ in which the
levels of the first factors (A and B) are held constant while the levels of the
outermost factor (C) are allowed to vary. Once all levels of C have been varied
for a particular combination of A and B, the level of the next factor, B, is varied
and C again moves through its possible levels. Eventually factor B will reach its
last level together with the last level of C, only then will the level of A change.
The process with factors B and C then repeats but with the second level of A.
Assuming there are ‘a’ levels of A, ‘b’ levels of B and ‘c’ levels of C, the 𝑎 × 𝑏 × 𝑐
treatments can be written schematically as: A1B1C1, A1B1C2, …, A1B1Cc,
A1B2C1, A1B2C2, …, A1B2Cc, …, A1BbCc, A2B1C1, A2B1C2, …, A2B2Cc,
A2B2C1, A2B2C2, …, A2BbCc, …, AaBbCc.
Example: Air Pollution, Land Usage, and Location
Three land uses (rural, residential and national park)
Two states (Qld and NSW)

7.1.2 The Factorial Alternative - How to Waste resources?


The alternative to the factorial design in the environmental monitoring example
above would involve several experiments.
• instruction level 0: 3 machine types, each replicated a minimum of 5 →
15 units
• instruction level S: 3 machine types, each replicated a minimum of 5 →
15 units
• instruction level M: 3 machine types, each replicated a minimum of 5 →
15 units
• instruction level H: 3 machine types, each replicated a minimum of 5 →
15 units
• machine type A: 4 instruction levels, each replicated a minimum of 4 →
16 units
• machine type B: 4 instruction levels, each replicated a minimum of 4 →
16 units
• machine type C: 4 instruction levels, each replicated a minimum of 4 →
16 units
A total of seven separate experiments, using a total of (4 𝑡𝑖𝑚𝑒𝑠15) + (3 × 16)
= 108 experimental units.
AND still no quantitative measure of the interactive effect of the degree of
instruction and the machine type.
When all possible combinations are used the design is called a complete fac-
torial.
128 CHAPTER 7. WEEK 9 - FACTORIAL ANOVA

Incomplete factorials are a valuable design; can save resources – but not covered
here.

7.1.3 What is this ‘Interaction’??


The best way to understand the ‘interaction’ concept is to look at an example.
Example: Chemicals and Weed Control
A controlled laboratory experiment has been carried out to look at
the effects of four different chemicals on the control of a noxious weed.
Since it was unknown whether the chemical effect would depend on
the stage of development of the plant at the time of application, some
plants were treated in an early stage of growth and others later in
their growing process.
Each chemical was applied to six experimental plants, three in early
growth stage and three in late growth, giving a total of 24 exper-
imental plants. Each chemical was applied at the concentration
recommended by the manufacturers. For confidentiality purposes,
the chemicals are known simply as A, B, C and D.
The variable measured was dry matter production at six weeks of
age. The results for the 24 individuals are given below.

Time of Application A B C D Time of Application Total


5.9 2.6 4.7 2.0
Early 4.7 3.7 4.3 2.4
5.6 (16.2) 3.4 (9.7 ) 3.8 (12.8) 1.9 (6.3) 45.0
2.5 2.7 3.1 5.9
Late 0.6 2.3 4.1 4.6
1.0 (4.1) 2.9 (7.9) 2.8 (10) 5.5 (16.0) 38.0
Chemical Total 20.3 17.6 22.8 22.3 83.0

There are 8 different sets of three individuals – representing 8 dif-


ferent treatments. The two way table of means identifies the mean
values for these 8 sets:

Time of Application A B C D Time Means


Early 5.4 3.2 4.3 2.1 3.8
Late 1.4 2.6 3.3 5.3 3.2
Chemical Means 3.4 2.9 3.8 3.7 Grand Mean 3.5

Looking at the means for the different chemicals for the two times
and overall gives:
7.1. TREATMENT DESIGNS: FACTORIAL ANOVA 129

drymatter<- c(5.9, 4.7, 5.6, 2.5, 0.6, 1,


2.6, 3.7, 3.4, 2.7, 2.3, 2.9,
4.7, 4.3, 3.8, 3.1, 4.1, 2.8,
2, 2.4, 1.9, 5.9, 4.6, 5.5)

time <- factor(rep(rep(c("Early", "Late"), c(3, 3)), 4))


chemical <- factor(rep(LETTERS[1:4], rep(6, 4)))

# Put variables in a dataframe and clean up:


weeds <- data.frame(drymatter, time, chemical)
rm(drymatter, time, chemical)

attach(weeds)
interaction.plot(chemical, time, drymatter, type = "l", col = c("blue", "red"), main = "Interacti

Interaction Plot for Weed Control Example

time
5

Late
mean of drymatter

Early
4
3
2

A B C D

chemical
detach()

The differences between the chemicals clearly depend on whether


the application occurred during the early or late stage of growth. If
application is in the early growth stage, chemical D appears to be
more effective as it produces much less dry matter of weed; but if
application is in the later growth stage it is chemical A which is more
effective, and chemical D is the worst of the chemicals (weeds treated
with D have a high average production). The effect of chemicals
interacts with the time of application – the effect of the chemicals
depends on which time of application is being considered.
130 CHAPTER 7. WEEK 9 - FACTORIAL ANOVA

What about the question of whether it is better to apply the chemical


in the early stages of growth or hold off until a later stage?

7.1.4 The Factorial Model


Factorial Model
When more than one factor is involved the model used to describe the dependent
variable will involve terms for each factor and also terms for possible interactions
between the various factors. The following general model would apply for two
factors, say A and B, with 𝑎 and 𝑏 levels, respectively:

𝑦𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + 𝛼𝛽𝑖𝑗 + 𝜖𝑖𝑗𝑘 ,

where
• 𝑦𝑖𝑗𝑘 is the observation on the 𝑘th replicate receiving the 𝑖th level of A and
the 𝑗th level of B, 𝑘 = 1, … , 𝑛;
• 𝑛 is the number of replicates for each of the 𝑎 × 𝑏 treatment groups;
• 𝜇 is the overall grand mean of y;
• 𝛼𝑖 is the effect of the 𝑖th level of factor A, 𝑖 = 1, … , 𝑎;
• 𝛽𝑗 is the effect of the 𝑗th level of factor B, 𝑗 = 1, … , 𝑏;
• 𝛼𝛽𝑖𝑗 is the effect of the interaction between factor A and B;
• 𝜖𝑖𝑗𝑘 is the random effect attributable to the ijkth individual observation.
Notes
• The index, 𝑘, counts the replicates for a particular treatment group and
goes from 1 to n. (There are 𝑎 × 𝑏 treatment groups, each replicated n
times).
• Each level of factor A is replicated not only by each group of n, but also
across the levels of factor B – overall each level of A has 𝑏 × 𝑛 replicates.
Similarly, levels of factor B gain replication because they occur at the
different levels of factor A – overall each level of B has 𝑎 × 𝑛 replicates. It
is the replication of one factor’s levels across the levels of the other factor
that gives the factorial treatment design its power. As long as everything
is balanced, and all levels of one factor are represented equally in all levels
of every other factor, then any effect due to one factor is ‘evened’ out when
another factor’s levels are being compared.
Preliminary Model
It is useful to consider an initial model which contains a single treatment term
encompassing all the factorial effects (both main and interaction). Such a model
is known as a preliminary model. One of the advantages of this model is to
ascertain the sources of variation which will be unexplained (error) and the
sources that will form part of the experimental design (for example, blocking).
This will be discussed further in courses in later years.
7.1. TREATMENT DESIGNS: FACTORIAL ANOVA 131

The preliminary model for the general two-factor model above is:

𝑦𝑖𝑗𝑘 = 𝜇 + treat𝑖𝑗 + 𝜖𝑖𝑗𝑘

where
• 𝑦𝑖𝑗𝑘 , 𝜇, and 𝜖𝑖𝑗𝑘 are as previously defined;
• treat𝑖𝑗 is the effect of the 𝑖𝑗th treatment group (made up from the factors
A and B), 𝑖 = 1, … , 𝑎 and 𝑗 = 1, … , 𝑏.

7.1.5 Factorial Effects


When the treatment effect in the preliminary model is partitioned to provide
the factorial components of the treatment, two types of terms appear – main
effect terms and interaction terms.
A main effect represents the effect of the identified factor on average across
all levels of any other factors.
An interaction effect is a measure of any additional contribution arising
from the interaction of one factor with another (or others). If two factors act
independently of each other their interaction will be zero; the contribution made
by any one of them in the make-up of the observation is through the values of
its individual levels, regardless of the level of the other factor.

7.1.6 The Factorial ANOVA


The PROJECTED ANOVA
A projected ANOVA is a very useful design tool. It consists of the source of
variation together with the degrees of freedom associated with each source. Data
from an experiment, which has been designed using a projected ANOVA, will
fall automatically into an appropriate analysis. In the factorial ANOVA sources
of variation will come from main effects and interaction effects, and degrees
of freedom will be needed for each type of effect.
• For main effects the degrees of freedom are those that would apply if the
factor were a treatment – that is, the degrees of freedom will be one less
than the number of levels.
• For interaction effects the degrees of freedom are the product of the
degrees of freedom of the factors in the interaction.
Preliminary Projected ANOVA (Balanced):

Source of Variation Degrees of Freedom


Treatments 𝑎𝑏 − 1
Error 𝑎𝑏(𝑛 − 1)
132 CHAPTER 7. WEEK 9 - FACTORIAL ANOVA

Source of Variation Degrees of Freedom


Total 𝑎𝑏𝑛 − 1

Factorial Projected ANOVA (Balanced):

Source of Variation Degrees of Freedom


Main Effect A 𝑎−1
Main Effect B 𝑏−1
Interaction A and B (𝑎 − 1) × (𝑏 − 1)
Error 𝑎𝑏(𝑛 − 1)
Total 𝑎𝑏𝑛 − 1

Example: Chemicals and Weed Control (Contd.)


Recall the previous example where the effects of three different chem-
icals on the control of a noxious weed were studied. Each chemical
will be applied at a standard concentration at two different times of
growth, early and late. Forming a complete factorial will require 8
treatments: 4 chemicals × 2 times. There were 3 replicates within
each treatment. The variable measured is dry matter production
(production).
Preliminary Model:

production𝑖𝑗𝑘 = 𝜇 + treat𝑖𝑗 + 𝜖𝑖𝑗𝑘

Factorial Model:

production𝑖𝑗𝑘 = 𝜇 + Ch𝑖 + apptime𝑗 + (Ch × apptime)𝑖𝑗 + 𝜖𝑖𝑗𝑘 ,

where - production𝑖𝑗𝑘 is the observed dry matter production on the


𝑘th replicate of the 𝑖th chemical applied at the 𝑗th application time;
- Ch𝑖 is the effect of the 𝑖th chemical, 𝑖 = 1, … , 4; - apptime𝑗 is the
effect of the 𝑗th application time, 𝑗 = 1, 2; - (Ch × apptime)𝑖𝑗 is the
effect of the interaction between chemical and application time; - 𝜖𝑖𝑗𝑘
is the unexplained component of the 𝑖𝑗𝑘th observation. 𝑘 = 1, 2, 3.
In words, the model says, the amount of dry matter production in the
ijkth replicate (observation) is due to the 𝑖th chemical, the 𝑗th time
of application, the possible two way interaction between these two
factors and any inherent variation associated with that particular
replicate (observation).
7.1. TREATMENT DESIGNS: FACTORIAL ANOVA 133

Projected ANOVAs (to be completed in lectures)


Preliminary:

Source of Variation Degrees of Freedom


Treatments
Error
Total

Factorial:

Source of Variation Degrees of Freedom


Chemical
Applicaction Time
Ch × AppTime
Error
Total

7.1.7 Partitioning Treatment Sums of Squares Into Facto-


rial Components
The sums of squares for the factorial ANOVA are found using a two stage
approach which parallels the projected ANOVA process. For a case of two
factors the process is as follows.
• Stage 1:
– The SS for the treatments are found as usual – this value is the pre-
liminary treatment sums of squares which must then be partitioned
into its component parts as determined by the factorial design.
• Stage 2 A:
– For each main effect, the sums of squares is found by regarding the
factor as a treatment – the total for each level of the factor is squared
and divided by the number of observations in that level (note that
any other factors are completely ignored at this stage); these modified
totals are then added and the Correction Factor is subtracted, as for
the calculation of any treatment sums of squares.
• Stage 2 B:
– The interaction sum of squares is found by subtracting the main effect
sums of squares from the preliminary treatment sum of squares.
Example: Chemicals and Weed Control (contd):
Return to the example on the chemicals and weed control. The
Total Sum of Squares, Treatment Sum of Squares and Residual Sum
of Squares are calculated in the usual way as:
134 CHAPTER 7. WEEK 9 - FACTORIAL ANOVA

(∑𝑖 ∑𝑗 𝑦𝑖𝑗 )2
2
Total SS = ∑ ∑ 𝑦𝑖𝑗 −
𝑖 𝑗
𝑛
=

𝑇𝑖2
TSS = Treatment SS = ∑ − CF
𝑖
𝑛𝑖
=

ESS = Residual Sum of Squares = Tot SS - TSS =

2 2 2 2
(Total A) (Total B) (Total C) (Total D) (Total A + Total B + To
Chemical SS = + + + −
3+3 3+3 3+3 3+3 4 × (3 +
2 2 2 2 2
20.3 17.6 22.8 22.3 (20.3 + 17.6 + 22.8 + 22.3)
= + + + −
6 6 6 6 24
= 2.7883.

2 2
(Total Early) (Total Late) (Total Early + Total Late)2
Time SS = + +−
3+3+3+3 3+3+3+3 2 × (3 + 3 + 3 + 3)
2 2 2
45.0 38.0 (45.0 + 38.0)
= + −
12 12 24
= 2.0417.

Interaction SS = Treatment SS − Chemical SS − Time SS


= 44.7183–2.7883–2.0417
= 39.8883

The ANOVAs are:


Preliminary:

Source DF SS MS VR
Treatments 7 44.7183 6.3883 17.09
Error 16 5.9800 0.3738
Total 23 50.6983
7.1. TREATMENT DESIGNS: FACTORIAL ANOVA 135

Factorial:

Source DF SS MS VR
Chemical 3 2.7883 0.92944 2.49
Time 1 2.0417 2.04167 5.46
Chemical × Time 3 39.8883 13.29611 35.57
Error 16 5.9800 0.37375
Total 23 50.6983

7.1.8 Factorial Effects and Hypotheses


Factorial Effects

Each line of the Factorial ANOVA resulting from the partitioned treatment
source of variation, provides a test of some hypothesis. This means that a
factorial ANOVA with 2 main effects will need a null and alternative hypothesis
for EACH of the main effects and for the interaction, i.e. 3 null and 3 alternative
hypotheses will be needed.

A main effect line provides a test of the equality of means for the particular
factor, where the mean for each level is computed across all levels of the other
factors. It describes the effect of the factor when it is averaged across all levels
of the other factors. In the above example ‘chemicals’ and ‘time appln’ are the
2 main effects.

An interaction line tests whether or not the two (or more) factors interact
with each other; are the differences between the levels of one factor the same
regardless of which level of the other factor is considered OR do the differences
change, depending on which level of the other factor is being considered?

Hypotheses for Chemicals and Weed Control example:

Chemical Main Effect

𝐻0 ∶ (main effect 1): 𝐻1 ∶ (main effect 1):

Time of Application Main Effect

𝐻0 ∶ (main effect 2): 𝐻1 ∶ (main effect 2):

Interaction Between Chemical and Time of Application

𝐻0 ∶ (interaction): 𝐻1 ∶ (interaction):

CRITICAL NOTE

If an interaction is significant then the main effects for the factors in


the interaction should NOT be considered in isolation.
136 CHAPTER 7. WEEK 9 - FACTORIAL ANOVA

7.1.9 Testing and Interpreting Factorial Effects


All tests are carried out by considering the appropriate variance ratio and ap-
plying it to an F-test.

• Step 1:

– Consider the interactions working from the highest order down.

• Step 2 A:

– If an interaction is significant, the factor means within it must be in-


terpreted using the two-way (or three-way, etc) table of means. Each
factor must be considered separately for each level of the other fac-
tor(s) - the example will clarify this.

• Step 2 B:

– If a main effect is not involved in a significant interaction, then it is


interpreted in the same way that a treatment effect would be inter-
preted.

Example Chemicals and Weed Control (contd. and com-


pleted):

Return to the previous example and assume that the required as-
sumptions of, independence, homogeneity of variances, additivity
of model terms and normality of the variable, dry matter produc-
tion,are all valid.

Interaction

𝐻0 : there is no interaction between chemicals and time of application


– the effects of chemicals and time of application are independent.

𝐻1 : there is an interaction between chemicals and time of application


– the effects of chemicals depend on the level of time of application,
and vice versa.

Under 𝐻0 , the 𝑉 𝑅 ∼ 𝐹3,16 . From tables, 𝐹3,16 (0.95) = 3.24. The


critical region is 𝑉 𝑅 ≥ 3.24.

Calculated VR of 35.57 lies in critical region therefore 𝐻0 is rejected


and we conclude that there is a significant interaction between chem-
icals and time of application (𝑝 < 0.05). Further testing should be
carried out on the two way table of means and not on the main effect
means. This means we do not comment on the significance of the
main effects in the ANOVA table or interpret them.

Interpretation of two way table of means (used to further


test the significant interaction)
7.2. USING R FOR FACTORIAL ANOVA 137

Find the least significant difference needed between any two of the
8 treatment means for the difference to be significant.
Standard deviation:
best estimate is square root of error mean square in the ANOVA: sd
= 0.61135
Number of replicates in each treatment:
for the two-way (treatment) means there are 3 replicates
Standard error of difference between 2 means:

1 1
𝑆𝐸𝑦𝑖̄ −𝑦𝑗̄ = 𝑠 × √ +
𝑛𝑖 𝑛𝑗
2
= 0.61135 × √
3
= 0.499

LSD(0.05) = 𝑡16 (0.975) × SEdiff means = 2.12 × 0.499 = 1.058


The means need to be 1.06 units apart before the treatment means
are said to be significantly different at the 5% level of significance.
Look at the graph of the means in section 1.3.
Considering the means for the chemicals (to be completed
in lectures):
Early Application:
Late Application:
Considering the means for the time of application (to be
completed in lectures):
Chemical A:
Chemical B:
Chemical C:
Chemical D:
Conclusions(to be completed in lectures):

7.2 Using R for Factorial ANOVA


Chemical and Time of Application Example:
138 CHAPTER 7. WEEK 9 - FACTORIAL ANOVA

drymatter<- c(5.9, 4.7, 5.6, 2.5, 0.6, 1,


2.6, 3.7, 3.4, 2.7, 2.3, 2.9,
4.7, 4.3, 3.8, 3.1, 4.1, 2.8,
2, 2.4, 1.9, 5.9, 4.6, 5.5)

time <- factor(rep(rep(c("Early", "Late"), c(3, 3)), 4))


chemical <- factor(rep(LETTERS[1:4], rep(6, 4)))

# Put variables in a dataframe and clean up:


weeds <- data.frame(drymatter, time, chemical)
rm(drymatter, time, chemical)

weeds.aov <- aov(drymatter ~ time*chemical, data = weeds)


summary(weeds.aov)

## Df Sum Sq Mean Sq F value Pr(>F)


## time 1 2.04 2.042 5.463 0.0328 *
## chemical 3 2.79 0.929 2.487 0.0977 .
## time:chemical 3 39.89 13.296 35.575 2.62e-07 ***
## Residuals 16 5.98 0.374
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We must examine the interaction hypothesis first. We can see from the R output
that there is a significant chemical by time interaction (𝑝 = 2.63𝑒 − 07 < 0.05).
That is, we reject the null hypothesis in favour of the alternative.
Since there is a significant interaction we cannot look at the main effects of
chemical or time. Our interpretation must account for the interaction between
the two factors. That is, the LSD value must take into account the interaction.
We can do this in R as follows:
library(agricolae)
weeds.interaction.LSD <- LSD.test(weeds.aov, c("time", "chemical"), console = T)

##
## Study: weeds.aov ~ c("time", "chemical")
##
## LSD t Test for drymatter
##
## Mean Square Error: 0.37375
##
## time:chemical, means and individual ( 95 %) CI
##
## drymatter std r LCL UCL Min Max
## Early:A 5.400000 0.6244998 3 4.6517505 6.148249 4.7 5.9
## Early:B 3.233333 0.5686241 3 2.4850838 3.981583 2.6 3.7
7.2. USING R FOR FACTORIAL ANOVA 139

## Early:C 4.266667 0.4509250 3 3.5184172 5.014916 3.8 4.7


## Early:D 2.100000 0.2645751 3 1.3517505 2.848249 1.9 2.4
## Late:A 1.366667 1.0016653 3 0.6184172 2.114916 0.6 2.5
## Late:B 2.633333 0.3055050 3 1.8850838 3.381583 2.3 2.9
## Late:C 3.333333 0.6806859 3 2.5850838 4.081583 2.8 4.1
## Late:D 5.333333 0.6658328 3 4.5850838 6.081583 4.6 5.9
##
## Alpha: 0.05 ; DF Error: 16
## Critical Value of t: 2.119905
##
## least Significant Difference: 1.058185
##
## Treatments with the same letter are not significantly different.
##
## drymatter groups
## Early:A 5.400000 a
## Late:D 5.333333 a
## Early:C 4.266667 b
## Late:C 3.333333 bc
## Early:B 3.233333 bc
## Late:B 2.633333 cd
## Early:D 2.100000 de
## Late:A 1.366667 e
plot(weeds.interaction.LSD)

Groups and Range


7

a a
6

b
5

bc
bc
4

cd
3

de e
2
1

Early:A Early:C Early:B Early:D


140 CHAPTER 7. WEEK 9 - FACTORIAL ANOVA

# Do an interaction plot to help interpret the results:


attach(weeds)
interaction.plot(chemical, time, drymatter, type = "l",
col = c("blue", "red"),
main = "Interaction Plot for Weed Control Example")

Interaction Plot for Weed Control Example

time
5

Late
mean of drymatter

Early
4
3
2

A B C D

chemical
detach()
Chapter 8

Week 10/11 - Correlation


and Simple Linear
Regression

Outline:
1. BIVARIATE STATISTICAL METHODS
• Introduction
• Covariance
• The Correlation Coefficient – Pearson’s Product Moment
Coefficient
2. Regression Analysis
• Assumptions Underlying The Regression Model
• Simple Linear Regression
• Estimating A Simple Regression Model
• Evaluating The Model
• The Coefficient of Determination (𝑅2 )
• The Standard Error (Root Mean Squared Error) of the
Regression, 𝑆𝜖
• Testing The Significance Of The Independent Variable
• Testing The Overall Significance Of The Model
• Testing The Overall Significance Of The Model
• Functional Forms Of The Regression Model
3. Using R
• Using R for Correlation and Simple Linear Regression Mod-
elling

141
142CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

Accompanying Workshop - done in week 12


• Regression Examples
Workshop for week 10/11
• Based on lectures in week 9/10
• Help with Assignment 2
Project Requirements for Week 10
• Assignment 2 is available on Learning@Griffith. It is due in
week 12.

8.1 BIVARIATE STATISTICAL METHODS


8.1.1 Introduction
When research involves more than one measurement on each experimental unit
the resulting data are said to be multivariate, there is more than one variable.
The simplest case occurs when there are just two variables – bivariate data.
With multivariate data two situations may occur:
• the different variables are independent of each other;
• the different variables depend on each other.
When there are dependencies between the variables, any analysis of a single
variable should take into account the impact of the other variable(s). Standard
univariate methods of analysis may lead to misleading results. A second area
of difference arises in considering whether or not there is a causal relationship
between the two variables. If there is expected to be a causal relationship –
one variable drives or predicts the other – then the correct statistical analysis
will be regression analysis. However, when there is not reason to suppose any
cause and effect between the two variables, the appropriate statistical analysis
is correlation.

8.1.1.1 Covariance
The dispersion or variation in a variable is usually measured by the variance
or its square root, standard deviation. Recall the definition of the population
variance for a single variable, say X:

1 𝑛
𝑉 𝑎𝑟(𝑋) = ∑(𝑋 − 𝜇)2
𝑛 𝑖=1 𝑖

The sample variance is given by:


8.1. BIVARIATE STATISTICAL METHODS 143

𝑛
1
𝑆2 = ∑(𝑋𝑖 − 𝑋)2
𝑛 − 1 𝑖=1

A measure of the way in which two variables vary together is given by the
covariance which is defined as:

1 𝑛
𝐶𝑜𝑣(𝑋1 , 𝑋2 ) = ∑(𝑋 − 𝜇1 )(𝑋2𝑖 − 𝜇2 )
𝑛 𝑖=1 1𝑖

with the sample covariance being:

𝑛
1
𝐶𝑜𝑣(𝑋1 , 𝑋2 ) = ∑(𝑋1𝑖 − 𝑋 1 )(𝑋2𝑖 − 𝑋 2 )
𝑛 − 1 𝑖=1

One of the classical bivariate situations involves a bivariate normal – the two
variables have a joint normal distribution. The theory for this is beyond the
scope of this course but the following illustrates the graph of a bivariate normal
distribution.
# install.packages(c("mvtnorm", "plotly"))

library(mvtnorm)
library(plotly)

x <- seq(-3, 3, 0.1)


x <- expand.grid(x, x)

test <- dmvnorm(x)

graph <- plot_ly(x = ~ x$Var1, y = ~ x$Var2, z = ~ test, type = 'mesh3d')

graph <- graph %>% layout(


title = "Interactive 3D Mesh Plot: Standard Bivariate Normal Density",
scene = list(
xaxis = list(title = "X1"),
yaxis = list(title = "X2"),
zaxis = list(title = "Prob")
))

graph

The bivariate observations, (𝑥1𝑖 , 𝑥2𝑖 ) are said to be 𝑁 ((𝜇1 , 𝜇2 ), (𝜎1 , 𝜎2 , 𝜌12 ))
where 𝜌12 is a function of the covariance (known as the correlation) between the
two variables.
144CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

8.1.1.2 The Correlation Coefficient – Pearson’s Product Moment


Coefficient
The covariance is a measure which depends on the scale of the measurements
and as such does not provide any idea of how related the two variables really
are. Instead, a measure is needed which is unitless and which does provide a
general idea of the degree of dependence.
One such measure is the Correlation Coefficient. The most well-known corre-
lation coefficient is the one developed by Karl Pearson in 1896 - the Pearson
Product Moment Correlation Coefficient. It is usually symbolised by the Greek
letter ‘rho’ (𝜌) and it achieves its standardisation by dividing the covariance by
the product of the standard deviations of the two variables. Note that we use
the square roots of the variances so that the unitless nature of the coefficient is
achieved – the covariance involves each of the original units whereas the variance
by definition involves squares of the original units.
Population Correlation Coefficient:

𝑛
𝐶𝑜𝑣(𝑋1 , 𝑋2 ) ∑ (𝑋1𝑖 − 𝜇1 )(𝑋2𝑖 − 𝜇2 )
𝜌𝑥1 𝑥2 = = 𝑛 𝑖=1 𝑛
√𝑉 𝑎𝑟(𝑋1 )𝑉 𝑎𝑟(𝑋2 ) ∑𝑖=1 (𝑋1𝑖 − 𝜇1 )2 ∑𝑖=1 (𝑋2𝑖 − 𝜇2 )2

Sample Correlation Coefficient:

𝑛
𝐶𝑜𝑣(𝑋1 , 𝑋2 ) ∑ (𝑋1𝑖 − 𝑋 1 )(𝑋2𝑖 − 𝑋 2 )
𝑟𝑥1 𝑥2 = = 𝑛 𝑖=1 𝑛
√𝑉 𝑎𝑟(𝑋1 )𝑉 𝑎𝑟(𝑋2 ) ∑𝑖=1 (𝑋1𝑖 − 𝑋 1 )2 ∑𝑖=1 (𝑋2𝑖 − 𝑋 2 )2

Working Formula:

1
∑ 𝑥1 𝑥2 − 𝑛 (∑ 𝑥1 ∑ 𝑥2 )
𝑟𝑥1 𝑥2 =
1 2 1 2
√(∑ 𝑥21 − 2
𝑛 (∑ 𝑥1 ) ) (∑ 𝑥2 − 𝑛 (∑ 𝑥2 ) )

Other names for this correlation coefficient are: Simple Correlation Coefficent
and Product Moment Coefficient.
The Pearson Correlation Coefficient lies between –1 and +1. That is,

−1 ≤ 𝜌𝑥1 𝑥2 ≤ 1

A value near zero indicates little or no linear relationship between the two
variables; a value close to –1 indicates a strong negative linear relationship (as
one goes up the other comes down – large values of one variable are associated
with small values of the other variable); a value near +1 indicates a strong
positive linear relationship (large values of one variable are associated with large
values of the other variable). The following figures illustrate these ideas.
8.1. BIVARIATE STATISTICAL METHODS 145

library(MASS)

##
## Attaching package: 'MASS'

## The following object is masked from 'package:plotly':


##
## select
make.sig <- function(cor, sd1, sd2){
covar <- cor*sd1*sd2
sig.mat <- matrix(c(sd1^2, covar, covar, sd2^2), byrow = T, nc = 2)
sig.mat
}

par(mfrow = c(2, 3))

x <- mvrnorm(50, c(0, 0), make.sig(-1, 1, 1))


plot(x, sub = paste(expression(rho), " = -1"), xlab = expression(X[1]), ylab = expression(X[2]))
abline(lm(x[, 2] ~ x[, 1]), col = "red")

x <- mvrnorm(50, c(0, 0), make.sig(-0.8, 1, 1))


plot(x, sub = paste(expression(rho), " = -0.8"), xlab = expression(X[1]), ylab = expression(X[2])
abline(lm(x[, 2] ~ x[, 1]), col = "red")

x <- mvrnorm(50, c(0, 0), make.sig(0, 1, 1))


plot(x, sub = paste(expression(rho), " = 0"), xlab = expression(X[1]), ylab = expression(X[2]))
abline(lm(x[, 2] ~ x[, 1]), col = "red")

x <- mvrnorm(50, c(0, 0), make.sig(0.25, 1, 1))


plot(x, sub = paste(expression(rho), " = 0.25"), xlab = expression(X[1]), ylab = expression(X[2])
abline(lm(x[, 2] ~ x[, 1]), col = "red")

x <- mvrnorm(50, c(0, 0), make.sig(0.5, 1, 1))


plot(x, sub = paste(expression(rho), " = 0.5"), xlab = expression(X[1]), ylab = expression(X[2]))
abline(lm(x[, 2] ~ x[, 1]), col = "red")

x <- mvrnorm(50, c(0, 0), make.sig(1, 1, 1))


plot(x, sub = paste(expression(rho), " = 1"), xlab = expression(X[1]), ylab = expression(X[2]))
abline(lm(x[, 2] ~ x[, 1]), col = "red")
146CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

2
2

1
1

1
0
X2

X2

X2
0

0
−1
−2 −1

−1
−2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

X1 X1 X1
rho = −1 rho = −0.8 rho = 0

2
2
1.0

1
1
0.0
X2

X2

X2

0
0
−1
−1.5

−2
−1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 2

X1 X1 X1
rho = 0.25 rho = 0.5 rho = 1

par(mfrow = c(1, 1))

The correlation coefficient measures the linear relationship between two vari-
ables. If the relationship is nonlinear any interpretation of the correlation will
be misleading. The following figure demonstrates this concept.
x <- seq(-3, 3, 0.1)
y <- 4*x^2 + 2*x + 10 + rnorm(length(x), 0, 4)

plot(x, y)
abline(lm(y ~ x), col = "blue")
legend("topleft", legend = paste("Correlation = ", round(cor(x, y), 3)), bty = "n")
8.2. REGRESSION ANALYSIS 147

50
40
30 Correlation = 0.243
y

20
10

−3 −2 −1 0 1 2 3

x
Note
Just because one variable relates to another variable does not mean that changes
in one causes changes in the other. Other variables may be acting on one or both
of the related variables and affecting them in the same direction. Cause-and-
effect may be present, but correlation does not prove cause. For example, the
length of a person’s pants and the length of their legs are positively correlated -
people with longer legs have longer pants; but increasing one’s pants length will
not lengthen one’s legs!
Property of Linearity
A low correlation (near 0) does not mean that 𝑋1 and 𝑋2 are not related in some
way. When |𝜌| < 𝜖 (where 𝜖 is some small value, say, for example, 0.1 or 0.2)
indicating no or very weak correlation between the two variables, there may still
be a definite pattern in the data reflecting a very strong “nonlinear” relationship
(see the previous figure for an example of this). Pearson’s correlation applies
only to the strength of linear relationships.

8.2 Regression Analysis


Regression analysis is a technique which develops a mathematical equation that
analyses the relationship between two or more variables, whereby one variable
is considered dependent upon other independent variables. The dependent vari-
able is represented by 𝑌 , and the independent variables by 𝑋.

8.2.1 Assumptions Underlying The Regression Model


For any given value of 𝑋 there is a distribution of different values of 𝑌 : the
conditional distribution of 𝑌 given the particular 𝑋 value;
148CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

• the mean of the conditional distribution is the true population value of 𝑌𝑖


for the given 𝑋𝑖 (and therefore lies on the true population regression line);
the residual associated with the mean 𝑌 value is equal to zero;
• the conditional distribution is approximately normally distributed;
• the variances of the conditional distributions (ie. for each different value
of 𝑋) are equal: homoscedascity;
• each 𝑌𝑖 value is independent of every other (that is, the error terms are
not autocorrelated);
• the relationship between the independent and dependent variables can be
expressed in linear form (as a straight line);
• there is more than one value of the independent variable, 𝑋;
• there is no random error in the independent variable, the 𝑋 values, or any
error in the 𝑋 observations is very much less than the random error in the
dependent variable, the 𝑌 values.
Scatterplots
The relationship between two variables can be graphically presented in a scat-
terplot. The dependent variable (𝑌 ) is on the vertical axis and the independent
variable 𝑋) is on the horizontal axis. For example, assume the weight of an
animal is deemed to be dependent upon the age of the animal:
age <- seq(2, 12, 0.1)
weight <- 3*age + rnorm(length(age), 0, 3)

plot(weight ~ age, xlab = "Age (X)", ylab = "Weight (Y)")


40
30
Weight (Y)

20
10

2 4 6 8 10 12

Age (X)
8.2. REGRESSION ANALYSIS 149

8.2.2 Simple Linear Regression


Simple Linear regression is appropriate when a single dependent variable, 𝑌 , is
explained by a single independent variable, 𝑋, using a linear relationship. The
deterministic form of the regression model is:

𝐸(𝑌𝑖 |𝑋𝑖 ) = 𝛽0 + 𝛽1 𝑋𝑖 , 𝑖 = 1, … , 𝑛.

Because of natural variation in the data (sampling variability), actual relation-


ships will not fall exactly on the straight line. Hence, in estimating the mathe-
matical model we attempt to estimate the line of best fit. The line will not be
able to pass through all the individual observations; in fact, the line of best fit
may not pass through any of the points. The difference between the observed
point and the estimated line at a particular 𝑋𝑖 is known as the ith error term,
𝜖𝑖 .
The population simple regression model is thus:

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖 , 𝑖 = 1, … , 𝑛,

where
• 𝑦𝑖 is the 𝑌 measurement for the individual having an 𝑋 value of 𝑥𝑖 ;
• 𝛽0 is a constant representing the y value when x is zero (intercept);
• 𝛽1 is the slope of the population regression line;
• 𝜖𝑖 is the random error or dispersion or scatter in the 𝑦𝑖 observation asso-
ciated with unknown (excluded) influences.
The sample simple linear regression is:

𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥𝑖 , 𝑖 = 1, … , 𝑛,

where symbols with hats (^) denote estimates obtained from sample data.

8.2.3 Estimating A Simple Linear Regression Model


One possible strategy is to choose the line that minimises the sum of squared
residuals. In other words,

𝑛 𝑛
2
min (∑ 𝜖2𝑖 ) = min (∑ (𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 ) )
(𝛽0 ,𝛽1 ) (𝛽0 ,𝛽1 )
𝑖=1 𝑖=1

This method is known as ordinary least squares, or OLS, because it attempts


to estimate the model that minimises the sum of squared errors. The equations
150CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

for 𝛽0 and 𝛽1 that minimise the sum of squared errors (ie solve the above
equation) are:

𝑛 𝑛 𝑛
𝑛 ∑𝑖=1 𝑥𝑖 𝑦𝑖 − (∑𝑖=1 𝑥𝑖 ) (∑𝑖=1 𝑦𝑖 )
𝛽1̂ = 𝑛 𝑛 2
𝑛 ∑𝑖=1 𝑥2𝑖 − (∑𝑖=1 𝑥𝑖 )

and

𝛽0̂ = 𝑦 ̄ − 𝛽1̂ 𝑥.̄


Example:
Given the following data, estimate the regression equation.

X Y
200 180
250 230
300 280
350 310

4 × 286000 − 1100 × 1000


𝛽1̂ = = 0.88 𝛽0̂ = 250−0.88×275 = 8.00
4 × 315000 − 11002

Therefore, the estimated regression equation (line of best fit through


𝑋 and 𝑌 ) is:

𝑦 ̂ = 8 + 0.88𝑥

8.2.4 Evaluating The Model


In the regression model it is assumed that changes in the value of 𝑦 are associated
with changes in the value of 𝑥. This is only an assumption and may or may
not be true. To test whether 𝑦 is related linearly to 𝑥, iwe first imagine that
changes in 𝑥 do not lead to any changes in 𝑦. We then test whether the data
support this assumption. Under the assumption of “no effect”, the graph of 𝑌
against 𝑋 would be a horizontal line.
plot(0:5, 2*(0:5), type = "n", xlab = "X", ylab = "Y", main = "Y does not change when X
abline(h = mean(2*(0:5)))
text(3, 6, expression(bar(Y)), cex = 1.5)
8.2. REGRESSION ANALYSIS 151

10
8 Y does not change when X changes

Y
6
Y

4
2
0

0 1 2 3 4 5

Of the total variation of the individual responses from their mean, some will
be explained by the model, and some will be unexplained. Thus the total
variation is made up of two components - the variation explained by the model
(or systematic variation) and the variation left unexplained (error variation).
plot(0:5, 2*(0:5), type = "n", xlab = "X", ylab = "Y", main = "Total Variation = Systematic Varia

ybar <- mean(2*(0:5))


abline(h = ybar)
abline(2, 1.5)

points(3, 9, col = 'red')


lines(c(3, 3), c(0, 9))
lines(c(3.2, 3.2), c(6.9, 9), col = "blue", lty = 2)
lines(c(3.2, 3.2), c(6.62, ybar), col = "red", lty = 3)
lines(c(2.8, 2.8), c(ybar, 9), col = "purple", lty = 1)

text(2, 7.5, "Total Variation", col = "purple")


text(2, 7, expression((Y - bar(Y))), col = "purple")

text(4, 6, "Systematic Variation", col = "red")


text(4, 5.5, expression((hat(Y) - bar(Y))), col = "red")

text(4, 9, "Error Variation", col = "blue")


text(3.9, 8.5, expression((Y - hat(Y))), col = "blue")
152CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

Total Variation = Systematic Variation + Error Variation

10
Error Variation
(Y − Y^)
8
Total Variation
(Y − Y)
Systematic Variation
6

(Y^ − Y)
Y

4
2
0

0 1 2 3 4 5

8.2.5 The Coefficient of Determination (𝑅2 )


The coefficient of determination measures the proportion of variation in the
dependent variable (𝑦) that is explained by the model (𝑥). Of the total variation
in the data, (𝑌 − 𝑌 ̄ ), the portion identified by (𝑌 ̂ − 𝑌 ̄ ) is explained by the
regression model. Thus:

Explained (or systematic) Variation


𝑅2 =
Total Variation
Sum of Squares for Regression
=
Total Sum of Squares
∑(𝑦𝑖̂ − 𝑦)̄ 2
=
∑(𝑦𝑖 − 𝑦)̄ 2
𝑛 𝑛 𝑛 2
(𝑛 ∑𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖 )
= 𝑛 𝑛 2 𝑛 𝑛 2
[𝑛 ∑𝑖=1 𝑥2𝑖 − (∑𝑖=1 𝑥𝑖 ) ] [𝑛 ∑𝑖=1 𝑦𝑖2 − (∑𝑖=1 𝑦𝑖 ) ]

For the example data,

(4 × 286000 − 1000 × 1100)2


𝑅2 = = 0.988.
(4 × 315000 − 11002 ) × (4 × 259800 − 10002 )

(Note that for simple linear regression 𝑅2 is simply the square of the correlation
coefficient.)
8.2. REGRESSION ANALYSIS 153

In other words, 99.8% of the variation in the dependent variable (y) is explained
by the model. But is this significant? What about 60% or 40%? The coefficient
of determination does not give a testable measure of the significance of the model
overall in explaining the variation in (y). Inferences concerning the regression
model depend on the standard error or root mean square error (RMSE) of the
model.

8.2.6 The Standard Error (or Root Mean Squared Error)


Of The Regression, 𝜎𝜖
The standard error of the regression is used as a measure of the dispersion of the
observed points around the regression line. It is the estimate of the standard
deviation of the error variance. A formula for this estimate is:

𝑛
√ ∑𝑖=1 (𝑌𝑖 − 𝑌𝑖̂ )2
𝑆𝜖 =
𝑛−2
𝑛
∑𝑖=1 𝜖2𝑖̂
=√
𝑛−2
SSE
=√
𝑛−2

A more useable form of the equation is:

𝑛 𝑛 𝑛
√ ∑𝑖=1 𝑦𝑖2 − 𝛽0̂ ∑𝑖=1 𝑦𝑖 − 𝛽1̂ ∑𝑖=1 𝑥𝑖 𝑦𝑖
𝑆𝜖 =
𝑛−2
From the previous example:

259800 − 8 × 1000 − 0.88 × 286000


𝑆𝜖 = √ = 7.7459.
4−2

8.2.7 Testing The Significance Of The Independent Vari-


able
To test whether the independent variable 𝑋 is significant in a simple linear
regression model, we need to test the hypothesis that 𝛽1 equals zero. If 𝛽1 equals
zero then the value of 𝑌 will not be affected by any change in 𝑋 . Therefore
the hypotheses are:

𝐻0 ∶ 𝛽 1 = 0
𝐻1 ∶ 𝛽 1 ≠ 0
154CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

The test statistic for this hypothesis is:

𝛽1̂
𝑇 =
𝑆𝐸(𝛽1̂ )

where

𝑆𝜖
𝑆𝐸(𝛽1̂ ) =
𝑛
√∑𝑖=1 𝑥2𝑖 − 𝑛𝑥2̄

and 𝛽1̂ denotes the sample estimate of the slope parameter, 𝛽1 .


If the null hypothesis is true, 𝑇 is approximately distributed as a student’s - 𝑡
with degrees of freedom: 𝑑𝑓 = 𝑛 − 2.
Using the previous example again, we can test

𝐻0 ∶ 𝛽 1 = 0
𝐻1 ∶ 𝛽 1 ≠ 0

using

7.7459
𝑆𝐸(𝛽1̂ ) = √ = 0.06928,
315000 − 4 × 2752

and

0.88
𝑇 = = 12.702.
0.06928

Degrees of freedom are 4−2 = 2. From 𝑡 tables, 𝑡2 (0.025) = 4.303. As 𝑇 > 4.303,
reject the null hypothesis at the 0.05 level of significance and conclude that 𝑌
has a significant linear relationship with 𝑋 .

8.2.8 Testing The Overall Significance Of The Model


Testing the overall significance of the model is conducted using ANOVA. Vari-
ation in the data, as already discussed, can be divided into total, explained (in
the ANOVA it is termed regression) and unexplained or error. The ANOVA
table is thus:
8.2. REGRESSION ANALYSIS 155

Sum of
DF Squares Mean Square Variance
Source Ratio
Regression 𝑘−1 𝑆𝑆𝑅 = 𝑀 𝑆𝑅 = 𝑀 𝑆𝑅/𝑀 𝑆𝐸
∑(𝑦𝑖̂ − 𝑦)̄ 2 𝑆𝑆𝑅/(𝑘 − 1)
Error 𝑛−𝑘 𝑆𝑆𝐸 = 𝑀 𝑆𝐸 =
∑(𝑦𝑖 − 𝑦𝑖̂ )2 𝑆𝑆𝐸/(𝑛 − 𝑘)
Total 𝑛−1 𝑇 𝑆𝑆 =
∑(𝑦𝑖 − 𝑦)̄ 2

𝑇 𝑆𝑆 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸

NB: 𝑘 is the number of parameters estimated in the model. For simple linear
regression, there are two parameters estimated (𝛽0 and 𝛽1 ), so 𝑘 = 2.
Notes:
• the purpose of the ANOVA is to break up the variation in 𝑦 and (in
simple regression) can also test 𝐻0 ∶ 𝛽1 = 0, and show how the coefficient
of determination (𝑅2 ) is derived.
• based on F-test statistic defined as the variance ratio.
• main question: “Is the ratio of the explained variance (MSR) to the unex-
plained variance (MSE) sufficiently greater than 1, to reject 𝐻0 that 𝑦 is
unrelated to 𝑥?”
• if we reject 𝐻0 the main conclusion is:
– the linear model explains a part of the variation in 𝑦. We accept that
𝑥 and 𝑦 are linearly related (at the given level of significance).

8.2.9 Predictions Using The Regression Model


Two types of predictions are of interest - one involving an actual prediction, and
the other a conditional average. Both predictions have the same midpoint (or
point estimate) but different confidence intervals.
For predicting the actual value for a conditional mean 𝐸(𝑌𝑖 |𝑥𝑖 ), the standard
error of the prediction is given as:

1 (𝑥𝑝 − 𝑥)̄ 2
𝑠2𝑦𝑝 = 𝑠2𝜖 ( + 𝑛 )
𝑛 ∑𝑖=1 𝑥2𝑖 − 𝑛𝑥2̄

where 𝑥𝑝 denotes the value of 𝑥 we are interested in predicting the 𝑦 value for.
This interval is sometimes called the “narrow” interval, or “confidence” interval
in R.
156CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

Whereas for predictions of actual values of 𝑌 from values of 𝑥𝑖 :

1 (𝑥𝑝 − 𝑥)̄ 2
𝑠2𝑦𝑝 = 𝑠2𝜖 (1 + + 𝑛 )
𝑛 ∑𝑖=1 𝑥2𝑖 − 𝑛𝑥2̄

This interval is sometimes called the “wide” interval, or “prediction” interval in


R.

8.2.10 Functional Forms Of The Regression Model


The form of the relationship between two variables may not be linear. If the
relationship between two variables is not linear then the data either has to be
transformed in some manner, or non-linear regression must be used (non-linear
regression is beyond the scope of this course and will not be discussed here).
There are many transformations that can be used in regression modelling - some
will help to linearise a non-linear relationship; some do other things (eg stabilise
the variance). One common transformation is the (natural) logarithmic trans-
formation.
E.g.:

𝛽
𝑦𝑖 = 𝑒𝛽0 +𝜖𝑖 𝑥𝑖 1

becomes

ln(𝑦𝑖 ) = 𝛽0 + 𝛽1 ln(𝑥𝑖 ) + 𝜖𝑖

under a natrual log transformation of both 𝑦𝑖 and 𝑥𝑖 .

8.3 Using R and Examples:


8.3.1 House prices and distance from an abattoir
You should try to run a simple regression on this data (during your own time)
and get the following output.

Distance to Abattoir (km) House Price ($000s


1.2 101
0.8 92
1 110
1.3 120
0.7 90
0.3 51
8.3. USING R AND EXAMPLES: 157

Distance to Abattoir (km) House Price ($000s


1 93
0.6 75
0.9 77
1.1 120

distance <- c(1.2, 0.8, 1, 1.3, 0.7, 0.3, 1, 0.6, 0.9, 1.1)
price <- c(101, 92, 110, 120, 90, 51, 93, 75, 77, 120)
house.prices <- data.frame(distance, price)
rm(distance, price)

attach(house.prices)
plot(x = distance, y = price, xlab = "Distance (km) from Abattoir", ylab = "House Price (x $1000)

House Prices vs Distance from Abattoir


110
House Price (x $1000)

90
80
70
60
50

0.4 0.6 0.8 1.0 1.2

Distance (km) from Abattoir

detach()

house.lm <- lm(price ~ distance, data = house.prices)

# Get the anova table:


anova(house.lm)

## Analysis of Variance Table


##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## distance 1 3289.9 3289.9 30.079 0.0005844 ***
158CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

## Residuals 8 875.0 109.4


## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Do the regrssion - get the coefficient estimates (line of best fit estimates) table:
summary(house.lm)

##
## Call:
## lm(formula = price ~ distance, data = house.prices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5377 -6.2549 0.7738 8.1221 13.7083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14 10.86 3.327 0.010431 *
## distance 63.77 11.63 5.484 0.000584 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.46 on 8 degrees of freedom
## Multiple R-squared: 0.7899, Adjusted R-squared: 0.7636
## F-statistic: 30.08 on 1 and 8 DF, p-value: 0.0005844

Before we use this model to test hypotheses or make predictions etc, we should
first assess whether the fitted model follows the assumptions of regression mod-
elling. We do this graphically as follows:
plot(house.lm)
8.3. USING R AND EXAMPLES: 159

Residuals vs Fitted
10 15
10
5
Residuals

0
−10

9
−20

60 70 80 90 100 110 120

Fitted values
lm(price ~ distance)
Normal Q−Q
1.5

10
Standardized residuals

0.5
−0.5

1
−1.5

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Theoretical Quantiles
lm(price ~ distance)
160CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

Scale−Location
9

0.0 0.2 0.4 0.6 0.8 1.0 1.2


10
1
Standardized residuals

60 70 80 90 100 110 120

Fitted values
lm(price ~ distance)
Residuals vs Leverage

10
1
1.0
Standardized residuals

0.5
0.0

6
−1.0

0.5

1 1

Cook's distance
−2.0

0.0 0.1 0.2 0.3 0.4 0.5

Leverage
lm(price ~ distance)

These graphs will be discussed in lectures.

What are the hypotheses and the population regression model for this example?

What is the fitted model for this example?

What would the predicted house price be 0.63 km away from the abattoir?

What would the predicted house price be 2.0 km away from the abattoir?
8.3. USING R AND EXAMPLES: 161

8.3.2 Bicycle lanes and rider safety


As part of ongoing traffic studies, researchers examined whether bicycle lanes
increased rider safety. Ten roads with bicycle lanes were randomly selected and
the distance of these lanes from the center line measured. The researchers then
set up video cameras on each road to record how close cars were to cyclists as
they passed. The data are recorded in the table below (measurements are in
feet).

Distance to Center Line (Center) 12.8 12.9 12.9 13.6 14.5 14.6 15.1 17.5 19.5 20.8
Distance from Car to Cyclist (Car) 5.5 6.2 6.3 7.0 7.8 8.3 7.1 10.0 10.8 11.0

The data have been analysed using lm() in R and the results are given below.
center <- c(12.8, 12.9, 12.9, 13.6, 14.5, 14.6, 15.1, 17.5, 19.5, 20.8)
car <- c(5.5, 6.2, 6.3, 7, 7.8, 8.3, 7.1, 10, 10.8, 11)

cyclist <- data.frame(center, car)


rm(center, car)

# Plot the Data:


attach(cyclist)

plot(center, car, xlab = "Distance (feet) to Centre lane",


ylab = "Distance (feet) of Car to Cyclist",
main = "Scatterplot of Cyclist Data")

Scatterplot of Cyclist Data


11
Distance (feet) of Car to Cyclist

10
9
8
7
6

14 16 18 20

Distance (feet) to Centre lane


162CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

detach()

# Run the model using lm():

cyclist.lm <- lm(car ~ center, data = cyclist)


anova(cyclist.lm)

## Analysis of Variance Table


##
## Response: car
## Df Sum Sq Mean Sq F value Pr(>F)
## center 1 32.449 32.449 95.763 9.975e-06 ***
## Residuals 8 2.711 0.339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(cyclist.lm)

##
## Call:
## lm(formula = car ~ center, data = cyclist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76990 -0.44846 0.03493 0.35609 0.84148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.18247 1.05669 -2.065 0.0727 .
## center 0.66034 0.06748 9.786 9.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5821 on 8 degrees of freedom
## Multiple R-squared: 0.9229, Adjusted R-squared: 0.9133
## F-statistic: 95.76 on 1 and 8 DF, p-value: 9.975e-06
# Check model using residual diagnostics:

par(mfrow = c(2, 2))


plot(cyclist.lm)
8.3. USING R AND EXAMPLES: 163

Standardized residuals
Residuals vs Fitted Normal Q−Q
6 6
Residuals

−0.5 0.5

0.5
−1.5
1 7 1 10

7 8 9 10 11 −1.5 −0.5 0.5 1.0 1.5

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals
Scale−Location Residuals vs Leverage
0.0 0.6 1.2

1 6 10 6
0.5

0.5
Cook's distance
−1.5
1 10 1

7 8 9 10 11 0.0 0.1 0.2 0.3 0.4 0.5

Fitted values Leverage

par(mfrow = c(1, 1))

Hypotheses: Population regression model: Is this model an adequate fit to the


data? Why are Pr(>F) and Pr >|t| for the slope the same? What does the 𝑅2
tell us? What is the fitted regression model for the distance a bike lane is from
the centre line (centre) and how close a car passes a cyclist (car). Is there a
significant linear relationship between the distance a bike lane is from the centre
line (center) and how close a car passes a cyclist (car) for this data? Explain
your answer. Use the equation of the fitted regression line to predict how much
distance a car would leave when passing a cyclist in a bike lane that is 17 feet
from the centre line. Comment on this prediction. Check your answer to the
above prediction using R (see the “Regression Examples – Cycle Lanes.R” file
in the week 10/11 lecture notes folder).

8.3.3 Dissolved Oxygen and Temperature in freshwater


rivers
The following data were collected in a study to examine the relationship between
dissolved oxygen and temperature in freshwater river systems.

River 1 2 3 4 5 6 7 8 9 10
Dissolved Oxygen (%) 45.8 67.9 54.6 51.9 32.7 62.3 71.3 78.9 38.7 49.7
Temperature (C) 27.9 16.5 22.3 24.8 31.2 18.5 13.1 10.7 29.1 22.1

The data have been analyzed using lm() in R and the results are given below.
164CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

# Enter Data:

DO <- c(45.8, 67.9, 54.6, 51.9, 32.7, 62.3, 71.3, 78.9, 38.7, 49.7)
temp <- c(27.9, 16.5, 22.3, 24.8, 31.2, 18.5, 13.1, 10.7, 29.1, 22.1)

rivers <- data.frame(DO, temp)


rm(DO, temp)

# Plot the Data:

attach(rivers)

plot(temp, DO, xlab = "Temperature (Degrees C)",


ylab = "Dissolved Oxygen (%)",
main = "Scatterplot of DO vs Temperature")

Scatterplot of DO vs Temperature
80
70
Dissolved Oxygen (%)

60
50
40

10 15 20 25 30

Temperature (Degrees C)

detach()

# Fit the model to the data:

do.lm <- lm(DO ~ temp, data = rivers)

anova(do.lm)

## Analysis of Variance Table


##
8.3. USING R AND EXAMPLES: 165

## Response: DO
## Df Sum Sq Mean Sq F value Pr(>F)
## temp 1 1880.1 1880.14 248.63 2.615e-07 ***
## Residuals 8 60.5 7.56
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(do.lm)

##
## Call:
## lm(formula = DO ~ temp, data = rivers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.671 -1.728 0.468 1.483 3.617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.8129 3.0097 33.50 6.89e-10 ***
## temp -2.1014 0.1333 -15.77 2.62e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.75 on 8 degrees of freedom
## Multiple R-squared: 0.9688, Adjusted R-squared: 0.9649
## F-statistic: 248.6 on 1 and 8 DF, p-value: 2.615e-07
# Check model using residual diagnostics:

par(mfrow = c(2, 2))

plot(do.lm)
166CHAPTER 8. WEEK 10/11 - CORRELATION AND SIMPLE LINEAR REGRESSION

Standardized residuals
Residuals vs Fitted Normal Q−Q

4
1 4 1

Residuals
4

0.5
0

−1.5
−4
10 10

40 50 60 70 80 −1.5 −0.5 0.5 1.0 1.5

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals
Scale−Location Residuals vs Leverage
10 1 1
1 4 0.5
0.8

0
Cook's distance 5 0.5
0.0

−2
10

40 50 60 70 80 0.0 0.1 0.2 0.3

Fitted values Leverage

par(mfrow = c(1, 1))

What is the population regression model? What are the hypotheses? Is the
model fit adequate? Explain your answer. What does the 𝑅2 tell us? Is there a
significant linear relationship between dissolved oxygen and temperature? What
is the fitted regression model? Predict the value of the dependent variable if the
independent variable=30. Comment on this prediction. Predict the value of the
dependent variable if the independent variable=9. Comment on this prediction.
Check both of the above predictions in R (see the R script file “Regression
Examples – Dissolved Oxygen.R”) using 95% “wide” prediction intervals.

You might also like