0% found this document useful (0 votes)
39 views182 pages

STA1502 - Study Guide

The STA1502 study guide focuses on statistical inference, building on concepts from STA1501, and covers topics such as comparing population means, analysis of variance, chi-square tests, regression, nonparametric statistics, and time series analysis. It emphasizes the importance of understanding probability distributions and statistical methods for making inferences about populations based on sample data. The guide includes structured study units, exercises, and learning outcomes to aid students in mastering statistical concepts and techniques.

Uploaded by

Refilwe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views182 pages

STA1502 - Study Guide

The STA1502 study guide focuses on statistical inference, building on concepts from STA1501, and covers topics such as comparing population means, analysis of variance, chi-square tests, regression, nonparametric statistics, and time series analysis. It emphasizes the importance of understanding probability distributions and statistical methods for making inferences about populations based on sample data. The guide includes structured study units, exercises, and learning outcomes to aid students in mastering statistical concepts and techniques.

Uploaded by

Refilwe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 182

Department of Statistics

STA1502
Statistical inference I

Study guide for STA1502


i STA1502/1

CONTENTS

Orientation iii

STUDY UNIT 1
1.1 Introduction 1
1.2 Inference about the Difference Between Two Population Means: 1
Independent Samples
1.3 Observational and Experimental Data 18
1.4 Inference about the Difference Between Two Population Means: 19
Matched Pairs Experiment
1.5 Inference about the Ratio of Two Variances 28
1.6 Self-correcting Exercises for Unit 1 38
1.7 Solutions to Self-correcting Exercises for Unit 1 39
1.8 Learning Outcomes 46

STUDY UNIT 2
2.1 Introduction 48
2.2 Inference about the Difference Between Two Population Proportions 48
2.3 One-Way Analysis of Variance 57
2.4 Multiple Comparisons 68
2.5 Analysis of Variance experimental designs (read only) 80
2.6 Randomized Block(two-way) Analysis of Variance 81
2.7 Self-correcting Exercises for Unit 2 87
2.8 Solutions to Self-correcting Exercises for Unit 2 89
2.9 Learning Outcomes 98

STUDY UNIT 3
3.1 Chi–square test 100
3.2 Chi-squared goodness-of-fit test 101
3.3 Chi-squared test of a Contingency Table 107
3.4 Summary of test on nominal data 109

STUDY UNIT 4
4.1 Simple linear regression and correlation 118
4.2 Simple Linear Regression and Correlations 118
4.3 Diagnostic Tools for checking the regressionassumptions 122

STUDY UNIT 5
5.1 Non parametric statistics 131
5.2 The Wilcoxon Rank Sum Test: Independent Random Samples 131
ii

5.3 Sign Test and Wilcoxon Signed Rank Sum Test 137
5.4 The Wilcoxon Signed Rank Sum Test for a matched paired experiment 147
5.5 The Kruskal–Wills H –test for completely randomized designs 151
5.6 Friedman Test for the Randomized Block Design 154

STUDY UNIT 6
6.1 Introduction 158
6.2 Components of time series 158
6.3 Smoothing techniques 160
6.4 Trend and seasonal effects 164
6.5 Introduction to forecasting 167
6.6 Forcasting models 167
iii STA1502/1

ORIENTATION
Welcome
Welcome to STA1502. This module is the second one of the first-year statistics courses. STA1501
and STA1502 form the first year Statistics course for students from the College of Economic and
Management Sciences. If you are a BSc student in the College of Science, Engineering and
Technology, the three modules STA1501 and STA1502 and STA1503 form the first year in Statistics.

In the preceding module STA1501, we treated probability and probability distributions, and unless
one has a proper understanding of the laws of probability, the mechanisms underlying statistical data
analysis will not be understood properly. Probability theory is the tool that makes statistical inference
possible. In STA1502, we consider to the applications of the probability distributions. You have
learned in STA1501 that the shape of the normal distribution is determined by the value of the mean
and the variance 2; whilst the shape of the binomial distribution is determined by the sample size
n and the probability of a success p. These critical values are called parameters. We most often
don’t know what the values of the parameters are and thus we cannot "utilise" these distributions (i.e.
use the mathematical formula to draw a probability density graph or compute specific probabilities)
unless we somehow estimate these unknown parameters. It makes perfect logical sense that to
estimate the value of an unknown population parameter, we compute a corresponding or comparable
characteristic of the sample.

The objective of this module is to focus on the issues related to prediction and inference in statistics
and therefore it is called Statistical Inference and the "I" in the title indicates that it is a module at
the first level. We draw inference about a population (a complete set of data) based on the limited
information contained in a sample. In dictionary terms, inference is the act or process of inferring;
to infer means to conclude or judge from premises or evidence; meaning to derive by reasoning.
In general, the term implies a conclusion based on experience or knowledge. More specifically in
statistics, we have as evidence the limited information contained in the outcome of a sample and
we want to conclude something about the unknown population from which the sample was drawn.
The set of principles, procedures and methods that we use to study populations by making use of
information obtained from samples is called statistical inference.

Learning outcomes
There are very specific outcomes for this module, listed below. Throughout your study of this module
you must come back to this page, sit back and reflect upon them, think them through, digest them
into your system and feel confident in the end that you have mastered the following outcomes:
iv

Describing the behaviour of sample statistics in repeated sampling, focussing on sampling


distributions of the sample mean and the sample proportion.
Evaluating the reliability of estimates of the population parameters with the use of the Central
Limit Theorem and the sampling distributions of the corresponding sample statistics.
Considering point and interval estimators for single or compound population parameters.
Basic concepts of large-sample statistical estimation and hypothesis testing involving population
means and proportions.
Small-sample tests and confidence intervals for population means and proportions
Employ three diferent non-parametric test to compare two populations of ordinal or interval data
when normality cannot be accepted.
Applying the classical time series and its decomposition into trend, seasonal and random
variation.
Measuring long-term trend using regression analysis and seasonal variation by computing
seasonal indexes.
Describing four forecasting techniques, including the autoregressive model.

The prescribed textbook

For this module you have to study certain sections from six chapters of the prescribed textbook:

Keller, Gerald and N. Gaciu (2020, 2nd edition) Statistics for Management and Economics ISBN:
9781473768260

Chapter 13: INFERENCE ABOUT COMPARING TWO POPULATIONS


Chapter 14: ANALYSIS OF VARIANCE (not 14:5 and 14:6)
Chapter 15: CHI-SQUARED TESTS
Chapter 16: SIMPLE LINEAR REGRESSION AND CORRELATION
Chapter 19: NONPARAMETRIC STATISTICS
Chapter 20: TIME SERIES ANALYSIS AND FORECASTING

The study guide


The study guide may be better describes as a textbook guide because it guides you through the
textbook in a systematic way. It is no substitute for the textbook, where the different topics are
explained in detail. You have to use the two together as the guide supplements with additional
exercises and longer explanations, but is not repeating the basic theoretical knowledge.This study
guide serves as an interactive workbook, where spaces are provided for your convenience. Should
v STA1502/1

you so prefer, you are welcome to write and reference your solutions in your own book or file, if the
space we supply is insufficient or not to your liking.

Study Units and workload

We realise that you might feel overwhelmed by the volumes and volumes of printed matter that
you have to absorb as a student! How do you eat an elephant? Bite by bite! We have divided
the 6 chapters of the textbook into 5 study units or "sessions". Make very sure about the sections
indicated in each study unit since some sections of the textbook are excluded and we do not want
you frustrated by working through unnecessary work. Regular contact with statistics will ensure that
your study becomes personally rewarding.

Try to work through as many of the exercises as possible

Doing exercises on your own will not only enhance your understanding of the work, but it will give you
confidence as well. Feedback is given immediately after the activity to help you check whether you
understand the specific concept. The activities are designed (i.e. specific exercises are selected) so
that you can reflect on a concept discussed in the textbook. You can only obtain maximum benefit
from this activity-feedback process if you discipline yourself not to peep at the solution before you
have attempted it on your own!

Final word: Attitude

We know that many of you have some "math anxiety" to deal with, but we will do our best to make
your statistics understandable and not too theoretic. Studying statistics is sometimes not "exciting"
or "fun" but keep in mind that the considerable effort to master the content of this module can be very
rewarding. We claim that knowledge of statistics will enable you to make effective decisions in your
business and to conduct quantitative research into the many larger and detailed data sources that
are available. Statistical literacy will enable you to understand statistical reports you might encounter
as a manager in your business.

We are there to assist you in a process where you shift yourself from a supported school learner to
an independent learner. Studying through distance education is neither easy nor quick. There will
be times when you feel frustrated and discouraged and then only your attitude will pull you through!
You are the master of your own destiny.

In a paper by Sue Gordon1 (1995) from the University of Sydney, the following metaphor is given:
"The learning of statistics is like building a road. It’s a wonderful road, it will take you to places you
did not think you could reach. But when you have constructed one bit of road you cannot sit back and
think ‘Oh, that’s a great piece of road!’ and stop at that. Each bit leads you on, shows the direction
to go, opens the opportunity for more road to be built. And furthermore, the part of the road that
1
Gordon, Sue (1995) A theoretical Approach to Understanding Learners of Statistics. Journal of Statistics Education
v. 3, n.3 University of Sydney.
vi

you built a few weeks ago, that you thought you were finished with, is going to develop pot holes
the instant you turn your back on it. This is not to be construed as failure on your part, this is not
inadequacy. This is just part of road building. This is what learning statistics is about: go back and
repair, go on and build, go back and repair."

A few logistical problems

(You can skip the following section if you have read through it when you did STA1501.)

Decimal comma or point?

We realise that in the South African schooling system commas are used to indicate the decimal digit
values. You have been penalised at school for using a point. Now we sit between two fires: the
school system and common practice in calculators and computers! Most computer packages use
decimal points (ignoring the option to change it) and Keller (the author) also uses the decimal point
in our textbook (Statistics for Management and Economics). Therefore we use the decimal point in
our study guide, assignments and examination.
vii STA1502/1

Role of computers and statistical calculators:

The emphasis in the textbook is well beyond the arithmetic of calculating statistics and the focus is
on the identification of the correct technique, interpretation and decision making. This is achieved
with a flexible design giving both manual calculations and computer steps.

Every statistical technique that needs computation is illustrated in a three-step approach:


Step 1 MANUALLY
Step 2 EXCEL
Step 3 MINITAB

It is a good idea that you initially go through the laborious manual computations to enhance your
understanding of the principles and mathematics but we strongly urge you to manage the Excel
computations because using computers reflects the real world outside. The additional advantage of
using a computer is that you can do calculations for larger and more realistic data sets. Whether
you use a computer program or a statistical calculator as tool for your calculations is irrelevant to us.
However, the emphasis in this module will always be on the interpretation and how to articulate the
results in report writing.

CD Appendixes and A Study Guide are provided on the CD-ROM (included in the textbook) in pdf
format . The slide shot below is just to give you an idea of some of the topics covered. Although it will
not be to your disadvantage if you do not use the CD, we encourage you to try your best to have at
least a few sessions on a computer. Statistical Software makes Statistics exciting - so, play around
on the computer should you have access!
viii

Some Key Terms/Symbols

Sampling distribution of the sample proportion


Standard error of the proportion
Sampling distribution of the difference between two sample means
Standard error of the difference between two means
Pooled variance estimator
Matched pairs experiment
Degrees of freedom
Pooled proportion estimator
Response variable
Sum of squares for error
Multinomial experiment
Least squares method
Distribution–free methods
Random variation
Trend analysis
1 STA1502/1

STUDY UNIT 1
1.1 Introduction
You should not attempt to do the module STA1502 without knowledge of the contents of STA1501.
This module STA1502 is a continuation in the same textbook of the follow–up chapters. Chapters
such as what is Statistics, graphical descriptive techniques, numerical descriptive techniques,
data collection and sampling, probability, random variables and discrete probability distributions,
continuous probability distributions, sampling distributions, introduction to hypothesis testing and
inference about a population were covered in STA1501.
In this module we continue with chapters such as

1. Inference about comparing two populations

2. Analysis of variance

3. Chi–square tests

4. Simple linear regression and correlation

5. Nonparametric statistics

6. Time–series analysis and forecasting

In STA1501 you learnt about Statistical Inference for a single population and derived hypothesis
tests and confidence intervals from the information contained in a single sample. You did this for

the population mean


the population variance 2

the population proportion p

1.2 Inference about the difference between


two means: Independent samples
In STA1501 you learnt that the population mean ; the population variance 2 and the population
proportion p where not known under study. In order to test and estimate the difference between two
population means, we draw a random sample from each of two independent populations. In such a
case, we almost always do not know the standard deviation of either population.
In this section, we discuss independent sample when using a two–sample test that compares the
means of samples selected from two populations. We define independent samples as samples
completely unrelated to one another.
2

We must establish the following assumptions:

When the population variances ( 2 and 2) are equal.


1 2

When the population variances 2 and 2 are unequal.


1 2

We are now sampling from two independent populations where the means of the population are
our focus. Note that we need subscripts to distinguish between the population mean of the first
population called 1, and the population mean of the second population called 2:
The best estimator of the difference between two population means is denoted by ( 1 2) and the
difference between two sample means is denoted by X 1 X2 :

1.2.1 Independent samples when population variances are equal

Assume that two random samples are independently selected from two populations that are normally
distributed with equal variances. To test whether the two population means are equal, we can use a
pooled–variance t test to determine whether there is a significant difference between the means.
In statistical notation we summarize this as follows:
If we have a random sample of size n1 which is normally distributed with the mean 1 and the
variance 2 denoted by N 2 and an independent random sample of size n2 which is normally
1 1; 1
distributed with the mean and the variance 2 denoted by N 2 population.
2 2 2; 2
There are 5 steps under the sampling distribution of X 1 X2 :

Step 1

The null hypothesis H0 of no difference in the means of two independent populations can be denoted
by
H0 : 1 = 2 or H0 : 1 2 = D0
and H0 may be tested at the % level of significance against one of the following alternatives:

(i) H1 : ( 1 2) 6= D0 : This is a two–tailed test.

(ii) H1 : ( 1 : 2) < D0 This is called a one–tailed test.

(iii) H1 : ( 1 : 2) > D0 This is called a one–tailed test.

The symbol D0 implies a known, specified difference under H0


and is usually (mostly) the value 0; indicating that we are testing
H0 : 1 = 2 or H0 : 1 2 = 0:
3 STA1502/1

Step 2

The pooled–variance s2p that combines the two sample variances s21 and s22 independently selected
from the two populations. The pooled-variance s2p ; is the best estimate of the variance common to
both populations. The formula for the pooled–variance s2p is

(n1 1) s21 + (n2 1) s22


s2p =
n1 + n 2 2

where
n1 : The sample size taken from population 1.

n2 : The sample size taken from population 2.

X1 : Mean of the sample taken from population 1.

X2 : Mean of the sample taken from population 2.

S12 : Variance of the sample taken from population 1.

S22 : Variance of the sample taken from population 2.

(n1 + n2 2) : The degrees of freedom for the t–distribution.

Step 3

The test statistic t for X 1 X 2 is

X1 X2 ( 1 2)
t(X 1 X2) = r
s2p n11 + 1
n2

where
X1 : The mean of the sample taken from population 1.

X2 : The mean of the sample taken from population 2.

( 1 2) : The difference of the population means that in most cases equal to 0.

Step 4

The rejection region for X 1 X 2 is

t > t( ;df ) : For a one–tailed test.


t > t( ;df ) : For a two–tailed test.
2
4

where
t: Represents the calculated t–test statistic.

t( ;df ) Represents the critical value at level of significance for a two–tailed test.
2

t( ;df ) Represents the critical value at level of significance for a one–tailed test.

df : Represents the degrees of freedom equals to (n1 + n2 2) :


Step 5

The confidence interval of ( 1 2) equal to

s
1 1
X1 X2 t( ;df ) s2p +
2 n1 n2

where
X1 : The sample mean taken from the population 1.

X2 : The sample mean taken from the population 2.

t( ;df ) The critical value, df = n1 + n2 2.


2

r
1 1
s2p n1 + n2 : The standard error for X 1 X2 :

: The level of significance.

1.2.2 Independent samples when population variances are unequal: 2


1 6=
2
2

We cannot use the pooled variance s2p : Instead, we estimate each population variance with the
sample variance. There are 4 steps under the sampling distribution of X 1 X2

Step 1: is the same as given in Section 1.2.1.


Step 2: the test statistic for X 1 X 2 is

X1 X2 ( 1 2)
t(X 1 X2) = s
s21 s2
+ 2
n1 n2
5 STA1502/1

The sampling distribution can be estimated by a student t–distribution with the degrees of freedom
df equal to
2
s21 s2
+ 2
n1 n2
df = 2 2
S12 S22
n1 n2
+
n1 1 n2 1
Remark: Round df to the nearest integer.
Step 3: The rejection region is the same as provided in Section 1.2.1 when using the above
degrees of freedom df:
Step 4: The confidence interval estimator of ( when 2 2
1 2) 1 6= 2

r
S12 S22
X1 X2 t( ;df ) n1 + n2
2

The critical value for t( ;n1 +n2 2) or t( ;n1 +n2 2) is obtained from Table 2.
2
6

Table 1: Cumulative Standardized Normal Probabilities


7 STA1502/1

Table 1 (continued)
8

Table 2 Critical values of t


9 STA1502/1

Activity 1.1
Say whether the following statements are correct or incorrect and try to rectify the incorrect
statements to make them true.

(n1 1)s21 + (n2 1)s22 (x1i x1 )2 + (x2i x2 )2


(a) s2pooled = =
n1 + n2 2 n1 + n2 2

.............................................................................. ..............................................................................

.............................................................................................................................................................

s
s21 s2
(b) If we derive a confidence interval for ( 1 2 ) we use SE = + 2
n1 n2
r
1 1
but if we test H0 : 1 = 2 we use SE = s2pooled ( + ).
n1 n2

.............................................................................................................................................................

.............................................................................................................................................................

(c) In a one-tailed test for the difference between two population means, ( 1 2 ), if the null
hypothesis is rejected when the alternative hypothesis, H1 : 1 < 2 is false, a Type I error
is committed.

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

Feedback Feedback
(a) Correct. With a little algebraic manipulation it follows from the definitions of
(x1i x1 )2 (x2i x2 )2
s21 = and s22 = that (n1 1)s21 = (x1i x1 )2
n1 1 n2 1
and that (n2 1)s22 = (x2i x2 )2 :
r
1 1
(b) Incorrect. We use SE = s2pooled ( + ) for both the hypothesis test and the confidence
n1 n2
interval!

(c) Correct.
10

You will find that in most of the exercises on this section, whether they are for an assignment, the
examination or exercises, the information you have to work with will either be

raw data for two samples, or


summarised data given in a table format as
Population
1 2

Sample size n1 n2

Sample mean x1 x2

Sample variance s21 s22

There could be "variations" on the theme of summarised data where computed sums are given
instead of sample statistics, e.g. x1i instead of x1 or x21i and x1i instead of s21 :
In the case of raw data, you must try to have at least a Scientific Pocket Calculator with Statistical
Functions that will enable you to compute the sample statistics:

Activity 1.2
Question 1

Psychologists have claimed that the scores on a tolerance measurement scale have a normal
distribution. Suppose that this scale is administered to two independent random samples of males
and females and their tolerance towards other road users is measured. (The higher the score, the
more tolerant you are.) The following scores were obtained:

Males: 12 8 11 14 10
Females: 15 12 14 11 13 14 12

(a) Test H0 : males = f emales against the alternative H1 : males 6= f emales :

Use = 0; 01 and assume that 2 = 2


1 2

............................... ............................... ............................... ............................... ...............................

(b) Compute a 99% confidence interval for the difference ( males f emales ). How do you interpret
this interval?

.............................................................................................................................................................

(c) What can you conclude from questions (a) and (b)?

.............................................................................................................................................................
11 STA1502/1

Question 2

You and some friends have decided to test the validity of an advertisement by a local pizza restaurant,
which says it delivers to the dormitories faster than a local branch of a national chain both the local
pizza and national chair are located across the street from your college campus. You define the
variable of interest as the delivery time, in minutes, from the time the pizza is ordered to when it is
delivered. You collect the data by ordering 10 pizzas from the local pizza restaurant and 10 pizzas
from the local chain at different times. The data for the delivery times are given below:

Chain X1 22 15:2 18:7 15:6 20:8 19:5 17 19:5 16:5 24


Local X2 16:8 11:7 15:6 16:7 17:5 18:1 14:1 21:8 13:9 20:8

The summary statistics are shown below:

n1 = 10 X 1 = 18:88 S12 = 8:2151


n2 = 10 X 2 = 16:7 S22 = 9:5822

At the 0:05 level of significance, is there evidence that the mean delivery time for the local pizza
restaurant is less than the mean delivery time for the national pizza chain?

(a) Test the null hypothesis H0 : 1 = 2 against the alternative H1 : 1 < 2 ( 1 is the population
mean for local and 2 is the population mean for chain. Use = 0:05 and assume that the
population variances are equal. Interpret the results obtained.

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

(b) Calculate a 95% confidence interval for the difference of the means.

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................
12

Question 3

Assume that you have a sample of n1 = 8; with the sample mean X 1 = 42 and a standard deviation
S1 = 4; and you have an independent sample n2 = 15 from another population with a sample mean
X 2 = 34 and a sample standard deviation S2 = 5:

(a) Using unequal variance approach at = 0:01; test the null hypothesis H0 : 1 = 2 against
H1 : 1 6= 2: Interpret the results obtained.

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

(b) Calculate the 99% confidence interval for the difference ( 1 2 )?

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................
13 STA1502/1

Feedback Feedback

Question 1

(a) Step 1:
We have to test H0 : males = f emales =) H0 : ( 1 2) = 0
against H1 : males 6 = f emales =) H1 : ( 1 2 ) 6 = 0
Step 2:
The data are
Males X1 : 12 8 11 14 10
Females X2 : 15 12 14 11 13 14 12
Using a scientific calculator we have obtained:
n1 = 5P n2 = 7
Xi 55
X1 = = 11
Pn 5
X1 91
X2 = = 13
n 7
The sample variances are
1 X 2
S12 = Xi X =5
n 1
1 X 2
S22 = Xi X =2
n 1

The pooled variance Sp2 is


(n1 1) S12 + (n2 1) S22
s2p =
n1 + n2 2

(5 1) (5) + (7 1) (2) 20 + 12 32
s2p = = = = 3:2
5+7 2 10 10
Step 3:
The test statistic for X 1 X 2 is
X1 X2 ( 1 2)
t = s
1 1
s2p +
n1 n2
(11 13) 0
= s
1 1
3:2 +
5 7
2
=
1:0474
= 1:9095
14

Step 4:
The rejection region is t > t( ;n1 +n2 2) for a two –tailed test because H1 is using the symbol 6= :
2

t > t( 0:01 ; 5+7 2)


2

t > t(0:005; 10)


t > 3:169 (using Table 2)

The critical value is t(0:05; 10) = 3:169:


Step 5:
Decision rule

If the test statistic is greater than the critical value, reject the null hypothesis H0 at level of
significance.
If the test statistic is smaller than the critical value, we fail to reject H0 :

Since the test statistic ( 1; 9095) lies between the critical values, that means 3; 169 < 1:9095 <
3:169 we fail to reject the null hypothesis H0 : The conclusion is that there is not a significant difference
between the means of the males and the females.
s
1 1 q
2 1 1
(b) (x1 x2 ) t 2 ;(n1 +n2 2) Spooled + = (11 13) (3:169) 3:2 5 + 7
n1 n2
p
= 2 (3:169) 1:097 1

= 2 3:3194

= ( 5:3194 ; 1:3194):

We are 99% confident that the unknown difference ( males f emales ) will be between 5:3194
and 1:3194: We see that ( 5; 3194; 1; 3194) includes the null value, which implies that we are 99%
confident that the mean for the males is the same as the mean for the females.

[Extra explanation: We translate the phrase "the mean for the males is the same as the mean for
the females" as males = f emales which is in general 1 = 2: But, if males = f emales it implies
that ( 1 2) = 0:
So, to conclude that males = f emales we have to check whether zero is included in the confidence
interval. ]

(c) We conclude from questions (a) and (b) that using a two-sided confidence interval and performing
a two-sided hypothesis test must always lead to the same conclusion because it is a different
"juggle" of the same information! This is indeed the case with this exercise!
15 STA1502/1

You will find that in most of the exercises on this section, whether they are for an assignment, the
examination or exercises in Keller or any other textbook, we will simply state: " Assume that.....blah-
blah-blah" and then we conveniently take care of the assumptions of normality and equal variances!
But, strictly speaking, we should have first checked whether these conditions are met before we
proceed with the test.
There exist additional preliminary tests where we can formally test for normality and for the equality
of variances. The tests for normality are covered in detail in your second-year statistics syllabus.
Most statistical packages will provide you with a statistical test to formally test H0 : 2 = 2: In
1 2
the module STA2601 you will be formally introduced to the statistical package JMP. In case you do
not continue with statistics but anyhow apply your first-year knowledge using a statistical package of
your own choice, be aware that most statistical software packages will automatically include a test
for the equality of variances when you request to do a test for means! (This also happens when you
request to do an ANOVA test for means – a procedure you will learn about in the following study unit.)
The output for the test for the equality of variances will be a so-called F -test. An F-test, in general, is
basically the ratio of two quantities – in this application two variances. The p-value associated with
the F -test could be interpreted exactly like you have learned to do for any other test. If it is significant
(i.e. p-value < ) you will reject H0 : 2 = 2:
1 2

Question 2

The summary statistics are:


n1 = 10 X 1 = 18:88 S12 = 8:2151
n2 = 10 X 2 = 16:7 S22 = 9:5822
H0 : 1 2 = 0 against H1 : 1 < 2; this is a one–tailed test.

(a) Step 1
The pooled variance Sp2
(n1 1) S12 + (n2 1) S22
Sp2 =
n1 + n2 2
(10 1) (8:2151) + (10 1) (9:5822)
=
10 + 10 2
73:9359 + 86:2398
=
18
160:1757
=
18
= 8:8987
16

Step 2
The test statistic for X 1 X 2 is
X1 X2
t = s
1 1
Sp2 +
n1 n2
18:88 16:70
= q
1 1
8; 8987 10 + 10
2:18
=
1:3341
= 1:6341

Step 3
The rejection region for X 1 X 2 is
t > t( ; n1 +n2 2) (For one–tailed test.)
t > t(0:05; 10+10 2)

t > t(0:05; 18)


t > 1:734
The critical value is t(0:05; 18) = 1:734:
Step 4
Decision Rule
Since the test statistic (1:6341) is less than the critical value (1:734) : We fail to reject H0 at 5%
level of significance. Therefore we can conclude that there is insufficient evidence to reject the null
hypothesis H0 : Based on the results, there is insufficient evidence for the local pizza restaurant
to make the advertising claim that it has a faster delivery time.

(b) The 95% confidence interval for X 1 X 2 is


s
1 1
X1 X2 t( ; n1 +n2 2) s2p +
2 n1 n2
s
1 1
(18:88 16:7) t( 0:05 ; 10+10 2) 8:8987 +
2 10 10
2:18 t(0:025; 18) 1:3341
2:18 2:101 1:3341
2:18 2:8029
(2:18 2:8029; 2:18 + 2:8029)
( 0:6229; 4:9829)

Therefore, we are 95% confident that the difference in mean delivery time between the local pizza
restaurant and the national pizza chain is between 0:6229 delivery time and 4:9829: From a
17 STA1502/1

hypothesis testing perspective, using a two–tailed test at 5% level of significance, because the
interval does include zero, we fail to reject the null hypothesis between the means of the two
populations.

Question 3
Solutions

n1 = 8 n2 = 15 s1 = 4 = 0:01
X 1 = 42 X 2 = 34 S2 = 5

(a) Step 1
against H1 : 2 2
H0 : 1 = 2 1 6= 2. The two population variances are unequal 1 6= 2 for a two
tailed test.
Step 2
The test statistic for X 1 X 2 is
X1 X2
t =s
s21 s2
+ 2
n1 n2
42 34
= r 2
(4) (5)2
+
8 15
8 8
= p =
3:6667 1:9149
= 4:1778

Step 3:
The degrees of freedom df is
s21 s2
+ 2
n1 n2
df = 2 2
s21 s22
n1 n2
+
n1 1 n2 1
2
16 25
+
8 15 (2 + 1:6667)2
df = =
16 2 25 2 (2)2 (1:6667)2
+
8 15 7 14
+
8 1 15 1
13:4444 13:4444
= = = 17:4648
0:5714 + 0:1984 0:7698
Round df to integer, this implies that df = 17:
18

Step 4
The rejection region is

t > t( ;df ) (For two–tailed test, is divided by 2.)


2

t > t( 0:01 ; 17)


2

t > t(0:005; 17)


t > 2:898

The critical value at = 0:01 is 2:898:


Step 5
Interpretation
Since the test–statistical (4:1778) is greater than the critical value (2:898) ; we reject H0 at 1% level
of significance and therefore we conclude that there is enough sufficient evidence to infer that the
mean in the two populations differ.

1.3 Observational and Experimental Data


The data to test the equal means techniques require that the populations be normally distributed.
The data to test both the equal variance and unequal variances techniques require that the
populations be normally distributed.
To check whether the normality requirement is satisfied we can draw the histograms of the data
to be confident in the validity of the results.
A normal distribution has a bell shaped.
When the normality requirement is unsatisfied, we can use a nonparametric technique: The
Wilcoxon Rank Sum Test.
We can observe that when p–value is less than (level of significance) or when the value of
the test statistic is greater than the critical value, the null hypothesis H0 is rejected at level of
significance. As a result we conclude that there is sufficient evidence to infer (here we complete
with the null hypothesis H0 statement).
When the p–value is greater to or when the value of test statistic is less than the critical value,
H0 is not rejected at level of significance. As a result, we conclude that there is no enough
evidence to infer (we have to complete with H0 statement).
Experimental data are usually more expensive to obtain because of the planning required to set
up the experiment :
Observational data usually require less work to gather because we can take a random sample of
each population.
Observational and experimental data can be used to test the difference between to means, but
the issue is how the data can be obtained from each approach. This is more relevant to the
interpretation of all the statistical techniques.
19 STA1502/1

1.4 Inference about the Difference Between Two


Population Means: Matched Pairs Experiment
Have you noticed when we derived the sampling distribution of (x1 x2 ); we used the fact that
s2 s2
E(x1 x2 ) = ( 1 2) ...........(the minus sign stays), but that var(x1 x2 ) = ( 1 + 2 ) ...............(the
n1 n2
minus sign disappears)?

(Yes, there is a plus sign even though you might expect a minus sign!) In other words, if we create
a new variable by subtracting two variables, the variance of this new variable will – provided they
are independently distributed – be the sum of the variances of the two original variables.

Strictly speaking there is (in general) a third term that takes care of the dependency between the two
variables. We did not even bother to mention it in section 1.1 because this dependency term falls
away if we assume that X and Y are independent.

However, if we cannot assume that we have two samples from two independent populations, we
have a problem with var(x1 x2 ):
(x1i x1 )2 + (x2i x2 )2
Using b2 = = s2pooled is not valid anymore!
n1 + n2 2
So, whenever there is a "connectedness" between one set of values (sample 1) and the second
set of values (sample 2), we could take care of the dependency by treating the data as matched
pairs. We remove the dependency by reducing the two samples to one set of scores. This would
immediately imply that n1 = n2 :

Thus, we create a single random sample by taking the paired differences di = x1i x2i : With a little
adaptation (and imagination) we are now back to the set-up discussed in STA1501 (depending on
whether we consider the sample as having a known or unknown population variance!) for the topic
such as:

Testing the Population Mean when the Population Standard deviation is known
and
Inference about a Population Mean when the standard deviation is unknown.

Comparing the means of two dependent data sets is always a separate choice (or sub-menu
in computer jargon) of the test procedures available for testing means (main-menu in computer
jargon) in any statistical software package. It is generally known as a “paired samples t-test” and
observations of a single sample, obtained by first taking the differences, are used.

Now the formula for the test statistic is


xD 0
t= s
pD
n
where
20

1P
xD = mean difference between the paired observations = di
n
sD = standard deviation of the differences di
nD = number of paired observations.

For dependent observations, the hypothesis test for the difference between the two means therefore
boils down to the hypothesis test for a single sample.
H0 : X = Y is the same as H0 : D = 0:

It is interesting to note that in the paired observations test, the degrees of freedom are half of what
they are if the samples are not paired. (When the samples are not paired two kinds of variation are
present: differences among the groups and differences among the subjects.)

Activity 1.3
Say whether the following statements are correct or incorrect and try to rectify the incorrect
statements to make them true.

(a) Repeated measurements from the same individuals constitute an example of data collected from
matched pairs experiment.

.............................................................................. ..............................................................................

.............................................................................................................................................................

(b) The number of degrees of freedom associated with the t-test, when the data are gathered from a
matched pairs experiment with 8 pairs, is 7.

.............................................................................................................................................................

.............................................................................................................................................................

(c) The matched pairs experiment always produce a larger test statistic than the independent samples
experiment.

.............................................................................................................................................................

.............................................................................................................................................................
21 STA1502/1

(d) In comparing two population means of interval data, we must decide whether the samples are
independent (in which case the parameter of interest is 1 2) or matched pairs (in which case
the parameter is D) in order to select the correct test statistic.

.............................................................................................................................................................

.............................................................................................................................................................

(e) When comparing two population means using data that are gathered from a matched pairs
experiment, the test statistic for D has a Student t-distribution with = nD 1 degrees of
freedom, provided that the differences are normally distributed.

.............................................................................................................................................................

.............................................................................................................................................................

Feedback Feedback

(a) Correct.

(b) Correct.

(c) Incorrect. We may say that the matched pairs produce a smaller estimated SE because we
eliminate the often considerable variability due to individual variation in the separate samples.

(d) Correct.

(e) Correct.
22

Activity 1.4
Suppose that person A believes that sons, upon maturity, are in general taller than their fathers.
Person B, on the other hand, argues that the opposite is true. In order to investigate this issue, we
measure the heights of a random sample of nine father-son pairs. The following are the results (in
cm):

Pair 1 2 3 4 5 6 7 8 9
Son 185 173 168 178 188 173 165 183 175
Father 180 175 160 178 183 175 160 173 178

(a) Perform the appropriate test to solve this issue. Use = 0; 05 :

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

(b) Find a 95% confidence interval estimate for ( 1 2 ); the mean difference in heights of fathers
and sons.

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................
23 STA1502/1

Feedback Feedback

We have dependent (paired) observations and we need to work with the differences of the pairs,

di = length of son length of father.

Fair 1 2 3 4 5 6 7 8 9
Son 185 173 168 178 188 173 165 183 175
Father 180 175 160 178 183 175 160 173 178
Differences 5 2 8 0 5 2 5 10 3

The data to use in the matched pairs are the differences:

di = 5 2 8 0 5 2 5 10 3

(a) Step 1
The hypotheses are
H0 : D = 0 against H1 : D 6= 0
Step 2
The test statistic is
XD D
t=
SD
p
n
where P
di 26
The sample mean XD = = = 2:8889
n 9
The sample variance
P 2
2 xD X D
SD =
n 1
2 (5 2:8889)2 + ( 2 2:8889)2 + :::: + ( 3 2:8889)2
SD =
(9 1)
180:8889
=
8
= 22:6111
p
The sample standard deviation SD = 22:6111 = 4:7551:
We can also use the scientific calculator to calculate the mean X D and standard deviation SD :
Therefore the test statistic is
xD 0 2:8889 2:8889
t= = = = 1:8226
sD 4:7551 1:5850
p p
n 9
24

Step 3
The rejection region is
t > t( ;n 1) for a two–tailed test.
2

t > t( ;n 1) for a one–tailed test.


Therefore t > t( 0:05 ; 9 1)
2
t > t(0:025; 8)
t > 2:306 (using Table 2).
The critical value at 5% is 2:306:
Step 4
Decision rule
Reject the null hypothesis H0 when the test statistic t is greater than the critical value.
Since the test statistic (1:8226) is less than the critical value (2:306) ; we cannot reject H0 at
5% level of significance.
Conclusion: The height of sons and fathers do not differ significantly at the 5% level of
significance.

(b) The 95% of confidence interval is


S
XD t( ;n 1) pD
2 n
4:7551
2:8889 t( 0:05 ;8) p
2 9
2:8889 t0:025;8 1:5850
2:8889 2:306 1:5850
2:8889 3:6550
(2:8889 3:6550; 2:8889 + 3:6550)
( 0:7661; 6:5439)

Conclusion: We are 95% confident that the mean difference in heights of fathers and sons is
between 0:7661 and 6:5439: (Sons seem to be taller than their fathers but not significantly.)
25 STA1502/1

Activity 1.5
Question 1
In testing the hypothesis H0 : D = 5 vs. H1 : D > 5, two random samples from two
dependent normal populations produced the following statistics: xD = 9, nD = 20, and sD = 7:5.
What conclusion can we draw at the 1% significance level?

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

Question 2
Promotional Campaigns
The general manager of a chain of fast food chicken restaurants wants to determine how effective
their promotional campaigns are. In these campaigns “20% off” coupons are widely distributed.
These coupons are only valid for one week. To examine their effectiveness, the executive records
the daily gross sales (in R1000’s) in one restaurant during the campaign and during the week after
the campaign ends. The data is shown below.

Sales during Sales after


Day
Campaign Campaign
Sunday 18.1 16.6
Monday 10.0 8.8
Tuesday 9.1 8.6
Wednesday 8.4 8.3
Thursday 10.8 10.1
Friday 13.1 12.3
Saturday 20.8 18.9
(a) Can they infer at the 5% significance level that sales increase during the campaign?

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................
26

(b) Find the 95% confidence interval for the difference in sales during the week.

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

(c) What can you conclude from the answers in (a) and (b)?

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

Feedback Feedback

Question 1

The given information

The null hypothesis H0 : D = 5 against the alternative H1 : D > 5:


Two random samples from dependent normal population.
The sample mean X D = 9:
The sample size nD = 20:
The standard deviation SD = 7:5:

Step 1
The test statistic is
XD D 9 5 4
t= = = = 2:3851
SD
p 7:5
p 1:6771
n 20
Step 2
The rejection region is
t > t( ; n 1) For one tail test.
t > t(0:01; 20 1)
t > t(0:01; 19)
t > 2:539 Using Table 2.
The critical value at 1% level of significance is 2:539:
27 STA1502/1

Step 3
Decision rule
Since the test t follows a t( ; n 1) distribution. Since the test statistic (2:3851) is less than the critical
value (2:539), we fail to reject the null hypothesis H0 at 1% level of significance.

Question 2

We have dependent (paired) observations and we need to work with the differences of the pairs.
Day 1 2 3 4 5 6 7
Sales during campaign 18:1 10 9:1 8:4 10:8 13:1 20:8
Sales after campaign 16:6 8:8 8:6 8:3 10:1 12:3 18:9
Differences 1:5 1:2 0:5 0:1 0:7 0:8 1:9
The data to use in the matched pairs are

di = 1:5 1:2 0:5 0:1 0:7 0:8 1:9

The sample mean X D = 0:9571


The standard deviation SD = 0:6161
The sample size nD = 7:

(a) Step 1
The null hypothesis H0 : D = 0 against the alternative H1 : D > 0:
Step 2
The test statistic is
XD D 0:9571 0 0:9571
t= = = = 4:1095
sD 0:6161 0:2329
p p
nD 7
Step 3
The Decision rule
Reject H0 if the test statistic is greater than the critical value.
Since the value of the test statistic (4; 1095) is greater than the critical value t( ;n 1) =
t(0:05; 7 1) = t(0:05; 6) = 1:943; we reject the null hypothesis H0 at 5% level of significance.
The conclusion: The sales increase during the campaign.

(b) The 95% confidence interval is


SD
XD t( ; n 1) p
2 nD
0:6161
0:9571 t( 0:05 ; 6) p
2 7
0:9571 t(0:025; 6) 0:2329
0:9571 2:447 0:2329
(0:9571 0:5699; 0:9571 0:5699)
(0:3872; 1:5270)
28

The null hypothesis H0 is rejected at the 5% level of significance because zero lies outside of
the confidence limits.
We are 95% confident that the mean difference in sales is between 0:3872 and 1:5270 thousand
rand.

(c) We can estimate that the daily sales during the campaign increase on average between 0:3872
and 1:527 thousand rand.

1.5 Inference about the Ratio of Two Variances


The focus in this section is to determine whether two independent populations have the same
variability. By testing variances, we can detect differences in the variability in two independent
populations. One of the important reason to test for the differences between the variances of
two populations is to determine whether to use the pooled–variance t test (which assumes equal
variances) or the separated variance t test (which assume unequal variances) while comparing the
means of the two independent populations.
You have to know the following points:

The sample variance is an unbiased, consistent estimator of the population variance.


The sampling took place independently from two normal populations.
The test for the difference variances of the two independent populations is based on the ratio of
the two sample variances.
The test statistic follows the F –distribution. That means

S12
F =
S22

where S12 : The variance of sample 1.


S22 The variance of sample 2.

S12 2
1
The statistics is the estimator of the parameter 2:
S22 2
The rejection region is
F > F( ; n1 1; n2 1) : For one–tailed test.
F > F( ; n1 1; n2 1) : For two–tailed test.
2
29 STA1502/1

where
: The level of significance

n1 : The sample size selected from population 1.

n2 : The sample size selected from population 2.

n1 1: The degrees of freedom from sample 1. (That is the


numerator degrees of freedom.)

n2 1: The degrees of freedom from sample 2. (That is the


denominator degrees of freedom.)
The F –test statistic follows an F –distribution with n1 1 and n2 1 degrees of freedom.
The critical value of the F –distribution is
F( ; n1 1; n2 1) : For one–tailed test.
F( ; n1 1; n2 1) : For two–tailed test.
2

The hypothesis testing is


The null hypothesis testing H0 : 2 = 2 against one of the alternative.
1 2

H1 : 2 6= 2 : For a two–tailed test.


1 2

OR
H1 : 2 < 2 : For a one–tailed test.
1 2

H1 : 2 > 2 : For a one–tailed test.


1 2

The decision rule


– Reject the null hypothesis H0 if the F test statistic is greater than the critical value. We denote
by reject H0 if F > F( ; n1 1; n2 1) for a two–tailed test or F > F( ; n1 1; n2 1) for one–tailed
2

test.
– Otherwise, do not reject H0 :
2
1
The confidence interval of 2 is
2
– The Lower Confidence Limit (LCL) is

s21 1
LCL = :
s22 F( ; n1 1; n2 1)
2

– The Upper Confidence Limit (UCL) is

S12
U CL = F( ; n2 1; n1 1) :
S22 2

Note that the statistic tables give values for F( ; n1 1; n2 1) 6= F( ; n2 1; n1 1) you must therefore
2 2

make sure that you know what to use for the upper and lower limits in the assignment or
examination and the read off the correct value from the table.
30

Table 3A Critical values of the F-distribution: A = 0.05


31 STA1502/1

Table 3A Critical values of the F-distribution: A = 0.05 (continued)


32

Table 3B Critical values of the F-distribution: A = 0.025


33 STA1502/1

Table 3B Critical values of the F-distribution: A = 0.025 (continued)


34

Table 3C Values of the F-distribution: A = 0.01


35 STA1502/1

Table 3C Values of the F-distribution: A = 0.01 (continued)


36

Activity 1.6
Question 1
2
1
In constructing a 90% interval estimate for the ratio of two population variances, 2, two independent
2
samples of sizes 40 and 60 are drawn from the populations. If the sample variances are 515 and 920,
then the lower confidence limit is:

1. 0:244

2. 0:352

3. 0:341

4. 0:890

5. 0:918

Question 2

An experimenter is concerned that variability of responses using two different experimental


procedures may not be the same. He randomly selects two samples of 16 and 14 responses from
two normal populations and gets the statistics: S12 = 55, and S22 = 118, respectively.

(a) Do the sample variances provide enough evidence at the 10% significance level to infer that the
two population variances differ?

(b) Estimate with 90% confidence the ratio of the two population variances.

(c) Describe what the interval estimate tells you and briefly explain how to use the interval estimate
to test the hypotheses.

Feedback Feedback

Question 1
S12 1
The formula for the LCL is and you have to substitute the correct values into this
S22 F
2; 1; 2

formula.

S12 515
=
S22 920
= 0:5598:
37 STA1502/1

Go to the F -table (Table 3A) with heading 0:05 (because = 0:1 and you need ) and where the
2
values for 40 and 60 meet, you will read off the value 1:59.

The critical value F( ; df1 ; df2 ) = F(0:05; 40 1; 60 1) = F(0:05; 39; 59) = 1:5 (the nearest)
2

S12 1 1
= 0:5598
S22 F( ;df1 ;df2 ) 1:59
2
= 0:3521 which is option 2
= 0:3521

Question 2
2 2
1 1 1 2 2 2
(a) H0 : 2 = 1 versus H1 : 2 6= 1 or H0 : = versus H1 : 1 6= 2
2 2

1 1
Rejection region:F > F0:05;15;13 = 2:53 or F < F0:95;13;15 = = 0:408
F0:05;13;15 2:45

55
Test statistics: F = = 0:466
118

Conclusion: We don’t reject the null hypothesis H0 . No, the sample variances don’t provide
enough evidence at the 10% significance level to infer that the two population variances differ

(b) The 90% confidence interval for the ratio of the two population variances:

S12 1 df1 = n1 1 = 16 1 = 15
LCL =
S22 F df2 = n2 1 = 14 1 = 13
2 ;df1 ;df2
55 1
=
118 F0:05;15;13
1
= 0:4661
2:53
= 0:1842
S12
U CL = F
S22 2 ;df2 ;df1
55
= F0:05;13;15
118
= 0:4661 2:45
= 1:1419
2
1
(c) We estimate that the ratio 2 lies between 0:1842 and 1:1419. Since the hypothesized value 1 is
2
included in the 90% interval estimate, we fail to reject the null hypothesis at = 0:10.
38

1.6 Self-correcting Exercises for Unit 1


Question 1
In random samples of 25 from each of two normal population, we found the following statistics:
X 1 = 524 S1 = 129
X 2 = 469 S2 = 141
(a) Estimate the difference between the two population means with 95% confidence.

(b) Repeat (a) increasing the standard deviations to S1 = 225 and S2 = 260:

(c) Describe what happens when the sample standard deviations get larger.

(d) Repeat (a) with samples of size 100:

(e) Discuss the effects of increasing the sample size.

Question 2

Every month a clothing store conducts an inventory and calculates losses from theft. The store
would like to reduce these loses and is considering two methods. The first is to hire a security
guard, and the second is to install cameras. To help decide which method to choose, the manager
hired a security guard, and the second is to install cameras. To help decide which method to choose,
the manager hired a security guard for 6 months. During the next 6 month period, the store installed
cameras. The monthly losses were recorded and are listed here. The manager decided that because
the cameras were cheaper than the guard, he would install the cameras unless there was enough
evidence to infer that the guard was better. What the manager should do?
Security guard 355 284 401 398 477 254
Cameras 486 303 270 386 411 435
Question 3

How effective is an antilock braking system (ABS), which pumps very rapidly rather than lock and
thus avoid skids? As a test, a car buyer organized an experiment. He hit the brakes and using
a stop–watch, recorded the number of second it took to stop an ABS–equipped car and another
identical car without ABS. The speed when the brakes were applied and the number of seconds
each took to stop on dry pavement are listed here. Can we infer that ABS is better?
Speeds 20 25 30 35 40 45 50 55
ABS 3:6 4:1 4:8 5:3 5:9 6:3 6:7 7:0
Non–ABS 3:4 4:0 5:1 5:5 6:4 6:5 6:9 7:3
Question 4

In an effort to determine whether a new type of fertilizer is more effective than the type currently in
use, researchers took 23 two–acre plots of land scattered throughout the country. Each plot was
39 STA1502/1

divided into two equal–sized subplots, one of which was treated with the current fertilizer and the
other with the new fertilizer. What was planted, and the crop yields were measured.
Plot 1 2 3 4 5 6 7 8 9 10 11 12
Current fertilizer 56 45 68 72 61 69 57 55 60 72 75 66
New fertilizer 60 49 66 73 59 67 61 60 58 75 72 68

(a) Can we conclude at the 5% significance level that the new fertilizer is more effective than the
current one?

(b) Estimate with 95% confidence the difference in mean crop yields between the two fertilizers.

(c) What is the required condition(s) for the validity of the results obtained in parts (a) and (b)?

(d) Is the required condition(s) satisfied?

(e) Are these data experimental or observational. Explain.

(f) How should the experiment be conducted if the researchers believed that the land throughout the
country was essentially the same?

1.7 Solutions to Self-correcting Exercises for Unit 1

Question 1

The given information

Two independent normal populations.


The sample size: n1 = 25 and n2 = 25:
The sample means: X 1 = 524 and X 2 = 469:
The sample standard deviations.:
S1 = 129 and S2 = 141

(a) The 95% confidence interval is


s
1 1
X1 X2 t( ; n1 +n2 2) s2p +
2 n1 n2
The pooled variance s2p is
(n1 1) s21 + (n2 1) s22
s2p =
n1 + n2 2
(25 1) (129)2 + (25 1) (141)2
=
25 + 25 2
399384 + 477144
=
48
2 876528
sp = = 18261
48
40

The standard error SE(X X2) is


s
1 1
SE(X 1 X2) = s2p +
n1 n2
s
1 1
= 18261 +
25 25
p
= 1460:88
= 38:2215

The 95% confidence interval is


s
1 1
X1 X2 t( ; n1 +n2 2) s2p +
2 n1 n2
(524 469) t( 0:05 ; 25+25 2) 38:2215
2

55 t(0:025; 48) 38:2215


55 2:009 38:2215
55 76:7870
(55 76:7870; 55 + 76; 7870)
( 21; 787; 131; 787)

(b) Repeat (a) with s1 = 255 and s2 = 260:


The 95% confidence interval is
s
1 1
X1 X2 t( ; n1 +n2 2) s2p +
2 n1 n2
The pooled variance is
(n1 1) s21 + (n2 1) s22
s2p =
n1 + n2 2
(25 1) (255)2 + (25 1) (260)2
=
25 + 25 2
1560600 + 1622400
=
48
3183000
=
48
= 66312:5

The standard error SE(X 1 X2) is


s
1 1
SE(X 1 X2) = s2p +
n1 n2
s
1 1
= 66312:5 +
25 25
p
= 5305
= 72:8354
41 STA1502/1

The 95% confidence interval is


X1 X2 t( ; n1 +n2 2) SE(X 1 X 2 )
2

(524 469) t( 0:05 ; 25+25 2) (72:8354)


2

55 t(0:025; 48) 72:8354


55 2:009 72:8354
55 146; 3263
(55 146:3263; 55 + 146; 3263)
( 91; 3263; 201; 3263)

(c) The interval widens if we increase the standard deviations.

(d) Repeat (a) with n1 = 100 and n2 = 100:


The 95% confidence interval is
s
1 1
X1 X2 t( ; n1 +n2 2) s2p +
2 n1 n2
s
1 1
(524 469) t( 0:05 ; 100+100 2) s2p +
2 100 100

The pooled variance s2p is


(n1 1) s21 + (n2 1) s22
s2p =
n1 + n2 2
(100 1) (255)2 + (100 1) (260)2
=
100 + 100 2
6437475 + 6692400
=
198
13129875
=
198
= 66312; 5

The standard error for X 1 X 2 is


s
1 1
SE(X 1 X2) = s2p +
n1 n2
s
1 1
= 66312:5 +
100 100
p
SE(X 1 X2) = 1326:25
= 36:4177
42

The 95% confidence interval is


X1 X2 t( ; n1 +n2 2) SE(X 1 X2)
2

(524 469) t(0:025; 198) 36:4177


55 1:972 36:4177
55 71:8157
(55 71:8157; 55 + 71:8157)
( 16:8157; 126:8157)

(e) The effects of increasing the sample size have narrowed the confidence interval.

Question 2

The given information.

The data are as follows:


Security Guard 355 284 401 398 477 254
Cameras 486 303 270 386 411 435
The summary statistics are
X 1 = 361:5 S1 = 82:2648 n1 = 6
X 2 = 381:8333 S2 = 81:5682 n2 = 6
= 0:01
Step 1
The hypothesis testing
H0 : 1 = 2 against H1 : 1 < 2 or
H0 : ( 1 2) = 0 against H1 : ( 1 2) <0

Step 2
The pooled variance for independent samples.
(n1 1) s21 + (n2 1) s22
s2p =
n1 + n2 2
(6 1) (82:2648)2 + (6 1) (81:5682)2
=
6+6 2
33837:4866 + 33266:8563
=
10
67104:3429
=
10
= 6710:4343
43 STA1502/1

Step 3
The test statistic is
X X2
t = r 1
s2p n11 + n12
361:5 381:8333
= q
6710:4343 16 + 16
20:3333
=
47:2949
= 0:4299

Step 4
The rejection region is
t > t( ; n1 +n2 2)

t > t(0:01; 6+6 2)

t > t(0:01; 10)


t > 2:764
Since the test statistic ( 0:4299) is negative the rejection region is t < 2:764: but because
0:4299 > 2:764; we fail to reject H0 :
The critical value at 1% is 2:764:
Step 5
The 99% confidence interval is
s
1 1
X1 X2 t( ; n1 +n2 2) s2p +
2 n1 n2
s
1 1
(361:5 381:8333) t( 0:01 ; 6+6 2) 6710:4343 +
2 6 6
20:3333 t(0:005; 10) 47:2949
20:3333 3:169 47:2949
20:333 149:8870
( 20:3333 149:8870; 20:3333 + 149:8870)
( 170:2203; 129:5537)

Making a decision
Since the confidence interval limits include zero, we fail to reject the null hypothesis at 1% level of
significance.
Conclusion: There is no enough evidence to reject the null hypothesis H0 :
44

Question 3

The given information.

The data are from dependent samples.


No 1 2 3 4 5 6 7 8
ABS 3:6 4:1 4:8 5:3 5:9 6:3 6:7 7:0
Non-ABS 3:4 4:0 5:1 5:5 6:4 6:5 6:9 7:3
Differences 0:2 0:1 0:3 0:2 0:5 0:2 0:2 0:3
The data to use are di = ABS Non–ABS.
di = 0:2 0:1 0:3 0:2 0:5 0:2 0:2 0:3

Step 1
The hypothesis testing
H0 : D = 0 against H1 : D > 0:
Step 2
The test statistic
XD
t=
SD
p
nD
The mean X D = 0:175:
The standard deviation SD = 0:2252:
The sample size nD = 8:
0:175 0:175
t= = = 2:1985
0:2252
p 0:0796
8

Step 3
The 95% confidence interval is
SD
X1 t( ; nD 1) p
2 nD
0:2252
0:175 t(0:025;7) p
8
0:175 2:365 0:0796
0:175 0:1883
( 0:175 0:1883; 0:175 + 0:1883)
( 0:3363; 0:0133)

Making a decision
Since zero lies between the confidence limits, we fail to reject the null hypothesis at 5% level of
significance.
Conclusion: There is no enough evidence that ABS performs better than the Non–ABS at the 5%
level of significance.
45 STA1502/1

Question 4

The given information.

The two samples are dependent, the matched pairs method is use.
The data are
Plot 1 2 3 4 5 6 7 8 9 10 11 12
Current fertilizer 56 45 68 72 61 69 57 55 60 72 75 66
New fertilizer 60 49 66 73 59 67 61 60 58 75 72 68
Differences di 4 4 2 1 2 2 4 5 2 3 3 2
di = current new fertilizer
The summary statistics calculated based on di data are
XD = 1 SD = 3:0151 nD = 12

(a) Step 1

The hypothesis testing


H0 : D = 0 against H1 : D < 0:
Step 2
The test statistic
XD 1 1
t = = =
SD 3:0151 0:8704
p p
nD 12
t = 1:1489

Step 3
The rejection region is
t > t( ;nD 1)

t > t(0:05; 11)


t > 1:796

Making a decision
Because the test statistic is negative the rejection region is t < 1:796:
Since 1:1489 > 1:796; we fail to reject H0 at 5% level of significance.
Conclusion: We may not infer that the new fertilizer is more effective than the current fertilizer.
46

(b) The 95% confidence interval is


SD
XD t( ; nD 1) p
2 nD
3:0151
1 t( 0:05 ;11) p
2 12
1 t(0:025;11) 0:8704
: 1 2:201 0:8704
1 1:9158
( 1 1:9158; 1 + 1:9158)
( 2:9158; 0:9158)

We are 95% confident that the mean difference in crop yield is between 2:9158 and 0:9158. Since
zero lies between the two confidence limits, therefore we can’t infer that the new fertilizer is more
effective than the current fertilizer.

(c) The required condition for validity of the results is that the differences are required to be normally
distributed.

(d) No, the required conditions is not satisfied because the histogram of the differences is not a bell
shape.

(e) The data are experimental design.

(f) The experiment design should be independent samples.

1.8 Learning Outcomes


Use the following learning objectives as a checklist after you have completed this study unit to
evaluate the knowledge you have acquired.

Can you

calculate the small-sample SE of (x1 x2 ) under the assumption that 2 = 2?


1 2

perform a small-sample statistical test for the difference between two population means in the
case of independent random samples?

derive a small-sample confidence interval for the difference between two population means
( 1 2) in the case of independent random samples?

explain the difference between independent samples and dependent samples?

apply Student’s t-distribution to a paired difference test?


47 STA1502/1

perform a small-sample statistical test for the difference between two population means in the
case of dependent random samples?

derive a small-sample confidence interval for the difference between two population means
( 1 2) in the case of dependent random samples?

use a confidence interval estimator to test hypotheses for the ration of two variances when two
independent samples are drawn from normal populations.

Key Terms/Symbols
t-distribution
F-distribution
degrees of freedom
dependent and independent random samples
paired difference test
48

STUDY UNIT 2
2.1 Introduction
In this study we tie some loose ends. We continue our inference about comparing two populations,
but we shift from means and comparing two variances to proportions. In the last section we move
back to means but extend it to more than two populations.

2.2 Inference about the Difference Between


Two Population Proportions
Comparing proportions from two independent populations is analogous to comparing means from
two independent populations.
The steps are:
Step 1

The null hypothesis is


H0 : P1 = P2 or H0 : P1 P2 = 0:
The alternative hypothesis is
H1 : P1 < P2 or H1 : P1 > P2 or H1 : P1 6= P2
(left–sided) (right–sided) (two–sided)
or equivalently:
H1 : P1 P2 < 0 or H1 : P1 P2 > 0 or H1 : P1 P2 6= 0
(left–sided) (right–sided) (two–sided)
where
P1 and P2 are the population proportions.
P1 is estimated by Pb1 and P2 is estimated by Pb2 :

Step 2

We are now sampling from two independent populations where the proportions of the populations
have a certain attribute.
X1
If Pb1 = is the proportion in a random sample size n1 from a population 1 with parameters P1
n1
where x1 is the number of items that satisfied condition in sample 1.
X2
Pb2 = is the proportion in a random sample of size n2 from a second independent population
n2
with parameter P2 where X2 is the number of items that satisfied condition in sample 2.
49 STA1502/1

Step 3

The test statistic is Z that we can use for this particular hypothesis:

Pb1 Pb2 (P1 P2 )


Z=s
1 1
P 1 P +
n1 n2

Z has an approximate normal distribution denoted by N (0; 1) where 0 is the mean and 1 is the
variance.
Z tests the null hypothesis H0 : P1 P2 = 0:
P is called the pooled–variance and we calculate it as

X1 + X2 Total numbers of successes in both samples


P = =
n1 + n2 n1 + n2

Step 4
The confidence interval is given by
v
u
u Pb1 1 Pb1 Pb2 1 Pb2
t
Pb1 Pb2 Z( ) +
2 n1 n2

This formula is valid when n1 Pb1 ; n1 1 Pb1 ; n2 Pb2 and n2 1 Pb2 are greater than or equal to 5:
The standard error
v
u s
u Pb1 1 Pb1 Pb2 1 Pb2
t 1 1
+ = P 1 P +
n1 n2 n1 n2

Step 5
The decision rule

Reject the null hypothesis H0 if the test statistic Z is greater than the critical value, otherwise we
do not reject H0 :
Reject the null hypothesis H0 if the p–value is less than (the level of significance), otherwise we
fail to reject H0 :
Reject the null hypothesis H0 if zero lies between the two confidence limits.

To illustrate the use of the Z test for the equality of two properties, work through activities 2.1 and
2.2 to enhance your understanding.
50

Activity 2.1
Say whether the following statements are correct or incorrect and try to rectify the incorrect
statements to make them true.
r
pb1 (1 pb1 ) pb2 (1 pb2 )
(a) If we derive a confidence interval for (P1 P2 ) we use SE = +
n1 n2
r
1 1 X1 + X2
but if we test H0 : P1 = P2 we use SE = p(1 p)( + ) with p = :
n1 n2 n1 + n2

.............................................................................. ..............................................................................

.............................................................................................................................................................

(b) In testing a hypothesis about the difference between two population proportions (P1 P2 ) , the z
test statistic measures how close the computed sample difference between two proportions has
come to the hypothesized value of zero.

.............................................................................................................................................................

.............................................................................................................................................................

(c) In a one-tailed test for the difference between two population proportions (P1 P2 ), if the null
hypothesis is rejected when the alternative hypothesis, H1 : P1 > P2 ; is false, a Type I error is
committed.

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

r
pb1 (1 pb1 ) pb2 (1 pb2 )
(d) If we derive a confidence interval for (P1 P2 ); we use SE = +
n1 n2
r
pb1 (1 pb1 ) pb2 (1 pb2 )
and if we test H0 : P1 P2 = 0:15; we will also use SE = + for the z test
n1 n2
statistic.

.............................................................................. ..............................................................................
51 STA1502/1

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

Feedback Feedback

(a) Correct.

(b) Correct.

(c) Correct.

(d) Correct.

Activity 2.2
1. A seed distributer, called Easy Grow Seeds, claims that 75% of a specific variety of maize, called
Golden Glow, will germinate. A random sample of n1 = 300 seeds was selected from this batch
and 207 germinated. Denote the population proportion of seeds that germinate as P1 : Suppose
that a second, independent seed distributer, called Seeds of All Kinds claims that 80% of their
stock of the same variety of maize, called Golden Glow, will germinate. (Denote this population
proportion of seeds that germinate as P2 :) From this population we draw a random sample of size
n2 = 200 and the number seeds that germinate in this sample is 153.

Test H0 : P1 = P2 against H1 : P1 6= P2 at the 10% level of significance.

To draw a final conclusion show


(a) the use of critical values

(b) computation of the p-value

............................... ............................... ............................... ............................... ...............................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................
52

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

............................... ............................... ............................... ............................... ...............................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

2. Let n1 = 100; X1 = 45; n2 = 50; and X2 = 25:


(a) At the 0:01 level of significance, is there evidence of a significant different between the two
population proportions?

(b) Construct a 99% confidence interval estimate of the difference between the two population
proportions.

(c) Calculate the p–value.

............................... ............................... ............................... ............................... ...............................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

............................... ............................... ............................... ............................... ...............................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................
53 STA1502/1

Feedback Feedback

Question 1

To test the null hypothesis H0 : P1 = P2 ) H0 : (P1 P2 ) = 0:


x1 x2
( ) (P1 P2 )
n n2
We use the test statistic Z = r1 , which has an approximate n(0; 1) distribution.
1 1
p(1 p)( + )
n1 n2

total number of successes in both samples


ppooled =
n1 + n2
x1 + x2
ppooled =
n1 + n2
207 + 153
=
300 + 200

= 0:72
r
1 1
SEpooled (p1 p2 ) = p(1 p)( + )
n1 n2
q
1 1
= 0:72(1 0:72)( 300 + 200 )

= 0:0410
x1 x2
( ) (p1 p2 )
n n2
Z = r1
1 1
p(1 p)( + )
n1 n2

( 207
300
153
200 ) 0
=
0:0410
0:075
=
0:0410

= 1:8293

(a) Find the critical values:


For a two-tailed test with = 0:10; we will reject H0 if jZj > z 2 = z0:05 : (This implies we will reject
H0 if Z 1:645 or if Z 1:645:)
From Table 1 we find z 2 = z0:05 = 1:645:
54

Since jZj = j 1:8298j = 1:8298 > 1:645 =) we reject H0 : It seems likely that the two populations
do not have the same proportions:

Extra explanation:
With a confidence interval our focus is on the inside of the probability statement and with a
hypothesis test our focus is on the outside of the probability statement. For example, for a 90%
confidence interval

P ( 1:645 Z 1:645) = 0:90

which implies that


P (Z 1:645) + P (Z 1:645) =

0.05 0.05

-1.645 0 1.645
Rejection region Rejection region
Two-sided hypothesis test using = 0:10

(b) Compute the p-value:


Since the alternative hypothesis is two-tailed we need to double the probability of observing a
value of the test statistic or more extreme.
p-value= 2 P (Z 1:8298) = 2(0:0336) = 0:0672
Since 0:0672 < 0:10 (the p-value < ) we reject H0 and come to the same conclusion!

Question 2

(a) The given information


The number of successes is X1 = 45 and X2 = 25:
The sample size is n1 = 100 and n2 = 50:
The level of significance = 0:01:
Step 1
The hypothesis testing

H0 : P1 P2 = 0 against H1 : P1 P2 6= 0
55 STA1502/1

Step 2
The test statistic
Pb1 Pb2
Z = s
1 1
P 1 P +
n1 n2
X1 45
Pb1 = = = 0:45
n1 100
X2 25
Pb2 = = = 0:5
n2 50
The pooled proportion is
X1 + X2 45 + 25 70
P = = = = 0:4667
n1 + n2 100 + 50 150
The test statistic is

0:45 0:5
Z = q
1 1
0:4667 (1 0:4667) 100 + 50
0:05 0:05
Z = p = = 0:5774
0:0075 0:0866
Step 3
The critical value at = 0:01 is 2:33 when using Table 1.

Step 4
0:01
The rejection region for a two–tailed test is to corresponding Z of = = 0:005 which gives
2 2
Z = 2:58:
Therefore the rejection is Z < 2:58 and Z > 2:58:

Step 5
Make a decision
Since the test statistic ( 0:5774) lies between the critical value for a two–tailed 2:58 and 2:58;
we fail to reject the null hypothesis H0 at 1% level of significance.
56

(b) The 99% confidence interval is


s
1 1
Pb1 Pb2 Z2 P 1 P +
n1 n2
s
1 1
(0:45 0:5) Z 0:01: 0:4667 (1 0:4667) +
2 100 50
0:05 Z0:005 0:0866
0:05 2:58 0:0866
0:05 0:2234
( 0:05 0:2234; 0:05 + 0:2234)
( 0:2734; 0:1734)

since zero lies between the two confidence limits, we fail to reject the null hypothesis H0 at 1 level
of significance.

(c) Because the hypothesis is a two–tailed test than p–value = 2 P (Z < 0:5774)

= 2 P (Z < 0:58)
= 2 0:2810
= 0:562

Making a decision
Since the p–value (0:562) is greater than = 0:01 (level of significance, we fail to reject H0 at 1%
level of significance.
Conclusion: There is no enough evidence to reject the null hypothesis. Therefore the two
population proportions are similar.
57 STA1502/1

2.3 One-Way Analysis of Variance


In section of study unit 1 we compared the means of two independent and dependent samples. In
general unit 1 through section 2.1 and 2.2 we discussed hypothesis testing methods that allowed
us to reach conclusions about differences between two populations. What happens when we have
more than two independent samples?
We perform a test for means called Analysis of Variance (ANOVA).
The name of the technique derives from the way in which the calculations are performed; that is, the
technique analyzes the variance of the data to determine whether we can infer that the population
means differ when the samples are independently drawn.
ANOVA are methods that allow us to compare multiple populations; or groups. In this approach, we
take samples from each group to examine the effects of differences among two or more groups. The
criteria that distinguish the groups are called factors. Factors contain levels which are analogous to
the categories of a categorical variable.
One–way ANOVA is two-parts process:

To determine if there is a significant difference among the group means. Through this test if the
null hypothesis (all the means are equal) is rejected than we proceed with the second test.
To identify the groups whose means are significantly different from the other group means.

In other words, there is a variation:

(i) that is due to differences among the groups that measures the differences from group to group,
sample to sample or treatment to treatment.
The total of variation among group of variation (or group of treatment is denoted SST).

(ii) That is due to the differences within the groups that measures random variation. The total of
variation within the group or treatment is denoted by SSE.

The total variation (SSTotal) = SST + SSE

To perform an ANOVA test of equality of population means. We have the following steps:
Step 1

Assuming that there is k groups (treatments) that represent the population whose values are
randomly and independently selected, follow a normal distribution, and we have equal variance.
The null hypothesis of no differences in the population means:
H0 : 1 = 2 = ::: = k

is tested against the alternative that not all the k population means are equal:
H1 = not all j are equal where j = 1; 2; :::; k:
or
H1 = At least one population mean is different from the other population means.
58

Step 2
Calculate the among group variation or among treatment called the sum of squares for treatment
(SST ) given by

Pk 2
SST = j=1 nj Xj X

where

k = number of groups (treatment)


nj = number of values in group j (or treatment j)
Xj = sample mean of group j
X = grand mean

X
k
X
nj

Xij
j=1 i=1
X= n

n = n1 + n2 + ::: + nj

Step 3
Calculate the within treatments variation usually called sum of squares within treatment or sum of
square error (SSE) given by

Pk Pnj 2
SSE = j=1 i=1 Xij Xj or

SSE = (n1 1) s21 + (n2 1) s22 + ::: + (nk 1) s2k

where S12 ; S22 ; :::; Sk2 are the variances


SSE is the combined or pooled variation of the k samples, where

Xij = ith value in group j


Xj = sample mean of group j:

Step 4
Calculate the total variation representing the sum of squares total (SST otal) given by

Pk Pnj 2
SST otal = j=1 i=1 Xij X

where
X
k
X
nj

Xij
j=1 i=1
X= n = grand mean
59 STA1502/1

Xij = ith value in group j


nj = number of values in group j
n = n1 + n2 + ::: + nk

Remarks:

Because there are k groups (treatments that we are comparing, therefore there are k 1 degrees
of freedom associated with the sum of squares among groups (treatments).
There are n k degrees of freedom associated with the sum of squares within groups or sum of
squares error.
Because each of the k groups (treatments) contributes nj 1 degrees of freedom through the sum
squares total. That is, we compare each value xij to the grand mean X; based on all n values.

Therefore

* Mean squares for treatment is


MST = SST
k 1
* Mean squares for error is
MSE = SSE
n k
* Mean squares total is
MSTotal = SSTotal
n 1

Step 5
F –test Differences among more than Two means.

To determine if there is a significant difference among group means, we use F–test for differences
among more than two means.
If the null hypothesis H0 : 1 = 2 = ::: = k is true, we conclude that there is no differences
among the k group means such as M ST; M SE and M ST otal; thus these means provide
estimates of the overall variance in the population.
The test statistic for the one–way ANOVA is

M ST
F =
M SE

where
M ST = Mean squares for treatment with k 1 degrees of freedom.
M SE = Mean squares for error with n k degrees of freedom.
60

Step 6
The critical value is

F( ; k 1; n k)

where

= level of significance
k 1 = The degrees of freedom for treatments
n k = The degrees of freedom for error.

Step 7
Decision rule

Reject the null hypothesis H0 : 1 = 2 = ::: k against H1 : Not all j are equal (where
j = 1; 2; :::; k) at a selected level of significance if the F –test statistic is greater than the critical
value F( ; k 1; n k) of the F –distribution. Otherwise, do not reject H0 :
Reject H0 if p–value is less than the level of significance :
ANOVA SUMMARY TABLE

The ANOVA summary table summarizes the results of a one–way ANOVA.


Source of Variation Degrees of freedom Sum of squares Mean of squares F
SST M ST
Treatment k 1 SST M ST = F =
k 1 M SE
SSE
Error n k SSE M SE =
n k
Total n 1 SST otal
61 STA1502/1

Activity 2.3
The marketing manager of a pizza chain is in the process of examining some of the demographic
characteristics of her customers. In particular, she would like to investigate the belief that the ages
of the customers of pizza parlors, hamburger huts, and fast-food chicken restaurants are different.
As an experiment, the ages of eight customers randomly selected of each of the restaurants are
recorded and listed below. Assume that we know from previous analyses that the ages are normally
distributed with the same variances.

Customers’ Ages
Pizza Hamburger Chicken
23 26 25
19 20 28
25 18 36
17 35 23
36 33 39
25 25 27
28 19 38
31 17 31

(a) State whether the following calculations are correct or incorrect.


(i) x = 26:833; x1 = 25:5; x2 = 24:125; x3 = 30:875

(ii) SST otal = 1067:34

(iii) SSW ithin = SSE = 863:76

(iv) SSBetween = SST = 203:58

[:-) Always keep in mind that small differences could be due to rounding errors!]

............................... ............................... ............................... ............................... ...............................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................
62

(b) Set up an ANOVA table

............................... ............................... ............................... ............................... ...............................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

(c) Do these data provide enough evidence at the 5% significance level to infer that there are
differences in ages among the customers of the three restaurants?

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................
63 STA1502/1

Feedback Feedback
The given information
Pizza Hamburger Chicken
23 26 25
19 20 28
25 18 36
17 35 23
36 33 39
25 25 27
28 19 38
31 17 31
Total 204 193 247
Sample Mean X 1 = 25:5 X 2 = 24:125 X 3 = 30:875
Standard Deviation S1 = 6:1875 S2 = 6:8959 S3 = 6:1281
Sample Size n1 = 8 n2 = 8 n3 = 8
(a) (i) Correct
3 X
X 24
Xij
n = n1 + n2 + n3
j=1 i=1
X = =8+8+8
n = 24
204 + 193 + 247
=
24
644
=
24
= 26:8333

or
X1 + X2 + X3
X =
3
25:5 + 24:125 + 30:875
=
3
80:5
X = = 26:8333
3
(ii) Correct
3
X 2
SS Between = SST = nj X j X
j=1
2 2 2
SST = n1 X 1 X + n2 X 2 X + n3 X 3 X
= 8(25:5 26:8333)2 + 8(24:125 26:8333)2 + 8 (30:875 26:8333)2
= 14:2215 + 58:6791 + 130:6827
= 203:5833
3
X
SSW ithin = SSE = (nj 1) s2j
j=1
64

SSE = (n1 1) s21 + (n2 1) s22 + (n3 1) s23


= 7 (6:1875)2 + 7 (6:8959)2 + 7 (6:1281)2
= 267:9961 + 332:8741 + 262:8753
= 863:7455

SST otal = SST + SSE


= 203:5833 + 863:7455
= 1067:3288

(b) The ANOVA table


Source of Variation df SS MS F
Treatment 3 1=2 203:5833 101:7917 2:4588
Error 24 3 = 21 863:7455 41:1307
Total 23 1067:3288
SST 203:5833
M ST = = = 101:7917
k 1 2

SSE 863:7455
M SE = = = 41:1307
n k 21

M ST 101:7917
F = = = 2:4588
M SE 41:1307
(c) Decision Rule
Reject H0 if F –test statistic > The critical value F( ;k 1;n k) . Since F –test statistic (2:4588) is
less than the critical F(0:05; 2; 21) = 3:44; we fail to reject the null hypothesis H0 at 5% level of
significance.
H0 : 1 = 2 = 3 against H1 : at least two means differ.
Conclusion: The data do not provide enough evidence at the 5% significance level to infer that
there are differences in ages among the customers of the three restaurants.

Activity 2.4
A statistics practitioner calculated the following statistics:
Treatment
Statistic 1 2 3
The sample size n 5 5 5
The sample X 10 15 20
The sample S 2 50 50 50
(a) Complete the ANOVA table.

(b) Repeat part (a) changing the sample sizes to 10 each.


65 STA1502/1

(c) Describe what happen to the F –statistic when the sample sizes increase

.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

............................... ............................... ............................... ............................... ...............................

............................... ............................... ............................... ............................... ...............................

............................... ............................... ............................... ............................... ...............................

............................... ............................... ............................... ............................... ...............................

Feedback Feedback

The given information


k=3 n1 = 5 n2 = 5 n3 = 5

n = n1 + n2 + n3 = 5 + 5 + 5 = 15

~ 1 = 10
X X 2 = 15 X 3 = 20

S12 = 50 S22 = 50 S32 = 50


(a)
X1 + X2 + X3 10 + 15 + 20 45
X = = = = 15
3 3 3
3
X 2
SST = nj X j X
j=1
2 2 2
SST = n1 X 1 15 + n2 X 2 15 + n3 X 3 15
= 5 (10 15)2 + 5 (15 15)2 + (20 15)2
= 125 + 0 + 125
= 250
66

3
X
SSE = (nj 1) Sj2
j=1

= (n1 1) S12 + (n2 1) S22 + (n3 1) S32


= (5 1) 50 + (5 1) 50 + (5 1) 50
= 200 + 200 + 200
= 600

SST otal = SST + SSE


= 250 + 600
= 850

SST 250 250


M ST = = = = 125
k 1 3 1 2

SSE 600 600


M SE = = = = 50
n k 15 3 12

M ST 125
F = = = 2:5
M SE 50
ANOVA TABLE
Source of variation df SS MS F
Treatment k 1=2 250 125 2:5
Error n k = 12 600 50
Total 14 850
(b) The given information
k=3 n1 = 10 n2 = 10 n3 = 10

n = n1 + n2 + n3 = 10 + 10 + 10 = 30

~ 1 = 10
X X 2 = 15 X 3 = 20

S12 = 50 S22 = 50 S32 = 50

X1 + X2 + X3 10 + 15 + 20 45
X = = = = 15
3 3 3
3
X 2
SST = nj X j X
j=1
2 2 2
SST = n1 X 1 X + n2 X 2 X + n3 X 3 X
= 10 (10 15)2 + 10 (15 15)2 + 10 (20 15)2
= 250 + 0 + 250
= 500
67 STA1502/1

3
X
SSE = (nj 1) Sj2
j=1

= (n1 1) S12 + (n2 1) S22 + (n3 1) S32


= 9 50 + 9 50 + 9 50
= 450 + 450 + 450
= 1350

SST otal = SST + SSE


= 500 + 1350
= 1850
SST 500 500
M ST = = = = 125
k 1 3 1 2

SSE 1350 1350


M SE = = = = 50
n k 30 3 27

M ST 125
F = = = 25
M SE 50
ANOVA TABLE
Source of variation df SS MS F
Treatment k 1=3 1=2 500 125 25
Error n k = 30 3 = 27 1350 50
Total 29 1850
(c) F –increased (In this case from F = 2:6 to F = 25):
68

2.4 Multiple comparisons.

Performing an analysis of variance test to determine whether differences exist between two or
more population means is a good start, but not nearly enough for a practical application where it
is necessary to identify which treatment means are responsible for the differences. The statistical
method used to determine this is called multiple comparisons. We will consider three methods for
this purpose, namely

Fisher’s least significant difference method (LSD) which is used of you want find areas for further
investigation.
The Bonferroni method which is used of you want to identify two or three pairwise comparisons.
Tukey’s method is used when you want to consider all possible population-combinations.

These three methods are discussed in the next section. Make sure that you understand them and can
apply the knowledge. The formulas for the three methods are different, but you need not remember
them. In fact, rather go through each example and its solution to see how the three methods are
applied.

As your knowledge of statistics expands, lengthy calculations will interest you less and less, seeing
that your interest should move to the actual statistical analysis. There is a very delicate balance
between the importance of the calculation and the statistical analysis: if the calculation is incorrect,
the analysis has no meaning. Still,you are being trained to make a meaningful and correct analysis.
Once you understand the method applied in the calculation, that part can be taken over by statistical
software. This is why most statisticians start to use statistical software for their calculations at an
early stage. We are introducing students at second level in STA2601 to the software package JMP.
It is therefore advisable for you to take note of any given Excel and Minitab printouts in a textbook.
Try to do them yourself if you have access to Excel or Minitab and if you do not have access, study
them and note what information they supply and how to interpret it. No professional statistician can
function properly without knowledge of and using statistical software.
If we conclude from an ANOVA F –test that the population means are all equal (i.e. if we do not reject
the null hypothesis H0 ); then can end our analysis. However, if we conclude that the population,
means are not all equal (i.e. if we reject the null hypothesis H0 ); then we are led directly into a new
question: If the population means are not all equal, which ones differ and which ones the same?
The next section tests will answer this new question.
69 STA1502/1

2.4.1 Multiple Comparisons: Fisher’s Least Significant Difference (LSD)


Method

The least significant difference (LSD) method determines which population means differ. We define
the least significant difference LSD as
s
1 1
LSD = t( ; (n k)) M SE +
2 ni nj

where

MSE is an unbiased estimator of the common variance of the populations we are testing.

A simple way of determining whether differences exist between each pair of population means is to
compare the absolute value of the difference between their two sample means and LSD.
In other words, we will conclude that i and j differ if

Xi X j > LSD

where X i X j is the pairwise absolute differences given always a positive difference.

2.4.2 Multiple Comparisons: Bonferroni tests of individual pairs of means

If we conclude that the null hypothesis of equal population means is not valid, we need to determine
which pairs of population means are the same and which ones differ. The Bonferroni t–test is one of
different types of tests that can determine which pairs of means differ. When we have k groups (or
k (k 1)
treatmens), we can test a total of C = hypothesis, each with significant level of (i.e. the
2
probability of a type 1 error), that means the error of incorrectly rejecting the null hypothesis.
The Bonferroni test attempts to control the overall probability of a type 1 error by ensuring that it is
no greater than the original specified level of significance. In general, the hypothesis that we test are
given by:
H0 : i = j against H1 : i 6= j (i and j = 1; 2; :::; k ):
The Bonferroni tests for ANOVA test have the following steps:
Step 1
k (k 1)
The hypothesis statements H0 and H1 for the number of hypothesis is determined by C = to
2
be tested simultaneously.
Step 2
The test statistic is

Xi Xj
Tij = s
1 1
M SE +
ni nj
70

where
ni and nj are the sample sizes of groups (treatment) i and j:
X i and X j are the sample means of groups (or treatments) i and j:
M SE is the mean within sum of squares calculated from
the original ANOVA.

The test statistic follows a t(n k) distribution under the null hypothesis H0 :
Step 3
The critical value
k (k 1)
The critical values that are used for all c = tests for the pairs of population means are
2
t( and t( ; (n
2c ; (n k)) 2c k))

Step 4
Specify the significance level
The overall significance level for all the tests is specified as = 0:05:
Step 5
Decision rule

Reject the null hypothesis H0 : i = j if the test statistic tij < t( ;(n k)) or if tij > t( ; (n k))
2 2
we can use the Table 2 to obtain these critical values.
Reject the null hypothesis H0 if p–value < c

c is the Bonferroni correction

2.4.3 Multiple Comparisons: The Kukey- Kramer Method

After performing the one–way ANOVA and finding a significant difference among the treatment,
we still do not know which treatments differ. All we know is that there is sufficient evidence to
state that the population means are not all the same. That is, one or more population means
are significantly different. To determine which treatments differ, we use the Tukey–Kramer multiple
comparisons procedure for one–way ANOVA. Using this technique, we are able to simultaneously
make comparison between all pairs of groups.
This technique determines a critical number similar to the least significant difference (LSD) method
of section 2.4.1 for Fisher’s test. The critical number is denoted by !; such that if any pair of sample
means has a difference greater than !; we conclude that the pair’s two corresponding population
means are different.
The test is based on the studentized range, which is to calculate the variable q:
X max X min
q=
S
p
n
71 STA1502/1

where X max and X min are the largest and smallest sample means respectively, assuming that there
are no differences between the population means.
The critical number ! is
r
M SE
! = q (k; n k)
ng

where
k = number of treatments
n = number of observations
= n1 + n2 + ::: + nk
n k = number of degrees of freedom
associated with M SE
ng = number of observations in each
of k samples
= significance level
q (k; n k) = critical value of the studentized
range as given in Table 4.
72

Table 4A: Critical Values of the Studentized Range, = 0:05


73 STA1502/1

Table 4B: Critical Values of the Studentized Range, = 0:01


74

Activity 2.5
Question 1
An investor studied the percentage rates of return of three different types of mutual funds. Random
samples of percentage rates of return for four periods were taken from each fund. The results appear
in the table below:
Mutual Funds Percentage Rates
Fund 1 Fund 2 Fund 3
12 4 9
15 8 3
13 6 5
14 5 7
17 4 4

Use Tukey’s method with = :05 to determine which population means differ.
............................... ............................... ............................... ............................... ...............................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

............................... ............................... ............................... ............................... ...............................

Question 2

(a) Use Fisher’s LSD method with = 0:05 to determine which population means differ in the following
problem.
k=3 n1 = 10 n2 = 10 n3 = 10

M SE = 700 X 1 = 128:7 X 2 = 101:4 X 3 = 133:7


(b) Repeat part (a) using the Bonferroni adjustment.

(c) Repeat part (a) using Tukey’s multiple comparison method.

............................... ............................... ............................... ............................... ...............................

.............................................................................................................................................................

.............................................................................................................................................................
75 STA1502/1

Feedback Feedback
Question 1

Mutual Funds
Fund 1 Fund 2 Fund 3
12 4 9
15 8 3
13 6 5
14 5 7
17 4 4
The sample size n1 = 5 n2 = 5 n3 = 5
Total of observation 71 27 28
The sample mean X 1 = 14; 2 X 2 = 5:4 X 3 = 5:6

The sample variance S12 = 3:7 S22 = 2:8 S32 = 5:8

n = n1 + n2 + n3 = 5 + 5 + 5 = 15
k = 3
n k = 15 3 = 12
= 0:05
ng = 5
s
M SE
! = q0:05 (k; n k)
ng
3
X
SSE = (nj 1) Sj2
j=1

= (n1 1) S12 + (n2 1) S22 + (n3 1) S32


= (5 1) 3:7 + (5 1) 2:8 + (5 1) 5:8
= 14:8 + 11:2 + 23:3 = 49:2

SSE 49:2
M SE = = = 4:1
n k 12
The critical number is
r
4:1
! = q0:05 (3; 12)
5
= 3:77 0:9055 (q0:05 (3; 12) = 3:77 using Table 4A)
= 3:4137

The pairwise absolute differences are

X1 X 2 = j14:2 5:4j = j8:8j = 8:8


Since 8:8 > 3:4137; 1 and 2 differ therefore there is enough evidence to reject the null hypothesis
H0 : 1 = 2: The test is significant.
76

X1 X 3 = j14:2 5:6j = 8:6 > 3:4137; 1 and 3 differ, the test is significant.
X2 X 3 = j5:4 5:6j = j 0:2j = 0:2
Since 0:2 < 3:4137; 2 and 3 do not differ.

Conclusion: It is clear that the mean percentage rate of return for mutual fund 1 is significantly
different from that of the other two mutual funds.

Question 2

(a) Given information


The level of significance = 0:05:
The number of treatment k = 3
The sample size n1 = 10; n2 = 10 and n3 = 10
n = n1 + n2 + n3 = 10 + 10 + 10 = 30

The mean squares error M SE = 700


The sample mean X 1 = 128:7; X 2 = 101:4 and X 3 = 133:7
n k = 30 3 = 27
The least significant difference LSD is
s
1 1
LSD = t( ; n k) M SE +
2 ni nj
s
1 1
= t( 0:05 ; 12) 700 +
2 10 10
p
= t(0:025: 27) 140
= 2:052 11:8322 (We use Table 2 for t(0:025; 27)
= 24:2797

Decision Rule
Reject H0 if X 1 X j > LSD; otherwise we fail to reject the null hypothesis H0 :
The pairwise absolute differences are
(i) X 1 X 2 = j128:7 101:4j = j27:3j = 27:3
Since 27:3 > 24:2797; 1 and 2 differ therefore the test is significant.

(ii) X 1 X 3 = j128:7 133:7j = j 5j = 5


Since 5 < 24:2797; 1 and 3 differ therefore the test is not significant.

(iii) X 2 X 3 = j101:4 133:7j = j 32:3j = 32:3


Since 32:3 > 24:2797, 2 and 3 differ therefore the test is significant.
Conclusion: 2 differs from 1 and 3:
77 STA1502/1

(b) Bonferroni test


Step 1
The hypotheses statements H0 and H1 .
k (k 1) 3 (3 1) 6
The c = = = = 3; this means: We have three hypotheses as given below.
2 2 2
The hypotheses are as follows
H0 : 1 = 2 against H1 : 1 6= 2

H0 : 1 = 3 against H1 : 1 6= 3

H0 : 2 = 3 against H1 : 2 6= 3

The above hypothesis need to be tested simultaneously.


Step 2
Calculate the test statistic
From the given information
k=3 = 0:05 n = 30 n1 = n2 = n3 = 10

M SE = 700 X 1 = 128:7 X 2 = 101:4 X 3 = 133:7

The test statistic for hypothesis H0 : 1 = 2

X1 Xj
Tij = s
1 1
M SE +
ni nj

X1 X2 128:7 101:4 27:3


(i) t1;2 = s = p = = 2:3073
1 1 140 11:8322
700 +
10 10

(ii) The test for hypothesis H0 : 1 = 3

X1 X3 128:7 133:7
t1;3 = s =s
1 1 1 1
M SE + 700 +
n1 n2 10 10
5
= = 0:4226
11:8322
(iii) The test for the hypothesis H0 : 2 = 3

X2 X3 101; 4 133:7
t2;3 = s =s
1 1 1 1
M SE + 700 +
n2 n3 10 10
32:3
= = 2:7298
11:8322
78

Decision rule
The critical values is t(
2c ;n k)
Reject H0 if the test statistic tij is greater than the positive critical value or if the test statistic
is less than t( ;n k) :
2c
Step 3
Making a decision
The critical value is t 0:05 = t 0:05 = t(0:0083; 27) = t(0:01; 27) = 2:473
2 3 ; 30 3 6 ; 27
For the hypothesis H0 : 1 = 2; t1;2 = 2:3073 since 2:3073 lies between the critical values
2:473 and 2:473; we fail to reject H0 and we conclude that 1 and 2 are not significantly
different.
For the hypothesis H0 : 1 = 3 ; t1;3 = 0:4226
Since 0:4226 lies between the critical value 2:473 and 2:473; we fail to reject H0 and we
conclude that 1 and 3 are not significantly different.
For the hypothesis H0 : 2 = 3; t2;3 = 2:7298
Since 2:7298 lies outside of the critical values 2:473 and 2:473; we reject the null hypothesis
H0 : 2 = 3; we conclude that 2 and 3 are significantly different.

(c) Tukey’s multiple comparison method.


From the given information
n1 = 10 n2 = 10 n3 = 10

n = n1 + n2 + n3 = 10 + 10 + 10 = 30

k=3 M SE = 700 ng = 10

= 0:05 X = 128:7 X 2 = 101:4 X 3 = 133:7

The critical number ! is


s
M SE
! = q (k; n k)
ng
r
700
= q0:05 (3; 30 3)
10
= q0:05 (3; 27) 8:3667
= 3:49 8:3667
= 29:1998 (q0:05 (3; 27) = 3:49 from Table 4A from the nearest q0:05 (3:30) )

The pairwise absolute differences are


X1 X 2 = j128:7 101:4j = 27:3
Since 27:3 < 29:1998; 1 does not differ to 2:
79 STA1502/1

X1 X 3 = j128:7 133:7j = j 5j = 5
Since 5 < 29:198; 1 and 3 are not different.
X2 X 3 = j101:4 133:7j = j 32:3j = 32:3
Since 32:3 > 29:1998; 2 and 3 are significantly different.
80

2.5 Analysis of variance experimental designs

The way that a sample is selected is called experimental design and determines the amount of
information in the sample. Researcher can involve an:

observational study: only observes the characteristics of data already exist (i.e. the researcher
does not produce the data). For instance, a sample survey in the form of a questionnaire.
experimentation in which the researcher may use one or more experimental conditions in order to
determine the effect on the response.

We will use terms such as factor, level, treatment and response in the design of a statistical
experiment.
The one–way analysis of variance introduced in the previous sections is the only one of many
different experimental designs of the analysis of variance.
In the following section, we present an overview of concepts that we will discuss the analysis of
variance in the design of a statistical experiment.

2.5.1 Single–Factor Experimental Design

The experiment described in one–way analysis of variance (ANOVA) is a single–factor analysis of


variance simply because it addresses the problem of comparing two or more populations defined
on the basis of only one factor. One–way analysis of variance is the only one of the simplest
experimental designs because it is the completely randomized design, in which random samples are
selected independently from each of k populations. This design involves only one factor, hence the
designation as a one–way classification.

2.5.2 Multifactor Experimental Design

The experiment in this case involves two or more factors that define the treatments. The focus in this
section is to determine whether the levels of each factor are different from one another. The analysis
of variance is used to address this problem.

2.5.3 Independent Samples and Block

The one–way analysis of variance described earlier is a generalization of the two independent
samples design to the used when the experimental units are quite similar or homogeneous with
only one factor.

When the problem objective is to compare more than two populations, in which the data have to be
gathered from a matched pairs experiment, then the experimental design is called the randomized
block design (i.e. a direct extension of the matched pairs). The design uses blocks of k experimental
units that are relatively similar (or homogeneous), with one unit within each block randomly assigned
81 STA1502/1

to each treatment. The randomized block experiment is also called the two–way analysis of
variance.

2.5.4 Fixed and Random Effects

Fixed–effects analysis of variance is a technique that includes all possible levels of a factor in
the analysis.
Random–effects analysis of variance is a technique that uses the level included in the study as
random sample.

In some experimental designs, there are no differences in calculations of the test statistic between
fixed and random effects. However, in other including the two–factor experiment, the calculations
are different.

2.6 Randomized Block(two-way) Analysis of Variance

The randomized block design identifies two factors: treatments and blocks that both of which affect
the response. The procedure for carrying out the randomized block design, which is summarized as
given below.

A. The null and alternative hypotheses are expressed in terms of the equality of the population means
for all of the treatment groups.

H0 : 1 = 2 = ::: = k

H1 : The population means are not all the same.

B. The format of the data to be analyzed.


The data are listed in tabular from, as shown below, with a separate column for each of the k
treatments and a separate row for each of the b blocks as with one–way, or completely randomized
ANOVA.
Treatment j = 1 to k
1 2 3 k Calculate block means
1 X11 X12 X13 X1k X 1:

Blocks, i = 1 to b 2 X21 X22 X23 X2k X 2:

3 X31 X32 X33 X3k X 3:


.. .. .. .. .. ..
. . . . . .
n X b1 X b2 X b3 X bk X n:
b X
X k
Xij
i=1 j=1
Calculate Treatment Means X1 X2 X3 Xk X=
N
82

Xij = the observation for the ith block and the j th treatment.
X = grand mean, mean of all the observations. We can also average the means of treatments or
the means of block to calculate X:

C. The ANOVA table for the randomized block design.


Source of variation Degrees of freedom Sum of squares Mean Squares F –ratio
M ST
Treatments k 1 SST M ST F =
M SE
M SB
Blocks b 1 SSB M SB F =
M SE
Sampling Error (k 1) (b 1) SSE M SE

Total kb 1 SST otal

k
X 2
SST = b Xj X
j=1

SST
MST =
k 1
b
X 2
SSB = k Xi X
i=1

SSB
MSB =
b 1

SSE = SSTotal SST SSB

SSE
MSE =
(k 1) (b 1)

k X
X b
2
SSTotal = Xij X
j=1 i=1

D. The test statistic, the critical value and the decision rule.
(1) The test statistic

MST
F= As with the one–way ANOVA
MSE
MSB
F=
MSE
83 STA1502/1

(2) The critical value


The test is right–tail and for a given level of significance .
The critical value is F( ; (k 1); (k 1)(b 1))

(3) The decision rule


M ST
Reject H0 at level, if F = is greater than the critical value as given in (2).
M SE
H0 : 1 = 2 = ::: = k.

The calculations for this type of analysis are time consuming that in some cases we provide some
details to complete others or we give only computer printouts in the explanations. Learn this method
with an application. With this design, testing if the treatment means differ can also be used to test if
there are differences in the block means. Of course, if the block means do not differ, it implies that
specific analysis was not the correct one!
84

Activity 2.6
Question 1
The following statistics were generated from a randomized block experiment with k = 3 and b = 7:

SST = 100 SSB = 50 SSE = 25

(a) Test to determine whether the treatment means differ. (Use = 0:05).

(b) Test to determine whether the block means differ. (Use = 0:05):

............................... ............................... ............................... ............................... ...............................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................

Question 2
A partial ANOVA table in a randomized block design is shown below, where the treatments refer to
different high blood pressure drugs, and the blocks refer to different groups of men with high blood
pressure. Use the given ANOVA table to answer the questions:

Source of Variation SS df MS F

Treatments 6; 720 4 1; 680 14:6087

Blocks 3; 120 6 520 4:5217

Error 2; 760 24 115

Total 12; 600 34

(a) Can we infer at the 5% significance level that the treatment means differ?

(b) Can we infer at the 5% significance level that the block means differ?

............................... ............................... ............................... ............................... ...............................

.............................................................................................................................................................

.............................................................................................................................................................

.............................................................................................................................................................
85 STA1502/1

Feedback Feedback
Question 1
The given information
k=3 b=7 SST = 100

SSE = 25 SSB = 50

H0 : 1 = 2 = 3 against H1 : The population means are not all the same.

(a) Test for the treatments


SST 100 100
M ST = = = = 50
k 1 3 1 2

SSE 25 25 25
M SE = = = = = 2:0833
(k 1) (b 1) (3 1) (7 1) 2 6 12

M ST 50
F = = = 24:0004
M SE 2:0833
The critical value is F( ;(k 1); (k 1)(b 1)) = F(0:05; (3 1); (3 1)(7 1)) = F(0:05; 2; 12) = 3:89
The rejection region is F > 3:89:
Since the test statistic F = 24:004 is greater than the critical value, we reject the null hypothesis
H0 :
Conclusion: There is enough evidence to conclude that the treatment means differ.

(b) Test for the blocks


SSB 50 50
M SB = = = = 8:3333
b 1 7 1 6

M SB 8:3333
F = = = 4:00
M SE 2:0833
The critical value is F( ; b 1; (b 1)(k 1)) = F(0:05; 6; 12) = 3:00

The rejection region is F > 3:00: Since the test statistic F = 4:00 is greater than the critical
value (3:00) ; the null hypothesis H0 is rejected at 5% significance level.
The null hypothesis H0 is rejected at 5% level of signifcance.
Conclusion: There is enough evidence to conclude that the block means differ.
The ANOVA table is
Source of Variation SS df MS F
Treatments 6; 720 4 1; 680 14:6087
Blocks 3; 120 6 520 4:5217
Error 2; 760 24 115
Total 12; 600 34
86

Question 2

The ANOVA table

Source of variation df SS MS F
Treatments 4 6720 1680 14:6087
Blocks 6 3120 520 4:5217
Error 24 2760 115
Total 34 12600

H0 : 1 = 2 = 3 = 4
H1 : At least two means differ

(a) Test for treatments


M ST = 1680 M SB = 520 M SE = 115 F = 14:6087
The critical value is F( ; k 1; (k 1)(b 1)) = F(0:05; 4; 4 6) = F(0:05; 4; 24) = 2:78
Since the test statistic F = 14:6087 is greater than the critical value (2:78) ; the null hypothesis
is rejected at 5% level of significance.
Conclusion: There is enough evidence to reject the null hypothesis H0 : There are at least two
population for treatments differ.

(b) Test for Blocks


M SB = 520 SSB = 3120 F = 4:5217
The critical value is F(0;05; (b 1); (k 1)(b 1)) = F(0:05; 6; 4 6) = F(0:05; 6; 24) = 2:51
Since the test statistic F = (4:5217) is greater than the critical value (2:51) ; we reject the null
hypothesis H0 :
Conclusion: There is enough evidence that at least two population means for block differ.
87 STA1502/1

2.7 Self-correcting Exercises for Unit 2

Question 1
These statistics were calculated from two random samples:
Pb1 = 0:60 n1 = 225 Pb2 = 0:55 n2 = 225

(a) Calculate the p–value of a test to determine whether there is evidence to infer that the population
proportions differ.

(b) Repeat part (a) with Pb1 = 0:95 and Pb2 = 0:90:

(c) Describe the effect on the p–value of increasing the sample proportions.

(d) Repeat part (a) with Pb1 = 0:10 and Pb2 = 0:05:

(e) Describe the effect on the p–value of decreasing the sample proportions.

Question 2
Surveys have been widely used by politicians around the world as a way of monitoring the opinions
of the electorate. Six months ago, a survey was undertaken to determine the degrees of support for
a national party leader. Of a sample of 1100; 56% indicated that they would vote for this politician.
This month, another survey of 800 voters revealed that 46% now support the leader.

(a) At the 5% significance level, can we infer that the national leader’s popularity has decreased?

(b) At the 5% significance level, can we infer that the national leader’s popularity has decreased by
more than 5%?

(c) Estimate with 95% confidence the decrease in percentage support between now and 6 months
ago.

Question 3
Consider the following ANOVA table:

Source of Variation Sum of squares df Mean Squares F


Treatments 128 4 32 2:963
Error 270 25 10:8
Total 398 29

Say whether the following statements are true or false.

(a) The total number of observations in all the samples is 30.

(b) The within-treatments variation stands for the sum of squares for error.

(c) In one-way analysis of variance, if all the sample means are equal, then the sum of squares for
88

treatments will be zero.

(d) Rejection region, at the 1% level of significance, for this one-way analysis of variance is where
F >F ;k 1;n k = F0:01;4;25 :

(e) Assume that the above ANOVA is applied to independent samples taken from normally distributed
populations with equal variances. If the null hypothesis is rejected, then we can infer that at least
two population means differ.

Question 4
A consumer organization was concerned about the differences between the advertised sizes of
containers and the actual amount of product. In a preliminary study, six packages of three different
brands of margarine that are supposed to contain 500ml were measured. The differences from 500ml
are listed here. Do these data provide sufficient evidence to conclude that differences exist between
the three brands? (Use = 0:01).
Brand 1 Brand 2 Brand 3
1 2 1
3 2 2
3 4 4
0 3 2
1 0 3
0 4 4
89 STA1502/1

2.8 Solutions to Self-correcting Exercises for Unit 2

Question 1
The given information

Two population proportions


Pb1 = 0:60 Pb2 = 0:55
n1 = 225 n2 = 225
(a) To caluclate the p–value, we need the test statistic.
The test statisc for the difference between two population propositions Pb1 Pb2 is
Pb1 Pb2 (P1 P2 )
Z=s
1 1
P 1 P +
n1 n2
The pooled proportion
X1 + X2
P =
n1 + n 2
X1
Pb1 = ) X1 = n1 Pb1
n1
X2
Pb2 = ) X2 = n2 Pb2
n2
Thus
X1 + X2 n1 Pb1 + n2 Pb2
P = =
n1 + n2 n1 + n2

225 0:60 + 225 0:55


=
225 + 225
135 + 123:75
P =
450
258:75
=
450
= 0:575

The test statistic for Pb1 Pb2 is


Pb1 Pb2 (P1 P2 )
Z = s
1 1
P 1 P +
n1 n2
(0:60 0:55) 0
= s
1 1
0:575 (1 0:575) +
225 225
0:05
=
0:0466

= 1:0730
90

The hypotheses for H0 and H1 are


H0 : P1 P2 = 0 against H1 : P1 P2 6= 0

since we have a two–tailed test,


the p–value = 2 P (Z > 1:0730)
= 2 P (Z > 1:07)
= 2 0:1423 (From Table 1)
= 0:2846

Making decision
Since the p–value = 0:2846 is greater than = 0:05; we fail to reject H0 :
Conclusion: The two population proportions do not differ.

(b) The given information


Pb1 = 0:95 Pb2 = 0:90
n1 = 225 n2 = 225
The pooled proportion is
X1 + X2
P =
n1 + n 2
X1
Pb1 = ) X1 = n1 Pb1
n1
X2
Pb2 = ) X2 = n2 Pb2
n2
Thus
X1 + X2 n1 Pb1 + n2 Pb2
P = =
n1 + n2 n1 + n2

225 0:95 + 225 0:90


=
225 + 225

213:75 + 202:5
=
450
= 925
91 STA1502/1

The test statistic for Pb1 Pb2 is


Pb1 Pb2 (P1 P2 )
Z = s
1 1
P 1 P +
n1 n2
(0:95 0:95) 0
= s
1 1
0:925 (1 0:925) +
225 225
0:05
=
0:0248

= 2:0161

Since we are given a two–tailed test,

p–value = 2 P (Z > 2:0161)


= 2 P (Z > 2:02)
= 2 0:0217
= 0:0434

Making a decision
Since the p–value = 0:0434 is less than = 0:05; we reject the null hypothesis H0 :
Conclusion: The two population proportions differ.

(c) The p–value decreases from 0:2846 to 0:0434:

(d) The given information


Pb1 = 0:01 Pb2 = 0:05

n1 = 225 n2 = 225

X1 = n1 Pb1 = 225 0:10 = 22:5

X2 = n2 Pb2 = 225 0:05 = 11:25

The pooled proportion P is


X1 + X2 22:5 + 11:25 33:75
P = = = = 0:075
n1 + n2 225 + 225 450
92

The test statistic for Pb1 Pb2 is


Pb1 Pb2 (P1 P2 )
Z = s
1 1
P 1 P +
n1 n2
(0:10 0:075) 0
= s
1 1
0:075 (1 0:075) +
225 225
0:05
=
0:0248

= 2:0161

p-value = 2 P (Z > 2:0161)


= 2 P (Z > 2:02)
= 2 0:0217 (From Table 1)
= 0:0434

Since p–value (0:0434) < 0:05; we fail to reject H0 :

(e) The p–value decreases.

Question 2
The given information

The sample size is n1 = 1100 and n2 = 800:


The population proportion is Pb1 = 0 and Pb2 = 0:46:

(a) The hypothesis for H0 and H1 .


H0 : P1 P2 = 0:56 against H1 : (P1 P2 ) > 0:
The level of significance = 0:05
The test statistic for Pb1 Pb2 is
Pb1 Pb2 (P1 P2 )
Z=s
1 1
P 1 P +
n1 n2
The pooled variance from the proportion is
X1 + X2
P =
n1 + n2
X1 = n1 Pb1 = 1100 0:56 = 616
X2 = n2 Pb2 = 800 0:46 = 368
93 STA1502/1

X1 + X2 616 + 368 984


P = = = = 0:5179
n1 + n2 1100 + 800 1900
The test statistic is now
Pb1 Pb2 (P1 P2 )
Z = s
1 1
P 1 P +
n1 n2
(0:56 0:46) 0
= s
1 1
0:5179 (1 0:5179) +
1100 800
0:1
=
0:0232

= 4:3103

The critical value for a one–tailed test is Z = Z0:05 = 1:645 (from Table 1)
Make a decision
Since the test statistic Z = 4:3103 is greater than the critical value (1:645), we reject the null
hypothesis H0 at 5% significance level.
Conclusion: The popularity decreased.

(b) If the popularity decrease by more than 5% this means (P1 P2 ) > 0:05:
The hypothesis are
H0 : P1 P2 = 0 against H1 : (P1 P2 ) > 0:05:
In this question, the pooled variance for proportion P does no longer exist and we have to
calculate the standard error SE for proportion given below.
The standard error for proportion Pb1 Pb2 is
v
u
u Pb1 1 Pb1 Pb2 1 Pb2
t
SE = +
n1 n2
r
0:56 (1 0:56) 0:46 (1 0:46)
= +
1100 800

p
= 0:000224 + 0:0003105
= 0:0223
94

The test statistic for Pb1 Pb2 is

Pb1 Pb2 Pb1 Pb2


Z =
SE

(0:56 0:46) 0:05


=
0:0223

0:05
=
0:0223

= 2:2422

The critical value for one–tailed test is Z = Z0:05 = 1:645.


Making a decision
Since the test statistic Z = 2:2422 is greater than the critical value (1:645), we reject the null
hypothesis H0 :
Conclusion: The popularity decreased by more than 5%.

(c) The given information


H0 : (P1 P2 ) = 0 against H1 : (P1 P2 ) > 0:05:
The standard error for the decrease is SE = 0:0223
Pb1 = 0:56 and Pb2 = 0:46
The 95% confidence interval is
Pb1 Pb2 Z2 SE
(0:56 0:46) Z 0:05 0:0223
2

0:10 1:96 0:0223


0:10 0:0437
(0:10 0:0437; 0:10 + 0:0437)
(0:0563; 0:1437)

Making a decision
Since zero lies outside of the confidence limits than we conclude the null hypothesis H0 is
rejected.
95 STA1502/1

Question 3
The given information
The ANOVA table

Source of Variation Sum of squares df Mean Squares F


Treatments 128 4 32 2:963
Error 270 25 10:8
Total 398 29
(a) Correct
The degrees of freedom for error is n k; where k = 5: From the ANOVA table n k = 25 )
n = 25 + 5 = 31 alternatively
The total of the degree of freedom from the general ANOVA is n 1; from the ANOVA table
n 1 = 29 so n = 29 + 1 = 30:

(b) Correct

(c) Correct
If X 1 = X 2 = ::: = X k
The sum squares for treatments is SST
k
X 2
SST = nj X j X
j=1

since X is the average of the means that are all equal therefore the differences in formula of SST
is zero.

(d) Correct

(e) Correct
96

Question 4
The given information

The data of the three brands.


Brand 1 Brand 2 Brand 3
1 2 1
3 2 2
3 4 4
0 3 2
1 0 3
0 4 4
Total 8 15 16
Sample mean X 1 = 1:3333 X 2 = 2:5 X 3 = 2:6667

Sample variance S12 = 1:8667 S22 = 2:3 S32 = 1:4667

Sample size n1 = 6 n2 = 6 n3 = 6

One–way ANOVA procedure


(i)
k
X 2
SST = nj X 1 X k=3
j=1
n = n1 + n2 + n3 = 6 + 6 + 6 = 18

X1 + X2 + X3
X =
3
1:3333 + 2:5 + 2:6667
=
3
6:5
=
3
= 2:1667
2 2 2
SST = n1 X 1 2:1667 + n2 X 2 2:1667 + n3 X 3 2:1667
= 6 (1; 3333 2:1667)2 + 6 (2:5 2:1667)2 + 6 (2:6667 2:1667)2
= 6:0490
97 STA1502/1

(ii)
k
X
SSE = (nj 1) Sj2
j=1

= (n1 1) S12 + (n2 1) S22 + (n3 1) S32


= (6 1) 1:8667 + (6 1) 2:3 = (6 1) 1:4667
= 5 18667 + 5 2:3 + 5 1:4667
= 28:167

(iii)

SST otal = SST + SSE


= 6:0490 + 28:167
= 34:216

(iv)
SST 6:0490 6:0490
M ST = = = = 3:0245
k 1 3 1 2
(v)
SSE 28:167 28:167
M SE = = = = 1:8778
n k 18 3 15
(vi) The test statistic F is
M ST 3:0245
F = = 1:6107
M SE 1:8778

(vii) The rejection region is F > F( ;k 1;n 1) = F > F(0:01; 2; 15) = 6:36:

(viii) The critical value is F(0:01; 2; 15) = 6:36:

(ix) Making a decision


Since the test F = 1:6107 is less than the critival value (6:36) ; we fail to reject the null
hypothesis H0 at 5% level of significance.

Conclusion: There is no enough evidence to conclude that differences exist between the three
brands.
98

2.9 Learning Outcomes


Use the following learning outcomes as a checklist after you have completed this study unit to
evaluate the knowledge you have acquired.

Can you

define SE for (Pb1 Pb2 ) under the assumption that P1 = P2 ?

perform a Large-sample statistical test for (P1 P2 )?

derive a Large-sample confidence interval for (P1 P2 )?

demonstrate an understanding of the different parts of a statistical test:


- null hypothesis

- alternative hypothesis

- test statistic and its p-value

- rejection region =) critical values

- significance levels

- conclusion

demonstrate an understanding of the connections between the concepts significance level and
p-value?

interpret computer output regarding inferences about an F-test for two population variances

define the following concepts


- within-treatments variation

- sum of squares for error

- between-treatments variation

- rejection region =) critical values for an ANOVA test

differentiate between one- and two-way analysis of variance experimental designs as well as
randomized block designs?

perform statistical tests for H0 : 1 = 2 = 3 = :::::: k

understand the three multiple comparison methods


interpret computer output regarding inferences about an ANOVA test for more than two population
means
99 STA1502/1

Key Terms/Symbols
degrees of freedom
F-test for two population variances
ANOVA-test
within-treatments variation
sum of squares for error
between-treatments variation
SS Within
SS Between
SS Blocks
SS Error
SS Treatment
overall mean
100

STUDY UNIT 3
3.1 Chi–square test

It is just as important to consider the sampled population as it is to know the data type of your
sample. What do you want to know about a specific population or populations? In the earlier study
units we were always interested in the parameters of the population, which implied that we had some
information about the population (e.g. we knew that it was normally, or approximately so, distributed).
What we have discussed so far implied so-called parametric techniques, where we considered the
statistics of a sample to predict the parameters of the distribution describing the population. In the first
part of this study unit we consider other very important parametric techniques, namely chi-squared
tests. In the second part of this unit we then venture into something new, addressing the dilemma
when one cannot make assumptions about the shape of the sampled population. As statisticians
we are often faced with this reality. Do you think that it is still possible to use a random sample
drawn from such a population and make a sensible analysis and even predictions about that sampled
population? Yes! You are going to see that there are also nonparametric techniques that you can use
if you do not know about the distribution of the sampled population. As usual, apart from explaining
the methods, the necessary conditions under which these alternatives apply, will also be described,
Of course, the correct technique for the particular data type stays important.
The first part of this study guide covers two applications of the continuous chi-squared distribution,
which is the technique applicable if the data is nominal. In STA1501 you heard about this distribution
and here hypothesis tests will be discussed and the conditions for their application. Only the chi-
squared goodness-of-fit test and the chi-squared test of a contingency table form part of the contents
of this module (the test for normality is therefore not included). In the second part of this study unit
you will be introduced to three nonparametric techniques. You will see that the sampled populations
are nonnormal and that dependence and independence of the samples play an important role. The
techniques you have to know for this module are the Wilcoxon rank sum test for ordinal or interval
data from two independent samples, the sign test for ordinal data in the form of matched pairs and
lastly the Wilcoxon signed rank test for interval data, also in the form of matched pairs. There are
other nonparametric tests in the prescribed book, but they are not included in the contents of this
module. Remember about them because you never know if you may need to use one of them in
future. Then you simply take the prescribed book and read up about them!
As you study these different tests, please do not be discouraged by all the different definitions that
are given and are used in the manual examples. Remember that we are statisticians and we do not
want to test your memory, but your knowledge of the different procedures and their conditions. In the
examination you will be given a list of formulas from which you can select the one you need (should
we ask a question in an examination paper where you need a formula).
101 STA1502/1

3.2 Chi-squared goodness-of-fit test


In this section we introduce

Chi-Squared Goodness-of-Fit Test

Test statistic
Required conditions

In distance learning the pronunciation of words or symbols is often a problem. If you wonder about
the word "chi" or its symbol ; think of the words "pie" or "sky" in English, because "chi" rhymes with
it. The ch is pronounces as a k, which means that you actually say "kai".
For the symbol 2 you say "kai-square".

Recall the knowledge given to you in STA1501 about a binomial experiment and the binomial
distribution. Just a reminder - the prefix bi- refers to two, while the prefix multi- refers to many.

Chi-square is a family of distributions commonly used for significance testing. A chi-square test
(also chi-squared or 2 test) is any statistical hypothesis test in which the sampling distribution of
the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this
is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be
made to approximate a chi-square distribution as closely as desired by making the sample size large
enough. A number of tests exist, but you are required to focus only on this one.

Below is a table illustrating the similarities and differences between a binomial and a multinomial
experiment.
Binomial experiment consists of Multinomial experiment consists of
a fixed number n of trials a fixed number n of trials
two possible outcomes per trial k categories (cells) of outcomes per trial
constant probability outcomes p and 1 p constant probabilities pi for each cell i
two probabilities p (success) and 1 p (failure) k probabilities pi and p1 + p2 + :::pk = 1
different independent trials different independent trials
x successes in n trials observed frequencies fi of outcomes in cell i
expected value = np expected frequencies ei = npi

The discussion in STA1501 on the chi-squared distribution was very brief. In this section you are
going to learn more about different tests where the test statistic has a chi-squared distribution.
102

The Chi-squared distribution

is a family of continuous probability distributions


is represented by a different positively skewed curve of which the shape is determined by the
number of degrees of freedom
ranges between 0 and 1
is used to describe nominal data (you can make a mental link between the nomial as in binomial
and multinomial if you have difficulty to remember that 2 analysis is on nominal data)

There are many interesting and practical applications of the chi-squared distribution. Researchers
are also very keen to use a chi-squared test and we hope that you will now study research results
and see if the conditions for application of this distribution are satisfied. The purpose of an analysis
can be to determine if the sample is from a specified population or the interest can be to determine
if there is a relationship between two populations, e.g. between predicted values and actual values.
An example of the latter: suppose a telecommunications company, interested in customer care, is
uncertain about the continuation or not of a specific product. They decide to ask customers if they
would like the service to continue for the next year or not (this would be categorical or nominal data).
The recorded data (two categories of ’yes’ and ’no’) can be saved and the product continued for a
year. Then data (’yes’ or ’no’) can again be collected and a chi-square analysis can be made to see if
there is a relationship between what the people said and what they actually did. If the null hypothesis
is rejected, it indicates that there is a relationship between the two populations. In this scenario the
managers can then decide to use data where customers say what they are going to do, the data are
reliable enough for their planning.
If you study the examples in the book and in the activities, see if you understand the following
comment: Samples should not be too large for applications of the chi-squared test, and in practice,
analysts carefully study the distribution of the items in the chi-square table and do not only rely on
the numerical value of the test.

Goodness-of-fit test
Make sure that you understand the hypothesis testing procedure and the sampling distribution of the
test statistic for the goodness-of-fit test.

Test statistic
How would you express the formula for the test statistic of the goodness-of-fit test in your own words?

k
X
2= (f i ei )2
ei
i=1
103 STA1502/1

The procedure is:


Square the difference between the observed and expected frequency and divide it by the expected
frequency for each cell. Add all these answers and it gives you the formula for the test statistic of the
chi-squared goodness-of-fit test.

Is that not easier to remember than the formula itself? It tells you exactly what to do. Can you explain
it to someone else?
If you are still not so sure, we illustrate with the words:

Square (:::)2 the difference (: :)2 between the observed fi and expected frequency ei
and divide it by the expected frequency ei for each cell.
k
X
Add all these answers and it gives you the formula for the test statistic of the
i=1
chi-squared goodness-of-fit test.
104

Table 5 Critical values of 2


105 STA1502/1

Activity 3.1
Question 1
Employee absenteeism has become a serious problem which cannot be ignored. The personnel
department at a university decided to record the weekdays during which lecturers in the Faculty of
Humanities in a sample of 300 called in sick over the past several months. Determine if the given
data suggests that absenteeism is higher on some days of the week than on others.
From existing medical evidence the following information is specified in the null hypothesis for the
consecutive days of the week:
Monday P1 = 0:3; Tuesday P2 = 0:1; Wednesday P3 = 0:2; Thursday P4 = 0:2; Friday P5 = 0:2

Day of
Monday Tuesday Wednesday Thursday Friday
the week
Number
84 24 56 64 72
absent

Question 2
In a goodness-of-fit test, suppose that a sample showed that the observed frequency fi and expected
frequency ei were equal for each cell i: Then, the null hypothesis is

1. rejected at = 0:05 but is not rejected at = 0:25

2. not rejected at = 0:05 but is rejected at = 0:25

3. rejected at any level

4. not rejected at any level

5. the same as the difference between fi and ei

Question 3
The critical value in a goodness-of-fit test with 6 degrees of freedom, considered at the 5%
significance level, is

1. equal to 18:5476

2. equal to 12:6

3. equal to 0:872085

4. always greater than the test statistic

5. always less than the test statistic


106

Question 4
A chi-squared goodness-of-fit test is always conducted as

1. a lower-tail test

2. an upper-tail test

3. a two-tailed test

4. a measure of the size of the cells

5. any of the above

Question 5
Five statements are given below. Only one of them is a true statement. Which option is true?

1. For a chi-squared distributed random variable with 10 degrees of freedom and a level of
significance of 0:025, the chi-squared table value is 20:5. The computed value of the test statistic
is 16:857. This will lead us to reject the null hypothesis.

2. Whenever the expected frequency of a cell is less than 5, one remedy for this condition is to
decrease the size of the sample.

3. For a chi-squared distributed random variable with 12 degrees of freedom and a level of
significance of 0:05, the chi-squared value from the table is 21:0. The computed value of the
test statistics is 25:1687. This will lead us to reject the null hypothesis.

4. The chi-squared goodness-of-fit test can be used for any type of data.

5. In a multinomial experiment the probability Pi that the outcome will fall into cell i can change from
one trial to the next.
107 STA1502/1

3.3 Chi-squared test of a Contingency Table


In this section we present

Chi-Squared Test of a Contingency Table

Test statistic

Rejection region and p-value

Rule of five

You need to realize that there are many similarities between the two 2 -tests in this chapter, and that
there are also definite differences.
In statistics, contingency tables are used to record and analyse the relationship between two or
more variables, most usually categorical variables. Suppose that we have two variables, sex (male
or female) and handedness (right- or left-handed). We observe the values of both variables in a
random sample of 100 people.

Then a contingency table can be used to express the relationship between these two variables, as
follows:
Right-handed Left-handed TOTAL
Male 43 9 52
Female 44 4 48
TOTAL 87 13 100

The figures in the right-hand column and the bottom row are called marginal totals and the figure
in the bottom right-hand corner is the grand total. The table allows us to deduce at a glance that
the proportion of men who are right-handed is about the same as the proportion of women who are
right-handed. However the two proportions are not identical and the statistical significance of the
difference between them can be tested statistically using one of a number of available methods. In
our case we will use a nonparametric method called a Pearson’s chi-square test. In this case the
entries provided in the table must represent a random sample from the population contemplated in
the null hypothesis. If the proportions of individuals in the different columns vary between rows (and,
therefore, vice versa) we say that the table shows contingency between the two variables. If there is
no contingency, we say that the two variables are independent.
If we make a table of comparisons it might help you to remember the different principles involved and
the calculation methods.
108

2 Goodness-of-Fit Test 2 Test of a Contingency Table

Only applicable for nominal data produced Only applicable for nominal data
by a multinomial experiment. arranged in a contingency table.

Expected value for each cell > 5 (Rule of five).

Test for evidence to conclude (infer) that


two classifications of a population are
Test if two variables are related.
statistically independent, i.e. unrelated
two or more populations are related.

k
X k
X
2 (fi ei )2 2 (fi ei )2
Test statistic: = Test statistic: =
ei ei
i=1 i=1

Contingency table with r rows and


Data are classified into k categories.
c columns consists of k cells.

Expected frequency of cell in row i and


Expected frequency for each category is column j is
ei = npi : total row i total column j
eij = :
sample size

Degrees of freedom: df = k 1: Degrees of freedom: df = (r 1) (c 1) :

Probabilities pi are given. pi are calculated assuming Ho as true.

Ho lists values for the probabilities pi : Ho states the two variables are independent.

The manual calculation of the 2 -values for the contingency table is rather cumbersome, but not that
complex!
Make sure that you understand the process of

calculating the expected frequencies for each cell - multiply total of row and total of column and
divide by the grand total
writing the given (observed) frequencies and calculated (expected) frequencies next to each other
for each cell in a new contingency table
calculation of the test statistic, which involves only this last contingency table for each cell: subtract
the two frequencies, square the answer, then divide by the calculated (expected) frequency

If you calculate these values with Excel or Minitab it is of course not so complex, but remember that,
at this first-year level, you have to know the "how" of the process itself and not only the interpretation
of the 2 and p values.
109 STA1502/1

3.4 Summary of tests on nominal data


This section emphasises the contexts in which the various chi–square test apply as follows:

Z –test of inference about a population proportion as you have learnt in STA1501.

Z –test of inference about the difference between two population proportions as we have indicated
in the study unit in section 2.2.

The chi–squared goodness-of–fit test.

The chi–squared test of a contingency table.

You will understand that it is necessary to learn each technique at a time and focus on the kinds
of problems each addresses. A summary of the statistical test on nominal data is given below to
ensure that you are capable of selecting the correct method.

Statistical Techniques for Nominal Data


Problem objective Number of categories Statistical Technique
Describe a population 2 Z –test of p or chi–squared goodness–of–fit test.
Describe a population More than 2 Chi–squared goodness–of–fit test.
Compare two populations 2 Z –test of P1 P2 or chi–squared test of a
contingency table.
Compare two populations More than 2 Chi–squared test of contingency table.
Compare two or more 2 or more Chi–squared of a contingency table.
populations
Analyze the relationship 2 or more Chi–squared of contingency table.
between two variables

You can notice that there are two groups of tests: Those that we have the Z –test of P for the
chi–squared test of a multinomial experiment and the test that employs two or more categories.
In the first approach, called chi–squared goodness of fit test. We determine the frequency of each
category and use these frequencies to calculate the test statistic. In the second approach called the
chi–squared of a contingency table, we use the frequencies to calculate the chi–square test statistics.

Activity 3.2
Question 1
The trustee of a company’s pension plan has solicited the opinions of a sample of the company’s
employees about a proposed revision of the plan. A breakdown of the responses is shown in the
accompanying table. Is there enough evidence to infer that the responses differ between the three
groups of employees?
Responses Blue–Colour Workers White–colours Workers Managers
Yes 67 32 11
Against 63 18 9
110

Question 2
The number of degrees of freedom for a contingency table with 5 rows and 7 columns is

1. 35
2. 12
3. 10
4. 24
5. 30

Question 3
In a chi-squared test of a contingency table, the test statistic value was 2 = 12:678, and the critical
value at = 0:025 was 14:4. Thus,

1. the number of degrees of freedom was not 6


2. we fail to reject the null hypothesis at = 0:025
3. we reject the null hypothesis at = 0:025
4. we don’t have enough evidence to accept or reject the null hypothesis at = 0:025
5. we should decrease the level of significance in order to reject the null hypothesis

Question 4
Which of the following statements is/are false?

1. A chi-squared test for independence is applied to a contingency table with 3 rows and 4 columns
for two qualitative variables. The degrees of freedom for this test must be 12:

2. A chi-squared test for independence with 10 degrees of freedom results in a test statistic of 17:894.
Using the chi-squared table, the most accurate statement that can be made about the p-value for
this test is that 0:05 < p-value< 0:10.

3. In a chi-squared test of independence, the value of the test statistic was 15:652, and the critical
value at = 0:025 was 11:1433. Thus, we must reject the null hypothesis at = 0:025:

4. A chi-squared test for independence with 6 degrees of freedom results in a test statistic of 13:25:
Using the chi-squared table, the most accurate statement that can be made about the p-value for
this test is that p-value is greater than 0:025 but smaller than 0:05.

5. The chi-squared test of a contingency table is used to determine if there is enough evidence to
infer that two nominal variables are related, and to infer that differences exist among two or more
populations of nominal variables.
111 STA1502/1

Activity 3.3
Question 1
A statistics professor posted the following grade distribution guidelines for his elementary statistics
class:
8% A, 35% B, 40% C, 12% D, and 5% F.

A sample of 100 elementary statistics grades at the end of last semester showed
12 A’s, 30 B’s, 35 C’s, 15 D’s, and 8 F’s.

Suppose that you test at the 5% significance level to determine whether the actual grades deviate
significantly from the posted grade distribution guidelines. Compare your calculations with the step
by step calculations given below. Indicate in which step the first error was made.

1. H0 : p1 = 0:08; p2 = 0:35; p3 = 0:40; p4 = 0:12; p5 = 0:05:


H1 : At least two proportions differ from their specified values.

2. Rejection region: 2 = 9:49


:050;4

3. Test statistic: 5:889

4. Conclusion: Reject the null hypothesis.

5. The actual grades do not deviate significantly from the posted grade distribution guidelines.

Question 2
Which of the following tests is appropriate for nominal data if the problem objective is to compare two
or more populations and the number of categories is at least 2?

1. The z -test for one proportion, p, or difference of two proportions

2. The chi-squared goodness-of-fit test

3. The chi-squared test of a contingency table

4. All of the above

5. Not one of the above


112

Feedback Feedback

Activity 3.1
Question 1
H0 : p1 = 0:3; p2 = 0:1; p3 = 0:2; p4 = 0:2; p5 = 0:2
H1 : At least one pi is not equal to its specified value.

The observed frequencies.


Cell 1 2 3 4 5
Frequency fi 84 24 56 64 72
The expected frequencies ei = N pi
N = 84 + 24 + 56 + 64 + 72 = 300
e1 = N p1 = 300 0:3 = 90
e2 = N p2 = 300 0:1 = 30
e3 = N p3 = 300 0:2 = 60
e4 = N p4 = 300 0:2 = 60
e5 = N p5 = 300 0:2 = 60
Total = 300

The test statistic for chi–square test


5
X
2 (fi e i )2
=
ei
i=1
Cell 1 2 3 4 5
Observed Frequency fi 84 24 56 64 72
Expected Frequency ei 90 30 60 60 60

2 (8490)2 (24 30)2 (56 60)2 (64 60)2 (72 60)2


= + + + +
90 30 60 60 60
= 0:4 + 1:2 + 0:2667 + 0:2667 + 2:4
= 4:5334

The rejection region is


2 2
> ( ; k 1)
2 2
> (0:01; 5 1)
2 2
> (0:01; 4) = 13:3 (From Table 5)

Making a decision
Since the test statistic = 4:5334 is less than the critival value (13:3) ; we fail to reject H0 at 1%
significance level.
113 STA1502/1

Conclusion: There is not enough evidence to infer to infer that absenteeism is higher on some
days of the week.

Question 2
Option (4)
The chi-squared goodness-of-fit test involves the difference between the expected and observed
frequencies. In this question there is never a difference between the two, with the result that the null
hypothesis will never be rejected.

Question 3
Option (2)
From the 2 Table 5, find the cell where the column under 2 in the first row meets the row with 6
:050
in the first column. The value written there is 12:6.

Question 4
Option (2)
If you are not sure, look at the little picture at the top of the page listing the 2 Table 5 and you will
see that the shaded area lies on the right-hand side.

Question 5
Option (3)

1. False, because the table is correct, but the value 16:857 does not fall in the critical region and
therefore the null hypothesis will not be rejected.

2. False. The remedy is to combine cells should any expected value in a cell be less than 5:

3. True. 25:1687 is greater than the test statistic and the null hypothesis would be rejected.

4. False. Only nominal data may be used in applications of the test.

5. False. These probabilities have to remain constant for each trial of a multinomial experiment.
114

Activity 3.2
Question 1
The given information

In this problem we want to find out if the job description of an employee has an influence on their
choice option. A contingency table is used to address this problem.
The hypotheses are
H0 : The two variables (employees and their responses) are independent.
H1 : The two variables are dependent.
The table for observed frequencies and the expected frequencies are in brackets.
Employees
Responses Blue collar White collar Managers Total
For revision 67(71:5) 32(27:5) 11(11) 110
Against revision 63(58:5) 18(22:5) 9(9) 90
Total 130 50 20 200

The grand total is 200:


The formula to calculate the expected frequency is
total column total raw
ei =
grand total
110 130
For revision blue colour: = 71:5
200

90 130
Against revision blue colour: = 58:5
200

110 50
For revision white colour: = 27:5
200

90 50
Against revision white colour: = 22:5
200

110 20
For revision Manager: = 11
200

90 20
Against revision Manager: =9
200
115 STA1502/1

Therefore the test statistic is


6
X
2 (fi ei )2
=
ei
i=1
71:5)2 (63 58:5)2 (32 27:5)2
(67
= + +
71:5 58:5 27:5
2 2
(18 22:5) (11 11) (9 9)2
+ + +
22:5 11 9
= 0:2832 + 0:3462 + 0:7364 + 0:9 + 0 + 0
= 2:2658

The degrees of freedom df is


df = The total number of raws The total number of column denoted by (r 1) (c 1)
= (2 1) (3 1)
= 1 2
= 2

The critical value 2 = 2 = 5:99 (Using Table 5.)


( ;df ) (0:05; 2)

Making a decision
Since the test statistic (2:2658) is less than the critical value (5:99) ; we fail to reject the null
hypothesis H0 at 5% level of significance.
Conclusion: There is not enough evidence that the response to the proposed revision plan
depends on the group (according to job description in the company) of the employee.

Question 2
Option (4)
The degrees of freedom df = (r 1) (c 1) = (5 1)(7 1) = 4 6 = 24

Question 3
Option (2)
The number of degrees of freedom was 6, as can be seen from the 2 table if you find the cell under
2 with the value 14:4 written in it. Furthermore, because 14:4 is larger than the calculated 12:678,
:025
the null hypothesis cannot be rejected. For option 5, if you look at the table and you decrease the
significance level to 2 the critical value is 16:8 and the null hypothesis would still not be rejected
:010
because 12:678 < 16:8:

Question 4
Option (1)
Option 1 is false in the number of degrees of freedom. It is not 3 4 but 2 3 = 6:
Option 2 is true because the p-values can only be determined accurately with computer software.
However, we can have some indication from the 2 table. 17:894 lies between the table values 16:0
116

and 18:3; which correspond respectively with significance levels of 0:100 and 0:050: Therefore the
comment about the range of the p-value is true.
Option 3 is true because the test statistic’s value 15:652 is more than the table value 11:1433, which
places it in the rejection region at level = 0:025:
Option 4 is true for the same reasons as option 2 is true.
Option 5 is true.

Activity 3.3
Question 1

The hypothesis are


H0 : p1 = 0:8; p2 = 0:35; p3 = 0:40; p4 = 0:12; p5 = 0:05
H1 : At least one pi is not equal to its specified value.
The observed frequencies are
Cell 1 2 3 4 5 Total
Frequency fi 12 30 35 15 8 100 = n
The expected frequencies ei are
e1 = np1 = 100 0:08 = 8
e2 = np2 = 100 0:35 = 35
e3 = np3 = 100 0:40 = 40
e4 = np4 = 100 0:12 = 12
e5 = np5 = 100 0:05 = 5
Total = 100

The data to be used


Cell 1 2 3 4 5
fi 12 30 35 15 8
ei 8 35 40 12 5
The test statistic is
5
X
2 (fi e i )2
=
ei
i=1

2 (12 8)2 (30


35) 2
(35 40)2 (15 12)2 (8 5)2
= + + + +
8 35 40 12 5
= 2 + 0:7143 + 0:625 + 0:75 + 1:8
= 5:8893

The rejection region is


2 2
> ( ; k 1)
2 2
> (0:05; 5 1)
2 2
> (0:05; 4)
2
> 9:49
117 STA1502/1

Making a decision
The test statistic does not fall in the rejection region, therefore the null hypothesis H0 cannot be
rejected.

1. Correct

2. Correct

3. Correct

4. Incorrect
The error lies in the interpretation of the calculated value.

5. Correct

Option (4)

Question 2
Option (2)
118

STUDY UNIT 4
4.1 Simple Linear Regression and correlation
Introduction
In this study unit the discussion is about the relationship between interval variables. In regression
analysis involving two variables, one of the variables is used to make predictions about the other
variable. Recall that interval data are real numbers, such as heights, weights, incomes and distance,
as was said in chapter 2 (or STA1501), where you were told that interval data can also be referred to
as quantitative or numerical data. In this unit the so-called probabilistic model for regression analysis
is described, with initial interest in the first-order linear model (also called the simple linear regression
model). In this model an error variable is introduced. Finding the equation of the regression line is
the first step, but this has to be followed by an assessment of the fit of the line to the data as well as
looking into the relationship between the dependent and independent variables. The importance of
the error variable and the conditions that apply to it, forms the basis of many of the discussions that
follow.

You will not be examined on all the sections of this chapter but our focus will be in the topics of :
Simple Linear Regression and correlation. The topics covered in these sections are very important
and should you continue with statistics, you will surely learn about them in a second-level module.

4.2 Simple Linear Regression and Correlation


In this chapter, we restrict our attention to the simple linear regression and correlation. The focus is
in the following topics:

1. Model

2. Estimating the coefficients

3. Assessing the model

4. Correlation and coefficient of determination

4.2.1 A Simple Linear Model

The problem in this discussion is of bivariate data that we use the equation of a straight line to
describe the relationship between the independent variable X and the dependent variable Y . We
describe the strength of the relationship using the correlation coefficient r: Consider the problem to
predict the value of a response Y based on the value of an independent variable X: The best–fitting
line is Y = 0 + 1X +"
119 STA1502/1

where

0 = intercept, the value of Y when X = 0:

1 = the slope of the line, defined as the change in Y for a one unit in X:
X = Independent variable (or predictor variable).
Y = Dependent variable (or response variable).
" = Random error that explains the deviation of the points (X; Y ) about the line.

The model Y = 0 + 1X + " is called the first–order model or the simple linear regression model.
To analyze the relationship between two variables, X and Y , both of which must be interval. In the
relationship between X and Y; we need to know the value of the coefficient 0 and 1: However,
these coefficient are population parameters, which are always unknown. In the next section, we
discuss how these parameters are estimated.

4.2.2 Estimating the coefficients

The statistical procedure for finding the best–fitting line for a set of bivariate data we need to estimate
the parameters 0 and 1: Since these parameters represent the coefficients of a straight line, their
estimators are based on drawing a straight line throught the sample data. The formula for the best–
fitting line is Yb = b0 + b1 X where

b0 = intercept
b1 = the slope
X = independent variable
Yb = the predicted or fitted value of Y:

The least squares method is an approach that enable us to produce a straight line. Finding the
values of b0 and b1 we use the differential calculus, which is beyond the scope. Rather than derive
their values, we will simply present formulas for calculating the values of b0 and b1 :
Least squares line coefficients

SXY
b1 =
S2X

b0 =Y b1 X

where
Sxy : The covariance of (X; Y )
Sx : The variance of X .
120

where
2
n
!0 n 13
X X
6
6X Xi @ Yi A 7
7
n
1 6
6 i=1 j=1 7
7
Sxy = Xi Yi
n 16
6 i=1 n 7
7
4 5

2 !2 3
n
X
6 Xi 7
1 6X n 7
6 i=1 7
S2x = 6 X2i 7
n 16 n 7
4 i=1 5

n
X n
X
Xi Yi
i=1 i=1
X= Y=
n n

You can use your scientific calculator to obtained the necessary sums and sums of squares as well
as the values of b0 ; b1 and the correlation coefficient r: Make sure you consult your calculator manual
to find the easiest way to obtain the least squares estimators b0 and b1 . Be careful about rouding
errors, carry at least four significant figures, and round off only in reporting the end result.
The deviations between the actual data points and the line are called the residuals, denoted ei ; that
is, ei = yi ybi :
The minimized sum of squared deviations is called the sum of squares for errors denoted SSE:

4.2.3 Assessing the model


In section 4.2.2, you have used the least squares method approach to obtain the straight line. In
this section, it is important for us to assess how well the linear model fits the data. If the fit is poor,
we should seek another model. Several techniques are used to evaluate the model but we present
two statistics and one test procedure to determine whether a linear model should be employed.
The methods are

(1) The standard error of estimate: In the linear model, the error variable " is normally distributed
with a mean 0 and standard deviation ": If " is large, this implies the model fit is poor. If "

is small, the errors tend to be closed to the mean (which is 0), as a result the model fits well
the data. We need to calculate "; unfortunately it is a population parameter that is unknown.
However, we can estimate " from the data by using the sum squares error (SSE). The formula
121 STA1502/1

for the standard error of estimate is


r
SSE
Se =
n 2

where
n
!
X 2
Sxy
SSE = (yi y^j )2 = (n 1) Sy2
Sx2
i=1

Sy2 = variance of y –variable


2
Sxy = The squared of the value of the covariance.
Sx2 = Variance of X –variable

(2) The coefficient of determination is a measure of the strength of the relationship. To answer to
the question “How well does the regression model fit?”, we can use the correlation coefficient r
given by

Sxy
r= for 1 r 1
Sx Sy

that is, the correlation coefficient is a value between 1 and +1 of the strength of the
relationship.
The coefficient of determination R2 is the proportion of the total variation that is explained by the
linear regression of y and x. In general, the higher the value of R2 ; the better the model fits the
data. You will discover in practice that when you improve the model, the value of R2 increases.
The formula for the coefficient of determination is

Sxy
R2 =
S2x S2y

(3) The t–test of the slope 1 is to make sure if there is evidence of a linear relationship.
122

4.3 Required Conditions: Diagnostic Tools for checking


the regression assumptions
In the previous sections we examined several test and measures that helped us to answer linear
regression questions for instance.

Is the independent variable X useful in predicting the response (dependent) variable y ?


If so, how well does it work?

Several statistical test and measures were used to determine the model fits well the data:
The t–test for the slope (or the ANOVA F –test) and the value of the coefficient of determination
R2 : Take note that the results of a regression analysis are valid only when the data satisfy the
necessary regression assumptions.
The regression assumptions:

(1) The relationship between Y and X must be linear, given by the model Y = 0 + 1X + ":

(2) The value of the random error " must


(i) be independent.

(ii) Have a mean 0 and a common variance 2:

(iii) be normally distributed.

The diagnostic tools for checking the assumptions involve the analysis of the residual error.
The residual plots, which are complicated to conduct by hands but easy to used by a computer. When
the residuals are normally distributed or approximately so, the plot should appear as a straight line.
When the observations are collected at regular time intervals. The error terms are often dependent,
therefore the observations make up a time series. This approach analysis the data using time series
methods.
In a diagnostic analysis the requirements for the error variable and the influence of very large of small
observations must be investigated. You need not to apply the different test, but you have to know
about them and what they mean.
When the normality requirement is unsatisfied, we can use a nonparametric technique called the
Spearman rank correlation coefficient to replace the t–test.
123 STA1502/1

Concept Meaning of the concept Test


Normality Bell-shaped symmetrical curve Draw histogram of the residuals
Heteroscedasticity The variance is not constant Plot the residuals and interpret
Homoscedasticity The variance is constant Plot the residuals and interpret
Independence of Looking at the relationship Graph the residuals against
error variables among the residuals the time periods - no pattern
Dependence of Looking at the relationship Graph the residuals against the
error variables among the residuals time periods - pattern exist
Error in recording of values,
Outliers wrong sample data point, Clear from a scatter diagram
incorrectly recorded value
Looks like outlier, but has
Influential observation Scatter diagram inspection
big influence on statistic

Activity 4.1
Question 1
The regression line y^ = 3 + 2x has been fitted to the data points (4; 8); (2; 5); and (1; 2). The sum of
the squared residuals will be

1. 7

2. 15

3. 8

4. 22

5. 7:5

Question 2
If an estimated regression line has a y -intercept of 10 and a slope of 4, then when x = 2 the actual
value of y is

1. 15

2. 24

3. 18

4. 14

5. unknown
124

Question 3
Given the least squares regression line y^ = 5 2x; choose the correct statement:

1. The relationship between x and y is positive.

2. The relationship between x and y is negative.

3. As x increases, so does y:

4. As x decreases, so does y:

5. The formula gives the equation of the population regression line.

Question 4
A regression analysis between weight y (in kilogram) and height x (in centimetre) resulted in the
following least squares line: y^ = 70 + 2x. This implies that if the height is increased by 1 centimetre,
the weight, on average, is expected to

1. increase by 1 kilogram

2. decrease by 2 kilogram

3. increase by 2 kilogram

4. decrease by an unknown amount

5. increase with an unknown amount.

Question 5
In regression analysis, the residuals represent the

1. difference between the actual y -values and their predicted values

2. difference between the actual x-values and their predicted values

3. square root of the slope of the regression line

4. change in y per unit change in x

5. sum of the squares for error, denoted by SSE


125 STA1502/1

Question 6

In a simple linear regression problem, the following statistics are calculated from a sample of 10
P P P
observations: (x x) (y y) = 2250; sx = 10; x = 50; y = 75: The least squares estimates

of the slope and y -intercept is respectively

1. 2:2 and 3:5

2. 2:5 and 1:5

3. 5:5 and 2:5

4. 2:5 and 5:0

5. 25 and 117:5

Question 7

A random sample of 11 statistics students produced the following data where x is the third test score,
out of 100, and y is the final exam score, out of 300. Can you predict the final exam score of a random
student if you know the third test score?
x third exam score 65 67 71 71 66 75 67 70 71 69 69
y final exam score 175 133 185 163 126 198 153 163 159 151 159
You can easily show by estimating the slope and gradient that the best fit line for the third exam/final
exam example has the equation: y^ = 173:51 + 4:83x.
What would be the expected final scores for students who obtained third exam scores of (i) 68, (ii)
78 and (iii) 94?

Question 8

The simple linear regression line is Son’s Height = 33:73 + 0:516 Father’s Height

(a) Interpret the coefficients.

(b) What does the regression line tell you about the heights of sons of tall fathers?

(c) What does the regression line tell you about the heights of sons of short fathers?

Question 9

Which value of the coefficient of correlation r indicates a stronger correlation than 0:65?

1. 0:55
2. 0:75
3. 0:60
4. 0:05
5. 0:65
126

Question 10

In a regression problem the following pairs of (x; y) are given:(3; 1); (3; 1); (3; 0); (3; 2) and (3; 2).
That indicates that the

1. correlation coefficient has no limits

2. correlation coefficient is 1

3. correlation coefficient is 0

4. correlation coefficient is 1

5. changes in y caused no change in the values of x

__________________________________________________________________________

Feedback Feedback

Activity 4.1
Question 1

The given information

The data points for the actual values


For (4; 8) ) X = 4 and Y = 8:
For (2; 5) ) X = 2 and Y = 5:
For (1; 2) ) X = 1 and Y = 2:
The predicted values are calculated by the regression line Yb = 3 + 2X when you substitute the
values of X = 4; 2 and 1 in the regression line.
If X = 4; then Yb = 3 + 2 (4)
= 3+8
= 11
If X = 2; then Yb = 3 + 2 (2)
= 3+4
= 7
If X = 1; Yb = 3 + 2 (1)
= 3+2
= 5
127 STA1502/1

The residuals ei = Yi Ybi


Data points
2
X Y Predicted Yb ei = Yi Ybi e2i = Yi Ybi
4 8 11 8 11 = 3 9
2 5 7 5 7= 2 4
1 2 5 2 5= 3 9
Total 22
Option (4)

Question 2

Option (5)

We can say nothing about the actual value of y , because the interpretation of the calculated values
only refer to the sample.

Question 3

Option (2)

In the least squares regression line y^ = 5 2x the value of the slope is 2, which is negative;
therefore the relationship is negative (if the one increases, the other will decrease).

Question 4

Option (3)

The relationship can be expressed based on the slope. From the equation y^ = 70 + 2x we know the
slope of the line is 2, which implies that ratio rise/run is 2=1. For each move forward (x height) the
movement up (y weight) will be double of that.

Question 5

Option (1)

The residuals ei = Yi Ybi :


128

Question 6

X
X X Y Y = 2250
Sx = 10
X
X = 50
X
Y = 75
P
(x x) (y y)
The covariance of (X; Y ) denoted by sxy =
n 1
2250
=
10 1

= 250

The variance of X denoted by s2x = 102

= 100

sxy
The slope b1 =
s2x

250
=
100

= 2:5

The intercept b0 = y b1 x

75 50
= 2:5
10 10

= 7:5 12:5

= 5
Option (4)

Question 7

We are given the equation: for this estimation Yb = 1:73:51 + 4:83X: Thus, for those who obtained
third exam scores of (i) 68, (ii) 78 and (iii) 94 we would expect the final exam scores of:

(i) when x = 68, then y = 173:51 + 4:83(68) = 154:93


This means that for a student who obtained 68 out of 100 in the third test, we expect him/her to
obtain about 155 out of 300 in the examination.
129 STA1502/1

(ii) when x = 78, then y = 173:51 + 4:83(78) = 203:23


For one who obtained 78 out of 100 in the third test we expect him/her to obtain 203 out of 300 in
the examination.

(iii) when x = 94, then y = 173:51 + 4:83(94) = 280:51


Thus, one who obtained 94 out of 1200 in the third test is expected to obtain 281 out of 300 in the
examination.

Question 8

Compare the given equation of the regression line with the standard form of the regression line:

y^ = b0 + b1 x
Son’s height = 33:73 + 0:516 Father’s height

This implies that the dependent variable y represents the son’s height and the independent variable
x represents the number of centimetres that the father is taller or less than 33:73 centimetres. We
assume that both father and son are measured when they are fully grown.

Let us answer the questions:

(a) The intercept b0 = 33:73 is where the regression line and the y -axis intersect and at that point
x = 0. It does not mean that when the father’s height is 0 (not born yet ??) the son’s height is
33:73 cm. You can see that makes no sense - it is meaningless!
The slope coefficient b1 = 0:516 implies that for each additional cm of the father’s height the son’s
height increases on average by 0:516 cm.

(b) 33:73 cm is taken as the cut-off value: ’tall’ fathers are supposedly taller than 33:73 and ’short’
fathers are shorter than 33:73. Therefore, if the father is tall, the son would on average be shorter
than his father.

(c) If the father is short, then on average the son will be taller than his father.

Question 9

Option (2)

Remember that we said that the closer the value of r is to either +1 or 1, the stronger the
relationship between the variables. The fact that we compare positive and negative values is
irrelevant if the only issue is the strength of the relationship. A value of r close to zero indicates
a very weak relationship. This relation is strong negatively.
130

Question 10

The given information

The pairs of (x; y) are


(3; 1) ; (3; 1) ; (3; 0) ; (3; 2) and (3; 2)
The correlation coefficient r is
Sxy
r =
Sx Sy
X P P
1 ( X) ( Y )
Sxy = XY
n 1 n
Using the scientific calculator, the summary statistics are
P P P
X = 15 Y =0 XY = 0
P P
X 2 = 45 Y 2 = 10 X=3 Y =0
1 (15) (0)
Sxy = 0
5 1 5
= 0
0
r =
Sx Sy
= 0

Option (3)
131 STA1502/1

STUDY UNIT 5
5.1 Non parametric statistics
In the Unit 1, we presented statistical techniques for comparing two populations by comparing
their respective population parameters (usually their population means). These approaches were
applicable to quantitative data. The data that have a normal distributions. In this unit 5, we
present statistical tests for comparing populations for the many types of data that do not satisfy the
assumptions of normal distribution. The following statistical tests will be discussed in your first–year
level.

The Rank Sum Test

The Sign Test

The Signed Rank Sum Test

The Friedman F –test

The Kruskal–Wallis H –test

We use nonparametric techniques, when the sample sizes are small and the original populations are
not normal.
We present nonparametric techniques appropriate for comparing two or more populations using
either independent or matched paired samples. Furthermore, we will discuss a measure of
association that is useful in determining whether one variable increases as the other increases or
whether one variable decreases.

5.2 The Wilcoxon Rank Sum Test: Independent


Random Samples
In comparing the means of two populations based on independent samples, if you are not sure for a
two–sample t–test can be used, one alternative is to replace the values of the observations by their
ranks. Two different non–parametric tests are used to test statistic based on these sample ranks:

Wilcoxon Rank Sum Test

Mann–Whitney U –test

We only discuss the Wilcoxon Rank Sum Test since they are equivalent because they use the same
information. Wilcoxon Rank Sum Test is based on the sum of the ranks of the sample that has the
small sample size.
132

The steps are

1. The Hypotheses
The null hypothesis H0 : The two population locations are the same.

The alternative hypothesis H1 : The location of population 1 is to the left of the location of
populations 2.

2. The test statistic Z


Statisticians have shown that when the sample sizes are larger than 10; the test statistic is
approximately normally distributed with the mean E (T ) and the standard deviation T.

n1 (n1 + n2 + 1)
E (T ) =
2
r
n1 n2 (n1 + n2 + 1)
T =
12

Therefore, the standardized test statistic is

T E (T )
Z=
T

We arbitrarily select T1 as the test statistic and label it T:

3. The rejection region is Z > Z :

4. The decision rules


Reject H0 when p–value is less than the level of significance otherwise we don’t reject H0 :

Reject H0 when the value of the test statistic is greater than the critical value at alpha ( ) level
of significance.

5. Interpretation.

Remark
You must be able to use the table of critical values for the Wilcoxon Rank Sum Test. Make sure that
you understand that n1 is the number of observations in the data set with the smallest rank–total
(which need not to be the one given as “sample 1”. Furthermore, take note that you use the right
table for the right test. Table 6(a) is used for either alpha = 0:025 for one–tailed test or = 0:05 for
two tailed test. Table 6(b) is used for either alpha = 0:05 for one–tailed test or 0:10 for two–tailed
test.

The formula given to use for sample sizes larger than 10 is a normal approximation and is calculated
without the tables (because they do not list values larger than 10!!) and only use the sizes of the two
independent samples and the test statistics.
133 STA1502/1

Table 6a and b Critical values for the Wilcoxon Rank Sum Test
134

Table 7 Critical Values for the Wilcoxon Signed Rank Sum Test
135 STA1502/1

Activity 5.1
Question 1
Consider the following data set: 14; 14; 15; 16; 18; 19; 19; 20; 21; 22; 23; 25; 25; 25; 25;and 28.
The rank assigned to the four observations of value 25 is

1. 12

2. 12:5

3. 13

4. 13:5

5. 14

Question 2
The Wilcoxon rank sum test statistic T is approximately normally distributed whenever the sample
sizes are

1. larger than 10

2. smaller than 10

3. between 5 and 15

4. larger than 20 but smaller than 30

5. smaller than 20

Question 3
A Wilcoxon rank sum test for comparing two populations involves two independent samples of sizes
5 and 7. The alternative hypothesis is stated as: The location of population 1 is different from the
location of population 2. The appropriate critical values at the 5% significance level are

1. 20 and 45

2. 22 and 43

3. 33 and 58

4. 35 and 56

5. 12 and 32
136

Question 4
Consider the following two independent samples:
Sample A: 16 17 19 22 47
Sample B: 27 31 34 37 40

The value of the test statistic for a left-tail Wilcoxon rank sum test is

1. 6

2. 20

3. 35

4. 55

5. 121

Question 5
Two observers are placed on two different observation points (randomly chosen) for a specified
period of time. They have to observe the drivers of the cars passing by and count the number of
them driving by while talking on a cell phone. Data given below was recorded at Point A for 6 days
and at Point B for 7 days. At the 0:10 level, can we conclude that the number of drivers talking on cell
phones at the two locations have the same median occurrence?

Point A 74 61 73 67 80 89
Point B 90 73 97 81 77 61 79

Question 6
A Wilcoxon rank sum test for comparing two populations involves two independent samples of sizes
15 and 20. The unstandardized test statistic (that is the rank sum) is T = 210. The value of the
standardized test statistic z is

1. 14:0

2. 10:5

3. 6:0

4. 0:7

5. 2:0
137 STA1502/1

5.3 Sign Test and Wilcoxon Signed Rank Sum Test


5.3.1 The sign test

The sign test is the nonparametric test to apply if you want to compare two samples forming matched
pairs of values, provided the data is ordinal and the populations are nonnormal. We say the two
samples are dependent. Typical of this is that one person is tested "before and after", or one person
is asked to make two different observations. Of course, this means that the size of the two dependent
samples will always be equal.
In ordinal data, numbers are often allocated to the different ranked categories, simply because it is
convenient. You were earlier told about a similar argument for nominal data where we could indicate
male =) 1 and female =) 0; because the ’0’ and ’1’ is easier to work with than the words ’female’
and ’male’. Please understand that if numbers are used for this purpose their placement in the
number line is not relevant. They are just symbols - maybe little goodies ( ; ]; z; xo; :::) would have
been less confusing, but then less convenient!
The sign test, true to its name, considers only the sign (positive or negative) of the difference
between the pair of observations, and the size of the difference is of no significance. Think of the
procedures to follow in these nonparametric tests as the rules of a game.

For the sign test the rules are as follows:

Name the one sample 1 and the other one 2.


Determine the difference between the data value in sample 1 and the data value in sample 2 for
each pair.
Count the number of positive and number of negative differences and ignore the zero differences.
The number of positive differences is the value of the test statistic.
The sample size is the number of pairs with either a positive or a negative difference. (Do not
count the zero differences.)
If n < 10; use the binomial table with p = 0:5, x = total of positive differences and n = total of
nonzero differences.
If n 10; use the normal approximation of the binomial.
Null hypothesis: the two populations locations are the same.
Alternative hypothesis: H1 the population locations are different (This can be one-or two-sided).
The test statistic: For n; the number of pairs with ties, use x the number of times that (the sample
1 minus the sample 2) is positive.
Tied observations means that the measurements associated with one or more pairs are equal.
When this happens, delete the tied pairs and reduce n; the total number of pairs.
138

The test statistic Z is


x 0:5 n
Z= p
0:5 n

The rejection region is


(1) For two–tailed test
Reject H0 if Z Z2 or Z Z 2 where Z 2 is the Z –value from Table 1.

(2) For one–tailed test: Reject H0 iff Z Z or Z Z .

(3) Calculate the p–value and reject H0 if the p–value is less than :

5.4 The Wilcoxon Signed Rank Sum Test for a


matched paired experiment
A signed rank test can be used to analysed the pair difference experiment by considering the paired
differences of two treatments, 1 and 2 : The null hypothesis H0 : The two population locations are
the same against the alternative hypothesis H1 : The location of population 1 is different from the
location of population 2.
Steps in calculating the test statistic for the Wilcoxon Signed Rank Test

1. Calculate the difference between sample 1 and sample 2 for each of the n pairs. Differences
equal to 0 are eliminated, and the number of pairs, n is reduced accordingly.

2. Rank the absolute values of the differences by assigning 1 to the smallest, 2 to the second
smallest, and so on. Tied observations are assigned the average of the ranks that would have
been assigned with no ties. The absolute value of 5 is denoted j 5j = 5 and the absolute value
of +5 is denoted j+5j = 5:

3. Calculate the rank sum for the negative differences and label this value T : Similarly, calculate
T + ; the rank sum for the positive differences.

4. The author Keller of our prescribed book used T = T + but Mendenhall and Beaver have used the
smaller of these two quantities T as a test statistic to test the hypothesis that the two population
locations are the same. In this module we use T = T + :
The formula to calculate the standardized test statistic is

T E (T )
Z=
T

where
139 STA1502/1

The mean

n (n + 1)
E (T ) =
4

The standard deviation T equal to


r
n (n + 1) (2n + 1)
T =
24

T is approximately normally distributed.

5. The rejection region is Z > Z 2 or Z < Z2:

6. Decision rule
(1) Reject H0 if the test statistic Z is greater than the critical value otherwise don’t reject H0 :

(2) Reject H0 if p–value < .

Activity 5.2
Question 1
It is important to sponsors of television shows that viewers remember as much as possible about
the commercials. The advertising executive of a large company is trying to decide which of two
commercials to use on a weekly half-hour sit-com. To help make a decision she decides to have 12
individuals watch both commercials. After each viewing, each respondent is given a quiz consisting
of 10 questions. The number of favourable responses is recorded and listed below. Assume that the
quiz results are not normally distributed.
Quiz Scores
Respondent Commercial 1 Commercial 2
1 7 9
2 8 9
3 6 6
4 10 10
5 5 4
6 7 9
7 5 7
8 4 5
9 6 8
10 7 9
11 5 6
12 8 10
(a) Which test is appropriate for this situation?

(b) Do these data provide enough evidence at the 5% significance level to conclude that the two
commercials differ?
140

Question 2
In a normal approximation to the sign test, the standardized test statistic is calculated as z = 1:58.
To test the alternative hypothesis that the location of population 1 is to the left of the location of
population 2, the p-value of the test is

1. 0:1142

2. 0:2215

3. 0:0571

4. 0:2284

5. 0.4429

Comparison between the Wicoxon Signed Rank Sum Test and Sign Test

If the matched pairs of observations from the two dependent nonnormal populations are interval and
not ordinal, the signed rank sum test of Wilcoxon is the appropriate test to use. Think about this - the
requirements for the sign test and this signed rank sum test are the same except for the type of data.

For the

– Sign test the data are ordinal

– Wilcoxon Signed Rank Sum Test the data are interval

For the Wilcoxon Signed Rank Sum Test the rules are as follows:

Name the one sample 1 and the other one 2.


Determine the difference between the data value in sample 1 and the data value in sample 2 for
each pair. Write these values in a column next to the relevant pair of values.
’Throw away’ (ignore) all the pairs where the observations from the two samples were the same
(difference was zero).
Make another column and in this one you write down the absolute value of the differences. This
means that you ignore the fact that some differences were negative - make them positive.
Rank this column of absolute values from 1 to n, where n is the number of nonzero differences.
Now you need two more columns: in the one you rewrite the ranks of the differences that were
originally positive and in the next column you rewrite the ranks of the differences that were
originally negative.
The value of the test statistic is the same as the total of the ranks of the original positive
differences.
141 STA1502/1

If n < 30; use Table 10 which lists a lower and upper cut-off value for one or two-tailed tests,
depending on four different significance levels and n = total of nonzero differences.
If n 30; use the normal approximation as given in section 5.2(2).
Null hypothesis: the two population locations are the same.
Alternative hypothesis: the population locations are different (can be one-or two-sided).
142

Activity 5.3
Question 1
A matched pairs experiment produced the following statistics. Conduct a Wilcoxon Signed Rank Sum
Test to determine whether the location of population 1 is to the right of the location of population 2.
(Use = 0:01).
T + = 3457 T = 2429 n = 108

Question 2
Perform the Wilcoxon Signed Rank Sum Test for the following matched pairs to determine whether
the two population locations differ. (Use = 0:10):
Pair 1 2 3 4 5 6
Sample 1 9 12 13 8 7 10
Sample 2 5 10 11 9 3 9

Question 3
In a Wilcoxon Signed Rank Sum Test, the test statistic is calculated as T = 91. There are 18
observation pairs of which 3 have zero differences and a two-tailed test is performed at the 5%
significance level. Choose the correct option below:

1. The critical cut-off values are T TU = 90 and T TL = 30:

2. The critical cut-off values are T TU = 131 and T TL = 40:

3. The null hypothesis is rejected.

4. The null hypothesis will not be rejected.

5. The test results are inconclusive.

Question 4
In a Wilcoxon Signed Rank Sum Test with n = 30, the rank sums of the positive and negative
differences are 198 and 165, respectively. The value of the standardized test statistic z is

1. 232:50

2. 0:7096

3. 2:8125

4. 48:6107

5. 0:6425
143 STA1502/1

Feedback Feedback

Activity 5.1
Question 1
Option (4)
The data set is already ranked (we wanted to test something else than ranking)
14; 14; 15; 16; 18; 19; 19; 20; 21; 22; 23; 25; 25; 25; 25, and 28.

Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Data 14 14 15 16 18 19 19 20 21 22 23 25 25 25 25 28
Ranks 1:5 1:5 3 4 5 6:5 6:5 8 9 10 11 13:5 13:5 13:5 13:5 16

1+2
From the data raw Rank of 14 = = 1:5
2
6+7
Rank of 19 : = 6:5
2
12 + 13 + 14 + 15 54
Rank of 25 : = = 13:5
4 4
Question 2
Option (1)
In the discussion about the sampling distribution of the Wilcoxon Rank Sum Test statistic it is stated
that T is approximately normally distributed whenever the sample sizes are larger than 10.

Question 3
Option (1)
n1 = 5 and n2 = 7. The values for n1 are listed in the first row and those for n2 in the first column.
The statement in the alternative hypothesis about the location of the populations being different does
not imply that the location of population 1 lies to the left or the right of population 2. It is a two-tailed
statement. The appropriate critical values at the 5% (two-tailed) significance level are 20 and 45:

Question 4
Option (2)

Ranked data 16 17 19 22 27 31 34 37 40 47
Ranks 1 2 3 4 5 6 7 8 9 10

From the given information


Total ranks of Sample A: 1 + 2 + 3 + 4 + 10 = 20
Total ranks of Sample B : 5 + 6 + 7 + 8 + 9 = 35
144

Question 5
(This is not a multiple choice question.)
Point A 74 61 73 67 80 89
Point B 90 73 97 81 77 61 79

Position 1 2 3 4 5 6 7 8 9 10 11 12 13
Ranked data 61 61 67 73 73 74 77 79 80 81 89 90 97
Ranks 1:5 1:5 3 4:5 4:5 6 7 8 9 10 11 12 13

1+2 4+5 9
Rank of 61 = = 1:5 Rank of 73 = = = 4:5
2 2 2

Total ranks of Sample A: 6 + 1:5 + 4:5 + 3 + 9 + 11 = 35


Total ranks of Sample B : 12 + 4:5 + 13 + 10 + 7 + 1:5 + 8 = 56

Sample A has the smallest total, so the test statistic is equal to 35:
If we are only testing for a "difference" in the data from the two points, it is a two-sided test. From
Table 6(b) the limits for n1 = 6 and n2 = 7 are 30 and 54. The test statistic of 35 falls between
these limits, so the null hypothesis cannot be rejected at the 10% level. We conclude that the median
number of persons talking on their cell phones while driving could be the same at points A and B .

Question 6
Option (5)
This answer is simply substitution into formulae.

T = 210

r
n1 (n1 + n2 + 1) n1 n2 (n1 + n2 + 1)
E(T ) = T =
2 12
r
15(15 + 20 + 1) 15 20(15 + 20 + 1)
= =
2 12

= 270 = 30

T E(T ) 210 270 60


z = = = = 2
T 30 30
145 STA1502/1

Activity 5.2
Question 1
Quiz Scores
Respondent Commercial 1 Commercial 2 Difference
1 7 9 -2
2 8 9 -1
3 6 6 0
4 10 10 0
5 5 4 1
6 7 9 -2
7 5 7 -2
8 4 5 -1
9 6 8 -2
10 7 9 -2
11 5 6 -1
12 8 10 -2

X = 1 (positive differences) n = 10 (excluded 0 as differences)

(a) The appropriate test for this situation is the Sign Test.

(b) Do these data provide enough evidence at the 5% significance level to conclude that the two
commercials differ?

The Hypotheses are


H0 : The two population locations are equal.

H1 : The two population locations are not equal.

Rejection region: jzj > z0:025 = 1:96 (two-sided test)

0:5n 1 0:5 (10) 4


Test statistic: z = p = p = = 2:53
0:5 n 0:5 10 1:5811

Two cells have zeros and are not counted for the sample size. Therefore n = 10 and x = 1 (only one
plus).
Conclusion: Reject the null hypothesis. Yes, these data provide enough evidence at the 5%
significance level to conclude that the two commercials differ.

Question 2
The standardized test statistic is calculated as z = -1.58. The p-value should then be such that
p-value: P (z < 1:58) = P (z > 1:58) = 0:0571
Option (3)
146

Activity 5.3
Question 1
H0 : The two population locations are the same.
H1 : The location of population 1 is to the right of the location of population 2:

T = T + = 3457 T = 2429 n = 108

n(n + 1)
The mean E(T ) =
4
108 109
=
4

= 2943
r
n(n + 1)(2n + 1)
T =
24
r
108 (108 + 1) (2 108 + 1)
=
24
p
= 106438:5

= 326:25

The test statistic


T E(T ) 3457 2943 514
z= = = = 1:5754
T 326:25 326:25

p value = P (Z > 1:5755) = P (Z > 1:58) = 0:0571:

Conclusion
There is not enough evidence to conclude that population 1 is located to the right of the location of
population 2 since p–value = 0:0571 > = 0:01; we fail to reject H0 at = 0:01.

Question 2
H0 : The two population locations are the same.
H1 : The location of population 1 is different from the location of population 2:

Rejection region: T TU = 19 or T TL = 2: (Using Table 7, = 0:05).

Pair Sample 1 Sample 2 Difference jDifferencej Positive rank Negative rank


1 9 5 4 4 5.5
2 12 10 2 2 3.5
3 13 11 2 2 3.5
4 8 9 -1 1 1.5
5 7 3 4 4 5.5
6 10 9 1 1 1.5
Totals T + = 19:5 T = 1:5
147 STA1502/1

T E (T )
The test statistic z = where T = T +_ = 19:5 n=6
T
n (n + 1) 6 7
E (T ) = = = 10:5
r 4 4 r
n (n + 1) (2n + 1) 6 7 13 p
T = = = 22:75 = 4:7697
24 24
(19:5 10:5)
Z =
4:7697
= 1:8869
P –value = 2 P (Z > 1:8869)
= 2 P (Z > 1:89)
= 2 0:0294
= 0:0988

Since p–value (0:0588) > 0:05; we fail to reject H0 at = 0:05.


Conclusion: There is no enough evidence to infer that the population locations differ.

Question 3
Option (4)
The value of the test statistic is calculated as T = 91; therefore the test statistic lies inside the ’safe’
region of [40; 131]. for a two-tailed test at the 5% significance level. The null hypothesis is therefore
not rejected. (Using Table 7).

Question 4
Option (2)

T = T + = 198
n(n + 1)
E(T ) =
4
30 31
=
4

= 232:5
r
n(n + 1)(2n + 1)
T =
24
r
30 (30 + 1) (61)
=
24
p
= 2363:75

= 48:6184
148

The standardized test statistic is


T E(T ) 198 232:5
z= = = 0:7096
T 48:6184
149 STA1502/1

Summary of the different tests


Summary of tests on data from a normal or approximately normal distribution

Deciding which test to use is the task of the statistician in practice and it is our aim to supply you
with the tools to make such a decision. This is not always so straightforward. The significance of
data type is obvious, but note how the study objective gives direction. Even now, while you are still
studying, make a point of looking at published statistical information and determine if it involves "lying
with statistics" or not. Two tables made from the information in the above-mentioned summaries are
given below. Look at them, but try to make your own. Making such a summary is a very valuable
method of studying.

Parameter/ Descriptive Statistical


Data type Problem objective
Categories measure technique
2 categories Describe a
Nominal proportion z -test
p population
Describe a 2
Nominal 2 categories goodness-of-fit
population
2 contingency
2 categories Compare two
Nominal proportions table
p1 p2 populations
z test
Analyze relationship 2 contingency
Nominal 2 categories between two
table
variables
Single normal
Interval ( known) population Central location z -test
(or n 30)
Describe a
Interval ( unknown) population Central location t-test
(normal or n 30)
2 Describe a 2 -test
Interval Variability
population
Compare two
1 2 populations
Interval Central location t-test
(independent)
(difference of means)
Compare two
D populations
Interval Central location t-test
(matched)
(difference of pop. means)
2
Compare two
Interval 1
2 populations Variability F -test
2
(ratio of two variances)
150

Parametric versus non-parametric tests


At this stage you should be quite familiar with both parametric and nonparametric tests. The table
below lists some obvious similarities and differences between the two types of tests.

Parametric testing Nonparametric testing

Basic principles of hypothesis testing apply Basic principles of hypothesis testing apply

Population must be normally or Population need not be normally


approximately so distributed distributed or approximately so

Sample size need to be large Sample size can be very small

Calculations can become very tedious Calculations usually simpler


because of large sample sizes because of small sample sizes

Data dependent on specific test Data dependent on specific test

One sample: One sample:


t test Wilcoxon signed rank test

Two independent samples: Two independent samples:


t test Wilcoxon rank sum test

Two dependent samples: Two dependent (paired) samples:


t test for paired samples Wilcoxon signed rank sum test
151 STA1502/1

5.5 The Kruskal-Wallis H-test for


completely randomized designs
The Kruskal–Wallis H –test is a nonparametric alternative to the analysis of variance F –test for a
completely randomized design. The procedure for conducting the Kruskal–Wallis is similar to that
used for the Wilcoxon Rank Sum Test.
Suppose you are comparing k populations based on independent random samples n1 from
population 1, n2 from population 2, ::: nk from population k; where n = n1 + n2 + ::: + nk : The first
step is to rank all observations from the smallest (rank 1) to the largest (rank m). Tied observations
are assigned a rank equal to the average of the rank. The steps to carry Kruskal–Wallis Test are
as follows:
Step 1
Null hypothesis and alternative hypotheses
H0 : m1 = m2 = ::: = mk for the j = 1 through k populations
or
H0 : The population medians are equal.
H1 : At least one mj differs from the other.
or
H1 : The population medians are not equal.
Step 2
The test statistic H

1. Ranked the combined data values as if they were from a single group. The smallest data value
gets a rank 1, the next smallest 2; and so on.
In the event of a tie, each of the tied values gets their average rank.
P P P
2. Add the ranks for data values from each of the k groups obtaining T1 ; T2 ; :::; Tk :

The calculated value of the test statistic is


" #
12 P Tj2
H= 3 (n 1)
n (n + 1) nj

where

n1 ; n2 ; :::; nk = Sample size for the k samples.


n = n1 + n2 + ::: + nk

Step 3
The rejection region
2
H> ( ; k 1)
The critical value of H is the chi–square 2 with each sample size is at least 5:
( ; k 1)
152

Activity 5.4
Question 1
Apply the Kruskal–Wallis test to determine if there is enough evidence at the 5% significance level to
infer that at least one of the population medians differs from the others.
Sample
1 2 3
23 25 25
22 27 22
25 17 19
20 19 21
18 20 26
Question 2
Conduct the Kruskal–Wallis Test on the following statistics. (Use = 5%.)
T1 = 984 n1 = 23
T2 = 1502 n2 = 36
T3 = 1430 n3 = 29

Feedback Feedback

Activity 5.4
Question 1
Rank the combined data values as if they were single group.
1 2 3
Data Rank Data Rank Data Rank
23 10 25 12 25 12
22 8:5 27 15 22 8:5
25 12 17 1 19 3:5
20 5:5 19 3:5 21 7
18 2 20 5:5 26 14
Total T1 = 38 T2 = 37 T3 = 45
Step 1
The null and alternative hypotheses
H0 : The medians are equal (m1 = m2 = m3 ) :
H1 : At least one median differs from the others.

Step 2
The test statistic H
" #
12 X Tj2
H = 3 (n 1)
n (n + 1) nj
n = n1 + n2 + n3 = 5 + 5 + 5 = 15
T1 = 38 T2 = 37 T3 = 45
153 STA1502/1

" !#
12 (38)2 (37)2 (45)2
H = + + 3 (15 1)
15 (15 + 1) 5 5 5
12
H = (288:8 + 273:8 + 405) 3 14
15 16
12
= (967:6) 42
240
= 48:38 42
= 6:38

Step 3
The rejection region
2
H > ( ; k 1)
2
H > (0:05; 3 1)
2
H > (0:05; 2)

H > 5:99 (using Table 5)

Step 4
Making a decision
Since the test statistic H = 6:38 is great than the critical value (5:99) ; we reject the null hypothesis
H0 and therefore we may infer that population medians are not all equal.

Question 2

The given information

The level of significance = 0:05:


T1 = 984 T2 = 1502 T3 = 1430
n1 = 23 n2 = 36 n3 = 29
n = n1 + n2 + n3 = 23 + 36 + 29 = 88

Step 1
The null and alternative hypothesis
H0 : m1 = m2 = m3
H1 : At least two medians differ.
154

Step 2
The test statistic H
" #
12 X Tj2
H = 3 (n 1)
n (n + 1) nj
" !#
12 (984)2 (1502)2 (1430)2
= + + 3 (88 + 1)
88 (88 + 1) 23 36 29
12
= (175278:6578) 261
7832
= 268:5577 261
= 7:5577

Step 3
The rejection region
2
H > ( ; k 1)
2
H > (0:05; 3 1)
2
H > (0:05; 2)

H > 5:99

Step 4
Making a decision
Since the test statistic H = 7:5577 is greater than a critical value (5:99), we reject the null hypothesis
H0 at the 5% significance level.

5.6 Friedman Test for the Randomized Block Design


The Friedman test is an extension of the Wilcoxon Signed Rank Test for paired sample and it
is the nonparametric counterpart to the randomized block ANOVA design. While randomized block
ANOVA requires that observations be from normal populations with equal variances, the Friedman
test makes no such demands. This approach can be extended to the examination of ordinal data
instead of being limited to data from the interval or ratio scales of measurement. The Friedman test is
concerned with the medians rather than the means. The steps for Friedman test for the randomized
block design:
Step 1
The null and alternative hypothesis
H0 : m1 = m2 = ::: = mk for the j = 1 through k treatments.
or
H0 : The population meadians are equal.
H0 : At least one of mj differs from the other.
or
H1 : The population medians are not equal.
155 STA1502/1

Step 2
The test statistic
2 3
k
X
12
Fr = 4 Tj2 5 3b (k 1)
bk (k + 1)
j=1

where

b = number of blocks
k = number of treatments
Tj = sum of the ranks for treatment j

Step 3
The rejection region
2
Fr > ( ; k 1)
where the critical value is the chi–squared distributed with k 1 degrees of freedom.

Step 4
The decision rule
Reject H0 if the test statistic Fr is greater than the critical value 2 otherwise do not reject.
( ; k 1) ;

Activity 5.5
Question 1
The nonparametric counterpart to the randomized block design of the analysis of variance is

1. Kruskal–Wallis test

2. Friedman test

3. Wilcoxon Rank Sum Test

4. Wilcoxon Signed Rank Sum Test

5. Sign test
156

Question 2
The maker of a stain remover is testing the effectiveness of four different formulations for a new
product. Six common types of stains were used as the blocks in the experiment. The data of ratings
are given below:
Stain–Remover Formulas
1 2 3 4
Creosote 2 7 3 6
Type Crayon 9 10 7 5
of Motor Oil 4 6 1 4
Stain Grape Juice 9 7 4 5
Ink 6 8 4 3
Coffee 9 4 2 6

Friedman test is used to examine whether the stain remover formulas could be equally effective in
removing stains from this type of fabric (use = 0:05):

Feedback Feedback

Activity 5.5
Question 1
Option (2)

Question 2
Rank data by type of stain used as individual.
1 2 3 4
Data Rank Data Rank Data Rank Data Rank
Creosote 2 1 7 4 3 2 6 3
Crayon 9 3 10 4 7 2 5 1
Block Motor Oil 4 2:5 6 5 1 1 4 2:5
Grape Juice 9 4 7 3 4 1 5 2
Ink 6 3 8 4 4 2 3 1
Coffee 9 4 4 2 2 1 6 3
Total Rank T1 = 17:5 T2 = 22 T3 = 9 T3 = 12:5
b=6 k=4
Step 1
The null and alternative hypothesis
H0 : m1 = m2 = m3 = m4
H1 : The population medians are not equal.
157 STA1502/1

Step 2
The test statistic:

2 3
k
X
12
Fr = 4 Tj2 5 3b (k + 1)
bk (k + 1)
j=1

12
= (17:5)2 + (22)2 + (9)2 + (12:5)2 3 (6) (4 + 1)
6 (4) (4 + 1)
12
= (1027:5) 90
120
= 102:75 90
= 12:75

Step 3
The rejection region is
2
Fr > ( ;k 1)
2
F2 > (0:05; 3)

Fr > 7:81

Step 4
Making a decision
Since the test statistic Fr = 12:75 is greater than the critical value (7:81) ; we reject H0 at the 5%
significance level.
158

STUDY UNIT 6
Time series analysis and forecasting
6.1 Introduction
Time series and forecasts based on time series are very relevant and significant in modern times. At
first year level, fortunately these concepts are simple and easy to explain. If you sit and think about
it, you can make a long list of events that you can observe at regular time intervals. If you drive to
work in a car or taxi or train, you can record the traffic every first day of the month, or every Friday,
or every day of the week, or...; if you have a favourite take-away food store you can record the length
of the queue at regular time intervals; an obvious example is to record the monthly rainfall at your
home. The list never ends, as government bodies, researchers, economists, etc. all record different
phenomena over short and long periods of time. These scores, collected at regular time intervals
are known as time series.
The question is – what do we do with the time series? Do you record the data simply to look at it, is
it just for the sake of fun, or what? As statisticians we are going to teach you how to look at, interpret
and even ’smooth’ the time series data, but is that the end of the process? That would have been
a sad day if everything stopped just there! The point is that what we observe as a pattern in the
past could well be repeated in the future and therefore a technique has been developed where the
data of a time series is used and the characteristics of that particular phenomenon is used to predict
what can be expected in the future. Of course, statisticians are always very careful not to say that
anything is certain (think in terms of hypothesis testing!), so they use models in their predictions. We
will only look at three elementary models, but there are many other models, some of which are much
more complex. This chapter begins with time series components smoothing techniques, trend and
seasonal effects and forecasting.

6.2 Components of time series


Time series forecasting assumes that the factors that have influenced activities in the past and
present will continue to do so in approximately the same way in the future. Times series forecasting
seeks to identify and isolate these components factors in order to make predictions. The following
four factors are examined in time series.

Trend

Cyclical variation

Seasonal variation

Random (or irregular) variation.


159 STA1502/1

– A trend is an overall long-term upward or downward movement in a time series. The duration is
more than 1 year.

– The cyclical involves the up-and-down swing or movements through the series. The duration is
more than 1 year. This is often correlated with business cycle.

– The seasonal variation refers to cycles that occur over a short repetitive calendar periods. The
duration is less than 1 year.

– Random variation is caused by irregular and unpredictable changes in a time series. They
have monthly or quarterly data.

The first step in a time series is to visualize the data and observe whether any patterns exist over
time. The second step is to determine whether there is a long–term upward or downward movements
in the series. This is to verify if there is a trend. If there is no obvious long–term upward or downward
trend, then you can use moving averages or exponential smoothing to smooth the series. If a trend
is present, you can consider several time series forecasting techniques.
Depending on the model, we can put these components together in different ways to represent the
time series. The model that we discuss is the multiplicative model. It states that any time series, Y ,
consists of the product of the four components listed above:

Y =T C S I

In addition to a multiplicative model, we can also use an additive model as given below

Additive model: Y = T + C + S + I

sometimes the most appropriate model is a mixed one.

Mixed model: Y = (T C S) + I

The choice of which model to use is beyond the scope of the module.
160

6.3 Smoothing techniques


One of the simplest ways to reduce random variation is to smooth the time series. In this section we
introduce two methods:

Moving averages

Exponential smoothing

6.3.1 Moving averages

The first technique of smoothing is to determine moving averages. Remember that the data
points in a time series are consecutive values, i.e. they are ordered. The idea of an average
is nothing new and in this case you substitute the actual observations of a time series with a
list of averages. You can compute a three-period moving average, which is the average of three
consecutive observations or you can compute a four-period moving average, which is the average
of four consecutive observations, etc. Make sure that you understand how these three, or four, or
... moving averages are calculated. In a three-period moving average each observation (except the
first and last values) are part of three averages.

Suppose we have real observations indicated as A, B, C, D E, F and G, then the three, four and
five-period moving averages would be as follows:

Actual Three-period Four-period Five-period


observation moving average moving average moving average
A
A+B+C
B
3
A+B+C +D
4
B+C+D A+B+C +D+E
C
3 5
B+C +D+E
4
C+D+E B+C +D+E+F
D
3 5
C +D+E+F
4
D+E+F C +D+E+F +G
E
3 5
D+E+F +G
4
E+F +G
F
3
G
161 STA1502/1

Note that for these 7 values A, B, ...you could calculate

– 5 three-period moving averages

– 4 four-period moving averages

– 3 five-period moving averages

See if common sense leads you to the following pointers:

By smoothing observations information is lost.


The more periods you include in the average, the smoother the graph becomes.
However, the more periods you include in the average, the fewer observations you have left.
Smoothing with the method of moving averages removes the random variation but must be
balanced against the importance of maintaining the real character of the time series.
When you have annual time series, the chosen period length have to be an odd numbers of years
otherwise the approach has to change.

6.3.2 Exponential smoothing

This method is mathematically more complex, but still a ’relatively crude method’ to remove random
variation. However, it removes two of the concerns mentioned above when the method of moving
averages is used for smoothing out random variation. These are the following:

With every calculation all the observations up to that particular observation form part of the
calculation, in other words give weight to the answer.
The smoothing process starts from the very first observation and continues up to the very last
observation.

The formula given may look a little complex, but with constant use it is manageable. Application of
the formula smooths values by calculating a weighted average of each observation in the series and
the previously already smoothed observation. The smoothing constant w is a number between 0 and
1 and seeing that w is multiplied by the actual observation yt (at time t), you should understand that
the closer w is to 1 the more influence the actual observation y will have. That is the sort of decision
the statistician has to make. Choosing the value of w will therefore depend on the importance of the
actual observations.
Considering the above, exponential smoothing is a series of exponentially weighted moving
averages. The weights assigned to the values change so that the most recent (the last) values
receives the highest weight, the previous value receives the second–heights weight, and so on.
Throughout the series, smoothed value depends on all previous values, which is an advantages of
exponential smoothing over the method of moving averages.
162

Exponentially smoothed time series formula

St = !Yt + (1 !) St 1 for t 2

where

St = exponentially smoothed time series at time period t:


Yt = time series at time period.
St 1 = exponentially smoothed time series at time period t 1:
! = smoothing constant.

Take note that we begin by setting

S1 = y1

Keep in mind that you will receive a list of formulas in the examination. You simply have to recognize
which formula to use where and to know the meaning of the different symbols.

Activity 6.1
Question 1
Test your knowledge.
Link each of the descriptions below to one of the four time series components (long-term trend,
cyclic, seasonal or random variation):

1. The time series component that reflects a long-term, relatively smooth pattern or direction
exhibited by a time series over a long time period (more than one year)

2. The time series component that reflects variability over short repetitive time periods and has
duration of less than one year

3. The time series component that reflects the irregular changes in a time series that are not
caused by any other component, and tends to hide the existence of the other more predictable
components

4. The time series component that reflects a wave-like pattern describing a long-term trend that is
generally apparent over a number of years
163 STA1502/1

Question 2
In exponentially smoothed time series, the smoothing constant w is chosen on the basis of how much
smoothing is required. In general, which of the following statements is true?

1. A small value of w such as w = 0:1 results in very little smoothing, while a large value such as
w = 0:8 results in too much smoothing.

2. A small value of w such as w = 0:1 results in too much smoothing, which a large value such as
w = 0:8 results in very little smoothing.

3. A small value of w such as w = 0:1 and a large value such as w = 0:8 may both result in very little
smoothing.

4. A small value of w such as w = 0:1 and a large value such as w = 0:8 may both result in too much
smoothing.

5. It is impossible to have too much or too little smoothing, regardless of the value of w:

Question 3
Monthly sales (in R11,000) of a computer store are shown below.
Month Jan Feb March April May June
Sales 73 65 72 82 86 90

Calculate the three-month and five-month moving averages.


________________________________________________________________________
164

6.4 Trend and seasonal effects


In this section, we use regression models for forecasting with time series data. Observations of any
variable recorded over time in sequential order are considered a time series. In order to forecast, we
present more precise measurements of the time series components:

1. Trend analysis

2. Seasonal analysis

3. Deseasonalizing a time series

6.4.1 Trend Analysis

Once you can see that there is a trend in a time series, you have to determine what the ’nature’ of the
trend is. This we do using mathematics. Do you remember the following from school mathematics?

A polynomial has many terms (from the prefix ’poly-’)


A linear equation is of the first power. The regression equation y^ = b0 + b1 x; is an example of a
linear relationship between x as independent variable and y as dependent variable. In time series
data x will always indicate time.
A nonlinear equation is of a power greater than 1 and this is where the poly nomial comes in. The
equation y^ = b0 + b1 x + b2 x2 is quadratic; y^ = b0 + b1 x + b2 x2 + b3 x3 is of the third power.

At this stage you should know enough about the possibility to fit a regression line through given data
and also about the principles involved in such a method. Now, in time series analysis to determine if
there is a trend in the data, such a fitted line can assist you in seeing if there is a trend in the data.
The y^ then becomes the trend line estimate of the y of the regression model y = 0 + 1t + ": The
slope of the line indicates the trend. If the slope is positive, you know the trend is positive and the
larger the numerical value of the slope the larger the positive trend.
These arguments about a graph assisting us to find trend in a time series apply if the relationship is
nonlinear. Should a quadratic model be needed to fit the time series, the trend equation relies on the
multiple regression technique (not included in this module).

We estimate a linear trend model using the regression techniques. Let yi be the value of the
response variable at time t: In this section we use t as the explanatory (independent) variable
corresponding to consecutive time period such as 1; 2; 3; and so on. It is specified as

yt = 0 + 1t + "t

where
yt = the value of the series as time t
165 STA1502/1

The estimated model is sued to make forecasts as

Y^t = b0 + b1 t

where
b0 and b1 are the coefficient estimates.
An exponential trend model is used for a times series that is expected to grow by an increasing
amount each time period. It is specified as ln (yt ) = 0 + 1t + "t where ln (yt ) is the natural log of
yt : The estimated is used to make forecasts as
se
y^t = exp b0 + b1 t +
2
where b0 ad b1 are the coefficients estimated

se = the standard error of estimate.

The polynomial trend model of order q is


2 q
yt = 0 + 1t + 2t + ::: + qt + "t

This model specializes to a linear model, quadratic trend model, and cubic trend model for q =
1; 2; and 3, respectively. The estimated model is used to make forecasts as

ybt = b0 + b1 t + b2 t2 + ::: + bq tq

where
b0 ; b1 ; b2 ; :::; bq are the coefficient estimates.
Quadratic model is
2
Yt = 0 + 1t + 2t + "t
Cubic model is
2 3
Yt = 0 + 1t + 2t + 3t + "t

6.4.2 Seasonal analysis and deseasonlizing a time series

To detect seasonality in a time series, several ’seasons’ must be observed. Seasonal index can be
calculated and used to either inflate or deflate the trend in the series. Depending on the choice, it will
either express the degree to which the seasons differ from one another or it can be used to remove
the seasonal variation. The purpose of removing the seasonality is that other changes in the series
can then be detected. This has many benefits, especially in forecasting.
166

Activity 6.2
Question 1
The quarterly earnings (in millions of rands) of a large soft-drink manufacturer have been recorded
for the years 2001 to 2004. These data are listed here. Calculated the seasonal indexes given the
regression line
Yb = 61:75 + 1:18t (t = 1; 2; 3; :::; 16)

Year
Quarter 2001 2002 2003 2004
1 52 57 60 66
2 67 75 77 82
3 85 90 94 98
4 54 61 63 67
Question 2
The Pyramid of Giza is one of the most visited monuments in Egypt. The number of visitors per
quarter has been recorded (in thousands) as shown in the accompanying table:

Year
Quarter 2000 2001 2002 2003
Winter 210 215 218 220
Spring 260 275 282 290
Summer 480 490 505 525
Autumn 250 255 265 270
(a) Plot the time series.

(b) Discuss your observations. Would exponential smoothing be recommended for this data?
167 STA1502/1

6.5 Introduction to forecasting


When two or more models are fitted to the same time series, it is useful to have one or more criteria
on which to compare them. In this section , we present two such criteria:

Mean Absolute Deviation (MAD).


Sum of Squares for Forecast Error (SSE).
According to MAD criterion, the best–fit model is the one having the lowest mean value for
jyt ybt j : We consider the absolute value. The MAD criterion is
P
jyt ybt j
M AD =
n

where

yt = an observed value of y
ybt = the value of y that is predicted using the estimation equation or model
n = the number of time periods.

The mean squared error (M SE) criterion: From a given set of models or estimation equations fit to
the same time series data, the model or equation that best fits the time series is the one with the
lowest value of
P
(yt ybt )2
M SE =
n

where

yt = an observed value of y
ybt = the value of Y that is predicted using the estimation equation or model.
n = the number of time periods.

When we compare the MAD and the MSE criterion, the MSE is to be preferred whenever the cost of
an error in estimation or forecasting increases in more than a direct proportion to the amount of the
error.
The sum of squares error SSE is

P
SSE = (yt ybt )2

6.6 Forecasting models


A large number of different forecasting techniques are available to statisticians. In this section, we
present:

Forecasting with exponential smoothing

Forecasting with seasonal indexes


168

The selected model for forecasting a time series is determined by the components present in the
recorded time series. The choice of model is therefore based on measures of accuracy and precision.
In general, the method used in the particular smoothing method can give you an indication of the type
of forecast. If you think about the method applied in exponential smoothing, you can imagine that for
a time series with a small positive trend, the forecast will be too low and if there is a small negative
trend, the forecast will tend to be too high.

A proper analysis of the given data must underlie the choice and you have to realize that one should
not try to forecast too far in the future as the accuracy decreases with each additional time frame
added.

At first-year level we only introduce you to forecasting and expect you to understand three relatively
elementary forecasting models: Exponential and seasonal models will be easy for you to
understand.

According to autocorrelation as we discussed in the regression model, one of the assumptions


required that the residuals (errors) be independent of each other. When this assumption is not
met, the condition known as autocorrelation (or serial correlation) can be used.
A common method of testing for autocorrelation is the Durbin–Watson test.
A broad outline of the three models follows:

Forecasting
Conditions Forecasting Action
model
Smoothing
Preferably used constant
No trend
Exponential for one time Assume initial
No exponential smoothing
smoothing period forecast forecast
No seasonal variation
but can be more Substitute St
with Ft+1
Regression
equation
Preferably one
Seasonal Long-term trend is used as
season but
indexes Seasonal variation well as
can be more
seasonal index
for period t
Based on
Can be complex
correlation of
Autocorrelation if the time
Autoregressive consecutive
No trend series values are
model terms (first
No seasonality themselves
order
correlated
autocorrelation)
169 STA1502/1

Activity 6.3
Question 1
Calculate MAD and SSE for the forecasts that follow

Period 1 2 3 4 5
Forecast ybt 63 72 86 71 60
Actual yt 57 60 70 75 70
Question 2
The following is the list of mean absolute deviation (MAD) statistics for each of the models you have
estimated from time-series data:

Model MAD
Linear trend 1:38
Quadratic trend 1:22
Exponential trend 1:39
Autoregressive 0:71
Based on the MAD criterion, the most appropriate model is

1. linear trend

2. quadratic trend

3. exponential trend

4. autoregressive

5. not possible to answer

Feedback Feedback

Activity 6.1
Question 1

1. long-term trend

2. seasonal variation

3. random variation

4. cyclical variation
170

Question 2
Option (2)

Question 3
Month Sales Moving averages
Three-month Five-month
Jan 73
Feb 65 70
March 72 73 75.6
April 82 80 79.0
May 86 86
June 90

The monthly sales are 73 65 72 82 86 90

The three–month moving averages


73 + 65 + 72 210
February : = = 70
3 3

65 + 72 + 82 219
March : = = 73
3 3

72 + 82 + 86 240
April : = = 80
3 3

82 + 86 + 90 258
May : = = 86
3 3
The five–month moving averages
73 + 65 + 72 + 82 + 86 378
March: = = 75:6
5 5

65 + 72 + 82 + 86 + 90 395
April: = = 79:0
5 5
171 STA1502/1

Activity 6.2
Question 1
y
Year Quarter Period t y y^ y^
2001 1 1 52 62.9 0.827
2 2 67 64.1 1.046
3 3 85 65.2 1.303
4 4 54 66.4 0.813
2002 1 5 57 67.6 0.843
2 6 75 68.8 1.090
3 7 90 70.0 1.286
4 8 61 71.1 0.857
2003 1 9 60 72.3 0.830
2 10 77 73.5 1.048
3 11 94 74.7 1.259
4 12 63 75.9 0.830
2004 1 13 66 77.0 0.857
2 14 82 78.2 1.048
3 15 98 79.4 1.234
4 16 67 80.6 0.831

Quarter
1 2 3 4 Total
2001 0:827 1:046 1:303 0:813
2002 0:843 1:090 1:286 0:857
2003 0:830 1:048 1:259 0:830
2004 0:857 1:048 1:234 0:831
Average 0:839 1:058 1:271 0:833 4:001
Seasonal index 0:839 1:058 1:270 0:833 4:000

Question 2
(a)

Pyramids of Egypt 2000-2003 Data

600
Number of Visitors

500
400
300
200
100
0
2000 2001 2002 2003
Year

We note a distinct pattern of seasonal variation in the series. This could have been detected in
the data, but in the graph one can see it without even thinking!
172

(b) Exponential smoothing is a method to remove the random variation in a time series and makes
it easier to detect the trend. In the further discussions you will see that exponential smoothing is
not an accurate forecasting method if the time series has clear seasonal effects.

Activity 6.3
Question 1

Period 1 2 3 4 5
Forecast ybt 63 72 86 71 60
Actual yt 57 60 70 75 70

jyt ybt j
M AD =
n
j57 63j + j60 72j + j70 86j + j75 71j + j70 60j
M AD =
5
6 + 12 + 16 + 4 + 10
=
5
48
=
5

= 9:6:

X
SSE = (yt ybt )2

SSE = (57 63)2 + (60 72)2 + (70 86)2 + (75 71)2 + (70 60)2

= 36 + 144 + 256 + 16 + 100

= 552:

Question 2
Option (4)
173 STA1502/1

Learning Outcomes
Use the chapter summary as a checklist to see if you have mastered the knowledge in this chapter
after you have completed this study unit to evaluate if you have really acquired a good understanding
of the work covered.

Can you

list and understand principles involved in the general procedures when applying chi-squared
testing?
apply your knowledge of the chi-square test, for nominal scale variables, to describe a single
population and/or to determine the relationship between two populations?
apply non-parametric statistical tests?
employ the Wilcoxon rank sum test, the sign test and the Wilcoxon signed rank sum test to
compare two populations of ordinal data?
apply Friedman F –test and Kruskal–Wallis H –test.
analyse the relationship between two interval variables using simple linear regression?
explain and decompose the components of a time series?
explain how trend and seasonal variation are measured?
describe exponential smoothing, seasonal indexes and the autoregressive model for forecasting
in time series?

References
Keller Gerald and Gaciu (2020, second Edition). Statistical for Management and Economics,
Belmont, CA USA Duxbury, Thomson.
Mendenhall, W., Bearer, R.J. and Beaver, B.M. (2009) Introducation to Probability and Statistics. 13
edition.
Weiers, Ronald M. (2005) Introduction to Business Statistics, Brooks/Cole, Duxbury, Thomson.

You might also like