STA1502 - Study Guide
STA1502 - Study Guide
STA1502
Statistical inference I
CONTENTS
Orientation iii
STUDY UNIT 1
1.1 Introduction 1
1.2 Inference about the Difference Between Two Population Means: 1
Independent Samples
1.3 Observational and Experimental Data 18
1.4 Inference about the Difference Between Two Population Means: 19
Matched Pairs Experiment
1.5 Inference about the Ratio of Two Variances 28
1.6 Self-correcting Exercises for Unit 1 38
1.7 Solutions to Self-correcting Exercises for Unit 1 39
1.8 Learning Outcomes 46
STUDY UNIT 2
2.1 Introduction 48
2.2 Inference about the Difference Between Two Population Proportions 48
2.3 One-Way Analysis of Variance 57
2.4 Multiple Comparisons 68
2.5 Analysis of Variance experimental designs (read only) 80
2.6 Randomized Block(two-way) Analysis of Variance 81
2.7 Self-correcting Exercises for Unit 2 87
2.8 Solutions to Self-correcting Exercises for Unit 2 89
2.9 Learning Outcomes 98
STUDY UNIT 3
3.1 Chi–square test 100
3.2 Chi-squared goodness-of-fit test 101
3.3 Chi-squared test of a Contingency Table 107
3.4 Summary of test on nominal data 109
STUDY UNIT 4
4.1 Simple linear regression and correlation 118
4.2 Simple Linear Regression and Correlations 118
4.3 Diagnostic Tools for checking the regressionassumptions 122
STUDY UNIT 5
5.1 Non parametric statistics 131
5.2 The Wilcoxon Rank Sum Test: Independent Random Samples 131
ii
5.3 Sign Test and Wilcoxon Signed Rank Sum Test 137
5.4 The Wilcoxon Signed Rank Sum Test for a matched paired experiment 147
5.5 The Kruskal–Wills H –test for completely randomized designs 151
5.6 Friedman Test for the Randomized Block Design 154
STUDY UNIT 6
6.1 Introduction 158
6.2 Components of time series 158
6.3 Smoothing techniques 160
6.4 Trend and seasonal effects 164
6.5 Introduction to forecasting 167
6.6 Forcasting models 167
iii STA1502/1
ORIENTATION
Welcome
Welcome to STA1502. This module is the second one of the first-year statistics courses. STA1501
and STA1502 form the first year Statistics course for students from the College of Economic and
Management Sciences. If you are a BSc student in the College of Science, Engineering and
Technology, the three modules STA1501 and STA1502 and STA1503 form the first year in Statistics.
In the preceding module STA1501, we treated probability and probability distributions, and unless
one has a proper understanding of the laws of probability, the mechanisms underlying statistical data
analysis will not be understood properly. Probability theory is the tool that makes statistical inference
possible. In STA1502, we consider to the applications of the probability distributions. You have
learned in STA1501 that the shape of the normal distribution is determined by the value of the mean
and the variance 2; whilst the shape of the binomial distribution is determined by the sample size
n and the probability of a success p. These critical values are called parameters. We most often
don’t know what the values of the parameters are and thus we cannot "utilise" these distributions (i.e.
use the mathematical formula to draw a probability density graph or compute specific probabilities)
unless we somehow estimate these unknown parameters. It makes perfect logical sense that to
estimate the value of an unknown population parameter, we compute a corresponding or comparable
characteristic of the sample.
The objective of this module is to focus on the issues related to prediction and inference in statistics
and therefore it is called Statistical Inference and the "I" in the title indicates that it is a module at
the first level. We draw inference about a population (a complete set of data) based on the limited
information contained in a sample. In dictionary terms, inference is the act or process of inferring;
to infer means to conclude or judge from premises or evidence; meaning to derive by reasoning.
In general, the term implies a conclusion based on experience or knowledge. More specifically in
statistics, we have as evidence the limited information contained in the outcome of a sample and
we want to conclude something about the unknown population from which the sample was drawn.
The set of principles, procedures and methods that we use to study populations by making use of
information obtained from samples is called statistical inference.
Learning outcomes
There are very specific outcomes for this module, listed below. Throughout your study of this module
you must come back to this page, sit back and reflect upon them, think them through, digest them
into your system and feel confident in the end that you have mastered the following outcomes:
iv
For this module you have to study certain sections from six chapters of the prescribed textbook:
Keller, Gerald and N. Gaciu (2020, 2nd edition) Statistics for Management and Economics ISBN:
9781473768260
you so prefer, you are welcome to write and reference your solutions in your own book or file, if the
space we supply is insufficient or not to your liking.
We realise that you might feel overwhelmed by the volumes and volumes of printed matter that
you have to absorb as a student! How do you eat an elephant? Bite by bite! We have divided
the 6 chapters of the textbook into 5 study units or "sessions". Make very sure about the sections
indicated in each study unit since some sections of the textbook are excluded and we do not want
you frustrated by working through unnecessary work. Regular contact with statistics will ensure that
your study becomes personally rewarding.
Doing exercises on your own will not only enhance your understanding of the work, but it will give you
confidence as well. Feedback is given immediately after the activity to help you check whether you
understand the specific concept. The activities are designed (i.e. specific exercises are selected) so
that you can reflect on a concept discussed in the textbook. You can only obtain maximum benefit
from this activity-feedback process if you discipline yourself not to peep at the solution before you
have attempted it on your own!
We know that many of you have some "math anxiety" to deal with, but we will do our best to make
your statistics understandable and not too theoretic. Studying statistics is sometimes not "exciting"
or "fun" but keep in mind that the considerable effort to master the content of this module can be very
rewarding. We claim that knowledge of statistics will enable you to make effective decisions in your
business and to conduct quantitative research into the many larger and detailed data sources that
are available. Statistical literacy will enable you to understand statistical reports you might encounter
as a manager in your business.
We are there to assist you in a process where you shift yourself from a supported school learner to
an independent learner. Studying through distance education is neither easy nor quick. There will
be times when you feel frustrated and discouraged and then only your attitude will pull you through!
You are the master of your own destiny.
In a paper by Sue Gordon1 (1995) from the University of Sydney, the following metaphor is given:
"The learning of statistics is like building a road. It’s a wonderful road, it will take you to places you
did not think you could reach. But when you have constructed one bit of road you cannot sit back and
think ‘Oh, that’s a great piece of road!’ and stop at that. Each bit leads you on, shows the direction
to go, opens the opportunity for more road to be built. And furthermore, the part of the road that
1
Gordon, Sue (1995) A theoretical Approach to Understanding Learners of Statistics. Journal of Statistics Education
v. 3, n.3 University of Sydney.
vi
you built a few weeks ago, that you thought you were finished with, is going to develop pot holes
the instant you turn your back on it. This is not to be construed as failure on your part, this is not
inadequacy. This is just part of road building. This is what learning statistics is about: go back and
repair, go on and build, go back and repair."
(You can skip the following section if you have read through it when you did STA1501.)
We realise that in the South African schooling system commas are used to indicate the decimal digit
values. You have been penalised at school for using a point. Now we sit between two fires: the
school system and common practice in calculators and computers! Most computer packages use
decimal points (ignoring the option to change it) and Keller (the author) also uses the decimal point
in our textbook (Statistics for Management and Economics). Therefore we use the decimal point in
our study guide, assignments and examination.
vii STA1502/1
The emphasis in the textbook is well beyond the arithmetic of calculating statistics and the focus is
on the identification of the correct technique, interpretation and decision making. This is achieved
with a flexible design giving both manual calculations and computer steps.
It is a good idea that you initially go through the laborious manual computations to enhance your
understanding of the principles and mathematics but we strongly urge you to manage the Excel
computations because using computers reflects the real world outside. The additional advantage of
using a computer is that you can do calculations for larger and more realistic data sets. Whether
you use a computer program or a statistical calculator as tool for your calculations is irrelevant to us.
However, the emphasis in this module will always be on the interpretation and how to articulate the
results in report writing.
CD Appendixes and A Study Guide are provided on the CD-ROM (included in the textbook) in pdf
format . The slide shot below is just to give you an idea of some of the topics covered. Although it will
not be to your disadvantage if you do not use the CD, we encourage you to try your best to have at
least a few sessions on a computer. Statistical Software makes Statistics exciting - so, play around
on the computer should you have access!
viii
STUDY UNIT 1
1.1 Introduction
You should not attempt to do the module STA1502 without knowledge of the contents of STA1501.
This module STA1502 is a continuation in the same textbook of the follow–up chapters. Chapters
such as what is Statistics, graphical descriptive techniques, numerical descriptive techniques,
data collection and sampling, probability, random variables and discrete probability distributions,
continuous probability distributions, sampling distributions, introduction to hypothesis testing and
inference about a population were covered in STA1501.
In this module we continue with chapters such as
2. Analysis of variance
3. Chi–square tests
5. Nonparametric statistics
In STA1501 you learnt about Statistical Inference for a single population and derived hypothesis
tests and confidence intervals from the information contained in a single sample. You did this for
We are now sampling from two independent populations where the means of the population are
our focus. Note that we need subscripts to distinguish between the population mean of the first
population called 1, and the population mean of the second population called 2:
The best estimator of the difference between two population means is denoted by ( 1 2) and the
difference between two sample means is denoted by X 1 X2 :
Assume that two random samples are independently selected from two populations that are normally
distributed with equal variances. To test whether the two population means are equal, we can use a
pooled–variance t test to determine whether there is a significant difference between the means.
In statistical notation we summarize this as follows:
If we have a random sample of size n1 which is normally distributed with the mean 1 and the
variance 2 denoted by N 2 and an independent random sample of size n2 which is normally
1 1; 1
distributed with the mean and the variance 2 denoted by N 2 population.
2 2 2; 2
There are 5 steps under the sampling distribution of X 1 X2 :
Step 1
The null hypothesis H0 of no difference in the means of two independent populations can be denoted
by
H0 : 1 = 2 or H0 : 1 2 = D0
and H0 may be tested at the % level of significance against one of the following alternatives:
Step 2
The pooled–variance s2p that combines the two sample variances s21 and s22 independently selected
from the two populations. The pooled-variance s2p ; is the best estimate of the variance common to
both populations. The formula for the pooled–variance s2p is
where
n1 : The sample size taken from population 1.
Step 3
X1 X2 ( 1 2)
t(X 1 X2) = r
s2p n11 + 1
n2
where
X1 : The mean of the sample taken from population 1.
Step 4
where
t: Represents the calculated t–test statistic.
t( ;df ) Represents the critical value at level of significance for a two–tailed test.
2
t( ;df ) Represents the critical value at level of significance for a one–tailed test.
s
1 1
X1 X2 t( ;df ) s2p +
2 n1 n2
where
X1 : The sample mean taken from the population 1.
r
1 1
s2p n1 + n2 : The standard error for X 1 X2 :
We cannot use the pooled variance s2p : Instead, we estimate each population variance with the
sample variance. There are 4 steps under the sampling distribution of X 1 X2
X1 X2 ( 1 2)
t(X 1 X2) = s
s21 s2
+ 2
n1 n2
5 STA1502/1
The sampling distribution can be estimated by a student t–distribution with the degrees of freedom
df equal to
2
s21 s2
+ 2
n1 n2
df = 2 2
S12 S22
n1 n2
+
n1 1 n2 1
Remark: Round df to the nearest integer.
Step 3: The rejection region is the same as provided in Section 1.2.1 when using the above
degrees of freedom df:
Step 4: The confidence interval estimator of ( when 2 2
1 2) 1 6= 2
r
S12 S22
X1 X2 t( ;df ) n1 + n2
2
The critical value for t( ;n1 +n2 2) or t( ;n1 +n2 2) is obtained from Table 2.
2
6
Table 1 (continued)
8
Activity 1.1
Say whether the following statements are correct or incorrect and try to rectify the incorrect
statements to make them true.
.............................................................................. ..............................................................................
.............................................................................................................................................................
s
s21 s2
(b) If we derive a confidence interval for ( 1 2 ) we use SE = + 2
n1 n2
r
1 1
but if we test H0 : 1 = 2 we use SE = s2pooled ( + ).
n1 n2
.............................................................................................................................................................
.............................................................................................................................................................
(c) In a one-tailed test for the difference between two population means, ( 1 2 ), if the null
hypothesis is rejected when the alternative hypothesis, H1 : 1 < 2 is false, a Type I error
is committed.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Feedback Feedback
(a) Correct. With a little algebraic manipulation it follows from the definitions of
(x1i x1 )2 (x2i x2 )2
s21 = and s22 = that (n1 1)s21 = (x1i x1 )2
n1 1 n2 1
and that (n2 1)s22 = (x2i x2 )2 :
r
1 1
(b) Incorrect. We use SE = s2pooled ( + ) for both the hypothesis test and the confidence
n1 n2
interval!
(c) Correct.
10
You will find that in most of the exercises on this section, whether they are for an assignment, the
examination or exercises, the information you have to work with will either be
Sample size n1 n2
Sample mean x1 x2
There could be "variations" on the theme of summarised data where computed sums are given
instead of sample statistics, e.g. x1i instead of x1 or x21i and x1i instead of s21 :
In the case of raw data, you must try to have at least a Scientific Pocket Calculator with Statistical
Functions that will enable you to compute the sample statistics:
Activity 1.2
Question 1
Psychologists have claimed that the scores on a tolerance measurement scale have a normal
distribution. Suppose that this scale is administered to two independent random samples of males
and females and their tolerance towards other road users is measured. (The higher the score, the
more tolerant you are.) The following scores were obtained:
Males: 12 8 11 14 10
Females: 15 12 14 11 13 14 12
(b) Compute a 99% confidence interval for the difference ( males f emales ). How do you interpret
this interval?
.............................................................................................................................................................
(c) What can you conclude from questions (a) and (b)?
.............................................................................................................................................................
11 STA1502/1
Question 2
You and some friends have decided to test the validity of an advertisement by a local pizza restaurant,
which says it delivers to the dormitories faster than a local branch of a national chain both the local
pizza and national chair are located across the street from your college campus. You define the
variable of interest as the delivery time, in minutes, from the time the pizza is ordered to when it is
delivered. You collect the data by ordering 10 pizzas from the local pizza restaurant and 10 pizzas
from the local chain at different times. The data for the delivery times are given below:
At the 0:05 level of significance, is there evidence that the mean delivery time for the local pizza
restaurant is less than the mean delivery time for the national pizza chain?
(a) Test the null hypothesis H0 : 1 = 2 against the alternative H1 : 1 < 2 ( 1 is the population
mean for local and 2 is the population mean for chain. Use = 0:05 and assume that the
population variances are equal. Interpret the results obtained.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
(b) Calculate a 95% confidence interval for the difference of the means.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
12
Question 3
Assume that you have a sample of n1 = 8; with the sample mean X 1 = 42 and a standard deviation
S1 = 4; and you have an independent sample n2 = 15 from another population with a sample mean
X 2 = 34 and a sample standard deviation S2 = 5:
(a) Using unequal variance approach at = 0:01; test the null hypothesis H0 : 1 = 2 against
H1 : 1 6= 2: Interpret the results obtained.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
13 STA1502/1
Feedback Feedback
Question 1
(a) Step 1:
We have to test H0 : males = f emales =) H0 : ( 1 2) = 0
against H1 : males 6 = f emales =) H1 : ( 1 2 ) 6 = 0
Step 2:
The data are
Males X1 : 12 8 11 14 10
Females X2 : 15 12 14 11 13 14 12
Using a scientific calculator we have obtained:
n1 = 5P n2 = 7
Xi 55
X1 = = 11
Pn 5
X1 91
X2 = = 13
n 7
The sample variances are
1 X 2
S12 = Xi X =5
n 1
1 X 2
S22 = Xi X =2
n 1
(5 1) (5) + (7 1) (2) 20 + 12 32
s2p = = = = 3:2
5+7 2 10 10
Step 3:
The test statistic for X 1 X 2 is
X1 X2 ( 1 2)
t = s
1 1
s2p +
n1 n2
(11 13) 0
= s
1 1
3:2 +
5 7
2
=
1:0474
= 1:9095
14
Step 4:
The rejection region is t > t( ;n1 +n2 2) for a two –tailed test because H1 is using the symbol 6= :
2
If the test statistic is greater than the critical value, reject the null hypothesis H0 at level of
significance.
If the test statistic is smaller than the critical value, we fail to reject H0 :
Since the test statistic ( 1; 9095) lies between the critical values, that means 3; 169 < 1:9095 <
3:169 we fail to reject the null hypothesis H0 : The conclusion is that there is not a significant difference
between the means of the males and the females.
s
1 1 q
2 1 1
(b) (x1 x2 ) t 2 ;(n1 +n2 2) Spooled + = (11 13) (3:169) 3:2 5 + 7
n1 n2
p
= 2 (3:169) 1:097 1
= 2 3:3194
= ( 5:3194 ; 1:3194):
We are 99% confident that the unknown difference ( males f emales ) will be between 5:3194
and 1:3194: We see that ( 5; 3194; 1; 3194) includes the null value, which implies that we are 99%
confident that the mean for the males is the same as the mean for the females.
[Extra explanation: We translate the phrase "the mean for the males is the same as the mean for
the females" as males = f emales which is in general 1 = 2: But, if males = f emales it implies
that ( 1 2) = 0:
So, to conclude that males = f emales we have to check whether zero is included in the confidence
interval. ]
(c) We conclude from questions (a) and (b) that using a two-sided confidence interval and performing
a two-sided hypothesis test must always lead to the same conclusion because it is a different
"juggle" of the same information! This is indeed the case with this exercise!
15 STA1502/1
You will find that in most of the exercises on this section, whether they are for an assignment, the
examination or exercises in Keller or any other textbook, we will simply state: " Assume that.....blah-
blah-blah" and then we conveniently take care of the assumptions of normality and equal variances!
But, strictly speaking, we should have first checked whether these conditions are met before we
proceed with the test.
There exist additional preliminary tests where we can formally test for normality and for the equality
of variances. The tests for normality are covered in detail in your second-year statistics syllabus.
Most statistical packages will provide you with a statistical test to formally test H0 : 2 = 2: In
1 2
the module STA2601 you will be formally introduced to the statistical package JMP. In case you do
not continue with statistics but anyhow apply your first-year knowledge using a statistical package of
your own choice, be aware that most statistical software packages will automatically include a test
for the equality of variances when you request to do a test for means! (This also happens when you
request to do an ANOVA test for means – a procedure you will learn about in the following study unit.)
The output for the test for the equality of variances will be a so-called F -test. An F-test, in general, is
basically the ratio of two quantities – in this application two variances. The p-value associated with
the F -test could be interpreted exactly like you have learned to do for any other test. If it is significant
(i.e. p-value < ) you will reject H0 : 2 = 2:
1 2
Question 2
(a) Step 1
The pooled variance Sp2
(n1 1) S12 + (n2 1) S22
Sp2 =
n1 + n2 2
(10 1) (8:2151) + (10 1) (9:5822)
=
10 + 10 2
73:9359 + 86:2398
=
18
160:1757
=
18
= 8:8987
16
Step 2
The test statistic for X 1 X 2 is
X1 X2
t = s
1 1
Sp2 +
n1 n2
18:88 16:70
= q
1 1
8; 8987 10 + 10
2:18
=
1:3341
= 1:6341
Step 3
The rejection region for X 1 X 2 is
t > t( ; n1 +n2 2) (For one–tailed test.)
t > t(0:05; 10+10 2)
Therefore, we are 95% confident that the difference in mean delivery time between the local pizza
restaurant and the national pizza chain is between 0:6229 delivery time and 4:9829: From a
17 STA1502/1
hypothesis testing perspective, using a two–tailed test at 5% level of significance, because the
interval does include zero, we fail to reject the null hypothesis between the means of the two
populations.
Question 3
Solutions
n1 = 8 n2 = 15 s1 = 4 = 0:01
X 1 = 42 X 2 = 34 S2 = 5
(a) Step 1
against H1 : 2 2
H0 : 1 = 2 1 6= 2. The two population variances are unequal 1 6= 2 for a two
tailed test.
Step 2
The test statistic for X 1 X 2 is
X1 X2
t =s
s21 s2
+ 2
n1 n2
42 34
= r 2
(4) (5)2
+
8 15
8 8
= p =
3:6667 1:9149
= 4:1778
Step 3:
The degrees of freedom df is
s21 s2
+ 2
n1 n2
df = 2 2
s21 s22
n1 n2
+
n1 1 n2 1
2
16 25
+
8 15 (2 + 1:6667)2
df = =
16 2 25 2 (2)2 (1:6667)2
+
8 15 7 14
+
8 1 15 1
13:4444 13:4444
= = = 17:4648
0:5714 + 0:1984 0:7698
Round df to integer, this implies that df = 17:
18
Step 4
The rejection region is
(Yes, there is a plus sign even though you might expect a minus sign!) In other words, if we create
a new variable by subtracting two variables, the variance of this new variable will – provided they
are independently distributed – be the sum of the variances of the two original variables.
Strictly speaking there is (in general) a third term that takes care of the dependency between the two
variables. We did not even bother to mention it in section 1.1 because this dependency term falls
away if we assume that X and Y are independent.
However, if we cannot assume that we have two samples from two independent populations, we
have a problem with var(x1 x2 ):
(x1i x1 )2 + (x2i x2 )2
Using b2 = = s2pooled is not valid anymore!
n1 + n2 2
So, whenever there is a "connectedness" between one set of values (sample 1) and the second
set of values (sample 2), we could take care of the dependency by treating the data as matched
pairs. We remove the dependency by reducing the two samples to one set of scores. This would
immediately imply that n1 = n2 :
Thus, we create a single random sample by taking the paired differences di = x1i x2i : With a little
adaptation (and imagination) we are now back to the set-up discussed in STA1501 (depending on
whether we consider the sample as having a known or unknown population variance!) for the topic
such as:
Testing the Population Mean when the Population Standard deviation is known
and
Inference about a Population Mean when the standard deviation is unknown.
Comparing the means of two dependent data sets is always a separate choice (or sub-menu
in computer jargon) of the test procedures available for testing means (main-menu in computer
jargon) in any statistical software package. It is generally known as a “paired samples t-test” and
observations of a single sample, obtained by first taking the differences, are used.
1P
xD = mean difference between the paired observations = di
n
sD = standard deviation of the differences di
nD = number of paired observations.
For dependent observations, the hypothesis test for the difference between the two means therefore
boils down to the hypothesis test for a single sample.
H0 : X = Y is the same as H0 : D = 0:
It is interesting to note that in the paired observations test, the degrees of freedom are half of what
they are if the samples are not paired. (When the samples are not paired two kinds of variation are
present: differences among the groups and differences among the subjects.)
Activity 1.3
Say whether the following statements are correct or incorrect and try to rectify the incorrect
statements to make them true.
(a) Repeated measurements from the same individuals constitute an example of data collected from
matched pairs experiment.
.............................................................................. ..............................................................................
.............................................................................................................................................................
(b) The number of degrees of freedom associated with the t-test, when the data are gathered from a
matched pairs experiment with 8 pairs, is 7.
.............................................................................................................................................................
.............................................................................................................................................................
(c) The matched pairs experiment always produce a larger test statistic than the independent samples
experiment.
.............................................................................................................................................................
.............................................................................................................................................................
21 STA1502/1
(d) In comparing two population means of interval data, we must decide whether the samples are
independent (in which case the parameter of interest is 1 2) or matched pairs (in which case
the parameter is D) in order to select the correct test statistic.
.............................................................................................................................................................
.............................................................................................................................................................
(e) When comparing two population means using data that are gathered from a matched pairs
experiment, the test statistic for D has a Student t-distribution with = nD 1 degrees of
freedom, provided that the differences are normally distributed.
.............................................................................................................................................................
.............................................................................................................................................................
Feedback Feedback
(a) Correct.
(b) Correct.
(c) Incorrect. We may say that the matched pairs produce a smaller estimated SE because we
eliminate the often considerable variability due to individual variation in the separate samples.
(d) Correct.
(e) Correct.
22
Activity 1.4
Suppose that person A believes that sons, upon maturity, are in general taller than their fathers.
Person B, on the other hand, argues that the opposite is true. In order to investigate this issue, we
measure the heights of a random sample of nine father-son pairs. The following are the results (in
cm):
Pair 1 2 3 4 5 6 7 8 9
Son 185 173 168 178 188 173 165 183 175
Father 180 175 160 178 183 175 160 173 178
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
(b) Find a 95% confidence interval estimate for ( 1 2 ); the mean difference in heights of fathers
and sons.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
23 STA1502/1
Feedback Feedback
We have dependent (paired) observations and we need to work with the differences of the pairs,
Fair 1 2 3 4 5 6 7 8 9
Son 185 173 168 178 188 173 165 183 175
Father 180 175 160 178 183 175 160 173 178
Differences 5 2 8 0 5 2 5 10 3
di = 5 2 8 0 5 2 5 10 3
(a) Step 1
The hypotheses are
H0 : D = 0 against H1 : D 6= 0
Step 2
The test statistic is
XD D
t=
SD
p
n
where P
di 26
The sample mean XD = = = 2:8889
n 9
The sample variance
P 2
2 xD X D
SD =
n 1
2 (5 2:8889)2 + ( 2 2:8889)2 + :::: + ( 3 2:8889)2
SD =
(9 1)
180:8889
=
8
= 22:6111
p
The sample standard deviation SD = 22:6111 = 4:7551:
We can also use the scientific calculator to calculate the mean X D and standard deviation SD :
Therefore the test statistic is
xD 0 2:8889 2:8889
t= = = = 1:8226
sD 4:7551 1:5850
p p
n 9
24
Step 3
The rejection region is
t > t( ;n 1) for a two–tailed test.
2
Conclusion: We are 95% confident that the mean difference in heights of fathers and sons is
between 0:7661 and 6:5439: (Sons seem to be taller than their fathers but not significantly.)
25 STA1502/1
Activity 1.5
Question 1
In testing the hypothesis H0 : D = 5 vs. H1 : D > 5, two random samples from two
dependent normal populations produced the following statistics: xD = 9, nD = 20, and sD = 7:5.
What conclusion can we draw at the 1% significance level?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Question 2
Promotional Campaigns
The general manager of a chain of fast food chicken restaurants wants to determine how effective
their promotional campaigns are. In these campaigns “20% off” coupons are widely distributed.
These coupons are only valid for one week. To examine their effectiveness, the executive records
the daily gross sales (in R1000’s) in one restaurant during the campaign and during the week after
the campaign ends. The data is shown below.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
26
(b) Find the 95% confidence interval for the difference in sales during the week.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
(c) What can you conclude from the answers in (a) and (b)?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Feedback Feedback
Question 1
Step 1
The test statistic is
XD D 9 5 4
t= = = = 2:3851
SD
p 7:5
p 1:6771
n 20
Step 2
The rejection region is
t > t( ; n 1) For one tail test.
t > t(0:01; 20 1)
t > t(0:01; 19)
t > 2:539 Using Table 2.
The critical value at 1% level of significance is 2:539:
27 STA1502/1
Step 3
Decision rule
Since the test t follows a t( ; n 1) distribution. Since the test statistic (2:3851) is less than the critical
value (2:539), we fail to reject the null hypothesis H0 at 1% level of significance.
Question 2
We have dependent (paired) observations and we need to work with the differences of the pairs.
Day 1 2 3 4 5 6 7
Sales during campaign 18:1 10 9:1 8:4 10:8 13:1 20:8
Sales after campaign 16:6 8:8 8:6 8:3 10:1 12:3 18:9
Differences 1:5 1:2 0:5 0:1 0:7 0:8 1:9
The data to use in the matched pairs are
(a) Step 1
The null hypothesis H0 : D = 0 against the alternative H1 : D > 0:
Step 2
The test statistic is
XD D 0:9571 0 0:9571
t= = = = 4:1095
sD 0:6161 0:2329
p p
nD 7
Step 3
The Decision rule
Reject H0 if the test statistic is greater than the critical value.
Since the value of the test statistic (4; 1095) is greater than the critical value t( ;n 1) =
t(0:05; 7 1) = t(0:05; 6) = 1:943; we reject the null hypothesis H0 at 5% level of significance.
The conclusion: The sales increase during the campaign.
The null hypothesis H0 is rejected at the 5% level of significance because zero lies outside of
the confidence limits.
We are 95% confident that the mean difference in sales is between 0:3872 and 1:5270 thousand
rand.
(c) We can estimate that the daily sales during the campaign increase on average between 0:3872
and 1:527 thousand rand.
S12
F =
S22
S12 2
1
The statistics is the estimator of the parameter 2:
S22 2
The rejection region is
F > F( ; n1 1; n2 1) : For one–tailed test.
F > F( ; n1 1; n2 1) : For two–tailed test.
2
29 STA1502/1
where
: The level of significance
OR
H1 : 2 < 2 : For a one–tailed test.
1 2
test.
– Otherwise, do not reject H0 :
2
1
The confidence interval of 2 is
2
– The Lower Confidence Limit (LCL) is
s21 1
LCL = :
s22 F( ; n1 1; n2 1)
2
S12
U CL = F( ; n2 1; n1 1) :
S22 2
Note that the statistic tables give values for F( ; n1 1; n2 1) 6= F( ; n2 1; n1 1) you must therefore
2 2
make sure that you know what to use for the upper and lower limits in the assignment or
examination and the read off the correct value from the table.
30
Activity 1.6
Question 1
2
1
In constructing a 90% interval estimate for the ratio of two population variances, 2, two independent
2
samples of sizes 40 and 60 are drawn from the populations. If the sample variances are 515 and 920,
then the lower confidence limit is:
1. 0:244
2. 0:352
3. 0:341
4. 0:890
5. 0:918
Question 2
(a) Do the sample variances provide enough evidence at the 10% significance level to infer that the
two population variances differ?
(b) Estimate with 90% confidence the ratio of the two population variances.
(c) Describe what the interval estimate tells you and briefly explain how to use the interval estimate
to test the hypotheses.
Feedback Feedback
Question 1
S12 1
The formula for the LCL is and you have to substitute the correct values into this
S22 F
2; 1; 2
formula.
S12 515
=
S22 920
= 0:5598:
37 STA1502/1
Go to the F -table (Table 3A) with heading 0:05 (because = 0:1 and you need ) and where the
2
values for 40 and 60 meet, you will read off the value 1:59.
The critical value F( ; df1 ; df2 ) = F(0:05; 40 1; 60 1) = F(0:05; 39; 59) = 1:5 (the nearest)
2
S12 1 1
= 0:5598
S22 F( ;df1 ;df2 ) 1:59
2
= 0:3521 which is option 2
= 0:3521
Question 2
2 2
1 1 1 2 2 2
(a) H0 : 2 = 1 versus H1 : 2 6= 1 or H0 : = versus H1 : 1 6= 2
2 2
1 1
Rejection region:F > F0:05;15;13 = 2:53 or F < F0:95;13;15 = = 0:408
F0:05;13;15 2:45
55
Test statistics: F = = 0:466
118
Conclusion: We don’t reject the null hypothesis H0 . No, the sample variances don’t provide
enough evidence at the 10% significance level to infer that the two population variances differ
(b) The 90% confidence interval for the ratio of the two population variances:
S12 1 df1 = n1 1 = 16 1 = 15
LCL =
S22 F df2 = n2 1 = 14 1 = 13
2 ;df1 ;df2
55 1
=
118 F0:05;15;13
1
= 0:4661
2:53
= 0:1842
S12
U CL = F
S22 2 ;df2 ;df1
55
= F0:05;13;15
118
= 0:4661 2:45
= 1:1419
2
1
(c) We estimate that the ratio 2 lies between 0:1842 and 1:1419. Since the hypothesized value 1 is
2
included in the 90% interval estimate, we fail to reject the null hypothesis at = 0:10.
38
(b) Repeat (a) increasing the standard deviations to S1 = 225 and S2 = 260:
(c) Describe what happens when the sample standard deviations get larger.
Question 2
Every month a clothing store conducts an inventory and calculates losses from theft. The store
would like to reduce these loses and is considering two methods. The first is to hire a security
guard, and the second is to install cameras. To help decide which method to choose, the manager
hired a security guard, and the second is to install cameras. To help decide which method to choose,
the manager hired a security guard for 6 months. During the next 6 month period, the store installed
cameras. The monthly losses were recorded and are listed here. The manager decided that because
the cameras were cheaper than the guard, he would install the cameras unless there was enough
evidence to infer that the guard was better. What the manager should do?
Security guard 355 284 401 398 477 254
Cameras 486 303 270 386 411 435
Question 3
How effective is an antilock braking system (ABS), which pumps very rapidly rather than lock and
thus avoid skids? As a test, a car buyer organized an experiment. He hit the brakes and using
a stop–watch, recorded the number of second it took to stop an ABS–equipped car and another
identical car without ABS. The speed when the brakes were applied and the number of seconds
each took to stop on dry pavement are listed here. Can we infer that ABS is better?
Speeds 20 25 30 35 40 45 50 55
ABS 3:6 4:1 4:8 5:3 5:9 6:3 6:7 7:0
Non–ABS 3:4 4:0 5:1 5:5 6:4 6:5 6:9 7:3
Question 4
In an effort to determine whether a new type of fertilizer is more effective than the type currently in
use, researchers took 23 two–acre plots of land scattered throughout the country. Each plot was
39 STA1502/1
divided into two equal–sized subplots, one of which was treated with the current fertilizer and the
other with the new fertilizer. What was planted, and the crop yields were measured.
Plot 1 2 3 4 5 6 7 8 9 10 11 12
Current fertilizer 56 45 68 72 61 69 57 55 60 72 75 66
New fertilizer 60 49 66 73 59 67 61 60 58 75 72 68
(a) Can we conclude at the 5% significance level that the new fertilizer is more effective than the
current one?
(b) Estimate with 95% confidence the difference in mean crop yields between the two fertilizers.
(c) What is the required condition(s) for the validity of the results obtained in parts (a) and (b)?
(f) How should the experiment be conducted if the researchers believed that the land throughout the
country was essentially the same?
Question 1
(e) The effects of increasing the sample size have narrowed the confidence interval.
Question 2
Step 2
The pooled variance for independent samples.
(n1 1) s21 + (n2 1) s22
s2p =
n1 + n2 2
(6 1) (82:2648)2 + (6 1) (81:5682)2
=
6+6 2
33837:4866 + 33266:8563
=
10
67104:3429
=
10
= 6710:4343
43 STA1502/1
Step 3
The test statistic is
X X2
t = r 1
s2p n11 + n12
361:5 381:8333
= q
6710:4343 16 + 16
20:3333
=
47:2949
= 0:4299
Step 4
The rejection region is
t > t( ; n1 +n2 2)
Making a decision
Since the confidence interval limits include zero, we fail to reject the null hypothesis at 1% level of
significance.
Conclusion: There is no enough evidence to reject the null hypothesis H0 :
44
Question 3
Step 1
The hypothesis testing
H0 : D = 0 against H1 : D > 0:
Step 2
The test statistic
XD
t=
SD
p
nD
The mean X D = 0:175:
The standard deviation SD = 0:2252:
The sample size nD = 8:
0:175 0:175
t= = = 2:1985
0:2252
p 0:0796
8
Step 3
The 95% confidence interval is
SD
X1 t( ; nD 1) p
2 nD
0:2252
0:175 t(0:025;7) p
8
0:175 2:365 0:0796
0:175 0:1883
( 0:175 0:1883; 0:175 + 0:1883)
( 0:3363; 0:0133)
Making a decision
Since zero lies between the confidence limits, we fail to reject the null hypothesis at 5% level of
significance.
Conclusion: There is no enough evidence that ABS performs better than the Non–ABS at the 5%
level of significance.
45 STA1502/1
Question 4
The two samples are dependent, the matched pairs method is use.
The data are
Plot 1 2 3 4 5 6 7 8 9 10 11 12
Current fertilizer 56 45 68 72 61 69 57 55 60 72 75 66
New fertilizer 60 49 66 73 59 67 61 60 58 75 72 68
Differences di 4 4 2 1 2 2 4 5 2 3 3 2
di = current new fertilizer
The summary statistics calculated based on di data are
XD = 1 SD = 3:0151 nD = 12
(a) Step 1
Step 3
The rejection region is
t > t( ;nD 1)
Making a decision
Because the test statistic is negative the rejection region is t < 1:796:
Since 1:1489 > 1:796; we fail to reject H0 at 5% level of significance.
Conclusion: We may not infer that the new fertilizer is more effective than the current fertilizer.
46
We are 95% confident that the mean difference in crop yield is between 2:9158 and 0:9158. Since
zero lies between the two confidence limits, therefore we can’t infer that the new fertilizer is more
effective than the current fertilizer.
(c) The required condition for validity of the results is that the differences are required to be normally
distributed.
(d) No, the required conditions is not satisfied because the histogram of the differences is not a bell
shape.
Can you
perform a small-sample statistical test for the difference between two population means in the
case of independent random samples?
derive a small-sample confidence interval for the difference between two population means
( 1 2) in the case of independent random samples?
perform a small-sample statistical test for the difference between two population means in the
case of dependent random samples?
derive a small-sample confidence interval for the difference between two population means
( 1 2) in the case of dependent random samples?
use a confidence interval estimator to test hypotheses for the ration of two variances when two
independent samples are drawn from normal populations.
Key Terms/Symbols
t-distribution
F-distribution
degrees of freedom
dependent and independent random samples
paired difference test
48
STUDY UNIT 2
2.1 Introduction
In this study we tie some loose ends. We continue our inference about comparing two populations,
but we shift from means and comparing two variances to proportions. In the last section we move
back to means but extend it to more than two populations.
Step 2
We are now sampling from two independent populations where the proportions of the populations
have a certain attribute.
X1
If Pb1 = is the proportion in a random sample size n1 from a population 1 with parameters P1
n1
where x1 is the number of items that satisfied condition in sample 1.
X2
Pb2 = is the proportion in a random sample of size n2 from a second independent population
n2
with parameter P2 where X2 is the number of items that satisfied condition in sample 2.
49 STA1502/1
Step 3
The test statistic is Z that we can use for this particular hypothesis:
Z has an approximate normal distribution denoted by N (0; 1) where 0 is the mean and 1 is the
variance.
Z tests the null hypothesis H0 : P1 P2 = 0:
P is called the pooled–variance and we calculate it as
Step 4
The confidence interval is given by
v
u
u Pb1 1 Pb1 Pb2 1 Pb2
t
Pb1 Pb2 Z( ) +
2 n1 n2
This formula is valid when n1 Pb1 ; n1 1 Pb1 ; n2 Pb2 and n2 1 Pb2 are greater than or equal to 5:
The standard error
v
u s
u Pb1 1 Pb1 Pb2 1 Pb2
t 1 1
+ = P 1 P +
n1 n2 n1 n2
Step 5
The decision rule
Reject the null hypothesis H0 if the test statistic Z is greater than the critical value, otherwise we
do not reject H0 :
Reject the null hypothesis H0 if the p–value is less than (the level of significance), otherwise we
fail to reject H0 :
Reject the null hypothesis H0 if zero lies between the two confidence limits.
To illustrate the use of the Z test for the equality of two properties, work through activities 2.1 and
2.2 to enhance your understanding.
50
Activity 2.1
Say whether the following statements are correct or incorrect and try to rectify the incorrect
statements to make them true.
r
pb1 (1 pb1 ) pb2 (1 pb2 )
(a) If we derive a confidence interval for (P1 P2 ) we use SE = +
n1 n2
r
1 1 X1 + X2
but if we test H0 : P1 = P2 we use SE = p(1 p)( + ) with p = :
n1 n2 n1 + n2
.............................................................................. ..............................................................................
.............................................................................................................................................................
(b) In testing a hypothesis about the difference between two population proportions (P1 P2 ) , the z
test statistic measures how close the computed sample difference between two proportions has
come to the hypothesized value of zero.
.............................................................................................................................................................
.............................................................................................................................................................
(c) In a one-tailed test for the difference between two population proportions (P1 P2 ), if the null
hypothesis is rejected when the alternative hypothesis, H1 : P1 > P2 ; is false, a Type I error is
committed.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
r
pb1 (1 pb1 ) pb2 (1 pb2 )
(d) If we derive a confidence interval for (P1 P2 ); we use SE = +
n1 n2
r
pb1 (1 pb1 ) pb2 (1 pb2 )
and if we test H0 : P1 P2 = 0:15; we will also use SE = + for the z test
n1 n2
statistic.
.............................................................................. ..............................................................................
51 STA1502/1
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Feedback Feedback
(a) Correct.
(b) Correct.
(c) Correct.
(d) Correct.
Activity 2.2
1. A seed distributer, called Easy Grow Seeds, claims that 75% of a specific variety of maize, called
Golden Glow, will germinate. A random sample of n1 = 300 seeds was selected from this batch
and 207 germinated. Denote the population proportion of seeds that germinate as P1 : Suppose
that a second, independent seed distributer, called Seeds of All Kinds claims that 80% of their
stock of the same variety of maize, called Golden Glow, will germinate. (Denote this population
proportion of seeds that germinate as P2 :) From this population we draw a random sample of size
n2 = 200 and the number seeds that germinate in this sample is 153.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
52
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
(b) Construct a 99% confidence interval estimate of the difference between the two population
proportions.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
53 STA1502/1
Feedback Feedback
Question 1
= 0:72
r
1 1
SEpooled (p1 p2 ) = p(1 p)( + )
n1 n2
q
1 1
= 0:72(1 0:72)( 300 + 200 )
= 0:0410
x1 x2
( ) (p1 p2 )
n n2
Z = r1
1 1
p(1 p)( + )
n1 n2
( 207
300
153
200 ) 0
=
0:0410
0:075
=
0:0410
= 1:8293
Since jZj = j 1:8298j = 1:8298 > 1:645 =) we reject H0 : It seems likely that the two populations
do not have the same proportions:
Extra explanation:
With a confidence interval our focus is on the inside of the probability statement and with a
hypothesis test our focus is on the outside of the probability statement. For example, for a 90%
confidence interval
0.05 0.05
-1.645 0 1.645
Rejection region Rejection region
Two-sided hypothesis test using = 0:10
Question 2
H0 : P1 P2 = 0 against H1 : P1 P2 6= 0
55 STA1502/1
Step 2
The test statistic
Pb1 Pb2
Z = s
1 1
P 1 P +
n1 n2
X1 45
Pb1 = = = 0:45
n1 100
X2 25
Pb2 = = = 0:5
n2 50
The pooled proportion is
X1 + X2 45 + 25 70
P = = = = 0:4667
n1 + n2 100 + 50 150
The test statistic is
0:45 0:5
Z = q
1 1
0:4667 (1 0:4667) 100 + 50
0:05 0:05
Z = p = = 0:5774
0:0075 0:0866
Step 3
The critical value at = 0:01 is 2:33 when using Table 1.
Step 4
0:01
The rejection region for a two–tailed test is to corresponding Z of = = 0:005 which gives
2 2
Z = 2:58:
Therefore the rejection is Z < 2:58 and Z > 2:58:
Step 5
Make a decision
Since the test statistic ( 0:5774) lies between the critical value for a two–tailed 2:58 and 2:58;
we fail to reject the null hypothesis H0 at 1% level of significance.
56
since zero lies between the two confidence limits, we fail to reject the null hypothesis H0 at 1 level
of significance.
(c) Because the hypothesis is a two–tailed test than p–value = 2 P (Z < 0:5774)
= 2 P (Z < 0:58)
= 2 0:2810
= 0:562
Making a decision
Since the p–value (0:562) is greater than = 0:01 (level of significance, we fail to reject H0 at 1%
level of significance.
Conclusion: There is no enough evidence to reject the null hypothesis. Therefore the two
population proportions are similar.
57 STA1502/1
To determine if there is a significant difference among the group means. Through this test if the
null hypothesis (all the means are equal) is rejected than we proceed with the second test.
To identify the groups whose means are significantly different from the other group means.
(i) that is due to differences among the groups that measures the differences from group to group,
sample to sample or treatment to treatment.
The total of variation among group of variation (or group of treatment is denoted SST).
(ii) That is due to the differences within the groups that measures random variation. The total of
variation within the group or treatment is denoted by SSE.
To perform an ANOVA test of equality of population means. We have the following steps:
Step 1
Assuming that there is k groups (treatments) that represent the population whose values are
randomly and independently selected, follow a normal distribution, and we have equal variance.
The null hypothesis of no differences in the population means:
H0 : 1 = 2 = ::: = k
is tested against the alternative that not all the k population means are equal:
H1 = not all j are equal where j = 1; 2; :::; k:
or
H1 = At least one population mean is different from the other population means.
58
Step 2
Calculate the among group variation or among treatment called the sum of squares for treatment
(SST ) given by
Pk 2
SST = j=1 nj Xj X
where
X
k
X
nj
Xij
j=1 i=1
X= n
n = n1 + n2 + ::: + nj
Step 3
Calculate the within treatments variation usually called sum of squares within treatment or sum of
square error (SSE) given by
Pk Pnj 2
SSE = j=1 i=1 Xij Xj or
Step 4
Calculate the total variation representing the sum of squares total (SST otal) given by
Pk Pnj 2
SST otal = j=1 i=1 Xij X
where
X
k
X
nj
Xij
j=1 i=1
X= n = grand mean
59 STA1502/1
Remarks:
Because there are k groups (treatments that we are comparing, therefore there are k 1 degrees
of freedom associated with the sum of squares among groups (treatments).
There are n k degrees of freedom associated with the sum of squares within groups or sum of
squares error.
Because each of the k groups (treatments) contributes nj 1 degrees of freedom through the sum
squares total. That is, we compare each value xij to the grand mean X; based on all n values.
Therefore
Step 5
F –test Differences among more than Two means.
To determine if there is a significant difference among group means, we use F–test for differences
among more than two means.
If the null hypothesis H0 : 1 = 2 = ::: = k is true, we conclude that there is no differences
among the k group means such as M ST; M SE and M ST otal; thus these means provide
estimates of the overall variance in the population.
The test statistic for the one–way ANOVA is
M ST
F =
M SE
where
M ST = Mean squares for treatment with k 1 degrees of freedom.
M SE = Mean squares for error with n k degrees of freedom.
60
Step 6
The critical value is
F( ; k 1; n k)
where
= level of significance
k 1 = The degrees of freedom for treatments
n k = The degrees of freedom for error.
Step 7
Decision rule
Reject the null hypothesis H0 : 1 = 2 = ::: k against H1 : Not all j are equal (where
j = 1; 2; :::; k) at a selected level of significance if the F –test statistic is greater than the critical
value F( ; k 1; n k) of the F –distribution. Otherwise, do not reject H0 :
Reject H0 if p–value is less than the level of significance :
ANOVA SUMMARY TABLE
Activity 2.3
The marketing manager of a pizza chain is in the process of examining some of the demographic
characteristics of her customers. In particular, she would like to investigate the belief that the ages
of the customers of pizza parlors, hamburger huts, and fast-food chicken restaurants are different.
As an experiment, the ages of eight customers randomly selected of each of the restaurants are
recorded and listed below. Assume that we know from previous analyses that the ages are normally
distributed with the same variances.
Customers’ Ages
Pizza Hamburger Chicken
23 26 25
19 20 28
25 18 36
17 35 23
36 33 39
25 25 27
28 19 38
31 17 31
[:-) Always keep in mind that small differences could be due to rounding errors!]
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
62
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
(c) Do these data provide enough evidence at the 5% significance level to infer that there are
differences in ages among the customers of the three restaurants?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
63 STA1502/1
Feedback Feedback
The given information
Pizza Hamburger Chicken
23 26 25
19 20 28
25 18 36
17 35 23
36 33 39
25 25 27
28 19 38
31 17 31
Total 204 193 247
Sample Mean X 1 = 25:5 X 2 = 24:125 X 3 = 30:875
Standard Deviation S1 = 6:1875 S2 = 6:8959 S3 = 6:1281
Sample Size n1 = 8 n2 = 8 n3 = 8
(a) (i) Correct
3 X
X 24
Xij
n = n1 + n2 + n3
j=1 i=1
X = =8+8+8
n = 24
204 + 193 + 247
=
24
644
=
24
= 26:8333
or
X1 + X2 + X3
X =
3
25:5 + 24:125 + 30:875
=
3
80:5
X = = 26:8333
3
(ii) Correct
3
X 2
SS Between = SST = nj X j X
j=1
2 2 2
SST = n1 X 1 X + n2 X 2 X + n3 X 3 X
= 8(25:5 26:8333)2 + 8(24:125 26:8333)2 + 8 (30:875 26:8333)2
= 14:2215 + 58:6791 + 130:6827
= 203:5833
3
X
SSW ithin = SSE = (nj 1) s2j
j=1
64
SSE 863:7455
M SE = = = 41:1307
n k 21
M ST 101:7917
F = = = 2:4588
M SE 41:1307
(c) Decision Rule
Reject H0 if F –test statistic > The critical value F( ;k 1;n k) . Since F –test statistic (2:4588) is
less than the critical F(0:05; 2; 21) = 3:44; we fail to reject the null hypothesis H0 at 5% level of
significance.
H0 : 1 = 2 = 3 against H1 : at least two means differ.
Conclusion: The data do not provide enough evidence at the 5% significance level to infer that
there are differences in ages among the customers of the three restaurants.
Activity 2.4
A statistics practitioner calculated the following statistics:
Treatment
Statistic 1 2 3
The sample size n 5 5 5
The sample X 10 15 20
The sample S 2 50 50 50
(a) Complete the ANOVA table.
(c) Describe what happen to the F –statistic when the sample sizes increase
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Feedback Feedback
n = n1 + n2 + n3 = 5 + 5 + 5 = 15
~ 1 = 10
X X 2 = 15 X 3 = 20
3
X
SSE = (nj 1) Sj2
j=1
M ST 125
F = = = 2:5
M SE 50
ANOVA TABLE
Source of variation df SS MS F
Treatment k 1=2 250 125 2:5
Error n k = 12 600 50
Total 14 850
(b) The given information
k=3 n1 = 10 n2 = 10 n3 = 10
n = n1 + n2 + n3 = 10 + 10 + 10 = 30
~ 1 = 10
X X 2 = 15 X 3 = 20
X1 + X2 + X3 10 + 15 + 20 45
X = = = = 15
3 3 3
3
X 2
SST = nj X j X
j=1
2 2 2
SST = n1 X 1 X + n2 X 2 X + n3 X 3 X
= 10 (10 15)2 + 10 (15 15)2 + 10 (20 15)2
= 250 + 0 + 250
= 500
67 STA1502/1
3
X
SSE = (nj 1) Sj2
j=1
M ST 125
F = = = 25
M SE 50
ANOVA TABLE
Source of variation df SS MS F
Treatment k 1=3 1=2 500 125 25
Error n k = 30 3 = 27 1350 50
Total 29 1850
(c) F –increased (In this case from F = 2:6 to F = 25):
68
Performing an analysis of variance test to determine whether differences exist between two or
more population means is a good start, but not nearly enough for a practical application where it
is necessary to identify which treatment means are responsible for the differences. The statistical
method used to determine this is called multiple comparisons. We will consider three methods for
this purpose, namely
Fisher’s least significant difference method (LSD) which is used of you want find areas for further
investigation.
The Bonferroni method which is used of you want to identify two or three pairwise comparisons.
Tukey’s method is used when you want to consider all possible population-combinations.
These three methods are discussed in the next section. Make sure that you understand them and can
apply the knowledge. The formulas for the three methods are different, but you need not remember
them. In fact, rather go through each example and its solution to see how the three methods are
applied.
As your knowledge of statistics expands, lengthy calculations will interest you less and less, seeing
that your interest should move to the actual statistical analysis. There is a very delicate balance
between the importance of the calculation and the statistical analysis: if the calculation is incorrect,
the analysis has no meaning. Still,you are being trained to make a meaningful and correct analysis.
Once you understand the method applied in the calculation, that part can be taken over by statistical
software. This is why most statisticians start to use statistical software for their calculations at an
early stage. We are introducing students at second level in STA2601 to the software package JMP.
It is therefore advisable for you to take note of any given Excel and Minitab printouts in a textbook.
Try to do them yourself if you have access to Excel or Minitab and if you do not have access, study
them and note what information they supply and how to interpret it. No professional statistician can
function properly without knowledge of and using statistical software.
If we conclude from an ANOVA F –test that the population means are all equal (i.e. if we do not reject
the null hypothesis H0 ); then can end our analysis. However, if we conclude that the population,
means are not all equal (i.e. if we reject the null hypothesis H0 ); then we are led directly into a new
question: If the population means are not all equal, which ones differ and which ones the same?
The next section tests will answer this new question.
69 STA1502/1
The least significant difference (LSD) method determines which population means differ. We define
the least significant difference LSD as
s
1 1
LSD = t( ; (n k)) M SE +
2 ni nj
where
MSE is an unbiased estimator of the common variance of the populations we are testing.
A simple way of determining whether differences exist between each pair of population means is to
compare the absolute value of the difference between their two sample means and LSD.
In other words, we will conclude that i and j differ if
Xi X j > LSD
If we conclude that the null hypothesis of equal population means is not valid, we need to determine
which pairs of population means are the same and which ones differ. The Bonferroni t–test is one of
different types of tests that can determine which pairs of means differ. When we have k groups (or
k (k 1)
treatmens), we can test a total of C = hypothesis, each with significant level of (i.e. the
2
probability of a type 1 error), that means the error of incorrectly rejecting the null hypothesis.
The Bonferroni test attempts to control the overall probability of a type 1 error by ensuring that it is
no greater than the original specified level of significance. In general, the hypothesis that we test are
given by:
H0 : i = j against H1 : i 6= j (i and j = 1; 2; :::; k ):
The Bonferroni tests for ANOVA test have the following steps:
Step 1
k (k 1)
The hypothesis statements H0 and H1 for the number of hypothesis is determined by C = to
2
be tested simultaneously.
Step 2
The test statistic is
Xi Xj
Tij = s
1 1
M SE +
ni nj
70
where
ni and nj are the sample sizes of groups (treatment) i and j:
X i and X j are the sample means of groups (or treatments) i and j:
M SE is the mean within sum of squares calculated from
the original ANOVA.
The test statistic follows a t(n k) distribution under the null hypothesis H0 :
Step 3
The critical value
k (k 1)
The critical values that are used for all c = tests for the pairs of population means are
2
t( and t( ; (n
2c ; (n k)) 2c k))
Step 4
Specify the significance level
The overall significance level for all the tests is specified as = 0:05:
Step 5
Decision rule
Reject the null hypothesis H0 : i = j if the test statistic tij < t( ;(n k)) or if tij > t( ; (n k))
2 2
we can use the Table 2 to obtain these critical values.
Reject the null hypothesis H0 if p–value < c
After performing the one–way ANOVA and finding a significant difference among the treatment,
we still do not know which treatments differ. All we know is that there is sufficient evidence to
state that the population means are not all the same. That is, one or more population means
are significantly different. To determine which treatments differ, we use the Tukey–Kramer multiple
comparisons procedure for one–way ANOVA. Using this technique, we are able to simultaneously
make comparison between all pairs of groups.
This technique determines a critical number similar to the least significant difference (LSD) method
of section 2.4.1 for Fisher’s test. The critical number is denoted by !; such that if any pair of sample
means has a difference greater than !; we conclude that the pair’s two corresponding population
means are different.
The test is based on the studentized range, which is to calculate the variable q:
X max X min
q=
S
p
n
71 STA1502/1
where X max and X min are the largest and smallest sample means respectively, assuming that there
are no differences between the population means.
The critical number ! is
r
M SE
! = q (k; n k)
ng
where
k = number of treatments
n = number of observations
= n1 + n2 + ::: + nk
n k = number of degrees of freedom
associated with M SE
ng = number of observations in each
of k samples
= significance level
q (k; n k) = critical value of the studentized
range as given in Table 4.
72
Activity 2.5
Question 1
An investor studied the percentage rates of return of three different types of mutual funds. Random
samples of percentage rates of return for four periods were taken from each fund. The results appear
in the table below:
Mutual Funds Percentage Rates
Fund 1 Fund 2 Fund 3
12 4 9
15 8 3
13 6 5
14 5 7
17 4 4
Use Tukey’s method with = :05 to determine which population means differ.
............................... ............................... ............................... ............................... ...............................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Question 2
(a) Use Fisher’s LSD method with = 0:05 to determine which population means differ in the following
problem.
k=3 n1 = 10 n2 = 10 n3 = 10
.............................................................................................................................................................
.............................................................................................................................................................
75 STA1502/1
Feedback Feedback
Question 1
Mutual Funds
Fund 1 Fund 2 Fund 3
12 4 9
15 8 3
13 6 5
14 5 7
17 4 4
The sample size n1 = 5 n2 = 5 n3 = 5
Total of observation 71 27 28
The sample mean X 1 = 14; 2 X 2 = 5:4 X 3 = 5:6
n = n1 + n2 + n3 = 5 + 5 + 5 = 15
k = 3
n k = 15 3 = 12
= 0:05
ng = 5
s
M SE
! = q0:05 (k; n k)
ng
3
X
SSE = (nj 1) Sj2
j=1
SSE 49:2
M SE = = = 4:1
n k 12
The critical number is
r
4:1
! = q0:05 (3; 12)
5
= 3:77 0:9055 (q0:05 (3; 12) = 3:77 using Table 4A)
= 3:4137
X1 X 3 = j14:2 5:6j = 8:6 > 3:4137; 1 and 3 differ, the test is significant.
X2 X 3 = j5:4 5:6j = j 0:2j = 0:2
Since 0:2 < 3:4137; 2 and 3 do not differ.
Conclusion: It is clear that the mean percentage rate of return for mutual fund 1 is significantly
different from that of the other two mutual funds.
Question 2
Decision Rule
Reject H0 if X 1 X j > LSD; otherwise we fail to reject the null hypothesis H0 :
The pairwise absolute differences are
(i) X 1 X 2 = j128:7 101:4j = j27:3j = 27:3
Since 27:3 > 24:2797; 1 and 2 differ therefore the test is significant.
H0 : 1 = 3 against H1 : 1 6= 3
H0 : 2 = 3 against H1 : 2 6= 3
X1 Xj
Tij = s
1 1
M SE +
ni nj
X1 X3 128:7 133:7
t1;3 = s =s
1 1 1 1
M SE + 700 +
n1 n2 10 10
5
= = 0:4226
11:8322
(iii) The test for the hypothesis H0 : 2 = 3
X2 X3 101; 4 133:7
t2;3 = s =s
1 1 1 1
M SE + 700 +
n2 n3 10 10
32:3
= = 2:7298
11:8322
78
Decision rule
The critical values is t(
2c ;n k)
Reject H0 if the test statistic tij is greater than the positive critical value or if the test statistic
is less than t( ;n k) :
2c
Step 3
Making a decision
The critical value is t 0:05 = t 0:05 = t(0:0083; 27) = t(0:01; 27) = 2:473
2 3 ; 30 3 6 ; 27
For the hypothesis H0 : 1 = 2; t1;2 = 2:3073 since 2:3073 lies between the critical values
2:473 and 2:473; we fail to reject H0 and we conclude that 1 and 2 are not significantly
different.
For the hypothesis H0 : 1 = 3 ; t1;3 = 0:4226
Since 0:4226 lies between the critical value 2:473 and 2:473; we fail to reject H0 and we
conclude that 1 and 3 are not significantly different.
For the hypothesis H0 : 2 = 3; t2;3 = 2:7298
Since 2:7298 lies outside of the critical values 2:473 and 2:473; we reject the null hypothesis
H0 : 2 = 3; we conclude that 2 and 3 are significantly different.
n = n1 + n2 + n3 = 10 + 10 + 10 = 30
k=3 M SE = 700 ng = 10
X1 X 3 = j128:7 133:7j = j 5j = 5
Since 5 < 29:198; 1 and 3 are not different.
X2 X 3 = j101:4 133:7j = j 32:3j = 32:3
Since 32:3 > 29:1998; 2 and 3 are significantly different.
80
The way that a sample is selected is called experimental design and determines the amount of
information in the sample. Researcher can involve an:
observational study: only observes the characteristics of data already exist (i.e. the researcher
does not produce the data). For instance, a sample survey in the form of a questionnaire.
experimentation in which the researcher may use one or more experimental conditions in order to
determine the effect on the response.
We will use terms such as factor, level, treatment and response in the design of a statistical
experiment.
The one–way analysis of variance introduced in the previous sections is the only one of many
different experimental designs of the analysis of variance.
In the following section, we present an overview of concepts that we will discuss the analysis of
variance in the design of a statistical experiment.
The experiment in this case involves two or more factors that define the treatments. The focus in this
section is to determine whether the levels of each factor are different from one another. The analysis
of variance is used to address this problem.
The one–way analysis of variance described earlier is a generalization of the two independent
samples design to the used when the experimental units are quite similar or homogeneous with
only one factor.
When the problem objective is to compare more than two populations, in which the data have to be
gathered from a matched pairs experiment, then the experimental design is called the randomized
block design (i.e. a direct extension of the matched pairs). The design uses blocks of k experimental
units that are relatively similar (or homogeneous), with one unit within each block randomly assigned
81 STA1502/1
to each treatment. The randomized block experiment is also called the two–way analysis of
variance.
Fixed–effects analysis of variance is a technique that includes all possible levels of a factor in
the analysis.
Random–effects analysis of variance is a technique that uses the level included in the study as
random sample.
In some experimental designs, there are no differences in calculations of the test statistic between
fixed and random effects. However, in other including the two–factor experiment, the calculations
are different.
The randomized block design identifies two factors: treatments and blocks that both of which affect
the response. The procedure for carrying out the randomized block design, which is summarized as
given below.
A. The null and alternative hypotheses are expressed in terms of the equality of the population means
for all of the treatment groups.
H0 : 1 = 2 = ::: = k
Xij = the observation for the ith block and the j th treatment.
X = grand mean, mean of all the observations. We can also average the means of treatments or
the means of block to calculate X:
k
X 2
SST = b Xj X
j=1
SST
MST =
k 1
b
X 2
SSB = k Xi X
i=1
SSB
MSB =
b 1
SSE
MSE =
(k 1) (b 1)
k X
X b
2
SSTotal = Xij X
j=1 i=1
D. The test statistic, the critical value and the decision rule.
(1) The test statistic
MST
F= As with the one–way ANOVA
MSE
MSB
F=
MSE
83 STA1502/1
The calculations for this type of analysis are time consuming that in some cases we provide some
details to complete others or we give only computer printouts in the explanations. Learn this method
with an application. With this design, testing if the treatment means differ can also be used to test if
there are differences in the block means. Of course, if the block means do not differ, it implies that
specific analysis was not the correct one!
84
Activity 2.6
Question 1
The following statistics were generated from a randomized block experiment with k = 3 and b = 7:
(a) Test to determine whether the treatment means differ. (Use = 0:05).
(b) Test to determine whether the block means differ. (Use = 0:05):
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Question 2
A partial ANOVA table in a randomized block design is shown below, where the treatments refer to
different high blood pressure drugs, and the blocks refer to different groups of men with high blood
pressure. Use the given ANOVA table to answer the questions:
Source of Variation SS df MS F
(a) Can we infer at the 5% significance level that the treatment means differ?
(b) Can we infer at the 5% significance level that the block means differ?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
85 STA1502/1
Feedback Feedback
Question 1
The given information
k=3 b=7 SST = 100
SSE = 25 SSB = 50
SSE 25 25 25
M SE = = = = = 2:0833
(k 1) (b 1) (3 1) (7 1) 2 6 12
M ST 50
F = = = 24:0004
M SE 2:0833
The critical value is F( ;(k 1); (k 1)(b 1)) = F(0:05; (3 1); (3 1)(7 1)) = F(0:05; 2; 12) = 3:89
The rejection region is F > 3:89:
Since the test statistic F = 24:004 is greater than the critical value, we reject the null hypothesis
H0 :
Conclusion: There is enough evidence to conclude that the treatment means differ.
M SB 8:3333
F = = = 4:00
M SE 2:0833
The critical value is F( ; b 1; (b 1)(k 1)) = F(0:05; 6; 12) = 3:00
The rejection region is F > 3:00: Since the test statistic F = 4:00 is greater than the critical
value (3:00) ; the null hypothesis H0 is rejected at 5% significance level.
The null hypothesis H0 is rejected at 5% level of signifcance.
Conclusion: There is enough evidence to conclude that the block means differ.
The ANOVA table is
Source of Variation SS df MS F
Treatments 6; 720 4 1; 680 14:6087
Blocks 3; 120 6 520 4:5217
Error 2; 760 24 115
Total 12; 600 34
86
Question 2
Source of variation df SS MS F
Treatments 4 6720 1680 14:6087
Blocks 6 3120 520 4:5217
Error 24 2760 115
Total 34 12600
H0 : 1 = 2 = 3 = 4
H1 : At least two means differ
Question 1
These statistics were calculated from two random samples:
Pb1 = 0:60 n1 = 225 Pb2 = 0:55 n2 = 225
(a) Calculate the p–value of a test to determine whether there is evidence to infer that the population
proportions differ.
(b) Repeat part (a) with Pb1 = 0:95 and Pb2 = 0:90:
(c) Describe the effect on the p–value of increasing the sample proportions.
(d) Repeat part (a) with Pb1 = 0:10 and Pb2 = 0:05:
(e) Describe the effect on the p–value of decreasing the sample proportions.
Question 2
Surveys have been widely used by politicians around the world as a way of monitoring the opinions
of the electorate. Six months ago, a survey was undertaken to determine the degrees of support for
a national party leader. Of a sample of 1100; 56% indicated that they would vote for this politician.
This month, another survey of 800 voters revealed that 46% now support the leader.
(a) At the 5% significance level, can we infer that the national leader’s popularity has decreased?
(b) At the 5% significance level, can we infer that the national leader’s popularity has decreased by
more than 5%?
(c) Estimate with 95% confidence the decrease in percentage support between now and 6 months
ago.
Question 3
Consider the following ANOVA table:
(b) The within-treatments variation stands for the sum of squares for error.
(c) In one-way analysis of variance, if all the sample means are equal, then the sum of squares for
88
(d) Rejection region, at the 1% level of significance, for this one-way analysis of variance is where
F >F ;k 1;n k = F0:01;4;25 :
(e) Assume that the above ANOVA is applied to independent samples taken from normally distributed
populations with equal variances. If the null hypothesis is rejected, then we can infer that at least
two population means differ.
Question 4
A consumer organization was concerned about the differences between the advertised sizes of
containers and the actual amount of product. In a preliminary study, six packages of three different
brands of margarine that are supposed to contain 500ml were measured. The differences from 500ml
are listed here. Do these data provide sufficient evidence to conclude that differences exist between
the three brands? (Use = 0:01).
Brand 1 Brand 2 Brand 3
1 2 1
3 2 2
3 4 4
0 3 2
1 0 3
0 4 4
89 STA1502/1
Question 1
The given information
= 1:0730
90
Making decision
Since the p–value = 0:2846 is greater than = 0:05; we fail to reject H0 :
Conclusion: The two population proportions do not differ.
213:75 + 202:5
=
450
= 925
91 STA1502/1
= 2:0161
Making a decision
Since the p–value = 0:0434 is less than = 0:05; we reject the null hypothesis H0 :
Conclusion: The two population proportions differ.
n1 = 225 n2 = 225
= 2:0161
Question 2
The given information
= 4:3103
The critical value for a one–tailed test is Z = Z0:05 = 1:645 (from Table 1)
Make a decision
Since the test statistic Z = 4:3103 is greater than the critical value (1:645), we reject the null
hypothesis H0 at 5% significance level.
Conclusion: The popularity decreased.
(b) If the popularity decrease by more than 5% this means (P1 P2 ) > 0:05:
The hypothesis are
H0 : P1 P2 = 0 against H1 : (P1 P2 ) > 0:05:
In this question, the pooled variance for proportion P does no longer exist and we have to
calculate the standard error SE for proportion given below.
The standard error for proportion Pb1 Pb2 is
v
u
u Pb1 1 Pb1 Pb2 1 Pb2
t
SE = +
n1 n2
r
0:56 (1 0:56) 0:46 (1 0:46)
= +
1100 800
p
= 0:000224 + 0:0003105
= 0:0223
94
0:05
=
0:0223
= 2:2422
Making a decision
Since zero lies outside of the confidence limits than we conclude the null hypothesis H0 is
rejected.
95 STA1502/1
Question 3
The given information
The ANOVA table
(b) Correct
(c) Correct
If X 1 = X 2 = ::: = X k
The sum squares for treatments is SST
k
X 2
SST = nj X j X
j=1
since X is the average of the means that are all equal therefore the differences in formula of SST
is zero.
(d) Correct
(e) Correct
96
Question 4
The given information
Sample size n1 = 6 n2 = 6 n3 = 6
X1 + X2 + X3
X =
3
1:3333 + 2:5 + 2:6667
=
3
6:5
=
3
= 2:1667
2 2 2
SST = n1 X 1 2:1667 + n2 X 2 2:1667 + n3 X 3 2:1667
= 6 (1; 3333 2:1667)2 + 6 (2:5 2:1667)2 + 6 (2:6667 2:1667)2
= 6:0490
97 STA1502/1
(ii)
k
X
SSE = (nj 1) Sj2
j=1
(iii)
(iv)
SST 6:0490 6:0490
M ST = = = = 3:0245
k 1 3 1 2
(v)
SSE 28:167 28:167
M SE = = = = 1:8778
n k 18 3 15
(vi) The test statistic F is
M ST 3:0245
F = = 1:6107
M SE 1:8778
(vii) The rejection region is F > F( ;k 1;n 1) = F > F(0:01; 2; 15) = 6:36:
Conclusion: There is no enough evidence to conclude that differences exist between the three
brands.
98
Can you
- alternative hypothesis
- significance levels
- conclusion
demonstrate an understanding of the connections between the concepts significance level and
p-value?
interpret computer output regarding inferences about an F-test for two population variances
- between-treatments variation
differentiate between one- and two-way analysis of variance experimental designs as well as
randomized block designs?
Key Terms/Symbols
degrees of freedom
F-test for two population variances
ANOVA-test
within-treatments variation
sum of squares for error
between-treatments variation
SS Within
SS Between
SS Blocks
SS Error
SS Treatment
overall mean
100
STUDY UNIT 3
3.1 Chi–square test
It is just as important to consider the sampled population as it is to know the data type of your
sample. What do you want to know about a specific population or populations? In the earlier study
units we were always interested in the parameters of the population, which implied that we had some
information about the population (e.g. we knew that it was normally, or approximately so, distributed).
What we have discussed so far implied so-called parametric techniques, where we considered the
statistics of a sample to predict the parameters of the distribution describing the population. In the first
part of this study unit we consider other very important parametric techniques, namely chi-squared
tests. In the second part of this unit we then venture into something new, addressing the dilemma
when one cannot make assumptions about the shape of the sampled population. As statisticians
we are often faced with this reality. Do you think that it is still possible to use a random sample
drawn from such a population and make a sensible analysis and even predictions about that sampled
population? Yes! You are going to see that there are also nonparametric techniques that you can use
if you do not know about the distribution of the sampled population. As usual, apart from explaining
the methods, the necessary conditions under which these alternatives apply, will also be described,
Of course, the correct technique for the particular data type stays important.
The first part of this study guide covers two applications of the continuous chi-squared distribution,
which is the technique applicable if the data is nominal. In STA1501 you heard about this distribution
and here hypothesis tests will be discussed and the conditions for their application. Only the chi-
squared goodness-of-fit test and the chi-squared test of a contingency table form part of the contents
of this module (the test for normality is therefore not included). In the second part of this study unit
you will be introduced to three nonparametric techniques. You will see that the sampled populations
are nonnormal and that dependence and independence of the samples play an important role. The
techniques you have to know for this module are the Wilcoxon rank sum test for ordinal or interval
data from two independent samples, the sign test for ordinal data in the form of matched pairs and
lastly the Wilcoxon signed rank test for interval data, also in the form of matched pairs. There are
other nonparametric tests in the prescribed book, but they are not included in the contents of this
module. Remember about them because you never know if you may need to use one of them in
future. Then you simply take the prescribed book and read up about them!
As you study these different tests, please do not be discouraged by all the different definitions that
are given and are used in the manual examples. Remember that we are statisticians and we do not
want to test your memory, but your knowledge of the different procedures and their conditions. In the
examination you will be given a list of formulas from which you can select the one you need (should
we ask a question in an examination paper where you need a formula).
101 STA1502/1
Test statistic
Required conditions
In distance learning the pronunciation of words or symbols is often a problem. If you wonder about
the word "chi" or its symbol ; think of the words "pie" or "sky" in English, because "chi" rhymes with
it. The ch is pronounces as a k, which means that you actually say "kai".
For the symbol 2 you say "kai-square".
Recall the knowledge given to you in STA1501 about a binomial experiment and the binomial
distribution. Just a reminder - the prefix bi- refers to two, while the prefix multi- refers to many.
Chi-square is a family of distributions commonly used for significance testing. A chi-square test
(also chi-squared or 2 test) is any statistical hypothesis test in which the sampling distribution of
the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this
is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be
made to approximate a chi-square distribution as closely as desired by making the sample size large
enough. A number of tests exist, but you are required to focus only on this one.
Below is a table illustrating the similarities and differences between a binomial and a multinomial
experiment.
Binomial experiment consists of Multinomial experiment consists of
a fixed number n of trials a fixed number n of trials
two possible outcomes per trial k categories (cells) of outcomes per trial
constant probability outcomes p and 1 p constant probabilities pi for each cell i
two probabilities p (success) and 1 p (failure) k probabilities pi and p1 + p2 + :::pk = 1
different independent trials different independent trials
x successes in n trials observed frequencies fi of outcomes in cell i
expected value = np expected frequencies ei = npi
The discussion in STA1501 on the chi-squared distribution was very brief. In this section you are
going to learn more about different tests where the test statistic has a chi-squared distribution.
102
There are many interesting and practical applications of the chi-squared distribution. Researchers
are also very keen to use a chi-squared test and we hope that you will now study research results
and see if the conditions for application of this distribution are satisfied. The purpose of an analysis
can be to determine if the sample is from a specified population or the interest can be to determine
if there is a relationship between two populations, e.g. between predicted values and actual values.
An example of the latter: suppose a telecommunications company, interested in customer care, is
uncertain about the continuation or not of a specific product. They decide to ask customers if they
would like the service to continue for the next year or not (this would be categorical or nominal data).
The recorded data (two categories of ’yes’ and ’no’) can be saved and the product continued for a
year. Then data (’yes’ or ’no’) can again be collected and a chi-square analysis can be made to see if
there is a relationship between what the people said and what they actually did. If the null hypothesis
is rejected, it indicates that there is a relationship between the two populations. In this scenario the
managers can then decide to use data where customers say what they are going to do, the data are
reliable enough for their planning.
If you study the examples in the book and in the activities, see if you understand the following
comment: Samples should not be too large for applications of the chi-squared test, and in practice,
analysts carefully study the distribution of the items in the chi-square table and do not only rely on
the numerical value of the test.
Goodness-of-fit test
Make sure that you understand the hypothesis testing procedure and the sampling distribution of the
test statistic for the goodness-of-fit test.
Test statistic
How would you express the formula for the test statistic of the goodness-of-fit test in your own words?
k
X
2= (f i ei )2
ei
i=1
103 STA1502/1
Is that not easier to remember than the formula itself? It tells you exactly what to do. Can you explain
it to someone else?
If you are still not so sure, we illustrate with the words:
Square (:::)2 the difference (: :)2 between the observed fi and expected frequency ei
and divide it by the expected frequency ei for each cell.
k
X
Add all these answers and it gives you the formula for the test statistic of the
i=1
chi-squared goodness-of-fit test.
104
Activity 3.1
Question 1
Employee absenteeism has become a serious problem which cannot be ignored. The personnel
department at a university decided to record the weekdays during which lecturers in the Faculty of
Humanities in a sample of 300 called in sick over the past several months. Determine if the given
data suggests that absenteeism is higher on some days of the week than on others.
From existing medical evidence the following information is specified in the null hypothesis for the
consecutive days of the week:
Monday P1 = 0:3; Tuesday P2 = 0:1; Wednesday P3 = 0:2; Thursday P4 = 0:2; Friday P5 = 0:2
Day of
Monday Tuesday Wednesday Thursday Friday
the week
Number
84 24 56 64 72
absent
Question 2
In a goodness-of-fit test, suppose that a sample showed that the observed frequency fi and expected
frequency ei were equal for each cell i: Then, the null hypothesis is
Question 3
The critical value in a goodness-of-fit test with 6 degrees of freedom, considered at the 5%
significance level, is
1. equal to 18:5476
2. equal to 12:6
3. equal to 0:872085
Question 4
A chi-squared goodness-of-fit test is always conducted as
1. a lower-tail test
2. an upper-tail test
3. a two-tailed test
Question 5
Five statements are given below. Only one of them is a true statement. Which option is true?
1. For a chi-squared distributed random variable with 10 degrees of freedom and a level of
significance of 0:025, the chi-squared table value is 20:5. The computed value of the test statistic
is 16:857. This will lead us to reject the null hypothesis.
2. Whenever the expected frequency of a cell is less than 5, one remedy for this condition is to
decrease the size of the sample.
3. For a chi-squared distributed random variable with 12 degrees of freedom and a level of
significance of 0:05, the chi-squared value from the table is 21:0. The computed value of the
test statistics is 25:1687. This will lead us to reject the null hypothesis.
4. The chi-squared goodness-of-fit test can be used for any type of data.
5. In a multinomial experiment the probability Pi that the outcome will fall into cell i can change from
one trial to the next.
107 STA1502/1
Test statistic
Rule of five
You need to realize that there are many similarities between the two 2 -tests in this chapter, and that
there are also definite differences.
In statistics, contingency tables are used to record and analyse the relationship between two or
more variables, most usually categorical variables. Suppose that we have two variables, sex (male
or female) and handedness (right- or left-handed). We observe the values of both variables in a
random sample of 100 people.
Then a contingency table can be used to express the relationship between these two variables, as
follows:
Right-handed Left-handed TOTAL
Male 43 9 52
Female 44 4 48
TOTAL 87 13 100
The figures in the right-hand column and the bottom row are called marginal totals and the figure
in the bottom right-hand corner is the grand total. The table allows us to deduce at a glance that
the proportion of men who are right-handed is about the same as the proportion of women who are
right-handed. However the two proportions are not identical and the statistical significance of the
difference between them can be tested statistically using one of a number of available methods. In
our case we will use a nonparametric method called a Pearson’s chi-square test. In this case the
entries provided in the table must represent a random sample from the population contemplated in
the null hypothesis. If the proportions of individuals in the different columns vary between rows (and,
therefore, vice versa) we say that the table shows contingency between the two variables. If there is
no contingency, we say that the two variables are independent.
If we make a table of comparisons it might help you to remember the different principles involved and
the calculation methods.
108
Only applicable for nominal data produced Only applicable for nominal data
by a multinomial experiment. arranged in a contingency table.
k
X k
X
2 (fi ei )2 2 (fi ei )2
Test statistic: = Test statistic: =
ei ei
i=1 i=1
Ho lists values for the probabilities pi : Ho states the two variables are independent.
The manual calculation of the 2 -values for the contingency table is rather cumbersome, but not that
complex!
Make sure that you understand the process of
calculating the expected frequencies for each cell - multiply total of row and total of column and
divide by the grand total
writing the given (observed) frequencies and calculated (expected) frequencies next to each other
for each cell in a new contingency table
calculation of the test statistic, which involves only this last contingency table for each cell: subtract
the two frequencies, square the answer, then divide by the calculated (expected) frequency
If you calculate these values with Excel or Minitab it is of course not so complex, but remember that,
at this first-year level, you have to know the "how" of the process itself and not only the interpretation
of the 2 and p values.
109 STA1502/1
Z –test of inference about the difference between two population proportions as we have indicated
in the study unit in section 2.2.
You will understand that it is necessary to learn each technique at a time and focus on the kinds
of problems each addresses. A summary of the statistical test on nominal data is given below to
ensure that you are capable of selecting the correct method.
You can notice that there are two groups of tests: Those that we have the Z –test of P for the
chi–squared test of a multinomial experiment and the test that employs two or more categories.
In the first approach, called chi–squared goodness of fit test. We determine the frequency of each
category and use these frequencies to calculate the test statistic. In the second approach called the
chi–squared of a contingency table, we use the frequencies to calculate the chi–square test statistics.
Activity 3.2
Question 1
The trustee of a company’s pension plan has solicited the opinions of a sample of the company’s
employees about a proposed revision of the plan. A breakdown of the responses is shown in the
accompanying table. Is there enough evidence to infer that the responses differ between the three
groups of employees?
Responses Blue–Colour Workers White–colours Workers Managers
Yes 67 32 11
Against 63 18 9
110
Question 2
The number of degrees of freedom for a contingency table with 5 rows and 7 columns is
1. 35
2. 12
3. 10
4. 24
5. 30
Question 3
In a chi-squared test of a contingency table, the test statistic value was 2 = 12:678, and the critical
value at = 0:025 was 14:4. Thus,
Question 4
Which of the following statements is/are false?
1. A chi-squared test for independence is applied to a contingency table with 3 rows and 4 columns
for two qualitative variables. The degrees of freedom for this test must be 12:
2. A chi-squared test for independence with 10 degrees of freedom results in a test statistic of 17:894.
Using the chi-squared table, the most accurate statement that can be made about the p-value for
this test is that 0:05 < p-value< 0:10.
3. In a chi-squared test of independence, the value of the test statistic was 15:652, and the critical
value at = 0:025 was 11:1433. Thus, we must reject the null hypothesis at = 0:025:
4. A chi-squared test for independence with 6 degrees of freedom results in a test statistic of 13:25:
Using the chi-squared table, the most accurate statement that can be made about the p-value for
this test is that p-value is greater than 0:025 but smaller than 0:05.
5. The chi-squared test of a contingency table is used to determine if there is enough evidence to
infer that two nominal variables are related, and to infer that differences exist among two or more
populations of nominal variables.
111 STA1502/1
Activity 3.3
Question 1
A statistics professor posted the following grade distribution guidelines for his elementary statistics
class:
8% A, 35% B, 40% C, 12% D, and 5% F.
A sample of 100 elementary statistics grades at the end of last semester showed
12 A’s, 30 B’s, 35 C’s, 15 D’s, and 8 F’s.
Suppose that you test at the 5% significance level to determine whether the actual grades deviate
significantly from the posted grade distribution guidelines. Compare your calculations with the step
by step calculations given below. Indicate in which step the first error was made.
5. The actual grades do not deviate significantly from the posted grade distribution guidelines.
Question 2
Which of the following tests is appropriate for nominal data if the problem objective is to compare two
or more populations and the number of categories is at least 2?
Feedback Feedback
Activity 3.1
Question 1
H0 : p1 = 0:3; p2 = 0:1; p3 = 0:2; p4 = 0:2; p5 = 0:2
H1 : At least one pi is not equal to its specified value.
Making a decision
Since the test statistic = 4:5334 is less than the critival value (13:3) ; we fail to reject H0 at 1%
significance level.
113 STA1502/1
Conclusion: There is not enough evidence to infer to infer that absenteeism is higher on some
days of the week.
Question 2
Option (4)
The chi-squared goodness-of-fit test involves the difference between the expected and observed
frequencies. In this question there is never a difference between the two, with the result that the null
hypothesis will never be rejected.
Question 3
Option (2)
From the 2 Table 5, find the cell where the column under 2 in the first row meets the row with 6
:050
in the first column. The value written there is 12:6.
Question 4
Option (2)
If you are not sure, look at the little picture at the top of the page listing the 2 Table 5 and you will
see that the shaded area lies on the right-hand side.
Question 5
Option (3)
1. False, because the table is correct, but the value 16:857 does not fall in the critical region and
therefore the null hypothesis will not be rejected.
2. False. The remedy is to combine cells should any expected value in a cell be less than 5:
3. True. 25:1687 is greater than the test statistic and the null hypothesis would be rejected.
5. False. These probabilities have to remain constant for each trial of a multinomial experiment.
114
Activity 3.2
Question 1
The given information
In this problem we want to find out if the job description of an employee has an influence on their
choice option. A contingency table is used to address this problem.
The hypotheses are
H0 : The two variables (employees and their responses) are independent.
H1 : The two variables are dependent.
The table for observed frequencies and the expected frequencies are in brackets.
Employees
Responses Blue collar White collar Managers Total
For revision 67(71:5) 32(27:5) 11(11) 110
Against revision 63(58:5) 18(22:5) 9(9) 90
Total 130 50 20 200
90 130
Against revision blue colour: = 58:5
200
110 50
For revision white colour: = 27:5
200
90 50
Against revision white colour: = 22:5
200
110 20
For revision Manager: = 11
200
90 20
Against revision Manager: =9
200
115 STA1502/1
Making a decision
Since the test statistic (2:2658) is less than the critical value (5:99) ; we fail to reject the null
hypothesis H0 at 5% level of significance.
Conclusion: There is not enough evidence that the response to the proposed revision plan
depends on the group (according to job description in the company) of the employee.
Question 2
Option (4)
The degrees of freedom df = (r 1) (c 1) = (5 1)(7 1) = 4 6 = 24
Question 3
Option (2)
The number of degrees of freedom was 6, as can be seen from the 2 table if you find the cell under
2 with the value 14:4 written in it. Furthermore, because 14:4 is larger than the calculated 12:678,
:025
the null hypothesis cannot be rejected. For option 5, if you look at the table and you decrease the
significance level to 2 the critical value is 16:8 and the null hypothesis would still not be rejected
:010
because 12:678 < 16:8:
Question 4
Option (1)
Option 1 is false in the number of degrees of freedom. It is not 3 4 but 2 3 = 6:
Option 2 is true because the p-values can only be determined accurately with computer software.
However, we can have some indication from the 2 table. 17:894 lies between the table values 16:0
116
and 18:3; which correspond respectively with significance levels of 0:100 and 0:050: Therefore the
comment about the range of the p-value is true.
Option 3 is true because the test statistic’s value 15:652 is more than the table value 11:1433, which
places it in the rejection region at level = 0:025:
Option 4 is true for the same reasons as option 2 is true.
Option 5 is true.
Activity 3.3
Question 1
Making a decision
The test statistic does not fall in the rejection region, therefore the null hypothesis H0 cannot be
rejected.
1. Correct
2. Correct
3. Correct
4. Incorrect
The error lies in the interpretation of the calculated value.
5. Correct
Option (4)
Question 2
Option (2)
118
STUDY UNIT 4
4.1 Simple Linear Regression and correlation
Introduction
In this study unit the discussion is about the relationship between interval variables. In regression
analysis involving two variables, one of the variables is used to make predictions about the other
variable. Recall that interval data are real numbers, such as heights, weights, incomes and distance,
as was said in chapter 2 (or STA1501), where you were told that interval data can also be referred to
as quantitative or numerical data. In this unit the so-called probabilistic model for regression analysis
is described, with initial interest in the first-order linear model (also called the simple linear regression
model). In this model an error variable is introduced. Finding the equation of the regression line is
the first step, but this has to be followed by an assessment of the fit of the line to the data as well as
looking into the relationship between the dependent and independent variables. The importance of
the error variable and the conditions that apply to it, forms the basis of many of the discussions that
follow.
You will not be examined on all the sections of this chapter but our focus will be in the topics of :
Simple Linear Regression and correlation. The topics covered in these sections are very important
and should you continue with statistics, you will surely learn about them in a second-level module.
1. Model
The problem in this discussion is of bivariate data that we use the equation of a straight line to
describe the relationship between the independent variable X and the dependent variable Y . We
describe the strength of the relationship using the correlation coefficient r: Consider the problem to
predict the value of a response Y based on the value of an independent variable X: The best–fitting
line is Y = 0 + 1X +"
119 STA1502/1
where
1 = the slope of the line, defined as the change in Y for a one unit in X:
X = Independent variable (or predictor variable).
Y = Dependent variable (or response variable).
" = Random error that explains the deviation of the points (X; Y ) about the line.
The model Y = 0 + 1X + " is called the first–order model or the simple linear regression model.
To analyze the relationship between two variables, X and Y , both of which must be interval. In the
relationship between X and Y; we need to know the value of the coefficient 0 and 1: However,
these coefficient are population parameters, which are always unknown. In the next section, we
discuss how these parameters are estimated.
The statistical procedure for finding the best–fitting line for a set of bivariate data we need to estimate
the parameters 0 and 1: Since these parameters represent the coefficients of a straight line, their
estimators are based on drawing a straight line throught the sample data. The formula for the best–
fitting line is Yb = b0 + b1 X where
b0 = intercept
b1 = the slope
X = independent variable
Yb = the predicted or fitted value of Y:
The least squares method is an approach that enable us to produce a straight line. Finding the
values of b0 and b1 we use the differential calculus, which is beyond the scope. Rather than derive
their values, we will simply present formulas for calculating the values of b0 and b1 :
Least squares line coefficients
SXY
b1 =
S2X
b0 =Y b1 X
where
Sxy : The covariance of (X; Y )
Sx : The variance of X .
120
where
2
n
!0 n 13
X X
6
6X Xi @ Yi A 7
7
n
1 6
6 i=1 j=1 7
7
Sxy = Xi Yi
n 16
6 i=1 n 7
7
4 5
2 !2 3
n
X
6 Xi 7
1 6X n 7
6 i=1 7
S2x = 6 X2i 7
n 16 n 7
4 i=1 5
n
X n
X
Xi Yi
i=1 i=1
X= Y=
n n
You can use your scientific calculator to obtained the necessary sums and sums of squares as well
as the values of b0 ; b1 and the correlation coefficient r: Make sure you consult your calculator manual
to find the easiest way to obtain the least squares estimators b0 and b1 . Be careful about rouding
errors, carry at least four significant figures, and round off only in reporting the end result.
The deviations between the actual data points and the line are called the residuals, denoted ei ; that
is, ei = yi ybi :
The minimized sum of squared deviations is called the sum of squares for errors denoted SSE:
(1) The standard error of estimate: In the linear model, the error variable " is normally distributed
with a mean 0 and standard deviation ": If " is large, this implies the model fit is poor. If "
is small, the errors tend to be closed to the mean (which is 0), as a result the model fits well
the data. We need to calculate "; unfortunately it is a population parameter that is unknown.
However, we can estimate " from the data by using the sum squares error (SSE). The formula
121 STA1502/1
where
n
!
X 2
Sxy
SSE = (yi y^j )2 = (n 1) Sy2
Sx2
i=1
(2) The coefficient of determination is a measure of the strength of the relationship. To answer to
the question “How well does the regression model fit?”, we can use the correlation coefficient r
given by
Sxy
r= for 1 r 1
Sx Sy
that is, the correlation coefficient is a value between 1 and +1 of the strength of the
relationship.
The coefficient of determination R2 is the proportion of the total variation that is explained by the
linear regression of y and x. In general, the higher the value of R2 ; the better the model fits the
data. You will discover in practice that when you improve the model, the value of R2 increases.
The formula for the coefficient of determination is
Sxy
R2 =
S2x S2y
(3) The t–test of the slope 1 is to make sure if there is evidence of a linear relationship.
122
Several statistical test and measures were used to determine the model fits well the data:
The t–test for the slope (or the ANOVA F –test) and the value of the coefficient of determination
R2 : Take note that the results of a regression analysis are valid only when the data satisfy the
necessary regression assumptions.
The regression assumptions:
(1) The relationship between Y and X must be linear, given by the model Y = 0 + 1X + ":
The diagnostic tools for checking the assumptions involve the analysis of the residual error.
The residual plots, which are complicated to conduct by hands but easy to used by a computer. When
the residuals are normally distributed or approximately so, the plot should appear as a straight line.
When the observations are collected at regular time intervals. The error terms are often dependent,
therefore the observations make up a time series. This approach analysis the data using time series
methods.
In a diagnostic analysis the requirements for the error variable and the influence of very large of small
observations must be investigated. You need not to apply the different test, but you have to know
about them and what they mean.
When the normality requirement is unsatisfied, we can use a nonparametric technique called the
Spearman rank correlation coefficient to replace the t–test.
123 STA1502/1
Activity 4.1
Question 1
The regression line y^ = 3 + 2x has been fitted to the data points (4; 8); (2; 5); and (1; 2). The sum of
the squared residuals will be
1. 7
2. 15
3. 8
4. 22
5. 7:5
Question 2
If an estimated regression line has a y -intercept of 10 and a slope of 4, then when x = 2 the actual
value of y is
1. 15
2. 24
3. 18
4. 14
5. unknown
124
Question 3
Given the least squares regression line y^ = 5 2x; choose the correct statement:
3. As x increases, so does y:
4. As x decreases, so does y:
Question 4
A regression analysis between weight y (in kilogram) and height x (in centimetre) resulted in the
following least squares line: y^ = 70 + 2x. This implies that if the height is increased by 1 centimetre,
the weight, on average, is expected to
1. increase by 1 kilogram
2. decrease by 2 kilogram
3. increase by 2 kilogram
Question 5
In regression analysis, the residuals represent the
Question 6
In a simple linear regression problem, the following statistics are calculated from a sample of 10
P P P
observations: (x x) (y y) = 2250; sx = 10; x = 50; y = 75: The least squares estimates
5. 25 and 117:5
Question 7
A random sample of 11 statistics students produced the following data where x is the third test score,
out of 100, and y is the final exam score, out of 300. Can you predict the final exam score of a random
student if you know the third test score?
x third exam score 65 67 71 71 66 75 67 70 71 69 69
y final exam score 175 133 185 163 126 198 153 163 159 151 159
You can easily show by estimating the slope and gradient that the best fit line for the third exam/final
exam example has the equation: y^ = 173:51 + 4:83x.
What would be the expected final scores for students who obtained third exam scores of (i) 68, (ii)
78 and (iii) 94?
Question 8
The simple linear regression line is Son’s Height = 33:73 + 0:516 Father’s Height
(b) What does the regression line tell you about the heights of sons of tall fathers?
(c) What does the regression line tell you about the heights of sons of short fathers?
Question 9
Which value of the coefficient of correlation r indicates a stronger correlation than 0:65?
1. 0:55
2. 0:75
3. 0:60
4. 0:05
5. 0:65
126
Question 10
In a regression problem the following pairs of (x; y) are given:(3; 1); (3; 1); (3; 0); (3; 2) and (3; 2).
That indicates that the
2. correlation coefficient is 1
3. correlation coefficient is 0
4. correlation coefficient is 1
__________________________________________________________________________
Feedback Feedback
Activity 4.1
Question 1
Question 2
Option (5)
We can say nothing about the actual value of y , because the interpretation of the calculated values
only refer to the sample.
Question 3
Option (2)
In the least squares regression line y^ = 5 2x the value of the slope is 2, which is negative;
therefore the relationship is negative (if the one increases, the other will decrease).
Question 4
Option (3)
The relationship can be expressed based on the slope. From the equation y^ = 70 + 2x we know the
slope of the line is 2, which implies that ratio rise/run is 2=1. For each move forward (x height) the
movement up (y weight) will be double of that.
Question 5
Option (1)
Question 6
X
X X Y Y = 2250
Sx = 10
X
X = 50
X
Y = 75
P
(x x) (y y)
The covariance of (X; Y ) denoted by sxy =
n 1
2250
=
10 1
= 250
= 100
sxy
The slope b1 =
s2x
250
=
100
= 2:5
The intercept b0 = y b1 x
75 50
= 2:5
10 10
= 7:5 12:5
= 5
Option (4)
Question 7
We are given the equation: for this estimation Yb = 1:73:51 + 4:83X: Thus, for those who obtained
third exam scores of (i) 68, (ii) 78 and (iii) 94 we would expect the final exam scores of:
Question 8
Compare the given equation of the regression line with the standard form of the regression line:
y^ = b0 + b1 x
Son’s height = 33:73 + 0:516 Father’s height
This implies that the dependent variable y represents the son’s height and the independent variable
x represents the number of centimetres that the father is taller or less than 33:73 centimetres. We
assume that both father and son are measured when they are fully grown.
(a) The intercept b0 = 33:73 is where the regression line and the y -axis intersect and at that point
x = 0. It does not mean that when the father’s height is 0 (not born yet ??) the son’s height is
33:73 cm. You can see that makes no sense - it is meaningless!
The slope coefficient b1 = 0:516 implies that for each additional cm of the father’s height the son’s
height increases on average by 0:516 cm.
(b) 33:73 cm is taken as the cut-off value: ’tall’ fathers are supposedly taller than 33:73 and ’short’
fathers are shorter than 33:73. Therefore, if the father is tall, the son would on average be shorter
than his father.
(c) If the father is short, then on average the son will be taller than his father.
Question 9
Option (2)
Remember that we said that the closer the value of r is to either +1 or 1, the stronger the
relationship between the variables. The fact that we compare positive and negative values is
irrelevant if the only issue is the strength of the relationship. A value of r close to zero indicates
a very weak relationship. This relation is strong negatively.
130
Question 10
Option (3)
131 STA1502/1
STUDY UNIT 5
5.1 Non parametric statistics
In the Unit 1, we presented statistical techniques for comparing two populations by comparing
their respective population parameters (usually their population means). These approaches were
applicable to quantitative data. The data that have a normal distributions. In this unit 5, we
present statistical tests for comparing populations for the many types of data that do not satisfy the
assumptions of normal distribution. The following statistical tests will be discussed in your first–year
level.
We use nonparametric techniques, when the sample sizes are small and the original populations are
not normal.
We present nonparametric techniques appropriate for comparing two or more populations using
either independent or matched paired samples. Furthermore, we will discuss a measure of
association that is useful in determining whether one variable increases as the other increases or
whether one variable decreases.
Mann–Whitney U –test
We only discuss the Wilcoxon Rank Sum Test since they are equivalent because they use the same
information. Wilcoxon Rank Sum Test is based on the sum of the ranks of the sample that has the
small sample size.
132
1. The Hypotheses
The null hypothesis H0 : The two population locations are the same.
The alternative hypothesis H1 : The location of population 1 is to the left of the location of
populations 2.
n1 (n1 + n2 + 1)
E (T ) =
2
r
n1 n2 (n1 + n2 + 1)
T =
12
T E (T )
Z=
T
Reject H0 when the value of the test statistic is greater than the critical value at alpha ( ) level
of significance.
5. Interpretation.
Remark
You must be able to use the table of critical values for the Wilcoxon Rank Sum Test. Make sure that
you understand that n1 is the number of observations in the data set with the smallest rank–total
(which need not to be the one given as “sample 1”. Furthermore, take note that you use the right
table for the right test. Table 6(a) is used for either alpha = 0:025 for one–tailed test or = 0:05 for
two tailed test. Table 6(b) is used for either alpha = 0:05 for one–tailed test or 0:10 for two–tailed
test.
The formula given to use for sample sizes larger than 10 is a normal approximation and is calculated
without the tables (because they do not list values larger than 10!!) and only use the sizes of the two
independent samples and the test statistics.
133 STA1502/1
Table 6a and b Critical values for the Wilcoxon Rank Sum Test
134
Table 7 Critical Values for the Wilcoxon Signed Rank Sum Test
135 STA1502/1
Activity 5.1
Question 1
Consider the following data set: 14; 14; 15; 16; 18; 19; 19; 20; 21; 22; 23; 25; 25; 25; 25;and 28.
The rank assigned to the four observations of value 25 is
1. 12
2. 12:5
3. 13
4. 13:5
5. 14
Question 2
The Wilcoxon rank sum test statistic T is approximately normally distributed whenever the sample
sizes are
1. larger than 10
2. smaller than 10
3. between 5 and 15
5. smaller than 20
Question 3
A Wilcoxon rank sum test for comparing two populations involves two independent samples of sizes
5 and 7. The alternative hypothesis is stated as: The location of population 1 is different from the
location of population 2. The appropriate critical values at the 5% significance level are
1. 20 and 45
2. 22 and 43
3. 33 and 58
4. 35 and 56
5. 12 and 32
136
Question 4
Consider the following two independent samples:
Sample A: 16 17 19 22 47
Sample B: 27 31 34 37 40
The value of the test statistic for a left-tail Wilcoxon rank sum test is
1. 6
2. 20
3. 35
4. 55
5. 121
Question 5
Two observers are placed on two different observation points (randomly chosen) for a specified
period of time. They have to observe the drivers of the cars passing by and count the number of
them driving by while talking on a cell phone. Data given below was recorded at Point A for 6 days
and at Point B for 7 days. At the 0:10 level, can we conclude that the number of drivers talking on cell
phones at the two locations have the same median occurrence?
Point A 74 61 73 67 80 89
Point B 90 73 97 81 77 61 79
Question 6
A Wilcoxon rank sum test for comparing two populations involves two independent samples of sizes
15 and 20. The unstandardized test statistic (that is the rank sum) is T = 210. The value of the
standardized test statistic z is
1. 14:0
2. 10:5
3. 6:0
4. 0:7
5. 2:0
137 STA1502/1
The sign test is the nonparametric test to apply if you want to compare two samples forming matched
pairs of values, provided the data is ordinal and the populations are nonnormal. We say the two
samples are dependent. Typical of this is that one person is tested "before and after", or one person
is asked to make two different observations. Of course, this means that the size of the two dependent
samples will always be equal.
In ordinal data, numbers are often allocated to the different ranked categories, simply because it is
convenient. You were earlier told about a similar argument for nominal data where we could indicate
male =) 1 and female =) 0; because the ’0’ and ’1’ is easier to work with than the words ’female’
and ’male’. Please understand that if numbers are used for this purpose their placement in the
number line is not relevant. They are just symbols - maybe little goodies ( ; ]; z; xo; :::) would have
been less confusing, but then less convenient!
The sign test, true to its name, considers only the sign (positive or negative) of the difference
between the pair of observations, and the size of the difference is of no significance. Think of the
procedures to follow in these nonparametric tests as the rules of a game.
(3) Calculate the p–value and reject H0 if the p–value is less than :
1. Calculate the difference between sample 1 and sample 2 for each of the n pairs. Differences
equal to 0 are eliminated, and the number of pairs, n is reduced accordingly.
2. Rank the absolute values of the differences by assigning 1 to the smallest, 2 to the second
smallest, and so on. Tied observations are assigned the average of the ranks that would have
been assigned with no ties. The absolute value of 5 is denoted j 5j = 5 and the absolute value
of +5 is denoted j+5j = 5:
3. Calculate the rank sum for the negative differences and label this value T : Similarly, calculate
T + ; the rank sum for the positive differences.
4. The author Keller of our prescribed book used T = T + but Mendenhall and Beaver have used the
smaller of these two quantities T as a test statistic to test the hypothesis that the two population
locations are the same. In this module we use T = T + :
The formula to calculate the standardized test statistic is
T E (T )
Z=
T
where
139 STA1502/1
The mean
n (n + 1)
E (T ) =
4
6. Decision rule
(1) Reject H0 if the test statistic Z is greater than the critical value otherwise don’t reject H0 :
Activity 5.2
Question 1
It is important to sponsors of television shows that viewers remember as much as possible about
the commercials. The advertising executive of a large company is trying to decide which of two
commercials to use on a weekly half-hour sit-com. To help make a decision she decides to have 12
individuals watch both commercials. After each viewing, each respondent is given a quiz consisting
of 10 questions. The number of favourable responses is recorded and listed below. Assume that the
quiz results are not normally distributed.
Quiz Scores
Respondent Commercial 1 Commercial 2
1 7 9
2 8 9
3 6 6
4 10 10
5 5 4
6 7 9
7 5 7
8 4 5
9 6 8
10 7 9
11 5 6
12 8 10
(a) Which test is appropriate for this situation?
(b) Do these data provide enough evidence at the 5% significance level to conclude that the two
commercials differ?
140
Question 2
In a normal approximation to the sign test, the standardized test statistic is calculated as z = 1:58.
To test the alternative hypothesis that the location of population 1 is to the left of the location of
population 2, the p-value of the test is
1. 0:1142
2. 0:2215
3. 0:0571
4. 0:2284
5. 0.4429
Comparison between the Wicoxon Signed Rank Sum Test and Sign Test
If the matched pairs of observations from the two dependent nonnormal populations are interval and
not ordinal, the signed rank sum test of Wilcoxon is the appropriate test to use. Think about this - the
requirements for the sign test and this signed rank sum test are the same except for the type of data.
For the
For the Wilcoxon Signed Rank Sum Test the rules are as follows:
If n < 30; use Table 10 which lists a lower and upper cut-off value for one or two-tailed tests,
depending on four different significance levels and n = total of nonzero differences.
If n 30; use the normal approximation as given in section 5.2(2).
Null hypothesis: the two population locations are the same.
Alternative hypothesis: the population locations are different (can be one-or two-sided).
142
Activity 5.3
Question 1
A matched pairs experiment produced the following statistics. Conduct a Wilcoxon Signed Rank Sum
Test to determine whether the location of population 1 is to the right of the location of population 2.
(Use = 0:01).
T + = 3457 T = 2429 n = 108
Question 2
Perform the Wilcoxon Signed Rank Sum Test for the following matched pairs to determine whether
the two population locations differ. (Use = 0:10):
Pair 1 2 3 4 5 6
Sample 1 9 12 13 8 7 10
Sample 2 5 10 11 9 3 9
Question 3
In a Wilcoxon Signed Rank Sum Test, the test statistic is calculated as T = 91. There are 18
observation pairs of which 3 have zero differences and a two-tailed test is performed at the 5%
significance level. Choose the correct option below:
Question 4
In a Wilcoxon Signed Rank Sum Test with n = 30, the rank sums of the positive and negative
differences are 198 and 165, respectively. The value of the standardized test statistic z is
1. 232:50
2. 0:7096
3. 2:8125
4. 48:6107
5. 0:6425
143 STA1502/1
Feedback Feedback
Activity 5.1
Question 1
Option (4)
The data set is already ranked (we wanted to test something else than ranking)
14; 14; 15; 16; 18; 19; 19; 20; 21; 22; 23; 25; 25; 25; 25, and 28.
Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Data 14 14 15 16 18 19 19 20 21 22 23 25 25 25 25 28
Ranks 1:5 1:5 3 4 5 6:5 6:5 8 9 10 11 13:5 13:5 13:5 13:5 16
1+2
From the data raw Rank of 14 = = 1:5
2
6+7
Rank of 19 : = 6:5
2
12 + 13 + 14 + 15 54
Rank of 25 : = = 13:5
4 4
Question 2
Option (1)
In the discussion about the sampling distribution of the Wilcoxon Rank Sum Test statistic it is stated
that T is approximately normally distributed whenever the sample sizes are larger than 10.
Question 3
Option (1)
n1 = 5 and n2 = 7. The values for n1 are listed in the first row and those for n2 in the first column.
The statement in the alternative hypothesis about the location of the populations being different does
not imply that the location of population 1 lies to the left or the right of population 2. It is a two-tailed
statement. The appropriate critical values at the 5% (two-tailed) significance level are 20 and 45:
Question 4
Option (2)
Ranked data 16 17 19 22 27 31 34 37 40 47
Ranks 1 2 3 4 5 6 7 8 9 10
Question 5
(This is not a multiple choice question.)
Point A 74 61 73 67 80 89
Point B 90 73 97 81 77 61 79
Position 1 2 3 4 5 6 7 8 9 10 11 12 13
Ranked data 61 61 67 73 73 74 77 79 80 81 89 90 97
Ranks 1:5 1:5 3 4:5 4:5 6 7 8 9 10 11 12 13
1+2 4+5 9
Rank of 61 = = 1:5 Rank of 73 = = = 4:5
2 2 2
Sample A has the smallest total, so the test statistic is equal to 35:
If we are only testing for a "difference" in the data from the two points, it is a two-sided test. From
Table 6(b) the limits for n1 = 6 and n2 = 7 are 30 and 54. The test statistic of 35 falls between
these limits, so the null hypothesis cannot be rejected at the 10% level. We conclude that the median
number of persons talking on their cell phones while driving could be the same at points A and B .
Question 6
Option (5)
This answer is simply substitution into formulae.
T = 210
r
n1 (n1 + n2 + 1) n1 n2 (n1 + n2 + 1)
E(T ) = T =
2 12
r
15(15 + 20 + 1) 15 20(15 + 20 + 1)
= =
2 12
= 270 = 30
Activity 5.2
Question 1
Quiz Scores
Respondent Commercial 1 Commercial 2 Difference
1 7 9 -2
2 8 9 -1
3 6 6 0
4 10 10 0
5 5 4 1
6 7 9 -2
7 5 7 -2
8 4 5 -1
9 6 8 -2
10 7 9 -2
11 5 6 -1
12 8 10 -2
(a) The appropriate test for this situation is the Sign Test.
(b) Do these data provide enough evidence at the 5% significance level to conclude that the two
commercials differ?
Two cells have zeros and are not counted for the sample size. Therefore n = 10 and x = 1 (only one
plus).
Conclusion: Reject the null hypothesis. Yes, these data provide enough evidence at the 5%
significance level to conclude that the two commercials differ.
Question 2
The standardized test statistic is calculated as z = -1.58. The p-value should then be such that
p-value: P (z < 1:58) = P (z > 1:58) = 0:0571
Option (3)
146
Activity 5.3
Question 1
H0 : The two population locations are the same.
H1 : The location of population 1 is to the right of the location of population 2:
n(n + 1)
The mean E(T ) =
4
108 109
=
4
= 2943
r
n(n + 1)(2n + 1)
T =
24
r
108 (108 + 1) (2 108 + 1)
=
24
p
= 106438:5
= 326:25
Conclusion
There is not enough evidence to conclude that population 1 is located to the right of the location of
population 2 since p–value = 0:0571 > = 0:01; we fail to reject H0 at = 0:01.
Question 2
H0 : The two population locations are the same.
H1 : The location of population 1 is different from the location of population 2:
T E (T )
The test statistic z = where T = T +_ = 19:5 n=6
T
n (n + 1) 6 7
E (T ) = = = 10:5
r 4 4 r
n (n + 1) (2n + 1) 6 7 13 p
T = = = 22:75 = 4:7697
24 24
(19:5 10:5)
Z =
4:7697
= 1:8869
P –value = 2 P (Z > 1:8869)
= 2 P (Z > 1:89)
= 2 0:0294
= 0:0988
Question 3
Option (4)
The value of the test statistic is calculated as T = 91; therefore the test statistic lies inside the ’safe’
region of [40; 131]. for a two-tailed test at the 5% significance level. The null hypothesis is therefore
not rejected. (Using Table 7).
Question 4
Option (2)
T = T + = 198
n(n + 1)
E(T ) =
4
30 31
=
4
= 232:5
r
n(n + 1)(2n + 1)
T =
24
r
30 (30 + 1) (61)
=
24
p
= 2363:75
= 48:6184
148
Deciding which test to use is the task of the statistician in practice and it is our aim to supply you
with the tools to make such a decision. This is not always so straightforward. The significance of
data type is obvious, but note how the study objective gives direction. Even now, while you are still
studying, make a point of looking at published statistical information and determine if it involves "lying
with statistics" or not. Two tables made from the information in the above-mentioned summaries are
given below. Look at them, but try to make your own. Making such a summary is a very valuable
method of studying.
Basic principles of hypothesis testing apply Basic principles of hypothesis testing apply
1. Ranked the combined data values as if they were from a single group. The smallest data value
gets a rank 1, the next smallest 2; and so on.
In the event of a tie, each of the tied values gets their average rank.
P P P
2. Add the ranks for data values from each of the k groups obtaining T1 ; T2 ; :::; Tk :
where
Step 3
The rejection region
2
H> ( ; k 1)
The critical value of H is the chi–square 2 with each sample size is at least 5:
( ; k 1)
152
Activity 5.4
Question 1
Apply the Kruskal–Wallis test to determine if there is enough evidence at the 5% significance level to
infer that at least one of the population medians differs from the others.
Sample
1 2 3
23 25 25
22 27 22
25 17 19
20 19 21
18 20 26
Question 2
Conduct the Kruskal–Wallis Test on the following statistics. (Use = 5%.)
T1 = 984 n1 = 23
T2 = 1502 n2 = 36
T3 = 1430 n3 = 29
Feedback Feedback
Activity 5.4
Question 1
Rank the combined data values as if they were single group.
1 2 3
Data Rank Data Rank Data Rank
23 10 25 12 25 12
22 8:5 27 15 22 8:5
25 12 17 1 19 3:5
20 5:5 19 3:5 21 7
18 2 20 5:5 26 14
Total T1 = 38 T2 = 37 T3 = 45
Step 1
The null and alternative hypotheses
H0 : The medians are equal (m1 = m2 = m3 ) :
H1 : At least one median differs from the others.
Step 2
The test statistic H
" #
12 X Tj2
H = 3 (n 1)
n (n + 1) nj
n = n1 + n2 + n3 = 5 + 5 + 5 = 15
T1 = 38 T2 = 37 T3 = 45
153 STA1502/1
" !#
12 (38)2 (37)2 (45)2
H = + + 3 (15 1)
15 (15 + 1) 5 5 5
12
H = (288:8 + 273:8 + 405) 3 14
15 16
12
= (967:6) 42
240
= 48:38 42
= 6:38
Step 3
The rejection region
2
H > ( ; k 1)
2
H > (0:05; 3 1)
2
H > (0:05; 2)
Step 4
Making a decision
Since the test statistic H = 6:38 is great than the critical value (5:99) ; we reject the null hypothesis
H0 and therefore we may infer that population medians are not all equal.
Question 2
Step 1
The null and alternative hypothesis
H0 : m1 = m2 = m3
H1 : At least two medians differ.
154
Step 2
The test statistic H
" #
12 X Tj2
H = 3 (n 1)
n (n + 1) nj
" !#
12 (984)2 (1502)2 (1430)2
= + + 3 (88 + 1)
88 (88 + 1) 23 36 29
12
= (175278:6578) 261
7832
= 268:5577 261
= 7:5577
Step 3
The rejection region
2
H > ( ; k 1)
2
H > (0:05; 3 1)
2
H > (0:05; 2)
H > 5:99
Step 4
Making a decision
Since the test statistic H = 7:5577 is greater than a critical value (5:99), we reject the null hypothesis
H0 at the 5% significance level.
Step 2
The test statistic
2 3
k
X
12
Fr = 4 Tj2 5 3b (k 1)
bk (k + 1)
j=1
where
b = number of blocks
k = number of treatments
Tj = sum of the ranks for treatment j
Step 3
The rejection region
2
Fr > ( ; k 1)
where the critical value is the chi–squared distributed with k 1 degrees of freedom.
Step 4
The decision rule
Reject H0 if the test statistic Fr is greater than the critical value 2 otherwise do not reject.
( ; k 1) ;
Activity 5.5
Question 1
The nonparametric counterpart to the randomized block design of the analysis of variance is
1. Kruskal–Wallis test
2. Friedman test
5. Sign test
156
Question 2
The maker of a stain remover is testing the effectiveness of four different formulations for a new
product. Six common types of stains were used as the blocks in the experiment. The data of ratings
are given below:
Stain–Remover Formulas
1 2 3 4
Creosote 2 7 3 6
Type Crayon 9 10 7 5
of Motor Oil 4 6 1 4
Stain Grape Juice 9 7 4 5
Ink 6 8 4 3
Coffee 9 4 2 6
Friedman test is used to examine whether the stain remover formulas could be equally effective in
removing stains from this type of fabric (use = 0:05):
Feedback Feedback
Activity 5.5
Question 1
Option (2)
Question 2
Rank data by type of stain used as individual.
1 2 3 4
Data Rank Data Rank Data Rank Data Rank
Creosote 2 1 7 4 3 2 6 3
Crayon 9 3 10 4 7 2 5 1
Block Motor Oil 4 2:5 6 5 1 1 4 2:5
Grape Juice 9 4 7 3 4 1 5 2
Ink 6 3 8 4 4 2 3 1
Coffee 9 4 4 2 2 1 6 3
Total Rank T1 = 17:5 T2 = 22 T3 = 9 T3 = 12:5
b=6 k=4
Step 1
The null and alternative hypothesis
H0 : m1 = m2 = m3 = m4
H1 : The population medians are not equal.
157 STA1502/1
Step 2
The test statistic:
2 3
k
X
12
Fr = 4 Tj2 5 3b (k + 1)
bk (k + 1)
j=1
12
= (17:5)2 + (22)2 + (9)2 + (12:5)2 3 (6) (4 + 1)
6 (4) (4 + 1)
12
= (1027:5) 90
120
= 102:75 90
= 12:75
Step 3
The rejection region is
2
Fr > ( ;k 1)
2
F2 > (0:05; 3)
Fr > 7:81
Step 4
Making a decision
Since the test statistic Fr = 12:75 is greater than the critical value (7:81) ; we reject H0 at the 5%
significance level.
158
STUDY UNIT 6
Time series analysis and forecasting
6.1 Introduction
Time series and forecasts based on time series are very relevant and significant in modern times. At
first year level, fortunately these concepts are simple and easy to explain. If you sit and think about
it, you can make a long list of events that you can observe at regular time intervals. If you drive to
work in a car or taxi or train, you can record the traffic every first day of the month, or every Friday,
or every day of the week, or...; if you have a favourite take-away food store you can record the length
of the queue at regular time intervals; an obvious example is to record the monthly rainfall at your
home. The list never ends, as government bodies, researchers, economists, etc. all record different
phenomena over short and long periods of time. These scores, collected at regular time intervals
are known as time series.
The question is – what do we do with the time series? Do you record the data simply to look at it, is
it just for the sake of fun, or what? As statisticians we are going to teach you how to look at, interpret
and even ’smooth’ the time series data, but is that the end of the process? That would have been
a sad day if everything stopped just there! The point is that what we observe as a pattern in the
past could well be repeated in the future and therefore a technique has been developed where the
data of a time series is used and the characteristics of that particular phenomenon is used to predict
what can be expected in the future. Of course, statisticians are always very careful not to say that
anything is certain (think in terms of hypothesis testing!), so they use models in their predictions. We
will only look at three elementary models, but there are many other models, some of which are much
more complex. This chapter begins with time series components smoothing techniques, trend and
seasonal effects and forecasting.
Trend
Cyclical variation
Seasonal variation
– A trend is an overall long-term upward or downward movement in a time series. The duration is
more than 1 year.
– The cyclical involves the up-and-down swing or movements through the series. The duration is
more than 1 year. This is often correlated with business cycle.
– The seasonal variation refers to cycles that occur over a short repetitive calendar periods. The
duration is less than 1 year.
– Random variation is caused by irregular and unpredictable changes in a time series. They
have monthly or quarterly data.
The first step in a time series is to visualize the data and observe whether any patterns exist over
time. The second step is to determine whether there is a long–term upward or downward movements
in the series. This is to verify if there is a trend. If there is no obvious long–term upward or downward
trend, then you can use moving averages or exponential smoothing to smooth the series. If a trend
is present, you can consider several time series forecasting techniques.
Depending on the model, we can put these components together in different ways to represent the
time series. The model that we discuss is the multiplicative model. It states that any time series, Y ,
consists of the product of the four components listed above:
Y =T C S I
In addition to a multiplicative model, we can also use an additive model as given below
Additive model: Y = T + C + S + I
Mixed model: Y = (T C S) + I
The choice of which model to use is beyond the scope of the module.
160
Moving averages
Exponential smoothing
The first technique of smoothing is to determine moving averages. Remember that the data
points in a time series are consecutive values, i.e. they are ordered. The idea of an average
is nothing new and in this case you substitute the actual observations of a time series with a
list of averages. You can compute a three-period moving average, which is the average of three
consecutive observations or you can compute a four-period moving average, which is the average
of four consecutive observations, etc. Make sure that you understand how these three, or four, or
... moving averages are calculated. In a three-period moving average each observation (except the
first and last values) are part of three averages.
Suppose we have real observations indicated as A, B, C, D E, F and G, then the three, four and
five-period moving averages would be as follows:
This method is mathematically more complex, but still a ’relatively crude method’ to remove random
variation. However, it removes two of the concerns mentioned above when the method of moving
averages is used for smoothing out random variation. These are the following:
With every calculation all the observations up to that particular observation form part of the
calculation, in other words give weight to the answer.
The smoothing process starts from the very first observation and continues up to the very last
observation.
The formula given may look a little complex, but with constant use it is manageable. Application of
the formula smooths values by calculating a weighted average of each observation in the series and
the previously already smoothed observation. The smoothing constant w is a number between 0 and
1 and seeing that w is multiplied by the actual observation yt (at time t), you should understand that
the closer w is to 1 the more influence the actual observation y will have. That is the sort of decision
the statistician has to make. Choosing the value of w will therefore depend on the importance of the
actual observations.
Considering the above, exponential smoothing is a series of exponentially weighted moving
averages. The weights assigned to the values change so that the most recent (the last) values
receives the highest weight, the previous value receives the second–heights weight, and so on.
Throughout the series, smoothed value depends on all previous values, which is an advantages of
exponential smoothing over the method of moving averages.
162
St = !Yt + (1 !) St 1 for t 2
where
S1 = y1
Keep in mind that you will receive a list of formulas in the examination. You simply have to recognize
which formula to use where and to know the meaning of the different symbols.
Activity 6.1
Question 1
Test your knowledge.
Link each of the descriptions below to one of the four time series components (long-term trend,
cyclic, seasonal or random variation):
1. The time series component that reflects a long-term, relatively smooth pattern or direction
exhibited by a time series over a long time period (more than one year)
2. The time series component that reflects variability over short repetitive time periods and has
duration of less than one year
3. The time series component that reflects the irregular changes in a time series that are not
caused by any other component, and tends to hide the existence of the other more predictable
components
4. The time series component that reflects a wave-like pattern describing a long-term trend that is
generally apparent over a number of years
163 STA1502/1
Question 2
In exponentially smoothed time series, the smoothing constant w is chosen on the basis of how much
smoothing is required. In general, which of the following statements is true?
1. A small value of w such as w = 0:1 results in very little smoothing, while a large value such as
w = 0:8 results in too much smoothing.
2. A small value of w such as w = 0:1 results in too much smoothing, which a large value such as
w = 0:8 results in very little smoothing.
3. A small value of w such as w = 0:1 and a large value such as w = 0:8 may both result in very little
smoothing.
4. A small value of w such as w = 0:1 and a large value such as w = 0:8 may both result in too much
smoothing.
5. It is impossible to have too much or too little smoothing, regardless of the value of w:
Question 3
Monthly sales (in R11,000) of a computer store are shown below.
Month Jan Feb March April May June
Sales 73 65 72 82 86 90
1. Trend analysis
2. Seasonal analysis
Once you can see that there is a trend in a time series, you have to determine what the ’nature’ of the
trend is. This we do using mathematics. Do you remember the following from school mathematics?
At this stage you should know enough about the possibility to fit a regression line through given data
and also about the principles involved in such a method. Now, in time series analysis to determine if
there is a trend in the data, such a fitted line can assist you in seeing if there is a trend in the data.
The y^ then becomes the trend line estimate of the y of the regression model y = 0 + 1t + ": The
slope of the line indicates the trend. If the slope is positive, you know the trend is positive and the
larger the numerical value of the slope the larger the positive trend.
These arguments about a graph assisting us to find trend in a time series apply if the relationship is
nonlinear. Should a quadratic model be needed to fit the time series, the trend equation relies on the
multiple regression technique (not included in this module).
We estimate a linear trend model using the regression techniques. Let yi be the value of the
response variable at time t: In this section we use t as the explanatory (independent) variable
corresponding to consecutive time period such as 1; 2; 3; and so on. It is specified as
yt = 0 + 1t + "t
where
yt = the value of the series as time t
165 STA1502/1
Y^t = b0 + b1 t
where
b0 and b1 are the coefficient estimates.
An exponential trend model is used for a times series that is expected to grow by an increasing
amount each time period. It is specified as ln (yt ) = 0 + 1t + "t where ln (yt ) is the natural log of
yt : The estimated is used to make forecasts as
se
y^t = exp b0 + b1 t +
2
where b0 ad b1 are the coefficients estimated
This model specializes to a linear model, quadratic trend model, and cubic trend model for q =
1; 2; and 3, respectively. The estimated model is used to make forecasts as
ybt = b0 + b1 t + b2 t2 + ::: + bq tq
where
b0 ; b1 ; b2 ; :::; bq are the coefficient estimates.
Quadratic model is
2
Yt = 0 + 1t + 2t + "t
Cubic model is
2 3
Yt = 0 + 1t + 2t + 3t + "t
To detect seasonality in a time series, several ’seasons’ must be observed. Seasonal index can be
calculated and used to either inflate or deflate the trend in the series. Depending on the choice, it will
either express the degree to which the seasons differ from one another or it can be used to remove
the seasonal variation. The purpose of removing the seasonality is that other changes in the series
can then be detected. This has many benefits, especially in forecasting.
166
Activity 6.2
Question 1
The quarterly earnings (in millions of rands) of a large soft-drink manufacturer have been recorded
for the years 2001 to 2004. These data are listed here. Calculated the seasonal indexes given the
regression line
Yb = 61:75 + 1:18t (t = 1; 2; 3; :::; 16)
Year
Quarter 2001 2002 2003 2004
1 52 57 60 66
2 67 75 77 82
3 85 90 94 98
4 54 61 63 67
Question 2
The Pyramid of Giza is one of the most visited monuments in Egypt. The number of visitors per
quarter has been recorded (in thousands) as shown in the accompanying table:
Year
Quarter 2000 2001 2002 2003
Winter 210 215 218 220
Spring 260 275 282 290
Summer 480 490 505 525
Autumn 250 255 265 270
(a) Plot the time series.
(b) Discuss your observations. Would exponential smoothing be recommended for this data?
167 STA1502/1
where
yt = an observed value of y
ybt = the value of y that is predicted using the estimation equation or model
n = the number of time periods.
The mean squared error (M SE) criterion: From a given set of models or estimation equations fit to
the same time series data, the model or equation that best fits the time series is the one with the
lowest value of
P
(yt ybt )2
M SE =
n
where
yt = an observed value of y
ybt = the value of Y that is predicted using the estimation equation or model.
n = the number of time periods.
When we compare the MAD and the MSE criterion, the MSE is to be preferred whenever the cost of
an error in estimation or forecasting increases in more than a direct proportion to the amount of the
error.
The sum of squares error SSE is
P
SSE = (yt ybt )2
The selected model for forecasting a time series is determined by the components present in the
recorded time series. The choice of model is therefore based on measures of accuracy and precision.
In general, the method used in the particular smoothing method can give you an indication of the type
of forecast. If you think about the method applied in exponential smoothing, you can imagine that for
a time series with a small positive trend, the forecast will be too low and if there is a small negative
trend, the forecast will tend to be too high.
A proper analysis of the given data must underlie the choice and you have to realize that one should
not try to forecast too far in the future as the accuracy decreases with each additional time frame
added.
At first-year level we only introduce you to forecasting and expect you to understand three relatively
elementary forecasting models: Exponential and seasonal models will be easy for you to
understand.
Forecasting
Conditions Forecasting Action
model
Smoothing
Preferably used constant
No trend
Exponential for one time Assume initial
No exponential smoothing
smoothing period forecast forecast
No seasonal variation
but can be more Substitute St
with Ft+1
Regression
equation
Preferably one
Seasonal Long-term trend is used as
season but
indexes Seasonal variation well as
can be more
seasonal index
for period t
Based on
Can be complex
correlation of
Autocorrelation if the time
Autoregressive consecutive
No trend series values are
model terms (first
No seasonality themselves
order
correlated
autocorrelation)
169 STA1502/1
Activity 6.3
Question 1
Calculate MAD and SSE for the forecasts that follow
Period 1 2 3 4 5
Forecast ybt 63 72 86 71 60
Actual yt 57 60 70 75 70
Question 2
The following is the list of mean absolute deviation (MAD) statistics for each of the models you have
estimated from time-series data:
Model MAD
Linear trend 1:38
Quadratic trend 1:22
Exponential trend 1:39
Autoregressive 0:71
Based on the MAD criterion, the most appropriate model is
1. linear trend
2. quadratic trend
3. exponential trend
4. autoregressive
Feedback Feedback
Activity 6.1
Question 1
1. long-term trend
2. seasonal variation
3. random variation
4. cyclical variation
170
Question 2
Option (2)
Question 3
Month Sales Moving averages
Three-month Five-month
Jan 73
Feb 65 70
March 72 73 75.6
April 82 80 79.0
May 86 86
June 90
65 + 72 + 82 219
March : = = 73
3 3
72 + 82 + 86 240
April : = = 80
3 3
82 + 86 + 90 258
May : = = 86
3 3
The five–month moving averages
73 + 65 + 72 + 82 + 86 378
March: = = 75:6
5 5
65 + 72 + 82 + 86 + 90 395
April: = = 79:0
5 5
171 STA1502/1
Activity 6.2
Question 1
y
Year Quarter Period t y y^ y^
2001 1 1 52 62.9 0.827
2 2 67 64.1 1.046
3 3 85 65.2 1.303
4 4 54 66.4 0.813
2002 1 5 57 67.6 0.843
2 6 75 68.8 1.090
3 7 90 70.0 1.286
4 8 61 71.1 0.857
2003 1 9 60 72.3 0.830
2 10 77 73.5 1.048
3 11 94 74.7 1.259
4 12 63 75.9 0.830
2004 1 13 66 77.0 0.857
2 14 82 78.2 1.048
3 15 98 79.4 1.234
4 16 67 80.6 0.831
Quarter
1 2 3 4 Total
2001 0:827 1:046 1:303 0:813
2002 0:843 1:090 1:286 0:857
2003 0:830 1:048 1:259 0:830
2004 0:857 1:048 1:234 0:831
Average 0:839 1:058 1:271 0:833 4:001
Seasonal index 0:839 1:058 1:270 0:833 4:000
Question 2
(a)
600
Number of Visitors
500
400
300
200
100
0
2000 2001 2002 2003
Year
We note a distinct pattern of seasonal variation in the series. This could have been detected in
the data, but in the graph one can see it without even thinking!
172
(b) Exponential smoothing is a method to remove the random variation in a time series and makes
it easier to detect the trend. In the further discussions you will see that exponential smoothing is
not an accurate forecasting method if the time series has clear seasonal effects.
Activity 6.3
Question 1
Period 1 2 3 4 5
Forecast ybt 63 72 86 71 60
Actual yt 57 60 70 75 70
jyt ybt j
M AD =
n
j57 63j + j60 72j + j70 86j + j75 71j + j70 60j
M AD =
5
6 + 12 + 16 + 4 + 10
=
5
48
=
5
= 9:6:
X
SSE = (yt ybt )2
SSE = (57 63)2 + (60 72)2 + (70 86)2 + (75 71)2 + (70 60)2
= 552:
Question 2
Option (4)
173 STA1502/1
Learning Outcomes
Use the chapter summary as a checklist to see if you have mastered the knowledge in this chapter
after you have completed this study unit to evaluate if you have really acquired a good understanding
of the work covered.
Can you
list and understand principles involved in the general procedures when applying chi-squared
testing?
apply your knowledge of the chi-square test, for nominal scale variables, to describe a single
population and/or to determine the relationship between two populations?
apply non-parametric statistical tests?
employ the Wilcoxon rank sum test, the sign test and the Wilcoxon signed rank sum test to
compare two populations of ordinal data?
apply Friedman F –test and Kruskal–Wallis H –test.
analyse the relationship between two interval variables using simple linear regression?
explain and decompose the components of a time series?
explain how trend and seasonal variation are measured?
describe exponential smoothing, seasonal indexes and the autoregressive model for forecasting
in time series?
References
Keller Gerald and Gaciu (2020, second Edition). Statistical for Management and Economics,
Belmont, CA USA Duxbury, Thomson.
Mendenhall, W., Bearer, R.J. and Beaver, B.M. (2009) Introducation to Probability and Statistics. 13
edition.
Weiers, Ronald M. (2005) Introduction to Business Statistics, Brooks/Cole, Duxbury, Thomson.