Stat2012 Notes Study Guide
Stat2012 Notes Study Guide
University of Witwatersrand
STAT2012
An Introduction to
Mathematical Statistics
Contents
1 What is Statistics? 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Population and Samples . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Discrete and Continuous Variables . . . . . . . . . . . . . . . . . . . . 7
1.5 Uses of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Descriptive Statistics 9
2.1 Graphical techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Tabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Multiple Bar Graph . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.5 Component Bar Graph . . . . . . . . . . . . . . . . . . . . . . 13
2.1.6 Percentage Component Graph . . . . . . . . . . . . . . . . . . 15
2.1.7 Line graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.8 Frequency Distribution . . . . . . . . . . . . . . . . . . . . . . 17
2.1.9 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.10 Cumulative Frequency Curve . . . . . . . . . . . . . . . . . . 21
2.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Measures of location . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Box and whisker plot . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Probability 37
3.1 Assigning probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Probability of events . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Addition of probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Relative frequency approach to probability . . . . . . . . . . . . . . . 45
CONTENTS 3
Preface
Whether you are doing this course by choice, or as a filler or because you are forced
to; the aim of this book is to introduce you to Mathematical Statistics and to realise
its applications. The notes will attempt to clarify links between the techniques
taught and the interpretation of the results obtained. It will not be assumed that
you are familiar with any statistical techniques at the onset of this course.
The best way for you to study this book is to work in systematic manner through
all the chapters. After understanding a chapter, you should be able to answer the
questions at the end of each chapter, to move onto the next chapter.
These notes have been revised from the notes compiled by Charles Chimedza and
Nothabo Ndebele from the University of Witwatersrand who previously taught this
course in 2017. The chapter on Probability has been adapted from the Advanced
Level Mathematics Statistics 1 written by Steve Dobbs and Jane Miller, published
by Cambridge University Press 2002.
Chapter 1
What is Statistics?
1.1 Introduction
What is statistics? It is not primarily the adding of numbers to come to the conclu-
sion that there are 33 346 registered students at Wits and 30% of them are science
students in the year 2014, or that the demographic profile of first year students con-
sists of more females (54%) than males (46%). Statistics, rather can be described
as the science of decision making in the face of uncertainty. The emphasis here is
not placed so much on the collection of data, but rather drawing conclusions from
the data.
Inferential statistics on the other hand, makes inferences and predictions about a
population based on a sample collected. The estimation of parameters and testing
of statistical hypothesis are the primary methods of inferential statistics.
• Nominal,
• Ordinal,
• Interval and
• Ratio.
Nominal
This is the weakest of the four measurements scales of data. It distinguishes one
object or event from another on the basis of a ’name’. An example of this, is classi-
fying items coming off an assembly line as defective or non-defective, or classifying
a bank account as open or closed. The ’naming’ can be coded, e.g. if the bank
account is open, then use the value 1, and closed use the value 2. Data of this type
are typically refereed to as: count data, frequency data or categorical data.
Ordinal
Objects or events are distinguished on the basis of the relative amounts of some
characteristic they possess. These measurements enable observations to be ranked.
An example, is ranking different sized jerseys from smallest to largest by assigning
the smallest as rank = 1 and increasing the rank by 1 up to the largest jersey with
a rank = 4, say. Note that the magnitude of the difference between measurements
is not reflected in the rank.
Interval Scale
This scale is applied when objects, or events can be distinguished one from another
and ranked, and when the differences between measurements have meaning. Suppose
that four objects A, B, C and D are assigned scores of 20, 30, 60 and 70 respectively.
If the interval scale is used then we can say that the difference between A and B
is equal to the difference between C and D, i.e. there are equal differences in the
amount of trait or characteristic being measured. However the ratios of the scores
cannot be used. The score of 60 for C does not mean that C has twice as much of
trait as B which has a score of 30. The values 20, 30, 60 and 70 are scores assigned,
and not measurements as it were.
Ratio scale
This kind of scale applies to all scales above and has the additional property that
the ratios are meaningful. This scale includes the familiar measurements of height,
weight, etc. that is quantitative data. The difference in magnitude and the ratio
can all be used for analysis as they have meaning attached to it.
Exercise 1.1
1. The banks in South Africa are assigned positions according to their reported
profits. The bank with the highest profit is given position 1, the bank with
the second highest profit is given position 2, etc. What type of data is this?
2. If the actual profit for each bank is recorded, what type of data would it be?
A continuous variable can take any value between two given values. Example
can be the height of a student that could can be 158.7cm, 164.2cm or 168.9cm.
The values lie on a continuous scale. The type of data which results from this
measurement is continuous data.
1.6 Conclusion
Statistics plays an important role in almost every field as it helps in decision making.
Before any decision can be made about any business or society or problem it is often
necessary to gather enough data or information to support a decision being made.
Statistics helps in the collection of information in a scientific and systematic manner,
and to make decisions based on the descriptive and inferential statistics.
Chapter 2
Descriptive Statistics
This chapter of descriptive statistics is split into two parts : graphical statistics,
which includes tables and graphs and summary statistics which involves numerical
calculations.
2.1.1 Tabulation
Data is typically presented in a tabular form. However, the data can be summarised
in a simple and easier way to understand and further analyse. Suppose data is
recorded on the gender of each lecturer in a school, and the results are presented in
the following way:
This is especially useful if the number of observations is large and the distinct
categories are few. This leads to the idea of contingency tables. A contingency
table is a convenient way of summarising data with more than one variable. It
consists of row(s) and column(s) of data, that represent the variables. Suppose
more information on the lecturers are recorded like in the table below.
The contingency table in this example is a 3 x 2 table as there are 3 rows and
2 columns. The number of rows and columns is determined by the number of
categories in each variable. For gender, there are 2 and for school, there are 3.
Contingency tables are sometimes referred to as cross tabulations and are used for
data that has two variables.
Table 2.4 lists the number of 91 staff members working at a company tabulated
by their qualification. Each category of the qualification can be expressed by a
proportion and percentage. The proportion is calculated by the number in each cat-
egory divided by the total number of staff members. E.g. Engineering qualifications
gives a proportion 38
91
24
, and Science 91 and so on. The percentage is the proportion
multiplied by 100.
Table 2.4: Table of staff members working in a certain company tabulated by their
qualification.
A pie chart is constructed by using the proportions in Table 2.4. As the pie is a
circle, the calculation of the angle in each category is the proportion x 360◦ . The
pie chart is shown in Figure 2.1, with the angles calculated in Table 2.5.
Qualification Angle
38
Engineering 91
x 360 = 150.3
24
Science 91
x 360 = 94.9
13
Arts 91
x 360 = 51.4
8
Commerce 91
x 360 = 31.6
5
Medicine 91
x 360 = 19.8
3
Other 91
x 360 = 11.8
Consider the same example in Table 2.4, a bar chart is constructed and given in
Figure 2.2 that uses the frequency to construct the bars.
Figure 2.3: Multiple bar graph of number of lecturers by school, split by gender.
Figure 2.4: Multiple bar graph of number of lecturers by gender, split by school.
Table 2.7: Calculating the percentage of males and females in each school
Consider data of average temperatures (in ◦ C) in Cape Town in each month of the
year in a specific year. To plot a line graph, use the variable Month on the x-axis,
and the Average Temperature on the y-axis in a x-y plot. Join the points to form
a line to produce a line graph, like Figure 2.7. The line graph is useful in detecting
trends or patterns over time.
Month Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec
Average Temp 22 23 21 18 16 13 13 13 14 16 18 20
Figure 2.7: Line graph of average temperatures during a year in Cape Town
6 4 7 10
5 6 7 8
7 8 8 9
7 5 6 6
9 4 7 8
Mark Frequency
4 2
5 2
6 4
7 5
8 4
9 2
10 1
Total 20
Sometimes the frequency distribution is not as simple to construct as the one above.
Observations are not always easy to group as there may be too many unique values.
Consider the following example: the number of calls from motorists per day for
roadside service in a certain month.
To construct a frequency distribution for this data, the following can be done:
2. Calculate the range of the data, i.e. the difference between the maximum and
minimum.
Range = 217 - 28 = 189.
3. Calculate k = 1 + 3.22 x log10 (n) to find the number of classes to have in the
frequency table.
k = 1 + 3.22 x log10 (31) ≈ 6.
5. Determine the lower end of the first class making sure the smallest value is
equal to or more than the lower end.
Lower end = 28.
6. Determine the frequencies for each class by counting the number of observa-
tions falling in each class.
The class limits are defined as the starting and ending point in each class. The
class boundary is the average of the end of the current class limit and the starting
of the next class limit. The class midpoint is the average of the current class
starting and ending limit.
2.1.9 Histogram
A histogram is a picture of a frequency distribution. It is used to represent con-
tinuous quantitative data. It usually consists of adjacent rectangles that are not
separated. The area of each rectangle is drawn in proportion to the frequency cor-
responding to that frequency class. When the class intervals are equal, the area of
each rectangle is a constant multiple of the height and the histogram can be drawn
like a bar chart, except the bars are not seperated. It is important to note that the
class intervals need not be equal.
One cannot use imaginary limits to construct a histogram. Real limits will need
to be constructed in this case as shown in the table below. The real limits are used
to construct a histogram shown in Figure 2.8.
Shipping time Frequency Lower limit Upper limit Lower real limit Upper real limit
10+9 19+20
10 -19 7 10 19 2
= 9.5 2
= 19.5
20+19 29+30
20 -29 20 20 29 2
= 19.5 2
= 29.5
30+29 39+40
30 - 39 9 30 39 2
= 29.5 2
= 39.5
40+39 49+50
40 - 49 3 40 49 2
= 39.5 2
= 49.5
50+49 59+50
50 - 59 5 50 59 2
= 49.5 2
= 59.5
60+59 69+70
60 - 69 1 60 69 2
= 59.5 2
= 69.5
Table 2.11: Computing the real limits of the shipping time frequencies
Using the same example in Table 2.10, to find the values to plot in the cumulative
frequency plot can be done in two different ways; the less than cumulative frequency
which is accumulating the frequencies starting from the lowest class, and the greater
than cumulative frequency which is accumulating the frequencies from the highest
class. Computation of these curves are shown in Tables 2.12 and 2.13, repsectively.
The two curves can be drawn on the same plot, and where these curves meet or
intersect is known as the median. The cumulative frequency curves is shown in
Figure 2.9.
Shipping time Frequency Lower real limit Upper real limit < cumulative frequency
10 -19 7 9.5 19.5 7
20 -29 20 19.5 29.5 27
30 - 39 9 29.5 39.5 36
40 - 49 3 39.5 4 9.5 39
50 - 59 5 49.5 59.5 44
60 - 69 1 59.5 69.5 45
Shipping time Frequency Lower real limit Upper real limit > cumulative frequency
10 -19 7 9.5 19.5 45
20 -29 20 19.5 29.5 38
30 - 39 9 29.5 39.5 18
40 - 49 3 39.5 49.5 9
50 - 59 5 49.5 59.5 6
60 - 69 1 59.5 69.5 1
Exercise 2.1
1. What is the difference between a histogram and a bar graph?
(a) Classify these ages into four classes; A: ages below 22.5, B: ages between
22.5 and 25 inclusive, C: ages between 25 and 27.5 inclusive, and D: above
27.5 to create a frequecny table.
(b) Construst a pie chart from these classes.
(c) Construct a bar chart from the data.
Classes Frequency
5-9 1
10-14 9
15-19 20
20-24 12
25-29 5
5. The monthly sales (in millions of $) of a large business are given as:
6. A company has been selling two types of cars A and B from 1992 to 1998.
The number of sales obtained (in billions of $) is given as:
Year A B
1992 134 119
1993 126 96
1994 198 182
1995 144 98
1996 164 78
1997 200 197
1998 213 187
(a) Draw the line graphs of the sales of the two data sets on the same plot,
and comment on the trend of the lines.
for data that is not grouped like frequency data. If the data is grouped, with xi
occurring fi times with a total of n observations, then
k k
1X X
mean = x̄ = f i xi , n= fi (2.2)
n i=1 i=1
Example 2.1
Calculate the mean age of the 45 people that attended a cultural movie on a specific
day.
7 9 11 12 12 12 13 13 14 14
14 14 15 15 15 16 17 18 18 19
19 19 20 20 20 21 22 22 22 23
24 24 25 26 28 29 31 31 32 34
38 39 39 16 25
Example 2.2
Calculate the mean of the following grouped data of average income per hour on
2000 participants in a survey:
The mean is
k
1 X 0x1235 + 25.5x459 + 75.5x121 + 150.5x29
x̄ = f i xi = = 13.67
1844 i=1 1844
Median
Suppose sorting all observations into a numerical order ranging from lowest to high-
est. The median will be the middle value in the sorted list. Half of all the observa-
tions will be greater than the median and the other half will be less than the median.
The median is also known as the 50th percentile or the second quartile.
The median x̃ is calculated by first sorting the data in ascending order to get a new
data array: x(1) , x(2) , . . . , x(n) , and then finding the central value. If n is odd, then
median is the n+1
2
th value,
x̃ = x( n+1 ) , (2.3)
2
Example 2.3
Calculate median of the data in Example 2.1.
Sort the data:
7 9 11 12 12 12 13 13 14 14
14 14 15 15 15 16 16 17 18 18
19 19 19 20 20 20 21 22 22 22
23 24 24 25 25 26 28 29 31 31
32 34 38 39 39
cm ( n2 − Fm−1 )
Median = Lm + , (2.5)
fm
where
Lm is the lower limit of the class containing the median,
cm is the difference between the upper end and lower end of the median class,
fm is the frequency of the median class,
Fm−1Pis the cumulative frequency of the class just before the median class,
n is ki=1 fi the sum of frequencies and
k is the number of classes.
When calculating the median for grouped data it is important to remember that
the real limits or class boundaries are used.
Example 2.4
The time it takes to build a three-roomed house is believed to be at most 12 weeks.
The man in charge of time and service delivery for a building company took a
random sample of three-roomed house constructions and inquired how long it took
to build them. The data is given as:
2. Lm = 10.5, since the median x(45) lies in the class 10.5 - 13.5.
3. cm = 13.5 − 10.5 = 3.
4. fm = 45.
5. F(m−1) = 25.
3( 45 −25)
6. Median = 10.5 + 2
45
= 11.833.
7. Important to note: check that the median calculated falls in the median class.
Mode
The mode is the observation with the largest frequency for ungrouped data. The
data is said to have no mode if all the observations are unique, as observations only
occur once in the data. It is also possible to have more than one mode.
With grouped data, there will not be a single most frequently occurring observation.
However, it will be the class with the highest frequency. The mode will be found in
the class with the highest frequency.
cm (fm − fm−1 )
Mode = Lm + , (2.6)
2fm − (fm−1 + fm+1 )
where
Lm is the lower end of the modal class,
cm is the upper end of the modal class - lower end of of the modal class,
fm is the frequency of the modal class,
fm−1 is the frequency of the class before the modal class and
fm+1 is the frequency of the class after the modal class.
Example 2.5
Using the data in Example 2.4, to calculate the mode is done in the following way:
1. The modal class is 10.5 - 13.5, as it has the highest frequency, therefore Lm =
10.5.
2. cm = 13.5 − 10.5 = 3.
3. fm = 45.
4. fm−1 = 20.
5. fm+1 = 10.
3(45−20)
6. Mode = 10.5 + 2(45)−(20−10)
= 11.75.
For any set of data; the mean, median and mode are likely to be different, thus it
has to be decided which is the best one to use in given situation.
Quantiles/Quartiles/Percentiles
While the mean, median and mode describe the center of the data, it is sometimes
useful to also summarise other specific points of location of the data. Suppose sorting
or ranking data values in ascending order, the values can then be partitioned into
equal size portions with dividing points called quantiles.
For ungrouped data, the 1st quartile is calculated by first ordering the data, and
then computing,
1
1st quartile = x| n+1 | + [x| n+1 |+1 − x| n+1 | ], (2.7)
4 4 4 4
where | n+1
4
| and only takes the integer value. For example | 13 4
| = 3 and | 27
4
| = 6.
Example 2.6
Calculate the 1st and 3rd quartile of the following data:
1
= x(2) + [x(3) − x(2) ]
4 (2.9)
1
= 0.20 + [10.00 − 0.20]
4
= 2.65.
The 3rd quartile is given by:
3
3rd quartile = x| 3 (8+1)| + [x| 3 (8+1)|+1 − x| 3 (8+1)| ]
4 4 4 4
3
= x(6) + [x(7) − x(6) ]
4 (2.10)
3
= 23.90 + [122.13 − 23.90]
4
= 97.57.
The quantiles for grouped data are calculate much like the grouped data median.
The following calculation is for the qth percentile:
qn
cq ( 100 − Fq−1 )
q-th percentile = Lq + , (2.11)
fq
where
Lq is the lower limit of the class containing the qth percentile,
cq is the difference between the upper end and lower end of the qth percentile class,
fq is the frequency of the qth percentile class,
Fq−1 is the cumulative frequency of the class before the qth percentile class and
P
n is the sum of frequencies ki=1 fi .
For example, to calculate the first quartile (25th percentile) the formula will look
as follows:
c25 ( 25n
100
− F25−1 )
25-th percentile = L25 + ,
f25
where
L25 is the lower limit of the class containing the 25th percentile,
c25 is the difference between the upper end and lower end of the 25th percentile
class,
f25 is the frequency of the 25th percentile class,
Fq−1 is the cumulative frequency of the class before the 25th percentile class and
P
n is the sum of frequencies ki=1 fi .
Example 2.7
Using the data in Example 2.4, the
3( 90 − 5)
25th percentile = 7.5 + 4
20
= 10.125
and the
3( 75x90
100
− 25)
75th percentile = 10.5 +
45
= 13.33.
Skewness
Skewness is a measure of symmetry of a distribution. A distribution can have a
positive or a negative skew, depending on where the mean, median and mode are
situated. The following figure presents the three different scenarios.
(mean - mode)
Ps1 = (2.12)
standard deviation
3(mean - median)
Ps2 = (2.13)
standard deviation
Values of Ps1 and Ps2 less than 0 indicates negative skewness in the data, while
values greater than 0 indicates positive skewness.
Kurtosis
Kurtosis measures how peaked the distribution of the data is. If the data set has a
high kurtosis, then the histogram of the data will have a high peak. It also means
that there is a great number of observations around the mode or the modal class
has a high frequency.
The Range
The range (R) is the difference between the minimum and maximum value in the
data set, it is given by:
Quartile deviation(IQR)
This measure is half the difference between the third and first quartile. It is given by:
1
IQR = [3rd quartile − 2nd quartile] (2.17)
2
This measure gives half the range of the middle 50% values, which means it is a
better statistics than the range. It is not greatly affected by outliers - which has the
property of robustness.
Variance
The variance is the most commonly used measure of dispersion or variability in
statistical analysis. This measurements takes into account all observations in the
data set. It is denoted by s2 , and the greater the value, the greater the variability.
If all observations are close or almost equal then the variance will be low.
The variance is calculated by taking the average of the sum of squared deviations
of each observation from the mean, it is given by:
n
2 1 X
s = |xi − x̄|2
n − 1 i=1
X n
1 2 2
= x − nx̄ (2.18)
n − 1 i=1 i
X n P
1 2 [ ni=1 xi ]2
= x − .
n − 1 i=1 i n
The positive square root of the variance s2 is called the standard deviation, and
is denoted by s.
1. quartiles,
2. median,
Outliers can also be indicated in the box and whisker plot. The features of the box
and whisker plot is that the box contains 50% of the observations and the whiskers
are lines which extend from the box to the maximum and minimum values.
Example 2.8
The rate at which accidents occurred at a road junction controlled by yield signs
were recorded. Road accidents were also recorded after traffic lights were erected in
place of the yield signs. The results are given as:
The box and whisker plots of the accidents before and after the traffic lights were
erected is given as:
Draw the boxplots to scale in one plot, and comment on the distribution of the
accidents before and after the traffic lights were erected.
2.3 Conclusion
In this chapter, a variety of diagrammatic representations of data was looked at.
The diagrams and summary statistics help one understand the data better and to
deduce the structure of the data. A step above descriptive statistics is inferential
statistics which will be looked at in further chapters.
Exercise 2.2
1. A glass manufacturing company recorded the following profits (in millions).
The profits are recorded every four months from January 1985 to December
1987):
12 18 10 13 20 11 12 19 10
(a) Calculate the mean, median, mode and variance of the data.
(b) Comment of the median and mean, in terms of skewness.
(c) Construct a bar chart and comment on the distribution.
(d) Construct a box-and-whisker plot for the data.
Chapter 3
Probability
Probability is the measure of the likelihood that an event will occur. Probability
quantifies events into a number which can be used to make decisions. It is measured
on a scale from zero representing impossibility to one representing certainty.
Each of the possible outcomes has an assigned probability to it. For example the
sample space for throwing a dice is {1,2,3,4,5,6}, and assigning a probability to
it each of the outcomes will be 61 , in belief that the dice is fair, and that is each
outcome is equally likely. When probabilities are assigned to possible outcomes,
1. each probability must lie between 0 and 1, inclusively, and
2. the sum of all probabilities assigned must equal to 1.
Example 3.1
Assign probabilities to the following experiments:
1. Choosing a card from a standard pack of playing cards.
The sample space consists of 52 playing cards {Ace of Clubs, 2 of Clubs, 3 of
1
Clubs, . . ., King of Spades}. The probability assigned to each item will be 52 ,
assuming all the cards are equally likely to be picked.
37
38 CHAPTER 3. PROBABILITY
To find the probability of an event, look at the sample space and add the probabilities
of the outcomes which make up the event. For example, to toss a coin twice, the
sample space will be {(H, H), (T, T ), (H, T ), (T, H)} and the probability of each
outcome will be 14 . The event A consists of two outcomes, so the probability of A,
P(A) = 14 + 14 = 21 .
If A is an event, the event ”not A” is the event consisting of those outcomes in the
sample sample which are not in A. Since the sum of the probabilities assigned to
outcomes in the sample space is 1,
P(A) + P(not A) = 1.
The event ”not A” is called the complement of the event A. The symbol A′ is used
to denoted the complement of A. Therefore,
P(A) + P(A′ ) = 1.
Example 3.2
The numbers 1, 2, . . . , 9 are written on separate cards. The cards are shuffled and
the top is turned over. Calculate the probability that the number on this card is a
prime number.
The sample space is {1, 2, 3, 4, 5, 6, 7, 8, 9}. Each outcome is equally likely and
has a probability of 19 . Let B be the event that the card turned over is prime. Then
B = {2,3,5,7}. The probability of B is the sum of the probabilities of the outcomes
in B.
1 1 1 1 4
P(B) = + + + = .
9 9 9 9 9
and the probability of the card not being a prime number is
4 5
P(B′ ) = 1 − = .
9 9
Exercise 3.1
1. A fair 20-sided dice has eight faces coloured red, ten coloured blue and two
coloured green. The dice is rolled:
2. A dice with 6 faces has been made from brass and aluminium and is not fair.
The probability of a 6 is 41 , the probabilities of 2,3,4 and 5 are each 61 , and the
1
probability of 1 is 12 . The dice is rolled.
In Example 3.2 there are two events, event B consisting of all the prime numbers
between 1 to 9. Let be A be the event of all the numbers consisting of non prime
numbers.
A = {1,4,6,8,9} and
B = {2,3,5,7}
are mutually exclusive between they do not have the one or more outcomes that are
the same. This can be seen in what is called a Venn diagram in Figure 3.1.
40 CHAPTER 3. PROBABILITY
When two events, A and B have outcomes in the sample space that are the same,
they are not mutually exclusive. Using the same Example 3.2, the event C are all
the even numbers in the sample space, C = {2,4,6,8}. The events A and C are not
mutually exclusive, as there are outcomes that are the same, namely {4,6,8}. In
this case the addition rule is not valid, P(A or C) 6= P(A) + P(C).
The addition rule is modified as follows: P(A or C) = P(A) + P(C) - P(A and C).
The P(A and C) is called the intersection of sets of A and C. It can be denoted
as: P(A ∩ C) and is illustrated by the region in the Venn diagram that has the
outcomes {4,6,8} in Figure 3.2. The P(A or C) will therefore be the probability of
outcomes in A plus the probability of all outcomes in C minus the probability of all
outcomes in the intersection of A and C.
Additional notation
The union of sets of events A and B is the set of all outcomes which belong to
event A and event B. This is given by P(A or B) as done in the previous example,
but it is denoted by A ∪ B.
A set that contains no outcomes is called an empty set and is denoted by ∅ or {}.
Consider a class of 20 students, of whom 12 are girls and 8 are boys. Suppose further
that 7 of the girls and 2 of the boys are left handed. If a student is picked randomly
from the class, then the chance that he or she is left handed is 7+220
9
= 20 .
However, the probability of selecting a student from the group of girls that is left
7
handed is 12 and the probability of selecting a student from the group of boys that
is left handed is 28 = 14 . These probabilities have been calculated on the basis of
an extra condition, which is selecting the student from a certain group. This is an
example of conditional probability.
42 CHAPTER 3. PROBABILITY
which is essentially:
Complete the probability of selecting a left handed student given that the student
is a boy.
Example 3.3
3
Weather records indicate that the probability that a particular day is dry is 10 . The
South African football team Bafana Bafana show a record of success is better on
dry days than on wet days. The probability that the team wins on a dry day is 83 ,
3
whereas the probability that they win on a wet day is 11 . The team is due to play
their next match in a few days.
2. Three Saturdays ago, the team won their match, what is the probability that
it was a dry day?
The sequence involves first the type of weather and then the result of the football
match. In cases of conditional probability like in this example, one can make use of
a tree diagram, as in Figure 3.3.
Notice the probabilities on the first layer of branches is the probability of the type
of weather, wet or dry. The probabilities on the second layer of branches are the
conditional probabilities. You can use the tree diagram to calculate any of the four
possibilities:
1. The probability of winning can happen in two different ways: P(win) = P(dry
& win) or P(wet & win).
44 CHAPTER 3. PROBABILITY
2. In this case, you have been asked to calculate a conditional probability. How-
ever, the sequence of events has been reversed and you want to find out
P(dry|win).
9
P(dry & win) 80 99
P(dry|win) = = 267 = .
P(win) 880
267
Think of P(dry|win) as being the proportion of times that the weather is dry
out of all the times that the team wins.
Exercise 3.2
1. The Prosecutor’s fallacy. An accused prisoner is on trial. The defence lawyer
asserts that in the absence of further evidence, the probability that the prisoner
is guilty is 1 in a million. The prosecuting lawyer produces further piece of
evidence and asserts that is the prisoner were guilty, the probability that this
evidence would be obtained is 999 in 1000, and if he were not guilty would
be only 1 in 1000. Assuming that the court order the legality of the evidence,
and that both lawyers’ figures are correct, what is the probability that the
prisoner is guilty?
Bayes’ Theorem
Bayes’ Theorem describes the probability of an event, based on prior knowledge of
conditions that might be related to the event.
Bayes’ theorem is stated mathematically as the following equation:
P(A and B)
P(B|A) = . (3.3)
P(A)
The examples done in this section of conditional probability uses Bayes’ Theorem.
Example 3.4
In a carnival game, a contestant has to first tossing a fair coin and then roll a fair
cubical dice whose faces are numbered 1 to 6. The contestant wins a prize if the
coin shows heads and the dice score is below 3. Find the probability the contestant
wins the prize.
The two events of tossing a coin and rolling a dice are independent. The outcome
of rolling the dice does not depend on the outcome of tossing the coin. Therefore
the probability of winning is :
P(prize won) = P(coin shows heads) and P(dice score is lower than 3)
= P(coin shows heads) x P(dice score is lower than 3)
1 2 1
= x = .
2 6 6
Example 3.5
Consider data on the mode of transport of 92 students to campus everyday:
Number of males
P(male) =
Total number of students
51
= = 55.4%
92
46 CHAPTER 3. PROBABILITY
3.7.1 Permutations
In the previous section, you could count the number of outcomes in a sample space.
When the number of outcomes is fairly small, it is quite straightforward, but in
certain instances counting the possible number of outcomes can be cumbersome.
Think of listing the 5 different cards from a pack of 52 playing cards.
Suppose you have 3 letters: A, B and C written on 3 separate cards. There are
different ways of arranging these cards.
Therefore altogether there are 3x2x1 = 6 possible ways of arranging the three cards.
Similarly the number of ways to arrange the letters A, B, C and D will be 4x3x2x1
= 24. The different arrangements of objects (they need not be letters) are called
permutations. The number of permutations of n distinct objects is n!, where
n! = n x (n − 1) x (n − 2) x . . . x 2 x 1. The expression n! is called n factorial.
The formula for permutations is further extended to give the number of different
permutations of r objects which can be made from n distinct objects:
n n!
Pr = (3.6)
(n − r)!
Equation 3.5 is used for example in the case of arranging 4 letters out of 7 letters
A, B, C, D, E, F and G. That is having the arrangements:
7 7! 7!
P4 = = = 840.
(7 − 4)! 3!
n!
(3.7)
n1 !n2 ! · · · nk !
48 CHAPTER 3. PROBABILITY
Example 3.6
Find the number of distinct permutations of the letters of the word MISSISSIPPI.
There are 11 letters, of which 4 S’s, 4 I’s, 2 P’s and 1 M. The number of distinct
permutations of the letters is therefore
11!
= 34650.
4!x4!x2!x1!
3.7.3 Combinations
Combinations is the case when the order of objects does not matter in counting
the number of different arrangements. For example, if you were dealt a hand of 13
cards from a pack of 52 cards, you would not be interested in the order in which
you received the cards.
n n!
Cr = (3.8)
(n − r)! x r!
Example 3.7
How many ways can you arrange three letters from the word ATMOSPHERIC,
where the order is not important?
11 11!
C3 = = 165.
(11 − 3)! x 3!
Since the order is not important, selecting the arrangements ATM, AMT, TMA,
TAM, MTA and MAT are all counted as just one selection. Try computing the
permutation of this example.
Example 3.8
A team of 5 people, which must contain 3 men and 2 women, is chosen from 8 men
and 7 women. How many different teams can be selected.
The number of different teams of 3 men which can be selected from 8 is 8 C3 , and
the number of different teams of 2 women which can be selected from 7 is 7 C2 . Any
8
C3 x 7 C2 = 1176.
Exercise 3.3
1. How many different arrangements can be made of the letters in the word
STATISTICS?
2. (a) Calculate the number of arrangements of the letters in the word NUM-
BER?
(b) How many arrangements in (a) begin and end with a vowel?
Acknowledgements
This chapter has been adapted from Advanced Level Mathematics Statistics 1, writ-
ten by Steve Dobbs and Jane Miller in 2002.
Chapter 4
4.1 Introduction
We have been looking at variables, describing them through samples and popula-
tions. Now, we will proceed to look at the distributions of data. There are a set/class
of distributions where most data fall into. We will discuss these distributions, for
continuous and discrete data. This discussion should help us understand the
structure of populations better.
Probability distribution
Let X be a random variable and let xi , i = 1, 2, · · · , k denote the k distinct values
that X may assume. Since each X corresponds to a basic outcome of a random
trial, a probability distribution for the sample space will associate a probability
value with each xi . The probability that random variable X will assume value xi
will be denoted by P (X = xi ) for example, P (X = 0) is the probability that X = 0.
Example 4.1
In the game problem mentioned above earlier, suppose a team plays 3 games, and
has the following probabilities of winning the games.
50
Prob(Winning 0 games)=0.65,
Prob(winning 1 game )=0.15,
Prob(Winning 2 games)=0.10,
Prob(winning 3 games)=0.10.
X 0 1 2 3
P(X=X) 0.65 0.15 0.10 0.10
This represents a probability distribution, i.e., the probability associated with each
possible outcome.
Properties
Any probability distribution for a discrete random variable X has the properties.
1. 0 ≤ P (X = xi ) ≤ 1 for i = 1, 2, · · · , k,
Pk
2. i=1 P (X = xi ) = 1.
P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2) or
P (X < 2) = P (X = 0) + P (X = 1)
X P (X = x) P (X ≤ x)
0 0.65 0.65
1 0.15 0.80
2 0.10 0.90
3 0.10 1.00
Expected value
The mean value of a random variable in many trials is known as the expected
value. For a discrete random variable X, the expected value is denoted by;
k
X
E[X] = xi P (xi ) where P (xi ) = P (X = xi )
i=1
Variance
The outcomes of a discrete random variable will vary, and as such it is useful to have
a measure of their variability. As we discussed earlier a key measure of variation is
the variance, defined;
σx2 = V ar[X]
Pk 2
= i=1 (xi − E[X]) P (xi )
or
= E[X − E[X]]2
= E[X 2 ] − [E[X]]2
Example 4.2
Find E[X] and Var[X] in the above example.
Solution
Exercise 4.1
Consider a man who tosses a coin once, the probability of getting a head is 12 , the
probability of getting a tail is 1- 21 = 21 . Let X = 1, if he gets a head and X = 0 if he
gets a tail. What is the distribution function of X? Find E[X].
Definition 1 mgf
The moment generating function (mgf ) of a discrete random variable X is defined
to be;
X
MX (t) = E(etX ) = etx P (x)
x∈X
where X is the set of possible X values. The mgf(·) exists if MX (t) is defined for
the t ∈ [−h, h] and h > 0.
Note:
Some of the merits of using the mgf include the following properties.
1. If the mgf exists then, when t = 0, MX (0) = 1 for any random variable.
2. If the mgf exists and is the same for two distributions then the two distribu-
tions are the same. This means mgf uniquely identify probability distribu-
tions.
dr (r)
E(X r ) = r
MX (t)|t=0 = MX (0)
dt
So we can say the mean of X can be found by evaluating the first derivative
of the mgf at t = 0. That is, µ = E[X] = M ′ (0).
The variance of X can be found by evaluating the first and second deriva-
tives of the mgf at t = 0. That is, σ 2 − E[X 2 ] − (E[X])2 = M ′′ (0) − (M ′ (0))2 .
4. Let X have mgf MX (t) and let Y = aX + b. Then MY (t) = ebt MX (at)
Example 4.3
Consider the continuous random variable X, which represents the yield of a crop in
tons per acre. Suppose the yield can take any value between 0 and 1 ton, and that
X has the density above.
Then
R 0.7
P (0.5 ≤ X ≤ 0.7) = 0.5 12(x)(1 − x)2 dx
= [6x2 − 8x3 + 3x4 ]0.5
0.7
See Figure 4.1. This means the probability that the crop yield will be between 0.5
and 0.7 is 0.2288.
Properties
As in the discrete case above;
The probability that X < x0 = P(X < x0 )=F(x R x00) is the area under the curve f (x)
to the left of x0 . i.e., using calculus F (x0 )= −∞ f (x)dx. This is the cumulative
distribiton function.
Expected value
The expected value of a continous variable is defined as
Z ∞
E[X] = xf (x)dx.
−∞
Example 4.4
In Example 4.3, we calculate the expected value from,
R1
E[X] = xf (x)dx
R01
= 0
x × 12x(1 − x)2 dx
= 6 × 12 − 8 × 13 + 3 × 14 − 6 × 02 − 8 × 03 + 3 × 04
= 6−8+3
= 1.
Variance
The variance, as in the discrete case is calculated with the summation replaced by
the integral, as follows;
R∞
V ar[X] = R−∞ (x − E[X])2 f (x)dx
∞
= −∞ x2 f (x)dx − E[X]2
= E[X 2 ] − [E[X]]2
Example 4.5
R1
V ar[X] = 0 (x − E[X])2 f (x)dx
= E[X − E[X]]2
= E[X 2 ] − [E[X]]2
R1
= 0 x2 × 12x(1 − x)2 dx − 12
= 12
4
x4 − 24 5
5
x + 12 6
6
x
= 0.2.
Exercise 4.2
In Example 4.3, find the cumulative distribution function of X.
The mgf() exists if MX (t) is defined for the t ∈ [−h, h] and h > 0.
There are many other distributions not discussed in this chapter. However, the
few distributions in the next two sections fall under a class of distributions we call
empirical distributions.
P (X = 1) = p
P (X = 0) = 1−p
i.e., P (X = 0) = 1 − P (X = 1)
= 1−p
Example 4.6
Tossing a fair coin, you get a head or a tail each with probability p = 21 . Thus if
a head is labelled 1 and a tail 0, the random variable X representing the outcome
takes values 0 or 1 if the probability that X = 1 is p, then we have that
p = P (X = 1) = 12
1
P (X = 0) = 1 − p = 1 − 2
= 12
Binomial Distribution
Suppose in an experiment there are two possible outcomes (failure and success) and
that the probability of success is p. Suppose also that the experiment is repeated n
times, the probability of x successes follows a Binomial distribution.
Let X = X1 + X2 + · · · + Xn where Xi are independent and identically distributed
Bernoulli random variables, then X is called a binomial random variable. Thus;
P (x) = P (X
= x) n−x
n x
= x p (1 − p)
n
n!
i.e., for x = 0, 1, · · · , n and 0 < p < 1, x
= x!(n−x)!
The quantities n and p are called parameters and they specify the distribution.
Let us look at one application of the Binomial with parameters n and p i.e Bin(n, p).
Example 4.7
Bin(n, p) here means Binomial distribution with parameters n and p.
Find
(i) probability of getting 4 heads in 6 tosses of a fair coin.
(ii) E[X] and Var[X].
Solution 4
P (X = 4) = 64 12 (1 − 21 )6−4
1
= 15 × 16 × 14
= 15
64
.
P n x
E[X] = x x p (1 − p)n−x
= P
np
V ar[X] = x2 nx px (1 − p)n−x − (np)2
= np(1 − p)
Poisson Distribution
A Poisson random variable is a discrete random variable that can take integer values
from 0 up to ∞. The parameter for this distribution is λ i.e., P0 (λ).
An example of the application of the Poisson distribution follows, The number of
individuals arriving at a bank teller per quarter hour X is a poisson random variable.
The Poisson probability function is
λx e−λ
P (x) = x!
, where P (x) = P (X = x), x = 0, 1, · · · , ∞, and 0 < λ < ∞
Example 4.8
The number of students arriving at a take away every 15minutes is a poisson random
variable with parameter λ = 0.2. Find the probability that zero, one and two
students arrive at the take away.
0 −0.2
P (X = 0) = (0.2)0!e
= 0.8187 (no students arrive)
1 −0.2
P (X = 1) = (0.2)1!e
= 0.1637
2 −0.2
P (X = 2) = (0.2)2!e
= 0.0164
Properties
P∞ xλx e−λ
E[X] =
Px=0
λx e−λ
x!
=
P(x−1)!
x−1 e−λ
= λ λ(x−1)!
= λ×1
= λ.
P x2 λx e−λ
Similarly V ar[X] = x!
− λ2
= λ2 + λ − λ
= λ.
Hint: Let y = x − 1.
Exercise 4.3
Show that the limiting distribution of the Binomial distribution is the Poisson dis-
tribution.
e−λ λx
lim P (x) = .
n−→∞ x!
S
N −S
x n−x
P (X = x) = N
n
Properties
s−1
E[X]= a+ 2
.
s2 −1
Var[X]= 12
.
Example 4.9
Let x be a discrete random variable which can assume 5 values. Then
X = 0 1 2 3 4,
in this case, s = 5, therefore,
X = 0 1 2 3 4
P(X) = 0.2 0.2 0.2 0.2 0.2
That is, the discrete Uniform Probability function. One way is to use the formula,
1
P (X = x) = P (x) =
5
x = 0, 0 + 1, 0 + 2, 0 + 3, 0 + (5 − 1) = 4, thus s = 5.
Exercise 4.4
Show that for the discrete Uniform distribution
s−1 2
E[X]=a + and Var[X]= s 12−1
2
,
P
Hint: V ar(X) = (X − E[X])2 P (X = x).
Example 4.10
The marks from a certain exam are uniformity distributed over to 50 to 75. The
density function for the marks is given by
1
25
50 < x < 75,
f (x) =
0 elsewhere.
Properties
E[X]= 21 (b + a)
2
Var[X]= (b−a)
12
Exercise 4.5
For the Continuous discrete Uniform distribution, show that
E[X]= 21 (a + b),
2
and Var[X]= (b−a)
12
.
( h 2
i
1 − 12 ( x−µ )
f (x) =
√
2πσ 2
e σ
where − ∞ < x < ∞, − ∞ < µ < ∞, σ > 0
0 elsewhere
We shall use the normal distribution to a great extent in inference. This is a two
parameter distribution, usually denoted by N (µ, σ 2 ). A random variable z is said
to be standard normal if Z = X−µσ
has mean µ = 0 and variance σ 2 = 1, for X a
normal random variable with mean µ and variance σ 2 . Tables which give normal
cumulative probabilities are widely available.
Properties
Let X be a normally distributed random variable with mean µ and variance σ 2 , i.e.,
X ∼ N (µ, σ 2 ), then Z = X−µ
σ
is normally distributed with mean 0 and variance 1.
We say Z has a standard normal distribution and X−µ σ
is called standardisation.
Example 4.11
Let X be the length of a long distance telephone call. Then X has an exponential
distribution.
Properties
1
E[X] = , and.
λ
1
V ar[X] = 2 .
λ
The parameter here is λ.
Memoryless Property
The memoryless property of a given probability distribution is mostly associated
with the distribution of times. Consider tossing a coin, if your outcome was a head
(H) and you toss the coin again does it mean at the second toss it is going to be a
tail (T )? i.e., P (T |H) = P (T ) in other words ;
4.7. CONCLUSION 63
since these events are independent. The memoryless property means the future is
independent of the past. There are certain probability distributions which have this
property.
We will mention two in this course, in the discrete case a product of independent
Bernoulli trials see coin tossing example.
The only memoryless continuous probability distribution is the exponential dis-
tribution. Suppose X ∈ [0, ∞) is a continuous random variable, the probability
distribution of X is said to be memoryless if for any real numbers t, a ∈ [0, ∞]
Exercise 4.6
1. Let X be Exponentially distributed. Show that E(X) = λ1 .
4.7 Conclusion
We have briefly looked at some types of distributions. Some kinds of data tend to
follow certain distributions. These distributions are not so many, rarely do we meet
data with an unknown distribution, especially when the sample size is large.
The number of people arriving at a certain bus stop in specified periods of time
will generally follow a Poisson distribution. The time taken by radio active material
to decay generally follows the exponential distribution. Such natural phenomenon
make it necessary for us to study Empirical distributions.
Exercise 4.7
1. The heights (in metres) of children aged between 10 and 14 are recorded below
1.4, 1.5, 1.6, 1.2, 1.63
.
3. Suppose that X, the number years of schooling a student completes beyond the
age of 14, is distributed normally with a mean of 4 and a standard deviation
of 2.
(a) What is the probability that a student completes more than 8 years of
schooling beyond the age 14?
(b) What is the probability that a student completes 2 to 8 years of schooling
beyond the age 14?
4. The probability that an individual gets a loan from a bank is 0.25. If 12 people
applied for loans what is the probability that
5. A new typist makes on average 1 error per page on her typing. What is the
probability that she will make
Chapter 5
5.1 Introduction
The objective of statistics is to make inferences about a population based on infor-
mation contained in a sample. Statistical inference is mainly concerned with making
inferences about population parameters. Methods of making inferences about pa-
rameters fall into two categories, making decisions concerning the value of a param-
eter or estimating/predicting the value of the parameter. The relevant information
in a sample can be used to estimate the likely values of their associated population
parameters.
We will often need to test the truth of some claims made about a population this
will be covered under hypothesis testing.
5.2 Estimation
When a single statistic is used to estimate a population parameter we call it a
point estimator. A good estimator of a population parameter should at least be
an unbiased estimator. The next question is what is an unbiased estimator?
65
This means if you have a representative sample you can estimate the variance of the
population. Similarly, the point estimate of population standard deviation is given
by,
v " n #
u Pn
u 1 X [ x ]
2
σ̂ = s = t i=1 i
x2 − (5.3)
n − 1 i=1 i n
Example 5.1
A company which manufactures and bottles chemicals, collected a bottle from each
batch of 20, which they dispatched for quality control purposes. They measured the
5.2. ESTIMATION 67
350, 351, 348, 352, 350, 356, 348, 347, 348, 352, 354
3. Estimate the proportion of bottles which have less than 350ml in the whole
consignment.
Solution
µ̂ = x
n
1X
= xi
n i=1
1
= × 3856
11
= 350.545
2.
σ̂ = sv
u n Pn !
u X (
2
1 i=1 xi )
= t x2i −
n−1 i=1
n
s
1 38562
= × 1351782 −
10 11
= 2.8058
3.
π̂ = p
x
=
n
4
=
11
= 0.3636
When computing the confidence intervals for our population parameters, we need
to consider the following;
1. the size n of the sample we are dealing with
2. whether we know the population standard deviation σ or not.
You will get the value of α from the question. For instance, if you are asked to
construct a 95% confidence interval, then
95% = 0.95 × 100%
= (1 − 0.05) × 100%
so that α = 0.05. So that 0.052
= 0.025. We now do the opposite of what we did in
the last chapter, i.e., we want to find the value of k that gives P (Z > k) = 0.025.
Looking down column φ(−z) (depending on the tables) until we get the probability
closest to 0.025 we see it is 1.96. So in this case k = z α2 = z 0.05 = z0.025 = 1.96
2
where x is the mean of a sample of size n from a population with known variance
σ and z α2 is a value in the standard normal distribution that leaves an area of α2 to
2
A (1 − α)100% confidence interval for µ, when σ unknown and the sample is small
is given by
s s
x − tn−1 (α/2) √ ≤ µ ≤ x + tn−1 (α/2) √ (5.7)
n n
where x and s are the mean and standard deviation respectively of a sample of
size n < 30 from an approximate normal population, and tn−1 (α/2) is the value of
the t-distribution, with n − 1 degrees of freedom, leaving an area of α/2 to the right.
X −µ
T = (5.8)
√s
n
This statistic follows what is called the students’ t distribution with n − 1 degrees
of freedom. This is the value being used in Equation 5.7. We look up this value
from our students’ t tables which look like Table 5.1 below
Suppose you want to look up tn−1 (α/2) where n=7, and α = 0.05 then we have;
As the sample size n increases the t-distribution tends to the standard normal dis-
tribution. So we can read of the values of Z α2 here. For example z α2 = z 0.05 =
2
z0.025 = 1.96, (see last row column 5). Let us look at a few examples. Each of these
examples should help you appreciate when to apply which formula.
Example 5.2
A sample of 36 people at a rave night club revealed an average age of x = 19.38
years, sample standard deviation s = 4.760 years. Determine a 95% confidence in-
terval for the true age of individuals at the rave night club.
Solution
n is large (n > 30)
σ is unknown, so we use case 2.
x − z α2 √sn ≤ µ ≤ x + z α2 √sn
4.760 4.760
19.38 − 1.96 × √
36
≤ µ ≤ 19.38 + 1.96 × √
36
17.83 ≤ µ ≤ 20.93
We are 95% confident that the true mean lies between 17.83 and 20.83.
Example 5.3
The scores below where recorded after a statistics examination was marked. The
standard deviation of the marks is known to be 7.93%
73 52 67 53 51 61 49 66 41 48
52 47 65 46 71 67 48 66 47 44
63 65 44 46 61 52 55 54 51 56
49 62 57 56 47 45 56 59 59 47
48 57 48 52 53 52 51 63 68 53
Determine the 98.44 confidence interval of the average weight of the marks.
Solution
x − z α2 √sn ≤ µ ≤ x + z α2 √sn
7.93 7.93
54.78 − 2.42 × √
50
≤ µ ≤ 54.78 + 2.42 × √
50
52.07 ≤ µ ≤ 57.49
We are 98.44% confident that the true mean mark lies between 52.07 and 57.49.
Example 5.4
The weights of seven similar containers of a chemical are recorded below.
Solution
x = 283.5, n = 7, s = 8.0186 from calculator.
Confidence interval estimation can be extended to cover other parameters like pro-
portions, the standard deviation as well as differences between two means and a
ratio of standard deviations.
The ideas used in the construction of confidence intervals can be extended to help
determine the size of a sample that will lead us to estimate the mean to any desired
degree of accuracy.
Example 5.5
The Manager of a department wants the margin of error of the mean number of
calculation errors in reports in his department to be within ± 3 points for the year
2006 trainees. The extent to which this error is likely to occur is 0.95. What sample
size should he take of reports if the standard deviation is known to be 9.2?
Solution
h z α ×σ i2
2
n =
ǫ
2
1.96 × 9.2
=
3
= 36.128
Exercise 5.1
1. Show that x is an unbiased estimator of µ.
4. Suppose
We can use confidence intervals to test the validity of a claim. Tests of claims are
best handled by hypothesis testing.
The testing of a statistical hypothesis is perhaps the most important part of decision
making.
The truth or falsity of a statistical hypothesis is never known with certainty unless
we examine the entire population. A random sample is taken from the population
of interest and the information contained in this sample is used to decide whether
the hypothesis is likely to be true or false. Evidence that is inconsistent with the
stated hypothesis leads to a rejection whereas evidence supporting the hypothesis
leads to its acceptance. For example, one might want to test the the hypothesis that
the pass rate for an exam is always 40%, or test the claim that women live longer
than man. There are two types of hypothesis, a null and alternative hypothesis.
A rejection region is the range of values of sample statistic value that would lead
to the rejection of the null hypothesis for values of the sample statistic which fall
within its limits.
A two tailed test has an area of rejection both below and above the hypothesised
value.
A one sided and right tailed test has an area of rejection which lies above the
null hypothesised value of the population parameter
A one sided and left tailed test has the area of rejection which lies below the
null hypothesised value of the population parameter.
A test statistic T is a value calculated from the sample data which is used to
decide whether or not H0 should be rejected.
Then |Zcalc | tends to be large if H0 is false and small otherwise. An α-size test is to
reject H0 if |Zcalc | > z α2
Example 5.7
The ages of a sample employees who come very early to work were observed in one
financial institute. The following information from a sample was obtained. x = 26.3,
σ 2 = 9 and n = 36. You are required to test the hypothesis that the mean age is
25. assume α = 0.005
Solution
Since σ is known, we are using Case 1, the Z distribution. The test is a two tailed
test.
α = 0.005 ⇒ Z0.0025 = 2.81
H0 : µ = 25 versus
H1 : µ 6= 25
The critical region is Z < −2.81 and Z > 2.81. Thus, reject H0 if |Zcalc | > 1.96.
The test statistic is
x − µ0
Zcalc =
√σ
n
26.3 − 25
=
√3
36
= 2.6
Since |Zcalc | < 2.81, that is, 2.6 < 2.81, thus, we fail to reject H0 and conclude that
the average age of early comers is equal to 25.
Example 5.8
It is desired that a statistician test the hypotheses given below and make correct rec-
ommendations based on the information gathered from a sample of measurements
of 40mm pipes which were supplied by a new supplier. Assume α = 0.01
H0 : µ = 40, H1 : µ 6= 40, n = 18, x = 34, and s2 = 64.
Solution
Since n is small and σ unknown we use the t-distribution. The test is a two tailed
test.
H0 : µ = 40 versus
H1 : µ 6= 40
α = 0.01, tn−1 ( α2 ) = t0.005 (17) = 2.90
The critical region is T < −2.90 and T > 2.90. Thus reject H0 if |Tcalc | > 2.8982 .
The test statistic is
X − µo
Tcalc =
√s
n
34 − 40
=
√8
18
= −3.182
Since |Tcalc | > 2.90, that is, 3.182 > 2.90, we reject H0 and conclude that the true
mean is not equal to 40. The pipes are not 40mm.
Exercise 5.2
1. A geologist is testing the hypothesis that the melting point of an unusual
carbon substance is 1946o C. He makes 7 determinations and obtained the
values of 1944, 1947, 1945, 1947, 1949, 1946 and 1944o C. What conclusions
can you draw at a significance level of 0.05?
3. A baker stated that on average the number of loaves bread sold daily is 3 000
with a standard deviation of 300. An employer want to test the accuracy of
this statement. A random sample of 36 days showed the average daily sales
were 3 150. Test at the 1% level of significance if the bakery’s statement can
be accepted.
Depending on the sample size and knowledge of σ, we reject H0 if test statistic >
tabulated value(α)
x−µ
1. If σ is known, then use Z = √σ
and reject H0 if z > zα
n
x−µ
2. If σ is unknown and n < 30, then use t = √s
and reject H0 if t > tα (n − 1)
n
x−µ
3. If σ is unknown and n > 30, then use z = √s
and reject H0 if Z > zα
n
Example 5.9
The average distance traveled by a small engined vehicle on 10 litres of petrol is
162.5 kms with a standard deviation of 6.9 kms. Is there reason to believe that
adding a new additive to petrol increases the distance travelled on 10 litres if a
random sample of 50 small cars has an average of 165.2 kms per 10 litres, at the 5%
level of significance.
Solution
Ho : µ = 162.5
H1 : µ > 162.5
n = 50, x = 165.2 and σ = 6.9. Since σ is known we use the Z distribution. The
test is a one tailed test.
α = 0.05, zα = z0.05 = 1.645
The critical region is Z > 1.645. Thus reject H0 if Zcalc > 1.645.
X − µo
Zcalc =
√σ
n
165.2 − 162.5
= 6.9
√
50
= 2.7669
Since Zcalc > z0.05 , that is, 2.7669 > 1.645, thus we reject Ho and conclude that the
the additive increases the distance travelled on 10 litres of petrol.
Exercise 5.3
1. A supermarket claims that customers to its stores spend on average 25 min-
utes carrying out their purchases. A consumer body wants to verify this claim.
They observed then entry and departure times from supermarkets in the chain
of 24 random selected customers. The sample average time was half an hour
with a standard deviation of 14.1 minutes. Test the validity of the supermar-
ket’s belief at the 2.5% level of significance.
The critical values, for particular tests, sample sizes and significance levels, are avail-
able available in tables. Remember, sometimes, we have to extrapolate the critical
x−µ
1. If σ is known, then use z = √σ
and reject H0 if z < − zα
n
x−µ
2. If σ is unknown and n < 30, then use t = √s
and reject H0 if t < − tα (n − 1)
n
x−µ
3. If σ is unknown and n > 30, then use z = √s
and reject H0 if z < − zα
n
Example 5.10
The average length of time spent in a bank queue has been 50 minutes with a
standard deviation of 10 minutes. A new banking system is testing a new software.
If a random sample of 12 clients had an average banking time of 42 minutes with a
standard deviation of 11.9 minutes under the new system. Test the hypothesis that
the population mean is now less than 50 using a level of significance of
(ii) 0.01.
Solution.
Let Ho : µ = 50 minutes versus H1 : µ < 50 minutes.
(i) Using t-distribution at α = 0.05 ⇒ tn−1 (0.05) = t11 (0.05) = 1.80. Reject H0 if
T < −1.80, that is, critical region is [−∞, −1.80]. The test statistic is
X − µo
T =
√s
n
42 − 50
= 11.9
√
12
= −2.3288
Since −2.3288 < −1.80, that is, the test statistics is in the critical region, we
reject the null hypothesis at 5% level of significance
(ii) Using t-distribution at α = 0.01 ⇒ t11 (0.01) = t11 (0.01) = 2.72 Reject Ho if
T < −2.72, that is critical region is [−∞, −2.72]. Since −2.33 > −2.72, we do
not reject Ho at 1% level of significance.
Note: We did note use σ = 10, since the sample partains to a new banking system
with new software so the population is not the same.
This essentially means that the true mean is likely to be less than 50 minutes but
does not differ significantly to warrant the high cost that would be required to
overhaul the current banking system.
Exercise 5.3
1. A cement manufacturer packs 50kg bags. To see if the manufacturer puts
enough cement in the bags, the contents of random sample of 200 such bags
were weighed. The average contents turned out to be 48kgs with a standard
deviation of 0.12. State the null and research hypothesis for this problem.
Hence test at the 0.05 significance level, whether or not the manufacturer
satisfies the requirement?
2. A scientist claimed that mice with an average life span of 28 months will live
to be about 43 months old when 45% of the calories in their food are replaced
by vitamins and proteins. Is there any reason to believe that the mean age is
less than 43 months if 54 mice that are placed on this diet have an average
life of 38 months with a standard deviation of 5.8 months. Use a 0.025 level
of significance.
3. Redo number 2, and use a sample of size 20 instead 54 are placed on this diet.
Note
In inference we use samples to make inferences about a population. You will never
have to calculate the population standard deviation. You are either given its value
or you do not know it.
The p-value
What is the p-value? You will find this value under a variety of names, some of
these are; critical level, the probability value and the associated probability.
Suppose, for a given null hypothesis H0 , we calculate a test statistic, say ktest , then
You can can also say the p-value is the smallest value of α for which test results are
statistically significant.
The probability value is commonly used in statistical computer packages. Once cal-
culated you do not need statistical tables.
Exercise 5.4
1. Suppose it is known that a variable X is a normally distributed with a mean
of 340 minutes. If a random sample of 20 observations has an average of 332
with a standard deviation of 43 minutes. Test the hypothesis at the 0.025
level of significance at that µ = 340 minutes against the alternative µ < 340
minutes.
4. A manufacturer claims that the average life of batteries produced by his firm
is at least 30 months. You disagree, contending that the average life of the
batteries is less than 30 months. A random sample of 12 batteries has a
mean of 38.7 months and a standard deviation of 18 months. Perform the
appropriate hypothesis test. Use a significance level of 0.05.
5. The manufacturer of an over the counter pain reliever claims that its product
brings pain relief to headache sufferers in less than 3.5 minutes on average. To
be able to make this claim in its television advertisements the manufacturer
was required by a particular television network to present statistical evidence
in support of the claim. The manufacturer reported that for a random sample
50 headache sufferers, the mean time to relief was 3.3 minutes and the standard
deviation was 66 seconds. Does this data support the manufacturer’s claim.
Test using α = 0.05.
6. What is the advantage of using the p-value over the critical value?
5.6. SUMMARY 85
5.6 Summary
In this chapter we discussed the concept of estimation and hypothesis testing. These
two concepts fall under the topic statistical inference. In statistical inference we use
sample data to make inferences about the population from which the sample came.
You should now be able to estimate the point and interval estimate of the popu-
lation. You should also be able to test a variety of hypotheses, depending on the
question.
Chapter 6
6.1 Introduction
A correlation analysis is the study of the strength of the relationship between any
pair of variables. Correlation measures how strongly pairs of variables variables are
related. The term regression analysis on the other hand describes a collection
of statistical techniques that quantifies how one variable depends on another (or
several other variables). Regression analysis is now perhaps the most widely used
method of modelling relationships.
Note
A scatter plot is made up of the axes and the points where values meet. The points
should not be joined together. An example of a scatter plot is shown in Figure 6.1.
Example 6.1
The incomes and amounts spent on entertainment by a sample of individuals at a
bar was recorded and a scatter plot constructed.
86
6.3. CORRELATION 87
The scatter plot shows that as income increases the amount spent on entertainment
also increases.
It is important to construct a scatter plot in order to get an understanding of what
you will expect the relationship to be.
6.3 Correlation
Correlation is the intensity or strength of the relationship between two variables.
It is a measure of the extent to which variables are related or associated. If the
correlation between two variables is zero, then the two variables are not related. On
the other hand, a correlation of 1, means that there is a perfect linear relationship
between the two variables. Here, “perfect” means an exact relationship. How is the
correlation calculated?
cov(X, Y )
ρxy = p .
var(X)var(Y )
We rarely deal with population parameters as they stand. We often estimate the
population parameters on the basis of samples. In this particular case, ρxy is esti-
mated from a sample, say (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ), to give ρ̂xy .
There are other measures of association, but for now we shall work with the one
above. ρxy is confined to the interval [−1 < ρxy < 1]. If ρxy = -1, then there is
an exact inverse relationship between x and y. If x increases, then y decreases and
vice-versa.
Example 6.2
Consider the number of expensive goods n sold by a company (which got the goods
at a very cheap cost price) and the profit p. ρnp ≈ 1 in this case. This means
that the more the sales, the higher the profit. (“≈” here means approximately) If
ρnp ≈ 0, then we would conclude that, there is no relationship between the profit
and number of sales.
On the other hand, ρnp ≈ −1 suggests that the more the sales the lower the profit.
The product is probably being sold at less than the cost price.
6.3. CORRELATION 89
let us look at the sample correlation coefficient for any k, ρ̂xk xk . Then:
Pn
i=1 [Xki −Xk ][Xki −Xk ]
ρ̂xk xk = √Pn 2
Pn 2
i=1 [Xki −Xk ] × i=1 [Xki −Xk ]
Pn 2
[Xki −Xk ]
= √ Pi=1
n 2 2
[ i=1 [Xki −Xk ] ]
Pn
[Xki −Xk ]2
= Pi=1
n 2
i=1 [Xki −Xk ]
= 1
This gives us the matrix:
1 ρ̂x1 x2 · · · ρ̂x1 xp
ρ̂x2 x1 1 ··· ρ̂x2 xp
. . . .
ρ̂ =
.
. . .
. . . .
ρ̂xp x1 ρ̂xp x2 · · · 1
Example 6.3
Suppose you are given three variables z1 , z2 and z3 representing the coded tensile
strength, melting point and amount of Titanium in a new alloy respectively. Use
the following data to calculate the correlation matrix.
z1 z2 z3
12 3 0.2
23 7 0.8
9 2 0.1
30 10 1.0
Pn
i=1 [z1i −z1 ][z2i −z2 ]
ρ̂z1 z2 = √Pn 2 × n [z −z ]2
P
i=1 [z1i −z1 ] i=1 2i 2
= 0.999
Pn
i=1 [z1i −z1 ][z3i −z3 ]
ρ̂z1 z3 = √Pn 2
Pn 2
i=1 [z1i −z1 ] × i=1 [z3i −z3 ]
= 0.993
Pn
i=1 [z2i −z2 ][z3i −z3 ]
ρ̂z2 z3 = √Pn 2
Pn 2
i=1 [z2i −z2 ] × i=1 [z3i −z3 ]
= 0.988
1 0.999 0.993
ρ̂ = 0.999 1 0.988
0.993 0.988 1
The correlation between any two of the variables is very high meaning that the
variables are highly related. They either increase or decrease together. Addition of
Titanium increases the tensile strength and melting point of the alloy. The process
seems to produce an alloy whose strength can be increased or decreased, by changing
the amount of Titanium.
Exercise 6.1
Consider the following data collected on two variables which are suspected to be
related. The variable x represents the grade scores of students in a class and y the
number of students with the same grade.
Grade x 2 5 9 3 6 7
Frequency y 34 45 59 40 50 48
The data above shows the cross-tabulation relating the two variables x and y. From
this illustration, it is clear that there is a linear relationship between the two vari-
ables x and y. As one variable increases, the other variable also increases. This
means that as the grade increases the number of students with high grades also
increases.
Calculate the correlation coefficient of the grade and the frequency. Comment on
your result.
When we calculated the correlation coefficient we quantified the strength of the re-
lationship between two variables, in regression analysis we study how one variable
(dependent variable) depends on other variables (independent variable). In this
course we will introduce the case of one independent variable.
In real life situations we are usually interested in the relationship between variables.
For example, it is difficult to study the impact or effect of salary increments, food
price increases, etc. on inflation by using descriptive techniques. Thus, regression
analysis equips you with a way of studying such situations.
The term simple implies that a single independent variable x is involved and the
term linear implies linearity in the parameters.
Y = f (X) (6.2)
Table 6.1: Names given to response and explanatory variables in Regression Analysis
Y X
(a) Predictand Predictor
(b) Regressand Regressor
(c) Dependent variable Independent variable
(d) Effect variable Causal variable
(e) Endogenous variable Exogenous variable
(f) Target variable Control variable
Each pair of the above terms is appropriate for a particular use of regression analysis.
For example, the terminology in (a) is often used if the purpose of the regression is
prediction; pairs (b), (c) and (d) are used by different applied researchers in their
discussion of regression models; (e) is usually used in studies of causation or
causality; while pair (f) is more appropriate in control problems.
Exact Relationships
An exact relationship is a relationship of the form:
yi = β0 + β1 xi (6.3)
where the subscript i (for i = 1, 2, · · · , n) refers to the ith observation. What does
this mean? Well, for any value of xi the yi value will be equal to some constant
value β0 added to the β1 × xi . β0 and β1 are constants. yi is determined by xi , i.e.,
a unit change in X causes a change equal to β1 in Y .
Note
The variables X and Y can be random or deterministic i.e., non-random.
Generally this equation can be expressed as follows:
y = β0 + β1 x (6.4)
where x and y are possible values of X and Y respectively. An example of an exact
relationship follows.
Exercise 6.2
The relationship between the area of a square A and the length of one side L is
given by:
Area = β0 + β1 × length2
where β0 = 0 and β1 = 1 This is an exact relationship Figure 6.2 gives a plot of the
relationship between Area [A] and the square of the length [L2 ].
These are both exact relationships. In Figure 6.2, we plotted A versus L2 which
gives a linear relationship of the form y = β0 +β1 x. On the other hand, in Figure 6.3
we plotted A versus L, we notice that, this time, there is a quadratic relationship.
Statistical Relationships
A statistical relationship, unlike an exact relationship, is not a perfect one, that is,
it does not give unique values of Y for a given value of X, but can be described ex-
actly in probabilistic terms. For instance, consider the following regression model
showing a statistical relationship between Y and X which is no longer exact because
of the error term ǫi :
Yi = β0 + β1 Xi + ǫi . (6.5)
The variable ǫi is a value added to the equation to make the two sides of the equation
equal. The term ǫi is called the error term. The error term is usually assumed
to have a normal distribution with mean 0 and variance σ 2 . The relationship be-
tween Y and X in Equation 6.5 is called a stochastic or statistical relationship
because of the presence of the random error term needed to make the equation exact.
Exercise 6.3
[Statistical relationship] A group of students are interested in evaluating the ad-
vantages and disadvantages of different study patterns and their effect on their
performance. Consider Y , the mark a student gets after an examination, and X1 ,
the number of hours the student puts into reading for the examination.
The variable X1 was chosen by the students because it seemed [appeared] to con-
tribute a lot to the examination mark. A possible equation to represent the rela-
tionship between Y and X1 is given as:
y = β0 + β1 x1 + ǫ (6.7)
β0 and β1 are unknown constants or regression parameters. If a student puts 0 hours
into studying for the examination, then we expect him/her to get β0 marks. On
the other hand, if a student increases his/her study time by one hour, the model
suggests that the mark should change by β1 . Please note that, as in Equations 6.7,
we will index y and x to yi and xi respectively, when we have the actual observations
x1 , x2 , · · · , xn and y1 , y2 , · · · , yn at hand.
Notice how the points in Figure 6.4 are not always on the line. This is because the
relationship between examination mark and the time spent on studying is not exact.
The students could have added x2 , the number of books the student consulted as a
variable, since this appears to have an impact on the final examination mark. This
would give Equation 6.8.
Y = β0 + β1 X1 + β2 X2 + ǫ (6.8)
Procedures embraced by regression analysis concern themselves with drawing con-
clusions about these coefficients. An example of the implications of these coefficients
follows from the fact that a positive coefficient means that the more the hours spent
studying, the higher the examination mark etc. The term ǫ in the equation is added
to account for the fact that the equation is not exact. If there are p explanatory
variables, a regression equation can be expressed more generally as in Equation 6.9.
y = β0 + β1 x1 + β2 x2 + · · · + βp xp + ǫ (6.9)
Exercise 6.4
Figure 6.5, shows examples of a scatter plots showing the relationship between
an independent variable and a dependent variable. The variable y represents the
dependent variable, while x represents the independent variable. In the first plot
we have a linear relationship. This could be the relationship between Intelligence
Quotient (IQ) x and the mark obtained in an achievement test y. In the second
plot, we have a quadratic relationship. This could be the relationship between time
x and the distance y travelled (from the source) by an object thrown up into the
air.
Figure 6.5: Scatter plots showing two possible relationships between x and y
So, by constructing a scatter plot, we are guided in our decision or choice of the
equation to use.
All procedures and conclusions drawn in regression analysis depend, at least indi-
rectly, on the assumptions of the regression model. A model is what the data
analysts perceive as the mechanism that generates the data on which the regression
analysis is conducted.
The term fitting the model to a set of data involves estimation of the regression
coefficients and formulation of a fitted regression model [i.e the model with the
estimated coefficients]:
ŷ = βˆ0 + βˆ1 x (6.10)
3. Explaining: system explanation, which variables contribute the most and how
they contribute to the dependent variable.
Regression Assumptions
We shall use the Least Squares (LS) procedure to estimate the parameters in the
model given by Equation 6.5. Although there are other procedures available for
estimating the parameters, we shall, however, only use the LS procedure. We will
also make the following assumptions (these are necessary for inference):
• The ǫi ’s are random variables with mean zero and constant variance. This is
called the homogeneous variance assumption. Mathematically, this assump-
tion is:
fo i, j = 1, 2, · · · , n
• The normal theory assumption is imposed on the ǫi ’s. This is the assump-
tion that the ǫi ’s are normally distributed with mean zero and variance σ 2 .
Mathematically, this is stated as
ǫi ∼ N (0, σ 2 ).
E(yi ) = β0 + β1 xi i = 1, 2, . . . , n, (6.11)
E(y|x) = β0 + β1 x. (6.12)
We call this the fitted model because the model now has estimated parameters.
Generally in Statistics, the ‘hat’ notation is used to indicate an estimate. Notice
that we don’t have the error term in Model 6.13. The relationship between x and
E(Y |x) is now an exact one.
Definition 4 (Residual)
Let ri = yi − ŷi , This difference is called a residual.
The distinction between the residual ri and the error term ǫi is important. The
former measures the deviation of yi from ŷi . Since ǫ is usually unknown, it is
estimated by ri . The residuals are needed not only for estimating the magnitude of
the random variation in the yi ’s, but also for assessing the appropriateness of the
regression model employed. We shall discuss this later.
Figure 6.6: The observed values yi (marked ‘⋆’), residuals ri and fitted values ŷi
(marked ‘♦’)
Figure 6.6 illustrates what is really happening, the points marked by the ‘⋆’s rep-
resent the observed values while the fitted values lie on the line indicated by ‘♦’s.
The residual ri is shown clearly as the difference between the observed value yi and
the fitted value ŷi .
To minimise the RSS, βˆ0 and βˆ1 must satisfy the conditions:
∂
Pn 2
∂ βˆ0
( i=1 ri ) = 0, and
∂
Pn 2
∂ βˆ1
( i=1 ri ) = 0.
Thus:
∂
Pn 2
Pn ∂ ˆ ˆ
2
∂ βˆ0
( i=1 ri ) = y
∂ βˆ0
i=1 i − β0 − β x
1 i
Pn
= −2 i=1 yi − βˆ0 − βˆ1 xi
P P
= −2 ni=1 yi + 2nβˆ0 + 2βˆ1 ni=1 xi
= 0.
So that:
n
X n
X
yi − nβˆ0 − βˆ1 xi = 0.
i=1 i=1
For β1 we have:
∂
Pn 2 ∂
Pn ˆ ˆ
2
∂ βˆ1
( i=1 ri ) = i=1 yi − β0 − β1 xi
∂ βˆ1
Pn
= −2 i=1 xi yi − βˆ0 − βˆ1 xi
P P P
= −2 ni=1 yi xi + 2βˆ0 ni=1 xi + 2βˆ1 ni=1 x2i
= 0.
Equations 6.14 and 6.15 are called normal equations. SolvingPn for βˆ0 in Equation
P
6.15 gives us the following estimate for β0 in terms of β1 . i=1 yi = nβˆ0 + βˆ1 ni=1 xi
can be solved to give:
n n
1X 1X
βˆ0 = yi − βˆ1 xi (6.16)
n i=1 n i=1
= y − βˆ1 x
(
1 Pn
)( xi )
Pn Pn
i=1 (xi yi )− n i=1 yi
βˆ1 = i=1
2
1 Pn
( xi )
Pn 2
i=1 xi − n i=1
Pn
i=1 (xi yi )−nxy
= P n 2 2
i=1 xi −nx
Pn
(x −x)(yi −y)
= Pn i
i=1
2
i=1 (xi −x)
The fitted line is ŷ = βˆ0 + βˆ1 x. This fitted line is referred to by many different
names in statistics. Some of the names are: the least squares line, fitted regression
line, estimated regression line or just the fitted model.
Example 6.4
The starting salary S per year of people of different educational background (Ed)
has always been of interest to people going to university. They have always tried to
find out the relationship between these. We expect the starting salary to be directly
related to the educational level, i.e., as the educational level increases, so does the
salary.
As this may not be the case, we shall investigate this suspicion using regression
analysis. Suppose that an individual’s educational level is given a score, then an
appropriate model is given by:
S = β0 + β1 Ed + ǫ
The data on S and Ed were collected and recorded as shown below. Find the esti-
mates for β0 and β1 and discuss your results.
1 20 000 2.8
2 24 500 3.4
3 23 000 3.2
4 25 000 3.8
5 20 000 3.2
6 22 500 3.4
Solution:
Calculations give
SEd,S
βˆ1 = SS,S
(19.8)(135000)
448400−
= 6
2
65.88− 19.8
6
2900
= 0.54
= $5370.
Therefore, ˆ
β0 = 135000 − 5370 19.8
6 6
= $4779
Thus, we have the fitted model Ŝi = 4779 + 5370Edi . How do we interpret this
model? The starting salary is predicted to be $4 779, when the Educational level
score is zero. This may not say much since an educational score of zero does not
apply to this group of people by virtue of their being at University.
Perhaps, of primary interest is the slope (coefficient) which indicates that for a
one-unit increase in educational score, the predicted salary increases by $5370.
For example, for an educational level score of 2.8, the predicted salary is P̂ =
4779 + 5370 × 2.8 = $19815.
Exercise 6.5
1. In Example 6.4, remove the last (sixth) number and estimate β0 and β1 .
2. Predict the salary for someone with an educational score of 2.8.
The Least Squares estimates of β0 and β1 are unbiased so that we have the following
properties.
1 x2
βˆ0 ∼ N β0 , σ 2
P
+ n 2
, (6.18)
n i=1 (xi − x)
σ2
βˆ1 ∼ N β1 , Pn 2
. (6.19)
i=1 (xi − x)
Now, using the properties of the sampling distributions of βˆ0 and βˆ1 , inferences
about β0 and β1 can be made. First, however, an estimate of one other unknown
parameter in the regression model is needed. This is an estimate of σ 2 . This estimate
is given by:
n
2 21 X
σ̂ ≈ s = (ri − r)2 (6.20)
n − 2 i=1
n
1 X 2
= (ri )
n − 2 i=1
n
1 X
= (yi − ŷ)2 .
n − 2 i=1
Pn
Since i=1 ri = 0 and r = 0.
Pn
Note that s2 =MSE, the Mean Square Error and SSE= i=1 (yi − ȳi )
2
the Error Sum
of Squares.
and
s2
V\
ar(βˆ1 ) = Pn 2
, (6.22)
i=1 (xi − x)
respectively.
Lets discuss how we can make inferences about the regression parameters before
proceeding to investigate how well the fitted line fits the data.
Exercise 6.6
1. Deduce the variance of yi .
2. Assuming βˆ1 and βˆ2 are independent, find the variance of ŷi .
4. State the assumptions under which this test statistic has a t-distribution. The
assumptions about the error must be valid for the conclusions or inferences to
be valid.
6. Find the values for the test statistic that allow a rejection of the null hy-
pothesis. We find the critical values tn−2 ( 21 α), such that we reject H0 if
t < −tn−2 ( 21 α). or if t > tn−2 ( 12 α)
Example 6.6
We assume that a two-tailed test is appropriate and we use the data in Example 6.4
to test the hypothesis
H0 : β1 = 0 versus
H1 : β1 6= 0
at α = 0.05.
r
7425926
s= ≈ 1363
4
Calculation of the variance estimate:
Exercise 6.7
1. If βˆ1 had been -0.32, what conclusions would you have drawn?
2. Test the null hypothesis that the intercept is zero.
Yi − Y
The greater the variability in the data, the larger will be the deviations, Yi − Y ,
and the greater is the uncertainty associated with a prediction Yi , without utilising
knowledge of Xi .
where SST stands for Total Sum of Squares. If there is a lot of variability in the Yi ,
then SST is large.
ri = Yi − Ŷi
If all the Yi values fall on the regression line, all the deviations ri , will be zero.
The larger the deviations ri the greater the uncertainty associated with a prediction
utilising knowledge of the independent variables Xi .
The conventional measure of variability around the fitted regression is the Error
Sum of Squares (SSE) which is calculated as follows:
n
X n
X
2
SSE = (Yi − Ŷi ) = ri2 (6.25)
i=1 i=1
If all the Yi values fall on the regression line, SSE will be zero.
We can show that SSR is the sum of squares involving the deviations:
Ŷi − Y ,
which represent the fitted value and the mean of the fitted value.
n
X
SSR = (Ŷi − Y )2 (6.27)
i=1
SSR can be viewed as a measure of the effect of the regression relation in reducing
the variability of Yi . If SSR = 0, the regression calculation will not reduce variability
at all. SSR can be interpreted as the proportion of variation in Y explained by the
regression.
Pn
( i=1 Yi Xi − nY X)2
SSR = Pn 2 (6.31)
i=1 Xi2 − nX
and
SSE = SST − SSR (6.32)
Mean Squares
A sum of squares divided by the degrees of freedom is called a mean square. For
example, s2 = M SE. The two important mean squares are the regression mean
square denoted by M SR and the error mean square denoted by M SE.
Thus:
SSR
M SR = (6.33)
1
and
SSE
M SE = = s2 (6.34)
n−2
E[M SE] = σ 2
P
It can also be shown that: E[M SR] = σ 2 + β12 (Xi − X)2
Thus, when β1 = 0, E[M SR] = σ 2 , both M SE and M SR have the same expected
value
P under this condition. On the other hand, when β1 6= 0, the term σ 2 +
β12 [Xi − X]2 will be positive and E[M SR] > nE[M SE]. Hence, if β1 6= 0, M SR
will tend to be larger than M SE.
Exercise 6.8
1. What does it mean if the fitted model gives you SSE = 0.
2. If SSR is zero, what does it tell you about the model?
3. In the Simple Linear Regression model, suppose that SST has 14 degrees of
freedom. Deduce the SSE and SSR degrees of freedom.
Source of variation SS df MS F
P SSR M SR
Regression SSR = (Ŷi − Y )2 1 M SR = F =
P 1
SSE
M SE
Error SSE = (Yi − Ŷi )2 n−1 M SE = n−2
P
Total SST = (Yi − Y )2 n−1
Table 6.2: The basic ANOVA table for simple linear regression
From the ANOVA table, we can get the variance s2 and test the hypothesis that
there is a regression relationship. How do we do this? The ratio F in the ANOVA
table has what we call the Fisher’s distribution with 1 and n − 2 degrees of freedom
if the assumptions of the model hold.
If F is near 1, then MSR and MSE are approximately equal. F > 1, suggests that
β1 6= 0. Thus, an upper-tail test is appropriate.
Figure 6.9: The general form of the statistical decision rule for an F-test
Accept H0 if F ≤ F1,n−2 (1 − α)
Reject H0 if F > F1,n−2 (1 − α).
SSR
R2 = (6.35)
SST
SSE
= 1− (6.36)
SST
Thus, R2 measures the proportionate reduction in SST associated with the use of
an independent variable.
In the Simple Linear Regression case, we usually refer to the coefficient of Deter-
mination as the Coefficient of Simple Determination (R2 ). Note that R is the
simple correlation coefficient of the independent and dependent variables.
Adjusted R2
One phenomenon found on adding terms to a regression model is that the R2 in-
creases. Although this may be an indication that the extra terms improve the
regression equation, it is may also be a reflection of the fact that one is using more
variables to predict the same number of data points. This problem may be taken
into account by examining not only the actual value of R2 , but also the value of
the adjusted R2 . This statistic takes into account the numberPof data points and
variables in the regression equation, by replacing the SSE and ni=1 (yi − y)2 by the
corresponding M SE’s, giving
2 SSE/(n − 2)
R = 1 − Pn 2
(6.37)
i=1 (yi − y) /(n − 1)
2 (n−1)
which can also be written as R = 1 − (n−2)
(1 − R2 ).
Example 6.7
induvidual X Y
1 2 8.74
2 2 10.53
3 2 10.99
4 2 11.97
5 3 12.83
6 3 14.69
7 3 14.69
8 3 15.30
9 4 16.11
10 4 16.31
11 4 16.46
12 4 17.69
13 5 19.65
14 5 18.86
15 5 19.93
16 5 20.51
Solution
The model is given by: y = β0 + β1 x + ǫ (price = β0 + β1 salary + ǫ)
SOURCE df SS MS F
Regression 1 177.668 177.668 183.921
Error 14 13.526 0.966
Total 15 191.194
Let us now construct the ANOVA table. First we calculate SSR, SST, and SSE.
P 2
SST = yi − ny 2
= 197.195,
P16 ˆ 2
SSR = i (y)i − ny
= 177.668,
We can see from this ANOVA table that F is quite large. Infact it leads us to a
rejection of the hypothesis of no regression relationship (verify).
We can compute the coefficient of Multiple determination from Table 6.3 above.
Thus:
R2 = SSR
SST
177.668
= 191.195
= 0.929.
Thus, about 92.9% of the variation in prices (Y) is explained by the regression
model. So, the salary estimates do seem do determine the price of flour. The ad-
justed R2 = 92.4%.
The correlation coefficient (r) measures the strength of the linear relationship be-
tween the dependent variable and all the independent variables. It is computed from
the formula:
√
r = + R2 (6.38)
r
SSR
= + , (6.39)
SST
where SSR is the Regression Sum of Squares and SST the Total Sum of Squares. R
close to 1 means that there is a good linear relationship between the dependent and
independent variables.
Exercise 6.9
A student recorded the 6 test marks she obtained after devoting a particular number
of hours of study. The marks are:
3. Show that F = 42.89, and test the null hypothesis that β1 = 0 at α = 0.05.
6.11 Conclusion
In this chapter, we focused on Simple Linear Regression. We discussed the estima-
tion of the parameters using the Least Squares technique. There are other methods
availabe for estimating these parameters. We shall meet these in future modules.
We went on to discuss how to check if any of the assumptions which enable us to use
the least squares approach have been violated. This is often ignored by “pseudo-
statisticians”. Some blame this abuse of Regression Analysis on computers which
allow you to use statistical computer packages without looking at the underlying
theory behind the techniques.
Exercise 6.10
1. Find the relationship between the correlation coefficient r and β1 .
Field: 1 2 3 4 5 6 7 8 9 10
Level: 4.5 17.7 -16.6 -14 18.6 -10.6 5.8 -8.1 -5.2 7.8
Yield: 75 112 38 120 105 52 116 118 105 110
4. For each of the following pairs of variables, explain whether an exact or sta-
tistical relation would most likely hold:
6. The following information was recorded over 10 years, the amount of rainfall,
maize production and maize price.