0% found this document useful (0 votes)
40 views118 pages

Stat2012 Notes Study Guide

Uploaded by

Neo Nkate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views118 pages

Stat2012 Notes Study Guide

Uploaded by

Neo Nkate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

lOMoARcPSD|16917485

STAT2012 Notes - study guide

Introduction to statistics (University of the Witwatersrand, Johannesburg)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Carseddy Tebele ([email protected])
lOMoARcPSD|16917485

University of Witwatersrand

STAT2012

An Introduction to
Mathematical Statistics

Lecturers: Raeesa Ganey & Anna Kaduma Gumbie


2018

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

Contents

1 What is Statistics? 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Population and Samples . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Discrete and Continuous Variables . . . . . . . . . . . . . . . . . . . . 7
1.5 Uses of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Descriptive Statistics 9
2.1 Graphical techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Tabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Multiple Bar Graph . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.5 Component Bar Graph . . . . . . . . . . . . . . . . . . . . . . 13
2.1.6 Percentage Component Graph . . . . . . . . . . . . . . . . . . 15
2.1.7 Line graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.8 Frequency Distribution . . . . . . . . . . . . . . . . . . . . . . 17
2.1.9 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.10 Cumulative Frequency Curve . . . . . . . . . . . . . . . . . . 21
2.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Measures of location . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Box and whisker plot . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Probability 37
3.1 Assigning probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Probability of events . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Addition of probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Relative frequency approach to probability . . . . . . . . . . . . . . . 45

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

CONTENTS 3

3.7 Counting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


3.7.1 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7.2 Permutations when objects are not distinct . . . . . . . . . . . 47
3.7.3 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Random Variables and Distributions 50


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Empirical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Estimation and Hypothesis Testing 65


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.1 Using Student’s t Tables . . . . . . . . . . . . . . . . . . . . . 69
5.4 Sampling to a Desired Precision . . . . . . . . . . . . . . . . . . . . . 73
5.5 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5.1 Two-sided tests . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.2 One-sided tests (right tailed) . . . . . . . . . . . . . . . . . . . 80
5.5.3 One-sided tests (left tailed) . . . . . . . . . . . . . . . . . . . 81
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 Correlation and Regression Analysis 86


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.1 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . 87
6.3.2 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4 Simple Linear Regression Analysis . . . . . . . . . . . . . . . . . . . . 90
6.4.1 Types of Relationship . . . . . . . . . . . . . . . . . . . . . . 91
6.5 The Least Squares Technique . . . . . . . . . . . . . . . . . . . . . . 98
6.6 Properties of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.7 Making Inferences about β0 and β1 . . . . . . . . . . . . . . . . . . . 104
6.8 Analysis of Variance of Simple Linear Regression . . . . . . . . . . . 107
6.9 The Basic ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.10 The Coefficient of (simple) Determination . . . . . . . . . . . . . . . 112
6.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

Preface

Whether you are doing this course by choice, or as a filler or because you are forced
to; the aim of this book is to introduce you to Mathematical Statistics and to realise
its applications. The notes will attempt to clarify links between the techniques
taught and the interpretation of the results obtained. It will not be assumed that
you are familiar with any statistical techniques at the onset of this course.

The best way for you to study this book is to work in systematic manner through
all the chapters. After understanding a chapter, you should be able to answer the
questions at the end of each chapter, to move onto the next chapter.

These notes have been revised from the notes compiled by Charles Chimedza and
Nothabo Ndebele from the University of Witwatersrand who previously taught this
course in 2017. The chapter on Probability has been adapted from the Advanced
Level Mathematics Statistics 1 written by Steve Dobbs and Jane Miller, published
by Cambridge University Press 2002.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

Chapter 1

What is Statistics?

1.1 Introduction
What is statistics? It is not primarily the adding of numbers to come to the conclu-
sion that there are 33 346 registered students at Wits and 30% of them are science
students in the year 2014, or that the demographic profile of first year students con-
sists of more females (54%) than males (46%). Statistics, rather can be described
as the science of decision making in the face of uncertainty. The emphasis here is
not placed so much on the collection of data, but rather drawing conclusions from
the data.

Mathematical statistics is the application of mathematical concepts to decision mak-


ing in statistics. In this course, the focus will be placed on parts from Linear Algebra
and Mathematical Analysis.

1.2 Population and Samples


A population is the collection of items under investigation. It may be finite, or
infinite. However, when doing a statistical analysis, it is not always practical to
collect all the data of a population, since they could be very large. Examples could
include full information on every birth in South Africa in a particular year, or the
details of all the stars in the sky.

Given the limitations of collecting data of a population, a sample is rather drawn to


make a conclusion. A sample is a subset of the population and it is chosen such that
it can represent the population. This means that any conclusions made about the
sample, can be generalized to the entire population. Therefore, great care should
be taken when selecting a sample and collecting the data. Sampling is a statistical
technique used to select a sample from a population.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6 CHAPTER 1. WHAT IS STATISTICS?

Descriptive statistics is used as a statistical method to summarise what is hap-


pening in the sample, so that conclusions can be drawn without scrutinising each
and every observation. The data is summarized in a meaningful way, so patterns or
trends can be seen. Some examples of descriptive statistics could be graphs, tables
or even statistics such as the mean, median and mode.

Inferential statistics on the other hand, makes inferences and predictions about a
population based on a sample collected. The estimation of parameters and testing
of statistical hypothesis are the primary methods of inferential statistics.

1.3 Types of data


This section looks at how data is measured, which is referred to the different scales
of measurement.

Measurement is the assignment of numerals to objects or events according to


certain rules. The four common types of measurement scales are:

• Nominal,

• Ordinal,

• Interval and

• Ratio.

Nominal
This is the weakest of the four measurements scales of data. It distinguishes one
object or event from another on the basis of a ’name’. An example of this, is classi-
fying items coming off an assembly line as defective or non-defective, or classifying
a bank account as open or closed. The ’naming’ can be coded, e.g. if the bank
account is open, then use the value 1, and closed use the value 2. Data of this type
are typically refereed to as: count data, frequency data or categorical data.

Ordinal
Objects or events are distinguished on the basis of the relative amounts of some
characteristic they possess. These measurements enable observations to be ranked.
An example, is ranking different sized jerseys from smallest to largest by assigning
the smallest as rank = 1 and increasing the rank by 1 up to the largest jersey with

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

1.4. DISCRETE AND CONTINUOUS VARIABLES 7

a rank = 4, say. Note that the magnitude of the difference between measurements
is not reflected in the rank.

Interval Scale
This scale is applied when objects, or events can be distinguished one from another
and ranked, and when the differences between measurements have meaning. Suppose
that four objects A, B, C and D are assigned scores of 20, 30, 60 and 70 respectively.
If the interval scale is used then we can say that the difference between A and B
is equal to the difference between C and D, i.e. there are equal differences in the
amount of trait or characteristic being measured. However the ratios of the scores
cannot be used. The score of 60 for C does not mean that C has twice as much of
trait as B which has a score of 30. The values 20, 30, 60 and 70 are scores assigned,
and not measurements as it were.

Ratio scale
This kind of scale applies to all scales above and has the additional property that
the ratios are meaningful. This scale includes the familiar measurements of height,
weight, etc. that is quantitative data. The difference in magnitude and the ratio
can all be used for analysis as they have meaning attached to it.

Exercise 1.1
1. The banks in South Africa are assigned positions according to their reported
profits. The bank with the highest profit is given position 1, the bank with
the second highest profit is given position 2, etc. What type of data is this?

2. If the actual profit for each bank is recorded, what type of data would it be?

1.4 Discrete and Continuous Variables


A discrete variable is a quantity that can assume any of a prescribed set of values.
The set of values is called the domain of the variable. Think of a throwing dice, it
can only take the values 1, 2, 3, 4, 5 and 6. The values a discrete variable can take are
integers values. The type of data which results from this measurement is discrete
data.

A continuous variable can take any value between two given values. Example
can be the height of a student that could can be 158.7cm, 164.2cm or 168.9cm.
The values lie on a continuous scale. The type of data which results from this
measurement is continuous data.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

8 CHAPTER 1. WHAT IS STATISTICS?

1.5 Uses of Statistics


One could possibly use this entire book to write on the uses of statistics and appli-
cation of it in different kind of fields. From forecasting future trends of the exchange
rate, to modelling rainfall patterns to provide food security, to understanding astro-
nomical data, the uses and applications are endless. However, some of you might
question why you have to do a course in statistics and how will it actually help in
your career. Although this course is offered as an introductory course, there are
techniques that will be taught that provide to be very useful and can be easily
applied to daily life.

1.6 Conclusion
Statistics plays an important role in almost every field as it helps in decision making.
Before any decision can be made about any business or society or problem it is often
necessary to gather enough data or information to support a decision being made.
Statistics helps in the collection of information in a scientific and systematic manner,
and to make decisions based on the descriptive and inferential statistics.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

Chapter 2

Descriptive Statistics

This chapter of descriptive statistics is split into two parts : graphical statistics,
which includes tables and graphs and summary statistics which involves numerical
calculations.

2.1 Graphical techniques


A picture is worth a thousand words, or numbers, and there is no better way of
getting a ’feel’ for the data than to display them in a figure or graph. The general
principle should be to convey as much information as possible in the figure, with
the constraint that it is not overwhelmed by too much detail. Graphical techniques
are therefore representations of data such that the main features of the data are
captured.

2.1.1 Tabulation
Data is typically presented in a tabular form. However, the data can be summarised
in a simple and easier way to understand and further analyse. Suppose data is
recorded on the gender of each lecturer in a school, and the results are presented in
the following way:

male male female male female female male


female male female male male female male

By looking at this information, the distribution of genders is not immediately clear.


It becomes more difficult if there were more data, say 1000 lecturers. There is
a better way of presenting this information without losing any of the information
detail, namely tabulation.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

10 CHAPTER 2. DESCRIPTIVE STATISTICS

Gender Number of lecturers


Male 8
Female 6
Total 14

Table 2.1: Number of lecturers by gender

This is especially useful if the number of observations is large and the distinct
categories are few. This leads to the idea of contingency tables. A contingency
table is a convenient way of summarising data with more than one variable. It
consists of row(s) and column(s) of data, that represent the variables. Suppose
more information on the lecturers are recorded like in the table below.

Lecturer Gender School


1 male Maths
2 male Statistics
3 female Maths
4 male Statistics
5 female Maths
6 female CSAM
7 male CSAM
8 female CSAM
9 male Statistics
10 female Statistics
11 male CSAM
12 male Maths
13 female Maths
14 male Maths

Table 2.2: Number of lecturers by gender and school

This information can be presented in a contingency table as shown in Table 2.3.

School / Gender Male Female Total


Maths 3 3 6
Statistics 3 1 4
CSAM 2 2 4
Total 8 6 14

Table 2.3: Contingency table of lecturers by gender and school

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.1. GRAPHICAL TECHNIQUES 11

The contingency table in this example is a 3 x 2 table as there are 3 rows and
2 columns. The number of rows and columns is determined by the number of
categories in each variable. For gender, there are 2 and for school, there are 3.
Contingency tables are sometimes referred to as cross tabulations and are used for
data that has two variables.

2.1.2 Pie Chart


A pie chart is a circle that is divided into segments like a pie cut into pieces from the
center outwards. Each segment represents one of more values taken by a variable
and is used to illustrate proportion. Pie charts are a useful way to organise data
in order to see the size of components relative to the whole.

Table 2.4 lists the number of 91 staff members working at a company tabulated
by their qualification. Each category of the qualification can be expressed by a
proportion and percentage. The proportion is calculated by the number in each cat-
egory divided by the total number of staff members. E.g. Engineering qualifications
gives a proportion 38
91
24
, and Science 91 and so on. The percentage is the proportion
multiplied by 100.

Qualification Frequency Proportion Percentage


38
Engineering 38 91
41.75 %
24
Science 24 91
26.37 %
13
Arts 13 91
14.29 %
8
Commerce 8 91
8.79 %
5
Medicine 5 91
5.49 %
3
Other 3 91
3.29 %
Total 91 1 100 %

Table 2.4: Table of staff members working in a certain company tabulated by their
qualification.

A pie chart is constructed by using the proportions in Table 2.4. As the pie is a
circle, the calculation of the angle in each category is the proportion x 360◦ . The
pie chart is shown in Figure 2.1, with the angles calculated in Table 2.5.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

12 CHAPTER 2. DESCRIPTIVE STATISTICS

Qualification Angle
38
Engineering 91
x 360 = 150.3
24
Science 91
x 360 = 94.9
13
Arts 91
x 360 = 51.4
8
Commerce 91
x 360 = 31.6
5
Medicine 91
x 360 = 19.8
3
Other 91
x 360 = 11.8

Table 2.5: Calculation of angles in the pie chart

Figure 2.1: Pie chart of qualifications

2.1.3 Bar Chart


A bar chart or a bar graph, is a visual representation of data by means of bars or
blocks put side by side. Each bar represents a count of the different categories of the
data. Bar charts can use the actual frequency or proportions unlike the pie chart
that only uses proportions.

Consider the same example in Table 2.4, a bar chart is constructed and given in
Figure 2.2 that uses the frequency to construct the bars.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.1. GRAPHICAL TECHNIQUES 13

Figure 2.2: Bar chart of qualifications

2.1.4 Multiple Bar Graph


A multiple bar graph is a bar graph with multiple bars in each category. Consider
the contingency table in Table 2.3, by using the school as the variable, and splitting
the frequency by the gender. The multiple bar graph is shown in Figure 2.3. When
the gender is taken as the variable and is split by the school, then the multiple bar
graph will look like the one in Figure 2.4. The multiple graphs in Figure 2.3 and
2.4 are the same, but are displayed differently.

2.1.5 Component Bar Graph


Charts help answer and explain the main characteristics of data, without the need
to browse through all the observations. Another variation of the bar chart family
is the component bar graph. The component bar graph also known as a stacked
bar graph, allows to examine the composition of several variables over time or some
other entity. It is important to note that the variables you compare should have
the same units of measurement. In this graph, the bars are stacked on top of each
other. Have a look at Figure 2.5 which are stacked bar graphs of Table 2.3.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

14 CHAPTER 2. DESCRIPTIVE STATISTICS

Figure 2.3: Multiple bar graph of number of lecturers by school, split by gender.

Figure 2.4: Multiple bar graph of number of lecturers by gender, split by school.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.1. GRAPHICAL TECHNIQUES 15

Figure 2.5: Stacked bar graphs

2.1.6 Percentage Component Graph


The percentage component graph is a component graph constructed using the per-
centages instead of the observed values or frequencies like in Figure 2.5. Using the
same data in Table 2.3, a percentage bar is constructed and is seen in Figure 2.6.
The two graphs can be constructed differently, either as percentage of lecturers per
gender group (Table 2.6) or as a percentage of males and females in each school
(Table 2.7).

School / Gender Male Female


3 3
Maths 8
= 37.5% 6
= 50%
3 1
Statistics 8
= 37.5% 6
= 17%
2 2
CSAM 8
= 25% 6
= 33%

Table 2.6: Calculating the percentage of lecturers by gender

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

16 CHAPTER 2. DESCRIPTIVE STATISTICS

School / Gender Male Female


3 3
Maths 6
= 50% 6
= 50%
3 1
Statistics 4
= 75% 4
= 25%
2 2
CSAM 4
= 50% 4
= 50%

Table 2.7: Calculating the percentage of males and females in each school

Figure 2.6: Percentage stacked bar graphs

2.1.7 Line graphs


A line graph is a graphical display of information that changes continuously over
time. Within a line graph, there are points connecting the data to show a continuous
change. The lines in a line graph can descend and ascend based on the data. A line
graph can be used to compare different events, situations, and information.

Consider data of average temperatures (in ◦ C) in Cape Town in each month of the
year in a specific year. To plot a line graph, use the variable Month on the x-axis,
and the Average Temperature on the y-axis in a x-y plot. Join the points to form
a line to produce a line graph, like Figure 2.7. The line graph is useful in detecting
trends or patterns over time.

Month Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec
Average Temp 22 23 21 18 16 13 13 13 14 16 18 20

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.1. GRAPHICAL TECHNIQUES 17

Figure 2.7: Line graph of average temperatures during a year in Cape Town

2.1.8 Frequency Distribution


Frequency tells you how often something occurs. The frequency of an observation
in statistics tells you the number of times the observation occurs in the data. A
frequency distribution is a listing of observations according to their frequencies or
occurrences.

Consider test marks of 20 students in a first statistics course:

6 4 7 10
5 6 7 8
7 8 8 9
7 5 6 6
9 4 7 8

A frequency table of this data will look like:

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

18 CHAPTER 2. DESCRIPTIVE STATISTICS

Mark Frequency
4 2
5 2
6 4
7 5
8 4
9 2
10 1
Total 20

Table 2.8: Frequency table of marks

Sometimes the frequency distribution is not as simple to construct as the one above.
Observations are not always easy to group as there may be too many unique values.
Consider the following example: the number of calls from motorists per day for
roadside service in a certain month.

28 122 217 130 120 86 80 90


120 140 70 40 145 187 113 90
68 174 194 170 100 75 104 97
75 123 100 82 109 120 81

To construct a frequency distribution for this data, the following can be done:

1. Determine the smallest and largest value in the data.


Minimum = 28 and maximum = 217.

2. Calculate the range of the data, i.e. the difference between the maximum and
minimum.
Range = 217 - 28 = 189.

3. Calculate k = 1 + 3.22 x log10 (n) to find the number of classes to have in the
frequency table.
k = 1 + 3.22 x log10 (31) ≈ 6.

4. Estimate the approximate class size by dividing the range by k.


Class size = 189
6
≈ 32.

5. Determine the lower end of the first class making sure the smallest value is
equal to or more than the lower end.
Lower end = 28.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.1. GRAPHICAL TECHNIQUES 19

6. Determine the frequencies for each class by counting the number of observa-
tions falling in each class.

Number of calls Frequency


28 - 59 2
60 - 91 10
92 - 123 10
124 - 155 4
156 - 187 3
188 - 219 2

Table 2.9: Frequency table of number of calls

The class limits are defined as the starting and ending point in each class. The
class boundary is the average of the end of the current class limit and the starting
of the next class limit. The class midpoint is the average of the current class
starting and ending limit.

2.1.9 Histogram
A histogram is a picture of a frequency distribution. It is used to represent con-
tinuous quantitative data. It usually consists of adjacent rectangles that are not
separated. The area of each rectangle is drawn in proportion to the frequency cor-
responding to that frequency class. When the class intervals are equal, the area of
each rectangle is a constant multiple of the height and the histogram can be drawn
like a bar chart, except the bars are not seperated. It is important to note that the
class intervals need not be equal.

Consider an example where the results of a survey carried on 45 shipping companies.


They were asked how long it took to ship heavy equipment (in days) from the Far
East. The frequency distribution table is shown in Table 2.10.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

20 CHAPTER 2. DESCRIPTIVE STATISTICS

Shipping time Frequency


10 - 19 7
20- 29 20
30 - 39 9
40 - 49 3
50 - 59 5
60 - 69 1

Table 2.10: Frequency of shipping time (in days).

In constructing a histogram, the class intervals should be continuous. In this fre-


quency table, the classes are discontinuous. There are gaps between the classes, for
example which class does the value 19.5 fit in? Whenever there are gaps between
the classes, it is said to have imaginary limits. This happens when the class
boundaries and the class limits are not the same.

One cannot use imaginary limits to construct a histogram. Real limits will need
to be constructed in this case as shown in the table below. The real limits are used
to construct a histogram shown in Figure 2.8.

Shipping time Frequency Lower limit Upper limit Lower real limit Upper real limit
10+9 19+20
10 -19 7 10 19 2
= 9.5 2
= 19.5
20+19 29+30
20 -29 20 20 29 2
= 19.5 2
= 29.5
30+29 39+40
30 - 39 9 30 39 2
= 29.5 2
= 39.5
40+39 49+50
40 - 49 3 40 49 2
= 39.5 2
= 49.5
50+49 59+50
50 - 59 5 50 59 2
= 49.5 2
= 59.5
60+59 69+70
60 - 69 1 60 69 2
= 59.5 2
= 69.5

Table 2.11: Computing the real limits of the shipping time frequencies

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.1. GRAPHICAL TECHNIQUES 21

Figure 2.8: Histogram of shipping time.

2.1.10 Cumulative Frequency Curve


A cumulative frequency curve is drawn using the sum of all previous frequencies
at each point. It is a plot of the lower limits of the classes against the cumulative
frequencies. If the frequency is divided by the total of all the available frequencies
then the frequencies becomes a relative frequency. If the relative frequency is
multiplied by 100, then a percentage cumulative frequency curve is formed.

Using the same example in Table 2.10, to find the values to plot in the cumulative
frequency plot can be done in two different ways; the less than cumulative frequency
which is accumulating the frequencies starting from the lowest class, and the greater
than cumulative frequency which is accumulating the frequencies from the highest
class. Computation of these curves are shown in Tables 2.12 and 2.13, repsectively.

The two curves can be drawn on the same plot, and where these curves meet or
intersect is known as the median. The cumulative frequency curves is shown in
Figure 2.9.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

22 CHAPTER 2. DESCRIPTIVE STATISTICS

Shipping time Frequency Lower real limit Upper real limit < cumulative frequency
10 -19 7 9.5 19.5 7
20 -29 20 19.5 29.5 27
30 - 39 9 29.5 39.5 36
40 - 49 3 39.5 4 9.5 39
50 - 59 5 49.5 59.5 44
60 - 69 1 59.5 69.5 45

Table 2.12: Less than cumulative frequency computation

Shipping time Frequency Lower real limit Upper real limit > cumulative frequency
10 -19 7 9.5 19.5 45
20 -29 20 19.5 29.5 38
30 - 39 9 29.5 39.5 18
40 - 49 3 39.5 49.5 9
50 - 59 5 49.5 59.5 6
60 - 69 1 59.5 69.5 1

Table 2.13: Greater than cumulative frequency computation

Figure 2.9: Cumulative frequency curves

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.1. GRAPHICAL TECHNIQUES 23

Exercise 2.1
1. What is the difference between a histogram and a bar graph?

2. A company that produces timber is interested in the distribution of the heights


of their pine trees. Compute the frequency distribution and then construct
a histogram to display the heights (in metres) of the following sample of 30
trees:
18.3 19.1 17.3 19.4 17.6 20.1 19.9 20.0 19.5 19.3
17.7 19.1 17.4 19.3 18.7 18.2 20.0 17.7 20.0 17.5
18.5 17.8 20.1 19.4 20.5 16.8 18.8 19.7 18.4 20.4

3. A sample of 13 men working in a mine was collected during an investigation


into allegations that the mine only employs men who are older than 25 years
of age. Their ages are recorded as follows:

19.0 30.0 29.1 20.6 27.9


26.9 23.3 31.3 32.3 24.8
29.9 21.5 26.6

(a) Classify these ages into four classes; A: ages below 22.5, B: ages between
22.5 and 25 inclusive, C: ages between 25 and 27.5 inclusive, and D: above
27.5 to create a frequecny table.
(b) Construst a pie chart from these classes.
(c) Construct a bar chart from the data.

4. Given the following frequency distribution:

Classes Frequency
5-9 1
10-14 9
15-19 20
20-24 12
25-29 5

(a) What is the sample size?


(b) What is the class size?

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

24 CHAPTER 2. DESCRIPTIVE STATISTICS

(c) What are the boundaries of the third class?


(d) What is the midpoint of the second class?
(e) Obtain the real limits.

5. The monthly sales (in millions of $) of a large business are given as:

149 148 189 167


380 170 216 155
280 655 250 235
221 950 750 912
510 565 215 842

(a) Construct a frequency distribution.


(b) Draw a histogram.
(c) Compute the greater and less than cumulative frequency values.
(d) Plot the cumulative frequency curves, and hence determine the median.

6. A company has been selling two types of cars A and B from 1992 to 1998.
The number of sales obtained (in billions of $) is given as:

Year A B
1992 134 119
1993 126 96
1994 198 182
1995 144 98
1996 164 78
1997 200 197
1998 213 187

(a) Draw the line graphs of the sales of the two data sets on the same plot,
and comment on the trend of the lines.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.2. SUMMARY STATISTICS 25

2.2 Summary Statistics


Summary statistics are numerical measures characterising the distribution of the
data. It can be described in terms of measures of location and its spread or variation.
Here, the focus is on quantitative variables.

2.2.1 Measures of location


Mean
The arithmetic mean is defined as the sum of all observations, divided by the number
of observations. Suppose there are n observations denoted as x1 , x2 , . . . , xn , then:
n
1X
mean = x̄ = xi (2.1)
n i=1

for data that is not grouped like frequency data. If the data is grouped, with xi
occurring fi times with a total of n observations, then
k k
1X X
mean = x̄ = f i xi , n= fi (2.2)
n i=1 i=1

where k is the number of classes.

Example 2.1
Calculate the mean age of the 45 people that attended a cultural movie on a specific
day.

7 9 11 12 12 12 13 13 14 14
14 14 15 15 15 16 17 18 18 19
19 19 20 20 20 21 22 22 22 23
24 24 25 26 28 29 31 31 32 34
38 39 39 16 25

The mean will be:


1 1
x̄ = [7 + 9 + 11 + 12 . . . 16 + 25] = [927] = 20.6
45 45

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

26 CHAPTER 2. DESCRIPTIVE STATISTICS

Example 2.2
Calculate the mean of the following grouped data of average income per hour on
2000 participants in a survey:

Income per hour Frequency


R0 1235
R 1 - R50 459
R 51 - R100 121
R 101 - R200 29

The mean is
k
1 X 0x1235 + 25.5x459 + 75.5x121 + 150.5x29
x̄ = f i xi = = 13.67
1844 i=1 1844

Median
Suppose sorting all observations into a numerical order ranging from lowest to high-
est. The median will be the middle value in the sorted list. Half of all the observa-
tions will be greater than the median and the other half will be less than the median.
The median is also known as the 50th percentile or the second quartile.

The median x̃ is calculated by first sorting the data in ascending order to get a new
data array: x(1) , x(2) , . . . , x(n) , and then finding the central value. If n is odd, then
median is the n+1
2
th value,

x̃ = x( n+1 ) , (2.3)
2

and when n is even,


1
x̃ = [x( n2 ) + x( n2 +1) ] (2.4)
2

Example 2.3
Calculate median of the data in Example 2.1.
Sort the data:

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.2. SUMMARY STATISTICS 27

7 9 11 12 12 12 13 13 14 14
14 14 15 15 15 16 16 17 18 18
19 19 19 20 20 20 21 22 22 22
23 24 24 25 25 26 28 29 31 31
32 34 38 39 39

Because n is odd, the median is given by:

x̃ = x( 45+1 ) = x(23) = 19.


2

The computation of the median for grouped data is given by:

cm ( n2 − Fm−1 )
Median = Lm + , (2.5)
fm
where
Lm is the lower limit of the class containing the median,
cm is the difference between the upper end and lower end of the median class,
fm is the frequency of the median class,
Fm−1Pis the cumulative frequency of the class just before the median class,
n is ki=1 fi the sum of frequencies and
k is the number of classes.

When calculating the median for grouped data it is important to remember that
the real limits or class boundaries are used.

Example 2.4
The time it takes to build a three-roomed house is believed to be at most 12 weeks.
The man in charge of time and service delivery for a building company took a
random sample of three-roomed house constructions and inquired how long it took
to build them. The data is given as:

Time in weeks Frequency


5-7 5
8 - 10 20
11 - 13 45
14 - 16 10
17 - 19 6
20 - 22 4

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

28 CHAPTER 2. DESCRIPTIVE STATISTICS

The following steps are taken to calculate the median:

1. Calculate the real limits and the cumulative frequency.

Time in weeks Frequency Real limits Cumulative Frequency


5-7 5 4.5 - 7.5 5
8 - 10 20 7.5 - 10.5 25
11 - 13 45 10.5 - 13.5 70
14 - 16 10 13.5 - 16.5 80
17 - 19 6 16.5 - 19.5 86
20 - 22 4 19.5 - 22.5 90

2. Lm = 10.5, since the median x(45) lies in the class 10.5 - 13.5.

3. cm = 13.5 − 10.5 = 3.

4. fm = 45.

5. F(m−1) = 25.
3( 45 −25)
6. Median = 10.5 + 2
45
= 11.833.

7. Important to note: check that the median calculated falls in the median class.

Mode
The mode is the observation with the largest frequency for ungrouped data. The
data is said to have no mode if all the observations are unique, as observations only
occur once in the data. It is also possible to have more than one mode.

With grouped data, there will not be a single most frequently occurring observation.
However, it will be the class with the highest frequency. The mode will be found in
the class with the highest frequency.

The mode for grouped data is estimate as follows:

cm (fm − fm−1 )
Mode = Lm + , (2.6)
2fm − (fm−1 + fm+1 )
where
Lm is the lower end of the modal class,
cm is the upper end of the modal class - lower end of of the modal class,
fm is the frequency of the modal class,

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.2. SUMMARY STATISTICS 29

fm−1 is the frequency of the class before the modal class and
fm+1 is the frequency of the class after the modal class.

Example 2.5
Using the data in Example 2.4, to calculate the mode is done in the following way:

1. The modal class is 10.5 - 13.5, as it has the highest frequency, therefore Lm =
10.5.

2. cm = 13.5 − 10.5 = 3.

3. fm = 45.

4. fm−1 = 20.

5. fm+1 = 10.
3(45−20)
6. Mode = 10.5 + 2(45)−(20−10)
= 11.75.

For any set of data; the mean, median and mode are likely to be different, thus it
has to be decided which is the best one to use in given situation.

Quantiles/Quartiles/Percentiles
While the mean, median and mode describe the center of the data, it is sometimes
useful to also summarise other specific points of location of the data. Suppose sorting
or ranking data values in ascending order, the values can then be partitioned into
equal size portions with dividing points called quantiles.

Quantiles are used to describe the percentage or proportion of observations lying


above or below a certain level. The quartiles (when data is divided into four) are:

1. 1st quartile or the 25th percentile:


The value where 25% of the observations lie below.

2. 2nd quartile or the 50th percentile, the median:


The value where 50% of observations lie above and below.

3. 3rd quartile or the 75th percentile:


The value where 75% observations lie below.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

30 CHAPTER 2. DESCRIPTIVE STATISTICS

For ungrouped data, the 1st quartile is calculated by first ordering the data, and
then computing,
1
1st quartile = x| n+1 | + [x| n+1 |+1 − x| n+1 | ], (2.7)
4 4 4 4

where | n+1
4
| and only takes the integer value. For example | 13 4
| = 3 and | 27
4
| = 6.

The 3rd quartile is calculated in a similar manner:


3
3rd quartile = x| 3 (n+1)| + [x| 3 (n+1)|+1 − x| 3 (n+1)| ]. (2.8)
4 4 4 4

Example 2.6
Calculate the 1st and 3rd quartile of the following data:

0 0.20 10.00 20.12


20.20 23.90 122.13 200.00

The 1st quartile is given by:


1
1st quartile = x| 8+1 | + [x| 8+1 |+1 − x| 8+1 | ]
4 4 4 4

1
= x(2) + [x(3) − x(2) ]
4 (2.9)
1
= 0.20 + [10.00 − 0.20]
4
= 2.65.
The 3rd quartile is given by:
3
3rd quartile = x| 3 (8+1)| + [x| 3 (8+1)|+1 − x| 3 (8+1)| ]
4 4 4 4

3
= x(6) + [x(7) − x(6) ]
4 (2.10)
3
= 23.90 + [122.13 − 23.90]
4
= 97.57.
The quantiles for grouped data are calculate much like the grouped data median.
The following calculation is for the qth percentile:
qn
cq ( 100 − Fq−1 )
q-th percentile = Lq + , (2.11)
fq

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.2. SUMMARY STATISTICS 31

where
Lq is the lower limit of the class containing the qth percentile,
cq is the difference between the upper end and lower end of the qth percentile class,
fq is the frequency of the qth percentile class,
Fq−1 is the cumulative frequency of the class before the qth percentile class and
P
n is the sum of frequencies ki=1 fi .

For example, to calculate the first quartile (25th percentile) the formula will look
as follows:

c25 ( 25n
100
− F25−1 )
25-th percentile = L25 + ,
f25
where
L25 is the lower limit of the class containing the 25th percentile,
c25 is the difference between the upper end and lower end of the 25th percentile
class,
f25 is the frequency of the 25th percentile class,
Fq−1 is the cumulative frequency of the class before the 25th percentile class and
P
n is the sum of frequencies ki=1 fi .

Example 2.7
Using the data in Example 2.4, the

3( 90 − 5)
25th percentile = 7.5 + 4
20
= 10.125

and the
3( 75x90
100
− 25)
75th percentile = 10.5 +
45
= 13.33.

Skewness
Skewness is a measure of symmetry of a distribution. A distribution can have a
positive or a negative skew, depending on where the mean, median and mode are
situated. The following figure presents the three different scenarios.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

32 CHAPTER 2. DESCRIPTIVE STATISTICS

Karl Pearson suggested two calculations as a measure of skewness:

(mean - mode)
Ps1 = (2.12)
standard deviation
3(mean - median)
Ps2 = (2.13)
standard deviation
Values of Ps1 and Ps2 less than 0 indicates negative skewness in the data, while
values greater than 0 indicates positive skewness.

Kurtosis
Kurtosis measures how peaked the distribution of the data is. If the data set has a
high kurtosis, then the histogram of the data will have a high peak. It also means
that there is a great number of observations around the mode or the modal class
has a high frequency.

The kurtosis is measured by:


n
1 X
m4 = (xi − x̄)4 . (2.14)
n − 1 i=1
The value of m4 can be standardised in order to get a scale free value which can be
used in comparisons,
m4
m′4 = . (2.15)
s4
Data with a kurtosis equal to 3 is said to have a mesokurtic distribution that is
it is peaked like the normal distribution. Data with a kurtosis less than 3, is said
to have a leptokurtic distribution, that is it is more peaked than a normal dis-
tribution with values concentrated around the mean. When the data has a kurtosis
less than 3 then it is said to have a platykurtic distribution, which means it is
flatter than a normal distribution with a wider peak. The values are wider spread
around the mean.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.2. SUMMARY STATISTICS 33

2.2.2 Measures of dispersion


Variability is an important feature of data. It is a way of checking how the data
fluctuates between observations.

The Range
The range (R) is the difference between the minimum and maximum value in the
data set, it is given by:

R = Maximum value − Minimum value (2.16)


The range has the limitation that is depends only on the extreme values in the data,
which means it is sensitive to outliers.

Quartile deviation(IQR)
This measure is half the difference between the third and first quartile. It is given by:
1
IQR = [3rd quartile − 2nd quartile] (2.17)
2
This measure gives half the range of the middle 50% values, which means it is a
better statistics than the range. It is not greatly affected by outliers - which has the
property of robustness.

Variance
The variance is the most commonly used measure of dispersion or variability in
statistical analysis. This measurements takes into account all observations in the
data set. It is denoted by s2 , and the greater the value, the greater the variability.
If all observations are close or almost equal then the variance will be low.

The variance is calculated by taking the average of the sum of squared deviations
of each observation from the mean, it is given by:

n
2 1 X
s = |xi − x̄|2
n − 1 i=1
X n 
1 2 2
= x − nx̄ (2.18)
n − 1 i=1 i
X n P 
1 2 [ ni=1 xi ]2
= x − .
n − 1 i=1 i n

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

34 CHAPTER 2. DESCRIPTIVE STATISTICS

For grouped data, then the variance is calculated by:


k  P 
2 1 X 2 [ ki=1 xi fi ]2
s = x fi − , (2.19)
n − 1 i=1 i n

where fi are the k frequencies corresponding to each xi . If the data is categorical


data or classified data, the midpoint mi of a class instead of xi .

The positive square root of the variance s2 is called the standard deviation, and
is denoted by s.

2.2.3 Box and whisker plot


The box and whisker plot is constructed using the summary statistics, as opposed
to the other plots that uses the raw data. The diagram shows the

1. quartiles,

2. median,

3. maximum and minimum values.

Outliers can also be indicated in the box and whisker plot. The features of the box
and whisker plot is that the box contains 50% of the observations and the whiskers
are lines which extend from the box to the maximum and minimum values.

Example 2.8
The rate at which accidents occurred at a road junction controlled by yield signs
were recorded. Road accidents were also recorded after traffic lights were erected in
place of the yield signs. The results are given as:

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

2.2. SUMMARY STATISTICS 35

Year Before After


1 395 280
2 425 284
3 415 271
4 369 308
5 514 268
6 548 287
7 498 323
8 548 275
9 355 278
10 251 296
11 600 233
12 358 251
13 478 244

The summary statistics are calculated and are given as:

Period n Mean Median s Min Max Q1 Q3


Before 13 442.615 425 98.619 251 600 369 514
After 13 276.769 278 24.863 233 323 268 323

The box and whisker plots of the accidents before and after the traffic lights were
erected is given as:

Draw the boxplots to scale in one plot, and comment on the distribution of the
accidents before and after the traffic lights were erected.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

36 CHAPTER 2. DESCRIPTIVE STATISTICS

2.3 Conclusion
In this chapter, a variety of diagrammatic representations of data was looked at.
The diagrams and summary statistics help one understand the data better and to
deduce the structure of the data. A step above descriptive statistics is inferential
statistics which will be looked at in further chapters.

Exercise 2.2
1. A glass manufacturing company recorded the following profits (in millions).
The profits are recorded every four months from January 1985 to December
1987):

12 18 10 13 20 11 12 19 10

(a) Calculate the mean, median, mode and variance of the data.
(b) Comment of the median and mean, in terms of skewness.
(c) Construct a bar chart and comment on the distribution.
(d) Construct a box-and-whisker plot for the data.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

Chapter 3

Probability

Probability is the measure of the likelihood that an event will occur. Probability
quantifies events into a number which can be used to make decisions. It is measured
on a scale from zero representing impossibility to one representing certainty.

3.1 Assigning probabilities


In many situations, you may be unsure of the outcome of some activity or experi-
ment, but are sure of the possible outcomes. For example, when you roll a dice, the
possible outcomes are getting a 1,2,3,4,5 or 6. But you do not know which number
you will get. If you toss a coin twice, then the possible outcomes could be (H,H),
(H,T), (T,H) and (T,T). The list of all possible outcomes is called a sample space,
Ω.

Each of the possible outcomes has an assigned probability to it. For example the
sample space for throwing a dice is {1,2,3,4,5,6}, and assigning a probability to
it each of the outcomes will be 61 , in belief that the dice is fair, and that is each
outcome is equally likely. When probabilities are assigned to possible outcomes,
1. each probability must lie between 0 and 1, inclusively, and
2. the sum of all probabilities assigned must equal to 1.

Example 3.1
Assign probabilities to the following experiments:
1. Choosing a card from a standard pack of playing cards.
The sample space consists of 52 playing cards {Ace of Clubs, 2 of Clubs, 3 of
1
Clubs, . . ., King of Spades}. The probability assigned to each item will be 52 ,
assuming all the cards are equally likely to be picked.

37

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

38 CHAPTER 3. PROBABILITY

2. The combined experiment of tossing a coin and rolling a dice.


The sample space is {(H, 1), (H, 2), (H, 3), (H, 4), (H, 5), (H, 6),
(T, 1), (T, 2), (T, 3), (T, 4), (T, 5), (T, 6)}, and each of the outcomes will have
1
an assigned probability of 12 .

3.2 Probability of events


Sometimes it may be of interest, not in one particular outcome, but in two or three
or more of them. For example, suppose tossing a coin twice. You might be interested
in whether the result is the same both times. The list of outcomes in which you are
interested in is called an event. The even that both tosses of the coin give the same
result is {(H, H), (T, T )}. Events are denoted by capital letters. A can denote this
event, then A = {(H, H), (T, T )}. An event can be just one outcome, or a list of
outcomes or even no outcomes at all.

To find the probability of an event, look at the sample space and add the probabilities
of the outcomes which make up the event. For example, to toss a coin twice, the
sample space will be {(H, H), (T, T ), (H, T ), (T, H)} and the probability of each
outcome will be 14 . The event A consists of two outcomes, so the probability of A,
P(A) = 14 + 14 = 21 .

If A is an event, the event ”not A” is the event consisting of those outcomes in the
sample sample which are not in A. Since the sum of the probabilities assigned to
outcomes in the sample space is 1,

P(A) + P(not A) = 1.
The event ”not A” is called the complement of the event A. The symbol A′ is used
to denoted the complement of A. Therefore,

P(A) + P(A′ ) = 1.

Example 3.2
The numbers 1, 2, . . . , 9 are written on separate cards. The cards are shuffled and
the top is turned over. Calculate the probability that the number on this card is a
prime number.

The sample space is {1, 2, 3, 4, 5, 6, 7, 8, 9}. Each outcome is equally likely and
has a probability of 19 . Let B be the event that the card turned over is prime. Then
B = {2,3,5,7}. The probability of B is the sum of the probabilities of the outcomes
in B.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

3.3. ADDITION OF PROBABILITIES 39

1 1 1 1 4
P(B) = + + + = .
9 9 9 9 9
and the probability of the card not being a prime number is
4 5
P(B′ ) = 1 − = .
9 9

Exercise 3.1
1. A fair 20-sided dice has eight faces coloured red, ten coloured blue and two
coloured green. The dice is rolled:

(a) Find the probability that the bottom face is red.


(b) Let A be the event that the bottom face is not red. Find the probability
of A.

2. A dice with 6 faces has been made from brass and aluminium and is not fair.
The probability of a 6 is 41 , the probabilities of 2,3,4 and 5 are each 61 , and the
1
probability of 1 is 12 . The dice is rolled.

(a) Find the probability of rolling 1 or 6.


(b) Find the probability of rolling an even number.

3.3 Addition of probabilities


When two events, A and B have no outcomes in common they are said to be mu-
tually exclusive. The probability P(A or B) = P(A) + P(B) is known as the
addition law of mutually exclusive events.

In Example 3.2 there are two events, event B consisting of all the prime numbers
between 1 to 9. Let be A be the event of all the numbers consisting of non prime
numbers.

A = {1,4,6,8,9} and
B = {2,3,5,7}

are mutually exclusive between they do not have the one or more outcomes that are
the same. This can be seen in what is called a Venn diagram in Figure 3.1.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

40 CHAPTER 3. PROBABILITY

Figure 3.1: Mutually exclusive events A and B

When two events, A and B have outcomes in the sample space that are the same,
they are not mutually exclusive. Using the same Example 3.2, the event C are all
the even numbers in the sample space, C = {2,4,6,8}. The events A and C are not
mutually exclusive, as there are outcomes that are the same, namely {4,6,8}. In
this case the addition rule is not valid, P(A or C) 6= P(A) + P(C).
The addition rule is modified as follows: P(A or C) = P(A) + P(C) - P(A and C).

The P(A and C) is called the intersection of sets of A and C. It can be denoted
as: P(A ∩ C) and is illustrated by the region in the Venn diagram that has the
outcomes {4,6,8} in Figure 3.2. The P(A or C) will therefore be the probability of
outcomes in A plus the probability of all outcomes in C minus the probability of all
outcomes in the intersection of A and C.

P(A or C) = P(A) + P(C) − P(A ∩ C)


5 4 3
= + −
9 9 9
6
= .
9
Complete a Venn diagram with events A, B and C.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

3.4. CONDITIONAL PROBABILITY 41

Figure 3.2: Intersection space of events A and C

Additional notation
The union of sets of events A and B is the set of all outcomes which belong to
event A and event B. This is given by P(A or B) as done in the previous example,
but it is denoted by A ∪ B.

A set that contains no outcomes is called an empty set and is denoted by ∅ or {}.

An event A is said to be a subset of event B, if and only if each outcome of A is


an outcome of B. It is denoted by A ⊆ B.

3.4 Conditional probability


Conditional probability is the probability of some event A given that some other
event B has occurred. It is written as P(A|B) and is read as ”the probability of A
given B has occurred”.

Consider a class of 20 students, of whom 12 are girls and 8 are boys. Suppose further
that 7 of the girls and 2 of the boys are left handed. If a student is picked randomly
from the class, then the chance that he or she is left handed is 7+220
9
= 20 .

However, the probability of selecting a student from the group of girls that is left
7
handed is 12 and the probability of selecting a student from the group of boys that
is left handed is 28 = 14 . These probabilities have been calculated on the basis of
an extra condition, which is selecting the student from a certain group. This is an
example of conditional probability.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

42 CHAPTER 3. PROBABILITY

This can be written as follows:


7
7 20
P(left handed | girl) = = 12
12 20

which is essentially:

P(left handed and girl)


P(left handed | girl) =
P(girl)

Complete the probability of selecting a left handed student given that the student
is a boy.

The equation can be generalised as follows:


If A and B are two events and P(A) > 0, then the conditional probability of B given
A is
P(A and B)
P(B|A) = . (3.1)
P(A)
Rewriting equation 3.1 gives

P(A and B) = P(A) x P(B|A), (3.2)

which is known as the multiplication law of probability.

Example 3.3
3
Weather records indicate that the probability that a particular day is dry is 10 . The
South African football team Bafana Bafana show a record of success is better on
dry days than on wet days. The probability that the team wins on a dry day is 83 ,
3
whereas the probability that they win on a wet day is 11 . The team is due to play
their next match in a few days.

1. What is the probability that the team will win?

2. Three Saturdays ago, the team won their match, what is the probability that
it was a dry day?

The sequence involves first the type of weather and then the result of the football
match. In cases of conditional probability like in this example, one can make use of
a tree diagram, as in Figure 3.3.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

3.4. CONDITIONAL PROBABILITY 43

Figure 3.3: Tree diagram of conditional probabilities

Notice the probabilities on the first layer of branches is the probability of the type
of weather, wet or dry. The probabilities on the second layer of branches are the
conditional probabilities. You can use the tree diagram to calculate any of the four
possibilities:

• Probability of winning given the weather is dry.

• Probability of not winning given the weather is dry.

• Probability of winning given the weather is wet.

• Probability of not winning given the weather is wet.

1. The probability of winning can happen in two different ways: P(win) = P(dry
& win) or P(wet & win).

P(win) = P(dry & win) + P(wet & win)


= P(dry) x P(win|dry) + P(wet) x P(win|wet)
3 3 7 3
= x + x
10 8 10 11
9 21
= +
80 110
267
= = 0.303.
880

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

44 CHAPTER 3. PROBABILITY

2. In this case, you have been asked to calculate a conditional probability. How-
ever, the sequence of events has been reversed and you want to find out
P(dry|win).

9
P(dry & win) 80 99
P(dry|win) = = 267 = .
P(win) 880
267
Think of P(dry|win) as being the proportion of times that the weather is dry
out of all the times that the team wins.

Exercise 3.2
1. The Prosecutor’s fallacy. An accused prisoner is on trial. The defence lawyer
asserts that in the absence of further evidence, the probability that the prisoner
is guilty is 1 in a million. The prosecuting lawyer produces further piece of
evidence and asserts that is the prisoner were guilty, the probability that this
evidence would be obtained is 999 in 1000, and if he were not guilty would
be only 1 in 1000. Assuming that the court order the legality of the evidence,
and that both lawyers’ figures are correct, what is the probability that the
prisoner is guilty?

Bayes’ Theorem
Bayes’ Theorem describes the probability of an event, based on prior knowledge of
conditions that might be related to the event.
Bayes’ theorem is stated mathematically as the following equation:

P(A and B)
P(B|A) = . (3.3)
P(A)

where A and B are events and P (B) 6= 0.

• P (A|B) is a conditional probability: the likelihood of event A occurring given


that B is true.

• P (B|A) is also a conditional probability: the likelihood of event B occurring


given that A is true.

• P (A) and P (B) are the probabilities of observing A and B independently of


each other; this is known as the marginal probability.

The examples done in this section of conditional probability uses Bayes’ Theorem.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

3.5. INDEPENDENT EVENTS 45

3.5 Independent events


Independent events are events which have no effect on one another. For two inde-
pendent events, A and B,

P(A and B) = P(A) x P(B) (3.4)

This result is called the multiplication law for independent events.

Example 3.4
In a carnival game, a contestant has to first tossing a fair coin and then roll a fair
cubical dice whose faces are numbered 1 to 6. The contestant wins a prize if the
coin shows heads and the dice score is below 3. Find the probability the contestant
wins the prize.

The two events of tossing a coin and rolling a dice are independent. The outcome
of rolling the dice does not depend on the outcome of tossing the coin. Therefore
the probability of winning is :

P(prize won) = P(coin shows heads) and P(dice score is lower than 3)
= P(coin shows heads) x P(dice score is lower than 3)
1 2 1
= x = .
2 6 6

3.6 Relative frequency approach to probability


The relative frequency approach to probability is based on the number of times an
event has occurred over all possible number of occurrences it could have occurred.

Number of times an event has occurred


P(event) = (3.5)
Total number of events

Example 3.5
Consider data on the mode of transport of 92 students to campus everyday:

1. The probability of randomly selecting a student that is a male

Number of males
P(male) =
Total number of students
51
= = 55.4%
92

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

46 CHAPTER 3. PROBABILITY

Mode of transport / Gender Male Female Total


Car 20 15 35
Taxi 24 18 42
Bicycle 5 3 8
Walked 2 5 7
Total 51 41 92

2. The probability of randomly selecting a student that is a female and travels


to campus using a car
Number of females travelling by car
P(female travelling by car) =
Total number of students
15
= = 16.3%
92
3. Calculate the probability of selecting a male that rides a bicycle to campus.

4. Calculate the probability of selecting a female that walks to campus.

3.7 Counting methods


This section is about the numbers of arrangements of different objects, and the
number of ways you can choose different objects.

3.7.1 Permutations
In the previous section, you could count the number of outcomes in a sample space.
When the number of outcomes is fairly small, it is quite straightforward, but in
certain instances counting the possible number of outcomes can be cumbersome.
Think of listing the 5 different cards from a pack of 52 playing cards.

Suppose you have 3 letters: A, B and C written on 3 separate cards. There are
different ways of arranging these cards.

ABC ACB BCA BAC CAB CBA

There are 3 choices for the first position: either A, B or C.

There are 2 choices for the second position:


B or C if A has been used
C or A if B has been used

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

3.7. COUNTING METHODS 47

A or B if C has been used

There is 1 choice for the third position:


only C if A and B have been used
only B if C and A have been used
only A if B and C have been used

Therefore altogether there are 3x2x1 = 6 possible ways of arranging the three cards.

Similarly the number of ways to arrange the letters A, B, C and D will be 4x3x2x1
= 24. The different arrangements of objects (they need not be letters) are called
permutations. The number of permutations of n distinct objects is n!, where
n! = n x (n − 1) x (n − 2) x . . . x 2 x 1. The expression n! is called n factorial.

The formula for permutations is further extended to give the number of different
permutations of r objects which can be made from n distinct objects:

n n!
Pr = (3.6)
(n − r)!

Equation 3.5 is used for example in the case of arranging 4 letters out of 7 letters
A, B, C, D, E, F and G. That is having the arrangements:

ABCD ABCE ABCF ABCG ...and so on.

The number of arrangements will be:

7 7! 7!
P4 = = = 840.
(7 − 4)! 3!

3.7.2 Permutations when objects are not distinct


In the previous section, objects are distinctly different from each other. Suppose
now that in a set of n objects there are k subgroups with n1 in the first group and
n2 in the second group, up to nk in the kth group such that n1 + n2 + . . . + nk = n.
Then the number of distinguishable permutations is:

n!
(3.7)
n1 !n2 ! · · · nk !

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

48 CHAPTER 3. PROBABILITY

Example 3.6
Find the number of distinct permutations of the letters of the word MISSISSIPPI.

There are 11 letters, of which 4 S’s, 4 I’s, 2 P’s and 1 M. The number of distinct
permutations of the letters is therefore
11!
= 34650.
4!x4!x2!x1!

Note in permutations, the order of objects is significant when counting


the number of arrangements

3.7.3 Combinations
Combinations is the case when the order of objects does not matter in counting
the number of different arrangements. For example, if you were dealt a hand of 13
cards from a pack of 52 cards, you would not be interested in the order in which
you received the cards.

The number of different combinations of r objects selected from n distinct objects


is:

n n!
Cr = (3.8)
(n − r)! x r!

Example 3.7
How many ways can you arrange three letters from the word ATMOSPHERIC,
where the order is not important?

11 11!
C3 = = 165.
(11 − 3)! x 3!
Since the order is not important, selecting the arrangements ATM, AMT, TMA,
TAM, MTA and MAT are all counted as just one selection. Try computing the
permutation of this example.

Example 3.8
A team of 5 people, which must contain 3 men and 2 women, is chosen from 8 men
and 7 women. How many different teams can be selected.

The number of different teams of 3 men which can be selected from 8 is 8 C3 , and
the number of different teams of 2 women which can be selected from 7 is 7 C2 . Any

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

3.7. COUNTING METHODS 49

of 8 C3 teams can join up with 7 C2 to make a team of 5. The number of possible


teams is

8
C3 x 7 C2 = 1176.

Exercise 3.3
1. How many different arrangements can be made of the letters in the word
STATISTICS?

2. (a) Calculate the number of arrangements of the letters in the word NUM-
BER?
(b) How many arrangements in (a) begin and end with a vowel?

3. A contains 20 chocolates, 15 toffees and 12 peppermints. If three sweets are


chosen at random, what is the probability that they are:

(a) all different


(b) all chocolates
(c) all the same
(d) all not chocolates

Hint: Use the idea of relative frequencies to calculate the probability.

Acknowledgements
This chapter has been adapted from Advanced Level Mathematics Statistics 1, writ-
ten by Steve Dobbs and Jane Miller in 2002.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

Chapter 4

Random Variables and


Distributions

4.1 Introduction
We have been looking at variables, describing them through samples and popula-
tions. Now, we will proceed to look at the distributions of data. There are a set/class
of distributions where most data fall into. We will discuss these distributions, for
continuous and discrete data. This discussion should help us understand the
structure of populations better.

4.2 Discrete Random Variables


These are variables that take on distinct values, e.g. the number of games won by
a team 0, 1, 2, e.t.c., up to the total number of games played.

Probability distribution
Let X be a random variable and let xi , i = 1, 2, · · · , k denote the k distinct values
that X may assume. Since each X corresponds to a basic outcome of a random
trial, a probability distribution for the sample space will associate a probability
value with each xi . The probability that random variable X will assume value xi
will be denoted by P (X = xi ) for example, P (X = 0) is the probability that X = 0.

Example 4.1
In the game problem mentioned above earlier, suppose a team plays 3 games, and
has the following probabilities of winning the games.

50

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

4.2. DISCRETE RANDOM VARIABLES 51

Prob(Winning 0 games)=0.65,
Prob(winning 1 game )=0.15,
Prob(Winning 2 games)=0.10,
Prob(winning 3 games)=0.10.

X 0 1 2 3
P(X=X) 0.65 0.15 0.10 0.10

This represents a probability distribution, i.e., the probability associated with each
possible outcome.

Properties
Any probability distribution for a discrete random variable X has the properties.

1. 0 ≤ P (X = xi ) ≤ 1 for i = 1, 2, · · · , k,
Pk
2. i=1 P (X = xi ) = 1.

The distribution function can also be referred to as a probability function or


probability mass function.

Cumulative probability distribution


The cumulative probability distribution is the probability that a random vari-
able X takes a value less than or equal to some specified value x, denoted by
P (X ≤ x).

P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2) or
P (X < 2) = P (X = 0) + P (X = 1)

In Example 4.1, the cumulative distribution is given by;

X P (X = x) P (X ≤ x)
0 0.65 0.65
1 0.15 0.80
2 0.10 0.90
3 0.10 1.00

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

52 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

Expected value
The mean value of a random variable in many trials is known as the expected
value. For a discrete random variable X, the expected value is denoted by;
k
X
E[X] = xi P (xi ) where P (xi ) = P (X = xi )
i=1

Variance
The outcomes of a discrete random variable will vary, and as such it is useful to have
a measure of their variability. As we discussed earlier a key measure of variation is
the variance, defined;

σx2 = V ar[X]
Pk 2
= i=1 (xi − E[X]) P (xi )
or
= E[X − E[X]]2
= E[X 2 ] − [E[X]]2

The standard deviation of a random variable X is given by


p
σx = V ar[X].

Example 4.2
Find E[X] and Var[X] in the above example.

Solution

E[X] = 0 × 0.65 + 1 × 0.15 + 2 × 0.10 + 3 × 0.10


= 0.15 + 0.20 + 0.30
= 0.65

V ar[X] = (0 − 0.65)2 × 0.65 + (1 − 0.65)2 × 0.15 + (2 − 0.65)2 × 0.10 + (3 − 0.65)2 × 0.10


= 1.03

Exercise 4.1
Consider a man who tosses a coin once, the probability of getting a head is 12 , the
probability of getting a tail is 1- 21 = 21 . Let X = 1, if he gets a head and X = 0 if he
gets a tail. What is the distribution function of X? Find E[X].

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

4.2. DISCRETE RANDOM VARIABLES 53

Moment Generating Functions


The expected values E[X], E[X 2 ], . . . , E[X k ] are called moments. As you’ve seen
previously, µ = E[X] and σ 2 = V ar[X] = E[X 2 ] − µ2 , which are functions of
moments.
Special functions called moment generating functions can sometimes make finding
the mean and variance of a random variable simpler.

Definition 1 mgf
The moment generating function (mgf ) of a discrete random variable X is defined
to be;
X
MX (t) = E(etX ) = etx P (x)
x∈X

where X is the set of possible X values. The mgf(·) exists if MX (t) is defined for
the t ∈ [−h, h] and h > 0.

Note:
Some of the merits of using the mgf include the following properties.

1. If the mgf exists then, when t = 0, MX (0) = 1 for any random variable.

2. If the mgf exists and is the same for two distributions then the two distribu-
tions are the same. This means mgf uniquely identify probability distribu-
tions.

3. If mgf (·) exist then;

dr (r)
E(X r ) = r
MX (t)|t=0 = MX (0)
dt

So we can say the mean of X can be found by evaluating the first derivative
of the mgf at t = 0. That is, µ = E[X] = M ′ (0).

The variance of X can be found by evaluating the first and second deriva-
tives of the mgf at t = 0. That is, σ 2 − E[X 2 ] − (E[X])2 = M ′′ (0) − (M ′ (0))2 .

4. Let X have mgf MX (t) and let Y = aX + b. Then MY (t) = ebt MX (at)

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

54 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

4.3 Continuous Random Variables


A continuous random variable on the other hand is a random variable that can
take on a continuum (any value in an interval). An example of this is the height of
students. Remember the height can have an infinite number of possible values.
Here we talk of values on an interval as opposed to particular values. Thus the
probability density function of a continuous random variable X is a mathematical
function for which the area under the curve corresponding to any interval is equal to
the probability that X will take a value in that interval. The probability density
function is denoted by f (x). The value f (x) is called the probability density at x.
For example 
12x(1 − x)2 0 ≤ x ≤ 1,
f (x) =
0 elsewhere
For instance, the probability density at x = 0.5 is

f (0.5) = 12(0.5)(1 − 0.5)2 = 1.5.

Example 4.3
Consider the continuous random variable X, which represents the yield of a crop in
tons per acre. Suppose the yield can take any value between 0 and 1 ton, and that
X has the density above.
Then
R 0.7
P (0.5 ≤ X ≤ 0.7) = 0.5 12(x)(1 − x)2 dx
= [6x2 − 8x3 + 3x4 ]0.5
0.7

= 6 × 0.72 − 8 × 0.73 + 3 × 0.74 − 6 × 0.52 − 8 × 0.53 + 3 × 0.54


= 0.2288

See Figure 4.1. This means the probability that the crop yield will be between 0.5
and 0.7 is 0.2288.

Properties
As in the discrete case above;

1. f (x) ≥ 0 for x ǫ [a, b] interval of definition.


Rb
2. a f (x) = 1.

Note that, in this particular case,


P (a ≤ X ≤ b) = P (a < X ≤ b) = P (a ≤ X < b) = P (a < X < b).

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

4.3. CONTINUOUS RANDOM VARIABLES 55

Figure 4.1: The area representing the probability

Cumulative Probability Function


The cumulative probability function of a continuous random variable is denoted
by F (x) and defined by; F (x)=P(X < x) for x ǫ(−∞, ∞).

The probability that X < x0 = P(X < x0 )=F(x R x00) is the area under the curve f (x)
to the left of x0 . i.e., using calculus F (x0 )= −∞ f (x)dx. This is the cumulative
distribiton function.

Expected value
The expected value of a continous variable is defined as
Z ∞
E[X] = xf (x)dx.
−∞

Example 4.4
In Example 4.3, we calculate the expected value from,
R1
E[X] = xf (x)dx
R01
= 0
x × 12x(1 − x)2 dx
= 6 × 12 − 8 × 13 + 3 × 14 − 6 × 02 − 8 × 03 + 3 × 04
= 6−8+3
= 1.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

56 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

Variance
The variance, as in the discrete case is calculated with the summation replaced by
the integral, as follows;
R∞
V ar[X] = R−∞ (x − E[X])2 f (x)dx

= −∞ x2 f (x)dx − E[X]2
= E[X 2 ] − [E[X]]2

Example 4.5
R1
V ar[X] = 0 (x − E[X])2 f (x)dx
= E[X − E[X]]2
= E[X 2 ] − [E[X]]2
R1
= 0 x2 × 12x(1 − x)2 dx − 12

= 12
4
x4 − 24 5
5
x + 12 6
6
x
= 0.2.

Exercise 4.2
In Example 4.3, find the cumulative distribution function of X.

Moment generating function


Definition 2 mgf
The moment generating function (mgf ) of a continuous random variable X is defined
to be; Z ∞
tX
MX (t) = E(e ) = etx f (x)
−∞

The mgf() exists if MX (t) is defined for the t ∈ [−h, h] and h > 0.

4.4 Empirical distributions


There are a wide variety of probability distributions. However, only a limited num-
ber of types, or families, of probability distributions are used in practical applica-
tions. These fall under the two categories discrete and continuous.

There are many other distributions not discussed in this chapter. However, the
few distributions in the next two sections fall under a class of distributions we call
empirical distributions.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

4.5. DISCRETE DISTRIBUTIONS 57

4.5 Discrete distributions


Bernoulli trials
A random variable X which has two possible outcomes say 0 and 1 is called a
Bernoulli random variable.
The probability distribution of X is,

P (X = 1) = p
P (X = 0) = 1−p
i.e., P (X = 0) = 1 − P (X = 1)
= 1−p

Example 4.6
Tossing a fair coin, you get a head or a tail each with probability p = 21 . Thus if
a head is labelled 1 and a tail 0, the random variable X representing the outcome
takes values 0 or 1 if the probability that X = 1 is p, then we have that

p = P (X = 1) = 12
1
P (X = 0) = 1 − p = 1 − 2
= 12

Since events X = 1 and X = 0 are mutually exclusive.

Binomial Distribution
Suppose in an experiment there are two possible outcomes (failure and success) and
that the probability of success is p. Suppose also that the experiment is repeated n
times, the probability of x successes follows a Binomial distribution.
Let X = X1 + X2 + · · · + Xn where Xi are independent and identically distributed
Bernoulli random variables, then X is called a binomial random variable. Thus;

P (x) = P (X
 = x) n−x
n x
= x p (1 − p)

n
 n!
i.e., for x = 0, 1, · · · , n and 0 < p < 1, x
= x!(n−x)!

The quantities n and p are called parameters and they specify the distribution.
Let us look at one application of the Binomial with parameters n and p i.e Bin(n, p).

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

58 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

Example 4.7
Bin(n, p) here means Binomial distribution with parameters n and p.

Find
(i) probability of getting 4 heads in 6 tosses of a fair coin.
(ii) E[X] and Var[X].

Solution  4
P (X = 4) = 64 12 (1 − 21 )6−4
1
= 15 × 16 × 14
= 15
64
.
P n x
E[X] = x x p (1 − p)n−x
= P
np 
V ar[X] = x2 nx px (1 − p)n−x − (np)2
= np(1 − p)

Poisson Distribution
A Poisson random variable is a discrete random variable that can take integer values
from 0 up to ∞. The parameter for this distribution is λ i.e., P0 (λ).
An example of the application of the Poisson distribution follows, The number of
individuals arriving at a bank teller per quarter hour X is a poisson random variable.
The Poisson probability function is
λx e−λ
P (x) = x!
, where P (x) = P (X = x), x = 0, 1, · · · , ∞, and 0 < λ < ∞

Example 4.8
The number of students arriving at a take away every 15minutes is a poisson random
variable with parameter λ = 0.2. Find the probability that zero, one and two
students arrive at the take away.
0 −0.2
P (X = 0) = (0.2)0!e
= 0.8187 (no students arrive)
1 −0.2
P (X = 1) = (0.2)1!e
= 0.1637
2 −0.2
P (X = 2) = (0.2)2!e
= 0.0164

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

4.5. DISCRETE DISTRIBUTIONS 59

Properties
P∞ xλx e−λ
E[X] =
Px=0
λx e−λ
x!
=
P(x−1)!
x−1 e−λ
= λ λ(x−1)!
= λ×1
= λ.
P x2 λx e−λ
Similarly V ar[X] = x!
− λ2
= λ2 + λ − λ
= λ.
Hint: Let y = x − 1.

Poisson Distribution linked with the Binomial Distribution


The Poisson distribution can be viewed as a limiting case of a Binomial distribution,
when the number of trials n gets large and the the probability p the probability of
success gets small.

In the Binomial distribution P (X = x) = nx px (1−p)n−x we denote, E(X) = np = λ
so that
λ
p=
n
which means
λ
(1 − p) = 1 −
n
  x x
Thus, P (X = x) = nx nλ (1 − nλ )n−x → e x!λ as n → ∞
−λ

Exercise 4.3
Show that the limiting distribution of the Binomial distribution is the Poisson dis-
tribution.
e−λ λx
lim P (x) = .
n−→∞ x!

The Hypergeometric Distribution


We will at this stage mention a distribution related to the Binomial distribution.
Suppose again we are looking at a population consisting of N items. Suppose there
are two possible outcomes Success (S) and Failure (F ) and that k of out comes are
successes. If a random sample of n items is drawn and x of the items are successes,
then x follows a Hypergeometric distribution given by;

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

60 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

S
 N −S

x n−x
P (X = x) = N

n

Discrete Uniform distribution


The discrete Uniform random variable is a variable that can take on integer values
within a given interval with equal probabilities.
For example let x be a discrete random variable which can assume s values.
Thus, the discrete Uniform Probability function is defined
1
P (X = x) = P (x) = ,
s
x = a, a + 1, ..., a + (s − 1) where s > 0 is the number of terms and a the first
term.

Properties
s−1
E[X]= a+ 2
.
s2 −1
Var[X]= 12
.

Example 4.9
Let x be a discrete random variable which can assume 5 values. Then
X = 0 1 2 3 4,
in this case, s = 5, therefore,
X = 0 1 2 3 4
P(X) = 0.2 0.2 0.2 0.2 0.2
That is, the discrete Uniform Probability function. One way is to use the formula,
1
P (X = x) = P (x) =
5
x = 0, 0 + 1, 0 + 2, 0 + 3, 0 + (5 − 1) = 4, thus s = 5.

Exercise 4.4
Show that for the discrete Uniform distribution
s−1 2
E[X]=a + and Var[X]= s 12−1
2
,
P
Hint: V ar(X) = (X − E[X])2 P (X = x).

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

4.6. CONTINUOUS DISTRIBUTIONS 61

4.6 Continuous distributions


Continuous Uniform Distribution
This distribution is also known as the rectangular distribution. This is the
continuous equivalent to the discrete Uniform distribution, discussed above. A con-
tinuous Uniform variable has a Uniform probability density over an interval. This
distribution is of the form,
 1
b−a
a < x < b,
f (x) =
0 elsewhere.

Example 4.10
The marks from a certain exam are uniformity distributed over to 50 to 75. The
density function for the marks is given by
 1
25
50 < x < 75,
f (x) =
0 elsewhere.

Properties
E[X]= 21 (b + a)
2
Var[X]= (b−a)
12

For the example above E[X]=62.5.

Exercise 4.5
For the Continuous discrete Uniform distribution, show that

E[X]= 21 (a + b),
2
and Var[X]= (b−a)
12
.

The Normal distribution


This is one of the most important probability distributions in ststistics. The normal
probability density function is

( h 2
i
1 − 12 ( x−µ )
f (x) =

2πσ 2
e σ
where − ∞ < x < ∞, − ∞ < µ < ∞, σ > 0
0 elsewhere

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

62 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

We shall use the normal distribution to a great extent in inference. This is a two
parameter distribution, usually denoted by N (µ, σ 2 ). A random variable z is said
to be standard normal if Z = X−µσ
has mean µ = 0 and variance σ 2 = 1, for X a
normal random variable with mean µ and variance σ 2 . Tables which give normal
cumulative probabilities are widely available.

Properties
Let X be a normally distributed random variable with mean µ and variance σ 2 , i.e.,
X ∼ N (µ, σ 2 ), then Z = X−µ
σ
is normally distributed with mean 0 and variance 1.
We say Z has a standard normal distribution and X−µ σ
is called standardisation.

The Exponential probability distribution


An exponential random variable is a continuous random variable that can take on
any positive value and is usually used to describe the time between events.

Example 4.11
Let X be the length of a long distance telephone call. Then X has an exponential
distribution.

The density function is



λe−λx for 0 ≥ x < ∞, & λ > 0
f (x) =
0 elsewhere.

Properties
1
E[X] = , and.
λ
1
V ar[X] = 2 .
λ
The parameter here is λ.

Memoryless Property
The memoryless property of a given probability distribution is mostly associated
with the distribution of times. Consider tossing a coin, if your outcome was a head
(H) and you toss the coin again does it mean at the second toss it is going to be a
tail (T )? i.e., P (T |H) = P (T ) in other words ;

P (Xt+1 |Xi , X2 , . . . , Xx ) = P (Xt+1 )

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

4.7. CONCLUSION 63

since these events are independent. The memoryless property means the future is
independent of the past. There are certain probability distributions which have this
property.
We will mention two in this course, in the discrete case a product of independent
Bernoulli trials see coin tossing example.
The only memoryless continuous probability distribution is the exponential dis-
tribution. Suppose X ∈ [0, ∞) is a continuous random variable, the probability
distribution of X is said to be memoryless if for any real numbers t, a ∈ [0, ∞]

P (X > t + a|X > t) = P (X > a)

Exercise 4.6
1. Let X be Exponentially distributed. Show that E(X) = λ1 .

2. Prove that the Exponential distribution is memoryless.

4.7 Conclusion
We have briefly looked at some types of distributions. Some kinds of data tend to
follow certain distributions. These distributions are not so many, rarely do we meet
data with an unknown distribution, especially when the sample size is large.
The number of people arriving at a certain bus stop in specified periods of time
will generally follow a Poisson distribution. The time taken by radio active material
to decay generally follows the exponential distribution. Such natural phenomenon
make it necessary for us to study Empirical distributions.

Exercise 4.7
1. The heights (in metres) of children aged between 10 and 14 are recorded below
1.4, 1.5, 1.6, 1.2, 1.63
.

(a) Find x and s2 , and use them to estimate µ and σ 2 respectively.


(b) If the height is normally distributed, find P (X < 1.5).

2. Find the cumulative probability distributions of the Exponential, Poisson and


the continuous uniform distribution.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

64 CHAPTER 4. RANDOM VARIABLES AND DISTRIBUTIONS

3. Suppose that X, the number years of schooling a student completes beyond the
age of 14, is distributed normally with a mean of 4 and a standard deviation
of 2.

(a) What is the probability that a student completes more than 8 years of
schooling beyond the age 14?
(b) What is the probability that a student completes 2 to 8 years of schooling
beyond the age 14?

4. The probability that an individual gets a loan from a bank is 0.25. If 12 people
applied for loans what is the probability that

(a) at least 10 get loans,


(b) more than 2 but less than 9 get loans,
(c) exactly 5 get loans?

5. A new typist makes on average 1 error per page on her typing. What is the
probability that she will make

(a) no errors on a page?


(b) at least 4 errors on at page?
(c) makes no errors on a 5 page document? [hint: her error rate in now 5
per document]

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

Chapter 5

Estimation and Hypothesis


Testing

5.1 Introduction
The objective of statistics is to make inferences about a population based on infor-
mation contained in a sample. Statistical inference is mainly concerned with making
inferences about population parameters. Methods of making inferences about pa-
rameters fall into two categories, making decisions concerning the value of a param-
eter or estimating/predicting the value of the parameter. The relevant information
in a sample can be used to estimate the likely values of their associated population
parameters.
We will often need to test the truth of some claims made about a population this
will be covered under hypothesis testing.

5.2 Estimation
When a single statistic is used to estimate a population parameter we call it a
point estimator. A good estimator of a population parameter should at least be
an unbiased estimator. The next question is what is an unbiased estimator?

A statistic is called an unbiased estimator of a parameter if the expected value of


the statistic, for all samples equals the parameter. This technical definition simply
means, if we were to take a sample from a population and compute a statistic, and
continue taking samples and calculating the corresponding statistics, the average
of these statistics should be equal to the population parameter. Estimation can be
applied to many population parameters, in this chapter we will concentrate on the
population mean and standard deviation.

65

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

66 CHAPTER 5. ESTIMATION AND HYPOTHESIS TESTING

Point Estimate of the Mean


A point estimator of the population mean µ is given by the statistic x. Suppose we
have a random sample of observations of size n, x1 , x2 , . . . , xn . The point estimator
of µ is given by µ̂ where;
n
1X
µ̂ = x = xi (5.1)
n i=1
One of the most important properties of the sample mean x, is that it is an unbiased
estimator of the population mean.

Point Estimate of the Population Standard Deviation


The point estimate of the population variance is given;
" n Pn #
1 X [ x i ]
2
σ̂ 2 = s2 = x2 − i=1
(5.2)
n − 1 i=1 i n

This means if you have a representative sample you can estimate the variance of the
population. Similarly, the point estimate of population standard deviation is given
by,

v " n #
u Pn
u 1 X [ x ]
2
σ̂ = s = t i=1 i
x2 − (5.3)
n − 1 i=1 i n

Point Estimate of a Population Proportion π


Suppose we wish to estimate the proportion of the population that have a cellphone
in a large city. There will not be enough time to go to every individual to gather
the information in order to estimate the population proportion π.

If a representative sample of size n is taken and the number x of individuals with


cellphones in the sample is recorded. We can estimate the population proportion
using;
x
π̂ = p = (5.4)
n

Example 5.1
A company which manufactures and bottles chemicals, collected a bottle from each
batch of 20, which they dispatched for quality control purposes. They measured the

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

5.2. ESTIMATION 67

volumes and found the following results.

350, 351, 348, 352, 350, 356, 348, 347, 348, 352, 354

1. Estimate the mean volume of the whole consignment.

2. Estimate the standard deviation of the volume of chemicals.

3. Estimate the proportion of bottles which have less than 350ml in the whole
consignment.
Solution

1. The point estimate of the consignment mean is

µ̂ = x
n
1X
= xi
n i=1
1
= × 3856
11
= 350.545

2.

σ̂ = sv
u n Pn !
u X (
2
1 i=1 xi )
= t x2i −
n−1 i=1
n
s  
1 38562
= × 1351782 −
10 11
= 2.8058

3.

π̂ = p
x
=
n
4
=
11
= 0.3636

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

68 CHAPTER 5. ESTIMATION AND HYPOTHESIS TESTING

5.3 Confidence Intervals


Once we have calculated a point estimate, we can not be sure or state with any
confidence the degree of accuracy. The sample mean is one point in the sample
space and may not accurately estimate the true mean of the population. For this
reason, the interval estimation technique is more useful and appropriate for provid-
ing estimates of population values.

When computing the confidence intervals for our population parameters, we need
to consider the following;
1. the size n of the sample we are dealing with
2. whether we know the population standard deviation σ or not.

We can define a confidence interval as a range of values in which probability that


the population parameter will lie in it is known. In confidence estimation we will
be using the symbol z α2 , this value is found using normal distribution tables. α2 is
the value of z for which P(Z > z α2 )= α2

You will get the value of α from the question. For instance, if you are asked to
construct a 95% confidence interval, then
95% = 0.95 × 100%
= (1 − 0.05) × 100%
so that α = 0.05. So that 0.052
= 0.025. We now do the opposite of what we did in
the last chapter, i.e., we want to find the value of k that gives P (Z > k) = 0.025.
Looking down column φ(−z) (depending on the tables) until we get the probability
closest to 0.025 we see it is 1.96. So in this case k = z α2 = z 0.05 = z0.025 = 1.96
2

z 0.00 ··· 0.06 ··· 0.09


-3.4 0.0003 ··· 0.0003 ··· 0.0002
.. .. .. ..
. . . .
-1.9 0.0287 ··· 0.025 ··· 0.0233
.. .. .. ..
. . . .
3.4 0.9997 ··· 0.9997 ··· 0.9998
Some normal quantiles,
Z0.10 1.28
Z0.05 1.645
Z0.025 1.96
Z0.01 2.33
Z0.005 2.58

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

5.3. CONFIDENCE INTERVALS 69

Case 1: Confidence Interval for µ, when σ 2 is known


A (1 − α)100% confidence interval for µ, when σ known, is given by
σ σ
x − z α2 × √ ≤ µ ≤ x + z α2 × √ (5.5)
n n

where x is the mean of a sample of size n from a population with known variance
σ and z α2 is a value in the standard normal distribution that leaves an area of α2 to
2

the right of the normal curve.

Case 2: Confidence Interval for µ, when sample size is large,


(i.e. n > 30) and σ 2 is unknown
A (1 − α)100% confidence interval for µ, when σ unknown and the sample large is
given by
s s
x − z α2 √ ≤ µ ≤ x + z α2 √ (5.6)
n n
where x is the mean of a sample of size n from a population with sample variance
s2 and z α2 is the standard normal distribution leaving an area of α2 to the right.

Case 3: Confidence Interval for µ, when the sample size is


small, (i.e. n < 30) and σ 2 is unknown
If sample size is small and σ 2 is unknown we use a t-distribution.

A (1 − α)100% confidence interval for µ, when σ unknown and the sample is small
is given by
s s
x − tn−1 (α/2) √ ≤ µ ≤ x + tn−1 (α/2) √ (5.7)
n n
where x and s are the mean and standard deviation respectively of a sample of
size n < 30 from an approximate normal population, and tn−1 (α/2) is the value of
the t-distribution, with n − 1 degrees of freedom, leaving an area of α/2 to the right.

5.3.1 Using Student’s t Tables


X−µ
If X is normally distributed with mean µ and standard deviation σ then Z = √
σ/ n
will also be normally distributed with mean 0 and standard deviation 1.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

70 CHAPTER 5. ESTIMATION AND HYPOTHESIS TESTING

If σ is unknown then we can not standardise X to get Z, instead, we create a new


statistic called the student’s t, where we substitute s for σ

X −µ
T = (5.8)
√s
n

This statistic follows what is called the students’ t distribution with n − 1 degrees
of freedom. This is the value being used in Equation 5.7. We look up this value
from our students’ t tables which look like Table 5.1 below

Table 5.1: Student′ s t distribution Tables


Amount of α in one tail

df 0.25 0.10 0.05 0.025 0.01 0.005
1 1.000 3.08 6.31 12.7 31.8 63.7
2 0.816 1.98 2.92 4.30 6.97 9.92
3 0.765 1.64 2.35 3.18 4.54 5.84
4 0.741 1.53 2.13 2.78 3.75 4.60
5 0.727 1.48 2.02 2.57 3.37 4.03
6 0.718 1.44 1.94 2.45 3.14 3.71
· · · · · · ·
· · · · · · ·
· · · · · · ·
· · · · · · ·
· · · · · · ·
· · · · · · ·
· · · · · · ·
29 0.583 1.31 1.70 2.05 2.46 2.76
For n> 30 z 0.0674 1.28 1.65 1.96 2.33 2.58

Suppose you want to look up tn−1 (α/2) where n=7, and α = 0.05 then we have;

tn−1 (α/2) = tn−1 (α/2)


= t7−1 (0.05/2)
= t6 (0.025)
= 2.45 See arrows

As the sample size n increases the t-distribution tends to the standard normal dis-
tribution. So we can read of the values of Z α2 here. For example z α2 = z 0.05 =
2

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

5.3. CONFIDENCE INTERVALS 71

z0.025 = 1.96, (see last row column 5). Let us look at a few examples. Each of these
examples should help you appreciate when to apply which formula.

Example 5.2
A sample of 36 people at a rave night club revealed an average age of x = 19.38
years, sample standard deviation s = 4.760 years. Determine a 95% confidence in-
terval for the true age of individuals at the rave night club.

Solution
n is large (n > 30)
σ is unknown, so we use case 2.

A 95% confidence interval for µ is given by

x − z α2 √sn ≤ µ ≤ x + z α2 √sn

4.760 4.760
19.38 − 1.96 × √
36
≤ µ ≤ 19.38 + 1.96 × √
36

17.83 ≤ µ ≤ 20.93

We are 95% confident that the true mean lies between 17.83 and 20.83.

Example 5.3
The scores below where recorded after a statistics examination was marked. The
standard deviation of the marks is known to be 7.93%

73 52 67 53 51 61 49 66 41 48
52 47 65 46 71 67 48 66 47 44
63 65 44 46 61 52 55 54 51 56
49 62 57 56 47 45 56 59 59 47
48 57 48 52 53 52 51 63 68 53
Determine the 98.44 confidence interval of the average weight of the marks.

Solution

n = 50, x = 54.78%, s = 7.73%

σ is known, and n is large, so we use case 1.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

72 CHAPTER 5. ESTIMATION AND HYPOTHESIS TESTING

Then a 98.44% confidence interval for the mean mark is

x − z α2 √sn ≤ µ ≤ x + z α2 √sn

7.93 7.93
54.78 − 2.42 × √
50
≤ µ ≤ 54.78 + 2.42 × √
50

52.07 ≤ µ ≤ 57.49
We are 98.44% confident that the true mean mark lies between 52.07 and 57.49.

Example 5.4
The weights of seven similar containers of a chemical are recorded below.

277.83 289.17 294.84 277.83 283.5 289.17 272.16


Find a 95% confidence interval for the mean of such containers, assuming an ap-
proximate normal distribution.

Solution
x = 283.5, n = 7, s = 8.0186 from calculator.

σ is unknown, n is small, so we use case 3.


t6 (0.025) = 2.45

Then a 95% confidence interval for µ is


 
X − tn−1 (α/2) × √sn , X + tn−1 (α/2) × √s
n
 
283.5 − 2.45 × 8.0186

7
, 283.5 + 2.45 × 8.0186

7
(283.5 − 7.4253 , 283.5 + 7.4253)
(276.0747 , 290.9253)
Therefore, a 95% confidence interval for µ is (276.0747 ≤ µ ≤ 290.9253).

Confidence interval estimation can be extended to cover other parameters like pro-
portions, the standard deviation as well as differences between two means and a
ratio of standard deviations.

The ideas used in the construction of confidence intervals can be extended to help
determine the size of a sample that will lead us to estimate the mean to any desired
degree of accuracy.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

5.4. SAMPLING TO A DESIRED PRECISION 73

5.4 Sampling to a Desired Precision


Let x1 , x2 , . . . , xn be a random sample from N (µ, σ 2 ) and let x be sample mean and
and σ be population standard deviation respectively. Then the minimum sample
size required for X to be within ǫ of the true mean µ with a 100(1−α)% probability is
 2
Z α2 ×σ
n=
ǫ

Example 5.5
The Manager of a department wants the margin of error of the mean number of
calculation errors in reports in his department to be within ± 3 points for the year
2006 trainees. The extent to which this error is likely to occur is 0.95. What sample
size should he take of reports if the standard deviation is known to be 9.2?

Solution

h z α ×σ i2
2
n =
ǫ
 2
1.96 × 9.2
=
3
= 36.128

Therefore, 36 observations is the minimum needed sample size.

Exercise 5.1
1. Show that x is an unbiased estimator of µ.

2. Show that var(x) = σ 2 /n.

3. Show that sb2 is an unbiased estimator of σ 2 .

4. Suppose

22.4 31.6 -3.8 34.0 -35.2 -21.5 23.0 10.8 23.2

is a random sample from N (µ, σ 2 ). Construct a 90% confidence interval for


the mean µ in each of the following cases.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

74 CHAPTER 5. ESTIMATION AND HYPOTHESIS TESTING

(a) σ 2 = 7.24 is known.


(b) σ 2 is unknown.

5. A sample of 44 retired gentlemen was drawn from a group of pensioners and


their weekly cigarette expenses recorded in order to estimate the population
mean cigarette expenses. A mean of $325.00 and a standard deviation of
$125.00 was found from the sample.

(a) Find a 99% confidence interval of the true mean.


(b) Estimate µ, the mean amount spent on cigarettes during the past year.
(c) How large a sample is need so that one is
i. 95% confident that the sample mean will be within 0.1 of the true
mean?
ii. 98% confident that the sample mean will be within 0.5 of the true
mean?

We can use confidence intervals to test the validity of a claim. Tests of claims are
best handled by hypothesis testing.

5.5 Hypothesis Testing


Often before you engage in any study or eventually make any recommendations, you
will need supporting evidence. It has become generally acceptable that recommen-
dations and findings be accompanied by statistical evidence. This is often in the
form of statistical tests.

The testing of a statistical hypothesis is perhaps the most important part of decision
making.

What is a Statistical Hypothesis?


A statistical hypothesis is an assumption or statement, which may or may not be
true, concerning one or more populations.

The truth or falsity of a statistical hypothesis is never known with certainty unless
we examine the entire population. A random sample is taken from the population
of interest and the information contained in this sample is used to decide whether
the hypothesis is likely to be true or false. Evidence that is inconsistent with the
stated hypothesis leads to a rejection whereas evidence supporting the hypothesis
leads to its acceptance. For example, one might want to test the the hypothesis that

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

5.5. HYPOTHESIS TESTING 75

the pass rate for an exam is always 40%, or test the claim that women live longer
than man. There are two types of hypothesis, a null and alternative hypothesis.

The Null Hypothesis H0


The null hypothesis usually denoted H0 is a claim about a population parameter or
characteristic.

The Alternative Hypothesis H1


The alternative hypothesis usually denoted by H1 , is expressed the way in which the
value of a population parameter or characteristic in a statistical model may deviate
from that specified in the null hypothesis.

The null hypothesis is rejected if there is insufficient evidence to reject it.


There are two main types of errors:
1. type I error: When H0 is rejected when it is in fact true.

2. type II error: When H0 is ’accepted’ when it is in fact false.


The probability of committing a Type I error is known as the the level of signifi-
cance of the test, it is denoted by α. This is the value we were using in confidence
interval construction. The value of α can vary at the researcher’s discretion. How-
ever a value of α= 0.05 is a common benchmark for reasonable doubt.

The probability of committing a Type II error is denoted by β. This probability


helps us determine the power of our test. The power of a test can be defined as
the probability of a test to make a correct decision. So we can say the power of a
test = 1- P(making a type II error)=1-β

Formulating a Hypothesis Test


When testing a hypothesis, the following steps are followed;
1. State the hypothesis
Null Hypothesis (H0 ): There is no difference between population parameter
and given parameter.
Alternative Hypothesis (H1 ): There is a difference between population param-
eter and given parameter.

Note: The alternative hypothesis will indicate whether a 1-tailed or a 2-tailed


test is utilized to reject the null hypothesis.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

76 CHAPTER 5. ESTIMATION AND HYPOTHESIS TESTING

2. Set the rejection criteria


This determines how different the parameters and/or statistics must be before
the null hypothesis can be rejected. This “region of rejection” is based on α
the significance level. The point of rejection is known as the critical value.

3. Compute the test dtatistic


The collected data are converted into standardized scores for comparison with
the critical value.

4. Decide results of null hypothesis


If the test statistic equals or exceeds the region of rejection bracketed by the
critical value(s), the null hypothesis is rejected.

Terms used in Hypothesis Testing


A critical value is a value that separates the acceptance region from the rejection
region.

An acceptance region is a range of values of the sample statistics centered about


the null hypothesised population parameter that would lead to the acceptance of
the null hypothesis for values of the sample statistic which fall within its limits.

A rejection region is the range of values of sample statistic value that would lead
to the rejection of the null hypothesis for values of the sample statistic which fall
within its limits.

A two tailed test has an area of rejection both below and above the hypothesised
value.

A one sided and right tailed test has an area of rejection which lies above the
null hypothesised value of the population parameter

A one sided and left tailed test has the area of rejection which lies below the
null hypothesised value of the population parameter.

A test statistic T is a value calculated from the sample data which is used to
decide whether or not H0 should be rejected.

Hypothesis Testing for a Single Population Mean


In this chapter we will confine our discussion of hypothesis testing to tests concern-
ing the population mean. The hypotheses we will discuss are;

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

5.5. HYPOTHESIS TESTING 77

1. Two-sided test: H0 : µ = µ0 versus H1 : µ 6= µ0

2. One-sided test (right tailed): H0 : µ = µ0 versus H1 : µ > µ0 .

3. One-sided test (left tailed): Ho : µ = µ0 versus H1 : µ < µ0 .


Let us now proceed to test a hypothesis. When calculating the test statistic of a
single mean, one has to ascertain whether the variance is known or not and also
have an appreciation of the sample size.

The following test statistics are used depending on whether σ is known.

Case 1: When σ 2 is known: Use Z, where;


x−µ
z= (5.9)
√σ
n

Case 2: When σ 2 is unknown and n is < 30: Use T where;


x−µ
T = (5.10)
√s
n

Case 3: When σ 2 is unknown and n large Use Z where;


x−µ
z= (5.11)
√s
n

5.5.1 Two-sided tests


Testing H0 : µ = µ0 Versus H1 : µ 6= µ0 When σ 2 is Known
The test statistic is
X − µo
Zcalc = (5.12)
√σ
n

Then |Zcalc | tends to be large if H0 is false and small otherwise. An α-size test is to
reject H0 if |Zcalc | > z α2

Testing H0 : µ = µ0 Versus H1 : µ 6= µ0 When σ 2 is Unknown n < 30


The test statistic is
X − µo
Tcalc = (5.13)
√s
n

Then T ∼ tn−1 An α-size test is to reject H0 if |Tcalc | > tn−1 ( α2 )

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

78 CHAPTER 5. ESTIMATION AND HYPOTHESIS TESTING

Testing H0 : µ = µ0 Versus H1 : µ 6= µ0 When σ 2 is Unknown, n Large


The test statistic is
X − µo
Zcalc = (5.14)
√s
n

An α-size test is to reject H0 if |Zcalc | > z α2

Example 5.7
The ages of a sample employees who come very early to work were observed in one
financial institute. The following information from a sample was obtained. x = 26.3,
σ 2 = 9 and n = 36. You are required to test the hypothesis that the mean age is
25. assume α = 0.005

Solution
Since σ is known, we are using Case 1, the Z distribution. The test is a two tailed
test.
α = 0.005 ⇒ Z0.0025 = 2.81
H0 : µ = 25 versus
H1 : µ 6= 25
The critical region is Z < −2.81 and Z > 2.81. Thus, reject H0 if |Zcalc | > 1.96.
The test statistic is
x − µ0
Zcalc =
√σ
n
26.3 − 25
=
√3
36
= 2.6

Since |Zcalc | < 2.81, that is, 2.6 < 2.81, thus, we fail to reject H0 and conclude that
the average age of early comers is equal to 25.

Example 5.8
It is desired that a statistician test the hypotheses given below and make correct rec-
ommendations based on the information gathered from a sample of measurements
of 40mm pipes which were supplied by a new supplier. Assume α = 0.01
H0 : µ = 40, H1 : µ 6= 40, n = 18, x = 34, and s2 = 64.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

5.5. HYPOTHESIS TESTING 79

Solution
Since n is small and σ unknown we use the t-distribution. The test is a two tailed
test.
H0 : µ = 40 versus
H1 : µ 6= 40
α = 0.01, tn−1 ( α2 ) = t0.005 (17) = 2.90
The critical region is T < −2.90 and T > 2.90. Thus reject H0 if |Tcalc | > 2.8982 .
The test statistic is

X − µo
Tcalc =
√s
n
34 − 40
=
√8
18
= −3.182

Since |Tcalc | > 2.90, that is, 3.182 > 2.90, we reject H0 and conclude that the true
mean is not equal to 40. The pipes are not 40mm.

Exercise 5.2
1. A geologist is testing the hypothesis that the melting point of an unusual
carbon substance is 1946o C. He makes 7 determinations and obtained the
values of 1944, 1947, 1945, 1947, 1949, 1946 and 1944o C. What conclusions
can you draw at a significance level of 0.05?

2. An insurance broker claims that teachers spend an average of R65.00 per


month life insurance. To test this hypothesis a random sample of 124 teachers
was taken and it was observed that on average teachers spend about R52.24
with a standard deviation of R11.64. Using a 0.01 significance level, does this
support the broker’s claim.

3. A baker stated that on average the number of loaves bread sold daily is 3 000
with a standard deviation of 300. An employer want to test the accuracy of
this statement. A random sample of 36 days showed the average daily sales
were 3 150. Test at the 1% level of significance if the bakery’s statement can
be accepted.

Let us discuss one sided or one tailed tests.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

80 CHAPTER 5. ESTIMATION AND HYPOTHESIS TESTING

5.5.2 One-sided tests (right tailed)


The ideas discussed in Section 5.5, cases 1, 2 and 3 can be extended to testing
this one sided test. The test statistic to be used in this test will be determined by
whether we know σ or not and by the sample size n (whether its a small sample or
a large sample), see Section 5.5.

We wish to test the hypotheses


H0 : µ = µ0 Versus
H1 : µ > µ 0

Depending on the sample size and knowledge of σ, we reject H0 if test statistic >
tabulated value(α)

x−µ
1. If σ is known, then use Z = √σ
and reject H0 if z > zα
n

x−µ
2. If σ is unknown and n < 30, then use t = √s
and reject H0 if t > tα (n − 1)
n

x−µ
3. If σ is unknown and n > 30, then use z = √s
and reject H0 if Z > zα
n

Example 5.9
The average distance traveled by a small engined vehicle on 10 litres of petrol is
162.5 kms with a standard deviation of 6.9 kms. Is there reason to believe that
adding a new additive to petrol increases the distance travelled on 10 litres if a
random sample of 50 small cars has an average of 165.2 kms per 10 litres, at the 5%
level of significance.

Solution

Ho : µ = 162.5

H1 : µ > 162.5
n = 50, x = 165.2 and σ = 6.9. Since σ is known we use the Z distribution. The
test is a one tailed test.
α = 0.05, zα = z0.05 = 1.645

The critical region is Z > 1.645. Thus reject H0 if Zcalc > 1.645.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

5.5. HYPOTHESIS TESTING 81

The test statistic is

X − µo
Zcalc =
√σ
n
165.2 − 162.5
= 6.9

50
= 2.7669

Since Zcalc > z0.05 , that is, 2.7669 > 1.645, thus we reject Ho and conclude that the
the additive increases the distance travelled on 10 litres of petrol.

Exercise 5.3
1. A supermarket claims that customers to its stores spend on average 25 min-
utes carrying out their purchases. A consumer body wants to verify this claim.
They observed then entry and departure times from supermarkets in the chain
of 24 random selected customers. The sample average time was half an hour
with a standard deviation of 14.1 minutes. Test the validity of the supermar-
ket’s belief at the 2.5% level of significance.

2. It is claimed that an automobile is driven on average less than 12 000kms per


year. To test this claim a random sample of 100 automobile owners are asked
to keep a record of the kilometres they travel. Would you agree with this claim
if the random sample showed an average of 14 500 kilometres and a standard
deviation of 2 400kilometres? Use a 0.01 level of significance.

5.5.3 One-sided tests (left tailed)


The test statistic to be used in this test will be determined by whether we know σ
or not and by the sample size n (whether its a small sample or a large sample), as
we did in Section 5.5.2.

We wish to test the hypotheses


H0 : µ = µ0 Versus
H1 : µ < µ 0

Depending on the sample size and knowledge of σ, we reject H0 if test statistic


<tabulated value(α)

The critical values, for particular tests, sample sizes and significance levels, are avail-
able available in tables. Remember, sometimes, we have to extrapolate the critical

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

82 CHAPTER 5. ESTIMATION AND HYPOTHESIS TESTING

values from tables.

x−µ
1. If σ is known, then use z = √σ
and reject H0 if z < − zα
n

x−µ
2. If σ is unknown and n < 30, then use t = √s
and reject H0 if t < − tα (n − 1)
n

x−µ
3. If σ is unknown and n > 30, then use z = √s
and reject H0 if z < − zα
n

Example 5.10
The average length of time spent in a bank queue has been 50 minutes with a
standard deviation of 10 minutes. A new banking system is testing a new software.
If a random sample of 12 clients had an average banking time of 42 minutes with a
standard deviation of 11.9 minutes under the new system. Test the hypothesis that
the population mean is now less than 50 using a level of significance of

(i) 0.05 and

(ii) 0.01.

Assume the population of times to be normal.

Solution.
Let Ho : µ = 50 minutes versus H1 : µ < 50 minutes.

(i) Using t-distribution at α = 0.05 ⇒ tn−1 (0.05) = t11 (0.05) = 1.80. Reject H0 if
T < −1.80, that is, critical region is [−∞, −1.80]. The test statistic is

X − µo
T =
√s
n
42 − 50
= 11.9

12
= −2.3288

Since −2.3288 < −1.80, that is, the test statistics is in the critical region, we
reject the null hypothesis at 5% level of significance

(ii) Using t-distribution at α = 0.01 ⇒ t11 (0.01) = t11 (0.01) = 2.72 Reject Ho if
T < −2.72, that is critical region is [−∞, −2.72]. Since −2.33 > −2.72, we do
not reject Ho at 1% level of significance.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

5.5. HYPOTHESIS TESTING 83

Note: We did note use σ = 10, since the sample partains to a new banking system
with new software so the population is not the same.

This essentially means that the true mean is likely to be less than 50 minutes but
does not differ significantly to warrant the high cost that would be required to
overhaul the current banking system.

Exercise 5.3
1. A cement manufacturer packs 50kg bags. To see if the manufacturer puts
enough cement in the bags, the contents of random sample of 200 such bags
were weighed. The average contents turned out to be 48kgs with a standard
deviation of 0.12. State the null and research hypothesis for this problem.
Hence test at the 0.05 significance level, whether or not the manufacturer
satisfies the requirement?

2. A scientist claimed that mice with an average life span of 28 months will live
to be about 43 months old when 45% of the calories in their food are replaced
by vitamins and proteins. Is there any reason to believe that the mean age is
less than 43 months if 54 mice that are placed on this diet have an average
life of 38 months with a standard deviation of 5.8 months. Use a 0.025 level
of significance.

3. Redo number 2, and use a sample of size 20 instead 54 are placed on this diet.

Note

In inference we use samples to make inferences about a population. You will never
have to calculate the population standard deviation. You are either given its value
or you do not know it.

The p-value
What is the p-value? You will find this value under a variety of names, some of
these are; critical level, the probability value and the associated probability.
Suppose, for a given null hypothesis H0 , we calculate a test statistic, say ktest , then

p-value = P (k ≥ ktest |H0 ).

where k is a real number. Generally we reject H0 , at the α level, if the p-value< α.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

84 CHAPTER 5. ESTIMATION AND HYPOTHESIS TESTING

You can can also say the p-value is the smallest value of α for which test results are
statistically significant.

The probability value is commonly used in statistical computer packages. Once cal-
culated you do not need statistical tables.

Exercise 5.4
1. Suppose it is known that a variable X is a normally distributed with a mean
of 340 minutes. If a random sample of 20 observations has an average of 332
with a standard deviation of 43 minutes. Test the hypothesis at the 0.025
level of significance at that µ = 340 minutes against the alternative µ < 340
minutes.

2. Suppose the observations

1.4 -2.6 1.3 2.1 -2.2 -3.6 3.2 1.8 2.4

are independent and identically distributed as N (µ, σ 2 ). Test the hypothesis


at 5% significance level that the mean of the population is 2.2.

3. A random sample of size 14 from a normal distribution has a mean X = 33.2


and a standard deviation of s = 5.41. Does this suggest, at the 0.05 level of
significance, that the population mean is greater than 32?

4. A manufacturer claims that the average life of batteries produced by his firm
is at least 30 months. You disagree, contending that the average life of the
batteries is less than 30 months. A random sample of 12 batteries has a
mean of 38.7 months and a standard deviation of 18 months. Perform the
appropriate hypothesis test. Use a significance level of 0.05.

5. The manufacturer of an over the counter pain reliever claims that its product
brings pain relief to headache sufferers in less than 3.5 minutes on average. To
be able to make this claim in its television advertisements the manufacturer
was required by a particular television network to present statistical evidence
in support of the claim. The manufacturer reported that for a random sample
50 headache sufferers, the mean time to relief was 3.3 minutes and the standard
deviation was 66 seconds. Does this data support the manufacturer’s claim.
Test using α = 0.05.

6. What is the advantage of using the p-value over the critical value?

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

5.6. SUMMARY 85

5.6 Summary
In this chapter we discussed the concept of estimation and hypothesis testing. These
two concepts fall under the topic statistical inference. In statistical inference we use
sample data to make inferences about the population from which the sample came.

You should now be able to estimate the point and interval estimate of the popu-
lation. You should also be able to test a variety of hypotheses, depending on the
question.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

Chapter 6

Correlation and Regression


Analysis

6.1 Introduction
A correlation analysis is the study of the strength of the relationship between any
pair of variables. Correlation measures how strongly pairs of variables variables are
related. The term regression analysis on the other hand describes a collection
of statistical techniques that quantifies how one variable depends on another (or
several other variables). Regression analysis is now perhaps the most widely used
method of modelling relationships.

6.2 Scatter Plots


The easiest way to visualise the nature of the relationship between variables is to
use a scatter plot. A scatter plot is a plot of one variable against another. You
may use three variables to get a three-dimensional plot. By looking at the scatter
plot, you can get an idea of the relationship between variables, that is, whether the
variables are linearly related or related in some other way.

Note
A scatter plot is made up of the axes and the points where values meet. The points
should not be joined together. An example of a scatter plot is shown in Figure 6.1.

Example 6.1
The incomes and amounts spent on entertainment by a sample of individuals at a
bar was recorded and a scatter plot constructed.

86

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.3. CORRELATION 87

Figure 6.1: Scatter plot of incomes versus entertainment

The scatter plot shows that as income increases the amount spent on entertainment
also increases.
It is important to construct a scatter plot in order to get an understanding of what
you will expect the relationship to be.

6.3 Correlation
Correlation is the intensity or strength of the relationship between two variables.
It is a measure of the extent to which variables are related or associated. If the
correlation between two variables is zero, then the two variables are not related. On
the other hand, a correlation of 1, means that there is a perfect linear relationship
between the two variables. Here, “perfect” means an exact relationship. How is the
correlation calculated?

6.3.1 Correlation Coefficient


Symbolically, the population correlation between two variables X and Y is given by
the correlation coefficient ρxy . This is defined by:

cov(X, Y )
ρxy = p .
var(X)var(Y )
We rarely deal with population parameters as they stand. We often estimate the
population parameters on the basis of samples. In this particular case, ρxy is esti-
mated from a sample, say (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ), to give ρ̂xy .

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

88 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

This quantity is calculated using the formula given in Equation 6.1:


Pn
− X̄][Yi − Ȳ ]
i=1 [Xi
ρ̂xy = pPn Pn (6.1)
2 2
i=1 [Xi − X̄] × i=1 [Yi − Ȳ ]

There are other measures of association, but for now we shall work with the one
above. ρxy is confined to the interval [−1 < ρxy < 1]. If ρxy = -1, then there is
an exact inverse relationship between x and y. If x increases, then y decreases and
vice-versa.

Example 6.2
Consider the number of expensive goods n sold by a company (which got the goods
at a very cheap cost price) and the profit p. ρnp ≈ 1 in this case. This means
that the more the sales, the higher the profit. (“≈” here means approximately) If
ρnp ≈ 0, then we would conclude that, there is no relationship between the profit
and number of sales.

On the other hand, ρnp ≈ −1 suggests that the more the sales the lower the profit.
The product is probably being sold at less than the cost price.

6.3.2 Correlation Matrix


You will notice that the correlation coefficient is defined for two variables. Now let
us assume we have many variables. Take, for instance, p variables, then how do we
handle the correlation problem? Let these variables be X1 , X2 , · · · , Xp . Then we
can express the different combinations of correlation coefficients in a matrix, known
as a correlation matrix, as follows:
 
1 ρx1 x2 · · · ρx1 xp
 ρ x2 x1 1 ··· ρx2 xp 
 
 . . . . 
ρ=
 .


 . . . 
 . . . . 
ρx p x 1 ρx p x 2 · · · 1

ρxi xj for i, j = 1, 2, · · · , p, represents the correlation between variables Xi and


Xj . Therefore by looking at the correlation matrix, we can tell which variables are
related. Notice that the correlation between Xk and itself is always 1. For example,

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.3. CORRELATION 89

let us look at the sample correlation coefficient for any k, ρ̂xk xk . Then:
Pn
i=1 [Xki −Xk ][Xki −Xk ]
ρ̂xk xk = √Pn 2
Pn 2
i=1 [Xki −Xk ] × i=1 [Xki −Xk ]

Pn 2
[Xki −Xk ]
= √ Pi=1
n 2 2
[ i=1 [Xki −Xk ] ]

Pn
[Xki −Xk ]2
= Pi=1
n 2
i=1 [Xki −Xk ]

= 1
This gives us the matrix:
 
1 ρ̂x1 x2 · · · ρ̂x1 xp
 ρ̂x2 x1 1 ··· ρ̂x2 xp 
 
 . . . . 
ρ̂ = 
 .


 . . . 
 . . . . 
ρ̂xp x1 ρ̂xp x2 · · · 1

Example 6.3
Suppose you are given three variables z1 , z2 and z3 representing the coded tensile
strength, melting point and amount of Titanium in a new alloy respectively. Use
the following data to calculate the correlation matrix.

z1 z2 z3
12 3 0.2
23 7 0.8
9 2 0.1
30 10 1.0
Pn
i=1 [z1i −z1 ][z2i −z2 ]
ρ̂z1 z2 = √Pn 2 × n [z −z ]2
P
i=1 [z1i −z1 ] i=1 2i 2

= 0.999
Pn
i=1 [z1i −z1 ][z3i −z3 ]
ρ̂z1 z3 = √Pn 2
Pn 2
i=1 [z1i −z1 ] × i=1 [z3i −z3 ]
= 0.993
Pn
i=1 [z2i −z2 ][z3i −z3 ]
ρ̂z2 z3 = √Pn 2
Pn 2
i=1 [z2i −z2 ] × i=1 [z3i −z3 ]
= 0.988
 
1 0.999 0.993
ρ̂ =  0.999 1 0.988 
0.993 0.988 1

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

90 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

The correlation between any two of the variables is very high meaning that the
variables are highly related. They either increase or decrease together. Addition of
Titanium increases the tensile strength and melting point of the alloy. The process
seems to produce an alloy whose strength can be increased or decreased, by changing
the amount of Titanium.

Exercise 6.1
Consider the following data collected on two variables which are suspected to be
related. The variable x represents the grade scores of students in a class and y the
number of students with the same grade.

Grade x 2 5 9 3 6 7
Frequency y 34 45 59 40 50 48
The data above shows the cross-tabulation relating the two variables x and y. From
this illustration, it is clear that there is a linear relationship between the two vari-
ables x and y. As one variable increases, the other variable also increases. This
means that as the grade increases the number of students with high grades also
increases.

Calculate the correlation coefficient of the grade and the frequency. Comment on
your result.

When we calculated the correlation coefficient we quantified the strength of the re-
lationship between two variables, in regression analysis we study how one variable
(dependent variable) depends on other variables (independent variable). In this
course we will introduce the case of one independent variable.

In real life situations we are usually interested in the relationship between variables.
For example, it is difficult to study the impact or effect of salary increments, food
price increases, etc. on inflation by using descriptive techniques. Thus, regression
analysis equips you with a way of studying such situations.

Note: Descriptive techniques generally look at one variable at a time, while


regression analysis looks at the relationship between variables.

6.4 Simple Linear Regression Analysis


Simple regression analysis is seldom used in applied research because the workings
of most socio-economic systems cannot be adequately represented by such a simple

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.4. SIMPLE LINEAR REGRESSION ANALYSIS 91

formulation. However, knowledge of simple regression analysis is a good foundation


for understanding multiple regression analysis [which is usually used in applied re-
search].

The term simple implies that a single independent variable x is involved and the
term linear implies linearity in the parameters.

Consider the following equation which relates Y and X:

Y = f (X) (6.2)

where f is a function showing how Y is related to X. In literature, the response


variable, i.e., the variable of interest Y , and the explanatory or independent variable
X, which is used to explain Y , have several names. We give these names in the table
below;

Table 6.1: Names given to response and explanatory variables in Regression Analysis
Y X
(a) Predictand Predictor
(b) Regressand Regressor
(c) Dependent variable Independent variable
(d) Effect variable Causal variable
(e) Endogenous variable Exogenous variable
(f) Target variable Control variable

Each pair of the above terms is appropriate for a particular use of regression analysis.
For example, the terminology in (a) is often used if the purpose of the regression is
prediction; pairs (b), (c) and (d) are used by different applied researchers in their
discussion of regression models; (e) is usually used in studies of causation or
causality; while pair (f) is more appropriate in control problems.

6.4.1 Types of Relationship


There are two main types of relationship between or among variables, namely exact
and statistical relationships.

Exact Relationships
An exact relationship is a relationship of the form:

yi = β0 + β1 xi (6.3)

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

92 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

where the subscript i (for i = 1, 2, · · · , n) refers to the ith observation. What does
this mean? Well, for any value of xi the yi value will be equal to some constant
value β0 added to the β1 × xi . β0 and β1 are constants. yi is determined by xi , i.e.,
a unit change in X causes a change equal to β1 in Y .

Note
The variables X and Y can be random or deterministic i.e., non-random.
Generally this equation can be expressed as follows:
y = β0 + β1 x (6.4)
where x and y are possible values of X and Y respectively. An example of an exact
relationship follows.

Exercise 6.2
The relationship between the area of a square A and the length of one side L is
given by:
Area = β0 + β1 × length2
where β0 = 0 and β1 = 1 This is an exact relationship Figure 6.2 gives a plot of the
relationship between Area [A] and the square of the length [L2 ].

Figure 6.2: The relationship between A and L2

If we had plotted A against L, we would have obtained the relationship shown in


Figure 6.3

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.4. SIMPLE LINEAR REGRESSION ANALYSIS 93

Figure 6.3: The relationship between A and L

These are both exact relationships. In Figure 6.2, we plotted A versus L2 which
gives a linear relationship of the form y = β0 +β1 x. On the other hand, in Figure 6.3
we plotted A versus L, we notice that, this time, there is a quadratic relationship.

Statistical Relationships
A statistical relationship, unlike an exact relationship, is not a perfect one, that is,
it does not give unique values of Y for a given value of X, but can be described ex-
actly in probabilistic terms. For instance, consider the following regression model
showing a statistical relationship between Y and X which is no longer exact because
of the error term ǫi :

Yi = β0 + β1 Xi + ǫi . (6.5)
The variable ǫi is a value added to the equation to make the two sides of the equation
equal. The term ǫi is called the error term. The error term is usually assumed
to have a normal distribution with mean 0 and variance σ 2 . The relationship be-
tween Y and X in Equation 6.5 is called a stochastic or statistical relationship
because of the presence of the random error term needed to make the equation exact.

Definition 3 A regression model is a statistical relationship between a dependent


variable, say Y , and some explanatory variable(s), say X. The model is said to be
deterministic if the explanatory variable X is non-random.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

94 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

Note: The following term from Equation 6.5:


β0 + β1 Xi (6.6)
is the deterministic component of Yi , and β0 and β1 , are called regression
coefficients or regression parameters. ǫi is the stochastic or random distur-
bance term, it takes care of all the variation of Y which is not explained by the
deterministic component β0 + β1 Xi . We will discuss how to estimate the regression
coefficients from a given data later.

Let us look at a practical example of a statistical relationship in order for us to ap-


preciate the differences between an exact relationship and a statistical relationship.

Exercise 6.3
[Statistical relationship] A group of students are interested in evaluating the ad-
vantages and disadvantages of different study patterns and their effect on their
performance. Consider Y , the mark a student gets after an examination, and X1 ,
the number of hours the student puts into reading for the examination.

The variable X1 was chosen by the students because it seemed [appeared] to con-
tribute a lot to the examination mark. A possible equation to represent the rela-
tionship between Y and X1 is given as:
y = β0 + β1 x1 + ǫ (6.7)
β0 and β1 are unknown constants or regression parameters. If a student puts 0 hours
into studying for the examination, then we expect him/her to get β0 marks. On
the other hand, if a student increases his/her study time by one hour, the model
suggests that the mark should change by β1 . Please note that, as in Equations 6.7,
we will index y and x to yi and xi respectively, when we have the actual observations
x1 , x2 , · · · , xn and y1 , y2 , · · · , yn at hand.
Notice how the points in Figure 6.4 are not always on the line. This is because the
relationship between examination mark and the time spent on studying is not exact.

The students could have added x2 , the number of books the student consulted as a
variable, since this appears to have an impact on the final examination mark. This
would give Equation 6.8.

Y = β0 + β1 X1 + β2 X2 + ǫ (6.8)
Procedures embraced by regression analysis concern themselves with drawing con-
clusions about these coefficients. An example of the implications of these coefficients

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.4. SIMPLE LINEAR REGRESSION ANALYSIS 95

Figure 6.4: Examination mark versus time spent studying

follows from the fact that a positive coefficient means that the more the hours spent
studying, the higher the examination mark etc. The term ǫ in the equation is added
to account for the fact that the equation is not exact. If there are p explanatory
variables, a regression equation can be expressed more generally as in Equation 6.9.

y = β0 + β1 x1 + β2 x2 + · · · + βp xp + ǫ (6.9)

This is referred to as multiple regression.

When To Apply Regression Analysis


There are conditions which must be satisfied before we can apply regression analysis.
The first condition to be met is that the variables of concern should be related to
each other, otherwise the idea of regression collapses. The second condition is that
one variable should change in response to the other, i.e there should be a dependence
relationship. How do we check on these requirements? This is often done by:

1. constructing a scatter plot.


2. calculating the correlation of the variables.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

96 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

Exercise 6.4
Figure 6.5, shows examples of a scatter plots showing the relationship between
an independent variable and a dependent variable. The variable y represents the
dependent variable, while x represents the independent variable. In the first plot
we have a linear relationship. This could be the relationship between Intelligence
Quotient (IQ) x and the mark obtained in an achievement test y. In the second
plot, we have a quadratic relationship. This could be the relationship between time
x and the distance y travelled (from the source) by an object thrown up into the
air.

Figure 6.5: Scatter plots showing two possible relationships between x and y

So, by constructing a scatter plot, we are guided in our decision or choice of the
equation to use.
All procedures and conclusions drawn in regression analysis depend, at least indi-
rectly, on the assumptions of the regression model. A model is what the data
analysts perceive as the mechanism that generates the data on which the regression
analysis is conducted.

The term fitting the model to a set of data involves estimation of the regression
coefficients and formulation of a fitted regression model [i.e the model with the
estimated coefficients]:
ŷ = βˆ0 + βˆ1 x (6.10)

Some Uses of Regression Analysis


Uses of regression analysis fall into the following categories:
1. Prediction: using the data availabe in conjunction with the model to pre-
dict/estimate future outcomes.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.4. SIMPLE LINEAR REGRESSION ANALYSIS 97

2. Variable screening: removing unnecessary variables from the a multiple linear


regression model. In this way we can find variables that are important to a
dependent variable.

3. Explaining: system explanation, which variables contribute the most and how
they contribute to the dependent variable.

4. Inference: estimation of the parameters in the model.

5. Planning and control: if we have an appropriate model we can explain the


physical system and thus plan ahead and control the system.

Regression techniques are also widely used in econometrics.

Regression Assumptions
We shall use the Least Squares (LS) procedure to estimate the parameters in the
model given by Equation 6.5. Although there are other procedures available for
estimating the parameters, we shall, however, only use the LS procedure. We will
also make the following assumptions (these are necessary for inference):

• There is a linear relationship between the independent variable xi and the


dependent variable yi .

• The xi ’s are non-random and are observed with negligible error.

• The ǫi ’s are random variables with mean zero and constant variance. This is
called the homogeneous variance assumption. Mathematically, this assump-
tion is:

E(ǫi ) = 0 and V ar(ǫi ) = E(ǫ2i ) = σ 2 .

• The ǫi ’s are uncorrelated, i.e.



0 if i 6= j
E(ǫi , ǫj ) =
σ 2 if i = j

fo i, j = 1, 2, · · · , n

• The normal theory assumption is imposed on the ǫi ’s. This is the assump-
tion that the ǫi ’s are normally distributed with mean zero and variance σ 2 .
Mathematically, this is stated as

ǫi ∼ N (0, σ 2 ).

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

98 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

From above, it follows that:

E(yi ) = β0 + β1 xi i = 1, 2, . . . , n, (6.11)

since E(ǫi ) = 0. Thus, we can use the notation:

E(y|x) = β0 + β1 x. (6.12)

This is the expected value of y for a given value of X, i.e.. given X = x.

6.5 The Least Squares Technique


The method of least squares, attributed to K. Gauss, a German scientist, is perhaps
the most extensively used technique for estimating the parameters β0 and β1 , to
give the estimates βˆ0 and βˆ1 , respectively. These parameters have to be estimated
because we do not have the actual values available. When we use these estimates,
we get the fitted model given by:

ŷi = βˆ0 + βˆ1 xi . (6.13)

We call this the fitted model because the model now has estimated parameters.
Generally in Statistics, the ‘hat’ notation is used to indicate an estimate. Notice
that we don’t have the error term in Model 6.13. The relationship between x and
E(Y |x) is now an exact one.

Definition 4 (Residual)
Let ri = yi − ŷi , This difference is called a residual.

The distinction between the residual ri and the error term ǫi is important. The
former measures the deviation of yi from ŷi . Since ǫ is usually unknown, it is
estimated by ri . The residuals are needed not only for estimating the magnitude of
the random variation in the yi ’s, but also for assessing the appropriateness of the
regression model employed. We shall discuss this later.

Least Squares Estimation


We want P the fitted values ŷi to be as close as possible to yi . To achieve this, let us
consider ni=1 ri2 , the Residual Sum of Squares (RSS). It makes good sense to
minimise RSS, since a good fit should produce the smallest possible sum of squares.
This is the basis of the least squares technique.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.5. THE LEAST SQUARES TECHNIQUE 99

Figure 6.6: The observed values yi (marked ‘⋆’), residuals ri and fitted values ŷi
(marked ‘♦’)

Figure 6.6 illustrates what is really happening, the points marked by the ‘⋆’s rep-
resent the observed values while the fitted values lie on the line indicated by ‘♦’s.
The residual ri is shown clearly as the difference between the observed value yi and
the fitted value ŷi .
To minimise the RSS, βˆ0 and βˆ1 must satisfy the conditions:

Pn 2
∂ βˆ0
( i=1 ri ) = 0, and


Pn 2
∂ βˆ1
( i=1 ri ) = 0.

Thus:

Pn 2
Pn  ∂ ˆ ˆ
2
∂ βˆ0
( i=1 ri ) = y
∂ βˆ0
i=1 i − β0 − β x
1 i
Pn  
= −2 i=1 yi − βˆ0 − βˆ1 xi
P P
= −2 ni=1 yi + 2nβˆ0 + 2βˆ1 ni=1 xi
= 0.
So that:

n
X n
X
yi − nβˆ0 − βˆ1 xi = 0.
i=1 i=1

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

100 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

By dividing by n leads us to:

y = βˆ0 + βˆ1 x (6.14)

For β1 we have:


Pn 2 ∂
Pn  ˆ ˆ
2
∂ βˆ1
( i=1 ri ) = i=1 yi − β0 − β1 xi
∂ βˆ1
Pn  
= −2 i=1 xi yi − βˆ0 − βˆ1 xi
P P P
= −2 ni=1 yi xi + 2βˆ0 ni=1 xi + 2βˆ1 ni=1 x2i
= 0.

This simplifies to,


n
X n
X n
X
xi yi = βˆ0 xi + βˆ1 x2i (6.15)
i=1 i=1 i=1

Equations 6.14 and 6.15 are called normal equations. SolvingPn for βˆ0 in Equation
P
6.15 gives us the following estimate for β0 in terms of β1 . i=1 yi = nβˆ0 + βˆ1 ni=1 xi
can be solved to give:
n n
1X 1X
βˆ0 = yi − βˆ1 xi (6.16)
n i=1 n i=1
= y − βˆ1 x

Substituting for β̂0 in equation 2.8 gives us


P h P P iP
0 = ni=1 xi yi − n1 ni=1 yi − βˆ1 n1 ni=1 xi n ˆ P n x2
i=1 xi − β1 i=1 i
h P P i P P P
= βˆ1 n ( i=1 xi ) − i=1 xi + i=1 xi yi − n ( i=1 yi ) ( ni=1 xi )
1 n 2 n 2 n 1 n

(
1 Pn
)( xi )
Pn Pn
i=1 (xi yi )− n i=1 yi
βˆ1 = i=1
2
1 Pn
( xi )
Pn 2
i=1 xi − n i=1

Pn
i=1 (xi yi )−nxy
= P n 2 2
i=1 xi −nx

Pn
(x −x)(yi −y)
= Pn i
i=1
2
i=1 (xi −x)

We often state the above as:


Sxy
βˆ1 = , (6.17)
Sxx

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.5. THE LEAST SQUARES TECHNIQUE 101


P P
where Sxy = ni=1 (xi − x)(yi − y) and Sxx = ni=1 (xi − x)2 . Note that we first find
β̂1 and then find β̂0 , using β̂0 = y − β̂1 x.

The fitted line is ŷ = βˆ0 + βˆ1 x. This fitted line is referred to by many different
names in statistics. Some of the names are: the least squares line, fitted regression
line, estimated regression line or just the fitted model.

Example 6.4

The starting salary S per year of people of different educational background (Ed)
has always been of interest to people going to university. They have always tried to
find out the relationship between these. We expect the starting salary to be directly
related to the educational level, i.e., as the educational level increases, so does the
salary.

As this may not be the case, we shall investigate this suspicion using regression
analysis. Suppose that an individual’s educational level is given a score, then an
appropriate model is given by:

S = β0 + β1 Ed + ǫ

The data on S and Ed were collected and recorded as shown below. Find the esti-
mates for β0 and β1 and discuss your results.

Number Annual salary ($) Educational level score

1 20 000 2.8
2 24 500 3.4
3 23 000 3.2
4 25 000 3.8
5 20 000 3.2
6 22 500 3.4

Solution:

Calculations give

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

102 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

Number Si Edi Si Edi Ed2i

1 20 000 2.8 56 000 7.84


2 24 500 3.4 83 300 11.56
3 23 000 3.2 73 600 10.24
4 25 000 3.8 95 000 14.44
5 20 000 3.2 64 000 10.24
6 22 500 3.4 76 500 11.56

Total 135 000 19.8 448 400 65.88

SEd,S
βˆ1 = SS,S
(19.8)(135000)
448400−
= 6
2
65.88− 19.8
6

2900
= 0.54

= $5370.

Therefore, ˆ
β0 = 135000 − 5370 19.8
6 6
= $4779

Thus, we have the fitted model Ŝi = 4779 + 5370Edi . How do we interpret this
model? The starting salary is predicted to be $4 779, when the Educational level
score is zero. This may not say much since an educational score of zero does not
apply to this group of people by virtue of their being at University.

Perhaps, of primary interest is the slope (coefficient) which indicates that for a
one-unit increase in educational score, the predicted salary increases by $5370.
For example, for an educational level score of 2.8, the predicted salary is P̂ =
4779 + 5370 × 2.8 = $19815.

Exercise 6.5
1. In Example 6.4, remove the last (sixth) number and estimate β0 and β1 .
2. Predict the salary for someone with an educational score of 2.8.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.6. PROPERTIES OF ESTIMATES 103

6.6 Properties of Estimates


We shall briefly mention, without proving, some of the properties of the estimates
we derived earlier. We shall then use these properties in assessing how good our
model is.

The Least Squares estimates of β0 and β1 are unbiased so that we have the following
properties.

The Sampling Distribution of β0 is given by:

  
1 x2
βˆ0 ∼ N β0 , σ 2
P
+ n 2
, (6.18)
n i=1 (xi − x)

where σ 2 is the variance of the error term.

The sampling distribution of β̂1 is given by;

 
σ2
βˆ1 ∼ N β1 , Pn 2
. (6.19)
i=1 (xi − x)

Now, using the properties of the sampling distributions of βˆ0 and βˆ1 , inferences
about β0 and β1 can be made. First, however, an estimate of one other unknown
parameter in the regression model is needed. This is an estimate of σ 2 . This estimate
is given by:

n
2 21 X
σ̂ ≈ s = (ri − r)2 (6.20)
n − 2 i=1
n
1 X 2
= (ri )
n − 2 i=1
n
1 X
= (yi − ŷ)2 .
n − 2 i=1
Pn
Since i=1 ri = 0 and r = 0.
Pn
Note that s2 =MSE, the Mean Square Error and SSE= i=1 (yi − ȳi )
2
the Error Sum
of Squares.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

104 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

We will use s2 to estimate σ 2 in our expressions of V ar(βˆ0 ) and V ar(βˆ1 ) to get:


 2

\ 1 x
V ar(βˆ0 ) = s 2
+ Pn 2
(6.21)
n i=1 (xi − x)

and

s2
V\
ar(βˆ1 ) = Pn 2
, (6.22)
i=1 (xi − x)

respectively.

Lets discuss how we can make inferences about the regression parameters before
proceeding to investigate how well the fitted line fits the data.

Exercise 6.6
1. Deduce the variance of yi .
2. Assuming βˆ1 and βˆ2 are independent, find the variance of ŷi .

6.7 Making Inferences about β0 and β1


The following procedure can be used to test hypotheses about the slope and the
intercept.

1. Establish the null and alternative hypotheses.


(a) The null hypothesis is that: There is no linear relationship between Y
and X, that is, H0 : β1 = 0.
(b) The alternative hypothesis is that: There is a linear relationship between
Y and X, i.e. H1 : β1 6= 0.
2. Determine the tolerance α for a type I error probability.
3. Identify an appropriate test-statistic. Using H0 : β1 = 0, the test statistic is
given by:
βˆ1
t= p P (6.23)
s2 / ni=1 (xi − x)2

Note that t has a student’s t-distribution with n − 2 degrees of freedom.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.7. MAKING INFERENCES ABOUT β0 AND β1 105

4. State the assumptions under which this test statistic has a t-distribution. The
assumptions about the error must be valid for the conclusions or inferences to
be valid.

5. Determine whether these assumptions are satisfied.

6. Find the values for the test statistic that allow a rejection of the null hy-
pothesis. We find the critical values tn−2 ( 21 α), such that we reject H0 if
t < −tn−2 ( 21 α). or if t > tn−2 ( 12 α)

7. Compute the t-value based on sample data.

8. Interpret the result from a statistical viewpoint.

Example 6.6
We assume that a two-tailed test is appropriate and we use the data in Example 6.4
to test the hypothesis
H0 : β1 = 0 versus
H1 : β1 6= 0
at α = 0.05.

First we calculate s as follows:


s
Pn
i=1 (Si − Ŝi )2
s=
n−2

Number Si Ŝi Si − Ŝi (Si − Ŝi )2i


1 20 000 19 815 185 34.225
2 24 500 23 037 1,463 2 140 369
3 23 000 21 963 1,037 1 075 369
4 25 000 25 185 -185 34 225
5 20 000 21 963 -1,963 3 853 369
6 22 500 23 037 -537 288 369
Total 0 7 425 926

r
7425926
s= ≈ 1363
4
Calculation of the variance estimate:

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

106 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

Number Edi Edi − Ed (Edi − Ed)2


1 2.8 -0.5 0.25
2 3.4 0.1 0.01
3 3.2 -0.1 0.01
4 3.8 0.5 0.25
5 3.2 -0.1 0.01
6 3.4 0.1 0.01
Total 0.54

We calculate V\ ar(βˆ1 ) as follows:


V\ar(βˆ1 ) = √1363
0.54
= 1854
Our test statistic t is found to be:
5370 − 0
t= = 2.896
1854
The critical value is tn−2 ( 12 α) = t4 (0.025) = 2.78. Therefore, we reject H0 since t
exceeds the critical value tn−2 ( 21 α). We therefore conclude that there is a significant
statistical relationship between the starting salary and the educational level score.
See Figure 6.7 for the decision-making diagram.

Figure 6.7: t-Distribution: when to reject or accept H0

A similar approach can be used to test the significance of the intercept.

Exercise 6.7
1. If βˆ1 had been -0.32, what conclusions would you have drawn?
2. Test the null hypothesis that the intercept is zero.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.8. ANALYSIS OF VARIANCE OF SIMPLE LINEAR REGRESSION 107

6.8 Analysis of Variance of Simple Linear Regres-


sion
Analysis of variance (ANOVA) is a highly useful and flexible mode of analysis for
regression models. We will use ANOVA to compute σ 2 and to measure the degree
of linear association between X and Y in the sample data.

Partitioning the Total Sum of Squares


The uncertainty associated with a prediction is related to the variability of the Y
observations around their mean, as measured by the following deviations:

Yi − Y
The greater the variability in the data, the larger will be the deviations, Yi − Y ,
and the greater is the uncertainty associated with a prediction Yi , without utilising
knowledge of Xi .

Conventionally, the measure of variability of the observations is expressed in terms


of the sum of squares of the observations Yi − Y and is denoted by:
n
X
SST = (Yi − Y )2 (6.24)
i=1

where SST stands for Total Sum of Squares. If there is a lot of variability in the Yi ,
then SST is large.

Error Sum of Squares


The uncertainty associated with a prediction is related to the variability of the Yi
around the fitted regression line as measured by the following deviations:

ri = Yi − Ŷi

If all the Yi values fall on the regression line, all the deviations ri , will be zero.
The larger the deviations ri the greater the uncertainty associated with a prediction
utilising knowledge of the independent variables Xi .

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

108 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

The conventional measure of variability around the fitted regression is the Error
Sum of Squares (SSE) which is calculated as follows:
n
X n
X
2
SSE = (Yi − Ŷi ) = ri2 (6.25)
i=1 i=1

If all the Yi values fall on the regression line, SSE will be zero.

Regression Sum of Squares


The reduction in the variability associated with the utilisation of the knowledge of
the independent variable Xi is another sum of squares known as Regression Sum of
Squares (SSR). Figure 6.8 illustrates how each of these components arises.

Figure 6.8: Variability on the regression line

SSR = SST − SSE (6.26)

We can show that SSR is the sum of squares involving the deviations:

Ŷi − Y ,
which represent the fitted value and the mean of the fitted value.

n
X
SSR = (Ŷi − Y )2 (6.27)
i=1

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.8. ANALYSIS OF VARIANCE OF SIMPLE LINEAR REGRESSION 109

SSR can be viewed as a measure of the effect of the regression relation in reducing
the variability of Yi . If SSR = 0, the regression calculation will not reduce variability
at all. SSR can be interpreted as the proportion of variation in Y explained by the
regression.

A more mathematical approach is to say we are partitioning the variability of the


Yi ’s. Thus, for Simple Linear Regression, the decomposition of the SST into two
components is achieved as follows:
SST = SSR + SSE (6.28)
n
X Xn n
X
2 2
(Yi − Y ) = (Ŷi − Y ) + (Yi − Ŷi )2 (6.29)
i=1 i=1 i=1

The computational formulas are put as follows:


n
X 2
SST = Yi2 − nY (6.30)
i=1

Pn
( i=1 Yi Xi − nY X)2
SSR = Pn 2 (6.31)
i=1 Xi2 − nX
and
SSE = SST − SSR (6.32)

Partitioning of Degrees of Freedom


A sum of squares has an associated number of degrees of freedom. Recall that the
variance estimate s2 in Equation 3.3 has a denominator n − 2. These are the degrees
of freedom associated with the numerator sum of squares in s2 .

Corresponding to the partitioning of the sum of squares is a partitioning of the de-


grees of freedom. SST has n − 1 degrees of freedom (df ) associated with it. Why?
P namely Yi − Y . However, there is one constraint on these
SST has n deviations,
deviations, namely (Yi − Y ) = 0, so we lose one degree of freedom, to remain with
n − 1 degrees of freedom in the n deviations.

SSE has n − 2 degrees of freedom, since we imposed constraints on the ri ’s dur-


ing the estimation of β0 and β1 . As SSR has 1 df , there are two parameters in
P regression function, but the deviations Ŷi − Y are subject to the constraint
the
(Ŷi − Y ) = 0. Thus, the degrees of freedom are additive and given by n − 1 = 1
+ n − 2.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

110 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

Mean Squares
A sum of squares divided by the degrees of freedom is called a mean square. For
example, s2 = M SE. The two important mean squares are the regression mean
square denoted by M SR and the error mean square denoted by M SE.

Thus:
SSR
M SR = (6.33)
1
and
SSE
M SE = = s2 (6.34)
n−2

Some Properties of Mean Squares


It can be shown that the expectations of the mean squares are given by:

E[M SE] = σ 2
P
It can also be shown that: E[M SR] = σ 2 + β12 (Xi − X)2

Thus, when β1 = 0, E[M SR] = σ 2 , both M SE and M SR have the same expected
value
P under this condition. On the other hand, when β1 6= 0, the term σ 2 +
β12 [Xi − X]2 will be positive and E[M SR] > nE[M SE]. Hence, if β1 6= 0, M SR
will tend to be larger than M SE.

Exercise 6.8
1. What does it mean if the fitted model gives you SSE = 0.
2. If SSR is zero, what does it tell you about the model?
3. In the Simple Linear Regression model, suppose that SST has 14 degrees of
freedom. Deduce the SSE and SSR degrees of freedom.

6.9 The Basic ANOVA Table


It is useful to collect the sum of squares, degrees of freedom and mean squares in
an ANOVA table for regression analysis. Table 6.2 gives the structure and the ap-
pearance of the basic ANOVA table.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.9. THE BASIC ANOVA TABLE 111

Source of variation SS df MS F
P SSR M SR
Regression SSR = (Ŷi − Y )2 1 M SR = F =
P 1
SSE
M SE
Error SSE = (Yi − Ŷi )2 n−1 M SE = n−2

P
Total SST = (Yi − Y )2 n−1

Table 6.2: The basic ANOVA table for simple linear regression

From the ANOVA table, we can get the variance s2 and test the hypothesis that
there is a regression relationship. How do we do this? The ratio F in the ANOVA
table has what we call the Fisher’s distribution with 1 and n − 2 degrees of freedom
if the assumptions of the model hold.

If F is near 1, then MSR and MSE are approximately equal. F > 1, suggests that
β1 6= 0. Thus, an upper-tail test is appropriate.

The hypotheses we are testing here are as follows:


H0 : β1 = 0 versus
H1 : β1 6= 0
at level α. Our decision rule here is as shown in Figure 6.9.

Figure 6.9: The general form of the statistical decision rule for an F-test

The decision rule is given by:

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

112 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

Accept H0 if F ≤ F1,n−2 (1 − α)
Reject H0 if F > F1,n−2 (1 − α).

6.10 The Coefficient of (simple) Determination


The coefficient of determination R2 is a measure of the effect an independent
variable X has in the regression model in explaining the total variability in Y .

The coefficient of determination, denoted by R2 , is defined as follows:

SSR
R2 = (6.35)
SST
SSE
= 1− (6.36)
SST
Thus, R2 measures the proportionate reduction in SST associated with the use of
an independent variable.

In the Simple Linear Regression case, we usually refer to the coefficient of Deter-
mination as the Coefficient of Simple Determination (R2 ). Note that R is the
simple correlation coefficient of the independent and dependent variables.

R2 takes values between 0 and 1. We obtain R2 = 0 when β1 = 0, and R2 = 100%


when the Yi ’s fall directly on the regression line. A value of R2 > 80% or sometimes
70%, suggests that the model has a good fit.

Adjusted R2
One phenomenon found on adding terms to a regression model is that the R2 in-
creases. Although this may be an indication that the extra terms improve the
regression equation, it is may also be a reflection of the fact that one is using more
variables to predict the same number of data points. This problem may be taken
into account by examining not only the actual value of R2 , but also the value of
the adjusted R2 . This statistic takes into account the numberPof data points and
variables in the regression equation, by replacing the SSE and ni=1 (yi − y)2 by the
corresponding M SE’s, giving

2 SSE/(n − 2)
R = 1 − Pn 2
(6.37)
i=1 (yi − y) /(n − 1)

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.10. THE COEFFICIENT OF (SIMPLE) DETERMINATION 113

2 (n−1)
which can also be written as R = 1 − (n−2)
(1 − R2 ).

It is so scaled that, if a second added variable results in a non-significant improve-


ment in the regression fit, the adjusted R2 will decrease.

Example 6.7

The price of a kilogram of flour (Y ) at a market place in a particularly busy township


seems to vary according to what the vendor thinks is your salary X (in thousands
of Rands). Use the data supplied to investigate if this suspicion is true.

induvidual X Y
1 2 8.74
2 2 10.53
3 2 10.99
4 2 11.97
5 3 12.83
6 3 14.69
7 3 14.69
8 3 15.30
9 4 16.11
10 4 16.31
11 4 16.46
12 4 17.69
13 5 19.65
14 5 18.86
15 5 19.93
16 5 20.51

Solution
The model is given by: y = β0 + β1 x + ǫ (price = β0 + β1 salary + ǫ)

Fitting the model gives us;

[ = β̂0 + β̂1 salary


price
= 4.8970 + 2.9805salary

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

114 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

Table 6.3: The ANOVA Table

SOURCE df SS MS F
Regression 1 177.668 177.668 183.921
Error 14 13.526 0.966
Total 15 191.194

Let us now construct the ANOVA table. First we calculate SSR, SST, and SSE.
P 2
SST = yi − ny 2
= 197.195,
P16 ˆ 2
SSR = i (y)i − ny
= 177.668,

and SSE = SST − SSR


= 191.194 − 177.668.
= 3.527

From the above, the ANOVA table can be constructed as follows.

We can see from this ANOVA table that F is quite large. Infact it leads us to a
rejection of the hypothesis of no regression relationship (verify).

We can compute the coefficient of Multiple determination from Table 6.3 above.
Thus:
R2 = SSR
SST

177.668
= 191.195

= 0.929.

Thus, about 92.9% of the variation in prices (Y) is explained by the regression
model. So, the salary estimates do seem do determine the price of flour. The ad-
justed R2 = 92.4%.

The correlation coefficient (r) measures the strength of the linear relationship be-
tween the dependent variable and all the independent variables. It is computed from
the formula:

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.11. CONCLUSION 115


r = + R2 (6.38)
r
SSR
= + , (6.39)
SST
where SSR is the Regression Sum of Squares and SST the Total Sum of Squares. R
close to 1 means that there is a good linear relationship between the dependent and
independent variables.

Exercise 6.9
A student recorded the 6 test marks she obtained after devoting a particular number
of hours of study. The marks are:

Marks (%) 60 50 40 100 10


Time (hrs) 2 1.5 0.5 3 0

1. Estimate β0 and β1 in a simple linear regression model.

2. Construct the ANOVA table.

3. Show that F = 42.89, and test the null hypothesis that β1 = 0 at α = 0.05.

6.11 Conclusion
In this chapter, we focused on Simple Linear Regression. We discussed the estima-
tion of the parameters using the Least Squares technique. There are other methods
availabe for estimating these parameters. We shall meet these in future modules.

We went on to discuss how to check if any of the assumptions which enable us to use
the least squares approach have been violated. This is often ignored by “pseudo-
statisticians”. Some blame this abuse of Regression Analysis on computers which
allow you to use statistical computer packages without looking at the underlying
theory behind the techniques.

Exercise 6.10
1. Find the relationship between the correlation coefficient r and β1 .

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

116 CHAPTER 6. CORRELATION AND REGRESSION ANALYSIS

2. A farmer is carrying out an experiment on the effect of soil acidity levels on


yield. The farmer records the yields of 10 fields of equal size but different
acidity levels. [A negative acidity level here indicates that the acidity is less
than the standard level pegged at 0.]

Field: 1 2 3 4 5 6 7 8 9 10
Level: 4.5 17.7 -16.6 -14 18.6 -10.6 5.8 -8.1 -5.2 7.8
Yield: 75 112 38 120 105 52 116 118 105 110

(a) Examine graphically the relationship between Level and Yield.


(b) Is this a statistical relationship? Explain.
(c) Which is the dependent variable, and which is the independent variable?
Explain.
(d) Fit a simple linear regression model to this data.
(e) Investigate if any assumptions were violated.
(f) What do you expect the yield to be for a field with acidity level 0?
(g) Do you think the field would make a better independent variable?

3. A recent review of salaries at a company in Harare made the recommendation


that, depending on the number of years of service X, the minimum salary Y
of an employee should be as follows:
x: 5 10 15 20 25 30
y: 39.5 49.0 58.5 68.0 77.5 87.0

(a) Examine graphically the relationship between number of years of service


and minimum salary.
(b) Do these recommendations represent an exact or statistical relation? Ex-
plain using statistics like the sample correlation coefficient.

4. For each of the following pairs of variables, explain whether an exact or sta-
tistical relation would most likely hold:

(a) X=number of beds in a hotel; Y =hotel’s annual operating cost.


(b) X = Volume of a gas; Y = Pressure on the gas.
(c) X = A departmental store’s promotional and advertising expenditure; Y
= the company profits.

5. Find the correlation matrix ρ for the data in Question 1 above.

Downloaded by Carseddy Tebele ([email protected])


lOMoARcPSD|16917485

6.11. CONCLUSION 117

6. The following information was recorded over 10 years, the amount of rainfall,
maize production and maize price.

(a) Which is the dependent variable? Explain.


(b) What is the regression model linking these variables?
(c) Which two variables are likely to have a high correlation?

Downloaded by Carseddy Tebele ([email protected])

You might also like