Study Notes
Study Notes
TERMINOLOGY
1.1 Definitions
Examples
Examples
1. The population of people inhabiting a certain country.
2. The collection of all cars of a certain type manufactured during a
particular month.
3. All patients in a certain area suffering from AIDS.
4. Exam marks obtained by all students studying a certain statistics course.
1
A census is a study where every member (element) of the population is
included.
Examples
1. Study of the entire population carried out by the government every 10
years.
2. Special investigations e.g. tax study commissioned by a government.
3. Any study of all the individuals/elements in a population.
A census is usually very costly and time consuming. It is therefore not carried
out very often. A study of a population is usually confined to a subgroup of the
population.
The number of values in the sample (sample size) is denoted by n. The number
of values in the population (population size) is denoted by N.
Discrete variables are variables that can assume a finite or countable number
of possible values. Such variables are usually obtained by counting.
Examples
1. The number of cars parked in a parking lot.
2. The number of students attending a statistics lecture.
3. A person’s response (agree, not agree) to a statement. A one (1) is
recorded when the person agrees with the statement, a zero (0) is
recorded when a person does not agree.
Examples
1. The body temperature of a person.
2. The weight of a person.
3. The height of a tree.
2
4. The contents of a bottle of cool drink.
Examples
1. The course of study at university (B.Com, B.Eng, BA etc.)
2. The grade (A, B, C, D or E) obtained in an examination.
A variable can be treated as nominal when its values represent categories with
no intrinsic ranking. For example, the department of the company in which an
employee works. Examples of nominal variables include region, postal code, or
religious affiliation.
A variable can be treated as ordinal when its values represent categories with
some intrinsic order or ranking.
Examples
Examples
Discrete and continuous variable examples given above.
3
Examples
1. The difference between a temperature of 100 degrees and 90 degrees is
the same difference as that between 90 degrees and 80 degrees. Taking
ratios in such a case does not make sense.
2. When referring to dates (years) or temperatures measured (degrees
Fahrenheit or Celsius) there is no natural zero point.
Examples
Variables like height, weight, mark (in test) and speed are ratio variables. These
variables have a natural zero and ratios make sense when doing calculations
e.g. a weight of 80 kilograms is twice as heavy as one of 40 kilograms.
4
An experiment is the process of observing some phenomenon that occurs.
An experiment can be observational or designed.
Examples
1. Mean (average) age of all employees working at a certain company.
2. The proportion of registered female voters in a certain country.
Examples
1. The mean (average) monthly salary of 50 selected employees in a certain
government department.
2. The proportion of smokers in a sample of 60 university students.
5
Sampling frame (synonyms: "sample frame", "survey frame") is the actual set
of units from which a sample is drawn.
Example
Consider a survey aimed at establishing the number of potential customers for
a new service in a certain city. The research team has drawn 1000 numbers at
random from a telephone directory for the city, made 200 calls each day from
Monday to Friday from 8am to 5pm and asked some questions.
In this example, the population of interest is all the inhabitants in the city. The
sampling frame includes only those city dwellers that satisfy all the following
conditions:
The sampling frame in this case definitely differs from the population. For
example, it under-represents the categories which either have no telephone
(e.g. the most poor), have an unlisted number, and who were not at home at
the time of calls (e.g. employed people), who don't like to participate in
telephone interviews (e.g. more busy and active people). Such differences
between the sampling frame and the population of interest is a main cause of
bias when drawing conclusions based on the sample.
Probability samples are drawn according to the laws of chance. These include
simple random sampling, systematic sampling and stratified random sampling.
In simple random sampling each sample of a given size that can be drawn will
have the same chance of being drawn. Most of the theory in statistical
inference is based on random sampling being used.
Examples
1. The 6 winning numbers (drawn from 49 numbers) in a Lotto draw. Each
potential sample of 6 winning numbers has the same chance of being
drawn.
6
be randomly generated by computer or numbers could be picked out of
a hat. These numbers could then be matched to names in the telephone
directory, thereby providing a list of 2 000 people.
Example
Suppose the first 6 random numbers in the table of random numbers are:
10480, 22368, 24130, 42167, 37570, 77921.
Use these numbers to select the 6 wining numbers in a Lotto draw.
The 49 numbers from which the draw is made all involve 2 digits i.e. 01, 02, . .
. , 49.
Putting the above numbers from the table of random numbers next to each
other in a string of digits gives: 10 48 02 23 68 24 13 04 21 67 37 57 07 79 21 .
The winning numbers can be selected by either taking all pairs of digits
between 01 and 49 (discarding any numbers outside this range or repeats) by
working from left to right or right to left in the above string.
By working from left to right the winning numbers are: 10, 48, 2, 23, 24 and
13.
By working from right to left the winning numbers are: 21, 7, 37, 21, 4 and
13.
The advantage of simple random sampling is that it is simple and easy to apply
when small populations are involved. However, because every person or item
in a population has to be listed before the corresponding random numbers can
be read, this method is very cumbersome to use for large populations and
cannot be used if no list of the population items is available. It can also be very
time consuming to try and locate every person included in the sample. There is
also a possibility that some of the persons in the sample cannot be contacted
at all.
Examples
7
1. A manufacturer might decide to select every 20th item on a production
line to test for defects and quality. This technique requires the first item
to be selected at random as a starting point for testing and, thereafter,
every 20th item is chosen.
A general problem with random sampling is that you could, by chance, miss
out a particular group in the sample. However, if you subdivide the population
into groups, and sample from each group, you can make sure the sample is
representative. Some examples of strata commonly used are those according
to province, age and gender. Other strata may be according to religion,
academic ability or marital status.
Example
In a study investigating the expenditure pattern of consumers, they were
divided into low, medium and high income groups.
Income Percentage of
group population
low 40
medium 45
high 15
8
A stratified sample of 500 consumers is to be selected for this study.
When sampling is proportional to size (an income group comprises the same
percentage of the sample as of the population) the sample sizes for the strata
should be calculated as follows.
40 ×500 45 ×500 15× 500
low: 100
=200 ; medium : 100
=225 ; high: 100
=75
Example
A company is marketing a new product and needs to know how potential
customers might react to the product.
Stage 1: It is decided that age (the 3 groups under 20, 20-40, over 40)
and gender (male, female) are the characteristics that will determine the
sample.
When obtaining a quota sample, interviewers often choose who they like
(within criteria specifications) and may therefore select those who are easiest
to interview. Therefore sampling bias can result. It is also impossible to
estimate the accuracy of quota sampling (because sampling is not random).
Chapter 1 – Tutorial
1. Determine whether the data set is a population or a sample.
(a) The age of the Prime Minister of each Province in South Africa.
(b) The speed of every 5th car passing a police speed trap.
(c) A survey of 500 students from a university with 10000 students.
(d) The annual salary for each employee at Coke.
(e) The cholesterol level of 20 patients in a hospital with 100
patients.
2. Identify the populat ion and the sample for each of the statements
below.
(a) A study of 33043 infants in Italy was conducted to find a
link between a heart rhythm abnormality and sudden
infant death syndrome .
(b) A survey of 2104 households in South Africa found that 42%
subscribe to DSTV.
(c) A survey of 546 women found that more than 56% are the
primary investor in their household .
(d)The Ancient Mayans predicted the end of the world to be in
2012, a study was designed in KwaZulu-Natal where 1200
residents were randomly asked whether they believed the
10
prediction or not. The results indicated that 52% of the
interviewed residents believed in the Mayans predict ion.
3. Determine whether the numeric value is a parameter or a statistic.
(a) The average annual salary for 25 of a company's 1250 statisticians is
R250000.
(b)In a survey of a sample of high school students, 41% said
that their mother has taught them the most about
managing money.
(c) In a survey of sample computers, 15% said their computer
had a malfunction that needed to be repaired by a service
techni cian.
(d) In a recent year, the interest category for 9% of all new magazines
was sport.
(e) In a recent year, the average stats mark for all graduates at UKZN
was 34%.
(f) In a recent survey of 1000 adults from Gauteng, 34% said
using a cell phone while driving should be illegal
11
(m) The monthly birth rate at a maternity hospital.
(n) The mass of babies at birt h.
(o) The daily distance travelled by a courier service truck.
(p) The names of teams in a cricket league.
12
CHAPTER 2
DESCRIPTIVE STATISTICS
(Exploratory Data Analysis)
All the data sets used in this chapter will be regarded as samples drawn from
some population. One of the main purposes of studying a sample is to get
information about the population. The main focus here is on summarizing and
describing some features of the data.
Example
Thando's weight (kg)
75
74
73
72
71
Weight
70
69
68
67
66
65
2013 2014 2015 2016 2017 2018 2019
Year
The graph above shows how Thando's weight varied from the beginning of
2014 to the beginning of 2018.
1
See Appendix A3.
13
Bar charts
A bar chart or bar graph is a chart consisting of rectangular bars with heights
proportional to the values that they represent. Bar charts are used for
comparing two or more values that are taken over time or under different
conditions.
In a simple bar chart the figures used to make comparisons are represented by
bars. These are either drawn vertically or horizontally. Only totals are
represented. The height or length of the bar is drawn in proportion to the size
of the figure being presented.
Example
The South African population data is displayed in the following simple bar
chart.
57,500,000 57,398,421
57,000,000
56,717,156
56,500,000
Population
56,015,473
56,000,000
55,500,000 55,291,225
55,000,000
54,500,000
54,000,000
2015 2016 2017 2018
Year
2
See Appendix A4.
3
See Appendix A5.
14
When you want to draw a bar chart to illustrate your data, it is often the case
that the totals of the figures can be broken down into parts or components.
25,000,000 Female
20,000,000 Male
15,000,000
10,000,000
5,000,000
0
Black African Coloured Indian/Asian White
Population group
You start by drawing a simple bar chart with the total figures as shown above.
The columns or bars (depending on whether you draw the chart vertically or
horizontally) are then divided into the component parts.
4
See Appendix A6.
15
Mid-year population estimates for South Africa by population
group, sex 2017
50,000,000
45,000,000
40,000,000
35,000,000
30,000,000 Male
Female
Number
25,000,000
Total
20,000,000
15,000,000
10,000,000
5,000,000
0
Black African Coloured Indian/Asian White
Population group
Pie Chart5
A pie chart is a diagram that shows the subdivision of some entity/total into
subgroups. The diagram is in the form of a circle which is divided into slices
with each slice having an area according to the proportion that it makes up of
the total.
Example
The pie chart below shows the weighting of services used in the construction
input price index (Construction Materials Price Indices, April 2019).
16
equipment with operators
3% 1%
1 Site preparation Construction of buildings
%
8% Civil engineering Other structures
1 24% Construction by specialist trade con- Plumbing
8% tractors
%
Electrical contractors Shopfitting
8% Other building installation Painting and decorating
Other building completion Renting of construction or demoli-
tion equipment with operators
2 6%
%
2 37%
%
The degrees needed for each slice is found by calculating the appropriate
percentage of 360°
37 ° °
For example, civil engineering = 100 ×360 =133
The complete calculations are shown in the table below.
The symbol sigma ∑(Capital S in Greek alphabet) is used to denote “the sum
of” values.
Suppose the symbol x is used to denote some variable of interest in a study. In
order to distinguish between values of this variable, subscripts are used.
x 1is the first value in the data set which has a subscript 1.
x 2 is the second value in the data set which has a subscript 2.
.
.
x n is the nth value in the data set which has a subscript n .
17
n
x 1+ x2 +…+ x n=∑ x i
i=1
x 1+ x2 +…+ x n=∑ x
Example 1
If x 1=70 ; x 2=74 ; x 3=66 ; x 4 =68 ; x 5 =71
Then
5
The sum of the squares of a set of values are written as ∑ x 2 for short.
Example 2
For the data set in example 1,
5
Note: ∑ x 2i ≠¿ ¿
i=1
( )
5 5 2
∑ x 2i ¿ 24397 ≠ ∑ x i =3492=121801
i=1 i =1
The summation notation can also be used to write the sum of products of
corresponding values for 2 different sets of values.
n
∑ x i y i=x 1 y 1 + x 2 y 2 +…+ xn y n
i=1
i 1 2 3 4 5 6
xi 11 13 7 12 10 8
yi 8 5 7 6 9 11
For this data:
18
6
(∑ )(∑ )
n n n
Note: ∑ x i y i ≠ xi yi
i=1 i=1 i=1
∑ x i=61 ; ∑ yi =46
i=1 i=1
( )( )
6 6 6
∴ ∑ xi ∑ y i =2806 ≠ ∑ x i y i
i=1 i=1 i=1
Frequency distribution
A frequency distribution is a table in which data are grouped into classes and
the number of values (frequencies) which fall in each class is recorded.
The main purpose of constructing a frequency distribution is to gain insight
into the distribution pattern of the frequencies over the classes. Hence, the
name frequency distribution is used to refer to this pattern.
Example 1
In a survey of 40 families in an urban neighbourhood, the number of children
per family was recorded and the following data was obtained.
1 0 3 2 1 5 6 2
2 1 0 3 4 2 1 6
3 2 1 5 3 3 2 4
2 2 3 0 2 1 4 5
3 3 4 4 1 2 4 5
19
5 //// 4
6 // 2
Total 40
Example 2
Consider the following data of the amount of money spent by 50 DUT staff
members on public transport per day. The highest amount is R64 and the
lowest amount is R39.
Data set: The daily amount of money spent on public commuting by 50 DUT
staff members
57 39 52 52 43
50 53 42 58 55
58 50 53 50 49
45 49 51 44 54
49 57 55 64 45
50 45 51 54 58
53 49 52 51 41
52 40 44 49 45
43 47 47 43 51
55 55 46 54 41
The classes into which the above values can be sorted can be found by
following the steps shown below.
1. Find the maximum and minimum values and calculate the range (R):
number of classes=k
¿ the rounded up value of (1+1.44 ln n)
¿ 1+1.44 × ln(50)
¿ 6.63
i .e . k=7.
20
3. Calculate the class width such that:
the number of classes × class width> range
i .e .7 × class width>25
25
∴ class width>
7
4. Find the lower value that defines the first class. This is usually a value
just below the minimum value in the data set. Since the minimum value
for this data set is 39, the lowest class can have a minimum value one
below this i.e. 38.
5. Find the lower values that define each of the classes that follow by
successively adding the class width to the lower value of class:
The frequency distribution below shows the data values sorted into the
classes:
The table below shows the classes and their frequencies for the cost of
commuting data set.
class
limits f
38 – 41 4
42 – 45 10
46 – 49 8
50 – 53 15
54 – 57 9
58 – 61 3
62 – 65 1
Total 50
21
The values in the above example that define the classes of the frequency
distribution are called class limits. The classes of the type 38 – 41, 42 – 45, …,
etc. in which both the upper and lower limits are included are called “ inclusive
classes”. For example, the class 38 – 41 includes all the values from 38 to 41.
1. The classes should be clearly defined and should not lead to any
ambiguity.
2. Each of the given values in the data set should be included in one of
the classes.
3. The classes should be of equal width, otherwise the different class
frequencies will not be comparable. If the class widths are unequal,
then comparable figures can be obtained by dividing the value of the
frequencies by the corresponding widths of the class intervals. The
ratios thus obtained are called ‘ frequency density’.
4. The number of classes should not be too large nor too small.
Class midpoints
Examples
1. For the frequency distribution in example 2 (cost of daily commute
data), the class midpoints are given below.
22
62 – 65 63.5
Cumulative frequencies
The “less than” cumulative frequency of a class is the number of values in the
sample that are less than or equal to the upper class boundary of the class.
Example
For the frequency distribution in example 2 (cost of daily commute data) the
cumulative frequencies are calculated as shown below.
upper cumulative
classes class f frequencie
limit s calculations
38 – 41 41 4 4 4
42 – 45 45 10 14 4 + 10
46 – 49 49 8 22 4 + 10 + 8
50 – 53 53 15 37 4 + 10 + 8 + 15
54 – 57 57 9 46 4 + 10 + 8 + 15 + 9
58 – 61 61 3 49 4 + 10 + 8 + 15 + 9 + 3
4 + 10 + 8 + 15 + 9 + 3 +
62 – 65 65 1 1
1
Total 50
Examples
23
relative percentage
classes f
frequency frequency
38 – 41 4 0.08 8
42 – 45 10 0.2 20
46 – 49 8 0.16 16
50 – 53 15 0.3 30
45 – 57 9 0.18 18
58 – 61 3 0.06 6
62 – 65 1 0.02 2
Total 50 1 100
Histogram6
Example
6
See Appendix A8.
24
Histogram
16 15
14
12
Cost of daily commute
10
10 9
8
8
6
4
4 3
2 1
0
38 - 41 42 - 45 46 - 49 50 - 53 54 - 57 58 - 61 62 - 65
Class interval
Frequency polygon7
Example
For the cost of daily commute data the following values are plotted.
midpoint 35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
f 0 4 10 8 15 9 3 1 0
7
See Appendix A9.
25
Frequency Polygon
16
14
12
Cost of daily commute
10
0
35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
Class midpoint
Note:
The two plotted values at the lower and upper ends were added to anchor the
graph to the horizontal axis. The lower end value is a plot of 0 versus the
midpoint of the class below the first (lowest) class (35.5). This midpoint is
obtained by subtracting the class width (4) from the midpoint of the lowest
class (39.5). The upper end value is a plot of 0 versus the midpoint of the class
above the last class (67.5). This midpoint is obtained by adding the class width
(4) to the midpoint of the last (highest) class (63.5).
This is the graph of the cumulative frequencies versus the upper class limits.
Example
For the “less than” ogive of the frequency distribution in example 2 (daily cost
of commute data), the following values are plotted:
Upper class 37 41 45 49 53 57 61 65
8
See Appendix A10.
26
limit
cumulative
0 4 14 22 37 46 49 50
frequency
49 50
50 46
40 37
cumulative frequency
30
22
20
14
10
4
0
40 45 50 55 60 65 70
upper class boundary
Note:
The plotted value at the lower end was added to anchor the graph to the
horizontal axis. The lower end value is a plot of 0 versus the upper class
boundary of the class below the first (lowest) class (37). This upper class
boundary is obtained by subtracting the class width (4) from the upper class
boundary of the lowest class (41).
27
0.45
0.4
0.35
0.3
frequency
0.25
0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
x
This shape is for data sets where the majority of values are in the central
portion of the scale with fewer and fewer values the further away from the
center (in both directions). Many data sets have this shape. Examples are
28
0.12
0.1
0.08
frequency
0.06
0.04
0.02
0
0 1 2 3 4 5 6
x
This shape occurs when all the values in the data set occur approximately the
same number of times. Examples are:
1. Frequencies of winning numbers in a large number of Lotto draws.
2. Frequencies of winning numbers in a large number of roulette games.
3. Frequencies obtained when tossing an unbiased coin and recording 0 if
tails come up and 1 if heads come up.
Bimodal shape
60
50
40
frequency
30
20
10
0
0 20 40 60 80 100 120
Body length (m m )
This pattern which shows two distinct peaks (hence the name bimodal data)
appearing when there are two subgroups with different sets of values in the
same data set.
29
Examples
1. Measuring the body lengths of ants when there are adults and juveniles
together in the same data set. The two peaks in the curve reflect the fact
that juvenile ants have shorter body lengths than adult ants.
0.8
frequency
0.6
0.4
0.2
0
0 2 4 6 8 10 12 14
x
This shape shows a high clustering of values at the lower end of the scale and
less and less clustering further away from the lower end towards the upper
end.
Example
The time it takes to serve a customer at a supermarket. For most customers
the service time is quite short. The longer the service time, the less the number
of customers.
30
0.3
0.25
0.2
frequency
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16
-0.05
x
This shape shows a high clustering of values at the upper end of the scale and
less and less clustering further away from the upper end towards the lower
end.
Example
Marks in a test where most students did well, but a few performed poorly.
Tutorial
31
shares .
Label Value
A 55
B 121
C 83
D 46
23 46 66 67 13 58 19 17 65 17 25 20 47 28 16 38 44 29
48 29 69 34 35 60 37 52 80 59 51 33 48 46 23 38 52
50 17 57 41 77 45 47 49 19 32 64 27 61 70 19
57 23 35 18 21 26 51 47 29 21 46 43 29 23 39
50 41 19 36 28 31 42 52 29 18 28 46 33 28 20
32
s
20.5 - 25.5 17
25.5 -30.5 20
30.5-35.5 16
35.5-40.5 15
40.5- 45.5 8
45.5- 50.5 6
CHAPTER 3
33
MEASURES OF LOCATION AND
DISPERSION
3.1. Introduction
A measure of central tendency is a value that shows the location on the scale
where a data set is centrally located (most values are clustered around it).
In the calculations a distinction will be made between methods used when the
data are in raw form (values as collected) or grouped form (form of a
frequency distribution).
A. Raw data
Mean: The mean (or average) of a set of data values is the sum of all of the
data values in the set divided by the n the number of data values. That is
mean = x=
∑x
n
x is pronounced “x bar”.
Example
The marks of seven students in a mathematics test with a maximum possible
mark of 20 are given below:
15 13 18 16 14 17 12:
x¿
∑ x = 15+13+18+ 16+14+17 +12 =15
n 7
Median: The median is the value in the data set which is such that half of the
values in the data set are less than or equal to it and half are greater than or
equal to it.
For an odd number of values in the data set, the median is the middle value of
the data set when it has been arranged in ascending order. That is, from the
smallest value to the largest value.
34
1
Median= ( n+ 1 ) th value in a data set, where n is the sample size
2
If the number of values in the data set is even, then the median is the average
of the two middle values.
Examples
47 35 37 32 38 39 36 34 35
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39 47
1
Median= ( n+ 1 ) th value
2
¿ 5th value
¿ 36
2. Consider the above data set with the first value (47) omitted.
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39
In this case the number of values is, n=8, which is an even number.
1
Median= ( n+ 1 ) th value
2
¿ 4.5 th value
The value that lies in position 4.5 in the ranked data set would be the
average of the 4 th and 5th values:
35+36
∴ Median= =35.5
2
35
Mode: The mode of a set of data values is the value(s) that occurs most often.
Example:
Find the mode of the following data set:
48 44 48 45 42 49 48
The mode is 48 since it occurs most often.
Note:
1. It is possible for a set of data values to have more than one mode.
2. If there are two data values that occur most frequently, we say that the
set of data values is bimodal e.g. the data set 2 2 4 5 5 6 has two
modes (2 and 5).
3. If no value in the data set occurs more than once, it has no mode e.g. the
data set 4 5 7 9 has no mode.
Examples
1. The amounts (thousands) for which each of 7 properties were sold are
shown below.
For this data set mean = x̄ = 772.86. This value of the mean is not a
central value for the data set (it is greater than all the values but the
largest one). The reason for this is that the last value (2350) has a
considerable influence on the value of the mean.
36
The median = 555 is a value that more centrally located than the mean.
Unlike the mean, the median is not influenced by the large last values in
the data set.
A, A, D, N, D, A, D, D, N, N.
For this data set the modal response is D (since D occurs more times
than the other responses). It is not possible to calculate a median or a
mean for this data set.
When calculating the mean for raw data, it is usually assumed that all the
values in the data set are equally important. If the values are not all considered
equally important, the weighted mean ( x w ) is calculated according to the
formula below.
r
∑ x i wi
x w = i=1r
∑ wi
i=1
Example
Solution:
37
The above formula is applied with
x 1=65 , x 2=70 x 3 =55 , w1=10 , w 2=30 , w3 =60
( 65 ×10 ) + ( 70 ×30 )+(55 ×60) 6050
x w= = =60.5
10+30+60 100
B. Grouped data
Mean:
For grouped data the mean is calculated from the formula below:
x=
∑ (x ¿¿ mid × f ) ¿
n
where
x mid is the class midpoint, f the class frequency and n is the sample size.
This formula is a special case of the weighted mean formula with w i=f iand
∑ wi=n
Example
For the frequency distribution of temperatures (example 2 of the frequency
distributions), the mean can be calculated as shown below.
2487
x= =49.74
50
38
3.3 Measures of variability (variation, spread, dispersion)
Variability refers to the extent to which the values in a data set vary around
(differ from) the associated measure of central tendency.
Example
The performance of 2 different stocks is monitored over a period of 8 days.
Their values are shown in the table below.
Day 1 2 3 4 5 6 7 8
A 103 120 112 108 130 106 120 112
B 112 97 85 123 153 85 146 110
The scatter plots9 with that follows shows the performance of each stock.
Stock A
140 130
120 120
120 112 112
108 106
103
100
80
Stock price
60
40
20
0
0 1 2 3 4 5 6 7 8 9
Day
9
See Appendix A – page 24.
39
Stock B
180
160 153
146
140
123
120 112 110
97
100
Stock price
85 85
80
60
40
20
0
0 1 2 3 4 5 6 7 8 9
Day
The mean values for the two stocks are the same (= 113.875), but they differ in
variability (extent of spread around the mean). Stock B has a far wider spread
around the mean than stock A.
A. Raw data
Example:
For the stocks data sets:
Range for stock A = 130 – 103 = 27
Range for stock B = 153 – 85 = 68
The larger (wider) spread in the stock B values is reflected in the larger range
(more than twice that of stock A).
∑ ( x i−x )2
s2= i=1
n−1
40
n
∑ x 2i −n x 2
i .e . s 2= i=1
n−1
The variance is expressed in the data units squared.
The standard deviation: s= √ s2 which is the positive square root of the variance,
is expressed in the same units as the data.
Example
Stock A ( x values) x
2
103 10609
120 14400
112 12544
108 11664
130 16900
106 11236
120 14400
112 12544
∑ 911 104297
104297−( 8 ×113.875 2 )
Variance: s2= =79.55
7
For stock B the standard deviation is 25.682 (check this using your calculator).
Interpretation: The stock A values differ (on average) from the mean by 8.919,
while stock B values differ (on average) from the mean by almost 3 times this
amount.
B. Grouped data
For grouped data, the raw data formulae for the variance and standard
deviation can be slightly modified.
41
k
∑ ( x mid (i )−x )2 f i
s2= i=1
n−1
k
∑ x 2mid (i ) f i−n x 2
i .e . s 2= i=1
n−1
Example
The standard deviations of 2 data sets that are expressed in different units
cannot be directly compared. However, such a comparison may be done by
calculating the:
42
s
coefficient of variation ¿ CV = x ×100, which is expressed as a percentage
Example
The age of three students were 19, 20 and 21 years and their respective
weights were 55, 60 and 65 kilograms. Since the two data sets are in different
units, they cannot be compared directly.
1
For the age data: x=20 , s=1 ∴ CV = 20 ×100=5 %
5
For the weight data: x=60 , s=5 ∴ CV = 60 × 100=8.33 %
The coefficient of variation calculations show that in relative terms the
variability for the weight data set is greater than that of the age data set.
Examples
The 9 deciles D1, D2, . . . , D9 are the values that have 10%, 20%, ... , 90%
respectively of the values in the data set less than or equal to them.
43
The three quartiles (Q1 ,Q2 and Q3) are summary measures that divide a ranked
data set into four equal parts. As such, approximately 25% of the values in the
data set will be less than Q1, 50% of the values less than Q2 and 75% of the
values less than Q3.
Q 3−Q 1
The quartile deviation: Q= can also be used as a measure of variability.
2
The quartile deviation value shows the extent to which the values in the data
set deviate from the median. For a skew data set (heavy clustering at lower or
upper end of the scale) the quartile deviation is a more appropriate measure of
variability than the standard deviation (which is more suitable as a measure of
variability for symmetric data sets).
The value ( Q3−Q1) is called the Inter-quartile Range (IQR). IQR indicates the
spread or variation of the middle 50% of the values in the data set.
Q 1=
[ ] n+1
4 th
value in the ranked data set
Q 2= [ 2(n+1)
4 ] value in the ranked data set = Median
th
Q 3= [ 3(n+1)
4 th
]
value in the ranked data set
Example
44
6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36, 56
Solution
Ranked data set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49, 56
Q 1=
[ ] value in the ranked data set
n+1
4 th
¿
[ ] value in the ranked data set
12+1
4 th
¿ 15 kilometres
Q 2= [ 2(n+1)
4 th
]
value in the ranked data set
¿ [ 2(12+1)
4 th
]
value in the ranked data set
40+ 41
¿
2
¿40.5 kilometres
Q 3= [ 3(n+1)
4 th
]
value in the ranked data set
45
¿ [ 3(12+1)
4 th
]
value in the ranked data set
¿47 kilometres
Q3−Q1 47−15
Quartile deviation: Q = = =16 kilometres
2 2
Percentile rank of a score is the percentage of values in the data set that are
smaller than the given score and is denoted by PR x where x is the given score.
number of values less than x
PR x = × 100
n
For the distance to work data set above, P80 and PR40 is calculated as follows:
Ranked data set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49, 56
Pk = [
k (n+1)
100 th ]
value in a ranked data set
P80=
100 [
80 (12+ 1)
th
]
value in the ranked data set
P80=10.4 th ≅ 10 th value in the ranked data set
∴ P80=47 kilometres
5
PR40= ×100 = 41.67%
12
Examples
If it is known that the data set of interest has a bell-shaped clustering pattern
of the values then results that are better than that of Chebychev’s theorem can
be obtained. For data with such a shape:
(i) Approximately 68% of data values are within 1 standard deviation of the
mean.
(ii) Approximately 95% of data values are within 2 standard deviations of
the mean.
(iii) Approximately 99.7% of data values are within 3 standard deviations of
the mean.
Example
Men’s heights have a bell-shaped distribution with a mean of 175.8
centimetres and a standard deviation of 7.4 centimetres.
Approximately 68% of data values are within 175.8 ± 7.4 = (168.4; 183.2).
Approximately 95% of data values are within 175.8 ± 14.8 = (161.0; 190.6).
47
Approximately 99.7% of data values are within 175.8 ± 22.2 = (153.6; 198.0).
Tutorial
Downtime Frequency
0-9 3
10-19 13
20-29 30
30-39 25
40-49 14
50-59 8
60-69 4
70-79 2
80-89 1
Diameter Number of
(millimeters) washers
(frequency)
30-40 10
40-50 50
48
50-60 55
60-70 79
70-80 68
80-90 60
90-100 50
100-110 28
110-120 8
Total 400
49
a. Calculate the approximate sample mean and standard
deviation of the weight for the above ball bearings.
b. Construct a cumulative frequency distribution for the above
data and plot an ogive.
c. From the ogive above and the formula in your notes find the
first and third quartiles and the median weight for the ball
bearings.
50
a. between 50 and 60 km/h?
b. less than or equal to 50 km/h or greater than or equal to 60
km/h?
c. Find the interval of speed that will contain approximately 95%
of data values.
CHAPTER 4
CORRELATION AND REGRESSION
4.1 Bivariate data and scatter diagrams
The first step in the exploration of bivariate data is to plot the variables on a
graph. From such a graph, which is known as a scatter diagram (scatter plot,
scatter graph), an idea can be formed about the nature of the relationship.
Examples
1. It is believed that a person’s height (y) (measured in centimetres) is
dependent on the person’s shoe size (x). The values of x and y for 12
students are shown below.
Scatter diagram10
10
See Appendix A12.
51
Relationship between height and shoe size
250
150
100
50
0
3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0
Shoe size
Scatter diagram
52
Relationship between rainfall and quantity of air pollution
removed
160
140
Quantity of air pollution removed
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8
Rainfall (in centimetres)
3. Data on the annual GDP growth rate (x) of various African countries and
the cost of building individual prestige houses (y) in these countries was
taken from the Africa Property & Construction Cost Guide, July 2017,
and is shown below:
Scatter diagram
53
Relationship between annual GDP growth rate and building
costs
5000
4500
4000
3500
3000
Building costs
2500
2000
1500
1000
500
0
- 1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Annual GDP growth (%)
In all these cases the relationship can be fairly well described by means
of a straight line i.e. all these relationships are linear relationships.
2. Finding the equation of the straight line that will best describe the
relationship between the 2 variables (the linear regression equation).
Once this line is determined, it can be used to estimate a value of y for a
given value of x (linear estimation).
54
4.2 Linear Correlation
– 1 ≤ r ≤1
If the plotted points are closely clustered around this line, r will lie close to
either 1 or –1 (depending on whether the linear relationship is positive or
negative). The further the plotted points are away from the line, the closer the
value ofr will be to 0. Consider the scatter diagrams that follow.
55
No pattern (r close to 0)
n ∑ xy −∑ x ∑ y
r=
√ [ n∑ x − (∑ x ) ][ n∑ y −(∑ y ) ]
2 2 2 2
Example
Consider the data on a person’s shoe size (x) and height (y) considered earlier.
For this data r can be calculated in the following way.
x y xy x2 y2
5 160 800 25 25600
4 152 608 16 23104
12 196 2352 144 38416
8 168 1344 64 28224
9 178 1602 81 31684
7.5 165 1237.5 56.25 27225
6.5 165 1072.5 42.25 27225
11.5 170 1955 132.25 28900
10.5 188 1974 110.25 35344
11 180 1980 121 32400
6 163 978 36 26569
4.5 155 697.5 20.25 24025
∑ 95.5 2040 16600.5 848.25 348716
Substituting
n=12 , ∑ x=95.5 , ∑ y=2040 ,
56
2 2
∑ xy=16600.5 , ∑ x =848.25 ∑ y =348716
12×16600.5−95.5 ×2040
r=
√ 12 ×848.25−( 95.5 ) √12 ×348716−( 2040 )
2 2
4386
¿
√1058.75 × 22992
¿ 0.889
Comment: Strong positive correlation i.e. the increase in a person’s shoe size is
closely linked with an increase in the person’s height.
Coefficient of determination
The strength of the correlation between 2 variables is proportional to the
square of the correlation coefficient (r2). This quantity, called the coefficient of
determination, is the proportion of variability in the y variable that is
accounted for by its linear relationship with the x variable.
Example
In the above example on height (y) and shoe size (x), the
coefficient of determination ¿ r 2= ( 0.889 )2=0.7903 .
This means that approximately 79% of the change in the variability of in a
person’s height is explained by its relationship with the person’s shoe size.
57
According to the least squares principle, the line that “best” fits the plotted
points is the one that minimizes the sum of the squares of the vertical
deviations (see vertical lines in the graph) between the plotted y and estimated
y (values on the line). For this reason the line fitted according to this principle
is called the least squares line.
^y =a+bx
where ^y is the fitted y value (y value on the line which is different to the
observed y value), a is the y-intercept and b the slope of the line.
It can be shown that the coefficients that define the least squares line can be
calculated from
b=¿
11
See Appendix A13.
58
and
a= y−b x
Example
For the above data on shoe size (x) and height (y) the least squares line can
calculated as shown below.
Substituting
Therefore the equation of the y on x least squares line that can be used to
estimate values of y (height) based on x (shoe size) is:
^y =137.05+ 4.14 x
Suppose the height of a student with shoe of size 7 is to be estimated. This can
be done by substituting the value of x = 7 into the above equation. Then
A word of caution
The linear relationship between y and x is often only valid for values of x
within a certain range e.g. when estimating a person’s height using the
person’s shoe size as explanatory variable, it should be taken into
account that at some shoe size the person’s height will stop increasing.
Assuming a linear relationship between shoe size and height for values
59
beyond the shoe size where the person’s height stops increasing would
be incorrect.
If variables are not linearly related, it does not mean that they are not
related. There are many situations where the relationships between
variables are non-linear.
Example
A plot of the banana consumption (y) versus the price (x) is shown in the graph
on the following page. A straight line will not describe this relationship very
well, but the non-linear curve shown below will describe it well.
60
NONLINEAR REGRESSION: EXAMPLE
14
y
12
10
8
6 y u z u
x
4
0
0 1 2 3 4 5 6 7 8 9 10 11 x12
This sequence shows how a nonlinear regression model may be fitted. It uses the banana
consumption example in the first sequence.
Tutorial
61
Assessed value Selling price
(thousands of rand) (thousands of
X rand)
Y
116,0 185,0
160,8 246,4
103,2 162,2
55,8 97,6
89,6 148,0
65,0 110,4
144,0 236,6
80,6 126,8
Find the line of best fit and use it to estimate the selling price of a house
when its assessed value is R100 000.
62
Normal Stress, x Shear y
resistance,
26.8 26.5
25.4 27.3
28.9 24.2
23.6 27.1
27.7 23.6
23.9 25.9
24.7 26.3
28.1 22.5
26.9 21.7
27.4 21.4
22.6 25.8
25.6 24.9
(b) Determine the correlation coefficient between the shear resistance and the
normal stress.
(c) Estimate the shear resistance for a normal stress of 24.5 (kilograms per
square cm).
Find the least squares regression line by which one may predict the
efficiency from the extraction time.
63
thousands of units) and its price (in cents) in six different market
areas:
x y
Price Demand
19 55
23 7
21 20
15 123
16 88
18 76
Plot the data and the regression line on suitable axes. (Show your working for the 2
points needed to plot the straight line)
CHAPTER 5
RANDOM VARIABLES AND
PROBABILITY DISTRIBUTIONS
5.1 Introduction to probability distributions
64
Probability (chance)
A probability is the chance that something of interest will happen.
A probability is expressed as a proportion i.e. it ranges from 0 to 1.
Chance can be expressed as a percentage i.e. it ranges from 0 to 100.
Examples
1. The probability of rain tomorrow is 0.40
There is a 40% chance of rain tomorrow.
1
.
2. The probability of winning the Lotto is 13983816
3. The probability of a certain new product being successful is 0.75.
Random experiment
This is an experiment that gives different outcomes when repeated under
similar conditions.
Examples
Examples
1. T = the number of tails (t) when a coin is flipped 3 times.
2. X = the sum of the values (x) showing when two dice are rolled.
3. H = the height (h) of a woman chosen at random from a group.
4. V = the liquid volume (v) of soda in a can marked 12 oz.
65
There are two types of random variables:
Examples
1. The variables T and X from the above examples are discrete random
variables.
2. The variables H and V from the above examples are continuous random
variables.
Examples
Outcomes T
hhh 0
66
hht, hth,
1
thh
tth, tht, htt 2
ttt 3
Assuming that the outcomes are all equally likely, the probability
distribution for T is given in the following table.
t 0 1 2 3 Total
P(t) 1/8 3/8 3/8 1/8 1
2. A pair of dice is tossed. Let X denote the sum of the digits. The
probability distribution of X can be found from the following table. The
entry in any particular cell is the sum of the row and column values.
1st die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
2nd die 3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
x 2 3 4 5 6 7 8 9 10 11 12
P(X=x) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Note:
For any discrete random variable X, the range of values that it can assume are
such that
∑ P (x )=1
0 ≤ P(x) ≤ 1 and x .
67
F ( X )=P ( X ≤ x )=∑ P (r )
r≤ x
Examples
x 0 1 2 3
F(x 1/8 ½ 7/ 1
) 8
x 2 3 4 5 6 7 8 9 10 11 12
F(x 1/36 3/36 6/36 10/36 15/3 21/36 26/36 30/3 33/36 35/36 1
) 6 6
x 1 2 3 4
P(X=x 0.1 0.3 0. 0.2
) 4
68
(a) CDF (b) PMF
The graphs on the previous page are plots of the probability mass function
(graph on the right) and cumulative distribution function (graph on the left).
A random variable can only take on one value at a time i.e. the events X = x 1
and X = x 2 for x1 ≠ x2 are mutually exclusive. The probability of the variable
taking on any number of different values can be found by simply adding the
appropriate probabilities.
Examples
1. Find the probability of getting 2 or more tails when a coin is flipped 3
times.
P(T ≥ 2) = 3/8 + 1/8 = ½.
2. Find the probability of getting at least one tail when a coin is flipped 3
times.
P(at least 1) = P(1) + P(2) + P(3) = 3/8 + 3/8 +1/8 = 7/8
Or
69
The mean or expected value of a random variable X is the average value that
we would expect for X when performing the random experiment many times.
E(X) = μ = ∑ xp( x) .
Examples
Thus if 3 coins are flipped a large number of times, we should expect the
average number of tails (per 3 flips) to be about 1.5. Since the number of tails
is an integer value, it will never actually assume the mean value of 1.5. This
mean value more reflects the fact that the extreme values (0 and 3) occur the
same proportion of times (an eighth) and the middle values occur the same
proportion of times (three eighths).
s 0 1 2 3 4 5
P(S=s) 0.12 0.04 0.1 0.32 0.2 0.12
6 4
s 0 1 2 3 4 5 sum
P(S=s) 0.1 0.04 0.1 0.32 0.2 0.12 1
2 6 4
s×P(s) 0 0.04 0.3 0.96 0.9 0.60 2.88
2 6
μ = E(S) = 2.88
70
Variance
(a) For a random variable X, the variance, denoted by σ2 , can be calculated
by using the formula
σ =Σ ( x−μ ) P ( x )=Σ x P ( x ) - μ2
2 2 2
Examples
71
3. Fourteen percent of flights from a certain airport are delayed. If 20
flights are chosen at random, then we can consider each flight to be an
independent Bernoulli trial. If we define a successful trial to be one
where a flight takes off on time, then the random variable Z representing
the number of on-time flights will be binomially distributed with n =2 0,
p = 0.86 and q = 0.14.
Examples
( )( )
1 2 1 1
P(X = 2) = 3C2 2 2 = 0.375 .
72
Mean and standard deviation of a binomial random variable
E ( T )=μ=3 × 0.5=1.5
0.18000
0.16000
0.14000
0.12000
0.10000
0.08000
0.06000
0.04000
0.02000
0.00000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
73
X∼ Bin(20, 0.1)
0.30000
0.25000
0.20000
0.15000
0.10000
0.05000
0.00000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X ∼ Bin(20, 0.9)
0.30000
0.25000
0.20000
0.15000
0.10000
0.05000
0.00000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A Poisson random variable (X) is one that counts the number of events that
occur at random in an interval of time or space. The average number of events
that occur in the time/space interval is denoted by λ.
Examples
1. The number of bad cheques presented for daily payment at a bank.
2. The number of road deaths per month.
3. The number of bacteria in a given culture.
4. The number of defects per square meter on metal sheets being
manufactured.
5. The number of mistakes per typewritten page.
74
Formula for the calculation of Poisson probabilities
Examples
1. A bank receives on average μ = 6 bad cheques per day. Calculate the
probability of the bank receiving
Solution
= 1 – 0.062
=0.938
75
Solution
In this case λ=1 is claimed and X the number of mistakes ≥ 5. If the claim is
true,
P(X ≥ 5) = 1 – P(X ≤ 4)
[ ]
0 −1 1 −1 2 −1 −1 4 −1
1 e 1 e 1 e 3e 1 e
=1– + + + +
0! 1! 2! 3! 4!
= 1 – 0.9963
= 0.0037.
The above calculation shows that if the claim of 1 mistake per page on average
is true, there is only a 37 in 10 000 chance of getting 5 or more mistakes per
page. This remote chance of 5 or more mistakes when an average of 1 mistake
per page is true casts doubt on whether the claim of 1 mistake per page on
average is in fact true.
Example
Calls arrive at switchboard at an average rate of 1 every 15 seconds. What is
the probability of not more than 5 calls arriving during a particular minute?
Solution
A mean rate of 1 every 15 seconds is equivalent to a mean rate of 4 every
minute. Since the question concerns an interval of 1 minute, λ = 4 (not µ = 1).
−4 2 −4 3 −4 4 −4 5 −4
4e 4 e
−4
P ( X ≤5 )=e + 1!
+ 2!
+ 4 3e! + 4 e
4!
+ 4 e
5!
=0.7851
76
For this reason probabilities associated with individual values of a continuous
random variable X are taken as 0.
The clustering pattern of the values of X over the possible values in the interval
is described by a mathematical function f(x) called the probability density
function. A high (low) clustering of values will result in high (low) values of this
function. For a continuous random variable X, only probabilities associated
with ranges of values (e.g. an interval of values from a to b) will be calculated.
The probability that the value of X will fall between the values a and b is given
by the area between a and b under the curve describing the probability density
function f(x). For any probability density function the total area under the
graph of f(x) is 1.
[ ],
2
− ( x−μ )
1 2σ
2
for −∞ < x <+∞
f ( x )= e
√2 π σ 2
The constants and are the mean and standard deviation, respectively, of
X. These constants completely specify the density function. A graph of the
curve describing the probability function (known as the normal curve) for the
case μ=0 and σ =1 is shown below.
Graph of standard norm al distribution
0.45
0.4
0.35
0.3
0.25
p(z)
0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
z
77
5.6.2 Properties of the Normal distribution
An increase (decrease) in the mean µ results in a shift of the graph to the right
(left) e.g. the curve of the distribution with a mean of -2 is moved 2 units to the
left. An increase (decrease) in the standard deviation σ results in the graph
78
becoming more (less) spread out e.g. compare the curves of the distributions
with σ2 = 0.5, 1 and 2.
Histogram
1000
900
800
700
freq
600
500
400
300
200
100
0 e
15
25
35
45
55
65
75
90
or
M
mark
79
random variable of interest [ X N (µ , σ 2)] to a standardized Normal random
variable:
Z=¿
It can be shown that the transformed random variable Z N (0 ,1). The random
variable Z can be transformed back to X by using the formula
X =¿
Various areas under the above normal curve are shown. The standard Normal
table gives the area under the curve to the left of the value z. Other types of
areas can be found by combining several of the areas as shown in the following
examples.
80
The areas shown in the standard Normal table are those under the standard
normal curve to the left of the value of z looked up i.e. the areas are the
P(Z ≤ z). For example, P(Z ≤ 0.14)=0.5557.
Note:
For negative values of z less than the minimum value (– 3.79) in the
table, the probabilities are taken as 0, that is, P ( Z ≤ z ) =0 for z ←3.79.
For positive values of z greater than the maximum value (3.79) in the
table, the probabilities are taken as 1, that is, P(Z ≤ z)=1 for z >3.79.
Examples
In all the examples that follow Z N (0 ,1)
a) P(Z <1.35)=0.9115
In all the above examples an area was found for a given value of z. It is also
possible to find a value of z when an area to its left is given. This can be written
as P(Z ≤ z α )=α (α is the Greek letter for a and is pronounced “alpha”). In this
case z α has to be found where α is the area to its left.
Examples
81
1. Find the value of z that has an area of 0.0344 to its left.
Search the body of the table for the required area (0.0344) and then
read off the value of z corresponding to this area. In this case
z 0.0344 =– 1.82 .
Finding 0.975 in the body of the table and reading off the z value gives
z 0.975=1.96.
3. Find the values of z that have areas of 0.95 and 0.05 to their left.
When searching the body of the table for 0.95 this value is not found.
The z value corresponding to 0.95 can be estimated from the following
information obtained from the table.
z area to left
1.64 0.9495
? 0.95
1.65 0.9505
Since the required area (0.95) is halfway between the 2 areas obtained
from the table, the required z can be taken as the value halfway
between the two z values that were obtained.
1.64+1.65
From the table: z= 2
=1.645
Exercise: Using the same approach as above, verify that the z value
corresponding to an area of 0.05 to its left is -1.645.
At the bottom of the standard normal table selected percentiles z α are given
for different values of α. This means that the area under the normal curve to
the left of zα is α.
Examples:
1. α =0.900 , z α =1.282 means P(Z <1.282)=0.900.
82
2. α =0.995 , z α =2.576 means P(Z <2.576)=0.995.
P(Z ≥ z)=P(Z ≤ – z )
Let X be a N(μ, σ2) random variable and Z a N(0, 1) random variable. Then
Example 1
The height (in centimetres) of a population of women is approximately
normally distributed with a mean of μ=161.3 and a standard deviation of σ =6.7
centimetres.
Solution
To calculate the probability that a woman is less than 160 centimetres tall, we
first find the z-score for 160 centimetres:
160−161.3
z= =−0.19
6.7
83
Example 2
The length X (centimetres) of sardines is a N (11.73 , 0.1344) random variable.
What proportion of sardines is:
(a) longer than 12.7 centimetres?
(b) between 11.049 and 12.319 centimetres?
Solution
¿ 0.9441−0.0294
¿ 0.9147
The standard Normal table can be used to find percentiles for random
variables which are normally distributed. The p-th percentile for X is given by
x p = μ+σ z p
Example
The scores X obtained in a mathematics entrance examination are Normally
distributed with and . Find the score which marks the 80th
percentile.
Solution
84
From the standard Normal table, the z-value which is closest to an area of 0.80
in the body of the table is 0.84 (the actual area to its left is 0.7995). The score
which corresponds to a z-value of 0.84 can be found by
That is, a score of approximately 609 is better than 80% of all other exam
scores.
Tutorial
1. The probability distribution of X, the number of cylinders to be
tuned up in the engines of cars at a certain service station, is
shown in the table below.
85
X 4 6 8
probability 0.5 0.3 0.2
2. A game between two players is fair if each player has the same
mathematical expectation. If someone gives us RS each time we
roll a 1 or 2 with a balanced die, how much must we pay that
person each time we roll a 3, 4, 5 or 6 to make the game fair?
86
0.4.
(a) What is the probability of at
least one sale?
(b) What is the expected number
of sales?
87
average 1.9 trucks break down per day, and you keep two
trucks available to replace those that break down. If you can
assume that the number of breakdowns on any day is a
Poisson random variable, what is the probability that on
anyone day
(a) no extra replacement trucks are needed;
(b) the number of replacement trucks is inadequate?
15. Given a standard normal distribution, find the area under the curve
which lies
a. to the left of z = 1.43 i.e. P (z < 1.43)
b. to the right of z = −0.89 i.e. P (z > −0.89)
c. between z = −2.16 and z = −0.65 i.e. P (−0.65 < z < −2.16)
d. to the right of z = 1.96 i.e. P (z > 1.96)
e. between z = −0.48 and z = 1.74 i.e. P (−0.48 < z < 1.74).
16. Find the value of z if the area under a standard normal curve
(a) to the right of z is 0.3622
(b) to the left of z is 0.1131 i.e. find z0.1131
(c) between 0 and z, with z > 0, is 0.4838;
(d) between −z and z, with z > 0, is 0.9500.
17. Given the normally distributed variable X with mean 18 and standard
deviation 2.5, find
(a) P (X < 15);
(b) the value of k such that P (X < k) = 0.2236;
(c) the value of k such that P ( X >k )=0.1814
88
(d) P (17 < X < 21);
21. The weights of adult male rhesus monkeys are normally distributed
with a mean of 15 pounds and a standard deviation of 3 pounds.
(a) A male rhesus monkey is randomly selected. What is the
probability that its weight is more than 17 pounds?
(b) If 50 male rhesus monkeys are randomly selected, about how
many would you expect to weigh less than 12 pounds?
22. The manager of a gym has determined that the length of time
members spend at the gym is a normally distributed random
variable with a mean of 80 minutes and a standard deviation of
20 minutes.
(a) What proportion of members spend more than 2 hours at the gym?
(b) What proportion of members spend less than 1 hour at the gym?
(c) What is the least amount of time spent by 60% of the
members at the gym?
CHAPTER 6
90
HYPOTHESIS TESTING
6.1 Formulation of hypotheses and related terminology
Statistical hypothesis
A statistical hypothesis is an assertion (claim) made about a value(s) of a
population parameter.
Purpose
The purpose of testing of hypotheses is to determine whether a claim
that is made could be true. The conclusion about the truth of such a
claim is not stated with absolute certainty, but rather in terms of the
language of probability.
Null hypothesis (H 0 )
This is a statement concerning the value of the parameter of interest ( θ ) in a
claim that is made. This is formulated as
H 0 :θ=θ 0
Alternative hypothesis (H 1 )
91
This is a statement about the possible values of the parameter θ that are
believed to be true if H 0 is not true. One of the alternative hypotheses shown
below will apply.
a . H 1 :∨¿
b . H 1 :∨¿
c . H1 :
Examples
One-sided alternative
This is a hypothesis that specifies the alternative values (to the null hypothesis)
in a direction that is either below or above that specified by the null
hypothesis.
Example
The alternative hypothesis H1 (see example 1 above) is the alternative that the
value of the parameter is less than that stated under the null hypothesis.
Two-sided alternative
92
This is a hypothesis that specifies the alternative values (to the null hypothesis)
in directions that can be either below or above that specified by the null
hypothesis.
Example
The alternative hypothesis H1 (see example 2 above) is the alternative that the
value of the parameter is either greater than that stated under the null
hypothesis or less than that stated under the null hypothesis.
The testing procedure and terminology will be explained for the test for the
population mean μ with population variance σ 2 known.
1. H 0 :µ=μ0
H 1 : µ ≠ μ0
2. H 0 :µ=μ0
H 1 : μ < μ0
3. H 0 : μ=μ 0
H 1 : μ > μ0
When testing for the population mean, the test statistic used is:
Z=¿
We calculate the value of the statistic by substituting the value of x , μ0, σ and n
into the equation and obtain z calc.
If the difference between x and μ0 (and therefore the value of z calc) is
reasonably small, H 0 will be not be rejected. In this case the sample mean is
consistent with the value of the population mean that is being tested. If this
93
difference (and therefore the value of z calc) is sufficiently large, H 0 will be
rejected. In this case the sample mean is not consistent with the value of the
population mean that is being tested. In order to decide how large this
difference between x and μ0 (and therefore the value of z calc) should be before
H 0 is rejected, the following should be considered.
Type I error
A type I error is committed when the null hypothesis is rejected when, in
fact it is true i.e. H 0 is wrongly rejected.
For example, a type I error is committed when it is decided that the
statement
H0: µ = μ0 should be rejected when, in fact, it is true.
Type II error
A type II error is committed when the null hypothesis is not rejected
when, in fact, it is false i.e. a decision not to reject H 0 is wrong.
For example, a type II error is committed when it is decided that the
statement
H0: µ = μ0 should not be rejected when, in fact, it is false.
94
P(type I error) = P(reject μ = μ0 | μ = μ0 is true) = α
1−¿ P(type II error) = 1−β = the power of the test. It is the probability of not
making a type II error.
Probabilities of type I and type II errors work in opposite directions. The more
reluctant you are to reject H0, the higher the risk of accepting it when, in fact, it
is false. The easier you make it to reject H0, the lower the risk of accepting it
when, in fact, it is false.
For the test of the population mean the critical value is determined in the
following way. Assuming that H0 is true, the test statistic will follow a standard
Normal distribution i.e.
X̄−μ 0
Z = σ / √ n ~ N(0, 1)
1. When testing H0 versus the alternative hypothesis H1 (µ < µ0), the critical
region lies in the left tail of the standard Normal distribution. This is
called a left-tailed test. That is, the value of −z crit (the critical value) is
such that the area under the standard normal curve to the left of −z crit is
α. That is, P(Z<−z crit ) = α. The graph below illustrates the case for α =
0.05.
That is, P(Z < –1.645) = 0.05:
95
2. When testing H0 versus the alternative hypothesis H1 (µ > µ0), the critical
region lies in the right tail of the standard Normal distribution. This is
caleed a right-tailed test. That is, the value of + z crit is such that the area
under the standard Normal curve to the right of z crit is α. That is,
P( Z > z crit ) = α . This leaves an area of 1−α to the left of z crit . The graph
below illustrates the case for α = 0.05. This means 1 – α = 0.95 and thus
P(Z > 1.645) = 0.05:
96
Critical region (CR)
The critical region, or rejection region R, is the set of values of the test statistic
for which the null hypothesis is rejected.
{z ∨z> z crit }
When H0 is rejected, it does not necessarily mean that it is not true. It means
that according to the sample evidence available it appears not to be true.
Similarly when H0 is not rejected, it does not necessarily mean that it is true. It
means that there is not sufficient sample evidence to disprove H 0.
Critical values for tests based on the standard normal distribution can be found
from the selected percentiles listed at the bottom of the pages of the standard
normal table.
97
6.3 Test for the population mean (population variance known)
A summary of the steps to be followed in the testing procedure is shown below
(continuing onto the following page).
2
Test for μ when σ is known
1. State the null and alternative hypotheses:
H 0 :µ=μ0
H 1 : µ ≠ μ0
or
H 0 :µ=μ0
H 1 : μ < μ0
or
H 0 : μ=μ 0
H 1 : μ > μ0
Calculate : z calc .
3. State the level of significance α and determine the critical value(s) and
critical region.
(i) For a left-tailed test, the critical region is: {z ∨z<−z crit }
(ii) For a right-tailed test, the critical region is: {z ∨z>+ z crit }
4 If z calclies in the critical region, reject H0, otherwise do not reject H0.
Example 1
98
A hardware store receives complaints that the mean content of the “1
kilogram” cement bags that are sold by them is less than 1 kilogram. A
random sample of 40 cement bags is selected from the shelves and the
mean is found to be 0.987 kilograms. From past experience the standard
deviation contents of these bags is known to be 0.025 kilograms. Test, at
the 5% level of significance, whether this complaint is justified.
Solution:
Step 1:
H 0 : μ ≥1 (The complaint is not justified)
Step 2:
n = 40, x̄ = 0.987, σ = 0.025, μ0 = 1 (given)
0. 987−1
=
Test statistic: zcalc = 0 .025 / √ 40 –3.289.
Step 3:
α = 0.05
Critical region: left-tailed test so critical value = z crit =−1.645
Step 4:
Since z calc < z crit , that is −3.289←1.645, H0 is rejected.
Step 5:
Conclusion: Sample evidence suggests that there is less than 1 kilogram
of cement in the bags. The customers’ complaints are justified.
Example 2
99
A supermarket manager suspects that the machine filling “500 gram”
containers of coffee is over-filling them i.e. the actual contents of these
containers is more than 500 grams. A random sample of 30 of these
containers is selected from the shelves and the mean found to be 501.8
grams. From past experience the variance of contents of these bags is
known to be 60 grams. Test at the 5% level of significance whether the
manager’s suspicion is justified.
Solution:
Step 1:
H 0 : μ ≤500 (Suspicion is not justified)
Step 2:
n = 30, x̄ = 501.8, σ2 = 60, μ0 = 500 (given)
Step 3:
α = 0.05
Critical region: right-tailed test so critical value = z crit =1.645
Step 4:
Since z calc < z crit , that is 1.273 < 1.645, H0 is not rejected.
Step 5:
Conclusion: The sample evidence suggests that the coffee machine is
not over-filling the 500 gram coffee containers. The manager’s suspicion is not
justified.
Example 3
100
During a quality control exercise the manager of a factory that fills cans
of frozen shrimp wants to check whether the mean weights of the cans
conform to specifications i.e. the mean of these cans should be 600
grams as stated on the label of the can. He/she wants to guard against
either over or under filling the cans. A random sample of 50 of these
cans is selected and the mean found to be 595 grams. From past
experience the standard deviation of contents of these bags is known to
be 20 grams. Test, at the 5% level of significance, whether the weights
conform to specifications. Repeat the test at the 10% level of
significance.
Solution:
Step 1:
H 0 : μ=600 (Weights conform to specifications)
Step 2:
n = 50, x̄ = 595, σ = 20, μ0 = 600 (given)
Step 3:
α = 0.05
Critical region: two-tailed test so critical values ¿ ± z crit =± 1.96
Step 4:
Since −z crit < z calc <+ z crit
That is, – 1.96 <1.768<1.96, H0 is not rejected.
Step 5:
Conclusion: Sample evidence suggests that the weights appear to
conform to specifications.
101
In such a case:
α = 0.10
Critical region: two-tailed test and critical values = ± z crit =± 1.645
Thus, being less strict about controlling a type I error (changing α from 0.05 to
0.10) results in a different conclusion about H0 (reject instead of do not reject).
6.4 Test for the population mean (population variance not known,
n < 30): t-test12
When performing the test for the population mean for the case where the
population variance is not known, the following modifications are made to the
procedure.
The t-distribution was first proposed in a paper by William Gosset in 1908 who
wrote the paper under the pseudonym “Student”. The t-distribution has the
following properties.
12
See Appendix A 14.
102
sample size increases, the distribution approaches a standard normal
distribution. For n > 30, the differences are negligible.
The mean is zero (like the standard normal distribution).
The distribution is symmetrical about the mean.
The variance is greater than one, but approaches one from above as the
sample size increases (σ2 = 1 for the standard normal distribution).
2
Test for μ when σ is not known, n < 30 (t-test)
1. State null and alternative hypotheses:
H 0 :µ=μ0
H 1 : µ ≠ μ0
or
H 0 :µ=μ0
H 1 : μ < μ0
or
H 0 : μ=μ 0
H 1 : μ > μ0
x−μ0
2. Calculate the value of the test statistic: t calc=
S / √n
3. State the level of significance α and determine the critical value(s) and
critical region.
(ii) For a left-tailed test, the critical region is: {t∨t <−t crit }
(ii) For a right-tailed test, the critical region is: {t∨t >+t crit }
4 If t calc lies in the critical region, reject H0 , otherwise do not reject H0.
Example 4
103
A paint manufacturer claims that the average drying time for a new paint is 2
hours (120 minutes). The drying times for 20 randomly selected cans of paint
were obtained. The results are shown below.13
(a) Test whether the population mean drying time is greater than 2 hours
(120 minutes)
(b) Test, at the 5% level of significance, whether the population mean drying
time could be 2 hours (120 minutes).
Solution:
(a) Step 1:
H0 : μ ¿ 120 (mean is 2 hours)
H1 : μ > 120 (mean is greater than 2 hours)
Step 2:
n = 20, μ0 = 120 (given), x̄ = 124.1, S = 9.65674 (calculated from the
data).
(i) Step 3:
α = 0.05
Critical region: right-tailed test.
From the t-distribution table with degrees of freedom ¿=n – 1=19 , t crit =¿
1.729
13
See Appendix pg. on how to conduct a t-test for the mean in Excel.
104
Step 4:
Since t calc >t crit
that is, 1.899 > 1.729 , H0 is rejected.
Step 5:
Conclusion: The mean drying time appears to be greater than 2 hours.
(ii) Step 3:
α = 0.01
Critical region: right-tailed test.
From the t-distribution table with degrees of freedom ¿=n – 1=19 ,
t crit =2.539
Step 4
Since t calc <t crit
that is, 1.899 < 2.539 , H0 is not rejected.
Step 5:
Conclusion: The mean drying time appears to be 2 hours.
Thus, being more strict about controlling a type I error (changing α from 0.05
to 0.01) results in a different conclusion about H0 (do not reject instead of
reject).
(b) Step 1:
Step 2:
n = 20, μ0 = 120 (given), x̄ = 124.1, S = 9.65674 (calculated from the
data).
124 . 1−120
Test statistic: tcalc = 9 .65674 / √ 20 = 1.899 (as calculated in part(a)).
Step 3:
α = 0.05
105
Critical region: two-tailed test.
From the t-distribution table with degrees of freedom = ν = n–1 =19,
t crit =± 2.093
Step 4:
Since −t crit ≤ t calc ≤+ t crit
that is, –2.093 <1.899 < 2.093, H0 is not rejected.
Step 5:
Conclusion: The mean drying time appears to be 2 hours.
Note:
Despite the fact that the same data were used in the above examples,
the conclusions were different. In the first test H0 was rejected, but in
the next 2 tests H0 was not rejected.
In the first test the probability of a type I error was set at 5%, while in
the second test this was changed to 1%. To achieve this, the critical was
moved from 1.729 to 2.539, resulting in the test statistic value (1.899)
being less than (instead of greater than) the critical value.
The test for the population proportion ( π ) is based on the fact that the sample
X
proportion p= n ~ N( π , π (1−π )/n) , where n is the sample size and x the
number of items labeled “success” in the sample. From this result it follows
p−π 0
√
that Z = π 0 (1−π 0 ) ~ N(0, 1) where π 0 is the value of π under H 0.
n
For this reason the critical value(s) and critical region are the same as that for
the test for the population mean (both based on the standard normal
distribution).
106
Test for the population proportion π
1. State the null and alternative hypotheses.
H 0 :π =π 0
H 1: π ≠ π0
or
H 0 :π =π 0
H1: π < π 0
or
H 0 :π =π 0
H1: π > π 0
p−π 0
z calc =
2. Calculate the test statistic
√ π 0 (1−π 0 ) ’
n
3. State the level of significance α and determine the critical value(s) and
critical region.
(i) For a left-tailed test, the critical region is: {z ∨z<−z crit }
(ii) For a right-tailed test, the critical region is: {z ∨z>+ z crit }
4. If z calclies in the critical region, reject H0, otherwise do not reject H0.
Example 5
A construction company suspects that the proportion of jobs they
complete behind schedule is 0.20 (20%). Of their 80 most recent jobs 22
were completed behind schedule. Test at the 5% level of significance
whether this information confirms their suspicion.
107
Solution:
Step 1:
H0 : π = 0.20 (Suspicion is confirmed)
Step 2:
22
π0
n = 80, x = 22 (given), p = 80 = 0.275, = 0.20.
0 .275−0 . 20
Test statistic: zcalc = √ 0. 20∗0 .80 /80 = 1.677.
Step 3:
α = 0.05
Step 4:
Since −z crit < z calc <+ z crit
that is, –1.96 < z0 = 1.677 < 1.96, H0 is not rejected.
Step 5:
Conclusion: The suspicion is confirmed.
Example 6
During a marketing campaign for a new product 176 out of the 200
potential users of this product that were contacted indicated that they
would use it. Is this evidence that more than 85% of all the potential will
actually use the product? Use α = 0.01.
Solution:
Step 1:
H0 : π ≤ 0.85 (85% of all potential users will use the product)
H1 : π > 0.85 (More than 85% of all potential users will use the product)
Step 2:
176
π0
n = 200, x = 176, = 0.85 (given), p =200 = 0.88.
108
0 .88−0 . 85
Test statistic zcalc = √ 0. 85∗0 .15 /200 = 1.188.
Step 3:
α = 0.01
Critical region: right-tailed test = + z crit =2.576
Step 4:
Since z calc < z crit
that is, 1.188 < 2.576, H0 is not rejected.
Step 5:
Conclusion: 85% of all potential users will use the product.
6.6 Test for the difference between means for two independent
samples14
For small samples (both sample sizes n1,n2 < 30)
Examples
1. Are the mean salaries the same for males and females with the same
educational qualifications and work experience?
2. Do smokers and non-smokers have the same mortality rate?
3. Are the variances in drying times for two different types of paints
different?
4. Is a particular diet successful in reducing people’s weights?
When testing for the difference of means from 2 different populations labeled
1 and 2, the hypotheses are:
H 0 : μ1=μ2
H 1 : μ 1 ≠ μ2
or
14
See Appendix A15.
109
H 0 : μ1=μ2
H 1 : μ 1> μ 2
or
H 0 : μ1=μ2
H 1 : μ 1< μ 2
Notation
The following notation will used in the description of the two sample
tests.
notation notation
Measure
(population 1) (population 2)
sample size n1 n2
sample x 1 , x2 ,⋯, x n x 1 , x 2 ,⋯, x m
sample mean x̄ 1 x̄ 2
sample variance (standard S21 ( S1 ) S22 ( S2 )
deviation)
In the examples that follow, we will assume that the populations from which
the samples are drawn are Normally distributed and that the sample sizes are
2 2
small (n1 , n2 <30 ¿and that the population variances σ 1 , σ 2 are not known but
2 2
equal to σ . They may be replaced by their sample estimates S1 , S2 and
2
Test for difference between two population means (small sample sizes,
population variances unknown but equal)
x 1−x 2
Step 2: Calculate the test statistic: t calc=
with
√ S2(
1 1
+ )
n1 n2
Step 3: State the level of significance α and determine the critical value(s)
and critical region.
(i) For a left-tailed test, the critical region is: {t∨t <−t crit }
(ii) For a right-tailed test, the critical region is: {t∨t >+t crit }
Step 4: If tcalc lies in the critical region, reject H0, otherwise do not reject H0.
Example 7
A certain hospital has been getting complaints that the response to calls from
senior citizens is slower (takes longer time on average) than that to calls from
other patients. In order to test this claim, a pilot study was carried out. The
results are shown below.
Solution:
Label the “senior citizens” and “others” populations as 1 and 2 and their
population mean response times as μ1 and μ2 , respectively.
Step 1:
H 0 : μ1=μ2
H 0 : μ1 > μ2
Step 2:
S2 = ( 17 ×0.25 ) + ( 12× 0.21 ) =0.0549
2 2
29
5.6−5.3
t calc=
Test statistic:
√ 0.0549 ( 181 + 131 ) = 3.518
Step 3:
α = 0.01
Critical region: right-tailed test
From the t-distribution table with ν=n+m−2=18+13−2=29 degrees of
freedom, t crit =2.462
Step 4:
Since t calc >+t crit , that is, 3.518 > 2.462, H0 is rejected.
Step 5:
Conclusion: The claim is justified i.e. the mean response time for senior citizens
takes longer than that for others.
Tutorial
112
(a) A water faucet manufacturer announces that the mean flow rate of a
certain type of faucet is less than 2.5 gallons per minute.
(b) A cereal company advertises that the mean weight of the contents of its
1kg size cereal boxes is more than 1kg.
(c) A consumer analyst reports that the mean life of a certain type of auto-
mobile battery is not 74 months.
5. A paint manufacturer claims that the average drying time for his
new latex paint is two hours. To test this claim, drying times are
113
obtained for n = 20 randomly selected cans of paint. The results are
displayed below in minutes.
123 109 115 121 130
127 106 120 116 136
131 128 139 110 133
122 133 119 135 109
If we assume that the drying times are Normally distributed, do the sample
data suggest that the mean drying time is actually greater than the
manufac- turer’s claim of 120 minutes? Use α = 0, 05. (The sample mean
and standard deviation of the data are given by x = 123.1 and s = 10).
6. An industrial company claims that the mean pH level of the water
in a nearby river is 6.8. You randomly select 19 water samples and
measure the pH of each. The sample mean and standard deviation are
6.7 and 0.24, respectively. Is there enough evidence to reject the
company’s claim at α = 0.05? Assume the population is normally
distributed.
8. A medical researcher claims that less than 20% of the adults in RSA
are not allergic to any medication. In a random sample of 100 adults,
15% say they are not allergic to any medication. At a 0.01 level of
significance, is there enough evidence to support the researcher’s
claim?
9. Harper’s index claims that 23% of people in the United States are in
favour of outlawing cigarettes. You decide to test this claim and ask a
random sample of 200 people in the United States whether they are in
favour of outlawing cigarettes. Of the 200 people, 27% are in favour.
Using α = 0.05, is there enough evidence to reject the claim?
114
10. The U.S. National Centre for Health Statistics gathers and publishes
data on the daily intake of selected nutrients by race and income level.
Suppose we are considering protein intake and want to compare the
mean daily intake of people with incomes that are above the poverty
level with those of people with incomes below the poverty level. The
data in Table A give the protein intake, in grams, over a 24-hour period
for people with incomes above and below the poverty level.
TABLE A
Above poverty level Below poverty level
86,0 69,0 51,4 49,7 72,0
59,7 80,2 76,7 65,8 55,0
68,6 78,1 73,7 62,1 79,7
98,6 69,8 66,2 75,8 65,4
87,7 77,2 65,5 62,0 73,3
x1 = 77, 49 s1 = 11, 34 x2 = 66, 29 s2 = 9, 17
At the 5% significance level, do the data suggest that people with incomes
above the poverty level have a greater mean daily intake of protein than
those with incomes below the poverty level? Assume that the daily
intake of protein for both populations is normally distributed and that the
variances for the two populations is the same.
11. Two different hardening processes, (1) saltwater quenching and (2) oil
quenching, are used on samples of a particular type of metal alloy. The
results are shown here. Assume that hardness is normally distributed
and that the population variances are equal.
115
b. Based on the confidence interval, do you think that the mean
hardening times of the two processes are the same?
c. To confirm/check your answer in part (b) test the hypothesis
that the mean hardness for the saltwater quenching process
equals the mean hardness for the oil quenching process. Use
a .05 level of significance and assume equal variances.
12. Two methods of packaging frozen shrimps yield about the same
average weight per package. However, method 2 is somewhat faster
and a particular company that packages shrimps would like to use it
unless the variance of method 2 is shown to be larger than that of
method 1 at the 5% level of significance. Two samples of 51 packages,
one packed using the first method and one using the second method, are
examined. The sample standard deviations are s1 = 4.2 grams for method
1 and s2 = 5.8 grams for method 2. What decision should be made?
116
CHAPTER 7
CHI-SQUARE TESTS
7.1 Introduction
Chi-square ( χ 2 ) tests are used to test hypotheses on patterns of outcomes,
which are based on frequency counts, for categorical random variables.
The two chi-square tests that will be covered in this chapter are:
117
7.3 The test statistic
The 𝜒2 test statistic can be computed as follows:
2
O
χ =∑
2
−n
E
OR
( O−E )2
χ =∑2
E
where
O = observed frequency E = expected frequency n = sample size
For χ 2 tests, the rejection region lies in the right tail of the curve:
χ critical value
2 2
χ df ; α
Rejection Rule: If the calculated test statistic ( χ 2calc ) lies in the rejection region
that is, if χ 2calc > χ 2crit reject H 0 in favour of H 1. χ 2crit may be found using the χ 2 tables for
the given level of significance (α ) value and degrees of freedom = k – 1 (where k = the
number of categories of the categorical variable).
118
7.4 Goodness-of-Fit Test
In this type of hypothesis test, one determines whether the data "fit" a
particular distribution or not. For example, one may suspect that the unknown
data fits a binomial distribution. A χ 2−¿ test goodness-of-fit may be used to
determine if there is a fit or not. The null and the alternate hypotheses for
this test may be written in sentences or may be stated as equations or
inequalities.
Example 1
The following table gives the age distribution of a sample of 100 people
arrested for drunk driving:
Solution:
Step 1:
𝐻0: The proportion of people arrested for drunk driving is the same for all age
groups
𝐻1: The proportion is not the same for all age groups
Step 2:
E
16-20 25 20 31.25
21-25 32 20 15.2
26-30 19 20 18.05
31-35 16 20 12.8
36-40 8 20 3.2
Total 100 100 116.5
119
2
O
∴ test statistic: χ 2calc =∑ −n=116.5−100=16.5
E
Step 3:
α =0.01
Step 4:
area = 𝛼
2
χ crit =13.277
Step 5:
Sample evidence suggests that the proportion of arrests is not the same for all
age groups.
120
7.5 Test of Independence
df =(r – 1)×(c – 1)
OR
( O−E )2
χ =∑
2
E
121
Example
A random sample of 90 adults are classified according to gender and the
number of hours they watch television during a week:
Male Female
Under 25 hours 27 19
Over 25 hours 15 29
Use a 0.01 level of significance and test the hypothesis that the time spent
watching television is independent of whether the viewer is male or
female.
Solution:
Step 1:
Step 2:
Next, we need to calculate the test statistic. But in order to do so, we need to
compute the expected frequencies for each cell. This is done using the
formula:
Total 42 48 90
122
Cell 1: E=(46 × 42)/90=21.47
Cell 2: E=(46 × 48)/90=24.53
Cell 3: E=(44 × 42)/90=20.53
Cell 4: E=(44 × 48)/90=23.47
Thus,
2
O
Observed (O) Expected (E) E
27 21.47 33.95
19 24.53 14.72
15 20.53 10.96
29 23.47 35.83
90 90 95.46
2
O
∴ Test statistic = ∴ test statistic: χ 2calc =∑ −n=95.46−90=5.46
E
Step 3:
Determine the critical value: χ 2crit
α =0.01
that is, 5.46 < 6.635, do not reject H 0 at the 1% level of significance.
Step 5:
Conclusion: there is insufficient evidence to suggest that the time spent
watching TV is dependent on gender.
123
Tutorial
1. What type of data would you use for a χ 2 test?
a. Ratio
b. Categorical
c. Interval
d. Ordinal
124
5. The following conclusion at 10% level of significance is true:
a. Reject the null hypothesis and conclude that the names are
equally popular
b. Reject the null hypothesis and conclude that the names are not
equally popular
c. Accept the null hypothesis and conclude that the names are
equally popular
d. Accept the null hypothesis and conclude that the names are not
equally popular
2
8. The critical value for a χ -test of a contingency table with 4 columns and
6 rows at α =0.05 is:
a. 36.415
b. 28.869
c. 31.410
d. 24.996
125
d. The total area under the curve is 1
Region Reaction
Effective Not Effective
East 274 126
South 203 197
West 291 109
North 257 143
126
c. Z-test
d. T-test
127
North West 3 10 13
Gauteng 12 36 48
Mpumalanga 2 10 12
Limpopo 3 13 16
South Africa 40 128 167
The figures in the table were rounded-off to the nearest 100 000 from the
results of the 2016 Community Survey for ease of calculation. These results
illustrate the distribution of households, in the nine provinces, amongst
RDP/government subsidised dwellings in South Africa.
19. At the 5% level of significance, the χ 2 critical value for the test is:
e. 15.507
f. 26.296
g. 28.869
128
h. 16.919
19.If the test statistic value is 2.89, do we reject or fail to reject H 0 at the 5%
level of significance?
a. Fail to reject H 0
b. Reject H 0
c. Fail to accept H 0
d. Cannot be determined
2
21.The χ goodness-of-fit test has 23 categories. The critical value at α =
0.05 is approximately:
a. 35.172
b. 33.924
c. 32.813
d. 36.415
that:
a. H 0 will be accepted at α = 0.05
b. H 0 will be accepted at α = 0.005
129
c. H 0 will be rejected at α = 0.05
d. H 0 will be rejected at α = 0.005
130
3. Check Analysis ToolPak and click on OK.
131
4. On the Data tab, in the Analysis group, you can now click on Data Analysis.
132
A.2 Creating a random sample
The Excel software package has a facility with which a random sample of a
specific size can be selected from a given population.
Below is the population data of size 10:
12 15 16 18 20 19 14 11 16 13
Select a random sample of size 5 from this population.
1. Input the population data
2. On the Data tab, in the Analysis group, click Data Analysis.
4. Click on the Input Range box and select the range A2:A11.
5. Click on the Random button.
6. Type in 5 in the Number of Samples box
7. Click in the Output Range box and select cell B2.
8. Click OK.
133
A.3 Drawing a line graph
1. Input the Year and Thando’s weight.
2. Highlight the data and click on the Insert tab and select the scatter
plot with straight lines and markers.
134
3. Click on the green plus sign, tick the box for Axis Titles and write in
the titles of the axis.
4. Right click on a year, select Format Axis.
Set the Minimum value to 2013 and the Maximum value to 2019.
5. Final output appears as follows:
70
69
68
67
66
65
2013 2014 2015 2016 2017 2018 2019
Year
135
5. Label the axes.
6. The completed simple bar graph is as follows:
136
A.5 Constructing a Component Bar Chart
Given the following mid-year population estimates for South Africa by
population group and sex, 2017:
Population group Male Female
Black African 22 311 400 23 345 000
Coloured 2 403 400 2 559 500
Indian/Asian 719 300 689 800
White 2 186 500 2 307 100
137
A.6 Constructing a Multiple (Component) Bar Chart
Given the following mid-year population estimates for South Africa by
population group and sex, 2017:
138
Plumbing 2
Electrical contractors 8
Shopfitting 1
Other building installation 8
Painting and decorating 1
Other building completion 8
Renting of construction or demolition equipment
with operators 3
139
A.8 Constructing a Histogram
This example teaches you how to create a histogram in Excel.
1. First, enter the data and the bin numbers (upper levels).
140
4. Select the input range (the cost of daily commute values).
5. Click in the Bin Range box and select the bin range.
6. Click the Output Range option button, click in the Output Range box and
select a cell in which you want the output to appear.
7. Check Chart Output.
8. Click OK.
9. Click on Quick Analysis and choose Chart and then Clustered
141
9. Click on the More value in the table and delete.
10. Properly label your bins.
11. To remove the space between the bars, right click a bar, click Format Data
Series and change the Gap Width to 0%.
12. To add borders, right click a bar, click Format Data Series, click the Fill &
Line icon, click Border and select a color.
13. To add the data values above each bar, right click a bar, click Add Data
Lables → Add Data Lables
Result:
142
A.9 Constructing a Frequency Polygon
1. Input the Midpoint and frequency values.
2. Highlight the data and click on the Insert tab and select the 2D line
graph.
3. Click on the green plus sign, tick the box for Axis Titles and write in the
titles of the axes.
4. The final output appears as follows:
143
A.10 Constructing a “Less than” ogive
1. Input the upper class limits and the cumulative frequency values.
2. Highlight the data and click the Insert tab and then click on Scatter with
Straight Lines and Markers.
3. Click on the green plus sign and tick Axis, Axis Titles, Chart Title and data
Labels.
144
4. Right click on the horizontal axis and click on Format Axis.
5. Set the Minimum value to 40 and the Maximum value to 70.
To generate descriptive statistics for these scores, execute the following steps.
1. On the Data tab, in the Analysis group, click Data Analysis.
145
.
2. Select Descriptive Statistics and click OK.
6. Click OK.
Result:
146
A.12 Drawing a Scatter plot
1. Input the data for stock A and stock B given in the notes.
2. Highlight the data for stock A then click the Insert tab and choose
Scatter:
147
3. Highlight the data for stock B then click the Insert tab and choose
Scatter:
4. Click on the green plus sign and add axes titles, chart title and data
lables for both scatter plots.
148
5. The scatter plots for stock a and B are as follows:
149
2. Before we begin the analysis, we can create a scatter plot of the
variables shoe size (x) and height (y) and fit a trend line to the data as
follows:
From the above scatter diagram and linear trend line, it would seem that
height and shoe size has a positive linear correlation.
150
3. Select the Y Range. This is the predicted variable (also called dependent
variable).
4. Select the X Range. These are the explanatory variables (also called
independent variables). These columns must be adjacent to each other.
5. Check Labels.
6. Click in the Output Range box and select whichever cell you want the output
to appear in.
8. Click OK.
151
R Square
R Square equals 0.79 which is an average fit. Approximately, 79% of the
variation in height is explained by the independent variable shoe size. The
closer r is to 1, the better the regression line fits the data.
Coefficients
The regression line is: ^y = 137.03 + 4.14(shoe size). In other words, for each
unit increase in shoe size, height increases by 4.14 centimetres.
You can also use these coefficients to do a forecast. For example, if shoe size
equals 8, a person’s expected height = 137.03 + 4.14(8) = 170.15 centimetres.
152
11.Delete the Dummy variable column
12.Alter the heading to read: t-Test: Mean
Output is as follows:
The value of the test statistic is tcalc = 1.899 (3 decimal places). From the table
P(T< = –1.899) = 0.036 (for a left-tailed or one-tail test such as this). This
probability is known as the p-value (the probability of getting a t-value more
remote than the test statistic). When testing at the 5% level of significance, a
p-value of below 0.05 will cause the null hypothesis to be rejected.
153
New Old
13 12
17 8
19 6
11 16
20 12
15 14
18 10
9 18
12 4
16 11
The value of the test statistic is tcalc = 2.177 (3 decimal places). From the table
P(T< = 2.177) = 0.043 (for a two-tailed test such as this). This probability is
known as the p-value (the probability of getting a t-value more remote than
the test statistic). When testing at the 5% level of significance, a p-value of
below 0.05 will cause the null hypothesis to be rejected.
154