0% found this document useful (0 votes)
4 views

Study Notes

Chapter 1 introduces key terminology in statistics, including definitions of data, populations, samples, and various measurement scales. It discusses the importance of sampling methods, such as probability sampling and convenience sampling, and highlights the significance of ensuring representativeness in samples to avoid bias. The chapter also distinguishes between parameters and statistics, emphasizing their roles in population and sample analysis, respectively.

Uploaded by

dastsam624
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Study Notes

Chapter 1 introduces key terminology in statistics, including definitions of data, populations, samples, and various measurement scales. It discusses the importance of sampling methods, such as probability sampling and convenience sampling, and highlights the significance of ensuring representativeness in samples to avoid bias. The chapter also distinguishes between parameters and statistics, emphasizing their roles in population and sample analysis, respectively.

Uploaded by

dastsam624
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 154

CHAPTER 1

TERMINOLOGY

1.1 Definitions

A data/data set is a set of values collected or obtained when gathering


information on some issue of interest.

Examples

1. The monthly sales of a certain vehicle collected over a period.


2. The number of passengers using a certain airline on various routes.
3. Rating (on a scale from 1 to 5) of a new product by customers.
4. The yields of a certain crop obtained after applying different types of
fertilizer.

Statistics is the collection of methods for planning experiments, obtaining


data, and then organizing, summarizing, presenting, analyzing, interpreting the
data and drawing conclusions from it.

Statistics in the above sense refers to the methodology used in drawing


meaningful information from a data set. This use of the term should not be
confused with statistics (referring to a set of numerical values) or statistics
(referring to measures of description obtained from a data set).

Descriptive statistics is the collection, organization, summarization and


presentation of data. This will be discussed further in chapter 2.

A population refers to all subjects possessing a common characteristic that is


being studied.

Examples
1. The population of people inhabiting a certain country.
2. The collection of all cars of a certain type manufactured during a
particular month.
3. All patients in a certain area suffering from AIDS.
4. Exam marks obtained by all students studying a certain statistics course.

1
A census is a study where every member (element) of the population is
included.

Examples
1. Study of the entire population carried out by the government every 10
years.
2. Special investigations e.g. tax study commissioned by a government.
3. Any study of all the individuals/elements in a population.

A census is usually very costly and time consuming. It is therefore not carried
out very often. A study of a population is usually confined to a subgroup of the
population.

A sample is a subgroup or subset of the population.

The number of values in the sample (sample size) is denoted by n. The number
of values in the population (population size) is denoted by N.

Statistical inference involves generalizing from samples to populations and


expressing the conclusions in the language of probability (chance).

A variable is a characteristic or attribute that can assume different values for


different subjects in the population or sample.

Discrete variables are variables that can assume a finite or countable number
of possible values. Such variables are usually obtained by counting.

Examples
1. The number of cars parked in a parking lot.
2. The number of students attending a statistics lecture.
3. A person’s response (agree, not agree) to a statement. A one (1) is
recorded when the person agrees with the statement, a zero (0) is
recorded when a person does not agree.

Continuous variables are variables that can assume an infinite number of


possible values. Such variables are usually obtained by measurement.

Examples
1. The body temperature of a person.
2. The weight of a person.
3. The height of a tree.
2
4. The contents of a bottle of cool drink.

1.2 Measurement scales


Qualitative variables are variables that assume non-numerical values.

Examples
1. The course of study at university (B.Com, B.Eng, BA etc.)
2. The grade (A, B, C, D or E) obtained in an examination.

Nominal scale is a level of measurement which classifies data into categories in


which no order or ranking can be imposed on the data.

A variable can be treated as nominal when its values represent categories with
no intrinsic ranking. For example, the department of the company in which an
employee works. Examples of nominal variables include region, postal code, or
religious affiliation.

Ordinal scale is a level of measurement which classifies data into categories


that can be ordered or ranked. Differences between the ranks do not exist.

A variable can be treated as ordinal when its values represent categories with
some intrinsic order or ranking.

Examples

1. Levels of service satisfaction from very dissatisfied to very satisfied.


2. Attitude scores representing degree of satisfaction or confidence and
preference rating scores (low, medium or high).
3. Likert scale responses to statements (strongly agree, agree, neutral,
disagree, strongly disagree).

Quantitative variables are variables which assume numerical values.

Examples
Discrete and continuous variable examples given above.

Interval scale is a level of measurement which classifies data that can be


ordered and ranked and where differences are meaningful. However, there is
no meaningful zero and ratios are meaningless.

3
Examples
1. The difference between a temperature of 100 degrees and 90 degrees is
the same difference as that between 90 degrees and 80 degrees. Taking
ratios in such a case does not make sense.
2. When referring to dates (years) or temperatures measured (degrees
Fahrenheit or Celsius) there is no natural zero point.

Ratio scale is a level of measurement where differences and ratios are


meaningful and there is a natural zero. This is the “highest” level of
measurement in terms of possible operations that can be performed on the
data.

Examples
Variables like height, weight, mark (in test) and speed are ratio variables. These
variables have a natural zero and ratios make sense when doing calculations
e.g. a weight of 80 kilograms is twice as heavy as one of 40 kilograms.

Summary of the 4 measurement scales:

Measurement Examples Meaningful calculations


scale
Nominal Types of music Put into categories
University faculties
Vehicle makes
Ordinal Motion picture ratings: Put into categories
G- General audiences Put into order
PG-Parental guidance
PG-13 – Parents
cautioned
R - Restricted
NC 17 – No under 17
Interval Years: 2009,2010, 2011 Put into categories
Months: 1,2, . . . , 12 Put into order
Differences between
values are meaningfull
Ratio rainfall Put into categories
humidity Put into order
income Differences between
values are meaningfull
Ratios are meaningfull

4
An experiment is the process of observing some phenomenon that occurs.
An experiment can be observational or designed.

1. A designed experiment can be controlled to a certain extent by the


experimenter. Consider a study of 4 fuel additives on the reduction in
oxides of nitrogen. You may have 4 drivers and 4 cars at your disposal.
You are not particularly interested in any effects of particular cars or
drivers on the resultant oxide reduction. However, you do not want the
results for the fuel additives to be influenced by the driver or car. An
appropriate design of the experiment (way of performing the
experiment) will allow you to estimate effects of all factors of interest
without these outside factors influencing the results.

2. An observational study is not controlled by the experimenter. The


characteristic of interest is simply observed and the results recorded. For
example:
2.1 Collecting data that compares reckless driving of female and male
drivers.
2.2 Collecting data on smoking and lung cancer.

A parameter is a characteristic or measure of description obtained from a


population.

Examples
1. Mean (average) age of all employees working at a certain company.
2. The proportion of registered female voters in a certain country.

A statistic is a characteristic or measure of description obtained from a sample.

Examples
1. The mean (average) monthly salary of 50 selected employees in a certain
government department.
2. The proportion of smokers in a sample of 60 university students.

1.3 Sampling methods

When selecting a sample, the main objective is to ensure that it is as


representative as possible of the population from which it is drawn. When a
sample fails to achieve this objective, it is said to be biased.

5
Sampling frame (synonyms: "sample frame", "survey frame") is the actual set
of units from which a sample is drawn.

Example
Consider a survey aimed at establishing the number of potential customers for
a new service in a certain city. The research team has drawn 1000 numbers at
random from a telephone directory for the city, made 200 calls each day from
Monday to Friday from 8am to 5pm and asked some questions.

In this example, the population of interest is all the inhabitants in the city. The
sampling frame includes only those city dwellers that satisfy all the following
conditions:

1. They have a telephone.


2. The telephone number is included in the directory.
3. They are likely to be at home from 8am to 5pm from Monday to Friday;
4. They are not people who refuse to answer telephone surveys.

The sampling frame in this case definitely differs from the population. For
example, it under-represents the categories which either have no telephone
(e.g. the most poor), have an unlisted number, and who were not at home at
the time of calls (e.g. employed people), who don't like to participate in
telephone interviews (e.g. more busy and active people). Such differences
between the sampling frame and the population of interest is a main cause of
bias when drawing conclusions based on the sample.

Probability samples are drawn according to the laws of chance. These include
simple random sampling, systematic sampling and stratified random sampling.

In simple random sampling each sample of a given size that can be drawn will
have the same chance of being drawn. Most of the theory in statistical
inference is based on random sampling being used.

Examples
1. The 6 winning numbers (drawn from 49 numbers) in a Lotto draw. Each
potential sample of 6 winning numbers has the same chance of being
drawn.

2. Each name in a telephone directory could be numbered sequentially. If


the sample size was to include 2 000 people, then 2 000 numbers could

6
be randomly generated by computer or numbers could be picked out of
a hat. These numbers could then be matched to names in the telephone
directory, thereby providing a list of 2 000 people.

A random sample can be selected by using a table of random numbers.

Example

Suppose the first 6 random numbers in the table of random numbers are:
10480, 22368, 24130, 42167, 37570, 77921.
Use these numbers to select the 6 wining numbers in a Lotto draw.

The 49 numbers from which the draw is made all involve 2 digits i.e. 01, 02, . .
. , 49.
Putting the above numbers from the table of random numbers next to each
other in a string of digits gives: 10 48 02 23 68 24 13 04 21 67 37 57 07 79 21 .

The winning numbers can be selected by either taking all pairs of digits
between 01 and 49 (discarding any numbers outside this range or repeats) by
working from left to right or right to left in the above string.

By working from left to right the winning numbers are: 10, 48, 2, 23, 24 and
13.
By working from right to left the winning numbers are: 21, 7, 37, 21, 4 and
13.

The advantage of simple random sampling is that it is simple and easy to apply
when small populations are involved. However, because every person or item
in a population has to be listed before the corresponding random numbers can
be read, this method is very cumbersome to use for large populations and
cannot be used if no list of the population items is available. It can also be very
time consuming to try and locate every person included in the sample. There is
also a possibility that some of the persons in the sample cannot be contacted
at all.

Systematic sampling is a sampling method in which data is obtained by


N
selecting every kth object, where k is approximately n .

Examples

7
1. A manufacturer might decide to select every 20th item on a production
line to test for defects and quality. This technique requires the first item
to be selected at random as a starting point for testing and, thereafter,
every 20th item is chosen.

2. A market researcher might select every 10th person who enters a


particular store, after selecting a person at random as a starting point; or
interview occupants of every 5th house in a street, after selecting a
house at random as a starting point.

3. A systematic sample of 500 students is to be selected from a university


with an enrolled population of 10 000. In this case the population size
10000
N=10 000 and the sample size n = 500. Then every 500 = 20th student
will be included in the sample. The first student in the sample can be
randomly selected from an alphabetical list of students and thereafter
every 20th student can be selected until 500 names have been obtained.

Stratified random sampling involves sampling in which the population is


divided into groups (called strata) according to some characteristic. Each of
these strata is then sampled using random sampling.

A general problem with random sampling is that you could, by chance, miss
out a particular group in the sample. However, if you subdivide the population
into groups, and sample from each group, you can make sure the sample is
representative. Some examples of strata commonly used are those according
to province, age and gender. Other strata may be according to religion,
academic ability or marital status.

Example
In a study investigating the expenditure pattern of consumers, they were
divided into low, medium and high income groups.

Income Percentage of
group population
low 40
medium 45
high 15

8
A stratified sample of 500 consumers is to be selected for this study.

When sampling is proportional to size (an income group comprises the same
percentage of the sample as of the population) the sample sizes for the strata
should be calculated as follows.
40 ×500 45 ×500 15× 500
low: 100
=200 ; medium : 100
=225 ; high: 100
=75

Convenience Sampling – Sampling in which data that is readily available is


used e.g. surveys done on the internet. These include quota sampling.

Quota sampling – Quota sampling is performed in 4 stages.

(a) Stage 1: Decide which characteristics of the elements/individuals in the


population to be sampled are of importance.
(b) Stage 2: Decide on the categories to be sampled from. These categories
are determined by cross-classification according to the characteristics
chosen at stage 1.
(c) Stage 3: Decide on the overall number (quota) and numbers (sub-
quotas) to be sampled from each of the categories specified in step 2.
(d) Stage 4: Collect the information required until all the numbers
(quotas) are obtained.

Example
A company is marketing a new product and needs to know how potential
customers might react to the product.

Stage 1: It is decided that age (the 3 groups under 20, 20-40, over 40)
and gender (male, female) are the characteristics that will determine the
sample.

Stage 2: The 6 categories to be sampled from are (male under 20),


(male 20-40), (male over 40), (female under 20), (female 20-40) and
(female over 40).

Stage 3: The numbers (sub-quotas) to be sampled are:


(male under 20) = 40; (male 20-40) = 60; (male over 40) = 25;
9
(female under 20) = 35; (female 20-40) = 65 and (female over 40) =30.
The total quota is the total of all the sub-quotas i.e. 255.

Stage 4: Visit a place where individuals to be interviewed are readily


available e.g. a large shopping center and interview people until all the
quotas are filled.

Quota sampling is a cheap and convenient way of obtaining a sample in a short


space of time. However, this method of sampling is not based on the laws of
chance and cannot guarantee a sample that is representative of the population
from which it is drawn.

When obtaining a quota sample, interviewers often choose who they like
(within criteria specifications) and may therefore select those who are easiest
to interview. Therefore sampling bias can result. It is also impossible to
estimate the accuracy of quota sampling (because sampling is not random).

Chapter 1 – Tutorial
1. Determine whether the data set is a population or a sample.
(a) The age of the Prime Minister of each Province in South Africa.
(b) The speed of every 5th car passing a police speed trap.
(c) A survey of 500 students from a university with 10000 students.
(d) The annual salary for each employee at Coke.
(e) The cholesterol level of 20 patients in a hospital with 100
patients.

2. Identify the populat ion and the sample for each of the statements
below.
(a) A study of 33043 infants in Italy was conducted to find a
link between a heart rhythm abnormality and sudden
infant death syndrome .
(b) A survey of 2104 households in South Africa found that 42%
subscribe to DSTV.
(c) A survey of 546 women found that more than 56% are the
primary investor in their household .
(d)The Ancient Mayans predicted the end of the world to be in
2012, a study was designed in KwaZulu-Natal where 1200
residents were randomly asked whether they believed the

10
prediction or not. The results indicated that 52% of the
interviewed residents believed in the Mayans predict ion.
3. Determine whether the numeric value is a parameter or a statistic.
(a) The average annual salary for 25 of a company's 1250 statisticians is
R250000.
(b)In a survey of a sample of high school students, 41% said
that their mother has taught them the most about
managing money.
(c) In a survey of sample computers, 15% said their computer
had a malfunction that needed to be repaired by a service
techni cian.
(d) In a recent year, the interest category for 9% of all new magazines
was sport.
(e) In a recent year, the average stats mark for all graduates at UKZN
was 34%.
(f) In a recent survey of 1000 adults from Gauteng, 34% said
using a cell phone while driving should be illegal

4. For each of the following random variables (a) to (p):


(i) indicate the data type (i.e. discrete or continuous), and
(ii) the measurement scale (i.e. nominal, ordinal, interval or
ratio).

(a) The shelf life of milk.


(b) The number of life policies issued per day.
(c) The area of a shop floor.
(d) The number of pages in a text book .
(e) The flavours available in Dogmore food chunks.
(f) The types of wood that could be used to make a desk.
(g) The size categories for shoes.
(h) The voltage produced by a generator.
(i) The car types in the Mercedes r ange.
(j) The "yes/no/sometimes" response to "Do you drink Gin?".
(k) The number of loaves of bread sold daily by a bakery.
(I) The income per day of a bakery.

11
(m) The monthly birth rate at a maternity hospital.
(n) The mass of babies at birt h.
(o) The daily distance travelled by a courier service truck.
(p) The names of teams in a cricket league.

5. A city's telephone book lists 100 000 people. If the telephone


book is the frame for a study, how large would the sample size
be if systematic sampling were done on every 200th person?

6. If every 11t h item is systematically sampled to produce a


sample size of 75 items, approximately how large is the
population?

7. In a study investigating liver function in lions, the lions were


divided into 3 groups: Adult Males, Adult Females and Cubs {less
than a year old).

Lions Percentage of population


Adult Males 20
Adult Females 32
Cubs (less than a year old) 48

A stratified sample of 120 lions is to be selected for this study.


How many lions should be represented by each stratum?
8. Cadbury wants to market a new type of chocolate and needs
to know how potential customers might react to the product.
It is decided that age (under 21, 21 to 40, over 40), gender (m
ale, female) and race (black, coloured, Indian, white) are the
characteristics that will determine the sample. Quota sampling
is to be used.
(a) How many possible categories are there to be sampled from
(in stage 2 of quota sampling)?
(b) What is an advantage of using quota sampling?
(c) What is a disadvantage in using quota sampling?

12
CHAPTER 2
DESCRIPTIVE STATISTICS
(Exploratory Data Analysis)
All the data sets used in this chapter will be regarded as samples drawn from
some population. One of the main purposes of studying a sample is to get
information about the population. The main focus here is on summarizing and
describing some features of the data.

2.1 Graphs and diagrams

A line graph1 is a graph used to present some characteristic recorded over


time.

Example
Thando's weight (kg)
75
74
73
72
71
Weight

70
69
68
67
66
65
2013 2014 2015 2016 2017 2018 2019
Year

The graph above shows how Thando's weight varied from the beginning of
2014 to the beginning of 2018.

1
See Appendix A3.

13
Bar charts

A bar chart or bar graph is a chart consisting of rectangular bars with heights
proportional to the values that they represent. Bar charts are used for
comparing two or more values that are taken over time or under different
conditions.

Simple Bar Chart2

In a simple bar chart the figures used to make comparisons are represented by
bars. These are either drawn vertically or horizontally. Only totals are
represented. The height or length of the bar is drawn in proportion to the size
of the figure being presented.

Example

The South African population data is displayed in the following simple bar
chart.

South African population (2015 - 2018)


58,000,000

57,500,000 57,398,421

57,000,000
56,717,156
56,500,000
Population

56,015,473
56,000,000

55,500,000 55,291,225

55,000,000

54,500,000

54,000,000
2015 2016 2017 2018
Year

Component Bar Chart3

2
See Appendix A4.
3
See Appendix A5.

14
When you want to draw a bar chart to illustrate your data, it is often the case
that the totals of the figures can be broken down into parts or components.

Mid-year population estimates for South Africa by population


group, sex 2017
50,000,000
45,000,000
40,000,000
35,000,000
30,000,000
Number

25,000,000 Female
20,000,000 Male
15,000,000
10,000,000
5,000,000
0
Black African Coloured Indian/Asian White
Population group

You start by drawing a simple bar chart with the total figures as shown above.
The columns or bars (depending on whether you draw the chart vertically or
horizontally) are then divided into the component parts.

Multiple (compound) Bar Chart4


You may find that your data allows you to make comparisons of the
component figures themselves. If so, you will want to create a multiple
(compound) bar chart. This type of chart enables you to trace the trends of
each individual component, as well as making comparisons between the
components.

4
See Appendix A6.

15
Mid-year population estimates for South Africa by population
group, sex 2017
50,000,000
45,000,000
40,000,000
35,000,000
30,000,000 Male
Female
Number

25,000,000
Total
20,000,000
15,000,000
10,000,000
5,000,000
0
Black African Coloured Indian/Asian White
Population group

Pie Chart5

A pie chart is a diagram that shows the subdivision of some entity/total into
subgroups. The diagram is in the form of a circle which is divided into slices
with each slice having an area according to the proportion that it makes up of
the total.

Example
The pie chart below shows the weighting of services used in the construction
input price index (Construction Materials Price Indices, April 2019).

Service Percentage Degrees


Site preparation 1 3
Construction of buildings 24 86
Civil engineering 37 133
Other structures 2 6
Construction by specialist trade
contractors 6 22
Plumbing 2 6
Electrical contractors 8 29
Shopfitting 1 2
Other building installation 8 27
Painting and decorating 1 4
Other building completion 8 30
Renting of construction or demolition 3 12
5
See Appendix A7.

16
equipment with operators

Service weighting in the CIPI

3% 1%
1 Site preparation Construction of buildings
%
8% Civil engineering Other structures
1 24% Construction by specialist trade con- Plumbing
8% tractors
%
Electrical contractors Shopfitting
8% Other building installation Painting and decorating
Other building completion Renting of construction or demoli-
tion equipment with operators
2 6%
%

2 37%
%

The degrees needed for each slice is found by calculating the appropriate
percentage of 360°
37 ° °
For example, civil engineering = 100 ×360 =133
The complete calculations are shown in the table below.

2.2 Sigma and subscript notation

The symbol sigma ∑(Capital S in Greek alphabet) is used to denote “the sum
of” values.
Suppose the symbol x is used to denote some variable of interest in a study. In
order to distinguish between values of this variable, subscripts are used.

x 1is the first value in the data set which has a subscript 1.
x 2 is the second value in the data set which has a subscript 2.
.
.
x n is the nth value in the data set which has a subscript n .

The sum of these values is written in shorthand notation as

17
n
x 1+ x2 +…+ x n=∑ x i
i=1

If it is understood that the range of subscript indices over which the


summation is taken involves all the x values, the summation can be written
simply as:

x 1+ x2 +…+ x n=∑ x

Example 1
If x 1=70 ; x 2=74 ; x 3=66 ; x 4 =68 ; x 5 =71

Then
5

∑ x i=x 1 + x 2+ …+ x 5=70+74 +66=68+71=349


i=1

The sum of the squares of a set of values are written as ∑ x 2 for short.

Example 2
For the data set in example 1,
5

∑ x 2i =70 2+74 2+ 66 2+ 68 2+ 712=24397


i=1

Note: ∑ x 2i ≠¿ ¿
i=1

For example, with reference to the abovementioned data:

( )
5 5 2

∑ x 2i ¿ 24397 ≠ ∑ x i =3492=121801
i=1 i =1

The summation notation can also be used to write the sum of products of
corresponding values for 2 different sets of values.
n

∑ x i y i=x 1 y 1 + x 2 y 2 +…+ xn y n
i=1

Example: Consider the following values.

i 1 2 3 4 5 6
xi 11 13 7 12 10 8
yi 8 5 7 6 9 11
For this data:
18
6

∑ x i y i=¿ ¿ ( 11× 8 ) +( 13 ×5 ) +( 7 ×7 ) +( 12× 6 ) +( 10 × 9 )+ ( 8 ×11 )


i=1
¿ 88+65+ 49+72+90+ 88=452

(∑ )(∑ )
n n n

Note: ∑ x i y i ≠ xi yi
i=1 i=1 i=1

For example, with reference to the abovementioned data:


6 6

∑ x i=61 ; ∑ yi =46
i=1 i=1

( )( )
6 6 6
∴ ∑ xi ∑ y i =2806 ≠ ∑ x i y i
i=1 i=1 i=1

The summation notation is used extensively in specifying calculations in


statistical formulae.

2.3 Frequency distributions and related graphs

Frequency distribution

A frequency distribution is a table in which data are grouped into classes and
the number of values (frequencies) which fall in each class is recorded.
The main purpose of constructing a frequency distribution is to gain insight
into the distribution pattern of the frequencies over the classes. Hence, the
name frequency distribution is used to refer to this pattern.

Example 1
In a survey of 40 families in an urban neighbourhood, the number of children
per family was recorded and the following data was obtained.
1 0 3 2 1 5 6 2
2 1 0 3 4 2 1 6
3 2 1 5 3 3 2 4
2 2 3 0 2 1 4 5
3 3 4 4 1 2 4 5

number of Tally frequency (f)


children
0 /// 3
1 //// // 7
2 //// //// 10
3 //// /// 8
4 //// / 6

19
5 //// 4
6 // 2
Total 40

Note: The sum of the frequencies = sample size, i.e. ∑ f =n

Example 2
Consider the following data of the amount of money spent by 50 DUT staff
members on public transport per day. The highest amount is R64 and the
lowest amount is R39.

Data set: The daily amount of money spent on public commuting by 50 DUT
staff members
57 39 52 52 43
50 53 42 58 55
58 50 53 50 49
45 49 51 44 54
49 57 55 64 45
50 45 51 54 58
53 49 52 51 41
52 40 44 49 45
43 47 47 43 51
55 55 46 54 41

Constructing a frequency distribution

The classes into which the above values can be sorted can be found by
following the steps shown below.

1. Find the maximum and minimum values and calculate the range (R):

R=X max −X min =64−39=25


2. Decide on the number of classes. Use Sturges’ rule which states that:

number of classes=k
¿ the rounded up value of (1+1.44 ln n)
¿ 1+1.44 × ln(50)
¿ 6.63
i .e . k=7.

20
3. Calculate the class width such that:
the number of classes × class width> range

i .e .7 × class width>25
25
∴ class width>
7

This suggests a class width of 4.

4. Find the lower value that defines the first class. This is usually a value
just below the minimum value in the data set. Since the minimum value
for this data set is 39, the lowest class can have a minimum value one
below this i.e. 38.

5. Find the lower values that define each of the classes that follow by
successively adding the class width to the lower value of class:

lower value of the second class = 38 + 4 = 42.

lower value of the third class = 42 + 4 = 46 etc.

The frequency distribution below shows the data values sorted into the
classes:

38 – 41, 42 – 45, 46 – 49, 50 – 53, 54 – 57, 58 – 61, 62 – 65

The table below shows the classes and their frequencies for the cost of
commuting data set.

class
limits f
38 – 41 4
42 – 45 10
46 – 49 8
50 – 53 15
54 – 57 9
58 – 61 3
62 – 65 1
Total 50

21
The values in the above example that define the classes of the frequency
distribution are called class limits. The classes of the type 38 – 41, 42 – 45, …,
etc. in which both the upper and lower limits are included are called “ inclusive
classes”. For example, the class 38 – 41 includes all the values from 38 to 41.

The following points must be kept in mind for classification:

1. The classes should be clearly defined and should not lead to any
ambiguity.
2. Each of the given values in the data set should be included in one of
the classes.
3. The classes should be of equal width, otherwise the different class
frequencies will not be comparable. If the class widths are unequal,
then comparable figures can be obtained by dividing the value of the
frequencies by the corresponding widths of the class intervals. The
ratios thus obtained are called ‘ frequency density’.
4. The number of classes should not be too large nor too small.

Class midpoints

The midpoint of a class ( x mid) can be calculated from

lower class limit +upper class limit


x mid =
2

Examples
1. For the frequency distribution in example 2 (cost of daily commute
data), the class midpoints are given below.

class limits midpoints


38 – 41 39.5
42 – 45 43.5
46 – 49 47.5
50 – 53 51.5
54 – 57 55.5
58 – 61 59.5

22
62 – 65 63.5

Cumulative frequencies

The “less than” cumulative frequency of a class is the number of values in the
sample that are less than or equal to the upper class boundary of the class.

Example
For the frequency distribution in example 2 (cost of daily commute data) the
cumulative frequencies are calculated as shown below.

upper cumulative
classes class f frequencie
limit s calculations
38 – 41 41 4 4 4
42 – 45 45 10 14 4 + 10
46 – 49 49 8 22 4 + 10 + 8
50 – 53 53 15 37 4 + 10 + 8 + 15
54 – 57 57 9 46 4 + 10 + 8 + 15 + 9
58 – 61 61 3 49 4 + 10 + 8 + 15 + 9 + 3
4 + 10 + 8 + 15 + 9 + 3 +
62 – 65 65 1 1
1
Total 50

Relative and percentage frequencies


frequency
 Relative frequency=
sample ¿ ¿ ¿
∴ Rf =¿
 The percentage frequency of a class is calculated as: Rf × 100

Examples

1. For the frequency distribution in example (cost of daily commute


data) the relative and percentage frequencies are calculated as
shown below.

23
relative percentage
classes f
frequency frequency
38 – 41 4 0.08 8
42 – 45 10 0.2 20
46 – 49 8 0.16 16
50 – 53 15 0.3 30
45 – 57 9 0.18 18
58 – 61 3 0.06 6
62 – 65 1 0.02 2
Total 50 1 100

Histogram6

A histogram is the graphical representation of a frequency distribution. The


frequency for each class is represented by a rectangular bar with the class
boundaries as base and the frequency as height.

Example

The histogram of the frequency distribution in example 2 (cost of daily


commute data) is shown below.

6
See Appendix A8.

24
Histogram
16 15
14

12
Cost of daily commute

10
10 9
8
8

6
4
4 3
2 1
0
38 - 41 42 - 45 46 - 49 50 - 53 54 - 57 58 - 61 62 - 65
Class interval

Frequency polygon7

This is also a graphical representation of a frequency distribution. For each


class the class midpoint is plotted against the frequency and the plotted points
joined by means of straight lines.

Example

For the cost of daily commute data the following values are plotted.

midpoint 35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
f 0 4 10 8 15 9 3 1 0

The plot is shown below.

7
See Appendix A9.

25
Frequency Polygon
16

14

12
Cost of daily commute

10

0
35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
Class midpoint

Note:
The two plotted values at the lower and upper ends were added to anchor the
graph to the horizontal axis. The lower end value is a plot of 0 versus the
midpoint of the class below the first (lowest) class (35.5). This midpoint is
obtained by subtracting the class width (4) from the midpoint of the lowest
class (39.5). The upper end value is a plot of 0 versus the midpoint of the class
above the last class (67.5). This midpoint is obtained by adding the class width
(4) to the midpoint of the last (highest) class (63.5).

The histogram and frequency polygon are equivalent graphical representations


of the pattern of the frequencies shown in the frequency distribution. It can be
shown that the areas under the histogram and frequency polygon are the
same. The total area under the histogram (frequency polygon) represents the
total number of observations in the data set (n).

“Less than” ogive8

This is the graph of the cumulative frequencies versus the upper class limits.

Example

For the “less than” ogive of the frequency distribution in example 2 (daily cost
of commute data), the following values are plotted:

Upper class 37 41 45 49 53 57 61 65
8
See Appendix A10.

26
limit
cumulative
0 4 14 22 37 46 49 50
frequency

"less than" ogive


60

49 50
50 46

40 37
cumulative frequency

30
22
20
14

10
4

0
40 45 50 55 60 65 70
upper class boundary

Note:
The plotted value at the lower end was added to anchor the graph to the
horizontal axis. The lower end value is a plot of 0 versus the upper class
boundary of the class below the first (lowest) class (37). This upper class
boundary is obtained by subtracting the class width (4) from the upper class
boundary of the lowest class (41).

The shape of a distribution


The main purpose of drawing a histogram is to describe the clustering pattern
of the values in the data set. For a large sample size, the histogram (frequency
polygon) can be fairly well approximated by a smooth curve (called a density
curve) that is fitted to the frequencies. The following patterns of the shape of
the frequency curve appear regularly in data sets.

Symmetric bell shape

27
0.45

0.4

0.35

0.3

frequency
0.25

0.2

0.15

0.1

0.05

0
-4 -2 0 2 4
x

This shape is for data sets where the majority of values are in the central
portion of the scale with fewer and fewer values the further away from the
center (in both directions). Many data sets have this shape. Examples are

1. Marks obtained in an examination.


2. Heights of a large group of adult males.
3. IQ scores in a large population.

Uniform (rectangular) shape

28
0.12

0.1

0.08

frequency
0.06

0.04

0.02

0
0 1 2 3 4 5 6
x

This shape occurs when all the values in the data set occur approximately the
same number of times. Examples are:
1. Frequencies of winning numbers in a large number of Lotto draws.
2. Frequencies of winning numbers in a large number of roulette games.
3. Frequencies obtained when tossing an unbiased coin and recording 0 if
tails come up and 1 if heads come up.

Bimodal shape

60

50

40
frequency

30

20

10

0
0 20 40 60 80 100 120
Body length (m m )

This pattern which shows two distinct peaks (hence the name bimodal data)
appearing when there are two subgroups with different sets of values in the
same data set.
29
Examples
1. Measuring the body lengths of ants when there are adults and juveniles
together in the same data set. The two peaks in the curve reflect the fact
that juvenile ants have shorter body lengths than adult ants.

2. Heights of a population of males and females. Since the females are


shorter than the males, the frequency curve will have two peaks. One
peak will be located where the most female heights are concentrated
and one where the most male heights are concentrated.

Positive skew shape


1.2

0.8
frequency

0.6

0.4

0.2

0
0 2 4 6 8 10 12 14
x

This shape shows a high clustering of values at the lower end of the scale and
less and less clustering further away from the lower end towards the upper
end.

Example
The time it takes to serve a customer at a supermarket. For most customers
the service time is quite short. The longer the service time, the less the number
of customers.

Negative skewed shape

30
0.3

0.25

0.2

frequency
0.15

0.1

0.05

0
0 2 4 6 8 10 12 14 16
-0.05
x

This shape shows a high clustering of values at the upper end of the scale and
less and less clustering further away from the upper end towards the lower
end.

Example
Marks in a test where most students did well, but a few performed poorly.

Tutorial

1. According to the Air Transport Association of America, Delta


Airlines led all U.S. carriers in the number of passengers flown in
the recent year. The top 5 airlines were Delta, United, American,
US Airways , and Southwest. The number of passengers flown (in
thousands) by each of these airlines follow s:
Airline Passenger
s
Delta 103 133
United 84 203
American 81083
US Airways 58 659
Southwest 55 946
Construct a pie chart to depict this information.

2. Research International reports that in a recent year, Huggies


was the top selling diaper brand in South Africa with 41.3% of
the market share . Other leading brands included Pampers
with 25.6%, Luvs with 12.1%, Drypers with 3.3%, Fitti with
0.9%, and private labels with 15.8%of the market share. Use
this information to construct a pie chart of the diaper market

31
shares .

3. Construct a pie chart from the following data.

Label Value
A 55
B 121
C 83
D 46

4. The following data represent the number of passengers per


flight in a sample of Mango fights from Durban to Port
Elizabeth.

23 46 66 67 13 58 19 17 65 17 25 20 47 28 16 38 44 29
48 29 69 34 35 60 37 52 80 59 51 33 48 46 23 38 52
50 17 57 41 77 45 47 49 19 32 64 27 61 70 19

Construct a frequency distribution from the raw data.


a. Calculate the range of the data.
b. Calculate the class width.

5. For the following data, construct a frequency distribution with six


classes.

57 23 35 18 21 26 51 47 29 21 46 43 29 23 39
50 41 19 36 28 31 42 52 29 18 28 46 33 28 20

6. Complete the following frequency distribution table and then


construct the histogram and frequency polygon.

Class Frequency Midpoint Relative Cumulative


Boundarie frequency frequency

32
s
20.5 - 25.5 17
25.5 -30.5 20
30.5-35.5 16
35.5-40.5 15
40.5- 45.5 8
45.5- 50.5 6

7. Complete the following frequency distribution table and then


construct the histogram and frequency polygon.

Class Frequency Midpoint Relative Cumulative


Boundaries frequency frequency
50.5 - 60.5 13
60.5 - 70.5 27
70.5 -80.5 43
80.5 -90.5 31
90.5 - 100.5 9

8. Comment on the shape of the distributions in questions 6 and 7,


respectively.

CHAPTER 3
33
MEASURES OF LOCATION AND
DISPERSION
3.1. Introduction
A measure of central tendency is a value that shows the location on the scale
where a data set is centrally located (most values are clustered around it).

In the calculations a distinction will be made between methods used when the
data are in raw form (values as collected) or grouped form (form of a
frequency distribution).

3.2 The mean (average), median and mode

A. Raw data
Mean: The mean (or average) of a set of data values is the sum of all of the
data values in the set divided by the n the number of data values. That is

mean = x=
∑x
n

x is pronounced “x bar”.

Example
The marks of seven students in a mathematics test with a maximum possible
mark of 20 are given below:
15 13 18 16 14 17 12:

x¿
∑ x = 15+13+18+ 16+14+17 +12 =15
n 7

Median: The median is the value in the data set which is such that half of the
values in the data set are less than or equal to it and half are greater than or
equal to it.

For an odd number of values in the data set, the median is the middle value of
the data set when it has been arranged in ascending order. That is, from the
smallest value to the largest value.

34
1
Median= ( n+ 1 ) th value in a data set, where n is the sample size
2

If the number of values in the data set is even, then the median is the average
of the two middle values.

Examples

1. The marks of nine students in a geography test that had a maximum


possible mark of 50 are given below:

47 35 37 32 38 39 36 34 35

Find the median of this set of data values.

Arrange the data values in order from the lowest value to the highest value:

32 34 35 35 36 37 38 39 47

The number of values, n, in the data set is 9.

1
Median= ( n+ 1 ) th value
2

¿ 5th value

¿ 36

2. Consider the above data set with the first value (47) omitted.

Arrange the data values in order from the lowest value to the highest value:

32 34 35 35 36 37 38 39

In this case the number of values is, n=8, which is an even number.

1
Median= ( n+ 1 ) th value
2

¿ 4.5 th value

The value that lies in position 4.5 in the ranked data set would be the
average of the 4 th and 5th values:

35+36
∴ Median= =35.5
2
35
Mode: The mode of a set of data values is the value(s) that occurs most often.

Example:
Find the mode of the following data set:
48 44 48 45 42 49 48
The mode is 48 since it occurs most often.

Note:

1. It is possible for a set of data values to have more than one mode.
2. If there are two data values that occur most frequently, we say that the
set of data values is bimodal e.g. the data set 2 2 4 5 5 6 has two
modes (2 and 5).
3. If no value in the data set occurs more than once, it has no mode e.g. the
data set 4 5 7 9 has no mode.

Comparison of mean, median and mode

1. The mean is used as a measure of central tendency for symmetrical, bell-


shaped data that do not have extreme values (extreme values are called
outliers).
2. The median may be more useful than the mean when there are extreme
values in the data set as it is not affected by the extreme values.
3. The mode is useful when the most common item, characteristic or value
of a data set is required.

Examples

1. The amounts (thousands) for which each of 7 properties were sold are
shown below.

280, 390, 412, 555, 698, 725, 2 350

For this data set mean = x̄ = 772.86. This value of the mean is not a
central value for the data set (it is greater than all the values but the
largest one). The reason for this is that the last value (2350) has a
considerable influence on the value of the mean.

36
The median = 555 is a value that more centrally located than the mean.
Unlike the mean, the median is not influenced by the large last values in
the data set.

2. For qualitative (non-numerical) data only the mode can be calculated.


For example, suppose 10 rate payers are asked whether they think the
percentage increase in rates is reasonable. They can either agree (A),
disagree (D) or be neutral (N) on the issue. Their responses are shown
below.

A, A, D, N, D, A, D, D, N, N.

For this data set the modal response is D (since D occurs more times
than the other responses). It is not possible to calculate a median or a
mean for this data set.

The weighted mean

When calculating the mean for raw data, it is usually assumed that all the
values in the data set are equally important. If the values are not all considered
equally important, the weighted mean ( x w ) is calculated according to the
formula below.
r

∑ x i wi
x w = i=1r
∑ wi
i=1

In the formula x 1 , x 2 ,… , x r are the values and w 1 , w 2 ,… , w r are their respective


weights.

Example

The final mark (percentage) in a certain course is based on an assignment mark


(which counts for 10% of the final mark), a test mark (which counts for 30% of
the final mark) and an exam mark (which counts for 60% of the final mark).
Calculate the final mark of a student who gets a 65% assignment mark, a 70%
test mark and a 55% exam mark.

Solution:

37
The above formula is applied with
x 1=65 , x 2=70 x 3 =55 , w1=10 , w 2=30 , w3 =60
( 65 ×10 ) + ( 70 ×30 )+(55 ×60) 6050
x w= = =60.5
10+30+60 100

B. Grouped data

Mean:
For grouped data the mean is calculated from the formula below:
x=
∑ (x ¿¿ mid × f ) ¿
n

where
x mid is the class midpoint, f the class frequency and n is the sample size.
This formula is a special case of the weighted mean formula with w i=f iand
∑ wi=n

Example
For the frequency distribution of temperatures (example 2 of the frequency
distributions), the mean can be calculated as shown below.

Class interval x mid f x mid × f


38 – 41 39.5 4 158
42 – 45 43.5 10 435
46 – 49 47.5 8 380
50 – 53 51.5 15 772.5
54 – 57 55.5 9 499.5
58 – 61 59.5 3 178.5
62 - 65 63.5 1 63.5
Total 50 2487

2487
x= =49.74
50

38
3.3 Measures of variability (variation, spread, dispersion)

Variability refers to the extent to which the values in a data set vary around
(differ from) the associated measure of central tendency.

Example
The performance of 2 different stocks is monitored over a period of 8 days.
Their values are shown in the table below.

Day 1 2 3 4 5 6 7 8
A 103 120 112 108 130 106 120 112
B 112 97 85 123 153 85 146 110

The scatter plots9 with that follows shows the performance of each stock.
Stock A
140 130
120 120
120 112 112
108 106
103
100

80
Stock price

60

40

20

0
0 1 2 3 4 5 6 7 8 9
Day

9
See Appendix A – page 24.

39
Stock B
180

160 153
146
140
123
120 112 110
97
100
Stock price

85 85
80

60

40

20

0
0 1 2 3 4 5 6 7 8 9
Day

The mean values for the two stocks are the same (= 113.875), but they differ in
variability (extent of spread around the mean). Stock B has a far wider spread
around the mean than stock A.

A. Raw data

Range: R=xmax −x min

Example:
For the stocks data sets:
Range for stock A = 130 – 103 = 27
Range for stock B = 153 – 85 = 68

The larger (wider) spread in the stock B values is reflected in the larger range
(more than twice that of stock A).

Standard deviation and variance

The sample variance (denoted by s2) is a measure of variability based on


squared differences between the values in the data set and the mean.
n

∑ ( x i−x )2
s2= i=1
n−1

40
n

∑ x 2i −n x 2
i .e . s 2= i=1
n−1
The variance is expressed in the data units squared.
The standard deviation: s= √ s2 which is the positive square root of the variance,
is expressed in the same units as the data.

Example

For stock A the standard deviation is calculated as follows.

Stock A ( x values) x
2

103 10609
120 14400
112 12544
108 11664
130 16900
106 11236
120 14400
112 12544
∑ 911 104297

104297−( 8 ×113.875 2 )
Variance: s2= =79.55
7

Standard deviation: s= √79.55=8.919

For stock B the standard deviation is 25.682 (check this using your calculator).

Interpretation: The stock A values differ (on average) from the mean by 8.919,
while stock B values differ (on average) from the mean by almost 3 times this
amount.

B. Grouped data

Standard deviation and variance

For grouped data, the raw data formulae for the variance and standard
deviation can be slightly modified.

41
k

∑ ( x mid (i )−x )2 f i
s2= i=1
n−1
k

∑ x 2mid (i ) f i−n x 2
i .e . s 2= i=1
n−1

As before standard deviation ¿ s= √ s


2

Example

For the frequency distribution of example 2 (cost of commuting data), the


variance and standard deviation can be calculated as shown below.
2
x mid(i) f i
Class interval x mid(i) fi x mid(i) f i
38 – 41 39.5 4 158 6241
42 – 45 43.5 10 435 18922.5
46 – 49 47.5 8 380 18050
50 – 53 51.5 15 772.5 39783.75
54 – 57 55.5 9 499.5 27722.25
58 – 61 59.5 3 178.5 10620.75
62 - 65 63.5 1 63.5 4032.25
Total 50 2487 125372.5

∑ x 2mid (i ) f i−n x 2 125372.5−(50) ( 49.74 )2


variance=s 2= i=1 = =34.06367
n−1 50−1

standard deviation=s= √ s2= √ 34.06367=5.836

3.4 Coefficient of variation

The standard deviations of 2 data sets that are expressed in different units
cannot be directly compared. However, such a comparison may be done by
calculating the:

42
s
coefficient of variation ¿ CV = x ×100, which is expressed as a percentage

Example
The age of three students were 19, 20 and 21 years and their respective
weights were 55, 60 and 65 kilograms. Since the two data sets are in different
units, they cannot be compared directly.
1
For the age data: x=20 , s=1 ∴ CV = 20 ×100=5 %
5
For the weight data: x=60 , s=5 ∴ CV = 60 × 100=8.33 %
The coefficient of variation calculations show that in relative terms the
variability for the weight data set is greater than that of the age data set.

3.5 Measures of non-central location

3.5.1 Percentiles, Quartlies and Percentile Rank


The i thpercentile, Pi, is the value that has i % of the values in a data set less than
or equal to it.
where
(0< i≤ 100)

Examples

 Median = M e =50th percentile = P50

 First quartile = Q1 = 25th percentile = P25.

 Third quartile = Q3 = 75th percentile = P75

 The 9 deciles D1, D2, . . . , D9 are the values that have 10%, 20%, ... , 90%
respectively of the values in the data set less than or equal to them.

D1 = P10, D2 = P20, …, D5 = P50 = Me, … ,D9 = P90.

3.5.2 Calculation of quartiles and quartile deviation for raw data

43
The three quartiles (Q1 ,Q2 and Q3) are summary measures that divide a ranked
data set into four equal parts. As such, approximately 25% of the values in the
data set will be less than Q1, 50% of the values less than Q2 and 75% of the
values less than Q3.

Q 3−Q 1
The quartile deviation: Q= can also be used as a measure of variability.
2
The quartile deviation value shows the extent to which the values in the data
set deviate from the median. For a skew data set (heavy clustering at lower or
upper end of the scale) the quartile deviation is a more appropriate measure of
variability than the standard deviation (which is more suitable as a measure of
variability for symmetric data sets).

The value ( Q3−Q1) is called the Inter-quartile Range (IQR). IQR indicates the
spread or variation of the middle 50% of the values in the data set.

Q 1=
[ ] n+1
4 th
value in the ranked data set

Q 2= [ 2(n+1)
4 ] value in the ranked data set = Median
th

Q 3= [ 3(n+1)
4 th
]
value in the ranked data set

Use the following guidelines to obtain the quartile:


1. If the position point is a whole number then select the value from the
data set that is corresponding to the whole number position.
2. If the position point is halfway between two whole numbers then select
the average of the two data values which correspond to the two whole
number positions.
3. If the position point does not satisfy either of the above two cases then
round off to the nearest whole number and select the data value that
corresponds to the rounded-off whole number position.

Example

The distance from home to work (kilometers) of 12 employees at a certain


company are shown below. Calculate Q1 and Q3.

44
6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36, 56

Calculate Q1 ,Q2 , Q3 , IQR and Q for this data set.

Solution

Ranked data set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49, 56

Q 1=
[ ] value in the ranked data set
n+1
4 th

¿
[ ] value in the ranked data set
12+1
4 th

¿ 3.25 ≅ 3 rd value in the ranked data set


th

¿ 15 kilometres

Q 2= [ 2(n+1)
4 th
]
value in the ranked data set

¿ [ 2(12+1)
4 th
]
value in the ranked data set

¿ 6.5th value in the ranked data set

¿ average of the 6 th and 7th values in the ranked data set

40+ 41
¿
2

¿40.5 kilometres

Q 3= [ 3(n+1)
4 th
]
value in the ranked data set

45
¿ [ 3(12+1)
4 th
]
value in the ranked data set

¿ 9.75th ≅ 10th value in the ranked data set

¿47 kilometres

IQR = Q3−Q1 =47−15=32 kilometres

Q3−Q1 47−15
Quartile deviation: Q = = =16 kilometres
2 2

3.5.3 Calculation of percentiles and percentile rank for raw data


The value of the k th percentile is:
Pk = [ k (n+1)
100 ] value in a ranked data set
th

Percentile rank of a score is the percentage of values in the data set that are
smaller than the given score and is denoted by PR x where x is the given score.
number of values less than x
PR x = × 100
n

For the distance to work data set above, P80 and PR40 is calculated as follows:

Ranked data set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49, 56

Pk = [
k (n+1)
100 th ]
value in a ranked data set

P80=
100 [
80 (12+ 1)
th
]
value in the ranked data set
P80=10.4 th ≅ 10 th value in the ranked data set
∴ P80=47 kilometres

number of values less than 40


PR40= ×100
12

5
PR40= ×100 = 41.67%
12

3.6 Chebychev’s theorem and bell-shaped data


46
Chebychev’s Theorem
1
1−
Chebychev’s theorem states that for any data set a proportion of at least d2
of the values lie within d standard deviations of the mean.

Examples

1. Proportion of values that lie within 2 standard deviations of the mean is


1
1− =0. 75 .
at least 22
2. Proportion of values that lie within 3 standard deviations of the mean is
1
1− =0. 889 .
at least 32

The Empirical Rule (bell-shaped distributions)

If it is known that the data set of interest has a bell-shaped clustering pattern
of the values then results that are better than that of Chebychev’s theorem can
be obtained. For data with such a shape:

(i) Approximately 68% of data values are within 1 standard deviation of the
mean.
(ii) Approximately 95% of data values are within 2 standard deviations of
the mean.
(iii) Approximately 99.7% of data values are within 3 standard deviations of
the mean.

Example
Men’s heights have a bell-shaped distribution with a mean of 175.8
centimetres and a standard deviation of 7.4 centimetres.

Approximately 68% of data values are within 175.8 ± 7.4 = (168.4; 183.2).

Approximately 95% of data values are within 175.8 ± 14.8 = (161.0; 190.6).

47
Approximately 99.7% of data values are within 175.8 ± 22.2 = (153.6; 198.0).

Tutorial

1. In a factory, the time during working hours in which a machine is


not operating as a result of breakage or failure is called the
"downtime " . The following distribution shows a sample of 100
dow nt imes of a certain machine {rounded to the nearest minut
e):

Downtime Frequency
0-9 3
10-19 13
20-29 30
30-39 25
40-49 14
50-59 8
60-69 4
70-79 2
80-89 1

Calculate the {approximate) mean and standard deviation of the


downtimes.

2. The diameters of a sample of 400 washers produced by a


machine are summarized below:

Diameter Number of
(millimeters) washers
(frequency)
30-40 10
40-50 50

48
50-60 55
60-70 79
70-80 68
80-90 60
90-100 50
100-110 28
110-120 8
Total 400

Calculate the (approximate) mean and standard deviation for the


data.

3. The frequency distribution of the number of days to maturity of


40 short -term investments is summarized below:
Days to Frequency
Maturity
30-39 3
40-49 1
50-59 8
60-69 10
70-79 7
80-89 7
90-99 4
Calculate the mean number of days to maturity and the
standard deviation of the distribution.

4. In a factory the weight of all ball bearings produced is under


examination. The weights of 100 ball bearings were obtained
and recorded as follows:
Weight Frequency
(grammes)
5-9 16
10-14 30
15-19 39
20-24 12
25- 29 3

49
a. Calculate the approximate sample mean and standard
deviation of the weight for the above ball bearings.
b. Construct a cumulative frequency distribution for the above
data and plot an ogive.
c. From the ogive above and the formula in your notes find the
first and third quartiles and the median weight for the ball
bearings.

5. The number of traffic tickets issued by a certain police department


in a 7-day period was
19 17 14 21
19 16 34
a. Find the mean and standard deviation for the above data.
b. Find the coefficient of variation and explain what this tells us.
c. Find the first and third quartiles, and the median for the above
data.
d. Are there any outliers?

6. The diastolic blood pressure readings for 12 randomly


selected men aged 45 - 49 years were as follows
94 84 74 90 98 92 74 90 80 98 78
80.
a. Find the mean and standard deviation for the above data.
b. Find the coefficient of variation and explain what this tells us.
c. Find the first and third quartiles, and the median for the
above data.
d. Are there any outliers?

7. What proportion of values lie within 1 standard deviation


of the mean? (Hint: Use Chebychev's t heorem) .

8. In a wildlife study, it is found that the average speed of the


Cheetah is 60km/h with a standard deviation of 4km/h.
What proportion of Cheetahs will have a speed

50
a. between 50 and 60 km/h?
b. less than or equal to 50 km/h or greater than or equal to 60
km/h?
c. Find the interval of speed that will contain approximately 95%
of data values.

CHAPTER 4
CORRELATION AND REGRESSION
4.1 Bivariate data and scatter diagrams

Often two variables are measured simultaneously and relationships between


these variables explored. Data sets involving two variables are known as
bivariate data sets.

The first step in the exploration of bivariate data is to plot the variables on a
graph. From such a graph, which is known as a scatter diagram (scatter plot,
scatter graph), an idea can be formed about the nature of the relationship.

Examples
1. It is believed that a person’s height (y) (measured in centimetres) is
dependent on the person’s shoe size (x). The values of x and y for 12
students are shown below.

x 5 4 12 8 9 7.5 6.5 11.5 10.5 11 6 4.5


y 160 152 196 168 178 165 165 170 188 180 163 155

Scatter diagram10

10
See Appendix A12.

51
Relationship between height and shoe size
250

Height (in centimetres) 200

150

100

50

0
3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0
Shoe size

2. In a study of the relationship between the amount of daily rainfall (x)


and the quantity of air pollution removed (y), the following data were
collected.

Rainfall quantity removed (micrograms


(centimeters) per cubic meter)
4.3 126
4.5 121
5.9 116
5.6 118
6.1 114
5.2 118
3.8 132
2.1 141
7.5 108

Scatter diagram

52
Relationship between rainfall and quantity of air pollution
removed
160

140
Quantity of air pollution removed

120

100

80

60

40

20

0
1 2 3 4 5 6 7 8
Rainfall (in centimetres)

3. Data on the annual GDP growth rate (x) of various African countries and
the cost of building individual prestige houses (y) in these countries was
taken from the Africa Property & Construction Cost Guide, July 2017,
and is shown below:

GDP growth (annual % since 2000) Cost of building individual


prestige houses (in US$/m2 ¿
3.0 4650
-0.3 1952
3.9 2100
5.6 1350
6.6 1500
2.7 2560
6.9 1700
1.3 1187
7.0 1120
5.1 1540
2.9 1590

Scatter diagram

53
Relationship between annual GDP growth rate and building
costs
5000
4500
4000
3500
3000
Building costs

2500
2000
1500
1000
500
0
- 1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Annual GDP growth (%)

 In all these cases the relationship can be fairly well described by means
of a straight line i.e. all these relationships are linear relationships.

 In the first example an increase in y is proportional to an increase in x


(positive linear relationship).
In the second and third examples a decrease in y is proportional to an
increase in x (negative linear relationship).

 In all the examples changes in the values of y are affected by changes in


the values of x (not the other way round). The variable x is known as the
explanatory (independent) variable and the variable y the response
(dependent) variable.

In this section only linear relationships between 2 variables will be explored.


The issues to be explored are

1. Measuring the strength of the linear relationship between the 2


variables (the linear correlation coefficient).

2. Finding the equation of the straight line that will best describe the
relationship between the 2 variables (the linear regression equation).
Once this line is determined, it can be used to estimate a value of y for a
given value of x (linear estimation).

54
4.2 Linear Correlation

The calculation of the coefficient of correlation (r ) is based on the closeness of


the plotted points (in the scatter diagram) to the line fitted through them. It
can be shown that

– 1 ≤ r ≤1

If the plotted points are closely clustered around this line, r will lie close to
either 1 or –1 (depending on whether the linear relationship is positive or
negative). The further the plotted points are away from the line, the closer the
value ofr will be to 0. Consider the scatter diagrams that follow.

Strong positive correlation (r close to 1)

Strong negative correlation (r close to –1)

55
No pattern (r close to 0)

For a sample of n pairs of values ( x 1 , y 1 ) , ( x 2 , y 2 ) , ... ,(x n , y n), the coefficient of


correlation can be calculated from the formula

n ∑ xy −∑ x ∑ y
r=
√ [ n∑ x − (∑ x ) ][ n∑ y −(∑ y ) ]
2 2 2 2

Example
Consider the data on a person’s shoe size (x) and height (y) considered earlier.
For this data r can be calculated in the following way.

x y xy x2 y2
5 160 800 25 25600
4 152 608 16 23104
12 196 2352 144 38416
8 168 1344 64 28224
9 178 1602 81 31684
7.5 165 1237.5 56.25 27225
6.5 165 1072.5 42.25 27225
11.5 170 1955 132.25 28900
10.5 188 1974 110.25 35344
11 180 1980 121 32400
6 163 978 36 26569
4.5 155 697.5 20.25 24025
∑ 95.5 2040 16600.5 848.25 348716

Substituting
n=12 , ∑ x=95.5 , ∑ y=2040 ,

56
2 2
∑ xy=16600.5 , ∑ x =848.25 ∑ y =348716

into the equation for r gives:

12×16600.5−95.5 ×2040
r=
√ 12 ×848.25−( 95.5 ) √12 ×348716−( 2040 )
2 2

4386
¿
√1058.75 × 22992

¿ 0.889

Comment: Strong positive correlation i.e. the increase in a person’s shoe size is
closely linked with an increase in the person’s height.

Coefficient of determination
The strength of the correlation between 2 variables is proportional to the
square of the correlation coefficient (r2). This quantity, called the coefficient of
determination, is the proportion of variability in the y variable that is
accounted for by its linear relationship with the x variable.

Example
In the above example on height (y) and shoe size (x), the
coefficient of determination ¿ r 2= ( 0.889 )2=0.7903 .
This means that approximately 79% of the change in the variability of in a
person’s height is explained by its relationship with the person’s shoe size.

4.3 Linear Regression


Finding the equation of the line that best fits the (x, y) points is based on the
least squares principle. This principle can best be explained by considering the
scatter diagram below.

57
According to the least squares principle, the line that “best” fits the plotted
points is the one that minimizes the sum of the squares of the vertical
deviations (see vertical lines in the graph) between the plotted y and estimated
y (values on the line). For this reason the line fitted according to this principle
is called the least squares line.

Calculation of the least squares linear regression line 11

The equation for the line to be fitted to the (x, y) points is

^y =a+bx
where ^y is the fitted y value (y value on the line which is different to the
observed y value), a is the y-intercept and b the slope of the line.
It can be shown that the coefficients that define the least squares line can be
calculated from

b=¿

11
See Appendix A13.

58
and
a= y−b x

Example
For the above data on shoe size (x) and height (y) the least squares line can
calculated as shown below.

Substituting

n=12 ,∑ x=95.5 ,∑ y=2040 ,


2
∑ xy=16600.5 ∑ x =848.25

into the above equation gives:


12 ×16600.5−95.5 ×2040 4386
b= = =4.14
12 ×848.25−( 95.5 )
2
1058.75
and
2040 95.5
a= −4.14 × =137.05
12 12

Therefore the equation of the y on x least squares line that can be used to
estimate values of y (height) based on x (shoe size) is:

^y =137.05+ 4.14 x

Suppose the height of a student with shoe of size 7 is to be estimated. This can
be done by substituting the value of x = 7 into the above equation. Then

^y =137.05+ 4.14 × 7 ≅ 166

A word of caution

 The linear relationship between y and x is often only valid for values of x
within a certain range e.g. when estimating a person’s height using the
person’s shoe size as explanatory variable, it should be taken into
account that at some shoe size the person’s height will stop increasing.
Assuming a linear relationship between shoe size and height for values

59
beyond the shoe size where the person’s height stops increasing would
be incorrect.

 Only relationships between variables that could be related in a practical


sense are explored e.g. it would be pointless to explore the relationship
between the number of vehicles in New York and the number of
divorces in South Africa. Even if data collected on such variables might
suggest a relationship, it cannot be of any practical value.

 If variables are not linearly related, it does not mean that they are not
related. There are many situations where the relationships between
variables are non-linear.

Example

A plot of the banana consumption (y) versus the price (x) is shown in the graph
on the following page. A straight line will not describe this relationship very
well, but the non-linear curve shown below will describe it well.

60
NONLINEAR REGRESSION: EXAMPLE

14
y

12

10

8

6 y    u   z  u
x
4

0
0 1 2 3 4 5 6 7 8 9 10 11 x12

This sequence shows how a nonlinear regression model may be fitted. It uses the banana
consumption example in the first sequence.

Tutorial

1. The following are the assessed valuations of eight houses in a certain


city and the selling prices of these houses. The data constitute a
random sample of all the houses assessed and sold in that city.

61
Assessed value Selling price
(thousands of rand) (thousands of
X rand)
Y

116,0 185,0
160,8 246,4
103,2 162,2
55,8 97,6
89,6 148,0
65,0 110,4
144,0 236,6
80,6 126,8

Find the line of best fit and use it to estimate the selling price of a house
when its assessed value is R100 000.

2. In a study between the amount of rainfall and the quantity of


pollution re- moved from the air, the following data were
collected:
4.3 4.5 5.9 5.6 6.1 5.2 3.8 2.1 7.5
x
126 121 116 118 114 118 132 141 108
y

a. Find the equation of the regression line to predict the particulate


removed from the amount of daily rainfall. Estimate the
amount of particulate removed when the daily rainfall is x =
4.8 units.
b. Determine the correlation coefficient between he particulate
removed and the amount of daily rainfall.

3. In a certain type of metal test the normal stress on a specimen is


known to be functionally related to the shear resistance. The
following is a set of coded experimental data on the two variables:

62
Normal Stress, x Shear y
resistance,
26.8 26.5
25.4 27.3
28.9 24.2
23.6 27.1
27.7 23.6
23.9 25.9
24.7 26.3
28.1 22.5
26.9 21.7
27.4 21.4
22.6 25.8
25.6 24.9

(a) Find the equation for the line of best fit.

(b) Determine the correlation coefficient between the shear resistance and the
normal stress.

(c) Estimate the shear resistance for a normal stress of 24.5 (kilograms per
square cm).

4. A chemical company, wishing to study the effects of extraction time


(x) on the efficiency (y) of an extraction operation, obtained the
data shown below:
x
(minutes) 27 45 41 19 35 39 19 49 15 31
y
(%) 57 64 80 46 62 72 52 77 57 65

Find the least squares regression line by which one may predict the
efficiency from the extraction time.

5. The following sample data show a demand for a product (in

63
thousands of units) and its price (in cents) in six different market
areas:
x y
Price Demand
19 55
23 7
21 20
15 123
16 88
18 76

Σ x =112, Σ x 2=2136 , Σ y =369, Σ y 2=32123, Σ xy =6247

a. Fit a least squares line that will enable us to predict the


demand for the product in terms of its price.
b. Predict the demand for the product in a market area where it is
priced at 15 cents.

Plot the data and the regression line on suitable axes. (Show your working for the 2
points needed to plot the straight line)

CHAPTER 5
RANDOM VARIABLES AND
PROBABILITY DISTRIBUTIONS
5.1 Introduction to probability distributions

64
Probability (chance)
 A probability is the chance that something of interest will happen.
 A probability is expressed as a proportion i.e. it ranges from 0 to 1.
Chance can be expressed as a percentage i.e. it ranges from 0 to 100.

Examples
1. The probability of rain tomorrow is 0.40
There is a 40% chance of rain tomorrow.
1
.
2. The probability of winning the Lotto is 13983816
3. The probability of a certain new product being successful is 0.75.

Random experiment
This is an experiment that gives different outcomes when repeated under
similar conditions.

1. The experiment can have more than one possible outcome.


2. All possible outcomes can be listed.
3. The outcome that will occur when the experiment is performed depends
on chance.

Examples

1. Tossing a coin (possible outcomes: heads, tails).


2. Rolling a die (possible outcomes: 1, 2, 3, 4, 5, 6).
3. Asking a person to assign a rating to a product (possible outcomes: A, B,
C, D, E).
4. Drawing a card from a deck of cards (possible outcomes: 13 hearts, 13
clubs, 13 spades, 13 diamonds).

A random variable is a variable whose value depends on the outcome of a


random experiment. A random variable is denoted by a capital letter and a
particular value of a random variable by a lower case (small) letter.

Examples
1. T = the number of tails (t) when a coin is flipped 3 times.
2. X = the sum of the values (x) showing when two dice are rolled.
3. H = the height (h) of a woman chosen at random from a group.
4. V = the liquid volume (v) of soda in a can marked 12 oz.
65
There are two types of random variables:

Discrete Random Variables


 Variables that have a finite or countable number of possible values.
 These variables usually occur in counting experiments.

Continuous Random Variables


 Variables that can take on any value in some interval i.e. they can take
an infinite number of possible values.
 These variables usually occur in experiments where measurements are
taken.

Examples
1. The variables T and X from the above examples are discrete random
variables.
2. The variables H and V from the above examples are continuous random
variables.

5.2 Discrete probability distributions and their graphical


representations

A discrete probability distribution is a list of the possible distinct values of the


random variable together with their corresponding probabilities. The
probability of the random variable X assuming a particular value x is denoted
by P(X=x) = P(x). This probability, which is a function of x, is referred to as the
probability mass function.

Examples

1. As above, let T be the random variable that represents the number of


tails obtained when a coin is flipped three times. Then T has 4 possible
values 0, 1, 2, and 3. The outcomes of the experiment and the values of
T are summarized in the next table.

Outcomes T
hhh 0

66
hht, hth,
1
thh
tth, tht, htt 2
ttt 3

Assuming that the outcomes are all equally likely, the probability
distribution for T is given in the following table.

t 0 1 2 3 Total
P(t) 1/8 3/8 3/8 1/8 1

2. A pair of dice is tossed. Let X denote the sum of the digits. The
probability distribution of X can be found from the following table. The
entry in any particular cell is the sum of the row and column values.

1st die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
2nd die 3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

x 2 3 4 5 6 7 8 9 10 11 12
P(X=x) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

Note:
For any discrete random variable X, the range of values that it can assume are
such that

∑ P (x )=1
0 ≤ P(x) ≤ 1 and x .

The cumulative distribution function


The cumulative distribution function is defined as

67
F ( X )=P ( X ≤ x )=∑ P (r )
r≤ x

Examples

1. For the probability mass function in example 1 the cumulative


distribution function is

x 0 1 2 3
F(x 1/8 ½ 7/ 1
) 8

2. For the probability mass function in example 3 the cumulative


distribution function is

x 2 3 4 5 6 7 8 9 10 11 12
F(x 1/36 3/36 6/36 10/36 15/3 21/36 26/36 30/3 33/36 35/36 1
) 6 6

3. Consider a discrete random variable with probability mass function given


below.

x 1 2 3 4
P(X=x 0.1 0.3 0. 0.2
) 4

68
(a) CDF (b) PMF

The graphs on the previous page are plots of the probability mass function
(graph on the right) and cumulative distribution function (graph on the left).

A random variable can only take on one value at a time i.e. the events X = x 1
and X = x 2 for x1 ≠ x2 are mutually exclusive. The probability of the variable
taking on any number of different values can be found by simply adding the
appropriate probabilities.

Examples
1. Find the probability of getting 2 or more tails when a coin is flipped 3
times.
P(T ≥ 2) = 3/8 + 1/8 = ½.

2. Find the probability of getting at least one tail when a coin is flipped 3
times.
P(at least 1) = P(1) + P(2) + P(3) = 3/8 + 3/8 +1/8 = 7/8

Or

P(at least 1) = 1 – P(0) = 1 – 1/8 = 7/8.

5.3 Mean (expected value), variance and standard deviation of a


discrete random variable

69
The mean or expected value of a random variable X is the average value that
we would expect for X when performing the random experiment many times.

Notation: The mean or expected value of a random variable X will be


represented by μ or E(X).

We can calculate the mean by using the formula

E(X) = μ = ∑ xp( x) .

Examples

1. The expected value of the random variable T from above is:

( 18 )+( 1× 38 )+(2 × 38 )+(3 × 18 )= 32


E ( T )=Σ tP ( t )= 0×

Thus if 3 coins are flipped a large number of times, we should expect the
average number of tails (per 3 flips) to be about 1.5. Since the number of tails
is an integer value, it will never actually assume the mean value of 1.5. This
mean value more reflects the fact that the extreme values (0 and 3) occur the
same proportion of times (an eighth) and the middle values occur the same
proportion of times (three eighths).

2. The score S obtained in a certain quiz is a random variable with


probability distribution given below.

s 0 1 2 3 4 5
P(S=s) 0.12 0.04 0.1 0.32 0.2 0.12
6 4

The mean of the random variable S can be calculated as shown below.

s 0 1 2 3 4 5 sum
P(S=s) 0.1 0.04 0.1 0.32 0.2 0.12 1
2 6 4
s×P(s) 0 0.04 0.3 0.96 0.9 0.60 2.88
2 6

μ = E(S) = 2.88

70
Variance
(a) For a random variable X, the variance, denoted by σ2 , can be calculated
by using the formula

σ =Σ ( x−μ ) P ( x )=Σ x P ( x ) - μ2
2 2 2

The standard deviation of X, denoted by σ, is just the positive square root of


σ2. This is a measure of the extent to which the values are spread around the
mean.

5.4 Binomial distribution


Assumptions:
A discrete random variable X is said to have a binomial distribution if a random
experiment satisfies the following conditions.

1. The experiment is repeated a fixed number of times. Each repetition is


called a trial. The number of trials is denoted by n.
2. All trials are independent of each other.
3. The outcome for each trial of the experiment can be one of two
complementary outcomes, one labeled “success” and the other labeled
“failure”. A single such a trial is called a Bernoulli trial.
4. The probability of success has a constant value of pfor each trial and the
probability of failure is q=(1− p).
5. The random variable X counts the number of successes that has
occurred in n trials.

Examples

1. Consider the experiment of flipping a coin 5 times. If we let the event of


getting “tails” on a flip be labeled “success” and “heads” failure, and if
the random variable T represents the number of tails obtained, then T
will be binomially distributed with n = 5, p = ½ and q=½ .

2. A student answers 10 questions in a multiple-choice test by guessing


each answer. For each question, there are 5 possible answers, only one
of which is correct. If we consider a “success” as getting a question right
and consider the 10 questions as 10 independent Bernoulli trials, then
the random variable X representing the number of correct answers will
be binomially distributed with n=10, p=0.2 and q=0.8.

71
3. Fourteen percent of flights from a certain airport are delayed. If 20
flights are chosen at random, then we can consider each flight to be an
independent Bernoulli trial. If we define a successful trial to be one
where a flight takes off on time, then the random variable Z representing
the number of on-time flights will be binomially distributed with n =2 0,
p = 0.86 and q = 0.14.

Formula for the calculation of binomial probabilities

P(x) = nCx px qn-x for x = 0, 1, 2, … , n .

A short hand way of referring to a binomially distributed random variable X,


based on n trials with probability of success p, is X ~ B(n,p) or X ~ Bin(n,p).

Examples

1. As in the previous examples, let T be the random variable representing


the number of tails when a coin is flipped 3 times. Using the formula
above with n=3 and p = ½ , we can calculate the probability of exactly 2
tails as:

( )( )
1 2 1 1
P(X = 2) = 3C2 2 2 = 0.375 .

2. Let the random variable X represent the number of correct answers in


the multiple-choice test described above. Then the probability of a
student guessing 3 answers correctly is:

P(X = 3) = 10C3 (0.2)3 (0.8)7 = 0.2013,

and the probability of guessing seven answers correctly is:

P(X = 7) = 10C7 (0.2)7 (0.8)3 = 0.000786.

72
Mean and standard deviation of a binomial random variable

If X is a binomial random variable with n trials, probability of success p and


probability of failure q, then the mean, variance and standard deviation of X
can be calculated by using the following formulae.
mean = E(X) = µ= np
var(X) = σ2 = npq
standard deviation (X) = √ npq.
Example

For T = the number of tails when a coin is flipped 3 times, n = 3, p = q = ½ .

E ( T )=μ=3 × 0.5=1.5

σ =√ 3× 0.5 ×0.5 = √ 0.75 = 0.866

Shape of the binomial distribution

A binomial distribution is symmetric if p=q, positively skewed if p<q and


negatively skewed if p>q . These shapes are illustrated in the graphs for n = 20
shown below and on the following page.

X∼ Bin (20, 0.5)


0.20000

0.18000

0.16000

0.14000

0.12000

0.10000

0.08000

0.06000

0.04000

0.02000

0.00000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

73
X∼ Bin(20, 0.1)
0.30000

0.25000

0.20000

0.15000

0.10000

0.05000

0.00000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

X ∼ Bin(20, 0.9)
0.30000

0.25000

0.20000

0.15000

0.10000

0.05000

0.00000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

5.5 Poisson distribution

A Poisson random variable (X) is one that counts the number of events that
occur at random in an interval of time or space. The average number of events
that occur in the time/space interval is denoted by λ.

Examples
1. The number of bad cheques presented for daily payment at a bank.
2. The number of road deaths per month.
3. The number of bacteria in a given culture.
4. The number of defects per square meter on metal sheets being
manufactured.
5. The number of mistakes per typewritten page.

74
Formula for the calculation of Poisson probabilities

The probability that x events occur in time/space is given by


x −λ
λ e
P ( X=x ) =
x!

For x=1 , 2 ,3 , … where μ>0

A shorthand way of referring to a Poisson distributed random variable X with


average (mean) rate of occurrence λ is X ~ Po(λ).

Examples
1. A bank receives on average μ = 6 bad cheques per day. Calculate the
probability of the bank receiving

(a) exactly 4 bad cheques per day.


(b) at least 3 bad cheques per day.

Solution

(a) Substituting λ = 6 and x = 4 into the above formula gives


64 e−6
=0 . 134
P(4) = 4 ! .

(b) P(X ≥ 3) = 1 – P(X ≤ 2)

= 1 – 0.062
=0.938

2. A secretary claims an average mistake rate of 1 per page. A sample page


is selected at random and 5 mistakes found. What is the probability of
her making 5 or more mistakes if her claim of 1 mistake per page on
average is correct?

75
Solution

In this case λ=1 is claimed and X the number of mistakes ≥ 5. If the claim is
true,
P(X ≥ 5) = 1 – P(X ≤ 4)

[ ]
0 −1 1 −1 2 −1 −1 4 −1
1 e 1 e 1 e 3e 1 e
=1– + + + +
0! 1! 2! 3! 4!
= 1 – 0.9963
= 0.0037.

The above calculation shows that if the claim of 1 mistake per page on average
is true, there is only a 37 in 10 000 chance of getting 5 or more mistakes per
page. This remote chance of 5 or more mistakes when an average of 1 mistake
per page is true casts doubt on whether the claim of 1 mistake per page on
average is in fact true.

Mean and standard deviation of a Poisson random variable

The mean and variance of the Poisson distribution are given by


E(X) = λ and var(X) = λ.

Example
Calls arrive at switchboard at an average rate of 1 every 15 seconds. What is
the probability of not more than 5 calls arriving during a particular minute?

Solution
A mean rate of 1 every 15 seconds is equivalent to a mean rate of 4 every
minute. Since the question concerns an interval of 1 minute, λ = 4 (not µ = 1).
−4 2 −4 3 −4 4 −4 5 −4
4e 4 e
−4
P ( X ≤5 )=e + 1!
+ 2!
+ 4 3e! + 4 e
4!
+ 4 e
5!
=0.7851

5.6 Probability distributions of continuous random variables

A random variable X is called continuous if it can assume any of the possible


values in some interval i.e., the number of possible values are infinite. In this
case the definition of a discrete random variable (list of possible values with
their corresponding probabilities) cannot be used (since there are an infinite
number of possible values it is not possible to draw up a list of possible values).

76
For this reason probabilities associated with individual values of a continuous
random variable X are taken as 0.

The clustering pattern of the values of X over the possible values in the interval
is described by a mathematical function f(x) called the probability density
function. A high (low) clustering of values will result in high (low) values of this
function. For a continuous random variable X, only probabilities associated
with ranges of values (e.g. an interval of values from a to b) will be calculated.
The probability that the value of X will fall between the values a and b is given
by the area between a and b under the curve describing the probability density
function f(x). For any probability density function the total area under the
graph of f(x) is 1.

5.6.1 Normal distribution


A continuous random variable X is normally distributed (that is, it follows a
normal distribution) if the probability density function of X is given by:

[ ],
2
− ( x−μ )
1 2σ
2
for −∞ < x <+∞
f ( x )= e
√2 π σ 2

The constants and are the mean and standard deviation, respectively, of
X. These constants completely specify the density function. A graph of the
curve describing the probability function (known as the normal curve) for the
case μ=0 and σ =1 is shown below.
Graph of standard norm al distribution

0.45
0.4
0.35
0.3
0.25
p(z)

0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
z

77
5.6.2 Properties of the Normal distribution

a. The graph of the function defined above has a symmetric, bell-shaped


appearance.
b. The mean µ is located on the horizontal axis where the graph reaches its
maximum value.
c. At the two ends of the scale the curve describing the function gets closer
and closer to the horizontal axis without actually touching it.
d. The parameter µ shows where the distribution is centrally located and σ
describes the spread of the values around µ.
e. A short hand way of referring to a random variable X which follows a
normal distribution with mean µ and variance σ2 is to write it as
X ~ N(µ, σ2).

Many quantities measured in everyday life have a distribution which closely


matches that of a normal random variable, for example, marks in an exam,
weights of products, heights of a male population.
The next diagram shows graphs of normal distributions for various values of μ
and σ2.

An increase (decrease) in the mean µ results in a shift of the graph to the right
(left) e.g. the curve of the distribution with a mean of -2 is moved 2 units to the
left. An increase (decrease) in the standard deviation σ results in the graph

78
becoming more (less) spread out e.g. compare the curves of the distributions
with σ2 = 0.5, 1 and 2.

5.6.3 Empirical example – The Normal distribution and a histogram

Consider the scores obtained by 4 500 candidates in a matric mathematics


examination.

Histogram

1000
900
800
700
freq

600
500
400
300
200
100
0 e
15

25

35

45

55

65

75

90

or
M

mark

The histogram of the marks has an appearance that can be described by a


normal curve i.e. it has a symmetric, bell-shaped appearance. The mean of the
marks is 51.95 and the standard deviation is 10.

5.7 The Standard Normal Distribution

To find probabilities for a normally distributed random variable, we need to be


able to calculate the areas under the graph of the Normal distribution. Such
areas are obtained from a table showing the cumulative distribution of the
normal distribution. Since the Normal distribution is specified by the mean (µ)
and standard deviation (σ), there are many possible normal distributions that
can occur. It will be impossible to construct a table for each possible mean and
standard deviation. This problem is overcome by transforming X the normal

79
random variable of interest [ X N (µ , σ 2)] to a standardized Normal random
variable:

Z=¿

It can be shown that the transformed random variable Z N (0 ,1). The random
variable Z can be transformed back to X by using the formula

X =¿

The Normal distribution with mean µ = 0 and standard deviation σ = 1 is called


the standard normal distribution. The symbol Z is reserved for a random
variable with this distribution. The graph of the standard Normal distribution
appears below.

The Standard Normal Distribution

Various areas under the above normal curve are shown. The standard Normal
table gives the area under the curve to the left of the value z. Other types of
areas can be found by combining several of the areas as shown in the following
examples.

5.7.1 Calculating probabilities using the standard Normal table

80
The areas shown in the standard Normal table are those under the standard
normal curve to the left of the value of z looked up i.e. the areas are the
P(Z ≤ z). For example, P(Z ≤ 0.14)=0.5557.

Note:
 For negative values of z less than the minimum value (– 3.79) in the
table, the probabilities are taken as 0, that is, P ( Z ≤ z ) =0 for z ←3.79.
 For positive values of z greater than the maximum value (3.79) in the
table, the probabilities are taken as 1, that is, P(Z ≤ z)=1 for z >3.79.

Examples
In all the examples that follow Z N (0 ,1)

a) P(Z <1.35)=0.9115

b) P(Z > – 0.47)=1 – P(Z ≤ – 0.47)


¿ 1 – 0.3192
¿ 0.6808

c) P(– 0.47< Z <1.35)=P(Z< 1.35) – P (Z < – 0.47)


¿ 0.9115 – 0.3192
¿ 0.5923

d) P(Z >0.76)=1 – P(Z <0.76)


¿ 1 – 0.7764
¿ 0.2236

e) P(0.95 ≤ Z ≤ 1.36)=P(Z ≤ 1.36)– P(Z ≤ 0.95)


¿ 0.9131 – 0.8289
¿ 0.0842

f) P(– 1.96 ≤ Z ≤ 1.96)=P(Z ≤1.96) – P(Z ≤ – 1.96)


¿ 0.9750 – 0.0250
¿ 0.95

In all the above examples an area was found for a given value of z. It is also
possible to find a value of z when an area to its left is given. This can be written
as P(Z ≤ z α )=α (α is the Greek letter for a and is pronounced “alpha”). In this
case z α has to be found where α is the area to its left.

Examples

81
1. Find the value of z that has an area of 0.0344 to its left.

Search the body of the table for the required area (0.0344) and then
read off the value of z corresponding to this area. In this case
z 0.0344 =– 1.82 .

2. Find the value of z that has an area of 0.975 to its left.

Finding 0.975 in the body of the table and reading off the z value gives
z 0.975=1.96.

3. Find the values of z that have areas of 0.95 and 0.05 to their left.

When searching the body of the table for 0.95 this value is not found.
The z value corresponding to 0.95 can be estimated from the following
information obtained from the table.

z area to left
1.64 0.9495
? 0.95
1.65 0.9505

Since the required area (0.95) is halfway between the 2 areas obtained
from the table, the required z can be taken as the value halfway
between the two z values that were obtained.

1.64+1.65
From the table: z= 2
=1.645

Exercise: Using the same approach as above, verify that the z value
corresponding to an area of 0.05 to its left is -1.645.

At the bottom of the standard normal table selected percentiles z α are given
for different values of α. This means that the area under the normal curve to
the left of zα is α.

Examples:
1. α =0.900 , z α =1.282 means P(Z <1.282)=0.900.

82
2. α =0.995 , z α =2.576 means P(Z <2.576)=0.995.

3. α =0.005 , z α =– 2.576 means P(Z < – 2.576)=0.005.

The standard normal distribution is symmetric with respect to the mean = 0.


From this it follows that the area under the normal curve to the right of a
positive z entry in the standard normal table is the same as the area to the left
of the associated negative entry (– z) i.e.

P(Z ≥ z)=P(Z ≤ – z )

For example, P(Z ≥ 1.96)=1 – 0.975=0.025=P(Z ≤ – 1.96)

5.7.2 Calculating probabilities for any Normal random variable

Let X be a N(μ, σ2) random variable and Z a N(0, 1) random variable. Then

Example 1
The height (in centimetres) of a population of women is approximately
normally distributed with a mean of μ=161.3 and a standard deviation of σ =6.7
centimetres.

Solution
To calculate the probability that a woman is less than 160 centimetres tall, we
first find the z-score for 160 centimetres:

160−161.3
z= =−0.19
6.7

then use P( X ≤ 160)=P(Z ≤ – 0.19)=0.4247.


This means that 42.47% (a proportion of 0.4286) of women are less than 160
centimetres tall.

83
Example 2
The length X (centimetres) of sardines is a N (11.73 , 0.1344) random variable.
What proportion of sardines is:
(a) longer than 12.7 centimetres?
(b) between 11.049 and 12.319 centimetres?

Solution

(a) P( X> 12.7)=P ¿


¿ P(Z> 2.62)
¿ 1 – P(Z ≤ 2.62)
¿ 1 – 0.9956
¿ 0.0044

(b) P ( 11.049 ≤ X ≤12.319 ) =P ( 11.049−11.73


0.37
≤ Z≤
12.319−11.73
0.37 )
¿ P¿

¿ P ( Z ≤1.59 )−P ( Z ≤−1.89 )

¿ 0.9441−0.0294
¿ 0.9147

5.8 Finding percentiles by using the standard Normal table


The standard normal table can be used to find percentiles for random variables
which are normally distributed.

The standard Normal table can be used to find percentiles for random
variables which are normally distributed. The p-th percentile for X is given by

x p = μ+σ z p
Example
The scores X obtained in a mathematics entrance examination are Normally
distributed with and . Find the score which marks the 80th
percentile.

Solution

84
From the standard Normal table, the z-value which is closest to an area of 0.80
in the body of the table is 0.84 (the actual area to its left is 0.7995). The score
which corresponds to a z-value of 0.84 can be found by

x 0.80 = μ+σ z 0.80=514 + ( 113 )( 0.84 )=608.92 .

That is, a score of approximately 609 is better than 80% of all other exam
scores.

Exercises: With reference to the above normal distribution:


(a) Find P35
(b) If a person scores in the top 5% of test scores, what is the minimum
score they could have received?
(c) If a person scores in the bottom 10% of test scores, what is the
maximum score they could have received?

Tutorial
1. The probability distribution of X, the number of cylinders to be
tuned up in the engines of cars at a certain service station, is
shown in the table below.

85
X 4 6 8
probability 0.5 0.3 0.2

The cost of tune up for each cylinder is R 200. What is the


expected tune up cost of cylinders at this service station?

2. A game between two players is fair if each player has the same
mathematical expectation. If someone gives us RS each time we
roll a 1 or 2 with a balanced die, how much must we pay that
person each time we roll a 3, 4, 5 or 6 to make the game fair?

3. A union wage negotiator feels that the probabilities are 0.40,


0.30, 0.20 and 0.10 respectively that the union members will
get a Rl .50 per hour raise, a Rl.00 an hour raise, a 50 cents an
hour raise, or no raise at all. What is their expected raise?

4. An importer is offered a shipment of machine tools for


Rl40,000, and the respective probabilities that he will be able
to sell then for Rl80,000, R170,000 or RlS0,000 are 0.32, 0.55
and 0.13. What is the importer's expected gross profit?'

5. A builder has to choose between two jobs. The first job


promises a profit of R80,000 with a probability of 0.75 or a
loss of R25,000 with a probability of 0.25; the second job
promises a profit of R120,000 with a probability of 0.5 or a
loss of R45,000 with a probability of 0.5. Which job should
the builder choose if he wants to maximize his expected
profit?

6. It is known that 20% of all callers phoning an internet help


line are put on hold. Suppose 25 people phone this help line.
(a) What is the probability that 10 or more people will be put on
hold?
(b) What is the probability that 5 or less people will be put on hold?
(c) What is the mean and standard deviation of the number of people
put on hold?

7. An insurance broker, who has 5 independent contacts,


believes that, for each, the probability of making a sale is

86
0.4.
(a) What is the probability of at
least one sale?
(b) What is the expected number
of sales?

8. Consider families with three children, and suppose that each


child (independently) has probability 0.51of being a boy.
(a) Find the probability that at least one child in such a family
is a boy.
(b) Find the probability that at least two are boys, given that at
least one is a boy.
9. A missile manufacturer claims that his missiles are 90 per
cent effective. The Air Force checks the stock by firing 10
missiles and obtains 5 successes. What is the probability of
obtaining 5 or fewer successes if p = 0.9? What conclusion is
one able to draw?

10. The probability is 0.06 that a patient will cancel a dental


appointment. Consider a group of 10 patients scheduled for
appointments this morning, and let X denote the number of
cancellations in this group.
(a) What is the probability that exactly 2 out of 10 appointments
will be cancelled?
(b) What is the probability that at least 2 out of the 10
appointments will be cancelled?
(c) What is the expected number of cancellations?
(d) What is the standard deviation of the number of cancellations?

11.The intensive care unit at a particular hospital has patients


arriving at an average rate of 5 per day.
(a) What is the probability (4 decimals) that 5 patients arrive on a
particular day?
(b) What is the probability that at least 5 patients arrive on a
particular day?

12. You are in charge of a large fleet of delivery trucks. On

87
average 1.9 trucks break down per day, and you keep two
trucks available to replace those that break down. If you can
assume that the number of breakdowns on any day is a
Poisson random variable, what is the probability that on
anyone day
(a) no extra replacement trucks are needed;
(b) the number of replacement trucks is inadequate?

13.Suppose that the number of goals scored in a soccer match is


a Poisson random variable with mean 3. Find the probability
that 2 or more goals are scored in such a match.

14. On average an insurance company receives 6 claims


between 14:00 and 16:00 on a particular day. What is the
probability that the company receives exactly 17 claims
between 8:00 and 16:00 on that day?

15. Given a standard normal distribution, find the area under the curve
which lies
a. to the left of z = 1.43 i.e. P (z < 1.43)
b. to the right of z = −0.89 i.e. P (z > −0.89)
c. between z = −2.16 and z = −0.65 i.e. P (−0.65 < z < −2.16)
d. to the right of z = 1.96 i.e. P (z > 1.96)
e. between z = −0.48 and z = 1.74 i.e. P (−0.48 < z < 1.74).

16. Find the value of z if the area under a standard normal curve
(a) to the right of z is 0.3622
(b) to the left of z is 0.1131 i.e. find z0.1131
(c) between 0 and z, with z > 0, is 0.4838;
(d) between −z and z, with z > 0, is 0.9500.

17. Given the normally distributed variable X with mean 18 and standard
deviation 2.5, find
(a) P (X < 15);
(b) the value of k such that P (X < k) = 0.2236;
(c) the value of k such that P ( X >k )=0.1814

88
(d) P (17 < X < 21);

18. The loaves of bread distributed to local stores by a certain bakery


have an average length of 30 centimeters and a standard
deviation of 2 centimeters. Assume that the lengths are normally
distributed.
(a) what percentage of the loaves are
i. longer than 31.7 centimeters?
ii. between 29.3 and 33.5 centimeters in length?
iii. shorter than 25.5 centimeters?
(b) The owner of the bakery keeps the smaller loaves for private use.
If he/she retains only 5% of all the loaves, what is the
maximum size loaf that is kept?

19. A soft-drink machine is regulated so that it discharges an average


of 200 millimeters per cup. Assume the amount of drink is
normally distributed with a standard deviation equal to 15
millimeters.
(a) What fraction of the cups will contain more than 224 ml?
(b) What is the probability that a cup contains between 191 and 209
ml?
(c) (i) What is the probability that a cup will overflow if the cup
can hold 230 ml?
(ii) Using part (i), how many cups will probably overflow if 230
millimeter cups are used for the next 1000 drinks?
(d) below what value do we get the smallest 25% of the drinks?

20. The tensile strength of a certain metal component is normally


distributed with a mean 10,000 kilograms per square centimeter and
a standard deviation of 100 kilograms per square centimeter.
Measurements are recorded to the nearest 50 kilograms per
square centimeter.
(a) What proportion of these components exceed 10,150 kilograms
per square centimeter in tensile strength?
(b) If specifications require that all components have tensile
89
strength be- tween 9800 and 10,200 kilograms per square
centimeter inclusive, what proportion of pieces would we
expect to scrap?

21. The weights of adult male rhesus monkeys are normally distributed
with a mean of 15 pounds and a standard deviation of 3 pounds.
(a) A male rhesus monkey is randomly selected. What is the
probability that its weight is more than 17 pounds?
(b) If 50 male rhesus monkeys are randomly selected, about how
many would you expect to weigh less than 12 pounds?
22. The manager of a gym has determined that the length of time
members spend at the gym is a normally distributed random
variable with a mean of 80 minutes and a standard deviation of
20 minutes.
(a) What proportion of members spend more than 2 hours at the gym?
(b) What proportion of members spend less than 1 hour at the gym?
(c) What is the least amount of time spent by 60% of the
members at the gym?

CHAPTER 6
90
HYPOTHESIS TESTING
6.1 Formulation of hypotheses and related terminology

Statistical hypothesis
A statistical hypothesis is an assertion (claim) made about a value(s) of a
population parameter.

Purpose
The purpose of testing of hypotheses is to determine whether a claim
that is made could be true. The conclusion about the truth of such a
claim is not stated with absolute certainty, but rather in terms of the
language of probability.

Examples of claims to be tested

1. A supermarket receives complaints that the mean content of “1


kilogram” sugar bags that are sold by them is less than 1 kilogram.

2. A construction company suspects that the proportion of jobs they


complete behind schedule is 0.20 (20%). They want to test whether this
is indeed the case.

Null and alternative hypotheses

Null hypothesis (H 0 )
This is a statement concerning the value of the parameter of interest ( θ ) in a
claim that is made. This is formulated as

H 0 :θ=θ 0

¿The statement that the parameter θ is equal to the hypothetical value θ0 ).

Alternative hypothesis (H 1 )

91
This is a statement about the possible values of the parameter θ that are
believed to be true if H 0 is not true. One of the alternative hypotheses shown
below will apply.

a . H 1 :∨¿
b . H 1 :∨¿
c . H1 :

Examples

1. In the first example (above) the parameter of interest is the population


mean µ and the hypotheses to be tested are:

H 0 :µ=1 (Population mean is 1 kilogram)


H 0 : μ<1 (Population mean is less than 1 kilogram)

In terms of the general notation stated above ¿ µ;

2. In the second example (above) the parameter of interest is the


population proportion, π , of job completions behind schedule and the
hypotheses to be tested are

H 0 :π =0.20 (Population proportion is 0.20)


H 1 : π ≠ 0.20(Population proportion is not equal to 0.20)

In terms of the general notation stated above ¿ π ;

One and two-sided alternatives

One-sided alternative
This is a hypothesis that specifies the alternative values (to the null hypothesis)
in a direction that is either below or above that specified by the null
hypothesis.

Example

The alternative hypothesis H1 (see example 1 above) is the alternative that the
value of the parameter is less than that stated under the null hypothesis.
Two-sided alternative

92
This is a hypothesis that specifies the alternative values (to the null hypothesis)
in directions that can be either below or above that specified by the null
hypothesis.

Example
The alternative hypothesis H1 (see example 2 above) is the alternative that the
value of the parameter is either greater than that stated under the null
hypothesis or less than that stated under the null hypothesis.

6.2 Testing hypotheses for one sample: Terminology and summary


of procedure

The testing procedure and terminology will be explained for the test for the
population mean μ with population variance σ 2 known.

The hypotheses to be tested are:

1. H 0 :µ=μ0
H 1 : µ ≠ μ0

2. H 0 :µ=μ0
H 1 : μ < μ0

3. H 0 : μ=μ 0
H 1 : μ > μ0

The data set that is needed to perform the test is: x 1 , x 2 ,. . . , x n


a random sample of size n drawn from the population for which the mean is
tested. The test is performed to see whether or not the sample data are
consistent with what is stated by the null hypothesis.
The instrument that is used to perform the test is called a test statistic. A test
statistic is a quantity calculated from the sample data.

When testing for the population mean, the test statistic used is:
Z=¿
We calculate the value of the statistic by substituting the value of x , μ0, σ and n
into the equation and obtain z calc.
If the difference between x and μ0 (and therefore the value of z calc) is
reasonably small, H 0 will be not be rejected. In this case the sample mean is
consistent with the value of the population mean that is being tested. If this

93
difference (and therefore the value of z calc) is sufficiently large, H 0 will be
rejected. In this case the sample mean is not consistent with the value of the
population mean that is being tested. In order to decide how large this
difference between x and μ0 (and therefore the value of z calc) should be before
H 0 is rejected, the following should be considered.

Type I error
 A type I error is committed when the null hypothesis is rejected when, in
fact it is true i.e. H 0 is wrongly rejected.
 For example, a type I error is committed when it is decided that the
statement
H0: µ = μ0 should be rejected when, in fact, it is true.

Type II error
 A type II error is committed when the null hypothesis is not rejected
when, in fact, it is false i.e. a decision not to reject H 0 is wrong.
 For example, a type II error is committed when it is decided that the
statement
H0: µ = μ0 should not be rejected when, in fact, it is false.

The following table gives a summary of possible conclusions and their


correctness when performing a test of hypotheses.

Actually Reject H0 Do not reject H0


true/Conclusion
H0 is true Type I error Correct conclusion
H0 is false Correct conclusion Type II error

A Type I error is often considered to be more serious, and therefore more


important to avoid, than a Type II error. The hypothesis testing procedure is
therefore designed so that there is a guaranteed small probability of rejecting
the null hypothesis wrongly. This probability is never 0. Mathematically, the
probability of a type I error can be stated as

P(type I error) = P(Reject H0 | H0 is true) = α

When testing for the population mean: H 0 : μ=μ 0

94
P(type I error) = P(reject μ = μ0 | μ = μ0 is true) = α

P(type II error) = P(do not reject µ = µ0 | µ = µ0 is false) = β

1−¿ P(type II error) = 1−β = the power of the test. It is the probability of not
making a type II error.

Probabilities of type I and type II errors work in opposite directions. The more
reluctant you are to reject H0, the higher the risk of accepting it when, in fact, it
is false. The easier you make it to reject H0, the lower the risk of accepting it
when, in fact, it is false.

Critical value(s) and critical region

Critical (cut-off) value(s)


 The critical value(s) for tests of hypotheses is(are) a value(s) to which the
test statistic is compared in order to determine whether or not the null
hypothesis should be rejected.
 The critical value is determined according to the specified value of α , the
probability of a type I error.

For the test of the population mean the critical value is determined in the
following way. Assuming that H0 is true, the test statistic will follow a standard
Normal distribution i.e.

X̄−μ 0
Z = σ / √ n ~ N(0, 1)

1. When testing H0 versus the alternative hypothesis H1 (µ < µ0), the critical
region lies in the left tail of the standard Normal distribution. This is
called a left-tailed test. That is, the value of −z crit (the critical value) is
such that the area under the standard normal curve to the left of −z crit is
α. That is, P(Z<−z crit ) = α. The graph below illustrates the case for α =
0.05.
That is, P(Z < –1.645) = 0.05:

95
2. When testing H0 versus the alternative hypothesis H1 (µ > µ0), the critical
region lies in the right tail of the standard Normal distribution. This is
caleed a right-tailed test. That is, the value of + z crit is such that the area
under the standard Normal curve to the right of z crit is α. That is,
P( Z > z crit ) = α . This leaves an area of 1−α to the left of z crit . The graph
below illustrates the case for α = 0.05. This means 1 – α = 0.95 and thus
P(Z > 1.645) = 0.05:

3. When testing H0 versus the alternative hypothesis H1 (µ ≠ µ0), the critical


regions lie in both the left and right tails of the standard Normal
distribution. This is called a two-tailed test. The critical values are given
by ± z crit. The area under the standard Normal curve to the left of −z crit is
α α
2
and the area under the standard Normal curve to the right of + z crit is .
2
That is, P(Z <−z crit ) ¿ α /2 and P ( Z > z crit ) =α /2.
The area under the normal curve between these two critical values is 1 –
α. The graph below illustrates the case for α = 0.05 i.e.
P(Z < – 1.96)=0.025 and P( Z > 1.96) = 0.025.

96
Critical region (CR)
The critical region, or rejection region R, is the set of values of the test statistic
for which the null hypothesis is rejected.

(i) For a left-tail test, the rejection region is:


{z ∨z< z crit }

(ii) For a right tailed test, the rejection region is:

{z ∨z> z crit }

(iii) For a two-tailed test, the rejection region is:

{ z| z<−z crit ∨z >+ z crit }

H0 is rejected when there is a sufficiently large difference between the sample


mean x̄ and the mean (μ0) under H0 . Such a large difference is called a
significant difference (result of the test is significant). The value of α is called
the level of significance. It specifies the level beyond which this difference
(between x̄ and μ0) is sufficiently large for H0 to be rejected. The value of α is
specified prior to performing the test and is often taken as either 0.05 (5%
level of significance) or 0.01 (1% level of significance).

When H0 is rejected, it does not necessarily mean that it is not true. It means
that according to the sample evidence available it appears not to be true.
Similarly when H0 is not rejected, it does not necessarily mean that it is true. It
means that there is not sufficient sample evidence to disprove H 0.

Critical values for tests based on the standard normal distribution can be found
from the selected percentiles listed at the bottom of the pages of the standard
normal table.

97
6.3 Test for the population mean (population variance known)
A summary of the steps to be followed in the testing procedure is shown below
(continuing onto the following page).

2
Test for μ when σ is known
1. State the null and alternative hypotheses:
H 0 :µ=μ0
H 1 : µ ≠ μ0
or
H 0 :µ=μ0
H 1 : μ < μ0
or
H 0 : μ=μ 0
H 1 : μ > μ0

2. The test statistic:


X−μ0
Z= N (0 , 1)
σ /√n

Calculate : z calc .

3. State the level of significance α and determine the critical value(s) and
critical region.

(i) For a left-tailed test, the critical region is: {z ∨z<−z crit }

(ii) For a right-tailed test, the critical region is: {z ∨z>+ z crit }

(iii) For a two-tailed test, the critical region is:

{ z| z<−z crit ∨z >+ z crit }

4 If z calclies in the critical region, reject H0, otherwise do not reject H0.

5 State conclusion in terms of the original problem.

Example 1
98
A hardware store receives complaints that the mean content of the “1
kilogram” cement bags that are sold by them is less than 1 kilogram. A
random sample of 40 cement bags is selected from the shelves and the
mean is found to be 0.987 kilograms. From past experience the standard
deviation contents of these bags is known to be 0.025 kilograms. Test, at
the 5% level of significance, whether this complaint is justified.

Solution:

Step 1:
H 0 : μ ≥1 (The complaint is not justified)

H 1 : μ <1 (The complaint is justified)

Step 2:
n = 40, x̄ = 0.987, σ = 0.025, μ0 = 1 (given)

0. 987−1
=
Test statistic: zcalc = 0 .025 / √ 40 –3.289.
Step 3:
α = 0.05
Critical region: left-tailed test so critical value = z crit =−1.645

Step 4:
Since z calc < z crit , that is −3.289←1.645, H0 is rejected.

Step 5:
Conclusion: Sample evidence suggests that there is less than 1 kilogram
of cement in the bags. The customers’ complaints are justified.

Example 2

99
A supermarket manager suspects that the machine filling “500 gram”
containers of coffee is over-filling them i.e. the actual contents of these
containers is more than 500 grams. A random sample of 30 of these
containers is selected from the shelves and the mean found to be 501.8
grams. From past experience the variance of contents of these bags is
known to be 60 grams. Test at the 5% level of significance whether the
manager’s suspicion is justified.

Solution:

Step 1:
H 0 : μ ≤500 (Suspicion is not justified)

H 1 : μ >500 (Suspicion is justified)

Step 2:
n = 30, x̄ = 501.8, σ2 = 60, μ0 = 500 (given)

Test statistic: z calc ¿ 1.273

Step 3:
α = 0.05
Critical region: right-tailed test so critical value = z crit =1.645

Step 4:
Since z calc < z crit , that is 1.273 < 1.645, H0 is not rejected.

Step 5:
Conclusion: The sample evidence suggests that the coffee machine is
not over-filling the 500 gram coffee containers. The manager’s suspicion is not
justified.

Example 3

100
During a quality control exercise the manager of a factory that fills cans
of frozen shrimp wants to check whether the mean weights of the cans
conform to specifications i.e. the mean of these cans should be 600
grams as stated on the label of the can. He/she wants to guard against
either over or under filling the cans. A random sample of 50 of these
cans is selected and the mean found to be 595 grams. From past
experience the standard deviation of contents of these bags is known to
be 20 grams. Test, at the 5% level of significance, whether the weights
conform to specifications. Repeat the test at the 10% level of
significance.

Solution:

Step 1:
H 0 : μ=600 (Weights conform to specifications)

H 1 : μ ≠ 600 (Weights do not conform to specifications)

Step 2:
n = 50, x̄ = 595, σ = 20, μ0 = 600 (given)

Test statistic: z calc =1.768

Step 3:
α = 0.05
Critical region: two-tailed test so critical values ¿ ± z crit =± 1.96

Step 4:
Since −z crit < z calc <+ z crit
That is, – 1.96 <1.768<1.96, H0 is not rejected.

Step 5:
Conclusion: Sample evidence suggests that the weights appear to
conform to specifications.

Suppose the test is performed at the 10% level of significance.

101
In such a case:

α = 0.10
Critical region: two-tailed test and critical values = ± z crit =± 1.645

Since z calc =1.768>1.645, H0 is rejected.

Conclusion: The weights appear not to conform to specifications.

Thus, being less strict about controlling a type I error (changing α from 0.05 to
0.10) results in a different conclusion about H0 (reject instead of do not reject).

6.4 Test for the population mean (population variance not known,
n < 30): t-test12

When performing the test for the population mean for the case where the
population variance is not known, the following modifications are made to the
procedure.

 In the test statistic formula the population standard deviation σ is


replaced by the sample standard deviation S.
X−μ 0
T=
S /√n
 Since the test statistic that is used to perform the test follows a
Student’s t-distribution with n–1 degrees of freedom, critical values are
looked up in the t-tables.

The t-distribution was first proposed in a paper by William Gosset in 1908 who
wrote the paper under the pseudonym “Student”. The t-distribution has the
following properties.

 The Student t-distribution is symmetric and bell-shaped, but for smaller


sample sizes it shows increased variability when compared to the standard
normal distribution (its curve has a flatter appearance than that of the
standard normal distribution). In other words, the distribution is less
peaked than a standard normal distribution and with thicker tails. As the

12
See Appendix A 14.

102
sample size increases, the distribution approaches a standard normal
distribution. For n > 30, the differences are negligible.
 The mean is zero (like the standard normal distribution).
 The distribution is symmetrical about the mean.
 The variance is greater than one, but approaches one from above as the
sample size increases (σ2 = 1 for the standard normal distribution).

2
Test for μ when σ is not known, n < 30 (t-test)
1. State null and alternative hypotheses:
H 0 :µ=μ0
H 1 : µ ≠ μ0
or
H 0 :µ=μ0
H 1 : μ < μ0
or
H 0 : μ=μ 0
H 1 : μ > μ0

x−μ0
2. Calculate the value of the test statistic: t calc=
S / √n

3. State the level of significance α and determine the critical value(s) and
critical region.

Degrees of freedom = ν = n–1.

(ii) For a left-tailed test, the critical region is: {t∨t <−t crit }

(ii) For a right-tailed test, the critical region is: {t∨t >+t crit }

(iii) For a two-tailed test, the critical region is:

{ t| t<−t crit ∨t >+t crit }

4 If t calc lies in the critical region, reject H0 , otherwise do not reject H0.

5 State conclusion in terms of the original problem.

Example 4

103
A paint manufacturer claims that the average drying time for a new paint is 2
hours (120 minutes). The drying times for 20 randomly selected cans of paint
were obtained. The results are shown below.13

123 106 139 135


127 128 119 130
131 133 121 136
122 115 116 133
109 120 130 109

Assuming that the sample was drawn from a normal distribution,

(a) Test whether the population mean drying time is greater than 2 hours
(120 minutes)

(i) at the 5% level of significance.


(ii) at the 1% level of significance.

(b) Test, at the 5% level of significance, whether the population mean drying
time could be 2 hours (120 minutes).

Solution:

(a) Step 1:
H0 : μ ¿ 120 (mean is 2 hours)
H1 : μ > 120 (mean is greater than 2 hours)

Step 2:
n = 20, μ0 = 120 (given), x̄ = 124.1, S = 9.65674 (calculated from the
data).

Test statistic: t calc=¿1.899

(i) Step 3:
α = 0.05
Critical region: right-tailed test.
From the t-distribution table with degrees of freedom ¿=n – 1=19 , t crit =¿
1.729

13
See Appendix pg. on how to conduct a t-test for the mean in Excel.

104
Step 4:
Since t calc >t crit
that is, 1.899 > 1.729 , H0 is rejected.

Step 5:
Conclusion: The mean drying time appears to be greater than 2 hours.

(ii) Step 3:
α = 0.01
Critical region: right-tailed test.
From the t-distribution table with degrees of freedom ¿=n – 1=19 ,
t crit =2.539

Step 4
Since t calc <t crit
that is, 1.899 < 2.539 , H0 is not rejected.

Step 5:
Conclusion: The mean drying time appears to be 2 hours.

Thus, being more strict about controlling a type I error (changing α from 0.05
to 0.01) results in a different conclusion about H0 (do not reject instead of
reject).

(b) Step 1:

H0 : μ = 120 (mean is 2 hours)


H1 : μ ≠ 120 (mean is not equal to 2 hours)

Step 2:
n = 20, μ0 = 120 (given), x̄ = 124.1, S = 9.65674 (calculated from the
data).

124 . 1−120
Test statistic: tcalc = 9 .65674 / √ 20 = 1.899 (as calculated in part(a)).

Step 3:
α = 0.05

105
Critical region: two-tailed test.
From the t-distribution table with degrees of freedom = ν = n–1 =19,
t crit =± 2.093

Step 4:
Since −t crit ≤ t calc ≤+ t crit
that is, –2.093 <1.899 < 2.093, H0 is not rejected.

Step 5:
Conclusion: The mean drying time appears to be 2 hours.

Note:
 Despite the fact that the same data were used in the above examples,
the conclusions were different. In the first test H0 was rejected, but in
the next 2 tests H0 was not rejected.

 In the first test the probability of a type I error was set at 5%, while in
the second test this was changed to 1%. To achieve this, the critical was
moved from 1.729 to 2.539, resulting in the test statistic value (1.899)
being less than (instead of greater than) the critical value.

 In the third test (which has a two-sided alternative hypothesis), the


upper critical value was increased to 2.093 (to have an area of 0.025
under the t-curve to its right). Again this resulted in the test statistic
value (1.899) being less than (in stead of greater than) the critical value.

6.5 Test for population proportion

The test for the population proportion ( π ) is based on the fact that the sample
X
proportion p= n ~ N( π , π (1−π )/n) , where n is the sample size and x the
number of items labeled “success” in the sample. From this result it follows
p−π 0


that Z = π 0 (1−π 0 ) ~ N(0, 1) where π 0 is the value of π under H 0.
n
For this reason the critical value(s) and critical region are the same as that for
the test for the population mean (both based on the standard normal
distribution).

106
Test for the population proportion π
1. State the null and alternative hypotheses.
H 0 :π =π 0
H 1: π ≠ π0
or
H 0 :π =π 0
H1: π < π 0
or
H 0 :π =π 0
H1: π > π 0

p−π 0
z calc =
2. Calculate the test statistic
√ π 0 (1−π 0 ) ’
n
3. State the level of significance α and determine the critical value(s) and
critical region.

(i) For a left-tailed test, the critical region is: {z ∨z<−z crit }

(ii) For a right-tailed test, the critical region is: {z ∨z>+ z crit }

(iii) For a two-tailed test, the critical region is:

{ z| z<−z crit ∨z >+ z crit }

4. If z calclies in the critical region, reject H0, otherwise do not reject H0.

5 State conclusion in terms of the original problem.

Example 5
A construction company suspects that the proportion of jobs they
complete behind schedule is 0.20 (20%). Of their 80 most recent jobs 22
were completed behind schedule. Test at the 5% level of significance
whether this information confirms their suspicion.
107
Solution:
Step 1:
H0 : π = 0.20 (Suspicion is confirmed)

H1 : π ≠ 0.20 (Suspicion is not confirmed)

Step 2:
22
π0
n = 80, x = 22 (given), p = 80 = 0.275, = 0.20.

0 .275−0 . 20
Test statistic: zcalc = √ 0. 20∗0 .80 /80 = 1.677.
Step 3:
α = 0.05

Critical region: two-tailed test so critical value = ± z crit =± 1.96

Step 4:
Since −z crit < z calc <+ z crit
that is, –1.96 < z0 = 1.677 < 1.96, H0 is not rejected.

Step 5:
Conclusion: The suspicion is confirmed.

Example 6
During a marketing campaign for a new product 176 out of the 200
potential users of this product that were contacted indicated that they
would use it. Is this evidence that more than 85% of all the potential will
actually use the product? Use α = 0.01.

Solution:
Step 1:
H0 : π ≤ 0.85 (85% of all potential users will use the product)

H1 : π > 0.85 (More than 85% of all potential users will use the product)

Step 2:
176
π0
n = 200, x = 176, = 0.85 (given), p =200 = 0.88.

108
0 .88−0 . 85
Test statistic zcalc = √ 0. 85∗0 .15 /200 = 1.188.

Step 3:
α = 0.01
Critical region: right-tailed test = + z crit =2.576

Step 4:
Since z calc < z crit
that is, 1.188 < 2.576, H0 is not rejected.

Step 5:
Conclusion: 85% of all potential users will use the product.

6.6 Test for the difference between means for two independent
samples14
For small samples (both sample sizes n1,n2 < 30)

The tests discussed in the previous chapter involve hypotheses concerning


parameters of a single population and were based on a random sample drawn
from a single population of interest. Often the interest is in tests concerning
parameters of two different populations (labeled populations 1 and 2) where
two random samples (one from each population) are drawn.

Examples
1. Are the mean salaries the same for males and females with the same
educational qualifications and work experience?
2. Do smokers and non-smokers have the same mortality rate?
3. Are the variances in drying times for two different types of paints
different?
4. Is a particular diet successful in reducing people’s weights?

When testing for the difference of means from 2 different populations labeled
1 and 2, the hypotheses are:

H 0 : μ1=μ2
H 1 : μ 1 ≠ μ2
or

14
See Appendix A15.

109
H 0 : μ1=μ2
H 1 : μ 1> μ 2
or
H 0 : μ1=μ2
H 1 : μ 1< μ 2

Notation
The following notation will used in the description of the two sample
tests.

notation notation
Measure
(population 1) (population 2)
sample size n1 n2
sample x 1 , x2 ,⋯, x n x 1 , x 2 ,⋯, x m
sample mean x̄ 1 x̄ 2
sample variance (standard S21 ( S1 ) S22 ( S2 )
deviation)

In the examples that follow, we will assume that the populations from which
the samples are drawn are Normally distributed and that the sample sizes are
2 2
small (n1 , n2 <30 ¿and that the population variances σ 1 , σ 2 are not known but
2 2
equal to σ . They may be replaced by their sample estimates S1 , S2 and
2

( n1 −1 ) s 21+ ( n2−1 ) s22


2
S= , respectively.
n1 +n2−2

In such a case the resulting statistic follows a t-distribution. The degrees of


freedom is n1 + n2 – 2.

Test for difference between two population means (small sample sizes,
population variances unknown but equal)

Step 1: State null and alternative hypotheses


H 0 : μ1=μ2
H 1 : μ 1 ≠ μ2
or
110
H 0 : μ1=μ2
H 1 : μ 1> μ 2
or
H 0 : μ1=μ2
H 1 : μ 1< μ 2

x 1−x 2
Step 2: Calculate the test statistic: t calc=

with
√ S2(
1 1
+ )
n1 n2

2 ( n1 −1 ) s 21+ ( n2−1 ) s22


S=
n1 +n2−2

Step 3: State the level of significance α and determine the critical value(s)
and critical region.

Degrees of freedom = n1 +n 2−2

(i) For a left-tailed test, the critical region is: {t∨t <−t crit }

(ii) For a right-tailed test, the critical region is: {t∨t >+t crit }

(iii) For a two-tailed test, the critical region is:

{ t| t<−t crit ∨t >+t crit }

Step 4: If tcalc lies in the critical region, reject H0, otherwise do not reject H0.

Step 5: State the conclusion in terms of the original problem.

Example 7
A certain hospital has been getting complaints that the response to calls from
senior citizens is slower (takes longer time on average) than that to calls from
other patients. In order to test this claim, a pilot study was carried out. The
results are shown below.

Patient type sample mean response sample standard sample


111
time deviation size
Senior 5.60 minutes 0.25 minutes 18
citizens
Others 5.30 minutes 0.21 minutes 13

Test, at the 1% level of significance, whether the complaint is justified.

Solution:

Label the “senior citizens” and “others” populations as 1 and 2 and their
population mean response times as μ1 and μ2 , respectively.

Step 1:
H 0 : μ1=μ2
H 0 : μ1 > μ2

Step 2:
S2 = ( 17 ×0.25 ) + ( 12× 0.21 ) =0.0549
2 2

29
5.6−5.3
t calc=
Test statistic:
√ 0.0549 ( 181 + 131 ) = 3.518
Step 3:
α = 0.01
Critical region: right-tailed test
From the t-distribution table with ν=n+m−2=18+13−2=29 degrees of
freedom, t crit =2.462

Step 4:
Since t calc >+t crit , that is, 3.518 > 2.462, H0 is rejected.

Step 5:
Conclusion: The claim is justified i.e. the mean response time for senior citizens
takes longer than that for others.
Tutorial

1. Write the claim as a mathematical sentence. State the null and


alternative hypotheses.

112
(a) A water faucet manufacturer announces that the mean flow rate of a
certain type of faucet is less than 2.5 gallons per minute.
(b) A cereal company advertises that the mean weight of the contents of its
1kg size cereal boxes is more than 1kg.
(c) A consumer analyst reports that the mean life of a certain type of auto-
mobile battery is not 74 months.

2. A company uses thousands of light bulbs every year. The type of


light bulb in the past had an average life of µ = 1000 hours with a
standard deviation of σ = 100 hours. A new brand of light bulb with
lower price is now being considered and will be used unless it has a
smaller average life than the old brand. A random sample of 36 light
bulbs from the new brand is tested and yields an average of x¯ = 968
hours. Based on the sample that has been drawn and using a level of
significance of 0.05, should the company invest in these new light
bulbs?
3. In a labour management discussion, management revealed from past
records that workers at a certain plant took on the average 32.6
minutes with a standard deviation of 6.1 minutes to complete a
certain task. A random sample of 60 workers’ times was then collected
showing that it now took on the average 33.8 minutes to complete the
task. Can this be taken as an indication of a deliberate go-slow strike?
Use a 1% level of signicance.
4. A company that sells frozen shrimp prints “Contents 100 grams” on
the pack- age. The owner of the company is concerned that money is
being lost due to overfilling the boxes. A sample of 25 packages
yielded an average of x = 101.58 grams. Suppose it is known from past
experience that the pop- ulation of package weights has a standard
deviation of σ = 4 grams. Is the owners concern well founded? Use a
5% level of signicance.

5. A paint manufacturer claims that the average drying time for his
new latex paint is two hours. To test this claim, drying times are

113
obtained for n = 20 randomly selected cans of paint. The results are
displayed below in minutes.
123 109 115 121 130
127 106 120 116 136
131 128 139 110 133
122 133 119 135 109

If we assume that the drying times are Normally distributed, do the sample
data suggest that the mean drying time is actually greater than the
manufac- turer’s claim of 120 minutes? Use α = 0, 05. (The sample mean
and standard deviation of the data are given by x = 123.1 and s = 10).
6. An industrial company claims that the mean pH level of the water
in a nearby river is 6.8. You randomly select 19 water samples and
measure the pH of each. The sample mean and standard deviation are
6.7 and 0.24, respectively. Is there enough evidence to reject the
company’s claim at α = 0.05? Assume the population is normally
distributed.

7. The life of certain part in a cardiac pacemaker is assumed to be


normally distributed. A random sample of 10 of these parts is
subjected to an accelerated life test by running them continuously at
an elevated temperature until failure giving a sample mean of 26
hours and a sample standard deviation of 1.625 hours. The
manufacturer wants to be certain that the mean battery life exceeds
25 hours. What conclusions can be drawn from the sample if a 0.05
level of significance is used?

8. A medical researcher claims that less than 20% of the adults in RSA
are not allergic to any medication. In a random sample of 100 adults,
15% say they are not allergic to any medication. At a 0.01 level of
significance, is there enough evidence to support the researcher’s
claim?
9. Harper’s index claims that 23% of people in the United States are in
favour of outlawing cigarettes. You decide to test this claim and ask a
random sample of 200 people in the United States whether they are in
favour of outlawing cigarettes. Of the 200 people, 27% are in favour.
Using α = 0.05, is there enough evidence to reject the claim?

114
10. The U.S. National Centre for Health Statistics gathers and publishes
data on the daily intake of selected nutrients by race and income level.
Suppose we are considering protein intake and want to compare the
mean daily intake of people with incomes that are above the poverty
level with those of people with incomes below the poverty level. The
data in Table A give the protein intake, in grams, over a 24-hour period
for people with incomes above and below the poverty level.

TABLE A
Above poverty level Below poverty level
86,0 69,0 51,4 49,7 72,0
59,7 80,2 76,7 65,8 55,0
68,6 78,1 73,7 62,1 79,7
98,6 69,8 66,2 75,8 65,4
87,7 77,2 65,5 62,0 73,3
x1 = 77, 49 s1 = 11, 34 x2 = 66, 29 s2 = 9, 17

At the 5% significance level, do the data suggest that people with incomes
above the poverty level have a greater mean daily intake of protein than
those with incomes below the poverty level? Assume that the daily
intake of protein for both populations is normally distributed and that the
variances for the two populations is the same.
11. Two different hardening processes, (1) saltwater quenching and (2) oil
quenching, are used on samples of a particular type of metal alloy. The
results are shown here. Assume that hardness is normally distributed
and that the population variances are equal.

Saltwater quench Oil quench


152 146
146 158
154 152
139 151
148 143
a. Find a 95% confidence interval for µ1 − µ2.

115
b. Based on the confidence interval, do you think that the mean
hardening times of the two processes are the same?
c. To confirm/check your answer in part (b) test the hypothesis
that the mean hardness for the saltwater quenching process
equals the mean hardness for the oil quenching process. Use
a .05 level of significance and assume equal variances.
12. Two methods of packaging frozen shrimps yield about the same
average weight per package. However, method 2 is somewhat faster
and a particular company that packages shrimps would like to use it
unless the variance of method 2 is shown to be larger than that of
method 1 at the 5% level of significance. Two samples of 51 packages,
one packed using the first method and one using the second method, are
examined. The sample standard deviations are s1 = 4.2 grams for method
1 and s2 = 5.8 grams for method 2. What decision should be made?

116
CHAPTER 7
CHI-SQUARE TESTS
7.1 Introduction
Chi-square ( χ 2 ) tests are used to test hypotheses on patterns of outcomes,
which are based on frequency counts, for categorical random variables.

The two chi-square tests that will be covered in this chapter are:

 Goodness of fit test: This test is used to assess how closely


the distribution of a categorical variable matches an expected
distribution. For example, has the mode of transportation
(drive, bike, walk, other) used by students to get to class
changed from that of 5 years ago?
 Test of independence: This test is used assess whether two categorical
variables are independent of one another or if there is an association
between the two variables. For example, is there an association
between gender and smoking habits?

7.2. Properties of the Chi-square distribution

2. It has only one parameter ⟶ degrees of freedom 𝑑𝑓


1. It is a family of distributions, one for each degrees of freedom.

3. 𝜒 is a skewed distribution, skewed to the right. As 𝑑𝑓 increases it


2

4. 𝜒2 assumes non-negative values only.


becomes symmetrical.

5. Total area under the curve is equal to 1.

117
7.3 The test statistic
The 𝜒2 test statistic can be computed as follows:
2
O
χ =∑
2
−n
E

OR
( O−E )2
χ =∑2
E

where
O = observed frequency E = expected frequency n = sample size

For χ 2 tests, the rejection region lies in the right tail of the curve:

The area of the rejection region = α

χ critical value
2 2
χ df ; α

Rejection Rule: If the calculated test statistic ( χ 2calc ) lies in the rejection region
that is, if χ 2calc > χ 2crit reject H 0 in favour of H 1. χ 2crit may be found using the χ 2 tables for
the given level of significance (α ) value and degrees of freedom = k – 1 (where k = the
number of categories of the categorical variable).

118
7.4 Goodness-of-Fit Test
In this type of hypothesis test, one determines whether the data "fit" a
particular distribution or not. For example, one may suspect that the unknown
data fits a binomial distribution. A χ 2−¿ test goodness-of-fit may be used to
determine if there is a fit or not. The null and the alternate hypotheses for
this test may be written in sentences or may be stated as equations or
inequalities.

Example 1
The following table gives the age distribution of a sample of 100 people
arrested for drunk driving:

Age 16-20 21-25 26-30 31-35 36-40


No. of 25 32 19 16 8
arrests

At a 1% level of significance, test the hypothesis that the proportion of people


arrested for drunk driving is the same for all age groups.

Solution:
Step 1:
𝐻0: The proportion of people arrested for drunk driving is the same for all age
groups
𝐻1: The proportion is not the same for all age groups
Step 2:

Age Observed (O) Expected (E) O


2

E
16-20 25 20 31.25
21-25 32 20 15.2
26-30 19 20 18.05
31-35 16 20 12.8
36-40 8 20 3.2
Total 100 100 116.5

119
2
O
∴ test statistic: χ 2calc =∑ −n=116.5−100=16.5
E
Step 3:

Determine the critical value: χ 2crit

α =0.01

Degrees of freedom: df =k −1=5−1=4


2 2 2
∴ χ crit = χ df ;α = χ 4 ;0.01=13.277

Step 4:

area = 𝛼

2
χ crit =13.277

Since χ 2calc > χ 2crit

that is, 16.5 > 13.277, reject H 0 at the 1% level of significance.

Step 5:
Sample evidence suggests that the proportion of arrests is not the same for all
age groups.

120
7.5 Test of Independence

values. A contingency table is said to be of size (𝑟×𝑐) where 𝑟= number of


Tests of independence involve using a contingency table of observed (data)

rows and 𝑐= number of columns.

A test of independence determines whether two factors are independent or


not. In a test of independence, we state the null and alternate hypotheses in
words. Since the contingency table consists of two factors, the null hypothesis
states that the factors are independent and the alternate hypothesis states
that they are not independent (dependent).

The test of independence is always a right-tailed test, meaning that the


critical region lies in the right tail of the χ 2 distribution. If the expected and
observed values are not close together, then the test statistic is very large and
will lie way out in the right tail of the chi-square curve, as in the case of a
goodness-of-fit test.

The degrees of freedom for the test of independence are:

df =(r – 1)×(c – 1)

The following formula calculates the expected frequency (E):

(row total)×(column total)


E=
grand total
The test statistic for a test of independence is the same as that of a goodness-
of-fit test:
2
O
χ =∑
2
−n
E

OR
( O−E )2
χ =∑
2
E

121
Example
A random sample of 90 adults are classified according to gender and the
number of hours they watch television during a week:

Male Female

Under 25 hours 27 19

Over 25 hours 15 29

Use a 0.01 level of significance and test the hypothesis that the time spent
watching television is independent of whether the viewer is male or
female.

Solution:

Step 1:

H 0 : Gender and time spent watching TV are independent

H 1 : Gender and time spent watching TV is dependent

Step 2:

Next, we need to calculate the test statistic. But in order to do so, we need to
compute the expected frequencies for each cell. This is done using the
formula:

( row total ) × ( column total )


E=
grand total

Male Female Total

Under 25 hours 27 (21.47) 19 (24.53) 46

Over 25 hours 15 (20.53) 29 (23.47) 44

Total 42 48 90

122
Cell 1: E=(46 × 42)/90=21.47
Cell 2: E=(46 × 48)/90=24.53
Cell 3: E=(44 × 42)/90=20.53
Cell 4: E=(44 × 48)/90=23.47

Thus,
2
O
Observed (O) Expected (E) E

27 21.47 33.95
19 24.53 14.72
15 20.53 10.96
29 23.47 35.83
90 90 95.46
2
O
∴ Test statistic = ∴ test statistic: χ 2calc =∑ −n=95.46−90=5.46
E

Step 3:
Determine the critical value: χ 2crit

α =0.01

Degrees of freedom: df =( r−1 ) × ( c−1 )=( 2−1 ) × ( 2−1 )=1 ×1=1


2 2 2
∴ χ crit = χ df ;α = χ 1; 0.01=6.635
Step 4:
Since χ 2calc < χ 2crit

that is, 5.46 < 6.635, do not reject H 0 at the 1% level of significance.

Step 5:
Conclusion: there is insufficient evidence to suggest that the time spent
watching TV is dependent on gender.

123
Tutorial
1. What type of data would you use for a χ 2 test?
a. Ratio
b. Categorical
c. Interval
d. Ordinal

Read the following information and answer questions 2 - 5.

A car manufacturer wishes to test if 5 names are equally popular.


The following popularity results were obtained from a sample.
Proposed A B C D E Total
name
Number 14 24 62 80 20 200
who
prefer
the name

2. The null hypothesis is:


a. The car names are not equally popular
b. The car names are equally popular
c. Some names are more popular than others
d. Popularity is not the same for each name

3. The test statistic value is:


a. 85.4
b. 40
c. 16.9
d. 7.779

4. At 10% significance level the null hypothesis is rejected if:


a. > 7.779
b. > 9.236
c. > 13.277
d. > 15.086

124
5. The following conclusion at 10% level of significance is true:
a. Reject the null hypothesis and conclude that the names are
equally popular
b. Reject the null hypothesis and conclude that the names are not
equally popular
c. Accept the null hypothesis and conclude that the names are
equally popular
d. Accept the null hypothesis and conclude that the names are not
equally popular

6. Suppose the null hypothesis states that there is no relationship between


income level and donations to charity each year. Then the alternate
hypothesis would state that:
a. Income level and donations to charity have no association
b. Income level has nothing to do with charity donations
c. If income levels rise, then donations to charity remain unchanged
d. Income level and donations to charity are related

7. Suppose when using a χ -test you reject H 0 at α =0.01. It follows that:


2

a. H 0 will be accepted at α =0.05


b. H 0 will be accepted at α =0.005
c. H 0 will be rejected at α =0.05
d. H 0 will be rejected at α =0.005

2
8. The critical value for a χ -test of a contingency table with 4 columns and
6 rows at α =0.05 is:
a. 36.415
b. 28.869
c. 31.410
d. 24.996

9. The following statement is false about the properties of a chi-squared


distribution:
a. It has only one parameter which is the degrees of freedom
b. χ 2 is skewed to the left
c. χ 2 assume non-negative values only

125
d. The total area under the curve is 1

Refer to this information to answer questions 10 – 11.


A test for independence is used to test if gender and handedness i.e.
right-
handed, ambidextrous or left-handed are associated.

10.The degrees of freedom are


a. 2
b. 6
c. 3
d. 4

11. If α =0.05 the critical value is:


a. 9.488
b. 12.592
c. 5.991
d. 7.815

Refer to this information to answer questions 12 - 15.


A pharmaceutical company introduced a new drug for migraine in the
market a few months ago. The management wants to determine if the
reaction of customers depends on the different regions. The company
selected 1600 customers from the regions and asked them if the drug
was “effective” or “not effective” and the results are listed in the table
below:

Region Reaction
Effective Not Effective
East 274 126
South 203 197
West 291 109
North 257 143

12.The appropriate test is:


a. χ 2- Goodness-of-fit test
b. χ 2- test for independence

126
c. Z-test
d. T-test

13.The null hypothesis is:


a. H 0 : The reaction to the drug depends on the region
b. H 0 : The reaction to the drug is related to the region
c. H 0 : The reaction to the drug is independent of the region
d. H 0 : The reaction to the drug is associated with the region

14.The critical value at 5% significance level is:


a. 6.251
b. 5.991
c. 7.815
d. 3.841

15.The expected value corresponding to the cell “West – Effective” is:


a. 9.21
b. 47.331
c. 256.25
d. 1025

Refer to the following contingency table to answer questions 16 - 21.

Province Live in RDP/government Do not live in Total


subsidised dwelling RDP/government
subsidised dwelling
(in 100 000)
(in 100 000)
Western Cape 6 13 19
Eastern Cape 4 14 18
Northern Cape 1 2 3
Free State 3 7 10
Kwa-Zulu Natal 6 23 29

127
North West 3 10 13
Gauteng 12 36 48
Mpumalanga 2 10 12
Limpopo 3 13 16
South Africa 40 128 167
The figures in the table were rounded-off to the nearest 100 000 from the
results of the 2016 Community Survey for ease of calculation. These results
illustrate the distribution of households, in the nine provinces, amongst
RDP/government subsidised dwellings in South Africa.

16. To test if a relationship exists between the type of dwelling


(RDP/government subsidised dwelling or non-RDP/government
subsidised dwelling) a household occupies and the province in which the
household lives, the test one would perform is:
a. χ 2 goodness-of-fit test
b. χ 2 test of independence
c. χ 2 test of homogeneity
d. None of the above
17.The expected value for the cell where the heading titles “Kwa-Zulu
Natal” and “Live in RDP/government subsidised dwellings” intersect is:
a. 6.90
b. 21.93
c. 23
d. 5.59
18.The degrees of freedom for the test is:
a. 9
b. 16
c. 8
d. 18

19. At the 5% level of significance, the χ 2 critical value for the test is:

e. 15.507
f. 26.296
g. 28.869
128
h. 16.919

19.If the test statistic value is 2.89, do we reject or fail to reject H 0 at the 5%
level of significance?
a. Fail to reject H 0
b. Reject H 0
c. Fail to accept H 0
d. Cannot be determined

20.The conclusion for the test at the 5% level of significance is:


a. Do not reject H 0 and conclude that there is evidence to suggest
that a household’s dwelling type is independent of the province in
which the household lives.
b. Do not reject H 1 and conclude that there is evidence to suggest
that a household’s dwelling type is independent of the province in
which the household lives.
c. Reject H 0 and conclude that there is evidence to suggest that a
household’s dwelling type is dependent on the province in which
the household lives.
d. Reject H 1 and conclude that there is evidence to suggest that a
household’s dwelling type is dependent on the province in which
the household lives.

2
21.The χ goodness-of-fit test has 23 categories. The critical value at α =
0.05 is approximately:
a. 35.172
b. 33.924
c. 32.813
d. 36.415

22.Suppose when using a χ -test you rejected H 0 at α=0.01. It follows


2

that:
a. H 0 will be accepted at α = 0.05
b. H 0 will be accepted at α = 0.005

129
c. H 0 will be rejected at α = 0.05
d. H 0 will be rejected at α = 0.005

APPENDIX A – EXCEL NOTES

A.1 Installing the Data Analysis ToolPak in Excel


The Analysis ToolPak is an Excel add-in program that provides data analysis
tools for financial, statistical and engineering data analysis.
To load the Analysis ToolPak add-in, execute the following steps.
1. On the File tab, click Options.
2. Under Add-ins, select Analysis ToolPak and click on the Go button.

130
3. Check Analysis ToolPak and click on OK.

131
4. On the Data tab, in the Analysis group, you can now click on Data Analysis.

The following dialog box below appears.


5. For example, select Histogram and click OK to create a Histogram in Excel.

132
A.2 Creating a random sample
The Excel software package has a facility with which a random sample of a
specific size can be selected from a given population.
Below is the population data of size 10:
12 15 16 18 20 19 14 11 16 13
Select a random sample of size 5 from this population.
1. Input the population data
2. On the Data tab, in the Analysis group, click Data Analysis.

3. Select Sampling and click OK.

4. Click on the Input Range box and select the range A2:A11.
5. Click on the Random button.
6. Type in 5 in the Number of Samples box
7. Click in the Output Range box and select cell B2.
8. Click OK.

133
A.3 Drawing a line graph
1. Input the Year and Thando’s weight.
2. Highlight the data and click on the Insert tab and select the scatter
plot with straight lines and markers.

134
3. Click on the green plus sign, tick the box for Axis Titles and write in
the titles of the axis.
4. Right click on a year, select Format Axis.
Set the Minimum value to 2013 and the Maximum value to 2019.
5. Final output appears as follows:

Thando's weight (kg)


75
74
73
72
71
Weight

70
69
68
67
66
65
2013 2014 2015 2016 2017 2018 2019
Year

A.4 Constructing a Simple Bar Chart


Given the following mid-year population estimates for South Africa by
population group, 2017:
Population Group Number
Black African 45 656 400
Coloured 4 962 900
Indian/Asian 1 409 100
White 4 493 500

1. Input the data.


2. Click on the Insert tab and click on 2D clustered column chart.
3. Click OK.
4. Click on the green plus sign and tick Axis Titles, Chart Title and Data
labels.

135
5. Label the axes.
6. The completed simple bar graph is as follows:

136
A.5 Constructing a Component Bar Chart
Given the following mid-year population estimates for South Africa by
population group and sex, 2017:
Population group Male Female
Black African 22 311 400 23 345 000
Coloured 2 403 400 2 559 500
Indian/Asian 719 300 689 800
White 2 186 500 2 307 100

1. Input the data.


2. Click on the Insert tab and click on 2D stacked column chart.
3. Click OK.
4. Click on the green plus sign and tick Axis Titles, Chart Title and Data
labels.

5. Label the axes.


6. The completed component bar chart is as follows:

137
A.6 Constructing a Multiple (Component) Bar Chart
Given the following mid-year population estimates for South Africa by
population group and sex, 2017:

Population group Male Female Total


Black African 22 311 400 23 345 000 45 656 400
Coloured 2 403 400 2 559 500 4 962 900
Indian/Asian 719 300 689 800 1 409 100
White 2 186 500 2 307 100 4 493 500

1. Input the data.


2. Click on the Insert tab and click on 2D clustered column chart.
3. Click OK.
4. Click on the green plus sign and tick Axis Titles, Chart Title.

5. Label the axes.


6. The completed component bar chart is as follows:

A.7 Constructing a Pie Chart


The table below shows the weighting of services used in the construction input
price index (Construction Materials Price Indices, April 2019).
Service Weight (%)
Site preparation 1
Construction of buildings 24
Civil engineering 37
Other structures 2
Construction by specialist trade contractors 6

138
Plumbing 2
Electrical contractors 8
Shopfitting 1
Other building installation 8
Painting and decorating 1
Other building completion 8
Renting of construction or demolition equipment
with operators 3

1. Input the data.


2. Click on the Insert tab and click on 2-D Pie chart.
3. Click OK.
4. Add the title to the graph.
The completed pie chart is as follows:

139
A.8 Constructing a Histogram
This example teaches you how to create a histogram in Excel.
1. First, enter the data and the bin numbers (upper levels).

2. On the Data tab, in the Analysis group, click Data Analysis.

3. Select Histogram and click OK.

140
4. Select the input range (the cost of daily commute values).
5. Click in the Bin Range box and select the bin range.
6. Click the Output Range option button, click in the Output Range box and
select a cell in which you want the output to appear.
7. Check Chart Output.

8. Click OK.
9. Click on Quick Analysis and choose Chart and then Clustered

141
9. Click on the More value in the table and delete.
10. Properly label your bins.
11. To remove the space between the bars, right click a bar, click Format Data
Series and change the Gap Width to 0%.
12. To add borders, right click a bar, click Format Data Series, click the Fill &
Line icon, click Border and select a color.
13. To add the data values above each bar, right click a bar, click Add Data
Lables → Add Data Lables
Result:

142
A.9 Constructing a Frequency Polygon
1. Input the Midpoint and frequency values.
2. Highlight the data and click on the Insert tab and select the 2D line
graph.

3. Click on the green plus sign, tick the box for Axis Titles and write in the
titles of the axes.
4. The final output appears as follows:

143
A.10 Constructing a “Less than” ogive
1. Input the upper class limits and the cumulative frequency values.
2. Highlight the data and click the Insert tab and then click on Scatter with
Straight Lines and Markers.

3. Click on the green plus sign and tick Axis, Axis Titles, Chart Title and data
Labels.

144
4. Right click on the horizontal axis and click on Format Axis.
5. Set the Minimum value to 40 and the Maximum value to 70.

A.11 Calculating Summary Statistics


You can use the Analysis Toolpak add-in to generate descriptive statistics. For
example, you may have the scores of 14 participants for a test.

To generate descriptive statistics for these scores, execute the following steps.
1. On the Data tab, in the Analysis group, click Data Analysis.

145
.
2. Select Descriptive Statistics and click OK.

3. Select the range A2:A15 as the Input Range.


4. Select cell C1 as the Output Range.
5. Make sure Summary statistics is checked.

6. Click OK.
Result:

146
A.12 Drawing a Scatter plot

1. Input the data for stock A and stock B given in the notes.

2. Highlight the data for stock A then click the Insert tab and choose
Scatter:

147
3. Highlight the data for stock B then click the Insert tab and choose
Scatter:

4. Click on the green plus sign and add axes titles, chart title and data
lables for both scatter plots.

148
5. The scatter plots for stock a and B are as follows:

A.13 Performing Regression Analysis


1. Input the data.

149
2. Before we begin the analysis, we can create a scatter plot of the
variables shoe size (x) and height (y) and fit a trend line to the data as
follows:

Fit a trend line as follows:


1. Click on the green plus sign
2. Select trend line, click on the arrow and select linear
3. Add the correct axis labels

From the above scatter diagram and linear trend line, it would seem that
height and shoe size has a positive linear correlation.

1. On the Data tab, in the Analysis group, click Data Analysis.

2. Select Regression and click OK.

150
3. Select the Y Range. This is the predicted variable (also called dependent
variable).
4. Select the X Range. These are the explanatory variables (also called
independent variables). These columns must be adjacent to each other.
5. Check Labels.
6. Click in the Output Range box and select whichever cell you want the output
to appear in.
8. Click OK.

Excel produces the following Summary Output.

151
R Square
R Square equals 0.79 which is an average fit. Approximately, 79% of the
variation in height is explained by the independent variable shoe size. The
closer r is to 1, the better the regression line fits the data.

Coefficients
The regression line is: ^y = 137.03 + 4.14(shoe size). In other words, for each
unit increase in shoe size, height increases by 4.14 centimetres.
You can also use these coefficients to do a forecast. For example, if shoe size
equals 8, a person’s expected height = 137.03 + 4.14(8) = 170.15 centimetres.

A.14 Performing a t-test: one sample mean


1. For the first example in section 6.4 of the notes enter the data set on the
drying times of paint in Excel. Create another data set called Dummy
variable and enter at least two zeros as follows:

2. Click on the DATA tab


3. Click data analysis
4. Click on t-Test: Two-sample assuming Unequal Variances, then OK
5. Input the drying times for the variable 1 range
6. Input the Dummy variable for variable 2 range
7. Type in 120 for Hypothesized Mean Difference
8. Check labels
9. Type 0.05 for alpha
10.Click OK

152
11.Delete the Dummy variable column
12.Alter the heading to read: t-Test: Mean
Output is as follows:

The value of the test statistic is tcalc = 1.899 (3 decimal places). From the table
P(T< = –1.899) = 0.036 (for a left-tailed or one-tail test such as this). This
probability is known as the p-value (the probability of getting a t-value more
remote than the test statistic). When testing at the 5% level of significance, a
p-value of below 0.05 will cause the null hypothesis to be rejected.

A.15 Performing a two sample t-test: equal variances


Example:
A marketing research firm tests the effectiveness of a new flavouring for a
leading soft drink using a sample of 20 people, half of whom taste the soft
drink with the old flavouring and the other half who taste the beverage with
the new flavouring. The people in the study are then given a questionnaire
which evaluates how enjoyable the soft drink was. The scores are given below.
Determine whether there is a significant difference in preference between the
two flavourings at the 5% level of significance.
In other words, test the hypothesis that:
H 0 : μ1=μ2
H 1 : μ 1 ≠ μ2
OR
H 0 : μ1−μ2=0
H 1 : μ 1−μ2 ≠ 0

153
New Old
13 12
17 8
19 6
11 16
20 12
15 14
18 10
9 18
12 4
16 11

1. Input this data into Excel


2. Click on the Data taba, click on data analysis
3. Select t-Test: Two Sample Assuming Equal Variances, click OK
4. Select the New data set as the Variable 1 Range
5. Select the Old data set as the Variable 2 Range
6. Type in 0 for Hypothesized Mean Difference
7. Check Labels
8. Type 0,05 for Alpha
9. Click Output Range and click on the cell in which you want the output to
appear
10.Click OK
The summary output follows:

The value of the test statistic is tcalc = 2.177 (3 decimal places). From the table
P(T< = 2.177) = 0.043 (for a two-tailed test such as this). This probability is
known as the p-value (the probability of getting a t-value more remote than
the test statistic). When testing at the 5% level of significance, a p-value of
below 0.05 will cause the null hypothesis to be rejected.

154

You might also like