0% found this document useful (0 votes)
22 views151 pages

Stats Notes

Chapter 1 introduces key terminology related to data and statistics, defining concepts such as data sets, populations, samples, and various measurement scales. It explains different sampling methods, including probability sampling techniques like simple random sampling, systematic sampling, and stratified random sampling, as well as non-probability methods like convenience and quota sampling. The chapter emphasizes the importance of representative sampling to avoid bias in statistical inference.

Uploaded by

shivaal.s56
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views151 pages

Stats Notes

Chapter 1 introduces key terminology related to data and statistics, defining concepts such as data sets, populations, samples, and various measurement scales. It explains different sampling methods, including probability sampling techniques like simple random sampling, systematic sampling, and stratified random sampling, as well as non-probability methods like convenience and quota sampling. The chapter emphasizes the importance of representative sampling to avoid bias in statistical inference.

Uploaded by

shivaal.s56
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

CHAPTER 1

TERMINOLOGY

1.1 Definitions

A data/data set is a set of values collected or obtained when gathering


information on some issue of interest.

Examples
1. The monthly sales of a certain vehicle collected over a period.
2. The number of passengers using a certain airline on various routes.
3. Rating (on a scale from 1 to 5) of a new product by customers.
4. The yields of a certain crop obtained after applying different types of
fertilizer.

Statistics is the collection of methods for planning experiments, obtaining data,


and then organizing, summarizing, presenting, analyzing, interpreting the data
and drawing conclusions from it.

Statistics in the above sense refers to the methodology used in drawing


meaningful information from a data set. This use of the term should not be
confused with statistics (referring to a set of numerical values) or statistics
(referring to measures of description obtained from a data set).

Descriptive statistics is the collection, organization, summarization and


presentation of data. This will be discussed further in chapter 2.

A population refers to all subjects possessing a common characteristic that is


being studied.

Examples
1. The population of people inhabiting a certain country.
2. The collection of all cars of a certain type manufactured during a particular
month.
3. All patients in a certain area suffering from AIDS.
4. Exam marks obtained by all students studying a certain statistics course.

1
A census is a study where every member (element) of the population is included.

Examples
1. Study of the entire population carried out by the government every 10
years.
2. Special investigations e.g. tax study commissioned by a government.
3. Any study of all the individuals/elements in a population.

A census is usually very costly and time consuming. It is therefore not carried
out very often. A study of a population is usually confined to a subgroup of the
population.

A sample is a subgroup or subset of the population.

The number of values in the sample (sample size) is denoted by n. The number
of values in the population (population size) is denoted by N.

Statistical inference involves generalizing from samples to populations and


expressing the conclusions in the language of probability (chance).

A variable is a characteristic or attribute that can assume different values for


different subjects in the population or sample.

Discrete variables are variables that can assume a finite or countable number of
possible values. Such variables are usually obtained by counting.

Examples
1. The number of cars parked in a parking lot.
2. The number of students attending a statistics lecture.
3. A person’s response (agree, not agree) to a statement. A one (1) is
recorded when the person agrees with the statement, a zero (0) is
recorded when a person does not agree.

Continuous variables are variables that can assume an infinite number of


possible values. Such variables are usually obtained by measurement.

Examples
1. The body temperature of a person.
2. The weight of a person.
3. The height of a tree.

2
4. The contents of a bottle of cool drink.

1.2 Measurement scales

Qualitative variables are variables that assume non-numerical values.

Examples
1. The course of study at university (B.Com, B.Eng, BA etc.)
2. The grade (A, B, C, D or E) obtained in an examination.

Nominal scale is a level of measurement which classifies data into categories in


which no order or ranking can be imposed on the data.

A variable can be treated as nominal when its values represent categories with
no intrinsic ranking. For example, the department of the company in which an
employee works. Examples of nominal variables include region, postal code, or
religious affiliation.

Ordinal scale is a level of measurement which classifies data into categories that
can be ordered or ranked. Differences between the ranks do not exist.

A variable can be treated as ordinal when its values represent categories with
some intrinsic order or ranking.

Examples

1. Levels of service satisfaction from very dissatisfied to very satisfied.


2. Attitude scores representing degree of satisfaction or confidence and
preference rating scores (low, medium or high).
3. Likert scale responses to statements (strongly agree, agree, neutral,
disagree, strongly disagree).

Quantitative variables are variables which assume numerical values.

Examples
Discrete and continuous variable examples given above.

Interval scale is a level of measurement which classifies data that can be ordered
and ranked and where differences are meaningful. However, there is no
meaningful zero and ratios are meaningless.
3
Examples
1. The difference between a temperature of 100 degrees and 90 degrees is
the same difference as that between 90 degrees and 80 degrees. Taking
ratios in such a case does not make sense.
2. When referring to dates (years) or temperatures measured (degrees
Fahrenheit or Celsius) there is no natural zero point.

Ratio scale is a level of measurement where differences and ratios are


meaningful and there is a natural zero. This is the “highest” level of
measurement in terms of possible operations that can be performed on the
data.

Examples
Variables like height, weight, mark (in test) and speed are ratio variables. These
variables have a natural zero and ratios make sense when doing calculations
e.g. a weight of 80 kilograms is twice as heavy as one of 40 kilograms.

Summary of the 4 measurement scales:

Measurement Examples Meaningful calculations


scale
Nominal Types of music Put into categories
University faculties
Vehicle makes
Ordinal Motion picture ratings: Put into categories
G- General audiences Put into order
PG-Parental guidance
PG-13 – Parents
cautioned
R - Restricted
NC 17 – No under 17
Interval Years: 2009,2010, 2011 Put into categories
Months: 1,2, . . . , 12 Put into order
Differences between
values are meaningfull
Ratio rainfall Put into categories
humidity Put into order
income Differences between
values are meaningfull
Ratios are meaningfull

4
An experiment is the process of observing some phenomenon that occurs.
An experiment can be observational or designed.

1. A designed experiment can be controlled to a certain extent by the


experimenter. Consider a study of 4 fuel additives on the reduction in
oxides of nitrogen. You may have 4 drivers and 4 cars at your disposal. You
are not particularly interested in any effects of particular cars or drivers on
the resultant oxide reduction. However, you do not want the results for
the fuel additives to be influenced by the driver or car. An appropriate
design of the experiment (way of performing the experiment) will allow
you to estimate effects of all factors of interest without these outside
factors influencing the results.

2. An observational study is not controlled by the experimenter. The


characteristic of interest is simply observed and the results recorded. For
example:
2.1 Collecting data that compares reckless driving of female and male
drivers.
2.2 Collecting data on smoking and lung cancer.

A parameter is a characteristic or measure of description obtained from a


population.

Examples
1. Mean (average) age of all employees working at a certain company.
2. The proportion of registered female voters in a certain country.

A statistic is a characteristic or measure of description obtained from a sample.

Examples
1. The mean (average) monthly salary of 50 selected employees in a certain
government department.
2. The proportion of smokers in a sample of 60 university students.

1.3 Sampling methods

When selecting a sample, the main objective is to ensure that it is as


representative as possible of the population from which it is drawn. When a
sample fails to achieve this objective, it is said to be biased.

5
Sampling frame (synonyms: "sample frame", "survey frame") is the actual set of
units from which a sample is drawn.

Example
Consider a survey aimed at establishing the number of potential customers for
a new service in a certain city. The research team has drawn 1000 numbers at
random from a telephone directory for the city, made 200 calls each day from
Monday to Friday from 8am to 5pm and asked some questions.

In this example, the population of interest is all the inhabitants in the city. The
sampling frame includes only those city dwellers that satisfy all the following
conditions:

1. They have a telephone.


2. The telephone number is included in the directory.
3. They are likely to be at home from 8am to 5pm from Monday to Friday;
4. They are not people who refuse to answer telephone surveys.

The sampling frame in this case definitely differs from the population. For
example, it under-represents the categories which either have no telephone
(e.g. the most poor), have an unlisted number, and who were not at home at
the time of calls (e.g. employed people), who don't like to participate in
telephone interviews (e.g. more busy and active people). Such differences
between the sampling frame and the population of interest is a main cause of
bias when drawing conclusions based on the sample.

Probability samples are drawn according to the laws of chance. These include
simple random sampling, systematic sampling and stratified random sampling.

In simple random sampling each sample of a given size that can be drawn will
have the same chance of being drawn. Most of the theory in statistical inference
is based on random sampling being used.

Examples
1. The 6 winning numbers (drawn from 49 numbers) in a Lotto draw. Each
potential sample of 6 winning numbers has the same chance of being
drawn.

2. Each name in a telephone directory could be numbered sequentially. If


the sample size was to include 2 000 people, then 2 000 numbers could

6
be randomly generated by computer or numbers could be picked out of a
hat. These numbers could then be matched to names in the telephone
directory, thereby providing a list of 2 000 people.

A random sample can be selected by using a table of random numbers.

Example

Suppose the first 6 random numbers in the table of random numbers are:
10480, 22368, 24130, 42167, 37570, 77921.
Use these numbers to select the 6 wining numbers in a Lotto draw.

The 49 numbers from which the draw is made all involve 2 digits i.e. 01, 02, . .
. , 49.
Putting the above numbers from the table of random numbers next to each
other in a string of digits gives: 10 48 02 23 68 24 13 04 21 67 37 57 07 79 21 .

The winning numbers can be selected by either taking all pairs of digits between
01 and 49 (discarding any numbers outside this range or repeats) by working
from left to right or right to left in the above string.

By working from left to right the winning numbers are: 10, 48, 2, 23, 24 and 13.
By working from right to left the winning numbers are: 21, 7, 37, 21, 4 and 13.

The advantage of simple random sampling is that it is simple and easy to apply
when small populations are involved. However, because every person or item in
a population has to be listed before the corresponding random numbers can be
read, this method is very cumbersome to use for large populations and cannot
be used if no list of the population items is available. It can also be very time
consuming to try and locate every person included in the sample. There is also
a possibility that some of the persons in the sample cannot be contacted at all.

Systematic sampling is a sampling method in which data is obtained by selecting


N
every kth object, where k is approximately .
n

Examples
1. A manufacturer might decide to select every 20th item on a production
line to test for defects and quality. This technique requires the first item
to be selected at random as a starting point for testing and, thereafter,
every 20th item is chosen.
7
2. A market researcher might select every 10th person who enters a
particular store, after selecting a person at random as a starting point; or
interview occupants of every 5th house in a street, after selecting a house
at random as a starting point.

3. A systematic sample of 500 students is to be selected from a university


with an enrolled population of 10 000. In this case the population size
10000
N=10 000 and the sample size n = 500. Then every = 20th student
500
will be included in the sample. The first student in the sample can be
randomly selected from an alphabetical list of students and thereafter
every 20th student can be selected until 500 names have been obtained.

Stratified random sampling involves sampling in which the population is divided


into groups (called strata) according to some characteristic. Each of these strata
is then sampled using random sampling.

A general problem with random sampling is that you could, by chance, miss out
a particular group in the sample. However, if you subdivide the population into
groups, and sample from each group, you can make sure the sample is
representative. Some examples of strata commonly used are those according to
province, age and gender. Other strata may be according to religion, academic
ability or marital status.

Example
In a study investigating the expenditure pattern of consumers, they were divided
into low, medium and high income groups.

Income Percentage of
group population
low 40
medium 45
high 15

A stratified sample of 500 consumers is to be selected for this study.

8
When sampling is proportional to size (an income group comprises the same
percentage of the sample as of the population) the sample sizes for the strata
should be calculated as follows.
40×500 45×500 15×500
low: = 200; medium : = 225; high: = 75
100 100 100

Convenience Sampling – Sampling in which data that is readily available is


used e.g. surveys done on the internet. These include quota sampling.

Quota sampling – Quota sampling is performed in 4 stages.

(a) Stage 1: Decide which characteristics of the elements/individuals in the


population to be sampled are of importance.
(b) Stage 2: Decide on the categories to be sampled from. These categories
are determined by cross-classification according to the characteristics
chosen at stage 1.
(c) Stage 3: Decide on the overall number (quota) and numbers (sub-
quotas) to be sampled from each of the categories specified in step 2.
(d) Stage 4: Collect the information required until all the numbers (quotas)
are obtained.

Example
A company is marketing a new product and needs to know how potential
customers might react to the product.

Stage 1: It is decided that age (the 3 groups under 20, 20-40, over 40) and
gender (male, female) are the characteristics that will determine the
sample.

Stage 2: The 6 categories to be sampled from are (male under 20), (male
20-40), (male over 40), (female under 20), (female 20-40) and (female
over 40).

Stage 3: The numbers (sub-quotas) to be sampled are:


(male under 20) = 40; (male 20-40) = 60; (male over 40) = 25;
(female under 20) = 35; (female 20-40) = 65 and (female over 40) =30.
The total quota is the total of all the sub-quotas i.e. 255.

9
Stage 4: Visit a place where individuals to be interviewed are readily
available e.g. a large shopping center and interview people until all the
quotas are filled.

Quota sampling is a cheap and convenient way of obtaining a sample in a short


space of time. However, this method of sampling is not based on the laws of
chance and cannot guarantee a sample that is representative of the population
from which it is drawn.

When obtaining a quota sample, interviewers often choose who they like (within
criteria specifications) and may therefore select those who are easiest to
interview. Therefore sampling bias can result. It is also impossible to estimate
the accuracy of quota sampling (because sampling is not random).

Chapter 1 – Tutorial
1. Determine whether the data set is a population or a sample.
(a) The age of the Prime Minister of each Province in South Africa.
(b) The speed of every 5th car passing a police speed trap.
(c) A survey of 500 students from a university with 10000 students.
(d) The annual salary for each employee at Coke.
(e) The cholesterol level of 20 patients in a hospital with 100
patients.

2. Identify the populat ion and the sample for each of the statements
below.
(a) A study of 33043 infants in Italy was conducted to find a
link between a heart rhythm abnormality and sudden
infant death syndrome .
(b) A survey of 2104 households in South Africa found that 42%
subscribe to DSTV.
(c) A survey of 546 women found that more than 56% are the
primary investor in their household .
(d) The Ancient Mayans predicted the end of the world to be in
2012, a study was designed in KwaZulu-Natal where 1200
residents were randomly asked whether they believed the
prediction or not. The results indicated that 52% of the
interviewed residents believed in the Mayans predict ion.

10
3. Determine whether the numeric value is a parameter or a statistic.
(a) The average annual salary for 25 of a company's 1250 statisticians is
R250000.
(b) In a survey of a sample of high school students, 41% said
that their mother has taught them the most about
managing money.
(c) In a survey of sample computers, 15% said their computer
had a malfunction that needed to be repaired by a service
techni cian.
(d) In a recent year, the interest category for 9% of all new magazines
was sport.
(e) In a recent year, the average stats mark for all graduates at UKZN was
34%.
(f) In a recent survey of 1000 adults from Gauteng, 34% said
using a cell phone while driving should be illegal

4. For each of the following random variables (a) to (p):


(i) indicate the data type (i.e. discrete or continuous), and
(ii) the measurement scale (i.e. nominal, ordinal, interval or
ratio).

(a) The shelf life of milk.


(b) The number of life policies issued per day.
(c) The area of a shop floor.
(d) The number of pages in a text book .
(e) The flavours available in Dogmore food chunks.
(f) The types of wood that could be used to make a desk.
(g) The size categories for shoes.
(h) The voltage produced by a generator.
(i) The car types in the Mercedes r ange.
(j) The "yes/no/sometimes" response to "Do you drink Gin?".
(k) The number of loaves of bread sold daily by a bakery.
(I) The income per day of a bakery.
(m) The monthly birth rate at a maternity hospital.
(n) The mass of babies at birt h.

11
(o) The daily distance travelled by a courier service truck.
(p) The names of teams in a cricket league.

5. A city's telephone book lists 100 000 people. If the telephone


book is the frame for a study, how large would the sample size
be if systematic sampling were done on every 200th person?

6. If every 11t h item is systematically sampled to produce a


sample size of 75 items, approximately how large is the
population?

7. In a study investigating liver function in lions, the lions were


divided into 3 groups: Adult Males, Adult Females and Cubs {less
than a year old).

Lions Percentage of population


Adult Males 20
Adult Females 32
Cubs (less than a year old) 48

A stratified sample of 120 lions is to be selected for this study.


How many lions should be represented by each stratum?
8. Cadbury wants to market a new type of chocolate and needs to
know how potential customers might react to the product. It is
decided that age (under 21, 21 to 40, over 40), gender (m ale,
female) and race (black, coloured, Indian, white) are the
characteristics that will determine the sample. Quota sampling
is to be used.
(a) How many possible categories are there to be sampled from
(in stage 2 of quota sampling)?
(b) What is an advantage of using quota sampling?
(c) What is a disadvantage in using quota sampling?

12
CHAPTER 2
DESCRIPTIVE STATISTICS
(Exploratory Data Analysis)
All the data sets used in this chapter will be regarded as samples drawn from
some population. One of the main purposes of studying a sample is to get
information about the population. The main focus here is on summarizing and
describing some features of the data.

2.1 Graphs and diagrams

A line graph1 is a graph used to present some characteristic recorded over


time.

Example

Thando's weight (kg)


75
74
73
72
Weight

71
70
69
68
67
2013 2014 2015 2016 2017 2018 2019
Year

The graph above shows how Thando's weight varied from the beginning of 2014
to the beginning of 2018.

1
See Appendix A3.

13
Bar charts

A bar chart or bar graph is a chart consisting of rectangular bars with heights
proportional to the values that they represent. Bar charts are used for
comparing two or more values that are taken over time or under different
conditions.

Simple Bar Chart2

In a simple bar chart the figures used to make comparisons are represented by
bars. These are either drawn vertically or horizontally. Only totals are
represented. The height or length of the bar is drawn in proportion to the size
of the figure being presented.

Example

The South African population data is displayed in the following simple bar chart.

South African population (2015 - 2018)


58,000,000
57,398,421
57,500,000
57,000,000 56,717,156
56,500,000
Population

56,015,473
56,000,000
55,500,000 55,291,225

55,000,000
54,500,000
54,000,000
2015 2016 2017 2018
Year

Component Bar Chart 3

When you want to draw a bar chart to illustrate your data, it is often the case
that the totals of the figures can be broken down into parts or components.

2
See Appendix A4.
3
See Appendix A5.

14
Mid-year population estimates for South Africa by population
group, sex 2017
50,000,000
45,000,000
40,000,000
35,000,000
30,000,000
Number

25,000,000
20,000,000 Female
15,000,000
10,000,000 Male
5,000,000
0
Black African Coloured Indian/Asian White
Female 23,345,000 2,559,500 689,800 2,307,100
Male 22,311,400 2,403,400 719,300 2,186,500
Population group

You start by drawing a simple bar chart with the total figures as shown above.
The columns or bars (depending on whether you draw the chart vertically or
horizontally) are then divided into the component parts.

Multiple (compound) Bar Chart 4


You may find that your data allows you to make comparisons of the component
figures themselves. If so, you will want to create a multiple (compound) bar
chart. This type of chart enables you to trace the trends of each individual
component, as well as making comparisons between the components.

Mid-year population estimates for South Africa


by population group, sex 2017
50,000,000
45,000,000
40,000,000
35,000,000
30,000,000
Number

25,000,000 Male
20,000,000
Female
15,000,000
10,000,000 Total
5,000,000
0
Black African Coloured Indian/Asian White
Population group

4
See Appendix A6.

15
Pie Chart5

A pie chart is a diagram that shows the subdivision of some entity/total into
subgroups. The diagram is in the form of a circle which is divided into slices
with each slice having an area according to the proportion that it makes up of
the total.

Example
The pie chart below shows the weighting of services used in the construction
input price index (Construction Materials Price Indices, April 2019).

Service Percentage Degrees


Site preparation 1 3
Construction of buildings 24 86
Civil engineering 37 133
Other structures 2 6
Construction by specialist trade
contractors 6 22
Plumbing 2 6
Electrical contractors 8 29
Shopfitting 1 2
Other building installation 8 27
Painting and decorating 1 4
Other building completion 8 30
Renting of construction or demolition
equipment with operators 3 12

5
See Appendix A7.

16
Service weighting in the CIPI
1%
Site preparation
3%
1% 8%

24% Construction of buildings


8%
0%
Civil engineering
8%

Other structures
2%
6%

2% 37% Construction by specialist


trade contractors

The degrees needed for each slice is found by calculating the appropriate
percentage of 360°
37
For example, civil engineering = × 360° = 133°
100
The complete calculations are shown in the table below.

2.2 Sigma and subscript notation

The symbol sigma ∑ (Capital S in Greek alphabet) is used to denote “the sum of”
values.
Suppose the symbol 𝑥𝑥 is used to denote some variable of interest in a study. In
order to distinguish between values of this variable, subscripts are used.

𝑥𝑥1 is the first value in the data set which has a subscript 1.
𝑥𝑥2 is the second value in the data set which has a subscript 2.
.
.
𝑥𝑥𝑛𝑛 is the 𝑛𝑛𝑡𝑡ℎ value in the data set which has a subscript 𝑛𝑛.

The sum of these values is written in shorthand notation as


𝑛𝑛

𝑥𝑥1 + 𝑥𝑥2 + ⋯ + 𝑥𝑥𝑛𝑛 = � 𝑥𝑥𝑖𝑖


𝑖𝑖=1

17
If it is understood that the range of subscript indices over which the summation
is taken involves all the 𝑥𝑥 values, the summation can be written simply as:

𝑥𝑥1 + 𝑥𝑥2 + ⋯ + 𝑥𝑥𝑛𝑛 = � 𝑥𝑥

Example 1
If 𝑥𝑥1 = 70; 𝑥𝑥2 = 74; 𝑥𝑥3 = 66; 𝑥𝑥4 = 68; 𝑥𝑥5 = 71

Then
5

� 𝑥𝑥𝑖𝑖 = 𝑥𝑥1 + 𝑥𝑥2 + ⋯ + 𝑥𝑥5 = 70 + 74 + 66 + 68 + 71 = 349


𝑖𝑖=1

The sum of the squares of a set of values are written as ∑ 𝑥𝑥 2 for short.

Example 2
For the data set in example 1,
5

� 𝑥𝑥𝑖𝑖2 = (70)2 + (74)2 + (66)2 + (68)2 + (71)2 = 24397


𝑖𝑖=1

2
Note: ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖2 ≠ (∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 )

For example, with reference to the abovementioned data:


2
∑5𝑖𝑖=1 𝑥𝑥𝑖𝑖2 = 24397 ≠ �∑5𝑖𝑖=1 𝑥𝑥𝑖𝑖 � = (349)2 = 121801

The summation notation can also be used to write the sum of products of
corresponding values for 2 different sets of values.
𝑛𝑛

� 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 = 𝑥𝑥1 𝑦𝑦1 + 𝑥𝑥2 𝑦𝑦2 + ⋯ + 𝑥𝑥𝑛𝑛 𝑦𝑦𝑛𝑛


𝑖𝑖=1
Example: Consider the following values.

𝑖𝑖 1 2 3 4 5 6
𝑥𝑥𝑖𝑖 11 13 7 12 10 8
𝑦𝑦𝑖𝑖 8 5 7 6 9 11
For this data:
∑6𝑖𝑖=1 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 = (11 × 8) + (13 × 5) + (7 × 7) + (12 × 6) + (10 × 9) + (8 × 11)

18
= 88 + 65 + 49 + 72 + 90 + 88 = 452

Note: ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 ≠ (∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 )(∑𝑛𝑛𝑖𝑖=1 𝑦𝑦𝑖𝑖 )


For example, with reference to the abovementioned data:
∑6𝑖𝑖=1 𝑥𝑥𝑖𝑖 = 61 ; ∑6𝑖𝑖=1 𝑦𝑦𝑖𝑖 = 46
6 6 6

∴ �� 𝑥𝑥𝑖𝑖 � �� 𝑦𝑦𝑖𝑖 � = 2806 ≠ � 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖


𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1

The summation notation is used extensively in specifying calculations in


statistical formulae.

2.3 Frequency distributions and related graphs

Frequency distribution

A frequency distribution is a table in which data are grouped into classes and the
number of values (frequencies) which fall in each class is recorded.
The main purpose of constructing a frequency distribution is to gain insight into
the distribution pattern of the frequencies over the classes. Hence, the name
frequency distribution is used to refer to this pattern.

Example 1
In a survey of 40 families in an urban neighbourhood, the number of children
per family was recorded and the following data was obtained.
1 0 3 2 1 5 6 2
2 1 0 3 4 2 1 6
3 2 1 5 3 3 2 4
2 2 3 0 2 1 4 5
3 3 4 4 1 2 4 5

number of children Tally frequency (f)


0 /// 3
1 //// // 7
2 //// //// 10
3 //// /// 8
4 //// / 6
5 //// 4
6 // 2
Total 40

Note: The sum of the frequencies = sample size, i.e. ∑ 𝑓𝑓 = 𝑛𝑛

19
Example 2
Consider the following data of the amount of money spent by 50 DUT staff
members on public transport per day. The highest amount is R64 and the
lowest amount is R39.

Data set: The daily amount of money spent on public commuting by 50 DUT
staff members
57 39 52 52 43
50 53 42 58 55
58 50 53 50 49
45 49 51 44 54
49 57 55 64 45
50 45 51 54 58
53 49 52 51 41
52 40 44 49 45
43 47 47 43 51
55 55 46 54 41

Constructing a frequency distribution

The classes into which the above values can be sorted can be found by
following the steps shown below.

1. Find the maximum and minimum values and calculate the range (R):

𝑅𝑅 = 𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚 = 64 − 39 = 25
2. Decide on the number of classes. Use Sturges’ rule which states that:

𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 𝑘𝑘


= 𝑡𝑡ℎ𝑒𝑒 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑢𝑢𝑢𝑢 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑜𝑜𝑜𝑜 (1 + 1.44 𝑙𝑙𝑙𝑙 𝑛𝑛)
= 1 + 1.44 × 𝑙𝑙𝑙𝑙(50)
= 6.63
𝑖𝑖. 𝑒𝑒. 𝑘𝑘 = 7.

3. Calculate the class width such that:


𝑡𝑡ℎ𝑒𝑒 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 × 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ > 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟

𝑖𝑖. 𝑒𝑒. 7 × 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ > 25


20
25
∴ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ >
7

This suggests a class width of 4.

4. Find the lower value that defines the first class. This is usually a value just
below the minimum value in the data set. Since the minimum value for
this data set is 39, the lowest class can have a minimum value one below
this i.e. 38.

5. Find the lower values that define each of the classes that follow by
successively adding the class width to the lower value of class:

lower value of the second class = 38 + 4 = 42.

lower value of the third class = 42 + 4 = 46 etc.

The frequency distribution below shows the data values sorted into the
classes:

38 – 41, 42 – 45, 46 – 49, 50 – 53, 54 – 57, 58 – 61, 62 – 65

The table below shows the classes and their frequencies for the cost of
commuting data set.

class
limits f
38 – 41 4
42 – 45 10
46 – 49 8
50 – 53 15
54 – 57 9
58 – 61 3
62 – 65 1
Total 50

The values in the above example that define the classes of the frequency
distribution are called class limits. The classes of the type 38 – 41, 42 – 45, …,
etc. in which both the upper and lower limits are included are called “ inclusive
classes”. For example, the class 38 – 41 includes all the values from 38 to 41.

21
The following points must be kept in mind for classification:

1. The classes should be clearly defined and should not lead to any
ambiguity.
2. Each of the given values in the data set should be included in one of
the classes.
3. The classes should be of equal width, otherwise the different class
frequencies will not be comparable. If the class widths are unequal,
then comparable figures can be obtained by dividing the value of the
frequencies by the corresponding widths of the class intervals. The
ratios thus obtained are called ‘ frequency density’.
4. The number of classes should not be too large nor too small.

Class midpoints

The midpoint of a class (𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 ) can be calculated from

𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 + 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙


𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 =
2

Examples
1. For the frequency distribution in example 2 (cost of daily commute
data), the class midpoints are given below.

class limits midpoints


38 – 41 39.5
42 – 45 43.5
46 – 49 47.5
50 – 53 51.5
54 – 57 55.5
58 – 61 59.5
62 – 65 63.5

22
Cumulative frequencies

The “less than” cumulative frequency of a class is the number of values in the
sample that are less than or equal to the upper class boundary of the class.

Example
For the frequency distribution in example 2 (cost of daily commute data) the
cumulative frequencies are calculated as shown below.

upper
cumulative
classes class f
frequencies
limit calculations
38 – 41 41 4 4 4
42 – 45 45 10 14 4 + 10
46 – 49 49 8 22 4 + 10 + 8
50 – 53 53 15 37 4 + 10 + 8 + 15
54 – 57 57 9 46 4 + 10 + 8 + 15 + 9
58 – 61 61 3 49 4 + 10 + 8 + 15 + 9 + 3
62 – 65 65 1 50 4 + 10 + 8 + 15 + 9 + 3 + 1
Total 50

Relative and percentage frequencies


𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
• 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 =
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
f
∴ 𝑅𝑅𝑅𝑅 =
n
• The percentage frequency of a class is calculated as: 𝑅𝑅𝑅𝑅 × 100

Examples

1. For the frequency distribution in example (cost of daily commute data)


the relative and percentage frequencies are calculated as shown
below.

23
relative percentage
classes f
frequency frequency
38 – 41 4 0.08 8
42 – 45 10 0.2 20
46 – 49 8 0.16 16
50 – 53 15 0.3 30
45 – 57 9 0.18 18
58 – 61 3 0.06 6
62 – 65 1 0.02 2
Total 50 1 100

Histogram 6

A histogram is the graphical representation of a frequency distribution. The


frequency for each class is represented by a rectangular bar with the class
boundaries as base and the frequency as height.

Example

The histogram of the frequency distribution in example 2 (cost of daily


commute data) is shown below.

Histogram of amount spent on commuting


16 15
14
12
10
10 9
frequency

8
8
6
4
4 3
2 1
0
38 - 41 42 - 45 46 - 49 50 - 53 54 - 57 58 - 61 62 - 65
amount spent

6
See Appendix A8.

24
Frequency polygon7

This is also a graphical representation of a frequency distribution. For each class


the class midpoint is plotted against the frequency and the plotted points joined
by means of straight lines.

Example

For the cost of daily commute data the following values are plotted.

midpoint 35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
f 0 4 10 8 15 9 3 1 0

The plot is shown below.


Frequency Polygon
16
14
Cost of daily commute

12
10
8
6
4
2
0
35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
Class midpoint

Note:
The two plotted values at the lower and upper ends were added to anchor the
graph to the horizontal axis. The lower end value is a plot of 0 versus the
midpoint of the class below the first (lowest) class (35.5). This midpoint is
obtained by subtracting the class width (4) from the midpoint of the lowest class
(39.5). The upper end value is a plot of 0 versus the midpoint of the class above
the last class (67.5). This midpoint is obtained by adding the class width (4) to
the midpoint of the last (highest) class (63.5).

7
See Appendix A9.

25
The histogram and frequency polygon are equivalent graphical representations
of the pattern of the frequencies shown in the frequency distribution. It can be
shown that the areas under the histogram and frequency polygon are the same.
The total area under the histogram (frequency polygon) represents the total
number of observations in the data set (n).

“Less than” ogive8

This is the graph of the cumulative frequencies versus the upper class limits.

Example

For the “less than” ogive of the frequency distribution in example 2 (daily cost
of commute data), the following values are plotted:

Upper class
37 41 45 49 53 57 61 65
limit
cumulative
0 4 14 22 37 46 49 50
frequency

"less than" ogive


60
49 50
50 46
cumulative frequency

37
40

30
22
20 14

10 4

0
40 45 50 55 60 65 70
upper class limit

Note:
The plotted value at the lower end was added to anchor the graph to the
horizontal axis. The lower end value is a plot of 0 versus the upper class
boundary of the class below the first (lowest) class (37). This upper class

8
See Appendix A10.

26
boundary is obtained by subtracting the class width (4) from the upper class
boundary of the lowest class (41).

The shape of a distribution


The main purpose of drawing a histogram is to describe the clustering pattern
of the values in the data set. For a large sample size, the histogram (frequency
polygon) can be fairly well approximated by a smooth curve (called a density
curve) that is fitted to the frequencies. The following patterns of the shape of
the frequency curve appear regularly in data sets.

Symmetric bell shape

0.45

0.4

0.35

0.3
frequency

0.25

0.2

0.15

0.1

0.05

0
-4 -2 0 2 4
x

This shape is for data sets where the majority of values are in the central portion
of the scale with fewer and fewer values the further away from the center (in
both directions). Many data sets have this shape. Examples are

1. Marks obtained in an examination.


2. Heights of a large group of adult males.
3. IQ scores in a large population.

27
Uniform (rectangular) shape

0.12

0.1

frequency 0.08

0.06

0.04

0.02

0
0 1 2 3 4 5 6
x

This shape occurs when all the values in the data set occur approximately the
same number of times. Examples are:
1. Frequencies of winning numbers in a large number of Lotto draws.
2. Frequencies of winning numbers in a large number of roulette games.
3. Frequencies obtained when tossing an unbiased coin and recording 0 if
tails come up and 1 if heads come up.

Bimodal shape

60

50

40
frequency

30

20

10

0
0 20 40 60 80 100 120
Body length (m m )

28
This pattern which shows two distinct peaks (hence the name bimodal data)
appearing when there are two subgroups with different sets of values in the
same data set.

Examples
1. Measuring the body lengths of ants when there are adults and juveniles
together in the same data set. The two peaks in the curve reflect the fact
that juvenile ants have shorter body lengths than adult ants.

2. Heights of a population of males and females. Since the females are


shorter than the males, the frequency curve will have two peaks. One
peak will be located where the most female heights are concentrated and
one where the most male heights are concentrated.

Positive skew shape


1.2

0.8
frequency

0.6

0.4

0.2

0
0 2 4 6 8 10 12 14
x

This shape shows a high clustering of values at the lower end of the scale and
less and less clustering further away from the lower end towards the upper end.

Example
The time it takes to serve a customer at a supermarket. For most customers the
service time is quite short. The longer the service time, the less the number of
customers.

29
Negative skewed shape
0.3

0.25

0.2
frequency
0.15

0.1

0.05

0
0 2 4 6 8 10 12 14 16
-0.05
x

This shape shows a high clustering of values at the upper end of the scale and
less and less clustering further away from the upper end towards the lower end.

Example
Marks in a test where most students did well, but a few performed poorly.

Tutorial

1. According to the Air Transport Association of America, Delta


Airlines led all U.S. carriers in the number of passengers flown in
the recent year. The top 5 airlines were Delta, United, American,
US Airways , and Southwest. The number of passengers flown (in
thousands) by each of these airlines follow s:
Airline Passengers
Delta 103 133
United 84 203
American 81083
US Airways 58 659
Southwest 55 946

Construct a pie chart to depict this information.

2. Research International reports that in a recent year, Huggies


was the top selling diaper brand in South Africa with 41.3% of
the market share . Other leading brands included Pampers
with 25.6%, Luvs with 12.1%, Drypers with 3.3%, Fitti with
0.9%, and private labels with 15.8%of the market share. Use
this information to construct a pie chart of the diaper market

30
shares .

3. Construct a pie chart from the following data.

Label Value
A 55
B 121
C 83
D 46

4. The following data represent the number of passengers per


flight in a sample of Mango fights from Durban to Port
Elizabeth.

23 46 66 67 13 58 19 17 65 17 25 20 47 28 16 38 44 29
48 29 69 34 35 60 37 52 80 59 51 33 48 46 23 38 52
50 17 57 41 77 45 47 49 19 32 64 27 61 70 19

Construct a frequency distribution from the raw data.


a. Calculate the range of the data.
b. Calculate the class width.

5. For the following data, construct a frequency distribution with six


classes.

57 23 35 18 21 26 51 47 29 21 46 43 29 23 39
50 41 19 36 28 31 42 52 29 18 28 46 33 28 20

6. Complete the following frequency distribution table and then


construct the histogram and frequency polygon.

31
Class Frequency Midpoint Relative Cumulative
Boundaries frequency frequency
20.5 - 25.5 17
25.5 -30.5 20
30.5-35.5 16
35.5-40.5 15
40.5- 45.5 8
45.5- 50.5 6

7. Complete the following frequency distribution table and then


construct the histogram and frequency polygon.

Class Boundaries Frequency Midpoint Relative Cumulative


frequency frequency
50.5 - 60.5 13
60.5 - 70.5 27
70.5 -80.5 43
80.5 -90.5 31
90.5 - 100.5 9

8. Comment on the shape of the distributions in questions 6 and 7,


respectively.

32
CHAPTER 3
MEASURES OF LOCATION AND
DISPERSION
3.1. Introduction
A measure of central tendency is a value that shows the location on the scale
where a data set is centrally located (most values are clustered around it).

In the calculations a distinction will be made between methods used when the
data are in raw form (values as collected) or grouped form (form of a frequency
distribution).

3.2 The mean (average), median and mode

A. Raw data
Mean: The mean (or average) of a set of data values is the sum of all of the
data values in the set divided by the 𝑛𝑛 the number of data values. That is
∑ 𝑥𝑥
mean = 𝑥𝑥̅ =
𝑛𝑛

𝑥𝑥̅ is pronounced “x bar”.

Example
The marks of seven students in a mathematics test with a maximum possible
mark of 20 are given below:
15 13 18 16 14 17 12:
∑ 𝑥𝑥 15+13+18+16+14+17+12
𝑥𝑥̅ = = = 15
𝑛𝑛 7

Median: The median is the value in the data set which is such that half of the
values in the data set are less than or equal to it and half are greater than or
equal to it.

For an odd number of values in the data set, the median is the middle value of
the data set when it has been arranged in ascending order. That is, from the
smallest value to the largest value.
33
1
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = (𝑛𝑛 + 1)𝑡𝑡ℎ 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 in a data set, where 𝑛𝑛 is the sample size
2

If the number of values in the data set is even, then the median is the average
of the two middle values.

Examples

1. The marks of nine students in a geography test that had a maximum


possible mark of 50 are given below:

47 35 37 32 38 39 36 34 35

Find the median of this set of data values.

Arrange the data values in order from the lowest value to the highest value:

32 34 35 35 36 37 38 39 47

The number of values, 𝑛𝑛, in the data set is 9.

1
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = (𝑛𝑛 + 1)𝑡𝑡ℎ 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣
2
= 5𝑡𝑡ℎ 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣

= 36

2. Consider the above data set with the first value (47) omitted.

Arrange the data values in order from the lowest value to the highest value:

32 34 35 35 36 37 38 39

In this case the number of values is, 𝑛𝑛 = 8, which is an even number.

1
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = (𝑛𝑛 + 1)𝑡𝑡ℎ 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣
2
= 4.5𝑡𝑡ℎ 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣

The value that lies in position 4.5 in the ranked data set would be the
average of the 4𝑡𝑡ℎ and 5𝑡𝑡ℎ values:

34
35 + 36
∴ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = = 35.5
2
Mode: The mode of a set of data values is the value(s) that occurs most often.

Example:
Find the mode of the following data set:
48 44 48 45 42 49 48
The mode is 48 since it occurs most often.

Note:

1. It is possible for a set of data values to have more than one mode.
2. If there are two data values that occur most frequently, we say that the
set of data values is bimodal e.g. the data set 2 2 4 5 5 6 has two modes
(2 and 5).
3. If no value in the data set occurs more than once, it has no mode e.g. the
data set 4 5 7 9 has no mode.

Comparison of mean, median and mode

1. The mean is used as a measure of central tendency for symmetrical, bell-


shaped data that do not have extreme values (extreme values are called
outliers).
2. The median may be more useful than the mean when there are extreme
values in the data set as it is not affected by the extreme values.
3. The mode is useful when the most common item, characteristic or value
of a data set is required.

Examples

1. The amounts (thousands) for which each of 7 properties were sold are
shown below.

280, 390, 412, 555, 698, 725, 2 350

For this data set mean = x = 772.86. This value of the mean is not a central
value for the data set (it is greater than all the values but the largest one).
The reason for this is that the last value (2350) has a considerable
influence on the value of the mean.

35
The median = 555 is a value that more centrally located than the mean.
Unlike the mean, the median is not influenced by the large last values in
the data set.

2. For qualitative (non-numerical) data only the mode can be calculated. For
example, suppose 10 rate payers are asked whether they think the
percentage increase in rates is reasonable. They can either agree (A),
disagree (D) or be neutral (N) on the issue. Their responses are shown
below.

A, A, D, N, D, A, D, D, N, N.

For this data set the modal response is D (since D occurs more times than
the other responses). It is not possible to calculate a median or a mean
for this data set.

The weighted mean

When calculating the mean for raw data, it is usually assumed that all the values
in the data set are equally important. If the values are not all considered equally
important, the weighted mean (𝑥𝑥̅𝑤𝑤 ) is calculated according to the formula
below.
∑𝑟𝑟𝑖𝑖=1 𝑥𝑥𝑖𝑖 𝑤𝑤𝑖𝑖
𝑥𝑥̅𝑤𝑤 = 𝑟𝑟
∑𝑖𝑖=1 𝑤𝑤𝑖𝑖

In the formula 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑟𝑟 are the values and 𝑤𝑤1 , 𝑤𝑤2 , … , 𝑤𝑤𝑟𝑟 are their
respective weights.

Example

The final mark (percentage) in a certain course is based on an assignment mark


(which counts for 10% of the final mark), a test mark (which counts for 30% of
the final mark) and an exam mark (which counts for 60% of the final mark).
Calculate the final mark of a student who gets a 65% assignment mark, a 70%
test mark and a 55% exam mark.

Solution:
The above formula is applied with

36
𝑥𝑥1 = 65, 𝑥𝑥2 = 70 𝑥𝑥3 = 55, 𝑤𝑤1 = 10, 𝑤𝑤2 = 30, 𝑤𝑤3 = 60
(65 × 10) + (70 × 30) + (55 × 60) 6050
𝑥𝑥̅𝑤𝑤 = = = 60.5
10 + 30 + 60 100

B. Grouped data

Mean:
For grouped data the mean is calculated from the formula below:
∑(𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 × 𝑓𝑓)
𝑥𝑥̅ =
𝑛𝑛

where
𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 is the class midpoint, 𝑓𝑓 the class frequency and 𝑛𝑛 is the sample size.
This formula is a special case of the weighted mean formula with 𝑤𝑤𝑖𝑖 = 𝑓𝑓𝑖𝑖 and
∑ 𝑤𝑤𝑖𝑖 = 𝑛𝑛

Example
For the frequency distribution of temperatures (example 2 of the frequency
distributions), the mean can be calculated as shown below.

Class interval 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 𝑓𝑓 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 × 𝑓𝑓


38 – 41 39.5 4 158
42 – 45 43.5 10 435
46 – 49 47.5 8 380
50 – 53 51.5 15 772.5
54 – 57 55.5 9 499.5
58 – 61 59.5 3 178.5
62 - 65 63.5 1 63.5
Total 50 2487

2487
𝑥𝑥̅ = = 49.74
50

37
3.3 Measures of variability (variation, spread, dispersion)

Variability refers to the extent to which the values in a data set vary around
(differ from) the associated measure of central tendency.

Example
The performance of 2 different stocks is monitored over a period of 8 days.
Their values are shown in the table below.

Day 1 2 3 4 5 6 7 8
A 103 120 112 108 130 106 120 112
B 112 97 85 123 153 85 146 110

The scatter plots 9 with that follows shows the performance of each stock.
Stock A
140 130
120 120
120 112 112
108 106
103
100
Stock price

80

60

40

20

0
0 1 2 3 4 5 6 7 8 9
Day

9
See Appendix A – page 24.

38
Stock B
180
153
160 146
140 123
112 110
120
97
Stock price

100 85 85
80
60
40
20
0
0 1 2 3 4 5 6 7 8 9
Day

The mean values for the two stocks are the same (= 113.875), but they differ in
variability (extent of spread around the mean). Stock B has a far wider spread
around the mean than stock A.

A. Raw data

Range: 𝑅𝑅 = 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚

Example:
For the stocks data sets:
Range for stock A = 130 – 103 = 27
Range for stock B = 153 – 85 = 68

The larger (wider) spread in the stock B values is reflected in the larger range
(more than twice that of stock A).

Standard deviation and variance

The sample variance (denoted by 𝑠𝑠 2 ) is a measure of variability based on


squared differences between the values in the data set and the mean.

2
∑𝑛𝑛𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ )2
𝑠𝑠 =
𝑛𝑛 − 1

2
∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖2 − 𝑛𝑛𝑥𝑥̅ 2
𝑖𝑖. 𝑒𝑒. 𝑠𝑠 =
𝑛𝑛 − 1
39
The variance is expressed in the data units squared.
The standard deviation: 𝑠𝑠 = √𝑠𝑠 2 which is the positive square root of the
variance, is expressed in the same units as the data.

Example

For stock A the standard deviation is calculated as follows.

Stock A (𝑥𝑥 values) 𝑥𝑥 2


103 10609
120 14400
112 12544
108 11664
130 16900
106 11236
120 14400
112 12544
∑ 911 104297

104297−(8×113.8752 )
Variance: 𝑠𝑠 2 = = 79.55
7

Standard deviation: 𝑠𝑠 = √79.55 = 8.919

For stock B the standard deviation is 25.682 (check this using your calculator).

Interpretation: The stock A values differ (on average) from the mean by 8.919,
while stock B values differ (on average) from the mean by almost 3 times this
amount.

B. Grouped data

Standard deviation and variance

For grouped data, the raw data formulae for the variance and standard
deviation can be slightly modified.

2
2
∑𝑘𝑘𝑖𝑖=1�𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) − 𝑥𝑥̅ � 𝑓𝑓𝑖𝑖
𝑠𝑠 =
𝑛𝑛 − 1
40
2
∑𝑘𝑘𝑖𝑖=1 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖 − 𝑛𝑛𝑥𝑥̅ 2
2
𝑖𝑖. 𝑒𝑒. 𝑠𝑠 =
𝑛𝑛 − 1

As before standard deviation = 𝑠𝑠 = √𝑠𝑠 2

Example

For the frequency distribution of example 2 (cost of commuting data), the


variance and standard deviation can be calculated as shown below.

𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 2
Class interval 𝑓𝑓𝑖𝑖 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖
38 – 41 39.5 4 158 6241
42 – 45 43.5 10 435 18922.5
46 – 49 47.5 8 380 18050
50 – 53 51.5 15 772.5 39783.75
54 – 57 55.5 9 499.5 27722.25
58 – 61 59.5 3 178.5 10620.75
62 - 65 63.5 1 63.5 4032.25
Total 50 2487 125372.5

∑𝑘𝑘 2
𝑖𝑖=1 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖 −𝑛𝑛𝑥𝑥̅
2
125372.5−(50)(49.74)2
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑠𝑠 2 = = = 34.06367
𝑛𝑛−1 50−1

𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 = 𝑠𝑠 = �𝑠𝑠 2 = √34.06367 = 5.836

3.4 Coefficient of variation

The standard deviations of 2 data sets that are expressed in different units
cannot be directly compared. However, such a comparison may be done by
calculating the:
𝑠𝑠
coefficient of variation = 𝐶𝐶𝐶𝐶 = × 100, which is expressed as a percentage
𝑥𝑥̅

41
Example
The age of three students were 19, 20 and 21 years and their respective
weights were 55, 60 and 65 kilograms. Since the two data sets are in different
units, they cannot be compared directly.
1
For the age data: 𝑥𝑥̅ = 20, 𝑠𝑠 = 1 ∴ 𝐶𝐶𝐶𝐶 = × 100 = 5%
20
5
For the weight data: 𝑥𝑥̅ = 60, 𝑠𝑠 = 5 ∴ 𝐶𝐶𝐶𝐶 = × 100 = 8.33%
60
The coefficient of variation calculations show that in relative terms the
variability for the weight data set is greater than that of the age data set.

3.5 Measures of non-central location

3.5.1 Percentiles, Quartiles and Percentile Rank


The 𝑖𝑖𝑡𝑡ℎ percentile, 𝑃𝑃𝑖𝑖 , is the value that has 𝑖𝑖% of the values in a data set less
than or equal to it.
where
(0 < 𝑖𝑖 ≤ 100)

Examples

• Median = 𝑀𝑀𝑒𝑒 = 50𝑡𝑡ℎ percentile = 𝑃𝑃50

• First quartile = Q 1 = 25𝑡𝑡ℎ percentile = 𝑃𝑃25 .

• Third quartile = Q 3 = 75𝑡𝑡ℎ percentile = 𝑃𝑃75

• The 9 deciles D 1 , D 2 , . . . , D 9 are the values that have 10%, 20%, ... ,
90% respectively of the values in the data set less than or equal to them.

D 1 = P 10 , D 2 = P 20 , …, D 5 = P 50 = M e , … ,D 9 = P 90 .

3.5.2 Calculation of quartiles and quartile deviation for raw data

The three quartiles (𝑄𝑄1 , 𝑄𝑄2 and 𝑄𝑄3 ) are summary measures that divide a ranked
data set into four equal parts. As such, approximately 25% of the values in the
data set will be less than 𝑄𝑄1 , 50% of the values less than 𝑄𝑄2 and 75% of the
values less than 𝑄𝑄3 .

42
𝑄𝑄 −𝑄𝑄
The quartile deviation: 𝑄𝑄 = 3 1 can also be used as a measure of variability.
2
The quartile deviation value shows the extent to which the values in the data set
deviate from the median. For a skew data set (heavy clustering at lower or upper
end of the scale) the quartile deviation is a more appropriate measure of
variability than the standard deviation (which is more suitable as a measure of
variability for symmetric data sets).

The value (𝑄𝑄3 − 𝑄𝑄1 ) is called the Inter-quartile Range (IQR). IQR indicates the
spread or variation of the middle 50% of the values in the data set.

𝑛𝑛+1
𝑄𝑄1 = � � value in the ranked data set
4 𝑡𝑡ℎ

2(𝑛𝑛+1)
𝑄𝑄2 = � � value in the ranked data set = Median
4 𝑡𝑡ℎ

3(𝑛𝑛+1)
𝑄𝑄3 = � � value in the ranked data set
4 𝑡𝑡ℎ

Use the following guidelines to obtain the quartile:


1. If the position point is a whole number then select the value from the data
set that is corresponding to the whole number position.
2. If the position point is halfway between two whole numbers then select
the average of the two data values which correspond to the two whole
number positions.
3. If the position point does not satisfy either of the above two cases then
round off to the nearest whole number and select the data value that
corresponds to the rounded-off whole number position.

Example

The distance from home to work (kilometers) of 12 employees at a certain


company are shown below. Calculate Q 1 and Q 3 .

6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36, 56

Calculate 𝑄𝑄1 , 𝑄𝑄2 , 𝑄𝑄3 , 𝐼𝐼𝐼𝐼𝐼𝐼 and 𝑄𝑄 for this data set.

43
Solution

Ranked data set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49, 56

𝑛𝑛+1
𝑄𝑄1 = � � value in the ranked data set
4 𝑡𝑡ℎ

12+1
=� � value in the ranked data set
4 𝑡𝑡ℎ

= 3.25𝑡𝑡ℎ ≅ 3𝑟𝑟𝑟𝑟 value in the ranked data set

= 15 kilometres

2(𝑛𝑛+1)
𝑄𝑄2 = � � value in the ranked data set
4 𝑡𝑡ℎ

2(12+1)
=� � value in the ranked data set
4 𝑡𝑡ℎ

= 6.5𝑡𝑡ℎ value in the ranked data set

= average of the 6𝑡𝑡ℎ and 7𝑡𝑡ℎ values in the ranked data set

40 + 41
=
2

= 40.5 kilometres

3(𝑛𝑛+1)
𝑄𝑄3 = � � value in the ranked data set
4 𝑡𝑡ℎ

3(12+1)
=� � value in the ranked data set
4 𝑡𝑡ℎ

= 9.75𝑡𝑡ℎ ≅ 10𝑡𝑡ℎ value in the ranked data set

= 47 kilometres

IQR = 𝑄𝑄3 − 𝑄𝑄1 = 47 − 15 = 32 kilometres

44
𝑄𝑄3 −𝑄𝑄1 47−15
Quartile deviation: Q = = = 16 kilometres
2 2

3.5.3 Calculation of percentiles and percentile rank for raw data


The value of the 𝑘𝑘𝑡𝑡ℎ percentile is:
𝑘𝑘(𝑛𝑛+1)
𝑃𝑃𝑘𝑘 = � � value in a ranked data set
100 𝑡𝑡ℎ

Percentile rank of a score is the percentage of values in the data set that are
smaller than the given score and is denoted by 𝑃𝑃𝑃𝑃𝑥𝑥 where 𝑥𝑥 is the given score.
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑡𝑡ℎ𝑎𝑎𝑎𝑎 𝑥𝑥
𝑃𝑃𝑃𝑃𝑥𝑥 = × 100
𝑛𝑛

For the distance to work data set above, 𝑃𝑃80 and 𝑃𝑃𝑃𝑃40 is calculated as follows:

Ranked data set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49, 56

𝑘𝑘(𝑛𝑛+1)
𝑃𝑃𝑘𝑘 = � � value in a ranked data set
100 𝑡𝑡ℎ
80(12+1)
𝑃𝑃80 = � � value in the ranked data set
100 𝑡𝑡ℎ
𝑃𝑃80 = 10.4𝑡𝑡ℎ ≅ 10𝑡𝑡ℎ value in the ranked data set
∴ 𝑃𝑃80 = 47 kilometres

𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑡𝑡ℎ𝑎𝑎𝑎𝑎 40


𝑃𝑃𝑃𝑃40 = × 100
12
5
𝑃𝑃𝑃𝑃40 = × 100 = 41.67%
12

3.6 Chebychev’s theorem and bell-shaped data

Chebychev’s Theorem
1
Chebychev’s theorem states that for any data set a proportion of at least 1 −
d2
of the values lie within d standard deviations of the mean.

45
Examples

1. Proportion of values that lie within 2 standard deviations of the mean is


1
at least 1 − = 0.75.
22
2. Proportion of values that lie within 3 standard deviations of the mean is
1
at least 1 − = 0.889.
32

The Empirical Rule (bell-shaped distributions)

If it is known that the data set of interest has a bell-shaped clustering pattern of
the values then results that are better than that of Chebychev’s theorem can be
obtained. For data with such a shape:

(i) Approximately 68% of data values are within 1 standard deviation of the
mean.
(ii) Approximately 95% of data values are within 2 standard deviations of the
mean.
(iii) Approximately 99.7% of data values are within 3 standard deviations of
the mean.

Example
Men’s heights have a bell-shaped distribution with a mean of 175.8 centimetres
and a standard deviation of 7.4 centimetres.

Approximately 68% of data values are within 175.8 ± 7.4 = (168.4; 183.2).

Approximately 95% of data values are within 175.8 ± 14.8 = (161.0; 190.6).

Approximately 99.7% of data values are within 175.8 ± 22.2 = (153.6; 198.0).

46
Tutorial

1. In a factory, the time during working hours in which a machine is


not operating as a result of breakage or failure is called the
"downtime ". The following distribution shows a sample of 100
downtimes of a certain machine (rounded to the nearest minute):

Downtime Frequency
0-9 3
10-19 13
20-29 30
30-39 25
40-49 14
50-59 8
60-69 4
70-79 2
80-89 1

Calculate the (approximate) mean and standard deviation of the


downtimes.

2. The diameters of a sample of 400 washers produced by a


machine are summarized below:

Diameter Number of
(millimeters) washers
(frequency)
30-40 10
40-50 50
50-60 55
60-70 79
70-80 68
80-90 60
90-100 50
100-110 28
110-120 8
Total 400

47
Calculate the (approximate) mean and standard deviation for the
data.

3. The frequency distribution of the number of days to maturity of


40 short-term investments is summarized below:
Days to Frequency
Maturity
30-39 3
40-49 1
50-59 8
60-69 10
70-79 7
80-89 7
90-99 4
Calculate the mean number of days to maturity and the
standard deviation of the distribution.

4. In a factory the weight of all ball bearings produced is under


examination. The weights of 100 ball bearings were obtained
and recorded as follows:
Weight Frequency
(grammes)
5-9 16
10-14 30
15-19 39
20-24 12
25- 29 3

a. Calculate the approximate sample mean and standard


deviation of the weight for the above ball bearings.
b. Construct a cumulative frequency distribution for the above
data and plot an ogive.
c. From the ogive above and the formula in your notes find the
first and third quartiles and the median weight for the ball
bearings.

48
5. The number of traffic tickets issued by a certain police department in
a 7-day period was
19 17 14 21
19 16 34
a. Find the mean and standard deviation for the above data.
b. Find the coefficient of variation and explain what this tells us.
c. Find the first and third quartiles, and the median for the above
data.
d. Are there any outliers?

6. The diastolic blood pressure readings for 12 randomly selected


men aged 45 - 49 years were as follows
94 84 74 90 98 92 74 90 80 98 78
80.
a. Find the mean and standard deviation for the above data.
b. Find the coefficient of variation and explain what this tells us.
c. Find the first and third quartiles, and the median for the above
data.
d. Are there any outliers?

7. What proportion of values lie within 1 standard deviation


of the mean? (Hint: Use Chebychev's t heorem).

8. In a wildlife study, it is found that the average speed of the


Cheetah is 60km/h with a standard deviation of 4km/h.
What proportion of Cheetahs will have a speed

a. between 50 and 60 km/h?


b. less than or equal to 50 km/h or greater than or equal to 60
km/h?
c. Find the interval of speed that will contain approximately 95%
of data values.

49
CHAPTER 4
CORRELATION AND REGRESSION
4.1 Bivariate data and scatter diagrams

Often two variables are measured simultaneously and relationships between


these variables explored. Data sets involving two variables are known as
bivariate data sets.

The first step in the exploration of bivariate data is to plot the variables on a
graph. From such a graph, which is known as a scatter diagram (scatter plot,
scatter graph), an idea can be formed about the nature of the relationship.

Examples
1. It is believed that a person’s height (y) (measured in centimetres) is
dependent on the person’s shoe size (x). The values of x and y for 12
students are shown below.

x 5 4 12 8 9 7.5 6.5 11.5 10.5 11 6 4.5


y 160 152 196 168 178 165 165 170 188 180 163 155

Scatter diagram10

Relationship between height and shoe size


250

200
Height (in centimetres)

150

100

50

0
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0
Shoe size

10
See Appendix A12.

50
2. In a study of the relationship between the amount of daily rainfall (x)
and the quantity of air pollution removed (y), the following data were
collected.

Rainfall quantity removed (micrograms


(centimeters) per cubic meter)
4.3 126
4.5 121
5.9 116
5.6 118
6.1 114
5.2 118
3.8 132
2.1 141
7.5 108

Scatter diagram

Relationship between rainfall and quantity of air


pollution removed
160
Quantity of air pollution removed

140
120
100
80
60
40
20
0
0 1 2 3 4 5 6 7 8
Rainfall (in centimetres)

3. Data on the annual GDP growth rate (x) of various African countries and
the cost of building individual prestige houses (y) in these countries was
taken from the Africa Property & Construction Cost Guide, July 2017,
and is shown below:

51
GDP growth (annual % since 2000) Cost of building individual
prestige houses (in US$/𝑚𝑚2 )
3.0 4650
-0.3 1952
3.9 2100
5.6 1350
6.6 1500
2.7 2560
6.9 1700
1.3 1187
7.0 1120
5.1 1540
2.9 1590

Scatter diagram
Relationship between annual GDP growth rate
and building costs
5000
4500
4000
3500
Building costs

3000
2500
2000
1500
1000
500
0
- 1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Annual GDP growth (%)

• In all these cases the relationship can be fairly well described by means
of a straight line i.e. all these relationships are linear relationships.

• In the first example an increase in y is proportional to an increase in x


(positive linear relationship).
In the second and third examples a decrease in y is proportional to an
increase in x (negative linear relationship).

52
• In all the examples changes in the values of y are affected by changes in
the values of x (not the other way round). The variable x is known as the
explanatory (independent) variable and the variable y the response
(dependent) variable.

In this section only linear relationships between 2 variables will be explored.


The issues to be explored are

1. Measuring the strength of the linear relationship between the 2


variables (the linear correlation coefficient).

2. Finding the equation of the straight line that will best describe the
relationship between the 2 variables (the linear regression equation).
Once this line is determined, it can be used to estimate a value of y for a
given value of x (linear estimation).

4.2 Linear Correlation

The calculation of the coefficient of correlation (𝒓𝒓) is based on the closeness


of the plotted points (in the scatter diagram) to the line fitted through them. It
can be shown that

– 1 ≤ 𝑟𝑟 ≤ 1

If the plotted points are closely clustered around this line, 𝑟𝑟 will lie close to
either 1 or –1 (depending on whether the linear relationship is positive or
negative). The further the plotted points are away from the line, the closer the
value of 𝑟𝑟 will be to 0. Consider the scatter diagrams that follow.

53
Strong positive correlation (𝒓𝒓 close to 1)

Strong negative correlation (𝒓𝒓 close to –1)

No pattern (𝒓𝒓 close to 0)

For a sample of n pairs of values (𝑥𝑥1 , 𝑦𝑦1 ), (𝑥𝑥2 , 𝑦𝑦2 ), . . . , (𝑥𝑥𝑛𝑛 , 𝑦𝑦𝑛𝑛 ), the coefficient
of correlation can be calculated from the formula

𝑛𝑛 ∑ 𝑥𝑥𝑥𝑥 − ∑ 𝑥𝑥 ∑ 𝑦𝑦
𝑟𝑟 =
�[𝑛𝑛 ∑ 𝑥𝑥 2 − (∑ 𝑥𝑥)2 ][𝑛𝑛 ∑ 𝑦𝑦 2 − (∑ 𝑦𝑦)2 ]

Example
Consider the data on a person’s shoe size (x) and height (y) considered earlier.
For this data 𝑟𝑟 can be calculated in the following way.

54
x y xy x2 y2
5 160 800 25 25600
4 152 608 16 23104
12 196 2352 144 38416
8 168 1344 64 28224
9 178 1602 81 31684
7.5 165 1237.5 56.25 27225
6.5 165 1072.5 42.25 27225
11.5 170 1955 132.25 28900
10.5 188 1974 110.25 35344
11 180 1980 121 32400
6 163 978 36 26569
4.5 155 697.5 20.25 24025
∑ 95.5 2040 16600.5 848.25 348716

Substituting
𝑛𝑛 = 12, � 𝑥𝑥 = 95.5, � 𝑦𝑦 = 2040,

∑ 𝑥𝑥𝑥𝑥 = 16600.5, ∑ 𝑥𝑥 2 = 848.25 ∑ 𝑦𝑦 2 = 348716

into the equation for 𝑟𝑟 gives:

12 × 16600.5 − 95.5 × 2040


𝑟𝑟 =
�12 × 848.25 − (95.5)2 �12 × 348716 − (2040)2

4386
=
√1058.75×22992

= 0.889

Comment: Strong positive correlation i.e. the increase in a person’s shoe size is
closely linked with an increase in the person’s height.

Coefficient of determination
The strength of the correlation between 2 variables is proportional to the
square of the correlation coefficient (r2). This quantity, called the coefficient of

55
determination, is the proportion of variability in the y variable that is
accounted for by its linear relationship with the x variable.

Example
In the above example on height (y) and shoe size (x), the
coefficient of determination = 𝑟𝑟 2 = (0.889)2 = 0.7903.
This means that approximately 79% of the change in the variability of in a
person’s height is explained by its relationship with the person’s shoe size.

4.3 Linear Regression


Finding the equation of the line that best fits the (x, y) points is based on the
least squares principle. This principle can best be explained by considering the
scatter diagram below.

According to the least squares principle, the line that “best” fits the plotted
points is the one that minimizes the sum of the squares of the vertical
deviations (see vertical lines in the graph) between the plotted y and estimated
y (values on the line). For this reason the line fitted according to this principle
is called the least squares line.
56
Calculation of the least squares linear regression line 11

The equation for the line to be fitted to the (x, y) points is

𝑦𝑦� = 𝑎𝑎 + 𝑏𝑏𝑏𝑏

where 𝑦𝑦� is the fitted y value (y value on the line which is different to the
observed y value), a is the y-intercept and b the slope of the line.
It can be shown that the coefficients that define the least squares line can be
calculated from

n∑ xy − ∑ x ∑ y
𝑏𝑏 =
n∑ x 2 − (∑ x ) 2

and
𝑎𝑎 = 𝑦𝑦� − 𝑏𝑏𝑥𝑥̅

Example
For the above data on shoe size (x) and height (y) the least squares line can
calculated as shown below.

Substituting

𝑛𝑛 = 12, ∑ 𝑥𝑥 = 95.5, ∑ 𝑦𝑦 = 2040,

∑ 𝑥𝑥𝑥𝑥 = 16600.5 ∑ 𝑥𝑥 2 = 848.25

into the above equation gives:

12 × 16600.5 − 95.5 × 2040 4386


𝑏𝑏 = = = 4.14
12 × 848.25 − (95.5)2 1058.75
and
2040 95.5
𝑎𝑎 = − 4.14 × = 137.05
12 12

11
See Appendix A13.

57
Therefore the equation of the y on x least squares line that can be used to
estimate values of y (height) based on x (shoe size) is:

𝑦𝑦� = 137.05 + 4.14𝑥𝑥

Suppose the height of a student with shoe of size 7 is to be estimated. This can
be done by substituting the value of x = 7 into the above equation. Then

𝑦𝑦� = 137.05 + 4.14 × 7 ≅ 166

A word of caution

• The linear relationship between y and x is often only valid for values of x
within a certain range e.g. when estimating a person’s height using the
person’s shoe size as explanatory variable, it should be taken into
account that at some shoe size the person’s height will stop increasing.
Assuming a linear relationship between shoe size and height for values
beyond the shoe size where the person’s height stops increasing would
be incorrect.

• Only relationships between variables that could be related in a practical


sense are explored e.g. it would be pointless to explore the relationship
between the number of vehicles in New York and the number of
divorces in South Africa. Even if data collected on such variables might
suggest a relationship, it cannot be of any practical value.

• If variables are not linearly related, it does not mean that they are not
related. There are many situations where the relationships between
variables are non-linear.

58
Example

A plot of the banana consumption (y) versus the price (x) is shown in the graph
on the following page. A straight line will not describe this relationship very
well, but the non-linear curve shown below will describe it well.

NONLINEAR REGRESSION: EXAMPLE

14
y

12

10

8
β
6 y =α + + u = α + βz + u
x
4

0
0 1 2 3 4 5 6 7 8 9 10 11 x12

59
Tutorial

1. The following are the assessed valuations of eight houses in a certain city
and the selling prices of these houses. The data constitute a random
sample of all the houses assessed and sold in that city.

Assessed value Selling price


(thousands of rand) (thousands of rand)
X Y

116,0 185,0
160,8 246,4
103,2 162,2
55,8 97,6
89,6 148,0
65,0 110,4
144,0 236,6
80,6 126,8

Find the line of best fit and use it to estimate the selling price of a house
when its assessed value is R100 000.

2. In a study between the amount of rainfall and the quantity of


pollution re- moved from the air, the following data were
collected:

x
126 121 116 118 114 118 132 141 108
y

a. Find the equation of the regression line to predict the particulate


removed from the amount of daily rainfall. Estimate the
amount of particulate removed when the daily rainfall is x =
4.8 units.

60
b. Determine the correlation coefficient between he particulate
removed and the amount of daily rainfall.

3. In a certain type of metal test the normal stress on a specimen is


known to be functionally related to the shear resistance. The
following is a set of coded experimental data on the two variables:

Normal Stress, x Shear


resistance,
26.8 26.5
25.4 27.3
28.9 24.2
23.6 27.1
27.7 23.6
23.9 25.9
24.7 26.3
28.1 22.5
26.9 21.7
27.4 21.4
22.6 25.8
25.6 24.9

(a) Find the equation for the line of best fit.

(b) Determine the correlation coefficient between the shear resistance and the
normal stress.

(c) Estimate the shear resistance for a normal stress of 24.5 (kilograms per
square cm).

4. A chemical company, wishing to study the effects of extraction time


(x) on the efficiency (y) of an extraction operation, obtained the data
shown below:

61
x
(minutes) 27 45 41 19 35 39 19 49 15 31
y
(%) 57 64 80 46 62 72 52 77 57 65

Find the least squares regression line by which one may predict the efficiency
from the extraction time.

5. The following sample data show a demand for a product (in


thousands of units) and its price (in ands) in six different market
areas:
x y
Price Demand
19 55
23 7
21 20
15 123
16 88
18 76

Σ𝑥𝑥 =112, Σ 𝑥𝑥 2 = 2136, Σ𝑦𝑦 =369, Σ 𝑦𝑦 2 = 32123, Σ 𝑥𝑥𝑥𝑥 = 6247

a. Fit a least squares line that will enable us to predict the


demand for the product in terms of its price.
b. Predict the demand for the product in a market area where it is
priced at 15 Rands.

Plot the data and the regression line on suitable axes. (Show your working for the 2
points needed to plot the straight line)

62
CHAPTER 5
RANDOM VARIABLES AND
PROBABILITY DISTRIBUTIONS
5.1 Introduction to probability distributions

Probability (chance)
• A probability is the chance that something of interest will happen.
• A probability is expressed as a proportion i.e. it ranges from 0 to 1.
Chance can be expressed as a percentage i.e. it ranges from 0 to 100.

Examples
1. The probability of rain tomorrow is 0.40
There is a 40% chance of rain tomorrow.
1
2. The probability of winning the Lotto is .
13983816
3. The probability of a certain new product being successful is 0.75.

Random experiment
This is an experiment that gives different outcomes when repeated under
similar conditions.

1. The experiment can have more than one possible outcome.


2. All possible outcomes can be listed.
3. The outcome that will occur when the experiment is performed depends
on chance.

Examples

1. Tossing a coin (possible outcomes: heads, tails).


2. Rolling a die (possible outcomes: 1, 2, 3, 4, 5, 6).
3. Asking a person to assign a rating to a product (possible outcomes: A, B,
C, D, E).
4. Drawing a card from a deck of cards (possible outcomes: 13 hearts, 13
clubs, 13 spades, 13 diamonds).

63
A random variable is a variable whose value depends on the outcome of a
random experiment. A random variable is denoted by a capital letter and a
particular value of a random variable by a lower case (small) letter.

Examples
1. T = the number of tails (t) when a coin is flipped 3 times.
2. X = the sum of the values (x) showing when two dice are rolled.
3. H = the height (h) of a woman chosen at random from a group.
4. V = the liquid volume (v) of soda in a can marked 12 oz.

There are two types of random variables:

Discrete Random Variables


• Variables that have a finite or countable number of possible values.
• These variables usually occur in counting experiments.

Continuous Random Variables


• Variables that can take on any value in some interval i.e. they can take
an infinite number of possible values.
• These variables usually occur in experiments where measurements are
taken.

Examples
1. The variables T and X from the above examples are discrete random
variables.
2. The variables H and V from the above examples are continuous random
variables.

5.2 Discrete probability distributions and their graphical


representations

A discrete probability distribution is a list of the possible distinct values of the


random variable together with their corresponding probabilities. The
probability of the random variable X assuming a particular value x is denoted
by P(X=x) = P(x). This probability, which is a function of x, is referred to as the
probability mass function.

64
Examples

1. As above, let T be the random variable that represents the number of


tails obtained when a coin is flipped three times. Then T has 4 possible
values 0, 1, 2, and 3. The outcomes of the experiment and the values of
T are summarized in the next table.

Outcomes T
hhh 0
hht, hth,
1
thh
tth, tht, htt 2
ttt 3

Assuming that the outcomes are all equally likely, the probability
distribution for T is given in the following table.

t 0 1 2 3 Total
P(t) 1/8 3/8 3/8 1/8 1

2. A pair of dice is tossed. Let X denote the sum of the digits. The
probability distribution of X can be found from the following table. The
entry in any particular cell is the sum of the row and column values.

1st die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
2nd die 3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

x 2 3 4 5 6 7 8 9 10 11 12
P(X=x) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

65
Note:
For any discrete random variable X, the range of values that it can assume are
such that

0 ≤ P(x) ≤ 1 and ∑ P( x) = 1 .
x

The cumulative distribution function


The cumulative distribution function is defined as

𝐹𝐹(𝑋𝑋) = 𝑃𝑃(𝑋𝑋 ≤ 𝑥𝑥) = � 𝑃𝑃(𝑟𝑟)


𝑟𝑟≤𝑥𝑥

Examples

1. For the probability mass function in example 1 the cumulative


distribution function is

x 0 1 2 3
F(x) 1/8 ½ 7/8 1

2. For the probability mass function in example 3 the cumulative


distribution function is

x 2 3 4 5 6 7 8 9 10 11 12
F(x) 1/36 3/36 6/36 10/36 15/36 21/36 26/36 30/36 33/36 35/36 1

3. Consider a discrete random variable with probability mass function given


below.

x 1 2 3 4
P(X=x) 0.1 0.3 0.4 0.2

66
(a) CDF (b) PMF

The graphs on the previous page are plots of the probability mass function
(graph on the right) and cumulative distribution function (graph on the left).

A random variable can only take on one value at a time i.e. the events X = x 1
and X = x 2 for x 1 ≠ x 2 are mutually exclusive. The probability of the variable
taking on any number of different values can be found by simply adding the
appropriate probabilities.

Examples
1. Find the probability of getting 2 or more tails when a coin is flipped 3
times.
P(T ≥ 2) = 3/8 + 1/8 = ½.

2. Find the probability of getting at least one tail when a coin is flipped 3
times.
P(at least 1) = P(1) + P(2) + P(3) = 3/8 + 3/8 +1/8 = 7/8

Or

P(at least 1) = 1 – P(0) = 1 – 1/8 = 7/8.

67
5.3 Mean (expected value), variance and standard deviation of a
discrete random variable

The mean or expected value of a random variable X is the average value that
we would expect for X when performing the random experiment many times.

Notation: The mean or expected value of a random variable X will be


represented by µ or E(X).

We can calculate the mean by using the formula

E(X) = µ = ∑ xp(x) .
Examples

1. The expected value of the random variable T from above is:

1 3 3 1 3
𝐸𝐸(𝑇𝑇) = Σ 𝑡𝑡𝑡𝑡(𝑡𝑡) = �0 × � + �1 × � + �2 × � + �3 × � =
8 8 8 8 2

Thus if 3 coins are flipped a large number of times, we should expect the
average number of tails (per 3 flips) to be about 1.5. Since the number of tails
is an integer value, it will never actually assume the mean value of 1.5. This
mean value more reflects the fact that the extreme values (0 and 3) occur the
same proportion of times (an eighth) and the middle values occur the same
proportion of times (three eighths).

2. The score S obtained in a certain quiz is a random variable with


probability distribution given below.

s 0 1 2 3 4 5
P(S=s) 0.12 0.04 0.16 0.32 0.24 0.12

The mean of the random variable S can be calculated as shown below.

s 0 1 2 3 4 5 sum
P(S=s) 0.12 0.04 0.16 0.32 0.24 0.12 1
s×P(s) 0 0.04 0.32 0.96 0.96 0.60 2.88

µ = E(S) = 2.88

68
Variance
(a) For a random variable X, the variance, denoted by σ2 , can be calculated
by using the formula

𝜎𝜎 2 = Σ (𝑥𝑥 − 𝜇𝜇 )2 𝑃𝑃(𝑥𝑥 ) = Σ 𝑥𝑥 2 𝑃𝑃(𝑥𝑥 ) - 𝜇𝜇2

The standard deviation of X, denoted by σ, is just the positive square root of


σ2. This is a measure of the extent to which the values are spread around the
mean.

5.4 Binomial distribution


Assumptions:
A discrete random variable X is said to have a binomial distribution if a random
experiment satisfies the following conditions.

1. The experiment is repeated a fixed number of times. Each repetition is


called a trial. The number of trials is denoted by n.
2. All trials are independent of each other.
3. The outcome for each trial of the experiment can be one of two
complementary outcomes, one labeled “success” and the other labeled
“failure”. A single such a trial is called a Bernoulli trial.
4. The probability of success has a constant value of 𝑝𝑝 for each trial and the
probability of failure is 𝑞𝑞 = (1 − 𝑝𝑝).
5. The random variable X counts the number of successes that has occurred
in n trials.

Examples

1. Consider the experiment of flipping a coin 5 times. If we let the event of


getting “tails” on a flip be labeled “success” and “heads” failure, and if the
random variable T represents the number of tails obtained, then T will be
binomially distributed with n = 5, p = ½ and q=½ .

2. A student answers 10 questions in a multiple-choice test by guessing each


answer. For each question, there are 5 possible answers, only one of
which is correct. If we consider a “success” as getting a question right and
consider the 10 questions as 10 independent Bernoulli trials, then the
random variable X representing the number of correct answers will be
binomially distributed with n=10, p=0.2 and q=0.8.

69
3. Fourteen percent of flights from a certain airport are delayed. If 20 flights
are chosen at random, then we can consider each flight to be an
independent Bernoulli trial. If we define a successful trial to be one where
a flight takes off on time, then the random variable Z representing the
number of on-time flights will be binomially distributed with n =2 0, p =
0.86 and q = 0.14.

Formula for the calculation of binomial probabilities

P(x) = n C x px qn-x for x = 0, 1, 2, … , n .

A short hand way of referring to a binomially distributed random variable X,


based on n trials with probability of success p, is X ~ B(n,p) or X ~ Bin(n,p).

Examples

1. As in the previous examples, let T be the random variable representing


the number of tails when a coin is flipped 3 times. Using the formula
above with n=3 and p = ½ , we can calculate the probability of exactly 2
tails as:
2 1

P(X = 2) = 3 C 2  1   1  = 0.375 .
2 2

2. Let the random variable X represent the number of correct answers in the
multiple-choice test described above. Then the probability of a student
guessing 3 answers correctly is:

P(X = 3) = 10 C 3 (0.2)3 (0.8)7 = 0.2013,

and the probability of guessing seven answers correctly is:

P(X = 7) = 10 C 7 (0.2)7 (0.8)3 = 0.000786.

70
Mean and standard deviation of a binomial random variable

If X is a binomial random variable with n trials, probability of success p and


probability of failure q, then the mean, variance and standard deviation of X can
be calculated by using the following formulae.
mean = E(X) = µ= np
var(X) = σ2 = npq
standard deviation (X) = npq.

Example

For T = the number of tails when a coin is flipped 3 times, n = 3, p = q = ½ .

𝐸𝐸(𝑇𝑇) = 𝜇𝜇 = 3 × 0.5 = 1.5

𝜎𝜎 = √3 × 0.5 × 0.5 = √0.75 = 0.866

Shape of the binomial distribution

A binomial distribution is symmetric if 𝑝𝑝 = 𝑞𝑞, positively skewed if 𝑝𝑝 < 𝑞𝑞 and


negatively skewed if 𝑝𝑝 > 𝑞𝑞. These shapes are illustrated in the graphs for n = 20
shown below and on the following page.

X∼ Bin (20, 0.5)


0.20000
0.18000
0.16000
0.14000
0.12000
0.10000
0.08000
0.06000
0.04000
0.02000
0.00000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

71
X∼ Bin(20, 0.1)
0.30000

0.25000

0.20000

0.15000

0.10000

0.05000

0.00000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

X ∼ Bin(20, 0.9)
0.30000

0.25000

0.20000

0.15000

0.10000

0.05000

0.00000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

5.5 Poisson distribution

A Poisson random variable (X) is one that counts the number of events that
occur at random in an interval of time or space. The average number of events
that occur in the time/space interval is denoted by μ.

Examples
1. The number of bad cheques presented for daily payment at a bank.
2. The number of road deaths per month.
3. The number of bacteria in a given culture.
4. The number of defects per square meter on metal sheets being
manufactured.
5. The number of mistakes per typewritten page.

72
Formula for the calculation of Poisson probabilities

The probability that x events occur in time/space is given by

𝜇𝜇 𝑥𝑥 𝑒𝑒 −𝜇𝜇
𝑃𝑃(𝑋𝑋 = 𝑥𝑥) =
𝜇𝜇!

For 𝑥𝑥 = 1, 2, 3, … where 𝜇𝜇 > 0

A short hand way of referring to a Poisson distributed random variable X with


average (mean) rate of occurrence µ is X ~ Po(µ).

Examples
1. A bank receives on average μ = 6 bad cheques per day. Calculate the
probability of the bank receiving

(a) exactly 4 bad cheques per day.


(b) at least 3 bad cheques per day.

Solution

(a) Substituting μ = 6 and x = 4 into the above formula gives


6 4 e −6
P(4) = = 0.134 .
4!

(b) P(X ≥ 3) = 1 – P(X ≤ 2)

= 1 – 0.062
=0.938

2. A secretary claims an average mistake rate of 1 per page. A sample page


is selected at random and 5 mistakes found. What is the probability of
her making 5 or more mistakes if her claim of 1 mistake per page on
average is correct?

73
Solution

In this case μ=1 is claimed and X the number of mistakes ≥ 5. If the claim is
true,
P(X ≥ 5) = 1 – P(X ≤ 4)
10 𝑒𝑒 −1 11 𝑒𝑒 −1 12 𝑒𝑒 −1 3𝑒𝑒 −1 14 𝑒𝑒 −1
=1–� + + + + �
0! 1! 2! 3! 4!
= 1 – 0.9963
= 0.0037.

The above calculation shows that if the claim of 1 mistake per page on average
is true, there is only a 37 in 10 000 chance of getting 5 or more mistakes per
page. This remote chance of 5 or more mistakes when an average of 1 mistake
per page is true casts doubt on whether the claim of 1 mistake per page on
average is in fact true.

Mean and standard deviation of a Poisson random variable

The mean and variance of the Poisson distribution are given by


E(X) = µ and var(X) = µ.

Example
Calls arrive at switchboard at an average rate of 1 every 15 seconds. What is
the probability of not more than 5 calls arriving during a particular minute?

Solution
A mean rate of 1 every 15 seconds is equivalent to a mean rate of 4 every
minute. Since the question concerns an interval of 1 minute, µ = 4 (not µ = 1).

4𝑒𝑒 −4 42 𝑒𝑒 −4 43 𝑒𝑒 −4 44 𝑒𝑒 −4 45 𝑒𝑒 −4
𝑃𝑃(𝑋𝑋 ≤ 5) = 𝑒𝑒 −4 + + + + + =0.7851
1! 2! 3! 4! 5!

5.6 Probability distributions of continuous random variables

A random variable X is called continuous if it can assume any of the possible


values in some interval i.e., the number of possible values are infinite. In this
case the definition of a discrete random variable (list of possible values with
their corresponding probabilities) cannot be used (since there are an infinite
number of possible values it is not possible to draw up a list of possible values).

74
For this reason probabilities associated with individual values of a continuous
random variable X are taken as 0.

The clustering pattern of the values of X over the possible values in the interval
is described by a mathematical function f(x) called the probability density
function. A high (low) clustering of values will result in high (low) values of this
function. For a continuous random variable X, only probabilities associated
with ranges of values (e.g. an interval of values from a to b) will be calculated.
The probability that the value of X will fall between the values a and b is given
by the area between a and b under the curve describing the probability density
function f(x). For any probability density function the total area under the
graph of f(x) is 1.

5.6.1 Normal distribution


A continuous random variable X is normally distributed (that is, it follows a
normal distribution) if the probability density function of X is given by:

−(𝑥𝑥−𝜇𝜇)2
1 � �
𝑓𝑓(𝑥𝑥) = 𝑒𝑒 2𝜎𝜎2 , for −∞ < 𝑥𝑥 < +∞
√2𝜋𝜋𝜎𝜎 2

The constants µ and σ are the mean and standard deviation, respectively, of
X. These constants completely specify the density function. A graph of the
curve describing the probability function (known as the normal curve) for the
case µ = 0 and σ = 1 is shown below.
Graph of standard norm al distribution

0.45
0.4
0.35
0.3
0.25
p(z)

0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
z

75
5.6.2 Properties of the Normal distribution

a. The graph of the function defined above has a symmetric, bell-shaped


appearance.
b. The mean µ is located on the horizontal axis where the graph reaches its
maximum value.
c. At the two ends of the scale the curve describing the function gets closer
and closer to the horizontal axis without actually touching it.
d. The parameter µ shows where the distribution is centrally located and σ
describes the spread of the values around µ.
e. A short hand way of referring to a random variable X which follows a
normal distribution with mean µ and variance σ2 is to write it as
X ~ N(µ, σ2).

Many quantities measured in everyday life have a distribution which closely


matches that of a normal random variable, for example, marks in an exam,
weights of products, heights of a male population.
The next diagram shows graphs of normal distributions for various values of μ
and σ2.

An increase (decrease) in the mean µ results in a shift of the graph to the right
(left) e.g. the curve of the distribution with a mean of -2 is moved 2 units to the
left. An increase (decrease) in the standard deviation σ results in the graph

76
becoming more (less) spread out e.g. compare the curves of the distributions
with σ2 = 0.5, 1 and 2.

5.6.3 Empirical example – The Normal distribution and a histogram

Consider the scores obtained by 4 500 candidates in a matric mathematics


examination.

Histogram

1000
900
800
freq 700
600
500
400
300
200
100
0
15 25 35 45 55 65 75 90 More

mark

The histogram of the marks has an appearance that can be described by a


normal curve i.e. it has a symmetric, bell-shaped appearance. The mean of the
marks is 51.95 and the standard deviation is 10.

5.7 The Standard Normal Distribution

To find probabilities for a normally distributed random variable, we need to be


able to calculate the areas under the graph of the Normal distribution. Such
areas are obtained from a table showing the cumulative distribution of the
normal distribution. Since the Normal distribution is specified by the mean (µ)
and standard deviation (σ), there are many possible normal distributions that
can occur. It will be impossible to construct a table for each possible mean and
standard deviation. This problem is overcome by transforming X the normal

77
random variable of interest [𝑋𝑋 ~ 𝑁𝑁(µ, 𝜎𝜎 2 ) ] to a standardized Normal random
variable:

X −µ
𝑍𝑍 =
σ

It can be shown that the transformed random variable 𝑍𝑍 ~ 𝑁𝑁(0, 1). The
random variable Z can be transformed back to X by using the formula

𝑋𝑋 = µ + Zσ

The Normal distribution with mean µ = 0 and standard deviation σ = 1 is called


the standard normal distribution. The symbol Z is reserved for a random
variable with this distribution. The graph of the standard Normal distribution
appears below.

The Standard Normal Distribution

Various areas under the above normal curve are shown. The standard Normal
table gives the area under the curve to the left of the value z. Other types of
areas can be found by combining several of the areas as shown in the following
examples.

78
5.7.1 Calculating probabilities using the standard Normal table

The areas shown in the standard Normal table are those under the standard
normal curve to the left of the value of z looked up i.e. the areas are the
𝑃𝑃(𝑍𝑍 ≤ 𝑧𝑧). For example, 𝑃𝑃(𝑍𝑍 ≤ 0.14) = 0.5557.

Note:
• For negative values of z less than the minimum value (– 3.79) in the
table, the probabilities are taken as 0, that is, 𝑃𝑃(𝑍𝑍 ≤ 𝑧𝑧) = 0 for 𝑧𝑧 <
−3.79.
• For positive values of z greater than the maximum value (3.79) in the
table, the probabilities are taken as 1, that is, 𝑃𝑃(𝑍𝑍 ≤ 𝑧𝑧) = 1 for 𝑧𝑧 >
3.79.

Examples
In all the examples that follow 𝑍𝑍 ~ 𝑁𝑁(0, 1)

a) 𝑃𝑃(𝑍𝑍 < 1.35) = 0.9115

b) 𝑃𝑃(𝑍𝑍 > – 0.47) = 1 – 𝑃𝑃(𝑍𝑍 ≤ – 0.47)


= 1– 0.3192
= 0.6808

c) 𝑃𝑃( – 0.47 < 𝑍𝑍 < 1.35) = 𝑃𝑃(𝑍𝑍 < 1.35) – 𝑃𝑃(𝑍𝑍 < – 0.47)
= 0.9115 – 0.3192
= 0.5923

d) 𝑃𝑃(𝑍𝑍 > 0.76) = 1 – 𝑃𝑃(𝑍𝑍 < 0.76)


= 1 – 0.7764
= 0.2236

e) 𝑃𝑃(0.95 ≤ 𝑍𝑍 ≤ 1.36) = 𝑃𝑃(𝑍𝑍 ≤ 1.36) – 𝑃𝑃(𝑍𝑍 ≤ 0.95)


= 0.9131 – 0.8289
= 0.0842

f) 𝑃𝑃( – 1.96 ≤ 𝑍𝑍 ≤ 1.96) = 𝑃𝑃(𝑍𝑍 ≤ 1.96) – 𝑃𝑃(𝑍𝑍 ≤ – 1.96)


= 0.9750 – 0.0250
= 0.95

79
In all the above examples an area was found for a given value of z. It is also
possible to find a value of z when an area to its left is given. This can be written
as 𝑃𝑃(𝑍𝑍 ≤ 𝑧𝑧𝛼𝛼 ) = 𝛼𝛼 (𝛼𝛼 is the Greek letter for 𝑎𝑎 and is pronounced “alpha”). In
this case 𝑧𝑧𝛼𝛼 has to be found where 𝛼𝛼 is the area to its left.

Examples
1. Find the value of z that has an area of 0.0344 to its left.

Search the body of the table for the required area (0.0344) and then
read off the value of z corresponding to this area. In this case 𝑧𝑧0.0344 =
– 1.82.

2. Find the value of z that has an area of 0.975 to its left.

Finding 0.975 in the body of the table and reading off the z value gives
𝑧𝑧0.975 = 1.96.

3. Find the values of z that have areas of 0.95 and 0.05 to their left.

When searching the body of the table for 0.95 this value is not found.
The z value corresponding to 0.95 can be estimated from the following
information obtained from the table.

z area to left
1.64 0.9495
? 0.95
1.65 0.9505

Since the required area (0.95) is halfway between the 2 areas obtained
from the table, the required z can be taken as the value halfway
between the two z values that were obtained.

1.64+1.65
From the table: 𝑧𝑧 = = 1.645
2

Exercise: Using the same approach as above, verify that the z value
corresponding to an area of 0.05 to its left is -1.645.

80
At the bottom of the standard normal table selected percentiles z α are given
for different values of α. This means that the area under the normal curve to
the left of z α is α.

Examples:
1. 𝛼𝛼 = 0.900, 𝑧𝑧𝛼𝛼 = 1.282 means 𝑃𝑃(𝑍𝑍 < 1.282) = 0.900.

2. 𝛼𝛼 = 0.995, 𝑧𝑧𝛼𝛼 = 2.576 means 𝑃𝑃(𝑍𝑍 < 2.576) = 0.995.

3. 𝛼𝛼 = 0.005, 𝑧𝑧𝛼𝛼 = – 2.576 means 𝑃𝑃(𝑍𝑍 < – 2.576) = 0.005.

The standard normal distribution is symmetric with respect to the mean = 0.


From this it follows that the area under the normal curve to the right of a
positive z entry in the standard normal table is the same as the area to the left
of the associated negative entry (– 𝑧𝑧) i.e.

𝑃𝑃(𝑍𝑍 ≥ 𝑧𝑧) = 𝑃𝑃(𝑍𝑍 ≤ – 𝑧𝑧)

For example, 𝑃𝑃(𝑍𝑍 ≥ 1.96) = 1 – 0.975 = 0.025 = 𝑃𝑃(𝑍𝑍 ≤ – 1.96)

5.7.2 Calculating probabilities for any Normal random variable

Let X be a N(μ, σ2) random variable and Z a N(0, 1) random variable. Then

Example 1
The height (in centimetres) of a population of women is approximately
normally distributed with a mean of 𝜇𝜇 = 161.3 and a standard deviation of
𝜎𝜎 = 6.7 centimetres.

Solution
To calculate the probability that a woman is less than 160 centimetres tall, we
first find the z-score for 160 centimetres:

81
160 − 161.3
𝑧𝑧 = = −0.19
6.7

then use 𝑃𝑃(𝑋𝑋 ≤ 160) = 𝑃𝑃(𝑍𝑍 ≤ – 0.19) = 0.4247.


This means that 42.47% (a proportion of 0.4286) of women are less than 160
centimetres tall.

Example 2
The length X (centimetres) of sardines is a 𝑁𝑁(11.73, 0.1344) random variable.
What proportion of sardines is:
(a) longer than 12.7 centimetres?
(b) between 11.049 and 12.319 centimetres?

Solution

12.7−11.73
(a) 𝑃𝑃(𝑋𝑋 > 12.7) = 𝑃𝑃(𝑍𝑍 >
0.37
= 𝑃𝑃(𝑍𝑍 > 2.62)
= 1 – 𝑃𝑃(𝑍𝑍 ≤ 2.62)
= 1 – 0.9956
= 0.0044

11.049−11.73 12.319−11.73
(b) 𝑃𝑃(11.049 ≤ 𝑋𝑋 ≤ 12.319) = 𝑃𝑃 � ≤ 𝑍𝑍 ≤ �
0.37 0.37

= 𝑃𝑃(−1.84 ≤ 𝑍𝑍 ≤ 1.59

= 𝑃𝑃(𝑍𝑍 ≤ 1.59) − 𝑃𝑃(𝑍𝑍 ≤ −1.89)

= 0.9441 − 0.0294
= 0.9147

5.8 Finding percentiles by using the standard Normal table


The standard normal table can be used to find percentiles for random variables
which are normally distributed.

The standard Normal table can be used to find percentiles for random
variables which are normally distributed. The p-th percentile for X is given by

𝑥𝑥𝑝𝑝 =𝜇𝜇 + 𝜎𝜎 𝑧𝑧𝑝𝑝

82
Example
The scores X obtained in a mathematics entrance examination are Normally
distributed with µ = 514 and σ = 113 . Find the score which marks the 80th
percentile.

Solution
From the standard Normal table, the z-value which is closest to an area of 0.80
in the body of the table is 0.84 (the actual area to its left is 0.7995). The score
which corresponds to a z-value of 0.84 can be found by

𝑥𝑥0.80 =𝜇𝜇 + 𝜎𝜎 𝑧𝑧0.80 = 514 + (113)(0.84) = 608.92.

That is, a score of approximately 609 is better than 80% of all other exam
scores.

Exercises: With reference to the above normal distribution:


(a) Find 𝑃𝑃35
(b) If a person scores in the top 5% of test scores, what is the minimum
score they could have received?
(c) If a person scores in the bottom 10% of test scores, what is the
maximum score they could have received?

83
Tutorial
1. The probability distribution of X, the number of cylinders to be
tuned up in the engines of cars at a certain service station, is
shown in the table below.

X 4 6 8
probability 0.5 0.3 0.2

The cost of tune up for each cylinder is R 200. What is the


expected tune up cost of cylinders at this service station?

2. A game between two players is fair if each player has the same
mathematical expectation. If someone gives us RS each time we
roll a 1 or 2 with a balanced die, how much must we pay that
person each time we roll a 3, 4, 5 or 6 to make the game fair?

3. A union wage negotiator feels that the probabilities are 0.40,


0.30, 0.20 and 0.10 respectively that the union members will
get a Rl .50 per hour raise, a Rl.00 an hour raise, a 50 cents an
hour raise, or no raise at all. What is their expected raise?

4. An importer is offered a shipment of machine tools for


R140,000, and the respective probabilities that he will be able
to sell then for R180,000, R170,000 or RlS0,000 are 0.32, 0.55
and 0.13. What is the importer's expected gross profit?'

5. A builder has to choose between two jobs. The first job


promises a profit of R80,000 with a probability of 0.75 or a loss
of R25,000 with a probability of 0.25; the second job promises
a profit of R120,000 with a probability of 0.5 or a loss of
R45,000 with a probability of 0.5. Which job should the builder
choose if he wants to maximize his expected profit?

6. It is known that 20% of all callers phoning an internet help line


are put on hold. Suppose 25 people phone this help line.
(a) What is the probability that 10 or more people will be put on
hold?
(b) What is the probability that 5 or less people will be put on hold?

84
(c) What is the mean and standard deviation of the number of people
put on hold?

7. An insurance broker, who has 5 independent contacts,


believes that, for each, the probability of making a sale is
0.4.
a) What is the probability of at least one sale?
b) What is the expected number of sales?

8. Consider families with three children, and suppose that each


child (independently) has probability 0.51of being a boy.
(a) Find the probability that at least one child in such a family is a
boy.
(b) Find the probability that at least two are boys, given that at
least one is a boy.
9. A missile manufacturer claims that his missiles are 90 per cent
effective. The Air Force checks the stock by firing 10 missiles
and obtains 5 successes. What is the probability of obtaining 5
or fewer successes if p = 0.9? What conclusion is one able to
draw?

10. The probability is 0.06 that a patient will cancel a dental


appointment. Consider a group of 10 patients scheduled for
appointments this morning, and let X denote the number of
cancellations in this group.
(a) What is the probability that exactly 2 out of 10 appointments
will be cancelled?
(b) What is the probability that at least 2 out of the 10
appointments will be cancelled?
(c) What is the expected number of cancellations?
(d) What is the standard deviation of the number of cancellations?

11. The intensive care unit at a particular hospital has patients


arriving at an average rate of 5 per day.
(a) What is the probability (4 decimals) that 5 patients arrive on a
particular day?
(b) What is the probability that at least 5 patients arrive on a particular
day?

85
12. You are in charge of a large fleet of delivery trucks. On
average 1.9 trucks break down per day, and you keep two
trucks available to replace those that break down. If you can
assume that the number of breakdowns on any day is a
Poisson random variable, what is the probability that on
anyone day
(a) no extra replacement trucks are needed;
(b) the number of replacement trucks is inadequate?

13. Suppose that the number of goals scored in a soccer match is a


Poisson random variable with mean 3. Find the probability that
2 or more goals are scored in such a match.

14. On average an insurance company receives 6 claims


between 14:00 and 16:00 on a particular day. What is the
probability that the company receives exactly 17 claims
between 8:00 and 16:00 on that day?

15. Given a standard normal distribution, find the area under the curve which
lies
a. to the left of z = 1.43 i.e. P (z < 1.43)
b. to the right of z = −0.89 i.e. P (z >−0.89)
c. between z = −2.16 and z = −0.65 i.e. P (−0.65 < z < −2.16)
d. to the right of z = 1.96 i.e. P (z > 1.96)
e. between z = −0.48 and z = 1.74 i.e. P (−0.48 < z < 1.74).

16. Find the value of z if the area under a standard normal curve
(a) to the right of z is 0.3622
(b) to the left of z is 0.1131 i.e. find z 0.1131
(c) between 0 and z, with z > 0, is 0.4838;
(d) between −z and z, with z > 0, is 0.9500.

17. Given the normally distributed variable X with mean 18 and standard
deviation 2.5, find
(a) P (X < 15);

86
(b) the value of k such that P (X < k) = 0.2236;
(c) the value of 𝑘𝑘 such that 𝑃𝑃(𝑋𝑋 > 𝑘𝑘) = 0.1814
(d) P (17 < X < 21);

18. The loaves of bread distributed to local stores by a certain bakery


have an average length of 30 centimeters and a standard deviation of
2 centimeters. Assume that the lengths are normally distributed.
(a) what percentage of the loaves are
i. longer than 31.7 centimeters?
ii. between 29.3 and 33.5 centimeters in length?
iii. shorter than 25.5 centimeters?
(b) The owner of the bakery keeps the smaller loaves for private use. If
he/she retains only 5% of all the loaves, what is the maximum size
loaf that is kept?

19. A soft-drink machine is regulated so that it discharges an average of


200 millimeters per cup. Assume the amount of drink is normally
distributed with a standard deviation equal to 15 millimeters.
(a) What fraction of the cups will contain more than 224 ml?
(b) What is the probability that a cup contains between 191 and 209 ml?
(c) (i) What is the probability that a cup will overflow if the cup
can hold 230 ml?
(ii) Using part (i), how many cups will probably overflow if 230
millimeter cups are used for the next 1000 drinks?
(d) below what value do we get the smallest 25% of the drinks?

20. The tensile strength of a certain metal component is normally


distributed with a mean 10 000 kilograms per square centimeter and a
standard deviation of 100 kilograms per square centimeter.
Measurements are recorded to the nearest 50 kilograms per square
centimeter.
(a) What proportion of these components exceed 10 150 kilograms per
square centimeter in tensile strength?
(b) If specifications require that all components have tensile

87
strength between 9800 and 10200 kilograms per square
centimeter inclusive, what proportion of pieces would we expect
to scrap?

21. The weights of adult male rhesus monkeys are normally distributed
with a mean of 15 pounds and a standard deviation of 3 pounds.
(a) A male rhesus monkey is randomly selected. What is the
probability that its weight is more than 17 pounds?
(b) If 50 male rhesus monkeys are randomly selected, about how many
would you expect to weigh less than 12 pounds?
22. The manager of a gym has determined that the length of time
members spend at the gym is a normally distributed random
variable with a mean of 80 minutes and a standard deviation of 20
minutes.
(a) What proportion of members spend more than 2 hours at the gym?
(b) What proportion of members spend less than 1 hour at the gym?
(c) What is the least amount of time spent by 60% of the members
at the gym?

88
CHAPTER 6
HYPOTHESIS TESTING

6.1 Formulation of hypotheses and related terminology

Statistical hypothesis
A statistical hypothesis is an assertion (claim) made about a value(s) of a
population parameter.

Purpose
The purpose of testing of hypotheses is to determine whether a claim
that is made could be true. The conclusion about the truth of such a
claim is not stated with absolute certainty, but rather in terms of the
language of probability.

Examples of claims to be tested

1. A supermarket receives complaints that the mean content of “1


kilogram” sugar bags that are sold by them is less than 1 kilogram.

2. A construction company suspects that the proportion of jobs they


complete behind schedule is 0.20 (20%). They want to test whether this
is indeed the case.

Null and alternative hypotheses

Null hypothesis (𝑯𝑯𝟎𝟎 )


This is a statement concerning the value of the parameter of interest ( θ ) in a
claim that is made. This is formulated as

𝐻𝐻0 : 𝜃𝜃 = 𝜃𝜃0

(The statement that the parameter θ is equal to the hypothetical value θ 0 ).

89
Alternative hypothesis (𝑯𝑯𝟏𝟏 )
This is a statement about the possible values of the parameter θ that are
believed to be true if 𝐻𝐻0 is not true. One of the alternative hypotheses shown
below will apply.

𝑎𝑎. 𝐻𝐻1 : θ < θ 0 𝑜𝑜𝑜𝑜


𝑏𝑏. 𝐻𝐻1 : θ > θ 0 𝑜𝑜𝑜𝑜
𝑐𝑐. 𝐻𝐻1 : θ ≠ θ 0

Examples

1. In the first example (above) the parameter of interest is the population


mean µ and the hypotheses to be tested are:

𝐻𝐻0 : µ = 1 (Population mean is 1 kilogram)


𝐻𝐻0 : 𝜇𝜇 < 1 (Population mean is less than 1 kilogram)

In terms of the general notation stated above θ = µ; θ 0 = 1

2. In the second example (above) the parameter of interest is the


population proportion, 𝜋𝜋, of job completions behind schedule and the
hypotheses to be tested are

𝐻𝐻0 : 𝜋𝜋 = 0.20 (Population proportion is 0.20)


𝐻𝐻1 : 𝜋𝜋 ≠ 0.20 (Population proportion is not equal to 0.20)

In terms of the general notation stated above θ = 𝜋𝜋; θ 0 = 0.20

One and two-sided alternatives

One-sided alternative
This is a hypothesis that specifies the alternative values (to the null hypothesis)
in a direction that is either below or above that specified by the null
hypothesis.

Example

The alternative hypothesis H 1 (see example 1 above) is the alternative that the
value of the parameter is less than that stated under the null hypothesis.

90
Two-sided alternative
This is a hypothesis that specifies the alternative values (to the null hypothesis)
in directions that can be either below or above that specified by the null
hypothesis.

Example
The alternative hypothesis H 1 (see example 2 above) is the alternative that the
value of the parameter is either greater than that stated under the null
hypothesis or less than that stated under the null hypothesis.

6.2 Testing hypotheses for one sample: Terminology and summary


of procedure

The testing procedure and terminology will be explained for the test for the
population mean 𝜇𝜇 with population variance 𝜎𝜎 2 known.

The hypotheses to be tested are:

1. 𝐻𝐻0 ∶ µ = 𝜇𝜇0
𝐻𝐻1 ∶ µ ≠ 𝜇𝜇0

2. 𝐻𝐻0 : µ = 𝜇𝜇0
𝐻𝐻1 : 𝜇𝜇 < 𝜇𝜇0

3. 𝐻𝐻0 : 𝜇𝜇 = 𝜇𝜇0
𝐻𝐻1 : 𝜇𝜇 > 𝜇𝜇0

The data set that is needed to perform the test is: 𝑥𝑥1 , 𝑥𝑥2 , . . . , 𝑥𝑥𝑛𝑛
a random sample of size 𝑛𝑛 drawn from the population for which the mean is
tested. The test is performed to see whether or not the sample data are
consistent with what is stated by the null hypothesis.
The instrument that is used to perform the test is called a test statistic. A test
statistic is a quantity calculated from the sample data.

When testing for the population mean, the test statistic used is:
x − µ0
𝑍𝑍 =
σ/ n
We calculate the value of the statistic by substituting the value of 𝑥𝑥̅ , 𝜇𝜇0 , 𝜎𝜎 and
𝑛𝑛 into the equation and obtain 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 .

91
If the difference between 𝑥𝑥̅ and 𝜇𝜇0 (and therefore the value of 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ) is
reasonably small, 𝐻𝐻0 will be not be rejected. In this case the sample mean is
consistent with the value of the population mean that is being tested. If this
difference (and therefore the value of 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ) is sufficiently large, 𝐻𝐻0 will be
rejected. In this case the sample mean is not consistent with the value of the
population mean that is being tested. In order to decide how large this
difference between 𝑥𝑥̅ and 𝜇𝜇0 (and therefore the value of 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ) should be
before 𝐻𝐻0 is rejected, the following should be considered.

Type I error
• A type I error is committed when the null hypothesis is rejected when, in
fact it is true i.e. 𝐻𝐻0 is wrongly rejected.
• For example, a type I error is committed when it is decided that the
statement
H 0 : µ = μ 0 should be rejected when, in fact, it is true.

Type II error
• A type II error is committed when the null hypothesis is not rejected
when, in fact, it is false i.e. a decision not to reject 𝐻𝐻0 is wrong.
• For example, a type II error is committed when it is decided that the
statement
H 0 : µ = μ 0 should not be rejected when, in fact, it is false.

The following table gives a summary of possible conclusions and their


correctness when performing a test of hypotheses.

Actually true/Conclusion Reject H 0 Do not reject H 0


H 0 is true Type I error Correct conclusion
H 0 is false Correct conclusion Type II error

A Type I error is often considered to be more serious, and therefore more


important to avoid, than a Type II error. The hypothesis testing procedure is
therefore designed so that there is a guaranteed small probability of rejecting
the null hypothesis wrongly. This probability is never 0. Mathematically, the
probability of a type I error can be stated as

P(type I error) = P(Reject H 0 | H 0 is true) = 𝛼𝛼

92
When testing for the population mean: 𝐻𝐻0 : 𝜇𝜇 = 𝜇𝜇0

P(type I error) = P(reject μ = μ 0 | μ = μ 0 is true) = 𝛼𝛼

P(type II error) = P(do not reject µ = µ 0 | µ = µ 0 is false) = β

1 − P(type II error) = 1 − 𝛽𝛽 = the power of the test. It is the probability of not


making a type II error.

Probabilities of type I and type II errors work in opposite directions. The more
reluctant you are to reject H 0 , the higher the risk of accepting it when, in fact,
it is false. The easier you make it to reject H 0 , the lower the risk of accepting it
when, in fact, it is false.

Critical value(s) and critical region

Critical (cut-off) value(s)


• The critical value(s) for tests of hypotheses is(are) a value(s) to which the
test statistic is compared in order to determine whether or not the null
hypothesis should be rejected.
• The critical value is determined according to the specified value of 𝛼𝛼, the
probability of a type I error.

For the test of the population mean the critical value is determined in the
following way. Assuming that H 0 is true, the test statistic will follow a standard
Normal distribution i.e.

X − µ0
Z= ~ N(0, 1)
σ/ n

1. When testing H 0 versus the alternative hypothesis H 1 (µ < µ 0 ), the


critical region lies in the left tail of the standard Normal distribution. This
is called a left-tailed test. That is, the value of −𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 (the critical value)
R

is such that the area under the standard normal curve to the left of
−𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 is α. That is, 𝑃𝑃(𝑍𝑍 < −𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ) = α. The graph below illustrates the
case for α = 0.05.
That is, P(Z < –1.645) = 0.05:

93
2. When testing H 0 versus the alternative hypothesis H 1 (µ > µ 0 ), the
critical region lies in the right tail of the standard Normal distribution.
This is caleed a right-tailed test. That is, the value of +𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 is such that
R

the area under the standard Normal curve to the right of 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 is α. That
is,
P( Z > 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ) = 𝛼𝛼. This leaves an area of 1 − 𝛼𝛼 to the left of 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 . The
graph below illustrates the case for α = 0.05. This means 1 – α = 0.95 and
thus P(Z > 1.645) = 0.05:

3. When testing H 0 versus the alternative hypothesis H 1 (µ ≠ µ 0 ), the


critical regions lie in both the left and right tails of the standard Normal
distribution. This is called a two-tailed test. The critical values are given
by ±𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 . The area under the standard Normal curve to the left of
𝛼𝛼
−𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 is and the area under the standard Normal curve to the right of
2
R

𝛼𝛼
+𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 is . That is, 𝑃𝑃(𝑍𝑍 < −𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ) = 𝛼𝛼/2 and 𝑃𝑃(𝑍𝑍 > 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ) = 𝛼𝛼/2.
2
The area under the normal curve between these two critical values is 1 –
α. The graph below illustrates the case for α = 0.05 i.e.
P(Z < – 1.96)=0.025 and P( Z > 1.96) = 0.025.

94
Critical region (CR)
The critical region, or rejection region R, is the set of values of the test statistic
for which the null hypothesis is rejected.

(i) For a left-tail test, the rejection region is:

{ 𝑧𝑧 | 𝑧𝑧 < 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

(ii) For a right tailed test, the rejection region is:

{ 𝑧𝑧 | 𝑧𝑧 > 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

(iii) For a two-tailed test, the rejection region is:

{ 𝑧𝑧| 𝑧𝑧 < −𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑧𝑧 > +𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

H 0 is rejected when there is a sufficiently large difference between the sample


mean x and the mean (μ 0 ) under H 0 . Such a large difference is called a
significant difference (result of the test is significant). The value of α is called
the level of significance. It specifies the level beyond which this difference
(between x and μ 0 ) is sufficiently large for H 0 to be rejected. The value of α is
specified prior to performing the test and is often taken as either 0.05 (5%
level of significance) or 0.01 (1% level of significance).

When H 0 is rejected, it does not necessarily mean that it is not true. It means
that according to the sample evidence available it appears not to be true.
Similarly when H 0 is not rejected, it does not necessarily mean that it is true. It
means that there is not sufficient sample evidence to disprove H 0 .

Critical values for tests based on the standard normal distribution can be found
from the selected percentiles listed at the bottom of the pages of the standard
normal table.

95
6.3 Test for the population mean (population variance known)

A summary of the steps to be followed in the testing procedure is shown below


(continuing onto the following page).

Test for µ when σ 2 is known


1. State the null and alternative hypotheses:
𝐻𝐻0 ∶ µ = 𝜇𝜇0
𝐻𝐻1 ∶ µ ≠ 𝜇𝜇0
or
𝐻𝐻0 : µ = 𝜇𝜇0
𝐻𝐻1 : 𝜇𝜇 < 𝜇𝜇0
or
𝐻𝐻0 : 𝜇𝜇 = 𝜇𝜇0
𝐻𝐻1 : 𝜇𝜇 > 𝜇𝜇0

2. The test statistic:


𝑋𝑋�−𝜇𝜇0
𝑍𝑍 = ~ 𝑁𝑁(0,1)
𝜎𝜎⁄√𝑛𝑛

Calculate : 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 .

3. State the level of significance α and determine the critical value(s) and
critical region.

(i) For a left-tailed test, the critical region is: { 𝑧𝑧 | 𝑧𝑧 < −𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

(ii) For a right-tailed test, the critical region is: { 𝑧𝑧 | 𝑧𝑧 > +𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

(iii) For a two-tailed test, the critical region is:

{ 𝑧𝑧|𝑧𝑧 < −𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑧𝑧 > +𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

4 If 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 lies in the critical region, reject H 0 , otherwise do not reject H 0 .

5 State conclusion in terms of the original problem.

96
Example 1

A hardware store receives complaints that the mean content of the “1


kilogram” cement bags that are sold by them is less than 1 kilogram. A
random sample of 40 cement bags is selected from the shelves and the
mean is found to be 0.987 kilograms. From past experience the standard
deviation contents of these bags is known to be 0.025 kilograms. Test, at
the 5% level of significance, whether this complaint is justified.

Solution:

Step 1:
𝐻𝐻0 ∶ 𝜇𝜇 = 1 (The complaint is not justified)

𝐻𝐻1 ∶ 𝜇𝜇 < 1 (The complaint is justified)

Step 2:
n = 40, x = 0.987, σ = 0.025, μ 0 = 1 (given)

0.987 − 1
Test statistic: z calc = = –3.289.
0.025 / 40
Step 3:
α = 0.05
Critical region: left-tailed test so critical value = 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = −1.645

Step 4:
Since 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 < 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 , that is −3.289 < −1.645, H 0 is rejected.

Step 5:
Conclusion: Sample evidence suggests that there is less than 1 kilogram
of cement in the bags. The customers’ complaints are justified.

97
Example 2

A supermarket manager suspects that the machine filling “500 gram”


containers of coffee is over-filling them i.e. the actual contents of these
containers is more than 500 grams. A random sample of 30 of these
containers is selected from the shelves and the mean found to be 501.8
grams. From past experience the variance of contents of these bags is
known to be 60 grams. Test at the 5% level of significance whether the
manager’s suspicion is justified.

Solution:

Step 1:
𝐻𝐻0 ∶ 𝜇𝜇 = 500 (Suspicion is not justified)

𝐻𝐻1 ∶ 𝜇𝜇 > 500 (Suspicion is justified)

Step 2:
n = 30, x = 501.8, σ2 = 60, μ 0 = 500 (given)

501.8 − 500
Test statistic: 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = = 1.273
60 / 30

Step 3:
α = 0.05
Critical region: right-tailed test so critical value = 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 1.645

Step 4:
Since 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 < 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 , that is 1.273 < 1.645, H 0 is not rejected.

Step 5:
Conclusion: The sample evidence suggests that the coffee machine is
not over-filling the 500 gram coffee containers. The manager’s suspicion is not
justified.

98
Example 3

During a quality control exercise the manager of a factory that fills cans
of frozen shrimp wants to check whether the mean weights of the cans
conform to specifications i.e. the mean of these cans should be 600
grams as stated on the label of the can. He/she wants to guard against
either over or under filling the cans. A random sample of 50 of these
cans is selected and the mean found to be 595 grams. From past
experience the standard deviation of contents of these bags is known to
be 20 grams. Test, at the 5% level of significance, whether the weights
conform to specifications. Repeat the test at the 10% level of
significance.

Solution:

Step 1:
𝐻𝐻0 ∶ 𝜇𝜇 = 600 (Weights conform to specifications)

𝐻𝐻1 ∶ 𝜇𝜇 ≠ 600 (Weights do not conform to specifications)

Step 2:
n = 50, x = 595, σ = 20, μ 0 = 600 (given)

595 − 600
Test statistic: 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = = 1.768
20 / 50

Step 3:
α = 0.05
Critical region: two-tailed test so critical values = ±𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = ±1.96

Step 4:
Since −𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 < 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 < +𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
That is, – 1.96 < 1.768 < 1.96, H 0 is not rejected.

Step 5:
Conclusion: Sample evidence suggests that the weights appear to
conform to specifications.

99
Suppose the test is performed at the 10% level of significance.
In such a case:

α = 0.10
Critical region: two-tailed test and critical values = ±𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = ±1.645

Since 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 1.768 > 1.645, H 0 is rejected.

Conclusion: The weights appear not to conform to specifications.

Thus, being less strict about controlling a type I error (changing α from 0.05 to
0.10) results in a different conclusion about H 0 (reject instead of do not reject).

6.4 Test for the population mean (population variance not known,
n < 30): t-test 12

When performing the test for the population mean for the case where the
population variance is not known, the following modifications are made to the
procedure.

• In the test statistic formula the population standard deviation σ is


replaced by the sample standard deviation S.
𝑋𝑋�−𝜇𝜇0
𝑇𝑇 = 𝑆𝑆/√𝑛𝑛
• Since the test statistic that is used to perform the test follows a
Student’s t-distribution with n–1 degrees of freedom, critical values are
looked up in the t-tables.

The t-distribution was first proposed in a paper by William Gosset in 1908 who
wrote the paper under the pseudonym “Student”. The t-distribution has the
following properties.

• The Student t-distribution is symmetric and bell-shaped, but for smaller


sample sizes it shows increased variability when compared to the standard

12
See Appendix A 14.

100
normal distribution (its curve has a flatter appearance than that of the
standard normal distribution). In other words, the distribution is less
peaked than a standard normal distribution and with thicker tails. As the
sample size increases, the distribution approaches a standard normal
distribution. For n > 30, the differences are negligible.
• The mean is zero (like the standard normal distribution).
• The distribution is symmetrical about the mean.
• The variance is greater than one, but approaches one from above as the
sample size increases (σ2 = 1 for the standard normal distribution).

Test for µ when σ 2 is not known, n < 30 (t-test)


1. State null and alternative hypotheses:
𝐻𝐻0 ∶ µ = 𝜇𝜇0
𝐻𝐻1 ∶ µ ≠ 𝜇𝜇0
or
𝐻𝐻0 : µ = 𝜇𝜇0
𝐻𝐻1 : 𝜇𝜇 < 𝜇𝜇0
or
𝐻𝐻0 : 𝜇𝜇 = 𝜇𝜇0
𝐻𝐻1 : 𝜇𝜇 > 𝜇𝜇0

𝑥𝑥̅ −𝜇𝜇0
2. Calculate the value of the test statistic: 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 =
𝑆𝑆⁄√𝑛𝑛

3. State the level of significance α and determine the critical value(s) and
critical region.

Degrees of freedom = ν = n–1.

(ii) For a left-tailed test, the critical region is: { 𝑡𝑡 | 𝑡𝑡 < −𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

(ii) For a right-tailed test, the critical region is: { 𝑡𝑡 | 𝑡𝑡 > +𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

(iii) For a two-tailed test, the critical region is:

{ 𝑡𝑡| 𝑡𝑡 < −𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑡𝑡 > +𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

4 If 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 lies in the critical region, reject H 0 , otherwise do not reject H 0 .

5 State conclusion in terms of the original problem.

101
Example 4

A paint manufacturer claims that the average drying time for a new paint is 2
hours (120 minutes). The drying times for 20 randomly selected cans of paint
were obtained. The results are shown below. 13

123 106 139 135


127 128 119 130
131 133 121 136
122 115 116 133
109 120 130 109

Assuming that the sample was drawn from a normal distribution,

(a) Test whether the population mean drying time is greater than 2 hours
(120 minutes)

(i) at the 5% level of significance.


(ii) at the 1% level of significance.

(b) Test, at the 5% level of significance, whether the population mean drying
time could be 2 hours (120 minutes).

Solution:

(a) Step 1:
H 0 : μ = 120 (mean is 2 hours)
H 1 : μ > 120 (mean is greater than 2 hours)

Step 2:
n = 20, μ 0 = 120 (given), x = 124.1, S = 9.65674 (calculated from the
data).

124.1 − 120
Test statistic: 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = = 1.899
9.65674 / 20

(i) Step 3:
13
See Appendix pg. on how to conduct a t-test for the mean in Excel.

102
α = 0.05
Critical region: right-tailed test.
From the t-distribution table with degrees of freedom = ν = 𝑛𝑛– 1 =
19, 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 1.729

Step 4:
Since 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 > 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
that is, 1.899 > 1.729 , H 0 is rejected.

Step 5:
Conclusion: The mean drying time appears to be greater than 2 hours.

(ii) Step 3:
α = 0.01
Critical region: right-tailed test.
From the t-distribution table with degrees of freedom = ν = 𝑛𝑛– 1 =
19, 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 2.539

Step 4
Since 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 < 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
that is, 1.899 < 2.539 , H 0 is not rejected.

Step 5:
Conclusion: The mean drying time appears to be 2 hours.

Thus, being more strict about controlling a type I error (changing α from 0.05
to 0.01) results in a different conclusion about H 0 (do not reject instead of
reject).

(b) Step 1:

H 0 : μ = 120 (mean is 2 hours)


H 1 : μ ≠ 120 (mean is not equal to 2 hours)

Step 2:
n = 20, μ 0 = 120 (given), x = 124.1, S = 9.65674 (calculated from the
data).

103
124.1 − 120
Test statistic: t calc = = 1.899 (as calculated in part(a)).
9.65674 / 20

Step 3:
α = 0.05
Critical region: two-tailed test.
From the t-distribution table with degrees of freedom =ν = n–1 =19,
𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = ±2.093

Step 4:
Since −𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ≤ 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ≤ +𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
that is, –2.093 <1.899 < 2.093, H 0 is not rejected.

Step 5:
Conclusion: The mean drying time appears to be 2 hours.

Note:
• Despite the fact that the same data were used in the above examples,
the conclusions were different. In the first test H 0 was rejected, but in
the next 2 tests H 0 was not rejected.

• In the first test the probability of a type I error was set at 5%, while in
the second test this was changed to 1%. To achieve this, the critical was
moved from 1.729 to 2.539, resulting in the test statistic value (1.899)
being less than (instead of greater than) the critical value.

• In the third test (which has a two-sided alternative hypothesis), the


upper critical value was increased to 2.093 (to have an area of 0.025
under the t-curve to its right). Again this resulted in the test statistic
value (1.899) being less than (in stead of greater than) the critical value.

6.5 Test for population proportion

The test for the population proportion (𝜋𝜋) is based on the fact that the sample
𝑋𝑋
proportion 𝑝𝑝 = ~ N(𝜋𝜋, 𝜋𝜋(1 − 𝜋𝜋)/n) , where 𝑛𝑛 is the sample size and 𝑥𝑥 the
𝑛𝑛
number of items labeled “success” in the sample. From this result it follows
𝑝𝑝−𝜋𝜋0
that Z = ~ N(0, 1) where 𝜋𝜋0 is the value of 𝜋𝜋 under 𝐻𝐻0 .
𝜋𝜋 (1−𝜋𝜋0 )
� 0
𝑛𝑛

104
For this reason the critical value(s) and critical region are the same as that for
the test for the population mean (both based on the standard normal
distribution).

Test for the population proportion 𝝅𝝅


1. State the null and alternative hypotheses.

𝐻𝐻0 ∶ 𝜋𝜋 = 𝜋𝜋0
𝐻𝐻1 ∶ 𝜋𝜋 ≠ 𝜋𝜋0
or
𝐻𝐻0 : 𝜋𝜋 = 𝜋𝜋0
𝐻𝐻1 : 𝜋𝜋 < 𝜋𝜋0
or
𝐻𝐻0 : 𝜋𝜋 = 𝜋𝜋0
𝐻𝐻1 : 𝜋𝜋 > 𝜋𝜋0
𝑝𝑝−𝜋𝜋0
2. Calculate the test statistic 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = ’
𝜋𝜋 (1−𝜋𝜋0 )
� 0
𝑛𝑛

3. State the level of significance α and determine the critical value(s) and
critical region.

(i) For a left-tailed test, the critical region is: { 𝑧𝑧 | 𝑧𝑧 < −𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

(ii) For a right-tailed test, the critical region is: { 𝑧𝑧 | 𝑧𝑧 > +𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

(iii) For a two-tailed test, the critical region is:

{ 𝑧𝑧|𝑧𝑧 < −𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑧𝑧 > +𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

4. If 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 lies in the critical region, reject H 0 , otherwise do not reject H 0 .

5 State conclusion in terms of the original problem.

105
Example 5
A construction company suspects that the proportion of jobs they
complete behind schedule is 0.20 (20%). Of their 80 most recent jobs 22
were completed behind schedule. Test at the 5% level of significance
whether this information confirms their suspicion.

Solution:
Step 1:
H 0 : 𝜋𝜋 = 0.20 (Suspicion is confirmed)

H 1 : 𝜋𝜋 ≠ 0.20 (Suspicion is not confirmed)

Step 2:
22
n = 80, x = 22 (given), p = = 0.275, 𝜋𝜋0 = 0.20.
80

0.275 − 0.20
Test statistic: z calc = = 1.677.
0.20 * 0.80 / 80
Step 3:
α = 0.05

Critical region: two-tailed test so critical value = ±𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = ±1.96

Step 4:
Since −𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 < 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 < +𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
that is, –1.96 < z 0 = 1.677 < 1.96, H 0 is not rejected.

Step 5:
Conclusion: The suspicion is confirmed.

Example 6
During a marketing campaign for a new product 176 out of the 200
potential users of this product that were contacted indicated that they
would use it. Is this evidence that more than 85% of all the potential will
actually use the product? Use α = 0.01.

Solution:
Step 1:
H 0 : 𝜋𝜋 = 0.85 (85% of all potential users will use the product)

H 1 : 𝜋𝜋 > 0.85 (More than 85% of all potential users will use the product)
106
Step 2:
176
n = 200, x = 176, 𝜋𝜋0 = 0.85 (given), p = = 0.88.
200

0.88 − 0.85
Test statistic z calc = = 1.188.
0.85 * 0.15 / 200

Step 3:
α = 0.01
Critical region: right-tailed test = +𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 2.576

Step 4:
Since 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 < 𝑧𝑧𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
that is, 1.188 < 2.576, H 0 is not rejected.

Step 5:
Conclusion: 85% of all potential users will use the product.

6.6 Test for the difference between means for two independent
samples 14
For small samples (both sample sizes n 1 , n 2 < 30)

The tests discussed in the previous chapter involve hypotheses concerning


parameters of a single population and were based on a random sample drawn
from a single population of interest. Often the interest is in tests concerning
parameters of two different populations (labeled populations 1 and 2) where
two random samples (one from each population) are drawn.

Examples
1. Are the mean salaries the same for males and females with the same
educational qualifications and work experience?
2. Do smokers and non-smokers have the same mortality rate?
3. Are the variances in drying times for two different types of paints
different?
4. Is a particular diet successful in reducing people’s weights?

14
See Appendix A15.

107
When testing for the difference of means from 2 different populations labeled
1 and 2, the hypotheses are:

𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2


𝐻𝐻1 : 𝜇𝜇1 ≠ 𝜇𝜇2
or
𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2
𝐻𝐻1 : 𝜇𝜇1 > 𝜇𝜇2
or
𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2
𝐻𝐻1 : 𝜇𝜇1 < 𝜇𝜇2

Notation
The following notation will used in the description of the two sample
tests.

notation notation
Measure
(population 1) (population 2)
sample size n1 n2
sample x1 , x 2 ,  , x n x1 , x2 ,, xm
sample mean x1 x2
sample variance (standard
S12 ( S1 ) S 22 ( S 2 )
deviation)

In the examples that follow, we will assume that the populations from which
the samples are drawn are Normally distributed and that the sample sizes are
small (𝑛𝑛1 , 𝑛𝑛2 < 30) and that the population variances σ 12 , σ 22 are not known
but equal to 𝜎𝜎 2 . They may be replaced by their sample estimates S12 , S 22 and
2 (𝑛𝑛1 −1)𝑠𝑠12 +(𝑛𝑛2 −1)𝑠𝑠22
𝑆𝑆 = , respectively.
𝑛𝑛1 +𝑛𝑛2 −2

In such a case the resulting statistic follows a t-distribution. The degrees of


freedom is n 1 + n 2 – 2.

108
Test for difference between two population means (small sample sizes,
population variances unknown but equal)

Step 1: State null and alternative hypotheses


𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2
𝐻𝐻1 : 𝜇𝜇1 ≠ 𝜇𝜇2
or
𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2
𝐻𝐻1 : 𝜇𝜇1 > 𝜇𝜇2
or
𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2
𝐻𝐻1 : 𝜇𝜇1 < 𝜇𝜇2

𝑥𝑥̅1 −𝑥𝑥̅2
Step 2: Calculate the test statistic: 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 1 1
�𝑆𝑆 2 (𝑛𝑛 +𝑛𝑛 )
1 2
with
2
(𝑛𝑛1 − 1)𝑠𝑠12 + (𝑛𝑛2 − 1)𝑠𝑠22
𝑆𝑆 =
𝑛𝑛1 + 𝑛𝑛2 − 2

Step 3: State the level of significance α and determine the critical value(s)
and critical region.

Degrees of freedom = 𝑛𝑛1 + 𝑛𝑛2 − 2

(i) For a left-tailed test, the critical region is: { 𝑡𝑡 | 𝑡𝑡 < −𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

(ii) For a right-tailed test, the critical region is: { 𝑡𝑡 | 𝑡𝑡 > +𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

(iii) For a two-tailed test, the critical region is:

{ 𝑡𝑡|𝑡𝑡 < −𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑡𝑡 > +𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }

Step 4: If t calc lies in the critical region, reject H 0 , otherwise do not reject H 0 .

Step 5: State the conclusion in terms of the original problem.

109
Example 7
A certain hospital has been getting complaints that the response to calls from
senior citizens is slower (takes longer time on average) than that to calls from
other patients. In order to test this claim, a pilot study was carried out. The
results are shown below.

Patient type sample mean response sample standard sample


time deviation size
Senior 5.60 minutes 0.25 minutes 18
citizens
Others 5.30 minutes 0.21 minutes 13

Test, at the 1% level of significance, whether the complaint is justified.

Solution:

Label the “senior citizens” and “others” populations as 1 and 2 and their
population mean response times as µ1 and µ 2 , respectively.

Step 1:
𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2
𝐻𝐻0 : 𝜇𝜇1 > 𝜇𝜇2

Step 2:
(17×0.252 )+(12×0.212 )
S2 = = 0.0549
29
5.6−5.3
Test statistic: 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 1 1
= 3.518
�0.0549� + �
18 13

Step 3:
α = 0.01
Critical region: right-tailed test
From the t-distribution table with ν = n + m − 2 = 18 + 13 − 2 = 29 degrees of
freedom, 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 2.462

Step 4:
Since 𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 > +𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 , that is, 3.518 > 2.462, H 0 is rejected.

Step 5:
Conclusion: The claim is justified i.e. the mean response time for senior citizens
takes longer than that for others.
110
Tutorial

1. Write the claim as a mathematical sentence. State the null and alternative
hypotheses.
(a) A water faucet manufacturer announces that the mean flow rate of a
certain type of faucet is less than 2.5 gallons per minute.
(b) A cereal company advertises that the mean weight of the contents of its 1kg
size cereal boxes is more than 1kg.
(c) A consumer analyst reports that the mean life of a certain type of auto-
mobile battery is not 74 months.

2. A company uses thousands of light bulbs every year. The type of light
bulb in the past had an average life of µ = 1000 hours with a standard
deviation of σ = 100 hours. A new brand of light bulb with lower price is
now being considered and will be used unless it has a smaller average life
than the old brand. A random sample of 36 light bulbs from the new brand
is tested and yields an average of x̄ = 968 hours. Based on the sample that has
been drawn and using a level of significance of 0.05, should the company
invest in these new light bulbs?
3. In a labour management discussion, management revealed from past
records that workers at a certain plant took on the average 32.6 minutes
with a standard deviation of 6.1 minutes to complete a certain task. A
random sample of 60 workers’ times was then collected showing that it
now took on the average 33.8 minutes to complete the task. Can this be
taken as an indication of a deliberate go-slow strike? Use a 1% level of
signicance.
4. A company that sells frozen shrimp prints “Contents 100 grams” on the
pack- age. The owner of the company is concerned that money is being
lost due to overfilling the boxes. A sample of 25 packages yielded an
average of 𝑥𝑥̅ = 101.58 grams. Suppose it is known from past experience
that the pop- ulation of package weights has a standard deviation of σ =
4 grams. Is the owners concern well founded? Use a 5% level of
signicance.

111
5. A paint manufacturer claims that the average drying time for his new
latex paint is two hours. To test this claim, drying times are obtained for
n = 20 randomly selected cans of paint. The results are displayed below in
minutes.
123 109 115 121 130
127 106 120 116 136
131 128 139 110 133
122 133 119 135 109

If we assume that the drying times are Normally distributed, do the sample
data suggest that the mean drying time is actually greater than the manufac-
turer’s claim of 120 minutes? Use α = 0, 05. (The sample mean and standard
deviation of the data are given by x = 123.1 and s = 10).
6. An industrial company claims that the mean pH level of the water in a
nearby river is 6.8. You randomly select 19 water samples and measure
the pH of each. The sample mean and standard deviation are 6.7 and 0.24,
respectively. Is there enough evidence to reject the company’s claim at α =
0.05? Assume the population is normally distributed.

7. The life of certain part in a cardiac pacemaker is assumed to be


normally distributed. A random sample of 10 of these parts is subjected
to an accelerated life test by running them continuously at an elevated
temperature until failure giving a sample mean of 26 hours and a sample
standard deviation of 1.625 hours. The manufacturer wants to be
certain that the mean battery life exceeds 25 hours. What conclusions
can be drawn from the sample if a 0.05 level of significance is used?

8. A medical researcher claims that less than 20% of the adults in RSA are
not allergic to any medication. In a random sample of 100 adults, 15% say
they are not allergic to any medication. At a 0.01 level of significance,
is there enough evidence to support the researcher’s claim?
9. Harper’s index claims that 23% of people in the United States are in
favour of outlawing cigarettes. You decide to test this claim and ask a random
sample of 200 people in the United States whether they are in favour of
outlawing cigarettes. Of the 200 people, 27% are in favour. Using α =
0.05, is there enough evidence to reject the claim?
10. The U.S. National Centre for Health Statistics gathers and publishes

112
data on the daily intake of selected nutrients by race and income level.
Suppose we are considering protein intake and want to compare the
mean daily intake of people with incomes that are above the poverty
level with those of people with incomes below the poverty level. The
data in Table A give the protein intake, in grams, over a 24-hour period for
people with incomes above and below the poverty level.

TABLE A
Above poverty level Below poverty level
86,0 69,0 51,4 49,7 72,0
59,7 80,2 76,7 65,8 55,0
68,6 78,1 73,7 62,1 79,7
98,6 69,8 66,2 75,8 65,4
87,7 77,2 65,5 62,0 73,3
x1 = 77, 49 s 1 = 11, 34 x 2 = 66, 29 s 2 = 9,17

At the 5% significance level, do the data suggest that people with incomes
above the poverty level have a greater mean daily intake of protein than those
with incomes below the poverty level? Assume that the daily intake of
protein for both populations is normally distributed and that the variances for
the two populations is the same.
11. Two different hardening processes, (1) saltwater quenching and (2) oil
quenching, are used on samples of a particular type of metal alloy. The
results are shown here. Assume that hardness is normally distributed and
that the population variances are equal.

Saltwater quench Oil quench


152 146
146 158
154 152
139 151
148 143
a. Find a 95% confidence interval for µ 1 − µ 2 .
b. Based on the confidence interval, do you think that the mean

113
hardening times of the two processes are the same?
c. To confirm/check your answer in part (b) test the hypothesis
that the mean hardness for the saltwater quenching process
equals the mean hardness for the oil quenching process. Use a
.05 level of significance and assume equal variances.
12. Two methods of packaging frozen shrimps yield about the same
average weight per package. However, method 2 is somewhat faster and
a particular company that packages shrimps would like to use it unless the
variance of method 2 is shown to be larger than that of method 1 at the
5% level of significance. Two samples of 51 packages, one packed using the
first method and one using the second method, are examined. The sample
standard deviations are s 1 = 4.2 grams for method 1 and s 2 = 5.8 grams
for method 2. What decision should be made?

114
CHAPTER 7
CHI-SQUARE TESTS
7.1 Introduction
Chi-square (𝜒𝜒 2 ) tests are used to test hypotheses on patterns of outcomes,
which are based on frequency counts, for categorical random variables.

The two chi-square tests that will be covered in this chapter are:

• Goodness of fit test: This test is used to assess how closely


the distribution ofa categorical variable matches an expected
distribution. For example, has the mode of transportation
(drive, bike, walk, other) used by students to get to class
changed from that of 5 years ago?
• Test of independence: This test is used assess whether two categorical
variables are independent of one another or if there is an association
between the two variables. For example, is there an association
between gender and smoking habits?

7.2. Properties of the Chi-square distribution


1. It is a family of distributions, one for each degrees of freedom.
2. It has only one parameter ⟶ degrees of freedom 𝑑𝑑𝑑𝑑
3. 𝜒𝜒2 is a skewed distribution, skewed to the right. As 𝑑𝑑𝑑𝑑 increases it
becomes symmetrical.
4. 𝜒𝜒2 assumes non-negative values only.
5. Total area under the curve is equal to 1.

115
7.3 The test statistic
The 𝜒𝜒2 test statistic can be computed as follows:

2
𝑂𝑂2
𝜒𝜒 = � − 𝑛𝑛
𝐸𝐸
OR

2
(𝑂𝑂 − 𝐸𝐸)2
𝜒𝜒 = �
𝐸𝐸

where
O = observed frequency E = expected frequency n = sample size

For 𝜒𝜒 2 tests, the rejection region lies in the right tail of the curve:

The area of the rejection region = 𝛼𝛼

2
𝜒𝜒𝑑𝑑𝑑𝑑;𝛼𝛼 𝜒𝜒 2 critical value

2
Rejection Rule: If the calculated test statistic (𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ) lies in the rejection region
2 2 2
that is, if 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 > 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 reject 𝐻𝐻0 in favour of 𝐻𝐻1 . 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 may be found using the 𝜒𝜒 2
tables for the given level of significance (𝛼𝛼) value and degrees of freedom = k – 1 (where k =
the number of categories of the categorical variable).

116
7.4 Goodness-of-Fit Test
In this type of hypothesis test, one determines whether the data "fit" a
particular distribution or not. For example, one may suspect that the unknown
data fits a binomial distribution. A 𝜒𝜒 2 − test goodness-of-fit may be used to
determine if there is a fit or not. The null and the alternate hypotheses for this
test may be written in sentences or may be stated as equations or inequalities.

Example 1
The following table gives the age distribution of a sample of 100 people
arrested for drunk driving:

Age 16-20 21-25 26-30 31-35 36-40


No. of 25 32 19 16 8
arrests

At a 1% level of significance, test the hypothesis that the proportion of people


arrested for drunk driving is the same for all age groups.

Solution:
Step 1:
𝐻𝐻0: The proportion of people arrested for drunk driving is the same for all age
groups
𝐻𝐻1: The proportion is not the same for all age groups
Step 2:

Age Observed (O) Expected (E) 𝑂𝑂2


𝐸𝐸
16-20 25 20 31.25
21-25 32 20 15.2
26-30 19 20 18.05
31-35 16 20 12.8
36-40 8 20 3.2
Total 100 100 116.5

2 𝑂𝑂2
∴ test statistic: 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 =∑ − 𝑛𝑛 = 116.5 − 100 = 16.5
𝐸𝐸

117
Step 3:

2
Determine the critical value: 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

𝛼𝛼 = 0.01

Degrees of freedom: 𝑑𝑑𝑑𝑑 = 𝑘𝑘 − 1 = 5 − 1 = 4


2 2 2
∴ 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 𝜒𝜒𝑑𝑑𝑑𝑑;𝛼𝛼 = 𝜒𝜒4;0.01 = 13.277

Step 4:

area = 𝛼𝛼

2
𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 13.277

2 2
Since 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 > 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

that is, 16.5 > 13.277, reject 𝐻𝐻0 at the 1% level of significance.

Step 5:
Sample evidence suggests that the proportion of arrests is not the same for all
age groups.

118
7.5 Test of Independence
Tests of independence involve using a contingency table of observed (data)
values. A contingency table is said to be of size (𝑟𝑟×𝑐𝑐) where 𝑟𝑟= number of rows
and 𝑐𝑐= number of columns.

A test of independence determines whether two factors are independent or


not. In a test of independence, we state the null and alternate hypotheses in
words. Since the contingency table consists of two factors, the null hypothesis
states that the factors are independent and the alternate hypothesis states
that they are not independent (dependent).

The test of independence is always a right-tailed test, meaning that the


critical region lies in the right tail of the 𝝌𝝌𝟐𝟐 distribution. If the expected and
observed values are not close together, then the test statistic is very large and
will lie way out in the right tail of the chi-square curve, as in the case of a
goodness-of-fit test.

The degrees of freedom for the test of independence are:

𝑑𝑑𝑑𝑑 = (𝑟𝑟 – 1) × (𝑐𝑐 – 1)

The following formula calculates the expected frequency (E):

(𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡) × (𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡)


𝐸𝐸 =
𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
The test statistic for a test of independence is the same as that of a goodness-
of-fit test:

2
𝑂𝑂2
𝜒𝜒 = � − 𝑛𝑛
𝐸𝐸
OR

2
(𝑂𝑂 − 𝐸𝐸)2
𝜒𝜒 = �
𝐸𝐸

119
Example
A random sample of 90 adults are classified according to gender and the
number of hours they watch television during a week:

Male Female

Under 25 hours 27 19

Over 25 hours 15 29

Use a 0.01 level of significance and test the hypothesis that the time spent
watching television is independent of whether the viewer is male or
female.

Solution:

Step 1:

𝐻𝐻0 : Gender and time spent watching TV are independent

𝐻𝐻1 : Gender and time spent watching TV is dependent

Step 2:

Next, we need to calculate the test statistic. But in order to do so, we need to
compute the expected frequencies for each cell. This is done using the
formula:

(𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡) × (𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡)


𝐸𝐸 =
𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡

Male Female Total

Under 25 hours 27 (21.47) 19 (24.53) 46

Over 25 hours 15 (20.53) 29 (23.47) 44

Total 42 48 90

120
Cell 1: 𝐸𝐸 = (46 × 42)/90 = 21.47
Cell 2: 𝐸𝐸 = (46 × 48)/90 = 24.53
Cell 3: 𝐸𝐸 = (44 × 42)/90 = 20.53
Cell 4: 𝐸𝐸 = (44 × 48)/90 = 23.47

Thus,
𝑂𝑂2
Observed (O) Expected (E) 𝐸𝐸

27 21.47 33.95

19 24.53 14.72
15 20.53 10.96

29 23.47 35.83
90 90 95.46
2 𝑂𝑂2
∴ Test statistic = ∴ test statistic: 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 =∑ − 𝑛𝑛 = 95.46 − 90 = 5.46
𝐸𝐸

Step 3:
2
Determine the critical value: 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

𝛼𝛼 = 0.01

Degrees of freedom: 𝑑𝑑𝑑𝑑 = (𝑟𝑟 − 1) × (𝑐𝑐 − 1) = (2 − 1) × (2 − 1) = 1 × 1 =


1
2 2 2
∴ 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 𝜒𝜒𝑑𝑑𝑑𝑑;𝛼𝛼 = 𝜒𝜒1;0.01 = 6.635
Step 4:
2 2
Since 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 < 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐

that is, 5.46 < 6.635, do not reject 𝐻𝐻0 at the 1% level of significance.

Step 5:
Conclusion: there is insufficient evidence to suggest that the time spent
watching TV is dependent on gender.

121
Tutorial
1. What type of data would you use for a 𝜒𝜒 2 test?
a. Ratio
b. Categorical
c. Interval
d. Ordinal

Read the following information and answer questions 2 - 5.

A car manufacturer wishes to test if 5 names are equally popular. The


following popularity results were obtained from a sample.
Proposed A B C D E Total
name
Number 14 24 62 80 20 200
who
prefer
the name

2. The null hypothesis is:


a. The car names are not equally popular
b. The car names are equally popular
c. Some names are more popular than others
d. Popularity is not the same for each name

3. The test statistic value is:


a. 85.4
b. 40
c. 16.9
d. 7.779

4. At 10% significance level the null hypothesis is rejected if:


a. χ 2 > 7.779
b. χ 2 > 9.236
c. χ 2 > 13.277
d. χ 2 > 15.086

122
5. The following conclusion at 10% level of significance is true:
a. Reject the null hypothesis and conclude that the names are equally
popular
b. Reject the null hypothesis and conclude that the names are not
equally popular
c. Accept the null hypothesis and conclude that the names are equally
popular
d. Accept the null hypothesis and conclude that the names are not
equally popular

6. Suppose the null hypothesis states that there is no relationship between


income level and donations to charity each year. Then the alternate
hypothesis would state that:
a. Income level and donations to charity have no association
b. Income level has nothing to do with charity donations
c. If income levels rise, then donations to charity remain unchanged
d. Income level and donations to charity are related

7. Suppose when using a χ 2 -test you reject H 0 at 𝛼𝛼 = 0.01. It follows that:


a. H 0 will be accepted at 𝛼𝛼 = 0.05
b. H 0 will be accepted at 𝛼𝛼 = 0.005
c. H 0 will be rejected at 𝛼𝛼 = 0.05
d. H 0 will be rejected at 𝛼𝛼 = 0.005

8. The critical value for a χ 2 -test of a contingency table with 4 columns and
6 rows at 𝛼𝛼 = 0.05 is:
a. 36.415
b. 28.869
c. 31.410
d. 24.996

9. The following statement is false about the properties of a chi-squared


distribution:
a. It has only one parameter which is the degrees of freedom
b. 𝝌𝝌𝟐𝟐 is skewed to the left
c. 𝝌𝝌𝟐𝟐 assume non-negative values only
d. The total area under the curve is 1

123
Refer to this information to answer questions 10 – 11.
A test for independence is used to test if gender and handedness i.e. right-
handed, ambidextrous or left-handed are associated.

10.The degrees of freedom are


a. 2
b. 6
c. 3
d. 4

11. If 𝛼𝛼 = 0.05 the critical value is:


a. 9.488
b. 12.592
c. 5.991
d. 7.815

Refer to this information to answer questions 12 - 15.


A pharmaceutical company introduced a new drug for migraine in the
market a few months ago. The management wants to determine if the
reaction of customers depends on the different regions. The company
selected 1600 customers from the regions and asked them if the drug was
“effective” or “not effective” and the results are listed in the table below:

Region Reaction
Effective Not Effective
East 274 126
South 203 197
West 291 109
North 257 143

12.The appropriate test is:


a. 𝜒𝜒 2 - Goodness-of-fit test
b. 𝜒𝜒 2 - test for independence
c. Z-test
d. T-test

124
13.The null hypothesis is:
a. 𝐻𝐻0 : The reaction to the drug depends on the region
b. 𝐻𝐻0 : The reaction to the drug is related to the region
c. 𝐻𝐻0 : The reaction to the drug is independent of the region
d. 𝐻𝐻0 : The reaction to the drug is associated with the region

14.The critical value at 5% significance level is:


a. 6.251
b. 5.991
c. 7.815
d. 3.841

15.The expected value corresponding to the cell “West – Effective” is:


a. 9.21
b. 47.331
c. 256.25
d. 1025

Refer to the following contingency table to answer questions 16 - 21.

Province Live in RDP/government Do not live in Total


subsidised dwelling RDP/government
subsidised dwelling
(in 100 000)
(in 100 000)
Western Cape 6 13 19
Eastern Cape 4 14 18
Northern Cape 1 2 3
Free State 3 7 10
Kwa-Zulu Natal 6 23 29
North West 3 10 13
Gauteng 12 36 48

125
Mpumalanga 2 10 12
Limpopo 3 13 16
South Africa 40 128 167
The figures in the table were rounded-off to the nearest 100 000 from the
results of the 2016 Community Survey for ease of calculation. These results
illustrate the distribution of households, in the nine provinces, amongst
RDP/government subsidised dwellings in South Africa.

16. To test if a relationship exists between the type of dwelling


(RDP/government subsidised dwelling or non-RDP/government
subsidised dwelling) a household occupies and the province in which the
household lives, the test one would perform is:
a. 𝜒𝜒 2 goodness-of-fit test
b. 𝜒𝜒 2 test of independence
c. 𝜒𝜒 2 test of homogeneity
d. None of the above
17.The expected value for the cell where the heading titles “Kwa-Zulu Natal”
and “Live in RDP/government subsidised dwellings” intersect is:
a. 6.90
b. 21.93
c. 23
d. 5.59
18.The degrees of freedom for the test is:
a. 9
b. 16
c. 8
d. 18

19. At the 5% level of significance, the 𝜒𝜒 2 critical value for the test is:

e. 15.507
f. 26.296
g. 28.869
h. 16.919

126
19.If the test statistic value is 2.89, do we reject or fail to reject 𝐻𝐻0 at the 5%
level of significance?
a. Fail to reject 𝐻𝐻0
b. Reject 𝐻𝐻0
c. Fail to accept 𝐻𝐻0
d. Cannot be determined

20.The conclusion for the test at the 5% level of significance is:


a. Do not reject 𝐻𝐻0 and conclude that there is evidence to suggest
that a household’s dwelling type is independent of the province in
which the household lives.
b. Do not reject 𝐻𝐻1 and conclude that there is evidence to suggest that
a household’s dwelling type is independent of the province in which
the household lives.
c. Reject 𝐻𝐻0 and conclude that there is evidence to suggest that a
household’s dwelling type is dependent on the province in which
the household lives.
d. Reject 𝐻𝐻1 and conclude that there is evidence to suggest that a
household’s dwelling type is dependent on the province in which
the household lives.

21.The χ 2 goodness-of-fit test has 23 categories. The critical value at α = 0.05


is approximately:
a. 35.172
b. 33.924
c. 32.813
d. 36.415

22.Suppose when using a χ 2 -test you rejected H 0 at α=0.01. It follows that:


a. H 0 will be accepted at α = 0.05
b. H 0 will be accepted at α = 0.005
c. H 0 will be rejected at α = 0.05
d. H 0 will be rejected at α = 0.005

127
APPENDIX A – EXCEL NOTES

A.1 Installing the Data Analysis ToolPak in Excel


The Analysis ToolPak is an Excel add-in program that provides data analysis
tools for financial, statistical and engineering data analysis.
To load the Analysis ToolPak add-in, execute the following steps.
1. On the File tab, click Options.
2. Under Add-ins, select Analysis ToolPak and click on the Go button.

3. Check Analysis ToolPak and click on OK.

128
4. On the Data tab, in the Analysis group, you can now click on Data Analysis.

The following dialog box below appears.


5. For example, select Histogram and click OK to create a Histogram in Excel.

129
A.2 Creating a random sample
The Excel software package has a facility with which a random sample of a
specific size can be selected from a given population.
Below is the population data of size 10:
12 15 16 18 20 19 14 11 16 13
Select a random sample of size 5 from this population.
1. Input the population data
2. On the Data tab, in the Analysis group, click Data Analysis.

3. Select Sampling and click OK.

4. Click on the Input Range box and select the range A2:A11.
5. Click on the Random button.
6. Type in 5 in the Number of Samples box
7. Click in the Output Range box and select cell B2.
8. Click OK.

130
A.3 Drawing a line graph
1. Input the Year and Thando’s weight.
2. Highlight the data and click on the Insert tab and select the scatter
plot with straight lines and markers.

131
3. Click on the green plus sign, tick the box for Axis Titles and write in
the titles of the axis.
4. Right click on a year, select Format Axis.
Set the Minimum value to 2013 and the Maximum value to 2019.
5. Final output appears as follows:

Thando's weight (kg)


75
74
73
72
Weight

71
70
69
68
67
2013 2014 2015 2016 2017 2018 2019
Year

A.4 Constructing a Simple Bar Chart


Given the following mid-year population estimates for South Africa by
population group, 2017:
Population Group Number
Black African 45 656 400
Coloured 4 962 900
Indian/Asian 1 409 100
White 4 493 500

1. Input the data.


2. Click on the Insert tab and click on 2D clustered column chart.
3. Click OK.
4. Click on the green plus sign and tick Axis Titles, Chart Title and Data
labels.

132
5. Label the axes.
6. The completed simple bar graph is as follows:

133
A.5 Constructing a Component Bar Chart
Given the following mid-year population estimates for South Africa by
population group and sex, 2017:
Population group Male Female
Black African 22 311 400 23 345 000
Coloured 2 403 400 2 559 500
Indian/Asian 719 300 689 800
White 2 186 500 2 307 100

1. Input the data.


2. Click on the Insert tab and click on 2D stacked column chart.
3. Click OK.
4. Click on the green plus sign and tick Axis Titles, Chart Title and Data
labels.

5. Label the axes.


6. The completed component bar chart is as follows:

134
A.6 Constructing a Multiple (Component) Bar Chart
Given the following mid-year population estimates for South Africa by
population group and sex, 2017:

Population group Male Female Total


Black African 22 311 400 23 345 000 45 656 400
Coloured 2 403 400 2 559 500 4 962 900
Indian/Asian 719 300 689 800 1 409 100
White 2 186 500 2 307 100 4 493 500

1. Input the data.


2. Click on the Insert tab and click on 2D clustered column chart.
3. Click OK.
4. Click on the green plus sign and tick Axis Titles, Chart Title.

5. Label the axes.


6. The completed component bar chart is as follows:

A.7 Constructing a Pie Chart


The table below shows the weighting of services used in the construction input
price index (Construction Materials Price Indices, April 2019).
Service Weight (%)
Site preparation 1
Construction of buildings 24
Civil engineering 37
Other structures 2
Construction by specialist trade contractors 6

135
Plumbing 2
Electrical contractors 8
Shopfitting 1
Other building installation 8
Painting and decorating 1
Other building completion 8
Renting of construction or demolition equipment
with operators 3

1. Input the data.


2. Click on the Insert tab and click on 2-D Pie chart.
3. Click OK.
4. Add the title to the graph.
The completed pie chart is as follows:

136
A.8 Constructing a Histogram
This example teaches you how to create a histogram in Excel.
1. First, enter the data and the bin numbers (upper levels).

2. On the Data tab, in the Analysis group, click Data Analysis.

3. Select Histogram and click OK.

137
4. Select the input range (the cost of daily commute values).
5. Click in the Bin Range box and select the bin range.
6. Click the Output Range option button, click in the Output Range box and
select a cell in which you want the output to appear.
7. Check Chart Output.

8. Click OK.
9. Click on Quick Analysis and choose Chart and then Clustered

138
9. Click on the More value in the table and delete.
10. Properly label your bins.
11. To remove the space between the bars, right click a bar, click Format Data
Series and change the Gap Width to 0%.
12. To add borders, right click a bar, click Format Data Series, click the Fill &
Line icon, click Border and select a color.
13. To add the data values above each bar, right click a bar, click Add Data
Lables → Add Data Lables
Result:

139
A.9 Constructing a Frequency Polygon
1. Input the Midpoint and frequency values.
2. Highlight the data and click on the Insert tab and select the 2D line
graph.

3. Click on the green plus sign, tick the box for Axis Titles and write in the
titles of the axes.
4. The final output appears as follows:

140
A.10 Constructing a “Less than” ogive
1. Input the upper class limits and the cumulative frequency values.
2. Highlight the data and click the Insert tab and then click on Scatter with
Straight Lines and Markers.

3. Click on the green plus sign and tick Axis, Axis Titles, Chart Title and data
Labels.

141
4. Right click on the horizontal axis and click on Format Axis.
5. Set the Minimum value to 40 and the Maximum value to 70.

A.11 Calculating Summary Statistics


You can use the Analysis Toolpak add-in to generate descriptive statistics. For
example, you may have the scores of 14 participants for a test.

To generate descriptive statistics for these scores, execute the following steps.
1. On the Data tab, in the Analysis group, click Data Analysis.

142
.
2. Select Descriptive Statistics and click OK.

3. Select the range A2:A15 as the Input Range.


4. Select cell C1 as the Output Range.
5. Make sure Summary statistics is checked.

6. Click OK.
Result:

143
A.12 Drawing a Scatter plot

1. Input the data for stock A and stock B given in the notes.

2. Highlight the data for stock A then click the Insert tab and choose
Scatter:

144
3. Highlight the data for stock B then click the Insert tab and choose
Scatter:

4. Click on the green plus sign and add axes titles, chart title and data
lables for both scatter plots.

145
5. The scatter plots for stock a and B are as follows:

A.13 Performing Regression Analysis


1. Input the data.

146
2. Before we begin the analysis, we can create a scatter plot of the
variables shoe size (x) and height (y) and fit a trend line to the data as
follows:

Fit a trend line as follows:


1. Click on the green plus sign
2. Select trend line, click on the arrow and select linear
3. Add the correct axis labels

From the above scatter diagram and linear trend line, it would seem that
height and shoe size has a positive linear correlation.

1. On the Data tab, in the Analysis group, click Data Analysis.

2. Select Regression and click OK.

147
3. Select the Y Range. This is the predicted variable (also called dependent
variable).
4. Select the X Range. These are the explanatory variables (also called
independent variables). These columns must be adjacent to each other.
5. Check Labels.
6. Click in the Output Range box and select whichever cell you want the output
to appear in.
8. Click OK.

Excel produces the following Summary Output.

148
R Square
R Square equals 0.79 which is an average fit. Approximately, 79% of the
variation in height is explained by the independent variable shoe size. The
closer r is to 1, the better the regression line fits the data.

Coefficients
The regression line is: y� = 137.03 + 4.14(shoe size). In other words, for each
unit increase in shoe size, height increases by 4.14 centimetres.
You can also use these coefficients to do a forecast. For example, if shoe size
equals 8, a person’s expected height = 137.03 + 4.14(8) = 170.15 centimetres.

A.14 Performing a t-test: one sample mean


1. For the first example in section 6.4 of the notes enter the data set on the
drying times of paint in Excel. Create another data set called Dummy
variable and enter at least two zeros as follows:

2. Click on the DATA tab


3. Click data analysis
4. Click on t-Test: Two-sample assuming Unequal Variances, then OK
5. Input the drying times for the variable 1 range
6. Input the Dummy variable for variable 2 range
7. Type in 120 for Hypothesized Mean Difference
8. Check labels
9. Type 0.05 for alpha
10.Click OK

149
11.Delete the Dummy variable column
12.Alter the heading to read: t-Test: Mean
Output is as follows:

The value of the test statistic is t calc = 1.899 (3 decimal places). From the table
P(T< = –1.899) = 0.036 (for a left-tailed or one-tail test such as this). This
probability is known as the p-value (the probability of getting a t-value more
remote than the test statistic). When testing at the 5% level of significance, a
p-value of below 0.05 will cause the null hypothesis to be rejected.

A.15 Performing a two sample t-test: equal variances


Example:
A marketing research firm tests the effectiveness of a new flavouring for a
leading soft drink using a sample of 20 people, half of whom taste the soft
drink with the old flavouring and the other half who taste the beverage with
the new flavouring. The people in the study are then given a questionnaire
which evaluates how enjoyable the soft drink was. The scores are given below.
Determine whether there is a significant difference in preference between the
two flavourings at the 5% level of significance.
In other words, test the hypothesis that:
𝐻𝐻0 : 𝜇𝜇1 = 𝜇𝜇2
𝐻𝐻1 : 𝜇𝜇1 ≠ 𝜇𝜇2
OR
𝐻𝐻0 : 𝜇𝜇1 − 𝜇𝜇2 = 0
𝐻𝐻1 : 𝜇𝜇1 − 𝜇𝜇2 ≠ 0

150
New Old
13 12
17 8
19 6
11 16
20 12
15 14
18 10
9 18
12 4
16 11

1. Input this data into Excel


2. Click on the Data taba, click on data analysis
3. Select t-Test: Two Sample Assuming Equal Variances, click OK
4. Select the New data set as the Variable 1 Range
5. Select the Old data set as the Variable 2 Range
6. Type in 0 for Hypothesized Mean Difference
7. Check Labels
8. Type 0,05 for Alpha
9. Click Output Range and click on the cell in which you want the output to
appear
10.Click OK
The summary output follows:

The value of the test statistic is t calc = 2.177 (3 decimal places). From the table
P(T< = 2.177) = 0.043 (for a two-tailed test such as this). This probability is
known as the p-value (the probability of getting a t-value more remote than
the test statistic). When testing at the 5% level of significance, a p-value of
below 0.05 will cause the null hypothesis to be rejected.

151

You might also like