0% found this document useful (0 votes)
23 views110 pages

STA132 Complete Note

Uploaded by

ronkeolaoye0203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views110 pages

STA132 Complete Note

Uploaded by

ronkeolaoye0203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

EDWARD CARES

STA 132: Lecture Notes

Course Title: Laboratory For Inference


Course Code: STA132
Credit: 2
Status: Core (for Statistics Major); Required or Optional (for Non-statistics Major)
Course Lecturer: Professor W. B. Yahya

Course Contents:
Presentation and analysis of data. Curve fitting and goodness of-fit tests. Construction of
questionnaires and simple index numbers. Use of random numbers and statistical tables. 90h(P);
C

SOME DEFINITIONS

Statistics (As a field of study): Statistics is the science of collecting, organizing, summarizing,
analyzing, and making inferences from data.
The subject of statistics is divided into two broad areas which are the Descriptive and Inferential
statistics.

Population and Sample

A population consists of all elements that are being studied. For example, we may be interested
in studying the distribution of scores obtained by all the students that offered STA132 during the
2018/2019 academic session.

1
EDWARD CARES

A sample is a subset of the population. For example, we may be interested in studying the
distribution of scores obtained by 100 randomly selected students that offered STA132 during the
2018/2019 academic session.

Since parameters are descriptions of the population, a population can have many parameters.
Similarly, a sample can have many statistics.

Data, Parameter and Statistics:

In order to obtain information, data are collected from variables used to describe an event. Data
are the values or measurements that variables describing an event can assume.

Data are individual pieces of factual information recorded and used for the purpose of analysis. It
is the raw information from which statistics are created. In other words, statistics are the
characteristic of or a fact about a sample.

Both populations and samples have characteristics that are associated with them. These
are called parameters and statistics, respectively.

A parameter is a characteristic of or a fact about a population. For instance, the average age of
students in Nigerian Universities is µ, say 29 years.

A statistic is a characteristic of or a fact about a sample. For instance the average age of randomly
selected n (e.g. 15) students of University of Ilorin is , say 26 years. Loosely speaking, Statistics
can be regarded as the results of data analysis. We talk of statistics after some computation have
taken place that provide some understanding of what the data means.

Difference Between Population and Sample


2
EDWARD CARES

Example 1: Suppose is the (random) variable that describes the scores obtained students in
STA132 first CA test. Then, the scores obtained by 10 randomly selected students in this test
represented by , = 1, 2, … , 10 are given as follows:

: 20, 4, 21, 19, 12, 7, 8, 25, 17, 11.

Example 2: Sample Mean ( ) is a statistic.


The sample mean scores of students in Example 1 is determined by =∑ = 20.

Realizations: These are specific values of a random variable. For instance, in Example 1 above,
: 20, 4, 21, … are realizations which are all specific values of random variable .

Types of Variables/Data

The manner in which you analyze data depends on the type of data/variables that you are
evaluating. There are several different classifications that are used in classifying data.
Variable
 A variable is an item of data that varies from one subject/unit/observation to another.
 Examples of variables include quantities such as: gender, investment type, test scores, and
weight.
For instance, if represents the monthly salary of academic staff in Nigeria University, then
is a variable.
Note: Variables whose values are determined by chance are called random variables.

Types/Classifications of Variables
 Qualitative: Non-numerical quality
 Quantitative: Numerical
 Discrete: counts
 Continuous: measures

3
EDWARD CARES

Qualitative Variable/Data
 Qualitative data are data values that can be placed into distinct categories, according to some
characteristic or attribute.
 This variable describes the quality of something in a non-numerical format. Example: Colour
(Red, Black, ….); Gender (Male, Female); Class of Degree (First Class, Second Class upper,
….);
 Counts can be applied to qualitative data, but you cannot order (if purely nominal) or measure
this type of variable. Examples are gender, marital status, geographical region of an
organization, job title….
Example 3: Distribution of Colour of cars owned by academic staff in the Faculty of Physical
Sciences, University of Ilorin.
Car Colour Red Blue Yellow Green White
Frequency 20 54 12 56 3

 Qualitative data is usually treated as Categorical Data.


With categorical data, the observations can be sorted according into non-overlapping
categories or by characteristics.
 For example, shirts can be sorted according to color; the characteristic 'color' can have
non-overlapping categories: white, black, red, etc. People can be sorted by gender with
categories male and female.
 Categories should be chosen carefully since a bad choice can prejudice the outcome.
Every value of a data set should belong to one and only one category.
 Measurement Scale
4
EDWARD CARES

 Nominal: classifies with no ranking (e.g. color, investment type...)


 Ordinal: classifies with ranking (e.g. product satisfaction, grades…)
 Analyze qualitative data using:
 Frequency tables, Contingency tables (for 2 variables)
 Modes - most frequently occurring
 Graphs: Bar Charts, Pie Charts, Pareto Charts

Quantitative Data
 Quantitative or numerical data arise when the observations are frequencies or measurements
that are numeric.
 Discrete Data
 The data are said to be discrete if the measurements are integers (e.g. number of
employees of a company, number of incorrect answers on a test, number of
participants in a program…)
 Continuous Data
 The data are said to be continuous if the measurements can take on any value,
usually within some range (e.g. weight). Age and income are continuous quantitative
variables. For continuous variables, arithmetic operations such as differences and
averages make sense.
 Analysis can take almost any form:
 Create groups or categories and generate frequency tables.
 Effective graphs include: Histograms, Stem-and-Leaf plots, Dot Plots, Box plots,
and XY Scatter Plots (2 variables).
 All descriptive statistics can be applied.
 Measurement Scale
 Interval: ordered and difference between variables is meaningful (e.g. standardized
scores...)
 Ratio: ordered and difference between variables is meaningful, true 0 in measuring

Note: Some “quantitative” variables can be treated only as ranks; they have a natural order, but
these values are not strictly measured (ordinal data). Examples are: 1) age group (taking the
values child, teen, adult, senior), and 2) Likert Scale data (responses such as strongly agree,

5
EDWARD CARES

agree, neutral, disagree, strongly disagree). For these variables, the distinction between adjacent
points on the scale is not necessarily the same, and the ratio of values is not meaningful.
 Analyze using:
 Frequency tables
 Mode, Median, Quartiles
 Graphs: Bar Charts, Dot Plots, Pie Charts, and Line Charts (2 variables)

6
EDWARD CARES

PRESENTATION OF DATA

1.1 INTRODUCTION

Once data has been collected, it has to be classified and organised in such
a way that it becomes easily readable and interpretable, that is, converted to
information. Before the calculation of descriptive statistics, it is sometimes a good
idea to present data as tables, charts, diagrams or graphs. Most people find
‘pictures’ much more helpful than ‘numbers’ in the sense that, in their opinion,
they present data more meaningfully.

In this course, we will consider the various possible types of presentation


of data and justification for their use in given situations.

1.2 TABULAR FORMS

This type of information occurs as individual observations, usually as a


table or array of disorderly values. These observations are to be firstly arranged in
some order (ascending or descending if they are numerical) or simply grouped
together in the form of a frequency table before proper presentation on diagrams
is possible.

1.2.1 Arrays

An array is a matrix of rows and columns of numbers which have been


arranged in some order (preferably ascending). It is probably the most primitive
way of tabulating information but can be very useful if it is small in size. Some
important statistics can immediately be located by mere inspection.

Without any calculations, one can easily find the

1. Minimum observation
2. Maximum observation
3. Number of observations, n
4. Mode
5. Median, if n is odd

1
EDWARD CARES

Example

2 7 8 11 15
16 18 19 19 19
23 23 24 26 27
29 33 40 44 47
49 51 54 63 68

Table 1.2.1

We can easily verify the following:

1. Minimum = 2
2. Maximum = 68
3. Number of observations = 25
4. Mode = 19
5. Median = 24

1.2.2 Simple tables

A table is slightly more complex than an array since it needs a heading


and the names of the variables involved. We can also use symbols to represent the
variables at times, provided they are sufficiently explicit for the reader.
Optionally, the table may also include totals or percentages (relative figures).

Example

DISTRIBUTION OF AGES OF DCDMBS STUDENTS


Age of student Frequency Relative frequency
19 14 0.0350
20 23 0.0575
21 134 0.3350
22 149 0.3725
23 71 0.1775
24 9 0.0225
Total 400 1.0000

Table 1.2.2

1.2.3 Compound tables

A compound table is just an extension of a simple in which there are more


than one variable distributed among its attributes (sub-variable). An attribute is
just a quality, property or component of a variable according to which it can be
differentiated with respect to other variables.

We may refer to a compound table as a cross tabulation or even to a


contingency table depending on the context in which it is used.

2
EDWARD CARES

Example

UNISA 2004 results for first-year DCDMBS students

COURSE
BA B Com B Sc

Pass 37 25 33

RESULT
Supp 5 10 4

Fail 11 8 27

Table 1.2.3

1.3 LINE GRAPHS

A line graph is usually meant for showing the frequencies for various
values of a variable. Successive points are joined by means of line segments so
that a glance at the graph is enough for the reader to understand the distribution of
the variable.

1.3.1 Single line graph

The simplest of line graphs is the single line graph, so called because it
displays information concerning one variable only, in terms of its frequencies.

Example

Using the data from the table below,

Age of Number of students


students (frequency)
19 14
20 23
21 134
22 149
23 71
24 8
Total 399

Table 1.3.1.1

we may generate the following line graph:

3
EDWARD CARES

Line graph for ages of students

160
140
Number of students 120
100
80
60
40
20
0
19 20 21 22 23 24

Age

Fig. 1.3.1.2

1.3.2 Multiple line graph

Multiple line graphs illustrate information on several variables so that


comparison is possible between them. Consider the following table containing
information on the ages of first-year students attending courses the University of
Mauritius (UoM), the De Chazal du Mée Business School (DCDMBS) and the
University of Technology of Mauritius (UTM) respectively.

AGE DISTRIBUTION OF STUDENTS AT


ACADEMIC INSTITUTIONS
Number of students
Age of students UoM DCDMBS UTM
19 14 8 2
20 23 52 23
21 134 101 152
22 149 133 98
23 71 54 34
24 8 18 13

Table 1.3.2.1

This data, when displayed on a multiple line graph, enables a comparison


between the frequencies for each age among the institutions (maybe in an attempt
to know whether younger students prefer to enrol for courses at one of these
institutions).

4
EDWARD CARES

Multiple line graph for age distribution at academic institutions

160

140
Number of students

120

100
UoM
80 DCDMBS
UTM
60

40

20

0
19 20 21 22 23 24

Age

Fig. 1.3.2.2

1.4 PIE CHARTS

A pie chart or circular diagram is one which essentially displays the


relative figures (proportions or percentages) of classes or strata of a given sample
or population. We should not include absolute values (class frequencies) on a pie
chart. Perhaps, this is the simplest diagram that can be used to display data and
that is the reason why it is quite limited in its presentation.

The pie chart follows the principle that the angle of each of its sectors
should be proportional to the frequency of the class that it represents.

Merits

1. It gives a simple pictorial display of the relative sizes of classes.


2. It shows clearly when one class is more important than another.
3. It can be used for comparison of the same elements but in two or more
different populations.

Limitations

1. It only shows the relative sizes of classes.


2. It involves calculation of angles of sectors and drawing them accurately.
3. It is sometimes difficult to compare sectors sizes accurately by eye.

5
EDWARD CARES

1.4.1 Simple pie chart

Example

Using the same data from Table 1.2.3, but this time, including the total
number of students enrolled for BA, B Com and B Sc, we shall now display the
distribution of students for these three courses the population.

UNISA 2004 results for first-year DCDMBS students

COURSE
BA B Com B Sc

Pass 37 25 33
RESULT

Supp 5 10 4

Fail 11 8 27

TOTAL 53 43 64

Table 1.4.1.1

It is customary to include a legend to relate the colours or patterns used for


each sector to its corresponding data.

Distribution of students enrolled for BA, B Com and B Sc

BA
33%
B Com
40%
B Sc

27%

Fig. 1.4.1.2

6
EDWARD CARES

1.4.2 Enhanced pie chart

This is just an enhancement (as the name says itself) of a simple pie chart
in order to lay emphasis on particular sector.

Example

Again, using the same data from Table 1.2.3, but this time, including the
total number of students enrolled for BA, B Com and B Sc, we shall now display
the distribution of students for these three courses the population.

UNISA 2004 results for first-year DCDMBS students

COURSE
BA B Com B Sc

Pass 37 25 33
RESULT

Supp 5 10 4

Fail 11 8 27

TOTAL 53 43 64

Table 1.4.1.3

It is customary to include a legend to relate the colours or patterns used for


each sector to its corresponding data. In Fig. 1.4.1.4, we show the importance of
the number of passes in B Sc.

Distribution of students enrolled for BA, B Com and B Sc

BA
33%
40% B Com

B Sc

27%

Fig. 1.4.1.4

7
EDWARD CARES

1.5 BAR CHARTS

The bar chart is one of the most common methods of presenting data in a
visual form. Its main purpose is to display quantities in the form of bars. A bar
chart consists of a set of bars whose heights are proportional to the frequencies
that they represent.

Note that the figure may be drawn horizontally or vertically. There are
different types of bar charts, depending on the number of variables and the type of
information to be displayed.

General merits

1. The quantities can be easily read in terms of heights of the bars.


2. Comparison can be made between values of a variable.
3. It can be used even for non-numerical data.

General limitations

1. The class intervals must be equal in the distribution.


2. It cannot be used for continuous variables.

Note Any additional merit or limitation for each type of bar chart will be
mentioned in its corresponding section.

1.5.1 Simple bar chart

The simple bar chart is used for the case of one variable only. In Table
1.5.1.1 below, our variable is age.

Example

Age of Number of students


students (frequency)
19 14
20 23
21 134
22 149
23 71
24 8
Total 399

Table 1.5.1.1

8
EDWARD CARES

Simple bar chart for age distribution of students

160
140

Number of students
120
100
80
60
40
20
0
19 20 21 22 23 24

Age

Fig. 1.5.1.2

1.5.2 Multiple bar chart

The multiple bar chart is an extension of a simple bar chart when there are
quantities of several variables to be displayed. The bars representing the
quantities for the different variables are piled next to one another for each
attribute.

Example

UNISA 2004 results for first-year DCDMBS students

COURSE
BA B Com B Sc

Pass 37 25 33
RESULT

Supp 5 10 4

Fail 11 8 27

TOTAL 53 43 64

Table 1.5.2.1

Fig. 1.5.2.2 shows how an array of frequencies may be very easily


displayed on a multiple bar chart.

9
EDWARD CARES

Multiple bar chart showing the results for BA, B Com and B Sc

40

35

30

25 Pass
Results

20 Supp
15 Fail

10

0
BA B Com B Sc
Courses

Fig. 1.5.2.2

Merits

1. Comparison may be made among components of the same variable.


2. Comparison is also possible for the same component across all variables.

Limitations

1. The figure becomes very cumbersome when there are too many variables
and components.
2. Only absolute, not relative, values are available – it is much easier to
compare component percentages across variables.

1.5.3 Component bar chart

In this type of bar chart, the components (quantities) of each variable are
piled on top of one another.

10
EDWARD CARES

Example

UNISA 2004 results for first-year DCDMBS students

COURSE
BA B Com B Sc

Pass 37 25 33

RESULT
Supp 5 10 4

Fail 11 8 27

TOTAL 53 43 64

Table 1.5.2.1

Component bar chart showing the results for each course

70

60

50
Fail
Results

40
Supp
30
Pass
20

10

0
BA B Com B Sc

Courses

Fig. 1.5.2.2

Merits

1. Comparison may be made among components of the same variable.


2. Comparison is also possible for the same component across all variables.
3. It saves space as compared to a multiple bar chart.

Limitations

1. Only absolute, not relative, values are available – it is much easier to


compare component percentages across variables.
2. It is awkward to compute the quantities for individual components.

11
EDWARD CARES

1.5.4 Percentage (component) bar chart

A percentage (component) bar chart displays the components (quantities)


percentages of each variable, piled on top of one another. This is a refinement of
the component bar chart since, irrespective of the number of components, the
heights of the bars are always kept to a given value (100%).

Fig. 1.5.4 presents the same data as for the previous example.

Percentage bar chart showing the results for each course

100%
90%
80%
70%
Fail
Results

60%
50% Supp
40%
Pass
30%
20%
10%
0%
BA B Com B Sc
Courses

Fig. 1.5.4

Merits

1. Comparison may be made among components of the same variable.


2. Comparison is also possible for the same component across all variables.
3. It saves space as compared to a multiple bar chart.

Limitations

1. Only relative, not absolute, values are available – it is not possible to


compute quantities unless the totals are known.
2. It is awkward to compute the percentages for individual components.
3. Same percentages do not mean necessarily mean same quantities (this may
be calculated unless totals are known).

12
EDWARD CARES

1.6 HISTOGRAMS

Out of several methods of presenting a frequency distribution graphically,


the histogram is the most popular and widely used in practice. A histogram is a set
of vertical bars whose areas are proportional to the frequencies of the classes
that they represent.

While constructing a histogram, the variable is always taken on the x-axis


while the frequencies are on the y-axis. Each class is then represented by a
distance on the scale that is proportional to its class interval. The distance for
each rectangle on the x-axis shall remain the same in the case that the class
intervals are uniform throughout the distribution. If the classes have different
class intervals, they will obviously vary accordingly on the x-axis. The y-axis
represents the frequencies of each class which constitute the height of the
rectangle.

The histogram should be clearly distinguished from the bar chart. The
most striking physical difference between these two diagrams is that, unlike the
bar chart, there are no ‘gaps’ between successive rectangles of a histogram. A bar
chart is one-dimensional since only the length, and not the width, matters whereas
a histogram is two-dimensional since both length and width are important.

A histogram is mainly used to display data for continuous variables but


can also be adjusted so as to present discrete data by making an appropriate
continuity correction. Moreover, it can be quite misleading if the distribution has
unequal class intervals.

1.6.1 Histograms for equal class intervals

Example

Consider the set of data in Fig. 1.6.1.1, which represents the ages of
workers of a private company. The real limits and mid-class values have already
been computed.

Age group Real limits Mid-class value Frequency


21 – 25 20.5 – 25.5 23 5
26 – 30 25.5 – 30.5 28 12
31 – 35 30.5 – 35.5 33 23
36 – 40 35.5 – 40.5 38 39
41 – 45 40.5 – 45.5 43 32
46 – 50 45.5 – 50.5 48 21
51 – 55 50.5 – 55.5 53 9
56 – 60 55.5 – 60.5 58 2
Total 143

Table 1.6.1.1

13
EDWARD CARES

The data is presented on the histogram in Fig. 1.6.1.2.

Presentation of grouped data (uniform class interval) on a histogram

Histogram for grouped data


45

40
Number of workers (frequency)

35

30

25

20

15

10

0
[20.5, [25.5, [30.5, [35.5, [40.5, [45.5, [50.5, [55.5,
25.5) 30.5) 35.5) 40.5) 45.5) 50.5) 55.5) 60.5)
Age group of w orkers

Fig. 1.6.1.2

1.6.2 Histograms for unequal class intervals

When class intervals are unequal, a correction must be made. This consists
of finding the frequency density for each class, which is the ratio of the frequency
to the class interval. The frequency densities now become the actual heights of
the rectangles since the areas of the rectangles should be proportional to the
frequencies.

Frequency
Frequency density =
Class interval

Example

The temperatures (in degrees Fahrenheit) were simultaneously recorded in


various cities in the world at a specific moment. Table 1.6.2.1 below gives the
thermometer readings.

14
EDWARD CARES

Temperature Class Frequency Frequency


intervals density
[0 – 5) 5 3 0.60
[5 – 10) 5 6 1.20
[10 – 20) 10 10 1.00
[20 – 30) 10 15 1.50
[30 – 40) 10 10 1.00
[40 – 50) 10 5 0.50
[50 – 70) 20 5 0.25
Total 54

Table 1.6.2.1

Note [20 – 30) means ‘from 20 to 30, including 20 but excluding 30’.

Presentation of grouped data (unequal class intervals) on a histogram

Histogram (unequal class intervals) using frequency density

1.6
1.4
1.2
Frequency density

1
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70 80
Temparature (degrees Fahrenheit)

Fig. 1.6.2.2

1.7 FREQUENCY POLYGONS

A frequency polygon is a graph of frequency distribution. There are


actually two ways of drawing a frequency polygon:

1. By first drawing a histogram for the data


2. Direct construction

15
EDWARD CARES

1.7.1 Drawing a histogram first

This is indeed a very effective in which a frequency polygon may be


constructed. Draw a histogram of the given data and then join, by means of
straight lines, the midpoints of the upper horizontal side of each rectangle with the
adjacent ones. It is an accepted practice to close the polygon at both ends of the
distribution by extending the lines to the base line (x-axis). When this is done, two
hypothetical classes with zero frequencies must be included at each end. This
extension is made with the objective of making the area under the polygon equal
to the area under the corresponding histogram.

Example

Temperature Frequency
[0 – 10) 2
[10 – 20) 7
[20 – 30) 11
[30 – 40) 17
[40 – 50) 9
[50 – 60) 3
[60 – 70) 1
Total 50

Table 1.7.1.1

Presentation of grouped data on a histogram and frequency polygon

Histogram and frequency polygon for grouped data

18

16

14

12
Frequency

10

0
-10 0 10 20 30 40 50 60 70 80

Temperature (degrees Fahrenheit)

Fig. 1.7.1.2

16
EDWARD CARES

1.7.2 Direct construction

The frequency polygon may also be directly drawn by finding the points
on the figure. The x-coordinate of each point is the mid-class value of the cell
whilst the y-coordinate is the frequency of the cell (or frequency density if class
intervals are unequal). Successive points are then linked by means of line
segments.

In that state, the polygon would be ‘hanging in the air’, that is, it would
not touch the x-axis. To satisfy this ultimate requirement, we determine its left
(right) x-intercept by respectively subtracting (adding) the class intervals of the
first (last) classes from the x-intercept of the first (last) point.

Example

Using the data from Table 1.6.2.1, we have the following polygon:

Presentation of grouped data on a frequency polygon

Frequency polygon for grouped data

45
Number of students (frequency)

40
35
30
25
20
15
10
5
0
20.5 – 25.5 – 30.5 – 35.5 – 40.5 – 45.5 – 50.5 – 55.5 –
25.5 30.5 35.5 40.5 45.5 50.5 55.5 60.5
Age of students

Fig. 1.7.2

A frequency polygon sketches an outline of the data pattern more clearly.


In fact, it is the refinement of a histogram, as it does not assume that the
frequencies of observations within a class are equal. The polygon becomes
increasingly smooth and curve-like as we increase the number of classes in a
distribution.

17
EDWARD CARES

1.8 OGIVES

An ogive is the typical shape of a cumulative frequency curve or polygon.


It is generated when cumulative frequencies are plotted against real limits of
classes in a distribution. There are two types of ogives: ‘less than’ and ‘more
than’. Before differentiating between these two, let us start by defining
cumulative frequency.

1.8.1 Cumulative frequency

This self-explanatory term means that the frequencies of classes are


accumulated over the entire distribution. We define the two types of cumulative
frequencies as follows:

Definition 1

The ‘less than’ cumulative frequency of a class is the total number of


observations, in the entire distribution, which are less than or equal to the upper
real limit of the class.

Definition 2

The ‘more than’ cumulative frequency of a class is the total number of


observations, in the entire distribution, which are greater than or equal to the
lower real limit of the class.

Note For the rest of this course, we will denote ‘cumulative frequency’ by CF.

Example

Age Real limits Frequency ‘Less than’ ‘More than’


group CF CF
21 – 25 20.5 – 25.5 5 5 143
26 – 30 25.5 – 30.5 12 17 138
31 – 35 30.5 – 35.5 23 40 126
36 – 40 35.5 – 40.5 39 79 103
41 – 45 40.5 – 45.5 32 111 64
46 – 50 45.5 – 50.5 21 132 32
51 – 55 50.5 – 55.5 9 141 11
56 – 60 55.5 – 60.5 2 143 2
Total 143

Table 1.8.1

18
EDWARD CARES

Note Careful inspection of Table 1.8.1 reveals that the ‘less than’ CF of a class
is also the overall rank of the last observation in that class. This is a very
important finding since it will be of tremendous help to us when
calculating percentiles.

1.8.2 ‘Less than’ cumulative frequency ogive

The ‘less than’ CF ogive is used to determine the number of observations


which fall below a given value. We can thus use it to estimate the value of the
median and other percentiles by interpolation on the ogive itself.

The difference between a CF curve and a CF polygon is that, for the


polygon, successive points are linked by means of line segments whereas, for the
curve, we fit a smooth curve of best fit through the points.

The points on a ‘less than’ CF ogive have upper real limits for x-
coordinates and ‘less than’ CF for y-coordinates. This is quite easy to remember:
‘less than’ CFs are defined according to upper real limits!

If we use the data from Table 1.8.1, the following ‘less than’ CF curve is
obtained.

'Less than' cumulative frequency curve for age

160

140

120
'Less than' CF

100

80

60

40

20

0
20.5 25.5 30.5 35.5 40.5 45.5 50.5 55.5 60.5

Age (upper real limits)

Fig. 1.8.2

Note The ‘less than’ CF ogive has an x-intercept equal to the lower real limit of
the first class.

19
EDWARD CARES

1.8.3 ‘More than’ cumulative frequency ogive

The ‘more than’ CF ogive is used to determine the number of observations


which fall above a given value. We can also use it to estimate the value of the
median and other percentiles by interpolation on the ogive itself.

The points on a ‘more than’ CF ogive have lower real limits for x-
coordinates and mores than’ CF for y-coordinates. Remember that ‘more than’
CFs are defined according to lower real limits!

Again, if we use the data from Table 1.8.1, the following ‘more than’ CF
curve is obtained.

'More than' cumulative frequency ogive for age

160

140

120
'More than' CF

100

80

60

40

20

0
20.5 25.5 30.5 35.5 40.5 45.5 50.5 55.5 60.5

Age (lower real limits)

Fig. 1.8.3

Note The ‘less than’ CF ogive has an x-intercept equal to the upper real limit of
the last class.

The main use of cumulative frequency ogives is to estimate percentiles,


more specifically the median and the lower and upper quartiles. However, we may
also estimate the percentage of the distribution that falls below or above a given
value. Alternatively, we may find a value above or below which a certain
percentage of the distribution lies.

It is generally advisable to use a CF curve, instead of a CF polygon, since


it has been found to yield more realistic and reliable estimates for percentiles.

20
EDWARD CARES

1.9 STEM AND LEAF DIAGRAMS

Stem and leaf diagrams, or stemplots, are used to represent raw data, that
is, individual observations, without loss of information. The ‘leaves’ in the
diagram are actually the last digits of the values (observations) while the ‘stems’
are the remaining part of the values. For example, the value 117 would be split as
‘11’, the stem, and ‘7’, the leaf. By splitting all the values and distributing them
appropriately, we form a stemplot. The example in Section 1.9.1 would be a better
illustration of the above explanation.

1.9.1 Simple stem and leaf plot

Example

The following are the marks (out of 100) obtained by 20 students in an


assignment:

84 17 38 45 47
53 76 54 75 22
66 65 55 54 51
44 39 19 54 72

Table 1.9.1.1

In the first instance, the data is classified in the order that it appears on a
stemplot (see Fig. 1.9.1.2). The leaves are then arranged in ascending order (see
Fig. 1.9.1.3) – this is indeed a very practical way of arranging a set of data in
order if the number of observations is not very large.

Fig. 1.9.1.2 Fig. 1.9.1.3

Stem Leaf Stem Leaf


1 7 9 1 7 9
2 2 2 2
3 8 9 3 8 9
4 5 7 4 4 4 5 7
5 3 4 5 4 1 4 5 1 3 4 4 4 5
6 6 5 6 5 6
7 6 5 2 7 2 5 6
8 4 8 4

Key 1|7 means 17

Note A stemplot must always be accompanied by a key in order to help the


reader interpret the values.

21
EDWARD CARES

1.9.2 Back-to-back stemplots

These stemplots are mainly designed to compare two distributions in terms


of spread and skewness.

Example

Table 1.9.2.1 shows the results obtained by 20 pupils in French and


English examinations.

FRENCH 75 69 58 58 46 44 32 50 53 78
81 61 61 45 31 44 53 66 47 57
ENGLISH 52 58 68 77 38 85 43 44 56 65
65 79 44 71 84 72 63 69 72 79

Table 1.9.2.1

Using the same classification and ordering principles as in the previous


example, we have the following back-to-back stemplot:

Key (French) French English Key (English)


8|5 means 58 6|3 means 63
2 1 3 8
7 6 5 4 4 4 3 4 4
8 8 7 3 3 0 5 2 8 6
9 6 1 1 6 3 5 5 8 9
8 5 7 1 2 2 7 9 9
1 8 4 5

Fig. 1.9.2.2

From Fig. 1.9.2.2, we can deduce that pupils performed better in English
than in French (since they had higher marks in English given the negative
skewness of the distribution).

Merits

1. There is no loss of information from the original data.


2. All descriptive statistics can be exactly calculated or located.
3. If rotated through an angle of 900 anticlockwise, the figure resembles a bar
chart from which the distribution (spread and skewness) of observations
can be readily observed.
Limitations
1. The figure becomes too lengthy if there are too many observations.
2. It is applicable only to discrete data.

22
EDWARD CARES

1.10 BOX AND WHISKERS DIAGRAMS

Box and whiskers diagrams, common known as boxplots, are specially


designed to display dispersion and skewness in a distribution. The figure consists
of a ‘box’ in the middle from which two lines (whiskers) extend respectively to
the minimum and maximum values of the distribution. The position of median is
also indicated in the middle of the box. A boxplot can be drawn either
horizontally or vertically on graph. One axis is scaled to accommodate for the
values of the observations while the other has no scale given that the width of the
box is irrelevant. The boxplot is applicable for both discrete and continuous data.

A boxplot is drawn according to five descriptive statistics:

1. Minimum value
2. Lower quartile
3. Median
4. Upper quartile
5. Maximum value

Note Calculation of these statistics will be explained in detail in the chapter on


Descriptive Statistics. We will simply label the positions of these values
on the diagram.

Example

Using the data from Table 1.9.1.1 in Section 1.9.1, we have the following
five summary statistics:

Minimum 17
Lower quartile 40.25
Median 53.5
Upper quartile 65.75
Maximum 84

Fig. 1.10.1

O 10 20 30 40 50 60 70 80 90 100

Fig. 1.10.2 Number of marks

23
EDWARD CARES

1.10.1 What information can be gathered from a boxplot?

Apart from the five descriptive statistics, we can deduce the following
about the distribution:

1. The range – the numerical difference between the maximum and the
minimum values.
2. The inter-quartile range – the difference between the upper and lower
quartiles. It measures the dispersion for the middle 50% of the distribution.
3. The skewness of the distribution – if the median is closer to the lower
(upper) quartile, the distribution is positively (negatively) skewed. If it is
exactly in the middle of those quartiles, the distribution is symmetrical.

1.10.2 Using the boxplot for comparison

Several boxplots may even be plotted on the same axes for comparison
purposes. We might wish to compare marks obtained by students in French and
English so as to study any similarities and differences between their performances
in these subjects.

French

English

O 10 20 30 40 50 60 70 80 90 100

Number of marks
Fig. 1.10.3

From Fig. 1.10.3, the following can be observed:

1. In general, students have scored higher marks in French than in


English (the ‘box’ is more to the right).
2. The range for both subjects is the same.
3. The distribution for French is negatively skewed but that for English
is positively skewed.

24
EDWARD CARES

1.11 SCATTER DIAGRAMS

Scatter diagrams, also known as scatterplots, are used to investigate the


relationship between two variables. If it is suspected that a causal (cause-effect)
relationship exists between two variables, inspection of a scatterplot may well
provide us with an answer. In such a relationship, we normally have an
independent (explanatory) variable, also known as a predictor, and a dependent
(response) variable. Detailed explanations of these terms will be given in the
section on Regression.

Just imagine that we wish to know whether the length of a metal rod varies
with temperature. We may choose to record the length of the rod at various
temperatures. It is clear here that ‘temperature’ is the independent variable and
‘length’ is the dependent one. These data are kept in the form of a table in which
‘temperature’ and ‘length’ are labelled as X and Y respectively. We next plot the
corresponding pairs of readings in (x, y) form on a graph, the scatter diagram. Fig.
1.11.2 is an example of a scatterplot.

Example

Temperature (0C) 13 50 63 58 20 78 39 55 29 62
Length (cm) 5.10 5.68 5.85 5.74 5.25 5.98 5.59 5.73 5.46 5.81

Table 1.11.1

Scatterplot of length of metal rod at various temperatures

6.1
6
5.9
5.8
5.7
Length

5.6
5.5
5.4
5.3
5.2
5.1
5
0 10 20 30 40 50 60 70 80 90

Temperature

Fig. 1.11.2

25
EDWARD CARES

A scatterplot enables us to verify whether there does exist a causal


relationship between two variables by checking the pattern of points. In fact, it
even reveals the nature of the relationship, that is, if it is linear or non-linear, by
the shape of the pattern. Scatter diagrams are especially very useful in regression
and correlation analyses.

1.12 TIME SERIES HISTORIGRAMS

A time series is a series of figures which show the evolution of a variable


over time. The horizontal axis is labelled as the time axis instead of the usual x-
axis. The graph of a time series is known as a historigram. Points on the graph are
plotted with x-coordinates as time units while the y-coordinates are the values
assumed by the variable at those particular times. Successive points are then
linked by means of straight lines (similar to a line graph, Section 1.3).

Example

The following data represent the annual sales of petrol in Iraq in millions
of dollars for the period 1985-96.

Year (19_) 85 86 87 88 89 90 91 92 93 94 95 96
Sales ($m) 600 840 420 720 640 860 420 740 670 900 430 760

Table 1.12.1

Time series for annual sales of petrol

1000
900
800
700
Sales ($m)

600
500
400
300
200
100
0
85 86 87 88 89 90 91 92 93 94 95 96

Year

Fig. 1.12.2

26
EDWARD CARES

A time series shows the trend, cycle and seasonality in the behaviour of a
variable. It is a very sophisticated means of forecasting the values of the variable
on the assumption that history repeats itself.

1.13 LORENZ CURVES

The Lorenz curve is a device for demonstrating the evenness, by verifying


the degree of concentration of a property, of a distribution. A common application
of the Lorenz curve is to show the distribution of wealth in a population. The
explanation for the construction of such a diagram is given by means of the
example below.

Example

Table 1.13.1 below refers to tax paid by people in various income groups
in a sample. Construct a Lorenz curve for the data and comment on it.

Annual gross income Number of people Tax paid ($)


Less than 6 000 140 60 000
6 000 and less than 8 000 520 200 000
8 000 and less than 10 000 620 660 000
10 000 and less than 14 000 440 700 000
14 000 and less than 20 000 240 740 000
20 000 and less than 32 000 40 680 000
TOTAL 2000 3 040 000

Table 1.13.1

The above table now should be altered in such a way that relative
cumulative frequencies may now be displayed for both variables, that is, ‘number
of people’ and ‘tax paid’. We must change the labels for the first column,
determine the cumulative frequencies and then convert these to percentages
(proportions) as shown in Table 1.13.2.

Annual gross Number of Proportion Proportion


Tax paid ($)
income people of people of tax
Less than 6 000 140 60 000 0.07 0.0197
Less than 8 000 660 260 000 0.33 0.0855
Less than 10 000 1280 920 000 0.64 0.3026
Less than 14 000 1720 1 620 000 0.86 0.5329
Less than 20 000 1960 2 360 000 0.98 0.7763
Less than 32 000 2000 3 040 000 1.00 1.0000

Table 1.13.2

27
EDWARD CARES

Next, we plot one relative cumulative frequency against another. It does


not really matter which axis is to be used for which variable, the reason being that
we only wish to observe the departure of the Lorenz curve from the line of
uniform distribution. On graph, this is simply the line y = x, which represents the
ideal situation where, for our example, the proportion of tax paid is equally
distributed among the various classes of income earners. This line is also to be
drawn on the same graph in order to make the ‘bulge’ of the Lorenz curve more
visible. This is clearly illustrated in Fig. 1.13.3 below.

Lorenz curve for the distribution of taxpayers

0.9

0.8 Line of
uniform
distribution
0.7
Proportion of tax paid

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1

Proportion of taxpayers

Fig. 1.13.3

The further the curve is from the line of uniform distribution, the more
uneven is the distribution. It can be observed, for example, that approximately
36% of the population of taxpayers pays only 10% of the total tax. This shows a
considerable degree of unevenness in the population. In an ideal situation, 36% of
the population would have paid 36% of the total tax.

The Lorenz curve is normally used in economical contexts and its


interpretation is very useful whenever there is an imbalance somewhere in the
economic sector (for example, distribution of wealth in a population).

28
EDWARD CARES

1.14 Z-CHARTS

The usefulness of a Z-chart is for presenting business data. It shows the


following:

1. The value of a variable plotted against time over the year.


2. The cumulative sum of values for that variable over the year to
date.
3. The annual moving total for that variable.

The annual moving total is the sum of the values of the variable for the
12-month period up to the end of the month under consideration. A line for the
budget for the year to data may be added to a Z-chart, for comparison with the
cumulative sum of actual values.

Example

The sales figures for a company for 2002 and 2003 are as follows.

Month 2002 sales ($m) 2003 sales ($m)


January 7 8
February 7 8
March 8 8
April 7 9
May 9 8
June 8 8
July 8 7
August 7 8
September 6 9
October 7 6
November 8 9
December 8 9
90 97

Table 1.14.1

Table 1.14.2 will now include the cumulative sales for 2003 and the
annual moving total, that is, the 12-month period will be updated from the period
Jan-Dec 2003 to Feb 2003-Jan 2004, then Mar 2003- Feb 2004 and so on until
Jan-Dec 2004, whilst these total sales will be continuously calculated and
recorded.

Note Z-charts do not have to cover 12 months of a year. They could, for
example, also be drawn for four quarters of a year or seven days of a
week.

29
EDWARD CARES

2002 sales 2003 sales Cumulative sales Annual moving


Month
($m) ($m) 2003 ($m) total ($m)
January 7 8 8 91
February 7 8 16 92
March 8 8 24 92
April 7 9 33 94
May 9 8 41 93
June 8 8 49 93
July 8 7 56 92
August 7 8 64 93
September 6 9 73 96
October 7 6 79 95
November 8 9 88 96
December 8 9 97 97

Table 1.14.2

Fig. 1.14.3

30
EDWARD CARES

Interpretation of Z-charts

The popularity of Z-charts in practical applications derives from the


wealth of information which they can contain.

1. Monthly totals show the monthly results at a glance with any seasonal
variations.
2. Cumulative totals show the performance to data and can be easily compared
with planned and budgeted performance by superimposing the budget line.
3. Annual moving totals compare the current levels of performance with those
of the previous year. If the line is rising, then this year’s monthly results are
better than the results of the corresponding month last year. The opposite
applies if the line is falling. The annual moving total line indicates the long-
term trend of the variable, whether rising, falling or steady.

Note While the values of the annual moving total and the cumulative values are
plotted on month-end positions, the values for the current monthly figures
are plotted on mid-month positions. This is because monthly figures
represent achievement over a particular month whereas the annual moving
totals and the cumulative values represent achievement up to a particular
month end.

31
EDWARD CARES

LEVELS/SCALES OF MEASUREMENTS

There are four different scales of measurement. The data can be defined as being one of the four
scales. The four types of scales are:

 Nominal Scale
 Ordinal Scale
 Interval Scale
 Ratio Scale

Nominal Scale

A nominal scale is the 1st level of measurement scale in which the numbers serve as “tags” or
“labels” to classify or identify the objects. A nominal scale usually deals with the non-numeric
variables or the numbers that do not have any value.

Characteristics of Nominal Scale

 A nominal scale variable is classified into two or more categories. In this measurement
mechanism, the answer should fall into either of the classes.
 It is qualitative. The numbers are used here to identify the objects.
 The numbers don’t define the object characteristics. The only permissible aspect of
numbers in the nominal scale is “counting.”

Example:

An example of a nominal scale measurement is given below:

What is your gender?

M- Male = 0

F- Female = 1
EDWARD CARES

Here, the variables are used as tags, and the answer to this question should be either M or F.

Ordinal Scale

The ordinal scale is the 2nd level of measurement that reports the ordering and ranking of data
without establishing the degree of variation between them. Ordinal represents the “order.”
Ordinal data is known as qualitative data or categorical data. It can be grouped, named and also
ranked.

Characteristics of the Ordinal Scale

 The ordinal scale shows the relative ranking of the variables


 It identifies and describes the magnitude of a variable
 Along with the information provided by the nominal scale, ordinal scales give the
rankings of those variables
 The interval properties are not known
 The surveyors can quickly analyse the degree of agreement concerning the identified
order of variables

Example:

 Ranking of school students – 1st, 2nd, 3rd, etc.


 Ratings in restaurants
 Evaluating the frequency of occurrences
o Very often
o Often
o Not often
o Not at all
 Assessing the degree of agreement
o Totally agree
o Agree
o Neutral
o Disagree
EDWARD CARES

o Totally disagree

Interval Scale

The interval scale is the 3rd level of measurement scale. It is defined as a quantitative
measurement scale in which the difference between the two variables is meaningful. In other
words, the variables are measured in an exact manner, not as in a relative way in which the
presence of zero is arbitrary.

Characteristics of Interval Scale:

 The interval scale is quantitative as it can quantify the difference between the values
 It allows calculating the mean and median of the variables
 To understand the difference between the variables, you can subtract the values between
the variables
 The interval scale is the preferred scale in Statistics as it helps to assign any numerical
values to arbitrary assessment such as feelings, calendar types, etc.

Example:

 Likert Scale
 Net Promoter Score (NPS)
 Bipolar Matrix Table

Ratio Scale

The ratio scale is the 4th level of measurement scale, which is quantitative. It is a type of variable
measurement scale. It allows researchers to compare the differences or intervals. The ratio scale
has a unique feature. It possesses the character of the origin or zero points.

Characteristics of Ratio Scale:

 Ratio scale has a feature of absolute zero


 It doesn’t have negative numbers, because of its zero-point feature
EDWARD CARES

 It affords unique opportunities for statistical analysis. The variables can be orderly added,
subtracted, multiplied, divided. Mean, median, and mode can be calculated using the ratio
scale.
 Ratio scale has unique and useful properties. One such feature is that it allows unit
conversions like kilogram – calories, gram – calories, etc.

Example:

An example of a ratio scale is:

What is your weight in Kgs?

 Less than 55 kgs


 55 – 75 kgs
 76 – 85 kgs
 86 – 95 kgs
 More than 95 kgs
EDWARD CARES
EDWARD CARES

Chapter 10: Chi-Square Tests: Solutions

10.1 Goodness of Fit Test

In this section, we consider experiments with multiple outcomes. The probability of each
outcome is fixed.
Definition: A chi-square goodness-of-fit test is used to test whether a frequency distri-
bution obtained experimentally fits an “expected” frequency distribution that is based on
the theoretical or previously known probability of each outcome.
An experiment is conducted in which a simple random sample is taken from a population,
and each member of the population is grouped into exactly one of k categories.
Step 1: The observed frequencies are calculated for the sample.
Step 2: The expected frequencies are obtained from previous knowledge (or belief) or
probability theory. In order to proceed to the next step, it is necessary that each expected
frequency is at least 5.
Step 3: A hypothesis test is performed:

(i) The null hypothesis H0 : the population frequencies are equal to the expected frequen-
cies.

(ii) The alternative hypothesis, Ha : the null hypothesis is false (what does this imply about
the population frequencies?).

(iii) α is the level of significance.

(iv) The degrees of freedom: k − 1.

(v) A test statistic is calculated:

X ( observed − expected )2 X (O − E)2


2
χ = =
expected E
(vi) From α and k − 1, a critical value is determined from the chi-square table.

(vii) Reject H0 if χ2 is larger than the critical value (right-tailed test).

Example 1: Researchers have conducted a survey of 1600 coffee drinkers asking how much
coffee they drink in order to confirm previous studies. Previous studies have indicated that
72% of Americans drink coffee. The results of previous studies (left) and the survey (right)
are below. At α = 0.05, is there enough evidence to conclude that the distributions are the
same?

1
EDWARD CARES

% of Coffee
Response
Drinkers Response Frequency
2 cups per week 15% 2 cups per week 206
1 cup per week 13% 1 cup per week 193
1 cup per day 27% 1 cup per day 462
2+ cups per day 45% 2+ cups per day 739

(i) The null hypothesis H0 :the population frequencies are equal to the expected frequencies
(to be calculated below).

(ii) The alternative hypothesis, Ha : the null hypothesis is false.

(iii) α = 0.05.

(iv) The degrees of freedom: k − 1 = 4 − 1 = 3.

(v) The test statistic can be calculated using a table:

% of Coffee (O−E)2
Response E O O−E (O − E)2 E
Drinkers
2 cups per week 15% 0.15 × 1600 = 240 206 −34 1156 4.817
1 cup per week 13% 0.13 × 1600 = 208 193 −15 225 1.082
1 cup per day 27% 0.27 × 1600 = 432 462 30 900 2.083
2+ cups per day 45% 0.45 × 1600 = 720 739 19 361 0.5014

X ( observed − expected )2 X (O − E)2


χ2 = = = 8.483.
expected E
(vi) From α = 0.05 and k − 1 = 3, the critical value is 7.815.

(vii) Is there enough evidence to reject H0 ? Since χ2 ≈ 8.483 > 7.815, there is enough
statistical evidence to reject the null hypothesis and to believe that the old percentages
no longer hold.

Example 2: A department store, A, has four competitors: B,C,D, and E. Store A hires a
consultant to determine if the percentage of shoppers who prefer each of the five stores
is the same. A survey of 1100 randomly selected shoppers is conducted, and the results
about which one of the stores shoppers prefer are below. Is there enough evidence using a
significance level α = 0.05 to conclude that the proportions are really the same?

Store A B C D E
Number of Shoppers 262 234 204 190 210

2
EDWARD CARES

(i) The null hypothesis H0 :the population frequencies are equal to the expected frequencies
(to be calculated below).

(ii) The alternative hypothesis, Ha : the null hypothesis is false.

(iii) α = 0.05.

(iv) The degrees of freedom: k − 1 = 5 − 1 = 4.

(v) The test statistic can be calculated using a table:

% of (O−E)2
Preference E O O−E (O − E)2 E
Shoppers
A 20% 0.2 × 1100 = 220 262 42 1764 8.018
B 20% 0.2 × 1100 = 220 234 14 196 0.891
C 20% 0.2 × 1100 = 220 204 −16 256 1.163
D 20% 0.2 × 1100 = 220 190 −30 900 4.091
E 20% 0.2 × 1100 = 220 210 −10 100 0.455

X ( observed − expected )2 X (O − E)2


χ2 = = = 14.618.
expected E
(vi) From α = 0.05 and k − 1 = 4, the critical value is 9.488.

(vii) Is there enough evidence to reject H0 ? Since χ2 ≈ 14.618 > 9.488, there is enough
statistical evidence to reject the null hypothesis and to believe that customers do not
prefer each of the five stores equally.

3
EDWARD CARES

10.2 Independence

Recall that two events are independent if the occurrence of one of the events has no effect
on the occurrence of the other event.
A chi-square independence test is used to test whether or not two variables are inde-
pendent.
As in section 10.1, an experiment is conducted in which the frequencies for two variables
are determined. To use the test, the same assumptions must be satisfied: the observed
frequencies are obtained through a simple random sample, and each expected frequency is
at least 5. The frequencies are written down in a table: the columns contain outcomes for
one variable, and the rows contain outcomes for the other variable.
The procedure for the hypothesis test is essentially the same. The differences are that:

(i) H0 is that the two variables are independent.

(ii) Ha is that the two variables are not independent (they are dependent).

(iii) The expected frequency Er,c for the entry in row r, column c is calculated using:

( Sum of row r) × ( Sum of column c)


Er,c =
Sample size
(iv) The degrees of freedom: (number of rows - 1)×(number of columns - 1).

Example 3: The results of a random sample of children with pain from musculoskeletal
injuries treated with acetaminophen, ibuprofen, or codeine are shown in the table. At α =
0.10, is there enough evidence to conclude that the treatment and result are independent?

Acetaminophen (c. 1) Ibuprofen (c. 2) Codeine (c. 3) Total


(r. 1) Significant
58 (66.7) 81 (66.7) 61 (66.7) 200
Improvement
(r. 2) Slight
42 (33.3) 19 (33.3) 39 (33.3) 100
Improvement
Total 100 100 100 300

First, calculate the column and row totals.


Then find the expected frequency for each item and write it in the parenthesis next to the
observed frequency.
Now perform the hypothesis test.

(i) The null hypothesis H0 : the treatment and response are independent.
4
EDWARD CARES

(ii) The alternative hypothesis, Ha : the treatment and response are dependent.

(iii) α = 0.10.

(iv) The degrees of freedom:


(number of rows - 1)×(number of columns - 1) = (2 − 1) × (3 − 1) = 1 × 2 = 2.

(v) The test statistic can be calculated using a table:

(O−E)2
Row, Column E O O−E (O − E)2 E
200·100
1,1 300
= 66.7 58 −8.7 75.69 1.135
200·100
1,2 300
= 66.7 81 14.3 204.49 3.067
200·100
1,3 300
= 66.7 61 −5.7 32.49 0.487
100·100
2,1 300
= 33.3 42 8.7 75.69 2.271
100·100
2,2 300
= 33.3 19 −14.3 204.49 6.135
100·100
2,3 300
= 33.3 39 5.7 32.49 0.975

X ( observed − expected )2 X (O − E)2


2
χ = = = 14.07.
expected E
(vi) From α = 0.10 and d.f = 2, the critical value is 4.605.

(vii) Is there enough evidence to reject H0 ? Since χ2 ≈ 14.07 > 4.605, there is enough sta-
tistical evidence to reject the null hypothesis and to believe that there is a relationship
between the treatment and response.

Practice Problem 1: A doctor believes that the proportions of births in this country on each
day of the week are equal. A simple random sample of 700 births from a recent year is
selected, and the results are below. At a significance level of 0.01, is there enough evidence
to support the doctor’s claim?

Day Sunday Monday Tuesday Wednesday Thursday Friday Saturday


Frequency 65 103 114 116 115 112 75

(i) The null hypothesis H0 :the population frequencies are equal to the expected frequencies
(to be calculated below).

(ii) The alternative hypothesis, Ha : the null hypothesis is false.

(iii) α = 0.01.

(iv) The degrees of freedom: k − 1 = 7 − 1 = 6.

(v) The test statistic can be calculated using a table:

5
EDWARD CARES

(O−E)2
Day E O O−E (O − E)2 E
Sunday 700/7 = 100 65 −35 1225 12.25
Monday 700/7 = 100 103 3 9 0.09
Tuesday 700/7 = 100 114 14 196 1.96
Wednesday 700/7 = 100 116 16 256 2.56
Thursday 700/7 = 100 115 15 225 2.25
Friday 700/7 = 100 112 12 144 1.44
Saturday 700/7 = 100 75 −25 625 6.25

X ( observed − expected )2 X (O − E)2


2
χ = = = 26.8.
expected E
(vi) From α = 0.01 and k − 1 = 6, the critical value is 16.812.

(vii) Is there enough evidence to reject H0 ? Since χ2 ≈ 26.8 > 16.812, there is enough
statistical evidence to reject the null hypothesis and to believe that the proportion of
births is not the same for each day of the week.

Practice Problem 2: The side effects of a new drug are being tested against a placebo. A
simple random sample of 565 patients yields the results below. At a significance level of
α = 0.05, is there enough evidence to conclude that the treatment is independent of the side
effect of nausea?

Result Drug (c.1) Placebo (c.2) Total


Nausea (r.1) 36 13 49
No nausea (r.2) 254 262 516
Total 290 275 565

(i) The null hypothesis H0 : the treatment and response are independent.

(ii) The alternative hypothesis, Ha : the treatment and response are dependent.

(iii) α = 0.01.

(iv) The degrees of freedom:


(number of rows - 1)×(number of columns - 1) = (2 − 1) × (2 − 1) = 1 × 1 = 1.

(v) The test statistic can be calculated using a table:

(O−E)2
Row, Column E O O − E (O − E)2 E
49·290
1,1 565
= 25.15 36 10.85 117.72 4.681
49·275
1,2 565
= 23.85 13 −10.85 117.72 4.936
516·290
2,1 565
= 264.85 254 −10.85 117.72 0.444
516·275
2,2 565
= 251.15 262 10.85 117.72 0.469
6
EDWARD CARES

X ( observed − expected )2 X (O − E)2


χ2 = = = 10.53.
expected E
(vi) From α = 0.10 and d.f = 1, the critical value is 2.706.

(vii) Is there enough evidence to reject H0 ? Since χ2 ≈ 10.53 > 2.706, there is enough sta-
tistical evidence to reject the null hypothesis and to believe that there is a relationship
between the treatment and response.

Practice Problem 3: Suppose that we have a 6-sided die. We assume that the die is unbiased
(upon rolling the die, each outcome is equally likely). An experiment is conducted in which
the die is rolled 240 times. The outcomes are in the table below. At a significance level of
α = 0.05, is there enough evidence to support the hypothesis that the die is unbiased?

Outcome 1 2 3 4 5 6
Frequency 34 44 30 46 51 35

(i) The null hypothesis H0 : each face is equally likely to be the outcome of a single roll.

(ii) The alternative hypothesis, Ha : the null hypothesis is false.

(iii) α = 0.05.

(iv) The degrees of freedom: k − 1 = 6 − 1 = 5.

(v) The test statistic can be calculated using a table:

(O−E)2
Face E O O−E (O − E)2 E
1 240/6 = 40 34 −6 36 0.9
2 240/6 = 40 44 4 16 0.4
3 240/6 = 40 30 −10 100 2.5
4 240/6 = 40 46 6 36 0.9
5 240/6 = 40 51 11 121 3.025
6 240/6 = 40 35 −5 25 0.625

X ( observed − expected )2 X (O − E)2


χ2 = = = 8.35.
expected E
(vi) From α = 0.01 and k − 1 = 6, the critical value is 15.086.

(vii) Is there enough evidence to reject H0 ? Since χ2 ≈ 8.35 < 15.086, we fail to reject the
null hypothesis, that the die is fair.

7
EDWARD CARES
CGN 3421 - Computer Methods Gurley

Numerical Methods Lecture 5 - Curve Fitting Techniques


Topics
motivation
interpolation
linear regression
higher order polynomial form
exponential form
Curve fitting - motivation
For root finding, we used a given function to identify where it crossed zero
where does ! ( " ) " ! ??
Q: Where does this given function ! ( " ) come from in the first place?

• Analytical models of phenomena (e.g. equations from physics)


• Create an equation from observed data

1) Interpolation (connect the data-dots)


If data is reliable, we can plot it and connect the dots
This is piece-wise, linear interpolation
This has limited use as a general function ! ( " )
Since its really a group of small ! ( " ) s, connecting one point to the next
it doesn’t work very well for data that has built in random error (scatter)

2) Curve fitting - capturing the trend in the data by assigning a single function across the entire range.
The example below uses a straight line function

f(x) = ax + b
f(x) = ax + b
for each line
for entire range

Interpolation Curve Fitting

A straight line is described generically by f(x) = ax + b

The goal is to identify the coefficients ‘a’ and ‘b’ such that f(x) ‘fits’ the data well

Numerical Methods Lecture 5 - Curve Fitting Techniques page 89 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

other examples of data sets that we can fit a function to.

height of Oxygen in
dropped soil
object

time temperature
pore Profit
pressure

soil depth paid labor hours

Is a straight line suitable for each of these cases ?


No. But we’re not stuck with just straight line fits. We’ll start with straight lines, then expand the concept.

Linear curve fitting (linear regression)

Given the general form of a straight line

! ( " ) " #" # $

How can we pick the coefficients that best fits the line to the data?

First question: What makes a particular straight line a ‘good’ fit?

Why does the blue line appear to us to fit the trend better?

• Consider the distance between the data and points on the line

• Add up the length of all the red and blue verticle lines

• This is an expression of the ‘error’ between data and fitted line

• The one line that provides a minimum error is then the ‘best’
straight line

Numerical Methods Lecture 5 - Curve Fitting Techniques page 90 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

Quantifying error in a curve fit


assumptions:

1) positive or negative error have the same value


(data point is above or below the line)
(x4,y4)
2) Weight greater errors more heavily
(x2,y2) (x4,f(x4))
we can do both of these things by squaring the distance

denote data values as (x, y) ==============>> (x2,f(x2))


denote points on the fitted line as (x, f(x))
sum the error at the four data points

$ $ $
%&& " ∑ ( '( ) " ( )% & ! ( "% ) ) # ( )$ & ! ( "$ ) )
$ $
''''''''''''''''''''''''''' # ( ) ( & ! ( " ( ) ) # ( ) ) & ! ( " ) ) )

Our fit is a straight line, so now substitute ! ( " ) " #" # $

*'+,-,'./01-2 *'+,-,'./01-2
$ $
%&& " ∑ ( ) ( & ! ( "( ) ) " ∑ ( ) ( & ( #" ( # $ ) )
("% ("%

The ‘best’ line has minimum error between line and data points

This is called the least squares approach, since we minimize the square of the error.
*'+,-,'./01-2'"'*
$
minimize %&& " ∑ ( ) ( & ( #" ( # $ ) )
("%

time to pull out the calculus... finding the minimum of a function


1) derivative describes the slope
2) slope = zero is a minimum
==> take the derivative of the error with respect to # and $ , set each to zero

*
∂%&&
----------- " & $
∂# ∑ " ( ( ) ( & #" ( & $ ) " !
("%
*
∂%&&
----------- " & $
∂$ ∑ ( ) ( & #" ( & $ ) " !
("%

Numerical Methods Lecture 5 - Curve Fitting Techniques page 91 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

Solve for the # and $ so that the previous two equations both = 0
re-write these two equations
$
# ∑ "( # $ ∑ "( " ∑ ( "( )( )
# ∑ " ( # $3* " ∑ )(
put these into matrix form

* ∑ "(
$ " ∑ )(
$ #
∑ "( ∑ "( ∑ ( "( )( )
what’s unknown?
we have the data points ( " , ) ) for ( " %, 4445'* , so we have all the summation terms in the matrix
( (

so unknows are # and $


Good news, we already know how to solve this problem
remember Gaussian elimination ??

* ∑ "( ∑ )(
+ " , , " $ , - "
$
∑ "( ∑ "( # ∑ ( "( )( )
so
+, " -

using built in Mathcad matrix inversion, the coefficients # and $ are solved
>> X = A-1*B

Note: + , - , and , are not the same as # , $ , and "

Let’s test this with an example:

i 1 2 3 4 5 6

" 0 0.5 1.0 1.5 2.0 2.5

) 0 1.5 3.0 4.5 6.0 7.5

First we find values for all the summation terms


* " 6
$
∑ "( " 748 , ∑ )( " $$48 , ∑ "( " %(478 , ∑ "( )( " )%4$8
Now plugging into the matrix form gives us:

Numerical Methods Lecture 5 - Curve Fitting Techniques page 92 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

6 748 $ " $$48 $ $


748 %(478 # )%4$8
Note: we are using ∑ "( , NOT ( ∑ "( )

$ " (*. 6 748 3 $$48 or use Gaussian elimination...


# 748 %(478 )%4$8

The solution is $ " ! ===> ! ( " ) " (" # !


# (
This fits the data exactly. That is, the error is zero. Usually this is not the outcome. Usually we have data
that does not exactly fit a straight line.
Here’s an example with some ‘noisy’ data

x = [0 .5 1 1.5 2 2.5], y = [-0.4326 -0.1656 3.1253 4.7877 4.8535 8.6909]

6 748 $ " $!498:( , $ " (*. 6 748 3 $!498:( , $ " & !4:78
748 %(478 # )%4689) # 748 %(478 )%4689) # (486%

so our fit is ! ( " ) " (486%'" & !4:78

Here’s a plot of the data and the curve fit:

So...what do we do when a straight line is not


suitable for the data set?

Profit

paid labor hours


Straight line will not predict diminishing returns that data shows
Curve fitting - higher order polynomials
Numerical Methods Lecture 5 - Curve Fitting Techniques page 93 of 102
EDWARD CARES
CGN 3421 - Computer Methods Gurley

We started the linear curve fit by choosing a generic form of the straight line f(x) = ax + b
This is just one kind of function. There are an infinite number of generic forms we could choose from for
almost any shape we want. Let’s start with a simple extension to the linear regression concept
recall the examples of sampled data

height of Oxygen in
dropped soil
object

time temperature
pore Profit
pressure

soil depth paid labor hours


Is a straight line suitable for each of these cases ? Top left and bottom right don’t look linear in trend, so
why fit a straight line? No reason to, let’s consider other options. There are lots of functions with lots of
different shapes that depend on coefficients. We can choose a form based on experience and trial/error.
Let’s develop a few options for non-linear curve fitting. We’ll start with a simple extension to linear
regression...higher order polynomials

Polynomial Curve Fitting


Consider the general form for a polynomial of order /
/
$ ( / 0
! ( " ) " # ! # # % " # # $ " # # ( " # 444 # # / " " # ! # ∑ #0 " (1)
0"%
Just as was the case for linear regression, we ask:

How can we pick the coefficients that best fits the curve to the
data? We can use the same idea:
The curve that gives minimum error between data ) and the fit
! ( " ) is ‘best’

Quantify the error for these two second order curves...


• Add up the length of all the red and blue verticle lines
• pick curve with minimum total error
Error - Least squares approach
Numerical Methods Lecture 5 - Curve Fitting Techniques page 94 of 102
EDWARD CARES
CGN 3421 - Computer Methods Gurley

The general expression for any error using the least squares approach is

$ $ $ $ $ (2)
%&& " ∑ ( '( ) " ( )% & ! ( "% ) ) # ( )$ & ! ( "$ ) ) # ( )( & ! ( "( ) ) # ( )) & ! ( ") ) )
where we want to minimize this error. Now substitute the form of our eq. (1)
/
$ ( / 0
! ( " ) " # ! # # % " # # $ " # # ( " # 444 # # / " " # ! # ∑ #0 "
0"%
into the general least squares error eq. (2)
* $
 ) &  # # # " # # " $ # # " ( # 444 # # " / 
%&& " ∑  (  ! % ( $ ( ( ( / (  (3)
("%
where: * - # of data points given, ( - the current data point being summed, / - the polynomial order
re-writing eq. (3)
*   / $
0 
%&& " ∑  )( &  #! #
 
∑ #0 "  

(4)
("% 0"%
find the best line = minimize the error (squared distance) between line and data points
Find the set of coefficients # 0, # ! so we can minimize eq. (4)

CALCULUS TIME
To minimize eq. (4), take the derivative with respect to each coefficient # , # ''0 " %, 444, / set each to
! 0
zero

*   /
∂%&& 0 
----------- " & $
∂# ! ∑  )( &  #! #
 
∑ #0 "   " !

("% 0"%
*   /
∂%&& 0 
----------- " & $
∂# % ∑  )( &  #! #
 
∑ #0 "   " " !

("% 0"%
*   /
∂%&& 0  $
----------- " & $
∂# $ ∑  )( &  #! #
 
∑ #0 "   " " !

("% 0"%
;
;
*   /
∂%&& 0  /
----------- " & $
∂# / ∑  )( &  #! #
 
∑ #0 "   " " !

("% 0"%

Numerical Methods Lecture 5 - Curve Fitting Techniques page 95 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

re-write these / # % equations, and put into matrix form

* ∑ "( ∑ "(
$
444 ∑ "(
/ ∑ )(
#!
$ ( /#% ∑ ( "( )( )
∑ "( ∑ "( ∑ "( 444 ∑ "( #%
$
$ ( ) /#$ # " ∑  "( ) (
∑ "( ∑ "( ∑ "( 444 ∑ " ( $
; ;
; ; ; ;
#/ /
∑ "(
/
∑ "(
/#%
∑ "(
/#$
444
/#/
∑ "( ∑  "( ) (

where all summations above are over ( " %, 444, *

what’s unknown?

we have the data points ( " , ) ) for ( " %, 4445'*


( (
we want # , # ''''''0 " %, 444, /
! 0

We already know how to solve this problem. Remember Gaussian elimination ??

* ∑ "( ∑ "(
$
444 ∑ "(
/ ∑ )(
#!
$ ( /#% ∑ ( "( )( )
∑ "( ∑ "( ∑ "( 444 ∑ "( #%
$
+ " $ ( ) /#$, , " #$ , - " ∑  "( )(
∑ "( ∑ "( ∑ "( 444 ∑ " (
; ;
; ; ; ;
#/ /
∑ "(
/
∑ "(
/#%
∑ "(
/#$
444 ∑ "(
/#/
∑  "( )(
where all summations above are over ( " %, 444, * data points

Note: No matter what the order / , we always get equations LINEAR with respect to the coefficients.
This means we can use the following solution method

+, " -

using built in Mathcad matrix inversion, the coefficients # and $ are solved
>> X = A-1*B

Numerical Methods Lecture 5 - Curve Fitting Techniques page 96 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

Example #1:

Fit a second order polynomial to the following data

i 1 2 3 4 5 6

" 0 0.5 1.0 1.5 2.0 2.5

) 0 0.25 1.0 2.25 4.0 6.25

Since the order is 2 ( / " $ ), the matrix form to solve is

$
* ∑ "( ∑ "( #! ∑ )(
∑ "( ∑ "( ∑ "(
$ ( # "
% ∑ "( )(
$
$ ( ) #$ ∑ "( )(
∑ "( ∑ "( ∑ (
"
Now plug in the given data.
Before we go on...what answers do you expect for the coefficients after looking at the data?
* " 6
∑ "( " 748 , ∑ )( " %(478
$
∑ "( " %(478 , ∑ "( )( " $94%$8
( $
∑ "( " $94%$8 ∑ "( )( " 6%4%978
)
∑ "( " 6%4%978

#!
6 748 %(478 %(478
748 %(478 $94%$8 # % " $94%$8
%(478 $94%$8 6%4%978 # 6%4%978
$

$ $
Note: we are using ∑ "( , NOT ( ∑ " ( ) . There’s a big difference

#!
6 748 %(478 %(478
using the inversion method # " (*. 748 %(478 $94%$8 3 $94%$8
%
#$ %(478 $94%$8 6%4%978 6%4%987

Numerical Methods Lecture 5 - Curve Fitting Techniques page 97 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

or use Gaussian elimination gives us the solution to the coefficients

#!
! $
#% " ! ===> ! ( " ) " ! # !3" # %3"
#$ %
This fits the data exactly. That is, f(x) = y since y = x^2

Example #2: uncertain data


Now we’ll try some ‘noisy’ data

x = [0 .0 1 1.5 2 2.5]
y = [0.0674 -0.9156 1.6253 3.0377 3.3535 7.9409]
The resulting system to solve is:

#!
6 748 %(478 %84%!:(
# % " (*. 748 %(478 $94%$8 3 ($4$9()
#$ %(478 $94%$8 6%4%978 7%4$76

#!
& !4%9%$
giving: #% " & !4($$%
#$ %4(8(7

So our fitted second order function is:

$
! ( " ) " & !4%9%$ & !4($$%"3 # %4(8(73"

Example #3 : data with three different fits

In this example, we’re not sure which order will fit


well, so we try three different polynomial orders
Note: Linear regression, or first order curve fitting is
just the general polynomial form we just saw, where
we use j=1,

• 2nd and 6th order look similar, but 6th has a


‘squiggle to it. We may not want that...

Numerical Methods Lecture 5 - Curve Fitting Techniques page 98 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

Overfit / Underfit - picking an inappropriate order

Overfit - over-doing the requirement for the fit to ‘match’ the data trend (order too high)

Polynomials become more ‘squiggly’ as their order increases. A ‘squiggly’ appearance comes from
inflections in function

Consideration #1:

3rd order - 1 inflection point overfit


4th order - 2 inflection points
nth order - n-2 inflection points

Consideration #2:

2 data points - linear touches each point


3 data points - second order touches each point
n data points - n-1 order polynomial will touch each point

SO: Picking an order too high will overfit data

General rule: pick a polynomial form at least several orders lower than the number of data points.
Start with linear and add order until trends are matched.

Underfit - If the order is too low to capture obvious trends in the data

Profit

paid labor hours


Straight line will not predict
diminishing returns that data shows

General rule: View data first, then select an order that reflects inflections, etc.

For the example above:


1) Obviously nonlinear, so order > 1
2) No inflcetion points observed as obvious, so order < 3 is recommended
=====> I’d use 2nd order for this data

Numerical Methods Lecture 5 - Curve Fitting Techniques page 99 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

Curve fitting - Other nonlinear fits (exponential)

Q: Will a polynomial of any order necessarily fit any set of data?


A: Nope, lots of phenomena don’t follow a polynomial form. They may be, for example, exponential

Example : Data (x,y) follows exponential form

The next line references a separate worksheet with a function inside called
Create_Vector. I can use the function here as long as I reference the worksheet first

Reference:C:\Mine\Mathcad\Tutorials\MyFunctions.mcd

X := Create_Vector ( −2 , 4 , .25) Y := 1.6⋅ exp( 1.3⋅ X)

f2 := regress ( X , Y , 2) f3 := regress ( X , Y , 3)

fit2( x) := interp ( f2 , X , Y , x) fit3( x) := interp ( f3 , X , Y , x) i := −2 , −1.9.. 4

300

200

100

2 0 2 4
data
2nd order
3rd order

Note that neither 2nd nor 3rd order fit really describes the data well, but higher order will only get more
‘squiggly’

We created this sample of data using an exponential function. Why not create a general form of the expo-
nential function, and use the error minimization concept to identify its coefficients. That is, let’s replace
/
$ ( / 0
the polynomial equation ! ( " ) " # # # " # # " # # " # 444 # # " " # #
! % $ ( / ! ∑ #0 "
0"%
+"
With a general exponential equation ! ( " ) " 1% " 1 <=. ( +" )

Numerical Methods Lecture 5 - Curve Fitting Techniques page 100 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

where we will seek C and A such that this equation fits the data as best it can.
Again with the error: solve for the coefficients 1, + such that the error is minimized:

*
$
minimize %&& " ∑ ( ) ( & ( 1 <=. ( +" ) ) )
("%

Problem: When we take partial derivatives with respect to %&& and set to zero, we get two NONLIN-
EAR equations with respect to 1, +

So what? We can’t use Gaussian Elimination or the inverse function anymore.


Those methods are for LINEAR equations only...

Now what?
Solution #1: Nonlinear equation solving methods
Remember we used Newton Raphson to solve a single nonlinear equation? (root finding)
We can use Newton Raphson to solve a system of nonlinear equations.
Is there another way? For the exponential form, yes there is

Solution #2: Linearization:


Let’s see if we can do some algebra and change of variables to re-cast this as a linear problem...
Given: pair of data (x,y)
+"
Find: a function to fit data of the general exponential form ) " 1%

+"
1) Take logarithm of both sides to get rid of the exponential >1 ( ) ) " >1 ( 1% ) " +" # >1 ( 1 )

2) Introduce the following change of variables: 2 " >1 ( ) ) , , " ", - " >1 ( 1 )

Now we have: 2 " +, # - which is a LINEAR equation

The original data points in the " & ) plane get mapped into the , & 2 plane.

This is called data linearization. The data is transformed as: ( ", ) ) ⇒ ( ,, 2 ) " ( ", >1 ( ) ) )

Now we use the method for solving a first order linear curve fit
* ∑,
- " ∑2
$
∑, ∑, + ∑ ,2
for + and - , where above 2 " >1 ( ) ) , and , " "
-
Finally, we operate on - " >1 ( 1 ) to solve 1 " %

Numerical Methods Lecture 5 - Curve Fitting Techniques page 101 of 102


EDWARD CARES
CGN 3421 - Computer Methods Gurley

+"
And we now have the coefficients for ) " 1%
Example: repeat previous example, add exponential fit

X := Create_Vector ( − 2 , 4 , .25 ) Y := 1.6 ⋅ exp ( 1.3 ⋅ X )

f2 := regress ( X , Y , 2 ) f3 := regress ( X , Y , 3 )

fit2 ( x) := interp ( f2 , X , Y , x) fit3 ( x) := interp ( f3 , X , Y , x)

ADDING NEW STUFF FOR EXP FIT

Y2 := ln ( Y ) fexp := regress ( X , Y2 , 1 ) coeff := submatrix ( fexp , 4 , 5 , 1 , 1 )

(
C := exp coeff
1 ) A := coeff
2 fitexp ( x) := C ⋅ exp ( A ⋅ x) i := − 2 , − 1.9 .. 4

300

200
A = 1.3

C = 1.6
100

2 0 2 4
data
2nd order
exp

Numerical Methods Lecture 5 - Curve Fitting Techniques page 102 of 102


EDWARD CARES

Chapter 8
An Introduction to
Questionnaire Design

Introduction
In this chapter you will learn about:
• The key principles of designing effective questionnaires.
• How to formulate meaningful questions.
• The use of structured, semi-structured and unstructured
questionnaires in different types of research design.
• The three most important types of questions for asking
about behaviour, attitudes or classifying respondents
• Key terms used in questionnaire design
• The link between the interviewer, the respondent and the
questionnaire.

The key principles of effective questionnaire design


There are seven steps in the design of a questionnaire:

Step 1 – Decide what information is required


The starting point is for the researcher to refer to the proposal and
brief and make a listing of all the objectives and what information
is required in order that they are achieved.

129
EDWARD CARES
Step 2 – Make a rough listing of the questions
A list is now made of all the questions that could go into the ques-
tionnaire. The aim at this stage is to be as comprehensive as possi-
ble in the listing and not to worry about the phrasing of the
questions. That comes next.

Step 3 – Refine the question phrasing


The questions must now be developed close to the point where they
make sense and will generate the right answers. Tips on how to
write good questions are provided later in this chapter.

Step 4 – Develop the response format


Every question needs a response. This could be a pre-coded list of
answers or it could be open ended to collect verbatim comments.
Consideration of the responses is just as important as getting the
questions right. In fact, considering the answers will help get the
questions right.

Step 5 – Put the questions into an appropriate sequence


The ordering of the questions is important as it brings logic and
flow to the interview. Normally the respondent is eased into the
task with relatively straightforward questions while the more diffi-
cult or sensitive ones are left until they are warmed up. Questions
on brand awareness are asked first unprompted and then they are
prompted.

Step 6 – Finalise the layout of the questionnaire


The questionnaire now needs to be fully formatted with clear
instructions to the interviewer, including a powerful introduction,
routings and probes. There needs to be enough space to write in
answers and the responses codes need to be well separated from
each other so there is no danger of circling the wrong one.

Step 7 – Pretest and revise


The final step is to test the questionnaire. It usually isn’t necessary
to carry out more than 10 to 20 interviews in a pilot because the aim
is to make sure that it works, and not to obtain pilot results. In the-
ory the questionnaire should be piloted using the interviewing
method that will be used in the field (over the phone if telephone

130
EDWARD CARES
interviews are to be used; self completed if it will be a self comple-
tion questionnaire). Time and money can preclude a proper pilot so
at the very least it should be tested on one or two colleagues for
sense, flow and clarity of instructions. The whole purpose of the test
is to find out if changes are needed so that final revisions can be
made. When carrying out the pilot it is best to run through the
questionnaire with the guinea pig respondent and then go back
over the questions and ask for each one, “what was going through
your mind when you were asked this question?”.
Questionnaire design is one of the hardest and yet one of the most
important parts of the market research process. Given the same
objectives, two researchers would probably never design the same
questionnaire.

Designing effective questionnnaires


The primary purpose of a questionnaire is to help extract data from
respondents. It serves as a standard guide for the interviewers who
each need to ask the questions in exactly the same way. Without
this standard, questions would be asked in a haphazard way at the
discretion of the individual. Questionnaires are also an important
part in the data collection methodology. They are the medium on
to which responses are recorded to facilitate data analysis.
There are five people to take into consideration when designing a
questionnaire:
Client – the client wants answers to their particular problem and
even, on occasion, to have their worst fears shown up to be unlikely
or improbable.
Researcher – the researcher needs to uncover information and bal-
ance the needs of three groups of people. She or he needs to ensure
that the interviewer can manage the questionnaire easily, that the
questions are interesting for the respondent and that the question-
naire matches the client’s needs.
Interviewer – the interviewer wants a questionnaire which is easy
to follow and which can be completed in the time specified by the
researcher.
Respondent – respondents generally want to enjoy the interview
experience. They need to feel that the questions are phrased so that
they can be answered truthfully, and so that they allow the respon-
dent to actually say what he or she thinks. They may also want to

131
EDWARD CARES
know if they will receive anything in return for giving their opin-
ion.
Data-processor – the data processor wants a questionnaire which
will result in data which can be processed efficiently and with min-
imum error.
If questionnaires fail it is usually because they are dashed off with
insufficient thought. Questions may be missed out; they could be
badly constructed, too long, or too complicated and sometimes
unintelligible. Good questionnaires are iterations which begin as a
rough draft and, through constant refinement, are converted to pre-
cise and formatted documents. It is not unusual for a questionnaire
to develop through to version 7 or 8.
There are normally five sections in a questionnaire:
• The respondent’s identification data – such as their name,
address, date of the interview, name of the interviewer. The
questionnaire would also have a unique number for purposes
of entering the data into the computer.
• An introduction – this is the interviewer’s request for help.
It is normally scripted and lays out the credentials of the
market research company, the purpose of the study and any
aspects of confidentiality.
• Instructions – the interviewer and the respondent need to
know how to move through the questionnaire such as which
questions to skip and where to move to if certain answers are
given.
• Information – this is the main body of the document and is
made up of the many questions and response codes.
• Classification data – these questions, sometimes at the front
of the questionnaire, sometimes at the end, establish the
important characteristics of the respondent, particularly
related to their demographics.
Ten things to think about when designing a questionnaire:

10 things to think about in effective questionnaire design


1. Think about the objectives of the survey: at the outset, the
researcher should sit down with the research plan (the
statement of what is to be achieved and the methods which
will be involved) and list the objectives of the study. This
will ensure that the survey covers all the necessary points

132
EDWARD CARES
and it will generate a rough topic list which will eventually
be converted into more explicit questions.
2. Think about how the interview will be carried out: the
way that the interview will be carried out will have a bearing
on the framing of the questions. For example, interviews
carried out over the telephone have some limitations
compared with face to face interviews. Self-completion
questionnaires need to be very precise and explicit in the
way they are designed.
3. Think about the introduction to the questionnaire:
scripted introductions can sound “wooden”. However, each
interviewer should say the same thing so there has to be a
standard introduction. It should quickly and succinctly
communicate the purpose of the survey, any aspects of
confidentiality and what is required of the respondent. The
introduction is arguably one of the most important
components of a questionnaire because if it fails to engage
with the respondent, there will be no interview at all.
4. Think about the formatting: the questionnaire should be
clear and easy to read. It should be easy for the interviewer
to navigate around. Questions and response options should
be laid out in a standard format and if the questionnaire is
to be administered on a doorstep in winter, the typeface
should be large enough to read. Where appropriate, there
should be ample space to write in open ended comments.
There should be somewhere (front or back) to write down
the details of the respondent, the date of the interview and
the name of the interviewer.
5. Think about questions from the respondents’ point of
view: questions should be framed in a respondent friendly
manner. Researchers usually know what they want from a
survey but this seldom converts into a straight question. The
question usually has to be broken down into two or three
parts to make it relevant from the respondent’s point of
view. Furthermore, researchers can be greedy for information
and design questionnaires that are too long and impose
impossible tasks for the respondent.
6. Think about the possible answers at the same time as
thinking about the questions: the whole purpose of a
question is to derive answers and so it is essential that some
thought is given to all the possible replies that could be

133
EDWARD CARES
received. It is the anticipation of the complete range of
possible answers that throws up the faults in the question.
For example, it is no good asking people how many loaves of
bread they buy in a year if they think in terms of loaves
purchased per week
7. Think about the order of the questions: the questions
should flow easily from one to another and be grouped into
topics in a logical sequence.
8. Think about the types of questions: texture in the
interview can be achieved by incorporating different styles of
questions. The researcher can choose from open ended
questions, closed questions and scales.
9. Think about how the data will be processed: the
questionnaire is simply the vehicle by which data is
collected from many individuals before being stirred in the
analysis pot. Consideration of how the data will be analysed
at the time of designing the questionnaire will make things
easier later on.
10. Think about interviewer instructions: questionnaires are
administered by interviewers who, skilled as they are,
need clear guidance what to do at every stage of the
interview. These instructions need to be differentiated
from the text either by capital letters, emboldened or
underlined type.

In addition to these points that will


Key point guide the overall design of the
The best questionnaire, the questions themselves
questionnaires are must be carefully designed. To write a
constantly edited good question you need to make sure
and refined until that the respondents:
finally they have
• Can understand the question
clear questions and
instructions, laid out • Are willing to answer the question
in a logical order.
• Are able to answer the question.
Below are twelve things to watch out for when formulating indi-
vidual questions.
• Ensure that questions are without bias. Questions should not be
worded in such a way as to lead the respondent into the
answer.

134
EDWARD CARES
• Make the questions as simple as possible. Questions should not
only be short, they should also be simple. Those which
include multiple ideas or two questions in one will confuse
and be misunderstood.
• Make the questions very specific. Notwithstanding the
importance of brevity and simplicity, there are occasions
when it is advisable to lengthen the question by adding
memory cues. For example, it is good practice to be specific
with time periods.
• Avoid jargon or shorthand. It cannot be assumed that
respondents will understand words commonly used by
researchers. Trade jargon, acronyms and initials should be
avoided unless they are in every day use.
• Steer clear of sophisticated or uncommon words. A questionnaire
is not a place to score literary points so only use words in
common parlance. Colloquialisms are acceptable if they will
be understood by everybody (some are highly regional).
• Avoid ambiguous words. Words such as `usually’ or
`frequently’ have no specific meaning and need qualifying.
• Avoid questions with a negative in them. Questions are more
difficult to understand if they are asked in a negative sense.
It is better to say “Do you ever ...?”, as opposed to “Do you
never ...?
• Avoid hypothetical questions. It is difficult to answer questions
on imaginary situations. Answers may be given but they
cannot necessarily be trusted.
• Do not use words which could be misheard. This is especially
important when the interview is administered over the
telephone. For example, fifteen and fifty can sound very
similar.
• Desensitise questions by using response bands. Questions which
ask women about their age or companies about their
turnover are best presented as a range of response bands.
This softens the question by indicating that precision isn’t
necessary and a broad answer is acceptable. The data will
almost certainly be grouped into bands at the analysis stage,
so it may as well be collected in this way.
• Ensure that fixed responses do not overlap. The categories which
are used in fixed response questions (such as the age bands

135
EDWARD CARES
of respondents, the turnover bands of companies etc) should
be sequential and not overlap otherwise some answers will
be caught on the cusp.
• Allow for `others’ in fixed response questions. Pre-coded answers
should always allow for a response other than those listed.

Think about
How many questionnaires pass in front of you that you put
straight in the bin? Start collecting them. In time you will have
a good variety from which you can pick and choose questions
and layouts when you have to design a questionnaire.

Matching the questionnaire to the research


objectives
The survey plan will have a range of objectives which could require
qualitative or quantitative methods (or both). The specific market
research objectives will dictate the type of information needed from
the questionnaire.

Figure 8.1 Types Of Questionnaires For Different Studies


Type of Questionnaire Method of
Study Type Administration

Large, quantitative Structured Telephone/


studies Face-to-face
Self completion

Business to business Semi-structured Telephone/


studies; investigative Face-to-face
consumer studies

Qualitative studies Unstructured Depth Telephone/


Face-to-face/
Focus groups

Structured questionnaires consist of closed or prompted questions


with predefined answers. The researcher has to anticipate all possi-
ble answers with pre-coded responses. They are used in large inter-
view programmes (anything over 30 interviews and more likely over
200 interviews in number) and may be carried out over the tele-

136
EDWARD CARES
phone, face-to-face or self completion depending on the respondent
type, the content of questionnaire and the budget.
Semi-structured questionnaires comprise a mixture of closed and open
questions. They are commonly used in business-to-business market
research where there is a need to accommodate a large range of dif-
ferent responses from companies. The use of semi-structured ques-
tionnaires enables a mix of qualitative and quantitative information
to be gathered. They can be administered over the telephone or
face-to-face.
Unstructured questionnaires are made up of questions that elicit free
responses. These are guided conversations rather than structured
interviews and would often be referred to as a “topic guide”. The
topic guide is made up of a list of questions with an apparent order
but is not so rigid that the interviewer has to slavishly follow it in
every detail. The interviewer can probe or even construct new ques-
tions which have not been scripted. This type of questionnaire is
used in qualitative research for depth interviewing (face-to-face,
depth telephone interviews) and they form the basis of many stud-
ies into technical or narrow markets.
Using one of these types of questionnaire, (structured, semi-struc-
tured, or unstructured) a check should be made on how meaningful
it is, by asking “Is it measuring or probing what they think it’s measur-
ing or probing?”. If you get this right respondents will be able to give
valid answers.
Another simple measure is to think through all the possible
responses. This will make sure that the responses that are obtained
are reliable. Basically this means that the answers received should be
the same as those that would be given, if you repeated the question.
There are two major issues that can have a bad effect on both the
quality of your data, and a respondent’s attitude towards market
research. These are using excessively long questionnaires, and repet-
itive questioning techniques. Variety is the spice of questionnaires,
as well as of life! Use lots of different question types to stop respon-
dents getting bored. Stimulus materials, such as show cards and
advertisements, can also help provide texture in the interview.

An introduction to different question types


Questions are designed to collect three different types of informa-
tion from populations – information about behaviour, information
about attitudes, and information that is used for classification pur-

137
EDWARD CARES
poses The three different types of information that can be gathered
and the surveys in which they are used is summarised in Figure 8.2.

Figure 8.2 Three Different Types of Questions


Question Type Information Sought Types of Surveys

Behavioural Factual information Surveys to find out


on what the market size, market
respondent does or shares, awareness and
what they own. Also usage
the frequency with
which certain actions
are carried out.

Attitudinal What people think of Image and attitude


products, services or surveys. Brand
brands. Their image mapping studies.
and ratings of things. Customer/ employee
Why they do things. satisfaction surveys

Classification Information that can All surveys


be used to group
respondents to see
how they differ, one
from the other – such
as their age, gender,
social grade, location
of household, type of
house, family
composition.

Behavioural questions
Behavioural questions are designed to find out what people (or
companies) do. For example, do people eat butter or margarine?.
How much do they eat? What brands do they buy? Who buys it?
etc. They determine people’s actions in terms of what they have
bought, used, visited, seen, read or heard. Behavioural questions
record facts and not matters of opinion.

Behavioural questions address the following:


• Have you ever ........?
• Do you ever ........?
• Who do you know ........?

138
EDWARD CARES
• When did you last ........?
• Which do you do most often ........?
• Who does it ........?
• How many ........?
• Do you have ........?
• In what way do you do it ........?
• In the future will you ........?

Attitudinal questions
People hold opinions or beliefs on everything from the products
they buy and the companies which make or supply them through
to social issues and politics. These attitudes are important because
they influence the way people act.
Researchers explore attitudes using questions which especially begin
with the word `why...’. Also useful are the questions How?, Which,
Who?, Where?, What? In attitudinal and motivational research, the
phrases are often used: “Why did you say that?” or “Would you
explain?”.

Attitudinal questions address the following:


• Why do you ........?
• What do you think of ........?
• Do you agree or disagree ........?
• How do you rate ........?
• Which is best (or worst) for ........?
Scales are commonly used to measure attitudes. Scalar questions use
a limited choice of response, chosen to measure an attitude, an
intention or some aspect of the respondent’s behaviour. There are
five different types of rating scales which researchers commonly
use:
1. Verbal rating scales. These are the simplest of all scales in
which respondents choose a word or phrase on a scale to
indicate the level of their feeling. They normally range across
four or five possibilities such as:

139
EDWARD CARES
Very likely
Quite likely
Neither likely or not likely
Not very likely
Not likely at all.
2. Numerical rating scales. This is a very similar approach to
the verbal rating except the respondent is asked to give a
numerical `score’ rather than a semantic response. The scores
are often out of a number with 5, 7 and 10 being popular
choices (where the large number is best and 1 is worst). It
should be borne in mind that the bigger the scale, the more
consideration is required from the respondent.
3. The use of adjectives. An alternative to a scale is to ask
respondents which words best describe a company, a product
or a brand. The adjectives could be both positive and
negative and they need not be opposites. This could easily be
converted into a scale, for example, asking people which of
two adjectives they associate with a product or brand –
reliable v unreliable. In a self completion questionnaire a
line or scale could separate the two words and the
respondent is asked to mark the line to indicate their view.
4. The use of positioning statements. Here the respondent is
asked to agree or disagree with a number of statements. It is
important that the respondent is readily able to identify with
one of the statements and not left feeling that somehow
they do not capture their mood. Positioning statements are a
variation of the verbal rating scale and are often known as
agree/disagree scales or Likert scales after the person who
popularised them. Typically a statement is read out and the
respondent is presented with five choices such as:
Agree strongly
Agree slightly
Neither agree nor disagree
Disagree slightly
Disagree strongly
5. Ranking questions. Researchers often need to find out what is
the order of importance of various factors from a list. Typically
this is achieved by presenting the list and asking which is
most important, which is second most important and so on.

140
EDWARD CARES
Think about
The questions we ask are who, what, when, where, why, and
how. Which of these do you think is the most difficult for peo-
ple to answer? Why is it the most difficult?

Classification questions
The third group of questions are those used to classify the informa-
tion once it has been collected. Classification questions check that
the correct quota of people or companies has been interviewed and
are used to make comparisons between different groups of respon-
dents. Most classification questions are behavioural (factual).
A number of standard classification questions crop up constantly in
market research surveys. These are:
• Gender. There can be no other classifications other than male
and female.
• Marital status. This is usually asked by simply saying “Are
you .....”
– Single ❑
– Married ❑
– Widowed ❑
– Divorced ❑
– Separated ❑
• Socio Economic Grade (SEG). This is a classification peculiar to
UK market researchers in which respondents are pigeonholed
according to the occupation of the head of the household.
Thus, it combines the attributes of income, education and
work status. In addition to social grades, researchers
sometimes classify respondents by income group or lifestyle.
In summary the socio economic grades are:
A higher managerial, administrative or professional
B intermediate managerial, administrative or professional
C1 supervisory, clerical, junior administrative or professional
C2 skilled manual workers
D semi-skilled and unskilled manual workers
E state pensioners, widows, casual and lowest grade workers.

141
EDWARD CARES
For most practical purposes these can be reduced to just four:

AB ❑
C1 ❑
C2 ❑
DE ❑
Alternatively, a question may be asked about the income of the
respondent or the combined income of the household. The ques-
tion would be de-sensitised by using income bands.
• Industrial occupation. In Europe companies are classified
according to their Standard Industrial Classification (SIC).
Often researchers condense the many divisions into more
convenient and broader groupings such as:

Accommodation and Food Services ❑


Administrative and Support & Waste Management and
Remediation Services ❑
Arts, Entertainment, and Recreation ❑
Police, Fire Service and Other Support Services ❑
Construction ❑
Educational Services ❑
Finance & Insurance ❑
Health Care and Social Assistance ❑
Information ❑
Management of Companies and Enterprises ❑
Manufacturing ❑
Mining ❑
Professional, Scientific and Technical Services ❑
Property, Rental and Leasing ❑
Retail Trade ❑
Transportation & Warehousing ❑
Utilities ❑
Wholesale Trade ❑
Other Services (Except Public Administration) ❑

142
EDWARD CARES
In surveys of the general public, it may be relevant to establish
the level of employment of the respondent. For example:
Working full time (over 30 hours a week) ❑
Working part-time (8-30 hours a week) ❑
Housewife (full time at home) ❑
Student (full time) ❑
Retired ❑
Temporarily unemployed (but seeking work) ❑
Permanently unemployed (eg chronically sick, independent
means etc) ❑
• Number of employees. The size of the firm in which the
respondent works can be classified according to the number
of employees:
0–9 ❑
10–24 ❑
25–99 ❑
100–249 ❑
250 + ❑
• Location. Depending on the scope of the survey, this can be a
country code or in any single country a code indicating the
domicile of the respondent such as state in which they live
or a broader grouping such as East Coast, Central, West
Coast etc.

Think about
What classification questions would be most important in a sur-
vey for your company?

Key terms in questionnaire design


Questionnaire: a set of common questions laid out in a standard
and logical form to record individual respondent’s attitudes and
behaviour. Instructions show the interviewer or the respondent
how to move through the questions and complete the schedule. It
could be printed on paper or on a computer screen.

143
EDWARD CARES
Question: this is the framing of the pre-
Key point cise questions that are asked. Care needs
to be taken to ensure that the questions
Classification elicit a useful and unbiased response. The
questions are some questions can be open ended (used in
of the most smaller, qualitative surveys) or closed
important questions (used in quantitative surveys).
in the questionnaire
as they are used to Open ended questions: these are ques-
cross analyse the tions that invite free ranging responses –
data and pick up sometimes called verbatim responses.
different patterns of Such responses are extremely useful for
response across obtaining a deep understanding of the
different groups of respondents’ views and behaviour but
people. they are difficult to capture precisely (the
respondent may give a long winded
answer that is shortened by the interviewer) and are time consum-
ing to analyse. They are only suited to qualitative and small quan-
titative surveys.

Closed questions: these questions invite a response that is fitted


into a preordained answer. Usually the answers are read out or
shown to the respondent and they choose which best fits their
reply. Sometimes the answers are not read out (as in a brand aware-
ness question) though the responses are listed and to that extent are
“closed”. Closed questions are the norm in quantitative surveys. It
is vital to ensure that the correct response codes are designed for the
question otherwise there will be significant numbers that cannot be
placed in any useful response and are put in the dustbin category of
“others”.

Direct questions: A direct question measures exactly what it


appears to be measuring. For example, “How do you travel to work
each day?”

Indirect questions: An indirect question usually disguises its true


purpose. For example, “Which tour operators have you booked hol-
idays through in the past two years?”. This indirect question will
also give some idea of how many holidays (if any) the respondent
has taken over the last two years. Indirect questioning is usually
used if a direct question might bias a respondent’s answers to reveal
the true purpose of the research.

144
EDWARD CARES
Multiple response questions: some questions can receive a number
of answers and others only one answer. For example, a question that
asks how many brands someone is aware of could generate a list of
names and therefore is multiple response. Another question may
seek to find out which brand is used most frequently and this could
allow for just one response (ie single response). Sometimes the ques-
tions are marked multi-response so that the interviewer knows that
more than one answer is anticipated and allowed.

Prompted questions: A prompted question is used to give a com-


mon framework for the answers. The answer options are printed on
showcards, within the questionnaire or read out, and the respon-
dent chooses one of them. Prompted questions can help respon-
dents to understand difficult subjects and make it easier for them to
answer by indicating what’s expected and prompting their memory.
It also helps a researcher have some control over the scope of the
answers. The drawback here is of course that the questions can
become suggestive and leading, and they can also be complicated
for the interviewers. Closed questions are prompted.

Unprompted questions: An unprompted question allows the


respondent to give their own answer in their own words. Open
ended questions are unprompted.

Response codes: the answers to closed questions, each requiring a


mark to indicate which has been chosen. This could be by circling
a number or ticking a box.

Question grid: questions may be laid out in grids to save space on


the questionnaire. For example, a list of brands could be listed on
the page and against these there could be response columns to indi-
cate if the respondent has heard of each brand, ever used brands, is
a frequent user of the brands etc. Grids are used to save space and
to make it easier for the interviewer and respondent to quickly
move through the questions.
Heard of Ever used Frequent user
Brand 1 1 1 1
Brand 2 2 2 2
Brand 3 3 3 3

145
EDWARD CARES
Rating scales: these are words, numbers or pictures that indicate a
range of different responses to a question. Scales suit researchers as
a means for locating a respondent’s view on a continuum but they
may not always be easy for respondents to relate to as they may not
think in such terms. Scales have engaged the interest of researchers
for years and some are named after their originator. Likert invented
an agree/disagree scale with five positions. Osgood gave his name to
the bi-polar scale. The Thurstone scale starts by generating a list of
possible statements that relate to a subject and then a distilled list is
created which is the scale of issues that covers the subject.
Routing: the instructions that tell an interviewer or a respondent
where to go next when completing the questionnaire.
Trade-off question: at its simplest this could be a question which
asks the respondent to spend a number of points between factors
that influence their choice of a product or brand. The more sophis-
ticated trade-off questions ask respondents to express their prefer-
ences between pairs of attributes or between concepts (with a price
attached). This is a conjoint measurement that produces utility val-
ues indicating the weight of importance attached to the different
attributes.

The respondent, the interviewer and questionnaire


design
A questionnaire is the link between the interviewer and the respon-
dent. In a good interview the process feels more like an interesting
conversation than an interrogation. The combination of a good
interviewer and a good questionnaire are crucial to the successful
interview. The flow of the questions is critical to a good interview.
The flow of the questionnaire is dependent on six factors:
• Easy to answer questions should be put at the beginning to
give the respondent confidence in their ability to help
• Questions likely to interest the respondent should also be at
the start
• Questions should be asked in a logical order
• Filter questions should follow each other without being
interrupted by other questions
• It can be helpful to have an introduction before each change
of topic to help the respondent make an easy jump

146
EDWARD CARES
• Personal, emotional or complicated questions should be at
the end to avoid people being put off answering further
questions
Obtaining a market research interview is not easy, especially given
the large number of surveys that are taking place and the bombard-
ment of our privacy through the ‘cold call’ selling of financial ser-
vices or double-glazing. The respondent believes, with some
justification, that they are giving up their valuable time and may be
getting little in return.
It is in the opening seconds of the introduction that the interview
will be won or lost and so the questionnaire must have an intro-
duction with a hook that interests the respondents.
Skills are required on the part of the interviewer to communicate
the introduction as quickly as possible so that respondent can start
talking and answering the questions. The more information that is
packed in to the introduction and the longer it takes, the more time
a respondent will have to think of reasons why they don’t want to
take part. A fast engagement is vital.
The interviewer’s approach really does make a difference.
Respondents like to feel that they are in the hands of a professional.
Someone that is businesslike without being pushy.
Respondents will talk to people they trust. Building trust in a few
seconds is difficult when the interviewer has only their voice and
words. However both can be powerful ordnance if they are used cor-
rectly. The right words and voice will create legitimacy for the inter-
view. The wrong ones will result in the brush off. It does therefore
help to have a script prepared before making contact with the
respondent to ensure that the introduction is, as near as possible,
the best one to win trust and co-operation.
In most cases, once a respondent has started the interview, they will
see it through to completion. Compliance is not a foregone conclu-
sion and a different set of skills is needed for the execution of the
interview itself.
The crucial requirement of any interview is to know the question-
naire thoroughly. This is especially the case with paper based ques-
tionnaires, as complex routings could break the flow of the
interview.
The interview is, of course, a script of a kind and the questions have
to be read out exactly as stated. Good interviewers develop their
own style, speaking at a moderate pace and with good clarity and

147
EDWARD CARES
diction. And, although it may be the last interview in a busy and tir-
ing day, they must sound interested. In fact, they need to be inter-
ested because a good interviewer really does have to listen.
Although the questionnaire is a script,
and it must be adhered to, there is scope
Key point to build in social lubrication and verbal
A good encouragements that indicate the inter-
questionnaire will be viewer is listening and is interested. The
successful in body language of the voice becomes even
collecting accurate more important in telephone interviews
facts and opinions as there is nothing else to create a rapport.
and will be an
By the time the interview if finished, a
enjoyable event for
relationship will have been created with
the respondent.
the respondent. The respondent should
be thanked for their time and effort and it
may be appropriate to ask permission to call again should it be nec-
essary to clarify any of the answers. (This is more important in busi-
ness to business interviews).

Think about
Write an introduction to a questionnaire that you think would
be successful in winning your cooperation. The introduction
should include all the necessary coverage of who is carrying out
the survey (not necessarily who is commissioning it), promises of
anonymity and confidentiality, how long it will take and a per-
suasive hook. See if you can use less than 100 words.

148
EDWARD CARES
SCARY STORY
In the 1980s, Coke became seriously concerned that it was losing
market share to Pepsi. In 1984 it only had a 4.9% lead of Pepsi in
the US. This was despite the fact that Coke outspent Pepsi on
advertising, by upwards of $100 million per year. One major
problem was that Pepsi’s advertising was simply more effective.
The Pepsi Challenge had been fabulously successful: Pepsi made
great play in its ads that in blind taste tests, people preferred
Pepsi to Coke.
Roy Stout, head of market research for Coca-Cola USA, put it this
way, “If we have twice as many vending machines, dominate
fountain, have more shelf space, spend more on advertising, and
are competitively priced, why are we losing share? You look at
the Pepsi Challenge, and you have to begin asking about taste.”
In September 1984, Coke thought that it had found the answer
with a new formula that beat Pepsi in blind taste tests by as much
as 6 to 8%. Bearing in mind that Pepsi had beaten Coke by any-
where from 10 to 15%, this was an 18% swing. All discouraging
market research was tossed into the bin and New Coke was
launched – with disastrous results.
When it hit the streets, New Coke was rejected by huge groups of
people. Comments were received such as “sewer water”, “furni-
ture polish”, “Coke for wimps”, “two-day-old Pepsi”, and “I miss
the battery acid tang”.
What we can learn from this story is not that the research carried
out by Coke or by Pepsi was wrong; rather that the wrong ques-
tions were asked. An assumption was made that Coke drinkers
chose the drink on taste and this became the subject of the study.
In fact the reality was far more subtle and the main driver of
choice was the brand. For years Coke was promoted as “the real
thing” and with the launch of New Coke, it implied that they
had been duped.

149
EDWARD CARES

CHAPTER 15
INDEX NUMBERS
Steve Cole/Photodisc/Getty Images

LEARNING OBJECTIVES
When you have completed this chapter, you
will be able to:
LO1 Compute and interpret a simple index.

LO2 Describe the difference between a


weighted index and an unweighted index.
LO3 Compute and interpret a Laspeyres price
index.
LO4 Compute and interpret a Paasche price
index.
LO5 Compute and interpret a value index.

Information on prices and quantities for margarine, shortening, milk, and LO6 Explain how the Consumer Price Index is
potato chips for the years 2004 and 2014 is provided in Exercise 27. constructed and interpreted.
Compute a simple price index for each of the four items, using 2004 as
the base period. (Exercise 27, LO1)

15.1 INTRODUCTION
In this chapter, we will examine a useful descriptive tool called an index. An index ex-
presses the relative change in a value from one period to another. No doubt you are fa-
miliar with indexes such as the Consumer Price Index (CPI), which is released monthly
by Statistics Canada. There are many other indexes, such as the S&P/TSX Composite
Index, Dow Jones Industrial Average (DJIA), Nasdaq, and the NIKKEI 225. Indexes are
published on a regular basis by the federal government; by business publications, such
as BusinessWeek and Forbes; in most daily newspapers; and on the Internet.
Of what importance is an index? Why is the CPI so important and so widely
reported? As the name implies, it measures the change in the price of a large group of
items consumers purchase. Governments, consumer groups, unions, management,
senior citizens organizations, and others in business and economics are very concerned
about changes in prices. These groups closely monitor the CPI as well as other indexes.
To combat sharp price increases, the Bank of Canada often raises the interest rate to “cool
down” the economy. Likewise, the S&P/TSX Composite Index measures the overall daily
performance of the largest publically traded companies on the Toronto Stock Exchange.
A few stock market indexes appear daily in the financial section of most newspa-
pers. Many are reported in real time on several websites.

LO1 15.2 SIMPLE INDEX NUMBERS


Index number A number What is an index number? An index or index number measures the change in a particu-
that expresses the relative lar item—typically a product or service—or basket of items between two time periods.
change in price, quantity, or If the index number is used to measure the relative change in just one variable, such
value compared with a base
period.
as hourly wages in manufacturing, we refer to this as a simple index. It is the ratio of two
values of the variable and that ratio converted to a percentage. The following three
examples will serve to illustrate the use of index numbers. As noted in the definition, the
main use of an index number in business is to show the percentage change in one or
more items from one time period to another.
489
EDWARD CARES
490 Chapter 15

Example According to Statistics Canada, the average total undergraduate tuition fees for full-time stu-
dents were $4025 in the 2003–2004 academic year and $5772 in the 2013–2014 academic
year. What is the index of the average total undergraduate tuition fees for full-time students for
the 2013–2014 academic year based on the 2003–2004 academic year?
Solution It is 143.4, found by:
Average total undergraduate tuition fees 2013–2014 academic year
I=
Average total undergraduate tuition fees 2003–2004 academic year
5772
= (100)
4025
= 143.4
Thus, the average total undergraduate tuition fees for the 2013–2014 academic year compared
with the average total undergraduate tuition fees for the 2003–2004 academic year is 143.4. This
means that there was a 43.4% increase in the average tuition fees during the seven-year period.
Source: Statistics Canada statcan.gc.ca/daily-quotidien/130912/dq130912b-eng.htm

Example An index can also compare one item with another. The population of British Columbia in
2013 was estimated at 4 606 371, and for Ontario, it was estimated at 13 603 904. What is the
population of British Columbia compared with Ontario?
Solution The index of population for British Columbia is 33.9, found by:
Population of British Columbia 4 606 371
I= = (100) = 33.9
Population of Ontatio 13 603 904
This indicates that the population of British Columbia is 33.9% (about one-third) of the popula-
tion of Ontario, or the population of British Columbia is 66.1% less than the population of
Ontario (100 - 33.9 = 66.1).
Source: Statistics Canada www5.statcan.gc.ca/cansim/a26?lang=eng&retrLang=eng&id=0510005&paSer=&pattern=&stByVal=1&p1=1&p
2=31&tabMode=dataTable&csid=

Example The numbers of passengers (in millions) for the five busiest airports in Canada in 2013 are
given below. What is the index for Toronto, Vancouver, Calgary, and Montreal compared with
Edmonton?

Number of Passengers
City Airport (millions) Index
Toronto Toronto Pearson International Airport 36.1 515.7
Vancouver Vancouver International Airport 18.0 257.1
Calgary Calgary International Airport 14.3 204.3
Montreal Pierre Elliot Trudeau International Airport 14.1 201.4
Edmonton Edmonton International Airport 7.0 100.0

Solution To find the four indexes, we divide the numbers of passengers for Toronto, Vancouver, Calgary,
and Montreal by the number at Edmonton and multiply by 100. We conclude that Calgary had
104.3% more passengers than Edmonton, Montreal 101.4% more, Vancouver 157.1% more, and
Toronto 415.7% more.
Source: en.wikipedia.org/wiki/List_of_the_busiest_airports_in_Canada#Canada.27s_21_busiest_airports_by_passenger_traffic
EDWARD CARES
Index Numbers 491

Number of
City Airport Passengers (millions) Index Found by
Toronto Toronto Pearson International Airport 36.1 515.7 36.1/7.0*100
Vancouver Vancouver International Airport 18.0 257.1 18.0/7.0*100
Calgary Calgary International Airport 14.3 204.3 14.3/7.0*100
Montreal Pierre Elliot Trudeau International Airport 14.1 201.4 14.1/7.0*100
Edmonton Edmonton International Airport 7.0 100.0 7.0/7.0*100

Note the following from the previous discussion:

1. Index numbers are actually percentages because they are based on the number 100. How-
ever, the percent symbol is usually omitted.
2. Each index has a base period. The current base period for the CPI is 2002 = 100, changed
from 1992 = 100.
3. Most business and economic indexes are computed to the nearest whole number, such as
214 or 96, or to the nearest tenth of a percentage, such as 83.4 or 118.7.

15.3 WHY CONVERT DATA TO INDEXES?


Compiling index numbers is not a recent innovation. An Italian, G. R. Carli, is credited with
originating index numbers in 1764. They were incorporated in his report on price fluctua-
tions in Europe from 1500 to 1750. No systematic approach to collecting and reporting data
in index form was evident until about 1900. The cost-of-living index (now called the Con-
sumer Price Index [CPI]) was introduced in 1913, and a long list of indexes has been compiled
since then.
Indexes allow us to express a change in price, quantity or value as a percentage.
Why convert data to indexes? An index is a convenient way to express a change in a
diverse group of items. The CPI, for example, encompasses many items—including gasoline,
golf balls, lawn mowers, hamburgers, funeral services, and dentists’ fees. Prices are expressed
in dollars per kilogram, box, yard, and many other different units. Only by converting the
prices of these many diverse goods and services to one index number can the federal gov-
ernment and others concerned with inflation keep informed of the overall movement of
consumer prices.
Converting data to indexes also makes it easier to assess the trend in a series composed of
exceptionally large numbers. For example, the estimated population of Canada for March 31,
2014, was 35 344 962 and 34 940 975 for March 31, 2013. The increase of 403 987 appears
significant. Yet if the March 2014 estimated population was expressed as an index based on
the March 2013 population estimate, the increase would be approximately 1.2%.
Population estimate March 31, 2014 35 344 962
= (100) = 101.2
Population estimate March 31, 2013 34 940 975

15.4 CONSTRUCTION OF INDEX NUMBERS


We already discussed the construction of a simple price index. The price in a selected year
(e.g., 2014) is divided by the price in the base year. The base-period price is designated p0, and
a price other than the base period is often referred to as the given period or selected period and
designated pt. To calculate the simple price index P using 100 as the base value for any given
period, the following formula is used:

pt
SIMPLE INDEX P= × 100 [15–1]
p0
EDWARD CARES
492 Chapter 15

Suppose that the price of a ski weekend package (including meals and lift tickets) at Blue
Mountain was $600 in 2004. The price rose to $795 in 2014. What is the price index for 2014
using 2004 as the base period and 100 as the base value? It is 132.5, found by:
pt $795
P= × 100 = (100) = 132.5
p0 $600
Interpreting this result, the price of the ski weekend increased 32.5% from 2004 to 2014.
The base period need not be a single year. Note in Table 15–1 that if we use 2005–2006 =
100, the base price for the stapler would be $21 [found by determining the mean price of 2005
and 2006: ($20 + $22)/2 = $21]. The prices $20, $22, and $23 are averaged if 2005–2007 had
been selected as the base. The mean price would be $21.67. The indexes constructed using the
three different base periods are presented in Table 15–1. (Note that when 2005–2007 = 100,
the index numbers for 2005, 2006, and 2007 average 100.0, as we would expect.) Logically, the
index numbers for 2014 using the three different bases are not the same.

T A B L E 1 5 – 1 Prices of a Benson Automatic Stapler, Model 3, Converted to Indexes Using Three


Different Base Periods

Price of Price Index Price index Price index


Year Stapler ($) (2005 = 100) (2005–2006 = 100) (2005–2007 = 100)

2000 $18 90.0 18 18


× 100 = 85.7 × 100 = 83.1
21 21.67
2005 20 100.0 20 20
× 100 = 95.2 × 100 = 92.3
21 21.67
2006 22 110.0 22 22
× 100 = 104.8 × 100 = 101.5
21 21.67
2007 23 115.0 23 23
× 100 = 109.5 × 100 = 106.1
21 21.67
2014 38 190.0 38 38
× 100 = 181.0 × 100 = 175.4
21 21.67

self-review 15–1 1. Listed below are the top steel-producing nations, in tonnes (million), for the year 2013. Express
the amount produced by China, the European Union, Japan, and India as an index, using the
United States as a base. What percentage more steel does China produce than the United States?
Amount
Nation (million tonnes)
People’s Republic of China 779.0
European Union 165.8
Japan 110.6
United States 86.9
India 81.2

Source: issb.co.uk/global.html#CSP

2. The average weekly earnings (including overtime), educational and related services, in
Canada from 2008–2012 are given below:
Year Average Weekly Earnings
2008 $742.69
2009 770.30
2010 787.37
2011 808.69
2012 816.48

Source: Statistics Canada statcan.gc.ca/tables-tableaux/


sum-som/l01/cst01/health23-eng.htm
EDWARD CARES
Index Numbers 493

(a) Using 2008 as the base period and 100 as the base value, determine the indexes for 2008–2012.
Interpret the index.
(b) Use the average of 2009, 2010, and 2011 as the base and determine indexes for 2008–2012, us-
ing 100 as the base value.
(c) What is the index for 2011 data using 2009 as the base?

EXERCISES
1. Average house prices in dollars for Manitoba from January 2008 to January 2014 are listed below:

Average House List Prices in Manitoba


Jan. 2014 $254 481
Jan. 2013 241 652
Jan. 2012 227 807
Jan. 2011 221 933
Jul. 2010 206 454
Jan. 2009 183 873
Jul. 2008 169 668

Develop a simple index for the change in list price based on the average of years 2010–2012.
2. The following table shows the average cost of a 1-bedroom apartment for selected cities across
Canada. See Connect for the data set.

City One-Bedroom Rent ($)


Regina 861
Calgary 989
Edmonton 900
Saskatoon 831
Hamilton 747
Winnipeg 717
Quebec City 628
Ottawa 932
Toronto 1027
Kitchener 786
Vancouver 998
Windsor 658
Victoria 834
Montreal 645

Source: newsroom.bmo.com/press-releases/bmo-rider-nation-
tops-canada-s-cities-and-regions-tsx-bmo-201311210913105001.
Retrieved February 25, 2014.

a. Develop a simple index with Quebec City as the base.


b. How much more expensive is a one-bedroom apartment in Calgary, Vancouver, and Toronto?
3. Listed below is the change in the average price of homes listed with the CREA (Canadian Real Estate
Association: crea.ca) from January 2014 to January 2008.

Jan-14 Jan-13 Jan-12 Jan-11 Jan-10 Jan-09 Jan-08


$388 553 $354 951 $348 178 $344 118 $328 728 $274 711 $309 448

a. Develop a simple index with January 2008 as the base year to show the change in the listed
prices. By what percentage did the list price increase over the seven years?
b. Develop a simple index with the average of January 2010 to January 2012 as the base year to
show the change in the list prices.
4. In January 2001, the price for a whole fresh chicken was $1.99 per kilogram. In September 2014, the
price for the same chicken was $5.49. Use the January 2014 price as the base period and 100 as the
base value to develop a simple index. By what percentage has the cost of chicken increased during
the 10-year period?
EDWARD CARES
494 Chapter 15

LO2 15.5 UNWEIGHTED INDEXES


In many situations, we wish to combine several items and develop an index to compare the cost of
this aggregation of items in two different time periods. For example, we might be interested in an
index for items that relate to the expense of running and maintaining an automobile. The items in
the index might include tires, oil changes, and gasoline prices. Or we might be interested in a col-
lege student index. This index might include the cost of books, tuition, housing, meals, and enter-
tainment. There are several ways we can combine the items to determine the index.

Simple Average of the Price Indexes


Table 15–2 reports the prices for several food items for the years 2004 and 2014. We would like
to develop an index for this group of food items for 2014, using 2004 as the base. This is writ-
ten in the abbreviated code 2004 = 100.

T A B L E 1 5 – 2 Computation of Index for Food Price 2014, 2004 = 100

Item 2004 Price ($) 2014 Price ($) Simple Index


Bread, white (loaf) $0.97 $1.98 204.1
Eggs (dozen) 1.85 2.19 118.4
Milk, white (litre) 0.98 1.43 145.9
Apples, red delicious (500 grams [g]) 1.98 2.75 138.9
Orange juice (355 millilitres [mL] concentrate) 1.58 1.70 107.6
Coffee, 100% ground roast (400 g) 5.40 6.99 129.4
Total $12.76 $17.04

We could begin by computing a simple price index for each item, using 2004 as the
base year and 2014 as the given year. The simple index for bread is 204.1, found by using
formula (15–1).
Pt $1.98
P= × 100 = (100) = 204.1
P0 $0.97
We compute the simple index for the other items in Table 15–2 similarly. The largest price in-
crease is for bread, 104.1% (204.1 - 100 = 104.1), and milk was second at 45.9%. The price of
eggs increased 18.4% in the period, found by: 118.4 - 100.0 = 18.4. Then it would be natural
to average the simple indexes. The formula is:

©Pi
SIMPLE AVERAGE OF THE PRICE INDEXES P= [15–2]
n
where Pi refers to the simple index for each of the items and n the number of items. In our
example, the index is 140.7, found by:
©Pi 204.1 + 118.4 + p + 129.4 844.3
P= = = = 140.7
n 6 6
This indicates that the mean of the group of indexes increased 40.7% from 2004 to 2014.
A positive feature of the simple average of price indexes is that we would obtain the same
value for the index regardless of the units of measure. In the above index, if apples were priced in
tonnes, instead of kilograms, the impact of apples on the combined index would not change.
That is, the commodity “apples” represents one of six items in the index, so the impact of the
item is not related to the units. A negative feature of this index is that it fails to consider the rela-
tive importance of the items included in the index. For example, milk and eggs receive the same
weight, even though a typical family might spend far more over the year on milk than on eggs.

Simple Aggregate Index


A second possibility is to sum the prices (rather than the indexes) for the two periods and then
determine the index based on the totals. The formula is:
EDWARD CARES
Index Numbers 495

©pt
SIMPLE AGGREGATE INDEX P= × 100 [15–3]
©p0

This is called a simple aggregate index. The index for the above food items is found by sum-
ming the prices in 2004 and 2014. The sum of the prices for the base period is $12.76 and for
the given period, it is $17.04. The simple aggregate index is 133.5. This means that the aggre-
gate group of prices increased 33.5% in the 10-year period.
©pt $17.04
P= × 100 = (100) = 133.5
©p0 12.76
Because the value of a simple aggregate index can be influenced by the units of measure-
ment, it is not used frequently. In our example, the value of the index would differ significantly
if we were to report the price of apples in tonnes rather than kilograms. Also, note the effect of
coffee on the total index. For both the current year and the base year, the value of coffee is
slightly more than 40% of the total index, so a change in the price of coffee will drive the index
much more than any other item. Therefore, we need a way to appropriately “weight” the items
according to their relative importance.

15.6 WEIGHTED INDEXES


Two methods of computing a weighted price index are the Laspeyres method and the
Paasche method. They differ only in the period used for weighting. The Laspeyres method
uses base-period weights; that is, the original prices and quantities of the items bought are used
to find the percentage change over a period of time in either price or quantity consumed,
depending on the problem. The Paasche method uses current-year weights.

LO3 Laspeyres Price Index


Etienne Laspeyres developed a method in the latter part of the nineteenth century to deter-
mine a weighted index using base-period weights. Applying his method, a weighted price
index is computed by:

©ptq0
LASPEYRES PRICE INDEX P= × 100 [15–4]
©p0q0
where:
P is the price index.
pt is the current price.
p0 is the price in the base period.
q0 is the quantity used in the base period.

Example The prices for the six food items from Table 15–2 are repeated below in Table 15–3. The num-
ber of units of each consumed by a typical family in 2004 and 2014 is also included.
T A B L E 1 5 – 3 Price and Quantity of Food Items, 2004 = 100

2004 2014
Item Price ($) Quantity Price ($) Quantity
Bread, white (loaf) $0.97 50 $1.98 55
Eggs (dozen) 1.85 26 2.19 20
Milk, white (litre) 0.98 102 1.43 130
Apples, red delicious (500 g) 1.98 30 2.75 40
Orange juice, (355 mL concentrate) 1.58 40 1.70 41
Coffee, 100% ground roast (400 g) 5.40 12 6.99 12

Determine a weighted price index using the Laspeyres method. Interpret the result.
EDWARD CARES
496 Chapter 15

Solution First, we determine the total amount spent for the six items in the base period, 2004. To find
this value, we multiply the base period price for bread ($0.97) by the base period quantity of
50. The result is $48.50. This indicates that a total of $48.50 was spent in the base period on
bread. We continue that for all items and total the results. The base period total is $383.96. The
current period total is computed in a similar fashion. For the first item, bread, we multiply the
quantity in 2004 by the price of bread in 2014, that is, $1.98(50). The result is $99.00. We make
the same calculation for each item and total the result. The total is $536.18. Because of the re-
petitive nature of these calculations, a spreadsheet is effective for carrying out the calculations.
The Excel output showing the calculations is given below:

2004 2014
Item Price ($) Quantity Price ($) Quantity P0Q0 PtQ0
Bread, white (loaf) $0.97 50 $1.98 55 $48.50 $99.00
Eggs (dozen) 1.85 26 2.19 20 48.10 56.94
Milk, white (litre) 0.98 102 1.43 130 99.96 145.86
Apples, red delicious (500 g) 1.98 30 2.75 40 59.40 82.50
Orange juice (355 mL, concentrate) 1.58 40 1.70 41 63.20 68.00
Coffee, 100% ground roast (400 g) 5.40 12 6.99 12 64.80 83.88
$383.96 $536.18
Laspeyres: 139.6

The weighted price index for 2014 is 139.6, found by:


©ptq0 $536.18
P= × 100 = (100) = 139.6
©p0q0 $383.96
On the basis of this analysis, we conclude that the price of this group of items has in-
creased by 39.6% in the 10-year period. The advantage of this method over the simple aggre-
gate index is that the weight of each of the items is considered. In the simple aggregate index,
coffee had about 40% of the weight in determining the index. In the Laspeyres index, the item
with the most weight is milk because the product of the price and the units sold is the largest.

LO4 Paasche Price Index


The major disadvantage of the Laspeyres’ index is that it assumes that the base-period quanti-
ties are still realistic in the given period. That is, the quantities used for the six items are
about the same in 2004 as 2014. In this case, note that the quantity of eggs purchased de-
clined by 23%, the quantity of milk increased by nearly 28%, and the number of apples in-
creased by 33%.
The Paasche index is an alternative. The procedure is similar, but instead of using base-
period weights, we use current-period weights. We use the sum of the products of the 2014
prices and the 2014 quantities. This has the advantage of using the more recent quantities. If
there has been a change in the quantities consumed since the base period, such a change is
reflected in the Paasche index.

©ptqt
PAASCHE PRICE INDEX P= × 100 [15–5]
©p0qt

Example Use the information from Table 15–3 to determine the Paasche index. Discuss which of the
indexes should be used.
Solution Again, because of the repetitive nature of the calculations, Excel is used to perform the calcula-
tions. The results are shown in the following output:
EDWARD CARES
Index Numbers 497

2004 2014
Item Price ($) Quantity Price ($) Quantity P0Qt PtQt
Bread, white (loaf) $0.97 50 $1.98 55 $53.35 $108.90
Eggs (dozen) 1.85 26 2.19 20 37.00 43.80
Milk, white (litre) 0.98 102 1.43 130 127.40 185.90
Apples, red delicious (500 g) 1.98 30 2.75 40 79.20 110.00
Orange juice, (355 mL concentrate) 1.58 40 1.70 41 64.78 69.70
Coffee, 100% ground roast (400 g) 5.40 12 6.99 12 64.80 83.88
$426.53 $602.18
Paasche 141.2

The Paasche index is 141.2, found by:


©ptqt $602.18
P= × 100 = (100) = 141.2
©p0qt $426.53
This result indicates that there has been an increase of 41.2% in the price of this market
basket of goods between 2004 and 2014. That is, it costs 41.2% more to purchase these items
in 2014 than it did in 2004. All things considered, because of the change in the quantities pur-
chased between 2004 and 2014, the Paasche index is more reflective of the current situation.
It should be noted that the Laspeyres index is more widely used. The CPI, the most widely
reported index, is an example of the Laspeyres index.

How do we decide which index to use? When is the Laspeyres index more appropriate,
and when is the Paasche index the better choice?
Laspeyres’ Index
Advantages Requires quantity data from only the base period. This allows a more
meaningful comparison over time. The changes in the index can be at-
tributed to changes in the price.
Disadvantages Does not reflect changes in buying patterns over time. Also, it may over-
weight goods whose prices increase.

Paasche’s Index
Advantages Because it uses quantities from the current period, it reflects current buy-
ing habits.
Disadvantages It requires quantity data for the current year. Because different quantities
are used each year, it is impossible to attribute changes in the index to
changes in price alone. It tends to overweight the goods whose prices
have declined. It requires the prices to be recomputed each year.

Fisher’s Ideal Index


As noted above, Laspeyres’ index tends to overweight goods for which the prices have in-
creased. Paasche’s index, however, tends to overweight goods for which prices have gone
down. In an attempt to offset these shortcomings, Irving Fisher, in his book The Making of Index
Numbers, published in 1922, proposed an index called the Fisher’s ideal index. It is the geomet-
ric mean of the Laspeyres and Paasche indexes. We described the geometric mean in Chapter
3. It is determined by taking the kth root of the product of k positive numbers.

, ,
FISHER’S IDEAL INDEX = 2(Laspeyres index) (Paasche s index) [15–6]

The Fisher’s index seems to be theoretically ideal because it combines the best features of
both Laspeyres’ and Paasche’s. That is, it balances the effects of the two indexes. However, it is
rarely used in practice because it has the same basic set of problems as the Paasche index. It
requires that a new set of quantities be determined for each year.
EDWARD CARES
498 Chapter 15

Example Determine the Fisher’s ideal index for the data in Table 15–3.

Solution Fisher’s ideal index is:


Fisher’s ideal index = 2(Laspeyres’ index) (Paasche’s index)
= 2(139.6) (141.2) = 140.4

self-review 15–2 An index of clothing prices for 2014 based on 2004 is to be constructed. The clothing items consid-
ered are shoes and dresses. The information for prices and quantities for both years is given below.
Use 2004 as the base period and 100 as the base value.

2004 2014
Item Price ($) Quantity Price ($) Quantity
Dress (each) $75 500 $85 520
Shoes (pair) 40 1200 45 1300

(a) Determine the simple average of the price indexes.


(b) Determine the aggregate price indexes for the two years.
(c) Determine Laspeyres’ price index.
(d) Determine Paasche’s price index.
(e) Determine Fisher’s ideal index.

EXERCISES
For exercises 5–8:
a. Determine the simple price indexes.
b. Determine the simple aggregate price indexes for the two years.
c. Determine Laspeyres’ price index.
d. Determine Paasche’s price index
e. Determine Fisher’s ideal index.
5. The prices of toothpaste (100 mL), shampoo (500 mL), cough tablets (package of 100), and antiperspi-
rant (45 g) for August 2004 and August 2014 are given below. The quantities purchased are also in-
cluded. Use August 2004 as the base.

August 2004 August 2014


Item Price ($) Quantity Price ($) Quantity
Toothpaste $2.49 6 $2.69 6
Shampoo 3.29 4 3.59 5
Cough tablets 1.79 2 2.79 3
Antiperspirant 2.29 3 3.79 4

6. Fruit prices and the amounts consumed for 2004 and 2014 are given below. Use 2004 as the base.

2004 2014
Item Price ($) Quantity Price ($) Quantity
Bananas (pounds [lb]) $0.23 100 $0.49 120
Grapefruit (each) 0.29 50 0.27 55
Apples 0.35 85 0.35 85
Strawberries (basket) 1.02 8 1.99 10
Oranges (bag) 0.89 6 2.99 8
EDWARD CARES
Index Numbers 499

7. The prices and the numbers of various items produced by a small machine and stamping plant are
reported below. Use 2003 as the base.

2003 2013
Item Price ($) Quantity Price ($) Quantity
Washer $0.07 17 000 $0.10 20 000
Cotter pin 0.04 125 000 0.10 130 000
Stove bolt 0.15 40 000 0.18 42 000
Hex nut 0.08 62 000 0.10 65 000

8. The quantities and prices of office supplies for the years 2004 and 2014 for Sam’s Student Centre are
given below:

2004 2014
Item Price ($) Quantity Price ($) Quantity
Pens (dozen) $0.90 50 $1.10 55
Pencils (dozen) 0.65 50 0.80 60
Erasers (each) 0.45 250 0.55 275
Paper, lined (package [pkg]) 0.89 500 1.09 750
Paper, printer (pkg) 5.99 300 4.99 450
Printer (cartridges) 15.99 150 19.99 200

LO5 15.7 VALUE INDEX


A value index measures changes in both the price and quantities involved. A value index,
such as the index of department store sales, considers the base-year prices, the base-year
quantities, the present-year prices, and the present-year quantities for its construction. Its
formula is:

©ptqt
VALUE INDEX V= × 100 [15–7]
©p0q0

Example The prices and quantities sold at the Waleska Department Store for various items of apparel for
May 2003 and May 2013 are as follows:

2003 2013
Quantity Sold Quantity Sold
2003 Price, (thousands), 2013 Price, (thousands),
Item p0 ($) q0 pt ($) qt
Ties (each) $10 1000 $12 900
Suits (each) 300 100 400 120
Shoes (pair) 100 500 120 500

What is the index of value for May 2013, using May 2003 as the base period?

Solution Total sales in May 2013 were $118 800 000, and the comparable figure for 2003 is $90 000 000
(Table 15–4). Thus, the index of value for May 2013 using 2003 = 100 is 132.0. The value of
apparel sales in 2013 was 132% of the 2003 sales. To put it another way, the value of apparel
sales increased 32% from May 2003 to May 2013.
©ptqt 118 800
V= × 100 = (100) = 132.0
©p0q0 90 000
EDWARD CARES
500 Chapter 15

T A B L E 1 5 – 4 Construction of a Value Index for 2013 (2003 = 100)

2003 2013
Quantity Quantity
2003 Sold 2013 Sold
Price, (thousands), p0 q0 Price, (thousands), ptqt
Item p0 ($) q0 ($ thousands) pt ($) qt ($ thousands)
Ties (each) $10 1000 $10 000 $12 900 $10 800
Suits (each) 300 100 30 000 400 120 48 000
Shoes (pair) 100 500 50 000 120 500 60 000
$90 000 $118 800

self-review 15–3 The number of items produced by Houghton Products for 2004 and 2014 and the wholesale prices
for the two periods are as follows:

Price ($) Number Produced


Item Produced 2004 2014 2004 2014
Shear pins (box) $3 $4 10 000 9000
Cutting compound (500 g) 1 5 600 200
Tie rods (each) 10 8 3000 5000

(a) Find the index of the value of production for 2014 using 2004 as the base period.
(b) Interpret the index.

EXERCISES
9. The prices and production of grains for August 2004 and August 2014 are as follows:

2004 2014
Quantity Quantity
Produced Produced
2004 (millions of 2013 (millions of
Grain Price ($) bushels) Price ($) bushels)
Oats $1.52 200 $1.87 214
Wheat 2.10 565 2.05 489
Corn 1.48 291 1.48 203
Barley 3.05 87 3.29 106

Using 2004 as the base period, find the value index of grains produced for August 2014.
10. The Johnson Wholesale Company manufactures a variety of products. The prices and quantities
produced for April 2004 and April 2014 are as follows:

2004 2014
2004 Quantity 2014 Quantity
Product Price ($) Produced Price ($) Produced
Small motor (each) $23.60 1760 $28.80 4259
Scrubbing compound (litre) 2.96 86 450 3.08 62 949
Nails (pound) 0.40 9460 0.48 22 370

Using April 2004 as the base period, find the index of the value of goods produced for April 2014.
EDWARD CARES
Index Numbers 501

15.8 SPECIAL-PURPOSE INDEXES


Many important indexes are prepared and published by private organizations. Financial insti-
tutions, utility companies, and university bureaus of research often prepare indexes on em-
ployment, factory hours and wages, and retail sales for the regions they serve. Many trade
associations prepare indexes of price and quantity that are vital to their particular area of inter-
est. As well, there are many special-purpose indexes. How are these special indexes prepared?
An example will help explain some of the details.

Example A provincial Chamber of Commerce wants to develop a measure of general business activity
for the southwest area of the province. The director of economic development has been
assigned to develop the index. It will be called the General Business Activity Index of the
Southwest Region.

Solution After considerable thought and research, the director concluded that there were four factors
to be looked at: the regional department store sales (which are reported in $thousands), the
regional employment index (which has a 2003 base and is reported by the province), the
vehicle traffic reported in the region determine by independent studies (reported in thou-
sands), and exports of the industries in the region (in tonnes). The most recent available is
reported below:

Department Index of Vehicle


Year Store Sales Employment Traffic Exports
2003 2000 100 500 500
2008 4100 110 300 900
2013 4400 125 180 700

After review and consultation, the director assigned weights of 40% to department store sales,
30% to employment, 10% to vehicle traffic, and 20% to exports.
To develop the General Business Activity Index of the Southwest Region for 2013 using
2003 = 100, each 2013 value is expressed as a percentage. For example, department store
sales for 2013 are converted to a percentage by (4400/2000)(100) = 220. This means that de-
partment store sales have increased by 120% in the period. This percentage is then multiplied
by the appropriate weight. For the department store sales, this is (220)(0.40) = 88.0. The details
of the calculations are as follows:

2008 2013
Department Store Sales (4100/2000)(100)(0.40) = 82.0 (4400/2000)(100)(0.40) = 88.0
Index of Employment (110/100)(100)(0.30) = 33.0 (125/100)(100)(0.30) = 37.5
Vehicle Traffic (300/500)(100)(0.10) = 6.0 (180/500)(100)(0.10) = 3.6
Exports (900/500)(100)(0.20) = 36.0 (700/500)(100)(0.20) = 28.0
157.0 157.1

The General Business Activity Index of the Southwest Region for 2008 is 157.0, and for 2013,
it is 157.1. Interpreting, business activity has increased by 57.0% from 2003 to 2008 and 57.1%
from 2008 to 2013.

As we stated at the start of the section, there are many special-purpose indexes. Here are a
few examples:

The Consumer Price Index Statistics Canada reports this index monthly. It describes the
changes in prices from one period to another for a “market basket” of goods and services. The
EDWARD CARES
502 Chapter 15

base year for the CPI as of 2014 is 2002 = 100.0. A historical summary of the CPI for Canada
from 2003 to 2013 follows. We present some applications later in the chapter.

Year All Items Year All Items


2002 = 100
2002 100.0 2008 114.1
2003 102.8 2009 114.4
2004 104.7 2010 116.5
2005 107.0 2011 119.9
2006 109.1 2012 121.7
2007 111.5 2013 122.8

Source: Statistics Canada statcan.gc.ca/tables-tableaux/sum-


som/l01/cst01/econ150a-eng.htm Retrieved February 2014.

S&P/TSX Composite Index Introduced in 1977 as The TSE 300 Composite Index, the
Toronto Stock Exchange’s composite index represented the average performance of 300 of
Canada’s largest public companies traded on the Toronto Stock Exchange. Effective May 2002,
the index was renamed S&P/TSX and is no longer restricted to 300 companies.

Dow Jones Industrial Average This is an index of stock prices, but perhaps it would be
better to say that it is an “indicator” rather than an index. It is supposed to be the mean price of
30 specific industrial stocks. However, summing the 30 stocks and dividing by 30 does not
calculate its value. This is because of stock splits, mergers, and stocks being added or dropped.
When changes occur, adjustments are made in the denominator used with the average. Today,
the DJIA is more of a psychological indicator than a representation of the general price move-
ment on the New York Stock Exchange (NYSE). The lack of representativeness of the stocks
on the DJIA is one of the reasons for the development of the NYSE Index. This index was de-
veloped as an average price of all stocks on the NYSE.
There are many other indexes that track business and economic behaviour, such as the
Nasdaq and the Russell 2000.

LO6 15.9 CONSUMER PRICE INDEX


Frequent mention has been made of the CPI in the preceding pages. It measures the change in
price of a fixed market basket of goods and services from one period to another.
In brief, the CPI serves several major functions. It allows consumers to determine the de-
gree to which their purchasing power is being eroded by price increases. In that respect, it is a
yardstick for revising wages, pensions, and other income payments to keep pace with changes
in price. Equally important, it is an economic indicator of the rate of inflation and is used by
business analysts and governments for evaluating and forecasting trends in interest rates and
so on. The CPI is also used as a deflator to show the trend in “real” increases. Refer to the table
above for the historical summary of the CPI from 2002 to 2013, as reported by Statistics
Canada. The current base year is 2002 = 100.

Special Uses of the Consumer Price Index


In addition to measuring changes in the prices of goods and services, the CPI has a number of
other applications. It is used to determine real disposable personal income, to deflate sales or
other series, to find the purchasing power of the dollar, and to establish cost-of-living increases.
We first discuss the use of the CPI in determining real income.

Real Income As an example of the meaning and computation of real income, assume the
CPI is 122.8 (2013) with 2002 = 100. Also, assume that Ms. Watts earned $35 000 in the base
period of 2002. She has a current income of $42 980. Note that although her money income has
increased by 22.8% since the base period of 2002, the prices she paid for food, gasoline,
EDWARD CARES
Index Numbers 503

clothing, and other items have also increased by 22.8%. Thus, Ms. Watts’ standard of living has
remained the same from the base period to the present time. Price increases have exactly
offset an increase in income, so her present buying power (real income) is still $35 000. (See
Table 15–5 for computations.) In general:

Money income
REAL INCOME Real income = × 100 [15–8]
CPI

T A B L E 1 5 – 5 Computation of Real Income for 2002 and 2013

Consumer
Price Index Computation
Year Money Income (2002 = 100) Real Income of Real Income
2002 $35 000 100 $35 000 35 000
(100)
100
2013 42 980 122.8 35 000 42 980
(100)
122.8

The concept of real income is sometimes called deflated income, and the CPI is called
the deflator. Also, a popular term for deflated income is income expressed in constant dollars.
Thus, in Table 15–5, to determine whether Ms. Watts’ standard of living changed, her money
income was converted to constant dollars. We found that her purchasing power, expressed in
2002 dollars (constant dollars), remained at $35 000.

self-review 15–4 The take-home pay of Jon Greene and the CPI for 2003 and 2013 are as follows:

Year Take-Home Pay ($) CPI (2002 = 100)


2003 $35 000 102.8
2013 44 300 122.8

(a) What was Jon’s real income in 2003?


(b) What was his real income in 2013?
(c) Interpret your findings.

Deflating Sales A price index can also be used to “deflate” sales or similar money series.
Deflated sales are determined by:

Actual sales
USING AN INDEX AS A DEFLATOR Deflated sales = × 100 [15–9]
An appropriate index

Example Sam’s Enterprises has retail stores in Victoria and Collingwood. Sales in 2003 were $445 873
and $775 995, respectively. Last year, sales were $773 998 and $973 545, respectively. Sam
wants to know how much sales have increased over the last 11 years, so he decides to deflate
the sales for last year to the 2003 levels. Given that the industry index increase is 122.3,
express Sam’s sales last year in constant 2003 dollars.
EDWARD CARES
504 Chapter 15

Solution The results are shown in the following Excel output:

Sam’s Enterprises
Index Last year = 122.3
Sales Constant Dollars
2003 Last year (2003) Found by
Collingwood 775 995 973 545 796 030 = 973545/122.3*100
Victoria 445 873 773 998 632 868 = 773998/122.3*100

Comparing the sales for 2003 to the constant dollars, we see that sales grew in both locations
from 2003 to last year.

Purchasing Power of the Dollar The CPI is also used to determine the purchasing power of
the dollar.

USING AN INDEX TO FIND $1


PURCHASING POWER Purchasing power of dollar = × 100 [15–10]
CPI

Example Suppose the CPI this month is 125.0 (2002 = 100). What is the purchasing power of the dollar?

Solution Using formula (15–10), it is $0.80, found by:


$1
Purchasing power of dollar = (100) = $0.80
125.0
The CPI of 125.0 indicates that prices have increased by 25% from the years 2002 to this
month. Thus, the purchasing power of a dollar has been cut. That is, a 2002 dollar is worth
only $0.80 this month. To put it another way, if you lost $1000 in 2002 and just found it, the
$1000 could only buy $800 worth of goods that could have been bought in 2002.

Cost-of-Living Adjustments The CPI is also the basis for cost-of-living adjustments, or COLA,
in many management–union contracts. The specific clause in the contract is often referred to as
the “escalator clause.” Many workers have their incomes or pensions pegged to the CPI.
The CPI is also used to adjust alimony and child support payments; attorneys’ fees; work-
ers’ compensation payments; rentals on apartments, homes, and office buildings; welfare pay-
ments; and so on. In brief, say, a retiree receives a pension of $500 a month and the CPI
increases by 5 points from 165 to 170. Suppose that for each point that the CPI increases, the
pension benefits increase 1%, so the monthly increase in benefits will be $25, found by $500
(5 points)(0.01). Now the retiree will receive $525 per month.

self-review 15–5 Suppose that the CPI for the latest month is 134.0 (2002 = 100). What is the purchasing power of the
dollar? Interpret.

15.10 SHIFTING THE BASE


If two or more time series have the same base period, they can be compared directly. As an
example, suppose we are interested in the trend in the prices of food, shelter, clothing and
footwear, and health and personal care over the last four years compared with the base year
(2002 = 100). Note in Table 15–6 that all of the CPIs use the same base. Thus, it can be said that
the price of all consumer items combined increased by 19% from the base period (2002). Like-
wise, shelter increased by 13.8%, clothing and footwear by 5.2%, and so on.
EDWARD CARES
Index Numbers 505

T A B L E 1 5 – 6 Trend in Consumer Prices to March 2011 (2002 = 100)

All Clothing Health and


Year Items Food Shelter and Footwear Personal Care
2002 100.0 100.0 100.0 100.0 100.0
2005 108.6 109.3 103.7 103.9 108.1
2008 119.0 120.3 113.8 105.2 115.5
2011 126.5 127.1 123.0 106.0 120.0

A problem arises, however, when two or more series being compared do not have the
same base period. The following example compares price changes in the S&P/TSX Composite
Index and the DJIA.

Example We want to compare the price changes in the S&P/TSX Composite Index and the DJIA. The
two indexes from 2004 to 2013 follow. The information is reported at the end of December for
each year. (See Connect for the file Stock Indexes.)

Year S&P/TSX DJIA


2004 9 246.65 10 783.01
2005 11 272.36 10 717.50
2006 12 908.39 12 463.15
2007 13 778.58 13 264.82
2008 8 987.70 8 776.39
2009 11 746.11 10 428.05
2010 13 443.22 11 577.51
2011 11 955.10 12 217.56
2012 12 433.53 13 104.14
2013 13 621.55 16 576.66

Solution From the information given, we are not sure that the base periods are the same. Hence, a direct
comparison is not appropriate. Because we want to compare the changes in the two business
indexes, the logical approach is to let a particular period, say, December 2004, be the base for
both indexes. For the S&P/TSX Composite Index, the base is 9246.65, and for the DJIA, the
base is 10 783.01.
The calculation of the index for the S&P/TSX Composite Index for December 2010 is:
13 443.22
Index = (100) = 145.4
9246.65
The following Excel output shows the complete set of indexes:

Year S&P/TSX Index DJIA Index


2004 9 246.65 100.0 10 783.01 100.0
2005 11 272.36 121.9 10 717.50 99.4
2006 12 908.39 139.6 12 463.15 115.6
2007 13 778.58 149.0 13 264.82 123.0
2008 8 987.70 97.2 8 776.39 81.4
2009 11 746.11 127.0 10 428.05 96.7
2010 13 443.22 145.4 11 577.51 107.4
2011 11 955.10 129.3 12 217.56 113.3
2012 12 433.53 134.5 13 104.14 121.5
2013 13 621.55 147.3 16 576.66 153.7

We conclude that both indexes have increased over the period. The S&P/TSX Composite Index
has increased 47.3% over the time period, and the DJIA has increased 53.7% over the same period.
EDWARD CARES
506 Chapter 15

self-review 15–6 The following table shows the average earnings by gender of Canadian workers:

Year Women Men


2000 27 500 44 600
2001 27 600 44 500
2002 29 300 46 700
2003 29 000 46 000
2004 29 400 46 200
2005 30 000 46 900
2006 30 500 47 100
2007 31 300 47 800
2008 31 700 49 300
2009 32 600 47 400
2010 32 600 47 800
2011 32 100 48 100

The changes in earnings for men and women are to be compared. Unfortunately, the base period of
2000 is different for the two groups. The base period for women is $27 500, and the base period for
men is $44 600. Calculate the indexes for both groups and interpret the findings.

EXERCISES
11. In 2002, Marilyn started working for $600 per week. How much would she have to earn in 2013 to
have the same purchasing power if the CPI is 122.8 in 2013. Use 2002 as the base year.
12. The price of a pair of boots in 2006 was $125, and $150 in 2014. During the same period, the CPI for
clothing and footwear increased by 3.1%. Did the price of the boots increase more than, the same, or
less than the CPI?
13. At the end of 2013, the average salary for a senior customer service representative at Mercury Distri-
bution Inc. was $48 500. The CPI for 2013 was 122.8 (2002 = 100.0). The mean salary for the same
position in the base period of 2002 was $39 000. What was the real income of the customer service
representative in 2013? How much had the average salary increased?
14. The Trade Union Association maintains indexes on the hourly wages for a number of the trades.
Unfortunately, the indexes do not all have the same base periods. Listed below is information on
plumbers and electricians. Shift the base periods to 2000, and compare the hourly wage increases.

Year Plumbers (1995 = 100) Electricians (1998 = 100)


2000 133.8 126.0
2013 159.4 158.7

15. In 1998, the mean salary of plant workers at Mercury Distribution Inc. was $26 650. The salary in-
cluded bonuses and overtime. By 2003, the mean salary increased to $31 972, and it was further
increased to $36 382 in 2008, $37 269 in 2011, and $39 500 in 2014. The company maintains infor-
mation on employment trends throughout its industry. Its industry index, which has a base of 1998,
was 122.5 for 2003, 136.9 for 2008, 144.9 for 2011, and 146.0 in 2014. Compare Mercury Distribu-
tion Inc.’s plant workers salaries to the industry trends.
16. Sam Steward is a freelance Web page designer. His yearly wages for the years 2009 through 2014 are
listed below. An industry index for computer programmers that reports the rate of wage inflation in
the industry is also included. This index has a base of 1998.

Year Wage ($ thousands) Index (1998 = 100)


2009 $175 148.3
2010 175 140.6
2011 150 120.9
2012 120 110.2
2013 120 105.3
2014 130 105.0

Compute Sam’s real income for the period. Did his wages match the increase or decline in the industry?
EDWARD CARES
Index Numbers 507

Chapter Summary
I. An index number measures the relative change from one period to another.
A. The major characteristics of an index are as follows:
1. It is a percentage, but the percent sign is usually omitted.
2. It has a base period.
3. Most indexes are reported to the nearest tenth of a percent, such as 153.1.
4. The base of most indexes is 100.
B. The reasons for computing an index are as follows:
1. It facilitates the comparison of unlike series.
2. If the numbers are very large, often it is easier to comprehend the change of the index than
the actual numbers.
II. There are two types of price indexes—unweighted and weighted.
A. In an unweighted index, we do not consider the quantities.
1. In a simple index, we compare the base period to the given period.
pt
I= × 100 [15–1]
p0
where pt refers to the price in the current period, and p0 is the price in the base period.
2. In the simple average of price indexes, we add the simple indexes for each item and divide
by the number of items.
©Pi
P= [15–2]
n
3. In a simple aggregate price index, the price of the items in the group are totalled for both
periods and compared.
©pt
P= × 100 [15–3]
©p0
Statistics in Action
In the 1920s, wholesale B. In a weighted index, the quantities are considered.
prices in Germany increased 1. In the Laspeyres method, the base period quantities are used in both the base period and the
dramatically. In 1920, given period.
wholesale prices increased
by about 80%; in 1921, the ©ptq0
P= × 100 [15–4]
rate of increase was 140%; ©p0q0
and in 1922, it was a
whopping 4100%! Between 2. In the Paasche method, current period quantities are used.
December 1922 and No-
©ptqt
vember 1923, wholesale P= × 100 [15–5]
prices increased by another ©p0qt
4100%. By that time, gov-
ernment printing presses 3. Fisher’s ideal index is the geometric mean of the Laspeyres’ index and Paasche’s index.
could not keep up, even by Fisher’s ideal index = 2(Laspeyres’ index)(Paasche’s index) [15–6]
printing notes as large as
500 million marks. Stories C. A value index uses both base-period and current-period prices and quantities.
are told that workers were
©ptqt
paid daily and then twice V= × 100 [15–7]
daily so that their wives ©p0q0
could shop for necessities
before the wages became III. The most widely reported index is the Consumer Price Index (CPI).
too devalued.
A. It is often used to show the rate of inflation.
B. It is reported monthly by Statistics Canada.
C. The base year for 2010 is 2002 = 100.0, changed from 1992 = 100.0 in January 2002.
EDWARD CARES
508 Chapter 15

Chapter Exercises
The following information is from the CREA file. (See Connect for the data file and source.)

Region Jan-14 Jan-13 Jan-12 Jan-11 Jan-10 Jan-09 Jan-08


National Average $388 553 $354 951 $348 178 $344 118 $328 728 $274 711 $309 448
Vancouver 812 536 748 651 752 380 762 562 637 637 536 162 588 183
Calgary 444 153 418 938 382 468 394 655 382 009 362 143 408 672
Saskatoon 332 133 320 812 309 828 300 353 270 191 278 545 258 444
Toronto 526 528 482 648 463 534 427 159 409 058 343 632 374 449
Halifax 264 780 291 044 259 395 252 141 241 968 242 861 218 505

17. Refer to the table above. Use the National Average as the base period, and compute a simple index for each city
for Jan-14. Interpret your findings.
18. Refer to the table above. Use the National Average as the base period, and compute a simple index for each city
for Jan-10. Interpret your findings.
19. Refer to the table above. Use the National Average as the base period, and compute a simple index for each city
for Jan-08. Interpret your findings.
20. Refer to the table above. Use the data from Jan-14 for Vancouver, Calgary, and Saskatoon as the base period,
and compute a simple index for each city for Jan-14. Interpret your findings.
21. Refer to the table above. Use the data from Jan-14 for Calgary and Saskatoon as the base period, and compute a
simple index for each city for Jan-14. Interpret your findings.
22. Refer to the table above. Compare Jan-14 with Jan-08 for the national average and each city. Which city in-
creased the most?
The following information from Blackberry Limited’s stock prices is taken from the first trading day in March each
year. (See Connect for the data file and source.)

Date Closing Price


Mar-14 $9.66
Mar-13 13.03
Mar-12 13.44
Mar-11 55.65
Mar-10 75.25
Mar-09 54.49
Mar-08 115.49
Mar-07 157.50
Mar-06 98.89
Mar-05 92.71
Mar-04 122.40
Mar-03 19.08
Mar-02 44.34
Mar-01 33.60

23. Compute a simple index for the closing price. Use Mar-01 as the base period. What can you conclude about the
change in the closing stock price over the period?
24. Compute a simple index for the closing price using Mar-03 as the base period. What can you conclude about
the change in the closing price over the period?
25. Compute a simple index for the closing price using the period Mar-04–Mar-06 as the base period. What can you
conclude about the change in the closing price over the period?
26. Compute a simple index for the closing price using the period Mar-06–Mar-10 as the base period. What can you
conclude about the change in the closing price over the period?
EDWARD CARES
Index Numbers 509

The following information was reported on food items for the years 2004 and 2014:

2004 2014
Item Price ($) Quantity Price ($) Quantity
Margarine (454 g) $0.81 18 $2.39 27
Shortening (454 g) 0.84 5 1.49 9
Milk (2 liters [L]) 1.44 70 3.79 65
Potato chips (454 g) 2.91 27 3.99 33

27. Compute a simple price index for each of the four items. Use 2004 as the base period.
28. Compute a simple aggregate price index. Use 2004 as the base period.
29. Compute Laspeyres’ price index for 2014 using 2004 as the base period.
30. Compute Paasche’s index for 2014 using 2004 as the base period.
31. Determine Fisher’s ideal index using the values for the Laspeyres and Paasche indexes computed in the two
previous problems.
32. Determine a value index for 2014 using 2004 as the base period.
Betts Electronics purchases three replacement parts for robotic machines used in its manufacturing process. In-
formation on the price of the replacement parts and the quantity purchased is given below:

2005 2014
Part Price ($) Quantity Price ($) Quantity
RC-33 $0.50 320 $0.60 340
SM-14 1.20 110 0.90 130
WC-50 0.85 230 1.00 250

33. Compute a simple price index for each of the three items. Use 2005 as the base period.
34. Compute a simple aggregate price index for 2014. Use 2005 as the base period.
35. Compute Laspeyres’ price index for 2014 using 2005 as the base period.
36. Compute Paasche’s index for 2014 using 2005 as the base period.
37. Determine Fisher’s ideal index using the values for the Laspeyres and Paasche indexes computed in the two
previous problems.
38. Determine a value index for 2014 using 2005 as the base period.
Prices for selected foods for 2005 and 2014 are given in the following table:

2005 2014
Item Price ($) Quantity Price ($) Quantity
Cabbage (500 g) $0.60 2000 $0.90 1500
Carrots (bunch) 0.49 200 0.69 200
Peas (kilograms [kg]) 1.99 400 2.99 500
Endive (bunch) 0.89 100 1.29 200

39. Compute a simple price index for each of the four items. Use 2005 as the base period.
40. Compute a simple aggregate price index. Use 2005 as the base period.
41. Compute Laspeyres’ price index for 2014 using 2005 as the base period.
42. Compute Paasche’s index for 2014 using 2005 as the base period.
43. Determine Fisher’s ideal index using the values for the Laspeyres and Paasche indexes computed in the two
previous problems.
44. Determine a value index for 2014 using 2005 as the base period.
EDWARD CARES
510 Chapter 15

The prices of selected items for 2006 and 2014 are as follows. Quantity purchased is also listed.

2006 2014
Item Price ($) Quantity Price ($) Quantity
Paper, computer (pkg) $4.99 400 $5.99 500
Paper, lined (pkg) 0.89 1000 0.99 1200
Paper, plain (pkg) 0.99 850 1.19 1000
Paper, coloured (pkg) 1.49 350 1.79 350

45. Compute a simple price index for each of the four items. Use 2006 as the base period.
46. Compute a simple aggregate price index. Use 2006 as the base period.
47. Compute Laspeyres’ price index for 2014 using 2006 as the base period.
48. Compute Paasche’s index for 2014 using 2006 as the base period.
49. Determine Fisher’s ideal index using the values for the Laspeyres and Paasche indexes computed in the two
previous problems.
50. Determine a value index for 2014 using 2006 as the base period.
51. A special-purpose index is to be designed to monitor the overall economy of the region. Four key series were se-
lected. After considerable deliberation it was decided to weight retail sales 20%, total bank deposits 10%, industrial
production in the region 40%, and nonagricultural employment 30%. The data for 2006 and 2014 are as follows:

Bank Industrial
Retail Sales Deposits Production
Year ($ millions) ($ billions) (2003 = 100) Employment
2006 $1159.0 $87 110.6 1 214 000
2014 1971.0 91 114.7 1 501 000

Construct a special-purpose index for 2014 using 2006 as the base period, and interpret.
52. M Studios is studying its revenue to determine where its greatest growth has been. The business started
10 years ago, and a summary of sales is given below:

Consumer Price Photographic Supplies Index of Photographic


Year Index (2002 = 100) (in thousands) Services (in thousands)
2007 111.5 175 65
2008 114.1 205 70
2009 114.4 300 72
2010 116.5 310 86
2011 119.9 315 92
2012 121.7 318 92.5
2013 122.8 320 93

a. Make whatever calculations are necessary to compare the trend in revenue from 2007 to 2013.
b. Interpret.
53. The management of Ingalls Super Discount stores wants to construct an index of economic activity for its
metropolitan area. Management contends that if the index reveals that the economy is slowing down, inven-
tory should be kept at a low level.
Three series seem to hold promise as predictors of economic activity—area retail sales, bank deposits, and employ-
ment. All of these data can be secured monthly from the government. Retail sales is to be weighted 40%, bank
deposits 35%, and employment 25%. Seasonally adjusted data for the first three months of the year are as follows:

Retail Sales Bank Deposits Employment


Month ($ millions) ($ billions) (thousands)
January $8.0 $20 300
February 6.8 23 303
March 6.4 21 297

Construct an index of economic activity for each of the three months, using January as the base period.
EDWARD CARES
Index Numbers 511

54. The following table gives information on the CPI and the monthly take-home pay of Bill Martin, an employee
at the Jeep Corporation.

Consumer Price Mr. Martin’s Monthly


Year Index (2002 = 100) Take-Home Pay
2002 100.0 $2400
2007 111.5 2800
2010 116.5 2900
2013 122.8 3050

a. What is the purchasing power of the dollar for 2007 based on the period 2002?
b. Determine Mr. Martin’s “real” monthly income for 2007.
c. What is the purchasing power of the dollar for 2010 based on the period 2002?
d. Determine Mr. Martin’s “real” monthly income for 2010.
55. WSD Bank Inc. reported $17 446 (million) in commercial loans in 2000, $19 989 in 2002, $21 468 in 2004,
$21 685 in 2005, $15 922 in 2007, $18 375 for 2009, and $54 818 in 2014. Using 2000 as the base, develop
a simple index for the change in the amount of commercial loans for the years 2002, 2004, 2005, 2007, 2009,
and 2014, based on 2000.
The following are the quantities and prices for the years 2005 and 2014 for Kinzua Valley Geriatrics. 2005 is the
base period. Use this information for exercises 56 and 57.

2005 2014
Item Price Quantity Price Quantity
Syringes (dozen) $6.10 1500 $6.83 2000
Thermometres 8.10 10 9.35 12
Pain medication (bottle) 4.00 250 4.62 250
Patient record forms 6.00 1000 6.85 900
Computer paper (box) 12.00 30 13.65 40

56. a. Determine the simple price indexes.


b. Determine the simple average of the price indexes.
c. Determine the simple aggregate price index for the two years.
57. a. Determine Laspeyres’ price index.
b. Determine Paasche’s price index.
c. Determine Fisher’s ideal index.
58. The sales of Hill Enterprises, a small injection moulding company, increased from $875 000 in 1998 to $1 596 000
in 2013. Details are in the chart below. The owner, Harry Hill, realizes that the price of raw materials used in the
process have also increased over the period, so Mr. Hill wants to deflate sales to account for the increase in raw
material prices. What are the deflated sales for each year based on 1998 dollars?

Year Sales CPI


1998 $ 875 000 91.3
2001 1 482 000 97.8
2004 1 491 000 104.7
2007 1 502 000 111.5
2010 1 515 000 116.5
2013 1 596 000 122.8

59. In 2009, the mean salary for a marketing director with a bachelor’s degree was $89 673. The CPI for 2009
was 114.4. The mean annual salary for a marketing director in the base period of 2002 (2002 = 100.0) was
$69 800. What was the real income of the marketing director in 2009? How much had the mean salary
increased?
EDWARD CARES
512 Chapter 15

60. The prices and quantities of various items sold at the Accessory Shop in July 2007 and July 2014 are as follows:

2007 2014
Item Price Quantity Price Quantity
Handbags $49.00 1500 $79.00 2000
Gloves 25.00 10 30.00 12
Umbrellas 14.00 250 18.00 250
Scarves 21.00 1000 25.00 900
Hats 22.00 325 35.00 525

Determine the value index for July 2014 using July 2007 as the base year. Interpret the index.
61. The following table gives information on the CPI and the yearly salary of Simone Smith:

Consumer Price Simone Smith’s


Year Index (2002 = 100) Yearly Salary
2002 100.0 $30 000
2006 109.1 32 250
2010 116.5 36 000
2013 122.8 40 500

a. What is the purchasing power of the dollar for 2013 based on the period 2002?
b. Determine Simone Smith’s “real” yearly salary for 2010. Interpret the result.
c. What is the purchasing power of the dollar for 2013 based on the period 2006?
d. Determine Simone Smith’s “real” yearly salary for 2013. Interpret the result.
The following are the quantities and prices for the years 2009 and 2014 for Nine Thirty Photography, (2009 = 100).
Use this information for exercises 62 to 64.

2009 2014
Item Price Quantity Price Quantity
Camera $825.00 300 $975.00 500
Lens 125.00 200 175.00 250
Case 20.00 250 28.00 250
Lights 21.00 1000 25.00 900
Storage 110.00 325 110.00 525

62. a. Determine the simple price indexes.


b. Determine the simple average of the price indexes.
c. Determine the simple aggregate price index for the two years.
63. a. Determine Laspeyres’ price index.
b. Determine Paasche’s price index.
c. Determine Fisher’s ideal index.
64. Determine the value index for 2014 using 2009 as the base year. Interpret the index.

Data Set Exercises

Practise and learn online with Connect. Questions and tables with online data sets are marked with .

65. a. Use the file Stock Indexes on Connect to compare the price changes in the S&P/TSX Composite Index and
the NASDAQ from 2001 to 2013. Interpret your findings.
b. Use the file Stock Indexes on Connect to compare the price changes in the S&P/TSX Composite Index and
the S&P 500 from 2001 to 2013. Interpret your findings.
c. Use the file Stock Indexes on Connect to compare the price changes in the S&P/TSX Venture Index and the
NASDAQ from 2001 to 2013. Interpret your findings.
d. Use the file Stock Indexes on Connect to compare the price changes in the S&P/TSX Venture Index and the
S&P 500 from 2001 to 2013. Interpret your findings.
EDWARD CARES
Index Numbers 513

Practice Test
Part I Objective
1. To compute an index, the base period is always the (numerator, denominator, can be in
either, always 100).
2. A number that measures the relative change from one period to another is called a/an .
3. In a weighted index, both the price and the are considered.
4. In a Laspeyres index, the quantities are used in both the numerator and denominator. (base
period, given period, oldest, newest).
5. The current base period for the CPI is .
Part II Problems
1. The sales at Roberta’s Ice Cream Stand for the last five years are as follows:

Year Sales
2007 $130 000
2008 145 000
2009 120 000
2010 170 000
2011 190 000

a. Find the simple index for each year using 2007 as the base year.
b. Find the simple index for each year using 2007–2008 as the base year.
2. The prices and quantities of several golf items purchased by members of the men’s golf league at the Osler
Bluffs Golf and Tennis Club are as follows:

2006 2011
Price Quantity Price Quantity
Driver $250.00 5 $275.00 6
Putter 60.00 12 75.00 10
Iron 700.00 3 750.00 4

a. Determine the simple aggregate price index, with 2006 as the base period.
b. Determine Laspeyres’ price index.
c. Determine Paasche’s price index.
d. Determine the value index.

Answers to Self-Reviews
15–1 1. Amount 2. Average Weekly
Nation (millions of tonnes) Index Year Earnings Index
China 779.0 896.4 2008 $742.69 100.0
European Union 165.8 190.8 2009 770.30 103.7
Japan 110.6 127.3 2010 787.37 106.0
United States 86.9 100.0 2011 808.69 108.9
India 81.2 93.4 2012 816.48 110.0
China produces 796.4% more steel than the United States. (a) Wages have increased by 10% from 2008 to 2012.
EDWARD CARES
514 Chapter 15

(b) Base = (770.30 + 787.37 + 808.69)/3 = 788.79. 15–4 (a) $34 046.69, found by: (35 000/102.8)(100).
(b) In terms of the base period, Jon’s salary was $34 046.69
Average Weekly
in 2003 and $36 074.92 in 2013.
Year Earnings Index
(c) This indicates his take-home pay increased at a slightly
2008 $742.69 94.2 faster rate than the price paid for food, transportation,
2009 770.30 97.7 and so on.
2010 787.37 99.8 15–5 $0.75, found by: ($1.00/134.0)(100). A 2002 dollar is worth
2011 808.69 102.5 only $0.75 this month.
2012 816.48 103.5
(c) (808.69/770.30 (100) = 105.0. 15–6 Year Women Index Men Index
15–2 (a) P1 = ($85/$75)(100) = 113.3. 2000 27 500 100.0 44 600 100.0
P2 = ($45/$40)(100) = 112.5. 2001 27 600 100.4 44 500 99.8
P = (113.3 + 112.5)/2 = 112.9. 2002 29 300 106.5 46 700 104.7
(b) P = ($130/$115)(100) = 113.0. 2003 29 000 105.5 46 000 103.1
$85(500) + $45(1200) 2004 29 400 106.9 46 200 103.6
(c) P = (100) 2005 30 000 109.1 46 900 105.2
$75(500) + $40(1200)
$96 500 2006 30 500 110.9 47 100 105.6
= (100) = 112.9. 2007 31 300 113.8 47 800 107.2
85 500
$85(520) + $45(1300) 2008 31 700 115.3 49 300 110.5
(d) P = (100) 2009 32 600 118.5 47 400 106.3
$75(520) + $40(1300) 2010 32 600 118.5 47 800 107.2
$102 700 2011 32 100 116.7 48 100 107.8
= (100) = 112.9.
$91 000
The average earnings for men increased by 7.8% over the
(e) P = 2(112.9)(112.9) = 112.9. time period and increased 16.7% for women over the same
$4(9000) + $5(200) + $8(5000) time period.
15–3 (a) P = (100)
$3(10 000) + $1(600) + $10(3000)
$77 000
= (100) = 127.1.
60 600
(b) The value of sales has gone up 27.1% from 2004 to
2014.

You might also like