Module Text PDF
Module Text PDF
Author: Dr L. Short
MSc International Finance • Module FIN11101 • September 2011 Edition
Page
Unit 1: Introduction 5
The following introductory comments are designed to give you a general overview of
the module. Further details can be found at the appropriate place later in the notes.
1. General Structure
The module is designed roughly along the following general lines:
• 2 Lectures per week
• 1 Practical (Laboratory) per week
• 1 Tutorial per week.
You will find the module material organised around this course structure:
• Lecture material (in this book).
• Practical material in the Student Study Guide.
• Tutorial material at the end of the Student Study Guide.
The material in this module is written under the assumption that you have not studied
Statistics in any great detail, if at all. We start from very elementary ideas but
progress at a fairly brisk rate so that, by the end of the module, you will have
encountered a good many statistical ideas. Some of these ideas are relatively simple,
but some are decidedly not.
In addition statistics is essentially a “doing” subject; you cannot really learn about
how data is collected and analysed without actually collecting and analysing some
data for yourself. For this reason the lecture material is supplemented by material of
a very practical nature designed to be worked through on a PC.
We will find that, even though software may be available to perform various statistical
tasks, the interpretation of the resulting output requires some (theoretical) statistical
knowledge. Although the relevant ideas are discussed in the lecture notes, the
purpose of the tutorial sessions is to help you enhance this knowledge by making you
familiar with performing statistical computations by hand (sometimes with calculator
assistance). There is nothing quite like having to actually write down all the steps in a
calculation to see if you really understand the material.
2. Lecture Material
Roughly speaking you should find the following:
• Unit 1 to Unit 4 are relatively straightforward. This material is often described
under the heading “Descriptive Statistics”.
• Units 5 and 6 are a bit harder going, and introduce the fundamental notions of
probability and probability distributions. In particular the Binomial and Normal
distributions are discussed in some detail.
• Units 7 to 9 are more challenging, and consider material that is often described
under the heading “Inferential Statistics”. The ideas discussed build on the
concepts introduced in Units 5 and 6.
Data Collection
In this module we use secondary data, i.e. data collected by somebody other than
ourselves. In Unit 2 we consider in some detail the use of various important statistical
websites in obtaining data. Indeed one of the central themes running through the
module, and which we try and emphasise throughout the module, is that statistics is
“data based”; for this reason we endeavour to use “real data” in most of our
discussions and illustrations. In most examples you will find either a (web based)
reference to the data used, or the data itself in the form of an Excel file. The latter are
collected together on the module web page, and you can download them if you wish
to reproduce any of the results we quote.
Data Analysis
There is usually divided into the two categories mentioned above;
• Descriptive Statistics. Here data is analysed in one of two ways;
- Graphical Analysis. This is considered in detail in Unit 3.
- Numerical Summaries. This is considered in some detail in Unit 4.
• Inferential Statistics. This involves the more advanced techniques discussed
in Units 7 to 9 (with Units 5 and 6 providing necessary background material).
Reporting Results
This is often the most difficult part of the process, and certainly where students find
most difficulty. Whilst the most useful asset in compiling reports is experience, we
provide several opportunities for you to develop your report writing skills. Specifically:
• In Practical 1 (see below) you are asked to downloads various articles and,
where appropriate, look at the graphical and numerical measures the authors
use to summarise their own findings.
• In Tutorial 1 (see below) you are asked to summarise the findings of other
authors considered in Practical 1.
• In Assessment 1 (see below) you are required to produce a written summary of
a given paper (dealing with purchasing power parity).
• In Assessment 2 you are required to produce a report summarising your own
investigations (into the movement of stock prices).
The skills you develop in the course of these tasks should help in your later
dissertation work.
3. Practical Material
Working through the practical material is essential in developing a “working
knowledge” of how data can be obtained and analysed. In addition you should
develop a “feel” for how data behaves, and some of the potential difficulties that can
arise.
You work through each of the practical units and complete the tasks as indicated.
You should be aware that, because websites are, by their very nature, subject to
amendment (updating), the screen views provided may not be precisely the same as
you obtain. At the end of each unit you will find some practical exercises which you
should try and find time to work through; these will test your understanding of the
material in the unit. Further (optional) exercises can be found on the module web
page, allowing you to extend your knowledge if you wish.
Familiarity with the practical material will be required when you attempt the two
assessments for the module.
4. Tutorial Material
Working through this tutorial material will give you practice in performing statistical
calculations by hand, usually on “small” data sets. This will help you both in
understanding computer (Excel) output and in producing, and writing down, logical
arguments leading to a specific conclusion. Both of these will be useful in the
assessments you need to undertake, especially the second one (see below).
Further (optional) tutorial questions can be found on the module web page, giving
further practice or allowing you to extend your knowledge if you wish.
5. Assessment Material
The module has two assessments. Precise details of, and guidelines for, the
assessments will be made available at the appropriate time during the course.
Roughly speaking Assessment 1 will cover Units 1-4 in the lecture notes, and
Assessment 2 Units 5-9.
You need to successfully complete the assessments in order to pass the module.
1 Introduction
Learning Outcomes
1. Overview
In this introductory unit we briefly look at various motivating factors for the topics we
shall study. Statistics is often broadly defined as the “Analysis of data” or, more
succinctly as “Data Analysis”. Closely connected to this is the concept of Probability,
which you may have encountered in the context of assessing Risk. (You will
probably have some general idea of these concepts, although no detailed knowledge
is assumed.) There are various questions that might spring to mind in relation to
these terms:
Some Questions
Question 11: How can we analyse the risks present in any particular situation?
From this brief extract we can give some (very partial) answers to a few questions:
(Partial) Answer 1: Bank of England (and building societies and others) collect data
for various reasons:
• Produce surveys (Derivatives and Forex-foreign exchange) for BIS (Bank of
International Settlements).
• Produce reports (e.g. Inflation).
• Monitor financial/banking sector.
• Make data (widely) available to others (general public, ONS – Office for
National Statistics).
• Link to regulatory bodies, e.g. FSA (Financial Services Authority).
(Partial) Answer 4: Websites are probably now the most convenient data sources.
• BoE has interactive databases and (monthly) on-line publications.
• ONS is an extensive resource for “general” statistical information.
• International data available for ECB (and many others).
The remaining questions are not addressed in the above brief extract from BoE.
Partial answers can be found in the following references, which you are asked to
download in Practical 1. You should look through each of these papers during the
next few weeks, even if you do not understand all the technical details. In particular,
look at the Barwell paper (see below) in preparation for Tutorial 1.
Note: Copies of some papers, data sets and Excel spreadsheets can be found on the
module web page.
2. Analysis in Context
In a financial context we would like to address problems such as the following:
Comments: Although it looks like we should take the firm’s advice it is not clear what
the chances of success really are.
• The firm seems to predict events with greater accuracy than just guessing.
Does the latter correspond to 50% accuracy?
• However which percentage do we believe? After all we do not know in advance
whether our investments will go up or not; we have only the firm’s forecast.
• Suppose the firm is very good at predicting “small” price rises, but is very poor
at predicting “large” price drops. If we follow its forecasts we might make a lot of
small gains, but also experience large losses which the firm could not forecast.
(Unfortunately the losses may well greatly outweigh the gains.)
• Maybe we should take the average of the two percentages? And is 65% that
much above 50%?
• We really need
- some means of sorting out the “logic” of the situation, and
- some means of performing the necessary computations.
• Although Problem 4 may seem a little artificial, this is precisely the type of
gamble investors take, albeit without really knowing their potential returns, and
the probabilities of achieving them.
Example 3.1: I have a (well mixed) bag of 100 counters, 25 red and 75 black. You
are invited to play the following game, with an entrance fee of £50. A counter is
picked at random. If the counter is black you win £100, otherwise you win nothing.
Should you play?
Comments 1: The risk here is (of course) that we do not win. There are several
important ideas that are relevant here:
• How likely are we to win (or lose)? If winning is “unlikely” we don’t really want to
be playing.
• How do we measure the “cut-off” between playing and not playing?
• How much can we expect to win?
• How risky is the game?
• How much risk are we prepared to accept? (What is an acceptable risk, and
what is unacceptable?)
Example 3.2: I have a (well mixed) bag of 100 counters, some red and some black.
You are invited to play the following game, with an entrance fee of £50.
• A counter is picked at random.
• If the counter is black you win £100, otherwise you win nothing.
Should you play?
Comments 2: Here we just do not know, i.e. we do not have enough information
to make a rational choice.
• The uncertainty in Example 1 is measurable. We are not sure what colour will
be chosen but, as we discuss in Unit 5, we can assign (measure) likelihoods
(probabilities) to all (two) possibilities. Most games of chance (gambling) are like
this.
• The uncertainty in Example 2 is unmeasurable. We cannot make any sensible
predictions about what will happen; if we play the second game we are “leaving
everything up to fate”. We will either win or not, but we cannot judge beforehand
which is the more likely, and hence cannot assess the inherent risk involved.
• In practice we are often somewhere between these two situations. We have
some knowledge, but not (nearly) as much as we would like. We may need to
estimate certain quantities, and this will lead to increased (but still quantifiable)
uncertainty.
• The idea of investors behaving in a rational manner, after having considered all
the available information, is a core assumption in much of the economic and
financial literature. In practice this is frequently not the case, and behavioural
risk refers to the risks resulting from this non-rational behaviour. The field of
behavioural finance has grown up to explain how psychological factors
influence investor behaviour. For an introduction to this rapidly expanding field
see http://www.investorhome.com/psych.htm. You will encounter some
behavioural finance ideas in later modules.
Example 4.1: As this section of the notes is being revised (June 1 2009) General
Electric stock is at $13.48; this is the value quoted on NYSE (New York Stock
Exchange). What will the stock value be tomorrow (June 2 2009)?
Comments: As it stands this is un-measurable (today).
• Indeed, with this perspective, much of finance would lie in the realm of
unmeasurable uncertainty, and no reliable predictions (forecasts) can be
made.
• To make any kind of progress we need to assume some kind of “predictable”
behaviour for the stock price, so we can use the price today to estimate the
value tomorrow.
• It is usual to formalise this procedure into the term “model”, and to say that we
“model the stock price movements”. Precisely which model we choose is still a
matter of some debate, and we shall look at some of the possibilities later.
• But the important point is that we need to select some type of model in order to
remove the “un-measurable uncertainty” and replace it by “measurable
uncertainty”. (Precisely what we mean by this phrase will only become clear
once we have worked through most of the module!)
5. Using Statistics
The word “statistics” derives from the word “state”, a body of people existing in social
union, and its original 18th century meaning was “ a bringing together of those facts
illustrating the condition and prospect of society”. Just as the word “risk” has many
connotations, so too does the modern usage of the term “statistics”. For example;
• Business statistics. The science of good decision making in the face of
uncertainty. Used in many disciplines such as financial analysis, econometrics,
auditing, production/operations and marketing research.
• Economic statistics. Focuses on the collection, processing, compilation and
dissemination of statistics concerning the economy of a region, country or group
of countries. This in itself is often subdivided into various groupings such as
- Agriculture, Fishing & Forestry
- Commerce, Energy & Industry
- Labour Market
- Natural & Built Environment
- Social & Welfare
• Crime & Justice. Statistics relating to crime and arrest, criminal justice and law
enforcement data, prisons, drugs, demographic trends and so on.
Obviously this list could be extended considerably, and there are clearly connections
between the various groupings. But the important point is that
Terminology We shall use the term “finance” in a very wide sense to include
• Financial Institutions and Financial Services
• Corporate Finance
• Econometrics (Financial Economics)
• Financial Accounting
• Mathematical Finance
Examples taken from all of these areas will appear at various stages.
Our interest will largely be in “finance related” areas where very large quantities of
statistical information has been gathered over the years. Indeed the rate at which
information is collated is increasingly (rapidly) with time. We would clearly like to
make use of some of this information (since it was deemed important enough to
collect in the first place!).
Example 5.1: General Electric stock prices for the past six months are as shown in
Table 5.1. What will be its value in July 2009?
Table 5.1: General Electric monthly closing stock price from Jan 9 to May 9 (2009)
We shall not try and solve this type of problem until much later. At present you might
like to consider
• “How much” information would you expect historic prices to contain?
- This bears directly on the famous Efficient Market Hypothesis (EMH).
- Is 6 months data enough to assess past trends? Just how far back in time
should we go?
• “How accurately” would you expect to be able to forecast future prices? (Would
we be happy to be able to predict prices to, say, within 10%?)
• How much money would you be prepared to invest in your predictions?
- Using monetary values is a good way to assess the “subjective
probabilities” you may have on particular stock movements.
- What you are prepared to invest will also reflect your “risk taking” capacity;
the more risk you are prepared to take the more (money) you will be willing
to invest.
Summary
All you could ever want to know about corporate finance. More for reference and
long term study.
• Ferguson, N. (2008): The Ascent of Money. A Financial History of the World.
London, Allen Lane.
Provides some very interesting background material to help understand some of the
benefits of, and problems with, modern finance.
• Schrage, M. (2003). Daniel Kahneman: The Thought Leader Interview,
Strategy+Business website. Available from www.strategy-business.com
(accessed 1st. June 2009).
For ease of reference we also list the various papers mentioned in the unit:
• Bahra, B. (1996) Probability distributions of future asset prices implied by option
prices, Bank of England Quarterly Bulletin (August).
• Barwell, R., May, O., Pezzini, S. (2006) The distribution of assets, incomes and
liabilities across UK households : results from the 2005 NMG research survey,
Bank of England Quarterly Bulletin (Spring).
• Boyle, M. (1996) Reporting panel selection and the cost effectiveness of
statistical reporting.
• Bunn, P. (2005) Stress testing as a tool for estimating systematic risk, Financial
Stability Review (June).
• Sentance, A. (2008) How big is the risk of recession? Speech given to Devon
and Cornwall Business Council.
Learning Outcomes
At the end of this unit you should be familiar with the following:
• How data is collected.
• Important data sources.
• The basic types of data.
• Appreciate the accuracy inherent in the data.
• Understand how to present data
- Meaningfully
- Unambiguously
- Efficiently
“One hallmark of the statistically conscious investigator is their firm belief that
however the survey, experiment, or observational program actually turned out, it
could have turned out somewhat differently. Holding such a belief and taking
appropriate actions make effective use of data possible. We need not always
ask explicitly "How much differently?" but we should be aware of such
questions. Most of us find uncertainty uncomfortable ... (but) ... each of us who
deals with the analysis and interpretation of data must learn to cope with
uncertainty.”
Frederick Mosteller and John Tukey: Exploratory Data Analysis 1968
1. Guiding Principles
• Whenever we discuss information we must also discuss its accuracy.
• Effective display of data must satisfy the following criteria:
1. Remind us that the data being displayed do contain some uncertainty.
2. Characterise the size of that uncertainty as it pertains to the inferences
(conclusions) we have in mind.
3. Help keep us from drawing incorrect conclusions (in 2) through the lack of
a full appreciation of the precision of our knowledge.
• Here we look at the numeric display of data, in Unit 3 at its graphical display,
and in many of the remaining units at the inferences that can (and cannot) be
drawn from the data.
• Central to all of this is the accuracy we can assign to the procedures we
undertake (whether it be data collection or data analysis).
2. Data Sources
In the past, unless you collected your own data, there were relatively few sources of
data available. However, with the advent of modern computing techniques, and in
particular the emergence of the Internet (Web), this has all radically changed.
Here we give a very brief list of some (generally free) sources we have found useful.
You should look to compile your own list of websites, and you should regard the list
merely as a starting point.
Bank Websites
• Bank of England www.bankofengland.co.uk
• Federal Reserve Bank of St. Louis http://stlouisfed.org/default.cfm
- One of the websites for the FED (Federal Reserve System) containing
material under the general areas of
Banking Consumers Economic Research Financial Services ,
Education Publications
- Gives access to FRED (Federal Reserve Economic Database) containing
about 20,000 economic time series in Excel format.
• Liber8 An economic information portal for librarians and students, and closely
linked with Federal Reserve Bank of St. Louis
- Gives access to many economic databases at an international, national or
regional level. Sometimes more easily accessible than St. Louis FRB.
Many Economic Indicators available.
Access to further FED databases such as Bureau of Labour
Statistics, Bureau of the Census, etc.
Statistics Websites
• UK Statistics Authority http://www.statistics.gov.uk/
- Responsible for promoting, and safeguarding the quality of, official UK
statistics.
- Has links to other government sites (Revenue & Customs, Crime &
Justice, Health & Care and ONS).
Finance Websites
• Yahoo Finance http://finance.yahoo.com/
- Extensive website with data available in many forms (current and historic
prices, charts). Historic data is free!
- Interesting background information on thousands of companies.
- Key statistics including a large volume of accounting based information.
- Separate sites devoted to major stock markets (US, UK, Singapore, India,
Hong Kong and so on). However there is less information available for
some markets; see http://www.eoddata.com/ for more details.
- Option information on company stock.
• Investopedia http://www.investopedia.com/ (Free registration)
- Extensive financial dictionary, articles and tutorials.
Clicking on Online Data (left margin) leads to a good source of online economic
data, and it may be worthwhile bookmarking this site
(http://www.economicsnetwork.ac.uk/links/data_free.htm). This contains some
of the above websites (but with different commentaries), and some new ones.
• Biz/ed http://www.bized.co.uk/
A site for students and educators in business studies, economics and
accounting (plus other topics). Important items are;
- Data section comprising
Time Web Integrated package of data and learning materials, with
much useful advice on data analysis.
Key Economic Data Includes “Most commonly requested data”
and advice on its statistical analysis.
Links to ONS and Penn World Data (a valuable data source based
at Pennsylvania University).
- Company Info Data and case studies on a variety of organisations.
Virtual Worlds This enables you to take a tour of, for example, the
economy, banks and developing countries. Go into this and see what
you find!
Reference Definitions, database links, study skills and more.
- This is a website you should have a look at sometime.
Useful Databases
• UK Data Archive (UKDA) http://www.data-archive.ac.uk
- Curator of largest digital collection in UK for social sciences/humanities.
- Houses a variety of databases including ESDS and Census.ac.uk, the
latter giving information from the last four U.K. censuses (1971-2001).
• MIMAS http://www.mimas.ac.uk
- A nationally sponsored data centre providing the UK Higher Education
sector with access to key data and information resources.
- Purpose is to support teaching, learning and research across a wide range
of disciplines. Free to authorised (education) institutions.
Links to Websites
• FDF Financial Data Finder http://www.cob.ohio-state.edu/fin/fdf/osudata.htm
- Provides direct access to 1800 financial websites (including some of the
above), arranged alphabetically.
3. Data Types
Data
Quantitative Qualitative
4. Data Accuracy
One must always bear in mind the accuracy of the data being used but,
unfortunately, this can be difficult to assess. There are many factors influencing
(often adversely) data accuracy;
• Statistics are often merely a by-product of business and government activities,
with no “experimental design” available, as alluded to above.
• There is no incentive for companies to provide accurate statistics; indeed this is
often expensive and time consuming.
energy statistics from China. The issues involved are somewhat complex; if you
are interested look at;
Sinton, J. (2000). What goes up: recent trends in China’s energy consumption,
Energy Policy 28 671-687.
Sinton, J (2001). Accuracy and reliability of China’s energy statistics,
China Economic Review 12 (4) 373-383.
and Chow, G (2006). Are China’s Official Statistics Reliable?
CESifo Economic Studies 52 (2) 396-414.
• You should look at LIMMD Unit 2 Section 2.3 (see Section 3.2) for a discussion
of the great lengths the ESDS International go to in assessing the quality of
their macro databanks.
• A useful source of further information (especially the Economics section) is the
following website: Ludwig von Mises Institute at http://mises.org/
Before using data always try and assess its accuracy (difficult as this may be).
5. Data Tables
Much of the material of the next few sections is adapted from Klass, G. Just Plain
Data Analysis Companion Website at http://lilt.ilstu.edu/jpda/.
We shall only present some of this material, and you should visit the website and
read further. Klass also has a good section on “Finding the data”, which supplements
Section 2. There are three general characteristics of a good tabular display:
• The table should present meaningful data.
• The data should be unambiguous.
• The table should convey ideas about the data efficiently.
Example 6.1 The crime statistics shown in Table.6.1 below have various points of
interest.
• Pure counts can be misleading. Maine appears to be a far safer place to live
than Massachusetts in that, in 2005, there were 1483 violent crimes in the
former compared to 29,644 in the latter (about 1 to 20).
- The problem with this argument is we are not comparing “like with like”
since Massachusetts has by far the larger population, and hence we would
expect a much larger incidence of violent crime.
• Rates (per “unit”) are more meaningful. To remove the dependency on
population size we could work with rates per person:
Raw count
Rate per person =
Population size
Raw count
or equivalently Rate per 100,000 =
Population size/100,000
This gives the “more meaningful” rates in Table 6.1 above.
- On this basis Maine is indeed safer to live in, but only by a (2005) ratio of
112.5/460.8 ≈ 1 to 4.
• Time comparisons In Table 6.1 we have data for two years, 2005 and 2006.
- When available, tabulations that allow for comparisons across time usually
say more about what is happening (compared to data that does not allow
such comparisons). However care is needed.
- To produce comparisons it is usual to compute percentage changes:
Change
% Change = x 100%
Original value
“Percent” simply means “per hundred” and hence we multiply by 100. This
also has the effect of making the numbers larger and “more meaningful”,
and we measure the change relative to “what it was”, rather than “what it
has become”. This is again just a convention, but one that usually makes
most sense.
New value - Old value
- More usefully % Change = x 100%
Old value
- Observe that the % changes are different (albeit similar) for the raw counts
and the rates. Why should this be?
- On this basis violent crime is increasing at a much faster rate in Maine
than in Massachusetts (where it is actually decreasing). On an “absolute
level” Massachusetts is more violent, but on a “relative level” Maine is.
- Great care is needed in interpreting these latter statements. With only two
years data available meaningful comparisons may not be possible. The
one year changes are subject to random fluctuations, and we do not
know how large we can expect these fluctuations to be. To obtain a more
reliable analysis we would need more data, over maybe 5 or 10 years.
An important part of statistical analysis is trying to assess just how large we can expect
random fluctuations in the data to be, often without having access to a lot of data.
7 Presenting Data
Whether or not the data is unambiguous depends largely on the descriptive text
contained in the title, headings and notes.
• The text should clearly and precisely define each number in the table.
• The title, headings and footnotes should
- Convey the general purpose of the table.
- Explain coding, scaling and definition of the variables.
• Define relevant terms or abbreviations.
Example 7.1 The data in Table 7.1 is poorly defined, with two major difficulties:
For example, the 6.7% change reported in Table 7.1 could arise from
- 1998 (White) rate = 26.7%, and 1987 rate = 20% (1st interpretation)
- 1998 (White) rate = 21.34%, and 1987 rate = 20% (2nd interpretation)
(So if the birth rates for whites in 1987 was 20% then what was it for whites in
1998, 21.34% or 26.7%?)
Both interpretations are valid; Table 7.1 should indicate which one is meant.
• Do not worry if you find this slightly confusing. Quoting results in percentages
causes more confusion in data interpretation that almost anything else. One
must always ask “Percentage of what?”; in Tutorial 2 you will get some practice
in computing percentages.
N.B. Because the first type of ambiguity occurs very frequently, and is often the
source of major confusion, we give a “graphical view” in Fig.7.1.
Babies born to
teenage mothers Did not give birth (1987-98)
8.1 Sorting
• Sort data by the most meaningful variable.
Example 8.1 Table 8.1 contains average television viewing times for 16 countries.
• Here the data is sorted by the “Country” variable in alphabetic order.
• Usually this is a poor way to sort data since the variable of interest is really the
(2003) viewing times. For example, we would like to know which country
contains the most frequent viewers.
• Current software allows data to be sorted very easily in a variety of ways. In
Table 8.2 we have sorted from largest to smallest (hours viewing).
Table 8.1: Data sorted by country Table 8.2: Data sorted by viewing times
• This table gives us a much better idea of what the “average” viewing time is. We
shall examine these ideas in more detail in Units 3 and 4. What would you
conclude from Table 8.2 about average viewing times?
• But note that, if we have data for more years (and more countries) available, the
situation becomes a bit trickier! In Table 8.3 we cannot sort all columns at the
same time – why?
- If we do want to sort the data we must select a variable to sort on, i.e. the
“most meaningful” variable.
- We may consider the latest year (2005) as the most important and sort on
this. We would then essentially be using 2005 as a “base year” on which
to make subsequent comparisons (as we shall do in Units 3 and 4).
Results are shown in Table 8.4 (using Excel).
Note that missing values are placed at the top. Would it be better to
place them according to their average values over all the years? Or
would they be better placed at the bottom, i.e. out of the way?
Example 8.2 Table 8.6 contains population data taken from Wikipedia
(http://en.wikipedia.org/wiki/World_population) in two forms;
• The raw data is given to the nearest million.
• The percentage data is given to one decimal place (1dp). Why 1dp?
⎛ 105.5 ⎞
*100 ⎟ = (13.3375 , 13.46397 ) = (13.3, 13.5) %
106.5
⎜ * 100 ,
⎝ 791 791 ⎠
rounded to 1dp. The quoted value of 13.4% is thus not accurate and, strictly
speaking, the result should only be given (consistently) to the nearest integer.
MORAL
• In practice we usually live with slight inaccuracies in our data, and hence in
values computed from them.
• But this does mean we usually cannot quote high accuracy for any subsequent
computations in the data. For example, computing the average population (in
1750) over the 6 regions gives (see Unit 4 for further discussion)
Average population =
1
(106 + 502 + 163 + 16 + 2 + 2) = 791 = 131.83333333
6 6
However we do not believe the (implied) accuracy of 8 decimal places. How
many places can we legitimately quote?
9. A Practical Illustration
It is instructive to see how the ONS actually tabulates the data it collects.
Data Source We will look at unemployment statistics taken from the ONS website;
look at Practical Unit 2 for details of the latter.
• Go to UK Snapshot, Labour Market and Latest on Employment & Earnings;
you should obtain something like Fig.9.1. (If you do not a slightly different search
may be required!)
Guide to website
Useful summary
• For general information click on the Related Links at the R.H.S. of Fig.9.1. In
particular individual data series are available for various (user specified) time
periods – see Table 9.1 below.
• Click on the (Labour Market Statistics) First Release option; Fig.9.2 results.
Now choose the March 2009 to give the document depicted in Fig.9.3.
Fig.9.2: First Release Statistics option Fig.9.3: First Release Document (pdf)
Structure of Data Table The table is formed from parts of many ONS data sets,
those used being indicated by a four letter code in the rows labelled People. For
example MGSC refers to All UK Unemployed Aged 16+ with the additional
information 000s (data measured in thousands) : SA (seasonally adjusted) ; Annual
= 4 quarter average (annual figures found by averaging over quarterly figures).
All these series can be downloaded via the Labour Market Statistics Time Series
Data option in Fig.9.1.
• The table is structured roughly as shown in Table 9.2, and allows four variables
to be tabulated (in a two dimensional table):
- Ages of unemployed (in the ranges 16-17, 18-24, 25-49, 50 and over and
two “cumulative” categories, 16 and over and 16-59/64). You may like to
think why these particular age divisions are chosen.
- Length of unemployment (in the ranges Up to 6 months, 6-12 months,
over 12 months and over 24 months). In addition two rates (percentages)
are given. Again you may think about these particular ranges and rates.
- Gender (in the categories All, Male and Female). Why is the “All” category
given, since this must be the sum of the males and females?
- Time (divided into quarterly periods).
• The general observation is that the ONS Table 9.1, although quite large, does
contain a considerable amount of data in a very compact form, i.e. the data is
Note: The data structure shown in Table 9.2 is not always the most convenient,
depending on the question(s) we are trying to answer. We shall return to Table 9.1 in
Unit 6.
obvious that no two people are exactly the same in all respects, that no two
apples are identical in size and shape and so on. However, it is quite surprising
how people will quite happily convince themselves of certain 'facts' on the basis
of very little (and quite often biased) evidence. If you visit a place once in the
cold and wet it is difficult to imagine that it is ever nice there. If you ask one
group of people their opinion on some topic you may be convinced that most
people think in a particular way but if you had asked another group you may
have got an entirely different impression. This is the main problem encountered
in data collection. The people or things you are interested in are all different and
yet somehow you need to get sufficiently accurate information to make a sound
decision.
2. Surveys Surveys fall broadly into two categories: those in which questions are
asked and those where the data is obtained by measurement or direct
observation.
• The first type is used extensively to get information about people and this
may include both factual information and opinions.
• The other type of survey is used in many other areas such as land use
surveys, pollution monitoring, process quality control and invoice checking.
In both cases a distinction must be made between those situations where data
can be collected on everyone or everything of interest and those where that is
impossible. The first situation, which is comparatively rare, is called a census.
There are no real problems analysing data of this sort from the statistical view-
point as potentially complete information is available. Most data, however, are
not of this type. In real life it usually takes too long or costs too much to collect
data on all the individuals of interest. In a business organisation decisions
usually have to made quickly. Although the Government carries out a census of
people in the UK once every ten years, by the time the full analysis is complete
much of the information is out of date.
In some situations it is impossible to carry out a complete survey. The only way
to test the strength or durability of certain components is to use destructive
testing. For example, to see how long a new type of light bulb lasts requires
switching some on and recording the times to failure. It would not be very
profitable if all the light bulbs manufactured had to be tested!
In practice then you are likely to want information about a large group of people
or things (the population) but you are restricted to collecting data from a
smaller group (the sample). As soon as you are in this situation there is no
possibility of getting completely accurate information. The best you can hope for
is that the information contained in the sample is not misleading. In order to plan
calculated from this sample, is likely to be. This will be discussed further in Unit
4.
Systematic sampling Although simple random sampling is intuitively
appealing, it can be a laborious task selecting, say, 400 numbers at random and
then matching them to a list of 8000 names. It would be a lot quicker just to pick
the first individual at random from the first 20 names and then subsequently pick
the 20th name on the list. This method is called systematic sampling and
approximates to simple random sampling as long as the list is not constructed in
a way that might affect the results (for instance if the list of names is organised
by date of birth). This method is particular convenient for lists stored on
separate cards or invoices filed in a drawer.
Stratified sampling In many populations there are natural groupings where
people or things within a group tend to be more similar in some respects than
people or things from different groups. Such a population is called a stratified
population. One aim of a survey is to get the estimates as precise as possible.
This suggests that it might be more efficient to sample within each group, and
then pool the results, than to carry out a simple random sample of the whole
population where just by chance some groups may get unfairly represented.
There might also be advantages from an administrative point of view. If, for
example, a survey were to be carried out amongst the workforces of several
engineering companies, then it is easier to choose a random sample within
each company than to get a random sample of the combined workforce. It can
be shown mathematically that stratified random sampling, in general, gives
better estimates for the same total sample size than simple random sampling as
long as the variability within the strata is relatively small and the number of
individuals chosen in each strata is proportional to the size of the strata.
Cluster sampling In some situations there may be a large number of groups of
individuals; for example, the workforces of small light engineering companies or
small rural communities. From an administrative point of view it is much easier
to select the groups at random and then look at all individuals in the selected
groups than to select individuals from all groups. Such a method is called
cluster sampling. Cluster sampling works best when the groups are similar in
nature so that it does not really matter which ones are included in the sample
because each cluster is like a mini-population. This method does, however, get
used in other situations where the saving in costs, made by restricting the
number of groups visited, are thought to outweigh any loss in statistical
efficiency.
Multi-stage sampling Most large scale surveys combine different sampling
methods. For example a drinking survey organised by the Office of Population
11. References
• Chow, G (2006). Are China’s Official Statistics Reliable? CESifo Economic
Studies 52 (2) 396-414.
• Sinton, J. (2000). What goes up: recent trends in China’s energy consumption,
Energy Policy 28 671-687.
• Sinton, J (2001). Accuracy and reliability of China’s energy statistics, China
Economic Review 12 (4) 373-383.
Learning Outcomes
At the end of this unit you should be familiar with the following:
• General principles of a graphic display.
• The various components of a chart, and their importance.
• Familiarity with the various chart types.
• When graphic displays are inadequate.
• Appreciate that a given dataset can be graphed in different ways to give
different visual impressions.
1. Guiding Principles
• Graphical displays are intended to give an immediate visual impression. This
depends on two factors:
- The precise variables that are graphed.
- The form of the graphic chosen.
• The intention should be to not mislead.
• Ideally a (graphical) chart should convey ideas about the data that would not be
readily apparent if displayed in a table (or described as text). If this is not the
case rethink whether a graphic is needed.
Reference: Much of the material of the next few sections is again adapted from
Klass, G. Just Plain Data Analysis Companion Website at http://lilt.ilstu.edu/jpda/
Fig.3.1 illustrates these ideas in the context of an Excel bar chart. You should be
familiar with how charts are constructed in Excel, how the data is set up, and the
labels and scales defined.
Title This should be used only to define the data series used.
• Do not impose a data interpretation on the reader. For example, a title like
“Rapid Increase of RPI and CPI” should be avoided.
• Specify units of measurement if appropriate, either
- at end of title (after a colon : ), or
- in parentheses in a subtitle (“constant dollars”, “% of GDP”, …)
(What are the units of RPI and CPI? See Practical 2.)
Legends These are required if two, or more, series are graphed (in order to identify
the various series!).
• Legends can be placed at the side (default in Excel) of a chart, or at the top or
bottom of the chart if you wish to increase the plot. size
• Legends are not necessary for a single series, so delete them if they appear (as
they do in Excel). This increases the plot area.
• Legends should be brief. Excel supplies them automatically, but does allow you
to alter them.
The Excel spreadsheet shown in Table 4.1 below shows U.S. defence spending
(column B) and various other quantities that can be used as a divisor. Below we give
five line (time series) plots of defence spending plotted against some of the
remaining columns in Table 4.1.
• Each represents a valid presentation of the data, BUT
• Depending on which divisor is used one could conclude that defence spending
is
- Steadily increasing (Fig.4.1 and Fig.4.4 [lower line] and Fig.4.5)
- Dramatically increasing (Fig. 4.1 and Fig.4.4)
- Steadily decreasing (Fig.4.2, and look at vertical scale)
- Dramatically decreasing (Fig.4.3, and look at vertical scale)
- Holding relatively constant (Fig.4.4)
You might like to consider how all this is possible! Remember that graphs can be
subjective, and you must carefully examine scales and appreciate the importance of
relative (percentage) and absolute value changes. Look again at Figs.4.2 and 4.3.
Brief Explanations If you are unsure of the rationale for these calculations:
• We need to standardise data in order to account for differences in
- populations, and prices, both
- at different times, and in different parts of the world (geographic location)
• Because the CPI is based on the prices of a market basket of goods that
consumers typically purchase, it is not a good deflator for a measure of
aggregate government spending.
• A better measure is often taken to be GDP (Gross Domestic Product), and %
GDP is used as a deflator for government revenue, and spending, indicators.
• The GDP deflator is used to construct the constant dollar price measure in
order to eliminate (or at least minimise) the problem of changing prices (over
time) in calculating total (defence) expenditure.
• In general one can expect GDP to increase (over time) faster than inflation.
Dividing a measure of government spending by GDP will therefore produce a
lower growth rate than if an inflation measure were used as divisor.
• To account for changing population figures, per capita (per person) values are
often quoted by dividing the relevant measure (say GDP) by the population size.
Such figures are available in a variety of places, with the Census being the
source of the most up to date information. Adjusted figures are often later
corrected for inaccuracies in the population count once new census figures
become available. (Revision of figures is commonly done with economic figures
in general, and government data in particular.)
• One problem with % GDP measures is that time series trends often fluctuate
more because of a country’s changing GDP than changes in the numerator
(here defence spending).
You may find these a bit too laboured, and the advice may not be precisely the same
as we give, but they may be worth a look. Some hand computations for the more
important charts will be considered in Tutorial 3.
• Bar charts often contain little data, a lot of ink, and rarely reveal ideas that
cannot be presented more simply in a table.
• Never use a 3D (three dimensional) bar chart.
But you only have to look at some of the papers cited in previous units to see how
popular bar charts are.
Exercise: Check this out. In particular look at Barwell, R., May, O., Pezzini, S. (2006)
The distribution of assets, incomes and liabilities across UK households : results from
the 2005 NMG research survey, Bank of England Quarterly Bulletin (Spring).
The following examples are designed to illustrate some of the good, and bad,
features of bar charts.
Comments A bar chart typically displays the relationship between one, or more,
categorical variables. In Fig.5.1 the two variables are Country (with 17 values) and
“Age Status” (taking two values: child or elderly).
• One variable (Country) is plotted on the y-axis, and the second variable (“Age
Status”) is accommodated by employing a multiple bar chart, with multiple bars
(here two) for each value of the first (Country) variable.
• The lengths of the bars, measured on the x-axis scale, quantify the relationship.
• We can quickly grasp the main point from Fig.5.1:
- The U.S. has the highest child poverty rate amongst developed nations.
• There are a few subsidiary points depicted in Fig.5.1:
- In many countries there tends to be a substantial difference between child
and elderly poverty (France, Germany, Italy and the U.S. are the
exceptions).
- The three countries with the lowest child poverty are Scandinavian.
- Five of the seven countries with the highest child poverty are European.
• Note that we can easily make these latter conclusions since the data is sorted
on the most significant variable. (Why is child poverty regarded as the more
important variable?)
• Data sorting is easily done in Excel using Data Sort . With the data of Fig.5.1.
we would sort on the Children column in ascending order.
• It may not be clear to you exactly what is being plotted in Fig.5.1. What is the
meaning of the phrase “% living in families below 50% of median family
income”?
We shall look at the median in Unit 4.
• Column charts. These are really just bar charts with vertical bars, and come in
the same four types as the bar charts above.
Data Source The data depicted in Table 5.2 (Distribution of Educational Staff) no
longer appears to be collected by OECD. This demonstrates that contents of web
pages do change. If you want specific information download it while you can,
assuming this is permitted.
Go to http://www.sourceoecd.org/ and search for this data; if you cannot find it then
use the Excel file EducationStaff.xls on the module web page for the data.
Fig.5.2b: Stacked bar chart with columns B and C (in Table 5.1) interchanged.
Search for the data in Table 5.3. If you cannot find it both the data, and the chart of
Fig.5.3, are available in GovReceipts.xls on the module web page.
Comments The categories in Table 5.3 are nominal rather than ordinal, i.e. there is
no implicit order to the various categories (income taxes, corporation taxes, etc.);
refer back to Unit 2 Section 2.3.
• In such a case we cannot stack the categories in any order of importance.
• In Fig.5.3 we have placed the categories in decreasing numerical order, from
bottom to top. But even here it is difficult to distinguish the differences in size of
the upper components of the chart. (Is “Other” bigger in 2000 or 2007?)
• Similar problems occur with Stacked line charts, and Area charts. You may like
to investigate these charts, and the difficulties that can occur.
• What are the units in the receipts of Table 5.3?
5.2. Histogram
Although a very simple type of chart, the histogram is, from a theoretical point of
view, the most important of all graphical representations. The reason for this relates
to the concept of a probability distribution, examined in Unit 5.
Data Source A great deal of useful information concerning stock price movements
can be found on the Yahoo! Finance website at http://finance.yahoo.com/. Here we
shall just use stock data taken directly from Yahoo; in Table 5.4(a) we have monthly
(closing) stock prices, from September 1984 through to February 2008, for the U.S.
computer company Apple. See the file AppleStock_25Y.xls.
Data Manipulation Note the stock data is continuous and quantitative, in contrast
to the integer-valued and categorical data from which our previous bar charts were
constructed.
• From our raw data we construct a frequency table, as in Table 4(b), indicating
the number (or frequency) of stock prices which fall within each of the indicated
intervals.
• This frequency table is constructed in Excel (see Practical 3 for some details),
but requires a little explanation.
- The intervals are termed bins.
- The upper limit of each bin is given in Table 5.4(b). Thus the frequency
opposite the bin value 13 indicates there is just one stock price below $13,
73 stock prices in the range $13 - $25 and so on.
• It is possible (in Excel) for the user to specify “more appropriate” intervals, such
as 0-10, 10-20, 20-30 and so on. This is frequently very convenient.
Histogram Construction The histograms of Fig.5.4 are constructed directly from the
frequency table with
• bar areas on the x-axis representing the appropriate frequencies, and
• bar widths corresponding to the bin widths.
• In Excel the default setting is to have the bars separated from each other as in
Fig.5.4 (a), but the more common approach is to have no gap as in Fig.5.4(b).
• The difference has to do with whether the underlying variable (here stock price)
is discrete or continuous. See Unit 2 Section 2.3 for a discussion.
• Excel Implementation Histograms are best constructed through Data, Data
Analysis and then Histogram. You may need to use Options and Add Ins to
get the Data Analysis software installed. See Practical 3 for some details.
Fig.5.4: (a) Price histogram (with gaps) (b) Price histogram (no gaps)
Comments It is very important to appreciate that it is the areas of the bars which are
proportional to the frequencies (for reasons we discuss in Unit 5).
• If the bins are of equal width then the bar areas are proportional to the bar
heights, and the heights equally well represent the frequencies.
• If the bins are of unequal width adjustment of the bar heights is necessary to
keep the bar areas in proportion.
- See Tutorial 3 for examples of hand calculations involving unequal widths,
and Section 6 for what happens if these adjustments are not made.
- The adjustments are made automatically in Excel. In Fig.5.4 the bins are,
for all practical purposes, of equal width, although this may not seem to be
precisely the case from Table 5.4(b). (The explanation is rounding.)
• What is important in Fig.5.4 is the “overall shape” of the histogram. We would,
for example, be interested in knowing how much of the time the stock is below a
certain (average?) level, and this is influenced by more than a single bar. By
contrast in Fig.5.1, for example, we are more interested in individual bars
(categories), and comparisons between them. (This highlights another important
distinction between nominal and categorical data.)
• Statistics is full of “odd sounding” terms, such as histogram. If you wonder how
these words originated you may care to look at the website
Probability and Statistics on the Earliest Uses Pages available at
http://www.economics.soton.ac.uk/staff/aldrich/Probability%20Earliest%20Uses.htm
From here you should be able to track down the original meaning of a particular
statistical term. For example, histogram derives from the Greek
histos (anything set upright) and gramma (drawing)
But, as was the case with bar charts, you only have to look at some of the papers
cited in previous units to see how popular pie charts are. (Exercise: Check this out.)
The following examples are designed to illustrate some of the good, and bad,
features of pie charts.
Various pie charts are available in Excel. You are asked to explore some of the
options available in Practical 3.
Pie charts are used to represent the distribution of the categorical components of a
single variable (series).
Single Pie Chart In Fig.5.6 we display, for the 2007 data, the various percentages
making up the total.
• In Excel it is possible to also display the actual values in Table 5.3, or not to
display any numerical values at all, just the labels.
Comparing Pie Charts In Fig.5.7 we compare the 2000 and 2007 data.
Fig.5.7: Pie charts for comparison of U.S. Federal Government Receipt data
Fig.5.8: Two three dimensional pie charts for Government Receipts (2007)
Table.5.4: Apple and AT&T Stock Fig.5.9: Scatter plot for Apple and AT&T Stock
Scatter plots are really intended to give a “global view” of a data set, often with the
view of further analysis (such as “fitting a straight line” to the data). In Fig.5.9 there
appears no “real” relationship between the two variables. However, if we know
beforehand a relationship exists, as in the case of (defined) functions, we can use a
scatter plot to graph the relation. Graphing functions in Excel is explored in Practical
Unit 3.
Data The data given in Table 5.5 relates to supply and demand of a product. Using
Copy & Paste to add the line graph for the supply curve to that of the demand curve
produces Fig.5.10. From this we can identify the equilibrium price (where demand =
supply) as £100.
Question: What happens if the demand line in Fig.5.10 suddenly shifts to the right
(representing an increased demand)?
Suitably adjusting the vertical scale is a common way to give a misleading impression
of the data. Another frequent offender is lack of a vertical scale.
Fig.6.2: Comparative behaviour of series Table 6.1: Actual data for Fig.6.2
3. “Stretched” Scales
You should be aware that data variations can appear smaller (or larger) by
“stretching” the horizontal (or vertical) scale. For example, compare Fig.6.4 with
Fig.6.5 overleaf; in the latter there appears virtually no variation in either series. This
type of “stretching” is now very easily done in Excel, so you should always take care
with the overall size of your charts.
4. Pie Charts
There is a large, and growing, literature on the use (and misuse) of pie charts. For an
interesting discussion of why pie charts can easily mislead us see
• Visual Gadgets (2008). Misleading the reader with pie charts, available at
http://visualgadgets.blogspot.com/2008/05/misleading-reader-with-pie-charts.html
You should be able to track many other articles via Google.
5. Stacked Charts
Rather than represent the data of Table 6.1 as a time series we could choose to use
a column chart; the second chart in Fig.6.6 is a stacked version. In truth neither chart
represents the data variations particularly well, but the stacked chart is particularly
inappropriate. The total height of each column is meaningless, since we cannot add
“Number of sales” with “Price of product”, not least because they are measured in
different units. Always check that items are comparable when stacking them in a
chart.
Fig.6.6: Column chart, and stacked column chart, for data of Table 6.1
6. Pictograms
The pictogram is a very visually appealing way of representing data values. The
selected data of Table 6.2 gives passenger numbers for April 2009 at some of the
major U.K. airports (London and Scotland). Figures are to the nearest hundred
thousand.
Data Source Figures, referring to April 2009, are taken from the webpage Recent
UK Airport Passenger Numbers – from CAA, BAA and IATA statistics at
http://airportwatch.org.uk/news/detail.php?art_id=2258. Here you can find further
data, with exact passenger numbers.
Edinburgh Stansted
Gatwick
Heathrow
The pictogram illustrated in Fig.6.7 attempts to compare the passenger numbers for
the airports shown, using a graphic of an aeroplane rather than lines or bars. The
obvious impression given is that Heathrow has vastly more passengers than the
other airports. In fact the ratio
Heathrow Passenger Numbers 5.6
= =7
Edinburgh Passenger Numnbers 0.8
So the “Heathrow graphic” should be seven times the size of the “Edinburgh graphic”.
This is clearly not the case since the eye picks up the “complete picture” and
registers the corresponding area. Unfortunately in Fig.6.7 both linear dimensions
(length and height) of the “Heathrow plane” are scaled up by an approximate factor
of seven compared to the “Edinburgh plane”, with a consequent area magnification of
72 = 49. In effect the final graphic is seven times too large, and a more representative
pictogram is shown in Fig.6.8.
7. Chart References
The ONS has produced a useful summary of best practices in using charts, under the
title Drawing Charts – Best Practice. See if you can find it under Neighbourhood
Statistics; the URL is
http://www.neighbourhood.statistics.gov.uk/HTMLDocs/images/Drawing%20Charts%
20-%20Best%20Practice%20v5_tcm97-51125.pdf
• You may also care to look at a set of (un-named) power point slides at
http://mtsu32.mtsu.edu:11235/Misleading%20Statistics.ppt
• A good discussion of the use of pictograms, both one and two dimensional, in
Excel is given in Hunt, N. (2000) Pictograms in Excel. Teaching Statistics, 22, 2,
56-58.
• There is a very interesting website Numeracy in the News located at
http://www.mercurynie.com.au/mathguys/mercindx.htm. Here you will find
discussions of numeracy in relation to specific newspaper articles. In particular
if you click on the Data Representation icon you will find many newspaper
pieces analysed in terms of their statistical content. Alternatively you can go
directly to http://www.mercurynie.com.au/mathguys/maths/datreprs.htm. You
should look at some of this material, and we shall consider one or two articles in
Tutorial 3.
• In addition there are a large number of graphics chartists use to describe, for
example, the price movements of stocks and indices. You can get an idea of
some of the possibilities by going to Yahoo Finance! and obtaining historical
quotes for a particular stock (say General Electric). Under Charts click on
Basic Technical Analysis (left side) and a chart should appear that
incorporates a time series, and histogram, of the stock movements. In addition
you can add to this chart items like “Bollinger Bands” (if you know what they
are!). Below is the kind of effect you can achieve.
• From an accounting perspective you should download, and read, the following
article: Burgess, D.O. (2008). Does Graph Design Matter To CPAs And
Financial Statement Readers? Journal of Business & Economics Research 6
(5) 111-124.
Here a survey of financial readers was undertaken to ascertain whether the
meaning of financial statements can be distorted, intentionally or not, by the
graphical representation chosen. To give you an idea of what is involved
complete the following exercise:
Exercise: Examine the following two graphs and then comment, as indicated, on the
five statements that follow.
You may like to revisit this exercise at the end of the course and see if your
responses differ.
• Finally there is one important chart we have not discussed in this unit. This is
the boxplot, and relies on the use of quartiles, a topic we discuss in the next
unit.
Nevertheless, the basic graphical devices we have discussed in this unit are used
repeatedly in the financial and economic literature. Look back to some of the
referenced papers to confirm this. In general, unless you have compelling reasons
not to, use simple graphics to describe your data, in preference to anything “more
fancy”.
8. References
• Hunt, N., Mashhoudy, H. (2008). The Humble Pie – Half Baked or Well Done?
Teaching Statistics 30 (1) ps. 6-12.
• Hunt, N. (2000) Pictograms in Excel. Teaching Statistics, 22, 2, 56-58.
• Noah, T. (2004) Stupid Budget Tricks. How not to Discredit the Clinton Surplus,
Slate Magazine (Aug. 9th.) available at http://slate.msn.com/id/2104952/
• Visual Gadgets (2008). Misleading the reader with pie charts, available at
http://visualgadgets.blogspot.com/2008/05/misleading-reader-with-pie-
charts.html
Learning Outcomes
At the end of this unit you should be familiar with the following:
• General ideas of location and spread of a data set.
• Use of numeric “summary measures”.
• The various numeric measures of location available; mean, median and mode,
and their importance.
• The various numeric measures of spread available; range, IQR and standard
deviation, and their importance.
• The use of graphical representations (stem and leaf plots and boxplots) to
compute, and display, quartiles.
• Understand when numeric measures are inadequate.
• Appreciate the properties of some basic types of financial data.
• Understand how market efficiency can be investigated using pivot tables.
A market where chief executive officers make 262 times that of the average
worker and 821 times that of the minimum-wage worker is not a market that is
working well.
Marcy Kaptur (American politician)
(See http://www.brainyquote.com/quotes/keywords/average.html)
1. Guiding Principles
• Numerical displays are intended to give a quick summary of the (numerical)
content of quantitative data..
• There are two important factors associated with any dataset
- A measure of “central location”, by which we mean “where is the bulk of
the data located”?
- A measure of “spread” indicating whether the data is “tightly bunched”
about the centre or not.
• In most practical situations (large data set) numerical summary measures are
usually computed using computer software (Excel in our case). However, as
with graphical measures, hand computations (using small data sets) are
important for two reasons:
- They illustrate the underlying principles of the calculation. The rationale
behind measures can be of crucial importance in financial situations.
- They “give one a feel for the data” and allow an understanding of which
particular summary measure should be used in a given situation.
Exercises involving hand computation are given in Tutorial 4.
Note From this unit onwards our discussions will, in general, become more
numerically based. A good modern reference text is
Nieuwenhuis, G. (2009) Statistical Methods for Business and Economics.
Maidenhead: McGraw-Hill Education (UK) Limited.
Although you are not required to purchase this, the text covers similar material to the
module, but in considerably more detail, and contains some more advanced topics
we do not have time to cover. In addition there are many finance/economic based
examples, and Excel (and SPSS) applications are discussed.
For our purposes it will be most useful to compute summary measures for a sample
taken from a larger population. So we can assume our data values are discrete (see
Unit 1) and label them x1, x2, x3, x4, ..... , xn --- (1)
This indicates we have n values (not necessarily distinct).
Definition 1 The mean (or arithmetic average) is usually denoted x (read x bar)
x = [x 1 + x 2 + x 3 + x 4 + .... + x n ] = ∑ i = 1 x i
1 1 i=n
and defined as --- (2)
n n
Definition 2 The median Q2 is the “middle value” of the data, i.e. the x-value such
that half the x-values are smaller, and half the x-values bigger, than Q2.
(3) says to take the unique middle value when it exists (n odd), otherwise take
the (arithmetic) average of the “two middle values”. In practice this verbal
description tends to be more useful than the mathematical one in (3).
Definition 3 The quartiles are the values that divide the data in (1) into “four equal
quarters”.
Definition 4 The mode is the “most popular” value” of the data, i.e. the x-value in (1)
which occurs most often.
Note 3 The data really needs to be ordered if we want to reliably identify the mode
(and quartiles), especially if n is large in (1). To compute the mean we do not need
ordered data.
It is traditional to compute the mean, median (quartiles) and mode for a “small”
artificial data set to illustrate the computations. We choose a slightly larger set and,
for variety, consider a “non-financial” application. (Nevertheless you should be able to
see economic, and financial, implications in Example 3.1.) For illustrative purposes
our calculations extend over the next two sections.
Stem
Leaves
Read
as 50 Read as 39
Notes
1. If you require more leaves, to better highlight the underlying “shape”, you can
divide each stem in two as illustrated in Fig.3.1c. Then, for example, the stem 2L
refers to the range “20 lower”, or 20 – 24, and 2H means 25 - 29.
2. Stem and leaf plots are produced by some computer software but, unfortunately,
not directly by Excel. However you can produce a “rotated” (histogram) version of a
stem and leaf plot in Excel using the procedure described in
Excel Charts for Statistics on the Peltier Technical Services, Inc. website at
http://www.peltiertech.com/Excel/Charts/statscharts.html#Hist1
3. We may take the view that data values in the range 60 and above are “atypical”,
and regard them as outliers. The remaining “typical” data values then appear “fairly
symmetric” – we shall return to this point several times later.
4. Computation of Quartiles
With a little practice, we can fairly easily read off the quartiles from the stem and leaf
plot. It is often convenient to compile a frequency table as part of the calculation.
5 data values
Q3
Q2 = Average of two data
values either side
Notes
1. Graphically we have the following situation (not drawn to scale):
2. Remember, in our example, Q2 is not directly a data value, but Q1 and Q3 are. This
is why, in Fig.43.3, 20 + 20 ≠ 41.
3. Our procedure of successively dividing the data into halves is not the only one
possible. Wikipedia, at http://en.wikipedia.org/wiki/Quartile, will give you more details,
and further references, such as http://mathworld.wolfram.com/Quartile.html, to other
computational procedures for quartiles. But the important point to bear in mind is that
all such procedures will give “very similar” answers and, as we discuss in the next
section, it is the “overall shape” defined by the quartiles which is of greatest interest
and importance.
4. From Fig.4.2 we can also read off the mode as 26 (occurring 8 times in the data).
Minimum Q1 Q2 Q3 Maximum
Box Whisker
Notes
1. The width of the box is arbitrary.
2. Although Excel does not graph boxplots directly, they can be obtained as shown in
Practical Unit 4.
3. Boxplots are particularly important when comparing two, or more, datasets. You
are asked to compare the actor, and actress, boxplots in the Tutorial Exercises.
Of course this type of calculation is more suited to Excel implementation, and you are
asked to perform some computations of this type in Practical Unit 4.
Calculation 2 A much simpler calculation results from using the frequency table in
Fig.4.1, but we have to make an assumption. Our frequency table just tells us how
many data values are in a particular interval, but not what the values are explicitly.
We assume all values are concentrated at the centre of the interval so, for example,
we imagine the data value 25 occurs 28 times. This easily leads to the frequency
table of Table 6.1. The mean is now calculated from the formula (why?)
x =
1
n
[ ] 1 i=n
f 1 x 1 + f x 2 + f 3 x 3 + .... + f n x n = ∑ i = 1 f i x i
n
--- (5a)
n = f 1 x 1 + f x 2 + f 3 x 3 + .... + f n x n = ∑ i = 1 f i
i=n
where --- (5b)
The actual computation shown in Table 6.2 gives
2960
x = = 36.10 (years) --- (6)
82
Table 6.1: Frequency table from Fig.4.1 Table 6.2: Computation of mean.
Notes 1. The value in (6) is an approximation (estimate) of the mean, whereas (4) is
the exact value. The virtue of the frequency table approach is its simplicity, since the
complete dataset (82 values in our case) is replaced by a much smaller number of
intervals (7 here).
2. The formalism in Table 6.2, where our result (the mean) is calculated in terms of
column sums of the data, occurs very often in statistical calculations. Such sums are
easily computed by hand (small dataset) and in Excel (large dataset).
Example 3.1 The results in Fig.8.1 are immediate from Fig.4.3. Note the following:
• The range uses only two data values, and hence is very sensitive to outliers in
the data.
• The IQR is designed to eliminate this difficulty by looking at the “middle half” of
the data. Observe that the range is not double the IQR as one might expect for
a symmetric dataset.
• However the IQR still only uses two data values.
IQR = 39 – 28 = 11
|______________|_________________|___________________|_______________|
Min = 21 Q1 = 28 Q2 = 33 Q3 = 39 Max = 80
Range = 80 – 21 = 59
x1 - x
x2 - x
xn - x
|_________|________|_______________________________________________|
x1 x2 x3 x xn
Example 8.1 The following data gives prices (in pence) of a 100 gram jar of a
particular brand of instant coffee on sale in 15 different shops on the same day:
100 109 101 93 96 104 98 97 95 107 102 104 101 99 102
Calculate the standard deviation of the prices.
Solution The required calculations are depicted in Table 8.1. Note the following:
• The sum of the deviations (cell C40) is zero; this is a consequence of the
definition (2) of the mean. The cancellation of positive and negative deviations
is avoided by squaring, as specified in (8).
• The sum of squared deviations is 266 (non-zero, of course), and the average
squared deviation is 266/15 = 17.733. Then s = 17.7333 = 4.21 (pence).
• In words: On average any particular coffee price is 4.2 pence from the mean of
101 pence.
Comments 1.Note that no individual price can involve a fraction of a pence; neither
can the deviations (column C). The standard deviation just gives an average
measure, averaged over all prices (data values).
2. You may also have noticed the mean, as defined by (2), is often not a possible
data value. Thus in (7) 35.4 is not a possible age (since all ages are given to the
nearest year). Again this occurs because we average over all data values.
3. There is an alternative calculation of s based on the formula
1 i=n 2
s2 =
n
∑ i =1
xi − x2 --- (9)
This is just an algebraic rearrangement of (8); you should be able to prove this if you
are familiar with manipulating summation symbols. We can check this by using
column F in Table 8.1 to give s2 = 153281/15 – 1012 = 17.733
And this agrees with the entry in cell D41.
Example 3.1 To calculate s for our original (Oscar winners) data we can do one of
three calculations:
• Use (8) with our 82 data values.
• Use a slight modification of (9), relating to frequency distributions, and use the
frequency table in Fig.4.1 (just as we did for the mean in Section 6).
• Use Excel’s “built in” functions.
The first calculation is too time-consuming, and you are asked to explore the third
option in the Practical Exercises. Here we look at the second alternative and, as with
the mean, the ease of the calculation is slightly offset by the approximate nature of
the computation. The frequency version of (9) is (compare this with (5))
1 i=n
∑ ∑
i=n
s2 = f x − x2 with n= --- (10)
2
i =1 i i i =1
fi
n
To implement (10) all we need to do is add an extra column (D) to Table 6.2 to give
Table 8.2. Note that, since we are calculating entirely from the frequency table, we
use the value of x given in (6). This gives
1 i=n
s2 = ∑ i = 1 (x i - x) 2 --- (8*)
n -1
Here we divide by (n – 1) rather than n. The reason centres around the fact that, to
compute s, we first need to compute x from the data. To see why this is important
we return to Example 8.1, where we have the 15 data values
100 109 101 93 96 104 98 97 95 107 102 104 101 99 102
Once we have computed the mean x = 101 our 15 data values are no longer all
needed – technically they are not all independent (of each other). In fact, knowing x
= 101, we can remove any one of our data values, i.e. our data could be any of the
following 15 sets (each one comprising only 14 values):
* 109 101 93 96 104 98 97 95 107 102 104 101 99 102
100 * 101 93 96 104 98 97 95 107 102 104 101 99 102
.......................................................................................................................................
100 109 101 93 96 104 98 97 95 107 102 104 101 99 *
In each case the starred * entry is uniquely determined by the requirement that the
mean is 101. To take account of this “one redundant” data value we adjust the
denominator (n) in (8) by one to give (8*).
Notes
1. You will probably find this argument a little strange! However it expresses a very
general view in statistics that it is only independent quantities that are important in
computations, and not necessarily all the (data) values we have available.
2. We shall essentially repeat the argument in Unit 7 Section 8, when we introduce
the important concept of degrees of freedom.
3. At the moment just remember to use (8*) when the mean x has to be calculated
from the data. Thus our previous calculations of s in Example 8.1 are, strictly
speaking, incorrect. For example, using the sum of squared deviations of 266 gives
s2 = 266/14 = 19 (in place of s2 = 266/15 = 17.333)
4. Observe that the above difference (4.36 compared to 4.21) is quite small. As n
increases the difference between (8) and (8*) clearly decreases. This leads to the
“rule of thumb”: use (8*) for “small” samples and (8) for “large” samples. The dividing
line between the two is often taken as n = 25, but this is really quite arbitrary.
5. Which formula is right – (8) or (8*)? The answer is that both are really just
definitions designed to capture (in a single number) the concept of “spread” of data
values around the mean. We are free to choose whichever definition we want.
Theoretically we choose the one with the “better mathematical properties” and,
because of the independence idea, this turns out to be (8*). The drawback is
explaining why we prefer (8*), without going into too many technical details, since (8)
is obviously more intuitive.
Example 10.1 The following (hypothetical) data gives (starting monthly) values of two
stock portfolios, X and Y, over a six month period.
Portfolio X Portfolio Y Portfolio X Portfolio Y
Month Month
(£000) (£000) (£000) (£000)
1 1000 1000 5 1032 1086
2 1008 1015 6 1038 1043
3 1018 1066 7 1058 1058
4 1048 1194
Although both portfolios have the same starting (1000) and finishing (1058) values,
their maximum values are different (1058 for X and 1194 for Y). Hence we would
judge Y is more volatile than X. To quantify this, the appropriate calculations are
normally expressed in terms of returns since an investor will usually target a specific
return, say 5%, on his investment (no matter how much he invests).
The return is just the familiar percentage change we have seen before (where?):
End value - Start value
Portfolio return = * 100%
Start value
Table 9.2 gives the calculated returns. For example, during month 1
1008 - 1000
Portfolio return = * 100% = 0.8%
1000
Hence Y has a slightly higher average return (1.2% compared to 0.9%). However,
this is more than offset by its (much) higher volatility:
• s2X = 1
5
[0.8 + 0.9823 + 2.8626 + 1.5267 + 0.5814 + 1.9268 ] - 0.93772 = 2.3569
2 2 2 2 2 2
• s2Y = 1
5
[1.5 + 5.0246 + 12.0075 + 9.0452 + 3.9595 + 1.4382 ] - 1.16092 =
2 2 2 2 2 2
54.2477
Solution Note that the average return of both investments will be 3%. We obtain the
values shown in Table 11.2. For example, after 2 months Portfolio Y will have grown
to £1000*(1 + 0.02)*(1 + 0.04) = £1060.8
(We have retained sufficient decimal places to minimise the effect of rounding errors
in the calculations.)
Time Portfolio X Investment Value (£) Portfolio Y Investment Value (£)
(months)
1 1000*1.03 = 1030 1000*1.02 = 1020
2 1030*1.03 = 1060.9 1020*1.04 = 1060.8
3 1060.9*1.03 = 1092.727 1060.8*1.02 = 1082.016
4 1092.727*1.03 = 1125.50881 1082.016*1.04 = 1125.29664
5 1125.50881*1.03 = 1159.27407 1125.29664*1.02 = 1147.8025728
6 1159.27407*1.03 = 1194.05230 1147.80257*1.04 = 1193.7146758
Conclusion Observe that, after each two month period (when the average return is
the same on both portfolios), X has a larger value than Y. Although the differences
here are small, they will increase over time. You may care to see what the difference
is after a further six months, assuming the same pattern of investment returns. Also
the more initially invested the larger the differences will be; with a £1 million
investment the difference in portfolios values after six months will be £337.62.)
More importantly than the actual amounts involved;
• The investment with the larger variation in returns is ALWAYS worth less after
any period of time (provided the average return is the same in all cases).
• The larger the variation the less the investment is worth (subject to the average
return being the same).
You may care to investigate these assertions for yourself. In Table 11.3 we give
investment values, computed in Excel, for the three sets of returns shown, with an
initial investment of £1000. Over 20 time periods the line graphs of Fig.11.1 are
obtained. Note in particular how the most variable returns (Investment 3) produce the
lowest final value (by a significant amount), and hence the lowest overall return. In
each case the average return is the same (5%).
However, in practice, predictable returns (with, for example, AAA bonds) will
invariably produce the lowest returns! To increase returns requires more risk to be
taken, which implies more unpredictable returns (and hence obviously greater
potential variation in returns).
We can now clearly see why data variation, as measured by the standard deviation,
is used as a measure of risk.
In the first “Key Points” section you will find the following (my highlighting):
• In April 2008 median gross weekly earnings were £479, up 4.6 per cent from
£458 in 2007, for full-time UK employee jobs on adult rates whose earnings
were not affected by absence
• Between 2007 and 2008 the weekly earnings for full-time employees in the top
decile grew by 4.4 per cent compared with a growth of 3.5 per cent for the
bottom decile.
• For the 2007/08 tax year median gross annual earnings for full-time employees
on adult rates who have been in the same job for at least 12 months was
£25,100. For males the median gross annual earnings was £27,500 and for
females it was £21,400
• The stronger growth in full-time men’s hourly earnings excluding overtime
compared with women’s has meant that the gender pay gap has increased to
12.8 per cent, up from 12.5 per cent in 2007. On the basis of mean full-time
hourly earnings excluding overtime, the gender pay gap has increased, from
17.0 per cent in 2007 to 17.1 per cent in 2008.
Read through the paper and note the types of summary measures and graphs used.
Many of these should have been covered in Units 3 and 4; which ones have not?
Example 13.1 The Excel file IBM_Weekly contains weekly closing IBM stock prices
from 03/01/2000 to 13/12/2007. The data, a small portion of which is shown in
Fig.13.1, was downloaded from Yahoo Finance!
• At the two “endpoints” the series takes roughly the same value, i.e. the stock
has (only) maintained the same price level (over an 8 year time span). We could
look for “economic interpretations” of this either
- in the news released by IBM itself, or
- in general news from the technology sector, or
- in more general economic news
over this time frame. You may care to look into this (Exercise).
• In between times the stock clearly has high and low points. The general
question we would like to ask is the following:
Question 1 “If we held the stock initially (Jan 2000) what investment strategy
should we have adopted to maximise our profits at the end (Dec 2007)?”
(To keep things as simple as possible we are ignoring the possibility of the
stock price having exactly the same value in two successive time periods. Do
you think this is reasonable?)
• The result is shown in Table 13.1. Clearly we want to count how many times the
stock went Up, and how many times Down.
Table 13.1: Stock Up or Down? Table 13.2: Count of Ups and Downs in stock
Comment We could represent the numerical result in Table 13.2 graphically by, for
example, a histogram. Would this be a sensible thing to do? Remember you would
only include a table and a graphical representation in a report if they gave different,
but complementary, information (or possibly different perspectives on the same
information). Look back to the “Summary” advice at the end of Section 5 of Unit 3.
In Step 3 we answered the question
• How often (or what percentage of the time) does the stock price go up?
There is another, more interesting question we can look at:
• If we know the stock price went up last week, what are the chances (probability)
of it going up again this week?
Rationale If this probability is “large” we may be tempted to buy the stock once we
have observed the price increase. This would be a sound investment strategy under
the given circumstances. (We shall not formally meet the idea of “probability” until
Unit 5, but here we just need an “informal idea” of probability as the likelihood/chance
of the stock continuing to rise.)
Table 13.3: Two week changes Table 13.4: Counts of two week changes
Conclusion If the stock went Up last week (which it did 202 times) then it
subsequently (this week) went Up again 48% of the time (97 times) and, of course,
Down 52% of the time. Similarly, if the stock went Down last week (213 times) it
subsequently continued to go down 51% of the time, and went back Up 49% of the
time.
All these percentages are (depressingly) close to 50%, so any strategy to buy, or sell,
the stock base on its previous week’s movement seem doomed to failure. Of course
we could look at how the stock behaved over the previous two (or three or ...) weeks
before deciding whether to buy (or sell or hold). You may care to investigate some of
these possibilities using pivot tables.
Whether or not we can predict stock prices is at the heart of the idea of market
efficiency, a concept that is generally phrased in terms of the Efficient Market
Hypothesis (EMH). As you are probably aware, there is a vast literature on this topic
and you cannot go very far in any finance course without meeting it. For an extended
discussion see Brealey, R.A. and Myers, S.C. (2003). Principles of Corporate
Finance, 7th. International Edition. New York, McGraw Hill.
Here we merely state the EMH version relevant to our analysis:
Weak form of EMH: Security prices reflect all information contained in the record of
past prices. (It is impossible to make consistently superior profits by studying past
returns.)
Although we have not really got very close to answering Question 1 (producing an
“optimal” investment strategy), you should be able to appreciate the use of pivot
tables (cross-tabulation) in looking for “patterns” within the data.
14. References
Learning Outcomes
At the end of this unit you should be familiar with the following:
• Understand how probability is defined and calculated in simple situations.
• Apply the basic probability laws using tables and tree diagrams.
• Understand the concept of a probability distribution.
• Appreciate the idea of a conditional probability distribution, and compute
conditional probabilities.
• Recognise the role of the mean and variance in characterising a probability
distribution.
Who cares if you pick a black ball or a white ball out of a bag? If you’re so
concerned about the colour, don’t leave it to chance. Look in the bag and pick
the colour you want.
Adapted from Stephanie Plum (Hard Eight)
1. Introduction
An increasingly important issue is to examine how financial quantities, such as stock
prices, behave. For example we may be interested in answering the following
questions:
• What is the probability my IBM stock will increase in value today?
• What is the probability my IBM stock will increase in value by 1% today?
• What is the probability my IBM stock will increase in value by 1% over the next
week?
• What is the probability my portfolio of stocks will increase in value by 1% over
the next month?
and so on. But before we can answer questions like these we need to look at the
(statistical) language needed to make meaningful (quantitative) statements.
Terminology:
To avoid long, and potentially complicated, verbal descriptions we write
P(E) = Probability of the event (outcome) E
Sometimes we may write Pr(E), or P{E} or Prob(E) or something similar.
probabilities, and probability distributions, from data in several ways; in this unit we
shall use a mixture of the following three types of data:
• “Real data” (taken from ONS). This will emphasise that probabilities are not just
theoretical constructs, but are tied firmly to collected data.
• “Simulated data” using Excel’s random number generators. This will let us
easily obtain, in certain well defined situations, as much data as we require; the
latter will let us illustrate concepts of interest.
• “Theoretical data”. This will allows us to keep the calculations as simple as
possible, and let us concentrate on the underlying ideas without worrying too
much about computational details.
Two “simple” examples which incorporate many of the ideas we need are found in
the age old pursuits of coin tossing and dice throwing. Despite their apparent
simplicity such examples contain a great deal of interest and can be used to illustrate
a variety of concepts.
This is the “standard” long run frequency interpretation of probability, and is the most
commonly used idea to define precisely the concept of probability. But to use this
definition we need to perform an experiment (a large number of times).
“Solution” : Rather than actually tossing a coin we shall simulate the process in
Excel using the built in random number generator. This allows us to generate
“Heads” and “Tails”, with equal probability, as often as we wish. From Table 2.1 we
obtain the two estimates
P(H) = 6/10 = 0.6 and P(H) = 7/10 = 0.7
=rand() Head = 1
Tail = 0
6 H in 10 Column
tosses sum = 7
0.7
1
0.6
0.8
0.5
Probability
Probability
0.4 0.6
0.3 0.4
0.2
0.2
0.1
0
0
Prob(H) Prob(T) Prob(H) Prob(T)
H or T H or T
1 1
0.8 0.8
Probability
Probability
0.6 0.6
0.4 0.4
0.2 0.2
0 0
Prob(H) Prob(T) Prob(H) Prob(T)
H or T H or T
Notes: 1. For a discussion of how the results of Fig. 2.1 are obtained see the
spreadsheet Unit5_CoinTossing in the Excel file CoinTossing2.xls. You are asked
to look at these simulations in Question 1 of Practical Exercises 4.
2. Look at the Excel file CoinTossing1.xls and the spreadsheet Proportions for an
empirical discussion of (1); see also Question 3 of Practical Exercises 4.
A Problem: There is one major (philosophical) flaw with the frequency approach to
probability. In some (many) situations we essentially have no control over our
experiment, which is essentially a one-off event, and hence cannot be repeated (and
certainly not many times).
Example 2.2: IBM stock is today worth $100. What is the probability it will be worth
$105 tomorrow?
“Solution”: Here we cannot use (1) since we cannot “repeat” the stock price
movement over the next day “many” times (and observe how often it reaches $105).
The stock price will move of its own accord, and it will assume a single value
tomorrow. Of course this value is unknown today. There appears to be no simple way
to assign a probability to the required event.
In fact there are two approaches we might take:
• Simulate the stock price process “many” times as we did in Example 2.1 for the
coin. Observe how many times the stock reaches the required level of $105 and
use (1). The difficulty is we need a model of the behaviour of the stock price to
do this. In Example 1a we used a “random” mechanism to model our coin toss.
• Observe the stock price over the next 100 days (say) and use (1) to assess the
required probability. Of course this will not give us an answer for tomorrow, and
there is the additional problem that, at the start of each new day, the stock will
not start at $100.
• Since all the probabilities must sum to one (why?) we conclude P(6) = 1
6
A Problem: This symmetry approach is a theoretical one, and will therefore have
nothing to say about probabilities determined by “real world events”:
Example 3.2: What is the probability that IBM goes bankrupt within the next year?
4. Subjective Probability
Here we start with the idea that there sometimes there is no objective way to
measure probability. In this case probability is really the degree of belief held by an
individual that a particular event will occur.
Example 4.1 : "I believe that Manchester United have probability of 0.9 of winning
the English Premiership next year since they have been playing really well this year,
and I expect their good form to continue into next season."
• Despite this difficulty the “degree of belief” idea does allow us to make
probability statements in situations where the other two approaches (based on
frequency and symmetry) may not be applicable.
• One can quantify the subjective probability in terms of odds. Just how much am
I prepared to bet on Manchester United winning the premiership next season at
a given set of odds (quoted by a bookmaker)?
Example 4.2 : IBM stock is today worth $100. What will it be worth next month?
“Solution”: Most investors views are biased towards optimistic outcomes, and so will
tend to overestimate a stock’s future value. So I believe the stock will fall in price to
$90.
You may like to look at Barberis, N., Thaler, R. (2002) : A Survey of Behavioral
Finance available at http://badger.som.yale.edu/faculty/ncb25/ch18_6.pdf.
This is a good, relatively recent, review of the literature and is quite readable, but
goes far beyond the limits of the course.
Q1. If the London Stock Exchange general index has increased on each of the
past 3 days, what is the probability that it will increase in value today as
well?
Probability =
Q2. If the London Stock Exchange general index has decreased on each of the
past 3 days, what is the probability that it will decrease in value today as
well?
Probability =
Q3. If you look at the London Stock Market today in your opinion it is
(choose one alternative):
1. Overvalued by __________ %
2. Undervalued by __________ %
3. Valued at a fundamentally correct level.
4. Cannot say whether it is fairly valued or not.
Q4. If the London Stock Exchange general index is valued at 6000 today, what
do you think will be its value in 6 months time?
Value in 6 months time __________
Q5. Assume the following situation. During the last 2 years the stock of a certain
company has risen by 60%, and the future for the stock looks bright. How do
you value this information?
1. The stock is worth buying.
2. The information is not sufficient to decide on buying the stock.
3. The stock is not worth buying.
5. Bayesian Methodology
An inevitable criticism of the subjective probability approach is precisely in its
subjective nature with different “answers” from different people. To improve upon
this situation the “Bayesian approach” allows probability estimates to be “updated”
as new information becomes available.
Example 5.1
(a) You have a coin in your hand. What is your “best estimate” of P(H), the
probability of obtaining a head on any toss?
(b) Your coin is tossed 10 times and 3 heads result. Now what is your best estimate
of P(H)?
(c) 10 further tosses give 6 heads. Now what is your best estimate of P(H)?
Solution:
(a) Without performing any experiment (tossing the coin), the best we can do is to
invoke the “Principle of Indifference” of Section 4.1.2 and conclude
P(H) = 0.5
(b) Clearly we should use the proportion of heads obtained as an estimate of the
3
required probability P(H) = = 0.3
10
(Using the proportion is the “rational thing to do”, but we can give formal arguments
to justify this choice.)
(c) We could again use the proportion of heads obtained as an estimate, i.e.
6
= 0.6
P(H) =
10
The “Bayesian point of view” suggests that we can improve on this estimate by
using any previous knowledge we have. Here we can argue that we already have an
estimate of P(H) from (b) and we can average this estimate with the current one, i.e.
0.3 + 0.6
P(H) = = 0.45
2
(We can take the average since both estimates in (b) and (c) are based on the same
sample size/number of tosses. If the were not the case we would take a weighted
average weighted by the respective sample sizes.)
Comment The zeros in the table probably just indicate a value less than 500 – Why?
We need to bear this in mind when assessing the accuracy of our computed
probabilities below. See Lecture Unit 2 Section 8.2.
Question 1 What is the probability of being unemployed for less than 6 months,
during the period Nov 2007 – Jan 2008, if you are a male aged 18-24?
Solution
Number of males unemployed for <6 months, during Nov 2007 – Jan 2008, in the
age range 18-24 = 332
Total number of (economically active) males in this age range, during Nov 2007 –
Jan 2008 = 4210
Using the (frequency) definition (1)
P(unemployed < 6 months under given conditions) = 332/4210 = 0.079
(Equivalently about 8% of the males aged 18-24 are unemployed during Nov 2007 –
Jan 2008.) You should think of the number of decimal places we can quote here.
Comment Clearly we can repeat the above calculation to change all the frequencies
in Table 6.1 into probabilities. The results for Table 6.1(a) are shown in Table 6.2,
and we leave it as an exercise for the reader to obtain the corresponding results for
Table 6.1(b). We have quoted probabilities to 4 decimal places, although three
places are more realistic (why?).
Question 2
What is the probability of being unemployed, during Nov 2007–Jan 2008, if you are a
male aged 18-24?
Comment Instead of adding the frequencies in Table 6.1(a), we can add the
probabilities in Table 6.2. Explicitly
P (unemployed if male aged 18-24) = 0.0789 + 0.0169 + 0.0140 + 0.0095 = 0.1192
Can you see why this works? In more general terms we would like to know how to
combine probabilities of certain events to produce probabilities of “more
complicated” events. We do this in the next section.
“Solution” Our data in Table 6.1, where ages are split between 4 “sub-tables”, is not
best suited to answering this question. A better alternative is to “collect ages
together”, at fixed time intervals, as shown in Table 6.3. (As an exercise you should
produce the table for May-Jul 2008).
We look at a more detailed solution to Question 2 after we have seen how to
combine probabilities. Here we just note the following male probabilities for Feb-Apr:
• Age range 16-17: P(unemployed < 6 months) = 76/367 = 0.207 (21%)
• Age range 18-24: P(unemployed < 6 months) = 337/4205 = 0.080 (8%)
• Age range 25-49: P(unemployed < 6 months) = 187/9718 = 0.019 (2%)
• Age range 50+: P(unemployed < 6 months) = 61/4606 = 0.013 (1.3%)
We can clearly see these probabilities decrease with increasing age. Is this what you
would expect?
Table 6.3: Unemployment figures across age ranges with time (interval) fixed.
7. Probability Laws
The following gives a more “formal structure” to probability manipulations. For a
general discussion we assume we have two “events”, conveniently labelled A and B,
with known probabilities of occurrence, and we do not enquire where these have
come from. These known probabilities apply to “simple events” and we want to
combine these values to determine the probabilities of “compound events” (which
are formed as combinations of the simple events).
S S
. . .
.
S
Fig 7.1: Possible Sample Space Representations
An event A of interest will usually be a subset of the sample space S, i.e. just part of
S. (The terminology “subset” arises from the fact that we can phrase “events” in
terms of “sets”, and use set theory to discuss everything. We shall not do this, but lot
of books do in order to give more formal proofs of general results.)
S Not A
A A
Fig 7.2: An event A of interest Fig 7.3: An event A and its complement (not A)
S A S A
B B
An Important Observation: In practice, we are usually not dealing with just two
events but with many more. It is then far simpler to deal with mutually exclusive and
independent events since the addition and multiplication laws extend very simply. If
we have events A1 , A2 , A3 , ... , An
that are both mutually exclusive and independent
• P(A1 or A2 or A3 or ... or An ) = P(A1) + P(A2) + P(A3) + .... + P(An) --- (1)
• P(A1 and A2 and A3 and ... and An ) = P(A1) P(A2)P(A3).....P(An) --- (2)
Example 7.1 A particular industry currently contains 100 firms and, from past
records, it is estimated the probability of any particular firm going bankrupt (within the
next year) is 5%. Determine the probability that:
(a) No firm goes bankrupt
(b) At least one firm goes bankrupt.
(c) Exactly two firms go bankrupt
What assumptions are you making in your calculations?
(Since, from Table 7.1, probabilities must be less than 1, we cannot use 5% directly.)
• Now we can also write
P(not A1) = P(not A2) = ..... = P(not A100) = 1 - 0.05 = 0.95
as the probability that an individual firm does not go bankrupt.
(a) P(no firm goes bankrupt) = P(Firm 1 does not go bankrupt and Firm 2 does not
go bankrupt and ..... and Firm 1000 does not go bankrupt)
= P(not A1 and not A2 and not A3 and ... and not An )
= P(not A1) P(not A2)P(not A3).....P(not An)
assuming the events are independent. Hence
P(no firm goes bankrupt) = 0.95*0.95* ... *0.95 = 0.95100 = 0.0059
You will need a calculator (or Excel) here. Note that, although each individual
probability is close to 1, the product is very close to zero. With a 5% chance of any
firm going bankrupt, there is only a 0.59% chance of no firm going bankrupt.
(c) This is a bit trickier. Given two specific firms, say F2 and F7, we can easily
compute, again using independence,
P(F2 and F7 go bankrupt) = P(A2 and A7) = P(A2)P(A7) = 0.05*0.05 = 0.0025
However, this is not the probability we want for two reasons:
• If we want exactly two bankruptcies, in addition to F2 and F7 going
bankrupt, we require the remaining 98 firms do not go bankrupt (for
otherwise we would have more than two bankruptcies). Thus
P(only F2 and F7 go bankrupt) = P(A2 and A7 and not A1 and not A3 and .....)
= P(A2)*P(A7)*P(not A1)* P(not A3)* ......
= 0.052 * 0.9598 = 0.000164
(Even though we have not explicitly written down all the events we can “clearly”
see what the corresponding probabilities must be!)
• We also have arbitrarily selected the firm F2 and F7 to go bankrupt. We can see
that, no matter which two we select, the same probability 0.000164 will result.
So we need to decide in how many ways we can select the two firms. Often
such counting problems can be difficult, but here there is a simple solution.
- The first firm can be chosen in any one of 100 ways, and the second firm
in any of 99 ways. We multiply these numbers together and divide by 2 (to
avoid counting choices such as {F2 and F7} and {F7 and F2} as different).
- Number of possible choices = 100*99/2 = 4950. (See Unit 6, Section 2).
- We must now add our computed probability together 4950 times. Can you
see the addition law at work here?
P(exactly 2 bankrupt firms) = 4950*0.000164 = 0.081
There is thus about an 8% chance of exactly two firms going bankrupt.
Note The solution we have given is far longer than you would normally give.
Typically you should make your calculations clear, but not necessarily explain
all the logical steps you have taken in computing the various probabilities.
Solution We can turn Table 6.3 into one involving probabilities by simple division
(using “economically active” as the denominator), as we did with Table 6.2.
Table 7.2: Unemployment probabilities across age ranges with time (interval) fixed.
(a & b) Since each of the unemployment intervals are disjoint the corresponding
events are mutually exclusive. For example let
A1 = Unemployed < 6 months ; A2 = Unemployed 6-12 months
A3 = Unemployed 12-24 months ; A4 = Unemployed > 24 months
and A = Unemployed
The events A1 –A4 are mutually exclusive (amongst each other), and
A = A1 or A2 or A3 or A4
Then P(A) = P(A1 or A2 or A3 or A4)
= P(A1) + P(A2 ) + P(A3) + P(A4)
The appropriate sums, for each age range, and for males and females are given in
Table 7.2 for two of the time periods.
(c & d) Since each of the age ranges are disjoint the corresponding events are
mutually exclusive. For example let
B1 = Unemployed < 6 months in age range 16-17
B2 = Unemployed < 6 months in age range 18-24
B3 = Unemployed < 6 months in age range 25-49
B4 = Unemployed < 6 months in age range 50+
and B = Unemployed < 6 months.
The events B1 –B4 are mutually exclusive (amongst each other), and
B = B1 or B2 or B3 or B4
Then P(B) = P(B1 or B2 or B3 or B4)
= P(B1) + P(B2 ) + P(B3) + P(B4)
8. Tree Diagrams
Often a very convenient way to implement the addition and multiplication laws of
probability is to draw a so-called tree diagram. This gives a pictorial representation
of some, or all, possible outcomes, together with the corresponding probabilities.
Example 8.1 We have an initial investment of £1000 that, over any period of 1 year,
has one of two possible behaviours: an increase of 10% or a decrease of 10%.
These two outcomes occur with equal probability.
Fig 8.1: Tree diagram comprising all possible investment outcomes (£).
Note In the financial jargon the tree of Fig.8.1 is termed recombining. This means
that, for example, although the Year 2 value of £990 can be reached in two different
ways, the actual value attained is the same, i.e.
£1000*1.1*0.9 = £1000*0.9*1.1
This happens because our probabilities (0.5) do not vary with time.
• This last observation enables us just to count paths to obtain the desired
probabilities. Since, for example, the final investment value £1089 can be
achieved in 3 different ways, and each way (path) occurs with probability 0.125
P(investment value = £1089) = 2*0.125 = 0.375
In this way we end up with the probabilities shown in Table 8.1, and displayed
as a histogram in Fig.8.4(b).
Summary of Calculation
Fig 8.4: (a) Investment paths (b) Investment Probabilities (as histogram)
Comment The device of counting paths in Step 3 has “hidden” the explicit use of the
addition and multiplication laws. For example we can write
Investment value in Year 3 = £1089
• Each row of this description corresponds to a particular branch in the tree, over
which we multiply probabilities (indicated by the “and”). With equal probabilities
each row gives the same product 0.53 = 0.125. (Note we are implicitly assuming
independence of returns each year – is this reasonable?)
• The different rows correspond to various distinct branches in the tree, over
which we add probabilities (indicated by the “or”). The simple addition law
applies since events on separate branches are mutually exclusive (why?). The
number of branches determines the number of terms in the sum.
Example 8.2 For the 3-year returns in Example 8.1 determine the
(a) mean and (b) standard deviation.
Solution (a) Look at the probability distribution of returns in Table 8.1. We treat this
like a frequency table and compute the mean return as (see Unit 6 Section 6 eq. 5)
as
1
[ 1 i=n
]
x = f 1 x 1 + f x 2 + f 3 x 3 + .... + f n x n = ∑ i = 1 f i x i
n n
If we interpret the relative frequency fi/n as a probability pi we obtain the result
∑
i=n
x = i =1
pi x i ---- (3)
∑
i=n
s2 = pi x i − x 2
2
i =1
--- (4)
Thus, although the mean return will be £1000 there is considerable variation in the
actual returns (as specified in Table 8.1). We shall see how to interpret the precise
value of £174 in Unit 6, where we shall also discuss more fully the idea of a
probability distribution, and its description in terms of the mean and standard
deviation.
9. Conditional Probability
Example 9.1: We now come to a very important, although quite subtle, idea. We
return to Table 7.2 (or Table 6.3) and try to compare the 18-24 and 25-49 age groups
in terms of time unemployed. The difficulty is that the two age groups are not directly
comparable since they have different sizes (as measured by the “economically
active” values in Table 6.3). This is reflected in the probabilities in Table 7.2 adding to
different values.
The solution is to “normalise” the age categories so that their sum is the same. In
addition, by making this sum equal to 1, we ensure each set of probabilities define a
probability distribution. This is simply accomplished on dividing each row entry by
the corresponding row sum. In Table 9.1 we give all the (eight) probability
distributions that result from the left hand data in Table 7.2. In Fig. 9.1 we have
plotted three of these distributions separately, to emphasise that individually they
form a probability distribution. In addition we have plotted all three on a single
histogram which allows for easier comparisons. Observe the vertical (probability)
scale is (roughly) constant throughout.
Each row
defines a
(conditional)
probability
distribution.
are applicable when we are dealing with Males in the Age Group 18-24. To express
this compactly, with the minimum of words, we use the notation of Example 7.2
applied to males, i.e.
A1 = Unemployed < 6 months ; A2 = Unemployed 6-12 months
A3 = Unemployed 12-24 months ; A4 = Unemployed > 24 months
In addition we set
C1 = male in age group 16-17 ; C2 = male in age group 18-24
C3 = male in age group 25-49 ; C4 = male in age group 50+
Then our 4 probabilities translate into the following statements:
P(A1 | C2) = 0.6569 ; P(A2 | C2) = 0.1365 ; P(A3 | C2) = 0.1248 ; P(A4 | C2) = 0.0819 .
For example, the first statement is read as “The probability of being unemployed < 6
months given you are a male in the age group 18-24) = 0.6569.
The general notation P(A | B) means the probability of A occurring given that B has
already occurred.
Example 9.2: There is nothing special about normalising the row sums to be 1. We
can equally well make the column sum add to 1, and this defines yet further
conditional probability distributions.
To make this clear we return to the left table in Table 7.2, and split this into males
and females. Evaluating the column sums gives Tables 9.2 and 9.3.
Interpretation: As an example, the first entry 0.6477 in Table 9.2(b) gives the
probability of being in the age group 16-17 given that you have been unemployed for
< 6 months. In the symbolism of Example 9.1 above P(C1 | A1) = 0.6477.
Note: 1. P(C1 | A1) = 0.6477 ≠ P(A1 | C1) = 0.7917 (from Table 9.1)
Thus the symbols P(A | B) and P(B | A) are, in general, not interchangeable, and we
must be very careful in our use of conditional probabilities.
2. Histograms of the male and female conditional distributions in Table 9.2(b) and
9.3(b) are given in Fig.9.2. Here we have superimposed the four unemployment
categories together in a single histogram for males and females. You should note
carefully the different axes labels in Figs.9.1 and 9.2, and how these relate to the
corresponding conditional distribution.
We have a much greater chance of choosing a black ball, and hence winning £100.
• Adopting our frequency interpretation of probability, these probabilities indicate
that, if we imagine repeatedly drawing a counter, then 75% of the time we will
win £100, and 25% of the time we will win nothing.
• To quantify this we calculate the expected winnings. This is just the mean value
of our winnings, computed using (3), but the “expected value” terminology is
used since it is more descriptive. We obtain
Expected winnings = £100*0.75 + £0*0.25 = £75
• Of course we have a cost, of £50, which we must pay to enter the game, so
Expected profit = £75 - £50 = £25
Conclusions At first sight we should “clearly” play the game. However, our expected
profit is what we can “reasonably expect” to win in a long run series of games (unless
the run into a lot of “bad luck”). But we only expect to play the game once.
• Under these circumstances (one play of the game only) we have to decide
individually whether we are prepared to take the risk of losing.
• Different people will have different “risk appetites”. If you are risk averse, as
most of us are, you may well be unwilling to take the risk (of losing). But if you
are a risk taker you may well take the opposite view.
• In addition your considerations will undoubtedly be complicated by a
consideration of your current financial wealth. If you have £100,000 in the bank
you are probably not unduly worried about the risk of losing £50. But, if you only
have £100 in the bank, £50 is a big investment to make.
To answer the question posed in Example 10.1: “Whether you should play or not
depends on your risk profile, i.e. your attitude to risk.
The basic moral from this example is that, although one can compute probabilities,
and in a variety of ways, the resulting number may not be the determining factor in
any investment strategy you may adopt (although the probability value will play a
part). How one assesses the risks involved may well be a “behavioural finance
issue”, beyond the realms of mere definitions of probability. Feller’s quote at the
beginning of this unit reflects precisely this view.
We can reinforce this latter view by actually computing the risk in Example 10.1 using
the standard deviation (or its square the variance) as a risk measure, as we
advocated in Unit 4 Section 10. Using (4)
Variance of winnings = £1002*0.75 + £02*0.25 - £752 = 1875 (in units of £2)
What does this value actually tell us about the risk involved? Do we judge it to be
“large” or “small” relative to the expected winnings?
Question If we work with the profit (winnings – entrance fee) what value do we
obtain for the standard deviation?
In Fig.11.1 six lotteries are depicted. Each lottery has 100 tickets, represented by
the tally marks. The values at the left give the prizes that are won by tickets in that
row. For example, in lottery (a), only one ticket wins the highest prize of $200, two
tickets win the next highest prize of $187 and so on.
Exercise You must decide which lottery you wish to participate in. In fact you are
required to order the six lotteries in the order you would most like to participate, on
the assumption that you are risk averse, i.e. you will avoid risk unless you are
suitably compensated by extra returns (profits).
A sensible way to look at the problem is to estimate the mean (expected return) and
variance (risk) of each lottery (probability distribution).
We shall return to this exercise in the tutorial. Observe how these lotteries bring
together several ideas we have previously looked at:
• Symmetric and skewed histograms/distributions (Units 3 – 5)
• Stem and leaf plots, which is really what Fig.11.1 represents (Unit 4).
• Probability distributions (Unit 5)
• Cumulative frequency curves (Unit 4). See tutorial exercises.
12. References
Learning Outcomes
At the end of this unit you should be familiar with the following:
• Calculate probabilities associated with the binomial distribution.
• Appreciate how the binomial distribution arises in finance.
• Basic properties of the binomial distribution.
• Recognise the central role of the normal distribution.
• Basic properties of the normal distribution.
1. Introduction
In Unit 5 we have seen that the term “probability distribution” refers to a set of (all)
possible outcomes in any given situation, together with their associated probabilities.
In practice relatively few such distributions are found to occur, and here we discuss
the most important of these from a financial perspective. The ideas we discuss in this
unit form the foundations for much of the theoretical developments that underlie most
of the ideas we discuss in the remainder of the module.
2. Binomial Distribution
The investment Example 8.1 in Unit 5 provides an illustration of the so-called
binomial distribution, which applies in the following general circumstances:
Binomial experiment An “experiment” consists of a series of n “trials”. Each trial can
result in one of two possible outcomes
⎛n⎞
P(x successes) = ⎜⎜ ⎟⎟ pxqn - x --- (1)
⎝x⎠
⎛n⎞
Note In (1) ⎜⎜ ⎟⎟ is called the binomial coefficient for algebraic reasons (not really
⎝x⎠
connected to our present purpose). This gives the number of different ways x
successes can occur (equivalent to the number of paths in our investment example).
It can be computed in a variety of ways:
⎛n⎞ n!
⎜⎜ ⎟⎟ = --- (2a)
⎝x⎠ x!(n - x)!
where the factorial function n! Is defined as the product of all integers from 1
up to n, i.e. n! = n(n – 1)(n – 2)............3.2.1 --- (2b)
Without some mathematical background the version (2) may seem unnecessarily
complicated, but it does arise very naturally. If you are uncomfortable with (2)
compute binomial coefficients using your calculator or Excel. In addition:
• Factorials can be computed in Excel using the FACT function.
• The complete result (1) can be computed directly in Excel, using the
BINOMDIST function. (Investigate this function as an Exercise; it will prove
useful in Tutorial 6).
Example 2.1: We return to Example 8.1 of Unit 5. Here our experiment is observing
investment results over n = 3 years (trials). On each trial we arbitrarily define
“Success” = increase in investment value (by 10%) with probability p = 0.5
“Failure” = decrease in investment value (by 10%) with probability q = 0.5
Then we can compute the following:
⎛ 3⎞
P(3 successes) = ⎜⎜ ⎟⎟ 0.530.50 = 1*0.125*1 = 0.125
⎝ 3⎠
⎛ 3⎞
P(2 successes) = ⎜⎜ ⎟⎟ 0.520.51 = 3*0.25*0.5 = 0.375
⎝ 2⎠
⎛ 3⎞
P(1 success) = ⎜⎜ ⎟⎟ 0.510.52 = 3*0.5*0.25 = 0.375
⎝1⎠
⎛ 3⎞
P(0 successes) = ⎜⎜ ⎟⎟ 0.500.53 = 1*1*0.125 = 0.125
⎝ 0⎠
Notes: Observe the following two points:
• The second binomial coefficient can easily be computed using (2) as
⎛ 3⎞ 3! 3 * 2 *1
⎜⎜ ⎟⎟ = = =3
⎝ 2⎠ 2!(3 - 2)! 2 * 1 *1
• This coefficient counts the number of paths which yield the final outcomes of 2
“successes”. We need to translate this into the investment going up twice and
down once. The corresponding paths are shown in Fig.8.3 (Unit 5).
• The above calculations reproduce the probabilities in Table 8.1 of Unit 5. The
advantage of the current algebraic formulation are twofold:
- We can perform calculations without drawing any (tree) diagram.
- We can identify the number of paths without having to locate them.
• The structure of the above calculations is very simple. In (1) the px term gives
the probability of x successes along a particular path, and the qn-x term the
probability of (n – x) failures. These probabilities are multiplied in accordance
with the multiplication law (why?). Finally we need to identify the number of
paths leading to x successes and multiply by this (why?). In essence (1)
provides a very compact representation of the multiplication and addition laws
applied to a binomial experiment (repeated trials with only two possible
outcomes each time).
It is important to realise the mean and standard deviation apply to the number of
successes (x), and these may not always represent quantities of interest.
Then we are in a binomial situation with n = 5 trials, each of which can result in only
one of twp outcomes. Since the coin is fair
p = P(Success) = 0.5 and q = P(Failure) = 0.5 (with p + q = 1)
⎛ 5⎞
(a) From (1) P(1 success) = ⎜⎜ ⎟⎟ 0.510.54 = 5*0.5*0.0625 = 0.15625
⎝1⎠
⎛ 5⎞ 5! 5 * 4 * 3 * 2 *1
since ⎜⎜ ⎟⎟ = = =5
⎝ 1 ⎠ 1!(5 - 1)! 4 * 3 * 2 *1 * 1
(More intuitively a single head can occur in one of 5 ways, i.e. HTTTT or THTTT or
TTHTT or TTTHT or TTTTH.) There is thus only about a 16% chance of obtaining a
single head when a coin is tossed 5 times.
(b) From (3a) Mean number of heads = 5*0.5 = 2.5
Obviously we can never obtain 2.5 heads on any throw (experiment). The
interpretation is the long run frequency one: if we repeat the experiment many times
we will obtain, on average, 2.5 heads.
(c) From (3b) Standard deviation of number of heads = 5 * 0.5 * 0.5 = 1.118
The meaning of this quantity is best illustrated by simulation.
Notes: 1. We can compute, as in Example 2.1, the (binomial) probabilities for all
possible outcomes in Example 3.1, i.e. 0 heads, 1 head, 2 heads and so on. The
resulting histogram of Fig.3.1 gives the theoretical probability distribution for 5 coin
tosses. We shall compare this to the simulated (experimental) version in Section 4.
2. If we draw a tree diagram of the situation in Example 3.1 we obtain something like
Fig.3.2. Although the tree is not recombining, you should note the similarity with
Fig.8.4 of Unit 5 relating to our investment example - see Section 5 below.
(a) Experiment 1
(b) Experiment 2
An Important Point We can improve the agreement between the simulated and
theoretical distributions by conducting larger simulations. For example in place of our
100 simulations we may use 500; you should explore this.
5. Investment Example
Important results emerge when we alter parameters in our simulation. Rather than
doing this for our coin tossing example where, for example, p = P(Head) = 0.3 is not
what we would expect, we return to our investment example of Unit 5.
Example 5.1: The Excel file Binomial_Investment.xlsx gives instructions for
generating our investment values of Example 8.1 of Unit 5. We have extended the
time period to 5 years, as shown in Fig.5.1, really to obtain more representative
histograms. The parameter values we can vary are the following:
• p = P(investment increases over 1 year) = 0.5 in Figs.5.1 and 5.2
• VUp = % increase in investment = 10% in Figs.5.1 and 5.2
• VDown = % decrease in investment = 10% in Figs.5.1 and 5.2
Varying these values leads to the results displayed in Fig.5.3, and we observe the
following important points:
• With p = 0.5 the investment values (IVs) are symmetric about the mean of
£1000.
• With p > 0.5 the IVs are skewed to the right, with larger values having greater
probability than smaller values. The mean is correspondingly > £1000 and,
importantly, the variation around the mean (as measured by the standard
deviation) also increases.
• Varying p changes the probabilities of the various outcomes (IVs) but not the
outcomes themselves.
• Varying VUp and VDown leaves the probabilities unchanged but changes the
values of the various outcomes.
Fig. 5.3: Varying Parameters in Fig.5.1
• Increasing (decreasing) VUp increases (decreases) the IVs, but the mean and
standard deviation changes in unexpected ways. For example in Fig.(b) the
mean has increased by £274.23 above £1000, but in Fig.(c) the decrease is
only £225. In addition the standard deviation is (very much) different in the two
cases, despite the fact that p remains unchanged and the values of VUp and
VDown are “symmetrical”.
• As we alter parameters, the changes in mean and standard deviation are very
important, and you should study them carefully in Fig.5.3.
An Important Point Each of the probability distributions in Fig.5.3 are the (exact)
theoretical ones and, as we can see, their shapes and summary measures depend
on the parameters of the distribution. In some situations we are more interested in
simulating sample paths:
• The actual paths may be too numerous to compute. For example, most
investments do not just have 2 possible future values, but rather hundreds. (A
stock may increase, or decrease, by any amount during each time frame.)
• We may not know how sample paths evolve forward in time. When valuing
options we only have terminal (future) values available, and we have to try and
work out how values evolve backwards in time.
using more (than 100) simulations; you are asked to investigate this in Practical
Exercises 5.
Fig. 6.1: Theoretical Histogram for n = 25 Fig. 6.2: Simulated Histogram for n = 25
Feature 1 Many practical situations seem to involve random variables which follow a
normal distribution, either exactly or approximately. This situation arises for reasons
connected to Fig.6.5. We can regard the number of heads as the sum of a large
number (here 100) of independent random variables (each variable taking the value
1 if a head occurs, and the value 0 otherwise). In such instances we can give
theoretical arguments that lead to the normal distribution. In general we can expect
the sum of a large number of independent random variables to lead to a normal
distribution, and a lot of practical situations fall into this category.
Feature 2 When we take samples from a population, and examine the average
value in the sample, the normal distribution invariably arises. We shall look at this
situation in detail in Unit 7.
Feature 3 The normal distribution is “very well behaved mathematically”, and this
leads to rather simple general theoretical results. Although we shall not pursue any of
this we give a few details in the next section.
Although we invariably refer to the normal distribution, there are in fact infinitely many
of them! Some of these are depicted in Fig.7.1 However the situation is not nearly as
complicated as it might appear since all normal distributions have a simple relation
connecting them. The key is that, similar to the binomial distribution, the normal
distribution is completely described by just two parameters. These are actually
quantities we have met several times before, the mean and the standard deviation.
Terminology We use the symbolism N(μ, σ2) to denote a normal distribution with
mean μ (pronounced mu) and standard deviation σ (pronounced sigma). The square
of the standard deviation σ2 is termed the variance. Its importance is explained in
Section 8.
Comment Do not let the use of Greek letters (μ and σ) for the mean and standard
deviation confuse you. They are simply used in place of the letters we have so far
been using ( x and s) to distinguish between population values and sample values.
We shall discuss this further in Unit 7.
You should observe how the mean and standard deviation affect the location and
shape of the normal curve in Fig.7.1.
• The mean determines the location of the “centre” of the distribution. Since the
curve is symmetric (evident from (4) in the next section), the mean equals the
mode (and the median), so the mean also gives the highest point on the curves.
• As we increase the standard deviation the curves becomes “more spread out”
about the centre. We shall make this more precise in Section 9.
• Also note how the vertical scale changes as we alter the standard deviation. In
addition observe we do not label the y-axis “Probability” as in, for example, the
discrete binomial distribution in Fig.6.5. As we emphasise in Sections 8 and 9 it
is not the height of the curve, but rather the area under the curve, that
determines the corresponding probability.
(Compare these results to those in Section 6.)
We shall use the normal distribution most of the time from now on. When doing this
you should bear three points in mind:
• We can always think of a normal distribution as arising from data being
collected, compiled into a frequency table and a histogram being drawn; a
“smoothed out” version of the histogram, as in Fig.6.5, yields a normal
distribution.
• There are instances when the normal distribution is not applicable, and these
occur when distributions are skewed. Examples include wages and waiting
times.
• In practice, and where feasible, always draw histograms to check whether the
assumption of a normal distribution is appropriate.
This is of the same general form as (4), and hence represents another normal
distribution. This in turn means when we add two normal distributions together
(equivalent to multiplying their defining equations) we obtain another normal
distribution. In particular this normal distribution has
• mean the sum of the two individual means, and
• variance equal to the sum of the two individual variances.
It is the fact that the variances (square of the standard deviation) add together that
makes the variance a more important quantity (at least theoretically) than the
standard deviation. These are very important properties of the normal distribution, to
which we shall return in Unit 7.
Note Moving from a discrete, to a continuous, distribution is actually a little more
complicated than we have described. We shall discuss the source of the difficulties in
Unit 8 when we consider how to draw reliable conclusions from statistical analysis of
data.
and corresponds to Fig.8.1. A set of tabulated areas is given in Table 9.1 on the
following page. Note carefully the following points:
• The underlying variable is always labelled Z for a standard normal distribution,
and is often referred to as a standard normal variable.
- Z is used as a reference for computational purposes only.
- Z has no underlying interpretation (such as the number of heads).
- Z is dimensionless, i.e. just a number without any units attached.
• It is not necessary to know the values (6) in order to read the corresponding
tables, but these parameter values are helpful in both understanding, and
remembering, what happens in the general case (discussed below).
• Only one specific type of area is tabulated – between 0 and the value of z
chosen. The figure at the top of Table 9.1 is a reminder of this. (Other tables
you may find in books may tabulate different areas, so some care is needed.)
• Only positive z values, up to about 3.5, are tabulated – see Fig.8.1.
P1
Total area under curve = 1
P2
The curve is symmetric (about z = 0), so
Area to left of negative z-value = Area to right of positive z-value
P3 P1 and P2 imply
(Area to left of z=0) = (Area to right of z = 0) = 0.5
Example 9.1 If Z is a standard normal variable, determine the probabilities:
(a) P(Z < 1.75), i.e. the probability that Z is less than 1.75
(b) P(Z < -1.6) (c) P(1.4 < Z < 2.62)
Solution You should always draw a sketch to indicate the area required. In addition
you may need further sketches to actually compute the area.
(a) In pictures 0.5 (P3)
0.4599
(tables)
1.75 0 0 1.75
In symbols P(Z < 1.75) = P(Z < 0) + P(0 < Z < 1.75)
= 0.5 + 0.4599 = 0.9599
Here the formulae just reflect the thought processes displayed in the pictures!
-3 -2 -1 00 1z 2 3
Entries in the table give the area under the curve between the mean and z standard
deviations above the mean. For example, with z = 1.02, the area under the curve between
the mean (of zero) and z is .3461.
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359
.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753
.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141
.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517
.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879
.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224
.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2518 .2549
.7 .2580 .2612 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852
.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133
.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389
1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621
1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830
1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015
1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177
1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319
1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441
1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545
1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633
1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706
1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767
2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817
2.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857
2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890
2.3 .4893 4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916
2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936
2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952
2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964
2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974
2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981
2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986
3.0 .4986 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
3.1 .4990 .4991 .4991 .4991 .4992 .4992 .4992 .4992 .4993 .4993
3.2 .4993 .4993 .4994 .4994 .4994 .4994 .4994 .4995 .4995 .4995
3.3 .4995 .4995 .4995 .4996 .4996 .4996 .4996 .4996 .4996 .4997
3.4 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4998
3.5 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998
3.6 .4998 .4998 .4998 .4999 .4999 .4999 .4999 .4999 .4999 .4999
(b) In pictures
by symmetry
-1.6
0.5 (P3) 0.4452
(tables)
0 0 1.6
In symbols P(Z < -1.6) = P(Z > 1.6) = P(Z > 0) + P(0 < Z < 1.60)
= 0.5 - 0.4452 = 0.0548
Since this is roughly 5%, we can see the above figures are not drawn to scale.
(c) In pictures 0.4192
0.4956 (tables)
(tables)
In symbols P(1.4 < Z < 2.62) = P(0 < Z < 2.62) - P(0 < Z < 1.4)
= 0.4956 - 0.4192 = 0.0764
The procedure defined by (7) is often termed “standardizing a variable” and means
expressing the size of a (random) variable relative to its mean and standard
deviation. Once we have a Z value available probabilities are again found as in
Example 9.1.
Example 9.2 IQ’s (Intelligence Quotients) of schoolchildren are normally distributed
with mean = 100 and standard deviation = 15. If a school child is selected at random,
determine the probability that their IQ is
(a) over 135 (b) between 90 and 120 ?
2.33 0 0 2.33
Note we may also phrase our answer in the form “There is a 1% chance of a
randomly chosen school child having an IQ greater than 135.”
(b) Here we have two x values and need to compute the two corresponding Z
90 - 100
values. With X1 = 90 Z1 = = - 0.67 (to 2D)
15
120 - 100
and with X2 = 120 Z2 = = 1.33 (to 2D)
15
Then P(90 < X < 120) = P(-0.67 < Z < 1.33) = 0.2486 + 0.4082 = 0.6568
using the sketches below.
0 0.67 0 1.33
on the web page, together with the graphics shown in this section (on the Graphics
sheet).
We are trying to see patterns in the data, hopefully with a future view to modelling
(or predicting) exchange rates. Overleaf we have drawn several graphs (discussed in
previous units).
• The line (times series) plot of Fig.10.2 shows exchange rate fluctuations without
any evident pattern.
• The histogram of Fig.10.3 appears to show two separate “normal like”
distributions. However this is very difficult to interpret since we have no
indication of the “time sequence” when particular exchange rate values
occurred, and this is a crucial feature of the data (as Fig.10.2 shows).
• Since we have found little apparent pattern in the exchange rates themselves,
our next focus is on changes in the rate. In Figs.10.4 and 10.5 we plot the
changes themselves, and also the percentage changes. Look back at Practical
Units 2 and 3 to recall how these changes are computed in Excel.
Fig.10.2: Time Series Plot of Exchange Rate Fig.10.3: Histogram of Exchange Rate
• It is difficult to see any patterns in the time series of either set of rate changes.
However, if we look at the corresponding histograms the situation changes
dramatically. In Figs.10.6 and 10.7 we can “clearly see” the presence of normal
distributions. Thus, although successive changes over time show no (apparent)
pattern, the changes “all taken together” over the complete time range show
clear evidence of being normally distributed. Many questions now naturally
arise:
• Why should time series and histograms give such completely different results?
• Do similar patterns appear in other exchange rates?
• Do similar patterns appear in other financial quantities (stock prices, GDP,
financial ratios and so on)?
• Once we convince ourselves a normal distribution is appropriate – see Section
11 below – we can start to do some calculations and estimate various
probabilities.
The last result in Fig.12.1 that 99.7% of a normal distribution lies within 3 standard
deviations of the mean allows us to state that, roughly,
• Smallest data value = Mean – 3*Standard Deviation
• Largest data value = Mean + 3*Standard Deviation
These limits are often useful when drawing appropriate normal distributions, or when
trying to assess the range (Largest – Smallest) of a data set when the mean and
standard deviation are known. These limits often go by the name “the 3σ rule”.
Learning Outcomes
At the end of this unit you should be familiar with the following:
• Understand the concept of sampling.
• Appreciate how the normal distribution arises in sampling.
• Understand the Central Limit Theorem, and its limitations.
• Recognize that, with unknown variance, the t - distribution is required.
• Understand how t-tables are used to compute probabilities.
• Recognise that the square of a normally distributed variable gives rise to the
chi-square distribution ( χ 2 ).
• Recognise that the ratio of two chi-squared variables gives rise to the F-
distribution.
• Understand how F-tables are used to compute probabilities.
1. Introduction
From a technical point of view this unit is the most complicated in the entire module,
and centres around two fundamental ideas:
• The need to take samples from a population, and the consequences this has
for the “structure” of the possible samples we may take.
• If a random variable (X) follows a normal distribution, then functions of X (such
as X2) will follow a different distribution.
By the end of the unit we will have met three further probability distributions (t, χ 2 and
F), all dependent on the normal distribution in one way or another. Taken together
use of these four distributions will account for almost all the statistical analysis you
are likely to encounter. Rather than having detailed technical knowledge of all these
distributions, it is far more useful to understand the general situations in which they
arise, and the uses to which they are put. This will enable you to make sense of the
statistical output of most software packages (Excel included).
2. Why Sample?
In very general terms we wish to deduce information about all “items of a particular
type” by studying just some of these items. The former is termed a population, and
the latter a sample (taken from the population). It is important to realise the term
“population” will not, in general, have its “usual interpretation” as a large group of
people (or possibly animals).
- It is too costly.
- It involves testing an item to destruction (assessing lifetimes).
- We are not really sure who is in the population (future potential buyers).
• We are trying to use sample information in order to estimate population
information. How well we can do this depends on how representative of the
whole population our sample is.
• If we take more than one sample we are almost certain to obtain different
results from the various samples. What we hope is that sample results will not
vary very much from each other if our samples are representative (of the
population). In practice this means we can use the results from any one sample,
and do not need to take more than one sample provided we can guarantee our
sample really is representative of the population.
The situation is summarised in Fig.2.1.
Would this give
Is this “similar” results?
representative? Population
Sample 1
Possible
Sample 2
mean of these 1000 values was calculated from the data . The mean height of all
females in the targeted age range is unfortunately unknown. However it can be
thought of as follows: if the height of every female in the targeted age range could be
measured then the population mean would be the mean height of these numbers.
If the manufacturer had been quite lazy and only taken a sample of 50 individuals or
worse still a sample with only 5 individuals would you expect the results to be as
reliable? Intuitively we feel that a bigger sample must be better. But (see Fig.4.1) in
what way is the mean calculated from a sample of size 50 better than one calculated
from one of size 5? To answer this question we need to know how the mean
calculated from different samples of the same size can vary.
If more than one sample of the same size was drawn from the population, as in
Fig.4.2, how much variation would you expect amongst the sample means? We know
that because of the sampling process the sample mean is unlikely to be exactly the
same as the true population mean. In some samples just by chance there will be too
high a proportion of tall women, and in others too low a proportion, compared with
the population as a whole.
Pop ulation Po pu latio n Po pu latio n
Sam ple s ize 1000 Sam ple siz e 50 Sam ple size 5
m ean x1 m ean x 2 m ean x 3
P o p u la t i on
(m e a n μ )
S a m p l e n o. 1 S a m p le n o . 2 S a m p le n o . 3 S a m p le no . k
m ea n x 1 m e a n x2 m ea n x 3 m ean xk
A g r o u p of s a m p le m e a n s x1, x 2… x3… … xk
We are led to the general situation of Fig.4.3 with four important components:
• Our population comprising all items of interest.
• Our sample comprising those parts of the population we have examined.
• A probability distribution describing our population, as discussed in Unit 6.
• A probability distribution describing our sample, to be discussed in Section 5.
Population Sample
Comprises all units Comprises randomly selected units
Too large to study directly Small enough to study directly
Mean μ usually unknown Mean x can be computed, i.e. known
StDev σ usually unknown StDev s is known
Example 4.2: The Excel file SingleDice.xlsx allows you to simulate either
(a) 36 throws, or (b) 100 throws
of a fair dice. In either case a histogram of the results is produced. Importantly, by
pressing F9 you can repeat the calculation (simulation) as often as you wish.
Table 4.1 lists one possible set of simulated results, together with a frequency table
of the results. For example a score of 5 actually occurred 10 times in the 36 throws.
The corresponding histogram is shown in Fig. 4.4(a), and histograms for three further
simulations are also given.
Table 4.1: Simulation of 36 throws of a fair dice, with corresponding frequency table
The most noticeable feature of Fig.4.4 is the lack of an “obvious pattern” in the
histograms. An equivalent, but more informative, way of expressing this is to say
there is “large variability present”. Another way of stating this is to ask “What will the
next histogram (Simulation 5) look like?” We are forced to conclude, on the basis of
the first four simulations, that we do not know what simulation 5 will produce.
In the Practical Exercises P5, Q5 you are asked to look at this example further.
Example 4.3: The Excel file SampleMean.xlsx allows you to perform simulations
that are superficially similar to Example 4.1, but actually are fundamentally different
in character. In light of the results in Example 4.1 we realise that, when there is large
variability present within the data, there is little point in trying to predict “too
precisely”. We settle for a more modest, but still very important, aim:
We seek to predict the average value (mean) of our sample (simulation).
Once we move from looking at individual values within a sample, and concentrate on
sample means our predictive ability improves dramatically. The spreadsheet
UniformSamp allows you to throw 9 dice 100 times (a grand total of 900 throws).
Each time the 9 dice are thrown the average value is calculated, and this is the only
information retained – the individual 9 values are no longer of any importance. In this
way we build up a set of 100 mean values, ands these are shown shaded in Column
J of Table 4.2. From these 100 values we extract the following information:
(a) A frequency table (shown in Columns K and L), and hence a histogram. Fig.
4.5(a) shows the histogram corresponding to the data in Table 4.2.
(b) A mean; this is termed the “mean of the sample means”.
(c) A standard deviation; this is termed the “standard deviation of the sample
means”.
For the data of Table 4.2 these latter two (sample) quantities are
x = 3.5 and s = 0.519 --- (1)
as you check from the frequency table.
Table 4.2: 100 repetitions of 9 throws of a fair dice, with mean values recorded.
Fig.4.5: Histograms for 4 simulations (each representing 900 dice throws) of sample means
Some Conclusions: The most obvious feature of Fig. 4.5(a) is the appearance of
what looks like a normal distribution; in words the sample means appear to follow a
normal distribution. We can confirm this by running further simulations, three of which
are illustrated in Fig.4.5. Although there is clearly some variability present, we can
discern a normal distribution starting to appear.
In addition, we know, from our work in Units 5 and 6, that a normal distribution is
characterised by its mean and standard deviation. From (1) we can see what the
sample mean and standard deviation are, but what are they estimates of? To answer
this we need to identify the underlying population – see Fig.4.3. Since we are
throwing a dice (9 times) the population is {1, 2, 3, 4, 5, 6}, each value occurring with
equal probability. This defines the population probability distribution in Fig.4.6.
X 1 2 3 4 5 6
Probability 1/6 1/6 1/6 1/6 1/6 1/6
Using (3) and (4) in Section 8 of Unit 5 gives us the population mean and standard
deviation as μ = (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5 --- (2a)
and σ2 = (12 + 22 + 32 + 42 + 52 + 62)/6 – 3.52 = 91/6 – 12.25 = 2.9167
• When we look at the average value in a sample the variation of the sample
mean (as measured by s) is smaller than the variation in the original
population. Intuitively we can see why this is so from the sample means shown
in Table 4.2 and also from the x-scales in Figs.4.5. Although any x value
{1,2,3,4,5,6} is equally likely the values of x are not equally likely. Indeed the
means in Table 4.2 include nothing smaller than 2.25 and nothing larger than 5.
The probability of obtaining a sample mean of, say 6, in a sample of size 9 is
(1/6)^9 = 9.9*10-8 (why?). Compare this with the probability of 1/6 of obtaining
an individual (population) value of 6. This reduced range of x values reduces
the variation in (sample) mean values.
• The precise factor by which the variation is reduced is very important. Since
σ 1.7078
= = 3.29 and Sample size n = 9 = 32 --- (3)
s 0.519
it appears that the reduction is by a factor of n.
Central Limit Theorem We are given a probability distribution with mean μ and
standard deviation σ.
(a) As the sample size n increases, the distribution of the sample mean x
approaches a normal distribution with mean μ and standard deviation σ/ n .
(b) If the (original) probability distribution is actually a normal distribution itself, then
the result in (a) holds for any value of n.
More informally
(a) For a “large” sample the sample mean has (approximate) normal distribution
whatever the original (parent) distribution.
(b) If the original (parent) distribution is normal the sample mean has an exact
normal distribution
More formally, if the distribution of X has mean μ and variance σ2
μ X
= μ X
and σ X
=
σX
--- (5)
n
Here the subscript (X or X ) denotes which distribution we are referring to.
Whichever formulation of CLT you are comfortable with, the result itself is possibly
the most important result in the whole of statistics. The two important points to always
bear in mind are the appearance of the normal distribution, and the reduction of
variability (standard deviation) by a factor of n .
You should now be able to answer the questions in the final box of Fig.4.3. In
addition there are various additional points you should appreciate about CLT.
However, before discussing these, we look at an example of CLT in use. For this we
return to the theme of Example 4.1.
Example 5.1: Female heights (X) follow a normal distribution with mean 163 cm. and
standard deviation 3.5 cm. A market researcher takes a random sample of 10
females. What is the probability the sample mean:
(a) is less than 162 cm. (b) is more than 165 cm.
(c) is between 162.5 cm. and 163.5 cm.
Solution: Histograms of the two distributions, heights and mean heights, are
depicted in Fig.5.1. The distributions are centred at the same place μ = 163, but the
sampling distribution has standard deviation 3.5/ 10 = 1.107. This is approximately
one third of the standard deviation of the height distribution.
From Fig.5.1 we can clearly see that probability values will depend on the distribution
we use. For example P(X < 162) appears (top graph) larger than (bottom graph)
P( X < 162). Computationally we have the two important formulae:
X value - Mean X-μ
Individual values Z= = --- (6)
Standard Deviation σ
X value - Mean X-μ
Sample mean values Z= = --- (7)
Standard Deviation σ
n
Fig.5.1: Histograms for (a) Heights and (b) Mean Heights in Example 5.1.
162 - 163 -1
(a) Here with X = 162 Z= = = -0.904
3.5 1.1068
10
Hence P( X < 162) = P(Z < -0.90) = 0.5 - 0.3159 = 0.18 (2D)
(See Unit 6 Section 9 if you don’t understand this calculation.)
165 - 163 2
(b) With X = 165 Z = = = 1.807
3.5 1.1068
10
Hence P( X > 162) = P(Z > 1.81) = 0.5 - 0.4639 = 0.036 (3D)
162.5 - 163 - 0.5
(c) Finally with X = 162.5 Z = = = -0.4518
3.5 1.1068
10
Hence P(162.5 < X < 163.5) = P(-0.45 < Z < 0.45) = 2*0.1736 = 0.35 (2D)
A very important issue is the role played by the sample size in (5). The following
example illustrates the general idea.
2
(b) With X = 165 Z== 5.71
0.35
Hence P( X > 162) = P(Z > 5.71) = 0.5 - 0.5 = 0.0000 (4D)
This probability is zero to 4D (accuracy of tables).
- 0.5
(c) Finally with X = 162.5 Z = = -1.429
0.35
Hence P(162.5 < X < 163.5) = P(-1.43 < Z < 1.43) = 2*0.4236 = 0.85 (2D)
The reason for the very considerable change in probabilities is revealed if we sketch
the second graph in Fig.5.1 for n = 100. The result is Fig.5.2 and we can very clearly
see how “much more concentrated” the distribution (of means) is about 163. It
becomes “highly likely” (as (c) shows) to obtain a sample mean “very close” to 163,
and “very unlikely” (as (a) and (b) show) to find a sample mean “very far” from 163.
Fig.5.2: Histograms for (a) Heights and (b) Mean Heights in Example 5.2.
(a) Initial
Distribution Normal
(b) Sample
Distribution Normal
(n = 9)
http://onlinestatbook.com/rvls/ , which you may care to look at. (These links are in
addition to those mentioned in Unit 2 Section 2.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Rule of Thumb Just how large n needs to be (for a normal distribution for X to be
apparent) will depend on how far the distribution of X is from normality, and no
precise rules can be given. However a “rule of thumb”, which is found to work well in
practice, states that n > 30 is considered a “large sample”. By this we mean the
distribution of X will resemble a normal distribution.
3. Simulated Distributions
There is one very important point that can initially be confusing. In Fig.6.1 we depict a
theoretical population X and the corresponding theoretical distribution for X ; in this
particular case both are normal distributions. In Examples 4.2 and 4.3 we are
simulating these distributions (using various features of Excel) and, in such
circumstances, we only obtain an approximation to the underlying (theoretical)
distributions. Indeed we can clearly see the following:
• In Fig.4.4 we do not obtain an exact uniform distribution (for X).
• In Fig.4.5 we do not obtain the exact distribution for X since this is unique (and
quite difficult to determine theoretically). Clearly Fig.4.5 is giving us four
different approximations to this (unknown) distribution.
In both cases we can obtain better and better approximations to the underlying
distributions by increasing indefinitely the number of samplings (number of dice = 36
in Example 4.2 and the number of samples = 100 in Example 4.4). But, for example,
repeating Example 4.3 with 100,000 samples is very time consuming!
This distinction between underlying distributions (represented by curves), and
simulated ones (represented by histograms), should always be borne in mind.
(Recall the purpose of this is so that we have to tabulate the areas under only a
single standard normal distribution.) However, the standard deviation is quite a
complicated quantity (average squared deviation from the mean) and it is entirely
possible (even probable) that the population standard deviation σ is not known. In
such a case the logical step is to estimate it from our sample. Calculating the sample
standard deviation s we then form the “standardised” t-value via
x-μ
t= --- (8)
s/ n
The difference between (7) and (8) is more than just a matter of changing symbols
(from z to t and σ to s). To see what is involved we perform a simulation.
Example 7.1: The Excel workbook tDist contains the spreadsheet Barrels1 and
illustrates how the t-distribution arises when sampling from a normal population when
the standard deviation is unknown. The title “Barrels” refers to a piece of history. The
t-distribution was derived by W.S.Gossett in 1908 whilst he was conducting tests on
the average strength of barrels of Guinness beer. His employers (Guinness
breweries in Dublin) would not allow employees to publish work under their own
names, and he used the pseudonym ‘student’. For this reason the distribution is also
often termed ‘Students t-distribution’.
Table 7.1: Illustration of Normal and t-distributions for “beer barrel” example.
Spreadsheet Table 7.1 illustrates the calculations, based on samples of size 3 (cell
E3) drawn from a N(5, 0.12) distribution.
• In Cells C9-E9 we generate three Z (normally distributed) values using the
Excel command NORMINV(RAND(),B3,B4). These values are intended to
represent the amount of beer in each of the three barrels. (This calculation
represents a slight modified version of the original computation.)
• We calculate the sample mean in cell G9, and this allows us to compute a Z-
value using (7). Thus cell H9 contains
5.10545 - 5
Z= = 1.82642 --- (9a)
0.1 / 3
Obviously the z and t values differ due to the difference between (9a) and (9b).
In Addition, we have shown cells C9-I9 in Table 7.1 to the increased accuracy
(5 dp) necessary to obtain z and t values correct to 3D.
• We now repeat these calculations a set number of times – we have chosen 200,
as shown in cells K27 and L27.
• From these 200 z and t-values we form two histograms based on the frequency
tables shown in columns J to L. These are shown superimposed in Fig.7.1.
Observations The most important feature of the above calculations is the following.
Replacing σ (which is constant, but generally unknown)
with s (which is known but varies from sample to sample)
introduces extra variation. Specifically in all calculations of the form (9a) the
denominator contains the same σ = 0.1 factor, but in (9b) the s = 0.05001 factor
changes with each such calculation. This extra variation has the consequence that t-
values are more variable than z-values, i.e. there are more “large” t-values than
“large” z-values. This is clear both from the “tails” of Fig.7.1 and from the columns J-L
in Table 7.1. In particular whereas z-values never get larger than 4 t-values do (see
cells K25 and L25).
A Second Simulation Observe that, because of our very small sample size n = 3,
there is really a great deal of variation in both the z- and t-values – see Table 7.1.
Intuitively we do not expect to get very reliable information from such a small sample.
We can confirm this by running a second simulation (just recalculate by pressing F9).
Results are shown in Table 7.2 and Fig.7.2. Again observe the behaviour at the
“tails” of the (normal and t) distributions.
Changing the Sample Size We can reduce the variation present by increasing n. In
the spreadsheet: Barrels2 we have set n = 9. In Fig.3 we can see that there are now
many fewer values in the tails and, in particular, few with a (z or t) value greater than
3. However there are still more such t-values than z-values. Very importantly,
although maybe not apparent from Fig.7.3, we obtain a different t-distribution for
each sample size n.
If σ is unknown but the sample size is “large” use the normal distribution
If σ is unknown and the sample size is “small” use the t-distribution
t Tables
For hand computation we require, in place of normal distribution tables (Unit 6 Table
9.1), so-called t-tables. The t-distribution tables differ from the normal tables because
we need a lot of different values for different sample sizes. We cannot include as
much detail in the t-distribution tables as in the normal tables otherwise the tables
would be quite bulky - we would need one complete table for each sample size.
Instead we have t-values for a selection of areas in the right hand tail of the
distribution as depicted in Table 8.1. The following is conventional:
-3 -2 -1 00 1 2t 3
Entries in the table give t values for an area in the upper tail of the t distribution. For
example, with 5 degrees of freedom and a .05 area in the upper tail, t = 2.015.
Area in Upper Tail
Degrees
of Freedom .10 .05 .025 .01 .005
1 3.078 6.314 12.71 31.82 63.66
2 1.886 2.920 4.303 6.965 9.925
3 1.638 2.353 3.182 4.541 5.841
4 1.533 2.132 2.776 3.747 4.604
Remember the t and normal (z) tables tabulate different types of areas (probabilities).
You may have to do some preliminary manipulation(s) before using the appropriate table.
t = 2.145
Fig.8.2a: t-distribution computations for Example 8.1(a)
N.B. Packages such as Excel (and SPSS) gives t-values as part of the output of an
appropriate computation; in this context you will not need t-tables. However it is very
important you understand the idea behind the t-distribution, together with its
connection to, and similarity with, the normal distribution. In this sense the
computations of Example 8.1 are important.
Degrees of Freedom
The t-distribution is thought to be “difficult” because of the fact that its precise
distribution (shape) depends on the sample size n. To make matters worse this
dependence on sample size is phrased in terms of what appears, at first sight, to be
something rather more complicated.
The term degrees of freedom (df) is a very commonly occurring one in statistics and,
as we shall see in later units, is invariably output by statistical software packages
(such as Excel and SPSS). The essential idea is illustrated by the following example;
you may care to re-read Section 9 of Unit 4 first.
Example 8.2: We shall try and explain the origin of (10), i.e. ν=n–1
The basic argument is the following:
• In order to use our basic result (8) we need to compute s (sample value).
• To compute s we need to know the sample mean. Recall the formula (9) of Unit
s2 = ∑in= 1 (x i - x )
1 2
4 Section 8: --- (11a)
n
• This knowledge (of X ) puts one constraint on our (n) data values. If we know
the mean of a set of data we can “throw away” one of the data values and “lose
nothing”. Thus, if our (sample) data values are
10 , 15 , 20 , 25 , 30
then each of the following data sets contain precisely the same information:
10 , 15 , 20 , 25 , * (mean = 20)
10 , 15 , 20 , * , 30 (mean = 20)
10 , 15 , * , 25 , 30 (mean = 20)
10 , * , 20 , 25 , 30 (mean = 20)
* , 15 , 20 , 25 , 30 (mean = 20)
In each case the missing entry * is uniquely determined by the requirement the
mean (of the 5 values) is 20.
• In effect we have “lost” one data value (degree of freedom) from the data once
X is known. This is often explicitly indicated by changing (11a) into
s2 =
1
∑
n
i =1
(x i - x )2 --- (11b)
n -1
The difference between (11a) and (11b) is only noticeable for “small samples”.
• Note that none of these difficulties occur when using (7) with σ known.
General Statement
• Estimates of statistical parameters are often based on different amounts of
information (data). The number of independent pieces of information that go into
the estimate of a parameter is called the degrees of freedom (of the parameter).
• Thus, if we use the sample standard deviation (s) as an estimate of the
population standard deviation (σ) the estimate is based on (n – 1) df. This is the
origin of the denominator in (11b) and the result (10).
9. The χ 2 Distribution
In Units 8 and 9 we shall start squaring and adding data values – indeed we have
already done so in (11) above. A very important situation arises when our data is
normally distributed, and we want to see what happens when we square. This gives
rise to the so-called Chi-square ( χ 2 ) distribution.
Spreadsheet: Cells A6-C6 contain random normal variables with the mean and
standard deviation indicated in B2 and B4 respectively. For simplicity we have
chosen standard normal variables.
• The squares of these variables are computed in cells D6-F6, and the sum of
these cells is placed in G6.
• The whole procedure is then repeated a “large” number of times in order to
obtain a representative histogram. As in Example 7.1 we have chosen 200
times.
• A frequency table of the results in column G is then compiled in columns I-J,
and a histogram produced. For the data of Table 9.1 we obtain the first
histogram of Fig.9.1.
Observations The most important feature of the above calculations is the following.
The histogram (and frequency table) is skewed in such a way that “small” values
occur much more frequently than “large” values.
• We can understand the scales involved by recalling that, if z = N(),1), then -3 <
z < 3 and hence, when we square, 0 < z2 < 9. Adding three such variables
together will give us a sum in the interval (0,27). In Fig.9.1 the x-scale is roughly
one half of this, indicating the lack of “large” values.
• It is not clear why the histogram is not symmetric in view of the symmetry of the
underlying normal distribution.
Changing the Number of Variables The use of 3 (normal) variables in Table 9.1 is
arbitrary. The spreadsheet ChiSq_n=9 increases this to 9 and representative results
are displayed in Fig.9.2. We note similar results to those obtained in Section 7 in
connection with the t-distribution:
• As we increase the number of variables the histograms become less skewed
and resemble more a normal distribution.
• There is a different (chi-square) distribution for each number of variables
added.
Chi-SquareTables
We will not usually need to do any explicit calculations involving the Chi-square
distribution. However it is useful to be able to check computer output using
appropriate tables. As shown in Table 9.2 specified right hand tail values (areas) of
the distribution are tabulated in a very similar manner to the t-tables of Section 8.
Solution: (a) Here we need df = 5 and Area in Upper Tail = 0.1. Table 9.2 gives the
value χ 2 = 9.24. The meaning of this is that 10% of the distribution (specified by df =
5) lies above the value 9.24.
(b) Similarly with df = 20 and Upper Tail Area = 0.05 Table 9.2 gives χ 2 = 31.41.
Since we are adding more (normal) variables together we would expect the chi-
square value to have increased from its value in (a). Note that df does not go above
20. For larger values than this we would use the appropriate normal distribution.
We shall return to these tables in Unit 9. For now remember that, when sums of
squares of normally distributed random variables are involved in a calculation, the
Chi-square distribution will be involved (either explicitly or implicitly).
00 5 10 χ2 15 20
Entries in the table give χ2 values (to 2 decimal places) for an area in the upper tail of
the χ2 distribution. For example, with 5 degrees of freedom and a .05 area in the
upper tail, χ2 = 11.07.
Observations The histogram (and frequency table) is even more skewed than for
the chi-square distribution – see Fig.9.1. Again “small” values occur much more
frequently than “large” values.
• We can appreciate the scales involved by recalling that, if z = N(0,1), then -3 < z
< 3 and hence, when we square, 0 < z2 < 9. Adding two such variables together
will give us a numerator sum in the interval (0,18); similarly the denominator
sum lies in (0,36). In general we would expect the ratio to be small since there
are more terms in the denominator.
• In view of this we would not expect the histogram to be symmetric despite the
symmetry of the underlying normal distribution.
Fig.10.1: F ratios, frequency table and histogram for data of Table 10.1
Changing the Number of Variables The spreadsheet FDist_(5,3) depicts the case
m = 5 and n = 3 with representative results displayed in Fig.10.3. We note:
• Even though the range of the F-ratios has increased the histograms remain
skewed, with smaller values predominating.
• There is a different F- distribution for each set of (m,n) values.
F Tables
You will rarely need to do any explicit calculations involving the F-distribution.
However it is useful to be able to check computer output using appropriate tables,
especially since F ratios occur repeatedly when using regression models. (We shall
discuss this in detail in Units 9 and 10.)
As shown in Tables 10.2 and 10.3 F-distribution tables are more complicated than
our previous (normal, t and chi-square) tables since they depend on the two
parameters m and n. It is conventional to select a specific right hand tail probability
(often termed percentage points) and tabulate the F-value corresponding to this area
for selected values of m and n. You need to be careful since m is tabulated across
the top row, and n down the first column.
Example 10.2: Find the F-values corresponding to (a) an upper tail probability of 0.1
with (5,10) degrees of freedom, and (b) an upper tail probability of 0.05 with (20,20)
degrees of freedom.
Solution: (a) Here we need Table 10.2 with m = 5 and n = 10. This gives the value
F= 2.52.
The meaning of this is that 10% of the F-distribution, specified by (m,n) = (5,10) lies
above the value 2.52.
0.1
0
Fν1, ν2
ν1 = 1 2 3 4 5 6 7 8 9 10 12 24
ν2 1 39.86 49.50 53.59 55.83 57.24 58.20 58.91 59.44 59.86 60.19 60.71 62.00
2 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 9.39 9.41 9.45
3 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.24 5.23 5.22 5.18
4 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94 3.92 3.90 3.83
5 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32 3.30 3.27 3.19
6 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96 2.94 2.90 2.82
7 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72 2.70 2.67 2.58
8 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56 2.54 2.50 2.40
9 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44 2.42 2.38 2.28
10 3.29 2.92 2.73 2.61 2.52 2.46 2.41 2.38 2.35 2.32 2.28 2.18
11 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.30 2.27 2.25 2.21 2.10
12 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2.21 2.19 2.15 2.04
13 3.14 2.76 2.56 2.43 2.35 2.28 2.23 2.20 2.16 2.14 2.10 1.98
14 3.10 2.73 2.52 2.39 2.31 2.24 2.19 2.15 2.12 2.10 2.05 1.94
15 3.07 2.70 2.49 2.36 2.27 2.21 2.16 2.12 2.09 2.06 2.02 1.90
16 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06 2.03 1.99 1.87
17 3.03 2.64 2.44 2.31 2.22 2.15 2.10 2.06 2.03 2.00 1.96 1.84
18 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 1.98 1.93 1.81
19 2.99 2.61 2.40 2.27 2.18 2.11 2.06 2.02 1.98 1.96 1.91 1.79
20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 1.94 1.89 1.77
21 2.96 2.57 2.36 2.23 2.14 2.08 2.02 1.98 1.95 1.92 1.87 1.75
22 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93 1.90 1.86 1.73
23 2.94 2.55 2.34 2.21 2.11 2.05 1.99 1.95 1.92 1.89 1.84 1.72
24 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91 1.88 1.83 1.70
25 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 1.87 1.82 1.69
26 2.91 2.52 2.31 2.17 2.08 2.01 1.96 1.92 1.88 1.86 1.81 1.68
27 2.90 2.51 2.30 2.17 2.07 2.00 1.95 1.91 1.87 1.85 1.80 1.67
28 2.89 2.50 2.29 2.16 2.06 2.00 1.94 1.90 1.87 1.84 1.79 1.66
29 2.89 2.50 2.28 2.15 2.06 1.99 1.93 1.89 1.86 1.83 1.78 1.65
30 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 1.82 1.77 1.64
32 2.87 2.48 2.26 2.13 2.04 1.97 1.91 1.87 1.83 1.81 1.76 1.62
34 2.86 2.47 2.25 2.12 2.02 1.96 1.90 1.86 1.82 1.79 1.75 1.61
36 2.85 2.46 2.24 2.11 2.01 1.94 1.89 1.85 1.81 1.78 1.73 1.60
38 2.84 2.45 2.23 2.10 2.01 1.94 1.88 1.84 1.80 1.77 1.72 1.58
40 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.79 1.76 1.71 1.57
60 2.79 2.39 2.18 2.04 1.95 1.87 1.82 1.77 1.74 1.71 1.66 1.51
120 2.75 2.35 2.13 1.99 1.90 1.82 1.77 1.72 1.68 1.65 1.60 1.45
0.05
0 Fν1, ν2
ν1 = 1 2 3 4 5 6 7 8 9 10 12 24
ν2 1 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 241.88 243.90 249.05
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.45
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.64
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.77
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.53
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.84
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.41
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.12
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 2.90
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.74
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.61
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.51
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.42
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.35
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.29
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.24
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.19
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.15
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.11
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.08
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.05
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.03
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.01
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 1.98
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 1.96
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 1.95
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 1.93
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 1.91
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 1.90
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 1.89
32 4.15 3.29 2.90 2.67 2.51 2.40 2.31 2.24 2.19 2.14 2.07 1.86
34 4.13 3.28 2.88 2.65 2.49 2.38 2.29 2.23 2.17 2.12 2.05 1.84
36 4.11 3.26 2.87 2.63 2.48 2.36 2.28 2.21 2.15 2.11 2.03 1.82
38 4.10 3.24 2.85 2.62 2.46 2.35 2.26 2.19 2.14 2.09 2.02 1.81
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.79
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.70
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.61
(b) Here we need Table 10.3 with (m,n) = (20,20) . Unfortunately this particular
combination is not tabulated, and we have one of two alternatives:
• Go to the nearest value, here (24,20) and use the value F = 2.08.
• Interpolate between the two nearest values, here (12,20) and (24,20). If you
understand how to do this it will give
F = 2.28 – (8/12)*0.1 = 2.147
In practice we shall never require F-values to more than 1 decimal place (at most),
and here we can safety quote the value F = 2.1. (In fact more accurate tables and
computations give the value F = 2.124, so our result is acceptable.)
We shall return to these F-tables in Units 8 and 9. For now remember that, when
ratios of sums of squares of normally distributed random variables are involved in a
calculation, the F-distribution will be involved (either explicitly or implicitly).
Summary
At the end of this rather long, and more theoretical, unit it is useful to bear in mind the
following general results:
• The normal distribution is the most important distribution in the whole of
statistics due to its appearance in the Central Limit Theorem.
• Sampling distributions arise when samples are taken from a population.
• Sample averages are of particular importance, and their distribution (in repeated
sampling) has much smaller variation than individual values.
• There are various (sampling) distributions, all based on the normal distribution,
which are of importance in practice. These arise in the following situations:
- We need to compute a sample standard deviation in addition to a sample
mean (the t-distribution).
- We need to compute sums of squares of (random normal) variables (the
chi-square distribution).
- We need to compute ratios of sums of squares (F-distribution).
• When computing quantities from a sample it is important to know how many
independent data values our computations are based on. This leads to the very
important concept of “degrees of freedom”.
Further Practical Exercises (on the Module Web Page) will give you some practice in
using the various distributions, and exploring their properties, in Excel. The Tutorial
Exercises will make you more familiar with hand computations involving the normal
distribution
Learning Outcomes
At the end of this unit you should be familiar with the following:
• Understand the concept of proportion, and how the CLT applies.
• Appreciate the idea of estimating population parameters using sample values.
• Understand how confidence intervals are constructed, and their interpretation.
• Recognize how hypothesis testing is carried out.
• Understand the meaning and computation of p-values.
1. Introduction
In this unit we continue the theme of sampling and ask the general question of how
we can use sample values to estimate corresponding population values. In particular
we wish to investigate how much accuracy/reliability we can assign sample values.
There are two main (inter-related) ways of doing this:
• Form so called confidence intervals for the population parameters.
• Test whether population parameters have prescribed values.
We shall need the ideas and techniques developed in Units 6 and 7 relating to the
normal and t-distributions to satisfactorily address these issues.
However, before doing this, we shall review some sampling material in a slightly
different context. An important point to bear in mind is that the sample mean is not
the only quantity we can extract from sample data. The sample mean is important
since, as we shall see, we can use it to approximate the population mean. However,
our interest in Unit 7 has been in some variable which can be measured (assigned a
numerical value) and averaged. We cannot always sensibly do this.
In many such cases there is no meaningful average we can work out, since our
questions may only have “Yes” or “No” answers, with no associated numerical value.
Consider the case of a company sponsoring market research into a new product and
the impact of a TV advertising campaign designed to launch the product. A sample of
the public was selected and asked whether they had seen the recent TV adverts for
the product. Clearly their response would be either Yes or No (we shall discount the
Don’t Knows). At the end of the survey we would be able to calculate the proportion,
or percentage, of respondents who saw the adverts.
As was the case for the sample mean if the sampling process is repeated a number
of times we expect the different samples to provide different sample proportions, as
illustrated in Fig.2.1.
P o p u la t io n
(p r o p o rt i o n Π )
S a m p le n o . 1 S a m p le n o . 2 S a m p l e n o. 3 S a m p le n o . k
P ro p o r t io n P r o p o rt i o n P r o p o rt i o n P r o p o r t io n
p1 p2 p3 pk
A g r o u p o f s a m p l e p ro p o r t io n s p 1 , p 2 , p 3 , p k
Π (1 - Π )
Mean (p) = П and Standard Deviation (p) = --- (1a)
n
In addition the sampling distribution of the proportion approaches a normal
distribution as the sample size n increases (“large” sample).
Techical Note For the above normal approximation to hold we require
nП > 5 and n(1 – П) > 5 --- (1b)
You should be familiar with this type of computation from the simulations of Unit 7. In
addition we also compute the mean and standard deviation, and these are also
depicted in Fig.3.2. You should be able to check these values from Table 3.1.
Two further simulations are shown in Fig.3.2. We conclude the following:
• The sampling variation can be accounted for by using the normal distribution
• The parameters of the normal distribution are in accord with CLT. Here
П = Proportion of sixes (in all possible dice throws) = 1/6 = 0.1666 --- (2a)
Π (1 - Π ) 0.1666 * 0.8333
and hence = = 0.0745 --- (2b)
n 25
CLT says the mean and standard deviation of the sample proportions should
approach (2a) and (2b) respectively (and should be exact for large enough n). You
can see the values in Figs.3.1 and 3.2 agree well with (2a,b).
Comments on CLT: Proportions often cause a good deal of confusion because
there are some subtle differences with the CLT result for means.
• Notice that (1a) makes no mention of the population standard deviation; for
proportions this is not a relevant concept. (Recall that standard deviations
measure deviations from the mean but, in the context of proportions, the mean
does not make any sense!)
• Nevertheless the sampling distribution does have a standard deviation. You
have to think about this in order not to become too confused! Look at Table 3.1.
• The final conditions for CLT to hold effectively mean the following:
- If our population proportion П is very small (say П = 0.01) we need a very
large sample (n > 500). Otherwise we will not find any member of the
population having the characteristic of interest, and will end up with p = 0.
This will give us no useful information (apart from П is “small”).
- If our population proportion П is very large (say П = 0.99) we again need a
very large sample (n > 500). Otherwise we will find that all members of the
population have the characteristic of interest, and will end up with p = 1
Again this will give us no useful information (apart from П is “large”).
• The form of the standard deviation in (1a) should be familiar. In Unit 6 Section 3
we stated the standard deviation X (the number of successes) of a binomial
distribution was given by np(1 - p) . If we divide this result by n, to turn number
of successes into proportion of successes, we obtain precisely (1a), but now
with П in place of p! Can you see why the binomial distribution is appropriate in
the context of proportions?
We now illustrate the use of the CLT for proportions. You should compare the
following calculations with those in Section 9 of Unit 6.
Example 3.2: An insurance company knows, from records compiled over the
previous 10 years, that on average 5% of its customers will have a car accident in the
current year. In such an event it has to pay out an average of £3000. The actuarially
fair premium would therefore be £150, but the firm charges £200 to cover risk and
profit. The firm will go bankrupt if more than 6% of its customers have accidents.
(a) If the firm has 1,000 customers, calculate the probability of bankruptcy.
(b) Calculate the same probability if the firm has 10,000 customers.
(c) Why should you feel happier dealing with a large insurance company?
(d) Is a large insurance company more profitable in the long run?
Question: Why are the 1,000 customers regarded as a sample rather than the whole
population (since these are all the company customers)?
• Theory (in words): Since we have a large sample CLT states that the sampling
distribution of proportions follows a normal distribution with
Π (1 - Π )
mean = П and standard deviation = --- (3a)
n
• Theory (in formulas): We have seen how to standardise a variable via
Sample value - Population value
Z=
Standard deviation of sample value
Notes:
1. We can interpret the result more simply by saying there is a 7.35% chance of
the insurance company becoming bankrupt (in the current year).
2. The above calculation is performed in percentages , although the answer
appears as a decimal. Why is this?
3. You should check that you obtain the same final probability if the calculation is
performed using proportions (0.05 instead of 5% and so on).
4. Estimation
So far our attention has been focused on the sample mean and sample proportion.
• If the mean and standard deviation of the population values are assumed to be
known then it is possible to make probability statements about the sample
mean.
• Similarly if the population proportion is assumed known we can make probability
statements about sample proportions.
However, in practice population values are rarely known and have to be estimated
from (sample) surveys or experiments. For example, to determine the average
height and weight of the adult population of Scotland we would need to survey every
such adult, and this is a practical impossibility. The best we can do is to sample some
of them.
In general we use (known) sample values to estimate (unknown) population values.
Although it may sound obvious
• The “best” estimate of a population mean μ is the sample mean
• The “best” estimate of a population proportion П is the sample proportion p
(The problem in proving these statements rigorously is in giving a definition of what
we mean by “best”.)
There is one further important idea which we illustrate via the following example.
x = 1 (99.8 + 100.7 + 100.9 + 99.8 + 99.5 + 99.2 + 99.7 + 99.8 + 100.2 + 99.7)
10
= 99.93 (grams)
This value is often termed a point estimate (of the population mean weight μ).
Comment The problem with this type of estimate is
• that we have no indication of how accurate it is, given
• the value of x will undoubtedly change if we take another sample.
-1.645
0 z
From the normal tables an area (of the type shown shaded) of 0.475 corresponds to
a z-value of 1.96. Here we are reading the normal tables “in reverse” – given an area
we use the tables to obtain the corresponding z-value.
(b) and (c) are left as an exercise.
Comment: The z-values in Example 4.2 are often termed critical values, and
denoted ZC. You should compare them with the values in Unit 6 Section 12.
• The un-shaded regions in Fig.4.1 are often termed critical regions, or tail
regions of the distribution.
• In finance a very common term is “tails of the distribution”. These are precisely
the un-shaded regions above.
We have said that the best estimate of the population mean is the sample mean.
Recall the following results (CLT for means Unit 7 Section 5):
• The sample mean has a normal distribution,
- if the underlying variable is normally distributed in the population, and
- an approximate normal distribution, if the variable has a different
distribution (as long as the sample size is fairly large).
⎛ σ σ ⎞
P⎜ μ - 1.96 < x < μ + 1.96 ⎟ = 0.95
⎝ n n ⎠
which, on rearranging, gives
⎛ σ σ ⎞
P⎜ x - 1.96 < μ < x + 1.96 ⎟ = 0.95 --- (4a)
⎝ n n ⎠
• This second statement shows that in 95% of all samples, the sample mean will be
within a distance of 1.96 σ/√n from the true mean. This quantity, 1.96 σ/√n may be
used to indicate how good the estimate of the mean really is by constructing an
interval estimate for the population mean.
• The interval contained in (4a) is termed the 95% confidence interval for the mean
Unknown μ
Unknown μ
Comment You should be able to see this argument is really the same as the
previous “argument in words”. The only real difference is that in Fig.5.1 we highlight
the unknown nature of μ; this is disguised somewhat in the argument leading to (4a).
Example 4.1 (revisited): Suppose we know, from past records, the following:
• The population (of all metal bars made on this production line) is normal.
• The population has (from past records) standard deviation σ = 1 (gram)
Under these (rather restrictive) circumstances CLT applies and we can use (4b) to
give us our 95% confidence interval for the mean (weight of metal bars produced on
this production line) as
1 1
99.93 - 1.96 to 99.93 + 1.96 = 99.93 – 0.6198 to 99.93 + 0.6198
10 10
= 99.31 to 100.55 (rounded to 2D)
Note: This interval estimate has, by its very nature, a built-in measure of its
reliability. We round the results to 1 decimal place since this is the accuracy of the
original data (and never quote more accuracy than is justified).
Example 5.1: The spreadsheet CI95 in the Excel workbook ConInt1 gives a
demonstration of this latter (long run frequency) interpretation in action. In Fig.5.2
cells A7-I7 contain (nine) randomly selected values from a N(9,32) distribution, with
the mean calculated in cell J7.
• Using (4b) the lower and upper 95% confidence limits are computed in cells K7-
L7. Since we have specified μ (=9) we can check whether this interval does, in
fact, contain μ; a 1 in cell M7 indicates it does.
• We now repeat (100 times) this entire CI calculation, and count how many
constructed intervals actually contain μ. For the values in Fig.5.2 it is 96.
• In Fig.5.2 we give a graphical view of this latter value (96) by joining each of the
(100) lower and upper confidence limits by a straight line. We then note how
many of these lines cross the (horizontal) line μ = 9. Unfortunately this can be a
little difficult to determine due to graphical resolution difficulties!
• Two further simulations are shown in Fig.5.3. In all instances we can see the
95% CI does indeed appear to contain μ 95% of the time. You are asked to
perform further simulations in Practical Exercises 6.
• The use of this particular value was for illustrative purposes, and we could
equally well use any confidence level. In general we can write a more general
form of (4a) as
σ σ
α% confidence interval for the mean = X - Zα to X + Zα --- (5)
n n
where the (critical) value Zα depends on the particular value of α.
• In practice the three levels in Example 4.2 (and Fig.4.1) tend to be used. The
particular level chosen is a compromise between the following:
- The higher the confidence level the more certain we are that our
calculated interval contains the true (population) mean.
- The higher the confidence level the wider the confidence interval is. Can
you see why this is the case?
• What we have been working out so far are two-sided confidence intervals
(extending both sides of the sample mean). In some circumstances one-sided
intervals may be more appropriate. If you ever need these the theory is very
similar to the two-sided case.
The spreadsheet CIAlpha in the ConInt1 workbook allows you to specify the
confidence level α and check the validity of (5). You should explore these issues.
Summary
• If we calculate the mean of a set of data then we have a point estimate of the
population mean. We have a single value with no indication whether it is likely
to be close to the true value or not.
• An interval estimate is one which gives a likely range of values of the
parameter to be estimated rather than just a single value. It is of much more
practical to know that the true value is likely to lie between two limits.
• A confidence interval states both
- these two limits, and
- precisely how likely the true value will lie between them.
• As an illustration x ± 196
. σ / n is a 95% confidence interval for the population
mean. Use of the ± sign is a compact (and common) way to indicate both ends
of the confidence interval in a single formula.
Example 6.1: An electronics firm is concerned about the length of time it takes to
deliver custom made circuit breaker panels. The firm’s managing director felt it
averaged about three weeks to deliver a panel after receiving the order. A random
sample of 100 orders showed a mean delivery time of 3.4 weeks and a standard
deviation of 1.1 weeks.
• If “hypothetical” value lies within the confidence interval accept this value
• If “hypothetical” value lies outside the confidence interval reject this value
Again Π is (usually) unknown, and we will estimate it by the sample value to give
Example 7.1: Coopers & Lybrand surveyed 210 chief executives of fast growing
small companies. Only 51% of these executives have a management succession
plan in place. A spokesman for Coopers & Lybrand said that many companies do
not worry about management succession unless it is an immediate problem .
Use the data given to compute a 95 % confidence interval to estimate the proportion
of all fast growing small companies that have a management succession plan.
Solution : (i) Here X = number of small fast growing companies which have a
management succession plan
Population information : Π = population proportion = unknown
(Population = set of all small fast growing companies
p(1 - p)
Approximate 95% confidence interval for the proportion = p ± 1.96
n
0.51 * 0.49
= 0.51 ± 1.96 = 0.51 ± 1.96 * 0.034
210
= 0.51 – 0.7 to 0.51 + 0.7
= 0.44 to 0.58 (44% to 58%)
8. Hypothesis tests
In Section 4 we introduced the ideas of estimation and observed how the standard
error of the quantity being estimated is a measure of the precision of the estimate.
(The standard error is a very commonly used term to denote the standard deviation
of the sampling distribution. This usage implies the term “standard deviation” refers to
the entire population.)
In many situations, estimation gives all the information required to make decisions.
However, there are circumstances where it is necessary to see whether the data
supports some previous supposition. Examples include
• comparing the mean level of sample output on a production line against a fixed
target value,
• comparing the efficacy of a new drug with a placebo or
• seeing whether times to failure of a particular component follow a specified
probability distribution.
Although, in all these cases, some quantities will have to be estimated from the data
the emphasis has switched from pure estimation to that of testing.
• In the first example, this would be testing whether the mean has wandered
away from the target value.
• In the second whether the new drug is better than the placebo and
• In the third whether the data is compatible with the particular distribution.
Although there are a lot of different types of tests that can be carried out on data (a
glance at any more advanced statistics text-book will reveal a frightening array!) the
idea behind all tests is the same. Once you have mastered this basic concept, life
becomes easy. So what is this basic concept?
Basic Concept
• Whenever any sort of test is performed in real life there will be objectives which
are specified. A test might be carried out to
- determine the breaking stress of a metal bar or
- the academic achievement of a pupil.
But in each case there is a clear goal.
Example 9.1 A programmer has written a new program and postulates that the mean
CPU time to run the program on a particular machine will be 17.5 seconds. From past
experience of running similar programs on his computer set-up he knows that the
standard deviation of times will be about 1.2 seconds. He runs the program eight
times and records the following CPU times (in seconds).
15.8, 15.5, 15.0, 14.8, 15.6, 16.5, 16.7, 17.0.
Assuming that running times are normally distributed, calculate the following:
Solution (Informal) The following solution goes through Steps 1-5 above. From this
we can fairly easily put together the more formal calculation.
• The (essential) computational parts of the calculation are starred **.
• The remainder are explanations of the ideas we are using.
First X = CPU time to run program
(a) x = 1 [15.8 + 15.5 + 15.0 + 14.8 + 15.6 + 16.5 + 16.7 + 17.0] = 15.8625 **
8
σ 1.2
Standard error of x = = = 0.4243 **
n 8
Recall: 1. Recall the standard error of x is just another phrase for the standard
deviation of x . The terminology is common when dealing with sampling
distributions, i.e. distributions of values computed from sample information, rather
than from complete population information (the latter usually being unknown).
σ
2. The sampling distribution of means has standard deviation
n
3. We shall also shortly use the related facts that the sampling distribution of
means has mean μ, and the sampling distribution of means has a normal
distribution (σ known).
Step 4 We need to assess what our sample computations are telling us.
• It is certainly true that x < μ (15.86 < 17.5).
• But the crucial question is “Is x < sufficiently less than μ ?”
• To assess this we are asked to compute a 95% confidence interval for the (true)
mean. From (4b) of Section 5 we easily obtain
σ
(b) 95% CI for mean = X ± 1.96 = 15.8625 ± 1.96* 0.4243 (accuracy?)
n
Comment If you just look at the starred entries you will see this solution is quite
short. It is only because we have given an (extended) discussion of the underlying
ideas that the solution appears rather long. In practice you would just give the salient
points (starred entries) in your solution. We shall do this in the next section in
Example 10, after we have introduced a bit more statistical jargon.
The Tutorial Exercises will give you practice in writing down solutions using the
appropriate terminology.
- The null hypothesis often involves just the parameters of a population but
it can also be concerned with theoretical models or the relationship
between variables in a population.
- Although the null hypothesis is written as a simple statement, other
assumptions may be made implicitly.
• A test may assume that the individuals are chosen at random, or
• the data come from a normal distribution and so on.
• If any of these additional assumptions are not true then the test
results will not be valid.
- The alternative hypothesis is what will be assumed to be true if it is
found subsequently that the data does not support the null hypothesis. So
either H0 is true or H1 is true
• It is denoted by H1 and unlike the null hypothesis, which is very
specific in nature, the alternative tends to much more general.
• In the case of Example 9.1, for example, the alternative hypothesis
may take one of three forms:
(i) H1 : μ ≠ 17.5 (ii) H1 : μ > 17.5 (iii) H1 : μ < 17.5
• The form of H0 and H1 must be decided before any data is collected. It may not
be obvious why this should be so – why not have a (quick) look at the data to
get an idea of what is going on, before you decide what to test?
- The problem is that the sample is only one possible sample and, if we
sample again, the results will change.
- We can only put “limited trust” in the values we actually observe.
Example 10.1 Solution (Formal) The following provides a more formal (and
compact) solution to Example 9.1.
• Step 1 : We wish to test the null hypothesis H0 : μ = 17.5
against the (2-sided) alternative hypothesis H1 : μ ≠ 17.5
Since we are told run times are normally distributed we do not need to assume this.
(This is important here since we do not have a large sample to use CLT.)
• Step 2 : We choose a confidence level of 95%. Note this is done before looking
at sample results (or, in practice, before collecting any data).
• Step 3 : This is exactly the same as before with
x = 15.86 and Standard error = 0.42
• Step 4 : This is exactly the same as before with
95% CI for mean = 15.03 to 16.69
• Step 5 : Since, assuming H0 is true, our CI does not contain μ we reject H0
and accept H1 .
Note Here we can say we accept H1 since H0 and H1 include all possibilities (since
H1 is two-sided). If H1 were 1-sided (say H1: μ < 17.5) we could not do this since the
sample data may also be compatible with another H1 (H1: μ > 17.5).
11. p-values
This section is intended to explain why so-called “p-values” are computed, as well as
showing how they are obtained.
Terminology There is one further piece of terminology that is used almost
exclusively in (statistical) computer packages. Rather than focusing on how confident
we are (see Step 2 in the formal testing procedure of Section 9) we highlight the
error we are (potentially) making.
“Confidence region”
Area = 95%
Fig.11.1
Significance region (Area = 5%)
• In general the significance level can refer to 1-sided regions (areas), although we
shall only look at the 2-sided case.
• The significance level is used to “set a scale” by which we can judge what we
mean by the phrase “unlikely”. We speak of deciding when sample values are
“significant”, i.e. unlikely to have occurred by chance alone. (We equate
“extreme” values with “significant values” in that they are too many standard
deviations away from the mean.)
- What all this comes down to is that, when dealing with continuously
varying quantities (such as height) we cannot ask for probabilities of
specific values, but must specify a range (however small) of values.
Criticism of CIs A major criticism of the use of confidence intervals in hypothesis
testing is the need to pre-set the confidence (significance) level.
• It is quite possible to reject H0 using a 99% confidence level, but to accept H0
using a 95% confidence level.
• This means our “entire conclusion” depends on the, somewhat arbitrary, levels
we set at the start of the analysis.
Asking the Right Question The concept of a p-value is designed to overcome the
above difficulty. To explain this we assume we have taken a sample and ask the
following question:
Question A: How likely was our observed sample value?
• The rationale for this is the following. If what we actually observed was a priori
(before the event) an unlikely outcome then its subsequent occurrence casts
doubt on some of our assumptions. We may well need to revise the latter (in
light of our sample evidence).
• The difficulty with asking this precise question is that, as we have just seen, the
likelihood (probability) is usually zero. We could ask instead
Question B: How likely were we to observe values within a “small” interval (say
within 1%) about our observed sample value?
• The difficulty with this question is one we have already alluded to:
- Our sample values will change if we take another sample so
• whilst we attach importance to our sample results,
• we do not wish to give them undue importance.
In effect we are not overly concerned about the particular sample mean of 15.86
in Example 9.1, so the answer to Question B is not particularly useful.
• What we are interested in is how our sample mean will aid us in assessing the
validity of the null hypothesis. In particular, if μ = 17.5 how far away is our
sample value (mean) from this (since the further away the less confidence we
have in H0)? So we (finally) ask the question
Question C: How likely were we to observe something as “extreme” as our
sample value?
Observed sample
mean x = 15.86
p-value
Fig.11.2
Assumed μ = 17.5
Then P(z < -3.86) = 0.5 – 0.4999 = 0.0001 (from normal tables).
In pictures
• Recall that an event 3 standard deviations (3σ) or more from the mean has
“very little” chance of happening, in agreement with the above computation.
p-value =
= 0.0001 + 0.0001
= 0.0002
95
• There is one final interpretation of the significance level. At the start of this
section we introduced the idea that the significance level measures how
“uncertain” we are (in contrast to the confidence interval that focuses on how
“confident” we are). The significance level measures the maximum error we
are prepared to make in making our decision as to whether to reject H0 or not.
- Remember that, whatever decision is made, we cannot be 100% certain
we have made the right one. It is possible our sample value is
unrepresentative and, by chance alone, we have been “unlucky” with our
sample results.
- Before taking a sample we need to decide how much error we are
prepared to live with. This is what our choice of significance level does.
• You may like to go back to the Excel demonstration of the idea behind
confidence intervals (ConfInt spreadsheet of Example 5.1). Observe that, in the
95% CI, we can still expect 5 of our intervals to behave “badly” and not contain
the true value of μ; this would correspond to rejecting H0 5% of the time. (This
would be the wrong decision since we know the true value of μ!)
F-distribution –
Unit 7 Section 10
95% Confidence
Interval – Unit 8
Section 5
Degrees of
freedom – Unit 7
Section 8
Standard error –
t-distribution – P-value – Unit 8
Unit 8 Section 9
Unit 7 Section 7-8 Section 11
Note Sometimes p-values are called significant (or sig.) values. Excel actually uses
both terms in the output of Fig.12.1. As you can see you need a fair amount of
statistical background to understand the output of most (statistical) packages. The
remaining terms in Fig.12.1 will be explained in Unit 9.
p(1 - p) p(1 - p)
= p - 1.96 to p + 1.96
n n
rather than the
(Exact) 95% confidence interval for the proportion
Π (1 - Π ) Π (1 - Π )
= П - 1.96 to П + 1.96
n n
even though we have a value for П specified under H0. The theory becomes a lot
simpler if we work with approximate, rather than exact, confidence intervals for the
population proportion. (You can see why if you look at the version of (4a) which
applies to proportions.)
Example 13.1: The catering manager of a large restaurant franchise believes that
37% of their lunch time customers order the “dish of the day”. On a particular day,
of the 50 lunch time customers which were randomly selected, 21 ordered the dish
of the day. Test the catering manager’s claim using a 99% confidence interval.
This value measures the chances of making the wrong decision. The catering
manager has decided he is prepared to live with the consequences of this, i.e.
• too many “dish of the day” dishes unsold (if in fact Π < 0.37 and we accept H0)
• not enough “dish of the day” made (if in fact Π > 0.37 and we accept H0)
and the consequent effect on the supplies that need to be ordered, the staff that need
to be deployed, customer dissatisfaction and so on.
Step 4: Examine sample data by computing the 2-sided 99% confidence interval.
p(1 - p)
Here 99% confidence interval for (true) proportion = p ± 2.58 **
n
0.42 * 0.58
= 0.42 ± 2.58 = 0.42 ± 0.180 = 0.24 to 0.60 **
50
or 99% confidence interval for (true) percentage = 24% to 60% **
Step 5 (Conclusion) : Here, on the basis of H0 being true, we have obtained a 99%
confidence interval which does contain Π. Since this will happen 99% of the time, we
regard this as a “very likely” event to happen. We cannot reject H0.
Example 14.1: A random sample of 10 items in a sales ledger has a mean value of
£60 and a standard deviation of £8. Find a 95% confidence interval for the population
mean of sale ledger items.
s 8
Hence 95% CI for μ = X ± tC = 60 ± 2.262* (in £)
n 10
Step 4a : Examine sample data by computing the 2-sided 95% confidence interval
s
via 95% CI for the mean = X ± tc **
n
0.77
Step 4a : 95% CI for the mean = 10.33 ± 2.11 **
18
= 10.33 ± 0.38 = 9.95 to 10.71
Step 5 (Conclusion): Here, on the basis of H0 being true, we have obtained a 95%
confidence interval which does contain Π. We cannot reject H0.
Comments:
1. Although our confidence interval is not very wide we may view with some
concern the fact that μ = 10 is very close to one edge of the interval. In practice
this may prompt us to do some further analysis, i.e.
• think about changing our confidence/significance level, or
• taking another sample, or
• taking some other course of action.
• We now need to calculate P(|t| > 1.818), representing the probability of a more
extreme value than the one observed, and corresponding to the area below:
-1.818 1.818
Unfortunately this area is not directly obtainable from the t-tables (why?). We need to
use the Excel function TDIST. Explicitly
p-value = TDIST(1.818, 17, 2) = 0.087
The validity of any statistical procedure ultimately rests on how well the data
conforms to the (often implicit) assumptions made. Always bear this in mind.
Learning Outcomes
At the end of this unit you should be able to:
• Appreciate the concept of covariance and correlation.
• Interpret the coefficient of correlation.
• Plot data to illustrate the relationship between variables.
• Determine the equation of a regression line and interpret the gradient and
intercept of the line.
• Understand the regression (Anova) output from Excel.
• Appreciate the usefulness of residual analysis in testing the assumptions
underlying regression analysis.
• Predict/forecast values using the regression equation.
• Understand how data can be transformed to improve a linear fit.
• Appreciate the importance of using statistical software (Excel) to perform
statistical computations involving correlation and regression.
• Understand the inter-relationships between expected values, variances and
covariances as expressed in the efficient frontier for portfolios.
1. Introductory Ideas
We are very often interested in the nature of any relationship(s) between variables of
interest, such as interest rates and inflation or salary and education.
• Covariance is the fundamental numerical financial quantity used to measure
the mutual variation between variables.
• Correlation is a scaled covariance measure and provides an important first
step in seeking to quantify the relationship between (two) variables. It is used
more often than the covariance in non-financial contexts.
• Regression extends the notion of correlation to many variables, and also
implicitly brings in notions of causality, i.e. whether changes in one variable
cause changes in another variable. In this unit we shall consider only the case
of two variables (simple regression.
Example 1.1: Look back at Problem 2 in Section 2 of Unit 1; here we want to know
how our gold and sterling assets will behave. Specifically if gold prices start to fall:
• Can we expect the value of sterling to fall?
• If so by how much?
We are interested in whether changes in one asset will accompany changes in the
other. Note we do not use the terminology “do changes in gold prices cause changes
in sterling”, rather we are more interested in how they vary together.
Example 1.2: There has always been a great deal of media attention focused on the
“correct” values at which interest rates should be set.
• The manufacturing sector complains interest rates are too high since
- this encourages foreign investment in the UK which in turns
- causes the £ to appreciate (against other currencies); in turn
- a higher £ causes difficulties with UK exports (higher prices), resulting in
- falling export sales, increased layoffs and rising unemployment.
• Whether you believe in either scenario, it is clear that there are a large number
of variables (interest rates, exchange rates, inflation, manufacturing output,
exports, unemployment and so on) that need to be considered in analysing the
situation.
There is an enormous literature on these ideas, ranging from the non-technical to the
very-technical. A brief overview is given at http://en.wikipedia.org/wiki/Interest_rates
Probably the crucial issue here is causation, i.e. “what causes what?” If all variables
are interlinked, do they all cause each other, or is there some “important” variable
that explains many (or all) the others? Economics seeks to use data, and statistical
analysis, to try and make sense of the relationships between (many) variables.
Correlation (or covariance) is the starting point for doing this. Very roughly speaking:
• In physics causation is often well established (“Gravity causes objects to fall”)
• In finance causation is plausible but not well understood (“High unemployment
causes interest rates to decrease”)
• In the social sciences causation is problematic (“Low educational attainment
causes alcohol abuse”).
A Useful Reference We shall only cover a small fraction of the available material in
the area of regression. If you wish to read further a good introductory text, with a very
modern outlook, is Koop, G. (2006) Analysis of Financial Data
A companion volume Koop, G. (2003) Analysis of Economic Data
covers similar ground, but with a slightly less financial and more economic
orientation. We shall refer to, and use, some of the data sets discussed by Koop.
We can see an inverse relationship between the two variables, i.e. higher fares are
associated with a lower demand. (But note the last three entries.) However we attach
no significance to which variable is plotted on which axis; if you interchanged them in
Fig.2.1 would your conclusion be any different?
Table 2.1: Great Britain data Fig.2.1: Scatter plot of bus fares and passengers
• We have deliberately not used X and Y for any of the variables since this
terminology is almost invariably used in the (mathematical) sense “Y depends
on X”. With correlation we do not want to imply this - see Section 3.
• The scatter plots of Fig.2.2 are indicative of the following:
- There appears to be some relation between executive pay (E) and
company profit (P) in the following sense:
• As P increases E increases “in general”. This does not mean that
every time P increases E increases, just “much” of the time.
• As E increases P increases “in general”. But we have no sense of
“cause and effect” here. We only know that high (low) values of P
tend to be associated with high (low) values of E.
• When we have further variables present there are other correlations we can
look at and, in general, these initially complicate the situation. The scatter plots
of Fig.2.3 are indicative of the following:
- First observe that we can have repeated values, here of D, with
corresponding different values of P. (An important question we could ask
is “Why are these values of P different, given that D is the same?”)
- Would you agree there appears to be some relation between P and D?
Maybe a “weak positive correlation” between P and D?
- There does appear to be some relation between E and D with a “strong
positive correlation” between E and D.
• These ideas raise a fundamental problem.
- It is possible that E does directly depend on D - Fig.2.3 (b)
- It is possible that D does directly depend on P - Fig.2.3 (c)
E P E D
(a) Incorrect View
(Direct Causation) P
(b) Correct View (Indirect Causation)
Notation We use the notation rXY to denote the correlation (coefficient) between
the variables X and Y or, more simply, just r if the variables are obvious from the
context. We shall see how to actually compute a value for rXY in Section 4.
• From Example 2.1 we can (only really) conclude
rEP > 0 , rED > 0 , rPD > 0
but cannot assess the strength of these correlations in any quantitative sense.
• You should note that we would intuitively expect, for example,
rEP = rPE
(although this may not be clear from Figs.2.2). This means that we cannot use
the value of rEP to infer some causal connection between E and P.
Question How would you expect the sales S in Fig.2.1 to fit into all this? In particular
what sign (positive or negative) would you expect for
rES , rPS > 0 and rDS > 0 ?
Software The scatter plots in Figs.2.2 and 2.3 are produced individually in Excel. A
weakness of the software is that we cannot obtain a “matrix” of scatter plots, where
every variable is plotted against every other variable apart from itself (Why?). In
Example 2.1, with 4 variables, the result is 12 plots, as shown in Fig.2.5. The latter is
obtained in SPSS (Statistical Package for Social Sciences), and is a more powerful
statistical package than Excel. You may need to learn some SPSS at some stage.
Example 3.1
(a) If we drop an object the time it takes to hit the ground depends on the height
from which it is dropped.
• Although this sounds trivially obvious we can check this by performing
experiments and, indeed, discover the precise relationship between height
and time taken.
• We have control over the various heights we choose and this is
characteristic of an independent variable.
• We cannot directly control the time taken, this depending on our choice of
height, and this is characteristic of a dependent variable.
(b) We can (very plausibly) argue the sales of a product (televisions) depend on the
price we charge for them. Here we would regard sales (termed demand by
economists) as depending on price.
• Note we can vary the price we charge (independent variable), but we do
not have control over the number sold (dependent variable).
• Here we suspect sales do not depend on price alone, but on other factors
(possibly advertising, the economic climate and so on).
• We could complicate the discussion and argue that, if the sales drop, we
can lower the price to try and improve the situation. In this sense may we
regard price as depending on sales (even though we cannot select the
level of sales as we would an independent variable)?
(c) We can argue the exchange rate of the £ (against the $ say) depends on the
level of UK (and US) interest rates.
• We could try and check this by checking data compiled by, say, the Bank
of England. Statistical techniques, or even simple scatter plots, would help
us decide whether there was indeed a connection.
• But maybe we could not say whether changes in one caused changes in
the other (possibly because there were other factors to take into account).
• However, we have no control over either variable and so we cannot do
any experiments as we can in (a) and (b). This is typical of economic
situations where “market forces” determine what occurs, and no “designed
experiments” are possible.
• When this type of situation occurs, where we cannot meaningfully label
anything as an independent variable, the terminology “explanatory
variable” is used. This is meant to indicate that we are tying to use
changes in this variable to “explain” changes in another (dependent)
variable.
Definition
(a) Given two (random) variables X and Y the covariance of X and Y, denoted
Cov(X,Y) or σ(X,Y), is defined by the mean value of the product of their
deviations (from their respective mean values)
Cov(X,Y) = E[(X – E(X))(Y – E(Y))] --- (2)
(b) In particular, if Y = X, (2) and (1) become identical, so that
V(X) = Cov(X,X) --- (3)
In this sense the covariance is a natural generalisation (to two variables) of the
variance (for a single variable). This is really the motivation for taking the
product in (2) to measure the interaction of X and Y, rather than any other
combination.
Example 4.1 Consider the following set of returns for two assets X and Y:
Possible States Probability R(X) = Return on X R(Y) = Return on Y
State 1 0.2 11% -3%
State 2 0.2 9% 15%
State 3 0.2 25% 2%
State 4 0.2 7% 20%
State 5 0.2 -2% 6%
Here we have used (3) in Unit 5 Section 8, expressed in “expected value formalism”.
Although we might naively prefer X to Y on the basis of expected returns, we must
also look at the “variability/riskiness” of each asset.
• Var[R(X)] = 0.2*(11-10)2 + 0.2*(9-10)2 + 0.2*(25–10)2 + 0.2*(7–10)2 + 0.2*(-2-10)2
= 0.2 + 0.2 + 45 + 1.8 + 28.8 = 76%
or σ(R(X)) = 76 = 8.72%
• Var[R(Y)] = 0.2*(-3-8)2 + 0.2*(15-8)2 + 0.2*(2–8)2 + 0.2*(20–8)2 + 0.2*(6-8)2
= 24.2 + 9.8 + 7.2 + 28.8 + 0.8 = 70.8%
Here we have essentially used (4) in Unit 5 Section 8. Both assets have similar
variability and, taken in isolation, we would still prefer X to Y (a larger expected
return with about the same degree of risk).
There is essentially nothing in these calculations that we did not cover in Unit 5.
However, there is one new ingredient here, which becomes important if we want to
combine both assets into a portfolio.
• Using (2) Cov[R(X),R(Y)] = E[(R(X) – 10)*(R(Y) – 8)]
= 0.2*(11-10)(-3-8) + 0.2*(9-10)(15-8) +
0.2*(25-10)(2-8) + 0.2*(7-10)(20-8) + 0.2*(-2-10)(6-8)
= -2.2 – 1.4 – 18 – 7.2 + 4.8 = -24
Note that he individual terms in this sum can be positive or negative, unlike in the
variance formulae where all are constrained to be positive. We can see that most of
the terms are negative as is the resulting sum.
• We need to be careful how we interpret this value of -24. The negative sign tells
us that, in general, as R(X) increases R(Y) decreases, i.e. the assets X and Y
are negatively correlated. We shall need the ideas of Section 5 to label this as a
“weak” correlation, although a scatter plot would be indicative.
• The units of Cov[R(X),R(Y)] are actually squared % - can you see why? To
avoid this happening we really should work with decimal forms of the returns,
rather than percentages. Thus 10% would be used as 0.1, and so on. This gives
Cov[R(X),R(Y)] = -0.0024
But we still have no scale against which we can assess the meaning of this value.
Of course this situation in Example 4.1 is not at all realistic. In practice we would not
have only 5 possible returns, nor would we know associated probabilities. But this
very simplified model does illustrate the essentially simple nature of (2), when
stripped of the symbolism. We shall return to this example in Section 13 to discuss
the fundamental importance of the covariance in a portfolio context. You should also
compare Example 4.1 with Example 8.1 of Unit 5.
- 24 % 2
r(X,Y) = = -0.33 (to 2D)
8.72% * 8.41%
Note that the dimensions cancel out and, as stated, our final value is dimensionless.
Because of this we would obtain the same value using the decimal forms of each of
the quantities (-.0024, 0.0872 and 0.0841). However most people prefer to work with
numbers that are “not too small”, otherwise interpretation can become difficult.
However we still have the difficulty of assigning a meaning to our correlation
coefficient value of -0.33. To do this we need the following result:
Property of R(X,Y) -1 ≤ r(X,Y) ≤ 1 --- (5)
Comments
1. The result (5) is really an algebraic one, following from the form of (2). In words
(5) says that the variation/interaction between X and Y cannot be larger (in
absolute value) than the product of the variation in X and the variation in Y.
2. We can understand how the product form of (1) results in (6) by noting that
sums of products of xy terms will be
• Large and positive if both x and y are positive (or both negative)
• Large and negative if x and y have different signs.
• Small if the signs of x and y have “randomly mixed” signs.
These situations are illustrated in Fig.5.1 where, to keep things simple, we have
not subtracted off the appropriate means in (2).
3. The two equalities in (6), i.e. r(X,Y) = 1 or r(X,Y) = -1, occur only if the (X,Y)
points lie exactly on a straight line. Can you prove this? (See (7) of Section 6.)
4. Because the product in (2) just contains single powers of X and Y, characteristic
of the equation of a straight line, the correlation coefficient measures linear
association between two variables. Nonlinear correspondences (such as y =
x2) are not picked up by r(X,Y).
5. Indeterminate cases occur rather frequently in practice when r is around ± 0.5.
In which case we say X and Y are “weakly” correlated, and this is precisely the
situation we have seen in Example 5.1.
∑ (x i - x )(y i - y )
1
SCov(X,Y) = --- (7a)
n -1 i
with x = mean of x-values y = mean of y-values
The summation sign (Σ) indicates contributions are to be added (summed) over all
possible data values. Recall the “degrees of freedom” argument in Unit 7 Section 8 to
account for the (n – 1) factor in (7a).
Using the expression for the standard deviation of a random variable (Unit 4 Section
8), we can show that (4) can be written in either of the forms
r(X,Y) =
∑i(x i - x )(y i - y ) --- (7b)
∑ i (x i - x ) ∑ i ( y i - y )
2 2
r(X,Y) =
∑ xy - ∑ x ∑ y
n
--- (7c)
⎧n x 2 − ( x )2 ⎫⎧n y 2 − ( y )2 ⎫
⎨ ∑ ∑ ⎬⎭⎨⎩ ∑ ∑ ⎬⎭
⎩
If you have good algebraic skills you should be able to derive these results. Although
(7b,c) look a bit daunting it is simple to implement in Excel in several different ways –
see Example 6.1 and the Practical Exercises.
Example 6.1 You have decided that you would like to buy a second hand car, and
have made up your mind which particular make and model you would like. However
you are unsure of what is a sensible price to pay. For a few weeks you have looked
at advertisements in the local paper and recorded the age of cars, using the year of
registration, and the asking price. The data is shown in the table below:
(Although you can find car price data on the web, the above device of using local
newspaper information is highlighted in Obrenski, T. (2008) Pricing Models Using
Real data, Teaching Statistics 30(2).)
The scatter plot of Fig.6.2 appears to show a “reasonably strong” negative correlation
between Y = Price of car (in £) and X = Age of car (in years).
We can confirm this conclusion with the calculation of r(X,Y) given in the Excel
spreadsheet Regression1L (CarData_Rsquare tab), and reproduced in Fig.6.2.
• Given the column sums in Fig.6.2 r(X,Y) is computed from (7c) as follows:
n ∑ xy - ∑ x ∑ y 9 * 126,780 − 41 * 31,375
rXY= =
{n∑ x }{ } {9 * 203 - 41 }{9 *126,984,425 - 31,375 }
•
− (∑ x ) n ∑ y 2 − (∑ y )
2 2 2 2 2
- 145,355 - 145,355
= = = -0.96
146 * 158,469,20 0 152,106.88
in agreement with the square root of the value 0.91 given in Fig.6.2.
• You should check (7b) gives the same result. The advantage of the latter is the
intermediate numbers are smaller, although they do become non-integer. This
is not really an issue if you are using Excel, but can be with calculators.
Fig.7.2: Problem with using residual sum criteria for determining line of best fit.
• What information does the data itself contain? Bear in mind that what we really
want our straight line for is prediction (of forecasting).
• If we ignore all the x-values in the data the best estimate we can make of any
y-value is the mean y of all the y-values.
• Of course this provides very poor estimates, but serves the purpose of setting
a “base line” against which we can judge how good our straight line fit is (once
we have derived it!).
• Thus, in Fig.7.1, we need to ask how much better does the regression (red) line
do in “explaining the data” than the horizontal (blue) line. This gives rise to two
important terms:
- Regression sum of squares (RSS) measuring the (squared) difference
between the regression line and the “mean line”.
- Total sum of squares (TSS) measuring the (squared) difference between
the data values and the “mean line”. TSS is independent of the
regression line, and serves as a measure of the situation before
attempting to fit a regression line.
• It should be obvious from Fig.7.1 that, for every observation,
Total “distance” = Error “distance” + Regression “distance”
It is not obvious that a similar relation holds for the squares of these distances
(after we add them together over all data values)
TSS = ESS + RSS --- (8)
• The result (8) is of fundamental (theoretical) importance and should be checked
in any Excel or SPSS output.
Example 7.1 If we return to the car price data of Example 6.1, the column sums
appearing in (9) are computed below.
x y xy x2
3 4995 14985 9
3 4950 14850 9
3 4875 14625 9
4 4750 19000 16
4 3755 15020 16
6 1675 10050 36
6 2150 12900 36
6 2725 16350 36
6 1500 9000 36
Σx = 41 Σy = 31375 Σxy = 126780 Σx2 = 203
Table 7.1: Hand computation of best (least squares) straight line in Example 6.1
a=
∑ y − b∑ x = 31,375 − (−955.6) * 41 = 8,021.6
n 9
The least squares regression line (of best fit) is therefore given by:
Price = 8021.6 - 995.6 * Age
Thus, the slope of the regression line is really a “scaled” version of the
correlation coefficient, the scaling depending on the standard deviations of X
and Y.
• In the form (9) we require the value of b before we can compute a. It is possible
to give a formula for a not involving b, but this produces extra calculations that
are not really needed. The form (9) is computationally the most efficient.
Example 8.1 Use the regression line in Example 7.1 to predict the following:
(a) The price of a 4 year old car (b) The price of a 7 year old car
(c) The price of a 10 year old car
• Very often we do not want our y-values turning negative. In our case this gives
the largest Age (x) value as
Age = 8021.6/995.6 ≈ 8 years
• However, in general, we can expect car prices to behave in a nonlinear fashion
both when the age is
- Small (new cars commanding substantially higher prices), and
- Large (after a certain age cars have no intrinsic market value)
9. Excel Implementation
From a practical point of view this section is the most important one in the Unit,
since you will usually perform regression calculations via Excel rather than by
hand calculation.
We may note that in our (car price data) calculation using (9) we obtain no
indication of the accuracy of our estimates of a and b. To do this we have to make
certain assumptions (relating to the normal distribution) which we discuss in Section
11. This allows us to develop standard error estimates, as in Unit 8, together with
confidence intervals for a and b.
• The necessary calculations, although feasible by hand, are much more
conveniently carried out using statistical software. We illustrate the ideas using
Excel, although different software will give similar output (probably with a
different layout of results).
• The Excel output of Fig.9.1 below refers to the car price data of Example 6.1,
and is obtained via Tools Data Analysis Regression
Section
Fig.7.1
• The a and b coefficients in the regression equation (Example 7.1) are output,
together with indications of their accuracy. Specifically:
• The standard errors enable us to verify the given 95% confidence intervals. For
example, for the “Age coefficient” b:
95% CI for b = -995.582 ± 2.365*116.018 = -1269.9 to -721.2
(Here we need the CI calculation with the t-distribution as in Example 14.1 of
Unit 8. The df = n – 2 since we are estimating the two parameters in the
regression equation from the data – see Unit 7 Section 8.)
• The p-values can be obtained as in Example 14.2 (Unit 8) but require some
effort. For completeness we briefly give the details:
x−μ − 995.582 − 0
t= = = -8.581 (available above)
s/ n 116.0182
You should now begin to see how much of our previous work is “packaged together”
in Excel output – look back at Unit 8 Section 12.
ESS 1528483.39
For RSS Mean Square for RSS = = = 218354.77
Degrees of Freedom 7
(If not we would think of nonlinear regression, or transforming the data – see Section
12.) We can try and assess “how good the fit is” using the following reasoning. To be
specific we use the car data (price and age):
• Not all cars of the same age sell for the same price. (Why?)
• The variability in price can be split into two parts:
- The relationship of price with age
- Other factors (make of car, interest rates and so on). These factors may
be unknown, or are just
not included in the model for a variety of reasons (keep the model
simple, no data available on other factors, …..)
• We can isolate these two contributions by arguing that
- The relationship of price with age is measured by the Regression sum
of squares (RSS) discussed in Section 7.
- Factors not included in the model will result in an “error” in the
regression, measured by the residual (error) sum of squares (ESS).
- We also know that the total sum of squares (TSS) is given by
TSS = RSS + ESS (Check in Fig.9.1)
Computation: Both of these quantities are available in the Excel output – look at
the ANOVA (Analysis of Variance) section of Fig.9.1 Here
RSS 16079205.5
R2 = = = 0.9131922764
TSS 17607688.89
and we would round this to something like R2 = 0.91
Comments:
1. We can clearly see that R2 lies between 0 and 1.
2. We can express the R2 value in words in the following (important) form:
R2 measures the proportion of variability in y that is accounted for by its
straight line dependence on x
3. With this last interpretation we usually quote a percentage. For our example
91% of the variability in price of a car is due to its (linear) dependence on the
car’s age. Hence 9% of the variability is due to other factors.
4. Rule of thumb : There are no precise rules for assessing how high a value of
R2 is needed in order to be able to say we have a “good fit” to the data (in the
sense that we have explained “most” of the variation in y by our choice of x-
variable). Rough guidelines are:
• R2 > 0.8 ⇒ very good fit
• 0.5 < R2 < 0.8 ⇒ reasonable fit
• R2 < 0.5 ⇒ not very good fit
In these terms the car example provides a very good fit of the linear regression
model to the data.
Example 11.1 In an investigation into the relationship between the number of weekly
loan applications and the mortgage rate, 15 weeks were selected at random from
among the 260 weeks of the past 5 years. The data are shown below:
Week Mortgage Number loan Week Mortgage Number loan
rate (%) applications rate (%) applications
1 11.0 75 9 10.0 87
2 13.5 65 10 11.0 79
3 13.0 62 11 10.5 80
4 12.0 76 12 12.0 72
5 15.0 50 13 12.5 69
6 14.0 58 14 13.0 65
7 14.5 54 15 13.0 61
8 13.5 64
Table 11.1: Mortgage Applications
We consider the mortgage rate data shown in Table 11.1. The residuals, shown in
Fig.11.1, can be found in the Excel spreadsheet Regression1 (DataSet2 tab).
• Residuals appear “small”, judging from the regression line.
• There appears to be no systematic pattern in the residual plot with residuals, by
and large, alternating in sign. This is also reflected in the regression line
“interleaving” the data points, with data values alternately above and below the
regression line.
• We conclude the straight line fit to the data is a “good” one (and this should also
be reflected in the R2 and r values).
Example 11.2 The sales (in thousands of units) of a small electronics firm for the last
10 years have been as follows:
Year 1 2 3 4 5 6 7 8 9 10
Sales 2.60 2.85 3.02 3.45 3.69 4.26 4.73 5.16 5.91 6.5
The scatter plot, regression line and residual plot are shown in Fig.11.2 below.
• Regression line appears to fit the data well, with “small” residuals and high R2.
• However the residual plot tells a slightly different story, exhibiting a “definite
pattern”.
Conclusion
Although we can use linear regression to “adequately” model the data of Example
11.2, we can do better. (See Section 12.)
We can use the precise pattern exhibited by the residuals to infer the type
of curve we should fit to the data (possibly a quadratic in this illustration).
Example 11.3 In neither of the previous examples have we really tested whether the
residuals are normally distributed. (We have just looked at their size, and any
pattern present.) The reason is we do not really have enough data to realistically
assess the distribution of residuals. For this reason we return to Example 2.2 relating
to executive pay. All the results discussed below are available in the Excel file
ExecutivePay.xls
• From Fig.11.4 we observe the regression fit is “not great” in terms of the R2
value, nor from a visual inspection of the scatter plot.
• The residual plot reveals a potential further problem. The residuals appear to be
increasing in size as x (Profit) increases.
- This arises when the data becomes increasingly (or decreasingly) variable
as x increases.
- This means the standard deviation of the data depends on x, i.e. is not
constant (violating one of the assumptions on which linear regression is
based).
• There are several ways we can check the normality assumption of the
residuals. Fig.11.5 displays conceptually the simplest; we draw the histogram of
residuals and “observe by eye” whether it looks like a normal distribution. (The
residuals are produced automatically in Excel using the Regression tool in
Data Analysis; then using the Histogram tool, again in Data Analysis.)
• Although the residuals appear normally distributed (left histogram) with mean
zero we are not really sure what the x-range should be. To deal with this we
work with the standardised residuals (right histogram).
X - Mean of X
• “Standardising” a variable X means computing
Standard deviation of X
• This is done automatically in Excel (and SPSS provided you request it!). If X is
normally distributed the standardised variable (which we have called Z in Unit 6)
follows a standard normal distribution. We know (that the range of Z is
(practically) from -3 to 3. The right histogram in Fig.11.5 fits into this pattern.
Although the first option is of great practical importance, we shall look at the second
possibility only. Before reading further you should review the material in Practical Unit
3 Section 5 relating to the graphs of the various standard elementary functions
(powers of x, exponential and logarithmic functions).
Some General Rules
What transformation will help linearise the data? In order to answer this question we
will make use of Fig.12.1 below. The easiest way to remember what is happening
here is the following:
• The quadrant in which the data lies determines whether we need to
(increase of decrease) the powers of (x or y)
• Remember that, because logs increase slower than powers, reducing powers
(of x or y) is similar to taking logs (of x or y).
You may care to think why there are some negative signs in some of the
transformations in Fig.12.1.
D C
x or y__ x or y__
√x √y x2 √y
logx logy x3 logy
-1/x -1/y x4 -1/y
Choices of Transformation Choices of Transformation
Fig. 12.1: Scatter Plot Determines Choice of Transformation
Note that the choice of a transformation is not unique. Several transformations will
often do a “reasonably good” job of linearising the data, and it is often just a matter of
trial and error to find an “optimum choice”. Experimentation also leads one to an
appreciation of just why the transformations given in Fig.12.1 are useful.
Example 12.1 A market research agency has observed the trends shown in Table
12.1 in Sales (y) and Advertising Expenditure (x) for 10 different firms.
The scatter plot, regression line and residual plot are shown in Fig.12.2. Comparing
this scatter plot with the diagrams A to D in Figure 12.1 we choose transformation C.
30 30
25 25
20 20
15 15
y
y
10 10
5 5
0 0
0 2 4 6 8 0 10 20 30 40 50 60
x x squared
6 2
5
1.5
4
sqrt(y)
log(y)
3 1
2
0.5
1
0 0
0 2 4 6 8 0 2 4 6 8
x x
0 30
-0.1 0 2 4 6 8 25
20
-0.2
15
-1/Y
-0.3 10
5
-0.4
0
-0.5 0 500 1000 1500 2000 2500 3000
X x^4
• Some of these possibilities are depicted in Fig.12.3, and we can see that most
of them do a reasonable job in linearising the data. You can find all these
results in the Excel spreadsheet Regresion1L
(Transformations_AdvertisingData tab).
• However you should appreciate we cannot just choose (essentially at random).
any transformation. For example, if we decrease the power of x the data
becomes “even less linear”, as Fig.12.4 illustrates.
Example 12.1 (Continued) Does all this really help us in our regression fits?
• We return to the data of Table 12.1 and perform the following transformation
(the last one depicted in Fig.12.3) y = x4
• Our new regression fit, and residual plot, are shown in Fig.10.6.
- The new R2 (93%) is higher than the old R2 (85%).
- The new residual plot, whilst not perfect, appears far more random than
the old residual plot (Fig.12.5)
• We conclude that transforming the data (via y = x4) is worthwhile here. This
does not, however, imply that we have found the “best” transformation, and you
may care to experiment to see if any significant improvement is possible.
Comment Although we have really concentrated on powers (of x and y) to define our
transformations, in economics logarithmic transformations find great use, especially
in the context of elasticity. See Tutorial 9, Q5 for an illustration
.
The last two are looked at in Assessment 2, and here we just analyse Example 4.1 a
little further.
Two Key Results Given two random variables X and Y, and any two constants a
and b, we look at the linear combination (or transformation)
P = aX + bY --- (10a)
Then E(P) = aE(X) + bE(Y) --- (10b)
and Var(P) = a2Var(X) + b2Var(Y) + 2abCov(X,Y) --- (10c)
“Proofs” (10b) really follows from (3) of Unit 5 when expressed in terms of
expectations. (10c) is more involved and follows from algebraic manipulation of (1)
using the definitions (2) and (3) of Section 4.
(10c) is a result of fundamental importance, and highlights that the variation in a sum
of (random) variables is not just the sum of the variations present within each
variable, but also contains a component relating to the interaction (covariance) of the
two variables. The more two variables influence one another the greater the variation
in their sum. In part the importance of the covariance concept stems from (10c).
Portfolio 1 Suppose we invest half of our money in asset X, and the other half in
asset Y. Then, in terms of returns, the portfolio return R(P) is given by
R(P) = 0.5R(X) + 0.5R(Y)
Then, using (10b) E[R(P)] = 0.5E[R(X)] + 0.5E[R(Y)] = 0.5*0.1 + 0.5*0.08 = 0.09
If the major concern (of a portfolio manager) is to minimise risk, then Portfolio 1
provides a much better alternative than investing in X or Y alone. (Although
maximising expected return sounds very tempting, it is a very high risk strategy!)
Portfolio 2 Since Y is slightly less risk than X (smaller standard deviation), we may
think of investing more in Y. Suppose we invest 60% of our wealth in Y and 40% in X.
Then, repeating the previous calculations, we obtain
R(P) = 0.4R(X) + 0.6R(Y)
E[R(P)] = 0.4E[R(X)] + 0.6E[R(Y)] = 0.4*0.1 + 0.6*0.08 = 0.088
Var[R(P)] = 0.42Var[R(X)] + 0.62Var[R(Y)] + 2*0.4*0.6*Cov(X,Y)
= 0.36*0.0076 + 0.16*0.00708 – 0.48*0.0024 = 0.00272
Unfortunately, not only have we reduced our expected return to 8.8%, we have
increased the portfolio risk to 5.2% (compared to Portfolio 1).
The conventional way to look at the portfolio risk is not in terms of the X weight, but
rather in terms of the expected gain, i.e. expected portfolio return. This is shown in
Fig.13.2, and measures the trade-off between portfolio return and risk. The
resulting curve is termed the efficient frontier, and represents one of the major
results in the area of portfolio theory. You will learn a lot more about this in other
modules.
Summary The idea of covariance, and the allied notion of correlation, is central to
many of the more advanced applications of statistics, especially in a financial context.
In a portfolio context the interplay between expected values (returns), variances and
covariances give rise to the concept of an efficient frontier, and this is the starting
point for much of modern portfolio theory. In a regression context correlation
measure the extent to which variables interact, and underlies notions of causality; the
latter is of fundamental importance in modern econometrics.
In addition to all this regression ideas are probably the most importance single
statistical technique used in applications. In practice this tends to be in the context of
multiple regression, and we look at this in Unit 10.
Finally any regression analysis you perform should be accompanied by graphical
output; at least a scatter plot (with fitted line superimposed) and a residual plot to
indicate the quality of the fit achieved.
14 References
• Koop, G. (2006) Analysis of Financial Data: Chichester, Wiley.
• Koop, G. (2003) Analysis of Economic Data: Chichester, Wiley.