0% found this document useful (0 votes)
268 views284 pages

Module Text PDF

Uploaded by

Livia Dițu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
268 views284 pages

Module Text PDF

Uploaded by

Livia Dițu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 284

MSc International Finance

Statistics, Probability & Risk


-
FIN 11101 Module Text

School of Accounting, Financial Services


& Law

Author: Dr L. Short
MSc International Finance • Module FIN11101 • September 2011 Edition

The module material has been written and developed by


Dr Les Short • School of Accounting, Financial Services & Law • Edinburgh Napier University

First published by Edinburgh Napier University, Scotland © 2009


No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means –
electronic, electrostatic, magnetic tape, mechanical, photocopying, recording or otherwise – without permission in writing from
Edinburgh Napier University, 219 Colinton Road, Edinburgh, EH14 1DJ, Scotland.
Contents

Page

Background notes for module 1

Unit 1: Introduction 5

Unit 2: Economic and Financial Data 19

Unit 3: Graphical Summaries of Data 51

Unit 4: Data - Numerical Summaries 81

Unit 5: Probability – Basic Concepts 105

Unit 6: Two important probability distributions 137

Unit 7: Sampling and Sampling Distributions 163

Unit 8: Inference & Hypothesis Testing 199

Unit 9: Correlation and Regression 237


A friend was bragging to a statistician over coffee one afternoon how two-day
volatility in the stock market had treated his holdings rather kindly. He chortled,
"Yeah... yesterday I gained 60% but today I lost 40% for a net gain of 20%."
The statistician sat in horrified silence. He finally mustered the courage and
said, "My good friend I'm sorry to inform you but you had a net loss of 4%!!!"
Available at http://www.ilstu.edu/~gcramsey/Tukey.html
Background Notes to the Module

Background Notes for the Module

The following introductory comments are designed to give you a general overview of
the module. Further details can be found at the appropriate place later in the notes.

1. General Structure
The module is designed roughly along the following general lines:
• 2 Lectures per week
• 1 Practical (Laboratory) per week
• 1 Tutorial per week.

You will find the module material organised around this course structure:
• Lecture material (in this book).
• Practical material in the Student Study Guide.
• Tutorial material at the end of the Student Study Guide.

The material in this module is written under the assumption that you have not studied
Statistics in any great detail, if at all. We start from very elementary ideas but
progress at a fairly brisk rate so that, by the end of the module, you will have
encountered a good many statistical ideas. Some of these ideas are relatively simple,
but some are decidedly not.
In addition statistics is essentially a “doing” subject; you cannot really learn about
how data is collected and analysed without actually collecting and analysing some
data for yourself. For this reason the lecture material is supplemented by material of
a very practical nature designed to be worked through on a PC.
We will find that, even though software may be available to perform various statistical
tasks, the interpretation of the resulting output requires some (theoretical) statistical
knowledge. Although the relevant ideas are discussed in the lecture notes, the
purpose of the tutorial sessions is to help you enhance this knowledge by making you
familiar with performing statistical computations by hand (sometimes with calculator
assistance). There is nothing quite like having to actually write down all the steps in a
calculation to see if you really understand the material.

1 Statistics, Probability & Risk


Background Notes to the Module

2. Lecture Material
Roughly speaking you should find the following:
• Unit 1 to Unit 4 are relatively straightforward. This material is often described
under the heading “Descriptive Statistics”.
• Units 5 and 6 are a bit harder going, and introduce the fundamental notions of
probability and probability distributions. In particular the Binomial and Normal
distributions are discussed in some detail.
• Units 7 to 9 are more challenging, and consider material that is often described
under the heading “Inferential Statistics”. The ideas discussed build on the
concepts introduced in Units 5 and 6.

In general terms Statistics consists of three fundamental areas;


• Data collection.
• Data analysis.
• Reporting/communicating results (of data analysis).

Data Collection
In this module we use secondary data, i.e. data collected by somebody other than
ourselves. In Unit 2 we consider in some detail the use of various important statistical
websites in obtaining data. Indeed one of the central themes running through the
module, and which we try and emphasise throughout the module, is that statistics is
“data based”; for this reason we endeavour to use “real data” in most of our
discussions and illustrations. In most examples you will find either a (web based)
reference to the data used, or the data itself in the form of an Excel file. The latter are
collected together on the module web page, and you can download them if you wish
to reproduce any of the results we quote.

Data Analysis
There is usually divided into the two categories mentioned above;
• Descriptive Statistics. Here data is analysed in one of two ways;
- Graphical Analysis. This is considered in detail in Unit 3.
- Numerical Summaries. This is considered in some detail in Unit 4.
• Inferential Statistics. This involves the more advanced techniques discussed
in Units 7 to 9 (with Units 5 and 6 providing necessary background material).

Statistics, Probability & Risk 2


Background Notes to the Module

Reporting Results
This is often the most difficult part of the process, and certainly where students find
most difficulty. Whilst the most useful asset in compiling reports is experience, we
provide several opportunities for you to develop your report writing skills. Specifically:
• In Practical 1 (see below) you are asked to downloads various articles and,
where appropriate, look at the graphical and numerical measures the authors
use to summarise their own findings.
• In Tutorial 1 (see below) you are asked to summarise the findings of other
authors considered in Practical 1.
• In Assessment 1 (see below) you are required to produce a written summary of
a given paper (dealing with purchasing power parity).
• In Assessment 2 you are required to produce a report summarising your own
investigations (into the movement of stock prices).

The skills you develop in the course of these tasks should help in your later
dissertation work.

3. Practical Material
Working through the practical material is essential in developing a “working
knowledge” of how data can be obtained and analysed. In addition you should
develop a “feel” for how data behaves, and some of the potential difficulties that can
arise.
You work through each of the practical units and complete the tasks as indicated.
You should be aware that, because websites are, by their very nature, subject to
amendment (updating), the screen views provided may not be precisely the same as
you obtain. At the end of each unit you will find some practical exercises which you
should try and find time to work through; these will test your understanding of the
material in the unit. Further (optional) exercises can be found on the module web
page, allowing you to extend your knowledge if you wish.
Familiarity with the practical material will be required when you attempt the two
assessments for the module.

3 Statistics, Probability & Risk


Background Notes to the Module

4. Tutorial Material
Working through this tutorial material will give you practice in performing statistical
calculations by hand, usually on “small” data sets. This will help you both in
understanding computer (Excel) output and in producing, and writing down, logical
arguments leading to a specific conclusion. Both of these will be useful in the
assessments you need to undertake, especially the second one (see below).
Further (optional) tutorial questions can be found on the module web page, giving
further practice or allowing you to extend your knowledge if you wish.

5. Assessment Material
The module has two assessments. Precise details of, and guidelines for, the
assessments will be made available at the appropriate time during the course.
Roughly speaking Assessment 1 will cover Units 1-4 in the lecture notes, and
Assessment 2 Units 5-9.
You need to successfully complete the assessments in order to pass the module.

Statistics, Probability & Risk 4


Unit 1 – Introduction

1 Introduction

Learning Outcomes

At the end of this unit you should be able to:


• Appreciate the ideas that statistics is concerned with.
• Understand how, and why, financial institutions produce, collect and use, data.
• Realise the need to check data accuracy.
• Recognise the need for statistics to measure uncertainty/risk empirically.
• Recognise the importance of probability to assess, and understand, the nature
of risk and how it can be controlled.
• Recognise the need for “mathematical models” in order to control the
uncertainty inherent in many financial situations.
• Appreciate the need for underlying finance concepts and theory.

“We evaluate risk poorly – whether it comes to insurance,


speculation or beef scares. These mistakes cost us all money in the
long run.”
Kay, J. (1996) On the trail of a good bet, Financial Times 29th March

5 Statistics, Probability & Risk


Unit 1 – Introduction

1. Overview
In this introductory unit we briefly look at various motivating factors for the topics we
shall study. Statistics is often broadly defined as the “Analysis of data” or, more
succinctly as “Data Analysis”. Closely connected to this is the concept of Probability,
which you may have encountered in the context of assessing Risk. (You will
probably have some general idea of these concepts, although no detailed knowledge
is assumed.) There are various questions that might spring to mind in relation to
these terms:

Some Questions

Question 1: Who collects data and why?

Question 2: How is data collected?

Question 3: What type of statistical data is available?

Question 4: Where does one actually find (statistical) data?

Question 5: How reliable is the statistical data available?

Question 6: How does one actually analyse (statistical) data?

Question 7: What types of answers/conclusions can we expect to make from an


analysis of (statistical) data?

Question 8: What is probability, and what is its purpose?

Question 9: What has probability got to do with analysing data?

Question 10: How do we measure/quantify risk?

Question 11: How can we analyse the risks present in any particular situation?

Following on from Questions 6 and 11

Question 12a: Do we need (computer) software to do our analysis, or can we make


do with hand calculations?

Question 12b: If we need software, what computer packages are best?

Statistics, Probability & Risk 6


Unit 1 – Introduction

Some Partial Answers


We shall spend this module looking at some aspects of the above questions, largely
in a financial context. The following extract, taken from the Bank of England (BoE)
website, gives us some interesting perspectives on some of the above questions.

Overview of Monetary & Financial Statistics


Monetary and Financial Statistics Division (MFSD) collects monetary and financial data
from all banks operating in the UK.
Monetary statistics are collected from banks (including the central bank) and building
societies operating in the UK - together they make up the Monetary Financial Institutions
(MFI) sector. Sector and industrial breakdowns of the money and credit data are
available. There is also a wider measure of lending to individuals available, which
includes mortgage lending and consumer credit provided by institutions other than banks
and building societies. Data for notes and coin (also known as 'narrow money') are
published on a weekly and monthly basis.
Financial Statistics include the banks' contribution to UK GDP and balance of payments,
the UK's gold and foreign currency reserves, statistics on financial derivatives and the UK
's international banking activity. We also collect effective and quoted interest rates data
and compile statistics on new issues and redemptions of equities, bonds, commercial
paper and other debt securities in the UK capital markets. Every three years we compile
the UK's contribution to the triennial survey of derivatives and Forex for the BIS - the next
one is due in 2010.
MFSD has a small Research & Development Team which undertakes research to ensure
the quality and relevance of our statistics, and to monitor our compliance with our
Statistical Code of Practice. The team oversees our international work, including
representational work with the European Central Bank (ECB). They also work closely with
the Office for National Statistics (ONS) in a number of areas (see Key Resources for an
overview of the joint research programme).
The data collected by the Monetary and Financial Statistics Division contribute to a wide
range of outputs. Many of these series are accessible from this web site while some
others are primarily collected as contributions to wider economic or financial aggregates
and so are not automatically identified here.
Much of MFSD's data are used within the Bank by Monetary Analysis, as part of the input
to the MPC process and the Inflation Report, and within Financial Stability in their
assessments of the UK banking sector. The ONS is also a major customer for our data. A
Firm Agreement governs the relationship between the Bank and ONS, with the ONS
providing an annual assessment of our performance (see Key Resources).
The Bank currently acts as the 'post box' for the banks' FSA regulatory returns - receiving
the bulk of data electronically and passing them on to the FSA. A Service Level
Agreement covers the provision of services between the Bank and FSA (see Key
Resources). Data are available on line via our interactive database or in formatted tables
in our Monthly on-line publication Monetary and Financial Statistics .

7 Statistics, Probability & Risk


Unit 1 – Introduction

From this brief extract we can give some (very partial) answers to a few questions:
(Partial) Answer 1: Bank of England (and building societies and others) collect data
for various reasons:
• Produce surveys (Derivatives and Forex-foreign exchange) for BIS (Bank of
International Settlements).
• Produce reports (e.g. Inflation).
• Monitor financial/banking sector.
• Make data (widely) available to others (general public, ONS – Office for
National Statistics).
• Link to regulatory bodies, e.g. FSA (Financial Services Authority).

(Partial) Answer 2: Collated at a central point (B of E) from other sources (e.g.


building societies).

(Partial) Answer 3: Types of data include


• Monetary statistics (money, credit, mortgage lending).
• Financial statistics (gold/currency reserves, financial derivatives, interest rates,
equities and bond data, debt securities).
• Economic statistics (GDP, balance of payments).
• “International” aspects (Forex, aggregated data) with links to BIS and ECB
(European Central Bank).

(Partial) Answer 4: Websites are probably now the most convenient data sources.
• BoE has interactive databases and (monthly) on-line publications.
• ONS is an extensive resource for “general” statistical information.
• International data available for ECB (and many others).

(Partial) Answer 5: Data integrity carefully monitored.


• BoE has a Statistical Code of Practice, and many other major data providers
have similar mechanisms.
• But this will not be universally true, and data accuracy is then an important
issue.

Statistics, Probability & Risk 8


Unit 1 – Introduction

The remaining questions are not addressed in the above brief extract from BoE.
Partial answers can be found in the following references, which you are asked to
download in Practical 1. You should look through each of these papers during the
next few weeks, even if you do not understand all the technical details. In particular,
look at the Barwell paper (see below) in preparation for Tutorial 1.

(Partial) Answer 6/7: See the following BoE article:


Barwell, R., May, O., Pezzini, S. (2006), The distribution of assets, incomes and
liabilities across UK households : results from the 2005 NMG research survey,
Bank of England Quarterly Bulletin (Spring).

(Partial) Answer 8/9: See the following BoE articles:


Boyle, M. (1996) Reporting panel selection and the cost effectiveness of statistical
reporting.
If you have sufficient background (in derivatives) you might also look at
Bahra, B. (1996) Probability distributions of future asset prices implied by option
prices, Bank of England Quarterly Bulletin (August).

(Partial) Answer 10/11: See the following BoE articles:


Sentance, A. (2008) How big is the risk of recession? Speech given to Devon and
Cornwall Business Council.
Bunn, P. (2005) Stress testing as a tool for estimating systematic risk, Financial
Stability Review (June).

Note: Copies of some papers, data sets and Excel spreadsheets can be found on the
module web page.

9 Statistics, Probability & Risk


Unit 1 – Introduction

2. Analysis in Context
In a financial context we would like to address problems such as the following:

Problem 1: We intend to build up a portfolio of assets (possibly by using funds from


investing clients), and we start out by purchasing £1 million in gold.
(a) How much money can we “realistically” lose on this position over a period of one
day?
(b) How much money can we “realistically” lose on this position over a period of one
week?

Problem 2: We wish to add to our portfolio by purchasing £1 million in sterling. Now


how much money can we realistically lose on the combined position
(a) over a period of one day?
(b) over a period of one week?

Comments: There are a variety of issues involved:


• How do we assess the “riskiness” of our (gold/sterling) positions?
• What do we mean by “realistically”?
- It is possible, though highly unlikely, that we could lose almost all our
money in a single day if events take a dramatic turn. (Enormous new gold
reserves are found throughout the world!)
- How do we assess “likely” fluctuations in the price of gold and sterling?
• How do the risks associated with gold and with sterling combine with one
another? We would like the respective risks to offset each other!
• How does time enter into the discussion? Is the risk (of holding gold, say) over
two days double the risk over one day?
• Where can we find data relating to all these issues?

Problem 3: We intend to build up a portfolio of assets and go to a firm of financial


consultants for advice. We have done some research on the firm and conclude:
• When an investment goes up in price 70% of the time the firm have correctly
predicted this.
• When an investment goes down in price 60% of the time the firm have correctly
predicted this.
The firm recommends we invest in certain assets. What action do we take, e.g. how
much reliability can we put in the forecasts of the firm?

Statistics, Probability & Risk 10


Unit 1 – Introduction

Comments: Although it looks like we should take the firm’s advice it is not clear what
the chances of success really are.
• The firm seems to predict events with greater accuracy than just guessing.
Does the latter correspond to 50% accuracy?
• However which percentage do we believe? After all we do not know in advance
whether our investments will go up or not; we have only the firm’s forecast.
• Suppose the firm is very good at predicting “small” price rises, but is very poor
at predicting “large” price drops. If we follow its forecasts we might make a lot of
small gains, but also experience large losses which the firm could not forecast.
(Unfortunately the losses may well greatly outweigh the gains.)
• Maybe we should take the average of the two percentages? And is 65% that
much above 50%?
• We really need
- some means of sorting out the “logic” of the situation, and
- some means of performing the necessary computations.

Problem 4: You are offered one of the following two options:


• Option 1: Pay £10 to participate in a game of chance where you have a
50% chance of winning £20 and a 50% chance of winning nothing.
• Option 2: Pay £100 to participate in a game of chance where you have a
25% chance of winning £400 and a 75% chance of winning nothing.
(a) Which option is the riskier?
(b) Which option would you choose to take?

Comments: How do we measure risk?


• In its simplest terms we expect risk to measure the “uncertainty” of an event; the
more uncertain an outcome is the greater the risk associated with trying to
make any prediction. Associated with this logic are two important points:
- Risk can be defined, and measured, in a variety of different ways.
- Uncertainty is quantified by the ideas of probability; a very unlikely event
has a small probability associated with it. But “unlikely” and “uncertain” are
not the same concept, and we need to work hard to uncover the
distinction. The key idea is that of variability (discussed later).

11 Statistics, Probability & Risk


Unit 1 – Introduction

• Although Problem 4 may seem a little artificial, this is precisely the type of
gamble investors take, albeit without really knowing their potential returns, and
the probabilities of achieving them.

We shall examine some of these issues involved in solving Problems 1- 4 in several


of the later units.

3. Why Do We Need Probability?


In any practical applications the inherent risks are usually quite complicated to
evaluate. To get started we look at “simpler” situations where the risk is both easier to
identify, and easier to compute. Unfortunately this is not always so easy to do!

Example 3.1: I have a (well mixed) bag of 100 counters, 25 red and 75 black. You
are invited to play the following game, with an entrance fee of £50. A counter is
picked at random. If the counter is black you win £100, otherwise you win nothing.
Should you play?

Comments 1: The risk here is (of course) that we do not win. There are several
important ideas that are relevant here:
• How likely are we to win (or lose)? If winning is “unlikely” we don’t really want to
be playing.
• How do we measure the “cut-off” between playing and not playing?
• How much can we expect to win?
• How risky is the game?
• How much risk are we prepared to accept? (What is an acceptable risk, and
what is unacceptable?)

We shall give a solution to Example 1 in Unit 5. At present we merely note that we


need probability ideas in order to assess the “likelihood” of winning the game (and
also to give a precise meaning to the “random” selection of counters).

Example 3.2: I have a (well mixed) bag of 100 counters, some red and some black.
You are invited to play the following game, with an entrance fee of £50.
• A counter is picked at random.
• If the counter is black you win £100, otherwise you win nothing.
Should you play?

Statistics, Probability & Risk 12


Unit 1 – Introduction

Comments 2: Here we just do not know, i.e. we do not have enough information
to make a rational choice.
• The uncertainty in Example 1 is measurable. We are not sure what colour will
be chosen but, as we discuss in Unit 5, we can assign (measure) likelihoods
(probabilities) to all (two) possibilities. Most games of chance (gambling) are like
this.
• The uncertainty in Example 2 is unmeasurable. We cannot make any sensible
predictions about what will happen; if we play the second game we are “leaving
everything up to fate”. We will either win or not, but we cannot judge beforehand
which is the more likely, and hence cannot assess the inherent risk involved.
• In practice we are often somewhere between these two situations. We have
some knowledge, but not (nearly) as much as we would like. We may need to
estimate certain quantities, and this will lead to increased (but still quantifiable)
uncertainty.
• The idea of investors behaving in a rational manner, after having considered all
the available information, is a core assumption in much of the economic and
financial literature. In practice this is frequently not the case, and behavioural
risk refers to the risks resulting from this non-rational behaviour. The field of
behavioural finance has grown up to explain how psychological factors
influence investor behaviour. For an introduction to this rapidly expanding field
see http://www.investorhome.com/psych.htm. You will encounter some
behavioural finance ideas in later modules.

4. The Need for Models

Example 4.1: As this section of the notes is being revised (June 1 2009) General
Electric stock is at $13.48; this is the value quoted on NYSE (New York Stock
Exchange). What will the stock value be tomorrow (June 2 2009)?
Comments: As it stands this is un-measurable (today).
• Indeed, with this perspective, much of finance would lie in the realm of
unmeasurable uncertainty, and no reliable predictions (forecasts) can be
made.
• To make any kind of progress we need to assume some kind of “predictable”
behaviour for the stock price, so we can use the price today to estimate the
value tomorrow.

13 Statistics, Probability & Risk


Unit 1 – Introduction

• It is usual to formalise this procedure into the term “model”, and to say that we
“model the stock price movements”. Precisely which model we choose is still a
matter of some debate, and we shall look at some of the possibilities later.
• But the important point is that we need to select some type of model in order to
remove the “un-measurable uncertainty” and replace it by “measurable
uncertainty”. (Precisely what we mean by this phrase will only become clear
once we have worked through most of the module!)

5. Using Statistics
The word “statistics” derives from the word “state”, a body of people existing in social
union, and its original 18th century meaning was “ a bringing together of those facts
illustrating the condition and prospect of society”. Just as the word “risk” has many
connotations, so too does the modern usage of the term “statistics”. For example;
• Business statistics. The science of good decision making in the face of
uncertainty. Used in many disciplines such as financial analysis, econometrics,
auditing, production/operations and marketing research.
• Economic statistics. Focuses on the collection, processing, compilation and
dissemination of statistics concerning the economy of a region, country or group
of countries. This in itself is often subdivided into various groupings such as
- Agriculture, Fishing & Forestry
- Commerce, Energy & Industry
- Labour Market
- Natural & Built Environment
- Social & Welfare

• Financial statistics. Statistics relating to international and domestic finance,


including data on exchange rates, liquidity, money and banking, interest rates,
government accounts, public sector and so on.
• Health statistics. Statistics relating to various health issues (both national and
international) such as disease levels, drug use, hospitals, births and deaths and
so on.
• Population & Migration. Statistics relating to various demographic issues such
as population estimates, population projections, census information, births,
deaths and marriages, immigration and emigration and so on. (Note the overlap
with some health statistics issues.)

Statistics, Probability & Risk 14


Unit 1 – Introduction

• Transport, Travel & Tourism. Statistics relating to a variety of travel related


issues such as
- air, freight, rail, sea and public and private transport,
- business, domestic, holiday and overseas tourism

• Crime & Justice. Statistics relating to crime and arrest, criminal justice and law
enforcement data, prisons, drugs, demographic trends and so on.

Obviously this list could be extended considerably, and there are clearly connections
between the various groupings. But the important point is that

STATISTICS IS CONCERNED WITH THE COLLECTION AND ANALYSIS


OF DATA OFTEN IN ORDER TO FORECAST FUTURE TRENDS

Terminology We shall use the term “finance” in a very wide sense to include
• Financial Institutions and Financial Services
• Corporate Finance
• Econometrics (Financial Economics)
• Financial Accounting
• Mathematical Finance

Examples taken from all of these areas will appear at various stages.
Our interest will largely be in “finance related” areas where very large quantities of
statistical information has been gathered over the years. Indeed the rate at which
information is collated is increasingly (rapidly) with time. We would clearly like to
make use of some of this information (since it was deemed important enough to
collect in the first place!).

Example 5.1: General Electric stock prices for the past six months are as shown in
Table 5.1. What will be its value in July 2009?

Date (2009) Jan Feb March April May


Stock price ($) 11.78 8.51 10.11 12.65 12.69

Table 5.1: General Electric monthly closing stock price from Jan 9 to May 9 (2009)

15 Statistics, Probability & Risk


Unit 1 – Introduction

Comments: Observe the following:


• Here we want to use the information contained in this (historic) series of prices
to forecast the stock price in the following month.
• To do this we need to establish if there is a “pattern” in the prices.
• This in turn requires that we use some type of model identifying and describing
the “pattern”.

We shall not try and solve this type of problem until much later. At present you might
like to consider
• “How much” information would you expect historic prices to contain?
- This bears directly on the famous Efficient Market Hypothesis (EMH).
- Is 6 months data enough to assess past trends? Just how far back in time
should we go?
• “How accurately” would you expect to be able to forecast future prices? (Would
we be happy to be able to predict prices to, say, within 10%?)
• How much money would you be prepared to invest in your predictions?
- Using monetary values is a good way to assess the “subjective
probabilities” you may have on particular stock movements.
- What you are prepared to invest will also reflect your “risk taking” capacity;
the more risk you are prepared to take the more (money) you will be willing
to invest.

Summary

WE NEED (STATISTICAL) DATA TO HELP US FORMULATE


USEFUL MODELS OF IMPORTANT UNDERLYING PROCESSES
(MOVEMENTS OF STOCK PRICES, INTEREST RATES ETC.)

WE NEED STATISTICAL ANALYSIS TO HELP US UNDERSTAND


PRECISELY WHAT THE MODELS ARE TELLING US.

Statistics, Probability & Risk 16


Unit 1 – Introduction

6. Important Business and Finance Concepts?


In addition to purely statistical considerations, there are a few very important
“finance” principles that help guide us in our formulation, and analysis, of models. We
shall meet some of these in this module, but you will come across them repeatedly
(explicitly or implicitly, and possibly in disguised form) in much of the finance
literature, and in many of the other modules in this programme.
• Present Value (PV). The value at time t (now) of an “asset” is equal to its
expected value at some future time T “discounted” by a (possibly random)
“discount factor”.
(This provides a computational tool for valuation of many types of assets, e.g.
annuities, bonds, equity and company value, and with links with the EMH.)
• No Arbitrage Principle. Very technical to state in general terms, but roughly
“Arbitrage relates to a trading strategy that takes advantage of two or more
securities being mis-priced relative to each other”. The no arbitrage principle
says this cannot happen; see http://en.wikipedia.org/wiki/Arbitrage for further
details.
(This theoretical concept can be turned into schemes for giving forward and
future prices of many financial assets such as stocks and bonds, indices,
commodities and derivative contracts. In addition arbitrage lies at the heart of
hedging principles.)
• Sensitivities. In general term measures how sensitive a quantity is to (small)
changes in market parameters.
(Sensitivity analysis appears in many guises: cash flows and project net present
value (NPV), financial statement modelling, Capital Asset Pricing Model
(CAPM), bond duration (and convexity), stock beta values, options (Greeks:
delta, gamma and so on), parameter and “shock” sensitivities in regression
models (see later units).
• Risk and Return. The only way to make greater profits (consistently) is to take
greater risks. Note that this is only a necessary condition, not a sufficient one;
taking more risks does not guarantee higher returns!
(This fundamental insight leads, when formulated mathematically, to important
ideas in portfolio theory and asset allocation.)
• Performance measures. How does one assess performance?
(This question leads to important ideas such as risk adjusted returns, VAR
(Value at risk), EVA (Economic value added) and portfolio, and fund
management, performance measures.)

17 Statistics, Probability & Risk


Unit 1 – Introduction

7. Recommended (Background) Reading


• Bernstein, P. (1996). Against the Gods: The Remarkable Story of Risk. New
York, Wiley.

A very readable historical look at the evolution of financial risk ideas.


• Brealey, R.A. and Myers, S.C. (2003). Principles of Corporate Finance, 7th.
International Edition. New York, McGraw Hill.

All you could ever want to know about corporate finance. More for reference and
long term study.
• Ferguson, N. (2008): The Ascent of Money. A Financial History of the World.
London, Allen Lane.

Provides some very interesting background material to help understand some of the
benefits of, and problems with, modern finance.
• Schrage, M. (2003). Daniel Kahneman: The Thought Leader Interview,
Strategy+Business website. Available from www.strategy-business.com
(accessed 1st. June 2009).

An interview with one of the founders of behavioural finance.


• Statistics from Wikipedia, available at http://en.wikipedia.org/wiki/Statistics
(accessed 1st. June 2009).
This provides a relatively brief overview of the scope of modern statistics, together
with a lot of useful references (including online courses and textbooks).

For ease of reference we also list the various papers mentioned in the unit:
• Bahra, B. (1996) Probability distributions of future asset prices implied by option
prices, Bank of England Quarterly Bulletin (August).
• Barwell, R., May, O., Pezzini, S. (2006) The distribution of assets, incomes and
liabilities across UK households : results from the 2005 NMG research survey,
Bank of England Quarterly Bulletin (Spring).
• Boyle, M. (1996) Reporting panel selection and the cost effectiveness of
statistical reporting.
• Bunn, P. (2005) Stress testing as a tool for estimating systematic risk, Financial
Stability Review (June).
• Sentance, A. (2008) How big is the risk of recession? Speech given to Devon
and Cornwall Business Council.

Statistics, Probability & Risk 18


Unit 2 –Economic and Financial Data

2 Economic and Financial Data

Learning Outcomes

At the end of this unit you should be familiar with the following:
• How data is collected.
• Important data sources.
• The basic types of data.
• Appreciate the accuracy inherent in the data.
• Understand how to present data
- Meaningfully
- Unambiguously
- Efficiently

“One hallmark of the statistically conscious investigator is their firm belief that
however the survey, experiment, or observational program actually turned out, it
could have turned out somewhat differently. Holding such a belief and taking
appropriate actions make effective use of data possible. We need not always
ask explicitly "How much differently?" but we should be aware of such
questions. Most of us find uncertainty uncomfortable ... (but) ... each of us who
deals with the analysis and interpretation of data must learn to cope with
uncertainty.”
Frederick Mosteller and John Tukey: Exploratory Data Analysis 1968

19 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

1. Guiding Principles
• Whenever we discuss information we must also discuss its accuracy.
• Effective display of data must satisfy the following criteria:
1. Remind us that the data being displayed do contain some uncertainty.
2. Characterise the size of that uncertainty as it pertains to the inferences
(conclusions) we have in mind.
3. Help keep us from drawing incorrect conclusions (in 2) through the lack of
a full appreciation of the precision of our knowledge.
• Here we look at the numeric display of data, in Unit 3 at its graphical display,
and in many of the remaining units at the inferences that can (and cannot) be
drawn from the data.
• Central to all of this is the accuracy we can assign to the procedures we
undertake (whether it be data collection or data analysis).

2. Data Sources
In the past, unless you collected your own data, there were relatively few sources of
data available. However, with the advent of modern computing techniques, and in
particular the emergence of the Internet (Web), this has all radically changed.

THIS MODULE WILL GIVE YOU AN OPPORTUNITY TO BECOME FAMILIAR


WITH SOME OF THE DATA SOURCES AVAILABLE ON THE WEB. YOU
SHOULD REGARD THIS AS AN IMPORTANT FEATURE OF THE COURSE.

Here we give a very brief list of some (generally free) sources we have found useful.
You should look to compile your own list of websites, and you should regard the list
merely as a starting point.

Statistics, Probability & Risk 20


Unit 2 –Economic and Financial Data

Bank Websites
• Bank of England www.bankofengland.co.uk
• Federal Reserve Bank of St. Louis http://stlouisfed.org/default.cfm
- One of the websites for the FED (Federal Reserve System) containing
material under the general areas of
Banking Consumers Economic Research Financial Services ,
Education Publications
- Gives access to FRED (Federal Reserve Economic Database) containing
about 20,000 economic time series in Excel format.

• Liber8 An economic information portal for librarians and students, and closely
linked with Federal Reserve Bank of St. Louis
- Gives access to many economic databases at an international, national or
regional level. Sometimes more easily accessible than St. Louis FRB.
ƒ Many Economic Indicators available.
ƒ Access to further FED databases such as Bureau of Labour
Statistics, Bureau of the Census, etc.

Government (Agency) Websites


• Office of National Statistics (ONS) http://www.statistics.gov.uk/default.asp
- See http://en.wikipedia.org/wiki/Office_for_National_Statistics for an
account of the major statistical areas overseen by the ONS.
- Downloadable (time series) data in Excel format.
- Many publications available online.

• European Union Statistics For a detailed general discussion see


http://en.wikipedia.org/wiki/European_Union_statistics
Eurostat is the data provider for European statistics, available at
http://epp.eurostat.ec.europa.eu/portal
- For a general discussion of the areas covered by Eurostat see
http://en.wikipedia.org/wiki/Eurostat
- Data can be freely downloaded, but registration is required.
- The database is vast, but three of Eurostat’s significant roles are:

21 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

ƒ Producing macroeconomic data which helps guide the European


Central Bank in its monetary policy for the euro.
ƒ Providing the data used for the Excessive Deficit Procedure,
ƒ Its regional data and classification (NUTS) which guide the EU's
structural policies.

• U.S. Statistics. These are coordinated at http://www.fedstats.gov/


- FedStats provides access to the full range of official statistical information
produced by the Federal Government without having to know in advance
which Federal agency produces which particular statistic.
- Searching and linking capabilities to more than 100 agencies.
- Provides data and trend information on such topics as economic and
population trends, crime, education, health care, aviation safety, energy
use, farm production and more.

Statistics Websites
• UK Statistics Authority http://www.statistics.gov.uk/
- Responsible for promoting, and safeguarding the quality of, official UK
statistics.
- Has links to other government sites (Revenue & Customs, Crime &
Justice, Health & Care and ONS).

• Economic and Social Data Service (ESDS) http://www.esds.ac.uk/


- National data service providing access and support for an extensive range
of key economic and social data, both quantitative and qualitative,
spanning many disciplines and themes.
- ESDS Government http://www.esds.ac.uk/government/
ƒ Large scale government surveys (General Household Survey,
Labour Force Survey) monitoring changes in population structure.
ƒ Online course materials available.
- ESDS International http://www.esds.ac.uk/international/
ƒ Access to international aggregate (macro) datasets, and user guides.
Available datasets include data from OECD, IMF (International
Monetary Fund), UN (United Nations), Eurostat.

Statistics, Probability & Risk 22


Unit 2 –Economic and Financial Data

• OECD (Organisation for Economic Cooperation & Development)


http://www.oecd.org/home/
- Brings together countries (over 100) to support sustainable economic
growth and related issues. Many publications and manuals available.
- Many (Excel downloadable) datasets available.
- Useful section of frequently requested statistics.

• IMF (International Monetary Fund) http://www.imf.org/external/index.htm


- Promotes international monetary cooperation (and related issues)
amongst its (185) member countries. Many publications and manuals
available.
- Many economic and financial (Excel downloadable) datasets available.
- Useful section of frequently asked questions relating to WEO (World
Economic Outlook) database.

Finance Websites
• Yahoo Finance http://finance.yahoo.com/
- Extensive website with data available in many forms (current and historic
prices, charts). Historic data is free!
- Interesting background information on thousands of companies.
- Key statistics including a large volume of accounting based information.
- Separate sites devoted to major stock markets (US, UK, Singapore, India,
Hong Kong and so on). However there is less information available for
some markets; see http://www.eoddata.com/ for more details.
- Option information on company stock.
• Investopedia http://www.investopedia.com/ (Free registration)
- Extensive financial dictionary, articles and tutorials.

Learning (Statistics) Websites


• FreeStatistics: Learning Statistics Extensive material, on a large number of
statistical topics, available at http://freestatistics.altervista.org/en/learning.php
• Online Text and Notes in Statistics for Economists available at
http://www.economicsnetwork.ac.uk/teaching/text/statisticsforeconomists.htm
- Contains descriptions, and links, to a variety of online learning sites.

23 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

Clicking on Online Data (left margin) leads to a good source of online economic
data, and it may be worthwhile bookmarking this site
(http://www.economicsnetwork.ac.uk/links/data_free.htm). This contains some
of the above websites (but with different commentaries), and some new ones.

• ESDS International http://www.esds.ac.uk/international/


- Provides comprehensive support material for (macro) datasets.
- Provides teaching and learning resources.

• Biz/ed http://www.bized.co.uk/
A site for students and educators in business studies, economics and
accounting (plus other topics). Important items are;
- Data section comprising
ƒ Time Web Integrated package of data and learning materials, with
much useful advice on data analysis.
ƒ Key Economic Data Includes “Most commonly requested data”
and advice on its statistical analysis.
ƒ Links to ONS and Penn World Data (a valuable data source based
at Pennsylvania University).
- Company Info Data and case studies on a variety of organisations.
ƒ Virtual Worlds This enables you to take a tour of, for example, the
economy, banks and developing countries. Go into this and see what
you find!
ƒ Reference Definitions, database links, study skills and more.
- This is a website you should have a look at sometime.

Useful Databases
• UK Data Archive (UKDA) http://www.data-archive.ac.uk
- Curator of largest digital collection in UK for social sciences/humanities.
- Houses a variety of databases including ESDS and Census.ac.uk, the
latter giving information from the last four U.K. censuses (1971-2001).

• (Office for) National Statistics (ONS) http://www.statistics.gov.uk


- See Practical 2 for a discussion of the website.

Statistics, Probability & Risk 24


Unit 2 –Economic and Financial Data

• Global Market Information Database (GMID)


- Provides key statistical data on countries, markets and consumers.
- Includes demographic, economic and lifestyle indicators.
- Very up to date; data is added as soon as it becomes available.
- Covers over 350 markets and over 200 countries.

• MIMAS http://www.mimas.ac.uk
- A nationally sponsored data centre providing the UK Higher Education
sector with access to key data and information resources.
- Purpose is to support teaching, learning and research across a wide range
of disciplines. Free to authorised (education) institutions.

Links to Websites
• FDF Financial Data Finder http://www.cob.ohio-state.edu/fin/fdf/osudata.htm
- Provides direct access to 1800 financial websites (including some of the
above), arranged alphabetically.

3. Data Types

3.1 Data Types


Although data may appear to be all the same (just numbers), there are several
different categories of data types. This classification is important since different types
may require different methods of analysis, and different ways of interpreting results.
The basic setup is as in Fig.3.1 below.

Data

Quantitative Qualitative

Continuous Discrete Ordinal Nominal

Fig.3.1: Basic data classification

25 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

• Quantitative variables are measurable or countable, for example height, age,


amount spent, price, number of defaulters, percentage of defects, time to
complete a task. The values of the data are “genuine” numbers and not just
labels.
• Quantitative data are further sub-classified into discrete and continuous.
- A discrete variable can only assume certain, usually integer, values. For
example number of children, number of defective items. In fact, the most
common occurrence of discrete data are the result of counting something.
- A continuous variable can assume any value within a specified range.
For example the weight of a parcel and the time spent on a phone call are
both continuous variables as they can be measured as accurately as the
tools available allow.
- The distinction between discrete and continuous can be blurred and there
are grey areas. Some data which are really discrete are conventionally
treated as continuous. Examples include;
ƒ Money values (one cannot have fractions of a penny in practice, but
money is usually treated as a continuous variable).
ƒ Population figures (again one is dealing with numbers so large that
they can be treated as continuous).
ƒ It sounds confusing but in practice it doesn’t cause problems because
convention is simply to treat a variable as continuous if it seems
sensible to do so.
ƒ In a similar way some variables that are really continuous are often
recorded in a discrete fashion, e.g. age of adults is the most common
example. Age is a continuous variable; you could measure age to within
a second or smaller if you really wanted to. However, it is usually
recorded as 'age in years last birthday' which is discrete.
ƒ You should be aware that some types of data can be either discrete or
continuous but, in most circumstances we will deal with in this module,
the distinction is not of vital importance.
• Qualitative (or Categorical) variables refer to what category something falls in
and are not naturally numerical. A person’s gender is a qualitative variable, for
instance, and so is the colour of a product. Sometimes qualitative variables will
be coded as numbers (for instance, the colour of shirt you buy may be coded 1
for red, 2 for white and 3 for blue), but these numbers are not a count or a
measure of anything.

Statistics, Probability & Risk 26


Unit 2 –Economic and Financial Data

- Qualitative variables are further subdivided into ordinal and nominal:


ƒ Ordinal variables take possible values having an underlying order;
for example the shift you work (Early, Middle, or Late) or how you
rate a chocolate bar (Good, Fair, Poor, Awful).
ƒ Nominal variables are purely descriptive; for example gender,
colour, country of origin and so on. In other words, the data are
merely labels.
- Very often economic and financial variables (GDP, stock prices, etc.)
will be quantitative, but not always so. You may ask financial market
practitioners whether they expect the market to go up or down, or
whether the economy will get better or worse.

• Other Terminology. Sometimes the following distinctions are important.


- Internal data is data that is used by the body that collected it. (Banks may
collect credit data on their own customers.)
ƒ Advantages: Control and full knowledge of its accuracy.
ƒ Disadvantages: It takes time and effort to collect.
- External data is collected by one body and used by another. (An
insurance company may use data collected by Central Statistical Office.)
ƒ Advantages: Saves time and effort.
ƒ Disadvantages: May have to be paid for. May not be quite what you
need. May be out of date.
- Primary data was collected for the purpose it is being used for.
ƒ It is usually raw data in that it is unprocessed, possibly responses
from a questionnaire or a list of figures from a set of invoices.
- Secondary data is data that was collected for some other purpose.
ƒ Often already processed as a series of tables, charts, etc.
ƒ Care must be taken to find out as much as possible about how it was
collected so as to decide how reliable it is and what situations it is
applicable to:
- Is it derived from employed adults only or from all adults?
- Was it from a large or small sample?
- Did respondents choose to reply (may give a biased result)?

27 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

3.2 ESDS International


Economic and Social Data Service International has a very interesting group of
learning and teaching resources entitled LIMMD (Linking International Macro and
Micro Data). These can be found at
http://www.esds.ac.uk/international/resources/learning.asp
(or by following the Learning and Teaching Resources link from the home page),
and comprise five self study units. Now read Unit 1 (Basics) since this contains an
interesting discussion of data types.

3.3 Data Analysis


• Some statistical techniques require quantitative data. For instance, it is
meaningless to average code numbers representing a product's colour, but it
does make sense to average the time spent on the phone or even the
responses on a scale 'Strongly agree' to 'Strongly disagree'. (The latter is only
truly meaningful if we are reasonably sure that the average of an 'Agree' and a
'Disagree' is 'Indifferent'.)
• Categorical data is more common in the social sciences where test subjects are
often assigned groups (with the group number becoming a categorical variable).
However, in economics/finance there is usually no question of “designing the
experiment” in this fashion. For example, we cannot test different economic
models by selecting a variety of interest rates and seeing how the models
perform. Interest rates are chosen by the system (using some not very well
understood processes!), and are not set externally by econometricians. The
distinction here is between endogenous (set within the system/model), and
exogenous (set outside the system/model), variables, although we will rarely
use these terms.
• There are many statistical techniques available for analysing categorical
variables, although we will generally not need them.

3.4 Two Important Distinctions


Numerical tables generally serve one of two purposes:
• Databases (tables and spreadsheets) compiled by government statistical
agencies or non-governmental organisations.
- Limited purpose is to present all available numerical information that might
be relevant to a wide variety of data users.
- Textual discussions accompanying these tabulations serve only to

Statistics, Probability & Risk 28


Unit 2 –Economic and Financial Data

ƒ describe how the data were obtained, and to


ƒ define what the numbers mean.

• Tables contained in Research Reports


- These seek to present numerical evidence relevant to support specific
conclusions (made in the report).
- To serve this purpose much care must be given to
ƒ the selection of the data, and
ƒ the design of the table.

Research reports (based on the analysis of numerical information) should


address two different audiences:
- Those who read the text and ignore the data presented in tables (and
charts).
ƒ Here it probably does not much matter how you present your data!
- Those who skim the text and grasp the main ideas from the data
presentation.
ƒ Here tables should be self-explanatory,
- conveying the critical ideas contained in the data,
- without relying on the text to explain what the numbers mean.
ƒ Tables should complement the textual discussion, and the text
- will provide a general summary of the most important ideas to
be derived from the data,
- without repeating many of the numbers contained in the data.

4. Data Accuracy
One must always bear in mind the accuracy of the data being used but,
unfortunately, this can be difficult to assess. There are many factors influencing
(often adversely) data accuracy;
• Statistics are often merely a by-product of business and government activities,
with no “experimental design” available, as alluded to above.
• There is no incentive for companies to provide accurate statistics; indeed this is
often expensive and time consuming.

29 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

• Companies may have a strong incentive to hide information, to mislead rivals


about their competitive strategy or strength.
• Companies may misrepresent their position to tax authorities to seek subsidies
or avoid taxation. As an extreme example remember Enron (see
http://en.wikipedia.org/wiki/Enron_scandal )
• Governments themselves may seek to misrepresent data to improve their
political image and maintain power, and to gain possible subsidies. Accusations
of falsifying financial data to comply with the Maastricht criteria for Euro zone
entry continue to haunt the Greek government; see
http://www.eubusiness.com/Greece/041014155857.dhkojaqs/ )
• Data gatherers may be inadequately trained, especially where questionnaires
are used. Delivery of questions, recording and interpretation of answers are all
possible sources of error. There is a large literature on this but a useful, and
relatively brief, starting point is
Whitney, D.R. (1972). The Questionnaire as a Data Source, Technical
Bulletin No.13, University of Iowa, available at both
http://wbarratt.indstate.edu/documents/questionnaire.htm
and http://faculty.washington.edu/tabrooks/599.course/questionnaire.html
• Lack of clear definitions and classifications.
- Classification of a company within industries may be problematic for well
diversified companies.
- Definition of “Employed” may not be constant across regions/countries.
• Price Statistics. In addition to above classification/definition problems:
- Multiple prices exist, depending on volume purchased.
- Different prices exist as quality of product varies.
- Computing price indices ambiguous depending on methodology used.
• National Income Statistics. When aggregating data:
- Errors in basic data (for reasons above).
- Adjustment of data to conform to “national format”.
- Data may not be available, and “gaps” need to be filled somehow.
• National Statistics. Data is only useful when users have confidence in the
reliability of the data. There has recently been concern over the reliability of

Statistics, Probability & Risk 30


Unit 2 –Economic and Financial Data

energy statistics from China. The issues involved are somewhat complex; if you
are interested look at;
Sinton, J. (2000). What goes up: recent trends in China’s energy consumption,
Energy Policy 28 671-687.
Sinton, J (2001). Accuracy and reliability of China’s energy statistics,
China Economic Review 12 (4) 373-383.
and Chow, G (2006). Are China’s Official Statistics Reliable?
CESifo Economic Studies 52 (2) 396-414.

• You should look at LIMMD Unit 2 Section 2.3 (see Section 3.2) for a discussion
of the great lengths the ESDS International go to in assessing the quality of
their macro databanks.
• A useful source of further information (especially the Economics section) is the
following website: Ludwig von Mises Institute at http://mises.org/

Before using data always try and assess its accuracy (difficult as this may be).

5. Data Tables
Much of the material of the next few sections is adapted from Klass, G. Just Plain
Data Analysis Companion Website at http://lilt.ilstu.edu/jpda/.
We shall only present some of this material, and you should visit the website and
read further. Klass also has a good section on “Finding the data”, which supplements
Section 2. There are three general characteristics of a good tabular display:
• The table should present meaningful data.
• The data should be unambiguous.
• The table should convey ideas about the data efficiently.

31 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

6. Presenting Meaningful Data


Even if data is accurate it needs to measure something of importance/relevance,
otherwise it is just “meaningless data” collected for no real purpose. Whether or not
the data is meaningful is related to how closely the data relate to the main points
that you are trying to make (in your analysis or report).
• The data in a table, and the relationships among the data, constitute the
evidence offered to support a conclusion. Report only this (meaningful) data.
• Ideally this should be an important conclusion, an essential part of the analysis
you are making.
• Knowing which data are meaningful requires an understanding of both;
- The specific subject matter you are writing about, and
- A good understanding of where the data comes from and how it was
collected.

6.1 Count Data


• Many indicators are based on counts (obtained from surveys or records):
- Unemployment figures are derived from monthly sample survey questions
relating to respondents employment status.
- Infant/adult mortality is assessed from enumeration of death certificates.
- Measures of voter turnout based on counting votes actually cast, or from
post election surveys.
- Poverty rates based on number of people living in poor families (suitably
defined), or on number of poor families.
- Crime data is based on number of crimes actually reported to police.
• Interpreting data often requires a good understanding of how the data was
collected, actual survey questions used, and definitions used in obtaining
counts.

6.2 More Meaningful Variables


• Rather than raw counts (and aggregate totals) one often wants;
- Rates (murders per million of population)
- Ratios (health expenditure as a percentage of GDP), or
- Per capita measures (health expenditure per person)
• These all involve division of two quantities.

Statistics, Probability & Risk 32


Unit 2 –Economic and Financial Data

Example 6.1 The crime statistics shown in Table.6.1 below have various points of
interest.
• Pure counts can be misleading. Maine appears to be a far safer place to live
than Massachusetts in that, in 2005, there were 1483 violent crimes in the
former compared to 29,644 in the latter (about 1 to 20).
- The problem with this argument is we are not comparing “like with like”
since Massachusetts has by far the larger population, and hence we would
expect a much larger incidence of violent crime.
• Rates (per “unit”) are more meaningful. To remove the dependency on
population size we could work with rates per person:
Raw count
Rate per person =
Population size

- These values become extremely small; for Maine


1483/1318220 = 0.001125

Table 6.1: Crime Rates in US Cities


(Source: Federal Bureau of Investigation, Crime in the United States 2006 available
at http://www.fbi.gov/ucr/cius2006/data/table_04.html)

33 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

- We find small quantities difficult to understand (what is 0.001125 violent


crimes?)
- It is common practice to choose a “base value” and measure in units of
the base. A convenient, but not unique, base is 100,000 (people)
Raw count
Rate per 100,000 = x 100,000
Population size

Raw count
or equivalently Rate per 100,000 =
Population size/100,000
This gives the “more meaningful” rates in Table 6.1 above.
- On this basis Maine is indeed safer to live in, but only by a (2005) ratio of
112.5/460.8 ≈ 1 to 4.

• Time comparisons In Table 6.1 we have data for two years, 2005 and 2006.
- When available, tabulations that allow for comparisons across time usually
say more about what is happening (compared to data that does not allow
such comparisons). However care is needed.
- To produce comparisons it is usual to compute percentage changes:
Change
% Change = x 100%
Original value

“Percent” simply means “per hundred” and hence we multiply by 100. This
also has the effect of making the numbers larger and “more meaningful”,
and we measure the change relative to “what it was”, rather than “what it
has become”. This is again just a convention, but one that usually makes
most sense.
New value - Old value
- More usefully % Change = x 100%
Old value

- Observe that the % changes are different (albeit similar) for the raw counts
and the rates. Why should this be?
- On this basis violent crime is increasing at a much faster rate in Maine
than in Massachusetts (where it is actually decreasing). On an “absolute
level” Massachusetts is more violent, but on a “relative level” Maine is.
- Great care is needed in interpreting these latter statements. With only two
years data available meaningful comparisons may not be possible. The
one year changes are subject to random fluctuations, and we do not

Statistics, Probability & Risk 34


Unit 2 –Economic and Financial Data

know how large we can expect these fluctuations to be. To obtain a more
reliable analysis we would need more data, over maybe 5 or 10 years.

- In general, when comparing two year’s data, one should be wary of


arbitrary selections of a base year. It is possible that in 2006 (immediately
following our base year) there was a major police initiative in
Massachusetts targeting violent crime, and this is responsible for the
decrease. It would then not be a fair comparison with the 2006 figures in
Maine where no such police action took place. We need to make sure
there is nothing unusual associated with our base year.
- One way of avoiding the trap of making too much out of random data
fluctuations, especially when there are large year-to-year variations, is to
average data values over a longer period (say 5 years in Table 6.1). We
look at this in Unit 4, together with how to reliably measure how variable
data is.

An important part of statistical analysis is trying to assess just how large we can expect
random fluctuations in the data to be, often without having access to a lot of data.

7 Presenting Data
Whether or not the data is unambiguous depends largely on the descriptive text
contained in the title, headings and notes.
• The text should clearly and precisely define each number in the table.
• The title, headings and footnotes should
- Convey the general purpose of the table.
- Explain coding, scaling and definition of the variables.
• Define relevant terms or abbreviations.

Example 7.1 The data in Table 7.1 is poorly defined, with two major difficulties:

Change in Teenage Birth Rates (1987-98)


White 6.7%
Meaning? Black -4.9%
Asian -1.8% Meaning?
Hispanic 3.7%
Source: Statistical Abstract 2000, table 85

Table 7.1: A poorly defined table

35 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

• Ambiguity 1: What is meant by a “teenage birth rate”? Does this measure


- The percentage of all babies born belonging to teenage mothers?
- The percentage of teenage mothers who gave birth?
We cannot decide this from the information (headings) given in the table.

• Ambiguity 2: What does a “change in percentage” mean? We know (from


Example 6.1, for example) that
New birth rate (1998) - Old birth rate (1997)
Percentage change in birth rate = x 100%
Old birth rate (1987)

By change in percentage birth rate we could mean one of two quantities:


- Percentage rate in 1998 – Percentage rate in 1987
New % rate (1998) - Old % rate (1987)
- x 100%
Old % rate (1987)

For example, the 6.7% change reported in Table 7.1 could arise from
- 1998 (White) rate = 26.7%, and 1987 rate = 20% (1st interpretation)
- 1998 (White) rate = 21.34%, and 1987 rate = 20% (2nd interpretation)
(So if the birth rates for whites in 1987 was 20% then what was it for whites in
1998, 21.34% or 26.7%?)
Both interpretations are valid; Table 7.1 should indicate which one is meant.

• Do not worry if you find this slightly confusing. Quoting results in percentages
causes more confusion in data interpretation that almost anything else. One
must always ask “Percentage of what?”; in Tutorial 2 you will get some practice
in computing percentages.

N.B. Because the first type of ambiguity occurs very frequently, and is often the
source of major confusion, we give a “graphical view” in Fig.7.1.

Babies born to Gave birth (1987-98)


non-teenage mothers

Babies born to
teenage mothers Did not give birth (1987-98)

All babies born (1987-98) All teenage mothers (1987-98)

Fig.7.1: Two different underlying populations on which percentages can be based.

Statistics, Probability & Risk 36


Unit 2 –Economic and Financial Data

8. Presenting Data Efficiently and Effectively

An efficient tabular display will allow the reader to


• Quickly discern the purpose and importance of the data.
• Draw various interesting conclusions from a large amount of information.
How quickly the reader can
• Digest the information presented,
• Discern the critical relationships among the data,
• Draw meaningful conclusions,
depends on how well the table is formatted.
There are a few important ideas to bear in mind when trying to present your data in
as efficient a manner as possible:

8.1 Sorting
• Sort data by the most meaningful variable.

Example 8.1 Table 8.1 contains average television viewing times for 16 countries.
• Here the data is sorted by the “Country” variable in alphabetic order.
• Usually this is a poor way to sort data since the variable of interest is really the
(2003) viewing times. For example, we would like to know which country
contains the most frequent viewers.
• Current software allows data to be sorted very easily in a variety of ways. In
Table 8.2 we have sorted from largest to smallest (hours viewing).

Table 8.1: Data sorted by country Table 8.2: Data sorted by viewing times

37 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

• This table gives us a much better idea of what the “average” viewing time is. We
shall examine these ideas in more detail in Units 3 and 4. What would you
conclude from Table 8.2 about average viewing times?
• But note that, if we have data for more years (and more countries) available, the
situation becomes a bit trickier! In Table 8.3 we cannot sort all columns at the
same time – why?
- If we do want to sort the data we must select a variable to sort on, i.e. the
“most meaningful” variable.
- We may consider the latest year (2005) as the most important and sort on
this. We would then essentially be using 2005 as a “base year” on which
to make subsequent comparisons (as we shall do in Units 3 and 4).
ƒ Results are shown in Table 8.4 (using Excel).
ƒ Note that missing values are placed at the top. Would it be better to
place them according to their average values over all the years? Or
would they be better placed at the bottom, i.e. out of the way?

Table 8.3: TV viewing times for period 1997 - 2005.

Statistics, Probability & Risk 38


Unit 2 –Economic and Financial Data

Table 8.4: TV viewing times ordered on 2005 viewing figures.

• Suppose we wanted to highlight the relative position of a particular country,


Japan say, year by year.
- We could then regard each year, together with the country designation, as
a separate table and produce a series of tables, sorted by viewing time,
one for each year.
- This would involve considerably more effort! Sample results are shown in
Tables 8.5; what can we conclude about Japan?
- Rather than having many tables (with obviously very considerable
repetition of values) we shall look at graphical (Unit 3) and numerical (Unit
4) summary measures to present the data more economically.
• Data of the type given above (Table 8.3) is not too easy to locate. The OECD is
often the best place to start looking, and the above data is available from OECD
Communications Outlook 2007 (p.91 Table 6.8) at
http://213.253.134.43/oecd/pdfs/browseit/9307021E.PDF (read only version)
The read only version can be downloaded free. (It is probably simplest to go to
http://www.oecd.org/document and then search for Communications
Outlook.)

39 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

Table 8.5: TV viewing times ordered year by year.

8.2 Decimal Places and Rounding


• Never give data to more accuracy (decimal places) than is warranted. However
this can sometimes be tricky to judge.

Example 8.2 Table 8.6 contains population data taken from Wikipedia
(http://en.wikipedia.org/wiki/World_population) in two forms;
• The raw data is given to the nearest million.
• The percentage data is given to one decimal place (1dp). Why 1dp?

Table 8.6: Population figures (actual and percentage).

Statistics, Probability & Risk 40


Unit 2 –Economic and Financial Data

• For illustration we shall look at the 1750 data.


- The raw data is consistent since the World population is indeed the sum of
the other 6 populations (791 = 106 + 502 + 163 + 16 + 2 + 2).
- The percentage data is “slightly inconsistent” since the 6 parts do not
(quite) add to 100% (13.4 + 63.5 + 20.6 + 2.0 + 0.3 + 0.3 = 100.1).
• This inconsistency is due to rounding the % data to 1D. Explicitly
106
% Africa population = x 100 % = 13.4007505 ; % Asia population = 63.4639697 ;
791
% Europe population = 20.6068268 ; % Latin America population = 2.0227560 ;
% North America population = 0.2528445 ; % Oceania population = 0.2528445.
• These values sum to 99.99999200%, and we would regard this as “acceptably
close” to 100% or, more informally, 100% “to within rounding error”. (See also
Tutorial 2 for a more quantitative discussion.)
• So why do we not give the % data values to more (than one) decimal places?
(We may regard 7dp as too many, but surely 1dp is not enough?) The reason is
that the original (raw) data from which the percentages are calculated, is not of
sufficient accuracy to allow this.
• Look at the Africa population for 1750. This is given as 106 million, but is
rounded to the nearest million. This means the “true value” could be anywhere
within the range 105.5 to 106.5 million, usually written (105.5, 106.5).
[Of course what we really mean by this is a value just greater than 105.5, say
105.000001, and just less than 106.5, say 106.499999. But the distinction is too
cumbersome to maintain, and we usually live with the slight ambiguity.]
• Even if we assume the world population is exactly 791 million (which it will not
be), the possible % figures for Africa will lie in the range

⎛ 105.5 ⎞
*100 ⎟ = (13.3375 , 13.46397 ) = (13.3, 13.5) %
106.5
⎜ * 100 ,
⎝ 791 791 ⎠
rounded to 1dp. The quoted value of 13.4% is thus not accurate and, strictly
speaking, the result should only be given (consistently) to the nearest integer.

MORAL
• In practice we usually live with slight inaccuracies in our data, and hence in
values computed from them.
• But this does mean we usually cannot quote high accuracy for any subsequent
computations in the data. For example, computing the average population (in
1750) over the 6 regions gives (see Unit 4 for further discussion)

41 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

Average population =
1
(106 + 502 + 163 + 16 + 2 + 2) = 791 = 131.83333333
6 6
However we do not believe the (implied) accuracy of 8 decimal places. How
many places can we legitimately quote?

• ALWAYS BE VERY CAREFUL IN THE ACCURACY YOU GIVE RESULTS TO


• ALWAYS BE PREPARED TO JUSTIFY THE ACCURACY YOU CLAIM FOR
YOUR RESULTS

9. A Practical Illustration

It is instructive to see how the ONS actually tabulates the data it collects.
Data Source We will look at unemployment statistics taken from the ONS website;
look at Practical Unit 2 for details of the latter.
• Go to UK Snapshot, Labour Market and Latest on Employment & Earnings;
you should obtain something like Fig.9.1. (If you do not a slightly different search
may be required!)

Guide to website

Useful summary

Fig.9.1: Source page for unemployment statistics

• For general information click on the Related Links at the R.H.S. of Fig.9.1. In
particular individual data series are available for various (user specified) time
periods – see Table 9.1 below.
• Click on the (Labour Market Statistics) First Release option; Fig.9.2 results.
Now choose the March 2009 to give the document depicted in Fig.9.3.

Statistics, Probability & Risk 42


Unit 2 –Economic and Financial Data

Fig.9.2: First Release Statistics option Fig.9.3: First Release Document (pdf)

• The document contains a very detailed (40 page) discussion of published


statistics relating to the labour market. For our present purposes we want to
look at Table 9, a part of which is reproduced in Table 9.1 below.

Structure of Data Table The table is formed from parts of many ONS data sets,
those used being indicated by a four letter code in the rows labelled People. For
example MGSC refers to All UK Unemployed Aged 16+ with the additional
information 000s (data measured in thousands) : SA (seasonally adjusted) ; Annual
= 4 quarter average (annual figures found by averaging over quarterly figures).
All these series can be downloaded via the Labour Market Statistics Time Series
Data option in Fig.9.1.

43 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

Table 9.1: (Portion of) Unemployment Table Published by ONS.

• The table is structured roughly as shown in Table 9.2, and allows four variables
to be tabulated (in a two dimensional table):
- Ages of unemployed (in the ranges 16-17, 18-24, 25-49, 50 and over and
two “cumulative” categories, 16 and over and 16-59/64). You may like to
think why these particular age divisions are chosen.
- Length of unemployment (in the ranges Up to 6 months, 6-12 months,
over 12 months and over 24 months). In addition two rates (percentages)
are given. Again you may think about these particular ranges and rates.
- Gender (in the categories All, Male and Female). Why is the “All” category
given, since this must be the sum of the males and females?
- Time (divided into quarterly periods).
• The general observation is that the ONS Table 9.1, although quite large, does
contain a considerable amount of data in a very compact form, i.e. the data is

Statistics, Probability & Risk 44


Unit 2 –Economic and Financial Data

very efficiently presented. In addition the data is unambiguous so that, for


example, the first data value in Table 9.1 has a well-defined meaning. Clearly
the data is also meaningful, so our three criteria (of Section 5) for a good tabular
display are met.
• When you come to design your own tables you might not find it straightforward
to present data efficiently!

Table 9.2: Structure of Table 9.1

Note: The data structure shown in Table 9.2 is not always the most convenient,
depending on the question(s) we are trying to answer. We shall return to Table 9.1 in
Unit 6.

10. Data Collection and Sampling


Now that we have looked at a variety of data sources, and the type of information
they contain, we conclude this unit with a more general consideration of how data is
collected. The importance of this discussion will not really become apparent until Unit
7 when we consider sampling, and some theoretical properties associated with
samples.
1. Data Collection A set of data can be collected in several different ways. It can
be obtained through a planned survey using questionnaires or direct
observation, by a controlled experiment or by extracting data from previously
published sources such as books and computer databases. The problems
associated with each type of method are rather different but to understand the
ideas behind any formal sampling schemes or experimental designs you have
to be aware of the important role that variation plays in statistics. It is fairly

45 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

obvious that no two people are exactly the same in all respects, that no two
apples are identical in size and shape and so on. However, it is quite surprising
how people will quite happily convince themselves of certain 'facts' on the basis
of very little (and quite often biased) evidence. If you visit a place once in the
cold and wet it is difficult to imagine that it is ever nice there. If you ask one
group of people their opinion on some topic you may be convinced that most
people think in a particular way but if you had asked another group you may
have got an entirely different impression. This is the main problem encountered
in data collection. The people or things you are interested in are all different and
yet somehow you need to get sufficiently accurate information to make a sound
decision.
2. Surveys Surveys fall broadly into two categories: those in which questions are
asked and those where the data is obtained by measurement or direct
observation.
• The first type is used extensively to get information about people and this
may include both factual information and opinions.
• The other type of survey is used in many other areas such as land use
surveys, pollution monitoring, process quality control and invoice checking.
In both cases a distinction must be made between those situations where data
can be collected on everyone or everything of interest and those where that is
impossible. The first situation, which is comparatively rare, is called a census.
There are no real problems analysing data of this sort from the statistical view-
point as potentially complete information is available. Most data, however, are
not of this type. In real life it usually takes too long or costs too much to collect
data on all the individuals of interest. In a business organisation decisions
usually have to made quickly. Although the Government carries out a census of
people in the UK once every ten years, by the time the full analysis is complete
much of the information is out of date.
In some situations it is impossible to carry out a complete survey. The only way
to test the strength or durability of certain components is to use destructive
testing. For example, to see how long a new type of light bulb lasts requires
switching some on and recording the times to failure. It would not be very
profitable if all the light bulbs manufactured had to be tested!
In practice then you are likely to want information about a large group of people
or things (the population) but you are restricted to collecting data from a
smaller group (the sample). As soon as you are in this situation there is no
possibility of getting completely accurate information. The best you can hope for
is that the information contained in the sample is not misleading. In order to plan

Statistics, Probability & Risk 46


Unit 2 –Economic and Financial Data

a satisfactory survey it is necessary to understand the nature of variability, a


central theme in this module.
It is possible to get reliable estimates of what is happening in the population, by
collecting data from a sample, but only if sampling is done in a sensible way.
The sample must, in some sense, 'represent' the population from which it
comes. Obviously a survey, designed to determine what proportion of the UK
population is in favour of reducing the tax on alcoholic drinks, is not going to be
very satisfactory if all the interviews are carried out in pubs. Unfortunately, it is
very easy to introduce bias unwittingly into a survey unless a great deal of
thought is given to the selection of a suitable sample. A bias in this context is
the difference between the results from the sample and what is actually
happening in the population as a whole, due to the fact that the sample has
been chosen in such a way that it could never be representative of the
population. The proportion of pub-goers who favour a reduction in tax on
alcoholic drinks is likely to be higher than the proportion of the general
population who favour such a decrease, so even if a representative sample of
pub-goers is chosen the proportion estimated from the sample will be an
overestimate of the proportion in the population.
3. Sampling Methods The aim of good sampling is to choose some sort of
representative sample so that the results from the sample reflect what is
happening in the population as a whole. If an investigator already has a fair
amount of background knowledge about the population he or she may be in a
position to choose a sample that is reasonably representative. The problem with
doing this is that it is impossible to put any measure of accuracy on the results
obtained. The only way, to get both a reasonable estimate and be able to say
how precise that estimate is likely to be, is to make use of the ideas of
probability and randomness. Although we shall not discuss probability formally
until Unit 5, the following discussions require no technical probability
knowledge.
Simple random sampling The ideas of probability sampling are most easily
understood by considering simple random sampling. In this method, individuals
to be included in the sample are chosen at random with equal probabilities. In
order to do this it is necessary to have a list of all individuals in the population.
Each individual is assigned a number and the numbers are selected randomly
either by using random number tables (or a random number generator on a
computer) or some physical method such as 'drawing numbers out of a hat'.
Every individual has an equal chance of being in the sample. If the sample is
chosen in this way then it is possible to determine how precise an estimate,

47 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

calculated from this sample, is likely to be. This will be discussed further in Unit
4.
Systematic sampling Although simple random sampling is intuitively
appealing, it can be a laborious task selecting, say, 400 numbers at random and
then matching them to a list of 8000 names. It would be a lot quicker just to pick
the first individual at random from the first 20 names and then subsequently pick
the 20th name on the list. This method is called systematic sampling and
approximates to simple random sampling as long as the list is not constructed in
a way that might affect the results (for instance if the list of names is organised
by date of birth). This method is particular convenient for lists stored on
separate cards or invoices filed in a drawer.
Stratified sampling In many populations there are natural groupings where
people or things within a group tend to be more similar in some respects than
people or things from different groups. Such a population is called a stratified
population. One aim of a survey is to get the estimates as precise as possible.
This suggests that it might be more efficient to sample within each group, and
then pool the results, than to carry out a simple random sample of the whole
population where just by chance some groups may get unfairly represented.
There might also be advantages from an administrative point of view. If, for
example, a survey were to be carried out amongst the workforces of several
engineering companies, then it is easier to choose a random sample within
each company than to get a random sample of the combined workforce. It can
be shown mathematically that stratified random sampling, in general, gives
better estimates for the same total sample size than simple random sampling as
long as the variability within the strata is relatively small and the number of
individuals chosen in each strata is proportional to the size of the strata.
Cluster sampling In some situations there may be a large number of groups of
individuals; for example, the workforces of small light engineering companies or
small rural communities. From an administrative point of view it is much easier
to select the groups at random and then look at all individuals in the selected
groups than to select individuals from all groups. Such a method is called
cluster sampling. Cluster sampling works best when the groups are similar in
nature so that it does not really matter which ones are included in the sample
because each cluster is like a mini-population. This method does, however, get
used in other situations where the saving in costs, made by restricting the
number of groups visited, are thought to outweigh any loss in statistical
efficiency.
Multi-stage sampling Most large scale surveys combine different sampling
methods. For example a drinking survey organised by the Office of Population

Statistics, Probability & Risk 48


Unit 2 –Economic and Financial Data

Censuses and Surveys used the Postcode Address File, which is a


computerised list of postal addresses, as the sampling frame. The country is
divided into regions and the number of postcode sectors selected in each region
are made proportional to the number of delivery addresses in that region. The
postcode sectors within each region were stratified by two other factors to
ensure that the sampled sectors represented different types of area. Addresses
were then systematically chosen in each sector.
Quota sampling When market research or opinion polls are carried out there is
usually insufficient time to select individuals by one of the random sampling
methods. However, if the results are to be credible then great care has to be
taken that the sample chosen is as representative as possible. The usual
method employed is some sort of quota sampling based on the ideas of
stratified sampling. The population is divided into groups on the basis of several
factors. In market research these usually include age, sex and socio-economic
class. The number of individuals in the sample for a particular group is chosen
proportional to the size of that group. The difference between quota sampling
and stratified sampling is that the final choice of individuals is not made at
random but is left to the discretion of the interviewer who just has to get the
requisite number of individuals. This freedom of choice by the interviewer may
introduce bias into the survey although good market research organisations
give very precise instructions to interviewees on how to obtain their quotas so
as to avoid biases. The other problem with quota sampling is that generally no
record is kept of individuals who refuse to answer the questions and these
individuals may have very different opinions from those that do. There is,
therefore, no way of assessing how accurate the results are from quota
sampling. However, in spite of these difficulties it is the main sampling method
employed by commercial organisations who need speedy results.

49 Statistics, Probability & Risk


Unit 2 – Economic and Financial Data

11. References
• Chow, G (2006). Are China’s Official Statistics Reliable? CESifo Economic
Studies 52 (2) 396-414.
• Sinton, J. (2000). What goes up: recent trends in China’s energy consumption,
Energy Policy 28 671-687.
• Sinton, J (2001). Accuracy and reliability of China’s energy statistics, China
Economic Review 12 (4) 373-383.

Statistics, Probability & Risk 50


Unit 3 –Graphical Summaries of Data

3 Graphical Summaries of Data

Learning Outcomes
At the end of this unit you should be familiar with the following:
• General principles of a graphic display.
• The various components of a chart, and their importance.
• Familiarity with the various chart types.
• When graphic displays are inadequate.
• Appreciate that a given dataset can be graphed in different ways to give
different visual impressions.

A sentence should contain no unnecessary words, a paragraph no unnecessary


sentences; for the same reason a drawing should contain no unnecessary lines
and a machine no unnecessary parts.
Edward Tufte: The Visual Display of Quantitative Information (1983)

51 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

1. Guiding Principles
• Graphical displays are intended to give an immediate visual impression. This
depends on two factors:
- The precise variables that are graphed.
- The form of the graphic chosen.
• The intention should be to not mislead.
• Ideally a (graphical) chart should convey ideas about the data that would not be
readily apparent if displayed in a table (or described as text). If this is not the
case rethink whether a graphic is needed.

We shall seek to examine each of these issues in this unit.

Reference: Much of the material of the next few sections is again adapted from
Klass, G. Just Plain Data Analysis Companion Website at http://lilt.ilstu.edu/jpda/

2. Good Graphic Displays


• The three general characteristics of a good tabular display also apply
graphically:
- The chart should present meaningful data by
ƒ defining clearly what the numbers represent, and
ƒ allowing the reader to quickly deduce the meaning of the display
- The chart should be unambiguous, again with clear definitions of
precisely what is being plotted.
- The chart should convey ideas about the data efficiently.
• The primary purpose of a chart is to simplify (possibly several) numeric
comparisons into a single graphic.

3. (Three) Chart Components


• Labelling This includes chart tile, axes titles and labels, legends defining the
separate data series, and notes (often to indicate the data source).
• Scales These define the range of the Y and X-axes.
• Graphical Elements This is what varies from chart to chart, and comprises
the bars in a bar chart, the lines in a time series plot, the points in a scatter plot,
the slices in a pie chart and so on.

Statistics, Probability & Risk 52


Unit 3 –Graphical Summaries of Data

Fig.3.1 illustrates these ideas in the context of an Excel bar chart. You should be
familiar with how charts are constructed in Excel, how the data is set up, and the
labels and scales defined.

Title This should be used only to define the data series used.
• Do not impose a data interpretation on the reader. For example, a title like
“Rapid Increase of RPI and CPI” should be avoided.
• Specify units of measurement if appropriate, either
- at end of title (after a colon : ), or
- in parentheses in a subtitle (“constant dollars”, “% of GDP”, …)
(What are the units of RPI and CPI? See Practical 2.)

Fig.3.1: Chart Components (with CPI & RPI data)

Axes Titles This should be brief.


• Do not use if this merely repeats information that is already clear from the title
or axes labels.
• In Fig.3.1 both the labels “Index value” (implicit in the title) and “Year” (obvious
from the x-axis label) could be removed, although one could argue they do
clarify the situation a little. (Subjective judgement required.)
• But, for example, a y-axis label of “RPI and CPI Index Value” is too much.

Axes Scales These define the overall look of the graphic.


• May require careful choice in order not to give a misleading impression.

53 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

• Scales do not need to be numeric and can sometimes be merely labels


(names). In Fig.3.1 the x-axis “Years” are really just labels since we would
never think of doing any numerical calculation (like adding) with them.
• Nonetheless there is still an order to the x-axis label/scale. In some instances
this might not be the case (University names, for example).

Data Labels These define individual data points.


• Data labels can be obtained in Excel (see Practical 3). But avoid too many.
• With data labels you may not require a scale on one, or both, axes.

Legends These are required if two, or more, series are graphed (in order to identify
the various series!).
• Legends can be placed at the side (default in Excel) of a chart, or at the top or
bottom of the chart if you wish to increase the plot. size
• Legends are not necessary for a single series, so delete them if they appear (as
they do in Excel). This increases the plot area.
• Legends should be brief. Excel supplies them automatically, but does allow you
to alter them.

Gridlines These should be used sparingly so as not to overwhelm the graphical


elements of the chart.
• They are useful if you wish to read off values from a graph.
• Excel gives various options for removing and altering the form of gridlines.

Source Data sources should be given for two important reasons:


• To allow other readers to check the validity of the results by accessing data
sources and repeating calculations. This is a primary motivation in much
scientific research, and is often expressed in the phrase “proper academic
citation”.
• To allow knowledgeable readers, familiar with the more common data sources,
to assess the reliability of the data.

Other Chart Elements Various other points should be borne in mind:


• Non-data elements, unnecessary for defining the meaning and values of the
data, should be kept to an absolute minimum.
- Fancy plot area borders, and shadings, are unnecessary.
- Keep shading of graphical elements simple. Avoid unnecessary 3-D
effects.

Statistics, Probability & Risk 54


Unit 3 –Graphical Summaries of Data

4. Importance of Graphical Representation Chosen


Before looking at the various types of graphics available, we emphasise the following
(almost always overlooked) observation.
• What we see in a graphic display, and the consequent inferences we make,
will often depend on the precise variables that are graphed.

To illustrate this we look at the following example.

Example 4.1 U.S. Defence Spending


Data Source The data, together with the various charts described below, is available
on the Module Web Page in the Excel file DefenceSpending.xls. The data is taken
from the following source:
The Budget of the United States Government: Historical Tables Fiscal Year
2006 at http://www.gpoaccess.gov/usbudget/fy06/hist.html
• It can be a little difficult to find this without the above URL. It is simplest to go to
the home page (The Budget of the United States Government) at
http://www.gpoaccess.gov/usbudget/. Then click on Browse and select Fiscal
Year 2006. On the resulting page you will see a Spreadsheet option; click on
this and select Historical Tables. Table 10.1 – Gross Domestic Product and
Deflators Used in the Historical Tables 1940-2010 will give you the data
saved as HistoricalBudgetData2006 on the module page.
• You should accustom yourself to having to work (quite hard sometimes) to
obtain the data you want. Searching websites can be time consuming and
frustrating!
• Using this data you should be able to reproduce the charts quoted below.

Table 4.1: U.S. Defence Spending Data.

55 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

The Excel spreadsheet shown in Table 4.1 below shows U.S. defence spending
(column B) and various other quantities that can be used as a divisor. Below we give
five line (time series) plots of defence spending plotted against some of the
remaining columns in Table 4.1.
• Each represents a valid presentation of the data, BUT
• Depending on which divisor is used one could conclude that defence spending
is
- Steadily increasing (Fig.4.1 and Fig.4.4 [lower line] and Fig.4.5)
- Dramatically increasing (Fig. 4.1 and Fig.4.4)
- Steadily decreasing (Fig.4.2, and look at vertical scale)
- Dramatically decreasing (Fig.4.3, and look at vertical scale)
- Holding relatively constant (Fig.4.4)

You might like to consider how all this is possible! Remember that graphs can be
subjective, and you must carefully examine scales and appreciate the importance of
relative (percentage) and absolute value changes. Look again at Figs.4.2 and 4.3.

Fig.4.1: Spending in Current & Constant $ Fig.4.2: Spending as % of GDP

Fig.4.3: Spending as % of Total Spending Fig.4.4: Spending Per Capita

Statistics, Probability & Risk 56


Unit 3 –Graphical Summaries of Data

Fig.4.5: Defence & Total Spending (Constant $ with 1980 = 100)

Brief Explanations If you are unsure of the rationale for these calculations:
• We need to standardise data in order to account for differences in
- populations, and prices, both
- at different times, and in different parts of the world (geographic location)
• Because the CPI is based on the prices of a market basket of goods that
consumers typically purchase, it is not a good deflator for a measure of
aggregate government spending.
• A better measure is often taken to be GDP (Gross Domestic Product), and %
GDP is used as a deflator for government revenue, and spending, indicators.
• The GDP deflator is used to construct the constant dollar price measure in
order to eliminate (or at least minimise) the problem of changing prices (over
time) in calculating total (defence) expenditure.
• In general one can expect GDP to increase (over time) faster than inflation.
Dividing a measure of government spending by GDP will therefore produce a
lower growth rate than if an inflation measure were used as divisor.
• To account for changing population figures, per capita (per person) values are
often quoted by dividing the relevant measure (say GDP) by the population size.
Such figures are available in a variety of places, with the Census being the
source of the most up to date information. Adjusted figures are often later
corrected for inaccuracies in the population count once new census figures
become available. (Revision of figures is commonly done with economic figures
in general, and government data in particular.)
• One problem with % GDP measures is that time series trends often fluctuate
more because of a country’s changing GDP than changes in the numerator
(here defence spending).

57 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

- Often government spending will increase incrementally at a steady rate,


while measures of spending (or taxation), as a % of GDP, will show
dramatic changes, increasing during recessions (why?) and decreasing
during periods of economic growth (why?).
• It has been suggested that inflation measures are a better deflator for defence
spending than GDP. See Noah, T. (2004) Stupid Budget Tricks. How not to
Discredit the Clinton Surplus, Slate Magazine (Aug. 9th.) available at
http://slate.msn.com/id/2104952/
• Example 4.1 will reappear in Section 6, where poor and misleading graphical
representations are considered.

5. Common Graphical Representations


There are a large variety of ways data can be represented graphically. Here we
briefly mention some of the more commonly occurring ones, and indicate how they
can be constructed in Excel. You may like to look at the Excel Help material on
Charts. If you Search for Charts you should be able to find, amongst many other
items, the following four demonstrations:
• Charts I: How to create a chart
• Charts II: Choosing the right chart
• Charts III: Create a professional looking chart
• Charts IV: Charts for the scientist

You may find these a bit too laboured, and the advice may not be precisely the same
as we give, but they may be worth a look. Some hand computations for the more
important charts will be considered in Tutorial 3.

5.1a Bar Charts


Whilst we do not fully agree with his conclusions, Klass (Just Plain Data Analysis
Companion Website) gives the following advice:

• Bar charts often contain little data, a lot of ink, and rarely reveal ideas that
cannot be presented more simply in a table.
• Never use a 3D (three dimensional) bar chart.

Statistics, Probability & Risk 58


Unit 3 –Graphical Summaries of Data

But you only have to look at some of the papers cited in previous units to see how
popular bar charts are.
Exercise: Check this out. In particular look at Barwell, R., May, O., Pezzini, S. (2006)
The distribution of assets, incomes and liabilities across UK households : results from
the 2005 NMG research survey, Bank of England Quarterly Bulletin (Spring).

The following examples are designed to illustrate some of the good, and bad,
features of bar charts.

Example 5.1 Poverty Rates in Wealthy Nations

Table.5.1: Poverty Rates Fig.5.1: Poverty Rates Bar Chart


Data Source Luxembourg Income Study (LIS) at http://www.lisproject.org/

• LIS is a cross-national data archive located in Luxembourg. The LIS archive


contains two primary databases.
• The LIS Database includes income microdata from a large number of countries
at multiple points in time.
• The newer LWS Database includes wealth microdata from a smaller selection
of countries. Both databases include labour market and demographic data as
well.
• Registered users may access the microdata for social scientific research using
a remote-access system.
• All visitors to the website may download the LIS Key Figures, which provide
country-level poverty and inequality indicators.
The data in Table 5.1 is available in BarChartPovety.xls on the module web page.

59 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

Comments A bar chart typically displays the relationship between one, or more,
categorical variables. In Fig.5.1 the two variables are Country (with 17 values) and
“Age Status” (taking two values: child or elderly).
• One variable (Country) is plotted on the y-axis, and the second variable (“Age
Status”) is accommodated by employing a multiple bar chart, with multiple bars
(here two) for each value of the first (Country) variable.
• The lengths of the bars, measured on the x-axis scale, quantify the relationship.
• We can quickly grasp the main point from Fig.5.1:
- The U.S. has the highest child poverty rate amongst developed nations.
• There are a few subsidiary points depicted in Fig.5.1:
- In many countries there tends to be a substantial difference between child
and elderly poverty (France, Germany, Italy and the U.S. are the
exceptions).
- The three countries with the lowest child poverty are Scandinavian.
- Five of the seven countries with the highest child poverty are European.
• Note that we can easily make these latter conclusions since the data is sorted
on the most significant variable. (Why is child poverty regarded as the more
important variable?)
• Data sorting is easily done in Excel using Data Sort . With the data of Fig.5.1.
we would sort on the Children column in ascending order.
• It may not be clear to you exactly what is being plotted in Fig.5.1. What is the
meaning of the phrase “% living in families below 50% of median family
income”?
We shall look at the median in Unit 4.

Excel Implementation In general, apart from the histogram, discussed in Section


5.2, Excel charts are obtained via Insert and Charts; see Practical Unit 2 Section 6.
Various bar charts exist:
• Conventionally bar charts have horizontal bars, and come in four basic types:
- Clustered. Compares values across categories (as in Fig.5.1)
- Stacked Compares the contribution of each value to a total across
categories (see Examples 5.2 and 5.3 below).
- 100% Stacked A stacked chart with values given in percentages (of total).
- Three dimensional (3D) versions of each of the above.

Statistics, Probability & Risk 60


Unit 3 –Graphical Summaries of Data

• Column charts. These are really just bar charts with vertical bars, and come in
the same four types as the bar charts above.

Question How does Fig.5.1 compare with the recommendations of Section 3?

5.1b Stacked Bar Charts

Example 5.2 OECD Education at a Glance: OECD Indicators (1995)


This is a text compiling educational statistics and indicators across OECD countries;
it has now been superseded by the 2008 edition. Search with Google and see what
you can find.

Data Source The data depicted in Table 5.2 (Distribution of Educational Staff) no
longer appears to be collected by OECD. This demonstrates that contents of web
pages do change. If you want specific information download it while you can,
assuming this is permitted.
Go to http://www.sourceoecd.org/ and search for this data; if you cannot find it then
use the Excel file EducationStaff.xls on the module web page for the data.

Table.5.2: Educational staff and their functions

Fig.5.2a: Stacked bar chart of education data in Table 5.1

61 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

Fig.5.2b: Stacked bar chart with columns B and C (in Table 5.1) interchanged.

Comments Stacked bar charts need to be used with some care.


• They work best when the primary comparisons are to be made across the
series represented at the bottom of the bar.
- In Fig.5.2a the “teachers data” in Column B of Table 5.2 is placed at the
bottom of each bar.
- This forces the reader’s attention on the crucial comparison, and results in
the “obvious conclusion” that “U.S. teachers have by far the largest
proportion of supervisory and support staff”.
• However, if we put the “Principal and supervisors” at the bottom, as in Fig.5.2b,
the conclusion is no longer quite so obvious, and we need to look at the figure
rather more carefully.
• Also note that the legend in Fig.5.2b is taking up far too much of the chart, and
we are much better placing it at the top, as in Fig.5.2a. Do you know how to do
this in Excel?

Example 5.3a U.S. Federal Government Receipts


Data Source As in Example 4.1, U.S. Budget data is available from Budget of the
United States Government available at http://www.gpoaccess.gov/usbudget/

Search for the data in Table 5.3. If you cannot find it both the data, and the chart of
Fig.5.3, are available in GovReceipts.xls on the module web page.

Statistics, Probability & Risk 62


Unit 3 –Graphical Summaries of Data

Table.5.3: Government Receipts Fig.5.3: Government Receipts Stacked Bar Chart

Comments The categories in Table 5.3 are nominal rather than ordinal, i.e. there is
no implicit order to the various categories (income taxes, corporation taxes, etc.);
refer back to Unit 2 Section 2.3.
• In such a case we cannot stack the categories in any order of importance.
• In Fig.5.3 we have placed the categories in decreasing numerical order, from
bottom to top. But even here it is difficult to distinguish the differences in size of
the upper components of the chart. (Is “Other” bigger in 2000 or 2007?)
• Similar problems occur with Stacked line charts, and Area charts. You may like
to investigate these charts, and the difficulties that can occur.
• What are the units in the receipts of Table 5.3?

5.2. Histogram
Although a very simple type of chart, the histogram is, from a theoretical point of
view, the most important of all graphical representations. The reason for this relates
to the concept of a probability distribution, examined in Unit 5.

Example 5.4 Distribution of Stock Prices and Returns

Data Source A great deal of useful information concerning stock price movements
can be found on the Yahoo! Finance website at http://finance.yahoo.com/. Here we
shall just use stock data taken directly from Yahoo; in Table 5.4(a) we have monthly
(closing) stock prices, from September 1984 through to February 2008, for the U.S.
computer company Apple. See the file AppleStock_25Y.xls.

63 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

Table.5.4: (a) Apple Stock Prices (b) Derived Frequency Table

Data Manipulation Note the stock data is continuous and quantitative, in contrast
to the integer-valued and categorical data from which our previous bar charts were
constructed.
• From our raw data we construct a frequency table, as in Table 4(b), indicating
the number (or frequency) of stock prices which fall within each of the indicated
intervals.
• This frequency table is constructed in Excel (see Practical 3 for some details),
but requires a little explanation.
- The intervals are termed bins.
- The upper limit of each bin is given in Table 5.4(b). Thus the frequency
opposite the bin value 13 indicates there is just one stock price below $13,
73 stock prices in the range $13 - $25 and so on.
• It is possible (in Excel) for the user to specify “more appropriate” intervals, such
as 0-10, 10-20, 20-30 and so on. This is frequently very convenient.

Histogram Construction The histograms of Fig.5.4 are constructed directly from the
frequency table with
• bar areas on the x-axis representing the appropriate frequencies, and
• bar widths corresponding to the bin widths.
• In Excel the default setting is to have the bars separated from each other as in
Fig.5.4 (a), but the more common approach is to have no gap as in Fig.5.4(b).
• The difference has to do with whether the underlying variable (here stock price)
is discrete or continuous. See Unit 2 Section 2.3 for a discussion.
• Excel Implementation Histograms are best constructed through Data, Data
Analysis and then Histogram. You may need to use Options and Add Ins to
get the Data Analysis software installed. See Practical 3 for some details.

Statistics, Probability & Risk 64


Unit 3 –Graphical Summaries of Data

Fig.5.4: (a) Price histogram (with gaps) (b) Price histogram (no gaps)

Comments It is very important to appreciate that it is the areas of the bars which are
proportional to the frequencies (for reasons we discuss in Unit 5).
• If the bins are of equal width then the bar areas are proportional to the bar
heights, and the heights equally well represent the frequencies.
• If the bins are of unequal width adjustment of the bar heights is necessary to
keep the bar areas in proportion.
- See Tutorial 3 for examples of hand calculations involving unequal widths,
and Section 6 for what happens if these adjustments are not made.
- The adjustments are made automatically in Excel. In Fig.5.4 the bins are,
for all practical purposes, of equal width, although this may not seem to be
precisely the case from Table 5.4(b). (The explanation is rounding.)
• What is important in Fig.5.4 is the “overall shape” of the histogram. We would,
for example, be interested in knowing how much of the time the stock is below a
certain (average?) level, and this is influenced by more than a single bar. By
contrast in Fig.5.1, for example, we are more interested in individual bars
(categories), and comparisons between them. (This highlights another important
distinction between nominal and categorical data.)
• Statistics is full of “odd sounding” terms, such as histogram. If you wonder how
these words originated you may care to look at the website
Probability and Statistics on the Earliest Uses Pages available at
http://www.economics.soton.ac.uk/staff/aldrich/Probability%20Earliest%20Uses.htm

From here you should be able to track down the original meaning of a particular
statistical term. For example, histogram derives from the Greek
histos (anything set upright) and gramma (drawing)

65 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

Stock returns By the return on a stock we usually mean either


• the actual return (expressed in monetary units, here $)
Return = New stock price – Old stock price
• or the percentage return (expressed in non-monetary units)
New stock price - Old stock price
% Return = x 100%
Old stock price

(Where have we seen this type of formula before?)


For the Apple stock of Table 5.4 we obtain the histograms of Fig.5.5 for the monthly
returns and percentage returns respectively.

Fig.5.5: (a) Histogram of returns (b) Histogram of % returns

Conclusion We can tentatively conclude that


• the histograms of returns (Fig.5.5) both look “approximately symmetric” (about
the highest bar), whereas
• the histograms of prices (Fig.5.4) look skewed.

Do you agree with this?

Histogram Shapes Although not obvious from Fig.5.4 it is found that

• Histograms of “similar shapes” occur frequently in practical applications.


• A major concern is whether histograms are symmetric or not (skewed).
• We really need to specify symmetric about what? We usually mean the
“centre” of the histogram, but this merely shifts the question onto what we
mean by “the centre”. We discuss this in Unit 4.

Further examples of symmetric and skewed “distributions” can be found in the


practical and tutorial exercises.

Statistics, Probability & Risk 66


Unit 3 –Graphical Summaries of Data

5.3. Pie Chart


Once again, whilst we do not fully agree, Klass gives the following advice:

• Pie charts should rarely be used.


• Pie charts usually contain more ink than is necessary to display the data, and
the slices provide for a poor representation of the magnitude of the data points.
• Never use a 3D (three dimensional) pie chart.

But, as was the case with bar charts, you only have to look at some of the papers
cited in previous units to see how popular pie charts are. (Exercise: Check this out.)
The following examples are designed to illustrate some of the good, and bad,
features of pie charts.
Various pie charts are available in Excel. You are asked to explore some of the
options available in Practical 3.
Pie charts are used to represent the distribution of the categorical components of a
single variable (series).

Example 5.3b U.S. Federal Government Receipts


We return to Example 5.3a and use various pie charts to represent the data in Table
5.3 (reproduced below).

Table.5.3: Government Receipts Fig.5.6: Government Receipts (2007) Pie Chart

Single Pie Chart In Fig.5.6 we display, for the 2007 data, the various percentages
making up the total.
• In Excel it is possible to also display the actual values in Table 5.3, or not to
display any numerical values at all, just the labels.

67 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

• Unfortunately, without numerical information it is difficult to accurately assess


the relative sizes of the various slices of the pie. (Bar lengths in a bar chart are,
by comparison, easier for the eye to distinguish.) This is the major problem with
pie charts, so you should always include quantitative (numerical) information.
• In addition most of our pie chart is taken upon with labels (of the various
categories), and this contradicts the general advice given in Section 3.3.

Comparing Pie Charts In Fig.5.7 we compare the 2000 and 2007 data.

Fig.5.7: Pie charts for comparison of U.S. Federal Government Receipt data

• Again we note the difficulty in making numerical comparisons based on the


relative sizes of the slices. Can you see any difference by eye between the two
social insurance payment slices?
• You can argue that we do not have to rely on visual inspection in Fig.5.7 since
we have the two respective numerical values (32% and 37%) available. But the
point is that they are already available in our original table (Table 5.3), so what
extra does the pie chart give us? If it gives us nothing extra why have it – just
stick with the table instead!

3D Pie Charts In Fig.5.8 we reproduce Fig.5.6 in two three dimensional versions.

Fig.5.8: Two three dimensional pie charts for Government Receipts (2007)

Statistics, Probability & Risk 68


Unit 3 –Graphical Summaries of Data

Here the 3D effect is to provide visual distortions in two ways:


• The corporate income tax slice looks bigger than it should (especially in the
right hand picture)
• It is harder to see the difference between the slices representing the social
insurance payments and the individual income taxes.
• It is also possible for two small percentages (8% and 4% say) to appear the
same size in a 3D view, although one percentage is twice the other.
In Section 6 we shall examine why such distortions tend to occur.

Notes 1. Another, more positive, view of pie charts is given in


• Hunt, N., Mashhoudy, H. (2008). The Humble Pie – Half Baked or Well Done?
Teaching Statistics 30 (1) ps. 6-12.
Here you will find sensible advice on how to construct pie charts, and ways of
minimising problems in interpretation.
2. Teaching Statistics is a journal devoted to good practice in the teaching of
statistics, and contains many articles from which you can learn much. You can
access the website at
http://www3.interscience.wiley.com/journal/118539683/home?CRETRY=1&SRETRY=0 ,

However it is easier to do a Google search on “Teaching Statistics” (and use the


“Wiley InterScience” entry). You should be able to download articles.

5.4. Scatter Plots


A two-dimensional scatter plot is very often the most effective medium for the
graphical display of data.
• A scatter plot will very effectively highlight the relationship between two
quantitative (numerical) variables.
• There may, or may not, be an implied causal relationship between the two
variables.
- By this we mean changes in one variable (the independent variable) cause
changes to occur in the other variable (the dependent variable). We shall
discuss this more fully in Unit 9.
- If this is the case place the independent variable on the x-axis.
• Data points are not connected in a scatter plot, so in Excel choose this option.

69 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

Example 5.5 Movement of Stock Prices


We return to Example 5.4 and include the monthly stock prices of a second company
AT&T (Inc.), as displayed in Table 5.4. In our scatter plot of Fig.5.9 (obtained via the
Excel Chart Wizard or Chart Menu) we have arbitrarily chosen to put the Apple stock
on the x-axis.

Table.5.4: Apple and AT&T Stock Fig.5.9: Scatter plot for Apple and AT&T Stock

Comments What can we conclude from Fig.5.9?


• Can we use one stock price in an attempt to predict the other? We shall look at
this type of problem in the Regression section (Units 9 and 10).
• The interpretation of Fig.5.9 may be made a little more difficult by the quantity of
data available, which causes many individual points to “merge” into each other.
• Financial data is often in the form of “high frequency” data. In contrast to
some (economic) areas where data may only be available yearly, financial data
(especially stock prices) may be observed daily (or even more frequently). This
produces a lot of data over time, and special mathematical techniques are
sometimes required to separate the “noise” from the “trend”.
• Would you expect Apple and AT&T stock to be “related” to one another?

Scatter plots are really intended to give a “global view” of a data set, often with the
view of further analysis (such as “fitting a straight line” to the data). In Fig.5.9 there
appears no “real” relationship between the two variables. However, if we know
beforehand a relationship exists, as in the case of (defined) functions, we can use a
scatter plot to graph the relation. Graphing functions in Excel is explored in Practical
Unit 3.

Statistics, Probability & Risk 70


Unit 3 –Graphical Summaries of Data

5.5. Line Chart/Graph


Line graphs work best when both (x and y) variables are quantitative (numerical) and
ordered (so it makes sense to compare two x or y-values). We have already seen
illustrations in Section 4 (see Figs.4); we now look at a slightly more complicated
example.

Example 5.6 Supply-Demand Curves

Reference The following is adapted from the website “Developing Analytical


Graphs in Microsoft Excel” at
http://csob.berry.edu/faculty/economics/ExcelGraphDevelopment/BuildingExcelWork
books.html
Here you will find very detailed instructions, and related references, for producing
graphics relating to macroeconomic models. The website is well worth a look.

Data The data given in Table 5.5 relates to supply and demand of a product. Using
Copy & Paste to add the line graph for the supply curve to that of the demand curve
produces Fig.5.10. From this we can identify the equilibrium price (where demand =
supply) as £100.

Table.5.5: Supply-Demand data Fig.5.10 Line plot of Table 5.5

Question: What happens if the demand line in Fig.5.10 suddenly shifts to the right
(representing an increased demand)?

71 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

5.6. Time Series Line Chart


Although time series are a special type of line chart, in which the x-axis denotes
time, they form a very important class of data (since many quantities are observed
over a period of time).
We have already seen examples of time series graphs; see, for example, Section 4
Figs.4 and Practical Unit 2 Section 6.

Summary Try and observe the following rules:


• Keep your chart as simple as possible.
• Do not use 3D effects unless they are really necessary.
• Sometimes the original table of data may be more informative

6. Poor and Misleading Graphical Representations


There are various ways, some intentional and some not, in which graphs can give a
distorted view of the data.

1. Bar Charts – Vertical Scale


In Fig.6.1(a), judging by the heights of the bars, sales of product A, for example,
appear to be about four times those of product C. (If we examine the scales, about
25 for A and 15 for C, we see that this is not really so, but the visual impression
remains.) The difficulty is that the vertical scale does not start at zero; changing the
scale as in (b) gives a better impression of the comparative product sales.
In fact a “fairer comparison” of (a) and (b) would have the latter scale about twice the
size; see Fig.(c).

Statistics, Probability & Risk 72


Unit 3 –Graphical Summaries of Data

Fig.6.1: Effect of vertical scale in column charts

Suitably adjusting the vertical scale is a common way to give a misleading impression
of the data. Another frequent offender is lack of a vertical scale.

2. Time Series – Comparisons


In Fig.6.2 the price (in £), and the sales (in thousands), of a product are given in the
form of a line (time) series plot. Without any vertical scale no meaningful
comparisons can be made regarding the variation either within each series, or
between the two series, over time. (We really want to know whether the changes are
significant in any practical sense.)
With access to the data of Table 6.1 we can see the two series have different scales,
and a less misleading plot is shown in Fig.6.3. However it is more difficult now to
compare the variations in the series, and a better alternative is to use two vertical
axes with separate scales, as in Fig.6.4. You are asked to use Excel to obtain this
type of graph in Practical Exercises 3. You should compare Figs.6.2 and 6.4.

Fig.6.2: Comparative behaviour of series Table 6.1: Actual data for Fig.6.2

73 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

Fig.6.3: A better representation of Fig.6.2 Fig.6.4: Use of a secondary (y) axis

3. “Stretched” Scales
You should be aware that data variations can appear smaller (or larger) by
“stretching” the horizontal (or vertical) scale. For example, compare Fig.6.4 with
Fig.6.5 overleaf; in the latter there appears virtually no variation in either series. This
type of “stretching” is now very easily done in Excel, so you should always take care
with the overall size of your charts.

Fig.6.5: Eliminating data variation by “stretching” the horizontal (time) scale

4. Pie Charts
There is a large, and growing, literature on the use (and misuse) of pie charts. For an
interesting discussion of why pie charts can easily mislead us see

Statistics, Probability & Risk 74


Unit 3 –Graphical Summaries of Data

• Visual Gadgets (2008). Misleading the reader with pie charts, available at
http://visualgadgets.blogspot.com/2008/05/misleading-reader-with-pie-charts.html
You should be able to track many other articles via Google.

5. Stacked Charts
Rather than represent the data of Table 6.1 as a time series we could choose to use
a column chart; the second chart in Fig.6.6 is a stacked version. In truth neither chart
represents the data variations particularly well, but the stacked chart is particularly
inappropriate. The total height of each column is meaningless, since we cannot add
“Number of sales” with “Price of product”, not least because they are measured in
different units. Always check that items are comparable when stacking them in a
chart.

Fig.6.6: Column chart, and stacked column chart, for data of Table 6.1

6. Pictograms
The pictogram is a very visually appealing way of representing data values. The
selected data of Table 6.2 gives passenger numbers for April 2009 at some of the
major U.K. airports (London and Scotland). Figures are to the nearest hundred
thousand.

Data Source Figures, referring to April 2009, are taken from the webpage Recent
UK Airport Passenger Numbers – from CAA, BAA and IATA statistics at
http://airportwatch.org.uk/news/detail.php?art_id=2258. Here you can find further
data, with exact passenger numbers.

75 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

Edinburgh Stansted
Gatwick

Table 6.2: Passenger Numbers

Heathrow

Fig.6.7: Misleading pictogram depicting relative passenger numbers

The pictogram illustrated in Fig.6.7 attempts to compare the passenger numbers for
the airports shown, using a graphic of an aeroplane rather than lines or bars. The
obvious impression given is that Heathrow has vastly more passengers than the
other airports. In fact the ratio
Heathrow Passenger Numbers 5.6
= =7
Edinburgh Passenger Numnbers 0.8
So the “Heathrow graphic” should be seven times the size of the “Edinburgh graphic”.
This is clearly not the case since the eye picks up the “complete picture” and
registers the corresponding area. Unfortunately in Fig.6.7 both linear dimensions
(length and height) of the “Heathrow plane” are scaled up by an approximate factor
of seven compared to the “Edinburgh plane”, with a consequent area magnification of
72 = 49. In effect the final graphic is seven times too large, and a more representative
pictogram is shown in Fig.6.8.

Statistics, Probability & Risk 76


Unit 3 –Graphical Summaries of Data

Fig.6.8: Improved pictogram depicting relative passenger numbers

7. Chart References
The ONS has produced a useful summary of best practices in using charts, under the
title Drawing Charts – Best Practice. See if you can find it under Neighbourhood
Statistics; the URL is
http://www.neighbourhood.statistics.gov.uk/HTMLDocs/images/Drawing%20Charts%
20-%20Best%20Practice%20v5_tcm97-51125.pdf
• You may also care to look at a set of (un-named) power point slides at
http://mtsu32.mtsu.edu:11235/Misleading%20Statistics.ppt
• A good discussion of the use of pictograms, both one and two dimensional, in
Excel is given in Hunt, N. (2000) Pictograms in Excel. Teaching Statistics, 22, 2,
56-58.
• There is a very interesting website Numeracy in the News located at
http://www.mercurynie.com.au/mathguys/mercindx.htm. Here you will find
discussions of numeracy in relation to specific newspaper articles. In particular
if you click on the Data Representation icon you will find many newspaper
pieces analysed in terms of their statistical content. Alternatively you can go
directly to http://www.mercurynie.com.au/mathguys/maths/datreprs.htm. You
should look at some of this material, and we shall consider one or two articles in
Tutorial 3.

7. Some Further Graphical Representations


We have really only touched the surface of what is available in terms of representing
data graphically. There are various further graphical devices that have found use in
recent years. They tend to be used in specific situations, rather than being designed
for general purpose use. You may care to investigate the following (using Google in
the first instance);
• Heat Maps
• Market Maps
• Sector Strips

77 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

• In addition there are a large number of graphics chartists use to describe, for
example, the price movements of stocks and indices. You can get an idea of
some of the possibilities by going to Yahoo Finance! and obtaining historical
quotes for a particular stock (say General Electric). Under Charts click on
Basic Technical Analysis (left side) and a chart should appear that
incorporates a time series, and histogram, of the stock movements. In addition
you can add to this chart items like “Bollinger Bands” (if you know what they
are!). Below is the kind of effect you can achieve.

• From an accounting perspective you should download, and read, the following
article: Burgess, D.O. (2008). Does Graph Design Matter To CPAs And
Financial Statement Readers? Journal of Business & Economics Research 6
(5) 111-124.
Here a survey of financial readers was undertaken to ascertain whether the
meaning of financial statements can be distorted, intentionally or not, by the
graphical representation chosen. To give you an idea of what is involved
complete the following exercise:

Exercise: Examine the following two graphs and then comment, as indicated, on the
five statements that follow.

Statistics, Probability & Risk 78


Unit 3 –Graphical Summaries of Data

You may like to revisit this exercise at the end of the course and see if your
responses differ.

• Finally there is one important chart we have not discussed in this unit. This is
the boxplot, and relies on the use of quartiles, a topic we discuss in the next
unit.

Nevertheless, the basic graphical devices we have discussed in this unit are used
repeatedly in the financial and economic literature. Look back to some of the
referenced papers to confirm this. In general, unless you have compelling reasons
not to, use simple graphics to describe your data, in preference to anything “more
fancy”.

79 Statistics, Probability & Risk


Unit 3 – Graphical Summaries of Data

8. References
• Hunt, N., Mashhoudy, H. (2008). The Humble Pie – Half Baked or Well Done?
Teaching Statistics 30 (1) ps. 6-12.
• Hunt, N. (2000) Pictograms in Excel. Teaching Statistics, 22, 2, 56-58.
• Noah, T. (2004) Stupid Budget Tricks. How not to Discredit the Clinton Surplus,
Slate Magazine (Aug. 9th.) available at http://slate.msn.com/id/2104952/
• Visual Gadgets (2008). Misleading the reader with pie charts, available at
http://visualgadgets.blogspot.com/2008/05/misleading-reader-with-pie-
charts.html

Statistics, Probability & Risk 80


Unit 4: Numerical Summaries of Data

4 Numerical Summaries of Data

Learning Outcomes
At the end of this unit you should be familiar with the following:
• General ideas of location and spread of a data set.
• Use of numeric “summary measures”.
• The various numeric measures of location available; mean, median and mode,
and their importance.
• The various numeric measures of spread available; range, IQR and standard
deviation, and their importance.
• The use of graphical representations (stem and leaf plots and boxplots) to
compute, and display, quartiles.
• Understand when numeric measures are inadequate.
• Appreciate the properties of some basic types of financial data.
• Understand how market efficiency can be investigated using pivot tables.

A market where chief executive officers make 262 times that of the average
worker and 821 times that of the minimum-wage worker is not a market that is
working well.
Marcy Kaptur (American politician)
(See http://www.brainyquote.com/quotes/keywords/average.html)

81 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

1. Guiding Principles
• Numerical displays are intended to give a quick summary of the (numerical)
content of quantitative data..
• There are two important factors associated with any dataset
- A measure of “central location”, by which we mean “where is the bulk of
the data located”?
- A measure of “spread” indicating whether the data is “tightly bunched”
about the centre or not.
• In most practical situations (large data set) numerical summary measures are
usually computed using computer software (Excel in our case). However, as
with graphical measures, hand computations (using small data sets) are
important for two reasons:
- They illustrate the underlying principles of the calculation. The rationale
behind measures can be of crucial importance in financial situations.
- They “give one a feel for the data” and allow an understanding of which
particular summary measure should be used in a given situation.
Exercises involving hand computation are given in Tutorial 4.

Note From this unit onwards our discussions will, in general, become more
numerically based. A good modern reference text is
Nieuwenhuis, G. (2009) Statistical Methods for Business and Economics.
Maidenhead: McGraw-Hill Education (UK) Limited.
Although you are not required to purchase this, the text covers similar material to the
module, but in considerably more detail, and contains some more advanced topics
we do not have time to cover. In addition there are many finance/economic based
examples, and Excel (and SPSS) applications are discussed.

2. Measures of Location – Three basic definitions


The “average value” of a set of data is a very important indicator of the typical kind of
value we can expect to find in the data. Unfortunately there are many possible
averages; just Google “average” and see what Wikipedia says. In statistics there are
two major averages used – the mean and the median. Traditionally a third, the
mode, is usually added to this list but its use appears to be limited, and we really just
mention it in passing.

Statistics, Probability & Risk 82


Unit 4: Numerical Summaries of Data

For our purposes it will be most useful to compute summary measures for a sample
taken from a larger population. So we can assume our data values are discrete (see
Unit 1) and label them x1, x2, x3, x4, ..... , xn --- (1)
This indicates we have n values (not necessarily distinct).

Definition 1 The mean (or arithmetic average) is usually denoted x (read x bar)

x = [x 1 + x 2 + x 3 + x 4 + .... + x n ] = ∑ i = 1 x i
1 1 i=n
and defined as --- (2)
n n

Note1 Here we have used the summation operator Σ to provide a compact


representation of the sum (over x values). You will see this written down in many
textbooks and is invaluable when proving general results. However on the few
occasions we use the notation it will merely be as a shorthand notation for a sum.

Definition 2 The median Q2 is the “middle value” of the data, i.e. the x-value such
that half the x-values are smaller, and half the x-values bigger, than Q2.

Notes 2 There are a few important points relating to Definition 2:


• The notation Q2 arises because, for reasons we discuss in Section 8, it is
convenient to “divide the data into quarters” using the quartiles Q1 (first
quartile), Q2 (second quartile) and Q3 (third quartile). Then Q2 also has the
property of “dividing the data in half” and hence defines the median.
• The data must be ordered (sorted), either increasing or decreasing, for the
process of computing quartiles to make sense (why?).
• Although it is not very easy (or useful) to write down a general formula for the
quartiles we can do so for the median.

⎧ xm where m = (n + 1)/2 if n is odd


Q2 = ⎨ --- (3)
⎩(x m + x m +1 ) / 2 where m = n/2 if n is even

(3) says to take the unique middle value when it exists (n odd), otherwise take
the (arithmetic) average of the “two middle values”. In practice this verbal
description tends to be more useful than the mathematical one in (3).

Definition 3 The quartiles are the values that divide the data in (1) into “four equal
quarters”.
Definition 4 The mode is the “most popular” value” of the data, i.e. the x-value in (1)
which occurs most often.
Note 3 The data really needs to be ordered if we want to reliably identify the mode
(and quartiles), especially if n is large in (1). To compute the mean we do not need
ordered data.

83 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

It is traditional to compute the mean, median (quartiles) and mode for a “small”
artificial data set to illustrate the computations. We choose a slightly larger set and,
for variety, consider a “non-financial” application. (Nevertheless you should be able to
see economic, and financial, implications in Example 3.1.) For illustrative purposes
our calculations extend over the next two sections.

3. Stem and Leaf Plots


Example 3.1 Oscar Winners
We are interested in the age at which actors, and actresses, win a movie Oscar.
Questions of interest include:
What is the “typical” age of winners? Is there any difference between genders?
Here we shall look at actresses and you are asked, in the Tutorial Exercises, to
perform a similar analysis for actors.

Data Sources Go to Wikipedia at (Google search with “Oscar winners”)


http://en.wikipedia.org/wiki/Academy_Award. This will give you all the winners from
1928 to the present. Getting ages is a bit more time consuming, but can be done on
an individual basis or by consulting a film compendium (such as Halliwell’s Film
Guide 2008 and Halliwell’s Who’s Who in the Movies, 2003 being the latest edition).
Table 3.1 gives the resulting data for actresses. If you wish to see details of the
actual winning films see the Oscars (Excel) file.

Table 3.1: Age of Oscar Winning Actress

Statistics, Probability & Risk 84


Unit 4: Numerical Summaries of Data

Step 1 Partially sort the data.


Although we could just produce a table of sorted values, a much better alternative is
to arrange values in the form of a stem and leaf plot.
• Divide the data range into suitable intervals, here called the stem values. If
possible, use 10 as the interval size, and suppress the zero.
• Place each data value in the appropriate interval/stem.
• Make sure the values are equally spaced (so values “line up”), as in Fig.3.1a.

Stem
Leaves

Read
as 50 Read as 39

Fig.3.1a: Stem and leaf with “partially” sorted data.

Fig.3.1b: Stem and leaf with “fully” sorted data.

Step 2 Fully sort the data.


Each row produced is termed a leaf. Observe that the leaves are (generally) of
different lengths, and this gives the plot an overall shape.
• Now sort each leaf (in ascending order). The resulting picture Fig.3.1b is termed
a stem and leaf plot.
• Note the similarity with a horizontal bar graph or histogram. The crucial
difference is that all the data values are preserved in the stem and leaf plot.

85 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

Fig.3.1c: Stem and leaf with stem size = 5.

Notes
1. If you require more leaves, to better highlight the underlying “shape”, you can
divide each stem in two as illustrated in Fig.3.1c. Then, for example, the stem 2L
refers to the range “20 lower”, or 20 – 24, and 2H means 25 - 29.
2. Stem and leaf plots are produced by some computer software but, unfortunately,
not directly by Excel. However you can produce a “rotated” (histogram) version of a
stem and leaf plot in Excel using the procedure described in
Excel Charts for Statistics on the Peltier Technical Services, Inc. website at
http://www.peltiertech.com/Excel/Charts/statscharts.html#Hist1
3. We may take the view that data values in the range 60 and above are “atypical”,
and regard them as outliers. The remaining “typical” data values then appear “fairly
symmetric” – we shall return to this point several times later.

4. Computation of Quartiles
With a little practice, we can fairly easily read off the quartiles from the stem and leaf
plot. It is often convenient to compile a frequency table as part of the calculation.

Step 1 Obtain frequency table.


This just consists of counting the number of data values along each leaf. In addition a
cumulative frequency table, comprising the number of data value in all previous rows
(including the current one), is very useful in Step 2.

Statistics, Probability & Risk 86


Unit 4: Numerical Summaries of Data

5 data values

Fig.4.1: Frequency table from stem and leaf plot.


Q1

Q3
Q2 = Average of two data
values either side

Fig.4.2: Computation of Quartiles.

Step 2 Compute the median


Find the “middle” data value Q2. This can be a little trickier than it appears since, if we
have an even number of values, there is (strictly speaking) no middle value. In the
notation of (1) we have the following data values:
1st value 2nd value ................................................................. nth value
• If n is odd the middle value is the value in position (n + 1); see (3).
• If n is even we take the average of the two values either side of this position, i.e.
positions n and (n + 2); see (3).
• In our case n = 82 – see Fig.4.1 – and we want the average of the 41st and
42nd.values. Our cumulative frequency table allows us to more easily locate the
required entries, and we insert a vertical line to denote the location of Q2. The
actual median value is Q2 = (33 + 33) = 33.

Step 3 Compute the remaining quartiles.


Take the first half of the data up to, but not including, Q2 and repeat Step 2 to locate
the median of this smaller dataset.
• We now have values from 1 through 41, and hence we require the (41 + 1) =
21st value. This gives Q1 = 28.
• Similarly we obtain Q3 = 39.

87 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

Notes
1. Graphically we have the following situation (not drawn to scale):

41 data values 41 data values


______________|_________________|___________________|_______________
Q1 = 28 Q2 = 33 Q3 = 39
20 data values 20 data values 20 data values 20 data values

Fig.4.3: Graphical view of quartiles.

2. Remember, in our example, Q2 is not directly a data value, but Q1 and Q3 are. This
is why, in Fig.43.3, 20 + 20 ≠ 41.
3. Our procedure of successively dividing the data into halves is not the only one
possible. Wikipedia, at http://en.wikipedia.org/wiki/Quartile, will give you more details,
and further references, such as http://mathworld.wolfram.com/Quartile.html, to other
computational procedures for quartiles. But the important point to bear in mind is that
all such procedures will give “very similar” answers and, as we discuss in the next
section, it is the “overall shape” defined by the quartiles which is of greatest interest
and importance.
4. From Fig.4.2 we can also read off the mode as 26 (occurring 8 times in the data).

5. Boxplots – Graphical Representation of Quartiles


It is often very instructive to view the quartiles graphically, and the boxplot (or box
and whisker plot) is designed to do this. This is really no more than a fancy version
of Fig.4.3 drawn to scale, and can be drawn horizontally or vertically. A horizontal
version is shown in Fig.4.4 with the median depicted as a vertical line enclosed by a
box representing the remaining two quartiles. The vertical edges of the box are
connected to the minimum and maximum data values by lines, termed the whiskers.

Minimum Q1 Q2 Q3 Maximum

Box Whisker

Fig.4.4: Boxplot (of quartiles).

Statistics, Probability & Risk 88


Unit 4: Numerical Summaries of Data

Notes
1. The width of the box is arbitrary.
2. Although Excel does not graph boxplots directly, they can be obtained as shown in
Practical Unit 4.
3. Boxplots are particularly important when comparing two, or more, datasets. You
are asked to compare the actor, and actress, boxplots in the Tutorial Exercises.

6. Computation of the Mean


The mean cannot really be obtained from the stem and leaf plot of Fig.4.2. Due to the
arithmetical nature of its definition, we must perform the calculation in (2).
Calculation 1 Adding the 82 data values in Table 3.1 gives (Exercise)
1 2905

i = 82
x = i =1
xi = = 35.43 (years) --- (4)
82 82

Of course this type of calculation is more suited to Excel implementation, and you are
asked to perform some computations of this type in Practical Unit 4.
Calculation 2 A much simpler calculation results from using the frequency table in
Fig.4.1, but we have to make an assumption. Our frequency table just tells us how
many data values are in a particular interval, but not what the values are explicitly.
We assume all values are concentrated at the centre of the interval so, for example,
we imagine the data value 25 occurs 28 times. This easily leads to the frequency
table of Table 6.1. The mean is now calculated from the formula (why?)

x =
1
n
[ ] 1 i=n
f 1 x 1 + f x 2 + f 3 x 3 + .... + f n x n = ∑ i = 1 f i x i
n
--- (5a)

n = f 1 x 1 + f x 2 + f 3 x 3 + .... + f n x n = ∑ i = 1 f i
i=n
where --- (5b)
The actual computation shown in Table 6.2 gives
2960
x = = 36.10 (years) --- (6)
82

Table 6.1: Frequency table from Fig.4.1 Table 6.2: Computation of mean.

89 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

Notes 1. The value in (6) is an approximation (estimate) of the mean, whereas (4) is
the exact value. The virtue of the frequency table approach is its simplicity, since the
complete dataset (82 values in our case) is replaced by a much smaller number of
intervals (7 here).
2. The formalism in Table 6.2, where our result (the mean) is calculated in terms of
column sums of the data, occurs very often in statistical calculations. Such sums are
easily computed by hand (small dataset) and in Excel (large dataset).

7. Comparison of Measures of Location


We have calculated 3 measures of central location for Example 3.1. Explicitly
Mean = 35.4 ; Median = 33 ; Mode = 26 --- (7)
Which value should we prefer? The general situation is depicted in Fig.7.1.
• For skewed distributions the mean is a poor indication of the “centre”; use the
median. See Fig.7.1(a) and (b).
• For symmetric distributions all three measures agree. In practice the mean
tends to be used since it has “better mathematical properties” which make it
easier to manipulate and derive theoretical results.
• Also (7) shows our movie data is positively skewed. The skew is not great, the
mean being only slightly greater than the median since, although our data has
some large values, it does not have many of them. (Probably only two values
out of 82 can be considered outliers, as we discuss in Section 8.)

(a) Negatively skewed (b) Positively skewed

Negative direction Positive direction

Statistics, Probability & Risk 90


Unit 4: Numerical Summaries of Data

Fig.7.1: Relative locations of


mean, median and mode

(c) Symmetric (No skew)

8. Measures of Spread – Three basic definitions


We can get an indication of the skewness of a distribution from the corresponding
histogram, as in Fig.7.1. We look at more quantitative measures in this section,
relating to the idea of the “spread” of a set of data.
Definition 4 The range is the difference between the largest and smallest values:
Range = Maximum value – Minimum value
Definition 5 The interquartile range (IQR) is the difference between the third and
first quartiles: IQR = Q3 – Q1 .

Example 3.1 The results in Fig.8.1 are immediate from Fig.4.3. Note the following:
• The range uses only two data values, and hence is very sensitive to outliers in
the data.
• The IQR is designed to eliminate this difficulty by looking at the “middle half” of
the data. Observe that the range is not double the IQR as one might expect for
a symmetric dataset.
• However the IQR still only uses two data values.

IQR = 39 – 28 = 11
|______________|_________________|___________________|_______________|
Min = 21 Q1 = 28 Q2 = 33 Q3 = 39 Max = 80
Range = 80 – 21 = 59

Fig.8.1: Graphical view of Range and IQR.

Definition 6 The standard deviation (s) is defined by


1
n
[ 2 2 2 2
]1 i=n
s2 = (x 1 − x) + (x 2 − x) + (x 3 − x) + .... + (x n − x) = ∑ i = 1 (x i - x)
n
2
--- (8)

91 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

Comments This is a very important definition and we note the following:


• The computation in (8) requires use of all the data values in (1).
• We require the mean x before we can compute the standard deviation.
• (8) involves the square of the standard deviation (for reasons we explain
below). To compute s requires a square root to be taken at the end of the
calculation.
• The individual terms (x 1 − x) , (x 2 − x) , ... , (x n − x) in (8) are called deviations from
the mean. Then (x 1 − x) 2 + (x 2 − x) 2 + (x 3 − x) 2 + .... + (x n − x) 2 is termed the sum
of the squared deviations. Finally the R.H.S. is the average squared deviation
(from the mean). See Fig.8.2.
• In words (8) gives the average (squared) distance of a data value from the
mean.

x1 - x

x2 - x
xn - x
|_________|________|_______________________________________________|
x1 x2 x3 x xn

Fig.8.2: Graphical view of deviations from the mean.

Calculation of Standard Deviation


There are several important points in the definition (8) that we wish to illustrate. Since
we need to use all data points, and Example 3.1 contains 82 of them, we first
consider a much smaller (real, but rather artificial) dataset.

Example 8.1 The following data gives prices (in pence) of a 100 gram jar of a
particular brand of instant coffee on sale in 15 different shops on the same day:
100 109 101 93 96 104 98 97 95 107 102 104 101 99 102
Calculate the standard deviation of the prices.

Solution The required calculations are depicted in Table 8.1. Note the following:
• The sum of the deviations (cell C40) is zero; this is a consequence of the
definition (2) of the mean. The cancellation of positive and negative deviations
is avoided by squaring, as specified in (8).
• The sum of squared deviations is 266 (non-zero, of course), and the average
squared deviation is 266/15 = 17.733. Then s = 17.7333 = 4.21 (pence).
• In words: On average any particular coffee price is 4.2 pence from the mean of
101 pence.

Statistics, Probability & Risk 92


Unit 4: Numerical Summaries of Data

Table 8.1: Computing the standard deviation using (8).

Comments 1.Note that no individual price can involve a fraction of a pence; neither
can the deviations (column C). The standard deviation just gives an average
measure, averaged over all prices (data values).
2. You may also have noticed the mean, as defined by (2), is often not a possible
data value. Thus in (7) 35.4 is not a possible age (since all ages are given to the
nearest year). Again this occurs because we average over all data values.
3. There is an alternative calculation of s based on the formula
1 i=n 2
s2 =
n
∑ i =1
xi − x2 --- (9)

This is just an algebraic rearrangement of (8); you should be able to prove this if you
are familiar with manipulating summation symbols. We can check this by using
column F in Table 8.1 to give s2 = 153281/15 – 1012 = 17.733
And this agrees with the entry in cell D41.

Example 3.1 To calculate s for our original (Oscar winners) data we can do one of
three calculations:
• Use (8) with our 82 data values.
• Use a slight modification of (9), relating to frequency distributions, and use the
frequency table in Fig.4.1 (just as we did for the mean in Section 6).
• Use Excel’s “built in” functions.
The first calculation is too time-consuming, and you are asked to explore the third
option in the Practical Exercises. Here we look at the second alternative and, as with

93 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

the mean, the ease of the calculation is slightly offset by the approximate nature of
the computation. The frequency version of (9) is (compare this with (5))
1 i=n
∑ ∑
i=n
s2 = f x − x2 with n= --- (10)
2
i =1 i i i =1
fi
n

To implement (10) all we need to do is add an extra column (D) to Table 6.2 to give
Table 8.2. Note that, since we are calculating entirely from the frequency table, we
use the value of x given in (6). This gives

Table 8.2: Frequency


table calculation of
standard deviation

s2 = 119450/82 – 36.09752 = 153.68 and s = 12.4 (years)


A more accurate value can be obtained using the data of Table 3.1, and Excel yields
s = 11.4 years.
You should compare this measure of spread with the IQR in Fig.8.1.
N.B. Usually you will use software (possibly Excel) to compute standard deviations
since, as we have seen, their computation is numerically tedious. However, it is very
important you understand the meaning of the standard deviation as a measure of
the spread of data values. (Understanding how to calculate statistical quantities does
tend to reinforce their meanings in ones’ mind, so hand computations do serve a very
useful purpose.)
Why the spread is important is briefly considered in Section 10.

9 Standard deviation – A complication


The standard deviation is often regarded as marking the divide between the area of
“descriptive statistics”, discussed in Units 3 and 4, and the more “quantitative
statistics” that we shall consider in the remaining units. This is partly because of the
increased difficulty in computing the (standard deviation) measure, but also due to
the following “complication”.
In equations (8) – (10) we have computed an average (squared deviation from the
mean) by dividing by n, the number of data values. However, there is an alternative
formula in place of (8):

Statistics, Probability & Risk 94


Unit 4: Numerical Summaries of Data

1 i=n
s2 = ∑ i = 1 (x i - x) 2 --- (8*)
n -1

Here we divide by (n – 1) rather than n. The reason centres around the fact that, to
compute s, we first need to compute x from the data. To see why this is important
we return to Example 8.1, where we have the 15 data values
100 109 101 93 96 104 98 97 95 107 102 104 101 99 102

Once we have computed the mean x = 101 our 15 data values are no longer all
needed – technically they are not all independent (of each other). In fact, knowing x
= 101, we can remove any one of our data values, i.e. our data could be any of the
following 15 sets (each one comprising only 14 values):
* 109 101 93 96 104 98 97 95 107 102 104 101 99 102
100 * 101 93 96 104 98 97 95 107 102 104 101 99 102
.......................................................................................................................................
100 109 101 93 96 104 98 97 95 107 102 104 101 99 *

In each case the starred * entry is uniquely determined by the requirement that the
mean is 101. To take account of this “one redundant” data value we adjust the
denominator (n) in (8) by one to give (8*).

Notes
1. You will probably find this argument a little strange! However it expresses a very
general view in statistics that it is only independent quantities that are important in
computations, and not necessarily all the (data) values we have available.
2. We shall essentially repeat the argument in Unit 7 Section 8, when we introduce
the important concept of degrees of freedom.

3. At the moment just remember to use (8*) when the mean x has to be calculated
from the data. Thus our previous calculations of s in Example 8.1 are, strictly
speaking, incorrect. For example, using the sum of squared deviations of 266 gives
s2 = 266/14 = 19 (in place of s2 = 266/15 = 17.333)

and hence s= 19 = 4.36 (in place of s = 17.7333 = 4.21)

4. Observe that the above difference (4.36 compared to 4.21) is quite small. As n
increases the difference between (8) and (8*) clearly decreases. This leads to the
“rule of thumb”: use (8*) for “small” samples and (8) for “large” samples. The dividing
line between the two is often taken as n = 25, but this is really quite arbitrary.
5. Which formula is right – (8) or (8*)? The answer is that both are really just
definitions designed to capture (in a single number) the concept of “spread” of data

95 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

values around the mean. We are free to choose whichever definition we want.
Theoretically we choose the one with the “better mathematical properties” and,
because of the independence idea, this turns out to be (8*). The drawback is
explaining why we prefer (8*), without going into too many technical details, since (8)
is obviously more intuitive.

10. Financial Perspective – Volatility and Risk


It is clear that both the computation, and meaning, of the standard deviation is a good
deal more involved that the IQR. This is only to be expected since, as we have
already mentioned, the former uses all data values whereas the latter uses only two
of them. In finance the standard deviation is of fundamental importance since it is
used as a quantitative measure of risk of, for example, a stock or stock portfolio. The
terminology used is that the standard deviation is a measure of volatility: the more
the return on a stock varies from the average return on the stock (measured over a
period of time), the more volatile the stock. In turn, the more volatile a stock the
riskier it is to invest in, since returns are more uncertain.

Example 10.1 The following (hypothetical) data gives (starting monthly) values of two
stock portfolios, X and Y, over a six month period.
Portfolio X Portfolio Y Portfolio X Portfolio Y
Month Month
(£000) (£000) (£000) (£000)
1 1000 1000 5 1032 1086
2 1008 1015 6 1038 1043
3 1018 1066 7 1058 1058
4 1048 1194

Table 9.1: Two stock portfolio values.

Although both portfolios have the same starting (1000) and finishing (1058) values,
their maximum values are different (1058 for X and 1194 for Y). Hence we would
judge Y is more volatile than X. To quantify this, the appropriate calculations are
normally expressed in terms of returns since an investor will usually target a specific
return, say 5%, on his investment (no matter how much he invests).
The return is just the familiar percentage change we have seen before (where?):
End value - Start value
Portfolio return = * 100%
Start value

Table 9.2 gives the calculated returns. For example, during month 1
1008 - 1000
Portfolio return = * 100% = 0.8%
1000

Statistics, Probability & Risk 96


Unit 4: Numerical Summaries of Data

Portfolio X Portfolio Y Portfolio X Portfolio Y


Month Month
Return (%) Return (%) Return (%) Return (%)
1 0.8 1.5 4 -1.5267 -9.0452
2 0.9823 5.0246 5 0.5814 -3.9595
3 2.8626 12.0075 6 1.9268 1.4382

Table 9.2: Portfolio returns.

The average returns (RX and RY) are then


• RX = 1
6 [0.8 + 0.9823 + 2.8626 − 1.5267 + 0.5814 + 1.9268] = 0.9377%
RY = 6 [1.5 + 5.0246 + 12.0075 − 9.0452 − 3.9595 + 1.4382] = 1.1609%
1

Hence Y has a slightly higher average return (1.2% compared to 0.9%). However,
this is more than offset by its (much) higher volatility:

• s2X = 1
5
[0.8 + 0.9823 + 2.8626 + 1.5267 + 0.5814 + 1.9268 ] - 0.93772 = 2.3569
2 2 2 2 2 2

• s2Y = 1
5
[1.5 + 5.0246 + 12.0075 + 9.0452 + 3.9595 + 1.4382 ] - 1.16092 =
2 2 2 2 2 2

54.2477

Hence sX = 2.3569 = 1.54% and sY = 54.2477 = 7.37%


Portfolio Y is thus about 5 times more volatile (using the standard deviation measure)
than portfolio X. The (marginally) higher expected (average) return on Y does not
really compensate an investor for the much higher risk he is taking.
Note that we have used (9*) in our volatility calculations, i.e. (9) with n replaced by (n
– 1), as discussed in Section 9. Also we have 4 dp throughout our calculations, and
rounded the final results to 2 dp.

11. Why is data variation important? Investor returns


Suppose we turn Example 10.1 around by specifying the returns an investor is
seeking, and calculating the final value of his investment after a specified period of
time.
Example 11.1 Two (hypothetical) investors X and Y wish to invest £1000 in a stock
portfolio and are looking for the (monthly) returns shown in Table 11.1 over the next
six months. If theses returns actually occur, which investment will be the more
profitable?
Month 1 2 3 4 5 6
Portfolio X 3 3 3 3 3 3
Portfolio Y 2 4 2 4 2 4

Table 11.1: Monthly portfolio returns (%).

97 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

Solution Note that the average return of both investments will be 3%. We obtain the
values shown in Table 11.2. For example, after 2 months Portfolio Y will have grown
to £1000*(1 + 0.02)*(1 + 0.04) = £1060.8
(We have retained sufficient decimal places to minimise the effect of rounding errors
in the calculations.)
Time Portfolio X Investment Value (£) Portfolio Y Investment Value (£)
(months)
1 1000*1.03 = 1030 1000*1.02 = 1020
2 1030*1.03 = 1060.9 1020*1.04 = 1060.8
3 1060.9*1.03 = 1092.727 1060.8*1.02 = 1082.016
4 1092.727*1.03 = 1125.50881 1082.016*1.04 = 1125.29664
5 1125.50881*1.03 = 1159.27407 1125.29664*1.02 = 1147.8025728
6 1159.27407*1.03 = 1194.05230 1147.80257*1.04 = 1193.7146758

Table 11.2: Monthly portfolio values (£).

Conclusion Observe that, after each two month period (when the average return is
the same on both portfolios), X has a larger value than Y. Although the differences
here are small, they will increase over time. You may care to see what the difference
is after a further six months, assuming the same pattern of investment returns. Also
the more initially invested the larger the differences will be; with a £1 million
investment the difference in portfolios values after six months will be £337.62.)
More importantly than the actual amounts involved;
• The investment with the larger variation in returns is ALWAYS worth less after
any period of time (provided the average return is the same in all cases).
• The larger the variation the less the investment is worth (subject to the average
return being the same).
You may care to investigate these assertions for yourself. In Table 11.3 we give
investment values, computed in Excel, for the three sets of returns shown, with an
initial investment of £1000. Over 20 time periods the line graphs of Fig.11.1 are
obtained. Note in particular how the most variable returns (Investment 3) produce the
lowest final value (by a significant amount), and hence the lowest overall return. In
each case the average return is the same (5%).

Table 11.3: Three sets of investment returns

Statistics, Probability & Risk 98


Unit 4: Numerical Summaries of Data

Fig.13.1: Line graph of investment returns

Thus, in an ideal world, investors would like returns that are


• as large as possible, and which will therefore
• contain no variation whatsoever, i.e. constant (predictable) returns.

However, in practice, predictable returns (with, for example, AAA bonds) will
invariably produce the lowest returns! To increase returns requires more risk to be
taken, which implies more unpredictable returns (and hence obviously greater
potential variation in returns).
We can now clearly see why data variation, as measured by the standard deviation,
is used as a measure of risk.

12. Using Numerical Measures


Most (quantitative) papers you will read will make reference to a variety of numerical
summary measures for the particular data sets under discussion. We have already
seen examples of this;
• Fig.5.1 of Unit 3 uses the variable “% living in families below 50% of median
family income”.
• Look back at the paper Barwell, R., May, O., Pezzini, S. (2006) The distribution
of assets, incomes and liabilities across UK households: results from the 2005
NMG research survey, Bank of England Quarterly Bulletin (Spring).
Identify the numerical summary measures used.
Now try and find the following article (Bank of England website)
Dobbs, C. (2008) Patterns of pay: results of the Annual Survey of Hours and
Earnings 1997 to 2008

99 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

In the first “Key Points” section you will find the following (my highlighting):
• In April 2008 median gross weekly earnings were £479, up 4.6 per cent from
£458 in 2007, for full-time UK employee jobs on adult rates whose earnings
were not affected by absence
• Between 2007 and 2008 the weekly earnings for full-time employees in the top
decile grew by 4.4 per cent compared with a growth of 3.5 per cent for the
bottom decile.
• For the 2007/08 tax year median gross annual earnings for full-time employees
on adult rates who have been in the same job for at least 12 months was
£25,100. For males the median gross annual earnings was £27,500 and for
females it was £21,400
• The stronger growth in full-time men’s hourly earnings excluding overtime
compared with women’s has meant that the gender pay gap has increased to
12.8 per cent, up from 12.5 per cent in 2007. On the basis of mean full-time
hourly earnings excluding overtime, the gender pay gap has increased, from
17.0 per cent in 2007 to 17.1 per cent in 2008.

Read through the paper and note the types of summary measures and graphs used.
Many of these should have been covered in Units 3 and 4; which ones have not?

13. Pivot Tables


Often data can be analysed in a very simple (numerical) fashion by a process of
cross-tabulation; Excel performs the process using “Pivot Tables”. The idea is very
useful when the data is divided into various categories, but data will not always come
in this form. We give an illustration using stock prices; look back at Examples 5.4 and
5.5 of Unit 3 for the construction of histograms and scatter plots of (Apple) stock
data.

Example 13.1 The Excel file IBM_Weekly contains weekly closing IBM stock prices
from 03/01/2000 to 13/12/2007. The data, a small portion of which is shown in
Fig.13.1, was downloaded from Yahoo Finance!

Step 1 Graph the data.


As with any data set, our first step is to graph the series, and a line (time series) chart
is the most appropriate. There are two important observations:

Statistics, Probability & Risk 100


Unit 4: Numerical Summaries of Data

Fig.13.1: Line graph of IBM stock prices

• At the two “endpoints” the series takes roughly the same value, i.e. the stock
has (only) maintained the same price level (over an 8 year time span). We could
look for “economic interpretations” of this either
- in the news released by IBM itself, or
- in general news from the technology sector, or
- in more general economic news
over this time frame. You may care to look into this (Exercise).
• In between times the stock clearly has high and low points. The general
question we would like to ask is the following:

Question 1 “If we held the stock initially (Jan 2000) what investment strategy
should we have adopted to maximise our profits at the end (Dec 2007)?”

Terminology To avoid our discussions becoming unnecessarily complicated (and


long winded) we define S(t) = Stock price at time t.

Step 2 Ups and downs in the data


Our first step in formulating a (retrospective) investment strategy is to find out how
often the stock went up (and down).
• To do this in Excel we merely need to code the simple logic:
If S(t + 1) > S(t) record an “Up” in the stock price, otherwise a “Down”.

101 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

(To keep things as simple as possible we are ignoring the possibility of the
stock price having exactly the same value in two successive time periods. Do
you think this is reasonable?)
• The result is shown in Table 13.1. Clearly we want to count how many times the
stock went Up, and how many times Down.

Table 13.1: Stock Up or Down? Table 13.2: Count of Ups and Downs in stock

Step 3 Set up a Pivot Table


Note that our “UpOrDown” data is now of categorical form, i.e. each data value
belongs to one of two categories “Up” or “Down”. Such a situation calls for the use of
a pivot table to make the count.
• The mechanics of setting up pivot tables in Excel are discussed in Practical 4.
In our case the result is Table 13.2.
• The entry (blank) indicates there is an entry which does not fit into either of our
two categories. Can you suggest a reason why this happens?
• We conclude that the stock went Up almost the same number of times it went
Down (202 compared to 214). In fact the stock price falls 51%, and rises 49% of
the time. This is not a very promising situation in which to answer Question 1
(or, at least, formulate a “sensible” investment strategy.)

Comment We could represent the numerical result in Table 13.2 graphically by, for
example, a histogram. Would this be a sensible thing to do? Remember you would
only include a table and a graphical representation in a report if they gave different,
but complementary, information (or possibly different perspectives on the same
information). Look back to the “Summary” advice at the end of Section 5 of Unit 3.
In Step 3 we answered the question
• How often (or what percentage of the time) does the stock price go up?
There is another, more interesting question we can look at:
• If we know the stock price went up last week, what are the chances (probability)
of it going up again this week?

Statistics, Probability & Risk 102


Unit 4: Numerical Summaries of Data

Rationale If this probability is “large” we may be tempted to buy the stock once we
have observed the price increase. This would be a sound investment strategy under
the given circumstances. (We shall not formally meet the idea of “probability” until
Unit 5, but here we just need an “informal idea” of probability as the likelihood/chance
of the stock continuing to rise.)

Step 4 A more detailed look at Ups and Downs.


We shall obtain UpOrDown movements this week and last week. We really want to
obtain two UpOrDown columns as in Table 13.1 (column C), one applying to stock
price movements this week and the second to last week. So we can keep track of
which is which we are going to call the UpOrDown variable UpOrDownThisWeek.
• We can easily set up, using Copy and Paste, Columns D (same as column C)
and E (lagged version of C) in Table 13.3.
• Now we produce a two way pivot table as described in Practical 4; the results
are depicted in Table 13.4.

Table 13.3: Two week changes Table 13.4: Counts of two week changes

Conclusion If the stock went Up last week (which it did 202 times) then it
subsequently (this week) went Up again 48% of the time (97 times) and, of course,
Down 52% of the time. Similarly, if the stock went Down last week (213 times) it
subsequently continued to go down 51% of the time, and went back Up 49% of the
time.
All these percentages are (depressingly) close to 50%, so any strategy to buy, or sell,
the stock base on its previous week’s movement seem doomed to failure. Of course
we could look at how the stock behaved over the previous two (or three or ...) weeks
before deciding whether to buy (or sell or hold). You may care to investigate some of
these possibilities using pivot tables.
Whether or not we can predict stock prices is at the heart of the idea of market
efficiency, a concept that is generally phrased in terms of the Efficient Market
Hypothesis (EMH). As you are probably aware, there is a vast literature on this topic
and you cannot go very far in any finance course without meeting it. For an extended

103 Statistics, Probability & Risk


Unit 4: Numerical Summaries of Data

discussion see Brealey, R.A. and Myers, S.C. (2003). Principles of Corporate
Finance, 7th. International Edition. New York, McGraw Hill.
Here we merely state the EMH version relevant to our analysis:

Weak form of EMH: Security prices reflect all information contained in the record of
past prices. (It is impossible to make consistently superior profits by studying past
returns.)
Although we have not really got very close to answering Question 1 (producing an
“optimal” investment strategy), you should be able to appreciate the use of pivot
tables (cross-tabulation) in looking for “patterns” within the data.

14. References

• Nieuwenhuis, G. (2009) Statistical Methods for Business and Economics.


Maidenhead: McGraw-Hill Education (UK) Limited.

Statistics, Probability & Risk 104


Unit 5: Probability – Basic Concepts

5 Probability – Basic Concepts

Learning Outcomes
At the end of this unit you should be familiar with the following:
• Understand how probability is defined and calculated in simple situations.
• Apply the basic probability laws using tables and tree diagrams.
• Understand the concept of a probability distribution.
• Appreciate the idea of a conditional probability distribution, and compute
conditional probabilities.
• Recognise the role of the mean and variance in characterising a probability
distribution.

All possible definitions of probability fall short of the actual practice.


William Feller. An Introduction to Probability Theory and its Applications (1968)

Who cares if you pick a black ball or a white ball out of a bag? If you’re so
concerned about the colour, don’t leave it to chance. Look in the bag and pick
the colour you want.
Adapted from Stephanie Plum (Hard Eight)

105 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

1. Introduction
An increasingly important issue is to examine how financial quantities, such as stock
prices, behave. For example we may be interested in answering the following
questions:
• What is the probability my IBM stock will increase in value today?
• What is the probability my IBM stock will increase in value by 1% today?
• What is the probability my IBM stock will increase in value by 1% over the next
week?
• What is the probability my portfolio of stocks will increase in value by 1% over
the next month?
and so on. But before we can answer questions like these we need to look at the
(statistical) language needed to make meaningful (quantitative) statements.

The appropriate language is that of probability. Although you will probably be


familiar with the general usage of the term, probability is a surprisingly difficult
concept to define precisely. Indeed there are four main ways in which people have
sought to define probability:
• Relative frequency interpretation
• Symmetry approach
• Subjective probability approach
• Bayesian methodology
None of these approaches are free from logical difficulties, but each have their uses
and highlight different aspects of the subject. The first two approaches go back
hundreds of years, but the last two are of quite recent origin and, more importantly,
are finding increasing application in finance. Probability has a long, interesting and
controversial history and, for further information and references, you should initially
consult http://en.wikipedia.org/wiki/Probability.

Terminology:
To avoid long, and potentially complicated, verbal descriptions we write
P(E) = Probability of the event (outcome) E
Sometimes we may write Pr(E), or P{E} or Prob(E) or something similar.

In general terms we shall interpret the probability of an event as the “likelihood” of


the event happening; the larger the probability the more likely an event is to occur.
However there are several ways these ideas can be interpreted. We can derive

Statistics, Probability & Risk 106


Unit 5: Probability – Basic Concepts

probabilities, and probability distributions, from data in several ways; in this unit we
shall use a mixture of the following three types of data:
• “Real data” (taken from ONS). This will emphasise that probabilities are not just
theoretical constructs, but are tied firmly to collected data.
• “Simulated data” using Excel’s random number generators. This will let us
easily obtain, in certain well defined situations, as much data as we require; the
latter will let us illustrate concepts of interest.
• “Theoretical data”. This will allows us to keep the calculations as simple as
possible, and let us concentrate on the underlying ideas without worrying too
much about computational details.
Two “simple” examples which incorporate many of the ideas we need are found in
the age old pursuits of coin tossing and dice throwing. Despite their apparent
simplicity such examples contain a great deal of interest and can be used to illustrate
a variety of concepts.

2. Frequency Interpretation of Probability


If we repeat an experiment (in which E may or may not occur) a “large” number of
times N, we define
Number of times E occurs
P(E) = --- (1)
Total number (N) of trials

This is the “standard” long run frequency interpretation of probability, and is the most
commonly used idea to define precisely the concept of probability. But to use this
definition we need to perform an experiment (a large number of times).

Example 2.1 : A fair coin is tossed. What is P(Heads)?

“Solution” : Rather than actually tossing a coin we shall simulate the process in
Excel using the built in random number generator. This allows us to generate
“Heads” and “Tails”, with equal probability, as often as we wish. From Table 2.1 we
obtain the two estimates
P(H) = 6/10 = 0.6 and P(H) = 7/10 = 0.7

=rand() Head = 1
Tail = 0

6 H in 10 Column
tosses sum = 7

Table 2.1: Two simulations of 10 coin tosses

107 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

"Experimental" Probability (N = 10) "Experimental" Probability (N = 50)

0.7
1
0.6
0.8
0.5

Probability
Probability

0.4 0.6
0.3 0.4
0.2
0.2
0.1
0
0
Prob(H) Prob(T) Prob(H) Prob(T)
H or T H or T

(a) 6 H in 10 throws (b) 28 H in 50 throws


"Experimental" Probability (N = 100) "Experimental" Probability (N = 500)

1 1
0.8 0.8
Probability

Probability
0.6 0.6
0.4 0.4
0.2 0.2
0 0
Prob(H) Prob(T) Prob(H) Prob(T)
H or T H or T

(c) 54 H in 100 throws (d) 235 H in 500 throws

Fig. 2.1: Some results from the simulation of tossing a coin.

Notes: 1. For a discussion of how the results of Fig. 2.1 are obtained see the
spreadsheet Unit5_CoinTossing in the Excel file CoinTossing2.xls. You are asked
to look at these simulations in Question 1 of Practical Exercises 4.
2. Look at the Excel file CoinTossing1.xls and the spreadsheet Proportions for an
empirical discussion of (1); see also Question 3 of Practical Exercises 4.

Discussion of results: There are various points of interest:


• The estimates of P(Heads) are clearly quite variable, especially for “small”
numbers of repetitions.
• Even if we perform a “very large” number of tosses we may never get the
“exact” value of P(Heads). But, as we increase the number of tosses, the
variation in our estimate seems to decrease, and we always end up with a value
“around 0.5”.
• Every time we perform a new simulation we obtain different estimates of
P(Heads). A sensible strategy would be to perform many simulations (with a
large, fixed number of tosses), and average over all the simulations. For more
complex problems this is now a very common technique, especially in finance
(going by the descriptive name Monte Carlo simulation). We look at
simulations, in the context of generating probability distributions, in Unit 7.

Statistics, Probability & Risk 108


Unit 5: Probability – Basic Concepts

A Problem: There is one major (philosophical) flaw with the frequency approach to
probability. In some (many) situations we essentially have no control over our
experiment, which is essentially a one-off event, and hence cannot be repeated (and
certainly not many times).
Example 2.2: IBM stock is today worth $100. What is the probability it will be worth
$105 tomorrow?

“Solution”: Here we cannot use (1) since we cannot “repeat” the stock price
movement over the next day “many” times (and observe how often it reaches $105).
The stock price will move of its own accord, and it will assume a single value
tomorrow. Of course this value is unknown today. There appears to be no simple way
to assign a probability to the required event.
In fact there are two approaches we might take:
• Simulate the stock price process “many” times as we did in Example 2.1 for the
coin. Observe how many times the stock reaches the required level of $105 and
use (1). The difficulty is we need a model of the behaviour of the stock price to
do this. In Example 1a we used a “random” mechanism to model our coin toss.
• Observe the stock price over the next 100 days (say) and use (1) to assess the
required probability. Of course this will not give us an answer for tomorrow, and
there is the additional problem that, at the start of each new day, the stock will
not start at $100.

Note 1: There is really another problem concealed here.


• If we observe the stock price in practice it is very unlikely to reach exactly $105.
What we really should be asking for is the probability the stock reaches at least
$105. Indeed when there are a large number of possible outcomes the
probability of any one of them occurring must be very small, and possibly not of
much real interest. Our stock may take the values
100 , 100.01 , 100.02 , 100.03 , ……
and so on. Assuming any particular value tomorrow is not really of great
interest.
• This is very similar to Example 2.1. If we toss the coin 500 times we do not
really want to know the probability of, say, precisely 250 heads, which must be
very small. (We have 501 possible outcomes and the sum of all probabilities
must be one!). We are much more interested in the probability of at least 250
heads.

109 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

3. Symmetry Approach to Probability


In some (simple) situations we can assign equal probabilities based on symmetry
considerations:

Example 3.1 : A fair dice is thrown. What is the probability of obtaining a 6?

Solution : There are 6 possible outcomes – we can obtain a 1, 2, 3, 4, 5 or 6.


“Clearly” no outcome is (a priori) to be preferred to any other, so that all outcomes
must have the same probability. Hence
P(1) = P(2) = P(3) = P(4) = P(5) = P(6).

We can now use one of two arguments:


Number of times a 6 occurs 1
• Using (1) P(6) = =
Total number of possible outcomes 6

• Since all the probabilities must sum to one (why?) we conclude P(6) = 1
6

Notes : There are several important ideas here:


• We have really used this type of symmetry argument in “expecting” the result
P(H) = 12 in the coin tossing of Example 2.1.

• The argument is known more formally as the “Principle of Insufficient


Reason” or the “Principle of Indifference”. It asserts that
“In the absence of any known reason to assign two outcomes different
probabilities, they ought to be assigned the same probability.”
• Outcomes where we can assign equal probabilities are termed (not surprisingly)
“equally likely” outcomes.
• Although not immediately apparent, there are logical difficulties with the above
type of reasoning. If you are interested try, as a starting point,
http://en.wikipedia.org/wiki/Bertrand’s_paradox(probability)

A Problem: This symmetry approach is a theoretical one, and will therefore have
nothing to say about probabilities determined by “real world events”:

Example 3.2: What is the probability that IBM goes bankrupt within the next year?

“Solution”: Where do we start?


• Maybe the best we can do is to argue IBM are “equally likely” to become
bankrupt as any other company “in the same market sector”. Even if we could

Statistics, Probability & Risk 110


Unit 5: Probability – Basic Concepts

identify the number of such companies, how do we assign a specific probability


of bankruptcy? (Knowing each company will be given the same probability does
not help in deciding what this probability is. In particular all these probabilities
will not sum to one, unless we know some company will go bankrupt.)
• Of course “market forces”, “company structure” and so on will be a large
determining factors relating to whether bankruptcy occurs.
• We would need to adopt an historical perspective and see how many
bankruptcies have occurred in the sector over the years. (Nevertheless we may
not expect similar patterns to persist in the future as economic conditions
change)

Notes: Observe the following two points:


• Firms are interested in the credit worthiness of their customers, so we would
like to be able to “solve” Example 3.2.
• Despite the apparent theoretical nature of “equally likely” outcomes it is an
important idea which recurs in many places. We shall meet it again when we
discuss tree diagrams and the binomial distribution.

4. Subjective Probability
Here we start with the idea that there sometimes there is no objective way to
measure probability. In this case probability is really the degree of belief held by an
individual that a particular event will occur.

A subjective probability describes an individual's personal judgement about how


likely a particular event is to occur. It is not based on any precise computation but is
often a “reasonable assessment by a knowledgeable person”.

Example 4.1 : "I believe that Manchester United have probability of 0.9 of winning
the English Premiership next year since they have been playing really well this year,
and I expect their good form to continue into next season."

Comments: Observe the following points:


• The probability 0.9 is not “derived” from any analysis of the team’s performance
over this year and previous years (“playing well” not being a quantitative
statement).
• This (subjective) probability estimate will, of course, change value from one
person to the next. What then is the “usefulness” of such a probability?

111 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

• Despite this difficulty the “degree of belief” idea does allow us to make
probability statements in situations where the other two approaches (based on
frequency and symmetry) may not be applicable.
• One can quantify the subjective probability in terms of odds. Just how much am
I prepared to bet on Manchester United winning the premiership next season at
a given set of odds (quoted by a bookmaker)?

Subjective probability has become of increasing importance in the area of


behavioural finance where, for example, investors need to show “good judgement”
in which stocks to invest in.

Example 4.2 : IBM stock is today worth $100. What will it be worth next month?

“Solution”: Most investors views are biased towards optimistic outcomes, and so will
tend to overestimate a stock’s future value. So I believe the stock will fall in price to
$90.

Comments : Note the following points:


• Optimism bias is the demonstrated systematic tendency for people to be over-
optimistic about the outcome of planned actions. For further details and
references see http://en.wikipedia.org/wiki/Appraisal_optimism.
• Many types of biases have been identified in the area of finance, all of which
involve some non-rational behaviour on the part of participants (individuals,
markets and so on).
• A central feature of most of the “classical finance” literature (corporate finance,
portfolio theory, options etc.) is that rational behaviour applies. For example,
given two possible projects, a firm will invest in the one giving the higher return
(profit) – to do otherwise would not be rational. (Assuming there are no other
mitigating factors, such as environmental issues.)
• Behavioral finance is largely concerned with analysing whether rational
behaviour applies in the financial markets and, if it does not, seeking to provide
possible explanations.

You may like to look at Barberis, N., Thaler, R. (2002) : A Survey of Behavioral
Finance available at http://badger.som.yale.edu/faculty/ncb25/ch18_6.pdf.
This is a good, relatively recent, review of the literature and is quite readable, but
goes far beyond the limits of the course.

Statistics, Probability & Risk 112


Unit 5: Probability – Basic Concepts

A SHORT QUESTIONNAIRE – PLEASE COMPLETE NOW

Q1. If the London Stock Exchange general index has increased on each of the
past 3 days, what is the probability that it will increase in value today as
well?
Probability =

Q2. If the London Stock Exchange general index has decreased on each of the
past 3 days, what is the probability that it will decrease in value today as
well?
Probability =

Q3. If you look at the London Stock Market today in your opinion it is
(choose one alternative):
1. Overvalued by __________ %
2. Undervalued by __________ %
3. Valued at a fundamentally correct level.
4. Cannot say whether it is fairly valued or not.

Q4. If the London Stock Exchange general index is valued at 6000 today, what
do you think will be its value in 6 months time?
Value in 6 months time __________

Q5. Assume the following situation. During the last 2 years the stock of a certain
company has risen by 60%, and the future for the stock looks bright. How do
you value this information?
1. The stock is worth buying.
2. The information is not sufficient to decide on buying the stock.
3. The stock is not worth buying.

113 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

5. Bayesian Methodology
An inevitable criticism of the subjective probability approach is precisely in its
subjective nature with different “answers” from different people. To improve upon
this situation the “Bayesian approach” allows probability estimates to be “updated”
as new information becomes available.

Example 5.1
(a) You have a coin in your hand. What is your “best estimate” of P(H), the
probability of obtaining a head on any toss?
(b) Your coin is tossed 10 times and 3 heads result. Now what is your best estimate
of P(H)?
(c) 10 further tosses give 6 heads. Now what is your best estimate of P(H)?

Solution:
(a) Without performing any experiment (tossing the coin), the best we can do is to
invoke the “Principle of Indifference” of Section 4.1.2 and conclude
P(H) = 0.5
(b) Clearly we should use the proportion of heads obtained as an estimate of the
3
required probability P(H) = = 0.3
10
(Using the proportion is the “rational thing to do”, but we can give formal arguments
to justify this choice.)
(c) We could again use the proportion of heads obtained as an estimate, i.e.
6
= 0.6
P(H) =
10
The “Bayesian point of view” suggests that we can improve on this estimate by
using any previous knowledge we have. Here we can argue that we already have an
estimate of P(H) from (b) and we can average this estimate with the current one, i.e.
0.3 + 0.6
P(H) = = 0.45
2
(We can take the average since both estimates in (b) and (c) are based on the same
sample size/number of tosses. If the were not the case we would take a weighted
average weighted by the respective sample sizes.)

Comments : Observe the following:


• This “Bayesian” estimate does in fact agree with what the “frequency approach”
of Section 2 would give. For, combining (b) and (c), we can argue we have

Statistics, Probability & Risk 114


Unit 5: Probability – Basic Concepts

tossed the coin 10 + 10 = 20 times, and obtained 3 + 6 = 9 heads. The


9
proportion of heads then gives the estimate P(H) = = 0.45
20
• The Bayesian approach is based on the idea of “updating estimates as more
information is received” and, in general, does not provide the same results as
the frequency interpretation.
• Bayesian ideas have assumed a position of great importance in statistics in the
last 20 years, and are finding their way into the financial literature. Whilst there
are no really elementary books, a good starting point is
Rachev, S.T., Hsu, J.S.J., Bagasheva, B.S., Fabozzi, F.J. (2008). Bayesian
Methods in Finance: Wiley.
This material goes well beyond the level of this course, but does highlight some
current finance trends and important research areas.

6. Probabilities Derived From Data


Data Source We will look at unemployment statistics taken from the ONS website;
look back at Section 9 of Lecture Unit 2 for details. We shall look at a portion of data
derived from Table 9.1 (Unit 2), and shown in Table 6.1.

Table 6.1(a): Unemployment data (in ‘000s) – “middle age” range

Table 6.1(b): Unemployment data (in ‘000s) – “extreme age” range

115 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

Comment The zeros in the table probably just indicate a value less than 500 – Why?
We need to bear this in mind when assessing the accuracy of our computed
probabilities below. See Lecture Unit 2 Section 8.2.

Question 1 What is the probability of being unemployed for less than 6 months,
during the period Nov 2007 – Jan 2008, if you are a male aged 18-24?

Solution
Number of males unemployed for <6 months, during Nov 2007 – Jan 2008, in the
age range 18-24 = 332
Total number of (economically active) males in this age range, during Nov 2007 –
Jan 2008 = 4210
Using the (frequency) definition (1)
P(unemployed < 6 months under given conditions) = 332/4210 = 0.079
(Equivalently about 8% of the males aged 18-24 are unemployed during Nov 2007 –
Jan 2008.) You should think of the number of decimal places we can quote here.

Comment Clearly we can repeat the above calculation to change all the frequencies
in Table 6.1 into probabilities. The results for Table 6.1(a) are shown in Table 6.2,
and we leave it as an exercise for the reader to obtain the corresponding results for
Table 6.1(b). We have quoted probabilities to 4 decimal places, although three
places are more realistic (why?).

Question 2
What is the probability of being unemployed, during Nov 2007–Jan 2008, if you are a
male aged 18-24?

Table 6.2: Unemployment probabilities – “middle age” range.

Solution If we go back to Table 6.1(a) we argue, as in Question 1, that Total number


of unemployed males in this age range = 332 + 71 + 59 + 40

Statistics, Probability & Risk 116


Unit 5: Probability – Basic Concepts

Total number of males in this age range = 4210


Hence P(unemployed if you are male aged 18-24) = 502/4210 = 0.119
(Equivalently about 12% of the males aged 18-24 are unemployed.)

Comment Instead of adding the frequencies in Table 6.1(a), we can add the
probabilities in Table 6.2. Explicitly
P (unemployed if male aged 18-24) = 0.0789 + 0.0169 + 0.0140 + 0.0095 = 0.1192
Can you see why this works? In more general terms we would like to know how to
combine probabilities of certain events to produce probabilities of “more
complicated” events. We do this in the next section.

Question 3 Is your probability of being unemployed dependent on your age?

“Solution” Our data in Table 6.1, where ages are split between 4 “sub-tables”, is not
best suited to answering this question. A better alternative is to “collect ages
together”, at fixed time intervals, as shown in Table 6.3. (As an exercise you should
produce the table for May-Jul 2008).
We look at a more detailed solution to Question 2 after we have seen how to
combine probabilities. Here we just note the following male probabilities for Feb-Apr:
• Age range 16-17: P(unemployed < 6 months) = 76/367 = 0.207 (21%)
• Age range 18-24: P(unemployed < 6 months) = 337/4205 = 0.080 (8%)
• Age range 25-49: P(unemployed < 6 months) = 187/9718 = 0.019 (2%)
• Age range 50+: P(unemployed < 6 months) = 61/4606 = 0.013 (1.3%)
We can clearly see these probabilities decrease with increasing age. Is this what you
would expect?

Table 6.3: Unemployment figures across age ranges with time (interval) fixed.

117 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

7. Probability Laws
The following gives a more “formal structure” to probability manipulations. For a
general discussion we assume we have two “events”, conveniently labelled A and B,
with known probabilities of occurrence, and we do not enquire where these have
come from. These known probabilities apply to “simple events” and we want to
combine these values to determine the probabilities of “compound events” (which
are formed as combinations of the simple events).

7.1 Summary of Rules/Laws


The important rules/laws that enable one to compute with probabilities are given in
Table 7.1 below. There are various other laws, often quoted in textbooks, but these
are used much less often and usually only in more specialised applications. A
working knowledge of the contents of Table 7.1 is almost always sufficient.
Before discussing what these rules mean, and how to compute with them, it is useful
to look at how these can be remembered (although the important multiplication law
has no simple visual interpretation). A good (brief) summary is provided at
http://en.wikipedia.org/wiki/Probability. Here you will also find a set of references and
links dealing with theoretical and practical issues relating to probability (though not
finance based). This is a good place to start if you are looking for material suitable for
further study.

Event Law/Rule Name of Law/Rule


A 0 ≤ P(A) ≤ 1 Definition
not A P(not A) = 1 – P(A) Complementary event
A or B P(A or B) = P(A)+P(B)-P(A and B) Addition law
A or B P(A or B) = P(A) + P(B) Addition law-mutually exclusive events
A and B P(A and B) = P(A | B) x P(B) Multiplication law - general version
A and B P(A and B) = P(A) x P(B) Multiplication law – independent events
A|B P(A and B) Definition of conditional probability
P(A | B) =
P(B)

Table 7.1: Important Probability Rules

7.2 How to Remember the Rules/Laws


This is most easily done using a pictorial representation of the sample space S,
which just contains all the possible outcomes of an event. There are several
(equivalent) ways of doing this, depending on ones’ personal preference.

Statistics, Probability & Risk 118


Unit 5: Probability – Basic Concepts

S S
. . .
.
S
Fig 7.1: Possible Sample Space Representations

An event A of interest will usually be a subset of the sample space S, i.e. just part of
S. (The terminology “subset” arises from the fact that we can phrase “events” in
terms of “sets”, and use set theory to discuss everything. We shall not do this, but lot
of books do in order to give more formal proofs of general results.)

S Not A
A A

Fig 7.2: An event A of interest Fig 7.3: An event A and its complement (not A)

S A S A
B B

(a) General events (b) Mutually exclusive events


Fig 7.4: Two events A and B

Notes: In these pictures (sometimes termed Venn diagrams) we interpret Areas as


measuring probabilities.
(This ties in with our “area interpretation” of histograms in Unit 3.) We can use this
idea to “derive” the results in Table 7.1.
• We “normalise” the total area within our sample space to be 1. Then:
- Since A is a subset of S, P(A) = Area within A ≤ Area within S = 1.
- Obviously areas are non-negative, so P(A) ≥ 0
• Since S is composed of A and “not A” (S = A U (not A) in set theory notation)
P(A) + P(not A) = P(S) = 1 ⇒ P(not A) = 1 – P(A) (see Fig.7.3)

• P(A or B) = Area inside A or B (see Fig.7.4(a))


= Area inside A + Area inside B – Area inside both (Why?)
so P(A or B) = P(A) + P(B) – P(A and B)

119 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

• If A and B cannot both happen (at the same time)


P(A and B) = 0 and P(A or B) = P(A) + P(B) (see Fig.7.4(b))
Such events are termed mutually exclusive, the occurrence of one precluding
(preventing) the occurrence of the other.

Unfortunately there is no really simple way to visualise the multiplication law.


However, when the events are independent (the occurrence of one not affecting the
occurrence of the other) the rule is very simple:
To obtain P(A and B) multiply the two individual probabilities P(A) and P(B)
(Independence is really quite a difficult concept, and we shall examine it in more
detail in Section 9 when we look at conditional probability.)

An Important Observation: In practice, we are usually not dealing with just two
events but with many more. It is then far simpler to deal with mutually exclusive and
independent events since the addition and multiplication laws extend very simply. If
we have events A1 , A2 , A3 , ... , An
that are both mutually exclusive and independent
• P(A1 or A2 or A3 or ... or An ) = P(A1) + P(A2) + P(A3) + .... + P(An) --- (1)
• P(A1 and A2 and A3 and ... and An ) = P(A1) P(A2)P(A3).....P(An) --- (2)

7.3 Using the Addition and (Simple) Multiplication Laws


Direct (formal) use of the addition and multiplication laws in Table 7.1 is illustrated in
the following example.

Example 7.1 A particular industry currently contains 100 firms and, from past
records, it is estimated the probability of any particular firm going bankrupt (within the
next year) is 5%. Determine the probability that:
(a) No firm goes bankrupt
(b) At least one firm goes bankrupt.
(c) Exactly two firms go bankrupt
What assumptions are you making in your calculations?

Solution Formally we define the (100) events


A1 = event Firm 1 goes bankrupt ; A2 = event Firm 2 goes bankrupt (and so on).
• Then we are given P(A1) = P(A2) = ..... = P(A100) = 0.05

Statistics, Probability & Risk 120


Unit 5: Probability – Basic Concepts

(Since, from Table 7.1, probabilities must be less than 1, we cannot use 5% directly.)
• Now we can also write
P(not A1) = P(not A2) = ..... = P(not A100) = 1 - 0.05 = 0.95
as the probability that an individual firm does not go bankrupt.

(a) P(no firm goes bankrupt) = P(Firm 1 does not go bankrupt and Firm 2 does not
go bankrupt and ..... and Firm 1000 does not go bankrupt)
= P(not A1 and not A2 and not A3 and ... and not An )
= P(not A1) P(not A2)P(not A3).....P(not An)
assuming the events are independent. Hence
P(no firm goes bankrupt) = 0.95*0.95* ... *0.95 = 0.95100 = 0.0059

You will need a calculator (or Excel) here. Note that, although each individual
probability is close to 1, the product is very close to zero. With a 5% chance of any
firm going bankrupt, there is only a 0.59% chance of no firm going bankrupt.

(b) If we define the event A = No firm goes bankrupt


then not A = At least one firm goes bankrupt
is precisely the event we are interested in. From Table 7.1
P(At least one firm goes bankrupt) = P(not A) = 1 – P(A) = 1 – 0.0059 = 0.9941
using the result in (a). Clearly if there is a 0.59% chance of no firm going
bankrupt there will be a 99.41% chance of some firm going bankrupt.

(c) This is a bit trickier. Given two specific firms, say F2 and F7, we can easily
compute, again using independence,
P(F2 and F7 go bankrupt) = P(A2 and A7) = P(A2)P(A7) = 0.05*0.05 = 0.0025
However, this is not the probability we want for two reasons:
• If we want exactly two bankruptcies, in addition to F2 and F7 going
bankrupt, we require the remaining 98 firms do not go bankrupt (for
otherwise we would have more than two bankruptcies). Thus
P(only F2 and F7 go bankrupt) = P(A2 and A7 and not A1 and not A3 and .....)
= P(A2)*P(A7)*P(not A1)* P(not A3)* ......
= 0.052 * 0.9598 = 0.000164

121 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

(Even though we have not explicitly written down all the events we can “clearly”
see what the corresponding probabilities must be!)
• We also have arbitrarily selected the firm F2 and F7 to go bankrupt. We can see
that, no matter which two we select, the same probability 0.000164 will result.
So we need to decide in how many ways we can select the two firms. Often
such counting problems can be difficult, but here there is a simple solution.
- The first firm can be chosen in any one of 100 ways, and the second firm
in any of 99 ways. We multiply these numbers together and divide by 2 (to
avoid counting choices such as {F2 and F7} and {F7 and F2} as different).
- Number of possible choices = 100*99/2 = 4950. (See Unit 6, Section 2).
- We must now add our computed probability together 4950 times. Can you
see the addition law at work here?
P(exactly 2 bankrupt firms) = 4950*0.000164 = 0.081
There is thus about an 8% chance of exactly two firms going bankrupt.
Note The solution we have given is far longer than you would normally give.
Typically you should make your calculations clear, but not necessarily explain
all the logical steps you have taken in computing the various probabilities.

Assumptions We have made two types of independence assumptions:


• In using the same bankruptcy probability (0.05) for each firm we are assuming
the probabilities are independent. In practice a firm builds up stock, finances,
reputation and so over the years. If a firm continues to survive its “survival
probability” will increase, and its “bankruptcy probability” will decrease. Our
probabilities 0.05 should be firm dependent, and also time dependent.
• We have assumed bankruptcies occur independently of one another. But,
especially in the same industry, economic factors will usually imply one
bankruptcy will result in further bankruptcies. The probability of a bankruptcy will
increase once one has occurred.

Example 7.2 We return to Question 3 of Section 6 and, in particular, to Table 6.3.


Specifically we concentrate on the time period Feb-Apr 2008 and calculate the
following probabilities:
(a) Being unemployed if you are a male in each of the age ranges 16-17, 18-24, 25-
49 and 50+.
(b) Being unemployed if you are a female in the above age ranges.
(c) Being unemployed < 6 months if you are male.
(d) Being unemployed < 6 months if you are female.

Statistics, Probability & Risk 122


Unit 5: Probability – Basic Concepts

Solution We can turn Table 6.3 into one involving probabilities by simple division
(using “economically active” as the denominator), as we did with Table 6.2.

Table 7.2: Unemployment probabilities across age ranges with time (interval) fixed.

(a & b) Since each of the unemployment intervals are disjoint the corresponding
events are mutually exclusive. For example let
A1 = Unemployed < 6 months ; A2 = Unemployed 6-12 months
A3 = Unemployed 12-24 months ; A4 = Unemployed > 24 months
and A = Unemployed
The events A1 –A4 are mutually exclusive (amongst each other), and
A = A1 or A2 or A3 or A4
Then P(A) = P(A1 or A2 or A3 or A4)
= P(A1) + P(A2 ) + P(A3) + P(A4)
The appropriate sums, for each age range, and for males and females are given in
Table 7.2 for two of the time periods.

Comments We may note the following:


• Unemployment probabilities decrease as age increases for both men and
women. Is this what you would expect?
• Corresponding probabilities for men and women are different. For example, if
you are aged 18-24 you are more likely to be unemployed if you are male
(0.122) rather than female (0.103). This implies that, in all age ranges, the
events
B = Unemployed = {< 6, 6-12, 12-24, > 24} months
and C = Gender = {M, F}
are not independent. Therefore we cannot write, for example,

123 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

P(Male and Unemployed < 6 months) = P(Male)*P(Unemployed < 6 months)


Can you see how to work out each of these probabilities?

(c & d) Since each of the age ranges are disjoint the corresponding events are
mutually exclusive. For example let
B1 = Unemployed < 6 months in age range 16-17
B2 = Unemployed < 6 months in age range 18-24
B3 = Unemployed < 6 months in age range 25-49
B4 = Unemployed < 6 months in age range 50+
and B = Unemployed < 6 months.
The events B1 –B4 are mutually exclusive (amongst each other), and
B = B1 or B2 or B3 or B4
Then P(B) = P(B1 or B2 or B3 or B4)
= P(B1) + P(B2 ) + P(B3) + P(B4)

⎧0.207 + 0.080 + 0.019 + 0.013 = 0.319 (males)


=⎨
⎩0.168 + 0.074 + 0.035 + 0.014 = 0.291 (females)
Again we may note the probability is different for males and females. We would find
the same were true for the other age ranges (Exercise).

Summary Most applications of the probability laws require us to either add or


multiply probabilities, with the additional restriction that (some) probabilities must add
to one. Simple as this may appear we must always be careful the logic we are using
is correct; if it is not our computations will be incorrect. Unfortunately probability is an
area where errors in logic are all too common, and often difficult to spot. Particular
attention should be paid to deciding whether events are mutually exclusive,
independent or equally likely; there is often “no recipe” for deciding this!

8. Tree Diagrams
Often a very convenient way to implement the addition and multiplication laws of
probability is to draw a so-called tree diagram. This gives a pictorial representation
of some, or all, possible outcomes, together with the corresponding probabilities.

Example 8.1 We have an initial investment of £1000 that, over any period of 1 year,
has one of two possible behaviours: an increase of 10% or a decrease of 10%.
These two outcomes occur with equal probability.

Statistics, Probability & Risk 124


Unit 5: Probability – Basic Concepts

Determine all possible investment outcomes over a 3 year period.

Solution We explain the calculations in several stages:


Step 1 Compute all possible investment prices at the end of 3 years.
These are depicted in Fig.8.1 and summarise the following calculations:
• If investment increases in Year 1, Investment value = £1000*1.1 = £1100
• If investment decreases in Year 1, Investment value = £1000*0.9 = £900
• If, at the end of Year 1, the investment has value £1100 then
- If investment increases in Year 2, Investment value = £1100*1.1 = £1210
- If investment decreases in Year 2, Investment value = £1100*0.9 = £990
• If, at the end of Year 1, the investment has value £900 then
- If investment increases in Year 2, Investment value = £900*1.1 = £990
- If investment decreases in Year 2, Investment value = £900*0.9 = £810
• Similarly there are 4 possible investment values at the end of Year 3 as shown.

Fig 8.1: Tree diagram comprising all possible investment outcomes (£).

Note In the financial jargon the tree of Fig.8.1 is termed recombining. This means
that, for example, although the Year 2 value of £990 can be reached in two different
ways, the actual value attained is the same, i.e.
£1000*1.1*0.9 = £1000*0.9*1.1
This happens because our probabilities (0.5) do not vary with time.

125 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

Step 2 Assign probabilities of investment values


These are depicted in Fig.8.2 where we have merely assigned a probability of 0.5 to
each branch of the tree. We have removed the investment values to emphasise that
the probabilities apply to each “small branch” only, and not to any “large branch” in
the tree; see Fig.8.3.

Fig 8.2: “Local” probabilities Table 8.1: “Global” probabilities

Step 3 Compute probabilities of final investment values.


All possible distinct paths in the tree diagram are shown in Fig.8.3. Note:
• There are 8 possible paths (since at each of the 3 stages we have 2 choices).
• Not all these paths lead to different final outcomes (investment value). In fact
- A final value of £1089 can occur in 3 possible ways.
- A final value of £8919 can occur in 3 possible ways.
- Final values of £1331 or £729 can each occur in just 1 way.
• All 3-step paths have the same probability 0.125 of occurrence.

(a) Path 1 (b) Path 2

(c) Path 3 (d) Path 4

Statistics, Probability & Risk 126


Unit 5: Probability – Basic Concepts

(e) Path 5 (f) Path 6

(f) Path 7 (g) Path 8

Fig 8.3: All possible distinct paths in tree diagram

• This last observation enables us just to count paths to obtain the desired
probabilities. Since, for example, the final investment value £1089 can be
achieved in 3 different ways, and each way (path) occurs with probability 0.125
P(investment value = £1089) = 2*0.125 = 0.375
In this way we end up with the probabilities shown in Table 8.1, and displayed
as a histogram in Fig.8.4(b).

Summary of Calculation

Fig 8.4: (a) Investment paths (b) Investment Probabilities (as histogram)

Comment The device of counting paths in Step 3 has “hidden” the explicit use of the
addition and multiplication laws. For example we can write
Investment value in Year 3 = £1089

127 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

in the equivalent form


Value in Year1 = £1100 and Value in Year2 = £1210 and Value in Year3 = £1089
or Value in Year1 = £1100 and Value in Year2 = £990 and Value in Year3 = £1089
or Value in Year1 = £900 and Value in Year2 = £990 and Value in Year3 = £1089

• Each row of this description corresponds to a particular branch in the tree, over
which we multiply probabilities (indicated by the “and”). With equal probabilities
each row gives the same product 0.53 = 0.125. (Note we are implicitly assuming
independence of returns each year – is this reasonable?)
• The different rows correspond to various distinct branches in the tree, over
which we add probabilities (indicated by the “or”). The simple addition law
applies since events on separate branches are mutually exclusive (why?). The
number of branches determines the number of terms in the sum.

Example 8.2 For the 3-year returns in Example 8.1 determine the
(a) mean and (b) standard deviation.

Solution (a) Look at the probability distribution of returns in Table 8.1. We treat this
like a frequency table and compute the mean return as (see Unit 6 Section 6 eq. 5)

as
1
[ 1 i=n
]
x = f 1 x 1 + f x 2 + f 3 x 3 + .... + f n x n = ∑ i = 1 f i x i
n n
If we interpret the relative frequency fi/n as a probability pi we obtain the result


i=n
x = i =1
pi x i ---- (3)

i.e. we weight each possible return by its corresponding probability of occurrence.


Then Mean return (£) = 729*0.125 + 891*0.375 + 1089*0.375 + 1331*0.125
= £1000
(b) Recall that the standard deviation measures the average (squared) deviation from
the mean. If we interpret the relative frequency fi/n as a probability pi eq. (10) in Unit
4 Section 8 gives the important result


i=n
s2 = pi x i − x 2
2
i =1
--- (4)

Here s2 = 0.125*(729 – 1000)2 + 0.375*(891 – 1000)2 +


0.375*(1089 – 1000)2 + 0.125*(1331 – 1000)2 = 30,301
as you can verify. Hence, taking square roots,
Standard deviation of returns = £174.07

Statistics, Probability & Risk 128


Unit 5: Probability – Basic Concepts

Thus, although the mean return will be £1000 there is considerable variation in the
actual returns (as specified in Table 8.1). We shall see how to interpret the precise
value of £174 in Unit 6, where we shall also discuss more fully the idea of a
probability distribution, and its description in terms of the mean and standard
deviation.

9. Conditional Probability
Example 9.1: We now come to a very important, although quite subtle, idea. We
return to Table 7.2 (or Table 6.3) and try to compare the 18-24 and 25-49 age groups
in terms of time unemployed. The difficulty is that the two age groups are not directly
comparable since they have different sizes (as measured by the “economically
active” values in Table 6.3). This is reflected in the probabilities in Table 7.2 adding to
different values.
The solution is to “normalise” the age categories so that their sum is the same. In
addition, by making this sum equal to 1, we ensure each set of probabilities define a
probability distribution. This is simply accomplished on dividing each row entry by
the corresponding row sum. In Table 9.1 we give all the (eight) probability
distributions that result from the left hand data in Table 7.2. In Fig. 9.1 we have
plotted three of these distributions separately, to emphasise that individually they
form a probability distribution. In addition we have plotted all three on a single
histogram which allows for easier comparisons. Observe the vertical (probability)
scale is (roughly) constant throughout.

Each row
defines a
(conditional)
probability
distribution.

Table 9.1: Conditional probability distributions defined from Table 7.2

129 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

Notation and Terminology: As we have seen the computation of conditional


distributions is very straightforward. However the notation used tends to confuse,
really just because of its unfamiliarity. The four probabilities in the “18-24” Age group

Fig 9.1: Histograms of conditional probability distributions (Row sums = 1)

are applicable when we are dealing with Males in the Age Group 18-24. To express
this compactly, with the minimum of words, we use the notation of Example 7.2
applied to males, i.e.
A1 = Unemployed < 6 months ; A2 = Unemployed 6-12 months
A3 = Unemployed 12-24 months ; A4 = Unemployed > 24 months
In addition we set
C1 = male in age group 16-17 ; C2 = male in age group 18-24
C3 = male in age group 25-49 ; C4 = male in age group 50+
Then our 4 probabilities translate into the following statements:
P(A1 | C2) = 0.6569 ; P(A2 | C2) = 0.1365 ; P(A3 | C2) = 0.1248 ; P(A4 | C2) = 0.0819 .
For example, the first statement is read as “The probability of being unemployed < 6
months given you are a male in the age group 18-24) = 0.6569.
The general notation P(A | B) means the probability of A occurring given that B has
already occurred.

Statistics, Probability & Risk 130


Unit 5: Probability – Basic Concepts

Example 9.2: There is nothing special about normalising the row sums to be 1. We
can equally well make the column sum add to 1, and this defines yet further
conditional probability distributions.
To make this clear we return to the left table in Table 7.2, and split this into males
and females. Evaluating the column sums gives Tables 9.2 and 9.3.

Interpretation: As an example, the first entry 0.6477 in Table 9.2(b) gives the
probability of being in the age group 16-17 given that you have been unemployed for
< 6 months. In the symbolism of Example 9.1 above P(C1 | A1) = 0.6477.

Note: 1. P(C1 | A1) = 0.6477 ≠ P(A1 | C1) = 0.7917 (from Table 9.1)
Thus the symbols P(A | B) and P(B | A) are, in general, not interchangeable, and we
must be very careful in our use of conditional probabilities.

Table 9.2: Male data from Table 7.2

(a) Not normalised (b) Normalised (column sums = 1)

Table 9.3: Female data from Table 7.2

(a) Not normalised (b) Normalised (column sums = 1)

2. Histograms of the male and female conditional distributions in Table 9.2(b) and
9.3(b) are given in Fig.9.2. Here we have superimposed the four unemployment
categories together in a single histogram for males and females. You should note
carefully the different axes labels in Figs.9.1 and 9.2, and how these relate to the
corresponding conditional distribution.

131 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

Fig 9.2: Histograms of conditional probability distributions (Column sums = 1)

Comment Conditional probability distributions are of fundamental importance in


many areas of finance, and the idea of “conditioning” on an event is very common.
Remember that conditional distributions just arise when we concentrate on part of the
(whole) probability space, and make this part into a probability distribution in its own
right (by ensuring the appropriate probabilities sum to one). The required division
gives rise to the important formula (see bottom entry of Table 7.1):
P(A and B)
P(A | B) = --- (5)
P(B)

10. Expected return and risk


We return to Example 3.1 of Unit 1 and, for ease of reference, repeat the problem.
Example 10.1 I have a (well mixed) bag of 100 counters, 25 red and 75 black. You
are invited to play the following game, with an entrance fee of £50. A counter is
picked at random. If the counter is black you win £100, otherwise you win nothing.
Should you play?
Solution If we let
B = event a black counter is drawn and R = event a red counter is drawn
Then using our basic result (1) we compute
Number of ways a black counter can be drawn 75
P(B) = = = 0.75
Total number of counters 100

and hence P(R) = 0.25

Statistics, Probability & Risk 132


Unit 5: Probability – Basic Concepts

We have a much greater chance of choosing a black ball, and hence winning £100.
• Adopting our frequency interpretation of probability, these probabilities indicate
that, if we imagine repeatedly drawing a counter, then 75% of the time we will
win £100, and 25% of the time we will win nothing.
• To quantify this we calculate the expected winnings. This is just the mean value
of our winnings, computed using (3), but the “expected value” terminology is
used since it is more descriptive. We obtain
Expected winnings = £100*0.75 + £0*0.25 = £75
• Of course we have a cost, of £50, which we must pay to enter the game, so
Expected profit = £75 - £50 = £25
Conclusions At first sight we should “clearly” play the game. However, our expected
profit is what we can “reasonably expect” to win in a long run series of games (unless
the run into a lot of “bad luck”). But we only expect to play the game once.
• Under these circumstances (one play of the game only) we have to decide
individually whether we are prepared to take the risk of losing.
• Different people will have different “risk appetites”. If you are risk averse, as
most of us are, you may well be unwilling to take the risk (of losing). But if you
are a risk taker you may well take the opposite view.
• In addition your considerations will undoubtedly be complicated by a
consideration of your current financial wealth. If you have £100,000 in the bank
you are probably not unduly worried about the risk of losing £50. But, if you only
have £100 in the bank, £50 is a big investment to make.

To answer the question posed in Example 10.1: “Whether you should play or not
depends on your risk profile, i.e. your attitude to risk.
The basic moral from this example is that, although one can compute probabilities,
and in a variety of ways, the resulting number may not be the determining factor in
any investment strategy you may adopt (although the probability value will play a
part). How one assesses the risks involved may well be a “behavioural finance
issue”, beyond the realms of mere definitions of probability. Feller’s quote at the
beginning of this unit reflects precisely this view.
We can reinforce this latter view by actually computing the risk in Example 10.1 using
the standard deviation (or its square the variance) as a risk measure, as we
advocated in Unit 4 Section 10. Using (4)
Variance of winnings = £1002*0.75 + £02*0.25 - £752 = 1875 (in units of £2)

So standard deviations of winnings = 1875 = £43.3

133 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

What does this value actually tell us about the risk involved? Do we judge it to be
“large” or “small” relative to the expected winnings?

Question If we work with the profit (winnings – entrance fee) what value do we
obtain for the standard deviation?

11. Assessing the mean and variance


In Example 10.1 we assessed the risk (variance) and expected return (mean) using
given parameter values (easily computed probabilities). Very interesting results arise
if people are asked to estimate risk and expected return from essentially a graphical
probability distribution. Here we briefly describe results obtained by
Lopes, L. A. (1987). Between Hope and Risk: The Psychology of Risk, Advances in
Experimental Social Psychology, 20 p.255-295.

(a) (b) (c)

(d) (e) (f)

Fig 11.1: Six probability distributions expressed as lotteries.

Statistics, Probability & Risk 134


Unit 5: Probability – Basic Concepts

In Fig.11.1 six lotteries are depicted. Each lottery has 100 tickets, represented by
the tally marks. The values at the left give the prizes that are won by tickets in that
row. For example, in lottery (a), only one ticket wins the highest prize of $200, two
tickets win the next highest prize of $187 and so on.

Exercise You must decide which lottery you wish to participate in. In fact you are
required to order the six lotteries in the order you would most like to participate, on
the assumption that you are risk averse, i.e. you will avoid risk unless you are
suitably compensated by extra returns (profits).
A sensible way to look at the problem is to estimate the mean (expected return) and
variance (risk) of each lottery (probability distribution).
We shall return to this exercise in the tutorial. Observe how these lotteries bring
together several ideas we have previously looked at:
• Symmetric and skewed histograms/distributions (Units 3 – 5)
• Stem and leaf plots, which is really what Fig.11.1 represents (Unit 4).
• Probability distributions (Unit 5)
• Cumulative frequency curves (Unit 4). See tutorial exercises.

135 Statistics, Probability & Risk


Unit 5: Probability – Basic Concepts

12. References

• Barberis, N., Thaler, R. (2002). A Survey of Behavioral Finance, available at


http://badger.som.yale.edu/faculty/ncb25/ch18_6.pdf.
• Lopes, L. A. (1987). Between Hope and Risk: The Psychology of Risk,
Advances in Experimental Social Psychology, 20 p. 255-295.
• Rachev, S.T., Hsu, J.S.J., Bagasheva, B.S., Fabozzi, F.J. (2008). Bayesian
Methods in Finance: Wiley.

Statistics, Probability & Risk 136


Unit 6: Two Important Probability Distributions

6 Two Important Probability Distributions

Learning Outcomes
At the end of this unit you should be familiar with the following:
• Calculate probabilities associated with the binomial distribution.
• Appreciate how the binomial distribution arises in finance.
• Basic properties of the binomial distribution.
• Recognise the central role of the normal distribution.
• Basic properties of the normal distribution.

“All models are false, but some are useful.”


George Box (eminent statistician)

137 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

1. Introduction
In Unit 5 we have seen that the term “probability distribution” refers to a set of (all)
possible outcomes in any given situation, together with their associated probabilities.
In practice relatively few such distributions are found to occur, and here we discuss
the most important of these from a financial perspective. The ideas we discuss in this
unit form the foundations for much of the theoretical developments that underlie most
of the ideas we discuss in the remainder of the module.

2. Binomial Distribution
The investment Example 8.1 in Unit 5 provides an illustration of the so-called
binomial distribution, which applies in the following general circumstances:
Binomial experiment An “experiment” consists of a series of n “trials”. Each trial can
result in one of two possible outcomes

⎧ Success with constant probability p



⎩ Failure with constant probability q = 1 - p
(The fact that p + q = 1 means “Success” and “Failure” account for all the possible
outcomes.) We are interested in the number of successes in the n trials.
Formal result The probability of x successes (where 0 ≤ x ≤ n) is given by

⎛n⎞
P(x successes) = ⎜⎜ ⎟⎟ pxqn - x --- (1)
⎝x⎠
⎛n⎞
Note In (1) ⎜⎜ ⎟⎟ is called the binomial coefficient for algebraic reasons (not really
⎝x⎠
connected to our present purpose). This gives the number of different ways x
successes can occur (equivalent to the number of paths in our investment example).
It can be computed in a variety of ways:

• On many calculators using a key usually labelled n


C r (again for algebraic
reasons not directly relevant to us).
• In Excel using the COMBIN function.
• By hand using the formula

⎛n⎞ n!
⎜⎜ ⎟⎟ = --- (2a)
⎝x⎠ x!(n - x)!

where the factorial function n! Is defined as the product of all integers from 1
up to n, i.e. n! = n(n – 1)(n – 2)............3.2.1 --- (2b)

Statistics, Probability & Risk 138


Unit 6: Two Important Probability Distributions

Without some mathematical background the version (2) may seem unnecessarily
complicated, but it does arise very naturally. If you are uncomfortable with (2)
compute binomial coefficients using your calculator or Excel. In addition:
• Factorials can be computed in Excel using the FACT function.
• The complete result (1) can be computed directly in Excel, using the
BINOMDIST function. (Investigate this function as an Exercise; it will prove
useful in Tutorial 6).

Example 2.1: We return to Example 8.1 of Unit 5. Here our experiment is observing
investment results over n = 3 years (trials). On each trial we arbitrarily define
“Success” = increase in investment value (by 10%) with probability p = 0.5
“Failure” = decrease in investment value (by 10%) with probability q = 0.5
Then we can compute the following:

⎛ 3⎞
P(3 successes) = ⎜⎜ ⎟⎟ 0.530.50 = 1*0.125*1 = 0.125
⎝ 3⎠

⎛ 3⎞
P(2 successes) = ⎜⎜ ⎟⎟ 0.520.51 = 3*0.25*0.5 = 0.375
⎝ 2⎠

⎛ 3⎞
P(1 success) = ⎜⎜ ⎟⎟ 0.510.52 = 3*0.5*0.25 = 0.375
⎝1⎠

⎛ 3⎞
P(0 successes) = ⎜⎜ ⎟⎟ 0.500.53 = 1*1*0.125 = 0.125
⎝ 0⎠
Notes: Observe the following two points:
• The second binomial coefficient can easily be computed using (2) as

⎛ 3⎞ 3! 3 * 2 *1
⎜⎜ ⎟⎟ = = =3
⎝ 2⎠ 2!(3 - 2)! 2 * 1 *1

• This coefficient counts the number of paths which yield the final outcomes of 2
“successes”. We need to translate this into the investment going up twice and
down once. The corresponding paths are shown in Fig.8.3 (Unit 5).
• The above calculations reproduce the probabilities in Table 8.1 of Unit 5. The
advantage of the current algebraic formulation are twofold:
- We can perform calculations without drawing any (tree) diagram.
- We can identify the number of paths without having to locate them.

139 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

• The structure of the above calculations is very simple. In (1) the px term gives
the probability of x successes along a particular path, and the qn-x term the
probability of (n – x) failures. These probabilities are multiplied in accordance
with the multiplication law (why?). Finally we need to identify the number of
paths leading to x successes and multiply by this (why?). In essence (1)
provides a very compact representation of the multiplication and addition laws
applied to a binomial experiment (repeated trials with only two possible
outcomes each time).

The importance of the binomial distribution is twofold:


• Many situations can be classified into two possibilities – a machine either
breaks down or it does not, a coin either lands heads or it does not, a thrown
dice either shows a 6 or it does not, and so on. In all these cases (1) can be
applied.
• As the number of trials n increases a very important histogram shape emerges.
We study this in Section 4.

3. Binomial Distribution Properties


In Example 8.2 of Unit 5 we looked at the mean and standard deviation of our
investment returns. The probability (binomial) distribution defined by (1) also has a
mean and standard deviation, and which can be expressed in terms of n,p and q = 1
- p. The latter are often termed the parameters of the distribution, and textbooks
commonly write something like Binomial (n,p) to denote a binomial distribution with
specified values of n and p. Explicitly we have the following results:
Mean of Binomial(n,p) = np --- (3a)

Standard deviation of Binomial (n,p) = npq --- (3b)

It is important to realise the mean and standard deviation apply to the number of
successes (x), and these may not always represent quantities of interest.

Example 3.1: A fair coin is tossed 5 times.


(a) What is the probability of obtaining just one head?
(b) What is the expected (mean) number of heads?
(c) What is the standard deviation of the number of heads?

Solution: We define Success = Occurrence of a head (on any toss)


Failure = Occurrence of a tail (on any toss)

Statistics, Probability & Risk 140


Unit 6: Two Important Probability Distributions

Then we are in a binomial situation with n = 5 trials, each of which can result in only
one of twp outcomes. Since the coin is fair
p = P(Success) = 0.5 and q = P(Failure) = 0.5 (with p + q = 1)

⎛ 5⎞
(a) From (1) P(1 success) = ⎜⎜ ⎟⎟ 0.510.54 = 5*0.5*0.0625 = 0.15625
⎝1⎠

⎛ 5⎞ 5! 5 * 4 * 3 * 2 *1
since ⎜⎜ ⎟⎟ = = =5
⎝ 1 ⎠ 1!(5 - 1)! 4 * 3 * 2 *1 * 1
(More intuitively a single head can occur in one of 5 ways, i.e. HTTTT or THTTT or
TTHTT or TTTHT or TTTTH.) There is thus only about a 16% chance of obtaining a
single head when a coin is tossed 5 times.
(b) From (3a) Mean number of heads = 5*0.5 = 2.5
Obviously we can never obtain 2.5 heads on any throw (experiment). The
interpretation is the long run frequency one: if we repeat the experiment many times
we will obtain, on average, 2.5 heads.

(c) From (3b) Standard deviation of number of heads = 5 * 0.5 * 0.5 = 1.118
The meaning of this quantity is best illustrated by simulation.

Notes: 1. We can compute, as in Example 2.1, the (binomial) probabilities for all
possible outcomes in Example 3.1, i.e. 0 heads, 1 head, 2 heads and so on. The
resulting histogram of Fig.3.1 gives the theoretical probability distribution for 5 coin
tosses. We shall compare this to the simulated (experimental) version in Section 4.
2. If we draw a tree diagram of the situation in Example 3.1 we obtain something like
Fig.3.2. Although the tree is not recombining, you should note the similarity with
Fig.8.4 of Unit 5 relating to our investment example - see Section 5 below.

Fig. 3.1: Probability Distribution Fig. 3.2: Tree diagram

141 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

4. Simulating a Binomial Distribution


We have already simulated coin tosses in Example 2.1 of Unit 5 to illustrate the
frequency interpretation of probability. Now we simulate the situation of Example 3.1.
Example 4.1: The sheet Unit6_5Coins in the Excel file CoinTossing2.xlsx gives
instructions for simulating coin tossing, and you are asked to explore this in Practical
Exercises 5. In relation to Example 3.1 we perform 100 simulations of tossing a coin
5 times, and Table 4.1 shows some results (for the first 10 simulations). In addition
the histograms of Fig.4.1 display the resulting probability distributions; these are
approximations to the theoretical distribution depicted in Fig.3.1. Note that in the
latter the histogram is symmetric, whereas they are not in Fig.4.2. However, in both
cases, the means and standard deviations well approximate the theoretical values in
Example 3.1(b,c).
Table 4.1: Spreadsheet to simulate number of heads in 5 coin tosses

(a) Experiment 1

(b) Experiment 2

Fig. 4.1: Histograms of Simulations (including Mean and Standard Deviation)

(a) Experiment 1 (b) Experiment 2

Statistics, Probability & Risk 142


Unit 6: Two Important Probability Distributions

An Important Point We can improve the agreement between the simulated and
theoretical distributions by conducting larger simulations. For example in place of our
100 simulations we may use 500; you should explore this.

5. Investment Example
Important results emerge when we alter parameters in our simulation. Rather than
doing this for our coin tossing example where, for example, p = P(Head) = 0.3 is not
what we would expect, we return to our investment example of Unit 5.
Example 5.1: The Excel file Binomial_Investment.xlsx gives instructions for
generating our investment values of Example 8.1 of Unit 5. We have extended the
time period to 5 years, as shown in Fig.5.1, really to obtain more representative
histograms. The parameter values we can vary are the following:
• p = P(investment increases over 1 year) = 0.5 in Figs.5.1 and 5.2
• VUp = % increase in investment = 10% in Figs.5.1 and 5.2
• VDown = % decrease in investment = 10% in Figs.5.1 and 5.2

Fig. 5.1: Tree Diagram for 5 Year Investment Horizon

143 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

Fig. 5.2: Probability Distribution and Histogram of 5- Year Investment Values

Varying these values leads to the results displayed in Fig.5.3, and we observe the
following important points:
• With p = 0.5 the investment values (IVs) are symmetric about the mean of
£1000.
• With p > 0.5 the IVs are skewed to the right, with larger values having greater
probability than smaller values. The mean is correspondingly > £1000 and,
importantly, the variation around the mean (as measured by the standard
deviation) also increases.
• Varying p changes the probabilities of the various outcomes (IVs) but not the
outcomes themselves.
• Varying VUp and VDown leaves the probabilities unchanged but changes the
values of the various outcomes.
Fig. 5.3: Varying Parameters in Fig.5.1

(a) p = 0.7 , VUp = 10% , VDown = 10%

Statistics, Probability & Risk 144


Unit 6: Two Important Probability Distributions

(b) p = 0.5 , VUp = 20% , VDown = 10%

(c) p = 0.5 , VUp = 10% , VDown = 20%

(d) p = 0.7 , VUp = 20% , VDown = 10%

145 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

(e) p = 0.3 , VUp = 10% , VDown = 20%

• Increasing (decreasing) VUp increases (decreases) the IVs, but the mean and
standard deviation changes in unexpected ways. For example in Fig.(b) the
mean has increased by £274.23 above £1000, but in Fig.(c) the decrease is
only £225. In addition the standard deviation is (very much) different in the two
cases, despite the fact that p remains unchanged and the values of VUp and
VDown are “symmetrical”.
• As we alter parameters, the changes in mean and standard deviation are very
important, and you should study them carefully in Fig.5.3.

An Important Point Each of the probability distributions in Fig.5.3 are the (exact)
theoretical ones and, as we can see, their shapes and summary measures depend
on the parameters of the distribution. In some situations we are more interested in
simulating sample paths:
• The actual paths may be too numerous to compute. For example, most
investments do not just have 2 possible future values, but rather hundreds. (A
stock may increase, or decrease, by any amount during each time frame.)
• We may not know how sample paths evolve forward in time. When valuing
options we only have terminal (future) values available, and we have to try and
work out how values evolve backwards in time.

Example 5.2: The Simulation sheet in the Excel file Binomial_Investment.xlsx


also gives instructions for simulating our investment values. In Table 5.2 we have
shown the first 11 of 100 simulated sample paths for the (default) case p = 0.5 and
VUp = VDown = 0.1.
Note the estimated mean and standard deviation in Table 5.2 agree reasonably well
with the exact (theoretical) values in Fig.5.2. As before we can improve accuracy by

Statistics, Probability & Risk 146


Unit 6: Two Important Probability Distributions

using more (than 100) simulations; you are asked to investigate this in Practical
Exercises 5.

Table 5.2: Simulating Sample Paths in Fig.5.1

6. Transition From Binomial To Normal Distribution


If we return to our coin tossing Example 3.1 and, with p = 0.5, significantly increase
the number of tosses to, say n = 25, an interesting pattern emerges. The theoretical
(binomial) distribution of Fig.6.1 displays a “bell shaped” configuration, and this is
confirmed by the simulation of Fig.6.2. Note that (3a,b) give
Mean = np = 25*0.5 = 12.5

and Standard deviation = npq = 25 * 0.5 * 0.5 = 2.5

As n increases (with p = 0.5) the binomial distribution approaches the so-called


Normal distribution, which has a characteristic “bell shape”.

Fig. 6.1: Theoretical Histogram for n = 25 Fig. 6.2: Simulated Histogram for n = 25

147 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

Note The normal distribution is a continuous distribution, whereas the binomial is


discrete. We can see how the former “merges” into the latter by successively
increasing n, as in Fig.6.3. The gaps in the histogram become successively smaller
and, to view the histogram more easily, we have redrawn Fig.6.4 using a smaller x-
range by not plotting probabilities that are too small to be seen. You should also note
how the mean, and especially the standard deviation, increases as n increases.

Fig.6.3: Theoretical Histogram for n = 50 Fig.6.4: Theoretical Histogram for n = 100

Fig.6.5: Redrawn theoretical histogram for n = 100

7. The Normal Distribution – Basic Properties


The normal distribution is the most important distribution in the whole of statistics.
This is really due to three fundamental features:

Statistics, Probability & Risk 148


Unit 6: Two Important Probability Distributions

Feature 1 Many practical situations seem to involve random variables which follow a
normal distribution, either exactly or approximately. This situation arises for reasons
connected to Fig.6.5. We can regard the number of heads as the sum of a large
number (here 100) of independent random variables (each variable taking the value
1 if a head occurs, and the value 0 otherwise). In such instances we can give
theoretical arguments that lead to the normal distribution. In general we can expect
the sum of a large number of independent random variables to lead to a normal
distribution, and a lot of practical situations fall into this category.
Feature 2 When we take samples from a population, and examine the average
value in the sample, the normal distribution invariably arises. We shall look at this
situation in detail in Unit 7.
Feature 3 The normal distribution is “very well behaved mathematically”, and this
leads to rather simple general theoretical results. Although we shall not pursue any of
this we give a few details in the next section.
Although we invariably refer to the normal distribution, there are in fact infinitely many
of them! Some of these are depicted in Fig.7.1 However the situation is not nearly as
complicated as it might appear since all normal distributions have a simple relation
connecting them. The key is that, similar to the binomial distribution, the normal
distribution is completely described by just two parameters. These are actually
quantities we have met several times before, the mean and the standard deviation.
Terminology We use the symbolism N(μ, σ2) to denote a normal distribution with
mean μ (pronounced mu) and standard deviation σ (pronounced sigma). The square
of the standard deviation σ2 is termed the variance. Its importance is explained in
Section 8.

(a) μ = 5 and σ = 1 (b) μ = 5 and σ = 2

149 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

(c) μ = 2 and σ = 0.5 (d) μ = 2 and σ = 1

Fig.7.1: A Variety of Normal Distributions

Comment Do not let the use of Greek letters (μ and σ) for the mean and standard
deviation confuse you. They are simply used in place of the letters we have so far
been using ( x and s) to distinguish between population values and sample values.
We shall discuss this further in Unit 7.
You should observe how the mean and standard deviation affect the location and
shape of the normal curve in Fig.7.1.
• The mean determines the location of the “centre” of the distribution. Since the
curve is symmetric (evident from (4) in the next section), the mean equals the
mode (and the median), so the mean also gives the highest point on the curves.
• As we increase the standard deviation the curves becomes “more spread out”
about the centre. We shall make this more precise in Section 9.
• Also note how the vertical scale changes as we alter the standard deviation. In
addition observe we do not label the y-axis “Probability” as in, for example, the
discrete binomial distribution in Fig.6.5. As we emphasise in Sections 8 and 9 it
is not the height of the curve, but rather the area under the curve, that
determines the corresponding probability.
(Compare these results to those in Section 6.)

We shall use the normal distribution most of the time from now on. When doing this
you should bear three points in mind:
• We can always think of a normal distribution as arising from data being
collected, compiled into a frequency table and a histogram being drawn; a
“smoothed out” version of the histogram, as in Fig.6.5, yields a normal
distribution.
• There are instances when the normal distribution is not applicable, and these
occur when distributions are skewed. Examples include wages and waiting
times.

Statistics, Probability & Risk 150


Unit 6: Two Important Probability Distributions

• In practice, and where feasible, always draw histograms to check whether the
assumption of a normal distribution is appropriate.

8. Normal Distribution Probabilities - Theoretical Computations


Although we shall always compute normal distribution probabilities using the
approach described in Section 9 below, it is important to understand a couple of
basic (theoretical) points to appreciate what we will actually be doing!
The key idea is that probabilities represent areas under the appropriate distribution
curve. In Fig.6.5 we will obtain a sum of 1 if we add the areas of all the rectangles
comprising the histogram. This is the analogue in the discrete situation of all
probabilities adding to 1; see Fig.5.3.

Fig.8.1: Normal Distribution Curve Fig.8.2: Area Computation?

However when we move to the continuous case, exemplified by Fig.8.1, we no longer


have rectangular areas which we can easily add. (We shall explain the x-scale in
Section 9.) Now we need some more advanced mathematics, in the form of calculus
and integration methods, to calculate areas such as that depicted in Fig.8.2.
Whilst we shall not pursue this, the necessary integration techniques require the
equation of the normal distribution curve. This is essentially
-1 x2
y = k e 2 = k*exp(- 12 x 2 ) --- (4)
where k is a constant (needed to make probabilities add to 1). As an exercise you
should sketch (4) as described in Practical Unit 3, Section 5, choosing a suitable
value of k. Here all we want to do is point out an important mathematical property of
(4). If we multiply two such terms
2 2
y1 = k1*exp(- 12 x 1 ) and y2 = k2*exp(- 12 x 2 )
we obtain [
y1y2 = k1k2*exp(- 12 x 1 + x 2 )
2 2
] --- (5)

151 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

This is of the same general form as (4), and hence represents another normal
distribution. This in turn means when we add two normal distributions together
(equivalent to multiplying their defining equations) we obtain another normal
distribution. In particular this normal distribution has
• mean the sum of the two individual means, and
• variance equal to the sum of the two individual variances.
It is the fact that the variances (square of the standard deviation) add together that
makes the variance a more important quantity (at least theoretically) than the
standard deviation. These are very important properties of the normal distribution, to
which we shall return in Unit 7.
Note Moving from a discrete, to a continuous, distribution is actually a little more
complicated than we have described. We shall discuss the source of the difficulties in
Unit 8 when we consider how to draw reliable conclusions from statistical analysis of
data.

9. Normal Distribution Probabilities - Practical Computations


We shall adopt a more practical approach to computing normal (distribution)
probabilities. We adopt a two stage approach:
• Some probabilities have already been calculated, using the ideas in Method 1,
and are available in tabulated form.
• We then need to be able to infer all probabilities from the tabulated ones.

1. Standard Normal Tables And Their Use


Areas (probabilities) are calculated, and tabulated, for one specific normal
distribution, termed the standard normal distribution. This is defined by
μ = 0 and σ = 1 --- (6)

and corresponds to Fig.8.1. A set of tabulated areas is given in Table 9.1 on the
following page. Note carefully the following points:
• The underlying variable is always labelled Z for a standard normal distribution,
and is often referred to as a standard normal variable.
- Z is used as a reference for computational purposes only.
- Z has no underlying interpretation (such as the number of heads).
- Z is dimensionless, i.e. just a number without any units attached.

Statistics, Probability & Risk 152


Unit 6: Two Important Probability Distributions

• It is not necessary to know the values (6) in order to read the corresponding
tables, but these parameter values are helpful in both understanding, and
remembering, what happens in the general case (discussed below).
• Only one specific type of area is tabulated – between 0 and the value of z
chosen. The figure at the top of Table 9.1 is a reminder of this. (Other tables
you may find in books may tabulate different areas, so some care is needed.)
• Only positive z values, up to about 3.5, are tabulated – see Fig.8.1.

Use of Tables You need to bear in mind the following points:

P1
Total area under curve = 1
P2
The curve is symmetric (about z = 0), so
Area to left of negative z-value = Area to right of positive z-value
P3 P1 and P2 imply
(Area to left of z=0) = (Area to right of z = 0) = 0.5
Example 9.1 If Z is a standard normal variable, determine the probabilities:
(a) P(Z < 1.75), i.e. the probability that Z is less than 1.75
(b) P(Z < -1.6) (c) P(1.4 < Z < 2.62)

Solution You should always draw a sketch to indicate the area required. In addition
you may need further sketches to actually compute the area.
(a) In pictures 0.5 (P3)
0.4599
(tables)

1.75 0 0 1.75

In symbols P(Z < 1.75) = P(Z < 0) + P(0 < Z < 1.75)
= 0.5 + 0.4599 = 0.9599
Here the formulae just reflect the thought processes displayed in the pictures!

153 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

Table 9.1: Standard Normal Distribution

-3 -2 -1 00 1z 2 3

Entries in the table give the area under the curve between the mean and z standard
deviations above the mean. For example, with z = 1.02, the area under the curve between
the mean (of zero) and z is .3461.

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359
.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753
.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141
.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517
.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879

.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224
.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2518 .2549
.7 .2580 .2612 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852
.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133
.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389

1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621
1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830
1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015
1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177
1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319

1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441
1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545
1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633
1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706
1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767

2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817
2.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857
2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890
2.3 .4893 4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916
2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936

2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952
2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964
2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974
2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981
2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986

3.0 .4986 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
3.1 .4990 .4991 .4991 .4991 .4992 .4992 .4992 .4992 .4993 .4993
3.2 .4993 .4993 .4994 .4994 .4994 .4994 .4994 .4995 .4995 .4995
3.3 .4995 .4995 .4995 .4996 .4996 .4996 .4996 .4996 .4996 .4997
3.4 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4998
3.5 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998
3.6 .4998 .4998 .4998 .4999 .4999 .4999 .4999 .4999 .4999 .4999

Statistics, Probability & Risk 154


Unit 6: Two Important Probability Distributions

(b) In pictures

by symmetry

-1.6
0.5 (P3) 0.4452
(tables)

0 0 1.6

In symbols P(Z < -1.6) = P(Z > 1.6) = P(Z > 0) + P(0 < Z < 1.60)
= 0.5 - 0.4452 = 0.0548
Since this is roughly 5%, we can see the above figures are not drawn to scale.
(c) In pictures 0.4192
0.4956 (tables)
(tables)

1.4 2.62 0 2.62 0 1.4

In symbols P(1.4 < Z < 2.62) = P(0 < Z < 2.62) - P(0 < Z < 1.4)
= 0.4956 - 0.4192 = 0.0764

2. General Normal Distributions


Usually we are not dealing (directly) with a normal distribution described by the
parameter values (6), but with a general N(μ, σ2) distribution. In this latter case the
underlying variable is always labelled X, and the fundamental formula connecting X
X-μ
with Z is Z= --- (7a)
σ
X value - Mean
In words Z= --- (7b)
Standard Deviation

155 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

The procedure defined by (7) is often termed “standardizing a variable” and means
expressing the size of a (random) variable relative to its mean and standard
deviation. Once we have a Z value available probabilities are again found as in
Example 9.1.
Example 9.2 IQ’s (Intelligence Quotients) of schoolchildren are normally distributed
with mean = 100 and standard deviation = 15. If a school child is selected at random,
determine the probability that their IQ is
(a) over 135 (b) between 90 and 120 ?

Solution With μ = 100 and σ = 15


X - 100
(7) gives Z=
15
135 - 100
(a) With X = 135 Z= = 2.33 (to 2D)
15
Then P(X > 135) = P(Z > 2.33) = 0.5 – 0.4901 = 0.0099
using the sketches below.

2.33 0 0 2.33

Note we may also phrase our answer in the form “There is a 1% chance of a
randomly chosen school child having an IQ greater than 135.”
(b) Here we have two x values and need to compute the two corresponding Z
90 - 100
values. With X1 = 90 Z1 = = - 0.67 (to 2D)
15
120 - 100
and with X2 = 120 Z2 = = 1.33 (to 2D)
15
Then P(90 < X < 120) = P(-0.67 < Z < 1.33) = 0.2486 + 0.4082 = 0.6568
using the sketches below.

Statistics, Probability & Risk 156


Unit 6: Two Important Probability Distributions

-0.67 1.33 -0.67 0 0 1.33


0.2486 0.4082
(tables) (tables)
symmetry

0 0.67 0 1.33

Notes 1. Although we have quoted results to 4D (decimal places) in Examples 9.1


and 9.2 it should be clear that such accuracy is not real. Although you may work to
4D, since our Z values are only accurate to 2D, we recommend giving final values for
probabilities to 2D.
2. Excel has special commands to generate normal distribution probabilities as they
are required, so that standard normal tables are not (directly) needed - see the
Practical Exercises. But tables are very useful for hand computations, and you
should become familiar with their use. (Another possibility is hand calculators, some
of which provide extensive statistical facilities.)

10. A Practical Illustration


Rather than looking at (somewhat artificial) examples drawn from the social sciences,
our final example is a financial one.
Example 10.1 Exchange rates are of fundamental importance for a whole variety of
reasons; see http://en.wikipedia.org/wiki/Exchange_rate for information and further
references. You can download Excel spreadsheets containing a variety of exchange
rates from the FRED website of the St. Louis Federal Reserve Bank at
http://research.stlouisfed.org/fred2/ - look back at Section 2 of Unit 2. We look at the
monthly dollar-sterling exchange rates from 1971-2009; this is the series denoted
EXUSUK on the website. The first few data values (there are 459 in total) are shown
in Fig.10.1; in addition we have computed monthly changes, and monthly percentage
changes, in the exchange rate. Raw data is available in the file ExchangeRate1.xlsx

157 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

on the web page, together with the graphics shown in this section (on the Graphics
sheet).

Fig.10.1: Monthly Exchange Rates and (Percentage) Changes

We are trying to see patterns in the data, hopefully with a future view to modelling
(or predicting) exchange rates. Overleaf we have drawn several graphs (discussed in
previous units).
• The line (times series) plot of Fig.10.2 shows exchange rate fluctuations without
any evident pattern.
• The histogram of Fig.10.3 appears to show two separate “normal like”
distributions. However this is very difficult to interpret since we have no
indication of the “time sequence” when particular exchange rate values
occurred, and this is a crucial feature of the data (as Fig.10.2 shows).
• Since we have found little apparent pattern in the exchange rates themselves,
our next focus is on changes in the rate. In Figs.10.4 and 10.5 we plot the
changes themselves, and also the percentage changes. Look back at Practical
Units 2 and 3 to recall how these changes are computed in Excel.

Fig.10.2: Time Series Plot of Exchange Rate Fig.10.3: Histogram of Exchange Rate

Statistics, Probability & Risk 158


Unit 6: Two Important Probability Distributions

Fig.10.4: Time Series Plot of Monthly Changes in Exchange Rate

Fig.10.5: Time Series Plot of Monthly Percentage Changes in Exchange Rate

Fig.10.6: Histogram of Monthly Changes Fig.10.7: Histogram of % Monthly Changes

159 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

• It is difficult to see any patterns in the time series of either set of rate changes.
However, if we look at the corresponding histograms the situation changes
dramatically. In Figs.10.6 and 10.7 we can “clearly see” the presence of normal
distributions. Thus, although successive changes over time show no (apparent)
pattern, the changes “all taken together” over the complete time range show
clear evidence of being normally distributed. Many questions now naturally
arise:
• Why should time series and histograms give such completely different results?
• Do similar patterns appear in other exchange rates?
• Do similar patterns appear in other financial quantities (stock prices, GDP,
financial ratios and so on)?
• Once we convince ourselves a normal distribution is appropriate – see Section
11 below – we can start to do some calculations and estimate various
probabilities.

A Simple Calculation If we look at the changes in exchange rate we can determine


the mean and standard deviation as described in previous units. Excel gives
Mean = - 0.0022 ; Standard deviation = 0.0430 --- (8a)
Suppose we ask for the probability of the exchange rate next month changing by
more than 0.1 (10 cents) from today’s value. Although we do not need the precise
value of the latter we shall assume the exchange rate is 1.4. Using (7b) we calculate,
with X = change in exchange rate,
0.1 - (−0.0022)
X = 0.1 Z= = 2.32 (to 2D)
0.0430
- 0.1 - (−0.0022)
X = - 0.1 Z= = - 2.27 (to 2D)
0.0430
Then P(-0.1 < X < 0.1) = P(-2.27 < Z < 2.32) = 0.4884 + 0.4898 = 0.9782
Look back at the solution to Example 9.2(b) to see why this calculation works. Since
we have computed the probability of the exchange rate changing by less than 0.1
P(exchange rate change is more than 0.1) = 1 – 0.9782 = 0.0218
Note the probability laws of Unit 5 in action here. Our conclusion is that we estimate
a 2% chance of such a “large” change in the exchange rate next month. If we were a
currency trader (or speculator) we would not be prepared to bet on such an outcome.

Statistics, Probability & Risk 160


Unit 6: Two Important Probability Distributions

11. Is Our Data Normal Distributed?


Since much of the statistical theory developed (some of which we shall look at in later
units) applies to normally distributed data, it is important in practice to be able to
assess whether a given data set is normally distributed or not.
• The quickest, though probably least reliable, way is to obtain a histogram of the
data, and see if it resembles a normal distribution. This is precisely what we
have done in Example 10.1 using Figs.10.6 and 10.7.
• Various other methods have been developed to assess “how well” a normal
distribution describes a data set. As an exercise investigate “Q-Q plots” using
Google.

12. Important Numbers


There are a few important values associated with the normal distribution that we shall
need in Unit 7. In addition these numbers give an important “general picture” of any
data set that follows a normal distribution (exactly or approximately). The values
relate to the proportion (or, when multiplied by 100, the percentage) of any normal
distribution within specified limits of the mean. The limits chosen are
(a) 1 standard deviation (b) 2 standard deviations (c) 3 standard deviations
(usually abbreviated 1σ, 2σ and 3σ respectively).

161 Statistics, Probability & Risk


Unit 6: Two Important Probability Distributions

Fig.12.1: Proportion of Normal Distribution Within Specified Limits.

The last result in Fig.12.1 that 99.7% of a normal distribution lies within 3 standard
deviations of the mean allows us to state that, roughly,
• Smallest data value = Mean – 3*Standard Deviation
• Largest data value = Mean + 3*Standard Deviation

These limits are often useful when drawing appropriate normal distributions, or when
trying to assess the range (Largest – Smallest) of a data set when the mean and
standard deviation are known. These limits often go by the name “the 3σ rule”.

Statistics, Probability & Risk 162


Unit 7: Sampling and Sampling Distributions

7 Sampling and Sampling Distributions

Learning Outcomes
At the end of this unit you should be familiar with the following:
• Understand the concept of sampling.
• Appreciate how the normal distribution arises in sampling.
• Understand the Central Limit Theorem, and its limitations.
• Recognize that, with unknown variance, the t - distribution is required.
• Understand how t-tables are used to compute probabilities.
• Recognise that the square of a normally distributed variable gives rise to the
chi-square distribution ( χ 2 ).

• Understand how χ 2 tables are used to compute probabilities.

• Recognise that the ratio of two chi-squared variables gives rise to the F-
distribution.
• Understand how F-tables are used to compute probabilities.

“I know of scarcely anything to impress the imagination as the wonderful


form of cosmic order expressed by the Central Limit Theorem”.
Francis Galton (1822-1911)

163 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

1. Introduction
From a technical point of view this unit is the most complicated in the entire module,
and centres around two fundamental ideas:
• The need to take samples from a population, and the consequences this has
for the “structure” of the possible samples we may take.
• If a random variable (X) follows a normal distribution, then functions of X (such
as X2) will follow a different distribution.

By the end of the unit we will have met three further probability distributions (t, χ 2 and
F), all dependent on the normal distribution in one way or another. Taken together
use of these four distributions will account for almost all the statistical analysis you
are likely to encounter. Rather than having detailed technical knowledge of all these
distributions, it is far more useful to understand the general situations in which they
arise, and the uses to which they are put. This will enable you to make sense of the
statistical output of most software packages (Excel included).

2. Why Sample?
In very general terms we wish to deduce information about all “items of a particular
type” by studying just some of these items. The former is termed a population, and
the latter a sample (taken from the population). It is important to realise the term
“population” will not, in general, have its “usual interpretation” as a large group of
people (or possibly animals).

Example 2.1: The following are examples of populations and samples:


• The population consists of all eligible voters in the UK. The sample consists of
all persons contacted by a polling firm hired by, say, the Conservative Party to
assess its chances in the next general election.
• A television manufacturer wants to be sure his product meets certain quality
control standards. The population is all the televisions (possibly of a particular
type) produced by the manufacturer (say over the past week). The sample may
be 100 televisions made yesterday afternoons on the second production line.
• A market research agency is conducting interviews on a new financial product.
The population consists of all future buyers of the product (which is probably not
known very precisely!). The sample is the particular people selected by the
agency to interview.
All these illustrations have three important ideas in common:
• We cannot sample from the entire population for various reasons:

Statistics, Probability & Risk 164


Unit 7: Sampling and Sampling Distributions

- It is too costly.
- It involves testing an item to destruction (assessing lifetimes).
- We are not really sure who is in the population (future potential buyers).
• We are trying to use sample information in order to estimate population
information. How well we can do this depends on how representative of the
whole population our sample is.
• If we take more than one sample we are almost certain to obtain different
results from the various samples. What we hope is that sample results will not
vary very much from each other if our samples are representative (of the
population). In practice this means we can use the results from any one sample,
and do not need to take more than one sample provided we can guarantee our
sample really is representative of the population.
The situation is summarised in Fig.2.1.
Would this give
Is this “similar” results?
representative? Population

Sample 1
Possible
Sample 2

Fig.2.1: Taking samples from a (much larger) population.

3. How To Sample (Sampling Methods)


In order to make sure our sample is representative we need to ensure the way our
sample is chosen does not introduce any bias (for example, by excluding a certain
part of the population). How serious an issue this is can be judged by the large
volume of literature that exists on sampling methods and procedures, and you will
learn more on these ideas in the Research Methods module. Whilst we shall be
largely concerned with how to interpret sample results once they have been
collected, the issues are of sufficient importance that we suggest you read Section
10 of Unit 2 before continuing.

4. Distribution of the Sample Mean


Consider the following example.
Example 4.1: The export manager of a clothing manufacturer is interested in
breaking into a new Far Eastern market. As part of the market research a random
sample of heights of 1000 females in the targeted age range was obtained with a
view to estimating the mean height of females in the targeted age group. The sample

165 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

mean of these 1000 values was calculated from the data . The mean height of all
females in the targeted age range is unfortunately unknown. However it can be
thought of as follows: if the height of every female in the targeted age range could be
measured then the population mean would be the mean height of these numbers.
If the manufacturer had been quite lazy and only taken a sample of 50 individuals or
worse still a sample with only 5 individuals would you expect the results to be as
reliable? Intuitively we feel that a bigger sample must be better. But (see Fig.4.1) in
what way is the mean calculated from a sample of size 50 better than one calculated
from one of size 5? To answer this question we need to know how the mean
calculated from different samples of the same size can vary.
If more than one sample of the same size was drawn from the population, as in
Fig.4.2, how much variation would you expect amongst the sample means? We know
that because of the sampling process the sample mean is unlikely to be exactly the
same as the true population mean. In some samples just by chance there will be too
high a proportion of tall women, and in others too low a proportion, compared with
the population as a whole.
Pop ulation Po pu latio n Po pu latio n

Sam ple s ize 1000 Sam ple siz e 50 Sam ple size 5
m ean x1 m ean x 2 m ean x 3

Fig.4.1: Samples of different sizes, and the information they contain.

P o p u la t i on
(m e a n μ )

S a m p l e n o. 1 S a m p le n o . 2 S a m p le n o . 3 S a m p le no . k

m ea n x 1 m e a n x2 m ea n x 3 m ean xk

A g r o u p of s a m p le m e a n s x1, x 2… x3… … xk

Fig.4.2: Taking repeated samples (k of them) of the same size (n)

Statistics, Probability & Risk 166


Unit 7: Sampling and Sampling Distributions

We are led to the general situation of Fig.4.3 with four important components:
• Our population comprising all items of interest.
• Our sample comprising those parts of the population we have examined.
• A probability distribution describing our population, as discussed in Unit 6.
• A probability distribution describing our sample, to be discussed in Section 5.

Since we have a normal distribution in mind and this, as we have seen, is


characterised by a mean (μ) and a standard deviation (σ), we concentrate on these
two quantities. These values ( x and s) will be known for our sample, and we are
trying to use them to estimate μ and σ. The fundamental problems of how well

Population Sample
ƒ Comprises all units ƒ Comprises randomly selected units
ƒ Too large to study directly ƒ Small enough to study directly
ƒ Mean μ usually unknown ƒ Mean x can be computed, i.e. known
ƒ StDev σ usually unknown ƒ StDev s is known

Probability distribution Sampling distribution


ƒ What is the distribution?
ƒ Usually characterised by mean μ
ƒ Mean of sampling distribution = ?
and StDev σ
ƒ StDev of sampling distribution = ?
ƒ Both are (usually) unknown

Fig.4.3: Relationships between Population and Sample.

(a) x approximates μ, and (b) s approximates σ


will be discussed in this unit and the next.
Our first, and conceptually most difficult, task is to understand the box labelled
“Sampling Distribution” in Fig.4.3. We shall do so via several Excel simulations and,
to keep the ideas as simple as possible, we look at a dice throwing example

Example 4.2: The Excel file SingleDice.xlsx allows you to simulate either
(a) 36 throws, or (b) 100 throws
of a fair dice. In either case a histogram of the results is produced. Importantly, by
pressing F9 you can repeat the calculation (simulation) as often as you wish.
Table 4.1 lists one possible set of simulated results, together with a frequency table
of the results. For example a score of 5 actually occurred 10 times in the 36 throws.
The corresponding histogram is shown in Fig. 4.4(a), and histograms for three further
simulations are also given.

167 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

Table 4.1: Simulation of 36 throws of a fair dice, with corresponding frequency table

Fig.4.4: Histograms for 4 simulations (each representing 36 dice throws)

(a) Simulation 1 (b) Simulation 2

(c) Simulation 3 (d) Simulation 4

The most noticeable feature of Fig.4.4 is the lack of an “obvious pattern” in the
histograms. An equivalent, but more informative, way of expressing this is to say
there is “large variability present”. Another way of stating this is to ask “What will the
next histogram (Simulation 5) look like?” We are forced to conclude, on the basis of
the first four simulations, that we do not know what simulation 5 will produce.
In the Practical Exercises P5, Q5 you are asked to look at this example further.
Example 4.3: The Excel file SampleMean.xlsx allows you to perform simulations
that are superficially similar to Example 4.1, but actually are fundamentally different
in character. In light of the results in Example 4.1 we realise that, when there is large
variability present within the data, there is little point in trying to predict “too
precisely”. We settle for a more modest, but still very important, aim:
We seek to predict the average value (mean) of our sample (simulation).

Statistics, Probability & Risk 168


Unit 7: Sampling and Sampling Distributions

Once we move from looking at individual values within a sample, and concentrate on
sample means our predictive ability improves dramatically. The spreadsheet
UniformSamp allows you to throw 9 dice 100 times (a grand total of 900 throws).
Each time the 9 dice are thrown the average value is calculated, and this is the only
information retained – the individual 9 values are no longer of any importance. In this
way we build up a set of 100 mean values, ands these are shown shaded in Column
J of Table 4.2. From these 100 values we extract the following information:
(a) A frequency table (shown in Columns K and L), and hence a histogram. Fig.
4.5(a) shows the histogram corresponding to the data in Table 4.2.
(b) A mean; this is termed the “mean of the sample means”.
(c) A standard deviation; this is termed the “standard deviation of the sample
means”.
For the data of Table 4.2 these latter two (sample) quantities are
x = 3.5 and s = 0.519 --- (1)
as you check from the frequency table.

Table 4.2: 100 repetitions of 9 throws of a fair dice, with mean values recorded.

Fig.4.5: Histograms for 4 simulations (each representing 900 dice throws) of sample means

(a) Simulation 1 (b) Simulation 2

169 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

(c) Simulation 3 (d) Simulation 4

Some Conclusions: The most obvious feature of Fig. 4.5(a) is the appearance of
what looks like a normal distribution; in words the sample means appear to follow a
normal distribution. We can confirm this by running further simulations, three of which
are illustrated in Fig.4.5. Although there is clearly some variability present, we can
discern a normal distribution starting to appear.
In addition, we know, from our work in Units 5 and 6, that a normal distribution is
characterised by its mean and standard deviation. From (1) we can see what the
sample mean and standard deviation are, but what are they estimates of? To answer
this we need to identify the underlying population – see Fig.4.3. Since we are
throwing a dice (9 times) the population is {1, 2, 3, 4, 5, 6}, each value occurring with
equal probability. This defines the population probability distribution in Fig.4.6.

X 1 2 3 4 5 6
Probability 1/6 1/6 1/6 1/6 1/6 1/6

Fig.4.6: Population Probability Distribution.

Using (3) and (4) in Section 8 of Unit 5 gives us the population mean and standard
deviation as μ = (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5 --- (2a)
and σ2 = (12 + 22 + 32 + 42 + 52 + 62)/6 – 3.52 = 91/6 – 12.25 = 2.9167

Hence σ= 2.9167 = 1.7078 --- (2b)


We need to compare (1) and (2):
Mean Here x = μ = 3.5, but the exact equality is fortuitous; see (4) below. But, in
general, the sample mean x approximates the population mean μ.
StDev However s= 0.519 is much smaller than σ = 1.7078, and it doesn’t look like s
is trying to approximate σ. In fact there are two important points here:

Statistics, Probability & Risk 170


Unit 7: Sampling and Sampling Distributions

• When we look at the average value in a sample the variation of the sample
mean (as measured by s) is smaller than the variation in the original
population. Intuitively we can see why this is so from the sample means shown
in Table 4.2 and also from the x-scales in Figs.4.5. Although any x value
{1,2,3,4,5,6} is equally likely the values of x are not equally likely. Indeed the
means in Table 4.2 include nothing smaller than 2.25 and nothing larger than 5.
The probability of obtaining a sample mean of, say 6, in a sample of size 9 is
(1/6)^9 = 9.9*10-8 (why?). Compare this with the probability of 1/6 of obtaining
an individual (population) value of 6. This reduced range of x values reduces
the variation in (sample) mean values.
• The precise factor by which the variation is reduced is very important. Since
σ 1.7078
= = 3.29 and Sample size n = 9 = 32 --- (3)
s 0.519
it appears that the reduction is by a factor of n.

Further evidence for these conclusions (that x approximates μ and s approximates


σ/ n ) is provided by the values in (1) for the remaining three simulations depicted in
Fig.4.5. These are:
Simulation 2: x = 3.52 and s = 0.546 (with σ/s = 3.13)
Simulation 3: x = 3.57 and s = 0.584 (with σ/s = 2.92)
Simulation 4: x = 3.50 and s = 0.545 (with σ/s = 3.13) --- (4)

5. The Central Limit Theorem


Our discussions in Section 4 are summarised in the following fundamental result:

Central Limit Theorem We are given a probability distribution with mean μ and
standard deviation σ.
(a) As the sample size n increases, the distribution of the sample mean x
approaches a normal distribution with mean μ and standard deviation σ/ n .
(b) If the (original) probability distribution is actually a normal distribution itself, then
the result in (a) holds for any value of n.
More informally
(a) For a “large” sample the sample mean has (approximate) normal distribution
whatever the original (parent) distribution.

171 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

(b) If the original (parent) distribution is normal the sample mean has an exact
normal distribution
More formally, if the distribution of X has mean μ and variance σ2

(a) For large n X = N(μ,σ2/n)

(b) If X = N(μ,σ2) X = N(μ,σ2/n) for any value of n.


Sometimes these results are written in the compact form

μ X
= μ X
and σ X
=
σX
--- (5)
n
Here the subscript (X or X ) denotes which distribution we are referring to.

Whichever formulation of CLT you are comfortable with, the result itself is possibly
the most important result in the whole of statistics. The two important points to always
bear in mind are the appearance of the normal distribution, and the reduction of
variability (standard deviation) by a factor of n .
You should now be able to answer the questions in the final box of Fig.4.3. In
addition there are various additional points you should appreciate about CLT.
However, before discussing these, we look at an example of CLT in use. For this we
return to the theme of Example 4.1.

Example 5.1: Female heights (X) follow a normal distribution with mean 163 cm. and
standard deviation 3.5 cm. A market researcher takes a random sample of 10
females. What is the probability the sample mean:
(a) is less than 162 cm. (b) is more than 165 cm.
(c) is between 162.5 cm. and 163.5 cm.

Solution: Histograms of the two distributions, heights and mean heights, are
depicted in Fig.5.1. The distributions are centred at the same place μ = 163, but the
sampling distribution has standard deviation 3.5/ 10 = 1.107. This is approximately
one third of the standard deviation of the height distribution.
From Fig.5.1 we can clearly see that probability values will depend on the distribution
we use. For example P(X < 162) appears (top graph) larger than (bottom graph)
P( X < 162). Computationally we have the two important formulae:
X value - Mean X-μ
Individual values Z= = --- (6)
Standard Deviation σ
X value - Mean X-μ
Sample mean values Z= = --- (7)
Standard Deviation σ
n

Statistics, Probability & Risk 172


Unit 7: Sampling and Sampling Distributions

As a check, if we set the sample size n = 1 in (7) we recover (6).

Fig.5.1: Histograms for (a) Heights and (b) Mean Heights in Example 5.1.

162 - 163 -1
(a) Here with X = 162 Z= = = -0.904
3.5 1.1068
10
Hence P( X < 162) = P(Z < -0.90) = 0.5 - 0.3159 = 0.18 (2D)
(See Unit 6 Section 9 if you don’t understand this calculation.)
165 - 163 2
(b) With X = 165 Z = = = 1.807
3.5 1.1068
10
Hence P( X > 162) = P(Z > 1.81) = 0.5 - 0.4639 = 0.036 (3D)
162.5 - 163 - 0.5
(c) Finally with X = 162.5 Z = = = -0.4518
3.5 1.1068
10
Hence P(162.5 < X < 163.5) = P(-0.45 < Z < 0.45) = 2*0.1736 = 0.35 (2D)

A very important issue is the role played by the sample size in (5). The following
example illustrates the general idea.

Example 5.2: Repeat Example 5.1 with a sample size of n = 100.

Solution: It is of considerable interest to see what happens to the probabilities in


Example 5.1 if we (greatly) increase the sample size n. Repeating the calculations:
162 - 163 -1
(a) With X = 162 Z = = = -2.857
3.5 0.35
100
Hence P( X < 162) = P(Z < -2,86) = 0.5 - 0.4979 = 0.002 (3D)

173 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

2
(b) With X = 165 Z== 5.71
0.35
Hence P( X > 162) = P(Z > 5.71) = 0.5 - 0.5 = 0.0000 (4D)
This probability is zero to 4D (accuracy of tables).
- 0.5
(c) Finally with X = 162.5 Z = = -1.429
0.35
Hence P(162.5 < X < 163.5) = P(-1.43 < Z < 1.43) = 2*0.4236 = 0.85 (2D)

The reason for the very considerable change in probabilities is revealed if we sketch
the second graph in Fig.5.1 for n = 100. The result is Fig.5.2 and we can very clearly
see how “much more concentrated” the distribution (of means) is about 163. It
becomes “highly likely” (as (c) shows) to obtain a sample mean “very close” to 163,
and “very unlikely” (as (a) and (b) show) to find a sample mean “very far” from 163.

Fig.5.2: Histograms for (a) Heights and (b) Mean Heights in Example 5.2.

Note The behaviour of sample distributions, as the sample size increases, is of


fundamental importance. The larger the sample we have the smaller the allowed
variations within the sample (mean). How large a sample we need to ensure a
sufficiently small sample variation is an important issue.

6. Theoretical Sampling Distributions


There are various important issues relating to the Central Limit Theorem (CLT).

Statistics, Probability & Risk 174


Unit 7: Sampling and Sampling Distributions

1. Initial Distribution Normal


If the initial population X is normally distributed X = N(μ,σ2) then the distribution of
sample means is also (exactly) normally distrusted X = N(μ,σ2/n) for any value of
the sample size n. This is illustrated in Fig.6.1 for μ = 10 and σ = 2, and you should
observe the dramatic reduction in the standard deviation

(a) Initial
Distribution Normal

(b) Sample
Distribution Normal
(n = 9)

Fig.6.1: Illustration of CLT for X Normally Distributed.

2. Initial Distribution Non Normal


If the initial population X is not normally distributed then we require a “large” sample
size (n) for the distribution of the sample mean to approach a normal distribution.
This is illustrated in Fig.6.2 where we start from a uniform distribution and examine
the distribution of X for increasing values of n. These illustrations are taken from
HyperStat Online: An Introductory Statistics Textbook available at
http://davidmlane.com/hyperstat/index.html
This is an online statistics textbook with links to other web resources. In fact this
material is part of the larger website Rice Virtual Lab in Statistics, available at

175 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

http://onlinestatbook.com/rvls/ , which you may care to look at. (These links are in
addition to those mentioned in Unit 2 Section 2.)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Fig.6.2: Illustration of CLT for X Uniformly Distributed with n = 1, 4, 7 and 10.

Rule of Thumb Just how large n needs to be (for a normal distribution for X to be
apparent) will depend on how far the distribution of X is from normality, and no
precise rules can be given. However a “rule of thumb”, which is found to work well in
practice, states that n > 30 is considered a “large sample”. By this we mean the
distribution of X will resemble a normal distribution.

3. Simulated Distributions
There is one very important point that can initially be confusing. In Fig.6.1 we depict a
theoretical population X and the corresponding theoretical distribution for X ; in this
particular case both are normal distributions. In Examples 4.2 and 4.3 we are
simulating these distributions (using various features of Excel) and, in such
circumstances, we only obtain an approximation to the underlying (theoretical)
distributions. Indeed we can clearly see the following:
• In Fig.4.4 we do not obtain an exact uniform distribution (for X).

Statistics, Probability & Risk 176


Unit 7: Sampling and Sampling Distributions

• In Fig.4.5 we do not obtain the exact distribution for X since this is unique (and
quite difficult to determine theoretically). Clearly Fig.4.5 is giving us four
different approximations to this (unknown) distribution.
In both cases we can obtain better and better approximations to the underlying
distributions by increasing indefinitely the number of samplings (number of dice = 36
in Example 4.2 and the number of samples = 100 in Example 4.4). But, for example,
repeating Example 4.3 with 100,000 samples is very time consuming!
This distinction between underlying distributions (represented by curves), and
simulated ones (represented by histograms), should always be borne in mind.

7. Simulating the t-Distribution


Recall from Example 5.1 the computational version (7) of the Central Limit Theorem
(CLT). If we know the population standard deviation σ we “standardise” our data (x-
values) to obtain (normal distribution) z-values via
x-μ
z= --- (7)
σ/ n

(Recall the purpose of this is so that we have to tabulate the areas under only a
single standard normal distribution.) However, the standard deviation is quite a
complicated quantity (average squared deviation from the mean) and it is entirely
possible (even probable) that the population standard deviation σ is not known. In
such a case the logical step is to estimate it from our sample. Calculating the sample
standard deviation s we then form the “standardised” t-value via
x-μ
t= --- (8)
s/ n

The difference between (7) and (8) is more than just a matter of changing symbols
(from z to t and σ to s). To see what is involved we perform a simulation.

Example 7.1: The Excel workbook tDist contains the spreadsheet Barrels1 and
illustrates how the t-distribution arises when sampling from a normal population when
the standard deviation is unknown. The title “Barrels” refers to a piece of history. The
t-distribution was derived by W.S.Gossett in 1908 whilst he was conducting tests on
the average strength of barrels of Guinness beer. His employers (Guinness
breweries in Dublin) would not allow employees to publish work under their own
names, and he used the pseudonym ‘student’. For this reason the distribution is also
often termed ‘Students t-distribution’.

177 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

Table 7.1: Illustration of Normal and t-distributions for “beer barrel” example.

Spreadsheet Table 7.1 illustrates the calculations, based on samples of size 3 (cell
E3) drawn from a N(5, 0.12) distribution.
• In Cells C9-E9 we generate three Z (normally distributed) values using the
Excel command NORMINV(RAND(),B3,B4). These values are intended to
represent the amount of beer in each of the three barrels. (This calculation
represents a slight modified version of the original computation.)
• We calculate the sample mean in cell G9, and this allows us to compute a Z-
value using (7). Thus cell H9 contains
5.10545 - 5
Z= = 1.82642 --- (9a)
0.1 / 3

• If we assume the value of σ in cell B4 is unknown, we need to compute the best


estimate we can via the sample standard deviation. Whilst we can do this using
the standard deviation formulae of Unit 4, Excel has the built in STDEV
command. Thus in cell F9 we enter = STDEV(C9:E9) to give 0.0845.
- Note this is meant to be an estimate of σ = 0.1 (and is quite reasonable).
- We shall say more about the STDEV command below – see Example 8.2.
• We can now use (8) to compute a t-value, and cell I9 contains
5.10545 - 5
t= = 3.65238 --- (9b)
0.05001 / 3

Statistics, Probability & Risk 178


Unit 7: Sampling and Sampling Distributions

Obviously the z and t values differ due to the difference between (9a) and (9b).
In Addition, we have shown cells C9-I9 in Table 7.1 to the increased accuracy
(5 dp) necessary to obtain z and t values correct to 3D.
• We now repeat these calculations a set number of times – we have chosen 200,
as shown in cells K27 and L27.
• From these 200 z and t-values we form two histograms based on the frequency
tables shown in columns J to L. These are shown superimposed in Fig.7.1.

Fig.7.1: Normal and t-distributions (histograms) for “beer barrel” example.

Observations The most important feature of the above calculations is the following.
Replacing σ (which is constant, but generally unknown)
with s (which is known but varies from sample to sample)
introduces extra variation. Specifically in all calculations of the form (9a) the
denominator contains the same σ = 0.1 factor, but in (9b) the s = 0.05001 factor
changes with each such calculation. This extra variation has the consequence that t-
values are more variable than z-values, i.e. there are more “large” t-values than
“large” z-values. This is clear both from the “tails” of Fig.7.1 and from the columns J-L
in Table 7.1. In particular whereas z-values never get larger than 4 t-values do (see
cells K25 and L25).

A Second Simulation Observe that, because of our very small sample size n = 3,
there is really a great deal of variation in both the z- and t-values – see Table 7.1.
Intuitively we do not expect to get very reliable information from such a small sample.
We can confirm this by running a second simulation (just recalculate by pressing F9).
Results are shown in Table 7.2 and Fig.7.2. Again observe the behaviour at the
“tails” of the (normal and t) distributions.

179 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

Table 7.2: A Second Simulation (n = 3).

Fig.7.2: Histogram for Simulation 2.

Changing the Sample Size We can reduce the variation present by increasing n. In
the spreadsheet: Barrels2 we have set n = 9. In Fig.3 we can see that there are now
many fewer values in the tails and, in particular, few with a (z or t) value greater than
3. However there are still more such t-values than z-values. Very importantly,
although maybe not apparent from Fig.7.3, we obtain a different t-distribution for
each sample size n.

Fig.7.3: Two Simulations with n = 9.

Statistics, Probability & Risk 180


Unit 7: Sampling and Sampling Distributions

8. Properties of the t-Distribution


Our simulated histograms of Section 7 give rise to the theoretical t-distributions
depicted in Fig.8.1; the latter are obtained on the t-Distribution spreadsheet of the
Excel workbook tDist. Note the following:

Fig.8.1: t-distribution and the limiting normal form

• As remarked earlier, whereas there is only a single standard normal


distribution, there are (infinitely) many “standard” t-distributions depending on
- the sample size n or, more conventionally,
- the number of degrees of freedom (ν is pronounced “new”)
ν=n–1 --- (10)
An explanation of the idea of “degrees of freedom” is given below.
• As with the normal distribution, t-distributions are symmetric (about t = 0).
• A t-distribution always has more area (probability) in the “tails” than a normal
distribution. The occurrence of more such extreme values is often heard in the
phrase “fat tails”, and is a characteristic feature of stock returns. This feature
has already been observed in Figs.7.1 and 7.2.
• Although the difference between the t and normal distributions is never “very
great” – even for small sample sizes – this difference is sufficient to produce
noticeable effects in many “small sample” problems.
• As the number of degrees of freedom (i.e. sample size) increases the t-
distribution tends towards the normal distribution (see Example 8.1 below).
Intuitively the larger the sample the more information (about the population
standard deviation) it contains. In view of this one has the following general rule:

If σ is unknown but the sample size is “large” use the normal distribution
If σ is unknown and the sample size is “small” use the t-distribution

181 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

t Tables
For hand computation we require, in place of normal distribution tables (Unit 6 Table
9.1), so-called t-tables. The t-distribution tables differ from the normal tables because
we need a lot of different values for different sample sizes. We cannot include as
much detail in the t-distribution tables as in the normal tables otherwise the tables
would be quite bulky - we would need one complete table for each sample size.
Instead we have t-values for a selection of areas in the right hand tail of the
distribution as depicted in Table 8.1. The following is conventional:

Statistics, Probability & Risk 182


Unit 7: Sampling and Sampling Distributions

Table 8.1: T Distribution Tables

-3 -2 -1 00 1 2t 3

Entries in the table give t values for an area in the upper tail of the t distribution. For
example, with 5 degrees of freedom and a .05 area in the upper tail, t = 2.015.
Area in Upper Tail
Degrees
of Freedom .10 .05 .025 .01 .005
1 3.078 6.314 12.71 31.82 63.66
2 1.886 2.920 4.303 6.965 9.925
3 1.638 2.353 3.182 4.541 5.841
4 1.533 2.132 2.776 3.747 4.604

5 1.476 2.015 2.571 3.365 4.032


6 1.440 1.943 2.447 3.143 3.707
7 1.415 1.895 2.365 2.998 3.499
8 1.397 1.860 2.306 2.896 3.355
9 1.383 1.833 2.262 2.821 3.250

10 1.372 1.812 2.228 2.764 3.169


11 1.363 1.796 2.201 2.718 3.106
12 1.356 1.782 2.179 2.681 3.055
13 1.350 1.771 2.160 2.650 3.012
14 1.345 1.761 2.145 2.624 2.977

15 1.341 1.753 2.131 2.602 2.947


16 1.337 1.746 2.120 2.583 2.921
17 1.333 1.740 2.110 2.567 2.898
18 1.330 1.734 2.101 2.552 2.878
19 1.328 1.729 2.093 2.539 2.861

20 1.325 1.725 2.086 2.528 2.845


21 1.323 1.721 2.080 2.518 2.831
22 1.321 1.717 2.074 2.508 2.819
23 1.319 1.714 2.069 2.500 2.807
24 1.318 1.711 2.064 2.492 2.797

25 1.316 1.708 2.060 2.485 2.787


26 1.315 1.706 2.056 2.479 2.779
27 1.314 1.703 2.052 2.473 2.771
28 1.313 1.701 2.048 2.467 2.763
29 1.311 1.699 2.045 2.462 2.756

30 1.310 1.697 2.042 2.457 2.750


40 1.303 1.684 2.021 2.423 2.704
60 1.296 1.671 2.000 2.390 2.660
120 1.289 1.658 1.980 2.358 2.617
∞ 1.282 1.645 1.960 2.326 2.576

183 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

• Tabulate degrees of freedom in steps of one up to 30 (the conventional cut-off


between “small” and “large” sample size as given in the “Rule of Thumb” in
Section 6.2)
• Then to tabulate a “few” values in larger steps (in such regions the normal
distribution will often provide a good approximation)
• To only tabulate selected tail probabilities, i.e. areas in the upper or lower
tails of the distribution. Our tables provide only the upper tails, but Excel allows
both to be computed.
• One can often use the symmetry of the t-distribution (about 0) to compute two-
tailed probabilities from the tabulated one-tailed ones.

Remember the t and normal (z) tables tabulate different types of areas (probabilities).
You may have to do some preliminary manipulation(s) before using the appropriate table.

Example 8.1: Find the t-value for a 2-sided


(a) 95% confidence interval with a sample size of 15
(b) 99% confidence interval with a sample size of 25
Solution: We shall explain the terminology “confidence interval” in Unit 7. At present
just take the given probabilities to be the “central” areas depicted in Figs.8.2a and
8.2b.
(a) We need the following simple steps:
• n = sample size = 15 ⇒ df = degrees of freedom = n - 1 = 14 (see (10))
• For the shaded area of 0.95 in Fig.8.2a
Upper tail area = 0.025 (half of 5%)
• From t-tables t = 2.145
Area = 0.025
Area = 0.95

t = 2.145
Fig.8.2a: t-distribution computations for Example 8.1(a)

Statistics, Probability & Risk 184


Unit 7: Sampling and Sampling Distributions

(b) In the same way


• n = sample size = 25 ⇒ df = degrees of freedom = n - 1 = 24
• For the shaded area of 0.99 in Fig.8.2a
Upper tail area = 0.005 (half of 1%)
• From t-tables t = 2.797
Area =
Area = 0.99 0.005

Fig.8.2b: t-distribution computations for Example 8.1(b)

N.B. Packages such as Excel (and SPSS) gives t-values as part of the output of an
appropriate computation; in this context you will not need t-tables. However it is very
important you understand the idea behind the t-distribution, together with its
connection to, and similarity with, the normal distribution. In this sense the
computations of Example 8.1 are important.

Degrees of Freedom
The t-distribution is thought to be “difficult” because of the fact that its precise
distribution (shape) depends on the sample size n. To make matters worse this
dependence on sample size is phrased in terms of what appears, at first sight, to be
something rather more complicated.
The term degrees of freedom (df) is a very commonly occurring one in statistics and,
as we shall see in later units, is invariably output by statistical software packages
(such as Excel and SPSS). The essential idea is illustrated by the following example;
you may care to re-read Section 9 of Unit 4 first.

Example 8.2: We shall try and explain the origin of (10), i.e. ν=n–1
The basic argument is the following:
• In order to use our basic result (8) we need to compute s (sample value).
• To compute s we need to know the sample mean. Recall the formula (9) of Unit

s2 = ∑in= 1 (x i - x )
1 2
4 Section 8: --- (11a)
n

• This knowledge (of X ) puts one constraint on our (n) data values. If we know
the mean of a set of data we can “throw away” one of the data values and “lose
nothing”. Thus, if our (sample) data values are
10 , 15 , 20 , 25 , 30

185 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

with known mean = 20

then each of the following data sets contain precisely the same information:

10 , 15 , 20 , 25 , * (mean = 20)
10 , 15 , 20 , * , 30 (mean = 20)
10 , 15 , * , 25 , 30 (mean = 20)
10 , * , 20 , 25 , 30 (mean = 20)
* , 15 , 20 , 25 , 30 (mean = 20)

In each case the missing entry * is uniquely determined by the requirement the
mean (of the 5 values) is 20.
• In effect we have “lost” one data value (degree of freedom) from the data once
X is known. This is often explicitly indicated by changing (11a) into

s2 =
1

n
i =1
(x i - x )2 --- (11b)
n -1

The difference between (11a) and (11b) is only noticeable for “small samples”.
• Note that none of these difficulties occur when using (7) with σ known.

General Statement
• Estimates of statistical parameters are often based on different amounts of
information (data). The number of independent pieces of information that go into
the estimate of a parameter is called the degrees of freedom (of the parameter).
• Thus, if we use the sample standard deviation (s) as an estimate of the
population standard deviation (σ) the estimate is based on (n – 1) df. This is the
origin of the denominator in (11b) and the result (10).

9. The χ 2 Distribution
In Units 8 and 9 we shall start squaring and adding data values – indeed we have
already done so in (11) above. A very important situation arises when our data is
normally distributed, and we want to see what happens when we square. This gives
rise to the so-called Chi-square ( χ 2 ) distribution.

1. Simulating the Chi-Square Distribution


Example 9.1: The Excel workbook ChiSquare contains the spreadsheet ChiSq_n=3
shown in Table 9.1, used to simulate the sum of 3 random normal variables.

Statistics, Probability & Risk 186


Unit 7: Sampling and Sampling Distributions

Table 9.1: Computation of Chi-square histograms with n = 3

Spreadsheet: Cells A6-C6 contain random normal variables with the mean and
standard deviation indicated in B2 and B4 respectively. For simplicity we have
chosen standard normal variables.
• The squares of these variables are computed in cells D6-F6, and the sum of
these cells is placed in G6.
• The whole procedure is then repeated a “large” number of times in order to
obtain a representative histogram. As in Example 7.1 we have chosen 200
times.
• A frequency table of the results in column G is then compiled in columns I-J,
and a histogram produced. For the data of Table 9.1 we obtain the first
histogram of Fig.9.1.

Observations The most important feature of the above calculations is the following.
The histogram (and frequency table) is skewed in such a way that “small” values
occur much more frequently than “large” values.
• We can understand the scales involved by recalling that, if z = N(),1), then -3 <
z < 3 and hence, when we square, 0 < z2 < 9. Adding three such variables
together will give us a sum in the interval (0,27). In Fig.9.1 the x-scale is roughly
one half of this, indicating the lack of “large” values.
• It is not clear why the histogram is not symmetric in view of the symmetry of the
underlying normal distribution.

187 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

Fig.9.1: Histograms obtained from spreadsheet of Table 9.1.

Further Simulations Performing further simulations (using F9) produce similar


results, as Fig.9.1 illustrates. In each case the skewed nature of the histogram is
apparent with small values occurring more frequently than large ones.

Changing the Number of Variables The use of 3 (normal) variables in Table 9.1 is
arbitrary. The spreadsheet ChiSq_n=9 increases this to 9 and representative results
are displayed in Fig.9.2. We note similar results to those obtained in Section 7 in
connection with the t-distribution:
• As we increase the number of variables the histograms become less skewed
and resemble more a normal distribution.
• There is a different (chi-square) distribution for each number of variables
added.

Statistics, Probability & Risk 188


Unit 7: Sampling and Sampling Distributions

Fig.9.2: Histograms obtained for sum of 9 normal variables

2. Properties of the Chi-Square Distribution


Our simulated histograms give rise to the theoretical Chi-square distributions
depicted in Fig.9.3; the latter are obtained on the ChiDistribution spreadsheet of the
ChiSquare workbook. Note the following:

Fig.9.3: Chi-square distributions and the limiting normal form

• There are (infinitely) many Chi-Square distributions depending on the number of


(normal) variables n added. Conventionally n is termed the number of degrees
of freedom.
• Chi-square distributions are always skewed, but become increasingly less so
as n increases.
• As n increases the Chi-square distribution tends towards the normal
distribution

189 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

Chi-SquareTables
We will not usually need to do any explicit calculations involving the Chi-square
distribution. However it is useful to be able to check computer output using
appropriate tables. As shown in Table 9.2 specified right hand tail values (areas) of
the distribution are tabulated in a very similar manner to the t-tables of Section 8.

Example 9.2: Find the χ 2 values corresponding to

(a) an upper tail probability of 0.1 with 5 degrees of freedom, and


(b) an upper tail probability of 0.05 with 20 degrees of freedom

Solution: (a) Here we need df = 5 and Area in Upper Tail = 0.1. Table 9.2 gives the
value χ 2 = 9.24. The meaning of this is that 10% of the distribution (specified by df =
5) lies above the value 9.24.

(b) Similarly with df = 20 and Upper Tail Area = 0.05 Table 9.2 gives χ 2 = 31.41.

Since we are adding more (normal) variables together we would expect the chi-
square value to have increased from its value in (a). Note that df does not go above
20. For larger values than this we would use the appropriate normal distribution.

We shall return to these tables in Unit 9. For now remember that, when sums of
squares of normally distributed random variables are involved in a calculation, the
Chi-square distribution will be involved (either explicitly or implicitly).

Statistics, Probability & Risk 190


Unit 7: Sampling and Sampling Distributions

Table 9.2: χ2 Distribution

00 5 10 χ2 15 20

Entries in the table give χ2 values (to 2 decimal places) for an area in the upper tail of
the χ2 distribution. For example, with 5 degrees of freedom and a .05 area in the
upper tail, χ2 = 11.07.

Area in Upper Tail


Degrees of
Freedom .10 .05 .025 .01 .005
1 2.70 3.84 5.02 6.63 7.88
2 4.61 5.99 7.38 9.21 10.60
3 6.25 7.81 9.35 11.34 12.84
4 7.78 9.49 11.14 13.28 14.86

5 9.24 11.07 12.83 15.09 16.75


6 10.64 12.59 14.45 16.81 18.55
7 12.02 14.07 16.01 18.48 20.28
8 13.36 15.51 17.53 20.09 21.96
9 14.68 16.92 19.02 21.67 23.59

10 15.99 18.31 20.48 23.21 25.19


11 17.28 19.68 21.92 24.73 26.76
12 18.55 21.03 23.34 26.22 28.30
13 19.81 22.36 24.74 27.69 29.82
14 21.06 23.68 26.12 29.14 31.32

15 22.31 25.00 27.49 30.58 32.80


16 23.54 26.30 28.85 32.00 34.27
17 24.77 27.59 30.19 33.41 35.72
18 25.99 28.87 31.53 34.81 37.16
19 27.20 30.14 32.85 36.19 38.58

20 28.41 31.41 34.17 37.57 40.00

191 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

10. The F Distribution


One final distribution that occurs frequently in practice is the so called F distribution,
named in honour of the great English statistician Ronald Fisher (see
http://en.wikipedia.org/wiki/Ronald_Fisher for some details). The distribution again
arises when we look at sums of squares of normal variables, but now we are
interested in the ratio of two such sums (for reasons we discuss in Unit 9). Recall
that when we consider standard N(0,1) normal variables
χ n2 = Z12 + Z22 + .... + Zn-12 + Zn2 --- (12)

has a chi-square distribution with n degrees of freedom (since it is the sum of n


independent normal variables). In a similar way we define the quantity
Z12 + Z 22 + ... + Z 2m
Fm,n = --- (13)
Z12 + Z 22 + ... + Z 2n
To have an F-distribution with (m,n) degrees of freedom.
1. Simulating the F Distribution
Example 10.1: The Excel workbook FDistribution contains the spreadsheet
FDist_(2,4) shown in Table 10.1.
Spreadsheet: Cells A6-E6 contain the computations required to evaluate the
numerator in (13), and cells F6-N6 the corresponding denominator calculations.
• Cell O6 then contains the F-ratio in (13), and is shown in Fig.10.1 overleaf.
• The whole procedure is then repeated a “large” number of times in order to
obtain a representative histogram. As in Example 7.1 we have chosen 200
times.
• A frequency table of the results in column O is then compiled in columns Q-R,
and a histogram produced. For the data of Table 10.1 we obtain the frequency
table and histogram of Fig.10.1.

Table 10.1: Computation of F-distribution histograms with m = 2 and n = 4

Statistics, Probability & Risk 192


Unit 7: Sampling and Sampling Distributions

Observations The histogram (and frequency table) is even more skewed than for
the chi-square distribution – see Fig.9.1. Again “small” values occur much more
frequently than “large” values.
• We can appreciate the scales involved by recalling that, if z = N(0,1), then -3 < z
< 3 and hence, when we square, 0 < z2 < 9. Adding two such variables together
will give us a numerator sum in the interval (0,18); similarly the denominator
sum lies in (0,36). In general we would expect the ratio to be small since there
are more terms in the denominator.
• In view of this we would not expect the histogram to be symmetric despite the
symmetry of the underlying normal distribution.

Fig.10.1: F ratios, frequency table and histogram for data of Table 10.1

Further Simulations Performing further simulations (using F9) produces similar


results, as Fig.10.2 illustrates. In each case the skewed nature of the histogram is
apparent with small values occurring more frequently than large ones.

Fig.10.2: Further simulated histograms

Changing the Number of Variables The spreadsheet FDist_(5,3) depicts the case
m = 5 and n = 3 with representative results displayed in Fig.10.3. We note:
• Even though the range of the F-ratios has increased the histograms remain
skewed, with smaller values predominating.
• There is a different F- distribution for each set of (m,n) values.

193 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

Fig.10.3: Simulated histograms for m = 5 and n = 3.

2. Properties of the F Distribution


Our simulated histograms give rise to the theoretical F-distributions depicted in
Fig.10.4; the latter are obtained on the Distribution spreadsheet of the
FDistribution workbook. Note the following:

Fig.10.4: F-distributions and the limiting normal form

• There are (infinitely) many F-distributions depending on the number of degrees


of freedom in the numerator (m) and denominator (n).
• F-distributions are always skewed, but become increasingly less so as both m
and n increase.
• As both m and n increase the F-distribution tends towards the normal
distribution
Compare these comments with those in Section 9 for the chi-square distribution.

Statistics, Probability & Risk 194


Unit 7: Sampling and Sampling Distributions

F Tables
You will rarely need to do any explicit calculations involving the F-distribution.
However it is useful to be able to check computer output using appropriate tables,
especially since F ratios occur repeatedly when using regression models. (We shall
discuss this in detail in Units 9 and 10.)
As shown in Tables 10.2 and 10.3 F-distribution tables are more complicated than
our previous (normal, t and chi-square) tables since they depend on the two
parameters m and n. It is conventional to select a specific right hand tail probability
(often termed percentage points) and tabulate the F-value corresponding to this area
for selected values of m and n. You need to be careful since m is tabulated across
the top row, and n down the first column.

Example 10.2: Find the F-values corresponding to (a) an upper tail probability of 0.1
with (5,10) degrees of freedom, and (b) an upper tail probability of 0.05 with (20,20)
degrees of freedom.

Solution: (a) Here we need Table 10.2 with m = 5 and n = 10. This gives the value
F= 2.52.
The meaning of this is that 10% of the F-distribution, specified by (m,n) = (5,10) lies
above the value 2.52.

195 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

Table 10.2: 10 percentage points of the F distribution

0.1

0
Fν1, ν2

ν1 = 1 2 3 4 5 6 7 8 9 10 12 24
ν2 1 39.86 49.50 53.59 55.83 57.24 58.20 58.91 59.44 59.86 60.19 60.71 62.00
2 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 9.39 9.41 9.45
3 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.24 5.23 5.22 5.18
4 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94 3.92 3.90 3.83
5 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32 3.30 3.27 3.19
6 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96 2.94 2.90 2.82
7 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72 2.70 2.67 2.58
8 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56 2.54 2.50 2.40
9 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44 2.42 2.38 2.28
10 3.29 2.92 2.73 2.61 2.52 2.46 2.41 2.38 2.35 2.32 2.28 2.18
11 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.30 2.27 2.25 2.21 2.10
12 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2.21 2.19 2.15 2.04
13 3.14 2.76 2.56 2.43 2.35 2.28 2.23 2.20 2.16 2.14 2.10 1.98
14 3.10 2.73 2.52 2.39 2.31 2.24 2.19 2.15 2.12 2.10 2.05 1.94
15 3.07 2.70 2.49 2.36 2.27 2.21 2.16 2.12 2.09 2.06 2.02 1.90
16 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06 2.03 1.99 1.87
17 3.03 2.64 2.44 2.31 2.22 2.15 2.10 2.06 2.03 2.00 1.96 1.84
18 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 1.98 1.93 1.81
19 2.99 2.61 2.40 2.27 2.18 2.11 2.06 2.02 1.98 1.96 1.91 1.79
20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 1.94 1.89 1.77
21 2.96 2.57 2.36 2.23 2.14 2.08 2.02 1.98 1.95 1.92 1.87 1.75
22 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93 1.90 1.86 1.73
23 2.94 2.55 2.34 2.21 2.11 2.05 1.99 1.95 1.92 1.89 1.84 1.72
24 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91 1.88 1.83 1.70
25 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 1.87 1.82 1.69
26 2.91 2.52 2.31 2.17 2.08 2.01 1.96 1.92 1.88 1.86 1.81 1.68
27 2.90 2.51 2.30 2.17 2.07 2.00 1.95 1.91 1.87 1.85 1.80 1.67
28 2.89 2.50 2.29 2.16 2.06 2.00 1.94 1.90 1.87 1.84 1.79 1.66
29 2.89 2.50 2.28 2.15 2.06 1.99 1.93 1.89 1.86 1.83 1.78 1.65
30 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 1.82 1.77 1.64
32 2.87 2.48 2.26 2.13 2.04 1.97 1.91 1.87 1.83 1.81 1.76 1.62
34 2.86 2.47 2.25 2.12 2.02 1.96 1.90 1.86 1.82 1.79 1.75 1.61
36 2.85 2.46 2.24 2.11 2.01 1.94 1.89 1.85 1.81 1.78 1.73 1.60
38 2.84 2.45 2.23 2.10 2.01 1.94 1.88 1.84 1.80 1.77 1.72 1.58
40 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.79 1.76 1.71 1.57
60 2.79 2.39 2.18 2.04 1.95 1.87 1.82 1.77 1.74 1.71 1.66 1.51
120 2.75 2.35 2.13 1.99 1.90 1.82 1.77 1.72 1.68 1.65 1.60 1.45

Statistics, Probability & Risk 196


Unit 7: Sampling and Sampling Distributions

Table 10.3: 5 percentage points of the F distribution

0.05

0 Fν1, ν2

ν1 = 1 2 3 4 5 6 7 8 9 10 12 24
ν2 1 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 241.88 243.90 249.05
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.45
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.64
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.77
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.53
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.84
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.41
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.12
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 2.90
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.74
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.61
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.51
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.42
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.35
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.29
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.24
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.19
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.15
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.11
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.08
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.05
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.03
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.01
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 1.98
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 1.96
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 1.95
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 1.93
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 1.91
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 1.90
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 1.89
32 4.15 3.29 2.90 2.67 2.51 2.40 2.31 2.24 2.19 2.14 2.07 1.86
34 4.13 3.28 2.88 2.65 2.49 2.38 2.29 2.23 2.17 2.12 2.05 1.84
36 4.11 3.26 2.87 2.63 2.48 2.36 2.28 2.21 2.15 2.11 2.03 1.82
38 4.10 3.24 2.85 2.62 2.46 2.35 2.26 2.19 2.14 2.09 2.02 1.81
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.79
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.70
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.61

197 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

(b) Here we need Table 10.3 with (m,n) = (20,20) . Unfortunately this particular
combination is not tabulated, and we have one of two alternatives:
• Go to the nearest value, here (24,20) and use the value F = 2.08.
• Interpolate between the two nearest values, here (12,20) and (24,20). If you
understand how to do this it will give
F = 2.28 – (8/12)*0.1 = 2.147
In practice we shall never require F-values to more than 1 decimal place (at most),
and here we can safety quote the value F = 2.1. (In fact more accurate tables and
computations give the value F = 2.124, so our result is acceptable.)

We shall return to these F-tables in Units 8 and 9. For now remember that, when
ratios of sums of squares of normally distributed random variables are involved in a
calculation, the F-distribution will be involved (either explicitly or implicitly).

Statistics, Probability & Risk 198


Unit 7: Sampling and Sampling Distributions

Summary
At the end of this rather long, and more theoretical, unit it is useful to bear in mind the
following general results:
• The normal distribution is the most important distribution in the whole of
statistics due to its appearance in the Central Limit Theorem.
• Sampling distributions arise when samples are taken from a population.
• Sample averages are of particular importance, and their distribution (in repeated
sampling) has much smaller variation than individual values.
• There are various (sampling) distributions, all based on the normal distribution,
which are of importance in practice. These arise in the following situations:
- We need to compute a sample standard deviation in addition to a sample
mean (the t-distribution).
- We need to compute sums of squares of (random normal) variables (the
chi-square distribution).
- We need to compute ratios of sums of squares (F-distribution).
• When computing quantities from a sample it is important to know how many
independent data values our computations are based on. This leads to the very
important concept of “degrees of freedom”.

Further Practical Exercises (on the Module Web Page) will give you some practice in
using the various distributions, and exploring their properties, in Excel. The Tutorial
Exercises will make you more familiar with hand computations involving the normal
distribution

199 Statistics, Probability & Risk


Unit 7: Sampling and Sampling Distributions

Statistics, Probability & Risk 200


Unit 8: Inference and Hypothesis Testing

8 Inference and Hypothesis Testing

Learning Outcomes
At the end of this unit you should be familiar with the following:
• Understand the concept of proportion, and how the CLT applies.
• Appreciate the idea of estimating population parameters using sample values.
• Understand how confidence intervals are constructed, and their interpretation.
• Recognize how hypothesis testing is carried out.
• Understand the meaning and computation of p-values.

"The mere rejection of a null hypothesis provides only meagre information"


"Whenever possible, the basic statistical report should be in the form of a
confidence interval"
(Collected statistical wisdom)

201 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

1. Introduction
In this unit we continue the theme of sampling and ask the general question of how
we can use sample values to estimate corresponding population values. In particular
we wish to investigate how much accuracy/reliability we can assign sample values.
There are two main (inter-related) ways of doing this:
• Form so called confidence intervals for the population parameters.
• Test whether population parameters have prescribed values.

We shall need the ideas and techniques developed in Units 6 and 7 relating to the
normal and t-distributions to satisfactorily address these issues.
However, before doing this, we shall review some sampling material in a slightly
different context. An important point to bear in mind is that the sample mean is not
the only quantity we can extract from sample data. The sample mean is important
since, as we shall see, we can use it to approximate the population mean. However,
our interest in Unit 7 has been in some variable which can be measured (assigned a
numerical value) and averaged. We cannot always sensibly do this.

2. Distribution of the Sample Proportion


In many circumstances our concern is only whether a particular observation has or
does not have a certain characteristic. For example:
• A political party wants to know whether a voter is going to vote for the party.
• A soft drinks manufacturer wants to know if a consumer prefers their product.
• A doctor is interested in whether a drug does or does not help his patient.
• We may want to know if a stock price crosses a certain level, or does not.

In many such cases there is no meaningful average we can work out, since our
questions may only have “Yes” or “No” answers, with no associated numerical value.
Consider the case of a company sponsoring market research into a new product and
the impact of a TV advertising campaign designed to launch the product. A sample of
the public was selected and asked whether they had seen the recent TV adverts for
the product. Clearly their response would be either Yes or No (we shall discount the
Don’t Knows). At the end of the survey we would be able to calculate the proportion,
or percentage, of respondents who saw the adverts.
As was the case for the sample mean if the sampling process is repeated a number
of times we expect the different samples to provide different sample proportions, as
illustrated in Fig.2.1.

Statistics, Probability & Risk 202


Unit 8: Inference and Hypothesis Testing

P o p u la t io n
(p r o p o rt i o n Π )

S a m p le n o . 1 S a m p le n o . 2 S a m p l e n o. 3 S a m p le n o . k

P ro p o r t io n P r o p o rt i o n P r o p o rt i o n P r o p o r t io n
p1 p2 p3 pk

A g r o u p o f s a m p l e p ro p o r t io n s p 1 , p 2 , p 3 , p k

Fig.2.1: Sample Proportions

Notation: Because we reserve the letter P for probability we use


• p to denote a sample proportion (which varies from sample to sample)
• П (the capital Greek letter for P) to denote a population proportion (which is
constant, but usually unknown).

Remember: The following distinction is (sometimes) important:


• Proportions lie between 0 and 1.
• Percentages lie between 0% and 100%.

Example 2.1: The spreadsheet SmallSamp in the Excel workbook Proportion.xls is


depicted in Fig.2.2. It computes the proportion of sixes in n = 9 throws of a (fair) dice;
in the illustration shown we obtain just one six, and a proportion p = 1/9 = 0.11.

Fig.2.2: Simple spreadsheet producing histogram for sample proportion.

203 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

Fig.2.3: Two further simulations (n = 9).

As Fig.2.3 shows there is considerable variation in the proportions obtained from


different samples (of size 9). However this is “more manageable” than the variation
between individual values – look back at Example 4.2 of Unit 7 to see how much
individual values can vary. Indeed by increasing the sample size the variation in
proportions is much reduced. The spreadsheet SmallSamp does this, and
representative results are shown in Fig.2.4.

Fig.2.4: Two simulations with n large (n = 100).

3. Central Limit Theorem for Sample Proportions


The distribution of sample proportions is specified by the following result:
Central Limit Theorem (CLT) for Proportions. Given a population with proportion П
the distribution of the sample proportion (p) has

Π (1 - Π )
Mean (p) = П and Standard Deviation (p) = --- (1a)
n
In addition the sampling distribution of the proportion approaches a normal
distribution as the sample size n increases (“large” sample).
Techical Note For the above normal approximation to hold we require
nП > 5 and n(1 – П) > 5 --- (1b)

Example 3.1: The spreadsheet Simulation_n=25 in the PropDist workbook


essentially repeats the calculations in Example 2.1 but now

Statistics, Probability & Risk 204


Unit 8: Inference and Hypothesis Testing

• collects together results (proportions) from all the samples, then


• forms a frequency table (one of which is shown in Table 3.1) and
• produces a histogram of these proportions (one of which is shown in Fig.3.1).

You should be familiar with this type of computation from the simulations of Unit 7. In
addition we also compute the mean and standard deviation, and these are also
depicted in Fig.3.2. You should be able to check these values from Table 3.1.
Two further simulations are shown in Fig.3.2. We conclude the following:

Table 3.1: Frequency table Fig.3.1: Simulated histogram for proportions.

Fig.3.2: Two further simulations in Example 3.1

• The sampling variation can be accounted for by using the normal distribution
• The parameters of the normal distribution are in accord with CLT. Here
П = Proportion of sixes (in all possible dice throws) = 1/6 = 0.1666 --- (2a)

Π (1 - Π ) 0.1666 * 0.8333
and hence = = 0.0745 --- (2b)
n 25

205 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

CLT says the mean and standard deviation of the sample proportions should
approach (2a) and (2b) respectively (and should be exact for large enough n). You
can see the values in Figs.3.1 and 3.2 agree well with (2a,b).
Comments on CLT: Proportions often cause a good deal of confusion because
there are some subtle differences with the CLT result for means.
• Notice that (1a) makes no mention of the population standard deviation; for
proportions this is not a relevant concept. (Recall that standard deviations
measure deviations from the mean but, in the context of proportions, the mean
does not make any sense!)
• Nevertheless the sampling distribution does have a standard deviation. You
have to think about this in order not to become too confused! Look at Table 3.1.
• The final conditions for CLT to hold effectively mean the following:
- If our population proportion П is very small (say П = 0.01) we need a very
large sample (n > 500). Otherwise we will not find any member of the
population having the characteristic of interest, and will end up with p = 0.
This will give us no useful information (apart from П is “small”).
- If our population proportion П is very large (say П = 0.99) we again need a
very large sample (n > 500). Otherwise we will find that all members of the
population have the characteristic of interest, and will end up with p = 1
Again this will give us no useful information (apart from П is “large”).
• The form of the standard deviation in (1a) should be familiar. In Unit 6 Section 3
we stated the standard deviation X (the number of successes) of a binomial
distribution was given by np(1 - p) . If we divide this result by n, to turn number
of successes into proportion of successes, we obtain precisely (1a), but now
with П in place of p! Can you see why the binomial distribution is appropriate in
the context of proportions?

We now illustrate the use of the CLT for proportions. You should compare the
following calculations with those in Section 9 of Unit 6.

Example 3.2: An insurance company knows, from records compiled over the
previous 10 years, that on average 5% of its customers will have a car accident in the
current year. In such an event it has to pay out an average of £3000. The actuarially
fair premium would therefore be £150, but the firm charges £200 to cover risk and
profit. The firm will go bankrupt if more than 6% of its customers have accidents.
(a) If the firm has 1,000 customers, calculate the probability of bankruptcy.
(b) Calculate the same probability if the firm has 10,000 customers.

Statistics, Probability & Risk 206


Unit 8: Inference and Hypothesis Testing

(c) Why should you feel happier dealing with a large insurance company?
(d) Is a large insurance company more profitable in the long run?

Solution: We give the solution together with various explanatory comments.


(a) We set out the (extended) calculation as follows:
• Population values П = population proportion (who have an accident) = 5%
• Sample values p = sample proportion (who have an accident) = 6%
n = 1000

Question: Why are the 1,000 customers regarded as a sample rather than the whole
population (since these are all the company customers)?
• Theory (in words): Since we have a large sample CLT states that the sampling
distribution of proportions follows a normal distribution with

Π (1 - Π )
mean = П and standard deviation = --- (3a)
n
• Theory (in formulas): We have seen how to standardise a variable via
Sample value - Population value
Z=
Standard deviation of sample value

From the above CLT result for proportions this gives


p-Π
Z= --- (3b)
Π (1 - Π )
n
The statement in (3a) (together with the normality assumption) is entirely equivalent
to the statement in (3b). In practice one tends to prefer (3b) since this gives a
“computational formula of the traditional type” to work with.

• Computution: Substituting our data values into (3b) gives


6-5
Z= = 1.45 (to 2D)
5 x 95
1000
We now evaluate P(Bankruptcy) = P(p > 6%)
= P(z > 1.45) = 0.5 – 0.4245 = 0.0735
using the normal tables and Fig.3.3.

207 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

Fig.3.3: Computation of P(z > 1.45) in Example 2.1

Notes:
1. We can interpret the result more simply by saying there is a 7.35% chance of
the insurance company becoming bankrupt (in the current year).
2. The above calculation is performed in percentages , although the answer
appears as a decimal. Why is this?
3. You should check that you obtain the same final probability if the calculation is
performed using proportions (0.05 instead of 5% and so on).

(b) If n = 10,000 we calculate, using (3b),


6-5
Z= = 4.59 (to 2D)
5 x 95
10,000
Using the normal tables the closest z-value is z = 4.09, and this gives
P(z > 4.59) = 0.5 – 0.5000 = 0
If we use a more accurate method of computing normal probabilities we do in fact
obtain P(z > 4.59) = 0.0000022
[You may care to try and verify this number using Excel.]

Conclusion: The chance of bankruptcy is “virtually zero” (about 1 in 500,000).

(c) We may conclude from (a) and (b) that


“a larger company is (very significantly) less likely to go bankrupt”
However, if market conditions change, even large companies can experience great
financial distress (as the ongoing turbulence in the financial sector shows).
(d) In the long run insurance companies, of whatever size, would expect to have 5%
of their customers involved in car accidents. This would give the following results:
Company No. customer Cost to Premium to Profit %
size accidents company company Profit
1000 50 £150,000 £200,000 £50,000 25%
10000 500 £1,500,000 £2,000,000 £500,000 25%

Statistics, Probability & Risk 208


Unit 8: Inference and Hypothesis Testing

Hence we would expect profitability (in percentage terms) to be the same.


However, the larger firm has the advantage that the cost of holding reserves to cover
bad years is relatively small compared to the smaller firm. (Remember the 5%
accident rate is only an average figure and the actual rate in any given year may well
be larger, and this will increase costs.)

4. Estimation
So far our attention has been focused on the sample mean and sample proportion.
• If the mean and standard deviation of the population values are assumed to be
known then it is possible to make probability statements about the sample
mean.
• Similarly if the population proportion is assumed known we can make probability
statements about sample proportions.
However, in practice population values are rarely known and have to be estimated
from (sample) surveys or experiments. For example, to determine the average
height and weight of the adult population of Scotland we would need to survey every
such adult, and this is a practical impossibility. The best we can do is to sample some
of them.
In general we use (known) sample values to estimate (unknown) population values.
Although it may sound obvious
• The “best” estimate of a population mean μ is the sample mean
• The “best” estimate of a population proportion П is the sample proportion p
(The problem in proving these statements rigorously is in giving a definition of what
we mean by “best”.)
There is one further important idea which we illustrate via the following example.

Example 4.1 An inspector takes a random sample of 10 metal bars from a


production line and weighs then. He obtains the following weights (in grams),
99.8, 100.7, 100.9, 99.8, 99.5, 99.2, 99,7, 99.8, 100.2, 99.7
What is the best estimate of the population mean weight?

Solution The sample mean provides the best estimate. We calculate

x = 1 (99.8 + 100.7 + 100.9 + 99.8 + 99.5 + 99.2 + 99.7 + 99.8 + 100.2 + 99.7)
10

= 99.93 (grams)

209 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

This value is often termed a point estimate (of the population mean weight μ).
Comment The problem with this type of estimate is
• that we have no indication of how accurate it is, given
• the value of x will undoubtedly change if we take another sample.

We suspect the accuracy of x as an estimate of μ is related to how variable (spread


out) the data values are.
• If all data values are “close” to each other we would expect x to be a “good”
estimate of μ.
• If the data values are “widely spread” (90, 120, 100,...) we would expect much
less information to be contained in x .
The question now becomes what do we mean by “close” and “widely spread”? We
know variation is assessed by the standard deviation, and CLT tells us how to
evaluate this for a sample.
Before following this up we need some (important) normal distribution numbers. The
ordering of the percentages in Example 4.2 (95%, 99% and 90%) reflects the
frequency with which these values tend to occur in practice.

Example 4.2 Show that for a standardised normal variate:


(a) 95% of all values lie between z = -1.96 and z = +1.96,
(b) 99% of all values lie between z = -2.58 and z = +2.58
(c) 90% of all values lie between z = -1.645 and z = +1.645.
Area = 99%
Area = 95%
(a)
(b)

-1.96 1.96 -2.58 2.58

(c) Area = 90%

-1.645

Fig.4.1: Important z-values

Statistics, Probability & Risk 210


Unit 8: Inference and Hypothesis Testing

Solution (a) We require the following area:


Area = 0.475

0 z
From the normal tables an area (of the type shown shaded) of 0.475 corresponds to
a z-value of 1.96. Here we are reading the normal tables “in reverse” – given an area
we use the tables to obtain the corresponding z-value.
(b) and (c) are left as an exercise.

Comment: The z-values in Example 4.2 are often termed critical values, and
denoted ZC. You should compare them with the values in Unit 6 Section 12.
• The un-shaded regions in Fig.4.1 are often termed critical regions, or tail
regions of the distribution.
• In finance a very common term is “tails of the distribution”. These are precisely
the un-shaded regions above.

5. Confidence Interval for Population Mean (σ known)


• The way in which we get around the difficulty with the point estimate of Example
4.1 is to form what is known as a confidence interval.
• Confidence limits for a population parameter give a range of values in which
the true parameter value is “likely to lie”. They give an indication of the
precision of an estimate.
• It is important to know how reliable an estimate really is if the estimate is to be
used in some sort of decision making.

We have said that the best estimate of the population mean is the sample mean.
Recall the following results (CLT for means Unit 7 Section 5):
• The sample mean has a normal distribution,
- if the underlying variable is normally distributed in the population, and
- an approximate normal distribution, if the variable has a different
distribution (as long as the sample size is fairly large).

211 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

• Unless we are dealing with very small samples or extremely skewed


distributions it is safe to assume that the sample mean has a normal
distribution.
- The mean of the sampling (normal) distribution is μ, the population mean.
- The standard error of the (sampling distribution of the) mean is σ/√n,
where σ is the population standard deviation.
• To illustrate the ideas involved we (arbitrarily) select the 95% value considered
in Example 4.2.

Basic Argument in Words


• We know, from Example 4.2, that for any normal distribution 95% of values lie
between 1.96 standard deviations below the mean and 1.96 standard deviations
above the mean. In particular, for the distribution of sample means, bearing in
mind the above CLT results,
95% of all means calculated from samples of size n taken from any
distribution lie between μ - 1.96 σ/√n and μ + 1.96 σ/√n.
• In probability terms this may be written as

⎛ σ σ ⎞
P⎜ μ - 1.96 < x < μ + 1.96 ⎟ = 0.95
⎝ n n ⎠
which, on rearranging, gives
⎛ σ σ ⎞
P⎜ x - 1.96 < μ < x + 1.96 ⎟ = 0.95 --- (4a)
⎝ n n ⎠
• This second statement shows that in 95% of all samples, the sample mean will be
within a distance of 1.96 σ/√n from the true mean. This quantity, 1.96 σ/√n may be
used to indicate how good the estimate of the mean really is by constructing an
interval estimate for the population mean.
• The interval contained in (4a) is termed the 95% confidence interval for the mean

95% confidence interval for the mean --- (4b)


σ σ
= X - 1.96 to X + 1.96
n n

Statistics, Probability & Risk 212


Unit 8: Inference and Hypothesis Testing

Basic Argument in Pictures


• We wish to use the known sample mean x to estimate the unknown
population mean μ
• We want to be 95% confident that x lies “close” to μ
• This means we want x to lie in the shaded (red) area in Fig.5.1.
Observed sample mean

One possibility 95% certain X


lies “close” to μ

Unknown μ

Another possibility Observed sample mean X NOT 95% confident X


lies “close” to μ

Unknown μ

Fig.5.1: Relation of (known) sample mean X to (unknown) population mean μ

• But we know the lower and upper edges of this region


- In terms of z they are z = -1.96 and z = 1.96.
- In terms of x they are calculated by recalling the fundamental result
X-μ
Z=
σ
n
σ σ
• This can re-arranged as Z = X -μ ⇒ μ = X -Z
n n
• Inserting our two critical z values gives us the two end points of the interval in (4).

Comment You should be able to see this argument is really the same as the
previous “argument in words”. The only real difference is that in Fig.5.1 we highlight
the unknown nature of μ; this is disguised somewhat in the argument leading to (4a).

N.B: This confidence interval formula (4b) is very simple to use


BUT we have to assume σ is known
(Recall σ is the original population standard deviation)

213 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

Example 4.1 (revisited): Suppose we know, from past records, the following:
• The population (of all metal bars made on this production line) is normal.
• The population has (from past records) standard deviation σ = 1 (gram)
Under these (rather restrictive) circumstances CLT applies and we can use (4b) to
give us our 95% confidence interval for the mean (weight of metal bars produced on
this production line) as
1 1
99.93 - 1.96 to 99.93 + 1.96 = 99.93 – 0.6198 to 99.93 + 0.6198
10 10
= 99.31 to 100.55 (rounded to 2D)

Note: This interval estimate has, by its very nature, a built-in measure of its
reliability. We round the results to 1 decimal place since this is the accuracy of the
original data (and never quote more accuracy than is justified).

Interpretation: We can phrase the above result in one of two ways:


• We are 95% confident that the true population mean (i.e. the average weight of
a metal bar made on this production line) lies between 99.3 grams and 100.6
grams.
• If we take 100 samples (each comprising 10 such metal bars) from this
production line, we would expect 95 of these samples to give a mean weight in
the range 99.31 grams to 100.55 grams.
• The equivalence of these two lies in the (standard) interpretation of probability
in terms of long run frequencies – see Unit 5 Section 2.

Example 5.1: The spreadsheet CI95 in the Excel workbook ConInt1 gives a
demonstration of this latter (long run frequency) interpretation in action. In Fig.5.2
cells A7-I7 contain (nine) randomly selected values from a N(9,32) distribution, with
the mean calculated in cell J7.

Table 5.1: Computation of Confidence Intervals in Excel.

Statistics, Probability & Risk 214


Unit 8: Inference and Hypothesis Testing

• Using (4b) the lower and upper 95% confidence limits are computed in cells K7-
L7. Since we have specified μ (=9) we can check whether this interval does, in
fact, contain μ; a 1 in cell M7 indicates it does.
• We now repeat (100 times) this entire CI calculation, and count how many
constructed intervals actually contain μ. For the values in Fig.5.2 it is 96.
• In Fig.5.2 we give a graphical view of this latter value (96) by joining each of the
(100) lower and upper confidence limits by a straight line. We then note how
many of these lines cross the (horizontal) line μ = 9. Unfortunately this can be a
little difficult to determine due to graphical resolution difficulties!
• Two further simulations are shown in Fig.5.3. In all instances we can see the
95% CI does indeed appear to contain μ 95% of the time. You are asked to
perform further simulations in Practical Exercises 6.

Fig.5.2: Graphical View of Confidence Intervals in Table 5.1

Fig.5.3: Two Further Confidence Interval Simulations.

215 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

General Confidence Intervals

• The 95% referred to in (4) is termed the confidence level.

• The use of this particular value was for illustrative purposes, and we could
equally well use any confidence level. In general we can write a more general
form of (4a) as
σ σ
α% confidence interval for the mean = X - Zα to X + Zα --- (5)
n n
where the (critical) value Zα depends on the particular value of α.

• In practice the three levels in Example 4.2 (and Fig.4.1) tend to be used. The
particular level chosen is a compromise between the following:
- The higher the confidence level the more certain we are that our
calculated interval contains the true (population) mean.
- The higher the confidence level the wider the confidence interval is. Can
you see why this is the case?

• If our interval is too wide it will be of very limited use.


- Knowing that, with 99.9% certainty, the true mean lies between (say) 5
and 15 is probably not as useful as
- knowing, with 90% certainty, the true mean lies between (say) 9 and 10.

• Remember that we cannot have a 100% confidence interval because nothing is


(absolutely) certain from sample data. To get complete certainty we need to
examine the entire population.

• In practice software packages (such as Excel) will automatically return 95%


confidence intervals, as we shall see in Unit 9.

• What we have been working out so far are two-sided confidence intervals
(extending both sides of the sample mean). In some circumstances one-sided
intervals may be more appropriate. If you ever need these the theory is very
similar to the two-sided case.

The spreadsheet CIAlpha in the ConInt1 workbook allows you to specify the
confidence level α and check the validity of (5). You should explore these issues.

Statistics, Probability & Risk 216


Unit 8: Inference and Hypothesis Testing

Summary
• If we calculate the mean of a set of data then we have a point estimate of the
population mean. We have a single value with no indication whether it is likely
to be close to the true value or not.
• An interval estimate is one which gives a likely range of values of the
parameter to be estimated rather than just a single value. It is of much more
practical to know that the true value is likely to lie between two limits.
• A confidence interval states both
- these two limits, and
- precisely how likely the true value will lie between them.

• As an illustration x ± 196
. σ / n is a 95% confidence interval for the population
mean. Use of the ± sign is a compact (and common) way to indicate both ends
of the confidence interval in a single formula.

6. Confidence Interval for Population Mean (σ Unknown; n large)


• Confidence limits are used to give a likely range of values within which a
population parameter lies. In constructing a confidence interval we have
assumed that the true population standard deviation σ is known.
• This is unrealistic. In practice the standard deviation would have to be estimated
as well as the mean. As with the population and sample means, the best
estimate of the population standard deviation (σ) is the sample standard
deviation (s).
• As long as the sample size is reasonably large then confidence limits for the
mean can still be constructed using the normal distribution together with the
estimate s replacing σ. This leads to (4b) being replaced by

Approximate 95% confidence interval for the mean


s s
= X - 1.96 to X + 1.96 --- (6)
n n

Example 6.1: An electronics firm is concerned about the length of time it takes to
deliver custom made circuit breaker panels. The firm’s managing director felt it
averaged about three weeks to deliver a panel after receiving the order. A random
sample of 100 orders showed a mean delivery time of 3.4 weeks and a standard
deviation of 1.1 weeks.

217 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

Is the estimate of 3 weeks confirmed using a 99% confidence interval?

Solution: Here X = delivery time of a parcel (in weeks)


Population information : μ = population mean (time) = unknown
σ = population standard deviation (time) = unknown
(Population = set of times for delivery of all past/present/future deliveries.)
Sample information : n = sample size = 100 (assumed a “large” sample)
x = sample mean (time) = 3.4 (weeks)
s = sample standard deviation (time) = 1.1 (weeks)

Approximate 99% confidence interval for the mean (delivery time)


s 1.1
= X ± 2.58 = 3.4 ± 2.58
n 100

= 3.4 ± 0.28 = 3.12 to 3.68 (weeks)


(Does it make more sense to convert the result to days?)

Important Idea : We assess whether the manager’s view, of an average delivery


time of 3 weeks, is borne out by the (sample) evidence using the following criteria
(See Example 5.1.) This idea will reappear in Section 8 on Hypothesis Testing.

• If “hypothetical” value lies within the confidence interval accept this value
• If “hypothetical” value lies outside the confidence interval reject this value

Conclusion: We reject the manager’s “preconceptions”. It is not supported by the


sample evidence (at the 1% level of significance – see Section 8.)

7. Confidence Interval for Proportions


Looking at (4) - (6) we can see that confidence intervals have the general form

Confidence interval = Sample Estimate ± ZC * Population/Sample standard deviation

This leads us to the following result for proportions:

95% confidence interval for the proportion


Π (1 - Π ) Π (1 - Π )
= p - 1.96 to p + 1.96
n n
--- (7a)

Statistics, Probability & Risk 218


Unit 8: Inference and Hypothesis Testing

Again Π is (usually) unknown, and we will estimate it by the sample value to give

Approximate 95% confidence interval for the proportion


p(1 - p) p(1 - p) --- (7b)
= p - 1.96 to p + 1.96
n n

Example 7.1: Coopers & Lybrand surveyed 210 chief executives of fast growing
small companies. Only 51% of these executives have a management succession
plan in place. A spokesman for Coopers & Lybrand said that many companies do
not worry about management succession unless it is an immediate problem .
Use the data given to compute a 95 % confidence interval to estimate the proportion
of all fast growing small companies that have a management succession plan.

Solution : (i) Here X = number of small fast growing companies which have a
management succession plan
Population information : Π = population proportion = unknown
(Population = set of all small fast growing companies

Sample information : n = 210 p = sample proportion = 0.51 (no units)

p(1 - p)
Approximate 95% confidence interval for the proportion = p ± 1.96
n

0.51 * 0.49
= 0.51 ± 1.96 = 0.51 ± 1.96 * 0.034
210
= 0.51 – 0.7 to 0.51 + 0.7
= 0.44 to 0.58 (44% to 58%)

Question : Is a two-sided interval appropriate?


Does the phrase “… do not worry about management succession ….” imply we are
only concerned with ”low” proportions, and hence a one-sided interval?

8. Hypothesis tests
In Section 4 we introduced the ideas of estimation and observed how the standard
error of the quantity being estimated is a measure of the precision of the estimate.
(The standard error is a very commonly used term to denote the standard deviation

219 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

of the sampling distribution. This usage implies the term “standard deviation” refers to
the entire population.)
In many situations, estimation gives all the information required to make decisions.
However, there are circumstances where it is necessary to see whether the data
supports some previous supposition. Examples include
• comparing the mean level of sample output on a production line against a fixed
target value,
• comparing the efficacy of a new drug with a placebo or
• seeing whether times to failure of a particular component follow a specified
probability distribution.

Although, in all these cases, some quantities will have to be estimated from the data
the emphasis has switched from pure estimation to that of testing.
• In the first example, this would be testing whether the mean has wandered
away from the target value.
• In the second whether the new drug is better than the placebo and
• In the third whether the data is compatible with the particular distribution.

Although there are a lot of different types of tests that can be carried out on data (a
glance at any more advanced statistics text-book will reveal a frightening array!) the
idea behind all tests is the same. Once you have mastered this basic concept, life
becomes easy. So what is this basic concept?

Basic Concept
• Whenever any sort of test is performed in real life there will be objectives which
are specified. A test might be carried out to
- determine the breaking stress of a metal bar or
- the academic achievement of a pupil.
But in each case there is a clear goal.

• In a statistical test the same criterion applies. A statement must be made as to


what precisely is being tested. The statement will make some assumption about
a particular feature of a population or populations. For example, that the mean
is a particular value.
• How do we test whether this assumption is sensible? The only way to find out
about most populations is to collect a random sample of data and see what
evidence is contained in the sample data.

Statistics, Probability & Risk 220


Unit 8: Inference and Hypothesis Testing

• The data from the sample is examined to see


- whether it supports the assumption or
- whether it seems to contradict the assumption.
• If the sample data is compatible with the statement then the statement is “not
rejected”. If the evidence contradicts the statement then doubt is cast upon
the validity of the statement. (We need to be very careful with the precise
language we use here, for reasons we shall see later.)
• A statistical hypothesis test is just a formal mechanism for rejecting, or not
rejecting, an assumption about a population or populations.

9. Basic formulation (everyday language)


We wish to test something, and this involves a few standard steps. To avoid
confusion (when situations become more complicated), it is advisable to state
exactly what is involved:
• Step 1 : State precisely what is being tested and any assumptions made.
• Step 2 : Assemble sample information
• Step 3 : Examine sample data
• Step 4 : See whether, on the basis of Step 3, the sample data
- supports the assumption(s) made in Step 1, or
- appears to contradict this assumption.
• Step 5 : If the sample data is
- compatible with our assumption then this assumption is not rejected,
- otherwise doubt is cast on the validity of our assumption, and this
assumption is rejected.
Although some of the assumptions made in the following example may be unrealistic
in practice (normally distributed times and known σ), we are initially more concerned
with the actual computation.

Example 9.1 A programmer has written a new program and postulates that the mean
CPU time to run the program on a particular machine will be 17.5 seconds. From past
experience of running similar programs on his computer set-up he knows that the
standard deviation of times will be about 1.2 seconds. He runs the program eight
times and records the following CPU times (in seconds).
15.8, 15.5, 15.0, 14.8, 15.6, 16.5, 16.7, 17.0.

221 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

Assuming that running times are normally distributed, calculate the following:

(a) The sample mean x and the standard error of x,


(b) A 95% confidence interval for the true mean.
(c) What conclusion can you draw regarding the postulated mean run time?

Solution (Informal) The following solution goes through Steps 1-5 above. From this
we can fairly easily put together the more formal calculation.
• The (essential) computational parts of the calculation are starred **.
• The remainder are explanations of the ideas we are using.
First X = CPU time to run program

Step 1 We have the following population information:


Population = set of all past/present/future program runs.
μ = mean = 17.5 secs. σ = standard deviation = 1.2 secs.
These values presumably are based on previous records.
• We would like to test whether the postulated mean time (of 17.5) is true.
• Since our sample size is small we are assuming run times are normal.

Step 2 We have the following sample information:


n = sample size = 8 , Actual sample data (run times) given.

Step 3 We compute the following based on the sample information:

(a) x = 1 [15.8 + 15.5 + 15.0 + 14.8 + 15.6 + 16.5 + 16.7 + 17.0] = 15.8625 **
8

σ 1.2
Standard error of x = = = 0.4243 **
n 8

Recall: 1. Recall the standard error of x is just another phrase for the standard
deviation of x . The terminology is common when dealing with sampling
distributions, i.e. distributions of values computed from sample information, rather
than from complete population information (the latter usually being unknown).
σ
2. The sampling distribution of means has standard deviation
n
3. We shall also shortly use the related facts that the sampling distribution of
means has mean μ, and the sampling distribution of means has a normal
distribution (σ known).

Statistics, Probability & Risk 222


Unit 8: Inference and Hypothesis Testing

Step 4 We need to assess what our sample computations are telling us.
• It is certainly true that x < μ (15.86 < 17.5).
• But the crucial question is “Is x < sufficiently less than μ ?”
• To assess this we are asked to compute a 95% confidence interval for the (true)
mean. From (4b) of Section 5 we easily obtain
σ
(b) 95% CI for mean = X ± 1.96 = 15.8625 ± 1.96* 0.4243 (accuracy?)
n

= 15.8625 ± 0.831 = 15.03 to 16.69 **

Step 5 What conclusion can we draw from this CI?


• We know, from Section 3, that if we take repeated samples we would expect
95% of those to give a CI which do actually contain the true (unknown) μ.
• If we assume μ = 17.5 then our (first) sample produces a CI which does not
contain μ, according to the result in (b).
• We must conclude either
- μ does not equal 17.5 (and appears somewhat smaller), or
- our sample just happens to be one of the “unusual” 5% which will produce,
through sampling fluctuations, a CI not containing μ (= 17.5).
• We decide (on the balance of probabilities)
- the second alternative “is not a very likely state of affairs”, and
- the first situation is “far more likely”.
• But if we reject the assertion that μ = 17.5 what value of μ do we accept?

Comment If you just look at the starred entries you will see this solution is quite
short. It is only because we have given an (extended) discussion of the underlying
ideas that the solution appears rather long. In practice you would just give the salient
points (starred entries) in your solution. We shall do this in the next section in
Example 10, after we have introduced a bit more statistical jargon.
The Tutorial Exercises will give you practice in writing down solutions using the
appropriate terminology.

223 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

10. Basic formulation (formal statistical language):


Terminology To give a slightly more formal solution to Example 9.1 it is conventional
to introduce some new terms:
• The parameters of a probability distribution are the quantities required to
specify its exact form. So, a normal distribution has two parameters, the mean
and the standard deviation.
• The null hypothesis is the specific statement made about the population or
populations that is to be tested. It is always denoted as H0 and the statement is
usually written in mathematical notation. In Example 9.1 we would have
H0 : μ = 17.5 seconds

- The null hypothesis often involves just the parameters of a population but
it can also be concerned with theoretical models or the relationship
between variables in a population.
- Although the null hypothesis is written as a simple statement, other
assumptions may be made implicitly.
• A test may assume that the individuals are chosen at random, or
• the data come from a normal distribution and so on.
• If any of these additional assumptions are not true then the test
results will not be valid.
- The alternative hypothesis is what will be assumed to be true if it is
found subsequently that the data does not support the null hypothesis. So
either H0 is true or H1 is true
• It is denoted by H1 and unlike the null hypothesis, which is very
specific in nature, the alternative tends to much more general.
• In the case of Example 9.1, for example, the alternative hypothesis
may take one of three forms:
(i) H1 : μ ≠ 17.5 (ii) H1 : μ > 17.5 (iii) H1 : μ < 17.5

• The form of H0 and H1 must be decided before any data is collected. It may not
be obvious why this should be so – why not have a (quick) look at the data to
get an idea of what is going on, before you decide what to test?
- The problem is that the sample is only one possible sample and, if we
sample again, the results will change.
- We can only put “limited trust” in the values we actually observe.

Statistics, Probability & Risk 224


Unit 8: Inference and Hypothesis Testing

• If we put too much reliance on the data, in particular in guiding us as to what we


should be testing, we are liable to be led astray and end up testing the wrong
thing. This can be very expensive in some situations! (Of course we do need to
rely on the sample results since they are the only evidence we have. But we
don’t want to over-rely on them.)

Testing Hypotheses (Population Mean)


In order to test whether the population mean is or is not equal to the value specified
in the null hypothesis a sample is drawn from the population, the sample mean
calculated and a decision made.
• The decision will be based on a comparison of the sample mean with the
population mean specified in the null hypothesis.
• Naturally, they will nearly always be different but what we will want to test is
whether the difference is statistically significant. The word significant has a
special meaning in statistics.
• Rejecting the null hypothesis says that we have decided that population mean
specified is probably not the true population mean. In this case we say that the
sample mean is significantly different from that postulated.
• Accepting the null hypothesis says that we have decided that the population
mean could reasonably be equal to that postulated. In this case we say that the
sample mean is not significantly different from that postulated.

Relationship with Confidence Intervals


An important fundamental relationship exists between testing hypotheses about a
population mean value and constructing a confidence interval for the population
mean. This fundamental relationship is as follows:
1. If the postulated value we are testing is inside the confidence interval then we
conclude that the population mean does not differ significantly from the
standard value. In this case, the null hypothesis cannot be rejected.
2. If the postulated value is outside the range of the confidence interval, then we
conclude that the population mean is significantly different from the
postulated value. In this case the null hypothesis is rejected and we conclude
the sample data is consistent with the alternative hypothesis.
(We would like to say that the alternative hypothesis is accepted, but this is not
quite right. We discuss this later.)

225 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

Formal Testing Procedure


• Step 1 : State the null and alternative hypotheses, and any assumptions.
• Step 2 : Decide on a confidence level (to be used later in Step 4).
• Step 3 : Assemble sample information.
• Step 4 : Examine the sample data and compute an appropriate CI.
• Step 5 : Decide on whether the CI in Step 4 is compatible with H0 in Step 1.

Example 10.1 Solution (Formal) The following provides a more formal (and
compact) solution to Example 9.1.
• Step 1 : We wish to test the null hypothesis H0 : μ = 17.5
against the (2-sided) alternative hypothesis H1 : μ ≠ 17.5
Since we are told run times are normally distributed we do not need to assume this.
(This is important here since we do not have a large sample to use CLT.)
• Step 2 : We choose a confidence level of 95%. Note this is done before looking
at sample results (or, in practice, before collecting any data).
• Step 3 : This is exactly the same as before with
x = 15.86 and Standard error = 0.42
• Step 4 : This is exactly the same as before with
95% CI for mean = 15.03 to 16.69
• Step 5 : Since, assuming H0 is true, our CI does not contain μ we reject H0
and accept H1 .

Note Here we can say we accept H1 since H0 and H1 include all possibilities (since
H1 is two-sided). If H1 were 1-sided (say H1: μ < 17.5) we could not do this since the
sample data may also be compatible with another H1 (H1: μ > 17.5).

11. p-values
This section is intended to explain why so-called “p-values” are computed, as well as
showing how they are obtained.
Terminology There is one further piece of terminology that is used almost
exclusively in (statistical) computer packages. Rather than focusing on how confident
we are (see Step 2 in the formal testing procedure of Section 9) we highlight the
error we are (potentially) making.

Statistics, Probability & Risk 226


Unit 8: Inference and Hypothesis Testing

• We replace the confidence level by the significance level, defined by


• Significance level = 100 – Confidence level (in %)
• Thus a 95% confidence level translates into a 5% significance level.
• The significance level refers to the areas in the “tails of the distribution”.

“Confidence region”
Area = 95%

Fig.11.1
Significance region (Area = 5%)
• In general the significance level can refer to 1-sided regions (areas), although we
shall only look at the 2-sided case.
• The significance level is used to “set a scale” by which we can judge what we
mean by the phrase “unlikely”. We speak of deciding when sample values are
“significant”, i.e. unlikely to have occurred by chance alone. (We equate
“extreme” values with “significant values” in that they are too many standard
deviations away from the mean.)

A Note on Measurements Suppose we ask


• Question 1: “What is the probability of an adult UK male having height 6 ft?” In
practice what we would record as 6 ft. would be any height from, say, 5.99 ft. to
6.01 ft. depending on the resolution of our measuring device. If we ask
• Question 2: “What is the probability of an adult UK male having a recorded
height of 6 ft (i.e. an actual height in the range 5.99 ft. to 6.01 ft.?” then we can
answer this by taking samples, or using previously recorded information
(perhaps data compiled by the NHS or census records).
• But the answer to Question 1 will always be “the probability is zero”. We can
think of the reason why in one of two ways:
- Nobody has a height of precisely 6ft. since we cannot measure such a
height “absolutely accurately”. How can we distinguish a height of 6.00001
ft. from a height of 6.00002 ft.?
- More formally whenever we have a continuous probability distribution, the
probability of any single event is zero. This is because we have an infinite
number of events and, since their combined probability must be 1, all must
have zero probability.

227 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

- What all this comes down to is that, when dealing with continuously
varying quantities (such as height) we cannot ask for probabilities of
specific values, but must specify a range (however small) of values.
Criticism of CIs A major criticism of the use of confidence intervals in hypothesis
testing is the need to pre-set the confidence (significance) level.
• It is quite possible to reject H0 using a 99% confidence level, but to accept H0
using a 95% confidence level.
• This means our “entire conclusion” depends on the, somewhat arbitrary, levels
we set at the start of the analysis.

Asking the Right Question The concept of a p-value is designed to overcome the
above difficulty. To explain this we assume we have taken a sample and ask the
following question:
Question A: How likely was our observed sample value?
• The rationale for this is the following. If what we actually observed was a priori
(before the event) an unlikely outcome then its subsequent occurrence casts
doubt on some of our assumptions. We may well need to revise the latter (in
light of our sample evidence).
• The difficulty with asking this precise question is that, as we have just seen, the
likelihood (probability) is usually zero. We could ask instead
Question B: How likely were we to observe values within a “small” interval (say
within 1%) about our observed sample value?
• The difficulty with this question is one we have already alluded to:
- Our sample values will change if we take another sample so
• whilst we attach importance to our sample results,
• we do not wish to give them undue importance.
In effect we are not overly concerned about the particular sample mean of 15.86
in Example 9.1, so the answer to Question B is not particularly useful.
• What we are interested in is how our sample mean will aid us in assessing the
validity of the null hypothesis. In particular, if μ = 17.5 how far away is our
sample value (mean) from this (since the further away the less confidence we
have in H0)? So we (finally) ask the question
Question C: How likely were we to observe something as “extreme” as our
sample value?

Statistics, Probability & Risk 228


Unit 8: Inference and Hypothesis Testing

Example 9.1 (Revisited) In this context Question C asks us to compute


P( x < 15.8625)
This corresponds to the shaded (red) area below,

Observed sample
mean x = 15.86

p-value

Fig.11.2
Assumed μ = 17.5

X-μ 15.8625 - 17.5 − 1.6375


• Computation z= = = = -3.86
σ 1.2 0.4243
n 8

Then P(z < -3.86) = 0.5 – 0.4999 = 0.0001 (from normal tables).
In pictures

Fig.11.3: Standard normal distribution computations

• Recall that an event 3 standard deviations (3σ) or more from the mean has
“very little” chance of happening, in agreement with the above computation.

Comments We note the following:


• The probability we have calculated is “almost” our p-value.
• Formally a p-value measures the probability of a “more extreme” event than
that actually observed in the sample.
• What qualifies as “more extreme” depends on the alternative hypothesis, since
we are computing p-values to help us in testing the validity of H0 (and H1
indirectly).

229 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

• In our case, since we have a 2-sided H1 a more extreme event would be


obtaining a sample mean more than 3.86 standard deviations from the
(assumed) mean given in H0. This includes the two possibilities
P(z < -3.86) and P(z > 3.86)
even though we only actually observed one of these (z = -3.86). (Again we do
not want to lay too much importance on the fact that the observed sample mean
was smaller than that specified in H0. This may just have occurred “by chance”,
and the “next” sample may well given a sample mean larger than in H0.)
• In pictures
Fig.11.4: Two-sided p-values

p-value =

= 0.0001 + 0.0001
= 0.0002

• Another way of interpreting a p-value is the following:


A p-value gives the significance level at which we will (just) reject H0.
• Alternatively, if we calculate a (100 – p)% confidence interval this will (just)
contain the value of μ specified in H0. Explicitly in our example
99.9999% confidence interval for mean = 15.8625 ± 3.86* 0.4243
= 14.2247 to 17.5003
• A very important point is that we do not reject (or fail to reject) H0 on the basis of
some arbitrary (standard) confidence level.
- Our p-values gives us the likelihood (probability) of H0 being “okay”, i.e.
compatible with the sample evidence.
- It is then up to us to decide whether this probability is “acceptable” or not,
and this will often depend on the context of the problem.

Final Points You should observe the following:


• The calculations we perform in computing
X-μ
p-values via z = and
σ/ n
σ
confidence intervals via X ± Zc
n

Statistics, Probability & Risk 230


Unit 8: Inference and Hypothesis Testing

are essentially the same; it is just the emphasis that is different.


• We can set up the following hypothesis testing decision rule (see Fig.11.1).
- Here the critical regions are defined before the sample is taken.
- Afterwards we reject H0 if the sample mean lies in the critical region.

BEFORE SAMPLE TAKEN AFTER SAMPLE TAKEN


Do not Reject H0
reject H0

95

Critical (Rejection) Regions


Fig.11.6: Hypothesis testing decision rule

• Using p-values instead we have the following setup:


BEFORE SAMPLE TAKEN AFTER SAMPLE TAKEN
Sample mean defines
critical region

Fig.11.7: p-values define the critical region

• There is one final interpretation of the significance level. At the start of this
section we introduced the idea that the significance level measures how
“uncertain” we are (in contrast to the confidence interval that focuses on how
“confident” we are). The significance level measures the maximum error we
are prepared to make in making our decision as to whether to reject H0 or not.
- Remember that, whatever decision is made, we cannot be 100% certain
we have made the right one. It is possible our sample value is
unrepresentative and, by chance alone, we have been “unlucky” with our
sample results.
- Before taking a sample we need to decide how much error we are
prepared to live with. This is what our choice of significance level does.

231 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

• You may like to go back to the Excel demonstration of the idea behind
confidence intervals (ConfInt spreadsheet of Example 5.1). Observe that, in the
95% CI, we can still expect 5 of our intervals to behave “badly” and not contain
the true value of μ; this would correspond to rejecting H0 5% of the time. (This
would be the wrong decision since we know the true value of μ!)

12. Excel output


It is helpful to briefly pause here to put in context the ideas we have seen in the last
few units, and to look ahead to the next unit. In Fig.12.1 we show Excel output which
will appear in Section 7 of Unit 9, and highlight some of the terminology we have so
far met.

Fig.12.1: Excel output for car price data (Unit 9)

F-distribution –
Unit 7 Section 10

Significance – See below

95% Confidence
Interval – Unit 8
Section 5

Degrees of
freedom – Unit 7
Section 8

Standard error –
t-distribution – P-value – Unit 8
Unit 8 Section 9
Unit 7 Section 7-8 Section 11

Note Sometimes p-values are called significant (or sig.) values. Excel actually uses
both terms in the output of Fig.12.1. As you can see you need a fair amount of
statistical background to understand the output of most (statistical) packages. The
remaining terms in Fig.12.1 will be explained in Unit 9.

Statistics, Probability & Risk 232


Unit 8: Inference and Hypothesis Testing

13. Hypothesis test for population proportion


The hypothesis testing procedure we have discussed has been in the context of
(population) means, but the same procedures apply with proportions, with one minor
amendment. When working out, say, a 95% confidence interval we use
(Approximate) 95% confidence interval for the proportion

p(1 - p) p(1 - p)
= p - 1.96 to p + 1.96
n n
rather than the
(Exact) 95% confidence interval for the proportion

Π (1 - Π ) Π (1 - Π )
= П - 1.96 to П + 1.96
n n

even though we have a value for П specified under H0. The theory becomes a lot
simpler if we work with approximate, rather than exact, confidence intervals for the
population proportion. (You can see why if you look at the version of (4a) which
applies to proportions.)

Example 13.1: The catering manager of a large restaurant franchise believes that
37% of their lunch time customers order the “dish of the day”. On a particular day,
of the 50 lunch time customers which were randomly selected, 21 ordered the dish
of the day. Test the catering manager’s claim using a 99% confidence interval.

Solution: Here X = number of customers ordering “dish of the day”


In the following essential steps are shown starred (**); the remainder are
explanations of, and comments on, the calculations.
Population: Π = population proportion = 0.37 (from previous records)
information (Population = set of all past/present/future restaurant customers.)

Step 1 : We wish to test the null hypothesis H0 : Π = 0.37 **


against the (2-sided) alternative hypothesis H1 : Π ≠ 0.37 **

Step 2 : Here we are given Significance level = 1%

This value measures the chances of making the wrong decision. The catering
manager has decided he is prepared to live with the consequences of this, i.e.
• too many “dish of the day” dishes unsold (if in fact Π < 0.37 and we accept H0)

233 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

• not enough “dish of the day” made (if in fact Π > 0.37 and we accept H0)
and the consequent effect on the supplies that need to be ordered, the staff that need
to be deployed, customer dissatisfaction and so on.

Step 3: Sample information is Sample size = n = 50


21
p = sample proportion = = 0.42 (no units) **
50
(p = proportion of customers in sample ordering “dish of the day”)

Step 4: Examine sample data by computing the 2-sided 99% confidence interval.
p(1 - p)
Here 99% confidence interval for (true) proportion = p ± 2.58 **
n

0.42 * 0.58
= 0.42 ± 2.58 = 0.42 ± 0.180 = 0.24 to 0.60 **
50
or 99% confidence interval for (true) percentage = 24% to 60% **

Step 5 (Conclusion) : Here, on the basis of H0 being true, we have obtained a 99%
confidence interval which does contain Π. Since this will happen 99% of the time, we
regard this as a “very likely” event to happen. We cannot reject H0.

Comments Observe the following:


• Since our confidence interval is so wide we must regard the sample evidence
as only a very weak test of H0. (Indeed the concept of the power of a statistical
test is an important idea, but beyond the scope of an introductory course.)
• Indeed the sample evidence will clearly be compatible with many values of Π
(Π = 0.36, 0.35, 0.38, 0.39,…..). This is precisely why we do not accept H0,
merely conclude that we cannot reject it!
• Does such a wide confidence interval really help the catering manager plan his
menus? (A data set will only contain so much, and no more, useful information!)
• Strictly speaking, we should say
we cannot reject H0 at the 99% confidence level
or we cannot reject H0 at the 1% significance level

It is, of course, possible that we could reject H0 at a different confidence/


significance level (but we would need to do further calculations to check this).

Statistics, Probability & Risk 234


Unit 8: Inference and Hypothesis Testing

14. Computations using the t-distribution (Small Samples)


Rationale for small samples. Financial restrictions often mean that, in practice, only
“small” samples can be taken, and
• provided the underlying (parent) population is normally distributed, and
• the population standard deviation is unknown (and hence needs to be
estimated from the available sample data)
we can use the t-distribution to perform all the “usual” computations, i.e.
• confidence intervals,
• hypothesis testing and
• p-values
for both means and proportions.

A. Confidence Intervals for the Population Mean (small samples)


Many of our previous results are easily adapted for use with the t-distribution. We
have already seen that, for 2-sided intervals:
σ
• X ± zc is a CI for μ if σ is known (Section 5)
n
s
• X ± zc is an approximate CI for μ if σ is unknown and n large (Section 6)
n
In view of the result (8) of Unit 7 Section 7 we can add the result (recall Example 7.1)
s
• X ± tC is a CI for μ if σ is unknown and the sample size n is small
n
provided the underlying distribution of X is normal

Example 14.1: A random sample of 10 items in a sales ledger has a mean value of
£60 and a standard deviation of £8. Find a 95% confidence interval for the population
mean of sale ledger items.

Solution: Here X = value (in £) of items occurring in a sales ledger


Population: μ = mean of population = unknown
Information σ = standard deviation of population = unknown
(Population = set of all items in ledger.)
Sample: n = sample size = 10 (“small”)

235 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

information x = sample mean = 60 (in £)


s = sample standard deviation = 8 (in £)
Since df = n – 1 = 9
for a 2-sided 95% CI tc = 2.262

s 8
Hence 95% CI for μ = X ± tC = 60 ± 2.262* (in £)
n 10

= 60 ± 5.72 = £54.28 to £65.72 (2D)

B. Hypothesis Test for the Population Mean (Small Samples)


Example 14.2: Fentol Products Ltd manufactures a power supply with an output
voltage that is believed to be normally distributed with a mean of 10 volts. During the
design stage, the quality engineering staff recorded 18 observations of output voltage
of a particular power unit. The mean voltage of this sample was 10.33 volts with a
standard deviation of 0.77 volts. Is there evidence, at the 5% significance level, that
the average voltage is not 10 volts?

Solution: Here X = Power supply output voltage


Population: μ = population mean = 10 volts (previous records)
information (Population = set of all past/present/future power supply units.)
Step 1: We wish to test the null hypothesis H0 : μ = 10
against the (2-sided) alternative hypothesis H1 : μ ≠ 10

Step 2 : Here we are given Significance level = 5%


(measuring the chances of making the wrong decision).

Step 3 : Sample information is Sample size = n = 18 (“small”)


x = sample mean = 10.33 volts
s = sample standard deviation = 0.77 volts

Step 4a : Examine sample data by computing the 2-sided 95% confidence interval
s
via 95% CI for the mean = X ± tc **
n

Step 4b : Compute tc . Here ν = number of degrees of freedom = n – 1 = 17

Statistics, Probability & Risk 236


Unit 8: Inference and Hypothesis Testing

Area = 0.95 Area = 0.025

The t-tables (Unit 7 Table 8.1) give tc = 2.110 **

0.77
Step 4a : 95% CI for the mean = 10.33 ± 2.11 **
18
= 10.33 ± 0.38 = 9.95 to 10.71

Step 5 (Conclusion): Here, on the basis of H0 being true, we have obtained a 95%
confidence interval which does contain Π. We cannot reject H0.

Comments:

1. Although our confidence interval is not very wide we may view with some
concern the fact that μ = 10 is very close to one edge of the interval. In practice
this may prompt us to do some further analysis, i.e.
• think about changing our confidence/significance level, or
• taking another sample, or
• taking some other course of action.

2. If we were to use the normal distribution value zc = 1.96 (pretending σ = 0.77)


0.77
we would obtain “95% CI for mean” = 0.33 ± 1.96
18

= 10.33 ± 0.36 = 9.97 to 10.69


This is little changed from our previous (correct) result, and does not affect our
conclusion (to not reject H0). The effect of a small sample will become more and
more marked the smaller the sample.

3. We can also compute the p-value using (1b):


x-μ 10.33 − 10
t= = = 1.818
s / n 0.77 / 18

237 Statistics, Probability & Risk


Unit 8: Inference and Hypothesis Testing

• We now need to calculate P(|t| > 1.818), representing the probability of a more
extreme value than the one observed, and corresponding to the area below:

-1.818 1.818
Unfortunately this area is not directly obtainable from the t-tables (why?). We need to
use the Excel function TDIST. Explicitly
p-value = TDIST(1.818, 17, 2) = 0.087

t value degrees of freedom (df) = n – 1 2-tailed

Question: Are two-sided interval appropriate in our previous examples?

Summary: Look back at the summary at the end of Section 5.


• Since then we have used confidence intervals in a variety of situations:
- Population mean (α unknown and n large) – Section 6
- Population mean (α unknown and n small) – Section 14
- Proportions – Sections 7 and 13
• Hypothesis tests (Sections 8-10) provide an important formal framework for
testing population parameter values, again with confidence intervals being the
basic computational tool used.
• A refinement is provided by computing p-values (Section 11), rather than using
pre-selected significance levels.
• All of these procedures have much in common (confidence intervals), but make
different assumptions on the underlying data (population). These can be in
various forms:
- Nature (normal or not) of population.
- Population parameters known or not.
- Specific values of parameters assumed/tested.

The validity of any statistical procedure ultimately rests on how well the data
conforms to the (often implicit) assumptions made. Always bear this in mind.

Statistics, Probability & Risk 238


Unit 9 – Correlation and (Simple) Regression

9 Correlation and (Simple) Regression

Learning Outcomes
At the end of this unit you should be able to:
• Appreciate the concept of covariance and correlation.
• Interpret the coefficient of correlation.
• Plot data to illustrate the relationship between variables.
• Determine the equation of a regression line and interpret the gradient and
intercept of the line.
• Understand the regression (Anova) output from Excel.
• Appreciate the usefulness of residual analysis in testing the assumptions
underlying regression analysis.
• Predict/forecast values using the regression equation.
• Understand how data can be transformed to improve a linear fit.
• Appreciate the importance of using statistical software (Excel) to perform
statistical computations involving correlation and regression.
• Understand the inter-relationships between expected values, variances and
covariances as expressed in the efficient frontier for portfolios.

God writes straight with crooked lines.


Proverb

239 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

1. Introductory Ideas
We are very often interested in the nature of any relationship(s) between variables of
interest, such as interest rates and inflation or salary and education.
• Covariance is the fundamental numerical financial quantity used to measure
the mutual variation between variables.
• Correlation is a scaled covariance measure and provides an important first
step in seeking to quantify the relationship between (two) variables. It is used
more often than the covariance in non-financial contexts.
• Regression extends the notion of correlation to many variables, and also
implicitly brings in notions of causality, i.e. whether changes in one variable
cause changes in another variable. In this unit we shall consider only the case
of two variables (simple regression.

Example 1.1: Look back at Problem 2 in Section 2 of Unit 1; here we want to know
how our gold and sterling assets will behave. Specifically if gold prices start to fall:
• Can we expect the value of sterling to fall?
• If so by how much?
We are interested in whether changes in one asset will accompany changes in the
other. Note we do not use the terminology “do changes in gold prices cause changes
in sterling”, rather we are more interested in how they vary together.

Example 1.2: There has always been a great deal of media attention focused on the
“correct” values at which interest rates should be set.
• The manufacturing sector complains interest rates are too high since
- this encourages foreign investment in the UK which in turns
- causes the £ to appreciate (against other currencies); in turn
- a higher £ causes difficulties with UK exports (higher prices), resulting in
- falling export sales, increased layoffs and rising unemployment.

• Other sectors complains interest rates are too low since


- low rates encourage inflation, which in turn
- causes prices to rise, in turn
- leading to higher wages, and reduced competitiveness, resulting in
- rising unemployment.

Statistics, Probability & Risk 240


Unit 9 – Correlation and (Simple) Regression

• Whether you believe in either scenario, it is clear that there are a large number
of variables (interest rates, exchange rates, inflation, manufacturing output,
exports, unemployment and so on) that need to be considered in analysing the
situation.
There is an enormous literature on these ideas, ranging from the non-technical to the
very-technical. A brief overview is given at http://en.wikipedia.org/wiki/Interest_rates

Probably the crucial issue here is causation, i.e. “what causes what?” If all variables
are interlinked, do they all cause each other, or is there some “important” variable
that explains many (or all) the others? Economics seeks to use data, and statistical
analysis, to try and make sense of the relationships between (many) variables.
Correlation (or covariance) is the starting point for doing this. Very roughly speaking:
• In physics causation is often well established (“Gravity causes objects to fall”)
• In finance causation is plausible but not well understood (“High unemployment
causes interest rates to decrease”)
• In the social sciences causation is problematic (“Low educational attainment
causes alcohol abuse”).

A Useful Reference We shall only cover a small fraction of the available material in
the area of regression. If you wish to read further a good introductory text, with a very
modern outlook, is Koop, G. (2006) Analysis of Financial Data
A companion volume Koop, G. (2003) Analysis of Economic Data
covers similar ground, but with a slightly less financial and more economic
orientation. We shall refer to, and use, some of the data sets discussed by Koop.

2. The Concept of Correlation (in pictures)


We are often interested in assessing whether there is any connection between two
(or more) variables, i.e. whether the variables are “correlated”. Our ultimate aim is to
try and use one (or more) variables to predict another variable, but we will not really
be able to do this until the end of this unit.

Example 2.1 A simple example.


The Excel file BusFares_Passengers.xls contains data relating to UK bus fares, as
measured by the fare index, and the level of demand, as measured by the number of
passenger journeys. The data is available from the ONS website, and you are asked
to download this in Practical Exercises 7, Q1; the specific data shown in Table 2.1 is
for the whole of the Great Britain.

241 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

We can see an inverse relationship between the two variables, i.e. higher fares are
associated with a lower demand. (But note the last three entries.) However we attach
no significance to which variable is plotted on which axis; if you interchanged them in
Fig.2.1 would your conclusion be any different?

Table 2.1: Great Britain data Fig.2.1: Scatter plot of bus fares and passengers

Example 2.2 A more complicated example.


The Excel file ExecutivePay.xls, adapted from data given by Koop, contains four
items of information on 70 companies as shown in Fig.2.1.

Fig.2.2: Data relating to executive pay from 70 companies

• We have deliberately not used X and Y for any of the variables since this
terminology is almost invariably used in the (mathematical) sense “Y depends
on X”. With correlation we do not want to imply this - see Section 3.
• The scatter plots of Fig.2.2 are indicative of the following:
- There appears to be some relation between executive pay (E) and
company profit (P) in the following sense:
• As P increases E increases “in general”. This does not mean that
every time P increases E increases, just “much” of the time.
• As E increases P increases “in general”. But we have no sense of
“cause and effect” here. We only know that high (low) values of P
tend to be associated with high (low) values of E.

Statistics, Probability & Risk 242


Unit 9 – Correlation and (Simple) Regression

• If we imagine drawing in a “trend line” it will have positive slope. For


this reason we say X and Y are positively correlated.
• The points are quite spread out. Again if we imagine drawing in a
“trend line” the points will not be too tightly bunched about the line.
This affects the strength (size) of the correlation, and we expect X
and Y to exhibit “weak positive correlation”.

We would like to make a statement such as


“Executive pay depends on company profit”
because we believe/feel/have been told this is a “reasonable” state of affairs that
should exist. What we are concerned with is the question
“Does the data support this conclusion (rather than just our feelings)?”

Fig.2.2: Relating Executive Pay and Company Profit

• When we have further variables present there are other correlations we can
look at and, in general, these initially complicate the situation. The scatter plots
of Fig.2.3 are indicative of the following:
- First observe that we can have repeated values, here of D, with
corresponding different values of P. (An important question we could ask
is “Why are these values of P different, given that D is the same?”)
- Would you agree there appears to be some relation between P and D?
Maybe a “weak positive correlation” between P and D?
- There does appear to be some relation between E and D with a “strong
positive correlation” between E and D.
• These ideas raise a fundamental problem.
- It is possible that E does directly depend on D - Fig.2.3 (b)
- It is possible that D does directly depend on P - Fig.2.3 (c)

243 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

It is possible that E does not directly depend on P, despite the evidence of


Fig.2.2(a), but E does depend indirectly on P through the association of P with D.

(a) P against D (b) E against D

(c) D against P (d) P against S


Fig.2.3: Relating Executive Pay and Debt Change

• When we observe a correlation between two variables it is possible that this is


only a side effect of correlations between each of these two variables with a third
variable – see Fig.2.4.
• We can easily mistake indirect correlations for direct correlations (especially if we
are unaware of the existence of this third variable).
• In the statistical jargon these “hidden” variables are often termed “confounding
variables”. We can usually only infer their existence by having some detailed
(economic) knowledge of the underlying issues.
• In effect statistical analysis on its own cannot “guarantee results” if all the
necessary variables have not been identified (from other considerations). In
Example 2.1 do we know E is only influenced by 3 other variables?

E P E D
(a) Incorrect View
(Direct Causation) P
(b) Correct View (Indirect Causation)

Fig.2.4: Direct and Indirect Correlation/Causation

Statistics, Probability & Risk 244


Unit 9 – Correlation and (Simple) Regression

Notation We use the notation rXY to denote the correlation (coefficient) between
the variables X and Y or, more simply, just r if the variables are obvious from the
context. We shall see how to actually compute a value for rXY in Section 4.
• From Example 2.1 we can (only really) conclude
rEP > 0 , rED > 0 , rPD > 0
but cannot assess the strength of these correlations in any quantitative sense.
• You should note that we would intuitively expect, for example,
rEP = rPE
(although this may not be clear from Figs.2.2). This means that we cannot use
the value of rEP to infer some causal connection between E and P.

Question How would you expect the sales S in Fig.2.1 to fit into all this? In particular
what sign (positive or negative) would you expect for
rES , rPS > 0 and rDS > 0 ?

Software The scatter plots in Figs.2.2 and 2.3 are produced individually in Excel. A
weakness of the software is that we cannot obtain a “matrix” of scatter plots, where
every variable is plotted against every other variable apart from itself (Why?). In
Example 2.1, with 4 variables, the result is 12 plots, as shown in Fig.2.5. The latter is
obtained in SPSS (Statistical Package for Social Sciences), and is a more powerful
statistical package than Excel. You may need to learn some SPSS at some stage.

Fig.2.5: Matrix Scatter Plot in SPSS

245 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

3. Dependent and Independent Variables


This terminology is used when we know (or strongly believe) that
• changes in the independent variable (generically termed X) cause
• changes in the dependent variable (generically termed Y).
The discussions of Section 2 indicate how careful we need to be in deciding that we
really have causal connections. The following example illustrates some further
issues.

Example 3.1
(a) If we drop an object the time it takes to hit the ground depends on the height
from which it is dropped.
• Although this sounds trivially obvious we can check this by performing
experiments and, indeed, discover the precise relationship between height
and time taken.
• We have control over the various heights we choose and this is
characteristic of an independent variable.
• We cannot directly control the time taken, this depending on our choice of
height, and this is characteristic of a dependent variable.

(b) We can (very plausibly) argue the sales of a product (televisions) depend on the
price we charge for them. Here we would regard sales (termed demand by
economists) as depending on price.
• Note we can vary the price we charge (independent variable), but we do
not have control over the number sold (dependent variable).
• Here we suspect sales do not depend on price alone, but on other factors
(possibly advertising, the economic climate and so on).
• We could complicate the discussion and argue that, if the sales drop, we
can lower the price to try and improve the situation. In this sense may we
regard price as depending on sales (even though we cannot select the
level of sales as we would an independent variable)?
(c) We can argue the exchange rate of the £ (against the $ say) depends on the
level of UK (and US) interest rates.
• We could try and check this by checking data compiled by, say, the Bank
of England. Statistical techniques, or even simple scatter plots, would help
us decide whether there was indeed a connection.

Statistics, Probability & Risk 246


Unit 9 – Correlation and (Simple) Regression

• But maybe we could not say whether changes in one caused changes in
the other (possibly because there were other factors to take into account).
• However, we have no control over either variable and so we cannot do
any experiments as we can in (a) and (b). This is typical of economic
situations where “market forces” determine what occurs, and no “designed
experiments” are possible.
• When this type of situation occurs, where we cannot meaningfully label
anything as an independent variable, the terminology “explanatory
variable” is used. This is meant to indicate that we are tying to use
changes in this variable to “explain” changes in another (dependent)
variable.

(d) If we return to Example 2.1 we can make the following argument:


• One of the responsibilities of a chief executive is to “manage” the
profitability of a company. If he/she can increase profits then their salary
will change to reflect this. We would expect E (executive pay) to depend
on P (company profit), possibly with a “year lag” built in.
However, we can also make the following argument:
• The chief executive’s managerial skills are the key factor in company
profitability, and salary directly reflects management skills. Hence
executive salary determines profitability, i.e. P depends on E. (Is there a
connection with Fig. 2.4(b) here?)
Which argument do you believe, i.e. which variable do we treat as an
“explanatory” one?
(e) In the more general scenario of Example 1.1 we can see some of the potential
difficulties in assigning the terms “explanatory” and “dependent” to variables. In
these more complex (but realistic) situations economists have developed
“simultaneous equation” models in which all variables are really regarded as
explanatory, and equations are formulated connecting all the variables. In a
sense all variables are regarded as depending on all the remaining variables.
But for our purposes we shall stick with situations in which a dependent
variable, and one (or more) explanatory variable(s), can be identified.

247 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

4. The Concept of Covariance (in formulae)


Terminology In any particular application we may be working with a discrete
distribution (such as the binomial), or a continuous distribution (such as the normal).
In a general context we can consider both situations by using a more general
(abstract) terminology. We denote by
• E(X) the expected value of the random variable X. This is yet another term for
the mean (average).
• V(X) the variance of the random variable X. Importantly
V(X) = E[(X – E(X))2] --- (1)
since this represent the average of the squared deviations from the mean.
The expected value and the variance are both computed as a sum (discrete case) or
via integration (continuous case). By implication these quantities refer to population
values but, in practice, they may often be approximated by sample values.

Definition
(a) Given two (random) variables X and Y the covariance of X and Y, denoted
Cov(X,Y) or σ(X,Y), is defined by the mean value of the product of their
deviations (from their respective mean values)
Cov(X,Y) = E[(X – E(X))(Y – E(Y))] --- (2)
(b) In particular, if Y = X, (2) and (1) become identical, so that
V(X) = Cov(X,X) --- (3)
In this sense the covariance is a natural generalisation (to two variables) of the
variance (for a single variable). This is really the motivation for taking the
product in (2) to measure the interaction of X and Y, rather than any other
combination.

Example 4.1 Consider the following set of returns for two assets X and Y:
Possible States Probability R(X) = Return on X R(Y) = Return on Y
State 1 0.2 11% -3%
State 2 0.2 9% 15%
State 3 0.2 25% 2%
State 4 0.2 7% 20%
State 5 0.2 -2% 6%

Table 4.1: Hypothetical Asset Returns

Statistics, Probability & Risk 248


Unit 9 – Correlation and (Simple) Regression

• E[R(X)] = 0.2*11% + 0.2*9% + 0.2*25% + 0.2*7% + 0.2*(-2%) = 10%


• E[R(Y)] = 0.2*(-3%) + 0.2*15% + 0.2*2% + 0.2*20% + 0.2*6% = 8%

Here we have used (3) in Unit 5 Section 8, expressed in “expected value formalism”.
Although we might naively prefer X to Y on the basis of expected returns, we must
also look at the “variability/riskiness” of each asset.
• Var[R(X)] = 0.2*(11-10)2 + 0.2*(9-10)2 + 0.2*(25–10)2 + 0.2*(7–10)2 + 0.2*(-2-10)2
= 0.2 + 0.2 + 45 + 1.8 + 28.8 = 76%

or σ(R(X)) = 76 = 8.72%
• Var[R(Y)] = 0.2*(-3-8)2 + 0.2*(15-8)2 + 0.2*(2–8)2 + 0.2*(20–8)2 + 0.2*(6-8)2
= 24.2 + 9.8 + 7.2 + 28.8 + 0.8 = 70.8%

or σ(R(Y)) = 70.8 = 8.41%

Here we have essentially used (4) in Unit 5 Section 8. Both assets have similar
variability and, taken in isolation, we would still prefer X to Y (a larger expected
return with about the same degree of risk).
There is essentially nothing in these calculations that we did not cover in Unit 5.
However, there is one new ingredient here, which becomes important if we want to
combine both assets into a portfolio.
• Using (2) Cov[R(X),R(Y)] = E[(R(X) – 10)*(R(Y) – 8)]
= 0.2*(11-10)(-3-8) + 0.2*(9-10)(15-8) +
0.2*(25-10)(2-8) + 0.2*(7-10)(20-8) + 0.2*(-2-10)(6-8)
= -2.2 – 1.4 – 18 – 7.2 + 4.8 = -24
Note that he individual terms in this sum can be positive or negative, unlike in the
variance formulae where all are constrained to be positive. We can see that most of
the terms are negative as is the resulting sum.
• We need to be careful how we interpret this value of -24. The negative sign tells
us that, in general, as R(X) increases R(Y) decreases, i.e. the assets X and Y
are negatively correlated. We shall need the ideas of Section 5 to label this as a
“weak” correlation, although a scatter plot would be indicative.
• The units of Cov[R(X),R(Y)] are actually squared % - can you see why? To
avoid this happening we really should work with decimal forms of the returns,
rather than percentages. Thus 10% would be used as 0.1, and so on. This gives
Cov[R(X),R(Y)] = -0.0024

249 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

But we still have no scale against which we can assess the meaning of this value.
Of course this situation in Example 4.1 is not at all realistic. In practice we would not
have only 5 possible returns, nor would we know associated probabilities. But this
very simplified model does illustrate the essentially simple nature of (2), when
stripped of the symbolism. We shall return to this example in Section 13 to discuss
the fundamental importance of the covariance in a portfolio context. You should also
compare Example 4.1 with Example 8.1 of Unit 5.

5. The Concept of Correlation (in formulae)


As we have seen the expression Cov(X,Y) in (2) has units associated with it, since X
and Y do. To give a dimensionless quantity it is usual to divide by
- the standard deviation σ(x) of x (to eliminate the units of x), and also
- the standard deviation α(y) of y (to eliminate the units of y).
• The resulting quantity is the correlation coefficient denoted r(X,Y) or rXY.
σ (X, Y) Cov(X, Y)
r(X,Y) = = --- (4)
σ (X) * σ (Y) StDev(X) * StDev(Y)

Example 5.1 Returning to Example 4.1 we can use (4) to give

- 24 % 2
r(X,Y) = = -0.33 (to 2D)
8.72% * 8.41%
Note that the dimensions cancel out and, as stated, our final value is dimensionless.
Because of this we would obtain the same value using the decimal forms of each of
the quantities (-.0024, 0.0872 and 0.0841). However most people prefer to work with
numbers that are “not too small”, otherwise interpretation can become difficult.
However we still have the difficulty of assigning a meaning to our correlation
coefficient value of -0.33. To do this we need the following result:
Property of R(X,Y) -1 ≤ r(X,Y) ≤ 1 --- (5)

In practice we interpret (5) in the following form:


• r is close to +1 if x and y are strongly positively correlated.
• r is close to -1 if x and y are strongly negatively correlated. --- (6)
• r is close to 0.5 or -0.5 if x and y weakly (positively or negatively) correlated.
• r is close to 0 if x and y are very weakly correlated, or uncorrelated.

Statistics, Probability & Risk 250


Unit 9 – Correlation and (Simple) Regression

Comments
1. The result (5) is really an algebraic one, following from the form of (2). In words
(5) says that the variation/interaction between X and Y cannot be larger (in
absolute value) than the product of the variation in X and the variation in Y.
2. We can understand how the product form of (1) results in (6) by noting that
sums of products of xy terms will be
• Large and positive if both x and y are positive (or both negative)
• Large and negative if x and y have different signs.
• Small if the signs of x and y have “randomly mixed” signs.
These situations are illustrated in Fig.5.1 where, to keep things simple, we have
not subtracted off the appropriate means in (2).
3. The two equalities in (6), i.e. r(X,Y) = 1 or r(X,Y) = -1, occur only if the (X,Y)
points lie exactly on a straight line. Can you prove this? (See (7) of Section 6.)
4. Because the product in (2) just contains single powers of X and Y, characteristic
of the equation of a straight line, the correlation coefficient measures linear
association between two variables. Nonlinear correspondences (such as y =
x2) are not picked up by r(X,Y).
5. Indeterminate cases occur rather frequently in practice when r is around ± 0.5.
In which case we say X and Y are “weakly” correlated, and this is precisely the
situation we have seen in Example 5.1.

Fig.5.1: Rationale for interpretation (6)

251 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

Note In order to make the subsequent calculations in this unit as transparent as


possible, we look at very simple (small) datasets, rather than the much larger (and
more realistic) one in Example 2.2. For larger data sets we shall use Excel in the
practical sessions.

6. Sample Correlation (and Covariance)


The definition (2), and the result (4), assume we are dealing with a (whole)
population of (X,Y) values. In practice this is unusual, and the best we can do is to
compute the sample covariance and correlation coefficient. In such a case we
interpret the expected value in (2) as a sample mean and, in the discrete case, this
gives the sample covariance SCov(X,Y) as

∑ (x i - x )(y i - y )
1
SCov(X,Y) = --- (7a)
n -1 i
with x = mean of x-values y = mean of y-values

The summation sign (Σ) indicates contributions are to be added (summed) over all
possible data values. Recall the “degrees of freedom” argument in Unit 7 Section 8 to
account for the (n – 1) factor in (7a).
Using the expression for the standard deviation of a random variable (Unit 4 Section
8), we can show that (4) can be written in either of the forms

r(X,Y) =
∑i(x i - x )(y i - y ) --- (7b)
∑ i (x i - x ) ∑ i ( y i - y )
2 2

r(X,Y) =
∑ xy - ∑ x ∑ y
n
--- (7c)
⎧n x 2 − ( x )2 ⎫⎧n y 2 − ( y )2 ⎫
⎨ ∑ ∑ ⎬⎭⎨⎩ ∑ ∑ ⎬⎭

If you have good algebraic skills you should be able to derive these results. Although
(7b,c) look a bit daunting it is simple to implement in Excel in several different ways –
see Example 6.1 and the Practical Exercises.

Example 6.1 You have decided that you would like to buy a second hand car, and
have made up your mind which particular make and model you would like. However
you are unsure of what is a sensible price to pay. For a few weeks you have looked
at advertisements in the local paper and recorded the age of cars, using the year of
registration, and the asking price. The data is shown in the table below:

Statistics, Probability & Risk 252


Unit 9 – Correlation and (Simple) Regression

Age (years) Price (£’s) Age (years) Price (£’s)


3 £4,995 6 £1,675
3 £4,950 6 £2,150
3 £4,875 6 £2,725
4 £4,750 6 £1,500
4 £3,755
Table 6.1: Price and Age of Second Hand Cars

(Although you can find car price data on the web, the above device of using local
newspaper information is highlighted in Obrenski, T. (2008) Pricing Models Using
Real data, Teaching Statistics 30(2).)
The scatter plot of Fig.6.2 appears to show a “reasonably strong” negative correlation
between Y = Price of car (in £) and X = Age of car (in years).
We can confirm this conclusion with the calculation of r(X,Y) given in the Excel
spreadsheet Regression1L (CarData_Rsquare tab), and reproduced in Fig.6.2.
• Given the column sums in Fig.6.2 r(X,Y) is computed from (7c) as follows:

n ∑ xy - ∑ x ∑ y 9 * 126,780 − 41 * 31,375
rXY= =
{n∑ x }{ } {9 * 203 - 41 }{9 *126,984,425 - 31,375 }

− (∑ x ) n ∑ y 2 − (∑ y )
2 2 2 2 2

Fig.6.1: Car price data Fig.6.2: Correlation coefficient

- 145,355 - 145,355
= = = -0.96
146 * 158,469,20 0 152,106.88

in agreement with the square root of the value 0.91 given in Fig.6.2.
• You should check (7b) gives the same result. The advantage of the latter is the
intermediate numbers are smaller, although they do become non-integer. This
is not really an issue if you are using Excel, but can be with calculators.

253 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

Comments Observe the following:


• It is clear why these calculations are normally automated on software packages,
with the value of r(X,Y) usually available directly from the data without the
intermediate calculations displayed in Fig.6.2.
• But it is important to bear in mind the meaning of r(X,Y) as given in (6).
- The numerator measures the interaction of x and y values in such a way
(product formula) that
• a “small” value is obtained for (random) data “without any pattern”,
• a “large” value is obtained for data “exhibiting a pattern”, with
• only “linear patterns” being picked up.
- The denominator is designed to scale (or normalise) values so that r(X,Y)
lies in the “simple range” (5) to make interpretation of results simpler.

7. Linear Relationships and “Line of Best Fit”


Returning to the car price data of Example 6.1 we observe the following:
• The asking price for the car will reflect the mileage, condition of the car and
aspirations of the vendor.
• As a rough guide of what we would expect to pay for a car of this make we
would not go far wrong (based on the scatter plot) if we put a straight line
through the data and used the line to give us an average price for cars of each
age.
• This is an example of straight line depreciation, where depreciation of an asset
can be modelled by assuming that items are losing value at a constant rate over
time.
Many other situations can be modelled in this way by making the assumption that
there is a basic linear (straight line) relationship between two variables.
• The relationship will never be exact in any practical business context because
of other factors which influence the data values.
• If these other factors are not known then the distribution of the observed values
about the line can be regarded as random variation.
• The statistical problem is how to determine the best line to fit the data in any
particular situation.

Statistics, Probability & Risk 254


Unit 9 – Correlation and (Simple) Regression

What is the Best Line (Words)?


Simple linear regression is a statistical technique which is used to fit a “best” straight
line to a bivariate set of data. To explain how this is done we need some terminology
(some of which we have already seen). This terminology is important since software
packages (Excel and SPSS) use these terms when outputting results. Refer to
Fig.7.1.
• If the line fits the data well then all the data points will be close to the line.
• The fitted value for an observation is the (y) value on the line which corresponds
to the value of the independent variable (x) for that observation.
• One way of determining which line is best is to look at the vertical distances of
the data points from the line. These vertical distances are called residuals or
errors.
• A residual is the difference between the observed value of the dependent
variable and the fitted value.

Fig.7.1: Terminology used in determining line of best fit.

255 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

• Residuals can be positive, negative or zero depending upon whether the


observed value is above, below or on the plotted line.
• An obvious way of determining the best line would be to make the sum of the
residuals zero.
- Unfortunately every straight line does this! This is illustrated on the Excel
spreadsheet Regression1 for the car price data.
- Even if this were not true Fig.7.2 illustrates why a “small” sum of residuals
would not necessarily imply a good fit to the data.
• A more profitable approach is to look at the squared distances from the line
and define the best line to be that line which goes as close as possible to all the
points, in the sense that it makes
the sum of the squared residuals (ESS) as small as possible.
• This method involving fitting a line by making the sum of the squared residuals
as small as possible is called least squares linear regression or simple linear
regression.

Sum of residuals = 0 Sum of residuals = 0


but very poor fit to data and very good fit to data

Fig.7.2: Problem with using residual sum criteria for determining line of best fit.

One Further Point


The following observation, although fairly straightforward to understand in principle,
causes endless confusion in practice when trying to interpret regression results (from
SPSS or Excel):
When we have fitted a straight line to some data we want to know
“How good is the straight line fit?”
This question only makes sense in a relative sense. We also need to answer the
(admittedly odd sounding) question
“How well can we do without fitting a straight line?”
What we really should be asking is
“How much more information does the straight line fit give us?”

Statistics, Probability & Risk 256


Unit 9 – Correlation and (Simple) Regression

• What information does the data itself contain? Bear in mind that what we really
want our straight line for is prediction (of forecasting).
• If we ignore all the x-values in the data the best estimate we can make of any
y-value is the mean y of all the y-values.

• For any value of x we would predict the corresponding y-value as y .

• Of course this provides very poor estimates, but serves the purpose of setting
a “base line” against which we can judge how good our straight line fit is (once
we have derived it!).
• Thus, in Fig.7.1, we need to ask how much better does the regression (red) line
do in “explaining the data” than the horizontal (blue) line. This gives rise to two
important terms:
- Regression sum of squares (RSS) measuring the (squared) difference
between the regression line and the “mean line”.
- Total sum of squares (TSS) measuring the (squared) difference between
the data values and the “mean line”. TSS is independent of the
regression line, and serves as a measure of the situation before
attempting to fit a regression line.
• It should be obvious from Fig.7.1 that, for every observation,
Total “distance” = Error “distance” + Regression “distance”
It is not obvious that a similar relation holds for the squares of these distances
(after we add them together over all data values)
TSS = ESS + RSS --- (8)
• The result (8) is of fundamental (theoretical) importance and should be checked
in any Excel or SPSS output.

What is the Best Line (Formulae)?


Using the “least squares criteria” (of minimising the residual sum of squares) we can
show (after a good deal of relatively straightforward algebra) the following:

The straight line of best fit is y = a + bx


where b = gradient (slope) of the line ; a = intercept of line with y-axis
are calculated form the given (x,y) data values by the formulae:
nΣxy − ΣxΣy Σy Σx
b= ; a= −b --- (9)
nΣx − ( Σx)
2 2
n n
Here the summation ( Σ) symbol indicates a sum over all data points is required.

257 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

Example 7.1 If we return to the car price data of Example 6.1, the column sums
appearing in (9) are computed below.
x y xy x2
3 4995 14985 9
3 4950 14850 9
3 4875 14625 9
4 4750 19000 16
4 3755 15020 16
6 1675 10050 36
6 2150 12900 36
6 2725 16350 36
6 1500 9000 36
Σx = 41 Σy = 31375 Σxy = 126780 Σx2 = 203

Table 7.1: Hand computation of best (least squares) straight line in Example 6.1

The gradient of the regression line is first calculated as:


n ∑ xy − ∑ x ∑ y 9 *126,780 − 41 * 31,375 − 145,355
b= = = = -995.6
n ∑ x − (∑ x ) 9 * 203 − 412
2 2
14

and the intercept via (using our value of b from above)

a=
∑ y − b∑ x = 31,375 − (−955.6) * 41 = 8,021.6
n 9
The least squares regression line (of best fit) is therefore given by:
Price = 8021.6 - 995.6 * Age

Comments Observe the following points:


• The gradient of the line has a negative slope, as expected, indicating that the
price of this make of car decreases as it gets older (Age increases).
• For each additional year the price is estimated to drop by £995.6, nearly one
thousand pounds (a drop since the gradient of the line is negative).
• The intercept gives the estimated price of a new car since a new car has zero
age: the price in this case is approximately £8022. Is this meaningful?
• Observe the similarities between the above calculation and that of the
correlation coefficient in Example 6.1. Indeed if we compare (7c) with the
σ (X)
expression for b in (9) we see that we can write b= r(X,Y)
σ (Y)

Statistics, Probability & Risk 258


Unit 9 – Correlation and (Simple) Regression

Thus, the slope of the regression line is really a “scaled” version of the
correlation coefficient, the scaling depending on the standard deviations of X
and Y.
• In the form (9) we require the value of b before we can compute a. It is possible
to give a formula for a not involving b, but this produces extra calculations that
are not really needed. The form (9) is computationally the most efficient.

8. Prediction using the regression line


One important reason for obtaining lines of best fit is to use them in prediction. In
general we can ask the following two questions:
Question 1 “Given a value of x what is our best prediction of the value of y?”
Question 2 “How reliable are our predictions?”
The answer to Question 1 is that the regression line provides our best predictions.
Indeed we can easily forecast using the regression line.

Example 8.1 Use the regression line in Example 7.1 to predict the following:
(a) The price of a 4 year old car (b) The price of a 7 year old car
(c) The price of a 10 year old car

Solution We easily obtain the following results:


(a) When Age = 4 (years) Price = 8021.6 - 995.6 * 4 = £4039.2
(b) When Age = 7 (years) Price = 8021.6 - 995.6 * 7 = £1052.4
(c) When Age = 10 (years) Price = 8021.6 - 995.6 * 10 = -£1934.4
Of course we notice something amiss about this last value!
• The computation in (a) is termed interpolation and produces a reliable
estimate of price. This is because the age of 4 years is within the range of our
data, and hence we know how values behave in this age region.
• The second computation is termed extrapolation and produces a less reliable
estimate of price. This is because the age of 7 years is (just) outside our data
range, and we have to rely on the assumption that car prices in this age region
behave as they do in the 3 – 6 age range (decreasing by roughly £1000 for
each year older).
• In (c) we are just too far outside our data range and clearly prices are behaving
differently! Indeed we can only expect our regression lines to provide adequate
descriptions of the data within certain limits (on x). These limits will vary from
problem to problem.

259 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

• Very often we do not want our y-values turning negative. In our case this gives
the largest Age (x) value as
Age = 8021.6/995.6 ≈ 8 years
• However, in general, we can expect car prices to behave in a nonlinear fashion
both when the age is
- Small (new cars commanding substantially higher prices), and
- Large (after a certain age cars have no intrinsic market value)

In Section 12 we look at how we can treat non linearities.

Fig.8.1: Nonlinear behaviour of car prices (3 linear regions?)

9. Excel Implementation

From a practical point of view this section is the most important one in the Unit,
since you will usually perform regression calculations via Excel rather than by
hand calculation.

We may note that in our (car price data) calculation using (9) we obtain no
indication of the accuracy of our estimates of a and b. To do this we have to make
certain assumptions (relating to the normal distribution) which we discuss in Section
11. This allows us to develop standard error estimates, as in Unit 8, together with
confidence intervals for a and b.
• The necessary calculations, although feasible by hand, are much more
conveniently carried out using statistical software. We illustrate the ideas using
Excel, although different software will give similar output (probably with a
different layout of results).
• The Excel output of Fig.9.1 below refers to the car price data of Example 6.1,
and is obtained via Tools Data Analysis Regression

Statistics, Probability & Risk 260


Unit 9 – Correlation and (Simple) Regression

You are asked to perform similar computations in the Practical Exercises.


• The R Square value is discussed in the next section, and is closely related to the
correlation coefficient r. In fact with r = -0.9556 (Example 6.1) we find
r2 = (-0.9656)2 = 0.91319
(Conversely we can determine r by taking the square root of the given R Square
value via r = √0.91319 = ± 0.9656.)

Section
Fig.7.1

Fig.9.1: Excel output for car price data

• The a and b coefficients in the regression equation (Example 7.1) are output,
together with indications of their accuracy. Specifically:
• The standard errors enable us to verify the given 95% confidence intervals. For
example, for the “Age coefficient” b:
95% CI for b = -995.582 ± 2.365*116.018 = -1269.9 to -721.2
(Here we need the CI calculation with the t-distribution as in Example 14.1 of
Unit 8. The df = n – 2 since we are estimating the two parameters in the
regression equation from the data – see Unit 7 Section 8.)
• The p-values can be obtained as in Example 14.2 (Unit 8) but require some
effort. For completeness we briefly give the details:
x−μ − 995.582 − 0
t= = = -8.581 (available above)
s/ n 116.0182

P( |t| > 8.581) = 5.81*10-5 (using Excel tdist function)

You should now begin to see how much of our previous work is “packaged together”
in Excel output – look back at Unit 8 Section 12.

261 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

Further Comments on Excel Output


Because you will undoubtedly meet Excel regression output outside of this module,
we give some further explanation of some of the calculations summarised in Fig.9.1.

1. ANOVA Table Analysis of Variance (ANOVA) is concerned with splitting up


the “total variation” in a data set into its various components (depending on
precisely the model we are using to describe the data). In our regression
context the appropriate decomposition (splitting up) is described in Section 10
below.
• Degrees of Freedom (df). Look back at Unit 7 Section 8 for a review of the
concept of “degrees of freedom”.
- To compute TSS (Total Sum of Squares) we need to know the sample
mean of the Y-values (see Unit 9 Fig.7.1), and we lose one df because of
this. With n data values TSS has (n – 1) df associated with it.
- To compute RSS (Regression Sum of Squares) we need to know all
independent parameters entering the regression equation. In our case we
have the two parameters a and b. However they are not independent, as
(9) shows (Unit 9). Hence we only have 2 – 1 = 1 independent parameter
needed to calculate the regression equation values, and hence RSS has 1
df.
- ESS (Error, or Residual, Sum of Squares) accounts for the remaining
degrees of freedom. Since TSS = RSS + ESS then ESS must have (n –
2) df.
Check these values in Fig.9.1.
• SS stands for Sum of Squares. Because different sums of squares are based
on different numbers of parameters calculated from the data, it is common to
Sum of Squares (SS)
calculate Mean Square MS =
Degrees of Freedom

ESS 1528483.39
For RSS Mean Square for RSS = = = 218354.77
Degrees of Freedom 7

• We have seen (Unit 7 Section 9) that, when the underlying distribution is


normal, sums of squares follow a Chi-square distribution. Further (Unit 7
Section 10) the ratio of two such Chi-square variables follows an F distribution.
In Fig.9.1 we compute the ratio of the two MS values:

Mean Square for RSS 16079205


F= = = 73.63799
Mean Square for ESS 218354.8

Statistics, Probability & Risk 262


Unit 9 – Correlation and (Simple) Regression

The p-value (or Significant F-value) corresponding to this F value can be


computed as in Unit 8 (Section 11). With respective df of v1 = 1 and v2 = 7 the
critical 5% F-value is 5.59 using Table10.3 (Unit 7). Since our F-value of 73 is far
bigger than this critical value we know our p-value is much less than 0.05. In fact
using the Excel command FDIST(73.63799,1,7) gives P(F > 73.638) = 5.809*10-5
in agreement with Fig.9.1.

• It is important to understand precisely what the above F-values measure. We


are in fact testing whether the regression line has any predictive power beyond
just using the mean Y-value (see Fig.7.1). Formally we are testing (see Unit 8
Section 10) H0 : All b-coefficients = 0
against H1 : Not all b-coefficients = 0
In our case we only have the one b-coefficient, which specifies the slope of the
regression line. (Note that if b = 0 our regression line in (9) y = a + bx just give
y = a, which is just approximating all y-value by a constant – the mean.)
Because our p-value is so small, certainly < 0.05) we reject H0 and conclude
that our regression line does have predictive power.

2. Regression Coefficients Table Here we are given the coefficients in the


regression equation together with accuracy (standard error) estimates and
hypothesis test results. Explicitly we have the following information:

• Regression equation is Price = 8021.541 – 995,582*Age


This agrees with the results in Example 7.1.

• Each p-value allows us to test individually whether the corresponding


coefficient is zero. Thus, for the slope, testing
H0 : b = 0 against H1 : b ≠ 0
the p-value 5.81*10-5 allows us to reject H0. In this particular case (simple
regression) this t-test is equivalent to our F-test above. In fact we can see that
F = 73.63799 = (-8.58126)2 = t2.

10. Coefficient of Determination (R2)


We have seen in Section 9 how to check the accuracy of our regression coefficients.
But there is another, very important, check we need to perform in order to answer the
following question: Does our linear regression model adequately fit the data?

263 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

(If not we would think of nonlinear regression, or transforming the data – see Section
12.) We can try and assess “how good the fit is” using the following reasoning. To be
specific we use the car data (price and age):
• Not all cars of the same age sell for the same price. (Why?)
• The variability in price can be split into two parts:
- The relationship of price with age
- Other factors (make of car, interest rates and so on). These factors may
ƒ be unknown, or are just
ƒ not included in the model for a variety of reasons (keep the model
simple, no data available on other factors, …..)
• We can isolate these two contributions by arguing that
- The relationship of price with age is measured by the Regression sum
of squares (RSS) discussed in Section 7.
- Factors not included in the model will result in an “error” in the
regression, measured by the residual (error) sum of squares (ESS).
- We also know that the total sum of squares (TSS) is given by
TSS = RSS + ESS (Check in Fig.9.1)

Definition: The coefficient of determination R2 is defined by


Regression sum of squares RSS
R2 = =
Total sum of squares TSS

Computation: Both of these quantities are available in the Excel output – look at
the ANOVA (Analysis of Variance) section of Fig.9.1 Here
RSS 16079205.5
R2 = = = 0.9131922764
TSS 17607688.89
and we would round this to something like R2 = 0.91

Comments:
1. We can clearly see that R2 lies between 0 and 1.
2. We can express the R2 value in words in the following (important) form:
R2 measures the proportion of variability in y that is accounted for by its
straight line dependence on x

Statistics, Probability & Risk 264


Unit 9 – Correlation and (Simple) Regression

3. With this last interpretation we usually quote a percentage. For our example
91% of the variability in price of a car is due to its (linear) dependence on the
car’s age. Hence 9% of the variability is due to other factors.
4. Rule of thumb : There are no precise rules for assessing how high a value of
R2 is needed in order to be able to say we have a “good fit” to the data (in the
sense that we have explained “most” of the variation in y by our choice of x-
variable). Rough guidelines are:
• R2 > 0.8 ⇒ very good fit
• 0.5 < R2 < 0.8 ⇒ reasonable fit
• R2 < 0.5 ⇒ not very good fit

In these terms the car example provides a very good fit of the linear regression
model to the data.

11. Assumptions Underlying Linear Regression and Residual Plots


We have discussed how to obtain, and interpret, the regression equation and how
well this fits the data. Now we need to think about what information our data really
contains.
• The given data can only really be regarded as a sample from an underlying
population. (For our car data the set of all used cars ever advertised in the local
paper.)
• The values for the regression coefficients (slope and intercept) are thus just
sample estimates of the true (unknown) population values, and will thus vary
from sample to sample.
• Our point estimates (of slope and intercept) need to be supplemented by
confidence interval estimates to give us some indication of the reliability of our
particular sample values (as we have seen in Section 9).
• As we have previously seen this requires some assumptions to be made on
the underlying population, e.g. it is normally distributed, or the sample size is
large enough that we can use the Central Limit Theorem (and again use the
normal distribution), and so on.
• In regression there are various assumptions (10 in all !!) that need to be
made
- to justify each step in the solution procedure which arrives at the formulae
(9) for the slope and intercept, and

265 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

- to allow the derivation of confidence intervals (as in Section 9).


• The most important of these assumptions is the following

The residuals/errors are


• Normally distributed, with
• Mean = 0 and
• Standard deviation = constant (independent of x)
• Independent of each other, i.e. do not depend on x

This is commonly interpreted in the form

Residuals should have the following properties:


• Should show no discernible pattern (random residual pattern)
• Positive and negative values should alternate (on the average)
• Should be “small”
• Should lie within a “constant band”

• It is important to check the validity of these assumptions in any particular


application. If we do not one of two things can happen:
- Although our regression line may look fine by eye, and appear to be a
good fit to the data, there may be a better fit available (by a nonlinear
function such as a quadratic).
- The standard errors for the coefficients (such as in Fig.9.1) may be overly
optimistic, and the corresponding confidence intervals are much wider
than we believe. The precision of the regression coefficients is lower
than the output indicates.

Checking the Normality Assumption


The simplest way to do this is to use the Excel Regression tool and tick the box
marked Residual Plot. Because the car price data of Example 6.1 has only three
different x-values, the corresponding residual plot is difficult to interpret in any
meaningful fashion. For this reason we consider another example.

Statistics, Probability & Risk 266


Unit 9 – Correlation and (Simple) Regression

Example 11.1 In an investigation into the relationship between the number of weekly
loan applications and the mortgage rate, 15 weeks were selected at random from
among the 260 weeks of the past 5 years. The data are shown below:
Week Mortgage Number loan Week Mortgage Number loan
rate (%) applications rate (%) applications
1 11.0 75 9 10.0 87
2 13.5 65 10 11.0 79
3 13.0 62 11 10.5 80
4 12.0 76 12 12.0 72
5 15.0 50 13 12.5 69
6 14.0 58 14 13.0 65
7 14.5 54 15 13.0 61
8 13.5 64
Table 11.1: Mortgage Applications

We consider the mortgage rate data shown in Table 11.1. The residuals, shown in
Fig.11.1, can be found in the Excel spreadsheet Regression1 (DataSet2 tab).
• Residuals appear “small”, judging from the regression line.
• There appears to be no systematic pattern in the residual plot with residuals, by
and large, alternating in sign. This is also reflected in the regression line
“interleaving” the data points, with data values alternately above and below the
regression line.
• We conclude the straight line fit to the data is a “good” one (and this should also
be reflected in the R2 and r values).

Fig.11.1: Regression fit and residuals for Example 11.1

Example 11.2 The sales (in thousands of units) of a small electronics firm for the last
10 years have been as follows:
Year 1 2 3 4 5 6 7 8 9 10
Sales 2.60 2.85 3.02 3.45 3.69 4.26 4.73 5.16 5.91 6.5

The scatter plot, regression line and residual plot are shown in Fig.11.2 below.

267 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

• Regression line appears to fit the data well, with “small” residuals and high R2.
• However the residual plot tells a slightly different story, exhibiting a “definite
pattern”.

Fig.11.2: Regression fit and residuals for Example 11.2

Explaining the pattern: Define Residual = Data value – Linear prediction


• A positive residual occurs when the data value is above the regression line.
• A negative residual occurs when the data value is below the regression line.
As depicted in Fig.11.3, a residual pattern indicates “bending” of the data away from
a straight line.

Fig.11.3: Explanation of residuals for Example 11.2

Conclusion
Although we can use linear regression to “adequately” model the data of Example
11.2, we can do better. (See Section 12.)

We can use the precise pattern exhibited by the residuals to infer the type
of curve we should fit to the data (possibly a quadratic in this illustration).

Statistics, Probability & Risk 268


Unit 9 – Correlation and (Simple) Regression

Example 11.3 In neither of the previous examples have we really tested whether the
residuals are normally distributed. (We have just looked at their size, and any
pattern present.) The reason is we do not really have enough data to realistically
assess the distribution of residuals. For this reason we return to Example 2.2 relating
to executive pay. All the results discussed below are available in the Excel file
ExecutivePay.xls
• From Fig.11.4 we observe the regression fit is “not great” in terms of the R2
value, nor from a visual inspection of the scatter plot.
• The residual plot reveals a potential further problem. The residuals appear to be
increasing in size as x (Profit) increases.
- This arises when the data becomes increasingly (or decreasingly) variable
as x increases.
- This means the standard deviation of the data depends on x, i.e. is not
constant (violating one of the assumptions on which linear regression is
based).

Fig.11.4: Regression fit and residuals for Example 2.2

Fig.11.5: Histograms of residuals for Example 2.2

269 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

• There are several ways we can check the normality assumption of the
residuals. Fig.11.5 displays conceptually the simplest; we draw the histogram of
residuals and “observe by eye” whether it looks like a normal distribution. (The
residuals are produced automatically in Excel using the Regression tool in
Data Analysis; then using the Histogram tool, again in Data Analysis.)
• Although the residuals appear normally distributed (left histogram) with mean
zero we are not really sure what the x-range should be. To deal with this we
work with the standardised residuals (right histogram).
X - Mean of X
• “Standardising” a variable X means computing
Standard deviation of X
• This is done automatically in Excel (and SPSS provided you request it!). If X is
normally distributed the standardised variable (which we have called Z in Unit 6)
follows a standard normal distribution. We know (that the range of Z is
(practically) from -3 to 3. The right histogram in Fig.11.5 fits into this pattern.

Conclusions From this analysis we can conclude the following:


• Company Profit does not adequately explain Executive Pay, with only 43% of
the variation in the latter being explained by variation in the former.
• We should not be really surprised by this since one explanatory variable will
rarely be sufficient in many practical situations.
• Here we would expect to need other explanatory variables to adequately model
executive pay structures; the Excel file ExecutivePay gives two further
variables.
• More variables take us into the area of multiple regression.
• The residuals do appear to be normally distributed with mean 0.
• The histogram of residuals appears to indicate a constant standard deviation for
the (normal) distribution of residuals, but the residual plot does not. However we
always need to be careful in using plots alone (not backed up by more
quantitative analysis). For example, if we ignore a few of the larger residuals
(which we can regard as outliers) we can plausibly “redraw” the residual plot in
Fig.11.4 as shown below in Fig.11.6. Here the residuals do appear to stay
within a band (of constant width), indicating a constant standard deviation.

Statistics, Probability & Risk 270


Unit 9 – Correlation and (Simple) Regression

Fig.11.6: Another view of the residual plot


A Few Final Notes
• In statistics one often obtains results which are “slightly contradictory”, for one
(or more) of several reasons:
- Not enough data (for long run frequency interpretations to apply).
- Inapplicable statistical methods used (see below).
- Interpretations of results can be difficult (and ambiguous).
• For this reason it is common practice to try and employ several different
techniques when analysing a dataset, and assess what the “weight of evidence”
implies. It is dangerous to base one’s conclusions on a single piece of evidence,
without supporting (backup) material.
• Adding to the difficulties, in the background there are always assumptions on
which a statistical technique depends (even if you are unaware of what they
are!). Ignoring these will lead you into employing methods that
- either may not apply, or
- definitely do not apply.
You should always carry out any checks you can think of!
• Be aware that sometimes you may not be able to reach any definite conclusion.
This usually means more data is required. In any report you may later write
never be afraid to state that there is really insufficient data available to be able
to reach any firm conclusions.

12. Transformations to linearise data


If the residual plot indicates a straight line is not the most appropriate model, we can
take one of two courses of action:
• Fit a non-linear model, such as a quadratic.
• Transform the data to “look more linear” and then use linear regression on
this “new transformed data”.

271 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

Although the first option is of great practical importance, we shall look at the second
possibility only. Before reading further you should review the material in Practical Unit
3 Section 5 relating to the graphs of the various standard elementary functions
(powers of x, exponential and logarithmic functions).
Some General Rules
What transformation will help linearise the data? In order to answer this question we
will make use of Fig.12.1 below. The easiest way to remember what is happening
here is the following:
• The quadrant in which the data lies determines whether we need to
(increase of decrease) the powers of (x or y)
• Remember that, because logs increase slower than powers, reducing powers
(of x or y) is similar to taking logs (of x or y).
You may care to think why there are some negative signs in some of the
transformations in Fig.12.1.

Choices of Transformation Choices of Transformation


x or \y__ x or y__
√x y2 x2 \\\\y2
logx y3 x3 y3
-1/x y4 x4 y4
A
B

D C

x or y__ x or y__
√x √y x2 √y
logx logy x3 logy
-1/x -1/y x4 -1/y
Choices of Transformation Choices of Transformation
Fig. 12.1: Scatter Plot Determines Choice of Transformation

Statistics, Probability & Risk 272


Unit 9 – Correlation and (Simple) Regression

Note that the choice of a transformation is not unique. Several transformations will
often do a “reasonably good” job of linearising the data, and it is often just a matter of
trial and error to find an “optimum choice”. Experimentation also leads one to an
appreciation of just why the transformations given in Fig.12.1 are useful.

Example 12.1 A market research agency has observed the trends shown in Table
12.1 in Sales (y) and Advertising Expenditure (x) for 10 different firms.
The scatter plot, regression line and residual plot are shown in Fig.12.2. Comparing
this scatter plot with the diagrams A to D in Figure 12.1 we choose transformation C.

Firm Sales Expenditure Firm Sales Expenditure


(£’000,000’s) (£’00,000’s) (£’000,000’s) (£’00,000’s)
1 2.5 1.0 6 9.1 4.6
2 2.6 1.6 7 14.8 5.0
3 2.7 2.5 8 17.5 5.7
4 5.0 3.0 9 23.0 6.0
5 5.3 4.0 10 28.0 7.0

Table 12.1: Sales and Advertising Expenditure

Fig.12.2: Regression fit and residuals for Example 12.1

Thus for our data we might consider initially


• either a transformation of the independent variable of x2 or x3 or
• a transformation of the dependent variable of √y or logy.

273 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

y versus x y versus x^2

30 30
25 25
20 20
15 15

y
y

10 10
5 5
0 0
0 2 4 6 8 0 10 20 30 40 50 60
x x squared

sqrt(y) versus x log(y) versus x

6 2
5
1.5
4
sqrt(y)

log(y)
3 1
2
0.5
1
0 0
0 2 4 6 8 0 2 4 6 8
x x

-1/y VERSUS X y versus x^4

0 30
-0.1 0 2 4 6 8 25
20
-0.2
15
-1/Y

-0.3 10
5
-0.4
0
-0.5 0 500 1000 1500 2000 2500 3000
X x^4

Fig.12.3: Some transformations in Example 12.1

Fig.12.4: Poor choices of transformations in Example 12.1

Statistics, Probability & Risk 274


Unit 9 – Correlation and (Simple) Regression

• Some of these possibilities are depicted in Fig.12.3, and we can see that most
of them do a reasonable job in linearising the data. You can find all these
results in the Excel spreadsheet Regresion1L
(Transformations_AdvertisingData tab).
• However you should appreciate we cannot just choose (essentially at random).
any transformation. For example, if we decrease the power of x the data
becomes “even less linear”, as Fig.12.4 illustrates.

Example 12.1 (Continued) Does all this really help us in our regression fits?
• We return to the data of Table 12.1 and perform the following transformation
(the last one depicted in Fig.12.3) y = x4
• Our new regression fit, and residual plot, are shown in Fig.10.6.
- The new R2 (93%) is higher than the old R2 (85%).
- The new residual plot, whilst not perfect, appears far more random than
the old residual plot (Fig.12.5)
• We conclude that transforming the data (via y = x4) is worthwhile here. This
does not, however, imply that we have found the “best” transformation, and you
may care to experiment to see if any significant improvement is possible.

Fig.12.5: Regression fit for transformed data in Example 12.1

Comment Although we have really concentrated on powers (of x and y) to define our
transformations, in economics logarithmic transformations find great use, especially
in the context of elasticity. See Tutorial 9, Q5 for an illustration
.

275 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

13. Financial Applications


There are many applications of the ideas we have discussed in this unit to a large
range of subject areas. In finance we single out three areas are of importance:
• Portfolio theory (mentioned in Section 4).
• Capital Asset Pricing Model (CAPM).
• Modelling stock price behaviour.

The last two are looked at in Assessment 2, and here we just analyse Example 4.1 a
little further.

Two Key Results Given two random variables X and Y, and any two constants a
and b, we look at the linear combination (or transformation)
P = aX + bY --- (10a)
Then E(P) = aE(X) + bE(Y) --- (10b)
and Var(P) = a2Var(X) + b2Var(Y) + 2abCov(X,Y) --- (10c)
“Proofs” (10b) really follows from (3) of Unit 5 when expressed in terms of
expectations. (10c) is more involved and follows from algebraic manipulation of (1)
using the definitions (2) and (3) of Section 4.
(10c) is a result of fundamental importance, and highlights that the variation in a sum
of (random) variables is not just the sum of the variations present within each
variable, but also contains a component relating to the interaction (covariance) of the
two variables. The more two variables influence one another the greater the variation
in their sum. In part the importance of the covariance concept stems from (10c).

Example 13.1 We return to Example 4.1 and recall the following:


E[R(X)] = 0.1 ; E[R(Y)] = 0.08
StDev[R(X)] = 0.0872 ; StDev[R(Y)] = 0.0841
Var[R(X)] = 0.0076 ; Var[R(Y)] = 0.00708 ; Cov(X,Y) = -0.0024
We have expressed all these in decimal form – look back at Example 4.1 to see why.

Portfolio 1 Suppose we invest half of our money in asset X, and the other half in
asset Y. Then, in terms of returns, the portfolio return R(P) is given by
R(P) = 0.5R(X) + 0.5R(Y)
Then, using (10b) E[R(P)] = 0.5E[R(X)] + 0.5E[R(Y)] = 0.5*0.1 + 0.5*0.08 = 0.09

Statistics, Probability & Risk 276


Unit 9 – Correlation and (Simple) Regression

and, using (10c) Var[R(P)] = 0.52Var[R(X)] + 0.52Var[R(Y)] + 2*0.5*0.5*Cov(X,Y)


= 0.25*0.0076 + 0.25*0.00708 – 0.5*0.0024 = 0.00247

Hence StDev[R(P)] = 0.00247 = 0.0497


The importance of this result is the following:

With half of our assets invested in X, and half in Y:


• The expected return is halfway between that offered by X and Y alone, but
• The portfolio risk is considerably less than either that offered by X or Y
alone (nearly half in fact).

If the major concern (of a portfolio manager) is to minimise risk, then Portfolio 1
provides a much better alternative than investing in X or Y alone. (Although
maximising expected return sounds very tempting, it is a very high risk strategy!)
Portfolio 2 Since Y is slightly less risk than X (smaller standard deviation), we may
think of investing more in Y. Suppose we invest 60% of our wealth in Y and 40% in X.
Then, repeating the previous calculations, we obtain
R(P) = 0.4R(X) + 0.6R(Y)
E[R(P)] = 0.4E[R(X)] + 0.6E[R(Y)] = 0.4*0.1 + 0.6*0.08 = 0.088
Var[R(P)] = 0.42Var[R(X)] + 0.62Var[R(Y)] + 2*0.4*0.6*Cov(X,Y)
= 0.36*0.0076 + 0.16*0.00708 – 0.48*0.0024 = 0.00272

Hence StDev[R(P)] = 0.00272 = 0.0521

Unfortunately, not only have we reduced our expected return to 8.8%, we have
increased the portfolio risk to 5.2% (compared to Portfolio 1).

General Portfolio If we repeat these calculations for various portfolio weights, we


obtain Table 13.1. You are asked to look at such calculations, in the context of two
assets, in Practical Exercises 7.
There are now various (line and scatter) plots of interest. In Fig.13.1 we plot, for the
portfolio, the expected return and the standard deviation against the X-weight, i.e. the
percentage of asset X in the portfolio. As the X weight increases we observe:
• The expected return increases; this is simply because X has the larger
expected return (10%).
• The portfolio risk (as measured by the standard deviation) initially reduces, but
then starts to increase. This is a little unexpected, but of crucial importance.

277 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

Table 13.1: Mean and Standard Deviation of Returns in Example 13.1

Fig.13.1: Expected Portfolio Return and Standard Deviation in Example 13.1

Statistics, Probability & Risk 278


Unit 9 – Correlation and (Simple) Regression

The conventional way to look at the portfolio risk is not in terms of the X weight, but
rather in terms of the expected gain, i.e. expected portfolio return. This is shown in
Fig.13.2, and measures the trade-off between portfolio return and risk. The
resulting curve is termed the efficient frontier, and represents one of the major
results in the area of portfolio theory. You will learn a lot more about this in other
modules.

Fig.13.2: The Efficient Frontier in Example 13.1

Summary The idea of covariance, and the allied notion of correlation, is central to
many of the more advanced applications of statistics, especially in a financial context.
In a portfolio context the interplay between expected values (returns), variances and
covariances give rise to the concept of an efficient frontier, and this is the starting
point for much of modern portfolio theory. In a regression context correlation
measure the extent to which variables interact, and underlies notions of causality; the
latter is of fundamental importance in modern econometrics.
In addition to all this regression ideas are probably the most importance single
statistical technique used in applications. In practice this tends to be in the context of
multiple regression, and we look at this in Unit 10.
Finally any regression analysis you perform should be accompanied by graphical
output; at least a scatter plot (with fitted line superimposed) and a residual plot to
indicate the quality of the fit achieved.

279 Statistics, Probability & Risk


Unit 9 – Correlation and (Simple) Regression

14 References
• Koop, G. (2006) Analysis of Financial Data: Chichester, Wiley.
• Koop, G. (2003) Analysis of Economic Data: Chichester, Wiley.

Statistics, Probability & Risk 280

You might also like