MIS BA 20232024 Notes Chapter02
MIS BA 20232024 Notes Chapter02
Correlation
17
CHAPTER 2. CORRELATION
as a point – no lines or bars. In Figure 2.1, each point represents an observation (one
person).
Weight/Height relationship
180 ●
●
●
170
●
Height (cm)
160
●
150
50 55 60 65 70 75
Weight (kg)
Figure 2.1: Scatter plot showing height and weight of the sample of adults.
A simple visual observation of the scatter plot is already enough to provide some
information with respect to how the two variables are related.
Exercise. What can you conclude from Figure 2.1? Is there a relationship between the
two variables? When one variable goes up, does the other one tend to go up, down, or
stay the same?
The formula returns a value within the range [ 1, 1]. A value of 1 or -1 indicates a
perfect linear relationship between the two variables: in a scatter plot, one could draw a
straight line that would pass through every single data point. In general, positive values
for r indicate that when the value of a variable goes up, the value of the other variable
also goes up. Negative values for r indicate that when one goes up, the other goes down.
If there is absolutely no relationship between the two variables, the value for r will be
0. See Figure 2.2.
Example. Suppose we collect data on all the accidental fires in Singapore for the last
ten years. We correlate the number of fire engines at each fire (x) and the eventual
damages in Euros (y) at each fire. We will probably find that there is a strong positive
correlation. Can we conclude that the presence of more fire engines causes more costly
damage?
Example. Suppose we collect data from 100 randomly-chosen adults: what is the value
of their car (x), and what is their income (y). (Anyone who does not own a car is omitted
from the survey). We will find a strong positive correlation. Does this suggest that if I
buy a more expensive car, my salary will go up? Why or why not?
A weak or zero linear correlation between two variables implies that there is a weak
or non-existent linear relationship between them. That does not imply that there is
no other relationship between them. That is because linear correlation only measures
the strength of any possible linear relationship. There might be a strong non-linear
relationship which nevertheless results in a zero correlation. Look at Figure 2.3. There
are several data-sets where r = 0, despite the fact that there is a clear relationship
between the variables.
Figure 2.3: More examples of correlation values. The numbers shown are the correlation
coefficients of the x y data-sets shown. Note that several very strong x y relationships give
a zero correlation. Note that correlation is undefined in the case where there is no variation in
y, i.e. y is constant (same for x). (Image public domain, from Wikipedia.)
Note that it is not always possible to calculate the correlation between two variables.
In the centre of Figure 2.3, di↵erent values of x all correspond to the same value of y.
In this case, the value of r is undefined : we cannot measure the variation of y when the
value of x changes.
Exercise. Think of it the other way around: given that correlation is a symmetric
measure, how can we measure the corresponding change in x when y changes, if the
value of y never changes?
Exercise. Put this into practice. Using equation 2.1, calculate the linear correlation
between x and y, for the following data:
x 1 2 3 4 5
y 8 8 8 8 8
Some very di↵erent data sets can have the same correlation values, as in Figure 2.4.
That is why it is always useful to plot the data and look at it.
Figure 2.4: Anscombe’s Quartet: four data sets with identical statistical values (mean and vari-
ance of x and of y, x-y correlation, and more), but very di↵erent properties. From Wikipedia:
”Anscombe’s quartet 3” by Anscombe.svg: Schutzderivative work (label using subscripts):
Avenue (talk) - Anscombe.svg. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
http://en.wikipedia.org/wiki/Anscombe’s\_quartet
Exercise. Your friend Bob is the tallest person you know, and he got the best marks in
his class in first year exams. What do you conclude from this data and why?
However, in other cases you might have many, many variables, and you might not
have a strong grasp on what the variables even mean. What if your table looks like
Table 2.3?
Table 2.3: Data on phone customers
In cases like this it becomes very useful to have a fast, automated method of investigating
all the relationships in the data. One method is to calculate the correlation between every
pair of variables and just see which ones are relatively strong. You might then choose
to report something like “Customers with large values for lnd mins tend to have small
values for v x”. This could be useful to your client or your manager, even if you are not
quite sure what v x means.
Exercise. Suppose there are four variables. How many relationships are there, i.e. how
many pairs? What about when there are 10 variables? What is the general rule?
less than 20; a weak or non-existent correlation for the group where age is between 20
and 60; and a weak negative correlation for the group where age is above 60. This gives
us a lot of insight into the data.
There are several measures of rank correlation; one of them is Spearman’s rank cor-
relation coefficient, which is equal to Pearson’s linear correlation (r), but applied to the
ranks of the variables, rather than their raw values. This results in a correlation value
which is also within the range [ 1, 1], but without the assumption of a linear relationship
between the variables (only between their ranks):
rs = r(rgx , rgy )
where rgx and rgy are the ranks of variables x and y, respectively.
Exercise. Calculate Pearson’s correlation of the data shown in Table 2.4. How would
you characterise the relationship between weight and height? And between height and
weight?
Exercise. Now calculate Spearman’s correlation of the same data. How does it compare
with the linear correlation you calculated previously? What conclusion can you draw
from that comparison?
=CORREL(A2:A6, B2:B6). Notice that each argument is a cell range, i.e. a list of cell
locations. The two cell ranges are of the same length, because it is paired data. If there
is a missing value for some observation, Excel will just ignore that row and calculate the
correlation on the rest of the data.
Exercise. Enter some data into Excel, and calculate r using CORREL. Now delete some
data, and see what happens.
Exercise. Suppose we collect data on the age and height of a sample of children, and
plot it on a scatter plot. What will the data look like? Draw a scatter plot representing
your best guess. What is your best guess for the correlation value?
Exercise. Consider the data gathered from a sample of 4 companies on the number of
employees and their sales per annum (in thousands of Euros), shown in Table 2.5. Bring
it into Excel and calculate the correlation between them. Is the result what you expect?
Interpret the result – give one sentence explaining the relationship in terms that would
be understood by someone who knows nothing about statistics.
Exercise. What do you think is the likely relationship between a country’s GDP and
its population? Gather the data from below and find out. Use a scatter plot to help
explain the result.
• http://data.worldbank.org/indicator/SP.POP.TOTL/countries
• http://data.worldbank.org/indicator/NY.GDP.MKTP.CD/countries
Exercise. Use Filter (in the Data menu) in Excel to choose a subset of observations
in the World Bank data (above): e.g. filter out countries with population less than 1
million or greater than 500 million. Does the correlation change much?