0% found this document useful (0 votes)
16 views8 pages

MIS BA 20232024 Notes Chapter02

Uploaded by

xujie623
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

MIS BA 20232024 Notes Chapter02

Uploaded by

xujie623
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Chapter 2

Correlation

2.1 Measuring Variable Relationships


A fundamental technique of analytics is to investigate the relationship between a pair of
variables. In some cases, it is obvious that there is a relationship. For example, we can
say with confidence that there is a positive relationship between a person’s salary and
the price they pay for a new car. This information is probably useful to a car dealer, for
example.
In other cases it is not so obvious. What, if any, is the relationship between a
person’s salary and the amount they spend on international phone calls per week? This
information would probably be useful to a phone company. If it is not obvious, we need
a way to measure it based on data.
Correlation is a simple method of measuring whether two variables are related, and
if so, what type of relationship and how strong it is.
We say that two variables are correlated if, when one goes up, so does the other, or
when one goes up, the other goes down – either is equally useful to know about.
In order to carry out a correlation, we need paired data – that is, we have multiple
observations, and each observation is a pair – one value for each variable. An example is
the height and weight of a sample of adults. For each adult, we have one value of each
variable. We will usually see our data-set in a table. On paper it would look like this:

Table 2.1: Height and weight of a sample of adults

height (cm) weight (kg)


168 60
170 62
155 55
183 70
172 68

2.2 Scatter Plots


A good place to start is a scatter plot. We plot the two variables x and y as a scatter
plot. A scatter plot has two axes, one for each variable, and each observation is plotted

17
CHAPTER 2. CORRELATION

as a point – no lines or bars. In Figure 2.1, each point represents an observation (one
person).

Weight/Height relationship

180 ●



170


Height (cm)

160


150

50 55 60 65 70 75

Weight (kg)

Figure 2.1: Scatter plot showing height and weight of the sample of adults.

A simple visual observation of the scatter plot is already enough to provide some
information with respect to how the two variables are related.

Exercise. What can you conclude from Figure 2.1? Is there a relationship between the
two variables? When one variable goes up, does the other one tend to go up, down, or
stay the same?

2.3 Linear Correlation


In practice, we do not always want to plot the data and then judge the relationship
visually. Sometimes it is useful to have a simple automated formula to describe the
relationship.
The correlation between two variables is calculated as follows, for a sample of n
observations: P P P
n · xy x y
r= p P 2 P 2 p P P (2.1)
n· x ( x) n · y 2 ( y)2
P
where x is the sum of the x values for all observations, etc. This is called Pearson’s
correlation coefficient, and measures the linear correlation between the variables x and
y.
This correlation metric is symmetric: it does not matter which variable is called x
(and put on the x-axis in a scatter plot) and which is y. The value of r for the relationship
between x and y is equal to the value of r for the relationship between y and x.

Nicolau & McDermott, Business Analytics 18


CHAPTER 2. CORRELATION

The formula returns a value within the range [ 1, 1]. A value of 1 or -1 indicates a
perfect linear relationship between the two variables: in a scatter plot, one could draw a
straight line that would pass through every single data point. In general, positive values
for r indicate that when the value of a variable goes up, the value of the other variable
also goes up. Negative values for r indicate that when one goes up, the other goes down.
If there is absolutely no relationship between the two variables, the value for r will be
0. See Figure 2.2.

Figure 2.2: A perfect positive correlation, r = 1, and a perfect negative correlation, r = 1,


and a non-existent correlation, r = 0.

Correlation is not an all-or-nothing measure: two variables can also be imperfectly


correlated. In many real-world data-sets, there is a clear relationship, but unlike the
r = 1 and r = 1 cases above, not perfect. In those cases, r takes on intermediate
values; not zero, but not -1 or 1 either:

If r is in the range . . . then we say there is a . . .


+.70 or higher very strong positive relationship
+.40 to +.69 strong positive relationship
+.30 to +.39 moderate positive relationship
+.20 to +.29 weak positive relationship
+.01 to +.19 no or negligible relationship
close to 0 no relationship
-.01 to -.19 no or negligible relationship
-.20 to -.29 weak negative relationship
-.30 to -.39 moderate negative relationship
-.40 to -.69 strong negative relationship
-.70 or lower very strong negative relationship

2.4 Misleading Correlations


When carrying out correlation, we may find that there is a relationship between y and
x. We should not assume that relationship to be causal ; that is, we should not assume
that a change in x actually causes the change in y. It might be the other way around.
Or there might be some other variable z which we have not considered, which causes
related changes in both x and y. Either would lead to a correlation between x and y.
Remember the following slogan:

Correlation is not causation.

Nicolau & McDermott, Business Analytics 19


CHAPTER 2. CORRELATION

Example. Suppose we collect data on all the accidental fires in Singapore for the last
ten years. We correlate the number of fire engines at each fire (x) and the eventual
damages in Euros (y) at each fire. We will probably find that there is a strong positive
correlation. Can we conclude that the presence of more fire engines causes more costly
damage?

Example. Suppose we collect data from 100 randomly-chosen adults: what is the value
of their car (x), and what is their income (y). (Anyone who does not own a car is omitted
from the survey). We will find a strong positive correlation. Does this suggest that if I
buy a more expensive car, my salary will go up? Why or why not?

A weak or zero linear correlation between two variables implies that there is a weak
or non-existent linear relationship between them. That does not imply that there is
no other relationship between them. That is because linear correlation only measures
the strength of any possible linear relationship. There might be a strong non-linear
relationship which nevertheless results in a zero correlation. Look at Figure 2.3. There
are several data-sets where r = 0, despite the fact that there is a clear relationship
between the variables.

Figure 2.3: More examples of correlation values. The numbers shown are the correlation
coefficients of the x y data-sets shown. Note that several very strong x y relationships give
a zero correlation. Note that correlation is undefined in the case where there is no variation in
y, i.e. y is constant (same for x). (Image public domain, from Wikipedia.)

Note that it is not always possible to calculate the correlation between two variables.
In the centre of Figure 2.3, di↵erent values of x all correspond to the same value of y.
In this case, the value of r is undefined : we cannot measure the variation of y when the
value of x changes.

Exercise. Think of it the other way around: given that correlation is a symmetric
measure, how can we measure the corresponding change in x when y changes, if the
value of y never changes?

Nicolau & McDermott, Business Analytics 20


CHAPTER 2. CORRELATION

Exercise. Put this into practice. Using equation 2.1, calculate the linear correlation
between x and y, for the following data:
x 1 2 3 4 5
y 8 8 8 8 8

Some very di↵erent data sets can have the same correlation values, as in Figure 2.4.
That is why it is always useful to plot the data and look at it.

Figure 2.4: Anscombe’s Quartet: four data sets with identical statistical values (mean and vari-
ance of x and of y, x-y correlation, and more), but very di↵erent properties. From Wikipedia:
”Anscombe’s quartet 3” by Anscombe.svg: Schutzderivative work (label using subscripts):
Avenue (talk) - Anscombe.svg. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
http://en.wikipedia.org/wiki/Anscombe’s\_quartet

2.5 Correlation and Coincidence


Correlation is a statistical relationship between two variables. It involves multiple ob-
servations. Correlation is not the same as coincidence. A coincidence is a single paired
observation (e.g., one particular person has a large height and a large weight). It does
not tell us anything about the relationship in general. Above, we saw that correlation
is not causation. Now, we have a second slogan:

Coincidence is not correlation.

Exercise. Your friend Bob is the tallest person you know, and he got the best marks in
his class in first year exams. What do you conclude from this data and why?

Nicolau & McDermott, Business Analytics 21


CHAPTER 2. CORRELATION

2.6 Correlation Between All Pairs


A scenario we will see over and over again in this course is where you have a table of
numerical data. Each column represents some variable, and each row represents some
instance, just as we had a table earlier where each row represented one person, and the
columns represented their height and weight.
In some cases, it might be very “obvious” what relationships exist in the data. For
example, suppose each row represents one country, and the first column represents area
in square kilometres, the second represents population, the third represents the number
of kilometres of motorways in the country, and the fourth represents mean temperature
in the country. This type of data is freely available online. See Table 2.2; you will
probably observe that there is a positive correlation between population and area.

Table 2.2: Some data on several countries

Country Area Population Roadways Temperature


Ireland 70,273 4,832,765 96,036 10
France 551,500 66,259,012 1,028,446 15
...

However, in other cases you might have many, many variables, and you might not
have a strong grasp on what the variables even mean. What if your table looks like
Table 2.3?
Table 2.3: Data on phone customers

Customer ID lnd mins mob mins l allow mob prop x vx ...


1234 574 252 0.55 15 ...
1235 51 459 0.47 29 ...
...

In cases like this it becomes very useful to have a fast, automated method of investigating
all the relationships in the data. One method is to calculate the correlation between every
pair of variables and just see which ones are relatively strong. You might then choose
to report something like “Customers with large values for lnd mins tend to have small
values for v x”. This could be useful to your client or your manager, even if you are not
quite sure what v x means.

Exercise. Suppose there are four variables. How many relationships are there, i.e. how
many pairs? What about when there are 10 variables? What is the general rule?

2.7 Correlation after Segmentation


Sometimes, you observe no correlation between two variables, but if you measure the
correlation of a subset of the observations, then you find a strong correlation.
Consider again the example of age and height in humans. With a sample of humans
of all ages, there will be very little correlation. But if we segment the population into
three groups, then we will find a strong positive correlation for the group where age is

Nicolau & McDermott, Business Analytics 22


CHAPTER 2. CORRELATION

less than 20; a weak or non-existent correlation for the group where age is between 20
and 60; and a weak negative correlation for the group where age is above 60. This gives
us a lot of insight into the data.

2.8 Rank Correlation


Two variables can be strongly correlated, but that correlation might not be linear. Take
a look at Figure 2.1 again; in the sample of people analysed, anyone who is taller than
anyone else is also heavier (this is not always the case in real life, but in our sample it is
what we observed - one of the problems of working with very small samples!). So even
through the relationship between weight and height in our sample is not fully linear,
we can perfectly rank all our observations by both weight and height. This is shown in
Table 2.4.
Table 2.4: Height and weight of a sample of adults, and their ranks

height (cm) height rank weight (kg) weight rank


168 2 60 2
170 3 62 3
155 1 55 1
183 5 70 5
172 4 68 4

There are several measures of rank correlation; one of them is Spearman’s rank cor-
relation coefficient, which is equal to Pearson’s linear correlation (r), but applied to the
ranks of the variables, rather than their raw values. This results in a correlation value
which is also within the range [ 1, 1], but without the assumption of a linear relationship
between the variables (only between their ranks):

rs = r(rgx , rgy )
where rgx and rgy are the ranks of variables x and y, respectively.

Exercise. Calculate Pearson’s correlation of the data shown in Table 2.4. How would
you characterise the relationship between weight and height? And between height and
weight?

Exercise. Now calculate Spearman’s correlation of the same data. How does it compare
with the linear correlation you calculated previously? What conclusion can you draw
from that comparison?

2.9 Correlation in Excel


Calculating a linear correlation in Excel is easy: the correlation function is called CORREL.
It takes two arguments, the values of the two variables. Take a look at the data shown
in Figure 2.5: to calculate the correlation between height and weight, use the formula

Nicolau & McDermott, Business Analytics 23


CHAPTER 2. CORRELATION

=CORREL(A2:A6, B2:B6). Notice that each argument is a cell range, i.e. a list of cell
locations. The two cell ranges are of the same length, because it is paired data. If there
is a missing value for some observation, Excel will just ignore that row and calculate the
correlation on the rest of the data.

Figure 2.5: Height and weight in Excel

Exercise. Enter some data into Excel, and calculate r using CORREL. Now delete some
data, and see what happens.

Exercise. Suppose we collect data on the age and height of a sample of children, and
plot it on a scatter plot. What will the data look like? Draw a scatter plot representing
your best guess. What is your best guess for the correlation value?

Exercise. Consider the data gathered from a sample of 4 companies on the number of
employees and their sales per annum (in thousands of Euros), shown in Table 2.5. Bring
it into Excel and calculate the correlation between them. Is the result what you expect?
Interpret the result – give one sentence explaining the relationship in terms that would
be understood by someone who knows nothing about statistics.

Table 2.5: Employees and sales in a small sample of companies.

employees sales (thousands of Euros)


1 15
4 25
5 100
7 120

Exercise. What do you think is the likely relationship between a country’s GDP and
its population? Gather the data from below and find out. Use a scatter plot to help
explain the result.
• http://data.worldbank.org/indicator/SP.POP.TOTL/countries
• http://data.worldbank.org/indicator/NY.GDP.MKTP.CD/countries

Exercise. Use Filter (in the Data menu) in Excel to choose a subset of observations
in the World Bank data (above): e.g. filter out countries with population less than 1
million or greater than 500 million. Does the correlation change much?

Nicolau & McDermott, Business Analytics 24

You might also like