0% found this document useful (0 votes)

16 views8 pages

MIS BA 20232024 Notes Chapter02

Uploaded by

xujie623

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views8 pages

MIS BA 20232024 Notes Chapter02

Uploaded by

xujie623

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Chapter 2

Correlation

2.1 Measuring Variable Relationships

A fundamental technique of analytics is to investigate the relationship between a pair of
variables. In some cases, it is obvious that there is a relationship. For example, we can
say with confidence that there is a positive relationship between a person’s salary and
the price they pay for a new car. This information is probably useful to a car dealer, for
example.
In other cases it is not so obvious. What, if any, is the relationship between a
person’s salary and the amount they spend on international phone calls per week? This
information would probably be useful to a phone company. If it is not obvious, we need
a way to measure it based on data.
Correlation is a simple method of measuring whether two variables are related, and
if so, what type of relationship and how strong it is.
We say that two variables are correlated if, when one goes up, so does the other, or
when one goes up, the other goes down – either is equally useful to know about.
In order to carry out a correlation, we need paired data – that is, we have multiple
observations, and each observation is a pair – one value for each variable. An example is
the height and weight of a sample of adults. For each adult, we have one value of each
variable. We will usually see our data-set in a table. On paper it would look like this:

Table 2.1: Height and weight of a sample of adults

height (cm) weight (kg)

168 60
170 62
155 55
183 70
172 68

2.2 Scatter Plots

A good place to start is a scatter plot. We plot the two variables x and y as a scatter
plot. A scatter plot has two axes, one for each variable, and each observation is plotted

17
CHAPTER 2. CORRELATION

as a point – no lines or bars. In Figure 2.1, each point represents an observation (one
person).

Weight/Height relationship

180 ●

●
●
170

●
Height (cm)

160

●
150

50 55 60 65 70 75

Weight (kg)

Figure 2.1: Scatter plot showing height and weight of the sample of adults.

A simple visual observation of the scatter plot is already enough to provide some
information with respect to how the two variables are related.

Exercise. What can you conclude from Figure 2.1? Is there a relationship between the
two variables? When one variable goes up, does the other one tend to go up, down, or
stay the same?

2.3 Linear Correlation

In practice, we do not always want to plot the data and then judge the relationship
visually. Sometimes it is useful to have a simple automated formula to describe the
relationship.
The correlation between two variables is calculated as follows, for a sample of n
observations: P P P
n · xy x y
r= p P 2 P 2 p P P (2.1)
n· x ( x) n · y 2 ( y)2
P
where x is the sum of the x values for all observations, etc. This is called Pearson’s
correlation coefficient, and measures the linear correlation between the variables x and
y.
This correlation metric is symmetric: it does not matter which variable is called x
(and put on the x-axis in a scatter plot) and which is y. The value of r for the relationship
between x and y is equal to the value of r for the relationship between y and x.

Nicolau & McDermott, Business Analytics 18

CHAPTER 2. CORRELATION

The formula returns a value within the range [ 1, 1]. A value of 1 or -1 indicates a
perfect linear relationship between the two variables: in a scatter plot, one could draw a
straight line that would pass through every single data point. In general, positive values
for r indicate that when the value of a variable goes up, the value of the other variable
also goes up. Negative values for r indicate that when one goes up, the other goes down.
If there is absolutely no relationship between the two variables, the value for r will be
0. See Figure 2.2.

Figure 2.2: A perfect positive correlation, r = 1, and a perfect negative correlation, r = 1,

and a non-existent correlation, r = 0.

Correlation is not an all-or-nothing measure: two variables can also be imperfectly

correlated. In many real-world data-sets, there is a clear relationship, but unlike the
r = 1 and r = 1 cases above, not perfect. In those cases, r takes on intermediate
values; not zero, but not -1 or 1 either:

If r is in the range . . . then we say there is a . . .

+.70 or higher very strong positive relationship
+.40 to +.69 strong positive relationship
+.30 to +.39 moderate positive relationship
+.20 to +.29 weak positive relationship
+.01 to +.19 no or negligible relationship
close to 0 no relationship
-.01 to -.19 no or negligible relationship
-.20 to -.29 weak negative relationship
-.30 to -.39 moderate negative relationship
-.40 to -.69 strong negative relationship
-.70 or lower very strong negative relationship

2.4 Misleading Correlations

When carrying out correlation, we may find that there is a relationship between y and
x. We should not assume that relationship to be causal ; that is, we should not assume
that a change in x actually causes the change in y. It might be the other way around.
Or there might be some other variable z which we have not considered, which causes
related changes in both x and y. Either would lead to a correlation between x and y.
Remember the following slogan:

Correlation is not causation.

Nicolau & McDermott, Business Analytics 19

CHAPTER 2. CORRELATION

Example. Suppose we collect data on all the accidental fires in Singapore for the last
ten years. We correlate the number of fire engines at each fire (x) and the eventual
damages in Euros (y) at each fire. We will probably find that there is a strong positive
correlation. Can we conclude that the presence of more fire engines causes more costly
damage?

Example. Suppose we collect data from 100 randomly-chosen adults: what is the value
of their car (x), and what is their income (y). (Anyone who does not own a car is omitted
from the survey). We will find a strong positive correlation. Does this suggest that if I
buy a more expensive car, my salary will go up? Why or why not?

A weak or zero linear correlation between two variables implies that there is a weak
or non-existent linear relationship between them. That does not imply that there is
no other relationship between them. That is because linear correlation only measures
the strength of any possible linear relationship. There might be a strong non-linear
relationship which nevertheless results in a zero correlation. Look at Figure 2.3. There
are several data-sets where r = 0, despite the fact that there is a clear relationship
between the variables.

Figure 2.3: More examples of correlation values. The numbers shown are the correlation
coefficients of the x y data-sets shown. Note that several very strong x y relationships give
a zero correlation. Note that correlation is undefined in the case where there is no variation in
y, i.e. y is constant (same for x). (Image public domain, from Wikipedia.)

Note that it is not always possible to calculate the correlation between two variables.
In the centre of Figure 2.3, di↵erent values of x all correspond to the same value of y.
In this case, the value of r is undefined : we cannot measure the variation of y when the
value of x changes.

Exercise. Think of it the other way around: given that correlation is a symmetric
measure, how can we measure the corresponding change in x when y changes, if the
value of y never changes?

Nicolau & McDermott, Business Analytics 20

CHAPTER 2. CORRELATION

Exercise. Put this into practice. Using equation 2.1, calculate the linear correlation
between x and y, for the following data:
x 1 2 3 4 5
y 8 8 8 8 8

Some very di↵erent data sets can have the same correlation values, as in Figure 2.4.
That is why it is always useful to plot the data and look at it.

Figure 2.4: Anscombe’s Quartet: four data sets with identical statistical values (mean and vari-
ance of x and of y, x-y correlation, and more), but very di↵erent properties. From Wikipedia:
”Anscombe’s quartet 3” by Anscombe.svg: Schutzderivative work (label using subscripts):
Avenue (talk) - Anscombe.svg. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
http://en.wikipedia.org/wiki/Anscombe’s\_quartet

2.5 Correlation and Coincidence

Correlation is a statistical relationship between two variables. It involves multiple ob-
servations. Correlation is not the same as coincidence. A coincidence is a single paired
observation (e.g., one particular person has a large height and a large weight). It does
not tell us anything about the relationship in general. Above, we saw that correlation
is not causation. Now, we have a second slogan:

Coincidence is not correlation.

Exercise. Your friend Bob is the tallest person you know, and he got the best marks in
his class in first year exams. What do you conclude from this data and why?

Nicolau & McDermott, Business Analytics 21

CHAPTER 2. CORRELATION

2.6 Correlation Between All Pairs

A scenario we will see over and over again in this course is where you have a table of
numerical data. Each column represents some variable, and each row represents some
instance, just as we had a table earlier where each row represented one person, and the
columns represented their height and weight.
In some cases, it might be very “obvious” what relationships exist in the data. For
example, suppose each row represents one country, and the first column represents area
in square kilometres, the second represents population, the third represents the number
of kilometres of motorways in the country, and the fourth represents mean temperature
in the country. This type of data is freely available online. See Table 2.2; you will
probably observe that there is a positive correlation between population and area.

Table 2.2: Some data on several countries

Country Area Population Roadways Temperature

Ireland 70,273 4,832,765 96,036 10
France 551,500 66,259,012 1,028,446 15
...

However, in other cases you might have many, many variables, and you might not
have a strong grasp on what the variables even mean. What if your table looks like
Table 2.3?
Table 2.3: Data on phone customers

Customer ID lnd mins mob mins l allow mob prop x vx ...

1234 574 252 0.55 15 ...
1235 51 459 0.47 29 ...
...

In cases like this it becomes very useful to have a fast, automated method of investigating
all the relationships in the data. One method is to calculate the correlation between every
pair of variables and just see which ones are relatively strong. You might then choose
to report something like “Customers with large values for lnd mins tend to have small
values for v x”. This could be useful to your client or your manager, even if you are not
quite sure what v x means.

Exercise. Suppose there are four variables. How many relationships are there, i.e. how
many pairs? What about when there are 10 variables? What is the general rule?

2.7 Correlation after Segmentation

Sometimes, you observe no correlation between two variables, but if you measure the
correlation of a subset of the observations, then you find a strong correlation.
Consider again the example of age and height in humans. With a sample of humans
of all ages, there will be very little correlation. But if we segment the population into
three groups, then we will find a strong positive correlation for the group where age is

Nicolau & McDermott, Business Analytics 22

CHAPTER 2. CORRELATION

less than 20; a weak or non-existent correlation for the group where age is between 20
and 60; and a weak negative correlation for the group where age is above 60. This gives
us a lot of insight into the data.

2.8 Rank Correlation

Two variables can be strongly correlated, but that correlation might not be linear. Take
a look at Figure 2.1 again; in the sample of people analysed, anyone who is taller than
anyone else is also heavier (this is not always the case in real life, but in our sample it is
what we observed - one of the problems of working with very small samples!). So even
through the relationship between weight and height in our sample is not fully linear,
we can perfectly rank all our observations by both weight and height. This is shown in
Table 2.4.
Table 2.4: Height and weight of a sample of adults, and their ranks

height (cm) height rank weight (kg) weight rank

168 2 60 2
170 3 62 3
155 1 55 1
183 5 70 5
172 4 68 4

There are several measures of rank correlation; one of them is Spearman’s rank cor-
relation coefficient, which is equal to Pearson’s linear correlation (r), but applied to the
ranks of the variables, rather than their raw values. This results in a correlation value
which is also within the range [ 1, 1], but without the assumption of a linear relationship
between the variables (only between their ranks):

rs = r(rgx , rgy )
where rgx and rgy are the ranks of variables x and y, respectively.

Exercise. Calculate Pearson’s correlation of the data shown in Table 2.4. How would
you characterise the relationship between weight and height? And between height and
weight?

Exercise. Now calculate Spearman’s correlation of the same data. How does it compare
with the linear correlation you calculated previously? What conclusion can you draw
from that comparison?

2.9 Correlation in Excel

Calculating a linear correlation in Excel is easy: the correlation function is called CORREL.
It takes two arguments, the values of the two variables. Take a look at the data shown
in Figure 2.5: to calculate the correlation between height and weight, use the formula

Nicolau & McDermott, Business Analytics 23

CHAPTER 2. CORRELATION

=CORREL(A2:A6, B2:B6). Notice that each argument is a cell range, i.e. a list of cell
locations. The two cell ranges are of the same length, because it is paired data. If there
is a missing value for some observation, Excel will just ignore that row and calculate the
correlation on the rest of the data.

Figure 2.5: Height and weight in Excel

Exercise. Enter some data into Excel, and calculate r using CORREL. Now delete some
data, and see what happens.

Exercise. Suppose we collect data on the age and height of a sample of children, and
plot it on a scatter plot. What will the data look like? Draw a scatter plot representing
your best guess. What is your best guess for the correlation value?

Exercise. Consider the data gathered from a sample of 4 companies on the number of
employees and their sales per annum (in thousands of Euros), shown in Table 2.5. Bring
it into Excel and calculate the correlation between them. Is the result what you expect?
Interpret the result – give one sentence explaining the relationship in terms that would
be understood by someone who knows nothing about statistics.

Table 2.5: Employees and sales in a small sample of companies.

employees sales (thousands of Euros)

1 15
4 25
5 100
7 120

Exercise. What do you think is the likely relationship between a country’s GDP and
its population? Gather the data from below and find out. Use a scatter plot to help
explain the result.
• http://data.worldbank.org/indicator/SP.POP.TOTL/countries
• http://data.worldbank.org/indicator/NY.GDP.MKTP.CD/countries

Exercise. Use Filter (in the Data menu) in Excel to choose a subset of observations
in the World Bank data (above): e.g. filter out countries with population less than 1
million or greater than 500 million. Does the correlation change much?

Nicolau & McDermott, Business Analytics 24

Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)
Introduction to Applied Econometrics Analysis Using Stata
From Everand
Introduction to Applied Econometrics Analysis Using Stata
Justin Doran
5/5 (3)
FIFA 18 - Data Analysis: - Harsh Takrani - Pranay Lulla
No ratings yet
FIFA 18 - Data Analysis: - Harsh Takrani - Pranay Lulla
16 pages
Measures of Association
No ratings yet
Measures of Association
14 pages
Correlation
No ratings yet
Correlation
19 pages
Correlation Analysis PDF
No ratings yet
Correlation Analysis PDF
30 pages
Correlation
No ratings yet
Correlation
83 pages
Correlation
No ratings yet
Correlation
22 pages
Statistics Regression Final Project
100% (2)
Statistics Regression Final Project
12 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
17 pages
Correlation
No ratings yet
Correlation
6 pages
Correlation BMLT
No ratings yet
Correlation BMLT
5 pages
Correlation and Regression
No ratings yet
Correlation and Regression
22 pages
Correlation and Covariance
No ratings yet
Correlation and Covariance
11 pages
Topic 4.5 Correlational Analysis
No ratings yet
Topic 4.5 Correlational Analysis
28 pages
Oe Statistics Notes
No ratings yet
Oe Statistics Notes
32 pages
Unit 3-1
No ratings yet
Unit 3-1
12 pages
Notes For Correlation Unit - 3 Business Statistics
No ratings yet
Notes For Correlation Unit - 3 Business Statistics
21 pages
Correlation Rev 1.0
No ratings yet
Correlation Rev 1.0
5 pages
Statistics Module 3hejeiehhwwhgsysysudhhdbb
No ratings yet
Statistics Module 3hejeiehhwwhgsysysudhhdbb
44 pages
Coolidge Chapter 6
No ratings yet
Coolidge Chapter 6
57 pages
Lecture 29
No ratings yet
Lecture 29
5 pages
Correlation: (For M.B.A. I Semester)
100% (2)
Correlation: (For M.B.A. I Semester)
46 pages
Correlation and Regression
No ratings yet
Correlation and Regression
13 pages
Cce 68 D 4 CC 4
No ratings yet
Cce 68 D 4 CC 4
28 pages
Business Project 12 Content
No ratings yet
Business Project 12 Content
33 pages
Business Statistics Method: by Farah Nurul Aisyah (4122001020) Jasmine Alviana Zalzabillah (4122001070)
No ratings yet
Business Statistics Method: by Farah Nurul Aisyah (4122001020) Jasmine Alviana Zalzabillah (4122001070)
35 pages
Correlation: By: Nathaniel S. Antero
No ratings yet
Correlation: By: Nathaniel S. Antero
13 pages
Correlation and Regression
No ratings yet
Correlation and Regression
11 pages
Correlation Notes
No ratings yet
Correlation Notes
15 pages
Unit 3 Correlation and Regression
No ratings yet
Unit 3 Correlation and Regression
27 pages
Correlation Analysis
No ratings yet
Correlation Analysis
16 pages
Correlation: Hapter
No ratings yet
Correlation: Hapter
16 pages
ch7 - CORELATION
No ratings yet
ch7 - CORELATION
16 pages
Correlation Anad Regression
No ratings yet
Correlation Anad Regression
13 pages
Peter
No ratings yet
Peter
48 pages
Correlation
No ratings yet
Correlation
30 pages
ECAP790 U06L01 Correlation
No ratings yet
ECAP790 U06L01 Correlation
37 pages
Correlation Analysis
No ratings yet
Correlation Analysis
49 pages
Lecture 7
No ratings yet
Lecture 7
65 pages
Correlation
No ratings yet
Correlation
18 pages
22nd April - Lecture-SPSS Correlation and Reliability Tests
No ratings yet
22nd April - Lecture-SPSS Correlation and Reliability Tests
17 pages
CORRELATION
No ratings yet
CORRELATION
4 pages
Important Notes in Correlation
No ratings yet
Important Notes in Correlation
9 pages
Correlation
No ratings yet
Correlation
22 pages
BA1 Chapter 10
No ratings yet
BA1 Chapter 10
11 pages
Kest107 PDF
100% (1)
Kest107 PDF
16 pages
Comp 3 Measure of Relationship and Effect
No ratings yet
Comp 3 Measure of Relationship and Effect
6 pages
Correlation and Regression Analysis
0% (1)
Correlation and Regression Analysis
17 pages
STATISTICS Documentary
No ratings yet
STATISTICS Documentary
18 pages
Statistics
No ratings yet
Statistics
21 pages
Correlation Coefficient
No ratings yet
Correlation Coefficient
14 pages
11 Correlation
No ratings yet
11 Correlation
28 pages
Chapter - Six
No ratings yet
Chapter - Six
8 pages
FODS Unit-3
No ratings yet
FODS Unit-3
25 pages
The Significance of Correlation
No ratings yet
The Significance of Correlation
6 pages
Correlation and Regression Analyses
No ratings yet
Correlation and Regression Analyses
8 pages
Correlation SBC
No ratings yet
Correlation SBC
4 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Exercises of Advanced Statistics
From Everand
Exercises of Advanced Statistics
Simone Malacrida
No ratings yet
Mathematical Analysis 1: theory and solved exercises
From Everand
Mathematical Analysis 1: theory and solved exercises
Alessio Mangoni
5/5 (1)
Nature of Inquiry and Research
No ratings yet
Nature of Inquiry and Research
19 pages
Advancing Pharmaceutical Dry Milling by Process Analytics and Robustness Testing
No ratings yet
Advancing Pharmaceutical Dry Milling by Process Analytics and Robustness Testing
9 pages
Can Theoretical Physics Research Benefit From Language Agents?
No ratings yet
Can Theoretical Physics Research Benefit From Language Agents?
20 pages
Baed-Rsch2122 Inquiries, Investigations and Immersion: Week 11-20 By: Marygenkie 100% Legit
50% (2)
Baed-Rsch2122 Inquiries, Investigations and Immersion: Week 11-20 By: Marygenkie 100% Legit
9 pages
StatisticsAllTopics
No ratings yet
StatisticsAllTopics
315 pages
MCA Question Bank
No ratings yet
MCA Question Bank
33 pages
Welfare Facilities and Employee Satisfaction in HLL PROJECT REPORT MBA
0% (1)
Welfare Facilities and Employee Satisfaction in HLL PROJECT REPORT MBA
91 pages
Social Anxiety Social Interaction Anxiet
No ratings yet
Social Anxiety Social Interaction Anxiet
70 pages
MSC PSCM Changalima, I.A 2016
No ratings yet
MSC PSCM Changalima, I.A 2016
91 pages
Marques Et Al 2021 Body Image Accepted Manuscript
No ratings yet
Marques Et Al 2021 Body Image Accepted Manuscript
41 pages
Assignment 2 Mba 652 PDF
No ratings yet
Assignment 2 Mba 652 PDF
11 pages
(Chapman & Hall_CRC The R Series) Chester Ismay, Albert Y. Kim - Statistical Inference via Data Science_ A ModernDive into R and the Tidyverse (Chapman & Hall_CRC The R Series)-Chapman and Hall_CRC (2
100% (1)
(Chapman & Hall_CRC The R Series) Chester Ismay, Albert Y. Kim - Statistical Inference via Data Science_ A ModernDive into R and the Tidyverse (Chapman & Hall_CRC The R Series)-Chapman and Hall_CRC (2
461 pages
Correlating Human Traits and Cybersecurity Behavior Intentions
No ratings yet
Correlating Human Traits and Cybersecurity Behavior Intentions
15 pages
Advanced Digital Marketing
67% (6)
Advanced Digital Marketing
363 pages
Chapter 4 Measures of Variability PDF
No ratings yet
Chapter 4 Measures of Variability PDF
28 pages
Kmeans - Ipynb - Colab
No ratings yet
Kmeans - Ipynb - Colab
2 pages
Task 2&3
No ratings yet
Task 2&3
13 pages
IS4242 W8 Similarity, NN and Clusters
No ratings yet
IS4242 W8 Similarity, NN and Clusters
29 pages
Tugas2 Regresi Linear Berganda - Ipynb - Colab
No ratings yet
Tugas2 Regresi Linear Berganda - Ipynb - Colab
3 pages
Satish BSBMKG512 Forecast International Market and Business Needs - AT
0% (1)
Satish BSBMKG512 Forecast International Market and Business Needs - AT
21 pages
Ai 2
No ratings yet
Ai 2
12 pages
Ebola Visualization
No ratings yet
Ebola Visualization
11 pages
The Effects of Big Data On Forensic Accounting
No ratings yet
The Effects of Big Data On Forensic Accounting
16 pages
Spotify Data Analysis Report
No ratings yet
Spotify Data Analysis Report
6 pages
Purposive Sampling Also Known As Judgmental
No ratings yet
Purposive Sampling Also Known As Judgmental
3 pages
Cheat Sheet-Building Unsupervised Learning Models
No ratings yet
Cheat Sheet-Building Unsupervised Learning Models
3 pages
Lecture-1 Big Data
No ratings yet
Lecture-1 Big Data
15 pages
Vol 5 - Issue - 1 - 2025 - 189 PP (261 - 280)
No ratings yet
Vol 5 - Issue - 1 - 2025 - 189 PP (261 - 280)
20 pages
Crack Fetection
No ratings yet
Crack Fetection
12 pages