Correlation Coefficient

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Statistics Dr.

Ansam Al-Obaidi

Correlation
The word “correlation” is a term used in everyday conversation to describe some
type of relationship between variables. You know that the amount you eat is
correlated with the amount that you weigh; that how hard you work in school is
correlated with how successful you'll be in life (well, maybe!).

In statistics, correlation has a precise definition. It's a measure of the strength of the
relationship between two variables.

Scatter Plots

Scatter Plots Before we take up the discussion of correlation, we need to examine a


way to display the relation between two variables x and y. The most common and
easiest way is a scatter plot. The following example illustrates a scatter plot.

Example

In Europe and Asia, m-commerce is popular. M-commerce users have special


mobile phones that work like electronic wallets as well as provide phone and Internet
services. Users can do everything from paying for parking to buying a TV set or
soda from a machine to banking to checking sports scores on the Internet. For the
years 2000 through 2004, was there a relationship between the year and the number
of m-commerce users? Construct a scatter plot. Let x = the year and let y = the
number of m-commerce users, in millions.
Statistics Dr. Ansam Al-Obaidi

(a) Table showing the number of m-commerce users (in millions) by year

(b) Scatter plot showing the number of m-commerce users (in millions) by year

Question

Amelia plays basketball for her high school. She wants to improve to play at the
college level. She notices that the number of points she scores in a game goes up in
response to the number of hours she practices her jump shot each week. She records
the following data:
Statistics Dr. Ansam Al-Obaidi

Construct a scatter plot and state if what Amelia thinks appears to be true

A scatter plot shows the direction of a relationship between the variables. A clear
direction happens when there is either:

• High values of one variable occurring with high values of the other variable or low
values of one variable occurring with low values of the other variable.

• High values of one variable occurring with low values of the other variable.

You can determine the strength of the relationship by looking at the scatter plot and
seeing how close the points are to a line, a power function, an exponential function,
or to some other type of function. For a linear relationship there is an exception.
Consider a scatter plot where all the points fall on a horizontal line providing a
"perfect fit." The horizontal line would in fact show no relationship.

When you look at a scatterplot, you want to notice the overall pattern and any
deviations from the pattern. The following scatterplot examples illustrate these
concepts.
Statistics Dr. Ansam Al-Obaidi

Pearson's Correlation Coefficient

The Pearson correlation coefficient (usually denoted as r) assumes that X and Y are
jointly distributed as bivariate normal, i.e. X and Y each are normally distributed,
and that they are linearly related. When these assumptions are not satisfied,
nonparametric versions can be used to estimate correlation. These include the
Spearman correlation coefficient.

Pearson correlation coefficient is a statistic that measures the strength and direction
of the linear relationship between two variables. The correlation coefficient does not
tell you anything about the cause-and-effect relationship between the variables. The
Pearson correlation coefficient ranges in value from -1 to +1. The absolute value of
the correlation coefficient tells you how strongly the variables are linearly related.
A value of either +1 or -1 means that you can perfectly predict the values of one
variable from the values of the other.

• If the correlation coefficient is +1, all points fall on a line with values of both
variables increasing together.
• If the correlation coefficient is -1, all points fall on a line but as values of one
variable increase the values of the other variable decrease.
• It is 0 when there is no linear relationship between two variables.

Figure below shows plots, correlation coefficients and summary lines for
correlations of different sizes.
Statistics Dr. Ansam Al-Obaidi

Scatterplots with Correlation Coefficient

Computing Pearson's correlation coefficient (r)

Let’s take the table below as an example to calculate the correlation coefficient.
Pearson's correlation is computed by dividing the sum of the xy column (Σxy) by the
square root of the prod

uct of the sum of the x2 column (Σx2) and the sum of the y2 column (Σy2). The
resulting formula is: -
Statistics Dr. Ansam Al-Obaidi

Therefore, r is:-

An alternative computational formula that avoids the step of computing deviation


scores is:

Example

Imagine that you’re studying the relationship between newborns’ weight and length.
You have the weights and lengths of the 10 babies born last month at your local
hospital. You enter the data in a table and find the value of the correlation coefficient
between weight and height.
Statistics Dr. Ansam Al-Obaidi
Statistics Dr. Ansam Al-Obaidi

Exercise:

The table below shows the marks obtained by a group of students, in two separate
tests.

The first test is out of 50 marks while the second test is out of 30 marks. Let x and y
represent the marks obtained in Test 1 and Test 2, respectively. Find the value of
the correlation coefficient between x and y.
Statistics Dr. Ansam Al-Obaidi

When to use the Pearson correlation coefficient

The Pearson correlation coefficient (r) is one of several correlation coefficients that
you need to choose between when you want to measure a correlation. The Pearson
correlation coefficient is a good choice when all of the following are true:

Both variables are quantitative.

The variables are normally distributed: You can create a histogram of each
variable to verify whether the distributions are approximately normal. It’s not a
problem if the variables are a little non-normal.

The data have no outliers: Outliers are observations that don’t follow the same
patterns as the rest of the data. A scatterplot is one way to check for outliers.

The relationship is linear: “Linear” means that the relationship between the two
variables can be described reasonably well by a straight line. You can use a
scatterplot to check whether the relationship between two variables is linear or not.
Your scatterplot may look something like one of the following:
Statistics Dr. Ansam Al-Obaidi

Pearson vs. Spearman’s rank correlation coefficients

The Spearman rank-order correlation coefficient (Spearman’s correlation) is a


nonparametric measure of the strength and direction of association that exists
between two ranked variables. The test is used for either ordinal variables or for
continuous data that has failed the assumptions necessary for conducting the
Pearson's product-moment correlation. It’s a better choice than the Pearson
correlation coefficient when one or more of the following is true:

The two variables should be measured on an ordinal, interval or ratio scale.

The variables aren’t normally distributed.

The data includes outliers.

The relationship between the variables is non-linear and monotonic. A


monotonic relationship is a relationship that does one of the following: (1) as the
value of one variable increases, so does the value of the other variable; or (2) as the
value of one variable increases, the other variable value decreases. Examples of
monotonic and non-monotonic relationships are presented in the diagram below.
Whilst there are a number of ways to check whether a monotonic relationship exists
between your two variables, it is suggested to create a scatterplot (for example using
SPSS Statistical software which is designed to undertake a range of statistical
procedures), where you can plot one variable against the other, and then visually
inspect the scatterplot to check for monotonicity. Your scatterplot may look
something like one of the following:
Statistics Dr. Ansam Al-Obaidi

References

Introduction to Statistics
Online Edition
Primary author and editor: David M. Lane
Other authors: David Scott1, Mikki Hebl1, Rudy Guerra1, Dan Osherson1, and Heidi Zimmer2 1Rice
University; 2University of Houston, Downtown Campus

Introductory Statistics
SENIOR CONTRIBUTING AUTHORS
BARBARA ILLOWSKY, DE ANZA COLLEGE SUSAN DEAN, DE ANZA COLLEGE

You might also like