Correlation Coefficient
Correlation Coefficient
Correlation Coefficient
Ansam Al-Obaidi
Correlation
The word “correlation” is a term used in everyday conversation to describe some
type of relationship between variables. You know that the amount you eat is
correlated with the amount that you weigh; that how hard you work in school is
correlated with how successful you'll be in life (well, maybe!).
In statistics, correlation has a precise definition. It's a measure of the strength of the
relationship between two variables.
Scatter Plots
Example
(a) Table showing the number of m-commerce users (in millions) by year
(b) Scatter plot showing the number of m-commerce users (in millions) by year
Question
Amelia plays basketball for her high school. She wants to improve to play at the
college level. She notices that the number of points she scores in a game goes up in
response to the number of hours she practices her jump shot each week. She records
the following data:
Statistics Dr. Ansam Al-Obaidi
Construct a scatter plot and state if what Amelia thinks appears to be true
A scatter plot shows the direction of a relationship between the variables. A clear
direction happens when there is either:
• High values of one variable occurring with high values of the other variable or low
values of one variable occurring with low values of the other variable.
• High values of one variable occurring with low values of the other variable.
You can determine the strength of the relationship by looking at the scatter plot and
seeing how close the points are to a line, a power function, an exponential function,
or to some other type of function. For a linear relationship there is an exception.
Consider a scatter plot where all the points fall on a horizontal line providing a
"perfect fit." The horizontal line would in fact show no relationship.
When you look at a scatterplot, you want to notice the overall pattern and any
deviations from the pattern. The following scatterplot examples illustrate these
concepts.
Statistics Dr. Ansam Al-Obaidi
The Pearson correlation coefficient (usually denoted as r) assumes that X and Y are
jointly distributed as bivariate normal, i.e. X and Y each are normally distributed,
and that they are linearly related. When these assumptions are not satisfied,
nonparametric versions can be used to estimate correlation. These include the
Spearman correlation coefficient.
Pearson correlation coefficient is a statistic that measures the strength and direction
of the linear relationship between two variables. The correlation coefficient does not
tell you anything about the cause-and-effect relationship between the variables. The
Pearson correlation coefficient ranges in value from -1 to +1. The absolute value of
the correlation coefficient tells you how strongly the variables are linearly related.
A value of either +1 or -1 means that you can perfectly predict the values of one
variable from the values of the other.
• If the correlation coefficient is +1, all points fall on a line with values of both
variables increasing together.
• If the correlation coefficient is -1, all points fall on a line but as values of one
variable increase the values of the other variable decrease.
• It is 0 when there is no linear relationship between two variables.
Figure below shows plots, correlation coefficients and summary lines for
correlations of different sizes.
Statistics Dr. Ansam Al-Obaidi
Let’s take the table below as an example to calculate the correlation coefficient.
Pearson's correlation is computed by dividing the sum of the xy column (Σxy) by the
square root of the prod
uct of the sum of the x2 column (Σx2) and the sum of the y2 column (Σy2). The
resulting formula is: -
Statistics Dr. Ansam Al-Obaidi
Therefore, r is:-
Example
Imagine that you’re studying the relationship between newborns’ weight and length.
You have the weights and lengths of the 10 babies born last month at your local
hospital. You enter the data in a table and find the value of the correlation coefficient
between weight and height.
Statistics Dr. Ansam Al-Obaidi
Statistics Dr. Ansam Al-Obaidi
Exercise:
The table below shows the marks obtained by a group of students, in two separate
tests.
The first test is out of 50 marks while the second test is out of 30 marks. Let x and y
represent the marks obtained in Test 1 and Test 2, respectively. Find the value of
the correlation coefficient between x and y.
Statistics Dr. Ansam Al-Obaidi
The Pearson correlation coefficient (r) is one of several correlation coefficients that
you need to choose between when you want to measure a correlation. The Pearson
correlation coefficient is a good choice when all of the following are true:
The variables are normally distributed: You can create a histogram of each
variable to verify whether the distributions are approximately normal. It’s not a
problem if the variables are a little non-normal.
The data have no outliers: Outliers are observations that don’t follow the same
patterns as the rest of the data. A scatterplot is one way to check for outliers.
The relationship is linear: “Linear” means that the relationship between the two
variables can be described reasonably well by a straight line. You can use a
scatterplot to check whether the relationship between two variables is linear or not.
Your scatterplot may look something like one of the following:
Statistics Dr. Ansam Al-Obaidi
References
Introduction to Statistics
Online Edition
Primary author and editor: David M. Lane
Other authors: David Scott1, Mikki Hebl1, Rudy Guerra1, Dan Osherson1, and Heidi Zimmer2 1Rice
University; 2University of Houston, Downtown Campus
Introductory Statistics
SENIOR CONTRIBUTING AUTHORS
BARBARA ILLOWSKY, DE ANZA COLLEGE SUSAN DEAN, DE ANZA COLLEGE