Regression and Correlation Analysis
Regression and Correlation Analysis
Regression and Correlation Analysis
Introduction
There are many situations where we are interested in the relationship between two or more
variables occurring together.
The analysis of bivariate numeric data is concerned with statistical measures of regression and
correlation. Correlation is a statistical method used to determine whether a relationship between
variables exists. Regression is a statistical method used to describe the nature of the relationship
between variables, that is, positive or negative, linear or nonlinear. Bivariate equivalents of
location and dispersion. Generally, regression locates bivariate data in terms of a mathematical
relationship, able to be graphed as a line or curve, while correlation describes the nature of the
spread of the items about the line or curve.
Regression
There are two types of regression: simple linear regression and multiple regressions.
In a simple regression there are only two variables under study while in multiple regression, many
variables are under study.
The simple linear regression model: Regression analysis attempts to establish the nature of the
relationship between variables – that is, to study the functional relationship between the variables
and thereby provide a mechanism for prediction, or forecasting. The variable which is used to
predict the variable of interest is called the independent variable or explanatory variable. The
variable we are trying to predict is called the dependent variable or explained variable.
Bivariate data always involves two distinct variables and in the majority of case one variable will
depend naturally on the other. The independent variable is the one that is chosen freely or occurs
naturally, the dependent variable occurs as a consequence of the value of the independent variable.
Sometimes the relation between a dependent and an independent variable is called a causal
relationship, since it can be argued that the value of one variable has been caused by the value of
the other.
The analysis is called simple linear regression analysis – simple because there is only one
Predictor or independent variable, and linear because of the assumed relationship between the
dependent and independent variables.
The independent and dependent variables can be plotted on a graph called a scatter plot. The
independent variables x is plotted on the horizontal axis, and the dependent variable y is plotted
on the vertical axis.
A scatter plot is a graph of the ordered pairs ( x, y) of numbers consisting of the independent
variable x and the dependent variable y.
The scatter plot is a visual way to describe the nature of the relationship between the independent
and dependent variables.
A. Construct a scatter plot for the following data
Number of 1 2 3 4 5 6 7 8 9 10 11 12
observation
Quantity Y 69 76 52 56 57 77 58 55 67 53 72 64
Price X 9 12 6 10 9 10 7 8 12 6 11 8
B: Construct a scatter plot for the data obtained in a study on the number of absences and the
final grades of seven randomly selected students from a statistics class. The data are shown here.
For any set of bivariate data, there are two regression lines equation which can be obtained:
a) The y on x regression line is the name given to that regression line which is used for
estimating y given a value of x. where
Mathematically: Y=a +bx. Where Y is called the dependent variable,
explained variable, predictand and X is called independent variable, explanatory
variable or, predictor; a” and “b” can be any numerical values, positive or negative, and
“a” is the Y intercept, “b” is the gradient or the slope
b) The x on y regression line is the name given to that regression line which is used for
estimating x given a value of y.
Mathematically: X=a+by. Where X is called the dependent variable,
explained variable, predictand and Y is called independent variable, explanatory
variable or, predictor; a” and “b” can be any numerical values, positive or negative, and
“a” is the X intercept, “b” is the gradient or the slope.
They are many methods that we can use for obtaining a regression line, but in this chapter we
will use: Method of Least squares regression line, Method of Mayer, Method of three Points.
n X X 2 2
n XY X Y
b
n X X
2 2
a
Y X X XY 511579 57 3745 82404 104.49
2
n X X 7 579 57
2 2
2 804
Example: A physician wishes to know whether there is a relationship between a father’s weight
(in kg) and his son’s weight (in kg).
The data are given here.
Father’s 65 63 67 64 68 62 70 66 68 67 69 71
weight
x
Son’s 68 66 68 65 69 66 68 65 71 67 68 70
weight
y
a
Y X X XY 811 53418 800 54107 36398 35,82
2
n X X 12 53418 800
2 2
2 1016
Hence, the equation of the regression line y a bx is y 35.82 0.476x
b) Method of Mayer
The set will be divided in two equal parts and for each part we will calculate the mean of Y and
X values as ( X 1 , Y1 );( X 2 , Y2 ).
The two points will be used to formulate a system of two equations which will be solved for
finding the straight line of the distribution.
Example 246 8 9 13
X1 4 et X 2 10
3 3
X Y
7 10 13 15 20 28
2 7 Y1 10 et Y2 21
3 3
4 10
Then we have (4,10) et (10,21)
6 13
The equation of the straight line pssed through two points is:Y=ax+b
8 15
10=4b+a
9 20 we have to solve:
13 28 21 10b a
Correlation
Correlation may be defined as the degree or strength of relationship existing between two or
more variables.
The variables are said to be correlated, if there exists a change in one variable corresponding to a
change in the other variable.
Types of Correlation
Correlation is classified in several ways. The following are the important types:
a. Positive correlation
b. Negative correlation
c. Simple, partial and multiple correlation
d. Linear and non - linear
a. Positive correlation: correlation is positive (direct), if the variables vary in the same
direction, that is, if they increase or decrease together.
Ex: Height ( x) and weight ( y) of persons are positively correlated.
b. Negative correlation: correlation is negative (inverse), if the variables vary in opposite
directions, that is, if one variable is increasing, the other is decreasing or vice versa.
Ex: Production ( x) and price ( y) of a commodity are negatively correlated.
c. Simple, partial and multiple correlations: The distinction between simple, partial and
multiple correlations is based on the number of variables involved. Simple correlations are
concerned with two variables only while partial and multiple correlations are concerned
with three or more related variables.
d. Linear and Nonlinear (Curvilinear) correlations. If the amount of change in one
variable tends to near a constant ratio to the amount of change in the other variable, then the
correlation is said to be linear otherwise nonlinear.
The coefficient of correlation is used to measure the strength and direction of a linear
relationship between two variables.
There are several types of correlation coefficients. The one explained in this section is Pearson’s
Coefficient correlation, named after statistician Karl Pearson, who pioneered the research in
this area. The symbol for the sample correlation coefficient is r. The symbol for population
correlation coefficients is (Greek letter rho).
Formula:
Correlation Co-efficient :
Correlation (r) =[ NΣXY - (ΣX)(ΣY) / Sqrt([NΣX2 - (ΣX)2][NΣY2 - (ΣY)2])]
where
N = Number of values or elements
X = First Score
Y = Second Score
ΣXY = Sum of the product of first and Second Scores
ΣX = Sum of First Scores
ΣY = Sum of Second Scores
ΣX2 = Sum of square First Scores
ΣY2 = Sum of square Second Scores
Correlation Co-efficient Example: To find the Correlation of
X Values Y Values
60 3.1
61 3.6
62 3.8
63 4
65 4.1
Example:
The following table shows the Height (x) vs. Femur Length (y) measurements (both in inches)
for 10 men:
X 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6
Y 42.5 40.2 44.4 42.8 40 47.3 43.4 40.1 42.1 36
50
Legth of Femur
45
40
35
65 66 67 68 69 70 71 72
Height
Example: The following table gives the weight (x) (in 1000 lbs.) and highway fuel efficiency (y)
(in miles/gallon) for a sample of 13 cars.
Vehicle X Y
40
38
36
34
MPG Highway
32
30
28
26
24
22
20
2 2.5 3 3.5 4 4.5
weight (1000 lbs)
The coefficient can be used to test for linear relationship between two variables.
Formula for the correlation coefficient r
n xy x y
r (1)
[n x x ] n y y
2 2 2 2
n
1
xi x yi y
n i 1 Cov xy
xi x yi y
r x, y
var x .var y
1 n
2 1 n
2
xi x yi y
2 2
xi x yi y
n i 1 n i 1
Where n is the number of data pairs.
Example Compute the value of the correlation coefficient for the data obtained in the study of
the number of absences and the final grade of the seven students in the statistics class given in
example above.
Solution
Student Number of Final grade Y xy x2 y2
absences X (%)
A 6 82 492 36 6724
B 2 86 172 4 7396
C 15 43 645 225 1849
D 9 74 666 81 5476
E 12 58 696 144 3364
F 5 90 450 25 8100
G 8 78 624 64 6084
x 57; y 511; xy 3745; x ; y 2 2
n xy x y
r
n x 2 x 2 n y 2 y 2
7 3745 57 511 0.944
7 579 57 2 7 38993 5112
The value of r suggests a strong negative relationship between a student’s final grade and the
number of absences a student has. That is, the more absences a student has, the lower is his or
her grade.
Example.
Find the equation of the regression line and compute the value of the correlation coefficient for
the following data.
Income x 80 100 120 140 160 180
Consumption y 325 462 445 707 678 750
r ( x, y )
7 9378 266 231 0,937
7 10810 266 2 7 8207 2312
Example: From the following data of marks in Accountancy and statistics obtained by 6
students (out of 50). Calculate the correlation coefficient:
Marks in Accountancy: 35 30 28 29 13 45
Marks in Statistics : 40 27 35 26 24 40
Solution
Let marks in Accountancy be denoted by x and marks in Statistics by y
xi yi
xi x
x2 y2
35 40 1120 1225 1024
30 27 1020 1156 900
28 35 1240 1600 961
29 26 1376 1849 1024
13 24 2968 3136 2809
45 40 400 400 400
1254 1444 1089
NOTE: The sign of the correlation coefficient and the sign of the slope of the regression will
always the same.
The sum of each column is found, and these sums can then be substituted into the formula above
to find r.
Example: Using our previous data set of height vs femur length for 10 men, we get the table:
Variable X Y Xy x2 y2
70.8 42.5 3009 5012.64 1806.25
66.2 40.2 2661.24 4382.44 1616.04
71.7 44.4 3183.48 5140.89 1971.36
68.7 42.8 2940.36 4719.69 1831.84
67.6 40 2704 4569.76 1600
69.2 47.3 3273.16 4788.64 2237.29
66.5 43.4 2886.1 4422.25 1883.56
67.2 40.1 2694.72 4515.84 1608.01
68.3 42.1 2875.43 4664.89 1772.41
65.6 36 2361.6 4303.36 1296
353.06 353.06
.651
352.76 834.16 542.4558
Exercise: Calculate the coefficient of correlation for the vehicle weight and miles per gallon
data sets. The table of variables is given below:
Variable X y Xy x2 y2
3.545 30 106.35 12.567025 900
2.6 32 83.2 6.76 1024
3.245 30 97.35 10.530025 900
3.93 24 94.32 15.4449 576
3.995 26 103.87 15.960025 676
3.115 30 93.45 9.703225 900
3.235 33 106.755 10.465225 1089
3.225 27 87.075 10.400625 729
2.44 37 90.28 5.9536 1369
3.24 32 103.68 10.4976 1024
2.29 37 84.73 5.2441 1369
2.5 34 85 6.25 1156
4.02 26 104.52 16.1604 676
If a pair of variables has a significant linear correlation, then the relationship between the data
values can be roughly approximated by a linear equation. The process of finding the linear
equation which best fits the data values is known as linear regression and the line of best fit is
called the regression line.
It is a fact of linear algebra and analysis that the least squares line of best fit to a set of data
values has an equation of the form ŷ mx b where:
n xy x y y m x
m and b y mx
x x
2
n 2 n
Example: For the vehicle weight vs. highway mileage data set, we have:
and
b
398 (6.23) 41.38 655.797 50.45
13 13
so our regression line is given by the equation yˆ 6.23x 50.45 . The graph of this line is
shown on the scatter diagram for the data set below.
Vehicle Weight vs. MPG Highway
40
38
36
MPG Highway
34
32
30
28
26
24
22
20
2 2.5 3 3.5 4 4.5
weight (1000 lbs)
Example: For the vehicle weight vs. highway mileage data set, we have:
b
398 (6.23) 41.38 655.797 50.45
13 13
so our regression line is given by the equation yˆ 6.23x 50.45 . The graph of this line is
shown on the scatter diagram for the data set below.
A line of best fit can be roughly determined using an eyeball method by drawing a straight line
on a scatter plot so that the number of points above the line and below the line is about equal
(and the line passes through as many points as possible).
A more accurate way of finding the line of best fit is the least square method.
Use the following steps to find the equation of line of best fit for a set of ordered pairs.
Step 1: Calculate the mean of the x -values and the mean of the y -values.
where are the mean of the x - and y -coordinates of the data points respectively.
Step 6: Use the slope and the y -intercept to form the equation of the line.
Example:
Use the least square method to determine the equation of line of best fit for the data. Then plot
the line.
Solution:
First, calculate the mean of the x -values and that of the y -values.
The primary use for the regression equation is to predict values for one variable given a value for
the other variable.
Example: Using our regression equation for the car data, we could estimate that a car that
weighed 3000 lbs. ( x 3) would have a highway mpg of yˆ 6.23(3) 50.45 31.76 .
Likewise, if we knew a car’s highway mpg was 36 mpg, then we would estimate its weight by
solving 36 6.23 x 50.45 to get x 2.319 or a car that weighs 2319 lbs.