11august2010 - Correlation and Regression
11august2010 - Correlation and Regression
Objectives:
At the end of the chapter, for a given data; the students should be able to:
1. Make a scatter plot;
2. Find the correlation coefficient;
3. Test the significance of r for a given level of significance;
4. Find the coefficients of determination and non-determination; and
5. Find the equation for the regression line.
CORRELATION ANALYSIS
Simple Correlation
In simple correlation, only two variables are studied at once. The two variables are the
independent and dependent variable. The independent variable , is the variable that can be
controlled or picked. The independent variable, is the variable that you assume to be dependent on
the other variable. The independent variable are used to predict the dependent variable if there is a
correlation between the two variables.
One way to determine the type of linear correlation between two variables is by means of a scatter
plot. The scatter plot is a graph with the independent variable at the bottom (or along the ) and
the dependent variable along the side For each pair of numbers, we plot a point but the
points are not connected with a line.
The scatter plot shows if there is a linear correlation between two variables. We can then
determine the type of linear correlation as follows:
1. Positive Linear Correlation - general trend in the plotted points is from bottom left to top right.
2. Negative Linear Correlation - general trend in the plotted points is from top left to bottom right.
3. No Linear Correlation - No general trend in plotted points, or a non-linear trend.
The strength of the linear correlation can be judged by looking at how closely the points approximate
a straight line.
Example 1: The following table shows the Height (X) vs. Weight (Y) measurements (both in inches) for
10 men:
Prepared by MJDP Page - 1 -
x 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6
y 42.5 40.2 44.4 42.8 40.0 47.3 43.4 40.1 42.1 36.0
Interpretation: The diagram scatter plot in Excel below shows a positive linear correlation between the
variables.
Example 2: The following table gives the resale value of a car bought in 1970 at Php200,000.00.
x (Php) 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997
y (000) 200 150 145 135 120 100 79 65 54 35.0
Interpretation: The diagram (Excel) indicates a negative linear correlation between the variables.
Example 3. Below is a data of the scores in an examination. Make a scatter plot and interpret the
data.
Test scores
100
Mid-Term Final
73 70
90
86 80
Final Term Score
93 96
80
92 85
72 68
70
65 68
58 62
60
75 78
50
Prepared by MJDP Page - 2 -
50 55 60 65 70 75 80 85 90 95 100
Mid-Term Score
Interpretation: There is a fairly positive correlation between scores in the mid-term examination and the
final examination.
Coefficient of Correlation
A more precise method of determining the type and strength of a linear correlation is to calculate
the coefficient of linear correlation for the two variables using the formula:
∑ ∑ ∑
√[ ∑ ∑ ]√ ∑ ∑
The coefficient of linear correlation will always be a number between -1.00 and 1.00, with a
positive value indicating a positive correlation and a negative value a negative correlation. A coefficient
of r 1 for a data set indicates perfect positive linear correlation, and indicates perfect
negative linear correlation, while would indicate no linear correlation. The closer the value of r is
to , the stronger the correlation, and the closer to zero, the weaker the correlation.
The coefficient of correlation between two variables is most easily calculated by constructing a
table (see example below).
Coefficient of Determination
The coefficient of determination tells us how much variation in the dependent variable is
explained by the independent variable. The coefficient of determination is and is usually explained as
percent.
On the other hand, the coefficient of non-determination, , is the variation in the dependent
variable that is not explained by the independent variable.
√[ ] √[ ]
a) Test : In Table 9 at 0.05 level, the Table value is 0.707. Note that 0.933 is not between -0.707
and 0.707, or Interpretation: so there is a correlation and 0.933 is a very strong
positive correlation.
b) Coefficient of determination: . Interpretation: 87% of the variation in the
final grades can be determined by the variations in the mid-term grades.
c) Coefficient of non-determination: . Interpretation: 13% of the
variation in the final grades cannot be determined by the variations in the mid-term grades
Example 5: Calculate the coefficient of correlation for the vehicle weight (x) and distance
travelled in miles per gallon(y) data sets. The table of variables is given below:
n x y x2 y2 xy
1 3.55 30 12.60 900.00 106.50
2 2.60 32 6.76 1,024.00 83.20
3 3.25 30 10.56 900.00 97.50
4 3.93 24 15.44 576.00 94.32
5 4.00 26 16.00 676.00 104.00
6 3.12 30 9.73 900.00 93.60
7 3.24 33 10.50 1,089.00 106.92
8 3.23 27 10.43 729.00 87.21
9 2.44 37 5.95 1,369.00 90.28
10 3.24 32 10.50 1,024.00 103.68
11 2.29 37 5.24 1,369.00 84.73
12 2.50 34 6.25 1,156.00 85.00
13 4.02 26 16.16 676.00 104.52
S 41.41 398 136.14 12,388.00 1,241.46
∑ ∑ ∑
√[ ∑ ∑ ]√ ∑ ∑
√[ ] √[ ]
Linear Regression
If a pair of variables has a significant linear correlation, then the relationship between the data
values can be roughly approximated by a linear equation. The regression line is the equation of the line
that best fits the points of the scatter plot.
The process of finding the linear equation which best fits the data values is known as linear
regression and the line of best fit is called the regression line.
It is a fact of linear algebra and analysis that the least squares line of best fit to a set of data values
has an equation of the form where: is the and is the .
will be the predicted value of for any given value. If the scatter plot indicates the line that is going
up, the slope will be positive and will be positive. If the scatter plot indicates the line that is going
down, the slope will be negative and will be negative.
To solve for the regression equation , we use the following:
∑ ∑ ∑ ∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
The standard error estimate, , is the standard deviation of the observed values about the
predicted value, or an average of how much error there will be in each predicted This can be
computed using any of the following:
∑ ∑ ∑ ∑
√ or √
A confidence interval for a predicted y can be found using the standard error of estimate if the
sample size is larger than 100. Instead of predicting a single value for , you could be more confident in
saying that would be between for a given . That is:
A 95% confidence interval, if , where is the value obtained from for
given : .
Example 6. Find the equation for the regression line and the standard error of estimate in
Example no. 4.
Solution: a) For the regression equation , we use the following:
∑ ∑ ∑ ∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
∑ ∑ ∑
√ √
Interpretation: Each predicted final grade will have an error of about 5.28 points.
Example 7. For a sample of 1 , the regression line is and the standard error of
estimate is . Find the confidence interval for .
Solution: Find the predicted value for y: , at , we get
Multiple Regression
Multiple regression is used when there is one independent variable and two or more independent
variables.
The following are the conditions in using multiple regression:
1. The variable must be normally distributed;
2. The variances for the must be the same for each value of the independent variable;
3. There must be a linear relationship between the dependent and each independent variable;
4. The independent variables must not be correlated; and
5. The independent variables must be independent.
The general form of the multiple regression equation with independent variables is:
.
The multiple regression coefficient, , will be between 0 and 1. Close to 0 is a weak correlation
and close to 1 is a strong relationship. will always be stronger than the individual correlation
coefficients.
Worksheet no. 11
Table 9