0% found this document useful (0 votes)
20 views

11august2010 - Correlation and Regression

Uploaded by

JOCELYN CAMACHO
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

11august2010 - Correlation and Regression

Uploaded by

JOCELYN CAMACHO
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

11.

CORRELATION AND REGRESSION

Objectives:
At the end of the chapter, for a given data; the students should be able to:
1. Make a scatter plot;
2. Find the correlation coefficient;
3. Test the significance of r for a given level of significance;
4. Find the coefficients of determination and non-determination; and
5. Find the equation for the regression line.

CORRELATION ANALYSIS

Correlation analysis is used to determine if there is a relationship, or correlation, between two


variables and to determine the strength of the correlation.
A correlation is a relationship between two statistical variables measured from the same
population. In this chapter, we will consider only linear correlation which comes in three types: positive
linear correlation, negative linear correlation and non-linear correlation.
A Positive Linear Correlation indicates that high values for one variable tend to correspond to
high values for the second variable or simply, if one value increases, so does the other the other. For
example, the height vs. weight for adults (For a normal individual, as the height increases, the weight
also increases).
A Negative Linear Correlation indicates high values for one variable tend to correspond to low
values for the second variable., that is, one variable increases and the other decreases. For instance, the
year of acquiring a vehicle and the and resale price (As the vehicle gets older, the re sale price becomes
lower).
Non Linear Correlation means no relationship between the variables or a non-linear relationship.
For example, the height and no. of years of education (The height of the person in no way has a bearing
on the number of years he had been in school).
Regression analysis is used to determine what type of relationship exists to make predictions
using the relationship.

Simple Correlation
In simple correlation, only two variables are studied at once. The two variables are the
independent and dependent variable. The independent variable , is the variable that can be
controlled or picked. The independent variable, is the variable that you assume to be dependent on
the other variable. The independent variable are used to predict the dependent variable if there is a
correlation between the two variables.
One way to determine the type of linear correlation between two variables is by means of a scatter
plot. The scatter plot is a graph with the independent variable at the bottom (or along the ) and
the dependent variable along the side For each pair of numbers, we plot a point but the
points are not connected with a line.
The scatter plot shows if there is a linear correlation between two variables. We can then
determine the type of linear correlation as follows:
1. Positive Linear Correlation - general trend in the plotted points is from bottom left to top right.
2. Negative Linear Correlation - general trend in the plotted points is from top left to bottom right.
3. No Linear Correlation - No general trend in plotted points, or a non-linear trend.
The strength of the linear correlation can be judged by looking at how closely the points approximate
a straight line.

Example 1: The following table shows the Height (X) vs. Weight (Y) measurements (both in inches) for
10 men:
Prepared by MJDP Page - 1 -
x 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6
y 42.5 40.2 44.4 42.8 40.0 47.3 43.4 40.1 42.1 36.0

Interpretation: The diagram scatter plot in Excel below shows a positive linear correlation between the
variables.

Example 2: The following table gives the resale value of a car bought in 1970 at Php200,000.00.
x (Php) 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997
y (000) 200 150 145 135 120 100 79 65 54 35.0

Interpretation: The diagram (Excel) indicates a negative linear correlation between the variables.

Example 3. Below is a data of the scores in an examination. Make a scatter plot and interpret the
data.
Test scores
100
Mid-Term Final
73 70
90
86 80
Final Term Score

93 96
80
92 85
72 68
70
65 68
58 62
60
75 78
50
Prepared by MJDP Page - 2 -
50 55 60 65 70 75 80 85 90 95 100

Mid-Term Score
Interpretation: There is a fairly positive correlation between scores in the mid-term examination and the
final examination.

Coefficient of Correlation
A more precise method of determining the type and strength of a linear correlation is to calculate
the coefficient of linear correlation for the two variables using the formula:

∑ ∑ ∑
√[ ∑ ∑ ]√ ∑ ∑
The coefficient of linear correlation will always be a number between -1.00 and 1.00, with a
positive value indicating a positive correlation and a negative value a negative correlation. A coefficient
of r  1 for a data set indicates perfect positive linear correlation, and indicates perfect
negative linear correlation, while would indicate no linear correlation. The closer the value of r is
to , the stronger the correlation, and the closer to zero, the weaker the correlation.
The coefficient of correlation between two variables is most easily calculated by constructing a
table (see example below).

Coefficient of Determination
The coefficient of determination tells us how much variation in the dependent variable is
explained by the independent variable. The coefficient of determination is and is usually explained as
percent.
On the other hand, the coefficient of non-determination, , is the variation in the dependent
variable that is not explained by the independent variable.

Testing the Coefficient of Correlation


The coefficient correlation can be tested for significance using a and following the
procedures for hypothesis testing, or by comparing to a value in Table 9. The null hypothesis is that
there is no correlation, or that . The alternative hypothesis is that there is a correlation,
Summarizing, therefore, to test ,
1. Find the value from Table 9 for the desired level of significance.
2. If is between , accept the null hypothesis. There may not be a correlation.
3. If is smaller than or greater than the , reject the null hypothesis.
There is a correlation.

Example 4. Using the data in Example 3, we have


Grade
n x2 y2 xy
Mid-Term (x) Final Term (y)
1 73 70 5,329 4,900 5,110
2 86 80 7,396 6,400 6,880
3 93 96 8,649 9,216 8,928
4 92 85 8,464 7,225 7,820
5 72 68 5,184 4,624 4,896
6 65 68 4,225 4,624 4,420
7 58 62 3,364 3,844 3,596
8 75 78 5,625 6,084 5,850
S 614 607 48,236 46,917 47,500

Prepared by MJDP Page - 3 -


∑ ∑ ∑
√[ ∑ ∑ ]√ ∑ ∑

√[ ] √[ ]

a) Test : In Table 9 at 0.05 level, the Table value is 0.707. Note that 0.933 is not between -0.707
and 0.707, or Interpretation: so there is a correlation and 0.933 is a very strong
positive correlation.
b) Coefficient of determination: . Interpretation: 87% of the variation in the
final grades can be determined by the variations in the mid-term grades.
c) Coefficient of non-determination: . Interpretation: 13% of the
variation in the final grades cannot be determined by the variations in the mid-term grades

Example 5: Calculate the coefficient of correlation for the vehicle weight (x) and distance
travelled in miles per gallon(y) data sets. The table of variables is given below:
n x y x2 y2 xy
1 3.55 30 12.60 900.00 106.50
2 2.60 32 6.76 1,024.00 83.20
3 3.25 30 10.56 900.00 97.50
4 3.93 24 15.44 576.00 94.32
5 4.00 26 16.00 676.00 104.00
6 3.12 30 9.73 900.00 93.60
7 3.24 33 10.50 1,089.00 106.92
8 3.23 27 10.43 729.00 87.21
9 2.44 37 5.95 1,369.00 90.28
10 3.24 32 10.50 1,024.00 103.68
11 2.29 37 5.24 1,369.00 84.73
12 2.50 34 6.25 1,156.00 85.00
13 4.02 26 16.16 676.00 104.52
S 41.41 398 136.14 12,388.00 1,241.46

∑ ∑ ∑
√[ ∑ ∑ ]√ ∑ ∑

√[ ] √[ ]

a) Test : In Table 9 at 0.05 level, the Table value is 0.553. . Interpretation:


There is a very strong negative correlation between vehicle weight and distance travelled in miles
per gallon. As the weight of the vehicle increases, the travel distance per gallon of gasoline
decreases.
b) Coefficient of determination: . Interpretation: 81% of the variation in the
distance travelled in miles per gallon can be determined by the variations in the weight of the
vehicle.
c) Coefficient of non-determination: . Interpretation19% of the variation
in the distance travelled in miles per gallon cannot be determined by the variations in the weight
of the vehicle.

Prepared by MJDP Page - 4 -


REGRESSION ANALYSIS

Linear Regression
If a pair of variables has a significant linear correlation, then the relationship between the data
values can be roughly approximated by a linear equation. The regression line is the equation of the line
that best fits the points of the scatter plot.
The process of finding the linear equation which best fits the data values is known as linear
regression and the line of best fit is called the regression line.
It is a fact of linear algebra and analysis that the least squares line of best fit to a set of data values
has an equation of the form where: is the and is the .
will be the predicted value of for any given value. If the scatter plot indicates the line that is going
up, the slope will be positive and will be positive. If the scatter plot indicates the line that is going
down, the slope will be negative and will be negative.
To solve for the regression equation , we use the following:

∑ ∑ ∑ ∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
The standard error estimate, , is the standard deviation of the observed values about the
predicted value, or an average of how much error there will be in each predicted This can be
computed using any of the following:
∑ ∑ ∑ ∑
√ or √
A confidence interval for a predicted y can be found using the standard error of estimate if the
sample size is larger than 100. Instead of predicting a single value for , you could be more confident in
saying that would be between for a given . That is:
A 95% confidence interval, if , where is the value obtained from for
given : .

Example 6. Find the equation for the regression line and the standard error of estimate in
Example no. 4.
Solution: a) For the regression equation , we use the following:
∑ ∑ ∑ ∑ ∑ ∑ ∑
∑ ∑ ∑ ∑

b) The standard error of estimate is:

∑ ∑ ∑
√ √
Interpretation: Each predicted final grade will have an error of about 5.28 points.

Example 7. For a sample of 1 , the regression line is and the standard error of
estimate is . Find the confidence interval for .
Solution: Find the predicted value for y: , at , we get

Prepared by MJDP Page - 5 -


Interpretation: We can be confident that for , will be between and .

Multiple Regression
Multiple regression is used when there is one independent variable and two or more independent
variables.
The following are the conditions in using multiple regression:
1. The variable must be normally distributed;
2. The variances for the must be the same for each value of the independent variable;
3. There must be a linear relationship between the dependent and each independent variable;
4. The independent variables must not be correlated; and
5. The independent variables must be independent.
The general form of the multiple regression equation with independent variables is:

.
The multiple regression coefficient, , will be between 0 and 1. Close to 0 is a weak correlation
and close to 1 is a strong relationship. will always be stronger than the individual correlation
coefficients.

Worksheet no. 11

1. Given the following data:


Number of absences 0 1 2 2 3 3 4 5 6
Final grade 96 91 78 83 75 62 70 68 56
a. Draw a scatter plot.
b. Find the correlation coefficient and test its significance at the 0.01 level.
c. Find and interpret the coefficient of determination and non-determination.
d. Find the regression equation and standard error of estimate.
2. For a sample of , the regression equation and the standard error of estimate is
. Find a 95% confidence interval for .
3. Using the data in Example 5,
a. Find and interpret the coefficient of determination and non-determination; and
b. Find the regression equation and standard error of estimate.

Table 9

3 0.997 18 0.468 0.543 0.590


4 0.950 0.980 0.990 19 0.456 0.529 0.575
5 0.878 0.934 0.959 20 0.444 0.516 0.561
6 0.811 0.882 0.917 21 0.433 0.503 0.549
7 0.754 0.833 0.875 22 0.423 0.492 0.537

8 0.707 0.789 0.834 27 0.381 0.445 0.487


9 0.666 0.750 0.798 32 0.349 0.409 0.449
10 0.632 0.715 0.765 37 0.325 0.381 0.418
11 0.602 0.685 0.735 42 0.304 0.358 0.393
12 0.576 0.658 0.708 47 0.288 0.338 0.372

Prepared by MJDP Page - 6 -


13 0.553 0.634 0.684 52 0.273 0.322 0.354
14 0.532 0.612 0.661 62 0.250 0.295 0.325
15 0.514 0.592 0.641 72 0.232 0.274 0.302
16 0.497 0.574 0.623 82 0.217 0.256 0.283
17 0.482 0.588 0.606 92 0.205 0.242 0.267
Source: This table was abridged from Table VI of R.A. Fisher and F. Yates, Statistical Tables for
Biological, Agricultural, and Medical Research, Longman Group Ltd. London.

Prepared by MJDP Page - 7 -

You might also like