Correlation and Regression Analysis
Correlation and Regression Analysis
There are many variables in this world which are related. The amount of rainfall is related to
the amount of production of agricultural products. The grade of a student in mathematics is
related to the number of hours spent by the student in his studies. T he amount of savings is
related to the amount of expenditures. In this unit, we shall learn how to describe the relationship
between two variables. We shall also learn how to predict the value of one variable, given the value
of another variable.
Lesson 1
Understanding Correlation Analysis
Why do most students who are good in Mathematics also perform well in Physics? Why
does blood pressure go with age? Why do students with high IQ have good academic performances?
These questions have something to do with relationships between variables. In this lesson, we shall
learn how to describe the relationship between two variables.
Learning Objective/s:
Definition
So far, we have analyzed the data involving only a single variable —for instance, the grades of
students, the weights of grocery products, and the lengths of rods. These data are called the
univariate data because they involve a single variable only. In this lesson we shall analyze data
involving two variables. Data that involve two variables are called bivariate data.
The analysis of bivariate data involves describing the relationship between two variables. The
process or procedure of describing the relationship between two variables is called correlation
analysis.
Example
A company with six branches provides free coffee to its employees. A manager is interested to
find out if there is a relationship between the number of cups of coffee provided and the number of
employees in the offices. The table below shows the data needed. Determine if there is a
relationship between the number of employees and the number of cups of coffee.
80
Notice that the points on the scatter plot do not lie on one line. However, the points closely
follow a straight line. This line is called a trend line.
The relationship between two variables is described in terms of strength and direction.
Positive Correlation y
A positive correlation exists if high
values in one variable are associated with
high values in another variable.
Similarly, low values in one variable are
associated with low values in the other
variable.
If a positive correlation exists, then x
the points on the scatter plot closely
follow a straight line slanting up to the
right.
Negative Correlation y
A negative correlation exists if high
values in one variable are associated with
low values in another variable. Similarly,
low values in one variable are associated
with high values in the other variable.
If a negative correlation exists, then
the points on the scatter plot closely
x
follow a straight line slanting down to the
right.
Zero Correlation y
A zero correlation exists when high
values in one variable are associated to
either high or low values in the other
variable.
If a zero correlation exists, then the
points on the scatter plot are randomly
scattered. The points do not follow closely
a straight line. x
TYPES OF CORRELATION according to Strength
A perfect correlation exists when all the points on the scatter plot lie on a straight line. When
the points on the scatter plot do not lie on a straight line, the relationship may be very high, high,
moderately high, low, negligible, zero.
Perfect correlation happens when other variables are controlled like what we do in our
experiments. In chemistry, for example, we learned that there is a perfect negative correlation
between pressure and volume when the temperature is controlled. Likewise, in Physics, under
controlled conditions, stress is directly proportional to strain. Direct proportion is another way of
expressing perfect positive correlation.
The next illustrations show the different types of relationship described in terms of direction
and strength.
x x
High Positive Correlation High Negative Correlation
y y
x x
Low Positive Correlation Low Negative Correlation
y y
x x
What pairs of variables in everyday life are positively correlated? What pairs of variables in
everyday life are negatively correlated? What pairs of variables in everyday life do not correlate at
all?
5. The strength of correlation an be perfect, very high, high, moderately high, low, negligible, or
zero.
8. A positive correlation exists between two variables when the points on the s catter plot follow a
straight line slanting up to the right.
9. A negative correlation exists between two variables when the points on the scatter plot follow a
straight line slanting down to the right.
10. A perfect correlation exists between two variables when the points on the scatter plot lie on a
straight.
Lesson 2
The scatter plot is not accurate enough to describe the strength and direction of relationship
between two variables. A more analytical approach to describe the relationship between two
variables is by computing the correlation coefficient.
Do the next activity before going through this lesson.
The following values are the length of times (in minutes) of 25 Philippine Basketball
Association (PBA) games. Compute the mean.
To describe the relationship between two variables, we can compute the correlation coefficient
(r). The correlation coefficient is a number between -1 and 1 that describes both the strength and
the direction of correlation. In symbol, we write
Value of r Interpretation
r=1 perfect positive correlation
r=0 no correlation or zero correlation
r=-1 perfect negative correlation
The following scale is used to interpret the other values of r.
Correlation Scale
Value of r Interpretation
very high correlation
high correlation
moderately high correlation
low correlation
negligible correlation
Employee A B C D E F
Age (X) 18 26 39 48 53 58
Days (Y) 16 12 9 5 6 2
Employee
A 18 16
B 26 12
C 39 9
D 48 5
E 53 6
F 58 2
∑ ∑
∑ ∑
̅ ̅
Step 2
Employee ̅ ̅
A 18 16 -22.33 7.67
B 26 12 -14.33 3.67
C 39 9 -1.33 0.67
D 48 5 7.67 -3.33
E 53 6 12.67 -2.33
F 58 2 17.67 -6.33
∑ ∑
Step 3.
Employee ̅ ̅ ̅ ̅
A 18 16 -22.33 7.67 498.63 58.83
B 26 12 -14.33 3.67 205.35 13.47
C 39 9 -1.33 0.67 1.77 0.45
D 48 5 7.67 -3.33 58.83 11.09
E 53 6 12.67 -2.33 160.53 5.43
F 58 2 17.67 -6.33 312.23 40.07
∑ ∑ ∑ ̅ ∑ ̅
Step 4
Baby ̅ ̅ ̅ ̅ ̅ ̅
A 36 86 -22.33 7.67 498.63 58.83
B 48 90 -14.33 3.67 205.35 13.47
C 51 91 -1.33 0.67 1.77 0.45
D 54 93 7.67 -3.33 58.83 11.09
E 57 94 12.67 -2.33 160.53 5.43
F 60 95 17.67 -6.33 312.23 40.07
∑ ∑ ∑ ̅ ∑ ̅ ∑( ̅ )( ̅)
Step 5
∑ ̅ ̅
√∑ ̅ ̅
Step 6
Using the correlation scale, we interpret the obtained value of as very high
negative correlation. This implies that there is a very high negative correlation between the age of
employees and the number of sick days. This means that older employees tend to have a smaller
number of sick days while younger employees tend to have a greater number of sick days.
Another Formula for Computing the Pearson Product-Moment Correlation Coefficient
The procedure for computing the Pearson Product-Moment Correlation coefficient using the
preceding formula is quite tedious. We can use another computing formula which is much shorter
and does not require the use of the mean. This formula uses the raw scores only.
∑ ∑ ∑
√[ ∑ ∑ ][ ∑ ∑ ]
Study the next example to find out how the formula is used. We shall use the same data used
in the previous example.
Step 1
1. Get the sum of the values of X. This is ∑
Employee
A 18 16
B 26 12
C 39 9
D 48 5
E 53 6
F 58 2
∑ ∑
Step 2
Employee
A 18 16 288
B 26 12 312
C 39 9 351
D 48 5 240
E 53 6 318
F 58 2 116
∑ ∑ ∑
Step 3
Step 4
∑ ∑ ∑
√[ ∑ ∑ ][ ∑ ∑ ]
√[ ][ ]
LESSON 3
The existence of correlation between two variables can be ascertained by testing its
significance, using the t-test.
The test statistic for testing the significance of r is given by the following for mula.
Example 1
A soft drink distributor is interested to find out if the number of cases of soft drinks ordered is
related to the travel time they are delivered. The following data have been obtained from past
experiences.
Solution
1. To compute the correlation coefficient, prepare a table like the one shown below.
∑ ∑ ∑
√[ ∑ ∑ ][ ∑ ∑ ]
√[ ][ ]
Step 1
There is no significant relationship between the number of cases of soft drinks ordered
and the travel time they are delivered
There is a significant relationship between the number of cases of soft drinks ordered
and the travel time they are delivered
Step 2
Use the t-test to test the significance of r. Get the critical value of t at 0.05 level of
significance, two-tailed test. Since using the table for the t
distribution, the critical value of t is 2.571.
Step 3
√
Step 5
There is no significant relationship between the number of cases ordered and the travel time
that they are delivered.
Example 2
The average normal daily temperature (in degrees Fahrenheit) and the corresponding average
monthly precipitation (in inches) for seven months are shown here. At determine if there is
a relationship between temperature and precipitation.
1. To compute the correlation coefficient, prepare a table like the one shown below.
∑ ∑ ∑
√[ ∑ ∑ ][ ∑ ∑ ]
√[ ][ ]
There is no significant relationship between the average daily temperature and the
average monthly precipitation
There is a significant relationship between the average daily temperature and the average
monthly precipitation
Step 2
Use the t-test to test the significance of r. Get the critical value of t at 0.01 level of
significance, two-tailed test. Since using the table for the t
distribution, the critical value of t is 4.032.
Step 3
Make a decision whether to accept or reject the null hypotheses. Since the absolute value of
the computed t value (4.206) is greater than the absolute value of tabular or critical value
(4.032), reject the null hypothesis.
Step 5
There is a significant relationship between the average daily temperature and the average
monthly precipitation.
LESSON 4
M aking Prediction Using Regression Analysis
If two variables are significantly correlated, then we can predict the value of one variable in
terms of the other variable. For example, it is believed that the amount of family income is related
to the amount of expenditures. If indeed there is a signifi cant relationship between these two
variables, then we can predict the amount of expenditures in terms of family income or vice versa.
In this lesson, we shall learn to predict the value of one variable in terms of another
variable. Before we proceed, do the next activities to prepare you for the present lesson.
ACTIVITY 4.1
A. Find the value of Y for the given value of X.
1.
2
3.
4.
5.
B. Find the values of the following, using the data shown below.
1. ∑ 5. ∑
2. ∑ 6. ∑
3. ∑ 7. ∑
4. ∑
X Y
3 4
5 7
7 9
9 12
12 8
There are many instances where we make predictions to make sound decisions.
Businessmen predict future sales of the company based on present productions. Manufacturers
make predictions of their profit based on the production cost. Guidance counselors predict
scholastic or academic success of the students, based on their scores in the entrance examination.
School administrators predict future expansions of physical facilities based on student enrolment
records.
The process of predicting the value of one variable in terms of the other variable is called
regression analysis. In this lesson, we shall discuss only simple linear regression analysis. We
use the word simple because we shall deal only with one dependent variable and one independent
variable. If there are more than one independent variable, the analysis is called multiple linear
regression analysis.
We use the word ‘’linear” be cause we shall assume that the relationship between the two
variables is linear. There are also relationships which are nonlinear but we shall not deal with
them here in this lesson.
If we are going to predict one variable in terms of the other variable, we have to make sure that
the variables are significantly correlated. We cannot do regression analysis without performing
correlation analysis first.
The Regression Equation
To predict one variable in terms of the other variable, we need to get the regression equation.
The graph of the regression equation is a line because it is assumed that we are dealing with a
linear relationship.
In the regression equation , is the dependent variable and is the independent
variable. The independent variable is sometimes called the predictor variable or the explanatory
variable because it is used to predict or explain the dependent variable. On the other hand, the
dependent variable is sometimes called the response variable. Since the regression equation is
used to predict the value of the dependent variable in terms of the independent variable, we use
to indicate that it is not the actual value but just a predicted value of Y.
Thus, the regression equation is
1. Determine the regression equation for predicting the price of the jeepney in terms of its
years of usage.
2. Predict the price of the jeepney which is 3 years in use.
Solution
Step 1
We need to establish first that the age and the depreciated price of a jeepney are significantly
correlated before we can perform regression analysis. We shall compute first the correlation
coefficient.
5 85 425 25 7225
4 103 412 16 10609
6 70 420 36 4900
5 82 410 25 6724
5 89 445 25 7921
5 98 490 25 9604
6 66 396 36 4356
6 95 570 36 9025
2 169 338 4 28561
7 70 490 49 4900
7 48 336 49 2304
∑ 58 ∑ 975 ∑ 4732 ∑ 326 ∑ 96129
The correlation coefficient is computed as follows:
∑ ∑ ∑
√[ ∑ ∑ ][ ∑ ∑ ]
√[ ][ ]
√
Since the absolute value of the computed value is greater than the absolute value of the critical
value, we conclude that the correlation coefficient is significant.
Step 3
Since there is a significant relationship between the age and the depreciated price of the
jeepney, we can proceed to regression analysis to predict the price in terms of age.
We compute for the values of and
∑ ∑ ∑
∑ ∑
∑ ∑
Step 4
To predict the price of a jeepney which is 3 years old, we substitute X = 3 in the regression
equation .
Since the prices of the jeepneys are expressed in thousand pesos, we multiply by
Therefore, the predicted price of a jeepney which is 3 years old is .
Example 2
It is believed that there is a relationship between a driver’s age and the number of accidents
he or she encounters over a one -year period. The data are shown here.
Driver’s Age (X) Number of Accidents (Y)
63 2
65 3
60 1
62 0
66 3
67 1
59 4
1. Find the regression equation for predicting the number of accidents in terms of age.
2. Predict the number of accidents of a driver who is 64 years old.
Solution
Step 1
We need to establish first that the age and number of accidents are significantly correlated
before we can perform regression analysis. We shall compute first the correlation coefficient.
63 2 126 3969 4
65 3 195 4225 9
60 1 60 3600 1
62 0 0 3844 0
66 3 198 4356 9
67 1 67 4489 1
59 4 236 3481 16
∑ 442 ∑ 14 ∑ 882 ∑ 27964 ∑ 40
The correlation coefficient is computed as follows:
∑ ∑ ∑
√[ ∑ ∑ ][ ∑ ∑ ]
√[ ][ ]
Since the absolute value of the computed value is less than the absolute value of the critical
value, we conclude that the correlation coefficient is not significant.
Step 3
Since there is no significant relationship between the driver’s age and the number of accidents,
we shall not proceed to regression analysis.
The T-Table
proportions of the areas in the two tails of the t curve.
There are critical values for the t distribution and are utilized like the z critical values. Like the z,
they are also called confidence coefficients.