Correlation and Regression
Correlation and Regression
Correlation and Regression
Correlation:
A measure of linear relationship between two quantitative variables (for example, age and weight).
Correlation is a statistical technique that can show whether and how strongly pairs of variables are
related.
Types of correlation:
a) Positive correlation, b) Negative correlation, c) No (Zero) correlation
Positive correlation:
If the values of a variable increase, the values of the other variable also increase and as the values
of a variable decrease, the values of the other variable also decrease the positive correlation is
raised. The points lie close to a straight line, which has a positive gradient.
Example:
Relation between training and performance of employees in a company
Relation between price and supply of a product
Negative correlation:
If the values of a variable increase, the values of the other variable decrease and as the values of a
variable decrease, the values of the other variable increase the negative correlation is raised. The
points lie close to a straight line, which has a negative gradient.
1
Correlation and Regression
Example:
Relation between television viewing and exam grades
Relation between price and demand of a product
No (Zero) correlation:
If change in one variable has no effect on the other variable. There is no pattern to the points.
Example:
Relation between height and exam grades
Correlation coefficient:
Let (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), …, (𝑥𝑛 , 𝑦𝑛 ) be n pairs of observations of the variables X and Y observed
from n sample points. The linear relationship of X and Y is called simple correlation. The degree
of linear relationship of X and Y is estimated by a quantity r, where
1 xy
( x− x )( y − y ) xy −
Cov ( x y ) n n SP ( xy )
r= = = =
V( x ) V ( y ) 1 1
( x − x )2 ( y − y )2 (x) 2
(y ) 2
SS( x)SS( y )
[ x2 − ][ y 2 − ]
n n n n
The values of r range between − 1 to 1. r = − 1 means a perfect negative correlation, r = 1 means
a perfect positive correlation, and r = 0 means no linear relationship between the variables.
2
Correlation and Regression
Correlation of pseudorandom binary codes makes GPS working; lots of radar systems, and lots
CDMA (code division multiple access) systems. That's why GPS is power hungry, it takes a lot
of processing to find the correct correlation, with correct satellite (correct code) and the correct
delay.
Correlation is one of the promising methods in analyzing network-based intrusion alerts
to find significant relationships among alerts that have been triggered by multiple intrusion
detection sensors. Security Admin (SA) needs to understand and study these alerts. They are
meaningless if being analyzed individually. Somehow, they must be 'connected' with previous
alerts or future alerts. So, SA can figure out the sequences of attacks that have been launched on
the network. This is important to identify preventive measure in the future.
Once correlation is known, we can use it to make predictions. If we know a score on one
measure, we can make a more accurate prediction of another measure that is highly related to it.
The stronger the relationship between/among variables the more accurate the prediction.
Show by example, r = 1
Let 𝑥 = 1, 2, 3 and 𝑦 = 1, 2, 3. (∑𝑥)2 (6)2
SS(𝑥) = ∑𝑥 2 − = 14 − = 2
𝑛 3
(∑𝑦)2 (6)2
SS(𝑦) = ∑𝑦 2 − = 14 − = 2
SP(𝑥𝑦) 2 𝑛 3
𝑟 = = = 1 𝑥 𝑦 6 ×6
√SS(𝑥) SS(𝑦) √2 × 2 SP(𝑥𝑦) = ∑𝑥𝑦 − = 14 − = 2
𝑛 3
r = 1 indicates that X and Y are perfectly and positively correlated. It happens, if both X and Y
change uniformly in the same direction. Here X increases by 1 unit and Y increases by 1 unit.
Show by example, r = −1
Let 𝑥 = 1, 2, 3 and 𝑦 = 9, 6, 3. (∑𝑥)2 (6)2
SS(𝑥) = ∑𝑥 2 − = 14 − = 2
𝑛 3
(∑𝑦)2 (18)2
SS(𝑦) = ∑𝑦 2 − = 126 − = 18
SP(𝑥𝑦) −6 𝑛 3
𝑟 = = = −1 𝑥 𝑦 6 × 18
√SS(𝑥) SS(𝑦) √2 × 18 SP(𝑥𝑦) = ∑𝑥𝑦 − = 30 − = −6
𝑛 3
r = − 1 indicates that X and Y are perfectly negatively correlated. It happens when both X and
Y change uniformly but in opposite direction. Here X increases by 1 unit and Y decreases by 3
units.
3
Correlation and Regression
Problem 10.1: The following are the data representing age and BP of some selected persons.
Age (𝑥, 𝑖𝑛 𝑦𝑒𝑎𝑟) 25 30 35 30 32 40 45 40 36 35
BP (𝑦, 𝑖𝑛 𝑚𝑚 𝐻𝑔) 75 80 85 90 95 85 100 90 85 80
a) Compute correlation coefficient.
b) Do you think that BP increases significantly with the increase in age?
Solution:
a)
𝒙 𝒚 𝒙𝒚 𝒙𝟐 𝒚𝟐
25 75 1875 625 5625
30 80 2400 900 6400
35 85 2975 1225 7225
30 90 2700 900 8100
32 95 3040 1024 9025
40 85 3400 1600 7225
45 100 4500 2025 10000
40 90 3600 1600 8100
36 85 3060 1296 7225
35 80 2800 1225 6400
𝜮𝒙 = 348 𝜮𝒚 = 865 𝜮𝒙𝒚 = 30350 𝜮𝒙𝟐 = 12420 𝜮𝒚𝟐 = 75325
(∑𝑥)2 (∑𝑦)2
SS(𝑥) = ∑𝑥 2 − SS(𝑦) = ∑𝑦 2 −
𝑛 𝑛
3482 8652
= 12420 – = 309.6 = 75325 – = 502.5
10 10
𝑥 𝑦 SP(𝑥𝑦) 248
SP(𝑥𝑦) = ∑𝑥𝑦 − 𝑟 = = = 0.63.
𝑛 √SS(𝑥)SS(𝑦) √309.6 × 502.5
348 × 865
= 30350 – = 248
10
The variables X (age) and Y (BP) are positively correlated.
4
Correlation and Regression
Decision rule: With α = .05 and df = n – 2, then the critical value of t is found from t table.
We reject H0 if | t | > t 0.05, (n-2).
If the test concludes that the correlation coefficient is significantly different from zero, we say
that the correlation coefficient is significant. There is a significant linear relationship between x
and y.
If the test concludes that the correlation coefficient is not significantly different from zero (it is
close to zero), we say that correlation coefficient is not significant. There is not a significant
linear relationship between x and y.
b) We need to test,
𝐻0 ∶ 𝜌 = 0 vs 𝐻1 ∶ 𝜌 ≠ 0.
Test Statistic:
𝑟 √𝑛−2 0.63 √10−2
𝑡 = = √1−0.632
= 2.29
√1−𝑟 2
MATLAB code
To compute the correlation coefficient matrix between two normally distributed, random vectors
of 10 observations each.
A = randn(10,1);
B = randn(10,1);
R = corrcoef(A,B)
5
Correlation and Regression
Regression:
It is a method of setting a function of dependent variable 𝑦 based on independent variable 𝑥 so
that for any value of 𝑥, value of 𝑦 can be estimated. Mathematically, the linear regression model
is given by,
𝒀 = 𝜶 + 𝜷𝒙 + 𝝐,
where
𝛼 = the value of y when 𝑥 = 0
𝛽 = regression coefficient of 𝑦 on 𝑥. It measures the rate of change of 𝑦 for unit change in 𝑥.
𝜖 = random error. It is used in the model to measure the influences of other variables which
are not included in the model.
The problem is to fit the regression equation in such a way that the sum of squares due to error is
minimum. Let the fitted model be
̂ = 𝒂 + 𝒃𝒙,
𝒚
where, a is the estimate of 𝛼 and b is the estimate of 𝛽. Here,
SP( xy)
𝑎 = y − bx ,and b = .
SS( x)
Show, by example, b = 1
𝐿𝑒𝑡 𝑥 = 1, 2, 3 𝑎𝑛𝑑 𝑦 = 1, 2, 3. (∑𝑥)2 (6)2
SS(𝑥) = ∑𝑥 2 − = 14 − = 2
𝑛 3
SP(𝑥𝑦) 2 𝑥 𝑦 6 ×6
𝑏= = =1 SP(𝑥𝑦) = ∑𝑥𝑦 − = 14 − = 2
𝑛 3
SS(𝑥) 2
Show, by example, b = −2
𝐿𝑒𝑡 𝑥 = 1, 2, 3 𝑎𝑛𝑑 𝑦 = 8, 6, 4. (∑𝑥)2 (6)2
SS(𝑥) = ∑𝑥 2 − = 14 − = 2
𝑛 3
SP(𝑥𝑦) −4 𝑥 𝑦 6 × 18
𝑏= = =−2 SP(𝑥𝑦) = ∑𝑥𝑦 − = 32 − = −4
𝑛 3
SS(𝑥) 2
6
Correlation and Regression
Problem 10.2: The following are the data representing the number of ever born children (y) to
different mothers having different levels of education (x in completed years of schooling):
𝑥: 8, 4, 5, 10, 12, 8, 5, 10, 0, 6, 8, 5, 0
𝑦: 2, 6, 5, 3, 1, 2, 5, 2, 7, 3, 4, 2, 5
a) Fit a regression line of 𝑦 on 𝑥.
b) Estimate the number of children of a mother who complete 14 years of schooling.
c) Test the significance of regression.
Solution:
(∑𝑥)2 (81)2 𝑥 𝑦 81× 47
a) 𝑆𝑆(𝑥) = ∑𝑥 2 − = 663 − SP(𝑥𝑦) = ∑𝑥𝑦 − 𝑛
= 228 − 13
𝑛 13
= 158.31 = − 64.85
SP(𝑥𝑦) − 64.85 𝑦 𝑥
𝑏= = = − 0.41 𝑎 = 𝑦̅ − 𝑏𝑥̅ = −𝑏
SS(𝑥) 158.31 𝑛 𝑛
47 81
= − (− 0.41) = 6.17
13 13
Fitted line: 𝑦̂ = 𝑎 + 𝑏𝑥 = 6.17 – 0.41𝑥
c) We need to test 𝐻0 ∶ 𝛽 = 0 vs 𝐻1 ∶ 𝛽 ≠ 0.
𝑏 − 0.41
Test Statistic: 𝑡 = = = − 4.5
𝑠2 1.317
√ √
𝑠𝑠 (𝑥) 158.31
7
Correlation and Regression
Exercise 10
10.1 The following data are given for the inflation rate(x) and the corresponding lending rate(y)
x y
11.8 10.4
12.5 16.5
15.7 22.9
19.2 26.6
21.9 33.8
23.3 42.8
8
Correlation and Regression
10.2 The following data are given for the educational qualification (year of schooling)(x) of a
person and the corresponding yearly income (in lac) (y)
x y
5 13.6
8 15.6
10 18.7
12 20.8
16 25.2
18 29.5
9
Correlation and Regression
10.3 The following data are given for the day temperature (in ℃)(x) of Dhaka and the
corresponding humidity (in %) (y)
x y
30 90
32 78
34 84
36 73
38 88
40 72
10
Correlation and Regression
10.4 The following data are given for the day temperature (in ℃)(x) of Dhaka in December and
the corresponding sales of ice cream (in thousand) (y)
x y
22 83.6
16 61.4
19 72.0
21 78.2
24 87.6
26 98.2
11
Correlation and Regression
Sample MCQs
1. The following data are given for the day temperature (in ℃) (x) of Chittagong and the
corresponding humidity (in %) (y). Compute correlation coefficient r.
x 35 32 30 40 38
y 85 75 90 70 85
a) 0.73 b) −0.52 c) -0.73 d) 0.52
2. Test the significance of correlation coefficient (𝑟), where r = 0.85 and sample size is 15.
a) Significant b) Not significant c) Inconclusive d) None of the above
3. The following data are given for the inflation rate(x) and the corresponding lending rate(y). Fit
a regression line of y on x.
x 15.5 12.5 11.5 21.5 23.5
y 22.5 17.0 10.5 33.5 42.8
a) 𝑦 = −2.37 + 14.79𝑥
b) 𝑦 = 14.79 − 2.37𝑥
c) 𝑦 = −14.79 + 2.37𝑥
d) 𝑦 = 2.37 − 14.79𝑥
12