Correlation and Regression
Correlation and Regression
Regression
Md. Rabiul Islam
Department of Biomedical Engineering
Islamic University
1
Correlation
Two variables are said to be correlated if the change
in one variable results in a corresponding change in
the other variable.
The correlation is a statistical tool which studies
the relationship between two variables.
Correlation analysis involves various methods and
techniques used for studying and measuring the
extent of the relationship between the two
variables.
Correlation is concerned with the measurement
of “strength of association between variables”.
The degree of association between two or
more variables is termed as correlation.
2
Contd…
5
Types of
i.
Correlation
Positive and Negative correlation: If both the
variables are varying in the same direction i.e. if one
variable is increasing and the other on an average is
also increasing or if as one variable is decreasing, the
other on an average, is also decreasing, correlation
is said to be positive. If on the other hand, the
variable is increasing, the other is decreasing or vice
versa, correlation is said to be negative.
Example 1: a) heights and weights (b) amount of rainfall
and yields of crops (c) price and supply of a
commodity (d) income and expenditure on luxury
goods (e) blood pressure and age
Example 2: a) price and demand of commodity (b) sales
of woolen garments and the days temperature.
6
Contd
…
ii. Simple, Partial and Multiple Correlation:
When only two variables are studied, it is a
case of simple correlation. In partial and
multiple correlation, three or more variables
are studied. In multiple correlation three or
more variables are studied simultaneously. In
partial correlation, we have more than two
variables, but consider only two variables to
be influencing each other, the effect of the
other variables being kept constant.
7
Contd…
Graphical Scatter
Method Diagram
Correlation
Karl Pearson’s
Algebraic Coefficient
Method of
Correlation
9
Methods of Studying
Correlation
The following are the methods of determining
correlation
1. Scatter diagram method
2. Karl Pearson’s Coefficient of Correlation
1. Scatter Diagram:
This is a graphic method of finding out relationship
between the variables.
Given data are plotted on a graph paper in the form
of dots i.e. for each pair of x and y values we put a
dot and thus obtain as many points as the number of
observations.
The greater the scatter of points over the graph, the
lesser the relationship between the variables.
10
Scatter
Diagram Low Degree of
Perfect Positive Positive Correlation Low Degree of
Perfect Negative
Y Y Negative Correlation
Y Correlation
Y Correlation
O X O X O X
X O
O X O X
XO XO
11
Interpretatio
n
If all the points lie in a straight line, there is
either perfect positive or perfect negative
correlation.
If all the points lie on a straight falling from
the lower
left hand corner to the upper right hand
corner then the
correlation is perfect positive.
Perfect positive if r = + 1.
If all the points lie on a straight falling from the
upper left hand corner to the lower right hand
corner then the correlation is perfect negative.
Perfect negative if r = -1.
The nearer the points are to be straight line, the
higher degree of correlation.
The farthest the points from the straight line, the
lower degree of correlation.
If the points are widely scattered and no trend is
12
Y̅ =
Interpretat
i. ion the relationship is
If the covariance is positive,
positive.
ii. If the covariance is negative, the relationship is
negative.
iii. If the covariance is zero, the variables are said to be
not correlated.
Hence the covariance measures the strength of linear
association between considered numerical variables.
Thus, covariance is an absolute measure of linear
association .
In order to have relative measure of relationship it is
necessary to compute correlation coefficient .
Computation of correlation coefficient a relation
developed by Karl Pearson are as follows:
15
Contd…
The formula for sample correlation coefficient (r) is calculated
by the following relation:
Example:
16
Properties of Karl Pearson’s Correlation
Coefficient
1. The coefficient of correlation ‘r’ is always a number between -1
and +1 inclusive.
2. If r = +1 or -1, the sample points lie on a straight line.
3. If ‘r’ is near to +1 or -1, there is a strong linear association
between the variables.
4. If ‘r’ is small(close to zero), there is low degree of correlation
between the variables.
5. The coefficient of correlation is the geometric mean of the two
regression coefficients.
Symbolically: r= √(bₓᵧ . bᵧₓ)
Note: It is clear that correlation coefficient is a measure of the
degree to which the association between the two variables
approaches a linear functional relationship.
17
Interpretation of Correlation Coefficient
i. The coefficient of correlation, as obtained by the above formula shall
always lie between +1 to -1.
ii. When r = +1, there is perfect positive correlation between the
variables.
iii. When r = -1, there is perfect negative correlation between the
variables.
iv. When r = 0, there is no correlation.
v. When r = 0.7 to 0.999, there is high degree of correlation.
vi. When r = 0.5 to 0.699, there is a moderate degree of
correlation.
vii. When r is less than 0.5, there is a low degree of correlation.
-1⩽ r ⩽ +1.
viii. The value of correlation lies in between -1 to +1 i.e.
units of measurement.
Coefficient of
Determination
The coefficient of determination(r²) is the square of the
coefficient of correlation.
It is the measure of strength of the relationship
between two variables.
It is subject to more precise interpretation because it
can be presented as a proportion or as a percentage.
The coefficient of determination gives the ratio of the
explained variance to the total variance.
Thus, coefficient of determination,
r² = Explained variance ÷ Total variance
Thus, coefficient of determination shows what amount of
variability or change in independent variable is
accounted for by the variability of the dependent
variable. 19
Exampl
e
Example 1: If r = 0.8 then r2 = (0.8) = 0.64 or 64%. This means that
2
Error term
Estimated value
Slope = β₁ of y when x =
x₀
One unit change in x
β₀
Y- intercept
29
Estimation of Regression
Equation
Regression Model Sample Data:
Y = β₀ + β₁ X + Ɛ x y
Regression x₁ y₁
Equation x₂
Y = β₀ + β₁ X . y₂
Unknown . .
parameter xn .
β₀ & β₁ y
n
Estimated
b₀ & b₁ Regression
Provides estimates equation
of ŷ = b₀ + b₁ x
β₀ & β₁ Sample Statistics
b₀ & b₁
30
Model
Linear regression model is
Y = β₀ + β₁ X + Ɛ
Linear regression equation is:
Y = β₀ + β₁ X
Sample regression model is
ŷ= b₀ + b₁x + e
Sample regression equation is
ŷ= b₀ + b₁x
Where b₀ = sample y intercept,
b₁= sample slope coefficient
x= independent variable
y= dependent variable
ŷ= estimated value of dependent variable for a given value
of independent variable.
e = error term = y - ŷ 31
Least square
graphically
y 2 = b 0 + b 2 x2 + e 2
y1= b0+ b1x1 +e1
e4
e2
e1
e5
e3
ŷ i = b0 + b1 x i
32
Least squares methods
• Let ŷ= b₀ + b₁x …..(1) be estimated linear
regression equation of y on x of the regression
equation Y = β₀ + β₁ X .
• By using the principles of least square, we
can get two normal equations of regression
equation (1) are as:
• ∑y = nb₀ +b₁ ∑x………(2)
• ∑xy = b₀∑x₁ + b₁∑x₂…….(3)
• By solving equations (2) & (3) we get the value
of b₀ & b₁ as: 33
• The computational formula for y intercept b₀ as follows:
• After finding the value of b₀ & b₁, we get the required fitted
regression model of y on x as ŷ= b₀ + b₁x .
34
Measures of
variation
• There are three measures of variations.
• They are as follows:
i. Total Sum of Squares (SST):It is a measures of
variation in the values of dependent variable (y)
around their mean value (y̅). That is
• SST = ∑(y – y)̅ ² = ∑y² - (∑y)²/n = ∑y² - n.y²̅ .
• Note: The total sum of squares or the total variation
is divided into the sum of two components. One is
explained variation due to the relationship between
the considered dependent variable (y) and the
independent variable (x) and the other is unexplained
variation which might be developed due to some
other factors other than the relationship between 35
variable x and y.
Contd
…
ii. Regression Sum of Squares( SSR): The
regression sum of squares is the sum of the
squared differences between the predicted
value of y and the mean value of y.
• SSR = ∑(ŷ - y)̅ ² = b₀.∑y+b₁ ∑xy – (∑y)²/n =
b₀.∑y+b₁∑xy – n.y²
ii. Error Sum of Squares (SSE): The error sum of
square is computed as the sum of the
squared differences between the observed
value of y and the predicted value of y i.e.
• SSE = ∑(y – ŷ)² = ∑y²- b₀∑y – b₁∑xy.
36
Cont
d…
Y
SST
SSR
y
37
Contd…
Relationship: From the above figures the
relationship of SST, SSR and SSE are as follows
SST = SSR + SSE………………(i)
Where: SST = Total sum of square
SSR = Regression sum of squares
SSE = Error sum of squares
•The fit of the estimated regression line would
be best if every value of the dependent variable
y falls on the regression line.
38
Contd
….
• If SSE = 0 i. e. e = (y – ŷ) = 0 then SST = SSR.
• For the perfect fit of the regression model, the
ratio of SSR to SST must be equal to unity i. e.
If SSE = 0 then the model would be perfect.
• If SSE would be larger, the fit of the regression
line would be poor.
• Note: Largest value of SSE the regression line
would be poor and if SSE = 0 the
regression line would be perfect.
39
Coefficient of
Determination (r²)
• The coefficient of determination measures the
strength or extent of the association that exists
between dependent variable (y) and independent
variable (x).
• It measures the proportion of variation in the
dependent variable (y) that is explained by
independent variable of the regression line.
• Coefficient of variation measures the total
variation in the dependent variable due to the
variation in the independent variable and it is
denoted b r².
• r² = SSR/SST but SST = SSE + SSR
• then SSR = SST - SSE
• r² = 1 – (SSE/SST) = (b₀.∑y+b₁∑xy – n.y²̅ )/(∑y² –
40
ny²̅ ).
• Note:
i. Coefficient of determination
Contd… is the square of
coefficent of correlation.
then r = ±√r²
ii. If the regression coefficient (b₁) is negative
then take the negative sign
iii. If the regression coefficient (b₁) is positive then
take the positive sign
• Adjusted coefficient of determination: The
adjusted coefficient of determination is
calculated by using the following relation:
•
41
The Standard Error of
Estimates
The standard error of estimate of Y on X, denoted by Sᵧᵪ, measures the
average variation or scatteredness of the observes data point around the
regression line. It is used to measure the reliability of the regression
equation. It is calculated by the following relation:
42
Test of Significance of Regression Coefficient in
Simple Linear Regression Model
• To test the significance of regression
coefficient of the simple linear regression
model Y = β₀ + β₁ X + Ɛ, the following statistical
test have been applied.
i. t- test for significance in simple linear
regression.
ii. F-test for significance in simple linear
regression.
43
(i) t- test for Significance in Simple Linear Regression
• t-test is applied whether the regression
coefficient β₁ is statistically significant or not.
• The process of setting Hypothesis are as
follows:
• Setting of Hypothesis:
• Null Hypothesis, H₀: β₁=0 (The population slope
(β₁) is zero between two variables X and Y in the
population.)
• Alternative Hypothesis, H₁: β₁ ≠ 0 (The population
slope (β₁) is not zero between two variables X and
Y in the population.) or H₁: β₁ > 0 or H₁: β₁ < 0 44
Cont
d…
• Test statistic: Under H₀ the test statistic is:
• Where
• Decisions:
i. (i) If Fcal < Ftab at α % level of significance and F
with 1
d. f. in the numerator and (n-2) d. f.
in the denominator then we accept H₀.
ii. (ii) If Fcal > Ftab at α % level of
significance and F with 1
d. f. in the numerator and (n-2) d. f. in 49
52
Interval Estimates for different
values of x
Y
Prediction interval for
a individual Y
Confidence interval
for the mean of Y
X
x̅ A Given X 53
“THANK
YOU”
54