0% found this document useful (0 votes)
10 views58 pages

Correlation Regression

The document provides an overview of correlation and regression analyses, focusing on the relationship between independent and dependent variables. It explains the concepts of regression lines, least squares method, covariance, correlation coefficients, and the coefficient of determination (R²). Additionally, it discusses the importance of understanding the nature of relationships in data analysis, highlighting that correlation does not imply causation.

Uploaded by

Manvendra Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views58 pages

Correlation Regression

The document provides an overview of correlation and regression analyses, focusing on the relationship between independent and dependent variables. It explains the concepts of regression lines, least squares method, covariance, correlation coefficients, and the coefficient of determination (R²). Additionally, it discusses the importance of understanding the nature of relationships in data analysis, highlighting that correlation does not imply causation.

Uploaded by

Manvendra Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

CORRELATION AND REGRESSION

Session 17 - 24
OVERVIEW

• Regression and correlation analyses are based on the relationship, or association,


between two (or more) variables. The known variable (or variables) is called the
independent variable(s). The variable we are trying to predict is the dependent
variable.
• Regression analysis provides a “best-fit” mathematical equation for the values of
the two variables.
• The equation may be linear (a straight line) or curvilinear, but we will be
concentrating on the linear type.
• Correlation analysis measures the strength of the relationship between the
variables.
• Let the two variables are y and x. These are called the dependent ( y) and
independent (x) variables, since a typical purpose for this type of analysis is to
estimate or predict what y will be for a given value of x.
• Relationships can be direct where the dependent variable increases as the independent
variable increases.
• Relationships can also be inverse rather than direct. In these cases, the dependent variable
decreases as the independent variable increases.
• There can be causal relationship between variables; that is, the
independent variable “causes” the dependent variable to change.

• This is the case in the antipollution example above. But in many cases,
other factors cause the changes in both the dependent and the independent
variables.

• For this reason, it is important that you consider the relationships found by
regression to be relationships of association but not necessarily of cause
and effect.
Scatter Diagrams
• A scatter diagram can give us two types of information.
• Visually, we can look for patterns that indicate that the variables are related.
• Then, if the variables are related, we can see what kind of line, or
estimating equation, describes this relationship.
• Relationship described by the data points is well described by a straight
line. Thus, we can say that it is a linear relationship.
• The relationship between X and Y variables can also take the form of a
curve. Statisticians call such a relationship curvilinear
ESTIMATION USING THE
REGRESSION LINE

• The equation for a straight line where the dependent variable Y is


determined by the independent variable X is:
• Suppose we know that a is 3 and b is 2. Let us determine what Y would be for an X equal to 5. When we
substitute the values of a, b, and X in the Equation 12-1, we find the corresponding value of Y to be
USING THE ESTIMATION EQUATION FOR A STRAIGHT LINE
F I N D I N G T H E VA L U E S F O R A A N D B

• Value of a can be found (the Y-intercept) by locating the point where the line crosses the Y-axis.
• Value of b can be found by using this equation
THE METHOD OF LEAST SQUARES
• How can we fit a line mathematically if none of the points lies on the line?
• The line will have a good fit, if it minimizes the error between the estimated points on the line and the
actual observed points that were used to draw it.
• The points that lie on the estimating line are represented as (Y hat).
THE LEAST-SQUARES CRITERION

• The least-squares criterion requires that the sum of the squared deviations between y values in the
scatter diagram and y values predicted by the equation be minimized. In symbolic terms:
LEAST SQUARES REGRESSION LINE
DETERMINING THE LEAST-SQUARES
REGRESSION LINE

𝑦 𝑖= 𝑎1 +𝑏 1 𝑥𝑖 + 𝑒𝑖 𝑥𝑖 = 𝑎2 +𝑏 2 𝑦 𝑖 +𝑒 𝑖

cov ( 𝑥 , 𝑦 ) 𝑏 cov ( 𝑥 , 𝑦 )
𝑏 𝑦 𝑜𝑛 𝑥 ( ¿ 𝑏1 ) = 𝑥 𝑜𝑛 𝑦 ( ¿ 𝑏2 ) = 2
𝜎x
2
𝜎y

( 𝛴 𝑥𝑖 𝑦 𝑖 ) −𝑛 𝑥 𝑦 𝑏 ( 𝛴 𝑥𝑖 𝑦 𝑖 ) −𝑛 𝑥 𝑦
𝑏 y on x = x on y =
( 𝛴 𝑥 )− 𝑛𝑥
2
𝑖
2
( 𝛴 𝑦𝑖 ) − 𝑛 𝑦
2 2
• Scatter
diagram and
least-squares
regression line
EXAMPLE: LEAST SQUARE METHOD

Year Sales (Crores)


2015 76
2016 80
2017 130
2018 144
2019 138
2020 120
2021 174
2022 190
EXAMPLE: LEAST SQUARE METHOD

Year Sales Xi= ti- Xi*Yi (Xi)^2


(ti) (Crores) t(Mean)
Yi
2015 76 -3.5 -266 12.25
2016 80 -2.5 -200 6.25
2017 130 -1.5 -195 2.25
2018 144 -0.5 -72 0.25
2019 138 0.5 69 0.25
2020 120 1.5 180 2.25
2021 174 2.5 435 6.25
2022 190 3.5 665 12.25
t(Mean) Xi= 0 XiYi= 616 (Xi)^2= 42
=2018.5
EXAMPLE: LEAST SQUARE METHOD

• Solution  a  bXi 131.5  14.67 * Xi


Yi
• b=14.67
• a=131.5
Month Stock Price (00)

EXAMPLE: STOCK Oct-21


Nov-21
4.8
4.1

PRICE DATA
Dec-21 6
Jan-22 6.5
Feb-22 5.8
Mar-22 5.2
Apr-22 6.8
May-22 7.4
Jun-22 6
Jul-22 5.6
Aug-22 7.5
Sep-22 7.8
Oct-22 6.3
Nov-22 5.9
Dec-22 8
Jan-23 8.4
DATA
REGRESSION EQUATION

The regression equation


is:
Stock price 4.8525  0.17985 * t
R, R 2 , AND ADJUSTED R 2
MEASURES OF ASSOCIATION

• The sample covariance measures the strength of the linear relationship between two
variables (called bivariate data)

• The sample covariance:


n

 ( X  X)( Y  Y )
i i
cov ( X , Y )  i1
n 1
• Only concerned with the strength of the relationship
• No causal effect is implied.
INTERPRETING COVARIANCE

• Covariance between two random variables:

 cov(X,Y) > 0 X and Y tend to move in the same direction


 cov(X,Y) < 0 X and Y tend to move in opposite direction
 cov(X,Y) = 0 X and Y are independent

• It is not possible to determine the relative strength of the


relationship using the size of covariance.
CORRELATION

• Correlation refers to sympathetic movement of variables either in the


same or in the opposite directions.
• Measures the relative strength of the linear relationship between two
variables.
• Simple correlation deals with co-variation of two variables
• Multiple and partial correlations involve a study of co-variation between
more than two variables.
• The relationship between variables is established and measured
quantitatively with a view to making estimates based on them.
CORRELATION

• Correlation between variables may be of varying degrees: from perfect on one


extreme down to high, moderate, low and to no correlation on the other.
• Correlation may be linear or non-linear.
• Graphically, correlation is studied by means of a scatter diagram.
• If dots representing pairs of data values are seen to fall on a straight line, the
correlation is perfect. The degree of correlation decreases as the points lay more
and more away from the line. Upward location of points with a rightward
movement on the horizontal axis indicates positive correlation while a downward
location is indicative of the negative correlation.Widely scattered dots with no
clear direction and dots in a line that is parallel to either of the axes means
absence of correlation.
KARL PEARSON’S COEFFECIENT OF
CORRELATION

• Numerically, the correlation is measured and expressed in terms of Karl


Pearson's coefficient of correlation.
• It is defined as the ratio of covariance to the product of standard
deviations of the two series involved.
• Its sign indicates the direction, and its magnitude measures the degree
of correlation.
• The coefficient of correlation varies between ±1
• It is independent of the change of origin and scale.
COEFFICIENT OF CORRELATION

• Sample coefficient of correlation:

cov (X ,Y)
R
SX SY
• where

n n n
 (Xi  X)(Yi  Y)  (X  X)
i
2
 (Y  Y )
i
2

cov (X , Y)  i1 SX  i1


SY  i1
n 1 n 1 n 1
FEATURES OF CORRELATION
COEFFICIENT, R

• Unit free
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear relationship
• Equal to 1, perfect correlation
• Equal to 0, no correlation
For a given series of paired data, the following information is available:
Covariance between X and Y series = -17.8
Standard deviation of X series = 6.6
Standard deviation of Y series = 4.2
No. of pairs of observations = 20
Calculate the coefficient of correlation.

r = -0.642
Thus, variables are negatively correlated.
RANK CORRELATION

• Rank correlation is calculated essentially where the variables under


consideration cannot quantified, being measured on ordinal scale.
• However, it can be calculated even where the variables are objectively
quantifiable.
• This is done by ranking the given data on the basis of the values involved.
• Like the Karl Pearson's coefficient of correlation, the rank correlation
coefficient also varies between ±1.
• The presence of extreme observations in the data does not distort the
value of rank correlation coefficient.
COEFFICINET OF RANK
CORRELATION
COEFFICINET OF RANK
CORRELATION

COMPARISON OF THE RANKS OF FIVE STUDENTS


COEFFICINET OF RANK
CORRELATION

GENERATING INFORMATION TO COMPUTE THE RANK-CORRELATION


COEFFICIENT
COEFFICINET OF RANK
CORRELATION
COEFFICIENT OF
DETERMINATION, R 2

• The coefficient of determination is the portion of the total variation in the dependent
variable that is explained by variation in the independent variable.
• The coefficient of determination is also called r-squared and is denoted as R 2

2 SSR regression sum of squares


R  
SST total sum of squares
2
0 R 1
EXAMPLES OF APPROXIMATE R 2
VALUE

• R2 = 1

X • Perfect linear relationship


R2 = 1 between X and Y:
Y

• 100% of the variation in Y is


explained by variation in X
X
R2 = 1
EXAMPLES OF APPROXIMATE R 2
VALUE
Y

• 0 < R2 < 1

• Weaker linear relationships


X
between X and Y:
Y

• Some but not all of the variation in


Y is explained by variation in X

X
EXAMPLES OF APPROXIMATE R 2
VALUE

Y
• R2 = 0

• No linear relationship between X


and Y.
X
R2 = 0
• The value of Y does not depend on
X. (None of the variation in Y is
explained by variation in X)
ADJUSTED R 2

• R-squared increases every time you add an independent variable


to the model. But Adjusted R-squared not always increases.
• The adjusted R-squared value actually decreases when the term
doesn’t improve the model fit by a sufficient amount.
• It shows how well a regression model makes predictions.
Adjusted R Squared = 1 – [((1 – R 2 ) * (n – 1)) / (n – k – 1)]
where
n – Number of points in your data set.
k – Number of independent variables in the model, excluding
the constant
POINT ESTIMATES USING THE
REGRESSION LINE

• Making point estimates based on the regression line is simply a matter of substituting a
known or assumed value of x into the equation, then calculating the estimated value of
y.
• For example, if a job applicant were to score x 5 15 on the manual dexterity test, we
would predict this person would be capable of producing 64.2 units per hour on the
assembly line.
DEGREES OF FREEDOM

One Independent variable DF=


1

Total DF= N-
1

In linear regression, the degree of freedom refers to the number


of independent observations available for estimation of the
parameters of the regression model.
DOF IN SIMPLE LINEAR REGRESSION

In linear regression, the degree of freedom refers to the number


of independent observations available for estimation of the
parameters of the regression model.

Total DOF is N-1

Independent variables have k DOF.

DOF for Error is N-k-1


MEASURES OF VARIATION

SSR= Regression Sum of


Squares
SSE= Error Sum of Squares
SST= Total Sum of Squares

MS= F=
(SS/DF) (MSR/MSE)
MEASURES OF VARIATION

• Total variation is made up of two parts:

SST  SSR  SSE


Total Sum of Regression Sum Error Sum of
Squares of Squares Squares

SST  ( Yi  Y )2 SSR  ( Ŷi  Y )2 SSE  ( Yi  Ŷi )2

where

Y= Mean value of the dependent variable


Yi = Observed value of the dependent variable

Yˆi= Predicted value of Y for the given X i value


MEASURES OF VARIATION

Yi 

SSE = (Yi - Yi )2 Y

_
SST = (Yi - Y)2

Y  _
SSR = (Yi - Y)2
_ _
Y Y

Xi X
Measures of Variation
• SST = total sum of squares (Total Variation)
• Measures the variation of the Yi values around their mean Y

• SSR = regression sum of squares (Explained Variation)


• Variation attributable to the relationship between X and Y

• SSE = error sum of squares (Unexplained Variation)


• Variation in Y attributable to factors other than X

13-
58

You might also like