Correlation Regression
Correlation Regression
Session 17 - 24
OVERVIEW
• This is the case in the antipollution example above. But in many cases,
other factors cause the changes in both the dependent and the independent
variables.
• For this reason, it is important that you consider the relationships found by
regression to be relationships of association but not necessarily of cause
and effect.
Scatter Diagrams
• A scatter diagram can give us two types of information.
• Visually, we can look for patterns that indicate that the variables are related.
• Then, if the variables are related, we can see what kind of line, or
estimating equation, describes this relationship.
• Relationship described by the data points is well described by a straight
line. Thus, we can say that it is a linear relationship.
• The relationship between X and Y variables can also take the form of a
curve. Statisticians call such a relationship curvilinear
ESTIMATION USING THE
REGRESSION LINE
• Value of a can be found (the Y-intercept) by locating the point where the line crosses the Y-axis.
• Value of b can be found by using this equation
THE METHOD OF LEAST SQUARES
• How can we fit a line mathematically if none of the points lies on the line?
• The line will have a good fit, if it minimizes the error between the estimated points on the line and the
actual observed points that were used to draw it.
• The points that lie on the estimating line are represented as (Y hat).
THE LEAST-SQUARES CRITERION
• The least-squares criterion requires that the sum of the squared deviations between y values in the
scatter diagram and y values predicted by the equation be minimized. In symbolic terms:
LEAST SQUARES REGRESSION LINE
DETERMINING THE LEAST-SQUARES
REGRESSION LINE
𝑦 𝑖= 𝑎1 +𝑏 1 𝑥𝑖 + 𝑒𝑖 𝑥𝑖 = 𝑎2 +𝑏 2 𝑦 𝑖 +𝑒 𝑖
cov ( 𝑥 , 𝑦 ) 𝑏 cov ( 𝑥 , 𝑦 )
𝑏 𝑦 𝑜𝑛 𝑥 ( ¿ 𝑏1 ) = 𝑥 𝑜𝑛 𝑦 ( ¿ 𝑏2 ) = 2
𝜎x
2
𝜎y
( 𝛴 𝑥𝑖 𝑦 𝑖 ) −𝑛 𝑥 𝑦 𝑏 ( 𝛴 𝑥𝑖 𝑦 𝑖 ) −𝑛 𝑥 𝑦
𝑏 y on x = x on y =
( 𝛴 𝑥 )− 𝑛𝑥
2
𝑖
2
( 𝛴 𝑦𝑖 ) − 𝑛 𝑦
2 2
• Scatter
diagram and
least-squares
regression line
EXAMPLE: LEAST SQUARE METHOD
PRICE DATA
Dec-21 6
Jan-22 6.5
Feb-22 5.8
Mar-22 5.2
Apr-22 6.8
May-22 7.4
Jun-22 6
Jul-22 5.6
Aug-22 7.5
Sep-22 7.8
Oct-22 6.3
Nov-22 5.9
Dec-22 8
Jan-23 8.4
DATA
REGRESSION EQUATION
• The sample covariance measures the strength of the linear relationship between two
variables (called bivariate data)
( X X)( Y Y )
i i
cov ( X , Y ) i1
n 1
• Only concerned with the strength of the relationship
• No causal effect is implied.
INTERPRETING COVARIANCE
cov (X ,Y)
R
SX SY
• where
n n n
(Xi X)(Yi Y) (X X)
i
2
(Y Y )
i
2
• Unit free
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear relationship
• Equal to 1, perfect correlation
• Equal to 0, no correlation
For a given series of paired data, the following information is available:
Covariance between X and Y series = -17.8
Standard deviation of X series = 6.6
Standard deviation of Y series = 4.2
No. of pairs of observations = 20
Calculate the coefficient of correlation.
r = -0.642
Thus, variables are negatively correlated.
RANK CORRELATION
• The coefficient of determination is the portion of the total variation in the dependent
variable that is explained by variation in the independent variable.
• The coefficient of determination is also called r-squared and is denoted as R 2
• R2 = 1
• 0 < R2 < 1
X
EXAMPLES OF APPROXIMATE R 2
VALUE
Y
• R2 = 0
• Making point estimates based on the regression line is simply a matter of substituting a
known or assumed value of x into the equation, then calculating the estimated value of
y.
• For example, if a job applicant were to score x 5 15 on the manual dexterity test, we
would predict this person would be capable of producing 64.2 units per hour on the
assembly line.
DEGREES OF FREEDOM
Total DF= N-
1
MS= F=
(SS/DF) (MSR/MSE)
MEASURES OF VARIATION
where
Yi
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2
Y _
SSR = (Yi - Y)2
_ _
Y Y
Xi X
Measures of Variation
• SST = total sum of squares (Total Variation)
• Measures the variation of the Yi values around their mean Y
13-
58