Lecture2 Xy 2025
Lecture2 Xy 2025
Cost
(C)
e.g. C = 24q + 500
a = 24
b = 500
What if samples cannot fit perfectly
into a straight line
How do we find the best straight line to fit the linear function?
6,000
5,000
Year Output Actual Costs
4,000
1 50 2500
2 95 1800
Cost
3,000
3 200 5500
4 120 2800 2,000
5 150 4500
1,000
0
0 50 100 150 200 250
Output
Learning outcomes for today
Covariance
Correlation
The “best fit” line (Regression)
Further considerations
Recap
Variance:
Standard deviation
√
𝒏
∑ ( 𝒙 − 𝒙 𝒊 )𝟐 Standard deviation:
√
𝒊=𝟏
𝒔=
√
𝒏 −𝟏
62 +3 2 +32 + 22+ 12 +22 +6 2+ 72 148
= ≈ 4.60
7 7
Excel 1:
=
Covariance
y y y
x x x
Covariance
Y
- X +
∑ ( 𝑋 𝑖 − 𝑋 )(𝑌 𝑖 − 𝑌 )
𝑖=1
𝑐𝑜𝑣 ( X , Y ) =
𝑛 −1
y
It is calculated as the sum of
the product of the deviations
of each pair of data points
from their respective means,
divided by the number of
+ - x
data points minus one.
Example -- Calculating
covariance
John does not want to increase the unsystematic risk of his portfolio, so he
does not want to own securities that move in the same direction.
Calculating covariance
x x y y
4. Multiply the results (a and b) to obtain and sum up.
Excel 2:
Quiz 1
Covariance
Correlation
The “best fit” line (Regression)
Further considerations
Correlations
Correlation coefficient : -1 ≤ R ≤ 1
R≈1 R ≈ -1 R≈ 0
Calculating correlation R
n n n
n x i yi x y i i
cov( x, y )
Rxy R yx R i 1 i 1 i 1
sx s y n n n n
n x ( x i )
2
i
2
n y ( y i ) 2
2
i
i 1 i 1 i 1 i 1
√ √
𝒏 𝒏
∑ ( 𝑿 − 𝑿𝒊) 𝟐
∑ (𝒀 − 𝒀 𝒊 )𝟐
𝒊=𝟏 𝒊=𝟏
𝑺𝒙 = 𝑺𝒚 =
𝒏 −𝟏 𝒏 −𝟏
Example -- Correlation
In the last 5 years, the actual costs of manufacturing a product at various levels of
output have been recorded in the table below. Determine the correlation between output
and cost assuming that the actual cost (y) depends on the output (x).
Year Output Actual Costs
1 50 2500
2 95 1800
3 200 5500
4 120 2800
5 150 4500
x y x*x y*y x*y
1 50 2500 2500 6250000 125000
n n n 2 95 1800 9025 3240000 171000
n x i yi x i yi 3 200 5500 40000 30250000 1100000
R i 1 i 1 i 1
n n n n 4 120 2800 14400 7840000 336000
n x ( x i )
2 2
n y ( y i ) 2
2
i 1
i
i 1 i 1
i
i 1
5 150 4500 22500 20250000 675000
615 17100 88425 67830000 2407000
Graph4 Graph5
Graph3
Outline
Covariance
Correlation
The “best fit” line (Regression)
Further considerations
Motivation
In the last 5 years, the actual costs of manufacturing a product at various levels of output
have been recorded as:
Year Output Actual Costs
1 50 2500
2 95 1800
3 200 5500
4 120 2800
5 150 4500
How do we find the best straight line to fit the linear function?
6,000
5,000
4,000
By ‘eye’ the red line
Cost
3,000
seems to be the
2,000 closest to all the
points and looks like
1,000 it has a slope of
1200/50 = 24 and
0
0 50 100 150
intercept
200
of250
500
Output
Line of Best Fit
• A line of best fit is a straight line drawn through the centre of a group of data points
plotted on a scatter plot of data from two variables.
• It is used to identify trends occurring within the dataset that produces a scatter plot.
• It tells us whether the changes in two variables are related.
n n n
n x i y i x y i i
slope a i1
n
i 1
n
i 1
n x i2 ( x i )2
i1 i 1
n n
y i a x i
intercept b i 1 i 1
n n
n n n
n x i yi x i yi
cov( x, y )
Rxy R yx R i 1 i 1 i 1
sx s y n n n n
n x i2 ( x i ) 2 n y i2 ( y i ) 2
i 1 i 1 i 1 i 1
Regression example
Table 1.
Year Output Actual Costs
1 50 2500
2 95 1800
3 200 5500
4 120 2800
5 150 4500
Slope and intercept of the
‘best’ fit line (Regression)
n n n
n x i y i x y i i slope = 5 * 2407000 - 615 * 17100
slope a i1
n
i 1
n
i 1 5 * 88425 - 615 *
615
n x i2 ( x i )2 = 23.76
i1 i1
n n
n n n
(n xi yi xi yi ) 2
y ' y
2
i
R2 i 1 i 1 i 1
y y
2
n n
2
n n
2
i i i i
2 2
i n x ( x ) * n y ( y )
i 1 i 1 i 1 i 1
yi
yi’
n n n
n x i yi x y i i
cov( x, y )
Rxy R yx R i 1 i 1 i 1
sx s y n n n n
n x i2 ( x i ) 2 n y i2 ( y i ) 2
i 1 i 1 i 1 i 1
Calculating R2 manually -
Example
i xi yi xi2 xiyi yi2
1 50 2500 2500 125000 6250000
2 95 1800 9025 171000 3240000
3 200 5500 40000 1100000 30250000
4 120 2800 14400 336000 7840000
5 150 4500 22500 675000 20250000
Sums = 615 17100 88425 2407000 67830000
n n n
( n xi yi x y ) i i
2
R2 i 1 i 1 i 1
n n
2 n n
i i
2 2
n x ( xi ) * n y ( yi ) 2
i 1 i 1 i 1 i 1
R2 = ( 5*2407000 – 615*17100 )2
( 5*88425 – 615*615 ) * ( 5*67830000 – 17100*17100 )
R2 = 0.77
So, we’ve ‘explained’ 77% of the variation in cost using our linear regression model.
It follows that the correlation coefficient is R = = 0.88.
Excel 7:
Outline
Covariance
Correlation
The “best fit” line (Regression)
Further considerations
Further considerations
Y = f(X) or X = f(Y) …?
Costs against Output over the last 5 years
7,000
6,000
5,000
4,000
Cost
3,000
2,000
Cost = 23.764*Output + 497.07
1,000
R2 = 0.772
0
0 50 100 150 200 250
Output
2. Outliers can influence the slope and intercept of the best fit line
and give poor correlations, e.g.
Removing
this outlier
moves the
line and
increases the
correlation
Review the scattergram for possible outliers, question if they are really
outliers or valid data points and, if necessary, exclude them.
Excel 9:
Further considerations
Covariance
Correlation
The “best fit” line (Regression)
Further considerations
Reading: Dewhurst Sections 8.3 and 8.4 & Oakshott
Section 10
Next week:
Quadratic functions
y=ax2+bx+c