0% found this document useful (0 votes)
19 views

Module 3 - SimpleLinearRegression - Afterclass1b

This document provides an overview of using linear regression to predict wine quality. It describes how wine professor Orley Ashenfelter used variables like weather conditions (average growing season temperature and rainfall), wine age, and French population to build a linear regression model for predicting wine prices from 1952-1978, which served as a proxy for wine quality. The document defines key concepts like the regression function, intercept, slope, residuals, and the ordinary least squares criterion used to estimate the regression model coefficients.

Uploaded by

Vanessa Wong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Module 3 - SimpleLinearRegression - Afterclass1b

This document provides an overview of using linear regression to predict wine quality. It describes how wine professor Orley Ashenfelter used variables like weather conditions (average growing season temperature and rainfall), wine age, and French population to build a linear regression model for predicting wine prices from 1952-1978, which served as a proxy for wine quality. The document defines key concepts like the regression function, intercept, slope, residuals, and the ordinary least squares criterion used to estimate the regression model coefficients.

Uploaded by

Vanessa Wong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

IIMT 2641 Introduction to Business Analytics

Module 3: Linear Regression


Topic 1: Simple Linear Regression

1
Bordeaux wine

§ Large differences in price and quality between years, although wine is


produced in a similar way
§ Meant to be aged, so hard to tell if wine will be good when it is on the
market
§ Expert tasters predict which ones will be good
§ Can analytics be used to come up with a different system for judging wine?
3
Predicting the quality of wine

§ March 1990 - Orley Ashenfelter, a Princeton economics professor, claims


he can predict wine quality without tasting the wine

4
Building a model

§ Ashenfelter used a method called linear regression


– Predicts an outcome variable, or dependent variable
– Predicts using a set of independent variables

5
Building a model
§ Dependent variable:
– Typical price in 1990-1991 wine auctions (approximates quality)
– Conduct logarithmic transformation
q A better linear fit

§ Independent variables:
– Age of wine (in 1990)
q Older wines are more expensive
– Weather
q Average Growing Season Temperature (AGST)
q Harvest Rain
q Winter Rain
– Population of France

6
The wine data (1952 - 1978)

8
The wine data (1952 - 1978)

Quick Question: What is the relationship between harvest rain, average


growing season temperature, and wine prices?
9
Baseline model (?)

10
Baseline model (Take the mean)
ne
y .....
In

&
11
One-Variable Linear Regression
me

e ⑧
12
Simple Regression Model
The population model of y with one predictor variable x is:
-

-
-
! =# +# %+ε
! "

I
-


-


-

§ y is the dependent variable (DV) Bl

§ x is the independent variable (IV) Pr


§ Regression Function
§ E[Y|x] = -
-
!
- ! + !" # is the mean of Y given x
-

§ !! is the y-intercept (value of E[Y|0] when x=0)


e n

§ !" is the slope for x, which is the change in E[Y|x] for a unit increase&
in x
-

§ Random errors e (not required)


-

§ Random errors are a random sample from $ 0, '#


-
Random samples are i.i.d. or
independent and identically
-

Each observation has a random error


-

§ The output does not show these, but it does estimate se distributed random variables
§ The random errors e and IV (X) are uncorrelated
§ These assumptions are important for effective business analytics
-

13
Estimated Regression Function
§ Estimates the regression model with n observations (xi,yi) for i = 1, …, n
-

§ The estimated or predicted value of y given x is:

&
"! = $ ! + $" &
44

&
§ '! is the sample estimate of the population intercept #!-

§ '" is the sample estimate of the population slope #"


-

'! and '" are sample statistics


)
(similar to *)
and have sampling distributions

14
One-Variable Linear Regression

brtb i)
,

15
Data and Predicted Values
§ What is the observed y when x = 1?

YG =

§ What is the predicted y when x = 1?

4 =

§ What is the observed y when x = 4?


O

§ What is the predicted y when x = 4?

16
Data and Predicted Values
§ What is the observed y when x = 1?

y=6

§ What is the predicted y when x = 1?


!=1+(2)(1)
, =3

§ What is the observed y when x = 4?

y=4

§ What is the predicted y when x = 4?

!=1+(2)(4)
, =9

17
Estimated Model and-
Residuals
e

§ Residuals are the difference between the observed values of ! and


-

predicted values of !,
-

– y - $# 8
! =-=
– Each observation has one observed y, one predicted $,
# and one residual r.
§ The residuals are errors between the observed and predicted values.

y3

"!$
## = "# − "!#
y1
"!" #$ = "$ − "!$
"!#
#! = "! − "!!

y4
"!! y2
#" = "" − "!"

18
Computing Residuals

r3

↓ r4
r1 r2

j
§ What is the residual r2 at x = 2? 1

y Y
=
-

§ What is the residual r3 at x = 3?

19
Computing Residuals

r3
r4
r1 r2

r -

§ What is the residual r2 at x = 2?

-# = !# − !,# = 3 − 1 + 2 ∗ 2 = 3 − 5 = −2

§ What is the residual r3 at x = 3?

-$ = !$ − !,$ = 11 − 1 + 2 ∗ 3 = 11 − 7 = 4

20
Ordinary Least Squares Criterion or (OLS)
The least squares line finds the estimates '! and '" of the coefficients to
minimize the sums-of-squares error for a sample {(xi,yi)} with n observations:
W I
-
667 = !% − !,% # ∑'%&"
!,% = '! + '" %% for < = 1, … , ? SSE

Why squared?
↑ min
The sum of residuals could be zero.
-> '
667('! , '" ) = A !% − '! − '" %% #
∑'%&" %% − %̅ !% − !)
D
-

%&" '" =
B667('! , '" ) ∑'%&" %% − %̅ #

3
= 0
B'! C
'! = !) − '" %̅
B667('! , '" ) %:̅ sample average of independent variable
-

= 0 -

B'" ) sample average of dependent variable


!:
- -
- -

Do not need to memorize.


21
Estimate a linear model H0: AGST Coefficient = 0 versus HA: AGST
(One Variable ) Coefficient ≠ 0
-

O0
Estimated Standard Errors t-score = (Estimated Coefficient – 0)/(Standard Error)
intercept and for estimated
slope coefficients
Two-Tail Test: p-value = 2*P(T<-|t-score|)
V

-
Coefficient of Determination: R-Squared

23
One-Variable Linear Regression

, -3.4178 + 0.6351*AGST
!=
24
Estimate a linear model
(One Variable )
• Estimated model for price:
D
, -3.4178 + 0.6351*AGST
!=
• The predicted LogPrice increases by
-

-
0.6351 for every 1 degree increase in
-

average growing season temperature.


• If AGST = 15, then !=, ?
• If AGST = 18, then !=
, ?

O • If AGST = 20, then !=


, ?
i

25
T-Tests for the Coefficients: H0: bj = 0 versus HA: bj ≠ 0
& -

Two-Tail Test for the Slope


(Very important. Can you predict Y from X?)
H0: b1 = 0 versus HA: b1 ≠ 0
• t-score = (coefficient – 0)/(std.error)
-
• t-score = (0.6351-0)/0.1509 = 4.208
• p_value = 2*P(T < -|4.208|)

↓- =2*t.dist(-4.208, 23, 1) < .001


• df = n-1-#IV = df Error under Sum of squares
-

O ⑧
e⑪
men
-

df = 23


-

& A
n
-

1 -

#IF

I
st ↑

I
25 1 1 23
=

- -
=

4 208
. ,
of =

23 .

0 .
001 <P rate -
<0 .

01 **

P([s> 200) xP(ts -3 76) 0 001


=

4 2 <
.

<
2 x
-
.
.

26
How well the model fits data
§ The simplest commonly used measure of fit is R# (the coefficient of
determination): R# = 1 − SSE/SST
-
-

– SSE = ∑&$%" y$ − y. $ ' : sum of squared errors - *

q Variation of Y that cannot be explained by the regression R 1


– SST = ∑&$%" y$ − y0 ' : total sum of squares v
1-
e

q Total amount of variation of Y around its mean


q “Error” generated by a baseline model without any inputs
-

– Decomposition of variation of Y:
SSF

&
SSE =

∑'%&" !% − !) # = ∑'%&" !% − !,% # + ∑'%&" !,% − !) #


7
q

-
-> Total variation Unexplained variation Explained variation
1- SE
s =
-

SSE *
SS7

R# = Proportion of the variance in DV is explained by the


regression model.

27
Coefficient of Determination: R-Squared
• R-Squared is a measure of fit
• Bigger R-Squared indicates better fit all
else being equal
• 43.5% of the variation of prices is
explained by the simple regression on
AGST

-
• 0 < R-Squared < 1

28
Use each variable on its own
§ R# =0.44 using Average growing season temperature (Variable
-
Significant, 0.001)
R# =0.32 using Harvest rain (Variable Significant, 0.01)


§
§ R# =0.22 using France Population (Variable Significant, 0.05)
§ R# =0.20 using Age (Variable Significant, 0.05)
§ R# =0.02 using Winter rain (Not Significant)
§ Multivariate linear regression allows us to use more than one
variable to potentially improve our predictive ability.

29

You might also like