0% found this document useful (0 votes)
38 views

Linear Regression

This document discusses linear regression and its application to predicting wine quality. It describes how Orley Ashenfelter used variables like weather and age to predict wine prices through linear regression models, achieving better results than expert wine critics. The document outlines the key aspects of linear regression, including selecting variables, measuring error, and testing predictive ability. It shows how Ashenfelter's models were able to accurately predict highly rated vintages of Bordeaux wine.

Uploaded by

Reza Barkhordari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Linear Regression

This document discusses linear regression and its application to predicting wine quality. It describes how Orley Ashenfelter used variables like weather and age to predict wine prices through linear regression models, achieving better results than expert wine critics. The document outlines the key aspects of linear regression, including selecting variables, measuring error, and testing predictive ability. It shows how Ashenfelter's models were able to accurately predict highly rated vintages of Bordeaux wine.

Uploaded by

Reza Barkhordari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Chapter

Linear Regression

Shirin Aslani, Fall 2021

1
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The statistical Sommelier

• Large differences in price and quality between years,


although wine is produced in a similar way

• Meant to be aged, so hard to tell if wine will be good when


it is on the market

• Expert tasters predict which ones will be good

• Can analytics be used to come up with a different system


for judging wine?

2
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The statistical Sommelier

• Large differences in price and quality between years,


although wine is produced in a similar way

• Meant to be aged, so hard to tell if wine will be good when


it is on the market

• Expert tasters predict which ones will be good

• Can analytics be used to come up with a different system


for judging wine?

• March 1990 – Orley Ashenfelter, a Princeton economics


professor, claims he can predict wine quality without tasting
the wine

3
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The statistical Sommelier

• Ashenfelter used a method called linear regression


• Predicts an outcome variable, or dependent variable
• Predicts using a set of independent variables

• Dependent variable: typical price in 1990-1991 wine


auctions (approximates quality)

• Independent variables:
• Age – older wines are more expensive
• Weather

• Average Growing Season Temperature


• Harvest Rain
• Winter Rain

4
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The statistical Sommelier

• Robert Parker, the world's most influential wine expert:

“Ashenfelter is an absolute total sham”

• “rather like a movie critic who never goes to see the movie
but tells you how good it is based on the actors and the
director”

5
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
One-Variable Linear Regression

8.5

8
Logarithm of Price

7.5

6.5

6
14.5 15 15.5 16 16.5 17 17.5 18

Average Growing Season Temp

6
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
One-Variable Linear Regression

8.5

8
Logarithm of Price

7.5

6.5

6
14.5 15 15.5 16 16.5 17 17.5 18

Average Growing Season Temp

7
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
One-Variable Linear Regression

8.5

8
Logarithm of Price

7.5

6.5

6
14.5 15 15.5 16 16.5 17 17.5 18

Average Growing Season Temp

8
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Regression Model

• The best model (choice of coefficients) has the


smallest error terms

9
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Selecting the Best Model

10
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Selecting the Best Model

11
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Other Error Measures

• SSE can be hard to interpret


• Depends on N
• Units are hard to understand

• Root-Mean-Square Error (RMSE)

• Normalized by N, units of dependent variable


12
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
R

• Compares the best


model to a “baseline”
model
• The baseline model
does not use any
variables
• Predicts same outcome
(price) regardless of
the independent
variable (temperature)

13
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
R

14
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
15
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Available Independent Variables

• Many different independent variables could be


used
• Average Growing Season Temperature
• Harvest Rain
• Winter Rain
• Age of Wine (in 1990)
• Population of France

16
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Available Independent Variables

17
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Regression Model

• Multiple linear regression model with k variables

• Best model coefficients selected to minimize SSE

18
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Adding Variables

• Adding more variables can improve the model


• Diminishing returns as more variables are added

19
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Adding Variables

• Not all available variables should be used


• Each new variable requires more data
• Causes overfitting: high R2 on data used to create
model, but bad performance on unseen data

20
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Correlation

• Correlation measures linear association, not


causation.
• The correlation (usually denoted by r) between
two variables (call them x and y) is a unit-free
measure of the strength of the linear relationship
between x and y.
• The correlation between any two variables is
always between –1 and +1.
• Although the exact formula used to compute the
correlation between two variables isn’t very
important, interpreting the correlation between
the variables is.
• A correlation near +1 means that x and y have a
strong positive linear relationship. 21
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Correlation

22
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Predictive Ability

23
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
24
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Result

• Parker:
• 1986 is “very good to sometimes exceptional”
• Ashenfelter:
• 1986 is mediocre
• 1989 will be “the wine of the century” and 1990 will be
even better!
• In wine auctions,
• 1989 sold for more than twice the price of 1986
• 1990 sold for even higher prices!
• Later, Ashenfelter predicted 2000 and 2003
would be great
• Parker has stated that “2000 is the greatest
vintage Bordeaux has ever
• produced” 25
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

• Meadowfoam is a small plant found growing in moist


meadows of the U.S. Pacific Northwest.
• It has been domesticated at Oregon State University for its
seed oil, which is unique among vegetable oils for its long
carbon strings which is nongreasy and highly stable.
• One study in a series designed to find out how to elevate
meadowfoam production to a profitable crop.
• In a controlled growth chamber, they focused on the effects
of two light-related factors:
• light intensity, 150, 300, 450, 600, 750, and 900 µmol/m2/sec;
• the timing of the onset of the light treatment, either at
photoperiodic floral induction (PFI)— the time at which the
photo period was increased from 8 to 16 hours per day to
induce flowering—or 24 days before PFI.

26
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

27
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

• 12 treatment groups—6 light intensities at 2 timing levels.


• Ten seedlings randomly assigned to each treatment group.
• The number of flowers per plant is the primary measure of
production, measured by averaging the numbers of flowers
produced by the 10 seedlings in each group.
• The entire experiment was repeated.
• No difference was found between the first and second runs
of the experiment.

• The 2 observations in each cell are thought of as


independent replicates under the specified conditions.
• What are the effects of differing light intensity levels?
• What is the effect of the timing? 28
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

Statistical Conclusion

• Increasing light intensity decreased the mean number of


flowers per plant by an estimated 4.0 flowers per plant per
100 µmol/m2/sec (95% confidence interval from 3.0 to 5.1).
• Beginning the light treatments 24 days prior to PFI increased
the mean numbers of flowers by an estimated 12.2 flowers
per plant (95% confidence interval from 6.7 to 17.6).
29
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

Statistical Conclusion

• The data provide no evidence that the effect of light intensity


depends on the timing of its initiation (two-sided p-
value=0.91, from a t -test for interaction, 20 degrees of
freedom).

30
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

Scope of Inference
• The researchers can infer that the effects above were caused
by the light intensity and timing manipulations, because this
was a randomized experiment.

31
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Multiple Linear Regression

• In multiple regression there is a single response variable


and multiple explanatory variables.

• Although they should not be thought of as the truth, one or


two models may adequately approximate the mean of the
response as a function of the explanatory variables, and
conveniently allow for the questions of interest to be
investigated.

32
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Interpretation of Regression Coefficients

33
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
An Indicator Variable to Distinguish Between Two
Groups
• An indicator variable (or dummy variable) takes on one of two
values: “1” (one) indicates that an attribute is present, and “0”
(zero) indicates that the attribute is absent.

• Consider the regression model

• If time=0, then early = 0, and the regression line is

• If time = 24, then early = 1, and the regression line is

• Because the slopes are the same the model is called the parallel
lines regression model 34
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
MONEYBALL

• Moneyball tells the story of the


Oakland A’s in 2002
• One of the poorest teams in baseball
• New ownership and budget cuts in 1995
• But they were improving

• How were they doing it?


• Was it just luck?
• In 2002, the A’s lost three key
players
• Could they continue winning?
35
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Problem

• Rich teams have four times


the salary of poor teams
• The Oakland A’s can’t
afford the all-stars, but they
are still making it to the
playoffs. How?
• They take a quantitative
approach and find
undervalued players

36
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
A Different Approach

• The A’s started using a different method to select


players
• The traditional way was through scouting
• Scouts would go watch high school and college players
• Report back about their skills
• A lot of talk about speed and athletic build
• The A’s selected players based on their statistics,
not on their looks
“The statistics enabled you to find your way
past all sorts of sight-based scouting prejudices.”

37
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Billy Beane

• The general manager since 1997


• Played major league baseball, but never made it
big
• Billy Beane succeeded in using analytics
• Had a management position
• Understood the importance of statistics – hired Paul
DePodesta (a Harvard graduate) as his assistant
• Didn’t care about being ostracized

38
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Taking a Quantitative View

• Paul DePodesta spent a lot of time looking at the


data
• His analysis suggested that some skills were
undervalued and some skills were overvalued
• If they could detect the undervalued skills, they
could find players at a bargain

39
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Taking a Quantitative View

40
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Making it to the Playoffs

• How many games does a team need to win in the


regular season to make it to the playoffs?
• “Paul DePodesta reduced the regular season to a
math problem. He judged how many wins it
would take to make it to the playoffs: 95.”

41
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Scoring Runs

• How does a team score more runs?


• The A’s discovered that two baseball statistics
were significantly more important than anything
else
• On-Base Percentage (OBP)
• Percentage of time a player gets on base
(including walks)
• Slugging Percentage (SLG)
• How far a player gets around the bases on
his turn (measures power)

42
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Predicting Runs and Wins

• Can we predict how many games the 2002


Oakland A’s will win using our models?
• The models for runs use team statistics
• Each year, a baseball team is different
• We need to estimate the new team statistics
using past player performance
• Assumes past performance correlates with future
performance
• Assumes few injuries
• We can estimate the team statistics for 2002 by
using the 2001 player statistics

43
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Predicting Runs Scored

• At the beginning of the 2002 season, the Oakland


A’s had 24 batters on their roster
• Using the 2001 regular season statistics for these
players
• Team OBP is 0.339
• Team SLG is 0.430

• At the beginning of the 2002 season, the Oakland


A’s had 17 pitchers on their roster
• Using the 2001 regular season statistics for these
players
• Team OOBP is 0.307
• Team OSLG is 0.373
44
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Goal of a Baseball Team

• Why isn’t the goal to win the World Series?


• Billy and Paul see their job as making sure the
team makes it to the playoffs – after that all bets
are off
• The A’s made it to the playoffs in 2000, 2001, 2002,
2003
• But they didn’t win the World Series
• Why?

45
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology

You might also like