0% found this document useful (0 votes)
101 views

Module 3 - Regression

Minimizing the sum of squares of errors is the Ordinary Least Squares (OLS) approach to finding the "best fit" line. OLS chooses the coefficients (b0 and b1) that make the squared errors between the predicted and actual y-values as small as possible. This has desirable statistical properties, such as producing estimates that are BLUE (Best Linear Unbiased Estimators).

Uploaded by

PPP
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

Module 3 - Regression

Minimizing the sum of squares of errors is the Ordinary Least Squares (OLS) approach to finding the "best fit" line. OLS chooses the coefficients (b0 and b1) that make the squared errors between the predicted and actual y-values as small as possible. This has desirable statistical properties, such as producing estimates that are BLUE (Best Linear Unbiased Estimators).

Uploaded by

PPP
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Building Predictive Models

ctl.mit.edu
Duffy Industries
• Robin Curtin, the Transportation vice president for Duffy
Industries, a food manufacturer, is trying to understand their
transportation rates. She needs to estimate what rates
should be for full truckload (TL) service from a new facility.
Duffy uses contract “over-the-road” trucking companies for
TL shipments. These are moves directly from one point to
another with no intermediate stops.
• Robin only has a snapshot of data showing the costs and
some other characteristics of about 100 TL shipments.
• Some questions she would like to answer:
 What characteristics are driving the rates for my TL services?
 What rates should I expect if I open new lanes?

2
Duffy Industries
• Let’s take a look at a snapshot of the data
ID Cost Per Load Distance (miles) LeadTime (days) Trailer Length (ft) Weight (lbs) Equipment
1 $3,692 1,579 1 53 20,559 DRY
2 $3,279 1,298 12 48 17,025 REF
3 $3,120 1,382 11 48 26,735 DRY
4 $3,205 1,033 1 53 26,175 DRY
5 $3,188 1,320 3 53 17,994 DRY
6 $2,835 1,103 9 53 32,206 DRY
7 $2,364 743 1 48 18,588 DRY

ID unique identification number for the load


CPL cost per load ($)
Dist distance hauled for shipment (miles)
LdTime lead time from offer to tender to carrier (days) 0 = same day
TrlLng trailer length (feet)
Wgt weight of goods in trailer (lbs)
Eqpt equipment type (Dry Van or Refrigerated)

3
Duffy Industries – Quick Statistics
CPL Dist LdTime TrlLng Wgt
Min $1,660 502 0.00 48 15,100
25th Pct $2,632 904 2.75 48 21,221
Mode $3,730 / $3,171 1273 1.00 53 #N/A
Median $3,166 1273 6.00 53 26,514
Mean $3,132 1207 5.87 51.3 26,708
75th Pct $3,701 1538 9.00 53 32,276
Max $4,301 1793 13.00 53 39,931
Range $2,641 1291 13.00 5 24,831
IQ $1,070 634 6.25 5 11,055
StdDev.s $655 387 3.96 2.4 7,070
CORR(CPL,X) 1.00 0.90 -0.09 0.14 0.08
Eqpt: 60 Ref and 40 Dry shipments

What to do next?
 Explore how CPL is influenced by other variables
 Develop a descriptive model where CPL=f(Dist, LdTime, . . . ?)
 Develop a predictive model for CPL
4
Setting up the Variables

5
Dependent vs. Independent Variables
• We want to measure movement of one (dependent) variable
to a small set of relevant (independent) variables.
• Dependent variable Y is a function of independent variables X.
• Examples:
 Property Values = f(area, location, # bathrooms, …)
 Sales = f(last month’s sales, advertising budget, price, seasonality, .. .)
 Probability Taking Transit versus Driving = f(income, location, . . .)
 Height = f(age, gender, height of parents, . . . )
 GPA = f(GMAT, age, undergraduate Grades, …)
 Number of Fliers = f(Economic activity, size of origin and destination
cities, competitor’s price, . . . )
 Condominium fees = f(area, story . . .)

6
Variables Types
• Variables have different scales
 Nominal Scale / Categorical – Groupings without related value
 Income status (1=retired, 2=student, 3=two income, etc.)
 Country of origin (1=US, 2=Canada, 3=China, etc.)
 Organizational structure (1=centralized, 2=decentralized, etc.)
 Ordinal Scale – indicates ranking but not proportionality between values
 Job satisfaction scale 1 to 5 (a 2 is not twice as good as a 1)
 Planning versus Response Profile (0 = Planner, 4 = Responder)
 Education level (1=High School, 2=Undergraduate, 3=Masters, etc.)
 Ratio Scale – value indicating ranking and relation
 Examples; Age, Income, Cost, Distance, Weight, . . . .

• Form of the Dependent Variable dictates the method used


 Continuous – takes any value
 Discrete – takes only integer values
We will focus on Linear
 Binary – is equal to 0 or 1
Regression of continuous, ratio
scaled dependent variables
7
Duffy Industries
• Dependent variable, Y: CPL or cost per load
• Potential independent variables, Xi:
 Dist – distance
 LdTime – lead time • Start with simple linear model
 TrlLng – length of trailer • Draw “best fit” line
 Wgt – weight • CPL = f(Dist) = β0 + β1X1
 Eqpt – equipment type
$5,000
$4,500
$4,000
$3,500
Cost Per Load ($)

$3,000
$2,500
$2,000
$1,500
$1,000
$500
$-
- 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
Distance (miles) 8
Linear Regression

9
Linear Regression Model
• Formally,
 The relationship is described in terms of a linear model
 The data (xi, yi) are the observed pairs from which we try to estimate the
Β coefficients to find the ‘best fit’
 The error term, ε, is the ‘unaccounted’ or ‘unexplained’ portion
 The error terms are assumed to be iid ~N(0,σ) and catch all of the factors
ignored or neglected in the model

y=
i β 0 + β1 xi
Yi =β 0 + β1 xi + ε i for i =1, 2,...n
) β 0 + β1 x
E (Y | x=
Observed Unknown StdDev(Y | x) = σ

10
Linear Regression - Residuals
• Residuals
 Predicted or estimated values found by using regression coefficients, b.
 Residuals, ei, are the difference of actual, yi, minus predicted, ŷi, values
 We want to select the “best” b values that “minimize the residuals”

$5,000
Suppose we find: ŷi = b0+b1xi to be CPLi = 1000 + 1.75*Disti
$4,500
Est. Value
$4,000
$3,500
ŷ32=2762.25
Cost Per Load ($)

$3,000
Intercept
$2,500
b0=1000 Residual
$2,000
e32=2346 -2762.25
$1,500
Obs Value = -416.25
$1,000 Slope x32= 1007
$500 b1=1.75 y32= $2346
$-
- 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
Distance (miles) 11
Linear Regression – Best Fit
• How should I determine the “best fit” with the residuals?
 Min sum of errors? min Σ(yi - b0 - b1xi)
 Min sum of absolute error? min Σ|yi - b0 - b1xi|
 Min sum of squares of error? min Σ(yi - b0 - b1xi)2

We will select the model that minimizes the residual sum of squares . . . WHY?

• Ordinary Least Squares (OLS) Regression


 Finds the optimal value of the coefficients (b0 and b1) that minimize the
sum of the squares of the errors.

12
Formal Definitions
• Where did this come from?
 Unconstrained optimization
(think back to first week!)
 Partial derivatives to find the
first order optimality condition
with respect to each variable.

• We can expand to multiple variables


 So, for k variables we need to find k regression coefficients
Yi = β 0 + β1 x1i + ... + β k xki + ε i for i = 1, 2,...n
E (Y | x1 , x2 ,..., xk ) = β 0 + β1 x1 + β 2 x2 + ... + β k xk
StdDev(Y | x1 , x2 ,..., xk ) = σ

= (
∑i 1=
= ei )
2
∑ i 1 ( yi −=
n
yˆi ) ∑ i 1 ( yi − b0 − b1 x1i − ... − bk xki )
=
n 2 n 2

13
Validating a Model – First Steps

14
Model Evaluation Metrics
• All packages will provide statistics for evaluation
 Names and format will differ slightly package by package
• Model Output
 Model Statistics (Regression Statistics or Summary of Fit)
 Coefficient of Determination or Goodness of Fit (R2)
 Adjusted R2
 Standard Error (Root Mean Squared Error)
 Analysis of Variance (ANOVA)
 Sum of the Squares (Model, Residual/Error, and Total)
 Degrees of Freedom
 Parameter Statistics (Coefficient Statistics)
 Coefficient (b value)
 Standard error
 t-Statistic
 p-value
 Upper and Lower Bounds

15
Model Validation 1 – Overall Fit
ID CPL (Y) Dist (X)
1 $3,692 1,579
• How much variation in dependent variable, y, can we explain?
2 $3,279 1,298  If we only have the mean?
3 $3,120 1,382
 If we can make estimates?
4 $3,205 1,033
5 $3,188 1,320
6 $2,835 1,103 • What is the total variation of CPL?
7 $2,364 743  Find dispersion around the mean
8 $2,434 772
9 $3,486 1,389  Called the Total Sum of Squares
10 $3,730 1,761
11 $3,735 1,664 • What if we now make estimates of y for each x?
12 $4,096 1,542  Find variation not accounted for by estimates 𝑒𝑒𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
13 $2,123 641 𝑛𝑛
14 $3,560 1,527  Called the Error or Residual Sum of Squares 𝑅𝑅𝑅𝑅𝑅𝑅 = � (𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )2
𝑖𝑖=1
15 $4,041 1,407
16 $3,765 1,720
• The regression model “explains” a certain percentage
17 $3,565 1,310 of the total variation of the dependent variable
18 $1,686 565
 Coefficient of Determination or R2
19 $3,045 1,200 𝑅𝑅𝑅𝑅𝑅𝑅
20 $2,933 1,207  Ranges between 0 and 1 𝑅𝑅2 = 1 −
. 𝑇𝑇𝑇𝑇𝑇𝑇
 The adj R2 corrects for additional variables
.
100 $4,208 1,687  n= # observations
 k = # independent variables (not b0)
16
Model Validation 2 – Individual Coefficients
• Each Independent variable (and b0) will have:
 An estimate of coefficient (b1),
 A standard error (sbi)
 se = Standard error of the model
 sx = Standard deviation of the independent variable
 n = number of observations
 The t-statistic ~Student-t, df=n-k-1
 k = number of independent variables
 bi = estimate or coefficient of independent variable
 Corresponding p-value – Testing the Slope
 If no linear relationship exists between the two variables, we would expect the
regression line to be horizontal, that is, to have a slope of zero.
 We want to see if there is a linear relationship, i.e. we want to see if the slope
(b1) is something other than zero. So: H0: b1 = 0 and H1 b1 ≠ 0
• Confidence Intervals
 We can estimate an interval for the slope parameter, (b1)

17
Model 1: CPL=f(Dist)

18
Model 1: CPL = b0 + b1(Dist)
R2 0.818
adj R2 0.816
Estimation Model
se 281.3
RSS 7,755,591
CPL = 1282 + 1.532 (Dist)
TSS 42,519,984

Std Error Lower CI Upper CI


Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,282 92.61 13.85 <0.0001 1,099 1,466
Distance (b1) 1.532 0.073 20.96 <0.0001 1.387 1.677

Interpretation:
• Model explains ~82% of total variation in CPL (very good!)
• Both the b0 and b1 terms make sense in terms of magnitude and sign
and are statistically valid (p<0.0001)

19
How can we improve the model?
ID Cost Per Load Distance (miles) LeadTime (days) Trailer Length (ft) Weight (lbs) Equipment
1 $3,692 1,579 1 53 20,559 REF
2 $3,279 1,298 12 48 17,025 DRY
3 $3,120 1,382 11 48 26,736 REF
4 $3,205 1,033 1 53 26,176 REF
5 $3,188 1,320 3 53 17,994 REF
6 $2,835 1,103 9 53 32,207 REF
7 $2,364 743 1 48 18,589 REF

• What potential additions can we make?


 Does the equipment type matter?
 Does lead time have an impact?
 Does the trailer length have an effect?
 Does the weight influence rates?
 Does the CPL have a non-linear relationship with distance? weight?
• Be logical in approach and exploration – always have a
hypothesis going in!

20
Model 2: CPL = f(Dist, Wgt)

21
Model 2: CPL = b0 + b1(Dist) + b2(Wgt)
R2 0.819
adj R2 0.815
se 281.964
Estimation Model
RSS 7,711,851
l TSS 42,519,984
CPL = 1354 + 1.538 (Dist) – 0.003 (Wgt)
Std Error Lower CI Upper CI
Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,354 134.4 10.077 <0.0001 1,088 1,621
Distance (b1) 1.538 0.074 20.850 <0.0001 1.392 1.685
Weight (b2) -0.003 0.004 -0.74 0.46 -0.011 0.005

Interpretation:
• Model explains ~82% of total variation in CPL (still very good!)
• Note that while R2 improved from Model 1, the adj R2 got worse!
• Both the b0 and b1 terms make sense in terms of magnitude and sign and are
statistically valid (p<0.0001)
• b2 does not make sense (more weight costs less?) and has poor p-value
22
Linear Transformations

23
What about non-linear relationships?
Suppose we think that CPL has some non-
linear relationship with some of our
independent variables . . .

Recall,
e=2.71828 . . .
• ln(1) = 0
• ln(e) = 1
• ln(xy) = ln(x) + ln(y)
• ln(x/y) = ln(x) – ln(y)
• ln(xa) = aln(x)
24
Modeling Techniques 1
• Linear Transformations
 We assume a linear model: yi = b0 + b1xi1 + b2xi2 +. . .+ bkxik
 What if we have a non-linear relationship with an independent variable?
 OLS Regression is ok as long as the estimated equation is linear in all of
its independent variables

• Let’s Try! Model 3: CPL = f(Dist, Dist2)


 Testing whether the distance effect tapers off at longer distances
 Create new variable: DistSq = DIST^2
 Run Regression: CPL = f(Dist, DistSq)

25
Model 3: CPL = b0 + b1(Dist) + b2(DistSq)
R2 0.818
adj R2 0.814 Estimation Model
se 282.714
RSS 7,752,918 CPL = 1236 + 1.623 (Dist) – 0.00004 (DistSq)
l TSS 42,519,984.

Std Error Lower CI Upper CI


Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,236 271 4.55 <0.0001 697 1,774
Distance (b1) 1.623 0.503 3.22 0.002 0.624 2.622
Distance2 (b2) -0.00004 0.00022 -0.18 0.855 -0.0005 0.0004

Interpretation:
• Model explains ~82% of total variation in CPL (still very good!), but squared
term did not improve the adj R2 from Model 1.
• b0 and b1 still are good (note slight degradation of p-value for b1)
• b2 sign and magnitude make sense (at 1000 miles, reduces effect by $40) but
has poor p-value
26
Modeling Categorical Variables

27
Modeling Techniques 2
• So far, we assumed that independent variables are continuous &
ratio scalar.

• What if we have a nominal/categorical independent variable?

• Create a Dummy Variable


 Suppose you think equipment type impacts CPL?
 Create a binary dummy variable RefFlag =1 if Refrigerated, =0 o.w.
 Run the Regression CPL = (Dist, RefFlag)
 Coefficient of RefFlag captures the differential impact of refrigerated
trailers versus Dry vans
• Notes:
 You do not need to create two dummy variables (RefFlag and DryFlag)
– in fact it will fail! This is over-specifying.
 If we create a DryFlag variable and run CPL=f(Dist, DryFlag) we will get
the same estimates for each observation!

28
Model 4: CPL = b0 + b1(Dist) + b2(RefFlag)
R2 0.822
adj R2 0.818
se 279.650
Estimation Model
RSS 7,585,804
TSS 42,519,984
CPL = 1236 + 1.529 (Dist) + 84 (RefFlag)
Std Error Lower CI Upper CI
Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,236 97.343 12.695 <0.0001 1,042.6 1,428.96
Distance (b1) 1.529 0.073 21.03 <0.0001 1.384 1.673
RefFlag(b2) 84.1 57.11 1.473 0.144 -29.2 197.5

Interpretation:
• Model explains ~82% of total variation in CPL (adj R2 improved from Model 1)
• Both the b0 and b1 terms are still fine
• b2 does make sense in terms of sign (refrigeration costs more) but has a poor p-
value. Perhaps it is more of a function of distance . . . .
• Let’s test – new variable RefDist = RefFlag*Dist
29
Model 5: CPL = b0 + b1(Dist) + b2(RefDist)
R2 0.821
adj R2 0.817 Estimation Model
se 280.317
RSS 7,622,037 CPL = 1283 + 1.496 (Dist) + 0.06 (RefDist)
TSS 42,519,984

Std Error Lower CI Upper CI


Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,283 92.278 13.9 <0.0001 1,099.9 1,466.2
Distance (b1) 1.496 0.078 19.19 <0.0001 1.341 1.65
RefDist(b2) 0.059 0.045 1.304 0.195 -0.031 0.149

Interpretation:
• Model is good – b0, and b1 are fine.
• b2 is problematic – sign and magnitude are reasonable – but p-value is bad.
• Should we keep it? This is where we get more art than science. If important,
then OK – but always state the p-value so the user of the model understands its
strength/weakness.
30
Model 6: CPL = b0 + b1(Dist) + b2(SameDay)
R2 0.828 SameDay = 1 if LdTime=0, =0 o.w.
adj R2 0.824
se 274.655
Estimation Model
RSS 7,317,227
TSS 42,519,984
CPL = 1237 + 1.552 (Dist) + 233 (SameDay)
Std Error Lower CI Upper CI
Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,237 92.32 13.4 <0.0001 1,054 1,420
Distance (b1) 1.552 0.072 21.6 <0.0001 1.409 1.694
SameDay(b2) 233 96.61 2.4 0.018 41.1 424.6

Interpretation & Insights:


• There are many ways to model the Lead Time effect:
• Continuous – each day adds a linear cost
• OverWeek = 1 if LdTime>7 days, =0 o.w.
• Quantified potential financial benefit for changing practice (SameDay tenders)

31
Model Validation II

32
Regression Assumptions
1. There is a population regression line that joins the mean of the
dependent variables.
2. This implies that the mean of the error is 0.
3. The variance of the dependent variable is constant for all
values of the explanatory variables (Homoscedasticity)
4. The dependent variable is normally distributed for any value of
the explanatory variables.
5. The error terms are probabilistically independent.

33
Model Validation
– Multi-Collinearity & Autocorrelation

• Multi-Collinearity – Are the independent variables correlated?


 When two or more independent variables are highly correlated
 Model might have high R2 but explanatory variables might fail t-test
 Can often result in strange results for correlated variables
 Check for correlation and remove correlated independent variables

• Autocorrelation – Are the residuals not independent?


 Errors are supposed to be identical and independently distributed (iid)
 Typically a time series issue – plot variables over time to see trend
 If they are not independent, they are autocorrelated

34
Model Validation - Heterscedasticity
• Heteroscedasticity – Does the standard deviation of the error terms
differ for different values of the independent variables?
 Observations are supposed to have the same variance

 How do the residuals behave over the ind. var.’s?

 Examine scatter plots and look for “fan-shaped” distributions

 Fixes (weighted least squares regression, others – beyond scope)

35
Model Validation
• Linearity – Is the dependent variable linear with independent variables?
 With one ind. variable, scatter plots work
 More than one ind. variable – look at R2
• Normality of Error Terms – Are the error terms distributed Normally?
 We have assumed that e~N(0,σ)
 Look at a histogram of the residuals
 There are more formal tests; e.g., Chi-Square or Kolmogorov–Smirnov
tests
6

0
-150 -125 -100 -75 -50 -25 0 25 50 75 100 125 150 175

36
Re-Sampling

Based on slides initially prepared by Dr. Tugba Efendigil


What’s the problem?
Robin Curtin is the Transportation VP for Duffy Industries, a food manufacturer. She is trying
to understand their transportation rates. Duffy uses contract “over-the-road” trucking
companies for TL shipments. These are moves directly from one point to another with no
intermediate stops. Robin has a snapshot of data showing the costs and some other
characteristics for about 100 TL shipments. She would like to know what characteristics are
driving the rates for her TL services?
ID Cost Per Load Distance (miles) LeadTime (days) Trailer Length (ft) Weight (lbs) Equipment
1 $3,692 1,579 1 53 20,559 DRY
2 $3,279 1,298 12 48 17,025 REF
3 $3,120 1,382 11 48 26,736 DRY
4 $3,205 1,033 1 53 26,176 DRY
5 $3,188 1,320 3 53 17,994 DRY
6 $2,835 1,103 9 53 32,207 DRY
7 $2,364 743 1 48 18,589 DRY
ID unique identification number for the load
CPL cost per load ($)
Dist distance hauled for shipment (miles) How did we model the relationship between CPL
LdTime lead time from offer to tender (days), 0 = same day and the independent variables?
TrlLng trailer length (feet)
Wgt weight of goods in trailer (lbs)
Eqpt equipment type (Dry Van or Refrigerated)
Duffy Industries – Regression Models
Model 1: CPL = b0 + b1(Dist) R2=81.6%
Model 2: CPL = b0 + b1(Dist) + b2(Wgt) R2=81.5%
Model 3: CPL = b0 + b1(Dist) + b2(DistSq) R2=81.4%

Independent Variables

100 observations (records)


Dependent Variable
Model 4: CPL = b0 + b1(Dist) + b2(RefFlag) R2=81.8%
Model 5: CPL = b0 + b1(Dist) + b2(RefDist) R2=81.7%
Model 6: CPL = b0 + b1(Dist) + b2(SameDay) R2=82.4%
Note that we used all of the data to calibrate the
potential models. Is this OK? Is it always OK?

Need to separate
Descriptive vs. Predictive Models the training from
the testing data!

We want to see how well the model predicts on new data!


Why does using all of the data pose a problem?
Short Answer: It can lead to over-fitting models that lowers forecasting accuracy!
8
7 Model: Linear Equation
6 8
7
5 6
Sales

4 5

Sales
3 4
2 3
2 y = 0.4x + 4.5
1 1 R² = 0.16
0 0
0 2 4 6 8 0 2 4 6 8
Effort Effort

Model: Quadratic Equation Model: Cubic Equation


8 8
7 7
6 6
5 5
Sales

Sales
4 4
3 3
2 y = 0.5x2 - 2.1x + 7 2 y = 1.3333x3 - 9.5x2 + 20.167x - 7
1 R² = 0.36 1 R² = 1
0 0
0 2 4 6 8 0 2 4 6 8
Effort Effort
Recall we are balancing between bias and variance!
Low Bias High underfitting overfitting

Low
Variance

underfitting

High

Source: The Elements of Statistical Learning, Hastie et al. (2009)


overfitting

n Bias - Persistent tendency to over or under predict – the mean of the errors
n Variance – closeness of predictions to the true value – the variability of the error
The basic idea is to divide your data into sets
Training Data: is used for developing, calibrating, and fitting the model.
Testing Data: is used for evaluating how well the model works.
Two Questions:
data set 1. How much data should I allocate for training versus testing?
2. Which data should I use for each?

train test 50:50 No hard and fast rules – more rules of thumb
• Train on more data than you test and be sure your training
90:10 data is ‘similar’ to your testing data.
• General consensus is 75:25 for train:test
• Time series is special and needs to be treated differently
10:90
• In some Machine Learning applications you may divide your
data into three sets: training, validation, and testing.
70:30 But why only test once? Two methods for ‘re-sampling’.
• Cross-Validation – make multiple partitions or ‘folds’ of data,
70:30 run analysis on each fold, and average overall performance.
• Bootstrap – similar to cross-validation except we resample the
data each time so that an element of data could be re-used.
Wait, why are we re-sampling again?
• We need to understand how well our model data set
performs on new (non-trained) data. Train Model 1 R2=.85
• Challenge: R2=.89
• We are using a limited set of data to estimate the
unknown parameters. R2=.91
• While we want a strong model, just because a model fits very R2=.87
well for the sample (training) data it is not guaranteed to fit well
for other data. Avg Error = 88%
• In fact, this can lead to overfitting a model.
• Resampling avoids overfitting in predictive models: data set
• Partition the data into several different train:test subsets Train Model 2 R2=.94
• Train and test the model on each of these ‘folds’
R2=.82
• Average the error terms across all of the tests.
R2=.76
• How generalizable is our model across ‘new’ data?
• The model coefficients will not be the same between the folds R2=.80
• Re-sampling is used to test the structure of the model – not the
Avg Error = 83%
detailed parameters
Cross-Validation Methods
k-fold: What value should I use for k?
1. Divide the sample randomly and equally into k • Typical values are 5 or 10 folds – but
pairwise disjoint subsets. essentially this is arbitrary.
2. Use k-1 parts for training, and 1 for testing • A bigger k leads to lower bias, but
3. Repeat the procedure for each fold, changing higher variance and longer run time.
the test subset.
4. Compute k-performance error metrics for Leave-One-Out Cross-Validation
each fold and find the average overall error 1. Let k=n, number of observations
estimate 2. Follow k-fold method
Duffy Industries – Regression Models
Model 1: CPL = b0 + b1(Dist) R2=81.6%

100 observations (records)


Independent Variables
Dependent Variable
Model 7: CPL = b0 + b1(Dist) + b2(RefFlag) + b3(SameDay) R2=82.0%
+ b4(RefDist) + b5(LT>Week) + b6(Weight)
+ b7(LeadTime) + b8(TrlLngth)
Running 4-fold Cross-Validation
• 4 Folds – each trained on 75 observations and tested on remaining 25
• Compared average Mean Absolute Percent Error (MAPE) across
the two Models to select the most robust structure.
data set Model 1 Model 7
Train Test 9.3% 9.1% Notes:
• The simpler model has a lower average MAPE,
6.6% 7.3% but this will not always be the case.
• We select the structure of Model 1.
9.4% 9.7%
• We will use the coefficients from the fold with
7.7% 7.5% the best fit model.
Avg: 8.25% 8.40%
A Quick Note on Bootstrap Sampling . . .
Bootstrap sampling is similar to k-folds cross-validation, except we
“resample with replacement” to create the training data sample.
Method:
1. Determine the sample size, s, needed for training
2. While the number of observations in the sample is less than s,
1. Randomly draw one observation from the whole dataset containing n observations and add it to the sample
2. Replace the observation back into the whole dataset
3. Train on the s observations and test on the remaining (out of bag or OOB) observations
4. Repeat this procedure several times and average the model score.
EXAMPLE: Suppose we have Model
Testing Data Performance
a dataset with 6 items and Training Sample
we want to test a model that Run 1 10 4 16 6 12 9 MAPE1
requires a training sample
size of 3 observations. Run 2 12 12 6 10 4 16 9 MAPE2
(∑MAPEn)
Run 3 10 16 10 4 6 12 9 MAPE3 n
10 4 16
...

...
6 12 9
Run n 4 6 12 10 16 9 MAPEn
dataset
A Longer Note on Time Series . . .
• You cannot blindly use k-folds cross validation for time series. Why not?
• For k-folds cross-validation, we assume the observations are independent.
• This does not hold for time series since there is a temporal element, e.g.,
• We do not want to train on Mon-Fri and test on Sat-Sun
• We do not want to train on Jan-Sep and then test on Oct-Dec
• We need to do something called “backtesting” or “hindsighting”
• Three common backtesting techniques:
• Train-Test
• Split the data that respects the temporal order of observations, (e.g. Train on years 1-2, Test year 3)
• Multiple Train-Test
• Split the data into multiple segments, each respecting the temporal order of observations and
having the same test size, (e.g., Train on first two months of a quarter, Test on last month).
• Walk-Forward Validation (recommended!)
• The model is estimated and updated each time step when new data is received.
• Need to decide on the minimum number of observations to train (width of training window) and
whether to use sliding or expanding windows.
Walk Forward Validation Suppose I decided I need at least 8 periods to train my data

Expanding Test Window – anchor the start of the window and use all Two Questions:
data that is available for testing. 1. What is my minimum number of
training periods?
2. Which type of window should we use?

Minimum number of training periods


depends on the data . . .
... • Using more data improves model,
but decreases the number of tests
• I always try to use something that
makes sense to the application (e.g.,
Sliding Test Window – only use the most recent n observations for testing 4, 8, 12 weeks)
where n=minimum data required. The types of window also depends on
the data . . .
• Expanding window will create better
fit models, but assumes no ‘concept
drift’ or change in underlying data.
• Sliding windows only use the most
... current data and therefore are more
responsive
Key Points from Lesson

37
The Practice of Regression
1. Choose which independent variables to include in the model,
based on common sense and context specific knowledge.
2. Collect data (create dummy variables in necessary).
3. Run regression -- the easy part.
4. Analyze the output and make changes in the model -- this is
where the action is.
 Using fewer independent variables is better than using more
 Always be able to explain or justify inclusion (or exclusion) of a variable
 Always validate individual explanatory variables (p-value)
 There is more art than science to these models

38
Regression Analysis Checklist
• Linearity: scatter plot, common sense, and knowing your
problem
• Signs of Regression Coefficients: do they agree with intuition?
• t-statistics: are the coefficients significantly different from zero?
• adj R2: is it reasonably high in the context?
• Normality: plot histogram of the residuals
• Heteroscedasticity: plot residuals with each x variable
• Autocorrelation: “time series plot”
• Multicollinearity: compute correlations of the x variables
 If |corr|>.70 – you might want to remove one of the

variables

39
Practical Concerns – Beware Over-Fitting
• Suppose you are given the following data,

Sales Effort
5 1
4 3
7 4
6 2

• How would you find the “best” fit curve assuming


Sales = f(Effort)?

40
Results – which is the best fit?
8
7 Model: Linear Equation
6 8
5
6
Sales

4
3

Sales
4
2
2 y = 0.4x + 4.5
1 R² = 0.16
0
0
0 2 4 6 8
0 2 4 6 8
Effort
Effort

Model: Quadratic Equation Model: Cubic Equation


8 8

6 6

Sales
4
Sales

4
y = 0.5x2 - 2.1x + 7 2 y = 1.3333x3 - 9.5x2 + 20.167x - 7
2 R² = 1
R² = 0.36
0 0
0 2 4 6 8 0 2 4 6 8
Effort Effort

41
41
Modeling Issues and Tips
• Don’t confuse causality and relationship
 Statistics find and measure relationships – not causality
 User must try to explain the causality
• Don’t be a slave to R2 – model must make sense
 Look at adjusted R2 to compare between models
• Simple is better (avoid over-specifying)
Rule of thumb n≥5(k+2) where
n= num obs and k=num of ind variables
• Avoid extrapolating out of observed range
• Non-linear relationships can be modeled through data
transformations (ln, sqrt, 1/x, Multiply)

42
Questions, Comments, Suggestions?
Use the Discussion Forum!

“Wilson and Dexter waiting patiently to regress their way


out of their holiday costumes”
[email protected]
ctl.mit.edu

You might also like