Module 3 - Regression
Module 3 - Regression
ctl.mit.edu
Duffy Industries
• Robin Curtin, the Transportation vice president for Duffy
Industries, a food manufacturer, is trying to understand their
transportation rates. She needs to estimate what rates
should be for full truckload (TL) service from a new facility.
Duffy uses contract “over-the-road” trucking companies for
TL shipments. These are moves directly from one point to
another with no intermediate stops.
• Robin only has a snapshot of data showing the costs and
some other characteristics of about 100 TL shipments.
• Some questions she would like to answer:
What characteristics are driving the rates for my TL services?
What rates should I expect if I open new lanes?
2
Duffy Industries
• Let’s take a look at a snapshot of the data
ID Cost Per Load Distance (miles) LeadTime (days) Trailer Length (ft) Weight (lbs) Equipment
1 $3,692 1,579 1 53 20,559 DRY
2 $3,279 1,298 12 48 17,025 REF
3 $3,120 1,382 11 48 26,735 DRY
4 $3,205 1,033 1 53 26,175 DRY
5 $3,188 1,320 3 53 17,994 DRY
6 $2,835 1,103 9 53 32,206 DRY
7 $2,364 743 1 48 18,588 DRY
3
Duffy Industries – Quick Statistics
CPL Dist LdTime TrlLng Wgt
Min $1,660 502 0.00 48 15,100
25th Pct $2,632 904 2.75 48 21,221
Mode $3,730 / $3,171 1273 1.00 53 #N/A
Median $3,166 1273 6.00 53 26,514
Mean $3,132 1207 5.87 51.3 26,708
75th Pct $3,701 1538 9.00 53 32,276
Max $4,301 1793 13.00 53 39,931
Range $2,641 1291 13.00 5 24,831
IQ $1,070 634 6.25 5 11,055
StdDev.s $655 387 3.96 2.4 7,070
CORR(CPL,X) 1.00 0.90 -0.09 0.14 0.08
Eqpt: 60 Ref and 40 Dry shipments
What to do next?
Explore how CPL is influenced by other variables
Develop a descriptive model where CPL=f(Dist, LdTime, . . . ?)
Develop a predictive model for CPL
4
Setting up the Variables
5
Dependent vs. Independent Variables
• We want to measure movement of one (dependent) variable
to a small set of relevant (independent) variables.
• Dependent variable Y is a function of independent variables X.
• Examples:
Property Values = f(area, location, # bathrooms, …)
Sales = f(last month’s sales, advertising budget, price, seasonality, .. .)
Probability Taking Transit versus Driving = f(income, location, . . .)
Height = f(age, gender, height of parents, . . . )
GPA = f(GMAT, age, undergraduate Grades, …)
Number of Fliers = f(Economic activity, size of origin and destination
cities, competitor’s price, . . . )
Condominium fees = f(area, story . . .)
6
Variables Types
• Variables have different scales
Nominal Scale / Categorical – Groupings without related value
Income status (1=retired, 2=student, 3=two income, etc.)
Country of origin (1=US, 2=Canada, 3=China, etc.)
Organizational structure (1=centralized, 2=decentralized, etc.)
Ordinal Scale – indicates ranking but not proportionality between values
Job satisfaction scale 1 to 5 (a 2 is not twice as good as a 1)
Planning versus Response Profile (0 = Planner, 4 = Responder)
Education level (1=High School, 2=Undergraduate, 3=Masters, etc.)
Ratio Scale – value indicating ranking and relation
Examples; Age, Income, Cost, Distance, Weight, . . . .
$3,000
$2,500
$2,000
$1,500
$1,000
$500
$-
- 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
Distance (miles) 8
Linear Regression
9
Linear Regression Model
• Formally,
The relationship is described in terms of a linear model
The data (xi, yi) are the observed pairs from which we try to estimate the
Β coefficients to find the ‘best fit’
The error term, ε, is the ‘unaccounted’ or ‘unexplained’ portion
The error terms are assumed to be iid ~N(0,σ) and catch all of the factors
ignored or neglected in the model
y=
i β 0 + β1 xi
Yi =β 0 + β1 xi + ε i for i =1, 2,...n
) β 0 + β1 x
E (Y | x=
Observed Unknown StdDev(Y | x) = σ
10
Linear Regression - Residuals
• Residuals
Predicted or estimated values found by using regression coefficients, b.
Residuals, ei, are the difference of actual, yi, minus predicted, ŷi, values
We want to select the “best” b values that “minimize the residuals”
$5,000
Suppose we find: ŷi = b0+b1xi to be CPLi = 1000 + 1.75*Disti
$4,500
Est. Value
$4,000
$3,500
ŷ32=2762.25
Cost Per Load ($)
$3,000
Intercept
$2,500
b0=1000 Residual
$2,000
e32=2346 -2762.25
$1,500
Obs Value = -416.25
$1,000 Slope x32= 1007
$500 b1=1.75 y32= $2346
$-
- 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
Distance (miles) 11
Linear Regression – Best Fit
• How should I determine the “best fit” with the residuals?
Min sum of errors? min Σ(yi - b0 - b1xi)
Min sum of absolute error? min Σ|yi - b0 - b1xi|
Min sum of squares of error? min Σ(yi - b0 - b1xi)2
We will select the model that minimizes the residual sum of squares . . . WHY?
12
Formal Definitions
• Where did this come from?
Unconstrained optimization
(think back to first week!)
Partial derivatives to find the
first order optimality condition
with respect to each variable.
= (
∑i 1=
= ei )
2
∑ i 1 ( yi −=
n
yˆi ) ∑ i 1 ( yi − b0 − b1 x1i − ... − bk xki )
=
n 2 n 2
13
Validating a Model – First Steps
14
Model Evaluation Metrics
• All packages will provide statistics for evaluation
Names and format will differ slightly package by package
• Model Output
Model Statistics (Regression Statistics or Summary of Fit)
Coefficient of Determination or Goodness of Fit (R2)
Adjusted R2
Standard Error (Root Mean Squared Error)
Analysis of Variance (ANOVA)
Sum of the Squares (Model, Residual/Error, and Total)
Degrees of Freedom
Parameter Statistics (Coefficient Statistics)
Coefficient (b value)
Standard error
t-Statistic
p-value
Upper and Lower Bounds
15
Model Validation 1 – Overall Fit
ID CPL (Y) Dist (X)
1 $3,692 1,579
• How much variation in dependent variable, y, can we explain?
2 $3,279 1,298 If we only have the mean?
3 $3,120 1,382
If we can make estimates?
4 $3,205 1,033
5 $3,188 1,320
6 $2,835 1,103 • What is the total variation of CPL?
7 $2,364 743 Find dispersion around the mean
8 $2,434 772
9 $3,486 1,389 Called the Total Sum of Squares
10 $3,730 1,761
11 $3,735 1,664 • What if we now make estimates of y for each x?
12 $4,096 1,542 Find variation not accounted for by estimates 𝑒𝑒𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
13 $2,123 641 𝑛𝑛
14 $3,560 1,527 Called the Error or Residual Sum of Squares 𝑅𝑅𝑅𝑅𝑅𝑅 = � (𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )2
𝑖𝑖=1
15 $4,041 1,407
16 $3,765 1,720
• The regression model “explains” a certain percentage
17 $3,565 1,310 of the total variation of the dependent variable
18 $1,686 565
Coefficient of Determination or R2
19 $3,045 1,200 𝑅𝑅𝑅𝑅𝑅𝑅
20 $2,933 1,207 Ranges between 0 and 1 𝑅𝑅2 = 1 −
. 𝑇𝑇𝑇𝑇𝑇𝑇
The adj R2 corrects for additional variables
.
100 $4,208 1,687 n= # observations
k = # independent variables (not b0)
16
Model Validation 2 – Individual Coefficients
• Each Independent variable (and b0) will have:
An estimate of coefficient (b1),
A standard error (sbi)
se = Standard error of the model
sx = Standard deviation of the independent variable
n = number of observations
The t-statistic ~Student-t, df=n-k-1
k = number of independent variables
bi = estimate or coefficient of independent variable
Corresponding p-value – Testing the Slope
If no linear relationship exists between the two variables, we would expect the
regression line to be horizontal, that is, to have a slope of zero.
We want to see if there is a linear relationship, i.e. we want to see if the slope
(b1) is something other than zero. So: H0: b1 = 0 and H1 b1 ≠ 0
• Confidence Intervals
We can estimate an interval for the slope parameter, (b1)
17
Model 1: CPL=f(Dist)
18
Model 1: CPL = b0 + b1(Dist)
R2 0.818
adj R2 0.816
Estimation Model
se 281.3
RSS 7,755,591
CPL = 1282 + 1.532 (Dist)
TSS 42,519,984
Interpretation:
• Model explains ~82% of total variation in CPL (very good!)
• Both the b0 and b1 terms make sense in terms of magnitude and sign
and are statistically valid (p<0.0001)
19
How can we improve the model?
ID Cost Per Load Distance (miles) LeadTime (days) Trailer Length (ft) Weight (lbs) Equipment
1 $3,692 1,579 1 53 20,559 REF
2 $3,279 1,298 12 48 17,025 DRY
3 $3,120 1,382 11 48 26,736 REF
4 $3,205 1,033 1 53 26,176 REF
5 $3,188 1,320 3 53 17,994 REF
6 $2,835 1,103 9 53 32,207 REF
7 $2,364 743 1 48 18,589 REF
20
Model 2: CPL = f(Dist, Wgt)
21
Model 2: CPL = b0 + b1(Dist) + b2(Wgt)
R2 0.819
adj R2 0.815
se 281.964
Estimation Model
RSS 7,711,851
l TSS 42,519,984
CPL = 1354 + 1.538 (Dist) – 0.003 (Wgt)
Std Error Lower CI Upper CI
Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,354 134.4 10.077 <0.0001 1,088 1,621
Distance (b1) 1.538 0.074 20.850 <0.0001 1.392 1.685
Weight (b2) -0.003 0.004 -0.74 0.46 -0.011 0.005
Interpretation:
• Model explains ~82% of total variation in CPL (still very good!)
• Note that while R2 improved from Model 1, the adj R2 got worse!
• Both the b0 and b1 terms make sense in terms of magnitude and sign and are
statistically valid (p<0.0001)
• b2 does not make sense (more weight costs less?) and has poor p-value
22
Linear Transformations
23
What about non-linear relationships?
Suppose we think that CPL has some non-
linear relationship with some of our
independent variables . . .
Recall,
e=2.71828 . . .
• ln(1) = 0
• ln(e) = 1
• ln(xy) = ln(x) + ln(y)
• ln(x/y) = ln(x) – ln(y)
• ln(xa) = aln(x)
24
Modeling Techniques 1
• Linear Transformations
We assume a linear model: yi = b0 + b1xi1 + b2xi2 +. . .+ bkxik
What if we have a non-linear relationship with an independent variable?
OLS Regression is ok as long as the estimated equation is linear in all of
its independent variables
25
Model 3: CPL = b0 + b1(Dist) + b2(DistSq)
R2 0.818
adj R2 0.814 Estimation Model
se 282.714
RSS 7,752,918 CPL = 1236 + 1.623 (Dist) – 0.00004 (DistSq)
l TSS 42,519,984.
Interpretation:
• Model explains ~82% of total variation in CPL (still very good!), but squared
term did not improve the adj R2 from Model 1.
• b0 and b1 still are good (note slight degradation of p-value for b1)
• b2 sign and magnitude make sense (at 1000 miles, reduces effect by $40) but
has poor p-value
26
Modeling Categorical Variables
27
Modeling Techniques 2
• So far, we assumed that independent variables are continuous &
ratio scalar.
28
Model 4: CPL = b0 + b1(Dist) + b2(RefFlag)
R2 0.822
adj R2 0.818
se 279.650
Estimation Model
RSS 7,585,804
TSS 42,519,984
CPL = 1236 + 1.529 (Dist) + 84 (RefFlag)
Std Error Lower CI Upper CI
Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,236 97.343 12.695 <0.0001 1,042.6 1,428.96
Distance (b1) 1.529 0.073 21.03 <0.0001 1.384 1.673
RefFlag(b2) 84.1 57.11 1.473 0.144 -29.2 197.5
Interpretation:
• Model explains ~82% of total variation in CPL (adj R2 improved from Model 1)
• Both the b0 and b1 terms are still fine
• b2 does make sense in terms of sign (refrigeration costs more) but has a poor p-
value. Perhaps it is more of a function of distance . . . .
• Let’s test – new variable RefDist = RefFlag*Dist
29
Model 5: CPL = b0 + b1(Dist) + b2(RefDist)
R2 0.821
adj R2 0.817 Estimation Model
se 280.317
RSS 7,622,037 CPL = 1283 + 1.496 (Dist) + 0.06 (RefDist)
TSS 42,519,984
Interpretation:
• Model is good – b0, and b1 are fine.
• b2 is problematic – sign and magnitude are reasonable – but p-value is bad.
• Should we keep it? This is where we get more art than science. If important,
then OK – but always state the p-value so the user of the model understands its
strength/weakness.
30
Model 6: CPL = b0 + b1(Dist) + b2(SameDay)
R2 0.828 SameDay = 1 if LdTime=0, =0 o.w.
adj R2 0.824
se 274.655
Estimation Model
RSS 7,317,227
TSS 42,519,984
CPL = 1237 + 1.552 (Dist) + 233 (SameDay)
Std Error Lower CI Upper CI
Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,237 92.32 13.4 <0.0001 1,054 1,420
Distance (b1) 1.552 0.072 21.6 <0.0001 1.409 1.694
SameDay(b2) 233 96.61 2.4 0.018 41.1 424.6
31
Model Validation II
32
Regression Assumptions
1. There is a population regression line that joins the mean of the
dependent variables.
2. This implies that the mean of the error is 0.
3. The variance of the dependent variable is constant for all
values of the explanatory variables (Homoscedasticity)
4. The dependent variable is normally distributed for any value of
the explanatory variables.
5. The error terms are probabilistically independent.
33
Model Validation
– Multi-Collinearity & Autocorrelation
34
Model Validation - Heterscedasticity
• Heteroscedasticity – Does the standard deviation of the error terms
differ for different values of the independent variables?
Observations are supposed to have the same variance
35
Model Validation
• Linearity – Is the dependent variable linear with independent variables?
With one ind. variable, scatter plots work
More than one ind. variable – look at R2
• Normality of Error Terms – Are the error terms distributed Normally?
We have assumed that e~N(0,σ)
Look at a histogram of the residuals
There are more formal tests; e.g., Chi-Square or Kolmogorov–Smirnov
tests
6
0
-150 -125 -100 -75 -50 -25 0 25 50 75 100 125 150 175
36
Re-Sampling
Independent Variables
Need to separate
Descriptive vs. Predictive Models the training from
the testing data!
4 5
Sales
3 4
2 3
2 y = 0.4x + 4.5
1 1 R² = 0.16
0 0
0 2 4 6 8 0 2 4 6 8
Effort Effort
Sales
4 4
3 3
2 y = 0.5x2 - 2.1x + 7 2 y = 1.3333x3 - 9.5x2 + 20.167x - 7
1 R² = 0.36 1 R² = 1
0 0
0 2 4 6 8 0 2 4 6 8
Effort Effort
Recall we are balancing between bias and variance!
Low Bias High underfitting overfitting
Low
Variance
underfitting
High
n Bias - Persistent tendency to over or under predict – the mean of the errors
n Variance – closeness of predictions to the true value – the variability of the error
The basic idea is to divide your data into sets
Training Data: is used for developing, calibrating, and fitting the model.
Testing Data: is used for evaluating how well the model works.
Two Questions:
data set 1. How much data should I allocate for training versus testing?
2. Which data should I use for each?
train test 50:50 No hard and fast rules – more rules of thumb
• Train on more data than you test and be sure your training
90:10 data is ‘similar’ to your testing data.
• General consensus is 75:25 for train:test
• Time series is special and needs to be treated differently
10:90
• In some Machine Learning applications you may divide your
data into three sets: training, validation, and testing.
70:30 But why only test once? Two methods for ‘re-sampling’.
• Cross-Validation – make multiple partitions or ‘folds’ of data,
70:30 run analysis on each fold, and average overall performance.
• Bootstrap – similar to cross-validation except we resample the
data each time so that an element of data could be re-used.
Wait, why are we re-sampling again?
• We need to understand how well our model data set
performs on new (non-trained) data. Train Model 1 R2=.85
• Challenge: R2=.89
• We are using a limited set of data to estimate the
unknown parameters. R2=.91
• While we want a strong model, just because a model fits very R2=.87
well for the sample (training) data it is not guaranteed to fit well
for other data. Avg Error = 88%
• In fact, this can lead to overfitting a model.
• Resampling avoids overfitting in predictive models: data set
• Partition the data into several different train:test subsets Train Model 2 R2=.94
• Train and test the model on each of these ‘folds’
R2=.82
• Average the error terms across all of the tests.
R2=.76
• How generalizable is our model across ‘new’ data?
• The model coefficients will not be the same between the folds R2=.80
• Re-sampling is used to test the structure of the model – not the
Avg Error = 83%
detailed parameters
Cross-Validation Methods
k-fold: What value should I use for k?
1. Divide the sample randomly and equally into k • Typical values are 5 or 10 folds – but
pairwise disjoint subsets. essentially this is arbitrary.
2. Use k-1 parts for training, and 1 for testing • A bigger k leads to lower bias, but
3. Repeat the procedure for each fold, changing higher variance and longer run time.
the test subset.
4. Compute k-performance error metrics for Leave-One-Out Cross-Validation
each fold and find the average overall error 1. Let k=n, number of observations
estimate 2. Follow k-fold method
Duffy Industries – Regression Models
Model 1: CPL = b0 + b1(Dist) R2=81.6%
...
6 12 9
Run n 4 6 12 10 16 9 MAPEn
dataset
A Longer Note on Time Series . . .
• You cannot blindly use k-folds cross validation for time series. Why not?
• For k-folds cross-validation, we assume the observations are independent.
• This does not hold for time series since there is a temporal element, e.g.,
• We do not want to train on Mon-Fri and test on Sat-Sun
• We do not want to train on Jan-Sep and then test on Oct-Dec
• We need to do something called “backtesting” or “hindsighting”
• Three common backtesting techniques:
• Train-Test
• Split the data that respects the temporal order of observations, (e.g. Train on years 1-2, Test year 3)
• Multiple Train-Test
• Split the data into multiple segments, each respecting the temporal order of observations and
having the same test size, (e.g., Train on first two months of a quarter, Test on last month).
• Walk-Forward Validation (recommended!)
• The model is estimated and updated each time step when new data is received.
• Need to decide on the minimum number of observations to train (width of training window) and
whether to use sliding or expanding windows.
Walk Forward Validation Suppose I decided I need at least 8 periods to train my data
Expanding Test Window – anchor the start of the window and use all Two Questions:
data that is available for testing. 1. What is my minimum number of
training periods?
2. Which type of window should we use?
37
The Practice of Regression
1. Choose which independent variables to include in the model,
based on common sense and context specific knowledge.
2. Collect data (create dummy variables in necessary).
3. Run regression -- the easy part.
4. Analyze the output and make changes in the model -- this is
where the action is.
Using fewer independent variables is better than using more
Always be able to explain or justify inclusion (or exclusion) of a variable
Always validate individual explanatory variables (p-value)
There is more art than science to these models
38
Regression Analysis Checklist
• Linearity: scatter plot, common sense, and knowing your
problem
• Signs of Regression Coefficients: do they agree with intuition?
• t-statistics: are the coefficients significantly different from zero?
• adj R2: is it reasonably high in the context?
• Normality: plot histogram of the residuals
• Heteroscedasticity: plot residuals with each x variable
• Autocorrelation: “time series plot”
• Multicollinearity: compute correlations of the x variables
If |corr|>.70 – you might want to remove one of the
variables
39
Practical Concerns – Beware Over-Fitting
• Suppose you are given the following data,
Sales Effort
5 1
4 3
7 4
6 2
40
Results – which is the best fit?
8
7 Model: Linear Equation
6 8
5
6
Sales
4
3
Sales
4
2
2 y = 0.4x + 4.5
1 R² = 0.16
0
0
0 2 4 6 8
0 2 4 6 8
Effort
Effort
6 6
Sales
4
Sales
4
y = 0.5x2 - 2.1x + 7 2 y = 1.3333x3 - 9.5x2 + 20.167x - 7
2 R² = 1
R² = 0.36
0 0
0 2 4 6 8 0 2 4 6 8
Effort Effort
41
41
Modeling Issues and Tips
• Don’t confuse causality and relationship
Statistics find and measure relationships – not causality
User must try to explain the causality
• Don’t be a slave to R2 – model must make sense
Look at adjusted R2 to compare between models
• Simple is better (avoid over-specifying)
Rule of thumb n≥5(k+2) where
n= num obs and k=num of ind variables
• Avoid extrapolating out of observed range
• Non-linear relationships can be modeled through data
transformations (ln, sqrt, 1/x, Multiply)
42
Questions, Comments, Suggestions?
Use the Discussion Forum!