0% found this document useful (0 votes)

118 views55 pages

Module 3 - Regression

Minimizing the sum of squares of errors is the Ordinary Least Squares (OLS) approach to finding the "best fit" line. OLS chooses the coefficients (b0 and b1) that make the squared errors between the predicted and actual y-values as small as possible. This has desirable statistical properties, such as producing estimates that are BLUE (Best Linear Unbiased Estimators).

Uploaded by

PPP

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views55 pages

Module 3 - Regression

Uploaded by

PPP

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Building Predictive Models

ctl.mit.edu
Duffy Industries
• Robin Curtin, the Transportation vice president for Duffy
Industries, a food manufacturer, is trying to understand their
transportation rates. She needs to estimate what rates
should be for full truckload (TL) service from a new facility.
Duffy uses contract “over-the-road” trucking companies for
TL shipments. These are moves directly from one point to
another with no intermediate stops.
• Robin only has a snapshot of data showing the costs and
some other characteristics of about 100 TL shipments.
• Some questions she would like to answer:
 What characteristics are driving the rates for my TL services?
 What rates should I expect if I open new lanes?

2
Duffy Industries
• Let’s take a look at a snapshot of the data
ID Cost Per Load Distance (miles) LeadTime (days) Trailer Length (ft) Weight (lbs) Equipment
1 $3,692 1,579 1 53 20,559 DRY
2 $3,279 1,298 12 48 17,025 REF
3 $3,120 1,382 11 48 26,735 DRY
4 $3,205 1,033 1 53 26,175 DRY
5 $3,188 1,320 3 53 17,994 DRY
6 $2,835 1,103 9 53 32,206 DRY
7 $2,364 743 1 48 18,588 DRY

ID unique identification number for the load

CPL cost per load ($)
Dist distance hauled for shipment (miles)
LdTime lead time from offer to tender to carrier (days) 0 = same day
TrlLng trailer length (feet)
Wgt weight of goods in trailer (lbs)
Eqpt equipment type (Dry Van or Refrigerated)

3
Duffy Industries – Quick Statistics
CPL Dist LdTime TrlLng Wgt
Min $1,660 502 0.00 48 15,100
25th Pct $2,632 904 2.75 48 21,221
Mode $3,730 / $3,171 1273 1.00 53 #N/A
Median $3,166 1273 6.00 53 26,514
Mean $3,132 1207 5.87 51.3 26,708
75th Pct $3,701 1538 9.00 53 32,276
Max $4,301 1793 13.00 53 39,931
Range $2,641 1291 13.00 5 24,831
IQ $1,070 634 6.25 5 11,055
StdDev.s $655 387 3.96 2.4 7,070
CORR(CPL,X) 1.00 0.90 -0.09 0.14 0.08
Eqpt: 60 Ref and 40 Dry shipments

What to do next?
 Explore how CPL is influenced by other variables
 Develop a descriptive model where CPL=f(Dist, LdTime, . . . ?)
 Develop a predictive model for CPL
4
Setting up the Variables

5
Dependent vs. Independent Variables
• We want to measure movement of one (dependent) variable
to a small set of relevant (independent) variables.
• Dependent variable Y is a function of independent variables X.
• Examples:
 Property Values = f(area, location, # bathrooms, …)
 Sales = f(last month’s sales, advertising budget, price, seasonality, .. .)
 Probability Taking Transit versus Driving = f(income, location, . . .)
 Height = f(age, gender, height of parents, . . . )
 GPA = f(GMAT, age, undergraduate Grades, …)
 Number of Fliers = f(Economic activity, size of origin and destination
cities, competitor’s price, . . . )
 Condominium fees = f(area, story . . .)

6
Variables Types
• Variables have different scales
 Nominal Scale / Categorical – Groupings without related value
 Income status (1=retired, 2=student, 3=two income, etc.)
 Country of origin (1=US, 2=Canada, 3=China, etc.)
 Organizational structure (1=centralized, 2=decentralized, etc.)
 Ordinal Scale – indicates ranking but not proportionality between values
 Job satisfaction scale 1 to 5 (a 2 is not twice as good as a 1)
 Planning versus Response Profile (0 = Planner, 4 = Responder)
 Education level (1=High School, 2=Undergraduate, 3=Masters, etc.)
 Ratio Scale – value indicating ranking and relation
 Examples; Age, Income, Cost, Distance, Weight, . . . .

• Form of the Dependent Variable dictates the method used

 Continuous – takes any value
 Discrete – takes only integer values
We will focus on Linear
 Binary – is equal to 0 or 1
Regression of continuous, ratio
scaled dependent variables
7
Duffy Industries
• Dependent variable, Y: CPL or cost per load
• Potential independent variables, Xi:
 Dist – distance
 LdTime – lead time • Start with simple linear model
 TrlLng – length of trailer • Draw “best fit” line
 Wgt – weight • CPL = f(Dist) = β0 + β1X1
 Eqpt – equipment type
$5,000
$4,500
$4,000
$3,500
Cost Per Load ($)

$3,000
$2,500
$2,000
$1,500
$1,000
$500
$-
- 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
Distance (miles) 8
Linear Regression

9
Linear Regression Model
• Formally,
 The relationship is described in terms of a linear model
 The data (xi, yi) are the observed pairs from which we try to estimate the
Β coefficients to find the ‘best fit’
 The error term, ε, is the ‘unaccounted’ or ‘unexplained’ portion
 The error terms are assumed to be iid ~N(0,σ) and catch all of the factors
ignored or neglected in the model

y=
i β 0 + β1 xi
Yi =β 0 + β1 xi + ε i for i =1, 2,...n
) β 0 + β1 x
E (Y | x=
Observed Unknown StdDev(Y | x) = σ

10
Linear Regression - Residuals
• Residuals
 Predicted or estimated values found by using regression coefficients, b.
 Residuals, ei, are the difference of actual, yi, minus predicted, ŷi, values
 We want to select the “best” b values that “minimize the residuals”

$5,000
Suppose we find: ŷi = b0+b1xi to be CPLi = 1000 + 1.75*Disti
$4,500
Est. Value
$4,000
$3,500
ŷ32=2762.25
Cost Per Load ($)

$3,000
Intercept
$2,500
b0=1000 Residual
$2,000
e32=2346 -2762.25
$1,500
Obs Value = -416.25
$1,000 Slope x32= 1007
$500 b1=1.75 y32= $2346
$-
- 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
Distance (miles) 11
Linear Regression – Best Fit
• How should I determine the “best fit” with the residuals?
 Min sum of errors? min Σ(yi - b0 - b1xi)
 Min sum of absolute error? min Σ|yi - b0 - b1xi|
 Min sum of squares of error? min Σ(yi - b0 - b1xi)2

We will select the model that minimizes the residual sum of squares . . . WHY?

• Ordinary Least Squares (OLS) Regression

 Finds the optimal value of the coefficients (b0 and b1) that minimize the
sum of the squares of the errors.

12
Formal Definitions
• Where did this come from?
 Unconstrained optimization
(think back to first week!)
 Partial derivatives to find the
first order optimality condition
with respect to each variable.

• We can expand to multiple variables

 So, for k variables we need to find k regression coefficients
Yi = β 0 + β1 x1i + ... + β k xki + ε i for i = 1, 2,...n
E (Y | x1 , x2 ,..., xk ) = β 0 + β1 x1 + β 2 x2 + ... + β k xk
StdDev(Y | x1 , x2 ,..., xk ) = σ

= (
∑i 1=
= ei )
2
∑ i 1 ( yi −=
n
yˆi ) ∑ i 1 ( yi − b0 − b1 x1i − ... − bk xki )
=
n 2 n 2

13
Validating a Model – First Steps

14
Model Evaluation Metrics
• All packages will provide statistics for evaluation
 Names and format will differ slightly package by package
• Model Output
 Model Statistics (Regression Statistics or Summary of Fit)
 Coefficient of Determination or Goodness of Fit (R2)
 Adjusted R2
 Standard Error (Root Mean Squared Error)
 Analysis of Variance (ANOVA)
 Sum of the Squares (Model, Residual/Error, and Total)
 Degrees of Freedom
 Parameter Statistics (Coefficient Statistics)
 Coefficient (b value)
 Standard error
 t-Statistic
 p-value
 Upper and Lower Bounds

15
Model Validation 1 – Overall Fit
ID CPL (Y) Dist (X)
1 $3,692 1,579
• How much variation in dependent variable, y, can we explain?
2 $3,279 1,298  If we only have the mean?
3 $3,120 1,382
 If we can make estimates?
4 $3,205 1,033
5 $3,188 1,320
6 $2,835 1,103 • What is the total variation of CPL?
7 $2,364 743  Find dispersion around the mean
8 $2,434 772
9 $3,486 1,389  Called the Total Sum of Squares
10 $3,730 1,761
11 $3,735 1,664 • What if we now make estimates of y for each x?
12 $4,096 1,542  Find variation not accounted for by estimates 𝑒𝑒𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
13 $2,123 641 𝑛𝑛
14 $3,560 1,527  Called the Error or Residual Sum of Squares 𝑅𝑅𝑅𝑅𝑅𝑅 = � (𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )2
𝑖𝑖=1
15 $4,041 1,407
16 $3,765 1,720
• The regression model “explains” a certain percentage
17 $3,565 1,310 of the total variation of the dependent variable
18 $1,686 565
 Coefficient of Determination or R2
19 $3,045 1,200 𝑅𝑅𝑅𝑅𝑅𝑅
20 $2,933 1,207  Ranges between 0 and 1 𝑅𝑅2 = 1 −
. 𝑇𝑇𝑇𝑇𝑇𝑇
 The adj R2 corrects for additional variables
.
100 $4,208 1,687  n= # observations
 k = # independent variables (not b0)
16
Model Validation 2 – Individual Coefficients
• Each Independent variable (and b0) will have:
 An estimate of coefficient (b1),
 A standard error (sbi)
 se = Standard error of the model
 sx = Standard deviation of the independent variable
 n = number of observations
 The t-statistic ~Student-t, df=n-k-1
 k = number of independent variables
 bi = estimate or coefficient of independent variable
 Corresponding p-value – Testing the Slope
 If no linear relationship exists between the two variables, we would expect the
regression line to be horizontal, that is, to have a slope of zero.
 We want to see if there is a linear relationship, i.e. we want to see if the slope
(b1) is something other than zero. So: H0: b1 = 0 and H1 b1 ≠ 0
• Confidence Intervals
 We can estimate an interval for the slope parameter, (b1)

17
Model 1: CPL=f(Dist)

18
Model 1: CPL = b0 + b1(Dist)
R2 0.818
adj R2 0.816
Estimation Model
se 281.3
RSS 7,755,591
CPL = 1282 + 1.532 (Dist)
TSS 42,519,984

Std Error Lower CI Upper CI

Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,282 92.61 13.85 <0.0001 1,099 1,466
Distance (b1) 1.532 0.073 20.96 <0.0001 1.387 1.677

Interpretation:
• Model explains ~82% of total variation in CPL (very good!)
• Both the b0 and b1 terms make sense in terms of magnitude and sign
and are statistically valid (p<0.0001)

19
How can we improve the model?
ID Cost Per Load Distance (miles) LeadTime (days) Trailer Length (ft) Weight (lbs) Equipment
1 $3,692 1,579 1 53 20,559 REF
2 $3,279 1,298 12 48 17,025 DRY
3 $3,120 1,382 11 48 26,736 REF
4 $3,205 1,033 1 53 26,176 REF
5 $3,188 1,320 3 53 17,994 REF
6 $2,835 1,103 9 53 32,207 REF
7 $2,364 743 1 48 18,589 REF

• What potential additions can we make?

 Does the equipment type matter?
 Does lead time have an impact?
 Does the trailer length have an effect?
 Does the weight influence rates?
 Does the CPL have a non-linear relationship with distance? weight?
• Be logical in approach and exploration – always have a
hypothesis going in!

20
Model 2: CPL = f(Dist, Wgt)

21
Model 2: CPL = b0 + b1(Dist) + b2(Wgt)
R2 0.819
adj R2 0.815
se 281.964
Estimation Model
RSS 7,711,851
l TSS 42,519,984
CPL = 1354 + 1.538 (Dist) – 0.003 (Wgt)
Std Error Lower CI Upper CI
Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,354 134.4 10.077 <0.0001 1,088 1,621
Distance (b1) 1.538 0.074 20.850 <0.0001 1.392 1.685
Weight (b2) -0.003 0.004 -0.74 0.46 -0.011 0.005

Interpretation:
• Model explains ~82% of total variation in CPL (still very good!)
• Note that while R2 improved from Model 1, the adj R2 got worse!
• Both the b0 and b1 terms make sense in terms of magnitude and sign and are
statistically valid (p<0.0001)
• b2 does not make sense (more weight costs less?) and has poor p-value
22
Linear Transformations

23
What about non-linear relationships?
Suppose we think that CPL has some non-
linear relationship with some of our
independent variables . . .

Recall,
e=2.71828 . . .
• ln(1) = 0
• ln(e) = 1
• ln(xy) = ln(x) + ln(y)
• ln(x/y) = ln(x) – ln(y)
• ln(xa) = aln(x)
24
Modeling Techniques 1
• Linear Transformations
 We assume a linear model: yi = b0 + b1xi1 + b2xi2 +. . .+ bkxik
 What if we have a non-linear relationship with an independent variable?
 OLS Regression is ok as long as the estimated equation is linear in all of
its independent variables

• Let’s Try! Model 3: CPL = f(Dist, Dist2)

 Testing whether the distance effect tapers off at longer distances
 Create new variable: DistSq = DIST^2
 Run Regression: CPL = f(Dist, DistSq)

25
Model 3: CPL = b0 + b1(Dist) + b2(DistSq)
R2 0.818
adj R2 0.814 Estimation Model
se 282.714
RSS 7,752,918 CPL = 1236 + 1.623 (Dist) – 0.00004 (DistSq)
l TSS 42,519,984.

Std Error Lower CI Upper CI

Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,236 271 4.55 <0.0001 697 1,774
Distance (b1) 1.623 0.503 3.22 0.002 0.624 2.622
Distance2 (b2) -0.00004 0.00022 -0.18 0.855 -0.0005 0.0004

Interpretation:
• Model explains ~82% of total variation in CPL (still very good!), but squared
term did not improve the adj R2 from Model 1.
• b0 and b1 still are good (note slight degradation of p-value for b1)
• b2 sign and magnitude make sense (at 1000 miles, reduces effect by $40) but
has poor p-value
26
Modeling Categorical Variables

27
Modeling Techniques 2
• So far, we assumed that independent variables are continuous &
ratio scalar.

• What if we have a nominal/categorical independent variable?

• Create a Dummy Variable

 Suppose you think equipment type impacts CPL?
 Create a binary dummy variable RefFlag =1 if Refrigerated, =0 o.w.
 Run the Regression CPL = (Dist, RefFlag)
 Coefficient of RefFlag captures the differential impact of refrigerated
trailers versus Dry vans
• Notes:
 You do not need to create two dummy variables (RefFlag and DryFlag)
– in fact it will fail! This is over-specifying.
 If we create a DryFlag variable and run CPL=f(Dist, DryFlag) we will get
the same estimates for each observation!

28
Model 4: CPL = b0 + b1(Dist) + b2(RefFlag)
R2 0.822
adj R2 0.818
se 279.650
Estimation Model
RSS 7,585,804
TSS 42,519,984
CPL = 1236 + 1.529 (Dist) + 84 (RefFlag)
Std Error Lower CI Upper CI
Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,236 97.343 12.695 <0.0001 1,042.6 1,428.96
Distance (b1) 1.529 0.073 21.03 <0.0001 1.384 1.673
RefFlag(b2) 84.1 57.11 1.473 0.144 -29.2 197.5

Interpretation:
• Model explains ~82% of total variation in CPL (adj R2 improved from Model 1)
• Both the b0 and b1 terms are still fine
• b2 does make sense in terms of sign (refrigeration costs more) but has a poor p-
value. Perhaps it is more of a function of distance . . . .
• Let’s test – new variable RefDist = RefFlag*Dist
29
Model 5: CPL = b0 + b1(Dist) + b2(RefDist)
R2 0.821
adj R2 0.817 Estimation Model
se 280.317
RSS 7,622,037 CPL = 1283 + 1.496 (Dist) + 0.06 (RefDist)
TSS 42,519,984

Std Error Lower CI Upper CI

Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,283 92.278 13.9 <0.0001 1,099.9 1,466.2
Distance (b1) 1.496 0.078 19.19 <0.0001 1.341 1.65
RefDist(b2) 0.059 0.045 1.304 0.195 -0.031 0.149

Interpretation:
• Model is good – b0, and b1 are fine.
• b2 is problematic – sign and magnitude are reasonable – but p-value is bad.
• Should we keep it? This is where we get more art than science. If important,
then OK – but always state the p-value so the user of the model understands its
strength/weakness.
30
Model 6: CPL = b0 + b1(Dist) + b2(SameDay)
R2 0.828 SameDay = 1 if LdTime=0, =0 o.w.
adj R2 0.824
se 274.655
Estimation Model
RSS 7,317,227
TSS 42,519,984
CPL = 1237 + 1.552 (Dist) + 233 (SameDay)
Std Error Lower CI Upper CI
Coefficient (sbi) t-stat p-value (95%) (95%)
Intercept (b0) 1,237 92.32 13.4 <0.0001 1,054 1,420
Distance (b1) 1.552 0.072 21.6 <0.0001 1.409 1.694
SameDay(b2) 233 96.61 2.4 0.018 41.1 424.6

Interpretation & Insights:

• There are many ways to model the Lead Time effect:
• Continuous – each day adds a linear cost
• OverWeek = 1 if LdTime>7 days, =0 o.w.
• Quantified potential financial benefit for changing practice (SameDay tenders)

31
Model Validation II

32
Regression Assumptions
1. There is a population regression line that joins the mean of the
dependent variables.
2. This implies that the mean of the error is 0.
3. The variance of the dependent variable is constant for all
values of the explanatory variables (Homoscedasticity)
4. The dependent variable is normally distributed for any value of
the explanatory variables.
5. The error terms are probabilistically independent.

33
Model Validation
– Multi-Collinearity & Autocorrelation

• Multi-Collinearity – Are the independent variables correlated?

 When two or more independent variables are highly correlated
 Model might have high R2 but explanatory variables might fail t-test
 Can often result in strange results for correlated variables
 Check for correlation and remove correlated independent variables

• Autocorrelation – Are the residuals not independent?

 Errors are supposed to be identical and independently distributed (iid)
 Typically a time series issue – plot variables over time to see trend
 If they are not independent, they are autocorrelated

34
Model Validation - Heterscedasticity
• Heteroscedasticity – Does the standard deviation of the error terms
differ for different values of the independent variables?
 Observations are supposed to have the same variance

 How do the residuals behave over the ind. var.’s?

 Examine scatter plots and look for “fan-shaped” distributions

 Fixes (weighted least squares regression, others – beyond scope)

35
Model Validation
• Linearity – Is the dependent variable linear with independent variables?
 With one ind. variable, scatter plots work
 More than one ind. variable – look at R2
• Normality of Error Terms – Are the error terms distributed Normally?
 We have assumed that e~N(0,σ)
 Look at a histogram of the residuals
 There are more formal tests; e.g., Chi-Square or Kolmogorov–Smirnov
tests
6

0
-150 -125 -100 -75 -50 -25 0 25 50 75 100 125 150 175

36
Re-Sampling

Based on slides initially prepared by Dr. Tugba Efendigil

What’s the problem?
Robin Curtin is the Transportation VP for Duffy Industries, a food manufacturer. She is trying
to understand their transportation rates. Duffy uses contract “over-the-road” trucking
companies for TL shipments. These are moves directly from one point to another with no
intermediate stops. Robin has a snapshot of data showing the costs and some other
characteristics for about 100 TL shipments. She would like to know what characteristics are
driving the rates for her TL services?
ID Cost Per Load Distance (miles) LeadTime (days) Trailer Length (ft) Weight (lbs) Equipment
1 $3,692 1,579 1 53 20,559 DRY
2 $3,279 1,298 12 48 17,025 REF
3 $3,120 1,382 11 48 26,736 DRY
4 $3,205 1,033 1 53 26,176 DRY
5 $3,188 1,320 3 53 17,994 DRY
6 $2,835 1,103 9 53 32,207 DRY
7 $2,364 743 1 48 18,589 DRY
ID unique identification number for the load
CPL cost per load ($)
Dist distance hauled for shipment (miles) How did we model the relationship between CPL
LdTime lead time from offer to tender (days), 0 = same day and the independent variables?
TrlLng trailer length (feet)
Wgt weight of goods in trailer (lbs)
Eqpt equipment type (Dry Van or Refrigerated)
Duffy Industries – Regression Models
Model 1: CPL = b0 + b1(Dist) R2=81.6%
Model 2: CPL = b0 + b1(Dist) + b2(Wgt) R2=81.5%
Model 3: CPL = b0 + b1(Dist) + b2(DistSq) R2=81.4%

Independent Variables

100 observations (records)

Dependent Variable
Model 4: CPL = b0 + b1(Dist) + b2(RefFlag) R2=81.8%
Model 5: CPL = b0 + b1(Dist) + b2(RefDist) R2=81.7%
Model 6: CPL = b0 + b1(Dist) + b2(SameDay) R2=82.4%
Note that we used all of the data to calibrate the
potential models. Is this OK? Is it always OK?

Need to separate
Descriptive vs. Predictive Models the training from
the testing data!

We want to see how well the model predicts on new data!

Why does using all of the data pose a problem?
Short Answer: It can lead to over-fitting models that lowers forecasting accuracy!
8
7 Model: Linear Equation
6 8
7
5 6
Sales

4 5

Sales
3 4
2 3
2 y = 0.4x + 4.5
1 1 R² = 0.16
0 0
0 2 4 6 8 0 2 4 6 8
Effort Effort

Model: Quadratic Equation Model: Cubic Equation

8 8
7 7
6 6
5 5
Sales

Sales
4 4
3 3
2 y = 0.5x2 - 2.1x + 7 2 y = 1.3333x3 - 9.5x2 + 20.167x - 7
1 R² = 0.36 1 R² = 1
0 0
0 2 4 6 8 0 2 4 6 8
Effort Effort
Recall we are balancing between bias and variance!
Low Bias High underfitting overfitting

Low
Variance

underfitting

High

Source: The Elements of Statistical Learning, Hastie et al. (2009)

overfitting

n Bias - Persistent tendency to over or under predict – the mean of the errors
n Variance – closeness of predictions to the true value – the variability of the error
The basic idea is to divide your data into sets
Training Data: is used for developing, calibrating, and fitting the model.
Testing Data: is used for evaluating how well the model works.
Two Questions:
data set 1. How much data should I allocate for training versus testing?
2. Which data should I use for each?

train test 50:50 No hard and fast rules – more rules of thumb
• Train on more data than you test and be sure your training
90:10 data is ‘similar’ to your testing data.
• General consensus is 75:25 for train:test
• Time series is special and needs to be treated differently
10:90
• In some Machine Learning applications you may divide your
data into three sets: training, validation, and testing.
70:30 But why only test once? Two methods for ‘re-sampling’.
• Cross-Validation – make multiple partitions or ‘folds’ of data,
70:30 run analysis on each fold, and average overall performance.
• Bootstrap – similar to cross-validation except we resample the
data each time so that an element of data could be re-used.
Wait, why are we re-sampling again?
• We need to understand how well our model data set
performs on new (non-trained) data. Train Model 1 R2=.85
• Challenge: R2=.89
• We are using a limited set of data to estimate the
unknown parameters. R2=.91
• While we want a strong model, just because a model fits very R2=.87
well for the sample (training) data it is not guaranteed to fit well
for other data. Avg Error = 88%
• In fact, this can lead to overfitting a model.
• Resampling avoids overfitting in predictive models: data set
• Partition the data into several different train:test subsets Train Model 2 R2=.94
• Train and test the model on each of these ‘folds’
R2=.82
• Average the error terms across all of the tests.
R2=.76
• How generalizable is our model across ‘new’ data?
• The model coefficients will not be the same between the folds R2=.80
• Re-sampling is used to test the structure of the model – not the
Avg Error = 83%
detailed parameters
Cross-Validation Methods
k-fold: What value should I use for k?
1. Divide the sample randomly and equally into k • Typical values are 5 or 10 folds – but
pairwise disjoint subsets. essentially this is arbitrary.
2. Use k-1 parts for training, and 1 for testing • A bigger k leads to lower bias, but
3. Repeat the procedure for each fold, changing higher variance and longer run time.
the test subset.
4. Compute k-performance error metrics for Leave-One-Out Cross-Validation
each fold and find the average overall error 1. Let k=n, number of observations
estimate 2. Follow k-fold method
Duffy Industries – Regression Models
Model 1: CPL = b0 + b1(Dist) R2=81.6%

100 observations (records)

Independent Variables
Dependent Variable
Model 7: CPL = b0 + b1(Dist) + b2(RefFlag) + b3(SameDay) R2=82.0%
+ b4(RefDist) + b5(LT>Week) + b6(Weight)
+ b7(LeadTime) + b8(TrlLngth)
Running 4-fold Cross-Validation
• 4 Folds – each trained on 75 observations and tested on remaining 25
• Compared average Mean Absolute Percent Error (MAPE) across
the two Models to select the most robust structure.
data set Model 1 Model 7
Train Test 9.3% 9.1% Notes:
• The simpler model has a lower average MAPE,
6.6% 7.3% but this will not always be the case.
• We select the structure of Model 1.
9.4% 9.7%
• We will use the coefficients from the fold with
7.7% 7.5% the best fit model.
Avg: 8.25% 8.40%
A Quick Note on Bootstrap Sampling . . .
Bootstrap sampling is similar to k-folds cross-validation, except we
“resample with replacement” to create the training data sample.
Method:
1. Determine the sample size, s, needed for training
2. While the number of observations in the sample is less than s,
1. Randomly draw one observation from the whole dataset containing n observations and add it to the sample
2. Replace the observation back into the whole dataset
3. Train on the s observations and test on the remaining (out of bag or OOB) observations
4. Repeat this procedure several times and average the model score.
EXAMPLE: Suppose we have Model
Testing Data Performance
a dataset with 6 items and Training Sample
we want to test a model that Run 1 10 4 16 6 12 9 MAPE1
requires a training sample
size of 3 observations. Run 2 12 12 6 10 4 16 9 MAPE2
(∑MAPEn)
Run 3 10 16 10 4 6 12 9 MAPE3 n
10 4 16
...

...
6 12 9
Run n 4 6 12 10 16 9 MAPEn
dataset
A Longer Note on Time Series . . .
• You cannot blindly use k-folds cross validation for time series. Why not?
• For k-folds cross-validation, we assume the observations are independent.
• This does not hold for time series since there is a temporal element, e.g.,
• We do not want to train on Mon-Fri and test on Sat-Sun
• We do not want to train on Jan-Sep and then test on Oct-Dec
• We need to do something called “backtesting” or “hindsighting”
• Three common backtesting techniques:
• Train-Test
• Split the data that respects the temporal order of observations, (e.g. Train on years 1-2, Test year 3)
• Multiple Train-Test
• Split the data into multiple segments, each respecting the temporal order of observations and
having the same test size, (e.g., Train on first two months of a quarter, Test on last month).
• Walk-Forward Validation (recommended!)
• The model is estimated and updated each time step when new data is received.
• Need to decide on the minimum number of observations to train (width of training window) and
whether to use sliding or expanding windows.
Walk Forward Validation Suppose I decided I need at least 8 periods to train my data

Expanding Test Window – anchor the start of the window and use all Two Questions:
data that is available for testing. 1. What is my minimum number of
training periods?
2. Which type of window should we use?

Minimum number of training periods

depends on the data . . .
... • Using more data improves model,
but decreases the number of tests
• I always try to use something that
makes sense to the application (e.g.,
Sliding Test Window – only use the most recent n observations for testing 4, 8, 12 weeks)
where n=minimum data required. The types of window also depends on
the data . . .
• Expanding window will create better
fit models, but assumes no ‘concept
drift’ or change in underlying data.
• Sliding windows only use the most
... current data and therefore are more
responsive
Key Points from Lesson

37
The Practice of Regression
1. Choose which independent variables to include in the model,
based on common sense and context specific knowledge.
2. Collect data (create dummy variables in necessary).
3. Run regression -- the easy part.
4. Analyze the output and make changes in the model -- this is
where the action is.
 Using fewer independent variables is better than using more
 Always be able to explain or justify inclusion (or exclusion) of a variable
 Always validate individual explanatory variables (p-value)
 There is more art than science to these models

38
Regression Analysis Checklist
• Linearity: scatter plot, common sense, and knowing your
problem
• Signs of Regression Coefficients: do they agree with intuition?
• t-statistics: are the coefficients significantly different from zero?
• adj R2: is it reasonably high in the context?
• Normality: plot histogram of the residuals
• Heteroscedasticity: plot residuals with each x variable
• Autocorrelation: “time series plot”
• Multicollinearity: compute correlations of the x variables
 If |corr|>.70 – you might want to remove one of the

variables

39
Practical Concerns – Beware Over-Fitting
• Suppose you are given the following data,

Sales Effort
5 1
4 3
7 4
6 2

• How would you find the “best” fit curve assuming

Sales = f(Effort)?

40
Results – which is the best fit?
8
7 Model: Linear Equation
6 8
5
6
Sales

4
3

Sales
4
2
2 y = 0.4x + 4.5
1 R² = 0.16
0
0
0 2 4 6 8
0 2 4 6 8
Effort
Effort

Model: Quadratic Equation Model: Cubic Equation

8 8

6 6

Sales
4
Sales

4
y = 0.5x2 - 2.1x + 7 2 y = 1.3333x3 - 9.5x2 + 20.167x - 7
2 R² = 1
R² = 0.36
0 0
0 2 4 6 8 0 2 4 6 8
Effort Effort

41
41
Modeling Issues and Tips
• Don’t confuse causality and relationship
 Statistics find and measure relationships – not causality
 User must try to explain the causality
• Don’t be a slave to R2 – model must make sense
 Look at adjusted R2 to compare between models
• Simple is better (avoid over-specifying)
Rule of thumb n≥5(k+2) where
n= num obs and k=num of ind variables
• Avoid extrapolating out of observed range
• Non-linear relationships can be modeled through data
transformations (ln, sqrt, 1/x, Multiply)

42
Questions, Comments, Suggestions?
Use the Discussion Forum!

“Wilson and Dexter waiting patiently to regress their way

out of their holiday costumes”
[email protected]
ctl.mit.edu

Download Neoliberalization, Universities and the Public Intellectual: Species, Gender and Class and the Production of Knowledge 1st Edition Heather Fraser ebook All Chapters PDF
100% (2)
Download Neoliberalization, Universities and the Public Intellectual: Species, Gender and Class and the Production of Knowledge 1st Edition Heather Fraser ebook All Chapters PDF
55 pages
Leadership Management PDF
100% (1)
Leadership Management PDF
264 pages
Biology for the IB Diploma Third Edition C. J. Clegg pdf download
100% (1)
Biology for the IB Diploma Third Edition C. J. Clegg pdf download
54 pages
The Importance of Transparency
No ratings yet
The Importance of Transparency
5 pages
Thesis Development Examples
100% (3)
Thesis Development Examples
6 pages
PNN
No ratings yet
PNN
13 pages
Research Proposal Format.docx (1)
No ratings yet
Research Proposal Format.docx (1)
9 pages
Mba ZG536 Ec-2r First Sem 2023-2024
No ratings yet
Mba ZG536 Ec-2r First Sem 2023-2024
4 pages
bhai1
No ratings yet
bhai1
59 pages
ECON1313 Individual Assignment-Time Series
100% (1)
ECON1313 Individual Assignment-Time Series
25 pages
Module 2 - Probability and Distributions
No ratings yet
Module 2 - Probability and Distributions
58 pages
Chapter 11
No ratings yet
Chapter 11
27 pages
Ethical Practice in Forensic Shane Bush
100% (1)
Ethical Practice in Forensic Shane Bush
199 pages
exmaple 33
No ratings yet
exmaple 33
1 page
HACCP On Audit and Inspection
0% (1)
HACCP On Audit and Inspection
17 pages
Lean Manufacturing Ram
No ratings yet
Lean Manufacturing Ram
27 pages
Why Young Filipino Teachers Teach
No ratings yet
Why Young Filipino Teachers Teach
14 pages
Project File On Replacement Marketing by Faizan Ahmad
No ratings yet
Project File On Replacement Marketing by Faizan Ahmad
96 pages
Full Download Marketing (MindTap Course List) 20th Edition William M. Pride PDF DOCX
100% (4)
Full Download Marketing (MindTap Course List) 20th Edition William M. Pride PDF DOCX
66 pages
Differences Between Mean
No ratings yet
Differences Between Mean
5 pages
The Effect of Multicollinearity in Nonlinear Regression Models
No ratings yet
The Effect of Multicollinearity in Nonlinear Regression Models
4 pages
SV Chapter 5 - Distribution in Supply Chain - 01 Bu I Vin
No ratings yet
SV Chapter 5 - Distribution in Supply Chain - 01 Bu I Vin
23 pages
DISC
No ratings yet
DISC
2 pages
Formula Sheet - ANOVA, Chi-Square & Regression
No ratings yet
Formula Sheet - ANOVA, Chi-Square & Regression
1 page
DS II Mid Term 2017 Solution
No ratings yet
DS II Mid Term 2017 Solution
20 pages
Cycle Tracks Architecture, Urban
No ratings yet
Cycle Tracks Architecture, Urban
94 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
1 page
The Evolution of Machinery Control Systems Support at The Naval Ship Systems Engineering Station
No ratings yet
The Evolution of Machinery Control Systems Support at The Naval Ship Systems Engineering Station
25 pages
Prestige Institute of Management and Research J Indore
No ratings yet
Prestige Institute of Management and Research J Indore
12 pages
Sand Casting Process Optimization Via Design of Experiments: A Review
No ratings yet
Sand Casting Process Optimization Via Design of Experiments: A Review
6 pages
Privatisation of Indian Railways
No ratings yet
Privatisation of Indian Railways
6 pages
Abcdemail: Search Engine Analytics: Case Study
No ratings yet
Abcdemail: Search Engine Analytics: Case Study
8 pages
STAT 2601 Final Exam Extra Practice Questions
No ratings yet
STAT 2601 Final Exam Extra Practice Questions
9 pages
Assignment of Fundamentals of Econometrics, On Multiple Regression
No ratings yet
Assignment of Fundamentals of Econometrics, On Multiple Regression
4 pages
BADM 572 - Stats Homework Answers 6
No ratings yet
BADM 572 - Stats Homework Answers 6
7 pages
Otto and Smith (2013)
No ratings yet
Otto and Smith (2013)
10 pages
TB Quiz 2
No ratings yet
TB Quiz 2
28 pages
c05 Net Models
No ratings yet
c05 Net Models
55 pages
11 Multiple Regression Part1
100% (1)
11 Multiple Regression Part1
13 pages
CLS Exr - LP Intro
100% (1)
CLS Exr - LP Intro
4 pages
No 5.1.fourth Generation Evalution Research (Qualitative Research Design)
No ratings yet
No 5.1.fourth Generation Evalution Research (Qualitative Research Design)
4 pages
Dynamo Industries Case Study
No ratings yet
Dynamo Industries Case Study
9 pages
Course Outline BRAND
No ratings yet
Course Outline BRAND
6 pages
Lead Scoring Subjective Questions
No ratings yet
Lead Scoring Subjective Questions
3 pages
MS Excel Linear & Multiple Regression
No ratings yet
MS Excel Linear & Multiple Regression
8 pages
CRM Unit 5 - Customer Analytics Part I
No ratings yet
CRM Unit 5 - Customer Analytics Part I
23 pages
Book List 2022-23 Nur-Xii
No ratings yet
Book List 2022-23 Nur-Xii
14 pages
AIUBMBA9
100% (1)
AIUBMBA9
8 pages
M3 Shifa 2021-23 Practice
No ratings yet
M3 Shifa 2021-23 Practice
5 pages
The Markstrat Challenge: - Durable Goods
No ratings yet
The Markstrat Challenge: - Durable Goods
35 pages
Course Handout MOS
No ratings yet
Course Handout MOS
16 pages
Pa ZG512 Ec-3r First Sem 2022-2023
No ratings yet
Pa ZG512 Ec-3r First Sem 2022-2023
5 pages
QWE Case Study
No ratings yet
QWE Case Study
5 pages
Conjoint Analysis PDF
100% (1)
Conjoint Analysis PDF
15 pages
SDM Mid Term
No ratings yet
SDM Mid Term
2 pages
5: Consumer Markets and Buyer Behavior:: Harley-Davidson's Spend A Great Deal of Time About
No ratings yet
5: Consumer Markets and Buyer Behavior:: Harley-Davidson's Spend A Great Deal of Time About
12 pages
EPGP09 WEB OR Endterm QP
No ratings yet
EPGP09 WEB OR Endterm QP
4 pages
BSW Learning Contract
No ratings yet
BSW Learning Contract
12 pages
Mis 107 - Course - Outline
No ratings yet
Mis 107 - Course - Outline
4 pages
WWW Studiestoday Com Practice Worksheets English Cbse Class 10 English Writing Skills Analytical Paragraph Worksheet
No ratings yet
WWW Studiestoday Com Practice Worksheets English Cbse Class 10 English Writing Skills Analytical Paragraph Worksheet
33 pages
Logistic+Regression - Done
100% (1)
Logistic+Regression - Done
41 pages
Fundamentals of Business Statistics - Hypothesis
No ratings yet
Fundamentals of Business Statistics - Hypothesis
25 pages
Regression in Marketing
No ratings yet
Regression in Marketing
90 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
Krajewski Ism Ch13 Solution
0% (1)
Krajewski Ism Ch13 Solution
29 pages
Language, Culture and Thought
100% (1)
Language, Culture and Thought
35 pages
Utkarsh's Entrepreneur Project
No ratings yet
Utkarsh's Entrepreneur Project
26 pages
Case Tuscan Lifestyles
No ratings yet
Case Tuscan Lifestyles
10 pages
Gillete
100% (1)
Gillete
26 pages
Man Com Final Rep
No ratings yet
Man Com Final Rep
17 pages
Schiffman 11ge PPT Ch2-2 Segmentation
100% (1)
Schiffman 11ge PPT Ch2-2 Segmentation
29 pages
Operations Research-Sec D
No ratings yet
Operations Research-Sec D
5 pages
Projects PDF
No ratings yet
Projects PDF
12 pages
Marketing Research3
100% (1)
Marketing Research3
8 pages
ME Problem Set-2
No ratings yet
ME Problem Set-2
2 pages
LPP Sensitivity Report
No ratings yet
LPP Sensitivity Report
7 pages
Marketing Strategy Instructional Manual Version 2 - 4
No ratings yet
Marketing Strategy Instructional Manual Version 2 - 4
17 pages
Iima Beta - Pe & VC Primer
No ratings yet
Iima Beta - Pe & VC Primer
23 pages
Statistics of Management
No ratings yet
Statistics of Management
7 pages
Conjoint Analysis - Class
No ratings yet
Conjoint Analysis - Class
10 pages
Solutions To Problem Set 1
No ratings yet
Solutions To Problem Set 1
6 pages
T2136-Sales Force and Channel Management
No ratings yet
T2136-Sales Force and Channel Management
5 pages
Cluster Training PDF (Compatibility Mode)
No ratings yet
Cluster Training PDF (Compatibility Mode)
21 pages
Pricing Pricing Strategies Strategies: Dr. Vandana Tandon Khanna
No ratings yet
Pricing Pricing Strategies Strategies: Dr. Vandana Tandon Khanna
7 pages
Term Paper Final Sumit Gupta EPGP 05 155
No ratings yet
Term Paper Final Sumit Gupta EPGP 05 155
13 pages
Sales Quota Setting
100% (1)
Sales Quota Setting
3 pages
2015 10 10
No ratings yet
2015 10 10
19 pages
Improving Repeated Sprint Ability in Young Elite.19
No ratings yet
Improving Repeated Sprint Ability in Young Elite.19
8 pages
Flextronics Case
No ratings yet
Flextronics Case
3 pages
Cadm Pre Mid Term 2016 - Soln
No ratings yet
Cadm Pre Mid Term 2016 - Soln
5 pages
Memorable Experiences in Destination Marketing
From Everand
Memorable Experiences in Destination Marketing
Kamlesh Rawal
No ratings yet
Exam Prep for:: Us Gourmet Food Distributors Directory Vol 6
From Everand
Exam Prep for:: Us Gourmet Food Distributors Directory Vol 6
Mzn Lnx
No ratings yet
Wealth Management
From Everand
Wealth Management
Erik Lie
No ratings yet
Intellectual Capital Complete Self-Assessment Guide
From Everand
Intellectual Capital Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet

Module 3 - Regression

Uploaded by

Module 3 - Regression

Uploaded by

Building Predictive Models

ID unique identification number for the load

• Form of the Dependent Variable dictates the method used

• Ordinary Least Squares (OLS) Regression

• We can expand to multiple variables

Std Error Lower CI Upper CI

• What potential additions can we make?

• Let’s Try! Model 3: CPL = f(Dist, Dist2)

Std Error Lower CI Upper CI

• What if we have a nominal/categorical independent variable?

• Create a Dummy Variable

Std Error Lower CI Upper CI

Interpretation & Insights:

• Multi-Collinearity – Are the independent variables correlated?

• Autocorrelation – Are the residuals not independent?

 How do the residuals behave over the ind. var.’s?

 Examine scatter plots and look for “fan-shaped” distributions

 Fixes (weighted least squares regression, others – beyond scope)

Based on slides initially prepared by Dr. Tugba Efendigil

100 observations (records)

We want to see how well the model predicts on new data!

Model: Quadratic Equation Model: Cubic Equation

Source: The Elements of Statistical Learning, Hastie et al. (2009)

100 observations (records)

Minimum number of training periods

• How would you find the “best” fit curve assuming

Model: Quadratic Equation Model: Cubic Equation

“Wilson and Dexter waiting patiently to regress their way

You might also like