7 Regression
7 Regression
Applied forecasting
Consumption
2
1
0
−1
−2
Income
2.5
0.0
−2.5
Production
2.5
0.0
−2.5
−5.0
40
Savings Unemployment
20
0
−20
−40
−60
1.5
1.0
0.5
0.0
−0.5
−1.0
1980 Q1 2000 Q1 2020 Q1
Quarter
5
Example: US consumption expenditure
Consumption Income Production Savings Unemployment
Consumption
0.6
0.4
Corr: Corr: Corr: Corr:
0.2 0.384*** 0.529*** −0.257*** −0.527***
0.0
Income
2.5 Corr: Corr: Corr:
0.0
−2.5 0.269*** 0.720*** −0.224**
Production
2.5
0.0 Corr: Corr:
−2.5 −0.059 −0.768***
−5.0
40
Savings Unemployment
20 Corr:
0
−20 0.106
−40
−60
1.5
1.0
0.5
0.0
−0.5
−1.0
−2 −1 0 1 2 −2.5 0.0 2.5 −5.0 −2.5 0.0 2.5 −60 −40 −20 0 20 40−1.0 −0.5 0.0 0.5 1.0 1.5
6
Example: US consumption expenditure
fit_consMR <- us_change %>%
model(lm = TSLM(Consumption ~ Income + Production + Unemployment + Savings))
report(fit_consMR)
## Series: Consumption
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.906 -0.158 -0.036 0.136 1.155
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.25311 0.03447 7.34 5.7e-12 ***
## Income 0.74058 0.04012 18.46 < 2e-16 ***
## Production 0.04717 0.02314 2.04 0.043 *
## Unemployment -0.17469 0.09551 -1.83 0.069 .
## Savings -0.05289 0.00292 -18.09 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.31 on 193 degrees of freedom
## Multiple R-squared: 0.768, Adjusted R-squared: 0.763
7
Example: US consumption expenditure
1
Data
0
Fitted
−1
−2
8
Example: US consumption expenditure
−1
−2 −1 0 1 2
Data (actual values)
9
Example: US consumption expenditure
1.0
0.5
0.0
−0.5
−1.0
1980 Q1 2000 Q1 2020 Q1
Quarter
40
0.1
30
count
0.0
acf
20
−0.1 10
−0.2 0
2 4 6 8 10 12 14 16 18 20 22 −1.0 −0.5 0.0 0.5 1.0
lag [1Q] .resid
10
Outline
Linear trend
xt = t
t = 1, 2, . . . , T
Strong assumption that trend will continue.
12
Nonlinear trend
13
Nonlinear trend
13
Nonlinear trend
NOT RECOMMENDED! 13
Dummy variables
If a categorical variable takes
only two values (e.g., ‘Yes’ or
‘No’), then an equivalent
numerical variable can be
constructed taking value 1 if
yes and 0 if no. This is called
a dummy variable.
14
Dummy variables
If there are more than two
categories, then the variable
can be coded using several
dummy variables (one fewer
than the total number of
categories).
15
Beware of the dummy variable trap!
Using one dummy for each category gives too many dummy
variables!
The regression will then be singular and inestimable.
Either omit the constant, or omit the dummy for one category.
The coefficients of the dummies are relative to the omitted
category.
16
Uses of dummy variables
Seasonal dummies
For quarterly data: use 3 dummies
For monthly data: use 11 dummies
For daily data: use 6 dummies
What to do with weekly data?
17
Uses of dummy variables
Seasonal dummies
For quarterly data: use 3 dummies
For monthly data: use 11 dummies
For daily data: use 6 dummies
What to do with weekly data?
Outliers
If there is an outlier, you can use a dummy variable to remove its
effect.
17
Uses of dummy variables
Seasonal dummies
For quarterly data: use 3 dummies
For monthly data: use 11 dummies
For daily data: use 6 dummies
What to do with weekly data?
Outliers
If there is an outlier, you can use a dummy variable to remove its
effect.
Public holidays
For daily data: if it is a public holiday, dummy=1, otherwise dummy=0. 17
Beer production revisited
Australian quarterly beer production
500
Megalitres
450
400
18
Beer production revisited
Australian quarterly beer production
500
Megalitres
450
400
Regression model
yt = β0 + β1 t + β2 d2,t + β3 d3,t + β4 d4,t + εt
di,t = 1 if t is quarter i and 0 otherwise. 18
Beer production revisited
fit_beer <- recent_production %>% model(TSLM(Beer ~ trend() + season()))
report(fit_beer)
## Series: Beer
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.9 -7.6 -0.5 8.0 21.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 441.8004 3.7335 118.33 < 2e-16 ***
## trend() -0.3403 0.0666 -5.11 2.7e-06 ***
## season()year2 -34.6597 3.9683 -8.73 9.1e-13 ***
## season()year3 -17.8216 4.0225 -4.43 3.4e-05 ***
## season()year4 72.7964 4.0230 18.09 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
19
## Residual standard error: 12.2 on 69 degrees of freedom
Beer production revisited
augment(fit_beer) %>%
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Beer, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
labs(y="Megalitres",title ="Australian quarterly beer production") +
scale_colour_manual(values = c(Data = "black", Fitted = "#D55E00"))
500
Megalitres
colour
450 Data
Fitted
400
Quarter
480
1
Fitted
2
440
3
4
400
20
0
−20
−40
1995 Q1 2000 Q1 2005 Q1 2010 Q1
Quarter
0.2
0.1 20
count
acf
0.0
−0.1 10
−0.2
0
2 4 6 8 10 12 14 16 18 −25 0 25
lag [1Q] .resid
22
Beer production revisited
500
level
Beer
450
80
95
400
350
1995 Q1 2000 Q1 2005 Q1 2010 Q1
Quarter
23
Fourier series
## Series: Beer
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.9 -7.6 -0.5 8.0 21.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 446.8792 2.8732 155.53 < 2e-16 ***
## trend() -0.3403 0.0666 -5.11 2.7e-06 ***
## fourier(K = 2)C1_4 8.9108 2.0112 4.43 3.4e-05 ***
## fourier(K = 2)S1_4 -53.7281 2.0112 -26.71 < 2e-16 ***
## fourier(K = 2)C2_4 -13.9896 1.4226 -9.83 9.3e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
25
## Residual standard error: 12.2 on 69 degrees of freedom
Harmonic regression: eating-out expenditure
aus_cafe <- aus_retail %>% filter(
Industry == "Cafes, restaurants and takeaway food services",
year(Month) %in% 2004:2018
) %>% summarise(Turnover = sum(Turnover))
aus_cafe %>% autoplot(Turnover)
4000
3500
Turnover
3000
2500
2000
## # A tibble: 6 x 4
## .model r_squared adj_r_squared AICc
## <chr> <dbl> <dbl> <dbl>
## 1 K1 0.962 0.962 -1085.
## 2 K2 0.966 0.965 -1099.
## 3 K3 0.976 0.975 -1160.
## 4 K4 0.980 0.979 -1183.
## 5 K5 0.985 0.984 -1234.
## 6 K6 0.985 0.984 -1232. 27
Harmonic regression: eating-out expenditure
Log transformed TSLM, trend() + fourier(K = 1)
5000
AICc = −1085
4000
level
Turnover
80
95
3000
2000
28
Harmonic regression: eating-out expenditure
Log transformed TSLM, trend() + fourier(K = 2)
5000
AICc = −1099
4000
level
Turnover
80
95
3000
2000
29
Harmonic regression: eating-out expenditure
Log transformed TSLM, trend() + fourier(K = 3)
5000
AICc = −1160
4000
level
Turnover
80
95
3000
2000
30
Harmonic regression: eating-out expenditure
Log transformed TSLM, trend() + fourier(K = 4)
5000
AICc = −1183
4000
level
Turnover
80
95
3000
2000
31
Harmonic regression: eating-out expenditure
Log transformed TSLM, trend() + fourier(K = 5)
5000
AICc = −1234
4000
level
Turnover
80
95
3000
2000
32
Harmonic regression: eating-out expenditure
Log transformed TSLM, trend() + fourier(K = 6)
5000
AICc = −1232
4000
level
Turnover
80
95
3000
2000
33
Intervention variables
Spikes
Equivalent to a dummy variable for handling an outlier.
34
Intervention variables
Spikes
Equivalent to a dummy variable for handling an outlier.
Steps
Variable takes value 0 before the intervention and 1 afterwards.
34
Intervention variables
Spikes
Equivalent to a dummy variable for handling an outlier.
Steps
Variable takes value 0 before the intervention and 1 afterwards.
Change of slope
Variables take values 0 before the intervention and values
{1, 2, 3, . . . } afterwards.
34
Holidays
35
Distributed lags
36
Example: Boston marathon winning times
marathon <- boston_marathon %>%
filter(Event == "Men's open division") %>%
select(-Event) %>%
mutate(Minutes = as.numeric(Time)/60)
marathon %>% autoplot(Minutes) + labs(y="Winning times in minutes")
Winning times in minutes
170
160
150
140
130
fit_trends
## # A mable: 1 x 3
## linear exponential piecewise
## <model> <model> <model>
## 1 <TSLM> <TSLM> <TSLM>
38
Example: Boston marathon winning times
fit_trends %>% forecast(h=10) %>% autoplot(marathon)
.model
160
exponential
linear
Minutes
piecewise
140
level
95
120
20
10
0
−10
count
0.1
acf
0.0 10
−0.1
−0.2 0
5 10 15 20 −10 0 10 20
lag [1Y] .resid
40
Outline
42
Multiple regression and forecasting
42
Residual plots
Useful for spotting outliers and whether the linear model was
appropriate.
Scatterplot of residuals εt against each predictor xj,t .
Scatterplot residuals against the fitted values ŷt
Expect to see scatterplots resembling a horizontal band with no
values too far from the band and no patterns such as curvature
or increasing spread.
43
Residual patterns
44
Outline
Computer output for regression will always give the R2 value. This is a
useful summary of the model.
It is equal to the square of the correlation between y and ŷ.
It is often called the “coefficient of determination’ ’.
It can also be calculated as follows:
(ŷt − ȳ)2
P
2
R =P
(yt − ȳ)2
It is the proportion of variance accounted for (explained) by the
predictors.
46
Comparing regression models
However . . .
R2 does not allow for “degrees of freedom’ ’.
Adding any variable tends to increase the value of R2 , even if that
variable is irrelevant.
47
Comparing regression models
However . . .
R2 does not allow for “degrees of freedom’ ’.
Adding any variable tends to increase the value of R2 , even if that
variable is irrelevant.
To overcome this problem, we can use adjusted R2 :
T−1
R̄2 = 1 − (1 − R2 )
T−k−1
where k = no. predictors and T = no. observations.
47
Comparing regression models
However . . .
R2 does not allow for “degrees of freedom’ ’.
Adding any variable tends to increase the value of R2 , even if that
variable is irrelevant.
To overcome this problem, we can use adjusted R2 :
T−1
R̄2 = 1 − (1 − R2 )
T−k−1
where k = no. predictors and T = no. observations.
48
Akaike’s Information Criterion
For small values of T, the AIC tends to select too many predictors, and
so a bias-corrected version of the AIC has been developed.
2(k + 2)(k + 3)
AICC = AIC +
T−k−3
49
Bayesian Information Criterion
50
Bayesian Information Criterion
51
Cross-validation
Traditional evaluation
Training data Test data
time
52
Cross-validation
Traditional evaluation
Training data Test data
time
52
Cross-validation
Traditional evaluation
Training data Test data
time
Leave-one-out cross-validation
h=1
53
Cross-validation
Traditional evaluation
Training data Test data
time
Leave-one-out cross-validation
h=1
53
Choosing regression variables
54
Choosing regression variables
55
Choosing regression variables
57
Scenario based forecasting
58
Building a predictive regression model
59
US Consumption
60
US Consumption
us_change %>% autoplot(Consumption) +
labs(y="% change in US consumption") +
autolayer(fc) +
labs(title = "US consumption", y = "% change")
US consumption
2 Scenario
Decrease
1
% change
Increase
0
level
−1
80
95
−2
63
Matrix formulation
63
Matrix formulation
64
Matrix formulation
64
Matrix formulation
64
Matrix formulation
1
σ̂ 2 = (y − X β̂)0 (y − X β̂)
T−k−1
Note: If you fall for the dummy variable trap, (X 0 X) is a singular matrix. 64
Likelihood
65
Likelihood
65
Likelihood
65
Likelihood
65
Multiple regression forecasts
Optimal forecasts
ŷ∗ = E(y∗ |y, X, x∗ ) = x∗ β̂ = x∗ (X 0 X)−1 X 0 y
66
Multiple regression forecasts
Optimal forecasts
ŷ∗ = E(y∗ |y, X, x∗ ) = x∗ β̂ = x∗ (X 0 X)−1 X 0 y
66
Multiple regression forecasts
Optimal forecasts
ŷ∗ = E(y∗ |y, X, x∗ ) = x∗ β̂ = x∗ (X 0 X)−1 X 0 y
68
Multicollinearity
69
Multicollinearity
If multicollinearity exists. . .
the numerical estimates of coefficients may be wrong (worse in
Excel than in a statistics package)
don’t rely on the p-values to determine significance.
there is no problem with model predictions provided the
predictors used for forecasting are within the range used for
fitting.
omitting variables can help.
combining variables can help.
70