0% found this document useful (0 votes)
43 views80 pages

Machine Learningand Econometrics

This document is a thesis that provides a survey of machine learning techniques that could be applied to econometrics. It begins with an introduction discussing how large datasets and machine learning may influence economics. It then covers several machine learning techniques in detail, including variable selection methods, penalized regression, dimensionality reduction techniques, support vector machines, ensemble methods, and how machine learning may impact the future of econometrics. The goal is to introduce econometricians to a subset of machine learning algorithms that could be useful for analyzing large, complex datasets.

Uploaded by

Martin Vallejos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views80 pages

Machine Learningand Econometrics

This document is a thesis that provides a survey of machine learning techniques that could be applied to econometrics. It begins with an introduction discussing how large datasets and machine learning may influence economics. It then covers several machine learning techniques in detail, including variable selection methods, penalized regression, dimensionality reduction techniques, support vector machines, ensemble methods, and how machine learning may impact the future of econometrics. The goal is to introduce econometricians to a subset of machine learning algorithms that could be useful for analyzing large, complex datasets.

Uploaded by

Martin Vallejos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/281821310

Machine Learning and Econometrics

Thesis · August 2015


DOI: 10.13140/RG.2.1.2846.0007

CITATION READS
1 1,529

1 author:

Peter Thesling
Maastricht University
1 PUBLICATION   1 CITATION   

SEE PROFILE

All content following this page was uploaded by Peter Thesling on 16 September 2015.

The user has requested enhancement of the downloaded file.


Maastricht University
School of Business and Economics

August, 2015

Master Thesis

Machine Learning and Econometrics


A survey of techniques

Author: Supervisors:
Peter Thesling Dr. Stephan Smeekes
Prof. Dr. Jean-Pierre Urbain
Education is one of the
blessings of life - and one of its
necessities. That has been my
experience during the 17 years
of my life. In my paradise
home, Swat, I always loved
learning and discovering new
things. I remember when my
friends and I would decorate
our hands with henna on
special occasions. And instead
of drawing flowers and patterns
I would paint our hands with
mathematical formulas and
equations.
MALALA YOUSAFZAI
Contents

1 Introduction 1

2 Supporting methods 4
2.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Machine learning techniques 7


3.1 Variable selection in its early stages . . . . . . . . . . . . . . . . 7
3.2 Penalized regression . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Elastic net . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Factor models . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Principal component analysis . . . . . . . . . . . . . . . 23
3.3.3 Sparse Loadings . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Support vector machine . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.2.1 Theoretical work . . . . . . . . . . . . . . . . 34
3.4.2.2 Examples of kernels . . . . . . . . . . . . . . . 38
3.5 k-nearest neighbor . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Ensemble machine learning techniques 47


4.1 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.1 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.2 Gradient boosting . . . . . . . . . . . . . . . . . . . . . 54
4.1.3 Component-wise boosting and other extensions . . . . . 56
4.2 Trees and random forest . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 A comparison of ensemble techniques and the variable impor-
tance plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Machine learning’s future in econometrics 65

6 Conclusion 67

A Datasets 73

v
Notation
X.i i-th column of X. Similarly Xi. is the i-th row.

entry in i-th row and j-th column of Λ. In general


λij lower case letters refer to entries of the matrix denoted
in upper case letters.

Variable i, which depending on the section


zi
is either a column or a row of the data matrix X.
Observation i, which depending on the section
xi
is either a column or a row of the data matrix
bxc Floor function: bxc = max{n ∈ Z|n ≤ x}

h., .iH Inner product in the space H

span Closed linear span

a.s.
→ Converges almost surely

PN 1
||a||p p) p
Lp -norm of a: ( i=1 |ai |

(
1 if x ∈ A
1(x ∈ A) Indicator function: 1(x ∈ A) =
0 otherwise.
|S| Cardinality or size of a set S

vi
1 Introduction
The term big data has received a lot of attention in the media recently and
sometimes it appears to have caused a revolution in statistics. Certainly the fact
that several corporations can now monitor far more parts of their customers’
behavior has had a great influence on doing business. However, big data’s effect
on economics is highly debated, since at this point a revolution can certainly
not be observed. Economists’ opinions range from big data - big hype to the
possibility to reach a more empirically driven view on economics. This could
imply that economists will be able to understand phenomena, such as the drivers
of inflation with the help of aggregated micro data. Einav and Levin (2013)
conducted a survey study addressing the question of big data’s influence on
economics. They hope for new insights from more publicly available data and
information on people’s sentiment, such that not only average treatment effects,
but whole distributions can be estimated. They find two major problems in
these novel type of datasets. The first is dimensionality: with more variables
than observations it is easy to obtain a perfect in-sample fit, which is obviously
not the right approach for inference or any kind of forecasting exercise. Next
to that more data means there is more to retrieve, such that the typical linear
relationships that are common in econometrics might be too modest. Einav
and Levin (2013) point at the need for new methods to be employed, but don’t
elaborate further.
Varian (2014) picks up from there and introduces the field of machine
learning. Machine learning originated from computer science and covers all
techniques, whose performance improves with the amount of data used, such
that e.g. any regression belongs to it. A large group within machine learning
techniques are ensemble algorithms. An ensemble algorithm generates not only
one, but a group of models and merges those into one final model. The most
popular representative of this class is boosting. Furthermore machine learning
categorizes algorithms into the groups unsupervised and supervised learning. A
supervised learning algorithm needs data that is labeled and this label, which
economists would call a response variable, is attempted to be modeled. If these
labels do not exist, an unsupervised algorithm has to be applied that explores
the data instead of predicting some outcome. From my point of view the super-
vised learning algorithms are by far more applicable to econometrics than their
counterpart. Therefore I will restrict my survey to those.1
With the emergence of new, large datasets machine learning algorithms have
1
Excluded from this restriction are factor models and principal component analysis, that
although belonging to the class of unsupervised algorithms can be easily extended to be used
as a supervised method.

1
been used extensively in academia and the private sector. Despite their success
they have hardly been employed in econometrics. I will suggest a subset of
machine learning techniques to be applied in econometrics and will go far beyond
Varian’s rather introductory level of technical detail. Some of the techniques,
such as principal component analysis, will sound rather familiar and should
alleviate the entry into the field. Others, such as the support vector machine,
should be rather new to econometrics.2 The fact that both groups of techniques
have mainly originated from computer science makes me face several issues.
Whereas economists would build theories around the data generating process
and would like to test their theories on the data, computer scientists care more
about e.g. the algorithm’s speed. Hence I will attempt to shed light on what
the transition from a pure machine learning method into an econometric tool
could look like.
Before continuing with the out-
line I would like to add some notes on Figure 1: Overfitting illustrated
overfitting, a crucial topic in machine
learning. Overfitting occurs when an
algorithm does not only model the
actual relationship between variables,
but also the error terms. Usually this
implies bad out-of-sample forecasts,
as shown in Figure 1. The unfilled
points were employed to fit a polyno-
mial regression of degree five. The
corresponding result is the dashed
line, which does a poor job at both
modeling the true x-y relationship il-
lustrated as the solid line as well as
at predicting out-of-sample observa-
tions, shown as filled points. Since
most machine learning methods are
non-linear and often deal with datasets, where the number of variables is larger
than the number of observations, overfitting will be a reoccuring topic in this
thesis.
This thesis tries to be self-contained, such that issues will be explained in
2
One might notice in the table of contents that no section on neural networks exists. The
neural networks method is certainly one of the most prominent techniques in machine learning.
Nevertheless the technique has been introduced before and can be extremely technical. Its
adoption into econometrics appears to be a long way out, such that I decided not to include
it in my survey.

2
detail including its derivation and if applicable with some geometric interpreta-
tion.3 Next to that in all introduced techniques I will attempt to continuously
answer two questions: firstly how could the methods be employed in econo-
metrics and secondly if dependent data can be analyzed with them. Since
some methods have been more adopted in econometrics than others, solutions
to these problems might already exist for some algorithms, whereas for others
more research is still required. The thesis will proceed as follows: To start I will
introduce supporting methods, that will appear on several occasions in the the-
sis. Those are cross-validation and the bootstrap. This is followed by the main
part of machine learning algorithms that are separated based on the question,
if the technique is an ensemble algorithm. The thesis will finish with a short
section on the possible paths that machine learning could take in econometrics
and a conclusion.

3
This approach is adhered to as long as explanations do not get off topic. For example for
extensions of methods that manage to handle time-series data it would require a vast amount
of analysis to be as exhaustive in those sections as I aim to be in general.

3
2 Supporting methods
2.1 Cross-validation
Cross-validation is a model evaluation method, like the in econometrics more
commonly used information criteria. It addresses the issue of sparsely available
data by using the existing sample in an elegant way and is computed as follows:
Say there is a dataset X: n×p with n observations and p variables. A row of X
represents one observation and is denoted by xi . Similarly yi is an observation
for the response variable y. First the data is split into K roughly equally sized
groups by some assignment function κ : {1, . . . , n} → {1, . . . , K}. K is often
chosen as 5 or 10 (Hastie et al., 2009). Next a model is run on the data while
withholding one of the K groups, such that formally fˆ−k is computed, which
denotes the model fitted without the k-th group. Group k, which was omitted,
now takes the role of a testing set for my model, as visualized in the table below
for K = 5.

1 2 3 4 5
Fit Fit Fit Fit Test

This method is repeated for all k ∈ {1, . . . , K}. Then the cross-validation error
is n
1X
CV (fˆ) = L(yi , fˆ−κ(i) (xi ))
n i=1
for some loss function L. A common choice for L would be the squared
loss function, such that CV (fˆ) turns into the mean squared error: M SE =
1 Pn ˆ−κ(i) (xi ) − yi )2 . Hence cross-validation tests the model out-of sam-
n i=1 (f
ple and thus obtains a number that is more robust to overfitting than other
goodness of fit measures. Several different versions exist, such as leave one
out cross-validation, where each group consists of only one element and thus
K = n. Another type is robust against dependent data and thus appealing for
time series econometrics: Again the sample is split into K groups. Then the
n
first group, hence the one with the first b K c elements is used for fitting and the
second one containing the subsequent observations is employed for testing. In
the next step the first two groups are used for the computation of the model,
whereas the third one now provides the out-of sample prediction and so on. The
errors are averaged weighted by the relative in-sample size. Since this procedure
takes the chronological order of the data within the group into account, the de-
pendency structure is maintained. Cross-validation is a reoccuring topic within

4
this thesis, as it is often used for parameter selection, such as the stopping time
of an algorithm. Compared to information criteria its idea is in my eyes more
intuitive, applicable to more problems and robust against overfitting.

2.2 Bootstrapping
In econometrics asymptotic analysis often simplifies calculations to a great ex-
tent. Already being allowed to make advantage of the central limit theorem
justifies the tendency to work with asymptotic results. However, the assump-
tion of a large sample is often far away from reality. A popular way to overcome
this issue is called bootstrapping. Introduced in Efron (1979) it is a resampling
technique, which is also widely used in econometrics by now. Since several ma-
chine learning techniques extend or build on bootstrapping, I will provide a short
introduction to it here that will be sufficient to understand those techniques later
on.
Suppose I have a sample of n observations on the variable x and summarize
this dataset in X .4 Furthermore let the CDF and PDF of x be F (x) and f (x)
respectively. Say I would like to conduct hypothesis tests on some estimator θ̂.
Usually I start with an assumption on the distribution and say f (x) = f (x|θ):
the distribution is parametrized by θ. I can then employ the maximum likelihood
to find θ̂. Next the central limit theorem provides me with the possibility to
construct a confidence interval around θ̂, which I can use for hypothesis testing.
However, I can also follow a different approach. Let me first turn the situation
around and write the estimator in terms of the distribution. R For example if

x is continuous and I am interested in the mean, then θ = −∞
R∞
xf (x)dx =
−∞ xdF (x). The distribution of my statistic obviously depends on the dataset
X and the true value of the estimator (θ) such that I refer to the distribution as
D(X , θ). As given in the example before θ is a function of the CDF, such that
I can rewrite the distribution into D(X , F ). Another difference is that now I do
not make an assumption about F , but instead will try to estimate it. Consider
F̂n (x) = n1 ni=1 I(−∞,x] (xi ), which is called the empirical distribution function
P

or EDF. Note that the corresponding PDF is


(
1
if x ∈ {x1 , . . . , xn }
fˆn (x) = n
0 otherwise.
p
Then by the weak law of large numbers F̂n (x) → − E[I(−∞,x] (xi )] = P (xi <
x) = F (x). This only states the weak version of the fact that the distribution
4
In this explanation I will only consider one variable. More variables do not change the
underlying idea, but complicate the notation.

5
converges pointwise to the CDF. What I would actually need is convergence
of the entire EDF to the CDF. This result is presented in the Glivenko-Cantelli
Theorem (Tucker, 1959), which provides me with the tools to show consistency
of F̂ . Now I can replace D(X , F ) with D(X , F̂ ). This way of constructing an
estimator is called plug-in principle. Bootstrapping makes use of all this. Since
F̂ consistently estimates the CDF of the population, I can use it to draw from
the population. As the corresponding PDF is equal to n1 for each x = xi for
i ∈ {1, . . . , n}, I simply draw observations from the sample with replacement.
The replacing part also assures that I do not obtain the exact same dataset.
After collecting B of those samples I compute θ̂ for each of them and thus get
a distribution for the statistic.
Especially nowadays the method of the bootstrap is very easy to implement
and computed. Nevertheless, the background that I just sketched and which jus-
tifies the bootstrap, is certainly more involved. For more details on the method
I refer to Smeekes (2009). Since this analysis can be done independent of the
distribution function and these functions are generally grouped into parametric
families, this type of analysis is called non-parametric statistics. As any other
non-parametric technique it relies on a strong trust into the data, since the
outcome entirely depends on the data and not on the user’s input.
For the bootstrap the user only needs to decide on B, the number of repli-
cations. Roughly it can be said that the larger B, the better, since more
replications can only increase the randomization of the different samples. Boot-
strapping can be easily extended to be used by dependent data: The so-called
moving block bootstrap splits the data into blocks that contain a roughly equal
number of observations. It is important here that these elements are consecu-
tive. Then the method picks blocks with replacement, such that the dynamic
within the block remains the same. The bootstrap is a powerful tool, whose
advantages will within this thesis be used in Section 4.2 on the random forest
and in Section 4.3 that deals with the bagging technique.

6
3 Machine learning techniques
3.1 Variable selection in its early stages
"And the biggest problem is that all of them are significant" is often a state-
ment that results in starting to research variable selection algorithms. Several
problems occur, if I use common methods, such as t-tests and least-squares
regression in datasets with many explanatory variables. First of all the t-test
is designed to answer, if a particular variable should be included in a model or
not. Such a model could have been constructed from e.g. economic theory.
However, in large datasets it might be preferable to let the data decide on a
model, simply because manually it would be too time-consuming. If I now run a
regression with all available variables included, t- and F -tests are not capable of
helping any more, as the model is not build in advance. Moving on to the spe-
cial case of high-dimensional data, where the number of explanatory variables
is larger than the number of observations, hence p > n, I encounter another
problem: Say {x1 , . . . , xn } is my sample with xi ∈ Rp ∀i, summarized as rows
in the matrix X. Then X T X is not invertible.
Lemma 1. If p > n, the empirical covariance matrix
n
1X
S= (xi − x̄)(xi − x̄)T
n i=1
Pn
with xi ∈ Rp and x̄ = i=1 xi , is singular.
Proof: S can be regarded as n1 X̃ T X̃ with X̃ being a standardized matrix X.
Now it is known that rank(X) = rank(X T X) ≤ min(n, p), since X : n × p.
If n < p, it is obvious that rank(X) < p. With rank(S) = rank(X̃ T X̃) ≤
rank(X T X), since the standardization will if at all decrease the rank of the
matrix, rank(S) < p. Hence S with dimensions p × p is singular and thus
non-invertible.

Direct computation of e.g. the least-squares estimator is hereby impossible.


Intuitively in high-dimensional data under the assumption of a full rank matrix
X I can regard Xβ = y as a system of linear equations with less equations (n)
than variables (p). All this justifies the conclusion that in datasets with many
explanatory variables I have to consider different methods. In times of increas-
ingly cheap data collection this problem has achieved more and more attention.
My objective is not to list all possible techniques, since this would result in a
thesis by itself. For the same reason I will also exclude all Bayesian techniques.5
5
A summary of Bayesian variable selection can be found in Varian (2014).

7
I rather see my contribution in this section as providing the reader with a com-
plement to the later, in more detail introduced techniques that essentially also
belong to the class of variable selection methods. In particular these are the
lasso (Section 3.2.2), the elastic net (Section 3.2.3) and component-wise boost-
ing (Section 4.1.3). In the following paragraph I will keep mentioning criteria
that evaluate the performance of a certain model. Among others those are the
adjusted R2 , information criteria, such as the AIC and BIC, or cross-validation
measures.
The simplest method among variable selection techniques is to test all mod-
els and rank them according to one of those criteria. This is called all-subsets
regression and assures that the best model is found. However this method is
computationally costly considering that there are 2p different models. It would
be more elegant to organize the search for the best model more efficiently. One
group of methods that tackles this issue is forward stepwise selection: I be-
gin with a regression of the response variable on a constant and keep adding
the variable to the model that improves a specific criteria the most. Similarly,
backwards stepwise selection considers a large model first, possibly containing
all explanatory variables, and deletes the variables, which worsen the model’s
performance the least. Both techniques stop, if updating the model increases or
decreases the criteria too little or the current model contains a sufficient number
of variables.
An improvement in the evolution of variable selection techniques was the
so-called forward stagewise selection. The method looks for the variable that is
most correlated with the current residuals and only takes small steps into the di-
rection of this variable. The corresponding coefficient is then in case of positive
(negative) correlation increased (decreased) by a small ε > 0. If there is only
one important variable, its coefficient is modified several times in a row. Oth-
erwise the method alternates between increasing or decreasing the coefficients
of several variables. Since forward stagewise selection is related to the lasso, a
graphical illustration of this approach is provided in Section 3.2.2. The algo-
rithm is stopped under similar conditions as in forward or backwards selection.
Several drawbacks remain for all introduced techniques: if variables only con-
tribute to a model as a group, it is possible that none of the techniques selects
them and instead proposes a local optimum. Furthermore the estimation of the
coefficients is usually done with established regression techniques. Abandoning
this idea might open up improvements. This leads me to penalized regression:
a group of techniques that combines estimation and variable selection.

8
3.2 Penalized regression
Penalized regression techniques are inherently biased, which allows them to
outperform traditional least-squares regression in many situations. What might
sound counter-intuitive, will be hopefully clear at the end of this section: Let me
begin with a familiar transformation of the mean squared error of an estimator
β̂ of β (Hamprecht, 2011):

M SE(β̂) = E[(β̂ − β)2 ]


= E[(β̂ − E(β̂) + E(β̂) − β)2 ]
= E[(β̂ − E(β̂))2 + (E(β̂) − β)2 + 2(β̂ − E(β̂))(E(β̂) − β)]
= E[(β̂ − E(β̂))2 ] + (E(β̂) − β)2
| {z } | {z }
V ariance Bias2

Graphically this is illustrated in Figure 2. If I increase the complexity of


a technique, say I add depth to a tree (see Section 4.2) or more variables
to a regression, then the bias and hence also the squared bias will decrease.
However, at the same time the variance increases, such that the best technique
in mean-squared error terms is not the one with the least bias. This insight
made statisticians approach biased methods, whose goal it is to increase the
squared bias less than the corresponding decrease in variance.

Figure 2: The Bias-Variance Tradeoff

9
3.2.1 Ridge
Hoerl and Kennard (1970) suggested to add the second norm of the coefficients
as a penalty and called the technique ridge regression. Mathematically ridge
solves
β̂r = arg min ||y − Xβ||22 + λ||β||22 (1)
β
qP
n 2
with ||β||2 = i=1 βi , λ ≥ 0 chosen beforehand and as usual X : n × p and
y : n×1.6 n and p denote the number of observations and variables respectively.
zi denotes variable i and is thus a column of X. This problem can be solved
analytically. With the help of some matrix calculus the first order condition is

2X T y = 2X T X β̂r + 2λβ̂r ,

which yields
β̂r = (X T X + λI)−1 X T y.
As expected β̂r is again the least squares estimator if λ = 0. Given standardized
data X T X is the covariance matrix of the variables:
 
1 ρ1,2 . . . ρ1,p
 .. .. 
XT X = 
 . . 
,
1 ρp−1,p
 
 
1

where ρi,j is the correlation between the variables zi and zj . Obviously X T X


is symmetric. Then X T X + λI adds variance to the regressors and its inverse
can be interpreted as:
 ρ1,2 ρ1,p −1
1 1+λ ... 1+λ
1 
 .. .. 
T
(X X + λI) −1
=  . . 
 .
1+λ ρp−1,p
1

 1+λ 
1

Note that this is the first part of the ridge coefficient. It illustrates that compared
to least-squares regression ridge shrinks the coefficients and decorrelates the
data both by a factor 1 + λ. In the special case that X is orthogonal and
6
Note that in the upcoming sections there is never a penalty on the intercept, neither in
ridge nor in lasso. For notational convenience I just assume a demeaned response vector, such
that the intercept is zero.

10
hence no correlation exists between the variables, only the shrinking takes place:
β̂r = (1 + λ)−1 β̂LS .
As a next step I would like to show ridge graphically. To do so let me first
demonstrate that the initial optimization problem can be regarded as:

min||y − Xβ||22
β
(2)
s.t. ||β||22 ≤ k

Instead of adding the norm in the objective function, I employ it as a constraint.


The newly constructed problem is a constrained optimization problem that I can
solve with a Lagrangian:

L(β, k, λc ) = ||y − Xβ||22 + λc (||β||22 − k),

such that the Karush-Kuhn-Tucker (KKT) conditions are

β̂c = (X T X + λc I)−1 X T y
||β̂c ||22 ≤ k
λc ≥ 0
λc (||β̂c ||22 − k) = 0.

If I take λc = λ and k = ||β̂c ||22 , all conditions are satisfied. Not only that: If I
remember that the solution to (1) was β̂r = (X T X + λI)−1 X T y together with
λ ≥ 0, I also found a correspondence between the two optimization problems
(1) and (2). Therefore I can from now on use either representation.
Next let me consider ||y − Xβ||22 , the first term of the objective function
in (1) and show that it represents an ellipsoid. In general an ellipsoid centered
around x∗ is defined as all x that satisfy (x − x∗ )T A(x − x∗ ) = c for some
positive semi-definite A ∈ Rm×m , c ∈ R, x, x∗ ∈ Rm and m ∈ N. With
β̂LS = (X T X)−1 X T y

||y − Xβ||22 = ||X β̂LS + ε̂LS − Xβ||22


= ||X(β̂LS − β) + ε̂LS ||22
= (X(β̂LS − β) + ε̂LS )T (X(β̂LS − β) + ε̂LS )
= ((β̂LS − β)T X T + ε̂TLS )(X(β̂LS − β) + ε̂LS )
= (β̂LS − β)T X T X(β̂LS − β) + 2(β̂LS − β)T X T ε̂LS + ε̂TLS ε̂LS
= (β − β̂LS )T X T X(β − β̂LS ) + ε̂TLS ε̂LS .

11
Figure 3: Ridge Regression Figure 4: L0 regression

In the last equality I used that the definition of β̂LS yields X T X β̂LS =
X T (X β̂LS + ε̂LS ) and thus X T ε̂LS = 0. A level set of the sum of squared
residuals is ||y − Xβ||22 = c or (β − β̂LS )T X T X(β − β̂LS ) = c − ε̂TLS ε̂LS . The
minimal c of ε̂TLS ε̂LS is then reached at β = β̂LS and thus ||y − Xβ||22 must be
an ellipsoid centered around β̂LS . Now it remains to add the L2 -norm, which
is just a circle centered at the origin. All information is summarized in Figure
3. The optimization problem implies to find the ellipsoid with the smallest sum
of squared residuals, that lies inside or on the circle. This is obviously found,
when the two functions touch and reminds of the typical economic problem of
maximizing utility given a budget line. Here the sum of squared residuals level
sets replace the indifference curve and the L2 -norm fills the role of the budget
line.
Certainly ridge regression is an improvement over least-squares regression
and has often proven to reduce the mean-squared error. Next to that it provides
coefficients that tend to be equal among a group of highly correlated variables.
For strong negative correlation the coefficients are similar up to their signs. In
the extreme case that two variables are exactly equal, I will now show that their
coefficients must be equal and will start with a rather basic result:
Lemma 2. The set, say Sc , of the Lp -norm defined as Sc = {x ∈ Rm : ||x||p ≤
c} with c ∈ R is convex for 1 ≤ p < ∞ and strictly convex for 1 < p < ∞.
Proof: Say a, b ∈ Sc and a 6= b. Take λ ∈ [0, 1]. Then

||λa + (1 − λ)b||p ≤ λ||a||p + (1 − λ)||b||p


≤ λc + (1 − λ)c
= c,

12
where the first step follows from the Minkowski inequality for 1 ≤ p < ∞.
Hence λa + (1 − λ)b ∈ Sc , such that Sc must be convex for all c ∈ R.
Let me now show the strict convexity for 1 < p < ∞. Note that if at least
one of the two points a and b is an interior point of Sc , hence e.g. ||a|| < c,
the second inequality of the earlier calculation is strict, such that I can focus on
the boundary points. Thus assume that ||a|| = c and ||b|| = c and additionally
λ ∈ (0, 1). Note that for 1 < p < ∞ the Minkowski inequality turns into an
equality, if and only if a and b are linearly dependent. So say b = ta for some
t ≥ 0. However, if ||a|| = c = ||b|| = ||ta||, t must be equal to one, which
violates that a 6= b. Hence a and b cannot be boundary points and be linearly
dependent at the same time for 1 < p < ∞, such that the inequality must be
strict. Since the same is true for all interior points, Sc is strictly convex for
1 < p < ∞.

Next:7

Lemma 3. Consider β̂ = arg minβ ||y − Xβ||22 + λf (β) with λ ≥ 0. If f (β) is


strictly convex and the two columns of X, zi and zj , are identical, then β̂i = β̂j .

Proof: Assume the opposite, namely β̂i 6= β̂j and define


(
1
2 (β̂i+ β̂j ) if k ∈ {i, j}
β̂k∗ =
β̂k otherwise.

Then writing out X β̂ ∗ will show that it is equal to X β̂, since zi = zj . Then
their sum of squared residuals must be equal. However, f (β̂ ∗ ) < f (β̂) due to
the strict convexity of f . Hence for a given λ ≥ 0 β̂ cannot be a minimum and
β̂i = β̂j .

Applying Lemma 2 with p = 2 and Lemma 3 together suggests that ridge


regression is able to handle groups of variables. One goal ridge did not achieve,
however, is sparsity or in other words a parsimonious model. Whereas ridge finds
coefficients close to zero, sparse models obtain coefficients that are actually
equal to zero. The obvious next step would be to change the constraint in
such a way that only sparse models are allowed. Those are located on the
axes. Mathematically this can be described by the L0 -norm8 shown in Figure
4 and defined as ||x||0 = |{i|xi 6= 0}|, hence the number of non-zero elements.
7
Lemma 3 is a slightly adapted version of Lemma 2(a) in Zou and Hastie (2005).
8
Technically this is abuse of notation, since any Lp -norm with 0 ≤ p < 1 does not fulfill
the requirements for a norm.

13
The solutions are now guaranteed to be sparse, but according to Lemma 2 the
corresponding optimization function of the type in equation (1) is not necessarily
convex. Hence minimizing the sum of squared residuals subject to a L0 -norm
might not attain a minimum. My requirement is thus a constraint that induces
sparsity like the L0 -norm, but also provides a convex optimization problem, as
suggested in the L2 -norm. The solution is found in between: the L1 -norm
penalty, whose corresponding method is called lasso regression.

3.2.2 Lasso
There is a tremendous literature on
Figure 5: Lasso Regression the least absolute shrinkage and se-
lection operator or short lasso. The
best known reference is probably the
rather technical book by Bühlmann
and Van De Geer (2011), which is
almost exclusively dedicated to the
lasso. Whereas other variable selec-
tion methods see the topic of selec-
tion and estimation as two separate
issues, the lasso is able to fit a sparse
model in one step. The lasso solves:

min ||y − Xβ||22 + λ||β||1 , (3)


β

where ||β||1 = i |βi | and the no-


P

tation for X and y corresponds to


the definition in the preceding section.
Again I can transform this into a con-
strained problem, which is visualized in Figure 5.9 The non-differentiability at
β1 = 0 and β2 = 0 is used to induce sparsity. If I restrict my data matrix
to be orthogonal, the ellipsoids turn into circles and I can identify those areas
that would map the least-squares fit to a sparse model. Part of those areas are
sketched in gray in Figure 5.
Figure 6a and 6b show the coefficients for a changing ||β||22 and ||β||1 in ridge
and lasso regression respectively. The graphs were produced with a dataset on
house prices in Boston. More details on the dataset are provided in Appendix A.
Note that moving from left to right in the graph decreases λ and thus increases
9
For a general proof that will allow me to also transform the lasso optimization problem
into a constrained problem I refer to Kloft et al. (2009).

14
Figure 6: Coefficients

(a) Ridge (b) Lasso

the L1 -norm of the coefficients. As explained earlier the most penalized, non-
trivial ridge model has rather small coefficients, whereas the lasso counterpart
excludes several variables.
There is another interesting feature Figure 6b shows: the lines of the lasso
coefficients are piecewise linear. When the lasso was introduced, its estimation
was rather slow, which made applying the method unattractive. One problem in
finding a fast algorithm was the fact that as shown in Lemma 2 the lasso problem
(3) is only convex and not strictly convex. Efron et al. (2004) recognized the
chance to use the piecewise linearity to speed up the algorithm. They managed
to compute those points, where one subfunction ends and a new one begins.
The number of these points, as will be clear later, is more or less bounded by
the number of variables.
I will explain the estimation of the lasso, called lars algorithm (least angle
regression) with the help of Figure 7. I begin with assuming a standardized
matrix X containing linearly independent variables z1 , . . . , zp . The technique
starts with a model that is just a constant in R: fˆ0 (z) = c, where c = 0 in
case of a demeaned y. Next I orthogonally project the response variable into
the space spanned by the explanatory variables z1 , . . . , zp and thus find the
fitted value under a least squares regression, ŷLS . As a next step I compute the
correlation between the residuals of the current model, ε̂1 = ŷLS − fˆ0 (z), and
each explanatory variable. Since the covariance is equal to the correlation here,
I can compute it as X T (ŷLS − fˆ0 (z)). Geometrically this is equal to the angle

15
between the vector ε̂1 and the explanatory variables. The variable with the
highest correlation and thus smallest angle is selected for the first step, which in
Figure 7 is z1 . The important part now is, how long I proceed into the direction
of z1 . One possibility would be to make small steps and keep asking, which
variable does have the highest correlation with the current residuals. It turns
out that this is the approach the forward stagewise selection technique takes.
Alternatively I could just proceed to the least-squares component of z1 . This is
actually what the forward stepwise selection technique does.10 In Lars, however,
I take a value along z1 between those two procedures: I look for a, such that the
residual induced by az1 is equally correlated with at least one other variable. In
Figure 7 this point is fˆ1 (z) = c + az1 , the model after z1 has been added. The
angle between the induced residual ε̂2 and z1 and z2 respectively is the same:
α = β2 . ε̂2 is the current residual and at the same time the way the model
is changed as a next step. For p = 2 the algorithm quits at the least-squares
estimate. If p > 2, I again proceed until the angle is the same with a third
variable and so on. Hence lars always proceeds along the least angle, which has
given the method its name.

Figure 7: Illustration of the lars algorithm

It can be shown that this algorithm under minor changes reproduces the
lasso and the closely related forward stagewise results. In the lasso this minor
change means that there is the possibility that variables are removed from the
model again, whereas for the forward stagewise algorithm lars first projects each
vector that is added to the model onto a different space. In this way the lars
algorithm is able to improve the estimation of the lasso and forward stagewise
selection. It turns out that the running time of the lars algorithm is of the same
10
For more details on forward stagewise and stepwise selection see Section 3.1.

16
order as the running time of computing least-squares coefficients. For more
mathematical details and the implementation of the lars algorithm I refer to the
original paper, that goes beyond a graphical explanation (Efron et al., 2004).
The lasso can be tuned with the
selection of λ, which as shown in Fig- Figure 8: Selection of λ for the lasso
ure 8 can greatly affect the result-
ing model. The data here was sim-
ulated with an orthogonal normal er-
ror term around the true x − y rela-
tionship represented by the solid line.
Then I fitted models of polynomial
degree 5 to the data. For a large λ
the solution is a constant, hence no
explanatory variables are included in
the model. For λ = 0.1 the line ap-
pears to be a close approximation of
the true line, whereas λ = 0.001 is
close to the least-squares fit and pro-
vides an over-complicated, overfitting
model. Hence selecting λ is crucial.
Cross-validation or information crite-
ria can support the selection process.
Due to its popularity there are several different versions of the lasso tech-
nique: The relaxed lasso is especially designed for variable selection. It begins
with a normal lasso regression to find the relevant variables. Then another
lasso regression with a smaller λ is computed on the reduced dataset. This is
supposed to reduce the bias (Vidaurre et al., 2013).
Furthermore there is the group lasso that is appealing for a dataset, where
groups of explanatory variables can be established. The idea is to either include
all variables from a group or none of them. Introduced in Yuan and Lin (2006)
for some J groups the optimization function is:
J
X
min ||y − Xβ||22 + λ ||βj ||Kj ,
β
j=1

1
where ||βj ||Kj = (βjT Kj βj ) 2 . The authors suggest Kj = pj Ij with pj being the

group size. Then ||βj ||Kj = pj ||βj ||2 . This is at first sight counter-intuitive,
since I showed that the L2 -norm does not induce sparsity, but note that here
the L2 -norm is only applied within the group. Across group there is actually a
L1 -norm: Just consider the special case, where pj = 1 ∀j. Geometrically the

17
group lasso produces a two-sided cone, where the peaks imply that all members
of one group only have coefficients taking the value zero.
There are also methods designed for time series data. First of all for ARMA
models the most common idea is to consider all lags as a group and estimate
a sometimes more, sometimes less adapted version of the group lasso. For
multivariate response the smoothed lasso is an alternative, which estimates time-
varying parameters (Meier et al., 2007). Instead of fitting a lasso regression for
each point in time the smoothed lasso finds for all t
n
X
min w(s, t)||y(s) − Xβ(t)||22 + λ||β(t)||1 ,
β(t)
s=1

where w(s, t) is a kernel density centered around t.11 The estimates are penal-
ized most for wrong predictions at the point in time they represent.

3.2.3 Elastic net


The elastic net is another extension to the lasso that has recently received
increased attention. It combines both ridge and lasso regression:

min ||y − Xβ||22 + λ1 ||β||1 + λ2 ||β||22 (4)


β

For different values of λ1 , λ2 ≥ 0 the special cases are least-squares, lasso and
ridge regression. Zou and Hastie (2005) managed to show that (4) can be
rewritten as

! ! 2
y 1 λ1 p
√X
p
0 − √1 + λ 1 + λ 2 β + √ 1 + λ2 β .

2 λ2 I 2 1 + λ2
1
! !
y √X
This provides me with the opportunity to define = y∗ , X∗ = √ 1
0 1+λ2 λ2 I
√ λ1
and β ∗ = 1 + λ2 β. Then with λ∗ = √1+λ I found a lasso problem and can
2
thus use the lars algorithm for the corresponding estimation. The elastic net
has several advantages over the lasso: First in the case of high-dimensional data
rank(X) ≤ min(n, p) = n, whereas rank(X ∗ ) ≤ p. This implies that the elas-
tic net can in this scenario include a number of explanatory variables larger than
the number of observations. Secondly the inclusion of the ridge penalty implies
11
For more details on kernel densities I refer to the upcoming Section 3.5 on the k-nearest
neighbor or to Chapter 20 of Hansen (2000).

18
that the method is strictly convex, which according to Lemma 3 suggests that
the method is able to handle groups of data.
Efron et al. (2004) call (4) the naive elastic net and suggest to multiply the
minimizing estimator with 1 + λ2 . The reason is that the first term of the earlier
transformation
! ! 2
y 1
√X
p
0 − √1 + λ 1 + λ2 β

2 λ2 I 2

equals
||y − Xβ||22 + λ2 ||β||22 .
Hence transforming the data is
Figure 9: elastic net penalty the same as having already applied
ridge regression. This implies that
the elastic net is a two-step procedure
consisting of a ridge and lasso regres-
sion and can be compared to other
two-step approaches like the instru-
mental variables technique. As found
earlier ridge applies decorrelation and
shrinkage. The elastic net is already
shrinked to zero due to the lasso re-
gression, such that Zou and Hastie
(2005) propose to revoke the ridge
shrinkage by multiplying the corre-
sponding estimator with 1 + λ2 . In
this way the good part of the ridge
- the decorrelation - remains, but the
ridge shrinkage vanishes.
Both lasso as well as the elastic net are biased techniques that deal with the
increasingly relevant topic of variable selection. One example of an economic
reference is a paper by Bai and Ng (2008), which demonstrates applying all
three techniques for forecasting purposes. Lasso and elastic net are especially
popular in applied macroeconomics, forecasting or any other economic area,
where a large number of variables is available for statistical modeling. To close
this section Figure 9 shows the constraints of the lasso, ridge and elastic net
in one graph. It is easy to see that ridge and elastic net are strictly convex,
whereas lasso is only convex.

19
3.3 Dimensionality reduction
In this section I will introduce algorithms that belong to the group of dimen-
sionality reduction techniques. They are closely related to variable selection
methods. Instead of selecting a subset of the variables, dimensionality reduc-
tion techniques summarize the information contained in the variables in a smaller
set of variables. There are different ways of doing this. Most prominently in
econometrics are factor models that are introduced in the upcoming section.

3.3.1 Factor models


The best known econometric application of factor models is a paper by Stock
and Watson (2002). They gathered a dataset consisting of 215 macroeconomic
variables with several interest rates, measures of economic activity, employment
data and so on. Each set of variables exhibits strong correlation and it can be
assumed that there exists a small number of driving factors that generate the
whole dataset. One example would be that the interest rates are generated by
the combination of a key interest rate and a measure of economic uncertainty.
Conceptually factor models start exactly with this type of idea: there is a
large dataset, which is likely to be generated by a smaller number of variables,
so-called factors. Those factors are unobserved. More formally a Factor Model
with k factors looks as follows:
zi = µi + li1 F1 + . . . + lik Fk + εi ∀i = 1, . . . , p
or in matrix notation
X = µιT + LF + ε,
 
where ιT = 1 . . . 1 . p is the number of explanatory variables and L is
called the loadings matrix. The dimensions here are X : p × n, L : p × k,
F : k × n, ε : p × n and µ : p × 1. For convenience in the upcoming calculations
X now contains the variables as rows and individuals as columns. Again xi
denotes observation i and zi characterizes variable i. Immediately visible is
the difference in the response variable: all variables are modeled instead of
being used as regressors. This implies that factor models belong to the in the
introduction described group of unsupervised learning algorithms, where the goal
is not to predict an outcome, but to explore the dataset. Like in a regression
the data is projected orthogonally onto a hyperplane, which is simply given by
µi + li1 F1 + . . . + lik Fk .
However, compared to regression the factors, which take the role of explanatory
variables here, have to be estimated first. Those factors form a basis of the

20
hyperplane and without changing the hyperplane can be transformed in such a
way that they form an orthonormal basis. The resulting uncorrelatedness assures
sparsity of the model and that each factor covers different aspects of the data
generating process. If now I assume a certain structure of the error term’s
covariance matrix, the list of assumptions for the Factor Model is complete:

1. V ar(ε) = Ψ = diag(ψ11 , . . . , ψpp )

2. E(F ) = 0

3. Cov(F ) = I

4. Cov(F, ε) = 0

Say Cov(X) = Σ. Then I can easily show one of the key equations of factor
models:
Σ = LLT + Ψ.
Hence the variance of the data is the sum of the variance induced by the factors
and the variance originating from the errors. Obviously Ψ should be minimal.
In order to find L and F one estimation method is to assume Gaussian error
terms and to maximize the likelihood with respect to the two matrices. Having
estimated those factors a natural question would be to understand their relation
to the observable data. Intuitively the loadings should provide this connection,
but let me approach this question by finding the correlation between factors and
data:
Cov(X, F ) = E[(LF + ε)F T ] = LE[F F T ] + E[εF T ] = L
lij
Corr(zi , F.j ) = √ ,
σii
where σii is the i-th diagonal element of Σ. Hence in matrix form
1
Corr(X, F ) = D− 2 L,

where D = diag(σ11 , . . . , σpp ). As expected there is a connection between the


loadings matrix L and the correlation between data and factors. To eliminate D
it is common to standardize the data, such that each variable has a zero mean
1
and a variance of one. Say X̃ = D− 2 (X − µιT ). Then
1 1
V ar(X̃) = D− 2 ΣD− 2
1 1 1 1
= D− 2 LLT D− 2 + D− 2 ΨD− 2 .

21
1
If I now define a new loading and a residual covariance matrix as LX̃ = D− 2 L
1 1
and ΨX̃ = D− 2 ΨD− 2 respectively, then Corr(X̃, F ) = Corr(X, F ) = LX̃ .
Hence the correlation now equals the loadings matrix.
Another important characteristic of factor models is that their loadings and
factors are unique up to rotation. This follows, since X = µιT +LF +ε = µιT +
LQQT F +ε for any orthogonal matrix Q. The axes are rotated, but the new L̃ =
LQ and F̃ = QT F still satisfy all the assumptions stated before. This additional
degree of freedom complicates the estimation procedure, however can be used
to the user’s advantage. Since after standardization of the data the loadings
matrix is the same as the correlation matrix, I could pick a transformation,
which provides advantageous correlation patterns. A convenient correlation
matrix could for example produce a situation, where each variable exhibits strong
correlation to only one specific factor. This is exactly the solution the varimax
rotation offers, which is the most popular rotation technique. It achieves firstly
that each factor either has a strong relationship to a variable or a correlation
close to zero and secondly that each variable has at most one strong correlation
to a factor. Following Kaiser (1958) the criterion that is maximized is
k  Xp p 2 
1 1
 X
(˜ljl )4 − ˜l2
X
jl .
l=1
p j=1 p j=1

At first sight this looks rather complicated, but can be explained in a few steps:
the part within the square brackets is the empirical variance of ˜ljl 2 .12 ˜ 2 is
ljl
the squared correlation between variable j and factor l. The variance is now
computed across each column, thus with a fixed factor for all variables and
summed over all factors. Hence varimax maximizes the squared correlation. It
is maximal, if the factors are constructed in such a way, that each variable either
has a strong correlation to a factor or none.
Factor analysis is a compelling method to work with large datasets. It
is mainly used with two different purposes: The first use applies, if researchers
build theoretical models on how the data could have been generated by a certain
number of factors and in a next step test this idea on the data. On the other
hand one can also employ factor analysis simply for dimensionality reduction.
Here the method compresses information contained in a lot of variables into
just a few variables. This is also the approach Stock and Watson (2002) opted
for: Their purpose was to use the reduced dataset for the prediction of real
economic activity and inflation, such that they focused on the forecasts and not
on the factors itself. This use is related to principal component analysis, which
12
Remember V ar(X 2 ) = E(X 4 ) − (E(X 2 ))2 .

22
is introduced in the next section. Therefore more details on shared issues such
as, how many factors to compute, will be explained there.

3.3.2 Principal component analysis


Principal component analysis (PCA) is very related to factor models and has
unarguably given inspiration to the development of factor models. In the course
of this section it will become clear that several differences remain, among them
their estimation technique and their different mindset. Factor analysis is rather
used, when the researcher wants to understand the data, whereas PCA is often
used simply to compress data. PCA is a rather old technique that is used in
econometrics for dimensionality reduction. It was introduced in Pearson (1901)
and just like in factor analysis principal components squeezes information into
a smaller set of variables.
The method tries to find a subspace, where when the data is mapped into
it, it has the largest variance. First I subtract the mean of each of the p
variables. The variance is not standardized. Then I consider the data mapped
into the hyperplane as wT X, which from now on I will refer to as score. Here
X : p × n and w: p × 1. The maximum variance due to a zero mean is then
simply the second norm of the scores squared or ||wT X||22 . Together with w,
which I constrain to be of length one, I formally face the following constrained
optimization problem (Hamprecht, 2011):

w = arg max ||wT X||22 = arg max wT XX T w,


||w||2 =1 ||w||2 =1

where the equality follows from the definition of a norm. If I use the Lagrangian
to find the solution

L(w, λ) = wT XX T w − λwT w,

I obtain as optimality condition

XX T w = λw.

Hence w has to be an eigenvector of XX T . XX T is the empirical covariance


matrix of the data, since X is demeaned and now contains rows as variables
instead of observations. As the idea is to extract variance of the data, I obviously
have to consider the covariance matrix. But let me go back and note that

wT XX T w
w = arg max wT XX T w = arg max .
||w||2 =1 w wT w

23
Since XX T is square and symmetric, thus Hermitian, I know that it must
possess real eigenvalues, whose corresponding eigenvectors are able to form an
orthonormal basis. Thus I can rewrite w = U c, where U is a matrix whose i-th
column is the i-th eigenvector. c ∈ Rp exists for every w ∈ Rn , since U must
have full rank. Then
wT XX T w cT U T XX T U c cT U T XX T U c
= = ,
wT w cT U T U c cT c
since U T U = I due to orthogonality. Note that again as XX T is Hermitian,
XX T = U ΛU T with Λ containing the eigenvalues of XX T on its diagonal.
Then
wT XX T w cT Λc λ1 c21 + . . . + λp c2p
= = ∈ [λ1 , λp ],
wT w cT c c21 + . . . c2p
where λ1 is the largest and λp is the smallest eigenvalue. This insight already
provides me with the range of the objective function. Now it only remains to
find the w, which achieves the maximal value λ1 . But if I pick some eigenvector
for w, the corresponding eigenvalue is its solution. Hence if I aim to maximize
this term, w should equal the eigenvector with the highest eigenvalue. With
the help of this eigenvector I am now able to compute the first coordinate in
the lower-dimensional space for each observation. This coordinate is called first
principal component (PC).
As a next step I would like to maximize the remaining variance under the
constraint that the following w is orthogonal to the first component. The
residuals, after the first PC has been applied, are computed as X̂1 = X−wwT X.
Note that wT X is the dataset in the lower dimensional subspace. Hence for
the computation of the residuals it is necessary to map the data back into the
space of the data, which is done by applying w on the left hand side. wwT X
reduces the rank of X by one, such that if I now maximize the variance, hence
wT X̂ X̂ T w
1 1
the term wT w
, the eigenvector with the second highest eigenvalue will be
13
the maximizing w. This eigenvector is obviously also orthogonal to the first
eigenvector, such that all conditions are fulfilled. The orthogonality property
can be shown by considering the sample covariance between two PCs over the
13
The details on how the rank is reduced are rather technical. To understand the algorithm
it is enough to imagine that wwT X, a so-called rank one approximation, removes the highest
eigenvalue from X.

24
dataset:

Cov(P C1, P C2) = (w1T X)(w2T X)T


= (w2T X)(w1T X)T
= w2T XX T w1
= w2T λ1 w1

The second equality follows, since the term is a scalar and thus equal to its
transpose. At the last step I make use of the fact that w1 is an eigenvector of
XX T with eigenvalue λ1 . This term is zero if w2 is orthogonal to w1 , which
holds, if both w1 and w2 are eigenvectors. The procedure described above for
the first two PCs can be continued, until all PCs have been found.
The algorithm described above is a more conceptual approach that is cer-
tainly stimulating in the process of understanding PCA, but its implementation
is actually simpler. Most software packages compute all eigenvectors and eigen-
values, rank them according to their eigenvalues and then let the user choose,
how many of the eigenvectors should be kept for the analysis. This rises another
question: how many principal components do I actually need?
The most common way to pick the dimension of the projected data and thus
the number of PCs is a measure of how much variance is already explained by
the first, say k, PCs. This is shown in the so-called screeplot, where the sum
of the first k eigenvalues is divided by the sum of all eigenvalues. I provided
a screeplot in Figure 10a of data on housing in Boston. By deleting two non-
numerical variables I reduced the corresponding initial dataset from 19 to 17
variables (see Appendix A). It can be easily seen that one PC already captures
almost all of the dataset’s variation, something that can be observed quite often
when applying PCA.
In Figure 10b I plotted the first two PCs of the data against each other. It
is obvious that clusters can be visualized. Clusters are groups of observations
that have little variation inside the group and hence graphically form a cloud of
points. This makes them unlikely to be separated by the principal components,
which attempt to find directions, where most variance is located. Since it is
impossible to visualize a dataset that has more than three dimensions, Figure
10b is an illustration that another application of PCA is to visualize clusters.
For econometricians it might be relevant to show the relation between PCA
and least squares (LS) regression. I will begin with a theorem.

Theorem 1. The minimizing vector of the sum of squared residuals is the same
as the maximizing vector of the variance.

25
Figure 10

(a) Screeplot (b) Scores of the data

Proof:
Again wwT X provides the fitted observations. Hence I would like to find the
unit vector w that minimizes the squared difference between X and wwT X:

w = arg min ||X − wwT X||2F ,


||w||2 =1

where ||A||2F = 2 = tr(AAT ). This norm is also called the Frobenius


P P
i j [A]ij
norm. Then
w = arg min tr(XX T − XX T wwT − wwT XX T + wwT XX T wwT )
||w||2 =1

= arg min −tr(wT XX T w) − tr(wT XX T w) + tr(wT wwT XX T w)


||w||2 =1

= arg min −tr(wT XX T w)


||w||2 =1

= arg min −wT XX T w


||w||2 =1

= arg max wT XX T w.
||w||2 =1

In the first step I omit XX T , since it is irrelevant for the choice of w. Then
several times I make use of the cyclic permutations and linearity property of
the trace operator. Finally I realize that wT XX T w is a scalar and can thus

26
neglect the trace operator. Since the two objective functions are the same, their
solution must be the same.

Apparently both PCA and LS minimize the sum of squared residuals, but still
they are not the same. The difference lies in what is minimized: LS minimizes
distances in the space of the response variable, hence one-dimensional space,
whereas PCA in the space of the data, thus in p dimensions. Visually this is
explained in Figure 11. Since the whole dataset takes the role of the response
variable in PCA, this technique belongs to the group of unsupervised algorithms,
just like factor models. PCA is e.g. not able to build a model that forecasts
recessions.
Even if PCA is mostly applied for dimensionality reduction, it is worth consid-
ering the connection between data and principal components. With P˜C = Ũ T X
T
Cov(X, P˜C) = E(X P˜C ) − E(X)E(P˜C)T
= E(XX T Ũ )
= Ũ Λ̃Ũ T Ũ
= Ũ Λ̃.

P˜C, Ũ and Λ̃ are derived from the theoretical and not empirical covariance ma-
Σ, such that they are denoted with a tilde. Then Corr(Xi , P˜C j ) =
trix of X, say r
√ũij λ̃j = ũij λ̃j
σii , where σii is the i-th diagonal element of Σ and ũij is the
σii λ̃j
element of Ũ at row i and column j. For the Boston Housing data the corre-
lations are illustrated in Figure 12. The further away the points are from the
origin, the bigger their influence on the principal component.
Due to its popularity PCA has several different extensions for certain type of
data. In case the variables have different scales, say distances and temperature,
their variances are inherently different. Then the PCA prefers those variables
with a high variance. Apart from the interpretations that will become tedious
it would make sense to standardize the data, such that all variables are treated
equally. This is why the normalized PCA standardizes each variable to have
a variance of one. Further PCA extensions are among others the robust PCA
(Candès et al., 2011), which is less sensitive to outliers, or the Sparse PCA (Zou
et al., 2006) that is explained in more detail in the upcoming section.
PCA has been used in various fields and has thus attracted a lot of research.
Therefore this overview is by far not complete, such that for further details I
recommend Jolliffe (2002).

27
Figure 11: Differences of OLS and
Figure 12: Interpretation of PCs
PCA fitting

3.3.3 Sparse Loadings


As the name already suggests sparse principal components address the issue that
principal components tend to compress the data with the help of all variables,
which hinders interpretation especially of large models. The idea is to add a lasso
penalty to the model. However, the principal component model is not written
yet in the form of a regression problem. I will now provide the underlying idea
of how to solve this problem. For a rigorous treatment I refer to Zou et al.
(2006). Note that the singular value decomposition (SVD) is employed to find
eigenvectors for the PCA. SVD says that X = U DV T , where U and V are
orthogonal and D is diagonal. Then X T X = V DU T U DV T = V D2 V T and
similarly XX T = U D2 U T . Hence this decomposition contains the earlier given
eigenvalues and eigenvectors. U D then contains the principal components.14
Now consider15

β̂ = arg min ||U.i Dii − Xβ||22 + λ||β||22 .


β

Then

β̂ = (X T X + λI)−1 X T U.i Dii


= (X T X + λI)−1 X T XV.i ,
14
The naming can get confusing at this point. PCA allows to map from the high-dimensional
space of the data to lower dimensions. These new dimensions are called principal components.
However, if I map an observation into this space its new coordinates are called principal
components of this observation. Here I refer to the dimensions.
15
The upcoming result can also be found in Theorem 1 of Zou et al. (2006).

28
since XV = U D and thus XV.i = U.i Dii .

β̂ = (X T X + λI)−1 X T XV.i
= (V D2 V T + λV IV T )−1 V D2 V T V.i
= (V (D2 + λI)V T )−1 V D2 V T V.i
= V (D2 + λI)−1 V T V D2 V T V.i

D112 2
Dnn

= V diag 2 +λ , . . . , 2 +λ
ei
D11 Dnn
Dii2
= V.i 2 + λ,
Dii

where ei is the i-th column of the identity matrix. Thus β̂ ∝ V.i , such that
finding V is equivalent to solving a ridge equation. Then Zou et al. (2006)
continue and explicitly show the general connection. This now allows me to
add a lasso penalty to the PCA problem, which in turn causes sparsity in the
loadings contained in V . Since the final model contains the ridge as well as the
lasso penalty, I can employ the for the elastic net adjusted lars algorithm for
estimation. One recent application of the sparse PCA to macroeconomic data
can be found in Kristensen et al. (2013).

3.4 Support vector machine


The support vector machine (SVM) has become one of the most popular ma-
chine learning techniques. One reason for its success is that it regularly out-
performs other machine learning techniques, such as e.g. neural networks and
nearest neighbor (e.g Byvatov et al. (2003) or Yoon et al. (2011)). The SVM
is programmed in a variety of statistical software packages and delivers promis-
ing results even with default settings. Hence not much experience is required
to employ the technique. On the other hand the derivation and the so-called
kernels, the SVM makes use of, are rather technical and far from trivial. This is
why the technique can certainly be labelled as a black box. Opening this black
box is the subject of this section.

3.4.1 Derivation
For the upcoming derivation I will focus on binary data. Say i ∈ 1, . . . , n,
xi ∈ Rp , x1 , . . . , xn ∈ X and yi ∈ {−1, +1}. This type of response variable
is uncommon in econometrics, however it will soon be obvious why {−1, 1} is
preferred over {0, 1}. Furthermore X = [x1 , . . . , xn ] = [z1T , . . . , zpT ]. Hence

29
X : p × n and contains observations x as columns and variables z as rows. Let
me begin with the assumption that my data is linearly separable. Then there
exists a hyperplane, such that

hxi , wiRp + b > 0 for all i with yi = 1

and

hxi , wiRp + b < 0 for all i with yi = −1,


where b ∈ R and w ∈ Rp is a normal vector perpendicular to the hyperplane.
An intuitive classifier16 would be g(xi ) = sign(hxi , wi + b). But how should
I select w and b? What the support vector machine does is to look for the
"widest street" between the two classes, as shown in Figure 13. Hence I look
for a hyperplane, such that when I shift out the hyperplane in either direction
the distance until I touch the first point is maximal. These parallel hyperplanes
that just touch the first point are called margins and are the two solid lines in
Figure 13. Say I scale w and b in such a way that the minimum distance of all
points to the hyperplane is ||w||2 . Then

hxi , wiRp + b ≥ 1 if yi = 1 (5)

and
hxi , wiRp + b ≤ −1 if yi = −1, (6)
which multiplied with the outcome variable yi turns all points on the margin
into
yi (hxi , wiRp + b) = 1.
Those points on the margins support the classification mechanism and are
thus called support vectors, which in turn has given the support vector machine
its name. The "width of the street" now is simply the distance of the two parallel
2
hyperplanes given by the boundary of the two half-spaces (5) and (6): ||w|| 2
.
By construction the decision boundary is situated right in the middle of the two
2
margins. Hence my optimization problem is maxw,b ||w|| 2
s.t. yi (hxi , wiRp +b) ≥
1 ∀i. To avoid scalars in the first order conditions (FOC) I adjust the objective
function and optimize:
1
min ||w||22
w,b 2

16
In machine learning language a model that provides a binary outcome is called a classifier.
One example would be the logit model. The counterpart is then called regression, where the
response variable is continuous.

30
s.t. yi (hxi , wiRp + b) ≥ 1 ∀i.
This program is called the Hard Margin SVM. After forming the Lagrangian
P P
L(w, b) the FOCs with respect to w and b are w = i αi yi xi and i αi yi = 0.
Some manipulation of the Lagrangian yields17
n n X n
X 1X
L(w, b) = αi − αi αj yi yj xTi xj
i=1
2 i=1 j=1

s.t. αi ≥ 0 ∀i
n
X
αi yi = 0.
i=1

Note that according to the Karush-Kuhn-Tucker (KKT) conditions the support


vectors are characterized by αi > 0.

Figure 13: Linearly separable data

The solution looks very promising, however the setup is rather artificial, since
real data will most likely contain outliers and non-linearly separable classes. I
17
αi is the i-th Lagrangian multiplier.

31
neglected both problems until now. The problem of outliers is easier to solve,
such that I will proceed with this.
Logically the program that allows for outliers is called Soft Margin SVM.
This is implemented with slack variables that are introduced in the constraints
and are also represented in the objective function as a penalty:
n
1 X
min ||w||22 + C ξi
w,b 2
i=1

s.t. yi (hxi , wiRp + b) ≥ 1 − ξi , ∀i


ξi ≥ 0, ∀i
This concept is related to penalized regression (see Section 3.2). Since ξi ≥ 0,
i ξi = ||ξi ||1 and thus the problem is similar to lasso regression. C takes the
P

role of λ and can also be selected with the help of cross-validation. The dual
of the Lagrangian is then
n n X n
X 1X
L(α) = αi − αi αj yi yj xTi xj
i=1
2 i=1 j=1

s.t. 0 ≤ αi ≤ C ∀i
n
X
αi yi = 0.
i=1

Suppose I am in a situation as illustrated in Figure 14a, where the space of


the observations, does not provide me with linearly separable data. The idea is
now to map the data into a higher-dimensional space, say E, where the data is
linearly separable. I will call this space variable space.18 To provide an example
I could have as observational space (z1 , z2 ) and make use of E in the form
(z1 , z2 , z1 z2 , z12 , z22 ). Hence the variable space includes interaction terms and
squared variables. Note that a hyperplane in E is non-linear in the observational
space, as soon as the input space is not linearly separable.
Hence I apply some mapping Φ(x) : Rp → E to each input vector xi and
replace each xi with Φ(xi ) in my optimization problem. Then my Lagrangian
looks as follows:

18
Machine learning usually refers to the variables that are used for the model as features
and hence refers to this space as feature space instead. Similarly the data is named input,
whereas I will refer to it as observational space.

32
Figure 14: Mapping from 2D into 3D

(b) 3D
(a) 2D

33
n n X n
X 1X
L(w, b) = αi − αi αj yi yj Φ(xi )T Φ(xj ).
i=1
2 i=1 j=1

s.t. 0 ≤ αi ≤ C ∀i
n
X
αi yi = 0.
i=1

3.4.2 Kernels
Working in high-dimensional spaces can be computationally very costly. How
costly can be seen in this example: Consider a dataset with 215 variables as
found in Stock and Watson (2002). If I now restrict myself to say monomials19
of 3rd degree, the resulting variable space has
! ! !
215 215 215
+2 + ≈ 1.7 · 106
1 2 3

variables (Ng, 2008). Any kind of data manipulation will take a long time,
such that I need to look for shortcuts. Looking back at the Lagrangian the
computation in the variable space is essentially a sum of inner products. Luckily,
there are so-called kernel functions that are able to compute inner products in
certain high-dimensional spaces without ever entering this space. Let me get
into some more technical details, how these kernel functions should look like.

3.4.2.1 Theoretical work


The formal analysis of the requirements on kernel functions, summarized in
Mercers Theorem, is beyond the scope of this thesis, as the theorem general-
izes the requirements for kernels for infinite dimensional spaces (Mercer, 1909).
However, it might be worth understanding its intuition. Kernels are among
others able to compute inner products in high-dimensional spaces: K(s, t) =
hΦ(s), Φ(t)i. Roughly speaking the theorem tells me that if the kernel function is
symmetric, continuous and non-negative definite K(s, t) = ∞
P
j=1 λj vj (s)vj (t),
where vj (x) is an eigenfunction and λj its corresponding eigenvalue. In a fi-
nite dimensional space eigenfunctions will turn out to be eigenvectors.20 Hence
in a finite dimensional space Mercer tells me that K(s, t) = nj=1 λj vsj vtj .
P

19
Monomials are polynomials that only contain one term.
20
Intuitively functions can be seen as infinite dimensional vectors. Hence eigenvectors of an
infinite dimensional matrix are actually functions.

34
This might look familiar, since placing the eigenvalue in between the eigenvec-
tors, provides me with the scalar form of an eigenvalue decomposition. In fact,
this sum is an entry in a matrix V ΛV T , where V contains the eigenvectors
as columns and Λ the eigenvalues on its diagonal. The decomposed matrix in
this case is called Gram matrix and contains all the information
√ of √the kernel
function: [k(xi , xj )] = Kij . If I then define Φ(xi ) = [ λ1 vi1 , . . . , λn vin ]T ,
I get that Φ(xi )T Φ(xj ) = Kij . If Φ(x) = x, Φ(xi )T Φ(xj ) = [X T X]ij , which
shows the connection to singular value decomposition.21
It is well known that the aforementioned eigendecomposition constraints the
decomposed matrix to be positive semi-definite and symmetric, which is exactly
the same, what Mercer found in his setting of possibly infinite dimensional
spaces. This intuitive argumentation leads me to the important result that
will be proven next. According to Theorem 4.16 in Steinwart and Christmann
(2008):
Theorem 2. A function k : X × X → R is a kernel if and only if it is symmetric
and positive definite.
For the proof I first need a definition of inner products, (pre-)Hilbert spaces and
kernels.
Definition 1. Given a vector space X an inner product h., .iX has to satisfy the
following conditions:
1. Bilinearity: hλx + µy, ziX = λhx, ziX + µhy, ziX ∀λ, µ ∈ R, x, y, z ∈ X
2. Symmetry hx, yiX = hy, xiX , ∀x, y ∈ X
3. Strict positive definiteness: hx, xiX ≥ 0 ∀x ∈ X, hx, xiX = 0 ⇔ x = 0
Definition 2. A complete vector space H combined with an inner product is
called a Hilbert space. Its counterpart that is not complete is called a pre-Hilbert
space.
One trivial example of a Hilbert space would be the vector space Rn with
hu, viRn = uT v and u, v ∈ Rn .

Definition 3. Let X be a non-empty set. Then a function k : X × X → R


is called a kernel on X , if there exists a Hilbert space called H and a map
Φ : X → H such that ∀x1 , x2 ∈ X I have
k(x1 , x2 ) = hΦ(x1 ), Φ(x2 )iH .
21
More details on the singular value decomposition can be found in section 3.3.3 on the
Sparse PCA.

35
I can now get to the actual task of proving Theorem 2:
⇒: If I know that k is a kernel function, k(x1 , x2 ) = hΦ(x1 ), Φ(x2 )iH ∀x1 , x2 ∈
X . Since the inner product is symmetric, the kernel must be symmetric. Simi-
larly I inherit the positive definite property from the inner product:
n X
X n n X
X n
αi αj k(xi , xj ) = αi αj hΦ(xi ), Φ(xj )iH
i=1 j=1 i=1 j=1
X n n
X 
= αi Φ(xi ), αj Φ(xj )
i=1 j=1 H

≥0

⇐: Take any symmetric and positive definite function k. According to the


definition of a kernel, I need to find a Hilbert space first. I can begin with
considering the span of the kernel functions:
n
X 
H≡ αi k(., xi ) : n ∈ N, α1 , . . . , αn ∈ R, x1 , . . . , xn ∈ X .
i=1

To transform H into a pre-Hilbert space I need to add an inner product, which


(1) (2)
I define as follows: Say f = m
P Pn
i=1 αi k(., xi ) and g = j=1 βj k(., xj ) for
(1) (2)
some αi , βi ∈ R and xi , xj ∈ X :22
n
m X
X (1) (2)
hf, giH = αi βj k(xi , xj )
i=1 j=1

This makes intuitive sense, if I remember that eventually k(., x) = Φ(x) ∀x ∈ X


and hΦ(x1 ), Φ(x2 )iH = Φ(x1 )T Φ(x2 ). Positive definiteness needs hf, f iH ≥ 0
(1) (2)
and hf, f iH = 0 ⇔ f = 0. Computed hf, f iH = m
Pm
j=1 αi αj k(xi , xj ) ≥
P
i=1
0, as k is positive definite. For bilinearity imagine the function k(., x) to be an in-
(1) (2)
finite dimensional vector: Say f1 = m
P 1 Pm2
i=1 αi k(., xi ), f2 = j=1 βj k(., xj ), g =
22
I am considering a more general X , which can also include vectors of different length,
hence length m and n. The superscript (k) for some k ∈ N denotes this.

36
Pm3 (3)
k=1 γk k(., xk ). Then
m1
 X m2 m3
X
(1) X (2) (3)
hλf1 + µf2 , giH = λ αi k(., xi )T +µ βj k(., xj )T γk k(., xk )
i=1 j=1 k=1
m m3
1 X
X (1) (3)
=λ αi γk k(., xi )T k(., xk )
i=1 k=1
m2 X
m3
X (2) (3)
+µ βj γk k(., xj )T k(., xk )
j=1 k=1

= λhf1 , giH + µhf2 , giH ,

which proves bilinearity of my inner product. For the next step I need the
Cauchy-Schwarz inequality to hold: The inequality says

hx, yi2H ≤ hx, xiH hy, yiH ∀x, y ∈ H

For the proof first assume hy, yiH 6= 0.

hx, yiH hx, yiH


 
0≤ x− y, x − y
hy, yiH hy, yiH H
hx, yiH hx, yiH hx, yi2H
= hx, xiH − hx, yiH − hx, yiH + hy, yiH ,
hy, yiH hy, yiH hy, yi2H

which implies
⇔ hx, xiH hy, yiH ≥ hx, yi2H .
For hy, yiH = 0 with n ∈ N
 
x − nhx, yiH y, x − nhx, yiH y ≥0
H
⇔hx, xiH − 2nhx, yi2H + n2 hx, yiH hy, yiH ≥ 0
⇔hx, xiH ≥ 2nhx, yi2H ,

which holds for all n. This is only possible, if hx, yi2H = 0. But then again
hx, yi2H ≤ 0 = hx, xiH hy, yiH . This proves the Cauchy-Schwarz inequality. I
can now use it to complete my proof: Say f = m i=1 αi k(., xi ) with hf, f iH = 0.
P

Then
n
X 2
2
αi k(x, xi ) = |hf, k(., x)iH |2 ≤ hk(., x), k(., x)iH hf, f iH = 0.

|f (x)| =

i=1

37
Note that after the second equality I consider f and not f (x) and that the
inequality follows from Cauchy-Schwarz. Hence I have established that my
definition of h., .iH is actually a inner product. It remains to transform the
current space into a Hilbert space by assuring that the space is complete. This
is done by adding the norm
q
||f ||Hk = hf, f iHk

to the Hilbert space. Hence I now consider


X
Hk = {f : f = αi k(., xi )}.
i

Having digested the proof it is worth looking back at one equality found
while proving the positive definiteness of the inner product:

f (x) = hf, k(., x)iH

Reading from right to left the inner product manages to reproduce the functional
value of f at x given two functions. This property is called reproducing and
provides the corresponding Hilbert space its name: Reproducing Kernel Hilbert
space (RKHS).

Definition 4. H is a Reproducing Kernel Hilbert Space if there exists a k :


X → R, such that

1. k has the reproducing property: f (x) = hf (.), k(., x)iH

2. k spans H: H = span{k(., x) : x ∈ X }

3.4.2.2 Examples of kernels


I restrict myself to the kernels that are by default included in the software
package e1017 in R.
The polynomial kernel in general looks as follows:

kpoly (s, t) = (sT t + c)d ,

where c and d are parameters to be chosen a priori. To consider the variable


space let me begin with a proof by induction and show that P (n) holds:

38
n
X 2 n
X n X
X i−1
si ti = s2i t2i + 2si sj ti tj
i=1 i=1 i=2 j=1

P (1) is obvious, as (s1 t1 )2 = s21 t21 . Now assume P (n) holds. Then
 n+1
X 2 n
X 2 n
X 
si ti = si ti +2 si ti sn+1 tn+1 + s2n+1 t2n+1
i=1 i=1 i=1
n+1
X n X
X i−1  n+1
X 
= s2i t2i + 2si sj ti tj + 2 si ti sn+1 tn+1
i=1 i=2 j=1 i=1
n+1
X n+1
XX i−1
= s2i t2i + 2si sj ti tj .
i=1 i=2 j=1

By the method of induction this proves P (n). I can now use this to illustrate the
variable space for d = 2, which is the most common choice for the parameter:
n
!2 n i−1 √
n X
X X X  √ 
si ti + c = s2i t2i + 2si sj 2ti tj
i=1 i=1 i=2 j=1
n √
X  √ 
+ 2csi 2cti + c2
i=1

It is then straightforward to show that for a vector x the corresponding mapping


is
√ √
Φ(x) =(x21 , . . . , x2n , 2xn xn−1 , . . . , 2xn x1 ,
√ √ √ √ √
2xn−1 xn−2 , . . . , 2xn−1 x1 , . . . , 2x2 x1 , 2cxn , . . . , 2cx1 , c).

Translated into econometric language this kernel maps the data into a√space in-
cluding as dimensions the squared
√ of the data (x2i ), interaction terms ( 2xi xi−1 )
and the scaled data itself ( 2cxi ). Hence for this type of kernel function the
support vector machine takes the function of an automatic mechanism that
determines a best fit to the model by considering interaction terms and the
polynomials of the data.
The next and by far simplest kernel is the usual inner product itself, called
the linear kernel:
klinear (s, t) = sT t.
It turns out that this is just a special case of the polynomial kernel with c = 0
and d = 1.

39
The most popular kernel in general is the radial basis function kernel or
gaussian kernel. It looks as follows

||s − t||22
 
k(s, t) = exp −
2σ 2

and maps the data into an infinite-dimensional space, which again justifies the
actual qualification for a kernel to fulfill Mercers Theorem. Let me show that
the space is actually infinite-dimensional. I start with transforming the kernel
function and then expand the last term with the Taylor series. For n = 1 and
σ 2 = 12 :
∞  T i
||s − t||22 2s t
  X
exp − = exp(−||s||22 ) exp(−||t||22 )
2σ 2 i=0
i!
s s
∞ 
2i s i 2i ti
X  
=
i=0
i! exp(s2 ) i! exp(t2 )

This infinite sum can only be an inner product in an infinite-dimensional space.


Accordingly
s

s √ s2 22 s3

Φ(s) = , 2 , ,... .
exp(s2 ) exp(s2 ) 2! exp(s2 )

To conclude this section I introduce the sigmoid kernel:

ksigmoid (s, t) = tan(αsT t + r)

This kernel is often used in the neural networks literature, even though it is not
guaranteed to be positive semi-definite. For the arising theoretical problems I
refer to Lin and Lin (2003).
A section on kernels is not too uncommon in a paper on econometric tech-
niques. Especially the gaussian kernel rings a bell and is related to the gaussian
kernel for density estimation. This is not a coincidence. Kernel density estima-
tion uses the fact that inner products provide a similarity measure. The fact that
inner products have this characteristic can be illustrated by their connection to
correlations. For two vectors the correlation is
− x̄)(yi − ȳ)
P
i (xi
Corr(x, y) = pP 2
pP
2
i (xi − x̄) i (yi − ȳ)
(x − x̄)T (y − ȳ)
=
||x − x̄||||y − ȳ||

40
So kernel densities make use of the similarity measure of inner products, whereas
for kernels in the SVM technique I use the fact that the kernels are inner products
in a higher-dimensional space. For a kernel density estimation I first compute
similarity measures between each point in the dataset and a given point. This
similarity is utilized as a weight (Hansen, 2000). Those similarity measures
appear to be in certain cases inner products in higher dimensions and can thus
also be used in the SVM technique.
Steinwart et al. (2009) managed to establish consistency of the SVM for
binary and continuous response variables even in the case of dependent data.
The assumptions are rather standard and for more details I refer to the original
paper. Still, several publications can be found on time-series type of data even
before this date. The main obstacle for the SVMs adoption into econometrics
will certainly be the economists’ need to conduct inference and interpret the
model’s results. This simply requires more research into the method. A starting
point would be the polynomial kernel, which appears to be providing a variable
space that can be interpreted.
I would like to finish this section with a short simulation on dependent data
to provide a glance of the huge potential SVM gives. Under a simulated AR(1)
process I depict in Figure 15a the fit of the SVM with a polynomial kernel and the
true data generating process. It is obvious at first sight that SVM outperforms
the true approximation by far. This is even reached without calibrating much
of the technique itself. Even more astonishing is the out-of-sample performance
(Figure 15b), which is again better than the one-step ahead forecasts of the
ARMA model. The difference is remarkable and should be a spark to continue
researching and employing the technique.

3.5 k-nearest neighbor


k-nearest neighbor (k-NN) is a non-parametric, entry level machine learning
technique. It is very intuitive and can thus be introduced to a non-technical
audience. The in econometrics more common approach of local regression gen-
eralizes it. Both k-NN and local regression start with the idea that predicting
an observation x∗ should only be done with observations that are similar to it.
Similar is regarded here as proximity in the dataset’s space. It would be best,
if I could sample sufficiently many times for x = x∗ and record the outcome. I
could use this data to compute mean and variance at x∗ and thus also find a
prediction interval. Since the dataset almost always contains continuous vari-
ables, I hardly encounter this situation in practice. Usually I obtain a dataset,
where maybe some points are closer to others, but will hardly ever be the same.
However, what can be done, is to take all observations in a neighborhood around

41
Figure 15: SVM versus ARMA

(a) in-sample (b) out-of-sample

x∗ and use those instead.


k-NN makes predictions based on a neighborhood that includes k obser-
vations. If I consider the Euclidian distance, k-NN blows up a ball around
x∗ , until it has swallowed k observations. This obviously implies standardizing
the mean and variance of each regressor to prevent discrimination among the
explanatory variables. In the case of a binary response variable the predicted
value corresponds to the majority’s outcome in the neighborhood, whereas for
a continuous response the y-values of the k neighbors are averaged to obtain a
prediction. Because now similar points are treated as if they are actually equal
to each other, this causes some bias in the estimate. Nevertheless this might be
preferable to treating all observations, independent of their similarity, as equally
important.
k-NN can be tuned with the selection of k, which refers back to Figure 2:
the bias-variance trade off. A small k implies a small bias, but on the other
hand also provides a large variance. Figure 16a illustrates the result of 1-nearest
neighbor for classification. The model is very sensitive to outliers and overfits
the data. For the continuous response case the function makes a jump at every
point, where the neighborhood changes. A small k forces those jumps to be
larger, as displayed in Figure 16c and 16d. On the other hand choosing k too
large can lead to an oversimplified model, which in the extreme case of k = n
is just a constant. If p ≤ 2, the model can be visualized and this can facilitate
selecting k. Otherwise cross-validation can be consulted.

42
Figure 16: k-NN

(a) 1-NN Classification (b) 10-NN Classification

(c) 1-NN Regression (d) 10-NN Regression

k-NN started with the idea that only similar points should be considered
for the prediction. Hence it is a logical extension to weight the selected points
based on their similarity. This is done with the help of kernel densities. In
general for a continuous response variable they are employed as follows:
Pn x∗ −xi
∗ i=1 k( h )yi
ŷ = Pn x∗ −xi ,
i=1 k( h )

where h, the so-called bandwidth, is the distance to the k-th nearest neighbor.
There are several different kernels that I can make use of, among them the

43
Epanechnikov kernel

x − xi 2
 ∗  ∗
x − xi 3
   
kE = 1− 1(|x∗ − xi | < h)
h 4 h

or the Bartlett kernel


 ∗ ∗
x − xi x − xi
  
kB = 1−
1(|x∗ − xi | < h).
h h

The uniform kernel


 ∗
x − xi

kU = 1(|x∗ − xi | < h)
h

is just equal to the uniformly weighted k-NN that I considered before: the
predicted value is an average of the neighbor’s outcome variable. A comparison
of the kernels is plotted in Figure 17. The kernels differ in giving weights to
points close to x∗ .

Figure 17: Kernels

So far I modeled the data within a neighborhood with a constant, which can
be extended to linear models. This is called local regression and thus solves:
n
wi (yi − xi β(x∗ ))2 ,
X
min

β(x )
i=1

44
x∗ −xi
k( )
where wi = P h
x∗ −xi . Let X : n × p with the i-th row being xi . If I take
i
k( h )
W = diag(w1 , . . . , wn ), I can rewrite the preceding problem into
1
min ||W 2 (y − Xβ(x∗ ))||22 .
β(x∗ )

This is a weighted least squares regression problem, whose estimator is β̂(x∗ ) =


(X T W X)−1 X T W y. Selecting h to be the distance to the furthest neighbor
in the dataset in combination with a uniform kernel reduces β̂ to be the least
squares estimator.
Over time it has been observed
that the k-nearest neighbor method Figure 18: Share of a unit sphere in unit
performs worse in higher dimensions. cube
Especially concerning have been situ-
ations, where in the binary case the
classes only spanned a lower dimen-
sional subspace. In this case inflating
a sphere around x∗ might not be the
best option. Another problem in these
spaces is the increasing distance and
size of the neighborhoods. If I just
consider a ball with radius r and a cor-
responding cube around it with edge
length 2r, then the volume the ball
takes as a share of the cube decreases
in higher dimensions, as displayed in
Figure 18. Hence if the points are uni-
formally distributed inside the cube,
the chance of the closest point lying
outside the sphere with radius r tends to one with larger dimensions. Hastie
and Tibshirani (1996) perceived these issues and found a more efficient method,
which instead of a sphere inflates an ellipsoid around x∗ . This ellipsoid takes
the variation within and between classes of y = 1 and y = 0 into account. In
particular for the binary response case (x − x∗ )T Σ(x − x∗ ), where
1 1 1 1
Σ = W − 2 (W − 2 BW − 2 + λI)W − 2 .

Here B is the between the classes and W the within the classes covariance
matrix of X. Under λ = 0, Σ yields a standardized between covariance matrix.
It turned out that for λ = 0 this matrix under some circumstances gave rise to

45
an ellipsoid collapsed to a line. To prevent this Hastie and Tibshirani (1996)
introduced additional equal spread into each direction with λ. It emerged that
λ = 1 is the preferred choice.
k-nearest neighbor is next to SVM, boosting etc. another non-linear model-
ing technique, which is in some situations able to extract more information from
large datasets than e.g. least-squares regression. The question is of course, if
this information is extracted from the true model or if it is picked up from ran-
dom error terms, as occurred in Figure 16c and 16d. Provided that the former
is the case the final model does not necessarily offer much understanding of
the relationships within the dataset. k-NN - according to Boente and Fraiman
(1989) also applicable to dependent data - should probably be regarded as an
easy step into non-parametric regression, which is however outperformed by its
successor local regression.

46
4 Ensemble machine learning techniques
Instead of creating a single model that relates variables and outcome, ensemble
algorithms create several models and utilize those for prediction. They differ in
which way the separate models are created and how the models are grouped in
the end. E.g. bagging, which is introduced in Section 4.3, computes bootstrap
samples and outputs a model for each of those samples. In the end the average
of the methods is employed for prediction. More promising, but also more
technical is boosting, which I will introduce next.

4.1 Boosting
What can a machine actually learn? Or: How should a dataset look like, so that
I can learn its patterns? PAC concept classes provide a formal answer to this
question. PAC stands for Probably Approximately Correct and was introduced
in Valiant (1984). The main idea is that a good algorithm implies having a good
approximation of a data generating process (DGP) with high probability. Instead
of a data generating process computer scientists use the more mathematical
construct called concept, which is simply a boolean function, hence outputs
{0, 1} for each element in the domain. A concept class is then a collection of
concepts. The idea is based on binary data, however can easily be extended to
datasets containing a continuous response variable.
For the approximate part Valiant introduced an unknown distribution D
according to which the data is drawn from that takes the role of nature supplying
i.i.d.
me with n examples: for each x in X , x ∼ D. Additionally he defined the
hypothesis h, a model in econometric terms, that tries to approximate c, where
c(x) is again the data generating process. Then the error of the hypothesis is
X
error(h) = P (x),
x:h(x)6=c(x)

where P (x) is the probability measure of the distribution D. A small error im-
plies a good approximation. Empirically this then boils down to n1 ni=1 I(h(xi ) 6=
P

c(xi )) under a uniform distribution. Now define C = {Cn }n≥1 as the set of
concepts and H = {Hn }n≥1 as the set of hypothesis functions. According to
Haussler (1992):
Definition 5. Probably approximately correct (simplified) The concept class
C is PAC-learnable by the hypothesis space H, if there exists an algorithm A,
such that for all n ≥ 1, all concepts c ∈ Cn , all probability distributions D and
all ε > 0, δ < 1, with samples drawn from D, the probability of A returning a
hypothesis h ∈ Hn with error(h) ≤ ε is at least 1 − δ.

47
In a less sophisticated language a DGP is PAC-learnable, if I can find an algo-
rithm that will generate an arbitrarily close approximation to this DGP with a
probability arbitrarily close to one. The definition of PAC reminds of the statis-
tical notion of consistency, where the error-function replaces the bias. However,
no relation has been found so far (Haussler, 1992).
A PAC algorithm is a very ambitious construct. Slightly less restrictive is
the definition of weakly learnable, which replaces ε with 12 − p(n,s) 1
(Kearns
and Valiant, 1988). p(n, s) is a polynomial with parameters n, the number of
concepts in C, and s, the input size of c. Input size is again a term originating
from computer science, however less relevant for my analysis. Rather crucial
here is that p(n, s) > 0 and thus error(h) < 21 . This implies that my model
has to be slightly better than random guessing. In his seminal paper Schapire
(1990) managed to prove that the weakly learnable class is equivalent to the
PAC (or strongly) learnable class. This makes finding PAC class algorithms
more realistic. Building upon this result Schapire found a method to transform
an algorithm consistent with the weakly learnable definition to an algorithm
adhering to the PAC class. He called this transformation boosting, which relies
on an additive model.

4.1.1 AdaBoost
The version of boosting that is introduced now is the father of all boosting
techniques and is called AdaBoost. It iteratively fits a model, fm (x) to a
reweighted sample and keeps adding it to a composite function Fm (x). fm (x)
is also called base learner. Say x1 , . . . , xn ∈ X and y1 , . . . , yn ∈ R for n ∈ N.
Then I fit some model fm (x) and add this model to my general function. Hence
Fm (x) = Fm−1 (x) + αm fm (x), where 0 < αm < 1 weights the function. αm
is usually referred to as step length. This is an iterative process, which starts
with F0 (x) = 0 and stops at some finite m = M to avoid overfitting. M is
commonly found with the help of cross-validation.
Before digging into the technical details of boosting let me make a slight
detour via loss functions. A loss function aims to provide a measure of the
predictions suggested by a certain model. More formally:

Definition 6. Let (X , A) be a measurable space and Y ⊂ R be a closed


subset. Then a function L : X × Y × R → [0, ∞) is called a loss function, if it
is measurable.

E.g. OLS minimizes the squared error loss function: L(x, y) = (y − xβ̂)2 . For
later use it makes sense to also introduce the term risk.

48
Definition 7. Let L : X × Y × R → [0, ∞) be a loss function and P be a
probability measure on X × Y . Then for a function f : X → R the L− risk is
defined by:
Z
RL,P = E(L(x, y, f (x))) = L(x, y, f (x))dP (x, y)
X×Y
Z
= L(x, y, f (x))dP (y|x)dPX (x),
X×Y

where the last equality follows with Bayes’ Theorem.

In the following sections the term empirical risk will be used, which simply puts
equal weight on each observation in the sample and thus equals an average of
the empirical loss function.
For the boosting algorithm I will employ the exponential loss function on
yFm (x) and focus on binary data with y ∈ {−1, 1} for the upcoming cal-
culations. Remember that Fm (x) is the model after m improvement steps.
yFm (x) is then equal to one, if prediction and actual value coincide. Other-
wise the term equals −1. Let me now consider the total loss, say C(f ), which
is essentially n times the empirical risk: C(Fm ) = ni=1 L(xi , yi , Fm (xi )) =
P
Pn
i=1 exp(−yi Fm (xi )). As mentioned before boosting keeps fitting to a
(m)
reweighted sample. Those weights I introduce now as wi = exp(−yi Fm−1 (xi )).
Then with Fm (x) = Fm−1 (x) + αfm−1 (x) I get the total loss
(m)
C(fm ) = ni=1 wi exp(−yi αm fm (xi )), which I try to minimize. Before solv-
P

ing this optimization problem let me try to simplify the objective function:
n
X (m)
C(fm ) = wi exp(−yi αm fm (xi ))
i=1
X (m) X (m)
= wi exp(−αm ) + wi exp(αm )
i:yi =fm (xi ) i:yi 6=fm (xi )
n  
X (m) X (m)
= wi exp(−αm ) + wi exp(αm ) − exp(−αm )
i=1 i:yi 6=fm (xi )

If I want to locate the minimal error in terms of fm for a given αm , fm should


(m) (m)
= ni=1 wi I(yi 6= fm (xi )). Hence in the lan-
P P
minimize i:yi 6=fm (xi ) wi
guage of the PAC learning class the error is minimized, while the distribution is
replaced by weights. Minimizing with respect to αm :
dE X (m) X (m)
= wi exp(αm ) − wi exp(−αm ),
dαm i:y 6=f (x ) i:y =f (x )
i m i i m i

49
which yields
P (m)
1 i:y =f (x ) wi 1 1 − εm
   
αm = log P i m i (m) = log .
2 w 2 εm
i:yi 6=fm (xi ) i
P (m)
w
i:yi 6=fm (xi ) i
εm is defined as P n (m) . Hence the boosting algorithm at each stage
w
i=1 i  
Pn (m) 1 1−εm
finds fm = arg minfm i=1 wi I(yi 6= fm (xi )), calculates αm = 2 log εm
and computes
Fm (x) = Fm−1 (x) + αm fm (x).
Taking a step back boosting keeps fitting a function to the data, such that it
minimizes an exponential loss function. This already proves that boosting with
a binary response variable can be interpreted as a forward stagewise additive
modeling technique with an exponential loss function (Hastie et al., 2009). The
function AdaBoost implements the AdaBoost algorithm in pseudocode. The
only new parts are the uniform distribution of the observations at the start of
the algorithm and standardizing the weights with Zm , such that they sum up
to one:
X (m+1) X w(m)  
i
wi = exp − αm yi fm (xi )
i i
Zm
s
1 εm 1 − εm
 X r 
(m) X (m)
= wi + wi
Zm i:yi =fm
1 − εm i:y 6=f εm
i m

2 q
= εm (1 − εm )
Zm
(m+1) (m+1)
If I plug the last equation into the definition of wi , I find wi =
 
(m)
wi exp − αm yi fm (xi ) . This means that the weights are updated with
(m+1) (m) (m+1)
wi = wi exp(−αm ) for all wrongly predicted observations and wi =
(m)
wi exp(αm ) for the correctly classified. In the scenario of an observation that
is often misclassified the weight is sequentially increased and thus receives more
attention by the upcoming base learner.
Friedman et al. (2000) proved that the AdaBoost algorithm is approximately
equal to stagewise additive logistic regression and thus was able to establish a
link between machine learning and statistics theory. Since earlier I showed that

50
function AdaBoost(x)
(1) 1
wi = T ∀i = 1, . . . , n and F0 (x) = 0

for m=1,. . . , M do
X (m)
εm = wi
i:yi 6=fm
n
X (m)
fm (x) = arg min wi
fm i:yi 6=fm (xi )

if εm < 0.5 then ∀i = 1, . . . , n :

Fm (xi ) = Fm−1 (xi ) + αm fm (xi )

1 1 − εm
 
αm = log
2 εm
q
Zm = 2 εm (1 − εm )

(m)
w
 
(m+1)
wi = i exp − αm yi fm (xi )
Zm
end if

end for

return FM (x)

end function

51
boosting is equivalent to forward stagewise regression, it remains to prove the
logistic part. Consider again boosting’s exponential loss function of yF (x):

min E[exp(−yF (x))] = min P (y = 1|x) exp(−F (x))+P (y = −1|x) exp(F (x))
F F
 
1 P (y=1|x)
The first order condition tells me that F (x) = 2 log P (y=−1|x) . Inverted this
exp(2F (x)) 1
implies P (y = 1|x) = 1+exp(2F (x)) = 1+exp(−2F (x)) . In a logistic regression the
probability is the same up to the factor 2.
As a next step let me consider the minimizer of the corresponding logistic
regression under the assumption of the preceding probability. As is well known
I will need to employ a binomial distribution, such that with y ∗ = y+1 2 ∈ {0, 1}
the log-likelihood is
n
Y 
yi∗ 1−yi∗
log(L(F )) = log P (yi∗ = 1|xi ) P (yi∗ = 0|xi )
i=1
n
X yi + 1 1 − yi
= log(P (yi = 1|xi )) + log(P (yi = −1|xi )).
i=1
2 2

Then since
1
P (yi = 1|xi ) = 1 − P (yi = −1|xi ) = ,
1 + exp(−2F (xi ))

n
X yi + 1 1 − yi
log(L(F )) = − log(1 + exp(−2F (xi ))) − log(1 + exp(2F (xi )))
i=1
2 2
n
X
= − log(1 + exp(−2yi F (xi )).
i=1

52
Figure 19: Loss functions

The last equality simply follows by plugging in yi = 1 and yi = −1 and observing


that only the sign of the exponential part changes. Finally I minimize the
conditional expectation of − log(L(F )):
n
X
min Ey|x (− log(L(F ))) = min P (yi = 1|xi ) log(1 + exp(−2F (xi )))
F F
i=1
+ (1 − P (yi = 1|xi )) log(1 + exp(2F (xi )))
Setting the derivative with respect to F (xi ) equal to zero combined with some
transformations implies
exp(2F (xi )) 1
P (yi = 1|xi ) = = .
1 + exp(2F (xi )) 1 + exp(−2F (xi ))
Hence the negative log-likelihood is minimized at the true probability. This also
implies that the minimizers of Ey|x (− log(1 + exp(−2yF (x)))) and
Ey|x (exp(−yF (x))) are the same.
Let me now plot the two functions − log(1+exp(−2yF (x))) and exp(−yF (x))
(Figure 19), where positive values of yF (x) illustrate correctly classified obser-
vations. It is worth noting that the plot shows that both functions still punish
correct predictions in contrast to the 0-1 loss and push them to be "more cor-
rect". I can see that they touch at yF (x) = 0 and are able to quantify their

53
similarity with a third-order Taylor series:
1
exp(−yF (x)) ≈ 1 − yF (x) +
2
and
1
log(1 + exp(−2F (x)y)) ≈ log(2) − yF (x) + .
2
To summarize boosting approximately fits a stagewise additive logistic regression
and the population minimizers of − log(1 + exp(−2yF (x))) and exp(−yF (x))
are the same.

4.1.2 Gradient boosting


In the evolution of boosting algorithms the next development was the so-called
Gradient boosting algorithm. The main difference to AdaBoost is that I fit my
base learner to the negative gradient of the loss function, which takes the spot
of reweighting the data. It is commonly implemented as follows:

function Gradient Boosting(x)


Pn
F0 (x) = arg minγ i=1 L(yi , γ)
for m=1,. . . , M do

∂L(yi , F )
rim =−
∂F
F =Fm−1 (xi )

Fit fm (x) to rim

n
X
γm = arg min L(yi , Fm−1 (xi ) + γfm (xi ))
γ
i=1

Fm (xi ) = Fm−1 (xi ) + γm fm (xi )


end for

return FM (x)

end function

Note that with L(yi , F (xi )) = (yi − F (xi ))2 I find ∂L(y i ,F (xi ))
∂F (xi ) = 2(y − F (xi ))
and keep fitting fm to the residuals of the preceding regression, such that rim

54
is sometimes referred to as pseudo-residuals. This squared error loss turns out
to be the most common choice for a continuous response variable. This again
shows that boosting builds upon several well established statistic techniques.
Minimizing the conditional expectation of the squared loss function - more pre-
cisely half the squared loss function - leads to the intuitive result that f (x)
should approximate E[y|x]. This can easily be obtained by mimicking the ear-
lier derivation for finding the minimizer of the exponential loss function. Let me
get into more intuitive detail of the technique.
The objective function of the algorithm is the earlier introduced empirical
risk function, as illustrated at the initialization stage of the technique. At each
stage in the algorithm the empirical risk is C(Fm−1 ) for some m, which will be
decreased in the following step as C(Fm−1 + γfm ). Then it is obvious that the
negative directional derivative of the function provides me with the direction
fm , in which C decreases most. However, the domain of the empirical risk
function is a function itself, such that I need to consider directional derivatives
of functionals. This is taken care of by the Gâteaux derivative. The Gâteaux
derivative of F at u in the direction ψ is defined as

F (u + τ ψ) − F (u)
dF (u, ψ) = lim .
τ →0 τ
Then I can show that for a real-valued function F with F 0 = f and u(x) ∈ L2 23
defined as Z
C(u) = F (u(x))dx

23
L2 is defined as the set of square-integrable functions on the Lebesque
q measurable set Ω,
R
concretely all functions, such that the second norm is finite: ||f ||2 = Ω
|f |2 dµ for some
measure µ.

55
the Gâteaux derivative in the direction ψ is

1
 
dC(u, ψ) = lim C(u(x) + τ ψ(x)) − C(u(x))
τ →0 τ

1
Z Z 
= lim F (u(x) + τ ψ(x))dx − F (u(x))dx
τ →0 τ Ω Ω
Z Z 1
1

= lim dF (u(x) + sτ ψ(x))dx
τ →0 τ Ω 0
Z Z 1
1 dF (u(x) + sτ ψ(x))
= lim dsdx
τ →0 τ Ω 0 ds
Z Z 1
1
= lim τ ψ(x)f (u(x) + sτ ψ(x))dsdx
τ →0 τ Ω 0
Z
= f (u(x))ψ(x)dx = hf (u(x)), ψ(x)i.

Note that in the last step I employ the usual definition of the inner product. If
I now apply this to my discrete case, where C(f ) = n−1 ni=1 L(yi , f (xi )) and
P

δxi is the indicator function at xi , I get that for i = 1, . . . , n



∂L(y, F )
 
dC(Fm−1 , δxi ) = , δx
∂F F =Fm−1 i
n
X ∂L(yj , F )
= δxi (xj )
j=1
∂F
F =Fm−1 (xj )

∂L(yi , F )
=
∂F
F =Fm−1 (xi )
= rim

Hence rim is the derivative of the empirical risk into the direction of the corre-
sponding data point. Going back to my algorithm gradient boosting tries to fit
fm (x) as close as possible to this negative gradient. The next step is then to
find the step size that minimizes the empirical risk.

4.1.3 Component-wise boosting and other extensions


An extension of boosting applicable for variable selection is called component-
wise boosting and was introduced by Bühlmann and Hothorn (2007). It restricts
the base learners to be a function of only one variable. Hence in each step only
one regressor is selected, such that component-wise boosting offers an alter-
native method for variable selection. Commonly used and likely most familiar

56
to econometrics would be the combination of component-wise boosting and
least-squares base learners. For X : n × p and X = [x1 , . . . , xn ] with xi ∈ Rp :
Pn
i=1 Xij rim
β̂j = P n 2
i=1 Xij

n
X
Ŝ = arg min (rim − β̂j Xij2 )
1≤j≤p i=1

fm (x) = β̂Ŝ xŜ


Expressed in econometric language I compute the regression coefficient of
the pseudo-residual on every single variable separately and select the one with
the lowest sum of squared residuals. This technique was already introduced to
econometrics by Serena Ng in a series of two papers: The first one using the
famous Stock and Watson dataset of diffusion indexes (Bai and Ng, 2009) and
the second recent one to forecast recessions (Ng, 2013). In the former paper
they employ the quadratic loss function and extend the method by regarding a
variable including its lags as a single component (block component-wise). Hence
each base learner is not only a function of a variable, but also all the variables’
lags. They find that block boosting outperforms component-wise boosting and
illustrate the main difference with the example of a DGP. Consider

yt = β11 yt−1 + β21 Xt−2,1 + β32 Xt−3,2 + εt+h ,


where β11 ≈ 0 and β21 ≈ 0. Then the component-wise boosting algorithm
will only add Xt−3,2 to the model, whereas the block component-wise boosting
will add all three variables to the model. Hence block boosting is less parsimo-
nious. In the end the superiority is decided, if the additional variance induced
by the block algorithm is worth the decrease in bias. A different application to
component-wise boosting is the study of Wohlrabe and Teresa (2014), where
macroeconomic variables are forecasted and it is found that boosting after some
calibration of the stopping criterion outperforms autoregressive models.
Another special case of component-wise boosting is the component-wise
likelihood-based boosting, where instead of considering gradients to select the
new base learner the log-likelihood decides on which function to add. Again the
function is restricted to have only one element. Last I would like to mention a
technique that might be interesting to consider from an econometric perspec-
tive: stochastic gradient boosting expands on gradient boosting by using only
a random subsample of the data. It has been used by Ng (2013) in her paper
on recessions.

57
There is a whole literature on selecting the base learners and the step length,
which I will not cover. Base learners are neglected for the simple reason that
those investigations tend to suggest more complex algorithms that certainly
depart from the ultimate goal of making the models interpretable for inference.
One example would be the popular application of picking splines or trees as base
learners. Next to that none of these methods is unique to boosting, but can also
be solely applied. The research on step length selection is certainly relevant,
however hardly adds anything to the intuitive understanding of boosting.
Boosting certainly has potential to be used in econometrics. It is extremely
popular in computer science and statistics, not only on the academic side, but
also in the emerging business field of data science. According to Kulkarni et al.
(2005) boosting is consistent in the case of dependent data.

4.2 Trees and random forest


Trees are probably the most prominent entry level machine learning technique.
They are a very intuitive method that is appealing to employ for a non-technical
audience. Trees split the data into groups based on the explanatory variables
and then fit a simple model, usually a constant, to each group. Each split is
connected to a so-called node. The first node is called the root and the nodes
that are at the bottom of the tree are termed terminal nodes. All but the
terminal nodes have two branches or arcs connected to them. The model then
looks for each xi ∈ X as follows:
K
fˆ(xi ) =
X
cm I(xi ∈ Rm ),
m=1

where cm is a scalar for region Rm . Each region is mutually exclusive and


belongs to one terminal node. Trees can be visualized either with branches or in
a rectangular that is split in several smaller rectangulars, as shown in Figure 20.
A rectangular or cube can only be used, if the number of variables is at most
three. I used data on the kyphosis disease, which is described in more detail in
Appendix A.
If I try to find the best model under a squared loss function,
cm = |S1m | i∈Sm yi with Sm = {i : xi ∈ Rm } gives the best fit; in words cm
P

is then the average value of the dependent variable in the group. Trees tend to
suffer from high variance and overfitting. If for example a change in the data
affects a split close to the root of the tree, it is likely that several other splits of
the tree are affected. For overfitting just imagine a tree, where each region only
contains one observation. This is similar to adding a dummy variable for each

58
Figure 20: Visualizing a tree

(a) With branches (b) In a rectangular

observation. Such a tree has a perfect in-sample fit, however adds little value
for forecasting or understanding the data generating process. One method to
avoid overfitting is to prune the tree down. First a criterion is established, which
then minimized among all possible subtrees provides a solution. One possible
criterion is
|T |
X X
Cα (T ) = (yi − cm )2 + α|T |
m=1 xi ∈Rm

with α ≥ 0. |T | is the number of terminal nodes of the tree T , hence also the
number of existing regions. The criterion reminds of a lasso regression, where
the number of variables is replaced by the number of terminal nodes in T and
α takes the role of λ.
There are several methods to reduce the variance of trees, two of them
being bagging (Section 4.3) and random forest. Random forest is explained in
the upcoming paragraphs.

59
function Random Forest(x)

for b=1,. . . , B do
Draw a bootstrap sample Z of size N
endnodes = 0
while endnodes < nmin do
Select m out of p variables at random

Pick the best predictor out of the m candidates and the


corresponding best split

Add this split to the tree


endnodes = endnodes + 2
end while
end for
return prediction = mean(Tree)

end function

Random forest was introduced in Breiman (2001) and extends trees by first
collecting a bootstrap sample and then picking a subset of the variables to create
a tree. Hence the remaining variables are excluded from the selection. In the
end the predictors are averaged.
Analytically I will now show that this algorithm reduces the variance of trees.
I know that each tree is constructed in the same way, but they are not mutually
exclusive. Thus I can model the group of trees as B dependent, but identically
distributed random variables xi with variance σ 2 . If the correlation among the
trees is ρ,
B B
1 X 1
  X 
Var xi = 2 Var xi
B i=1 B i=1
 B
1 X X 
= 2 V ar(xi ) + 2 Cov(xi , xj )
B i=1 1≤i<j≤B
σ 2 (1 − ρ)
= + σ 2 ρ.
B
With B → ∞ it can be observed that the total variance is the variance of a
single tree multiplied with the correlation among two trees. Since 0 ≤ ρ ≤ 1, the
variance of the random forest estimator is at most the variance of a single tree.

60
This calculation also implies that the less correlated the explanatory variables
are, the more random forest improves on the trees.
The bias, which is rather a minor concern for a tree, is necessarily increased
by the restrictions imposed by the random forest. In contrast to early findings
of research on the random forest it is by now known that the technique overfits
the data eventually. This overfitting takes place rather slowly, since the final
model is an average of an ever larger set of models.
One of the tuning parameters of the random forest technique is m, the
number of variables that are picked to be possibly chosen for the model. Es-
pecially in the case of datasets with a large number of variables, selecting the
number m is crucial. If there are only few relevant variables, then m has to
be sufficiently large to ensure a high chance of picking some of those variables.
Next to that a lower value of m will cause less correlation, but also a lower fit.
In simulation studies one seems to find that from a rather small m increasing it
hardly changes the bias of the tree, but provides a very low correlation among
the trees (Breiman, 2001).
Since the random forest is again non-linear and difficult to interpret, applica-
tions are sparse and it seems that more theoretical work needs to be conducted to
popularize the method in econometrics. Applications can be found in the fields
of finance (Rodriguez and Rodriguez (2004), Kumar and Thenmozhi (2006)),
marketing (Larivière and Van den Poel (2005)) and microeconomics (Keely and
Tan (2008)).

4.3 Bagging
Bagging is a variance reducing technique that can be applied in combination with
almost any other algorithm. The name is short for Bootstrap Aggregation and
originates as an attempt to reduce the variance of trees. In machine learning it is
also mainly applied to trees. The underlying idea is comparably straightforward:
Assume some DGP with the underlying distribution P that draws a subsample
{X , Y} from the population. Bagging then selects B bootstrap samples from
the subsample and fits a model for each bootstrap sample. Now with the help
of the underlying distribution and some independence assumption on the data
generating process, the probability of each bootstrap sample occurring can be
derived and the bagging estimate calculated as
fag = EP̂ (fˆ∗ (x)).
Empirically this corresponds to averaging each model:24
24
More details on the bootstrap and why I need to average the model can be found in
Section 2.2.

61
B
1 X
fˆag (x) = fˆ∗b (x),
B b=1

where fˆ∗b is the model fitted with the b-th bootstrap sample. Since each model
is unbiased, it is obvious that the bagged model must also be unbiased. Next
to that this averaging reduces the mean squared error, which I will now show
with a simple transformation:

EP (Y − fˆ∗ (x))2 = EP (Y − fag (x) + fag (x) − fˆ∗ (x))2



= EP (Y − fag (x))2 + 2(Y − fag (x))(fag (x) − fˆ∗ (x))

+ (fag (x) − fˆ∗ (x))2

= EP ((Y − fag (x))2 ) + EP ((fag (x) − fˆ∗ (x))2 )


≥ EP (Y − fag (x))2

Having established a theoretical justification of bagging the technique is graph-


ically explained with the help of simulated data in Figure 21. Each dashed line
shows one out of 20 models each belonging to a separate bootstrap sample.
The solid line is then the bagged model. It can be observed that the bagged
model exhibits little uncertainty in areas, where data is only sparsely available.
In a series of two papers Inoue and Kilian (2004, 2008) introduced bagging
to econometrics, in particular to time series forecasting. In general they find
that bagging despite its simple setup can in several situations keep up with other
more established methods, such as factor models or penalized regression.

4.4 A comparison of ensemble techniques and the variable im-


portance plot
To illustrate the performance of the methods boosting, random forest and bag-
ging I applied the three algorithms to a dataset on credit card data. The dataset
is described in Appendix A. Figure 22 shows their respective performance mea-
sured as the cross-validated share of false predictions. Bagging as the simplest
algorithm levels off first and appears to complete the task of predicting credit
card rejections the worst. Over time the bootstrap samples seem to add only
little new information, such that the bagged model hardly changes. Boosting -
here employed with trees as base learners - takes more iterations to converge,
but eventually predicts equally well as the random forest. This performance is
consistent with the experience other applications have made, where boosting

62
Figure 21: Bagging illustrated Figure 22: Performance on Credit
Card dataset

and the random forest offer promising results. Due to its flexibility in form of
the different base learners boosting is often preferred over the random forest.
All of the introduced ensemble techniques are in certain scenarios appealing
for statistical modeling. However, their main obstacle to be adopted in econo-
metrics is probably their interpretability. Given the techniques’ complexity this
is not an easy task and goal of current research efforts. Hypothesis testing is
probably a far way out. What can be done already, however, is to consider a
variable importance plot. An example of such a plot is given in Figure 23. The
data is taken from the random forest originating from the preceding compar-
ison study. Such a graph ranks the variables included in the model by their
respective contribution, which is measured by how much the variable affects
a certain criterion. Those criteria range from information criteria via cross-
validated mean-squared error to other more complicated measures of fit. Figure
23 evaluates with the mean decrease in the Gini impurity.25 In general at each
iteration - e.g. a bootstrap sample for bagging and random forest - I compute
the reduction in some criterion and attribute it to the corresponding variable.
In the end this value is averaged for each regressor and scaled to be bounded
between zero and hundred. This idea is certainly not revolutionizing, but a first
step into the direction of understanding the models. For the techniques that
employ the bootstrap in their algorithm the result is unique up to sampling,
such that results possibly differ for the same dataset. Thus a large number of
bootstrap replications is recommended. The variable importance plot contains
far less information than a t-statistic. Hence it does not equal a measure of
25
The Gini impurity is an information criterion especially used for trees.

63
significance, rather a value describing the contribution to the model.

Figure 23: Variable importance plot

64
5 Machine learning’s future in econometrics
After introducing several machine learning techniques in Section 3 and 4 the
main question that remains is whether the techniques can and will be adopted
into econometrics. This depends on several factors. First of all: economics
often draws upon economic theory before analysing data. With more cheaply
available data the future might drive economics to a more empirically founded
discipline. This in turn could boost machine learning’s level of acceptance in
econometrics. But even then it will depend on the type of questions researchers
ask that determine, which methods will be most useful. Currently machine
learning algorithms - at least the ones unfamiliar to econometrics - offer rather
strong out-of-sample performance but less insight into the data. This might
hinder their acceptance into econometrics. Looking at the techniques itself
Figure 24 attempts to summarize my personal evaluations of all algorithms in
one graph. The figure plots flexibility against interpretability with a focus on the
scenario of many regressors and possibly high-dimensional datasets. It shows
that the elastic net and lasso are the "stars" of this survey. Solving inference
problems would boost their interpretability even more and is denoted here by
the dashed arrow. In my eyes SVM and boosting have most potential. Their
ability to grasp the data in great detail in all kinds of applications is known but
could be enhanced by better understanding their final models and being able
to conduct inference. Considering that for the lasso - a generalization of the
least-squares regression that appears to be relatively uncomplicated - there is
still no procedure for hypothesis testing, one would certainly have to be patient
to have a solution to this problem for boosting or the SVM. Solving these issues
could promote their adoption as a member of the econometric toolbox. All
other techniques either already have a long history in economic research (e.g.
PCA) or simply show little potential, such as the k-NN or bagging. Often these
methods are far away from the complete package of being applicable in diverse
applications and at the same time offering a lot of insight into the data.

65
Figure 24: Comparison of the algorithms

66
6 Conclusion
As econometrics continues to work with large datasets demand will increase for
techniques that are able to extract more information from the existing data and
go beyond linear modeling. One field researchers will then certainly consult is
machine learning. This encouraged me to conduct a survey of machine learning
techniques that have potential to find their way into econometrics. I particularly
considered - among others - penalized regression (including lasso and the elastic
net), dimensionality reduction techniques (represented by principal component
analysis (PCA)), the support vector machine (SVM) and boosting. My thesis
attempted to contribute to the discussion and to enhance the understanding of
those techniques by explaining them in an econometric language.
A lot of work remains to be done. First of all theoretical research needs to
offer a better understanding of the models that are produced by the algorithms.
For too many techniques it is the case that the final model demonstrates a
remarkable performance, but does not reveal how this performance was estab-
lished. Especially in economics it often occurs that an understandable simple
model is preferred over a better performing black box technique. This theoret-
ical work should eventually also be targeted at being able to conduct inference
on the models. Furthermore, experience on applications of emerging techniques
has shown that independent of which field an algorithm originates, the algorithm
does not only have to be understood, but also adjusted in order to evolve from
a pure machine learning technique into an econometric tool. This has already
taken place for the PCA, where adjustments have produced techniques that are
e.g. able to forecast macroeconomic variables. For empirical researchers it is
the task to employ the techniques and evaluate how well specific questions can
be answered. One example of how such an application can be conducted is the
recent paper by Ng (2013), where boosting is employed to forecast recessions.
It remains to conclude that machine learning can and has certainly advanced
econometric techniques but a lot of work remains to speed up their introduction
into econometrics. Despite their impressive performance in understanding and
forecasting data the most persisting problem is that some of the methods remain
a black box. The future will tell if we can shed light into this black box. I am
curious what is inside.

67
Acknowledgements
I would like to acknowledge Kaya Verbooy for her support on proofreading and
Guillem Collell Talleda for helping out with one derivation. Furthermore I would
like to thank my family for continuous financial and emotional support. My
classmates Rasmus Lönn and Henrik Zaunbrecher supported me not only in the
process of this thesis, but throughout the whole master program, for which I
am very grateful. Last but certainly not least my supervisors Stephan and Jean-
Pierre have always been there to guide me through this thesis, if in person, via
phone or email.

68
References
Jushan Bai and Serena Ng. Forecasting economic time series using targeted
predictors. Journal of Econometrics, 146(2):304–317, 2008.

Jushan Bai and Serena Ng. Boosting diffusion indices. Journal of Applied
Econometrics, 24(4):607–629, 2009.

Graciela Boente and Ricardo Fraiman. Robust nonparametric regression estima-


tion for dependent observations. The Annals of Statistics, pages 1242–1256,
1989.

Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

Peter Bühlmann and Torsten Hothorn. Boosting algorithms: Regularization,


prediction and model fitting. Statistical Science, pages 477–505, 2007.

Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data:
methods, theory and applications. Springer Science & Business Media, 2011.

Evgeny Byvatov, Uli Fechner, Jens Sadowski, and Gisbert Schneider. Com-
parison of support vector machine and artificial neural network systems for
drug/nondrug classification. Journal of Chemical Information and Computer
Sciences, 43(6):1882–1889, 2003.

Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal
component analysis? Journal of the ACM (JACM), 58(3):11, 2011.

Bradley Efron. Bootstrap methods: another look at the jackknife. The annals
of Statistics, pages 1–26, 1979.

Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least
angle regression. The Annals of statistics, 32(2):407–499, 2004.

Liran Einav and Jonathan D Levin. The data revolution and economic analysis.
Technical report, National Bureau of Economic Research, 2013.

Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic


regression: a statistical view of boosting (with discussion and a rejoinder by
the authors). The annals of statistics, 28(2):337–407, 2000.

Fred Hamprecht. Pattern recognition class 2011, 2011. URL https://www.


youtube.com/watch?v=ZGUlaomeJ-k&list=PLE91541A982BEA7CD.

Bruce Hansen. Econometrics. 2000.

69
Trevor Hastie and Rolbert Tibshirani. Discriminant adaptive nearest neighbor
classification. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 18(6):607–616, 1996.

Trevor Hastie, Robert Tibshirani, Jerome Friedman, T Hastie, J Friedman, and


R Tibshirani. The elements of statistical learning, volume 2. Springer, 2009.

David Haussler. Overview of the probably approximately correct (pac) learning


framework. Information and Computation, 100(1):78–150, 1992.

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation


for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.

Atsushi Inoue and Lutz Kilian. Bagging time series models. 2004.

Atsushi Inoue and Lutz Kilian. How useful is bagging in forecasting economic
time series? a case study of us consumer price inflation. Journal of the
American Statistical Association, 103(482):511–522, 2008.

Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.

Henry F Kaiser. The varimax criterion for analytic rotation in factor analysis.
Psychometrika, 23(3):187–200, 1958.

Michael J Kearns and Leslie G Valiant. Learning Boolean formulae or finite


automata is as hard as factoring. Harvard University, Center for Research in
Computing Technology, Aiken Computation Laboratory, 1988.

Louise C Keely and Chih Ming Tan. Understanding preferences for income
redistribution. Journal of Public Economics, 92(5):944–961, 2008.

Marius Kloft, Ulf Brefeld, Pavel Laskov, Klaus-Robert Müller, Alexan-


der Zien, and Sören Sonnenburg. Efficient and accurate lp-norm
multiple kernel learning. In Y. Bengio, D. Schuurmans, J.D. Laf-
ferty, C.K.I. Williams, and A. Culotta, editors, Advances in Neu-
ral Information Processing Systems 22, pages 997–1005. Curran
Associates, Inc., 2009. URL http://papers.nips.cc/paper/
3675-efficient-and-accurate-lp-norm-multiple-kernel-learning.
pdf.

Johannes Tang Kristensen et al. Diffusion indexes with sparse loadings. Tech-
nical report, School of Economics and Management, University of Aarhus,
2013.

70
Sanjeev Kulkarni, Aurelie C Lozano, and Robert E Schapire. Convergence and
consistency of regularized boosting algorithms with stationary \ beta-mixing
observations. In Advances in neural information processing systems, pages
819–826, 2005.

Manish Kumar and M Thenmozhi. Forecasting stock index movement: A com-


parison of support vector machines and random forest. In Indian Institute of
Capital Markets 9th Capital Markets Conference Paper, 2006.

Bart Larivière and Dirk Van den Poel. Predicting customer retention and prof-
itability by using random forests and regression forests techniques. Expert
Systems with Applications, 29(2):472–484, 2005.

Hsuan-Tien Lin and Chih-Jen Lin. A study on sigmoid kernels for svm and
the training of non-psd kernels by smo-type methods. submitted to Neural
Computation, pages 1–32, 2003.

Lukas Meier, Peter Bühlmann, et al. Smoothing âĎŞ1-penalized estimators


for high-dimensional time-course data. Electronic Journal of Statistics, 1:
597–615, 2007.

James Mercer. Functions of positive and negative type, and their connection
with the theory of integral equations. Philosophical transactions of the royal
society of London. Series A, containing papers of a mathematical or physical
character, pages 415–446, 1909.

A. Ng. Machine learning (stanford). https://www.youtube.com/watch?v=


Uzx\YlbK2c7E&list=PLA89DCFA6ADACE599, 2008. Accessed: 2015-04-19.

Serena Ng. Boosting recessions. Canadian Journal of Economics, forthcoming,


2013.

Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space.
The London, Edinburgh, and Dublin Philosophical Magazine and Journal of
Science, 2(11):559–572, 1901.

Pedro N Rodriguez and Arnulfo Rodriguez. Predicting stock market indices


movements. Computational Finance and its Applications, 2004.

Robert E Schapire. The strength of weak learnability. Machine learning, 5(2):


197–227, 1990.

Stephan Smeekes. Bootstrapping Non-stationary Time Series. PhD thesis,


Maastricht University, May 2009.

71
Ingo Steinwart and Andreas Christmann. Support vector machines. Springer
Science & Business Media, 2008.

Ingo Steinwart, Don Hush, and Clint Scovel. Learning from dependent obser-
vations. Journal of Multivariate Analysis, 100(1):175–194, 2009.

James H Stock and Mark W Watson. Macroeconomic forecasting using diffusion


indexes. Journal of Business & Economic Statistics, 20(2):147–162, 2002.

Howard G. Tucker. A generalization of the glivenko-cantelli theorem. The Annals


of Mathematical Statistics, 30(3):pp. 828–830, 1959. ISSN 00034851. URL
http://www.jstor.org/stable/2237422.

Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27


(11):1134–1142, 1984.

Hal R Varian. Big data: New tricks for econometrics. The Journal of Economic
Perspectives, pages 3–27, 2014.

Diego Vidaurre, Concha Bielza, and Pedro Larrañaga. A survey of l1 regression.


International Statistical Review, 81(3):361–387, 2013.

Klaus Wohlrabe and Buchen Teresa. Assessing the macroeconomic forecasting


performance of boosting. 2014.

Heesung Yoon, Seong-Chun Jun, Yunjung Hyun, Gwang-Ok Bae, and Kang-
Kun Lee. A comparative study of artificial neural networks and support vector
machines for predicting groundwater levels in a coastal aquifer. Journal of
Hydrology, 396(1):128–138, 2011.

Ming Yuan and Yi Lin. Model selection and estimation in regression with
grouped variables. Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology), 68(1):49–67, 2006.

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), 67(2):301–320, 2005.

Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component
analysis. Journal of computational and graphical statistics, 15(2):265–286,
2006.

72
A Datasets

Table 1

Dataset Description
Housing data for 506 census tracts of Boston from
the 1970 census. The dataset is contained in the R
Boston Housing
package mlbench and its variables are described in
Table 2.
Cross-section data on the credit history for a sample
of applicants for a type of credit card. I used this
dataset to predict the binary outcome "card",
which indicates, if the application for
Credit Card
the credit card was accepted. As explanatory variables
demographic and economic characteristics
of the individual are given. The dataset is
contained in the AER package in R.

The dataset represents data on children, who have had


corrective spinal surgery. Next to the binary response, which
indicates, if a person had a surgery, there are three
Kyphosis
explanatory variables: the age, the number of the first
vertebra operated on (Start) and the number of vertebra
involved (number).

Table 2: Boston Housing data

Name Description
tract census tract
lon longitude of census tract
lat latitude of census tract
medv median value of owner-occupied homes in USD 1000’s
cmedv corrected median value of owner-occupied homes in USD 1000’s
crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling

73

View publication stats

You might also like