Continuous Time 2
Continuous Time 2
y = f (x)
• Easy to generalize to the case where y is a vector (or a probability distribution), but notation
becomes cumbersome.
• In economics, f (x) can be a value function, a policy function, a pricing kernel, a conditional
expectation, a classifier, ...
1
A neural network
• An artificial neural network (a.k.a. ANN or connectionist system) is an approximation to f (x) built as
a linear combination of M generalized linear models of x of the form:
M
y∼
X
= g NN (x; θ) = θ0 + θm φ (zm )
m=1
• We can select θ such that g NN (x; θ) is as close to f (x) as possible given some relevant metric (e.g.,
L2 norm).
• Compare: !
M N
y∼
X X
= g NN (x; θ) = θ0 + θm φ θ0,m + θn,m xn
m=1 n=1
• We exchange the rich parameterization of coefficients for the parsimony of basis functions.
• How we determine the coefficients will also be different, but this is somewhat less important.
3
Deep learning
where the M (1) , M (2) , ... and φ1 (·), φ2 (·), ... are possibly different across each layer of the network.
• “Feedforward” comes from the fact that the composition of neural networks can be represented as a
directed acyclic graph, which lacks feedback. We can have more general recurrent structures.
• J is known as the depth of the network. The case J = 1 is a standard neural network.
• As before, we can select θ such that g DL (x; θ) approximates a target function f (x) as closely as
possible under some relevant metric.
4
Why are neural networks a good solution method in economics?
• From now on, I will refer to neural networks as including both single and multilayer networks.
• With suitable choices of activation functions, neural networks can efficiently approximate extremely
complex functions.
• Furthermore, neural networks are easy to code, stable, and scalable for multiprocressing.
• Thus, neural networks have considerable option value as solution methods in economics.
5
Current interest
• Currently, neural networks are among the most active areas of research in computer science and
applied math.
• While original idea goes back to the 1940s, neural networks were rediscovered in the second half of
the 2000s.
• Why?
1. Suddenly, the large computational and data requirements required to train the networks efficiently
became available at a reasonable cost.
2. New algorithms such as back propagation through gradient descent became popular.
6
7
AlphaGo
• Silver et al. (2018): now applied to chess, shogi, Go, and StarCraft II.
• Check also:
1. https://deepmind.com/research/alphago/.
2. https://www.alphagomovie.com/
3. https:
//deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
a b
Rollout policy SL policy network RL policy network Value network Policy network Value network
Neural network
pS pV pU QT pVU (a⎪s) QT (s′)
Policy gradient
on
Cla
Se
ati
no
ssi
lf P
fic
ssi
fic
ssi
lay
gre
a
Cla
tio
Re
n
Data
s s′
Human expert positions Self-play positions
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation of the neural network architecture used in
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation of the board position
A reinforcement learning (RL) policy network pρ is initialized to the SL s as its input, passes it through many convolutional layers with parameters
policy network, and is then improved by policy gradient learning to σ (SL policy network) or ρ (RL policy network), and outputs a probability
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, represented by a
versions of the policy network. A new data set is generated by playing probability map over the board. The value network similarly uses many
games of self-play with the RL policy network. Finally, a value network vθ convolutional layers with parameters θ, but outputs a scalar value vθ(s′) 9
is trained by regression to predict the expected outcome (that is, whether that predicts the expected outcome in position s′.
human world chess champion (1). Computer games of self-play that traverse a tree from ro
chess programs continued to progress stead- state sroot until a leaf state is reached. Each si
2
eepMind, 6 Pancras Square, London N1C 4AG, UK. University ily beyond human level in the following two ulation proceeds by selecting in each state
llege London, Gower Street, London WC1E 6BT, UK.
hese authors contributed equally to this work.
decades. These programs evaluate positions by move a with low visit count (not previou
orresponding author. Email: [email protected] (D.S.); using handcrafted features and carefully tuned frequently explored), high move probability, a
[email protected] (D.H.) weights, constructed by strong human players and high value (averaged over the leaf states
g. 1. Training AlphaZero for 700,000 steps. Elo ratings were (B) Performance of AlphaZero in shogi compared with the 2017
omputed from games between different players where each player CSA world champion program Elmo. (C) Performance of AlphaZero
as given 1 s per move. (A) Performance of AlphaZero in chess in Go compared with AlphaGo Lee and AlphaGo Zero (20 blocks
ompared with the 2016 TCEC world champion program Stockfish. over 3 days).
10
Further advantages
• Neural networks and deep learning often require less “inside knowledge” by experts on the area.
• More recently, development of dedicated hardware (TPUs, AI accelerators, FPGAs) are likely to
maintain a hedge for the area.
11
12
Limitations of neural networks and deep learning
• While neural networks and deep learning can work extremely well, there is no such a thing as a silver
bullet.
• Rule-of-thumb in the industry is that one needs around 107 labeled observations to properly train a
complex ANN with around 104 observations in each relevant group.
• Of course, sometimes “observations” are endogenous (we can simulate them), but if your goal is to
forecast GDP next quarter, it is unlikely a neural network will beat an ARIMA(n,p,q) (at least only
with macro variables).
• Issues of interpretation.
13
14
Digging deeper
More details on neural networks
• We will follow a much sober formal treatment (which, in any case, agrees with state-of-art
researchers approach).
• In particular, we will highlight connections with econometrics (e.g., NOLS, semiparametric regression,
and sieves).
15
A neuron
Theoretically, we could build non-linear combinations, but unlikely to be a fruitful idea in general.
y = g (x; θ) = φ (z)
Inputs Weights
x1 θ1
x2 θ2 Activation
n
X Perceptron
θi xi classification
x3 i=1 output
θ3
Net input
γ
xn θn
17
The biological analog
18
Activation functions I
• Traditionally:
1. Identity function:
φ (z) = z
2. A sigmoidal function:
1
φ (z) =
1 + e −z
3. Hyperbolic tangent:
e 2z − 1
φ (z) =
e 2z + 1
19
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-4 -3 -2 -1 0 1 2 3 4 20
1
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8
-1
-4 -3 -2 -1 0 1 2 3 4
21
Activation functions II
2. Parametric ReLU:
φ (z) = max(z, az)
3. Softplus:
φ (z) = log (1 + e z )
22
4.5
3.5
2.5
1.5
0.5
ReLU
Sofplus
0
-4 -3 -2 -1 0 1 2 3 4
23
Interpretation
• The level of the θi ’s for i > 0 control the activation rate (the higher the θi ’s, the harder the
activation).
• Some textbooks separate the activation threshold and scaling coefficients from θ as different
coefficients in φ, but such separation moves notation farther away from standard econometrics.
• Potential identification problem between θ and more general activation functions with their own
parameters.
• But in practice θ does not have a structural interpretation, so the identification problem is of
secondary importance.
4
3
2
3
2
2
1
1
1
0 0 0
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4
2 6 4
1.5 3
4
1 2
2
0.5 1
0 0 0
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4
0 0 0
-1 -1
-2
-2 -2
-4
-3 -3
-4 -4 -6
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4
25
Combining neurons into a neural network
27
4
-1
-2
-3
-2 -1.5 -1 -0.5 0 0.5 1 1.5
28
4 4 4
3 3 3
2 2
2
1 1
1
0 0
0
-1 -1
-1
-2 -2
-2 -3 -3
-3 -4 -4
-2 -1 0 1 -2 -1 0 1 -2 -1 0 1
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
-1 -1 -1
-2 -2 -2
-3 -3 -3
-4 -4 -4
-2 -1 0 1 -2 -1 0 1 -2 -1 0 1
29
Two classic (yet remarkable) results II
• Assume, as well, that we are dealing with the class of functions for which the Fourier transform of
their gradient is integrable.
30
Training the network
• Where from do the observations Y come? Observed data vs. simulated epochs.
31
Back propagation
• In particulary, for the gradient, we can use back propagation (Rumelhart et al., 1986):
∂E (θ; yj , ybj )
= yj − g (xj ; θ)
∂θ0
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) φ (zm ) , for ∀m
∂θm
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) θm φ0 (zm ) , for ∀m
∂θ0,m
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) θm xn φ0 (zm ) , for ∀n, m
∂θn,m
where φ0 (z) is the derivative of the activation function.
• Back propagation will be particularly important below when we introduce multiple layers.
32
An approach to minimization
• One approach to optimization is to minimize a local model that approximates the true objective
function.
• The local model can be a first- or second-order Taylor approximation of the objective function.
• We can use this result to build a descent direction iteration if we know A and b (or we have
approximations to them).
33
Descent direction iteration
• Starting at point θ(1) , a descent direction algorithm generates sequence of steps (called iterates) that
converge to a local minimum.
1. At iteration k, check whether θ(k) satisfies termination condition. If so stop; otherwise go to step 2.
2. Determine the descent direction d(k) using local information such as gradient or Hessian.
34
Gradient descent method
• A natural choice for d is the direction of steepest descent (first proposed by Cauchy).
• The direction of steepest descent is given by the direction opposite the gradient ∇E(θ). Thus, a.k.a.
steepest descent.
• If function is smooth and the step size small, the method leads to improvement (as long as the
gradient is not zero).
• Under this step size choice, it can be shown d(k+1) and d(k) are orthogonal.
35
Steepest descent method
36
Conjugate descent method
• Gradient descent can perform poorly in narrow valleys (it may require many steps to make progress).
• The conjugate gradient method overcomes this problem by somehow constructing to be conjugate to
the old gradient, and to all previous directions traversed.
• In first iteration, set: d (1) = −g (θ(1) ) and θ(2) = θ(1) + α(1) d(1) . Here, α(1) is arbitrary.
37
Conjugate descent method
38
Conjugate descent method
2. Olak-Ribiere:
g (k)T (g (k) − g (k−1) )
β (k) =
g (k−1)T g (k−1)
• If the function to minimize has flat areas, one can introduce a momentum update equation:
v (k+1) = βv (k) − αg (k)
θ(k+1) = θ(k) + v (k+1)
• Intuitively, the momentum update is like a ball rolling down an almost horizontal surface. 39
Stochastic gradient descent and minibatch
• Even with back propagation, evaluating the gradient for the whole training set can be costly.
• An additional advantage.
• A compromise between using the whole training set and pure stochastic gradient descent: minibatch
gradient descent.
• You can flush the algorithm to a graphics processing unit (GPU) or a tensor processing unit (TPU)
instead of a standard CPU.
40
41
ntly improves our ability to navigate flat regions.
gure 2-7. The stochastic error surface fluctuates with respect to the batch error surfac
nabling saddle point avoidance 42
Alternative minimization algorithms
1. More sophisticated stochastic gradient descent: Adam (Adaptive Moment Estimation). It uses
running averages of both the gradients and the second moments of the gradients.
2. Newton and Quasi-Newton methods are unlikely to be of much use in practice. Why?
3. McMc/Simulated annealing.
4. Genetic algorithms:
• In fact, much of the research in deep learning incorporates some flavor of genetic selection.
• Basic idea.
43
Further ideas
44
45
Multiple layers I
and
M
X
2 2 2 1
zm = θ0,m + θm φ zm
m=1
...
M
y∼
X
K −1
= g (x; θ) = θ0K + K
θm φ zm
m=1
46
x1
x2
x3
1. It works! Our brains have six layers. AlphaGo has 12 layers with ReLUs.
• We can have different M’s in each layer ⇒ fewer neurons in higher layers allow for compression of
learning into fewer features.
• Or even to produce, as output, a probability distribution, for example, using a softmax layer:
K −1
e zm
ym = P M K −1
m=1 e zm
48
Application to Economics
Solving high-dimensional dynamic programming problems using Deep Learning
• Our goal is to solve the recursive continuous-time Hamilton-Jacobi-Bellman (HJB) equation globally:
1
ρV (x) = max r (x, α) + ∇x V (x)f (x, α) + tr (σ(x))T ∆x V (x)σ(x))
α 2
s.t. G (x, α) ≤ 0 and H(x, α) = 0
49
Neural networks
• We could think about the approach as just one large neural network with multiple outputs.
50
Error criterion I
∂r (x, α̃(x; Θα )
errα (x; Θ) ≡ + Dα f (x, α̃(x; Θα ))T ∇x Ṽ (x; ΘV )
∂α
− Dα G (x, α̃(x; Θα ))T µ̃(x; Θµ ) − Dα H(x, α̃(x; Θα ))λ̃(x; Θλ ),
where Dα G ∈ RL1 ×M , Dα H ∈ RL2 ×M , and Dα f ∈ RN×M are the submatrices of the Jacobian
matrices of G , H and f respectively containing the derivatives with respect to α.
51
Error criterion II
• We combine these four errors by using the squared error as our loss criterion:
2 2 2
E(x; Θ) ≡ errHJB (x; Θ) 2
+ errα (x; Θ) 2
+ errPF1 (x; Θ) 2
+
2 2 2
+ errPF2 (x; Θ) 2
+ errDF (x; Θ) 2
+ errCS (x; Θ) 2
52
Training
• We train our neural networks by minimizing the above error criterion through mini-batch gradient
descent over points drawn from the ergodic distribution of the state vector.
• The efficient implementation of this last step is the key to the success of our algorithm.
• We start by initializing our network weights and we perform K learning steps called epochs, where K
can be chosen in a variety of ways.
• For each epoch, we draw I points from the state space by simulating from the ergodic distribution.
• Then, we randomly split this sample into B mini-batches of size S. For each mini-batch, we define
the mini-batch error, by averaging the loss function over the batch.
• Finally, we perform mini-batch gradient descent for all network weights, with ηk being the learning
rate in the k-th epoch.
53
An Example
The continuous-time neoclassical growth model I
• We start with the continuous-time neoclassical growth model because it has closed-form solutions for
the policy functions, which allows us to focus our attention on the analysis of the value function
approximation.
• We can then back out the policy function from this approach and compare it to the results of the
next step in which we approximate the policy functions themselves with a neural net.
• A single agent deciding to either save in capital or consume with a HJB equation :
1
• Notice that c = (U 0 )−1 (V 0 (k)). With CRRA utility, this simplifies further to c = (V 0 (k))− γ .
54
The continuous-time neoclassical growth model II
• We approximate the value function V (k) with a neural network, Ṽ (k; Θ) with an “HJB error”:
!!
0 −1 ∂ Ṽ (k; Θ)
errHJB =ρṼ (k; Θ) − U (U )
∂k
" !#
∂ Ṽ (k; Θ) ∂ Ṽ (k; Θ)
− F (k) − δ ∗ k − (U 0 )−1
∂k ∂k
• Details:
1. 3 layers.
3. tanh(x) activation.
q
2
4. Normal initialization N 0, 4 ninput +n output
with input normalization.
55
(a) Value with closed-form policy 56
(c) Consumption with closed-form policy 57
(e) HJB error with closed-form policy 58
Approximating the policy function
• Let us not use the closed-form consumption policy function but rather approximate said policy
function directly with a policy neural network C̃ (k; ΘC ).
59
(b) Value with policy approximation 60
(d) Consumption with policy approximation 61
(f) HJB error with policy approximation 62
(g) Policy error with policy approximation 63
Alternative ANNs
Alternative ANNs
64
Input
Kernel
a b c d
w x
e f g h
y z
i j k l
Output
aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz
ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz
65
66
67
Reinforcement learning
Reinforcement learning
• Main idea: Algorithms that use training information that evaluates the actions taken instead of
deciding whether the action was correct.
• Purely evaluative feedback to assess how good the action taken was, but not whether it was the best
feasible action.
• Useful when:
1. The dynamics of the state is unkown but simulation is easy: model-free vs. model-based reinforcement
learning.
2. Or the dimensionality is so high that we cannot store the information about the DP in a table.
• Work surprisingly well in a wide range of situations, although no methods that are guaranteed to
work.
• Key for success in economic applications: ability to simulate fast (link with massive parallelization).
Also, it complements very well with neural networks.
68
Comparison with alternative methods
• Similar (same?) ideas are called approximate dynamic programming or neuro-dynamic programming.
• Supervised learning: purely instructive feedback that indicates best feasible action regardless of
action actually taken.
69
70
71
Example: Multi-armed bandit problem
• But you do not know which action is best, you only have estimates of your value function (dual
control problem of identification and optimization).
• Go back to the study of “sequential design of experiments” by Thompson (1933, 1934) and Bellman
(1956).
72
73
Theory vs. practice
1. Follow greedy actions: actions with highest expected value. This is known as exploiting.
2. Follow non-greedy actions: actions with dominated expected value. This is known as exploring.
• This should remind you of a basic dynamic programming problem: what is the optimal mix of pure
strategies?
• If we impose enough structure on the problem (i.e., distributions of payoffs belong to some family,
stationarity, etc.), we can solve (either theoretically or applying standard solution techniques) the
optimal strategy (at least, up to some upper bound on computational capabilities).
• But these structures are too restrictive for practical purposes outside the pages of Econometrica.
74
A policy-based method I
• A very simple method that uses the averages Qn (a) of rewards Ri (a), i = {1, ..., n}, actually received:
n−1
1X
Qn (a) = Ri (a)
n
i=1
• We start with Q0 (a) = 0 for all k. Here (and later), we randomize among ties.
• We update Qn (a) thanks to the nice recursive update based on linearity of means:
1
Qn+1 (a) = Qn (a) + [Rn (a) − Qn (a)]
n
Averages of actions not picked are not updated.
75
A policy-based method II
76
3
2
q⇤ (3)
q⇤ (5)
1
q⇤ (9)
q⇤ (4)
Reward 0
q⇤ (1)
q⇤ (7)
distribution q⇤ (10)
q⇤ (2)
-1 q⇤ (8)
q⇤ (6)
-2
-3
1 2 3 4 5 6 7 8 9 10
77
Action
1.5
" = 0.1
" = 0.01
1
" = 0 (greedy)
Average
reward
0.5
0
01 250 500 750 1000
Steps
100%
80%
" = 0.1
% 60%
" = 0.01
Optimal
action 40%
" = 0 (greedy)
20%
0%
01 250 500 750 1000 78
Steps
A more general update rule
• We can also have a time-varying αn (a), but, to ensure convergence with probability 1 as long as:
∞
X
αn (a) = ∞
i=1
X∞
αn2 (a) = ∞
i=1
79
Improving the algorithm
20%
0%
01 200 400 600 800 1000
Plays
Steps
gure 2.3: The e↵ect of optimistic initial action-value estimates on the 10-armed testbe
th methods used a constant step-size parameter, ↵ = 0.1. 81
practical.
-greedy = 0.1
1
Average
reward
0.5
0
1 250 500 750 1000
Steps
ure 2.4: Average performance of UCB action selection on the 10-armed testbed. As show
B generally performs better than "-greedy action selection, except in the first k steps, whe
82
1.5
UCB greedy with
optimistic
1.4 initialization
α = 0.1
Average 1.3 -greedy
reward
gradient
over first 1.2 bandit
1000 steps
1.1
1
1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4
" ↵ c Q0
gure 2.6: A parameter study of the various bandit algorithms presented in this chapt
83
Other algorithms
• Value-based methods.
• Actor-critic methods. 84