0% found this document useful (0 votes)
15 views

Continuous Time 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Continuous Time 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Deep learning and reinforcement learning

Jesús Fernández-Villaverde1 and Galo Nuño2


October 15, 2021
1
University of Pennsylvania
2
Banco de España
A short introduction
The problem

• Let us suppose we want to approximate an unknown function:

y = f (x)

where y is a scalar and x = {x1 , x2 , ..., xN } a vector.

• We care about the case when N is large.

• Easy to generalize to the case where y is a vector (or a probability distribution), but notation
becomes cumbersome.

• In economics, f (x) can be a value function, a policy function, a pricing kernel, a conditional
expectation, a classifier, ...

1
A neural network

• An artificial neural network (a.k.a. ANN or connectionist system) is an approximation to f (x) built as
a linear combination of M generalized linear models of x of the form:
M
y∼
X
= g NN (x; θ) = θ0 + θm φ (zm )
m=1

where φ(·) is an arbitrary activation function and:


N
X
zm = θ0,m + θn,m xn
n=1

• M is known as the width of the model.

• We can select θ such that g NN (x; θ) is as close to f (x) as possible given some relevant metric (e.g.,
L2 norm).

• This is known as “training” the network.


2
Comparison with other approximations

• Compare: !
M N
y∼
X X
= g NN (x; θ) = θ0 + θm φ θ0,m + θn,m xn
m=1 n=1

with a standard projection:


M
y∼
X
= g CP (x; θ) = θ0 + θm φm (x)
m=1

where φm is, for example, a Chebyshev polynomial.

• We exchange the rich parameterization of coefficients for the parsimony of basis functions.

• Later, we will explain why this is often a good idea.

• How we determine the coefficients will also be different, but this is somewhat less important.

3
Deep learning

• A deep learning network is an acyclic multilayer composition of J > 1 neural networks:


   
y∼= g DL (x; θ) = g NN(1) g NN(2) ...; θ(2) ; θ(1)

where the M (1) , M (2) , ... and φ1 (·), φ2 (·), ... are possibly different across each layer of the network.

• Sometimes known as deep feedforward neural networks or multilayer perceptrons.

• “Feedforward” comes from the fact that the composition of neural networks can be represented as a
directed acyclic graph, which lacks feedback. We can have more general recurrent structures.

• J is known as the depth of the network. The case J = 1 is a standard neural network.

• As before, we can select θ such that g DL (x; θ) approximates a target function f (x) as closely as
possible under some relevant metric.

4
Why are neural networks a good solution method in economics?

• From now on, I will refer to neural networks as including both single and multilayer networks.

• With suitable choices of activation functions, neural networks can efficiently approximate extremely
complex functions.

• In particular, under certain (relatively weak) conditions:

1. Neural networks are universal approximators.

2. Neural networks break the “curse of dimensionality.”

• Furthermore, neural networks are easy to code, stable, and scalable for multiprocressing.

• Thus, neural networks have considerable option value as solution methods in economics.

5
Current interest

• Currently, neural networks are among the most active areas of research in computer science and
applied math.

• While original idea goes back to the 1940s, neural networks were rediscovered in the second half of
the 2000s.

• Why?

1. Suddenly, the large computational and data requirements required to train the networks efficiently
became available at a reasonable cost.

2. New algorithms such as back propagation through gradient descent became popular.

• Some well-known successes and industrial applications.

6
7
AlphaGo

• Big splash: AlphaGo vs. Lee Sedol in March 2016.

• Silver et al. (2018): now applied to chess, shogi, Go, and StarCraft II.

• Check also:

1. https://deepmind.com/research/alphago/.

2. https://www.alphagomovie.com/

3. https:
//deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii

• Very different than Deep Blue against Kasparov.

• New and surprising strategies.

• However, you need to keep this accomplishment in perspective.


8
ARTICLE RESEARCH

a b
Rollout policy SL policy network RL policy network Value network Policy network Value network

Neural network
pS pV pU QT pVU (a⎪s) QT (s′)

Policy gradient
on
Cla

Se
ati

no
ssi

lf P
fic

ssi
fic

ssi

lay

gre
a

Cla
tio

Re
n

Data
s s′
Human expert positions Self-play positions
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation of the neural network architecture used in
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation of the board position
A reinforcement learning (RL) policy network pρ is initialized to the SL s as its input, passes it through many convolutional layers with parameters
policy network, and is then improved by policy gradient learning to σ (SL policy network) or ρ (RL policy network), and outputs a probability
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, represented by a
versions of the policy network. A new data set is generated by playing probability map over the board. The value network similarly uses many
games of self-play with the RL policy network. Finally, a value network vθ convolutional layers with parameters θ, but outputs a scalar value vθ(s′) 9
is trained by regression to predict the expected outcome (that is, whether that predicts the expected outcome in position s′.
human world chess champion (1). Computer games of self-play that traverse a tree from ro
chess programs continued to progress stead- state sroot until a leaf state is reached. Each si
2
eepMind, 6 Pancras Square, London N1C 4AG, UK. University ily beyond human level in the following two ulation proceeds by selecting in each state
llege London, Gower Street, London WC1E 6BT, UK.
hese authors contributed equally to this work.
decades. These programs evaluate positions by move a with low visit count (not previou
orresponding author. Email: [email protected] (D.S.); using handcrafted features and carefully tuned frequently explored), high move probability, a
[email protected] (D.H.) weights, constructed by strong human players and high value (averaged over the leaf states

g. 1. Training AlphaZero for 700,000 steps. Elo ratings were (B) Performance of AlphaZero in shogi compared with the 2017
omputed from games between different players where each player CSA world champion program Elmo. (C) Performance of AlphaZero
as given 1 s per move. (A) Performance of AlphaZero in chess in Go compared with AlphaGo Lee and AlphaGo Zero (20 blocks
ompared with the 2016 TCEC world champion program Stockfish. over 3 days).

ver et al., Science 362, 1140–1144 (2018) 7 December 2018 1o

10
Further advantages

• Neural networks and deep learning often require less “inside knowledge” by experts on the area.

• Results can be highly counter-intuitive and yet, deliver excellent performance.

• Outstanding open source libraries: Tensorflow, Pytorch, Flux.

• More recently, development of dedicated hardware (TPUs, AI accelerators, FPGAs) are likely to
maintain a hedge for the area.

• The width of an ecosystem is key for its long-run success.

11
12
Limitations of neural networks and deep learning

• While neural networks and deep learning can work extremely well, there is no such a thing as a silver
bullet.

• Clear and serious trade-offs in real-life applications.

• Rule-of-thumb in the industry is that one needs around 107 labeled observations to properly train a
complex ANN with around 104 observations in each relevant group.

• Of course, sometimes “observations” are endogenous (we can simulate them), but if your goal is to
forecast GDP next quarter, it is unlikely a neural network will beat an ARIMA(n,p,q) (at least only
with macro variables).

• Issues of interpretation.

13
14
Digging deeper
More details on neural networks

• Non-linear functional approximation method.

• Much hype around them and over-emphasis of biological interpretation.

• We will follow a much sober formal treatment (which, in any case, agrees with state-of-art
researchers approach).

• In particular, we will highlight connections with econometrics (e.g., NOLS, semiparametric regression,
and sieves).

• We will start describing the simplest possible neural network.

15
A neuron

• N observables: x1 , x2 ,...,xN . We stack them in x.

• Coefficients (or weights): θ0 (a constant), θ1 , θ2 , ...,θN . We stack them in θ.

• We build a linear combination of observations:


N
X
z = θ0 + θn xn
n=1

Theoretically, we could build non-linear combinations, but unlikely to be a fruitful idea in general.

• We transform such linear combination with an activation function:

y = g (x; θ) = φ (z)

The activation function might have some coefficients γ on its own.

• Why do we need an activation function?


16
Flow representation

Inputs Weights
x1 θ1

x2 θ2 Activation
n
X Perceptron
θi xi classification
x3 i=1 output
θ3
Net input
γ

xn θn

17
The biological analog

18
Activation functions I

• Traditionally:

1. Identity function:
φ (z) = z

Used in linear regression.

2. A sigmoidal function:
1
φ (z) =
1 + e −z

A particular limiting case as z grows quickly: step function.

3. Hyperbolic tangent:
e 2z − 1
φ (z) =
e 2z + 1

19
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-4 -3 -2 -1 0 1 2 3 4 20
1

0.8

0.6

0.4

0.2

-0.2

-0.4

-0.6

-0.8

-1
-4 -3 -2 -1 0 1 2 3 4
21
Activation functions II

• Some activation functions that have gained popularity recently:

1. Rectified linear unit (ReLU):


φ (z) = max(0, z)

2. Parametric ReLU:
φ (z) = max(z, az)

3. Softplus:
φ (z) = log (1 + e z )

22
4.5

3.5

2.5

1.5

0.5
ReLU
Sofplus
0
-4 -3 -2 -1 0 1 2 3 4
23
Interpretation

• θ0 controls the activation threshold.

• The level of the θi ’s for i > 0 control the activation rate (the higher the θi ’s, the harder the
activation).

• Some textbooks separate the activation threshold and scaling coefficients from θ as different
coefficients in φ, but such separation moves notation farther away from standard econometrics.

• Potential identification problem between θ and more general activation functions with their own
parameters.

• But in practice θ does not have a structural interpretation, so the identification problem is of
secondary importance.

• As mentioned in the introduction, a neuron closely resembles a generalized linear model in


econometrics.
24
4 5 3

4
3
2
3
2
2
1
1
1

0 0 0
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

2 6 4

1.5 3
4

1 2

2
0.5 1

0 0 0
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

0 0 0

-1 -1
-2

-2 -2

-4
-3 -3

-4 -4 -6
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

25
Combining neurons into a neural network

• As before, we have N observables: x1 , x2 ,...,xN .

• Coefficients (or weights): θ0,m (a constant), θ1,m , θ2,m , ...,θN,m .

• We build M linear combinations of observations:


N
X
zm = θ0,m + θn,m xn
n=1

• We transform and add such linear combinations with an activation function:


M
y∼
X
= g (x; θ) = θ0 + θm φ (zm )
m=1

• Also, quasi-linear structure in terms of vectors of observables and coefficients.

• This is known as a single layer network.


26
Two classic (yet remarkable) results I

Borel measurable function


A map f : X → Y between two topological spaces is called Borel measurable if f −1 (A) is a Borel set for
any open set A on Y (the Borel sets are all the open sets built through the operations of countable
union, countable intersection, and relative complement).

Universal approximation theorem: Hornik, Stinchcombe, and White (1989)


A neural network with at least one hidden layer can approximate any Borel measurable function mapping
finite-dimensional spaces to any desired degree of accuracy.

• Intuition of the result.

• Comparison with other results in series approximations.

27
4

-1

-2

-3
-2 -1.5 -1 -0.5 0 0.5 1 1.5

28
4 4 4

3 3 3

2 2
2
1 1
1
0 0
0
-1 -1
-1
-2 -2

-2 -3 -3

-3 -4 -4
-2 -1 0 1 -2 -1 0 1 -2 -1 0 1

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

-1 -1 -1

-2 -2 -2

-3 -3 -3

-4 -4 -4
-2 -1 0 1 -2 -1 0 1 -2 -1 0 1

29
Two classic (yet remarkable) results II

• Assume, as well, that we are dealing with the class of functions for which the Fourier transform of
their gradient is integrable.

Breaking the curse of dimensionality: Barron (1993)


A one-layer NN achieves integrated square errors of order O(1/M), where M is the number of nodes. In
comparison, for series approximations, the integrated square error is of order O(1/(M 2/N )) where N is
the dimensions of the function to be approximated.

• More general theorems by Leshno et al. (1993) and Bach (2017).

• What about Chebyshev polynomials? Splines? Problems of convergence and extrapolation.

• There is another, yet more subtle curse of dimensionality.

30
Training the network

• θ is selected to minimize the quadratic error function E (θ; Y, b


y):

θ∗ = arg min E (θ; Y, b


y)
θ
J
X
= arg min E (θ; yj , ybj )
θ
j=1
J
1X 2
= arg min kyj − g (xj ; θ)k
θ 2
j=1

• Where from do the observations Y come? Observed data vs. simulated epochs.

• How do we solve this minimization problem?

• Other objective functions are possible.

31
Back propagation

• In general, we can easily calculate E(θ∗ ; Y , b


y) and ∇E(θ∗ ; Y , b
y) for a given θ∗ .

• In particulary, for the gradient, we can use back propagation (Rumelhart et al., 1986):
∂E (θ; yj , ybj )
= yj − g (xj ; θ)
∂θ0
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) φ (zm ) , for ∀m
∂θm
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) θm φ0 (zm ) , for ∀m
∂θ0,m
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) θm xn φ0 (zm ) , for ∀n, m
∂θn,m
where φ0 (z) is the derivative of the activation function.

• The derivative φ0 (z) will be trivial to evaluate if we use a ReLU.

• Back propagation will be particularly important below when we introduce multiple layers.
32
An approach to minimization

• One approach to optimization is to minimize a local model that approximates the true objective
function.

• The local model can be a first- or second-order Taylor approximation of the objective function.

• For example, suppose a function E is roughly approximated as a quadratic form:


1 T
E(θ) ≈ θ Aθ − bT θ + c
2
where A is a square, symmetric, positive-definite matrix.

• Then E(θ) is minimized by the solution to:


Aθ = b

• We can use this result to build a descent direction iteration if we know A and b (or we have
approximations to them).
33
Descent direction iteration

• Starting at point θ(1) , a descent direction algorithm generates sequence of steps (called iterates) that
converge to a local minimum.

• The descent direction iteration algorithm:

1. At iteration k, check whether θ(k) satisfies termination condition. If so stop; otherwise go to step 2.

2. Determine the descent direction d(k) using local information such as gradient or Hessian.

3. Compute step size α(k) .

4. Compute the next candidate point: θ(k+1) ← θ(k) + α(k) d(k) .

• Choice of α and d determines the flavor of the algorithm.

34
Gradient descent method

• A natural choice for d is the direction of steepest descent (first proposed by Cauchy).

• The direction of steepest descent is given by the direction opposite the gradient ∇E(θ). Thus, a.k.a.
steepest descent.

• If function is smooth and the step size small, the method leads to improvement (as long as the
gradient is not zero).

• The normalized direction of steepest descent is:


∇E(θ(k) )
d(k) = −
||∇E(θ(k) )||

• One way to set the step size is to solve:


αk = arg min E(θ(k) + αd(k) )
α

• Under this step size choice, it can be shown d(k+1) and d(k) are orthogonal.
35
Steepest descent method

36
Conjugate descent method

• Gradient descent can perform poorly in narrow valleys (it may require many steps to make progress).

• Famous example: Rosenbrock function.

• The conjugate gradient method overcomes this problem by somehow constructing to be conjugate to
the old gradient, and to all previous directions traversed.

• Define g (θ) = ∇E(θ).

• In first iteration, set: d (1) = −g (θ(1) ) and θ(2) = θ(1) + α(1) d(1) . Here, α(1) is arbitrary.

• Subsequent iterations set d(k+1) = −g (k+1) + β (k) d(k) .

37
Conjugate descent method

38
Conjugate descent method

• There are two approaches to set β:


1. Fletcher-Reeves:
g (k)T g (k)
β (k) =
g (k−1)T g (k−1)

2. Olak-Ribiere:
g (k)T (g (k) − g (k−1) )
β (k) =
g (k−1)T g (k−1)

• The Olak-Ribiere requires an automatic reset at every iteration: β ← max(β, 0).

• If the function to minimize has flat areas, one can introduce a momentum update equation:
v (k+1) = βv (k) − αg (k)
θ(k+1) = θ(k) + v (k+1)

• The modification reverts to the gradient descent version if β = 0.

• Intuitively, the momentum update is like a ball rolling down an almost horizontal surface. 39
Stochastic gradient descent and minibatch

• Even with back propagation, evaluating the gradient for the whole training set can be costly.

• Stochastic gradient descent: Intuition from Monte Carlos.

• An additional advantage.

• A compromise between using the whole training set and pure stochastic gradient descent: minibatch
gradient descent.

• This is the most popular algorithm to train neural networks.

• Intuition from GMM. Notice also resilience to scaling.

• In practice, we do not need a global min (6= likelihood).

• You can flush the algorithm to a graphics processing unit (GPU) or a tensor processing unit (TPU)
instead of a standard CPU.
40
41
ntly improves our ability to navigate flat regions.

gure 2-7. The stochastic error surface fluctuates with respect to the batch error surfac
nabling saddle point avoidance 42
Alternative minimization algorithms

1. More sophisticated stochastic gradient descent: Adam (Adaptive Moment Estimation). It uses
running averages of both the gradients and the second moments of the gradients.

2. Newton and Quasi-Newton methods are unlikely to be of much use in practice. Why?

3. McMc/Simulated annealing.

4. Genetic algorithms:

• In fact, much of the research in deep learning incorporates some flavor of genetic selection.

• Basic idea.

43
Further ideas

• Design of the network architecture:

1. Trade-off error/computational time.

2. Better to err on the side of too many M.

• Double descent phenomenon.

44
45
Multiple layers I

• The hidden layers can be multiplied without limit in a feed-forward ANN.


• We build K layers:
N
X
1 1 1
zm = θ0,m + θn,m xn
n=1

and
M
X
2 2 2 1

zm = θ0,m + θm φ zm
m=1
...
M
y∼
X
K −1
= g (x; θ) = θ0K + K

θm φ zm
m=1

46
x1

x2

x3

Input Values Hidden Layer 1 Output Layer


Input Layer Hidden Layer 2
47
Multiple layers II

• Why do we want to introduce hidden layers?

1. It works! Our brains have six layers. AlphaGo has 12 layers with ReLUs.

2. Hidden layers induce highly nonlinear behavior.

3. Allow for clustering of variables.

• We can have different M’s in each layer ⇒ fewer neurons in higher layers allow for compression of
learning into fewer features.

• We can also add multidimensional outputs.

• Or even to produce, as output, a probability distribution, for example, using a softmax layer:
K −1
e zm
ym = P M K −1
m=1 e zm

48
Application to Economics
Solving high-dimensional dynamic programming problems using Deep Learning

• Our goal is to solve the recursive continuous-time Hamilton-Jacobi-Bellman (HJB) equation globally:

1
ρV (x) = max r (x, α) + ∇x V (x)f (x, α) + tr (σ(x))T ∆x V (x)σ(x))
α 2
s.t. G (x, α) ≤ 0 and H(x, α) = 0

• Think about the cases where we have many state variables.

• Alternatives for this solution?

49
Neural networks

• We define four neural networks:

1. Ṽ (x; ΘV ) : RN → R to approximate the value function V (x).

2. α̃(x; Θα ) : RN → RM to approximate the policy function α.

3. µ̃(x; Θµ ) : RN → RL1 , and λ̃(x; Θλ ) : RN → RL2 to approximate the Karush-Kuhn-Tucker (KKT)


multipliers µ and λ.

• To simplify notation, we accumulate all weights in the matrix Θ = (ΘV , Θα , Θµ , Θλ ).

• We could think about the approach as just one large neural network with multiple outputs.

50
Error criterion I

• The HJB error:

errHJB (x; Θ) ≡ r (x, α̃(s; Θα )) + ∇x Ṽ (x; ΘV )f (x, α̃(x; Θα ))+


1
+ tr [σ(x)T ∆x Ṽ (x; ΘV )σ(x)] − ρṼ (x; ΘV )
2

• The policy function error:

∂r (x, α̃(x; Θα )
errα (x; Θ) ≡ + Dα f (x, α̃(x; Θα ))T ∇x Ṽ (x; ΘV )
∂α
− Dα G (x, α̃(x; Θα ))T µ̃(x; Θµ ) − Dα H(x, α̃(x; Θα ))λ̃(x; Θλ ),

where Dα G ∈ RL1 ×M , Dα H ∈ RL2 ×M , and Dα f ∈ RN×M are the submatrices of the Jacobian
matrices of G , H and f respectively containing the derivatives with respect to α.

51
Error criterion II

• The constraint error is itself composed of the primal feasibility errors:


errPF1 (x; Θ) ≡ max{0, G (x, α̃(x; Θα ))}
errPF2 (x; Θ) ≡ H(x, α̃(x; Θα ))
the dual feasibility error:
errDF (x; Θ) = max{0, −µ̃(x; Θµ }
and the complementary slackness error:
errCS (x; Θ) = µ̃(x; Θ)T G (x, α̃(x; Θα ))

• We combine these four errors by using the squared error as our loss criterion:
2 2 2
E(x; Θ) ≡ errHJB (x; Θ) 2
+ errα (x; Θ) 2
+ errPF1 (x; Θ) 2
+
2 2 2
+ errPF2 (x; Θ) 2
+ errDF (x; Θ) 2
+ errCS (x; Θ) 2
52
Training

• We train our neural networks by minimizing the above error criterion through mini-batch gradient
descent over points drawn from the ergodic distribution of the state vector.

• The efficient implementation of this last step is the key to the success of our algorithm.

• We start by initializing our network weights and we perform K learning steps called epochs, where K
can be chosen in a variety of ways.

• For each epoch, we draw I points from the state space by simulating from the ergodic distribution.

• Then, we randomly split this sample into B mini-batches of size S. For each mini-batch, we define
the mini-batch error, by averaging the loss function over the batch.

• Finally, we perform mini-batch gradient descent for all network weights, with ηk being the learning
rate in the k-th epoch.

53
An Example
The continuous-time neoclassical growth model I

• We start with the continuous-time neoclassical growth model because it has closed-form solutions for
the policy functions, which allows us to focus our attention on the analysis of the value function
approximation.

• We can then back out the policy function from this approach and compare it to the results of the
next step in which we approximate the policy functions themselves with a neural net.

• A single agent deciding to either save in capital or consume with a HJB equation :

ρV (k) = max U(c) + V 0 (k)[F (k) − δ ∗ k − c]


c

1
• Notice that c = (U 0 )−1 (V 0 (k)). With CRRA utility, this simplifies further to c = (V 0 (k))− γ .

• We set γ = 2, ρ = 0.04, F (k) = 0.5 ∗ k 0.36 , δ = 0.05.

54
The continuous-time neoclassical growth model II

• We approximate the value function V (k) with a neural network, Ṽ (k; Θ) with an “HJB error”:

!!
0 −1 ∂ Ṽ (k; Θ)
errHJB =ρṼ (k; Θ) − U (U )
∂k
" !#
∂ Ṽ (k; Θ) ∂ Ṽ (k; Θ)
− F (k) − δ ∗ k − (U 0 )−1
∂k ∂k

• Details:
1. 3 layers.

2. 8 neurons per layers.

3. tanh(x) activation.
 q 
2
4. Normal initialization N 0, 4 ninput +n output
with input normalization.
55
(a) Value with closed-form policy 56
(c) Consumption with closed-form policy 57
(e) HJB error with closed-form policy 58
Approximating the policy function

• Let us not use the closed-form consumption policy function but rather approximate said policy
function directly with a policy neural network C̃ (k; ΘC ).

• The new HJB error:


  ∂ Ṽ (k; ΘV ) h i
errHJB = ρṼ (k; ΘV ) − U C̃ (k; ΘC ) − F (k) − δ ∗ k − C̃ (k; ΘC )
∂k

• Now we have a policy function error:


!
0 −1 ∂ Ṽ (k; ΘV )
errC = (U ) − C̃ (k; ΘC )
∂k

59
(b) Value with policy approximation 60
(d) Consumption with policy approximation 61
(f) HJB error with policy approximation 62
(g) Policy error with policy approximation 63
Alternative ANNs
Alternative ANNs

• Convolutional neural networks.

• Feedback ANN such as the Hopfield network.

• Self-organizing maps (SOM).

• ANN and reinforcement learning.

64
Input
Kernel
a b c d
w x
e f g h
y z
i j k l

Output

aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz

ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz

65
66
67
Reinforcement learning
Reinforcement learning

• Main idea: Algorithms that use training information that evaluates the actions taken instead of
deciding whether the action was correct.

• Purely evaluative feedback to assess how good the action taken was, but not whether it was the best
feasible action.

• Useful when:
1. The dynamics of the state is unkown but simulation is easy: model-free vs. model-based reinforcement
learning.

2. Or the dimensionality is so high that we cannot store the information about the DP in a table.

• Work surprisingly well in a wide range of situations, although no methods that are guaranteed to
work.

• Key for success in economic applications: ability to simulate fast (link with massive parallelization).
Also, it complements very well with neural networks.
68
Comparison with alternative methods

• Similar (same?) ideas are called approximate dynamic programming or neuro-dynamic programming.

• Traditional dynamic programming: we optimize over best feasible actions.

• Supervised learning: purely instructive feedback that indicates best feasible action regardless of
action actually taken.

• Unsupervised learning: hard to use for optimal control problems.

• In practice, we mix different methods.

• Current research challenge: how do we handle associate behavior effectively?

69
70
71
Example: Multi-armed bandit problem

• You need to choose action a among k available options.

• Each option is associated with a probability distribution of payoffs.

• You want to maximize the expected (discounted) payoffs.

• But you do not know which action is best, you only have estimates of your value function (dual
control problem of identification and optimization).

• You can observe actions and period payoffs.

• Go back to the study of “sequential design of experiments” by Thompson (1933, 1934) and Bellman
(1956).

72
73
Theory vs. practice

• You can follow two pure strategies:

1. Follow greedy actions: actions with highest expected value. This is known as exploiting.

2. Follow non-greedy actions: actions with dominated expected value. This is known as exploring.

• This should remind you of a basic dynamic programming problem: what is the optimal mix of pure
strategies?

• If we impose enough structure on the problem (i.e., distributions of payoffs belong to some family,
stationarity, etc.), we can solve (either theoretically or applying standard solution techniques) the
optimal strategy (at least, up to some upper bound on computational capabilities).

• But these structures are too restrictive for practical purposes outside the pages of Econometrica.

74
A policy-based method I

• Proposed by Thathachar and Sastry (1985).

• A very simple method that uses the averages Qn (a) of rewards Ri (a), i = {1, ..., n}, actually received:
n−1
1X
Qn (a) = Ri (a)
n
i=1

• We start with Q0 (a) = 0 for all k. Here (and later), we randomize among ties.

• We update Qn (a) thanks to the nice recursive update based on linearity of means:

1
Qn+1 (a) = Qn (a) + [Rn (a) − Qn (a)]
n
Averages of actions not picked are not updated.

75
A policy-based method II

• How do we pick actions?

1. Pure greedy method: arg maxa Qt (a).

2. -greedy method. Mixed best action with a random trembling.

• Easy to generalize to more sophisticated strategies.

• In particular, we can connect with genetic algorithms (AlphaGo).

76
3

2
q⇤ (3)
q⇤ (5)
1
q⇤ (9)
q⇤ (4)
Reward 0
q⇤ (1)
q⇤ (7)
distribution q⇤ (10)
q⇤ (2)
-1 q⇤ (8)
q⇤ (6)

-2

-3

1 2 3 4 5 6 7 8 9 10
77
Action
1.5
" = 0.1
" = 0.01
1
" = 0 (greedy)
Average
reward
0.5

0
01 250 500 750 1000

Steps

100%

80%
" = 0.1
% 60%
" = 0.01
Optimal
action 40%

" = 0 (greedy)
20%

0%
01 250 500 750 1000 78
Steps
A more general update rule

• Let’s think about a modified update rule:


Qn+1 (a) = Qn (a) + α [Rn (a) − Qn (a)]
for α ∈ (0, 1].
• This is equivalent, by recursive substitution, to:
n−1
X
Qn+1 (a) = (1 − α)n Q1 (a) + α α(1 − α)n−i Ri (a)
i=1

• We can also have a time-varying αn (a), but, to ensure convergence with probability 1 as long as:

X
αn (a) = ∞
i=1
X∞
αn2 (a) = ∞
i=1

79
Improving the algorithm

• We can start with “optimistic” Q0 to induce exploration.

• We can implement an upper-confidence-bound action selection


" s #
log n
arg max Qn (a) + c
a Nn (a)

• We can have a gradient bandit algorithms based on a softmax choice:


e Hn (a)
πn (a) = P (An = a) = Pk
Hn (b)
b=1 e
where

Hn+1 (An ) = Hn (An ) + α (1 − πn (An )) Rn (a) − R n

Hn+1 (a) = Hn (a) − απn (a) Rn (a) − R n for all a 6= An
This is a slightly hidden version of a stochastic gradient algorithm that we will see soon when we talk
about deep learning.
80
100%
Optimistic,
optimistic, greedy
greedy
Q1 = 5, " = 0
80% Q01 = 5, !!= 0

60% Realistic, " -greedy


realistic, ! -greedy
% Q1 = 0, " = 0.1
Optimal Q01 = 0, !!= 0.1
action 40%

20%

0%
01 200 400 600 800 1000
Plays
Steps

gure 2.3: The e↵ect of optimistic initial action-value estimates on the 10-armed testbe
th methods used a constant step-size parameter, ↵ = 0.1. 81
practical.

1.5 UCB c=2

-greedy = 0.1
1
Average
reward
0.5

0
1 250 500 750 1000
Steps

ure 2.4: Average performance of UCB action selection on the 10-armed testbed. As show
B generally performs better than "-greedy action selection, except in the first k steps, whe
82
1.5
UCB greedy with
optimistic
1.4 initialization
α = 0.1
Average 1.3 -greedy
reward
gradient
over first 1.2 bandit
1000 steps
1.1

1
1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4

" ↵ c Q0
gure 2.6: A parameter study of the various bandit algorithms presented in this chapt
83
Other algorithms

• Monte Carlo prediction.

• Temporal-difference (TD) learning:


V n+1 (st ) = V n (st ) + α (rt+1 + βV n (st+1 ) − V n (st ))

• SARSA ⇒ On-policy TD control:


Q n+1 (at, st ) = Q n (at, st ) + α (rt+1 + βQ n (at+1, st+1 ) − Q n (at, st ))

• Q-learning ⇒ Off-Policy TD Control:


 
Q n+1 (at, st ) = Q n (at, st ) + α rt+1 + β max Q n (at+1, st+1 ) − Q n (at, st )
at+1

• Value-based methods.

• Actor-critic methods. 84

You might also like