0% found this document useful (0 votes)

15 views

Continuous Time 2

Uploaded by

Luiz Antônio Leopoldo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Continuous Time 2

Uploaded by

Luiz Antônio Leopoldo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Deep learning and reinforcement learning

Jesús Fernández-Villaverde1 and Galo Nuño2

October 15, 2021
1
University of Pennsylvania
2
Banco de España
A short introduction
The problem

• Let us suppose we want to approximate an unknown function:

y = f (x)

where y is a scalar and x = {x1 , x2 , ..., xN } a vector.

• We care about the case when N is large.

• Easy to generalize to the case where y is a vector (or a probability distribution), but notation
becomes cumbersome.

• In economics, f (x) can be a value function, a policy function, a pricing kernel, a conditional
expectation, a classifier, ...

1
A neural network

• An artificial neural network (a.k.a. ANN or connectionist system) is an approximation to f (x) built as
a linear combination of M generalized linear models of x of the form:
M
y∼
X
= g NN (x; θ) = θ0 + θm φ (zm )
m=1

where φ(·) is an arbitrary activation function and:

N
X
zm = θ0,m + θn,m xn
n=1

• M is known as the width of the model.

• We can select θ such that g NN (x; θ) is as close to f (x) as possible given some relevant metric (e.g.,
L2 norm).

• This is known as “training” the network.

2
Comparison with other approximations

• Compare: !
M N
y∼
X X
= g NN (x; θ) = θ0 + θm φ θ0,m + θn,m xn
m=1 n=1

with a standard projection:

M
y∼
X
= g CP (x; θ) = θ0 + θm φm (x)
m=1

where φm is, for example, a Chebyshev polynomial.

• We exchange the rich parameterization of coefficients for the parsimony of basis functions.

• Later, we will explain why this is often a good idea.

• How we determine the coefficients will also be different, but this is somewhat less important.

3
Deep learning

• A deep learning network is an acyclic multilayer composition of J > 1 neural networks:

y∼= g DL (x; θ) = g NN(1) g NN(2) ...; θ(2) ; θ(1)

where the M (1) , M (2) , ... and φ1 (·), φ2 (·), ... are possibly different across each layer of the network.

• Sometimes known as deep feedforward neural networks or multilayer perceptrons.

• “Feedforward” comes from the fact that the composition of neural networks can be represented as a
directed acyclic graph, which lacks feedback. We can have more general recurrent structures.

• J is known as the depth of the network. The case J = 1 is a standard neural network.

• As before, we can select θ such that g DL (x; θ) approximates a target function f (x) as closely as
possible under some relevant metric.

4
Why are neural networks a good solution method in economics?

• From now on, I will refer to neural networks as including both single and multilayer networks.

• With suitable choices of activation functions, neural networks can efficiently approximate extremely
complex functions.

• In particular, under certain (relatively weak) conditions:

1. Neural networks are universal approximators.

2. Neural networks break the “curse of dimensionality.”

• Furthermore, neural networks are easy to code, stable, and scalable for multiprocressing.

• Thus, neural networks have considerable option value as solution methods in economics.

5
Current interest

• Currently, neural networks are among the most active areas of research in computer science and
applied math.

• While original idea goes back to the 1940s, neural networks were rediscovered in the second half of
the 2000s.

• Why?

1. Suddenly, the large computational and data requirements required to train the networks efficiently
became available at a reasonable cost.

2. New algorithms such as back propagation through gradient descent became popular.

• Some well-known successes and industrial applications.

6
7
AlphaGo

• Big splash: AlphaGo vs. Lee Sedol in March 2016.

• Silver et al. (2018): now applied to chess, shogi, Go, and StarCraft II.

• Check also:

1. https://deepmind.com/research/alphago/.

2. https://www.alphagomovie.com/

3. https:
//deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii

• Very different than Deep Blue against Kasparov.

• New and surprising strategies.

• However, you need to keep this accomplishment in perspective.

8
ARTICLE RESEARCH

a b
Rollout policy SL policy network RL policy network Value network Policy network Value network

Neural network
pS pV pU QT pVU (a⎪s) QT (s′)

Policy gradient
on
Cla

Se
ati

no
ssi

lf P
fic

ssi
fic

ssi

lay

gre
a

Cla
tio

Re
n

Data
s s′
Human expert positions Self-play positions
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation of the neural network architecture used in
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation of the board position
A reinforcement learning (RL) policy network pρ is initialized to the SL s as its input, passes it through many convolutional layers with parameters
policy network, and is then improved by policy gradient learning to σ (SL policy network) or ρ (RL policy network), and outputs a probability
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, represented by a
versions of the policy network. A new data set is generated by playing probability map over the board. The value network similarly uses many
games of self-play with the RL policy network. Finally, a value network vθ convolutional layers with parameters θ, but outputs a scalar value vθ(s′) 9
is trained by regression to predict the expected outcome (that is, whether that predicts the expected outcome in position s′.
human world chess champion (1). Computer games of self-play that traverse a tree from ro
chess programs continued to progress stead- state sroot until a leaf state is reached. Each si
2
eepMind, 6 Pancras Square, London N1C 4AG, UK. University ily beyond human level in the following two ulation proceeds by selecting in each state
llege London, Gower Street, London WC1E 6BT, UK.
hese authors contributed equally to this work.
decades. These programs evaluate positions by move a with low visit count (not previou
orresponding author. Email: [email protected] (D.S.); using handcrafted features and carefully tuned frequently explored), high move probability, a
[email protected] (D.H.) weights, constructed by strong human players and high value (averaged over the leaf states

g. 1. Training AlphaZero for 700,000 steps. Elo ratings were (B) Performance of AlphaZero in shogi compared with the 2017
omputed from games between different players where each player CSA world champion program Elmo. (C) Performance of AlphaZero
as given 1 s per move. (A) Performance of AlphaZero in chess in Go compared with AlphaGo Lee and AlphaGo Zero (20 blocks
ompared with the 2016 TCEC world champion program Stockfish. over 3 days).

ver et al., Science 362, 1140–1144 (2018) 7 December 2018 1o

10
Further advantages

• Neural networks and deep learning often require less “inside knowledge” by experts on the area.

• Results can be highly counter-intuitive and yet, deliver excellent performance.

• Outstanding open source libraries: Tensorflow, Pytorch, Flux.

• More recently, development of dedicated hardware (TPUs, AI accelerators, FPGAs) are likely to
maintain a hedge for the area.

• The width of an ecosystem is key for its long-run success.

11
12
Limitations of neural networks and deep learning

• While neural networks and deep learning can work extremely well, there is no such a thing as a silver
bullet.

• Clear and serious trade-offs in real-life applications.

• Rule-of-thumb in the industry is that one needs around 107 labeled observations to properly train a
complex ANN with around 104 observations in each relevant group.

• Of course, sometimes “observations” are endogenous (we can simulate them), but if your goal is to
forecast GDP next quarter, it is unlikely a neural network will beat an ARIMA(n,p,q) (at least only
with macro variables).

• Issues of interpretation.

13
14
Digging deeper
More details on neural networks

• Non-linear functional approximation method.

• Much hype around them and over-emphasis of biological interpretation.

• We will follow a much sober formal treatment (which, in any case, agrees with state-of-art
researchers approach).

• In particular, we will highlight connections with econometrics (e.g., NOLS, semiparametric regression,
and sieves).

• We will start describing the simplest possible neural network.

15
A neuron

• N observables: x1 , x2 ,...,xN . We stack them in x.

• Coefficients (or weights): θ0 (a constant), θ1 , θ2 , ...,θN . We stack them in θ.

• We build a linear combination of observations:

N
X
z = θ0 + θn xn
n=1

Theoretically, we could build non-linear combinations, but unlikely to be a fruitful idea in general.

• We transform such linear combination with an activation function:

y = g (x; θ) = φ (z)

The activation function might have some coefficients γ on its own.

• Why do we need an activation function?

16
Flow representation

Inputs Weights
x1 θ1

x2 θ2 Activation
n
X Perceptron
θi xi classification
x3 i=1 output
θ3
Net input
γ

xn θn

17
The biological analog

18
Activation functions I

• Traditionally:

1. Identity function:
φ (z) = z

Used in linear regression.

2. A sigmoidal function:
1
φ (z) =
1 + e −z

A particular limiting case as z grows quickly: step function.

3. Hyperbolic tangent:
e 2z − 1
φ (z) =
e 2z + 1

19
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-4 -3 -2 -1 0 1 2 3 4 20
1

0.8

0.6

0.4

0.2

-0.2

-0.4

-0.6

-0.8

-1
-4 -3 -2 -1 0 1 2 3 4
21
Activation functions II

• Some activation functions that have gained popularity recently:

1. Rectified linear unit (ReLU):

φ (z) = max(0, z)

2. Parametric ReLU:
φ (z) = max(z, az)

3. Softplus:
φ (z) = log (1 + e z )

22
4.5

3.5

2.5

1.5

0.5
ReLU
Sofplus
0
-4 -3 -2 -1 0 1 2 3 4
23
Interpretation

• θ0 controls the activation threshold.

• The level of the θi ’s for i > 0 control the activation rate (the higher the θi ’s, the harder the
activation).

• Some textbooks separate the activation threshold and scaling coefficients from θ as different
coefficients in φ, but such separation moves notation farther away from standard econometrics.

• Potential identification problem between θ and more general activation functions with their own
parameters.

• But in practice θ does not have a structural interpretation, so the identification problem is of
secondary importance.

• As mentioned in the introduction, a neuron closely resembles a generalized linear model in

econometrics.
24
4 5 3

4
3
2
3
2
2
1
1
1

0 0 0
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

2 6 4

1.5 3
4

1 2

2
0.5 1

0 0 0
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

0 0 0

-1 -1
-2

-2 -2

-4
-3 -3

-4 -4 -6
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

25
Combining neurons into a neural network

• As before, we have N observables: x1 , x2 ,...,xN .

• Coefficients (or weights): θ0,m (a constant), θ1,m , θ2,m , ...,θN,m .

• We build M linear combinations of observations:

N
X
zm = θ0,m + θn,m xn
n=1

• We transform and add such linear combinations with an activation function:

M
y∼
X
= g (x; θ) = θ0 + θm φ (zm )
m=1

• Also, quasi-linear structure in terms of vectors of observables and coefficients.

• This is known as a single layer network.

26
Two classic (yet remarkable) results I

Borel measurable function

A map f : X → Y between two topological spaces is called Borel measurable if f −1 (A) is a Borel set for
any open set A on Y (the Borel sets are all the open sets built through the operations of countable
union, countable intersection, and relative complement).

Universal approximation theorem: Hornik, Stinchcombe, and White (1989)

A neural network with at least one hidden layer can approximate any Borel measurable function mapping
finite-dimensional spaces to any desired degree of accuracy.

• Intuition of the result.

• Comparison with other results in series approximations.

27
4

-1

-2

-3
-2 -1.5 -1 -0.5 0 0.5 1 1.5

28
4 4 4

3 3 3

2 2
2
1 1
1
0 0
0
-1 -1
-1
-2 -2

-2 -3 -3

-3 -4 -4
-2 -1 0 1 -2 -1 0 1 -2 -1 0 1

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

-1 -1 -1

-2 -2 -2

-3 -3 -3

-4 -4 -4
-2 -1 0 1 -2 -1 0 1 -2 -1 0 1

29
Two classic (yet remarkable) results II

• Assume, as well, that we are dealing with the class of functions for which the Fourier transform of
their gradient is integrable.

Breaking the curse of dimensionality: Barron (1993)

A one-layer NN achieves integrated square errors of order O(1/M), where M is the number of nodes. In
comparison, for series approximations, the integrated square error is of order O(1/(M 2/N )) where N is
the dimensions of the function to be approximated.

• More general theorems by Leshno et al. (1993) and Bach (2017).

• What about Chebyshev polynomials? Splines? Problems of convergence and extrapolation.

• There is another, yet more subtle curse of dimensionality.

30
Training the network

• θ is selected to minimize the quadratic error function E (θ; Y, b

y):

θ∗ = arg min E (θ; Y, b

y)
θ
J
X
= arg min E (θ; yj , ybj )
θ
j=1
J
1X 2
= arg min kyj − g (xj ; θ)k
θ 2
j=1

• Where from do the observations Y come? Observed data vs. simulated epochs.

• How do we solve this minimization problem?

• Other objective functions are possible.

31
Back propagation

• In general, we can easily calculate E(θ∗ ; Y , b

y) and ∇E(θ∗ ; Y , b
y) for a given θ∗ .

• In particulary, for the gradient, we can use back propagation (Rumelhart et al., 1986):
∂E (θ; yj , ybj )
= yj − g (xj ; θ)
∂θ0
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) φ (zm ) , for ∀m
∂θm
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) θm φ0 (zm ) , for ∀m
∂θ0,m
∂E (θ; yj , ybj )
= (yj − g (xj ; θ)) θm xn φ0 (zm ) , for ∀n, m
∂θn,m
where φ0 (z) is the derivative of the activation function.

• The derivative φ0 (z) will be trivial to evaluate if we use a ReLU.

• Back propagation will be particularly important below when we introduce multiple layers.
32
An approach to minimization

• One approach to optimization is to minimize a local model that approximates the true objective
function.

• The local model can be a first- or second-order Taylor approximation of the objective function.

• For example, suppose a function E is roughly approximated as a quadratic form:

1 T
E(θ) ≈ θ Aθ − bT θ + c
2
where A is a square, symmetric, positive-definite matrix.

• Then E(θ) is minimized by the solution to:

Aθ = b

• We can use this result to build a descent direction iteration if we know A and b (or we have
approximations to them).
33
Descent direction iteration

• Starting at point θ(1) , a descent direction algorithm generates sequence of steps (called iterates) that
converge to a local minimum.

• The descent direction iteration algorithm:

1. At iteration k, check whether θ(k) satisfies termination condition. If so stop; otherwise go to step 2.

2. Determine the descent direction d(k) using local information such as gradient or Hessian.

3. Compute step size α(k) .

4. Compute the next candidate point: θ(k+1) ← θ(k) + α(k) d(k) .

• Choice of α and d determines the flavor of the algorithm.

34
Gradient descent method

• A natural choice for d is the direction of steepest descent (first proposed by Cauchy).

• The direction of steepest descent is given by the direction opposite the gradient ∇E(θ). Thus, a.k.a.
steepest descent.

• If function is smooth and the step size small, the method leads to improvement (as long as the
gradient is not zero).

• The normalized direction of steepest descent is:

∇E(θ(k) )
d(k) = −
||∇E(θ(k) )||

• One way to set the step size is to solve:

αk = arg min E(θ(k) + αd(k) )
α

• Under this step size choice, it can be shown d(k+1) and d(k) are orthogonal.
35
Steepest descent method

36
Conjugate descent method

• Gradient descent can perform poorly in narrow valleys (it may require many steps to make progress).

• Famous example: Rosenbrock function.

• The conjugate gradient method overcomes this problem by somehow constructing to be conjugate to
the old gradient, and to all previous directions traversed.

• Define g (θ) = ∇E(θ).

• In first iteration, set: d (1) = −g (θ(1) ) and θ(2) = θ(1) + α(1) d(1) . Here, α(1) is arbitrary.

• Subsequent iterations set d(k+1) = −g (k+1) + β (k) d(k) .

37
Conjugate descent method

38
Conjugate descent method

• There are two approaches to set β:

1. Fletcher-Reeves:
g (k)T g (k)
β (k) =
g (k−1)T g (k−1)

2. Olak-Ribiere:
g (k)T (g (k) − g (k−1) )
β (k) =
g (k−1)T g (k−1)

• The Olak-Ribiere requires an automatic reset at every iteration: β ← max(β, 0).

• If the function to minimize has flat areas, one can introduce a momentum update equation:
v (k+1) = βv (k) − αg (k)
θ(k+1) = θ(k) + v (k+1)

• The modification reverts to the gradient descent version if β = 0.

• Intuitively, the momentum update is like a ball rolling down an almost horizontal surface. 39
Stochastic gradient descent and minibatch

• Even with back propagation, evaluating the gradient for the whole training set can be costly.

• Stochastic gradient descent: Intuition from Monte Carlos.

• An additional advantage.

• A compromise between using the whole training set and pure stochastic gradient descent: minibatch
gradient descent.

• This is the most popular algorithm to train neural networks.

• Intuition from GMM. Notice also resilience to scaling.

• In practice, we do not need a global min (6= likelihood).

• You can flush the algorithm to a graphics processing unit (GPU) or a tensor processing unit (TPU)
instead of a standard CPU.
40
41
ntly improves our ability to navigate flat regions.

gure 2-7. The stochastic error surface fluctuates with respect to the batch error surfac
nabling saddle point avoidance 42
Alternative minimization algorithms

1. More sophisticated stochastic gradient descent: Adam (Adaptive Moment Estimation). It uses
running averages of both the gradients and the second moments of the gradients.

2. Newton and Quasi-Newton methods are unlikely to be of much use in practice. Why?

3. McMc/Simulated annealing.

4. Genetic algorithms:

• In fact, much of the research in deep learning incorporates some flavor of genetic selection.

• Basic idea.

43
Further ideas

• Design of the network architecture:

1. Trade-off error/computational time.

2. Better to err on the side of too many M.

• Double descent phenomenon.

44
45
Multiple layers I

• The hidden layers can be multiplied without limit in a feed-forward ANN.

• We build K layers:
N
X
1 1 1
zm = θ0,m + θn,m xn
n=1

and
M
X
2 2 2 1

zm = θ0,m + θm φ zm
m=1
...
M
y∼
X
K −1
= g (x; θ) = θ0K + K

θm φ zm
m=1

46
x1

Input Values Hidden Layer 1 Output Layer

Input Layer Hidden Layer 2
47
Multiple layers II

• Why do we want to introduce hidden layers?

1. It works! Our brains have six layers. AlphaGo has 12 layers with ReLUs.

2. Hidden layers induce highly nonlinear behavior.

3. Allow for clustering of variables.

• We can have different M’s in each layer ⇒ fewer neurons in higher layers allow for compression of
learning into fewer features.

• We can also add multidimensional outputs.

• Or even to produce, as output, a probability distribution, for example, using a softmax layer:
K −1
e zm
ym = P M K −1
m=1 e zm

48
Application to Economics
Solving high-dimensional dynamic programming problems using Deep Learning

• Our goal is to solve the recursive continuous-time Hamilton-Jacobi-Bellman (HJB) equation globally:

1
ρV (x) = max r (x, α) + ∇x V (x)f (x, α) + tr (σ(x))T ∆x V (x)σ(x))
α 2
s.t. G (x, α) ≤ 0 and H(x, α) = 0

• Think about the cases where we have many state variables.

• Alternatives for this solution?

49
Neural networks

• We define four neural networks:

1. Ṽ (x; ΘV ) : RN → R to approximate the value function V (x).

2. α̃(x; Θα ) : RN → RM to approximate the policy function α.

3. µ̃(x; Θµ ) : RN → RL1 , and λ̃(x; Θλ ) : RN → RL2 to approximate the Karush-Kuhn-Tucker (KKT)

multipliers µ and λ.

• To simplify notation, we accumulate all weights in the matrix Θ = (ΘV , Θα , Θµ , Θλ ).

• We could think about the approach as just one large neural network with multiple outputs.

50
Error criterion I

• The HJB error:

errHJB (x; Θ) ≡ r (x, α̃(s; Θα )) + ∇x Ṽ (x; ΘV )f (x, α̃(x; Θα ))+

1
+ tr [σ(x)T ∆x Ṽ (x; ΘV )σ(x)] − ρṼ (x; ΘV )
2

• The policy function error:

∂r (x, α̃(x; Θα )
errα (x; Θ) ≡ + Dα f (x, α̃(x; Θα ))T ∇x Ṽ (x; ΘV )
∂α
− Dα G (x, α̃(x; Θα ))T µ̃(x; Θµ ) − Dα H(x, α̃(x; Θα ))λ̃(x; Θλ ),

where Dα G ∈ RL1 ×M , Dα H ∈ RL2 ×M , and Dα f ∈ RN×M are the submatrices of the Jacobian
matrices of G , H and f respectively containing the derivatives with respect to α.

51
Error criterion II

• The constraint error is itself composed of the primal feasibility errors:

errPF1 (x; Θ) ≡ max{0, G (x, α̃(x; Θα ))}
errPF2 (x; Θ) ≡ H(x, α̃(x; Θα ))
the dual feasibility error:
errDF (x; Θ) = max{0, −µ̃(x; Θµ }
and the complementary slackness error:
errCS (x; Θ) = µ̃(x; Θ)T G (x, α̃(x; Θα ))

• We combine these four errors by using the squared error as our loss criterion:
2 2 2
E(x; Θ) ≡ errHJB (x; Θ) 2
+ errα (x; Θ) 2
+ errPF1 (x; Θ) 2
+
2 2 2
+ errPF2 (x; Θ) 2
+ errDF (x; Θ) 2
+ errCS (x; Θ) 2
52
Training

• We train our neural networks by minimizing the above error criterion through mini-batch gradient
descent over points drawn from the ergodic distribution of the state vector.

• The efficient implementation of this last step is the key to the success of our algorithm.

• We start by initializing our network weights and we perform K learning steps called epochs, where K
can be chosen in a variety of ways.

• For each epoch, we draw I points from the state space by simulating from the ergodic distribution.

• Then, we randomly split this sample into B mini-batches of size S. For each mini-batch, we define
the mini-batch error, by averaging the loss function over the batch.

• Finally, we perform mini-batch gradient descent for all network weights, with ηk being the learning
rate in the k-th epoch.

53
An Example
The continuous-time neoclassical growth model I

• We start with the continuous-time neoclassical growth model because it has closed-form solutions for
the policy functions, which allows us to focus our attention on the analysis of the value function
approximation.

• We can then back out the policy function from this approach and compare it to the results of the
next step in which we approximate the policy functions themselves with a neural net.

• A single agent deciding to either save in capital or consume with a HJB equation :

ρV (k) = max U(c) + V 0 (k)[F (k) − δ ∗ k − c]

1
• Notice that c = (U 0 )−1 (V 0 (k)). With CRRA utility, this simplifies further to c = (V 0 (k))− γ .

• We set γ = 2, ρ = 0.04, F (k) = 0.5 ∗ k 0.36 , δ = 0.05.

54
The continuous-time neoclassical growth model II

• We approximate the value function V (k) with a neural network, Ṽ (k; Θ) with an “HJB error”:

!!
0 −1 ∂ Ṽ (k; Θ)
errHJB =ρṼ (k; Θ) − U (U )
∂k
" !#
∂ Ṽ (k; Θ) ∂ Ṽ (k; Θ)
− F (k) − δ ∗ k − (U 0 )−1
∂k ∂k

• Details:
1. 3 layers.

2. 8 neurons per layers.

3. tanh(x) activation.
q
2
4. Normal initialization N 0, 4 ninput +n output
with input normalization.
55
(a) Value with closed-form policy 56
(c) Consumption with closed-form policy 57
(e) HJB error with closed-form policy 58
Approximating the policy function

• Let us not use the closed-form consumption policy function but rather approximate said policy
function directly with a policy neural network C̃ (k; ΘC ).

• The new HJB error:

∂ Ṽ (k; ΘV ) h i
errHJB = ρṼ (k; ΘV ) − U C̃ (k; ΘC ) − F (k) − δ ∗ k − C̃ (k; ΘC )
∂k

• Now we have a policy function error:

!
0 −1 ∂ Ṽ (k; ΘV )
errC = (U ) − C̃ (k; ΘC )
∂k

59
(b) Value with policy approximation 60
(d) Consumption with policy approximation 61
(f) HJB error with policy approximation 62
(g) Policy error with policy approximation 63
Alternative ANNs
Alternative ANNs

• Convolutional neural networks.

• Feedback ANN such as the Hopfield network.

• Self-organizing maps (SOM).

• ANN and reinforcement learning.

64
Input
Kernel
a b c d
w x
e f g h
y z
i j k l

Output

aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz

ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz

65
66
67
Reinforcement learning
Reinforcement learning

• Main idea: Algorithms that use training information that evaluates the actions taken instead of
deciding whether the action was correct.

• Purely evaluative feedback to assess how good the action taken was, but not whether it was the best
feasible action.

• Useful when:
1. The dynamics of the state is unkown but simulation is easy: model-free vs. model-based reinforcement
learning.

2. Or the dimensionality is so high that we cannot store the information about the DP in a table.

• Work surprisingly well in a wide range of situations, although no methods that are guaranteed to
work.

• Key for success in economic applications: ability to simulate fast (link with massive parallelization).
Also, it complements very well with neural networks.
68
Comparison with alternative methods

• Similar (same?) ideas are called approximate dynamic programming or neuro-dynamic programming.

• Traditional dynamic programming: we optimize over best feasible actions.

• Supervised learning: purely instructive feedback that indicates best feasible action regardless of
action actually taken.

• Unsupervised learning: hard to use for optimal control problems.

• In practice, we mix different methods.

• Current research challenge: how do we handle associate behavior effectively?

69
70
71
Example: Multi-armed bandit problem

• You need to choose action a among k available options.

• Each option is associated with a probability distribution of payoffs.

• You want to maximize the expected (discounted) payoffs.

• But you do not know which action is best, you only have estimates of your value function (dual
control problem of identification and optimization).

• You can observe actions and period payoffs.

• Go back to the study of “sequential design of experiments” by Thompson (1933, 1934) and Bellman
(1956).

72
73
Theory vs. practice

• You can follow two pure strategies:

1. Follow greedy actions: actions with highest expected value. This is known as exploiting.

2. Follow non-greedy actions: actions with dominated expected value. This is known as exploring.

• This should remind you of a basic dynamic programming problem: what is the optimal mix of pure
strategies?

• If we impose enough structure on the problem (i.e., distributions of payoffs belong to some family,
stationarity, etc.), we can solve (either theoretically or applying standard solution techniques) the
optimal strategy (at least, up to some upper bound on computational capabilities).

• But these structures are too restrictive for practical purposes outside the pages of Econometrica.

74
A policy-based method I

• Proposed by Thathachar and Sastry (1985).

• A very simple method that uses the averages Qn (a) of rewards Ri (a), i = {1, ..., n}, actually received:
n−1
1X
Qn (a) = Ri (a)
n
i=1

• We start with Q0 (a) = 0 for all k. Here (and later), we randomize among ties.

• We update Qn (a) thanks to the nice recursive update based on linearity of means:

1
Qn+1 (a) = Qn (a) + [Rn (a) − Qn (a)]
n
Averages of actions not picked are not updated.

75
A policy-based method II

• How do we pick actions?

1. Pure greedy method: arg maxa Qt (a).

2. -greedy method. Mixed best action with a random trembling.

• Easy to generalize to more sophisticated strategies.

• In particular, we can connect with genetic algorithms (AlphaGo).

76
3

2
q⇤ (3)
q⇤ (5)
1
q⇤ (9)
q⇤ (4)
Reward 0
q⇤ (1)
q⇤ (7)
distribution q⇤ (10)
q⇤ (2)
-1 q⇤ (8)
q⇤ (6)

-2

-3

1 2 3 4 5 6 7 8 9 10
77
Action
1.5
" = 0.1
" = 0.01
1
" = 0 (greedy)
Average
reward
0.5

0
01 250 500 750 1000

Steps

100%

80%
" = 0.1
% 60%
" = 0.01
Optimal
action 40%

" = 0 (greedy)
20%

0%
01 250 500 750 1000 78
Steps
A more general update rule

• Let’s think about a modified update rule:

Qn+1 (a) = Qn (a) + α [Rn (a) − Qn (a)]
for α ∈ (0, 1].
• This is equivalent, by recursive substitution, to:
n−1
X
Qn+1 (a) = (1 − α)n Q1 (a) + α α(1 − α)n−i Ri (a)
i=1

• We can also have a time-varying αn (a), but, to ensure convergence with probability 1 as long as:
∞
X
αn (a) = ∞
i=1
X∞
αn2 (a) = ∞
i=1

79
Improving the algorithm

• We can start with “optimistic” Q0 to induce exploration.

• We can implement an upper-confidence-bound action selection

" s #
log n
arg max Qn (a) + c
a Nn (a)

• We can have a gradient bandit algorithms based on a softmax choice:

e Hn (a)
πn (a) = P (An = a) = Pk
Hn (b)
b=1 e
where

Hn+1 (An ) = Hn (An ) + α (1 − πn (An )) Rn (a) − R n

Hn+1 (a) = Hn (a) − απn (a) Rn (a) − R n for all a 6= An
This is a slightly hidden version of a stochastic gradient algorithm that we will see soon when we talk
about deep learning.
80
100%
Optimistic,
optimistic, greedy
greedy
Q1 = 5, " = 0
80% Q01 = 5, !!= 0

60% Realistic, " -greedy

realistic, ! -greedy
% Q1 = 0, " = 0.1
Optimal Q01 = 0, !!= 0.1
action 40%

20%

0%
01 200 400 600 800 1000
Plays
Steps

gure 2.3: The e↵ect of optimistic initial action-value estimates on the 10-armed testbe
th methods used a constant step-size parameter, ↵ = 0.1. 81
practical.

1.5 UCB c=2

-greedy = 0.1
1
Average
reward
0.5

0
1 250 500 750 1000
Steps

ure 2.4: Average performance of UCB action selection on the 10-armed testbed. As show
B generally performs better than "-greedy action selection, except in the first k steps, whe
82
1.5
UCB greedy with
optimistic
1.4 initialization
α = 0.1
Average 1.3 -greedy
reward
gradient
over first 1.2 bandit
1000 steps
1.1

1
1/128 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4

" ↵ c Q0
gure 2.6: A parameter study of the various bandit algorithms presented in this chapt
83
Other algorithms

• Monte Carlo prediction.

• Temporal-difference (TD) learning:

V n+1 (st ) = V n (st ) + α (rt+1 + βV n (st+1 ) − V n (st ))

• SARSA ⇒ On-policy TD control:

Q n+1 (at, st ) = Q n (at, st ) + α (rt+1 + βQ n (at+1, st+1 ) − Q n (at, st ))

• Q-learning ⇒ Off-Policy TD Control:

Q n+1 (at, st ) = Q n (at, st ) + α rt+1 + β max Q n (at+1, st+1 ) − Q n (at, st )
at+1

• Value-based methods.

• Actor-critic methods. 84

Elon Musk v. Samuel Altman Et Al (Via CNBC - Com)
100% (3)
Elon Musk v. Samuel Altman Et Al (Via CNBC - Com)
46 pages
Download full The Age of AI 1st Edition Henry A Kissinger ebook all chapters
100% (4)
Download full The Age of AI 1st Edition Henry A Kissinger ebook all chapters
24 pages
Preview The Age of AI by Henry Kissinger and Eric Schmidt
No ratings yet
Preview The Age of AI by Henry Kissinger and Eric Schmidt
26 pages
Neural network
No ratings yet
Neural network
7 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
Neural Networks and Their Statistical Application
No ratings yet
Neural Networks and Their Statistical Application
41 pages
Intro To DL
No ratings yet
Intro To DL
28 pages
Unit 03 - Neural Networks - MD
No ratings yet
Unit 03 - Neural Networks - MD
24 pages
Lecture NN 2005
No ratings yet
Lecture NN 2005
137 pages
Deep Learning and Its Applications
No ratings yet
Deep Learning and Its Applications
21 pages
Defense Presentation - Zubair
No ratings yet
Defense Presentation - Zubair
29 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages
8.2.1: Introduction To Neural Networks: Objectives
No ratings yet
8.2.1: Introduction To Neural Networks: Objectives
11 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
51 pages
Unit 4
100% (1)
Unit 4
57 pages
UNIT-I.pptx
No ratings yet
UNIT-I.pptx
90 pages
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
No ratings yet
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
45 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Machine Learning Tutorial
No ratings yet
Machine Learning Tutorial
149 pages
Lect 12 -Deep Feed Forward NN- Review
No ratings yet
Lect 12 -Deep Feed Forward NN- Review
93 pages
Neural Networks
No ratings yet
Neural Networks
40 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
DL Intro
No ratings yet
DL Intro
64 pages
Notes DL-1
No ratings yet
Notes DL-1
10 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
Deep Learning
100% (1)
Deep Learning
49 pages
Deep Learning - Intro, Methods & Applications
100% (1)
Deep Learning - Intro, Methods & Applications
37 pages
04 - Neural Networks PDF
No ratings yet
04 - Neural Networks PDF
46 pages
Unit Iv DM
No ratings yet
Unit Iv DM
58 pages
Lec 1
No ratings yet
Lec 1
30 pages
Chapter 5 Artificial Neural Networks
No ratings yet
Chapter 5 Artificial Neural Networks
50 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
Neuralnetworks 1
No ratings yet
Neuralnetworks 1
65 pages
DL Concepts 1 Overview
No ratings yet
DL Concepts 1 Overview
80 pages
ML_MU_Unit_5NeuralNetworkpdf__2025_04_16_13_47_39
No ratings yet
ML_MU_Unit_5NeuralNetworkpdf__2025_04_16_13_47_39
57 pages
AI Mod4 Session 8 Best Fit Line & ANN
No ratings yet
AI Mod4 Session 8 Best Fit Line & ANN
39 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
22 pages
Introduction to Neural Network
No ratings yet
Introduction to Neural Network
19 pages
MachineLearningSlides PartOne
No ratings yet
MachineLearningSlides PartOne
252 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Artificial intelligence basics
No ratings yet
Artificial intelligence basics
13 pages
Deep Learning in Neural Networks An Overview
No ratings yet
Deep Learning in Neural Networks An Overview
89 pages
SocrAI Day 4
No ratings yet
SocrAI Day 4
38 pages
Safari - 25 Jul 2019 at 11:43
No ratings yet
Safari - 25 Jul 2019 at 11:43
1 page
Unit 4 Hca
No ratings yet
Unit 4 Hca
57 pages
Artificial Neural Networks: Introduction To Computational Neuroscience
No ratings yet
Artificial Neural Networks: Introduction To Computational Neuroscience
42 pages
Formale Methoden UZH Nov 2013
No ratings yet
Formale Methoden UZH Nov 2013
282 pages
Neural Network As Universal Approximates
No ratings yet
Neural Network As Universal Approximates
5 pages
Unit-3 D.L
No ratings yet
Unit-3 D.L
16 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
ML06_Neural-Network_2024-2025
No ratings yet
ML06_Neural-Network_2024-2025
78 pages
Unit-5
No ratings yet
Unit-5
59 pages
Slides 11
No ratings yet
Slides 11
48 pages
DS303_NN
No ratings yet
DS303_NN
20 pages
Unit 4-Health care and Deep Learninh
No ratings yet
Unit 4-Health care and Deep Learninh
87 pages
THE_DEEP_NEURAL_NETWORK-A_REVIEW
No ratings yet
THE_DEEP_NEURAL_NETWORK-A_REVIEW
5 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
40 pages
Lecture2
No ratings yet
Lecture2
67 pages
deep learning UNIT 1
No ratings yet
deep learning UNIT 1
22 pages
Neural Modeling Fields: Fundamentals and Applications
From Everand
Neural Modeling Fields: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ordinary Differential Equations and Stability Theory: An Introduction
From Everand
Ordinary Differential Equations and Stability Theory: An Introduction
David A. Sanchez
No ratings yet
Infinite Series
From Everand
Infinite Series
James M Hyslop
No ratings yet
2111.07631v2
No ratings yet
2111.07631v2
14 pages
Artificial Intelligence A Creation of The Human Mind
No ratings yet
Artificial Intelligence A Creation of The Human Mind
4 pages
Blue Modern Minimalist Artificial Intelligence Technology Presentation
No ratings yet
Blue Modern Minimalist Artificial Intelligence Technology Presentation
19 pages
Assessing Game Balance With Alphazero: Exploring Alternative Rule Sets in Chess
No ratings yet
Assessing Game Balance With Alphazero: Exploring Alternative Rule Sets in Chess
98 pages
Chin CD Cover
No ratings yet
Chin CD Cover
32 pages
104 3152 1 PB
No ratings yet
104 3152 1 PB
17 pages
Stockfish Absorbs NNUE, Claims 100 Elo Point Improvement
No ratings yet
Stockfish Absorbs NNUE, Claims 100 Elo Point Improvement
1 page
Fletcher, Deepfakes, - Artificial - Intellig
No ratings yet
Fletcher, Deepfakes, - Artificial - Intellig
18 pages
AI For Radiology 2024
100% (1)
AI For Radiology 2024
237 pages
Noam Brown - Combining Deep Reinforcement Learning and Search For Imperfect-Information Games
No ratings yet
Noam Brown - Combining Deep Reinforcement Learning and Search For Imperfect-Information Games
25 pages
Full download (Ebook) Artificial Neural Networks with Java - Tools for Building Neural Network Applications by Igor Livshin ISBN 9781484244203, 1484244206 pdf docx
100% (9)
Full download (Ebook) Artificial Neural Networks with Java - Tools for Building Neural Network Applications by Igor Livshin ISBN 9781484244203, 1484244206 pdf docx
67 pages
Demis Hassabis – Scaling, Superhuman AIs, AlphaZero atop LLMs, AlphaFold
No ratings yet
Demis Hassabis – Scaling, Superhuman AIs, AlphaZero atop LLMs, AlphaFold
12 pages
Alphago Zero Dethroned
No ratings yet
Alphago Zero Dethroned
37 pages
AlphaZero en PDF
No ratings yet
AlphaZero en PDF
14 pages
Ethical Safe Lawful A Toolkit For Artificial Intelligence Projects
No ratings yet
Ethical Safe Lawful A Toolkit For Artificial Intelligence Projects
58 pages
Go Big or Go Home: The Value of The Risky Venture
No ratings yet
Go Big or Go Home: The Value of The Risky Venture
3 pages
The Age of AI 1st Edition Henry A Kissinger download
100% (2)
The Age of AI 1st Edition Henry A Kissinger download
37 pages
Best Chess Resources - Chess
No ratings yet
Best Chess Resources - Chess
16 pages
AI & CHESS - 2020 - by Arya Shah - SUPER TOP 10 - PDF - INTERESSANTEEEEEEEEEEEEEEEEEEE
No ratings yet
AI & CHESS - 2020 - by Arya Shah - SUPER TOP 10 - PDF - INTERESSANTEEEEEEEEEEEEEEEEEEE
66 pages
Thesis_xPetu (1)
No ratings yet
Thesis_xPetu (1)
38 pages
Alpha Zero
No ratings yet
Alpha Zero
7 pages
Sayan Kar Choudhury - CSE
No ratings yet
Sayan Kar Choudhury - CSE
15 pages
LabCFTC PrimerArtificialIntelligence102119
No ratings yet
LabCFTC PrimerArtificialIntelligence102119
31 pages
Reimagining Chess With AlphaZero
No ratings yet
Reimagining Chess With AlphaZero
7 pages
Stockfish (Chess) - Wikipedia
No ratings yet
Stockfish (Chess) - Wikipedia
33 pages
AlphaZero Research Paper Summary
No ratings yet
AlphaZero Research Paper Summary
3 pages