0% found this document useful (0 votes)

14 views16 pages

SRE Report Merged

Uploaded by

darksmile629

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views16 pages

SRE Report Merged

Uploaded by

darksmile629

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Department of Electrical Engineering, IIT Bombay

EE 451 - Supervised Research Exposition

On The Theory of Policy Gradient

Methods

Prepared by: Kunal Randad (20D070049)

Instructor: Prof. Vivek Borkar
Date: November 30, 2023
Contents
1 Introduction 1

2 Notations 1

3 Tabular Class 2
3.1 Direct Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Softmax Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.1 Asymptotic Conversion Without Regularization . . . . . . . . . . 4
3.2.2 Polynomial Convergence with Log-Barrier Regularization . . . . . 4
3.2.3 Dimension-free Convergence of Natural Policy Gradient Ascent . . 5
3.3 Tabular Methods Summary . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Function Approximation and Distribution Shift 7

4.1 Log Linear Policy Class And Soft Policy Iteration . . . . . . . . . . . . . 7
4.2 Q-NPG Performance bounds for Log-Linear Policies . . . . . . . . . . . . 8
4.2.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.2 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 NPG: Performance Bounds for Smooth Policy Class . . . . . . . . . . . . 11
4.4 Summary for Approximate Methods . . . . . . . . . . . . . . . . . . . . . 11

I
1 Introduction
This work provides provable characterizations of the computational, approximation,
and sample size properties of policy gradient methods in the context of discounted Markov
Decision Processes (MDPs). The policy parameterization can be split into two classes:

1) Tabular Policy Parameterization: Optimal Policy is contained within the class.

In these case, the results are provided for convergence to optimal policy.

2) Approximate Policies / Parameteric Policy class: Optimal policy may not

be included in the set of policies that can represented by this class. In this case, re-
sults are provided with respect to the best policy available in the class or some other
comparator policy(agnostic learning results).

2 Notations
Markov Decision Process (MDP) Definitions:

• A (finite) MDP M = (S, A, P, r, γ, ρ) is specified by:

– a finite state space S;

– a finite action space A;
– a transition model P where P (s′ | s, a) is the probability of transitioning into
state s′ upon taking action a in state s;
– a reward function r : S × A → [0, 1] where r(s, a) is the immediate reward
associated with taking action a in state s;
– a discount factor γ ∈ [0, 1);
– a starting state distribution ρ over S.

Policy Definitions:

• A deterministic, stationary policy π : S → A specifies a decision-making strategy

where at = π (st ).
• A stochastic policy π : S → ∆(A), where at ∼ π (· | st ).
• A policy induces a distribution over trajectories τ = (st , at , rt )∞
t=0 , with s0 drawn
from the starting state distribution ρ.

Value Function and Related Equations:

• The value function V π : S → R is the discounted sum of future rewards:

"∞ #
X
V π (s) := E γ t r (st , at ) | π, s0 = s
t=0

• 0 ≤ V π (s) ≤ 1
1−γ
, and V π (ρ) := Es0 ∼ρ [V π (s0 )].

1
Action-Value (Q-value) and Advantage Functions:

• The action-value function Qπ : S × A → R:

"∞ #
X
Qπ (s, a) = E γ t r (st , at ) | π, s0 = s, a0 = a
t=0

• The advantage function Aπ : S × A → R:

Aπ (s, a) := Qπ (s, a) − V π (s)

Policy Gradient Definitions:

• In order to introduce policy gradients, define the discounted state visitation distri-
bution dπs0 of a policy π as:
∞
X
dπs0 (s) := (1 − γ) γ t Prπ (st = s | s0 )
t=0

where Prπ (st = s | s0 ) is the state visitation probability that st = s, after executing
π starting at state s0 .
• Overloading notation, write:

dπρ (s) = Es0 ∼ρ dπs0 (s)

where dπρ is the discounted state visitation distribution under the initial distribution
ρ.

3 Tabular Class
3.1 Direct Parameterization
The policies are parameterized by

πθ (a | s) = θs,a

where θ ∈ ∆(A)|S| , i.e. θ is subject to θs,a ≥ 0 and

P
a∈A θs,a = 1 for all s ∈ S and
a ∈ A.

• In discussing the proof, consider the performance measure ρ and the optimization
measure µ.
• Notably, although the gradient concerns V π (µ), the guarantee extends to all distri-
butions ρ.
• Optimal performance under ρ might benefit from optimization under µ.

2
• The lemma implies that a small gradient in feasible directions indicates near-
optimality in value, but this holds if dπµ sufficiently covers the state distribution
of some optimal policy π ⋆ .
• Recall Bellman and Dreyfus’ theorem (1959) affirming a single policy π ⋆ optimal
for all starting states s0 .
• The exploration problem’s challenge is captured through the distribution mismatch
coefficient.

Distribution Mismatch Coefficient: Given a policy π and measures ρ, µ ∈ ∆(S),

dπ dπ
we refer to µρ as the distribution mismatch coefficient of π relative to µ. Here, µρ
∞
denotes componentwise division.

Projected Gradient Ascent Algorithm:

• Updates: π (t+1) = P∆(A)|S| π (t) + η∇π V (t) (µ)

Theorem 5:

• For all ρ ∈ ∆(S), the algorithm satisfies:

⋆ 2
⋆ (t) 64γ|S∥A| dπρ
min V (ρ) − V (ρ) ≤ ϵ if T >
t<T (1 − γ)6 ϵ2 µ
∞

Explanation for Proof:

• A function f (θ) satisfies a gradient domination property if f (θ⋆ ) − f (θ) = O(G(θ))

for all θ, where G(θ) measures first-order stationarity.
• This condition ensures that finding a (near) first-order stationary point implies near
optimality in function value.
• The lemma establishes gradient domination for direct policy parameterization, a
standard device in global convergence analysis.
• Despite interest in V π (ρ), considering the gradient with respect to another state
distribution µ is helpful.

Lemma 4 (Gradient Domination):

⋆
1 dπρ
⋆
V (ρ) − V (ρ) ≤ π
max(π̄ − π)⊤ ∇π V π (µ)
1−γ µ π̄
∞

where the max is over all policies π̄ ∈ ∆(A)|S| .

Full Proof in Appendix B.1

3
3.2 Softmax Parameterization
Softmax Parameterization:
exp (θs,a )
πθ (a | s) = P
a′ ∈A exp (θs,a′ )

Gradient for Softmax Parameterization:

∂V πθ (µ) 1
= dπθ (s)πθ (a | s)Aπθ (s, a)
∂θs,a 1−γ µ

Gradient Ascent Update Rule:

θ(t+1) = θ(t) + η∇θ V (t) (µ)

3.2.1 Asymptotic Conversion Without Regularization

Theorem 10 (Global Convergence): Assuming gradient ascent with softmax
3
parameterization and η ≤ (1−γ)8
, if µ(s) > 0 for all states s, then V (t) (s) → V ⋆ (s) as
t → ∞. The convergence rate, a future work question, may be exponentially slow in state
space size. The proof is given in detail in the paper.

In leveraging the gradient domination property (Lemma 4), our aim is to demonstrate
∇π V π (µ) → 0. However, employing the softmax parameterization (refer to Lemma 40
and (7)), we find that:

∂V πθ (µ) 1 ∂V πθ (µ)
= dπµθ (s)πθ (a | s)Aπθ (s, a) = πθ (a | s)
∂θs,a 1−γ ∂πθ (a | s)

Hence, even if ∇θ V πθ (µ) → 0, there is no assurance that ∇π V πθ (µ) → 0. A regularization-

based approach is explored for polynomial rate convergence in all relevant quantities.

3.2.2 Polynomial Convergence with Log-Barrier Regularization

The Log Barrier Regularizer helps in avoiding the issue of gradients becoming criti-
cally small at near deterministic sub-optimal policies.

• Recall the relative-entropy KL(p, q) := Ex∼p [− log q(x)/p(x)].

• Denote the uniform distribution over set X as Unif X .
• Define the log barrier regularized objective Lλ (θ) as:
Lλ (θ) := V πθ (µ) − λEs∼Unif S [KL(Unif A , πθ (· | s))]
λ X
= V πθ (µ) + log πθ (a | s) + λ log |A|
|S||A| s,a

where λ is a regularization parameter.

4
• Policy gradient ascent updates for Lλ (θ):

θ(t+1) = θ(t) + η∇θ Lλ (θ(t) )

• Theorem 12 (Log Barrier Regularization): If ∥∇θ Lλ (θ)∥2 ≤ ϵopt and ϵopt ≤

λ/(2|S||A|), then for all starting state distributions ρ:
⋆
πθ ⋆ 2λ dπρ
V (ρ) ≥ V (ρ) −
1−γ µ
∞

• Corollary 13 (Iteration Complexity with Log Barrier Regularization):

8γ 2λ (0)
Let βλ := (1−γ)3 + |S| . Starting from any initial θ , consider updates (13) with
ϵ(1−γ)
λ= dπ
ρ
∗ and η = 1/βλ . Then for all starting state distributions ρ:
2 µ
∞

⋆ 2
⋆ (t) 320|S|2 |A|2 dπρ
min V (ρ) − V (ρ) ≤ ϵ whenever T ≥
t<T (1 − γ)6 ϵ2 µ
∞

• See Appendix C.2 for the proof. The corollary emphasizes balancing λ with desired
accuracy ϵ and the importance of the initial distribution µ for global optimality.

Remark 14 (Entropy vs. Log Barrier Regularization):

• Entropy (Mnih et al. 2016) regularizer:

1 X 1 XX
H (πθ (· | s)) = −πθ (a | s) log πθ (a | s)
|S| s |S| s a

• Entropy is less aggressive in penalizing small probabilities than the log barrier,
equivalent to relative entropy.
• Entropy is bounded between 0 and log |A|, while relative entropy is between 0 and
infinity, approaching infinity as probabilities tend to 0.
• Open question: Achieving polynomial convergence with the more common entropy
regularizer.
• Polynomial convergence using KL regularizer relies on relative entropy’s aggressive
prevention of small probabilities.

3.2.3 Dimension-free Convergence of Natural Policy Gradient Ascent

This can be considered as a quasi second-order method due to use of particular
preconditioner. The fast and dimension-free convergence shows how variable precon-
ditioner in natural gradient improves over standard gradient ascent algorithms. NPG
Algorithm:
h i
Fρ (θ) = Es∼dπρ θ Ea∼πθ (·|s) ∇θ log πθ (a | s) (∇θ log πθ (a | s))⊤
†
θ(t+1) = θ(t) + ηFρ θ(t) ∇θ V (t) (ρ),

5
Lemma 15 (NPG as Soft Policy Iteration):

(t+1) η
(t) (t) (t+1) (t) exp ηA(t) (s, a)/(1 − γ)
θ =θ + A and π (a | s) = π (a | s)
1−γ Zt (s)

where Zt (s) = a∈A π (t) (a | s) exp ηA(t) (s, a)/(1 − γ) .
P

Moore-Penrose Pseudoinverse: The Moore-Penrose pseudoinverse M † of a matrix

M is a generalization of the matrix inverse for non-square matrices. It satisfies the fol-
lowing properties:

• M M †M = M ,
• M †M M † = M †,
• (M M † )⊤ = M M † , and
• (M † M )⊤ = M † M .

It plays a crucial role in the NPG algorithm by allowing for effective updates in the
parameter space. The processing complexity in number of FLOPs can be expressed as
2np2 + 2n3 for a matrix size of n × p applying the common Singular Value Decomposi-
tion (SVD) method. The complexity of SVD (from PyTorch Documentation) is O(nm2 ),
where m is the larger dimension of the matrix and n the smaller.

Theorem 16 (Global Convergence for NPG):

log |A| 1
V (T ) (ρ) ≥ V ∗ (ρ) − −
ηT (1 − γ)2 T
2
Setting η ≥ (1 − γ)2 log |A|, NPG finds an ϵ-optimal policy in at most T ≤ (1−γ)2 ϵ
iterations.

3.3 Tabular Methods Summary

Algorithm Iteration Complexity

2
∞ |S||A|
Projected Gradient Ascent on Simplex (Thm 5) O D(1−γ) 6 ϵ2

Policy Gradient, softmax parameterization (Thm 10) Asymptotic

Policy Gradient + log barrier regularization,
D∞2 |S|2 |A|2

Softmax (1−γ)6 ϵ2
parameterization (Cor 13)
Natural Policy Gradient (NPG), 2
(1−γ)2 ϵ
softmax parameterization (Thm 16)

Table Summary: Describes iteration complexities for finding π with V ⋆ (s0 )−V π (s0 ) ≤ ϵ.
π
Es∼µ [V (s)] with |S| states, |A| actions, and 0 ≤ γ < 1.
Algorithms optimize
⋆
dπ
s0 (s)
D∞ := maxs µ(s)
measures distribution mismatch. NPG has no D∞ dependence and
works for any s0 .

6
4 Function Approximation and Distribution Shift
In contrast to the tabular scenarios explored earlier, the analysis extends to the
realm of function approximation where policy classes lack full expressiveness (e.g., when
d ≪ |S||A|). The investigation revolves around variants of the Natural Policy Gradient
(NPG) update rule:
θ ← θ + ηFρ (θ)† ∇θ V θ (ρ)
The analytical framework establishes a close connection between the NPG update and
the concept of compatible function approximation, formalized by Kakade (2001). This
relationship links the NPG update to a regression problem where the optimal weight
vector w⋆ is given by
h 2 i
w⋆ ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) w⊤ ∇θ log πθ (· | s) − Aπθ (s, a)

The regression problem embodies a form of ”compatible” function approximation,

approximating Aπθ (s, a) using features derived from ∇θ log πθ (· | s). Additionally, a
variant of the update rule, termed Q-NPG, utilizes Q-values instead of advantages in the
regression.

This approach allows for approximate updates, solving relevant regression problems
with samples. The main results demonstrate the effectiveness of NPG updates in scenarios
where errors arise from both statistical estimation (when exact gradients are unavailable)
and function approximation (due to using a parameterized function class). Notably, a
novel estimation/approximation decomposition tailored for the NPG algorithm is pro-
vided.

To delve into specific policy classes, the analysis begins with log-linear policies as
a special case. Subsequently, the investigation extends to more general policy classes,
including neural policy classes.

Importantly, these results represent a significant advancement in providing provable

approximation guarantees without explicit worst-case dependencies over the state space.

4.1 Log Linear Policy Class And Soft Policy Iteration

Log-Linear Policy Class: Each policy in the log-linear class is defined as:
exp(θ · ϕs,a )
πθ (a | s) = P
a′ ∈A exp(θ · ϕs,a′ )

where θ ∈ Rd and ϕs,a ∈ Rd is a feature mapping.

Compatible Function Approximation: For the log-linear policy class, the gradient
of the log-policy is expressed as:

∇θ log πθ (a | s) = ϕ̄θs,a , where ϕ̄θs,a = ϕs,a − Ea′ ∼πθ (·|s) [ϕs,a′ ]

Here, ϕ̄θs,a is the centered version of ϕs,a . Additionally, ϕ̄π is defined similarly for any
policy π.

7
NPG Update Rule: The NPG update rule, using the centered features, is equiva-
lent to:

NPG: θ ← θ + ηw⋆ , w⋆ ∈ argminw Es,a∼dπρ θ [(Aπθ (s, a) − w · ϕ̄θs,a )2 ]

Q-NPG Variant: We also consider a variant, Q-NPG, without centering the features:

Q-NPG: θ ← θ + ηw⋆ , w⋆ ∈ argminw Es,a∼dπρ θ [(Qπθ (s, a) − w · ϕs,a )2 ]

Note that Qπ (s, a) is not expected to be 0 under π(· | s), unlike the advantage function. If
compaitable function approximation is 0, then it is easy to verify that NPG and Q-NPG
are equivalent as their corresponding policy updates will be equivalent.

4.2 Q-NPG Performance bounds for Log-Linear Policies

For a state-action distribution v, define:

L(w; θ, v) := Es,a∼v (Qπθ (s, a) − w · ϕs,a )2

The iterates of the Q-NPG algorithm minimize this loss under a changing distribution v.
Now, an approximate version of Q-NPG is defined with a starting state-action distribution
ν. The visitation measure over states and actions, dπν , is given by:
∞
X
dπν (s, a) := (1 − γ)Es0 ,a0 ∼ν γ t Prπ (st = s, at = a | s0 , a0 )
t=0

(t)
Define d(t) := dνπ . The approximate Q-NPG algorithm is:

Approx. Q-NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argmin∥w∥2 ≤W L w; θ(t) , d(t)

The exact minimizer is denoted as:

w⋆(t) ∈ argmin∥w∥2 ≤W L w; θ(t) , d(t)

Assumption 6.1 (Estimation/Transfer errors): Fix a state distribution ρ, a state-

action distribution ν, and an arbitrary comparator policy π ⋆ . Define the state-action
measure d⋆ as
⋆
d⋆ (s, a) = dπρ (s) ◦ Unif A (a)
⋆
where d⋆ samples states from the comparator’s state visitation measure dπρ and actions
from the uniform distribution.

Suppose for all t < T :

1. (Excess risk) The estimation error (because of sampling like in Algorithm 1) is

bounded:
E L w(t) ; θ(t) , d(t) − L w⋆(t) ; θ(t) , d(t) ≤ ϵstat

8
(t)
2. (Transfer error) The best predictor w⋆ has an error bounded by ϵbias with respect
to the comparator’s measure of d⋆ :

E L w⋆(t) ; θ(t) , d⋆ ≤ ϵbias

Assumption 6.2 (Relative condition number): Consider the same ρ, ν, and π ⋆ as

in Assumption 6.1. With respect to any state-action distribution v, define:

Σv = Es,a∼v ϕs,a ϕ⊤

s,a

and define:
w⊤ Σd⋆ w
sup =κ
w∈Rd w⊤ Σν w
Assume that κ is finite.
Theorem 20 (Agnostic learning with Q-NPG): Fix a state distribution ρ, a state-
action distribution ν, and a comparator policy π ⋆ . Suppose Assumption 6.2 holds and
(0)
∥ϕs,a ∥p
2 ≤ B for all s, a. Suppose the Q − N P G update rule (in (20)) starts with θ = 0,
η = 2 log |A|/ (B 2 W 2 T ), and the (random) sequence of iterates satisfies Assumption
6.1. Then,
π⋆ (t)
E min V (ρ) − V (ρ)
t<T
r s p
BW 2 log |A| 4|A|κϵstat 4|A|ϵbias
≤ + 3
+
1−γ T (1 − γ) 1−γ
The proof is provided in Section 6.4.

L w(t) ; θ(t) , d(t) = L w(t) ; θ(t) , d(t) − L w⋆(t) ; θ(t) , d(t) + L w⋆(t) ; θ(t) , d(t)

| {z } | {z }
Excess risk Approximation error

Corollary 21 (Estimation error/Approximation error bound for Q − N P G):

Consider the same setting as in Theorem 20. Instead of assuming the transfer error is
bounded (part 2 in Assumption 6.1), suppose that, for all t ≤ T ,

E L w⋆(t) ; θ(t) , d(t) ≤ ϵapprox

Then,
π⋆ (t)
E min V (ρ) − V (ρ)
t<T
r s
BW 2 log |A| 4|A| d⋆
≤ + κ · ϵstat + · ϵapprox
1−γ T (1 − γ)3 ν ∞

Proof We have the following crude upper bound on the transfer error:

d⋆ 1 d⋆
w⋆(t) ; θ(t) , d⋆ w⋆(t) ; θ(t) , d(t) L w⋆(t) ; θ(t) , d(t) ,

L ≤ (t) L ≤
d ∞ 1−γ ν ∞

1 d⋆
where the last step uses the defintion of d(t) (see (19)). This implies ϵbias ≤ 1−γ ν
ϵ
∞ approx
,
and the corollary follows.

9
Remark 24 (ϵbias = 0 for ”linear” MDPs) In the recent linear MDP model of Jin
et al. (2019); Yang and Wang (2019); Jiang et al. (2017), where the transition dynamics
are low rank, we have that ϵbias = 0 provided we use the features of the linear MDP.
Our guarantees also permit model misspecification of linear MDPs, with non worst-case
approximation error where ϵbias ̸= 0.

Remark 25 (Comparison with POLITEX and EE-PoliteX) Compared with POLITEX

(AbbasiYadkori et al., 2019a), Assumption 6.2 is substantially milder, in that it just as-
sumes a good relative condition number for one policy rather than all possible policies
(which cannot hold in general even for tabular MDPs). Changing this assumption to an
analog of Assumption 6.2 is the main improvement in the analysis of the EE-POLITEX
(Abbasi-Yadkori et al. 2019b) algorithm.

4.2.1 Algorithm 1
Proof of Algorithms can be found Algorithms Section (Section C) of in Linear Con-
vergence of Natural Policy Gradient Methods with Log-Linear Policies Published as a
conference paper at ICLR 2023.
Input: Starting state-action distribution ν.
1. Sample s0 , a0 ∼ ν.
2. Sample s, a ∼ dπν as follows: at every timestep h, with probability γ, act according
to π; else, accept (sh , ah ) as the sample and proceed to Step 4. See (19).
3. From sh , ah , continue to execute π, and use a termination probability of 1 − γ.
Upon termination, set Q̂π (sh , ah ) as the undiscounted sum of rewards from time h
onwards.
4. return (sh , ah ) and Q̂π (sh , ah ).
Algorithm 1: Sampler for s, a ∼ dπν and unbiased estimate of Qπ (s, a)

4.2.2 Algorithm 2

Figure 4.1

10
4.3 NPG: Performance Bounds for Smooth Policy Class

LA (w; θ, v) := Es,a∼v (Aπθ (s, a) − w · ∇θ log πθ (a | s))2

Approx. NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argmin∥w∥2 ≤W LA w; θ(t) , d(t) ,

(t) (t)
w⋆ denote the minimizer, i.e. w⋆ ∈ argmin∥w∥2 ≤W LA w; θ(t) , d(t) .

Assumption 6.4 (Policy Smoothness) Assume for all s ∈ S and a ∈ A that log πθ (a |
s) is a β smooth function of θ

Theorem 29 (Agnostic learning with NPG) Fix a state distribution ρ; a state-

action distribution ν; an arbitrary comparator policy π ⋆ (not necessarily an optimal
policy). Suppose Assumption 6.4 holds. Suppose the NPG update p rule (in (21)) starts
with π (0) being the uniform distribution (at each state), η = 2 log |A|/ (βW 2 T ), and
the (random) sequence of iterates satisfies Assumption 6.5. We have that
r √
2β log |A| ϵbias
r
π⋆ (t) W κϵstat
E min V (ρ) − V (ρ) ≤ + 3
+ .
t<T 1−γ T (1 − γ) 1−γ

The proof is provided in Section 6.4.

Remarks: Observe there is no polynomial dependence on |A| in the rate for N P G

(in contrast to Theorem 20). The main difference in the analysis is that, for Q − N P G,
we need to bound the error in fitting the advantage estimates; this leads to the depen-
dence on |A| (which can be removed with a path-dependent bound, i.e. a bound that
depends on the sequence of iterates produced by the algorithm, 9 ). For N P G, the direct
fitting of the advantage function sidesteps this conversion step.

Compared with the result of Theorem 16 in the noiseless, tabular case, we see two
main differences. In the setting of Theorem 16, we have ϵstat = ϵbias = 0, so √ that the
last two terms vanish. This leaves the first term where we observe a slower 1/ T rate
compared with Theorem 16, and with an additional dependence on W (which grows as
O(|S∥A|/(1 − γ) to approximate the advantages in the tabular setting). Both differences
arise from the additional monotonicity property (Lemma 17) on the per-step improve-
ments in the tabular case, which is not easily generalized to the function approximation
setting.

4.4 Summary for Approximate Methods

• Table 2: Overview of Approximate Methods: The suboptimality, V ⋆ (s0 ) − V π (s0 ),
after T iterations for various approximate algorithms, which use different notions
of approximation error (sample complexities are not directly considered but instead
may be thought of as part of ϵ1 and ϵstat ). Order notation is used to drop constants,
and we assume |A| = 2 for ease of exposition.
• For approximate dynamic programming methods, the relevant error is the worst
case, ℓ∞ -error in approximating a value function, e.g. ϵ∞ = maxs,a Qπ (s, a) − Q
bπ (s, a) ,
where Qbπ is what an estimation oracle returns during the course of the algorithm.

11
Figure 4.2: Summary table for Approximate methods

• The second row (see Lemma 12 in Antos et al. (2008)) is a refinement of this ap-
proach, where ϵ1 is an ℓ1 -average error in fitting the value functions under the fitting
(state) distribution µ, and, roughly, C∞ is a worst case density ratio between the
state visitation distribution of any non-stationary policy and the fitting distribution
µ.
• For Conservative Policy Iteration, ϵ1 is a related ℓ1 -average case fitting error with
respect to a fitting distribution µ, and D∞ is as defined as before, in the caption of
Table 1 (see also (Kakade and Langford, 2002)); here, D∞ ≤ C∞ (e.g. see Scherrer
(2014)).
• For NPG, ϵstat and ϵapprox measure the excess risk (the regret) and approximation
errors in fitting the values. Roughly speaking, ϵstat is the excess squared loss rel-
ative to the best fit (among an appropriately defined parametric class) under our
fitting distribution (defined with respect to the state distribution µ ). Here, ϵapprox
is the approximation error: the minimal possible error (in our parametric class)
under our fitting distribution. The condition number κ is a relative eigenvalue con-
dition between appropriately defined feature covariances with respect to the state
⋆
visitation distribution of an optimal policy, dπs0 , and the state fitting distribution
µ.

References
[1] Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan. On the Theory of
Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. Journal
of Machine Learning Research, 22 (2021), 1-76.

12
[2] Rui Yuan, Simon S. Du, Robert M. Gower, Alessandro Lazaric, Lin Xiao. LINEAR
CONVERGENCE OF NATURAL POLICY GRADIENT METHODS WITH LOG-
LINEAR POLICIES. Published as a conference paper at ICLR 2023.

[3] Amir M. Ahmadian, Wolfgang Zirwas, Rakash Sivasiva Ganesan, Berthold Panzner.
Low Complexity Moore-Penrose Inverse for Large CoMP Areas with Sparse Massive
MIMO Channel Matrices. Nokia Bell Labs, Radio Systems Research, Munich, Ger-
many.

MAN K100 Electrical System TGS-TGX
100% (4)
MAN K100 Electrical System TGS-TGX
236 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Policy Gradient 2020
No ratings yet
Policy Gradient 2020
76 pages
Policy Gradient
No ratings yet
Policy Gradient
33 pages
RL 5
No ratings yet
RL 5
26 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Policy Approximation Document
No ratings yet
Policy Approximation Document
2 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Paper RL
No ratings yet
Paper RL
61 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
Softmax Policy Gradient Methods Can Take Exponenti
No ratings yet
Softmax Policy Gradient Methods Can Take Exponenti
65 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
No ratings yet
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
10 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Planning and Optimal Control Policy Gradient Methods
No ratings yet
Planning and Optimal Control Policy Gradient Methods
34 pages
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
No ratings yet
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
21 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
No ratings yet
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
283 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Chapter 12
No ratings yet
Chapter 12
17 pages
SLchapt 3
No ratings yet
SLchapt 3
10 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
CH3 - 3 Policy Search Alg
No ratings yet
CH3 - 3 Policy Search Alg
9 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
1、Bayesian Policy Gradient Algorithms（2006）
No ratings yet
1、Bayesian Policy Gradient Algorithms（2006）
9 pages
2023 Week5 Policy
No ratings yet
2023 Week5 Policy
62 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
Lec 4
No ratings yet
Lec 4
16 pages
Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
14 NLP
No ratings yet
14 NLP
20 pages
Kashi Vishwanath Entry Ticket (5 Persons)
No ratings yet
Kashi Vishwanath Entry Ticket (5 Persons)
1 page
1 s2.0 S2772940024000171 Main1
No ratings yet
1 s2.0 S2772940024000171 Main1
10 pages
2022 MDP APP and Budget Matrix F2F SARSARACAT ES
No ratings yet
2022 MDP APP and Budget Matrix F2F SARSARACAT ES
15 pages
Summer Internship Format May 2023 New
No ratings yet
Summer Internship Format May 2023 New
67 pages
Trắc nghiệm CCNA - Chương 5 Dynamic routing
No ratings yet
Trắc nghiệm CCNA - Chương 5 Dynamic routing
13 pages
The Evaluation of Operating System
No ratings yet
The Evaluation of Operating System
6 pages
FD Pro 8.1 Admin Guide
No ratings yet
FD Pro 8.1 Admin Guide
22 pages
Cmcp700s-Cvt Manual v1.1
No ratings yet
Cmcp700s-Cvt Manual v1.1
8 pages
JavaScript Business Rules - 250 Course Outline
No ratings yet
JavaScript Business Rules - 250 Course Outline
6 pages
Accu 204 Trabajofinal
No ratings yet
Accu 204 Trabajofinal
3 pages
Gnucash Guide
No ratings yet
Gnucash Guide
226 pages
Delft3D-WAVE User Manual PDF
No ratings yet
Delft3D-WAVE User Manual PDF
226 pages
Planos ZX130-5
No ratings yet
Planos ZX130-5
18 pages
Log
No ratings yet
Log
119 pages
LTE Radio Access Network Protocols and Procedures
0% (1)
LTE Radio Access Network Protocols and Procedures
151 pages
OSPE TI BC152886485467en-000701 February2022
No ratings yet
OSPE TI BC152886485467en-000701 February2022
39 pages
Presentation PPT Group No 6 New
No ratings yet
Presentation PPT Group No 6 New
25 pages
System Overview MEI633
No ratings yet
System Overview MEI633
46 pages
Blog - Are Smartphone Rentals Value Fo... - Mobile World Live
No ratings yet
Blog - Are Smartphone Rentals Value Fo... - Mobile World Live
8 pages
Ficha Técnica American Marsh
No ratings yet
Ficha Técnica American Marsh
8 pages
Log
No ratings yet
Log
9 pages
Southern Province Grade 10 Information and Communication Technology Ict 2020 1 Term Test Paper 61e9422335b6f
No ratings yet
Southern Province Grade 10 Information and Communication Technology Ict 2020 1 Term Test Paper 61e9422335b6f
13 pages
Software Engineering: UNIT-2
No ratings yet
Software Engineering: UNIT-2
53 pages
AFRL Skyborg FS 0921
No ratings yet
AFRL Skyborg FS 0921
1 page
Unit 5
No ratings yet
Unit 5
9 pages
TLWA Assignment-1 - 03-09-2024
No ratings yet
TLWA Assignment-1 - 03-09-2024
2 pages
Prestigio Multipad pmp3270b Service Manual
No ratings yet
Prestigio Multipad pmp3270b Service Manual
32 pages
Free Ebook MCQ Series Based On e PG Pathshala P02-M1,2,3
No ratings yet
Free Ebook MCQ Series Based On e PG Pathshala P02-M1,2,3
81 pages

SRE Report Merged

Uploaded by

SRE Report Merged

Uploaded by

Department of Electrical Engineering, IIT Bombay

EE 451 - Supervised Research Exposition

On The Theory of Policy Gradient

Prepared by: Kunal Randad (20D070049)

4 Function Approximation and Distribution Shift 7

1) Tabular Policy Parameterization: Optimal Policy is contained within the class.

2) Approximate Policies / Parameteric Policy class: Optimal policy may not

• A (finite) MDP M = (S, A, P, r, γ, ρ) is specified by:

– a finite state space S;

• A deterministic, stationary policy π : S → A specifies a decision-making strategy

Value Function and Related Equations:

• The value function V π : S → R is the discounted sum of future rewards:

• The action-value function Qπ : S × A → R:

• The advantage function Aπ : S × A → R:

Aπ (s, a) := Qπ (s, a) − V π (s)

Policy Gradient Definitions:

dπρ (s) = Es0 ∼ρ dπs0 (s)

where θ ∈ ∆(A)|S| , i.e. θ is subject to θs,a ≥ 0 and

Distribution Mismatch Coefficient: Given a policy π and measures ρ, µ ∈ ∆(S),

Projected Gradient Ascent Algorithm:

• Updates: π (t+1) = P∆(A)|S| π (t) + η∇π V (t) (µ)

• For all ρ ∈ ∆(S), the algorithm satisfies:

Explanation for Proof:

• A function f (θ) satisfies a gradient domination property if f (θ⋆ ) − f (θ) = O(G(θ))

Lemma 4 (Gradient Domination):

where the max is over all policies π̄ ∈ ∆(A)|S| .

Full Proof in Appendix B.1

Gradient for Softmax Parameterization:

Gradient Ascent Update Rule:

3.2.1 Asymptotic Conversion Without Regularization

Hence, even if ∇θ V πθ (µ) → 0, there is no assurance that ∇π V πθ (µ) → 0. A regularization-

3.2.2 Polynomial Convergence with Log-Barrier Regularization

• Recall the relative-entropy KL(p, q) := Ex∼p [− log q(x)/p(x)].

where λ is a regularization parameter.

θ(t+1) = θ(t) + η∇θ Lλ (θ(t) )

• Theorem 12 (Log Barrier Regularization): If ∥∇θ Lλ (θ)∥2 ≤ ϵopt and ϵopt ≤

• Corollary 13 (Iteration Complexity with Log Barrier Regularization):

Remark 14 (Entropy vs. Log Barrier Regularization):

• Entropy (Mnih et al. 2016) regularizer:

3.2.3 Dimension-free Convergence of Natural Policy Gradient Ascent

Moore-Penrose Pseudoinverse: The Moore-Penrose pseudoinverse M † of a matrix

Theorem 16 (Global Convergence for NPG):

3.3 Tabular Methods Summary

Algorithm Iteration Complexity

Policy Gradient, softmax parameterization (Thm 10) Asymptotic

The regression problem embodies a form of ”compatible” function approximation,

Importantly, these results represent a significant advancement in providing provable

4.1 Log Linear Policy Class And Soft Policy Iteration

where θ ∈ Rd and ϕs,a ∈ Rd is a feature mapping.

∇θ log πθ (a | s) = ϕ̄θs,a , where ϕ̄θs,a = ϕs,a − Ea′ ∼πθ (·|s) [ϕs,a′ ]

NPG: θ ← θ + ηw⋆ , w⋆ ∈ argminw Es,a∼dπρ θ [(Aπθ (s, a) − w · ϕ̄θs,a )2 ]

Q-NPG: θ ← θ + ηw⋆ , w⋆ ∈ argminw Es,a∼dπρ θ [(Qπθ (s, a) − w · ϕs,a )2 ]

4.2 Q-NPG Performance bounds for Log-Linear Policies

L(w; θ, v) := Es,a∼v (Qπθ (s, a) − w · ϕs,a )2

Approx. Q-NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argmin∥w∥2 ≤W L w; θ(t) , d(t)

The exact minimizer is denoted as:

w⋆(t) ∈ argmin∥w∥2 ≤W L w; θ(t) , d(t)

Assumption 6.1 (Estimation/Transfer errors): Fix a state distribution ρ, a state-

Suppose for all t < T :

1. (Excess risk) The estimation error (because of sampling like in Algorithm 1) is

E L w⋆(t) ; θ(t) , d⋆ ≤ ϵbias

Assumption 6.2 (Relative condition number): Consider the same ρ, ν, and π ⋆ as

Corollary 21 (Estimation error/Approximation error bound for Q − N P G):

E L w⋆(t) ; θ(t) , d(t) ≤ ϵapprox

Remark 25 (Comparison with POLITEX and EE-PoliteX) Compared with POLITEX

LA (w; θ, v) := Es,a∼v (Aπθ (s, a) − w · ∇θ log πθ (a | s))2

Approx. NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argmin∥w∥2 ≤W LA w; θ(t) , d(t) ,

Theorem 29 (Agnostic learning with NPG) Fix a state distribution ρ; a state-

The proof is provided in Section 6.4.

Remarks: Observe there is no polynomial dependence on |A| in the rate for N P G

4.4 Summary for Approximate Methods

You might also like