0% found this document useful (0 votes)
14 views16 pages

SRE Report Merged

Uploaded by

darksmile629
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views16 pages

SRE Report Merged

Uploaded by

darksmile629
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Department of Electrical Engineering, IIT Bombay

EE 451 - Supervised Research Exposition

On The Theory of Policy Gradient


Methods

Prepared by: Kunal Randad (20D070049)


Instructor: Prof. Vivek Borkar
Date: November 30, 2023
Contents
1 Introduction 1

2 Notations 1

3 Tabular Class 2
3.1 Direct Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Softmax Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.1 Asymptotic Conversion Without Regularization . . . . . . . . . . 4
3.2.2 Polynomial Convergence with Log-Barrier Regularization . . . . . 4
3.2.3 Dimension-free Convergence of Natural Policy Gradient Ascent . . 5
3.3 Tabular Methods Summary . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Function Approximation and Distribution Shift 7


4.1 Log Linear Policy Class And Soft Policy Iteration . . . . . . . . . . . . . 7
4.2 Q-NPG Performance bounds for Log-Linear Policies . . . . . . . . . . . . 8
4.2.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.2 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 NPG: Performance Bounds for Smooth Policy Class . . . . . . . . . . . . 11
4.4 Summary for Approximate Methods . . . . . . . . . . . . . . . . . . . . . 11

I
1 Introduction
This work provides provable characterizations of the computational, approximation,
and sample size properties of policy gradient methods in the context of discounted Markov
Decision Processes (MDPs). The policy parameterization can be split into two classes:

1) Tabular Policy Parameterization: Optimal Policy is contained within the class.


In these case, the results are provided for convergence to optimal policy.

2) Approximate Policies / Parameteric Policy class: Optimal policy may not


be included in the set of policies that can represented by this class. In this case, re-
sults are provided with respect to the best policy available in the class or some other
comparator policy(agnostic learning results).

2 Notations
Markov Decision Process (MDP) Definitions:

• A (finite) MDP M = (S, A, P, r, γ, ρ) is specified by:

– a finite state space S;


– a finite action space A;
– a transition model P where P (s′ | s, a) is the probability of transitioning into
state s′ upon taking action a in state s;
– a reward function r : S × A → [0, 1] where r(s, a) is the immediate reward
associated with taking action a in state s;
– a discount factor γ ∈ [0, 1);
– a starting state distribution ρ over S.

Policy Definitions:

• A deterministic, stationary policy π : S → A specifies a decision-making strategy


where at = π (st ).
• A stochastic policy π : S → ∆(A), where at ∼ π (· | st ).
• A policy induces a distribution over trajectories τ = (st , at , rt )∞
t=0 , with s0 drawn
from the starting state distribution ρ.

Value Function and Related Equations:

• The value function V π : S → R is the discounted sum of future rewards:


"∞ #
X
V π (s) := E γ t r (st , at ) | π, s0 = s
t=0

• 0 ≤ V π (s) ≤ 1
1−γ
, and V π (ρ) := Es0 ∼ρ [V π (s0 )].

1
Action-Value (Q-value) and Advantage Functions:

• The action-value function Qπ : S × A → R:


"∞ #
X
Qπ (s, a) = E γ t r (st , at ) | π, s0 = s, a0 = a
t=0

• The advantage function Aπ : S × A → R:

Aπ (s, a) := Qπ (s, a) − V π (s)

Policy Gradient Definitions:

• In order to introduce policy gradients, define the discounted state visitation distri-
bution dπs0 of a policy π as:

X
dπs0 (s) := (1 − γ) γ t Prπ (st = s | s0 )
t=0

where Prπ (st = s | s0 ) is the state visitation probability that st = s, after executing
π starting at state s0 .
• Overloading notation, write:

dπρ (s) = Es0 ∼ρ dπs0 (s)


 

where dπρ is the discounted state visitation distribution under the initial distribution
ρ.

3 Tabular Class
3.1 Direct Parameterization
The policies are parameterized by

πθ (a | s) = θs,a

where θ ∈ ∆(A)|S| , i.e. θ is subject to θs,a ≥ 0 and


P
a∈A θs,a = 1 for all s ∈ S and
a ∈ A.

• In discussing the proof, consider the performance measure ρ and the optimization
measure µ.
• Notably, although the gradient concerns V π (µ), the guarantee extends to all distri-
butions ρ.
• Optimal performance under ρ might benefit from optimization under µ.

2
• The lemma implies that a small gradient in feasible directions indicates near-
optimality in value, but this holds if dπµ sufficiently covers the state distribution
of some optimal policy π ⋆ .
• Recall Bellman and Dreyfus’ theorem (1959) affirming a single policy π ⋆ optimal
for all starting states s0 .
• The exploration problem’s challenge is captured through the distribution mismatch
coefficient.

Distribution Mismatch Coefficient: Given a policy π and measures ρ, µ ∈ ∆(S),


dπ dπ
we refer to µρ as the distribution mismatch coefficient of π relative to µ. Here, µρ

denotes componentwise division.

Projected Gradient Ascent Algorithm:

• Updates: π (t+1) = P∆(A)|S| π (t) + η∇π V (t) (µ)




Theorem 5:

• For all ρ ∈ ∆(S), the algorithm satisfies:


⋆ 2
 ⋆ (t) 64γ|S∥A| dπρ
min V (ρ) − V (ρ) ≤ ϵ if T >
t<T (1 − γ)6 ϵ2 µ

Explanation for Proof:

• A function f (θ) satisfies a gradient domination property if f (θ⋆ ) − f (θ) = O(G(θ))


for all θ, where G(θ) measures first-order stationarity.
• This condition ensures that finding a (near) first-order stationary point implies near
optimality in function value.
• The lemma establishes gradient domination for direct policy parameterization, a
standard device in global convergence analysis.
• Despite interest in V π (ρ), considering the gradient with respect to another state
distribution µ is helpful.

Lemma 4 (Gradient Domination):



1 dπρ

V (ρ) − V (ρ) ≤ π
max(π̄ − π)⊤ ∇π V π (µ)
1−γ µ π̄

where the max is over all policies π̄ ∈ ∆(A)|S| .

Full Proof in Appendix B.1

3
3.2 Softmax Parameterization
Softmax Parameterization:
exp (θs,a )
πθ (a | s) = P
a′ ∈A exp (θs,a′ )

Gradient for Softmax Parameterization:


∂V πθ (µ) 1
= dπθ (s)πθ (a | s)Aπθ (s, a)
∂θs,a 1−γ µ

Gradient Ascent Update Rule:


θ(t+1) = θ(t) + η∇θ V (t) (µ)

3.2.1 Asymptotic Conversion Without Regularization


Theorem 10 (Global Convergence): Assuming gradient ascent with softmax
3
parameterization and η ≤ (1−γ)8
, if µ(s) > 0 for all states s, then V (t) (s) → V ⋆ (s) as
t → ∞. The convergence rate, a future work question, may be exponentially slow in state
space size. The proof is given in detail in the paper.

In leveraging the gradient domination property (Lemma 4), our aim is to demonstrate
∇π V π (µ) → 0. However, employing the softmax parameterization (refer to Lemma 40
and (7)), we find that:

∂V πθ (µ) 1 ∂V πθ (µ)
= dπµθ (s)πθ (a | s)Aπθ (s, a) = πθ (a | s)
∂θs,a 1−γ ∂πθ (a | s)

Hence, even if ∇θ V πθ (µ) → 0, there is no assurance that ∇π V πθ (µ) → 0. A regularization-


based approach is explored for polynomial rate convergence in all relevant quantities.

3.2.2 Polynomial Convergence with Log-Barrier Regularization


The Log Barrier Regularizer helps in avoiding the issue of gradients becoming criti-
cally small at near deterministic sub-optimal policies.

• Recall the relative-entropy KL(p, q) := Ex∼p [− log q(x)/p(x)].


• Denote the uniform distribution over set X as Unif X .
• Define the log barrier regularized objective Lλ (θ) as:
Lλ (θ) := V πθ (µ) − λEs∼Unif S [KL(Unif A , πθ (· | s))]
λ X
= V πθ (µ) + log πθ (a | s) + λ log |A|
|S||A| s,a

where λ is a regularization parameter.

4
• Policy gradient ascent updates for Lλ (θ):

θ(t+1) = θ(t) + η∇θ Lλ (θ(t) )

• Theorem 12 (Log Barrier Regularization): If ∥∇θ Lλ (θ)∥2 ≤ ϵopt and ϵopt ≤


λ/(2|S||A|), then for all starting state distributions ρ:

πθ ⋆ 2λ dπρ
V (ρ) ≥ V (ρ) −
1−γ µ

• Corollary 13 (Iteration Complexity with Log Barrier Regularization):


8γ 2λ (0)
Let βλ := (1−γ)3 + |S| . Starting from any initial θ , consider updates (13) with
ϵ(1−γ)
λ= dπ
ρ
∗ and η = 1/βλ . Then for all starting state distributions ρ:
2 µ

⋆ 2
 ⋆ (t) 320|S|2 |A|2 dπρ
min V (ρ) − V (ρ) ≤ ϵ whenever T ≥
t<T (1 − γ)6 ϵ2 µ

• See Appendix C.2 for the proof. The corollary emphasizes balancing λ with desired
accuracy ϵ and the importance of the initial distribution µ for global optimality.

Remark 14 (Entropy vs. Log Barrier Regularization):

• Entropy (Mnih et al. 2016) regularizer:


1 X 1 XX
H (πθ (· | s)) = −πθ (a | s) log πθ (a | s)
|S| s |S| s a

• Entropy is less aggressive in penalizing small probabilities than the log barrier,
equivalent to relative entropy.
• Entropy is bounded between 0 and log |A|, while relative entropy is between 0 and
infinity, approaching infinity as probabilities tend to 0.
• Open question: Achieving polynomial convergence with the more common entropy
regularizer.
• Polynomial convergence using KL regularizer relies on relative entropy’s aggressive
prevention of small probabilities.

3.2.3 Dimension-free Convergence of Natural Policy Gradient Ascent


This can be considered as a quasi second-order method due to use of particular
preconditioner. The fast and dimension-free convergence shows how variable precon-
ditioner in natural gradient improves over standard gradient ascent algorithms. NPG
Algorithm:
h i
Fρ (θ) = Es∼dπρ θ Ea∼πθ (·|s) ∇θ log πθ (a | s) (∇θ log πθ (a | s))⊤
†
θ(t+1) = θ(t) + ηFρ θ(t) ∇θ V (t) (ρ),

5
Lemma 15 (NPG as Soft Policy Iteration):

(t+1) η
(t) (t) (t+1) (t) exp ηA(t) (s, a)/(1 − γ)
θ =θ + A and π (a | s) = π (a | s)
1−γ Zt (s)

where Zt (s) = a∈A π (t) (a | s) exp ηA(t) (s, a)/(1 − γ) .
P

Moore-Penrose Pseudoinverse: The Moore-Penrose pseudoinverse M † of a matrix


M is a generalization of the matrix inverse for non-square matrices. It satisfies the fol-
lowing properties:

• M M †M = M ,
• M †M M † = M †,
• (M M † )⊤ = M M † , and
• (M † M )⊤ = M † M .

It plays a crucial role in the NPG algorithm by allowing for effective updates in the
parameter space. The processing complexity in number of FLOPs can be expressed as
2np2 + 2n3 for a matrix size of n × p applying the common Singular Value Decomposi-
tion (SVD) method. The complexity of SVD (from PyTorch Documentation) is O(nm2 ),
where m is the larger dimension of the matrix and n the smaller.

Theorem 16 (Global Convergence for NPG):


log |A| 1
V (T ) (ρ) ≥ V ∗ (ρ) − −
ηT (1 − γ)2 T
2
Setting η ≥ (1 − γ)2 log |A|, NPG finds an ϵ-optimal policy in at most T ≤ (1−γ)2 ϵ
iterations.

3.3 Tabular Methods Summary

Algorithm Iteration Complexity


 2 
∞ |S||A|
Projected Gradient Ascent on Simplex (Thm 5) O D(1−γ) 6 ϵ2

Policy Gradient, softmax parameterization (Thm 10) Asymptotic


Policy Gradient + log barrier regularization, 
D∞2 |S|2 |A|2

Softmax (1−γ)6 ϵ2
parameterization (Cor 13)
Natural Policy Gradient (NPG), 2
(1−γ)2 ϵ
softmax parameterization (Thm 16)

Table Summary: Describes iteration complexities for finding π with V ⋆ (s0 )−V π (s0 ) ≤ ϵ.
π
 Es∼µ [V (s)] with |S| states, |A| actions, and 0 ≤ γ < 1.
Algorithms optimize


s0 (s)
D∞ := maxs µ(s)
measures distribution mismatch. NPG has no D∞ dependence and
works for any s0 .

6
4 Function Approximation and Distribution Shift
In contrast to the tabular scenarios explored earlier, the analysis extends to the
realm of function approximation where policy classes lack full expressiveness (e.g., when
d ≪ |S||A|). The investigation revolves around variants of the Natural Policy Gradient
(NPG) update rule:
θ ← θ + ηFρ (θ)† ∇θ V θ (ρ)
The analytical framework establishes a close connection between the NPG update and
the concept of compatible function approximation, formalized by Kakade (2001). This
relationship links the NPG update to a regression problem where the optimal weight
vector w⋆ is given by
h 2 i
w⋆ ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) w⊤ ∇θ log πθ (· | s) − Aπθ (s, a)

The regression problem embodies a form of ”compatible” function approximation,


approximating Aπθ (s, a) using features derived from ∇θ log πθ (· | s). Additionally, a
variant of the update rule, termed Q-NPG, utilizes Q-values instead of advantages in the
regression.

This approach allows for approximate updates, solving relevant regression problems
with samples. The main results demonstrate the effectiveness of NPG updates in scenarios
where errors arise from both statistical estimation (when exact gradients are unavailable)
and function approximation (due to using a parameterized function class). Notably, a
novel estimation/approximation decomposition tailored for the NPG algorithm is pro-
vided.

To delve into specific policy classes, the analysis begins with log-linear policies as
a special case. Subsequently, the investigation extends to more general policy classes,
including neural policy classes.

Importantly, these results represent a significant advancement in providing provable


approximation guarantees without explicit worst-case dependencies over the state space.

4.1 Log Linear Policy Class And Soft Policy Iteration


Log-Linear Policy Class: Each policy in the log-linear class is defined as:
exp(θ · ϕs,a )
πθ (a | s) = P
a′ ∈A exp(θ · ϕs,a′ )

where θ ∈ Rd and ϕs,a ∈ Rd is a feature mapping.

Compatible Function Approximation: For the log-linear policy class, the gradient
of the log-policy is expressed as:

∇θ log πθ (a | s) = ϕ̄θs,a , where ϕ̄θs,a = ϕs,a − Ea′ ∼πθ (·|s) [ϕs,a′ ]

Here, ϕ̄θs,a is the centered version of ϕs,a . Additionally, ϕ̄π is defined similarly for any
policy π.

7
NPG Update Rule: The NPG update rule, using the centered features, is equiva-
lent to:

NPG: θ ← θ + ηw⋆ , w⋆ ∈ argminw Es,a∼dπρ θ [(Aπθ (s, a) − w · ϕ̄θs,a )2 ]

Q-NPG Variant: We also consider a variant, Q-NPG, without centering the features:

Q-NPG: θ ← θ + ηw⋆ , w⋆ ∈ argminw Es,a∼dπρ θ [(Qπθ (s, a) − w · ϕs,a )2 ]

Note that Qπ (s, a) is not expected to be 0 under π(· | s), unlike the advantage function. If
compaitable function approximation is 0, then it is easy to verify that NPG and Q-NPG
are equivalent as their corresponding policy updates will be equivalent.

4.2 Q-NPG Performance bounds for Log-Linear Policies


For a state-action distribution v, define:

L(w; θ, v) := Es,a∼v (Qπθ (s, a) − w · ϕs,a )2


 

The iterates of the Q-NPG algorithm minimize this loss under a changing distribution v.
Now, an approximate version of Q-NPG is defined with a starting state-action distribution
ν. The visitation measure over states and actions, dπν , is given by:

X
dπν (s, a) := (1 − γ)Es0 ,a0 ∼ν γ t Prπ (st = s, at = a | s0 , a0 )
t=0

(t)
Define d(t) := dνπ . The approximate Q-NPG algorithm is:

Approx. Q-NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argmin∥w∥2 ≤W L w; θ(t) , d(t)




The exact minimizer is denoted as:

w⋆(t) ∈ argmin∥w∥2 ≤W L w; θ(t) , d(t)




Assumption 6.1 (Estimation/Transfer errors): Fix a state distribution ρ, a state-


action distribution ν, and an arbitrary comparator policy π ⋆ . Define the state-action
measure d⋆ as

d⋆ (s, a) = dπρ (s) ◦ Unif A (a)

where d⋆ samples states from the comparator’s state visitation measure dπρ and actions
from the uniform distribution.

Suppose for all t < T :

1. (Excess risk) The estimation error (because of sampling like in Algorithm 1) is


bounded:
E L w(t) ; θ(t) , d(t) − L w⋆(t) ; θ(t) , d(t) ≤ ϵstat
  

8
(t)
2. (Transfer error) The best predictor w⋆ has an error bounded by ϵbias with respect
to the comparator’s measure of d⋆ :

E L w⋆(t) ; θ(t) , d⋆ ≤ ϵbias


 

Assumption 6.2 (Relative condition number): Consider the same ρ, ν, and π ⋆ as


in Assumption 6.1. With respect to any state-action distribution v, define:

Σv = Es,a∼v ϕs,a ϕ⊤
 
s,a

and define:
w⊤ Σd⋆ w
sup =κ
w∈Rd w⊤ Σν w
Assume that κ is finite.
Theorem 20 (Agnostic learning with Q-NPG): Fix a state distribution ρ, a state-
action distribution ν, and a comparator policy π ⋆ . Suppose Assumption 6.2 holds and
(0)
∥ϕs,a ∥p
2 ≤ B for all s, a. Suppose the Q − N P G update rule (in (20)) starts with θ = 0,
η = 2 log |A|/ (B 2 W 2 T ), and the (random) sequence of iterates satisfies Assumption
6.1. Then,  
 π⋆ (t)
E min V (ρ) − V (ρ)
t<T
r s p
BW 2 log |A| 4|A|κϵstat 4|A|ϵbias
≤ + 3
+
1−γ T (1 − γ) 1−γ
The proof is provided in Section 6.4.

L w(t) ; θ(t) , d(t) = L w(t) ; θ(t) , d(t) − L w⋆(t) ; θ(t) , d(t) + L w⋆(t) ; θ(t) , d(t)
   
| {z } | {z }
Excess risk Approximation error

Corollary 21 (Estimation error/Approximation error bound for Q − N P G):


Consider the same setting as in Theorem 20. Instead of assuming the transfer error is
bounded (part 2 in Assumption 6.1), suppose that, for all t ≤ T ,

E L w⋆(t) ; θ(t) , d(t) ≤ ϵapprox


 

Then,  
 π⋆ (t)
E min V (ρ) − V (ρ)
t<T
r s  
BW 2 log |A| 4|A| d⋆
≤ + κ · ϵstat + · ϵapprox
1−γ T (1 − γ)3 ν ∞

Proof We have the following crude upper bound on the transfer error:

d⋆ 1 d⋆
w⋆(t) ; θ(t) , d⋆ w⋆(t) ; θ(t) , d(t) L w⋆(t) ; θ(t) , d(t) ,
  
L ≤ (t) L ≤
d ∞ 1−γ ν ∞

1 d⋆
where the last step uses the defintion of d(t) (see (19)). This implies ϵbias ≤ 1−γ ν
ϵ
∞ approx
,
and the corollary follows.

9
Remark 24 (ϵbias = 0 for ”linear” MDPs) In the recent linear MDP model of Jin
et al. (2019); Yang and Wang (2019); Jiang et al. (2017), where the transition dynamics
are low rank, we have that ϵbias = 0 provided we use the features of the linear MDP.
Our guarantees also permit model misspecification of linear MDPs, with non worst-case
approximation error where ϵbias ̸= 0.

Remark 25 (Comparison with POLITEX and EE-PoliteX) Compared with POLITEX


(AbbasiYadkori et al., 2019a), Assumption 6.2 is substantially milder, in that it just as-
sumes a good relative condition number for one policy rather than all possible policies
(which cannot hold in general even for tabular MDPs). Changing this assumption to an
analog of Assumption 6.2 is the main improvement in the analysis of the EE-POLITEX
(Abbasi-Yadkori et al. 2019b) algorithm.

4.2.1 Algorithm 1
Proof of Algorithms can be found Algorithms Section (Section C) of in Linear Con-
vergence of Natural Policy Gradient Methods with Log-Linear Policies Published as a
conference paper at ICLR 2023.
Input: Starting state-action distribution ν.
1. Sample s0 , a0 ∼ ν.
2. Sample s, a ∼ dπν as follows: at every timestep h, with probability γ, act according
to π; else, accept (sh , ah ) as the sample and proceed to Step 4. See (19).
3. From sh , ah , continue to execute π, and use a termination probability of 1 − γ.
Upon termination, set Q̂π (sh , ah ) as the undiscounted sum of rewards from time h
onwards.
4. return (sh , ah ) and Q̂π (sh , ah ).
Algorithm 1: Sampler for s, a ∼ dπν and unbiased estimate of Qπ (s, a)

4.2.2 Algorithm 2

Figure 4.1

10
4.3 NPG: Performance Bounds for Smooth Policy Class

LA (w; θ, v) := Es,a∼v (Aπθ (s, a) − w · ∇θ log πθ (a | s))2


 

Approx. NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argmin∥w∥2 ≤W LA w; θ(t) , d(t) ,




(t) (t) 
w⋆ denote the minimizer, i.e. w⋆ ∈ argmin∥w∥2 ≤W LA w; θ(t) , d(t) .

Assumption 6.4 (Policy Smoothness) Assume for all s ∈ S and a ∈ A that log πθ (a |
s) is a β smooth function of θ

Theorem 29 (Agnostic learning with NPG) Fix a state distribution ρ; a state-


action distribution ν; an arbitrary comparator policy π ⋆ (not necessarily an optimal
policy). Suppose Assumption 6.4 holds. Suppose the NPG update p rule (in (21)) starts
with π (0) being the uniform distribution (at each state), η = 2 log |A|/ (βW 2 T ), and
the (random) sequence of iterates satisfies Assumption 6.5. We have that
  r √
2β log |A| ϵbias
r
 π⋆ (t) W κϵstat
E min V (ρ) − V (ρ) ≤ + 3
+ .
t<T 1−γ T (1 − γ) 1−γ

The proof is provided in Section 6.4.

Remarks: Observe there is no polynomial dependence on |A| in the rate for N P G


(in contrast to Theorem 20). The main difference in the analysis is that, for Q − N P G,
we need to bound the error in fitting the advantage estimates; this leads to the depen-
dence on |A| (which can be removed with a path-dependent bound, i.e. a bound that
depends on the sequence of iterates produced by the algorithm, 9 ). For N P G, the direct
fitting of the advantage function sidesteps this conversion step.

Compared with the result of Theorem 16 in the noiseless, tabular case, we see two
main differences. In the setting of Theorem 16, we have ϵstat = ϵbias = 0, so √ that the
last two terms vanish. This leaves the first term where we observe a slower 1/ T rate
compared with Theorem 16, and with an additional dependence on W (which grows as
O(|S∥A|/(1 − γ) to approximate the advantages in the tabular setting). Both differences
arise from the additional monotonicity property (Lemma 17) on the per-step improve-
ments in the tabular case, which is not easily generalized to the function approximation
setting.

4.4 Summary for Approximate Methods


• Table 2: Overview of Approximate Methods: The suboptimality, V ⋆ (s0 ) − V π (s0 ),
after T iterations for various approximate algorithms, which use different notions
of approximation error (sample complexities are not directly considered but instead
may be thought of as part of ϵ1 and ϵstat ). Order notation is used to drop constants,
and we assume |A| = 2 for ease of exposition.
• For approximate dynamic programming methods, the relevant error is the worst
case, ℓ∞ -error in approximating a value function, e.g. ϵ∞ = maxs,a Qπ (s, a) − Q
bπ (s, a) ,
where Qbπ is what an estimation oracle returns during the course of the algorithm.

11
Figure 4.2: Summary table for Approximate methods

• The second row (see Lemma 12 in Antos et al. (2008)) is a refinement of this ap-
proach, where ϵ1 is an ℓ1 -average error in fitting the value functions under the fitting
(state) distribution µ, and, roughly, C∞ is a worst case density ratio between the
state visitation distribution of any non-stationary policy and the fitting distribution
µ.
• For Conservative Policy Iteration, ϵ1 is a related ℓ1 -average case fitting error with
respect to a fitting distribution µ, and D∞ is as defined as before, in the caption of
Table 1 (see also (Kakade and Langford, 2002)); here, D∞ ≤ C∞ (e.g. see Scherrer
(2014)).
• For NPG, ϵstat and ϵapprox measure the excess risk (the regret) and approximation
errors in fitting the values. Roughly speaking, ϵstat is the excess squared loss rel-
ative to the best fit (among an appropriately defined parametric class) under our
fitting distribution (defined with respect to the state distribution µ ). Here, ϵapprox
is the approximation error: the minimal possible error (in our parametric class)
under our fitting distribution. The condition number κ is a relative eigenvalue con-
dition between appropriately defined feature covariances with respect to the state

visitation distribution of an optimal policy, dπs0 , and the state fitting distribution
µ.

References
[1] Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan. On the Theory of
Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. Journal
of Machine Learning Research, 22 (2021), 1-76.

12
[2] Rui Yuan, Simon S. Du, Robert M. Gower, Alessandro Lazaric, Lin Xiao. LINEAR
CONVERGENCE OF NATURAL POLICY GRADIENT METHODS WITH LOG-
LINEAR POLICIES. Published as a conference paper at ICLR 2023.

[3] Amir M. Ahmadian, Wolfgang Zirwas, Rakash Sivasiva Ganesan, Berthold Panzner.
Low Complexity Moore-Penrose Inverse for Large CoMP Areas with Sparse Massive
MIMO Channel Matrices. Nokia Bell Labs, Radio Systems Research, Munich, Ger-
many.

13

You might also like