SRE Report Merged
SRE Report Merged
2 Notations 1
3 Tabular Class 2
3.1 Direct Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Softmax Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.1 Asymptotic Conversion Without Regularization . . . . . . . . . . 4
3.2.2 Polynomial Convergence with Log-Barrier Regularization . . . . . 4
3.2.3 Dimension-free Convergence of Natural Policy Gradient Ascent . . 5
3.3 Tabular Methods Summary . . . . . . . . . . . . . . . . . . . . . . . . . 6
I
1 Introduction
This work provides provable characterizations of the computational, approximation,
and sample size properties of policy gradient methods in the context of discounted Markov
Decision Processes (MDPs). The policy parameterization can be split into two classes:
2 Notations
Markov Decision Process (MDP) Definitions:
Policy Definitions:
• 0 ≤ V π (s) ≤ 1
1−γ
, and V π (ρ) := Es0 ∼ρ [V π (s0 )].
1
Action-Value (Q-value) and Advantage Functions:
• In order to introduce policy gradients, define the discounted state visitation distri-
bution dπs0 of a policy π as:
∞
X
dπs0 (s) := (1 − γ) γ t Prπ (st = s | s0 )
t=0
where Prπ (st = s | s0 ) is the state visitation probability that st = s, after executing
π starting at state s0 .
• Overloading notation, write:
where dπρ is the discounted state visitation distribution under the initial distribution
ρ.
3 Tabular Class
3.1 Direct Parameterization
The policies are parameterized by
πθ (a | s) = θs,a
• In discussing the proof, consider the performance measure ρ and the optimization
measure µ.
• Notably, although the gradient concerns V π (µ), the guarantee extends to all distri-
butions ρ.
• Optimal performance under ρ might benefit from optimization under µ.
2
• The lemma implies that a small gradient in feasible directions indicates near-
optimality in value, but this holds if dπµ sufficiently covers the state distribution
of some optimal policy π ⋆ .
• Recall Bellman and Dreyfus’ theorem (1959) affirming a single policy π ⋆ optimal
for all starting states s0 .
• The exploration problem’s challenge is captured through the distribution mismatch
coefficient.
Theorem 5:
3
3.2 Softmax Parameterization
Softmax Parameterization:
exp (θs,a )
πθ (a | s) = P
a′ ∈A exp (θs,a′ )
In leveraging the gradient domination property (Lemma 4), our aim is to demonstrate
∇π V π (µ) → 0. However, employing the softmax parameterization (refer to Lemma 40
and (7)), we find that:
∂V πθ (µ) 1 ∂V πθ (µ)
= dπµθ (s)πθ (a | s)Aπθ (s, a) = πθ (a | s)
∂θs,a 1−γ ∂πθ (a | s)
4
• Policy gradient ascent updates for Lλ (θ):
⋆ 2
⋆ (t) 320|S|2 |A|2 dπρ
min V (ρ) − V (ρ) ≤ ϵ whenever T ≥
t<T (1 − γ)6 ϵ2 µ
∞
• See Appendix C.2 for the proof. The corollary emphasizes balancing λ with desired
accuracy ϵ and the importance of the initial distribution µ for global optimality.
• Entropy is less aggressive in penalizing small probabilities than the log barrier,
equivalent to relative entropy.
• Entropy is bounded between 0 and log |A|, while relative entropy is between 0 and
infinity, approaching infinity as probabilities tend to 0.
• Open question: Achieving polynomial convergence with the more common entropy
regularizer.
• Polynomial convergence using KL regularizer relies on relative entropy’s aggressive
prevention of small probabilities.
5
Lemma 15 (NPG as Soft Policy Iteration):
(t+1) η
(t) (t) (t+1) (t) exp ηA(t) (s, a)/(1 − γ)
θ =θ + A and π (a | s) = π (a | s)
1−γ Zt (s)
where Zt (s) = a∈A π (t) (a | s) exp ηA(t) (s, a)/(1 − γ) .
P
• M M †M = M ,
• M †M M † = M †,
• (M M † )⊤ = M M † , and
• (M † M )⊤ = M † M .
It plays a crucial role in the NPG algorithm by allowing for effective updates in the
parameter space. The processing complexity in number of FLOPs can be expressed as
2np2 + 2n3 for a matrix size of n × p applying the common Singular Value Decomposi-
tion (SVD) method. The complexity of SVD (from PyTorch Documentation) is O(nm2 ),
where m is the larger dimension of the matrix and n the smaller.
Table Summary: Describes iteration complexities for finding π with V ⋆ (s0 )−V π (s0 ) ≤ ϵ.
π
Es∼µ [V (s)] with |S| states, |A| actions, and 0 ≤ γ < 1.
Algorithms optimize
⋆
dπ
s0 (s)
D∞ := maxs µ(s)
measures distribution mismatch. NPG has no D∞ dependence and
works for any s0 .
6
4 Function Approximation and Distribution Shift
In contrast to the tabular scenarios explored earlier, the analysis extends to the
realm of function approximation where policy classes lack full expressiveness (e.g., when
d ≪ |S||A|). The investigation revolves around variants of the Natural Policy Gradient
(NPG) update rule:
θ ← θ + ηFρ (θ)† ∇θ V θ (ρ)
The analytical framework establishes a close connection between the NPG update and
the concept of compatible function approximation, formalized by Kakade (2001). This
relationship links the NPG update to a regression problem where the optimal weight
vector w⋆ is given by
h 2 i
w⋆ ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) w⊤ ∇θ log πθ (· | s) − Aπθ (s, a)
This approach allows for approximate updates, solving relevant regression problems
with samples. The main results demonstrate the effectiveness of NPG updates in scenarios
where errors arise from both statistical estimation (when exact gradients are unavailable)
and function approximation (due to using a parameterized function class). Notably, a
novel estimation/approximation decomposition tailored for the NPG algorithm is pro-
vided.
To delve into specific policy classes, the analysis begins with log-linear policies as
a special case. Subsequently, the investigation extends to more general policy classes,
including neural policy classes.
Compatible Function Approximation: For the log-linear policy class, the gradient
of the log-policy is expressed as:
Here, ϕ̄θs,a is the centered version of ϕs,a . Additionally, ϕ̄π is defined similarly for any
policy π.
7
NPG Update Rule: The NPG update rule, using the centered features, is equiva-
lent to:
Q-NPG Variant: We also consider a variant, Q-NPG, without centering the features:
Note that Qπ (s, a) is not expected to be 0 under π(· | s), unlike the advantage function. If
compaitable function approximation is 0, then it is easy to verify that NPG and Q-NPG
are equivalent as their corresponding policy updates will be equivalent.
The iterates of the Q-NPG algorithm minimize this loss under a changing distribution v.
Now, an approximate version of Q-NPG is defined with a starting state-action distribution
ν. The visitation measure over states and actions, dπν , is given by:
∞
X
dπν (s, a) := (1 − γ)Es0 ,a0 ∼ν γ t Prπ (st = s, at = a | s0 , a0 )
t=0
(t)
Define d(t) := dνπ . The approximate Q-NPG algorithm is:
8
(t)
2. (Transfer error) The best predictor w⋆ has an error bounded by ϵbias with respect
to the comparator’s measure of d⋆ :
Σv = Es,a∼v ϕs,a ϕ⊤
s,a
and define:
w⊤ Σd⋆ w
sup =κ
w∈Rd w⊤ Σν w
Assume that κ is finite.
Theorem 20 (Agnostic learning with Q-NPG): Fix a state distribution ρ, a state-
action distribution ν, and a comparator policy π ⋆ . Suppose Assumption 6.2 holds and
(0)
∥ϕs,a ∥p
2 ≤ B for all s, a. Suppose the Q − N P G update rule (in (20)) starts with θ = 0,
η = 2 log |A|/ (B 2 W 2 T ), and the (random) sequence of iterates satisfies Assumption
6.1. Then,
π⋆ (t)
E min V (ρ) − V (ρ)
t<T
r s p
BW 2 log |A| 4|A|κϵstat 4|A|ϵbias
≤ + 3
+
1−γ T (1 − γ) 1−γ
The proof is provided in Section 6.4.
L w(t) ; θ(t) , d(t) = L w(t) ; θ(t) , d(t) − L w⋆(t) ; θ(t) , d(t) + L w⋆(t) ; θ(t) , d(t)
| {z } | {z }
Excess risk Approximation error
Then,
π⋆ (t)
E min V (ρ) − V (ρ)
t<T
r s
BW 2 log |A| 4|A| d⋆
≤ + κ · ϵstat + · ϵapprox
1−γ T (1 − γ)3 ν ∞
Proof We have the following crude upper bound on the transfer error:
d⋆ 1 d⋆
w⋆(t) ; θ(t) , d⋆ w⋆(t) ; θ(t) , d(t) L w⋆(t) ; θ(t) , d(t) ,
L ≤ (t) L ≤
d ∞ 1−γ ν ∞
1 d⋆
where the last step uses the defintion of d(t) (see (19)). This implies ϵbias ≤ 1−γ ν
ϵ
∞ approx
,
and the corollary follows.
9
Remark 24 (ϵbias = 0 for ”linear” MDPs) In the recent linear MDP model of Jin
et al. (2019); Yang and Wang (2019); Jiang et al. (2017), where the transition dynamics
are low rank, we have that ϵbias = 0 provided we use the features of the linear MDP.
Our guarantees also permit model misspecification of linear MDPs, with non worst-case
approximation error where ϵbias ̸= 0.
4.2.1 Algorithm 1
Proof of Algorithms can be found Algorithms Section (Section C) of in Linear Con-
vergence of Natural Policy Gradient Methods with Log-Linear Policies Published as a
conference paper at ICLR 2023.
Input: Starting state-action distribution ν.
1. Sample s0 , a0 ∼ ν.
2. Sample s, a ∼ dπν as follows: at every timestep h, with probability γ, act according
to π; else, accept (sh , ah ) as the sample and proceed to Step 4. See (19).
3. From sh , ah , continue to execute π, and use a termination probability of 1 − γ.
Upon termination, set Q̂π (sh , ah ) as the undiscounted sum of rewards from time h
onwards.
4. return (sh , ah ) and Q̂π (sh , ah ).
Algorithm 1: Sampler for s, a ∼ dπν and unbiased estimate of Qπ (s, a)
4.2.2 Algorithm 2
Figure 4.1
10
4.3 NPG: Performance Bounds for Smooth Policy Class
(t) (t)
w⋆ denote the minimizer, i.e. w⋆ ∈ argmin∥w∥2 ≤W LA w; θ(t) , d(t) .
Assumption 6.4 (Policy Smoothness) Assume for all s ∈ S and a ∈ A that log πθ (a |
s) is a β smooth function of θ
Compared with the result of Theorem 16 in the noiseless, tabular case, we see two
main differences. In the setting of Theorem 16, we have ϵstat = ϵbias = 0, so √ that the
last two terms vanish. This leaves the first term where we observe a slower 1/ T rate
compared with Theorem 16, and with an additional dependence on W (which grows as
O(|S∥A|/(1 − γ) to approximate the advantages in the tabular setting). Both differences
arise from the additional monotonicity property (Lemma 17) on the per-step improve-
ments in the tabular case, which is not easily generalized to the function approximation
setting.
11
Figure 4.2: Summary table for Approximate methods
• The second row (see Lemma 12 in Antos et al. (2008)) is a refinement of this ap-
proach, where ϵ1 is an ℓ1 -average error in fitting the value functions under the fitting
(state) distribution µ, and, roughly, C∞ is a worst case density ratio between the
state visitation distribution of any non-stationary policy and the fitting distribution
µ.
• For Conservative Policy Iteration, ϵ1 is a related ℓ1 -average case fitting error with
respect to a fitting distribution µ, and D∞ is as defined as before, in the caption of
Table 1 (see also (Kakade and Langford, 2002)); here, D∞ ≤ C∞ (e.g. see Scherrer
(2014)).
• For NPG, ϵstat and ϵapprox measure the excess risk (the regret) and approximation
errors in fitting the values. Roughly speaking, ϵstat is the excess squared loss rel-
ative to the best fit (among an appropriately defined parametric class) under our
fitting distribution (defined with respect to the state distribution µ ). Here, ϵapprox
is the approximation error: the minimal possible error (in our parametric class)
under our fitting distribution. The condition number κ is a relative eigenvalue con-
dition between appropriately defined feature covariances with respect to the state
⋆
visitation distribution of an optimal policy, dπs0 , and the state fitting distribution
µ.
References
[1] Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan. On the Theory of
Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. Journal
of Machine Learning Research, 22 (2021), 1-76.
12
[2] Rui Yuan, Simon S. Du, Robert M. Gower, Alessandro Lazaric, Lin Xiao. LINEAR
CONVERGENCE OF NATURAL POLICY GRADIENT METHODS WITH LOG-
LINEAR POLICIES. Published as a conference paper at ICLR 2023.
[3] Amir M. Ahmadian, Wolfgang Zirwas, Rakash Sivasiva Ganesan, Berthold Panzner.
Low Complexity Moore-Penrose Inverse for Large CoMP Areas with Sparse Massive
MIMO Channel Matrices. Nokia Bell Labs, Radio Systems Research, Munich, Ger-
many.
13