On Policy Gradients, Momentum, and Learning with Adversaries: Algorithms and Convergence Analysis
Abstract
This thesis comprises five works, organized into three parts: the first focuses on average-reward
Reinforcement Learning (RL), the second on distributed learning under adversaries in heterogeneous
and asynchronous setups, and the third on momentum in stochastic optimization.
Reinforcement Learning (RL), a subfield of machine learning, trains agents to develop optimal
strategies for interacting with their environments. Policy gradient methods have a long
history in RL and remain an attractive class of algorithms due to their flexibility. They are applicable
to any di!erentiable policy parameterization, support function approximation, and can
easily incorporate structured state and action spaces. Additionally, these methods are straightforward
to implement in a simulation-based, model-free manner. However, PG methods present
several open challenges and this thesis addresses some of these challenges:
1. Variance-Reduced Policy Gradient for Average-Reward RL: We study policy gradient
methods with general policy parameterization for average-reward infinite-horizon Markov Decision
Processes (MDPs). Our results demonstrate that variance-reduced policy gradient techniques
achieve an expected regret of ˜O(→T), a significant improvement over the prior state-ofthe-
art bound of ˜O(T3/4).
2. Natural Actor-Critic with Multi-Level Monte Carlo: We propose a natural actor-critic
algorithm with a linear critic and general parameterized policy that achieves the optimal global
convergence rate of O(1/→T). The method applies to large or even infinite state–action spaces
and, by leveraging Multi-Level Monte Carlo (MLMC) to reduce bias in Markovian estimates,
removes the need to partition samples according to mixing or hitting times.
3. Federated Reinforcement Learning Under Adversaries: We establish the first global sample
complexity result for federated reinforcement learning with general policy parameterization
in the presence of adversaries, achieving optimal-order sample complexity ˜O
!
1
Nω2
!
1 + f2
N
""
for
finding an ω-optimal policy, where N is the total number of agents and f <N/2 the number of
adversarial agents.
Distributed systems are vulnerable to challenges such as random failures and adversarial
attacks, which may arise from hardware malfunctions, inaccurate training data, or intentional
malicious actions. Adversarial attacks, in particular, where attackers may possess significant
system knowledge or collaborate, can severely disrupt learning processes. There is limited
analysis on such attacks in the heterogenous and asynchrnoous setting which is typical in
practice, and this thesis explores the following problem:
4. Distributed Asynchronous and Heterogeneous Learning in Adversarial Settings: We
study a distributed linear problem under adversarial settings, with applications to network
tomography. Despite challenges posed by heterogeneity and asynchronous setups, we develop
an algorithm that ensures almost sure convergence to the true solution.
Stochastic optimization focuses on maximizing or minimizing functions using noisy gradient
estimates and has broad applications across domains. Improving its sample complexity is a
central challenge. Momentum-based methods, such as the Heavy Ball (HB) and Nesterov’s
Accelerated Gradient (NAG), are known to accelerate exact gradient descent for quadratic
objectives. However, their impact on convergence in stochastic settings remains unclear. This
thesis investigates this question:
5. Momentum in Stochastic Optimization: We show that Stochastic Heavy Ball (SHB)
and Nesterov’s Accelerated Stochastic Gradient (ASG) cannot outperform standard Stochastic
Gradient Descent (SGD) in terms of sample complexity, even in simple stochastic settings. This
is demonstrated by deriving a lower bound on the sample complexity of SHB and ASG and
proving that this bound is achievable by vanilla SGD.

