Exploration and Misspecification in Reinforcement Learning
Abstract
Among the basic challenges that confront reinforcement learning are exploration – the need
to search effectively over large and complex state-action spaces – and misspecification, which
arises from using function approximation to mitigate the curse of dimensionality inherent in
these large state-action spaces. In this thesis, we study three central problems, each motivated
by observations pertaining to these aspects of reinforcement learning.
First, we examine exploration in linear bandits whose actions lie on smooth, curved manifolds.
We prove that any algorithm achieving sublinear regret must inherently perform sufficient
exploration. This phenomenon stands in stark contrast to what is observed and theoretically
justified in standard multi-armed bandits.
Next, we undertake a deeper investigation into model misspecification. We characterize a
class of problems as robust, and show that despite arbitrary model error, these problems can be
efficiently learned using standard, vanilla algorithms. These results extend existing literature,
which has primarily focused on worst-case analyses.
Finally, we study the effect of noise in reward models trained on preference datasets. We
identify that alignment procedures for large language models (LLMs), when based on such
noisy reward estimates, can suffer from performance degradation. To address this, we propose
variance-aware policy updates, which we prove are theoretically less susceptible to degradation
and support with empirical evidence.
Together, these studies illustrate distinct aspects of exploration and misspecification arising
either from model approximation errors or observation noise, and provide a theoretical foundation
to explain the causation behind phenomena already observed in prior work, along with
mitigation strategies wherever applicable.

