Show simple item record

dc.contributor.advisorGopalan, Aditya
dc.contributor.authorBanerjee, Debangshu
dc.date.accessioned2025-11-12T04:32:09Z
dc.date.available2025-11-12T04:32:09Z
dc.date.submitted2025
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/7380
dc.description.abstractAmong the basic challenges that confront reinforcement learning are exploration – the need to search effectively over large and complex state-action spaces – and misspecification, which arises from using function approximation to mitigate the curse of dimensionality inherent in these large state-action spaces. In this thesis, we study three central problems, each motivated by observations pertaining to these aspects of reinforcement learning. First, we examine exploration in linear bandits whose actions lie on smooth, curved manifolds. We prove that any algorithm achieving sublinear regret must inherently perform sufficient exploration. This phenomenon stands in stark contrast to what is observed and theoretically justified in standard multi-armed bandits. Next, we undertake a deeper investigation into model misspecification. We characterize a class of problems as robust, and show that despite arbitrary model error, these problems can be efficiently learned using standard, vanilla algorithms. These results extend existing literature, which has primarily focused on worst-case analyses. Finally, we study the effect of noise in reward models trained on preference datasets. We identify that alignment procedures for large language models (LLMs), when based on such noisy reward estimates, can suffer from performance degradation. To address this, we propose variance-aware policy updates, which we prove are theoretically less susceptible to degradation and support with empirical evidence. Together, these studies illustrate distinct aspects of exploration and misspecification arising either from model approximation errors or observation noise, and provide a theoretical foundation to explain the causation behind phenomena already observed in prior work, along with mitigation strategies wherever applicable.en_US
dc.language.isoen_USen_US
dc.relation.ispartofseries;ET01140
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectBanditsen_US
dc.subjectExplorationen_US
dc.subjectMisspecificationen_US
dc.subjectReinforcement Learning with Human Feedbacken_US
dc.subjectlinear banditsen_US
dc.subjectlarge language modelsen_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics::Other electrical engineering, electronics and photonicsen_US
dc.titleExploration and Misspecification in Reinforcement Learningen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record