Show simple item record

dc.contributor.advisorGopalan, Aditya
dc.contributor.authorRay Chowdhury, Sayak
dc.date.accessioned2022-06-07T04:42:48Z
dc.date.available2022-06-07T04:42:48Z
dc.date.submitted2021
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/5744
dc.description.abstractIn a reinforcement learning (RL) problem, a learner takes actions to control the state of an initially unknown environment so as to maximize the sum of the rewards he obtains. This has several applications in many practical sequential decision-making systems like recommendation systems, sequential investment and portfolio allocation, dynamic resource allocation in communication systems, targeted advertising and pricing, to name a few. In these online learning settings, there is no separate time available to purely exploring the unknown environment; rather, one must carefully trade-off exploration and exploitation. The learner seeks to minimize the regret or shortfall in total reward earned from all its decisions, compared to an optimal decision. In this thesis, we study online RL algorithms for large and structured environments, where generalization across unseen states and actions is crucial to achieving significantly low regret. In the first part of this thesis, we study non-parametric approaches to model reward structure in the multi-armed bandit problem -- a stateless i.i.d. non-Markovian environment -- with a very large (potentially infinite) number of arms. This problem can be formulated as learning to optimize a fixed but unknown function over a continuous domain, with the function evaluations being noisy and expensive. This is a generic problem appearing in several applications like hyper-parameter tuning in machine learning models, sensor selection, experimental design, to name a few. We model the function's uncertainty using the reproducing kernel Hilbert space (RKHS) compatible with a kernel on the function's domain. Adapting techniques from Gaussian process regression, we develop upper confidence bound and Thompson sampling-based algorithms for continuous bandit optimization and derive corresponding regret guarantees assuming the reward distribution to be sub-Gaussian. Along the way, we derive a self-normalized concentration inequality for RKHS-valued martingales of arbitrary, possibly infinite, dimension, which might be of independent interest. We further extend these results to kernelized bandit environments with heavy-tailed reward distributions. By employing an efficient reward-truncation strategy, we design an algorithm which is robust to heavy-tailed corruptions in rewards. We show that the regret performance of our algorithm is (almost) optimal in a widely used setting by deriving a fundamental lower bound for this problem. We conclude this part of this thesis by extending the kernelized bandit problem to the multi-objective optimization setting and designing low-regret strategies for the same. In the second part of the thesis, we study regret minimization strategies for episodic, average-reward Markov decision processes (MDPs) with infinite states and actions. We consider both parametric and non-parametric MDPs. In the non-parametric setting, taking inspiration from kernelized bandits, we model uncertainty in mean reward and transitions using RKHSs compatible with kernels defined jointly on state-action spaces. In the parametric setting, we introduce an MDP transition model defined by bilinear exponential families with an unknown finite-dimensional parameter. For both settings, we design variants of upper confidence RL and posterior sampling RL strategies. We prove finite-time regret guarantees for the proposed algorithms that depend on the structures of the underlying models. For linear exponential family MDPs, the regret bounds are proportional to the dimension of the parameter vector and for kernelizable MDPs, the bounds are expressed using effective dimensions of respective RKHSs. Moreover, all the results represent a generalization of existing approaches for tabular finite-state, finite-action MDPs, and parametric linear quadratic regulators.en_US
dc.description.sponsorshipGoogle India PhD Fellowshipen_US
dc.language.isoen_USen_US
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectReinforcement Learning, Multi-armed Banditsen_US
dc.subjectReinforcement Learningen_US
dc.subjectMulti-armed Banditsen_US
dc.subjectMarkov decision processesen_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonicsen_US
dc.titleReinforcement Learning in Large and Structured Environmentsen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record