Reinforcement Learning in Large and Structured Environments

Ray Chowdhury, Sayak

dc.contributor.advisor	Gopalan, Aditya
dc.contributor.author	Ray Chowdhury, Sayak
dc.date.accessioned	2022-06-07T04:42:48Z
dc.date.available	2022-06-07T04:42:48Z
dc.date.submitted	2021
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5744
dc.description.abstract	In a reinforcement learning (RL) problem, a learner takes actions to control the state of an initially unknown environment so as to maximize the sum of the rewards he obtains. This has several applications in many practical sequential decision-making systems like recommendation systems, sequential investment and portfolio allocation, dynamic resource allocation in communication systems, targeted advertising and pricing, to name a few. In these online learning settings, there is no separate time available to purely exploring the unknown environment; rather, one must carefully trade-off exploration and exploitation. The learner seeks to minimize the regret or shortfall in total reward earned from all its decisions, compared to an optimal decision. In this thesis, we study online RL algorithms for large and structured environments, where generalization across unseen states and actions is crucial to achieving significantly low regret. In the first part of this thesis, we study non-parametric approaches to model reward structure in the multi-armed bandit problem -- a stateless i.i.d. non-Markovian environment -- with a very large (potentially infinite) number of arms. This problem can be formulated as learning to optimize a fixed but unknown function over a continuous domain, with the function evaluations being noisy and expensive. This is a generic problem appearing in several applications like hyper-parameter tuning in machine learning models, sensor selection, experimental design, to name a few. We model the function's uncertainty using the reproducing kernel Hilbert space (RKHS) compatible with a kernel on the function's domain. Adapting techniques from Gaussian process regression, we develop upper confidence bound and Thompson sampling-based algorithms for continuous bandit optimization and derive corresponding regret guarantees assuming the reward distribution to be sub-Gaussian. Along the way, we derive a self-normalized concentration inequality for RKHS-valued martingales of arbitrary, possibly infinite, dimension, which might be of independent interest. We further extend these results to kernelized bandit environments with heavy-tailed reward distributions. By employing an efficient reward-truncation strategy, we design an algorithm which is robust to heavy-tailed corruptions in rewards. We show that the regret performance of our algorithm is (almost) optimal in a widely used setting by deriving a fundamental lower bound for this problem. We conclude this part of this thesis by extending the kernelized bandit problem to the multi-objective optimization setting and designing low-regret strategies for the same. In the second part of the thesis, we study regret minimization strategies for episodic, average-reward Markov decision processes (MDPs) with infinite states and actions. We consider both parametric and non-parametric MDPs. In the non-parametric setting, taking inspiration from kernelized bandits, we model uncertainty in mean reward and transitions using RKHSs compatible with kernels defined jointly on state-action spaces. In the parametric setting, we introduce an MDP transition model defined by bilinear exponential families with an unknown finite-dimensional parameter. For both settings, we design variants of upper confidence RL and posterior sampling RL strategies. We prove finite-time regret guarantees for the proposed algorithms that depend on the structures of the underlying models. For linear exponential family MDPs, the regret bounds are proportional to the dimension of the parameter vector and for kernelizable MDPs, the bounds are expressed using effective dimensions of respective RKHSs. Moreover, all the results represent a generalization of existing approaches for tabular finite-state, finite-action MDPs, and parametric linear quadratic regulators.	en_US
dc.description.sponsorship	Google India PhD Fellowship	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Reinforcement Learning, Multi-armed Bandits	en_US
dc.subject	Reinforcement Learning	en_US
dc.subject	Multi-armed Bandits	en_US
dc.subject	Markov decision processes	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Electrical engineering, electronics and photonics	en_US
dc.title	Reinforcement Learning in Large and Structured Environments	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Thesis (1).pdf
Size:: 1.540Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Electrical Communication Engineering (ECE) [404]

Show simple item record