Algorithms for Online Learning in Structured Environments
Online learning deals with the study of making decisions sequentially using information gathered along the way. Typical goals of an online learning agent can be to maximize the reward gained during learning or to identify the best possible action to take with the maximum (expected) reward. We study this problem in the setting where the environment has some inbuilt structure. This structure can be exploited by the learning agent while making decisions to accelerate the process of learning from data. We study a number of such problems in this dissertation. We begin with regret minimization for multi-user online recommendation where the expected user-item reward matrix has low rank (much smaller than the number of users or items). We address the cold-start problem in recommendation systems where the agent initially has no information about the reward matrix and only gathers information gradually by interactions (i.e., recommending items) with arriving users. We use results from low-rank matrix estimation to design an efficient online algorithm to exploit the low rank of the underlying reward matrix. We analyze this algorithm and show that it enjoys better regret than algorithms that do not take the low-rank structure into account. We then study the problem of pure exploration (or best arm identification) in linear bandits. In this problem, each time the learner chooses an action vector (arm), she receives a noisy realization of a reward whose mean is linearly dependent on an unknown vector and the chosen action. The aim here is to identify the arm which yields the maximum reward (in expectation) as quickly as possible. We are specifically interested in the situation where the ambient dimension of the unknown parameter vector is very small as compared to the number of arms. We show that by using this inherent problem structure, one can design provably optimal and efficient algorithms to identify the best arm quickly. We study how exploiting the intrinsic geometry of the problem leads to the design of statistically and computationally efficient algorithms for the best arm identification problem. We finally formulate and study the problem of improper reinforcement learning, where for a given (unknown) Markov Decision Process (MDP), we are given a bag of (pre-designed) controllers or policies for the MDP. We study how the agent can leverage (combine) these pre-trained base controllers (instead of just the rudimentary actions of the MDP) to accelerate learning. This can be useful in tuning across ensembles of controllers, learning in mismatched or simulated environments, etc., to obtain a good controller for a given target environment with relatively few trials. This differs from the usual reinforcement learning setup where the learner observes the current state of the environment and chooses an action to play. In contrast, the improper learner chooses a given base controller and plays whichever action is recommended by the chosen controller. This indirect selection of actions via the base controllers helps to inherit desirable properties (e.g., interpretability, principled design, safety, etc) into the learned policy.