Battle of Bandits: Online Learning from Subsetwise Preferences and Other Structured Feedback

Saha, Aadirupa

dc.contributor.advisor	Bhattacharyya, Chiranjib
dc.contributor.advisor	Gopalan, Aditya
dc.contributor.author	Saha, Aadirupa
dc.date.accessioned	2021-07-05T04:14:31Z
dc.date.available	2021-07-05T04:14:31Z
dc.date.submitted	2020
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5184
dc.description.abstract	The elicitation and aggregation of preferences is often the key to making better decisions. Be it a perfume company wanting to relaunch their 5 most popular fragrances, a movie recommender system trying to rank the most favoured movies, or a pharmaceutical company testing the relative efficacies of a set of drugs, learning from preference feedback is a widely applicable problem to solve. One can model the sequential version of this problem using the classical multiarmed-bandit (MAB) (e.g., Auer, 2002) by representing each decision choice as one bandit-arm, or more appropriately as a Dueling-Bandit (DB) problem (Yue \& Joachims, 2009). Although DB is similar to MAB in that it is an online decision making framework, DB is different in that it specifically models learning from pairwise preferences. In practice, it is often much easier to elicit information, especially when humans are in the loop, through relative preferences: `Item A is better than item B' is easier to elicit than its absolute counterpart: `Item A is worth 7 and B is worth 4'. However, instead of pairwise preferences, a more general $k$-subset-wise preference model $(k \ge 2)$ is more relevant in various practical scenarios, e.g. recommender systems, search engines, crowd-sourcing, e-learning platforms, design of surveys, ranking in multiplayer games. Subset-wise preference elicitation is not only more budget friendly, but also flexible in conveying several types of feedback. For example, with subset-wise preferences, the learner could elicit the best item, a partial preference of the top 5 items, or even an entire rank ordering of a subset of items, whereas all these boil down to the same feedback over pairs (subsets of size 2). The problem of how to learn adaptively with subset-wise preferences, however, remains largely unexplored; this is primarily due to the computational burden of maintaining a combinatorially large, $O(n^k)$, size of preference information in general (for a decision problem with $n$ items and subsetsize $k$). We take a step in the above direction by proposing ``Battling Bandits (BB)''---a new online learning framework to learn a set of optimal ('good') items by sequentially, and adaptively, querying subsets of items of size up to $k$ ($k\ge 2$). The preference feedback from a subset is assumed to arise from an underlying parametric discrete choice model, such as the well-known Plackett-Luce model, or more generally any random utility (RUM) based model. It is this structure that we leverage to design efficient algorithms for various problems of interest, e.g. identifying the best item, set of top-k items, full ranking etc., for both in PAC and regret minimization setting. We propose computationally efficient and (near-) optimal algorithms for above objectives along with matching lower bound guarantees. Interestingly this leads us to finding answers to some basic questions about the value of subset-wise preferences: Does playing a general $k$-set really help in faster information aggregation, i.e. is there a tradeoff between subsetsize-$k$ vs the learning rate? Under what type of feedback models? How do the performance limits (performance lower bounds) vary over different combinations of feedback and choice models? And above all, what more can we achieve through BB where DB fails? We proceed to analyse the BB problem in the contextual scenario – this is relevant in settings where items have known attributes, and allows for potentially infinite decision spaces. This is more general and of practical interest than the finite-arm case, but, naturally, on the other hand more challenging. Moreover, none of the existing online learning algorithms extend straightforwardly to the continuous case, even for the most simple Dueling Bandit setup (i.e. when $k=2$). Towards this, we formulate the problem of ``Contextual Battling Bandits (C-BB)'' under utility based subsetwise-preference feedback, and design provably optimal algorithms for the regret minimization problem. Our regret bounds are also accompanied by matching lower bound guarantees showing optimality of our proposed methods. All our theoretical guarantees are corroborated with empirical evaluations. Lastly, it goes without saying, that there are still many open threads to explore based on BB. These include studying different choice-feedback model combinations, performance objectives, or even extending BB to other useful frameworks like assortment selection, revenue maximization, budget-constrained bandits etc. Towards the end we will also discuss some interesting combinations of the BB framework with other, well-known, problems, e.g. Sleeping / Rotting Bandits, Preference based Reinforcement Learning, Learning on Graphs, Preferential Bandit-Convex-Optimization etc.	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Bandit Algorithms	en_US
dc.subject	Contextual	en_US
dc.subject	Multiarmed Bandits	en_US
dc.subject	Online Sequential Learning From Preferences	en_US
dc.subject	Complexity Analysis	en_US
dc.subject	Regret	en_US
dc.subject	Tightness of Performance	en_US
dc.subject	Dueling	en_US
dc.subject	Battling	en_US
dc.subject	Subsetwise Preference	en_US
dc.subject	Relation Graphs	en_US
dc.subject	Sleeping Bandits	en_US
dc.subject	Ranking	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	Battle of Bandits: Online Learning from Subsetwise Preferences and Other Structured Feedback	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: PhD-Thesis-AadirupaSaha-CSA(II ...
Size:: 8.440Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [545]

Show simple item record