|dc.description.abstract||The elicitation and aggregation of preferences is often the key to making better decisions. Be it a perfume company wanting to relaunch their 5 most popular fragrances, a movie recommender system trying to rank the most favoured movies, or a pharmaceutical company testing the relative efficacies of a set of drugs, learning from preference feedback is a widely applicable problem to solve. One can model the sequential version of this problem using the classical multiarmed-bandit (MAB) (e.g., Auer, 2002) by representing each decision choice as one bandit-arm, or more appropriately as a Dueling-Bandit (DB) problem (Yue \& Joachims, 2009). Although DB is similar to MAB in that it is an online decision making framework, DB is different in that it specifically models learning from pairwise preferences. In practice, it is often much easier to elicit information, especially when humans are in the loop, through relative preferences: `Item A is better than item B' is easier to elicit than its absolute counterpart: `Item A is worth 7 and B is worth 4'.
However, instead of pairwise preferences, a more general $k$-subset-wise preference model $(k \ge 2)$ is more relevant in various practical scenarios, e.g. recommender systems, search engines, crowd-sourcing, e-learning platforms, design of surveys, ranking in multiplayer games. Subset-wise preference elicitation is not only more budget friendly, but also flexible in conveying several types of feedback. For example, with subset-wise preferences, the learner could elicit the best item, a partial preference of the top 5 items, or even an entire rank ordering of a subset of items, whereas all these boil down to the same feedback over pairs (subsets of size 2). The problem of how to learn adaptively with subset-wise preferences, however, remains largely unexplored; this is primarily due to the computational burden of maintaining a combinatorially large, $O(n^k)$, size of preference information in general (for a decision problem with $n$ items and subsetsize $k$).
We take a step in the above direction by proposing ``Battling Bandits (BB)''---a new online learning framework to learn a set of optimal ('good') items by sequentially, and adaptively, querying subsets of items of size up to $k$ ($k\ge 2$). The preference feedback from a subset is assumed to arise from an underlying parametric discrete choice model, such as the well-known Plackett-Luce model, or more generally any random utility (RUM) based model. It is this structure that we leverage to design efficient algorithms for various problems of interest, e.g. identifying the best item, set of top-k items, full ranking etc., for both in PAC and regret minimization setting. We propose computationally efficient and (near-) optimal algorithms for above objectives along with matching lower bound guarantees. Interestingly this leads us to finding answers to some basic questions about the value of subset-wise preferences: Does playing a general $k$-set really help in faster information aggregation, i.e. is there a tradeoff between subsetsize-$k$ vs the learning rate? Under what type of feedback models? How do the performance limits (performance lower bounds) vary over different combinations of feedback and choice models? And above all, what more can we achieve through BB where DB fails?
We proceed to analyse the BB problem in the contextual scenario – this is relevant in settings where items have known attributes, and allows for potentially infinite decision spaces. This is more general and of practical interest than the finite-arm case, but, naturally, on the other hand more challenging. Moreover, none of the existing online learning algorithms extend straightforwardly to the continuous case, even for the most simple Dueling Bandit setup (i.e. when $k=2$). Towards this, we formulate the problem of ``Contextual Battling Bandits (C-BB)'' under utility based subsetwise-preference feedback, and design provably optimal algorithms for the regret minimization problem. Our regret bounds are also accompanied by matching lower bound guarantees showing optimality of our proposed methods. All our theoretical guarantees are corroborated with empirical evaluations.
Lastly, it goes without saying, that there are still many open threads to explore based on BB. These include studying different choice-feedback model combinations, performance objectives, or even extending BB to other useful frameworks like assortment selection, revenue maximization, budget-constrained bandits etc. Towards the end we will also discuss some interesting combinations of the BB framework with other, well-known, problems, e.g. Sleeping / Rotting Bandits, Preference based Reinforcement Learning, Learning on Graphs, Preferential Bandit-Convex-Optimization etc.||en_US