Show simple item record

dc.contributor.advisorAgarwal, Shivani
dc.contributor.authorSiddartha, Y R
dc.date.accessioned2020-12-01T05:50:41Z
dc.date.available2020-12-01T05:50:41Z
dc.date.submitted2017
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/4698
dc.description.abstractWe consider the dueling bandits problem, a sequential decision task where the goal is to learn to pick `good' arms out of an available pool by actively querying for and observing relative preferences between selected pairs of arms. The noisy observed preferences are assumed to be generated by a fixed but unknown stochastic preference model. Motivated by applications in information retrieval, e-commerce, crowdsourcing, etc., a number of bandit algorithms have been proposed in recent years for this task. These have mostly addressed restricted settings wherein the underlying preference model satisfies various structural assumptions. such as being based on a random utility function or a feature-space embedding, or satisfying transitivity or sparsity properties, or at least possessing a Condorcet winner { a single `best ‘arm that is preferred to all others. In seeking to move beyond such restricted settings, there has been a recent shift towards alternative notions of `good' arms (including Borda, Copeland and von Neumann winners). In this work, we extend the dueling bandits problem by adopting, as the desired target set of good (or 'winning') arms, a number of tournament solutions that have been proposed in social choice and voting theory literature as embodying principled and natural criteria for identifying good arms based on preference relations. We then propose a family of upper confidence bound (UCB) based dueling bandit algorithms that learn to play winning arms from several popular tournament solutions, the top cycle, uncovered set, Banks set and Copeland set. We derive these algorithms by first proposing a generic UCB-based framework algorithm that can be instantiated for different tournament solutions by means of appropriately designed `selection procedures. We show sufficiency conditions for the resulting dueling bandit algorithms to satisfy distribution-dependent, horizon-free bounds on natural regret measures defined w.r.t. the target tournament solutions. In contrast to previous work, these bounds do not require restrictive structural assumptions on the preference model and hold for a range of different tournament solutions. We develop selection procedures that satisfy the sufficiency conditions for a number of popular tournament solutions, yielding dueling bandit algorithms UCB-TC, UCB-UC, UCB-BA and UCB-CO for the top cycle, uncovered set, Banks set and the Copeland set respectively. The O_K2 ln T g2 _ bounds we derive are optimal in their dependence on the time horizon T. We show that for all of these tournament solutions, the distribution-dependent `margin' g is lower bounded by the separation or the relative advantage of top cycle arms over non-top cycle arms. While O(K ln T) bounds are known for Condorcet models, our O(K2 ln T) bounds extend to more general models as well as other tournament solutions. We empirically validate these claims and evaluate the proposed algorithms, comparing them to dueling bandit algorithms RUCB, SAVAGE and BTMB over synthetic and real-world preference models. We show that the UCB-TS algorithms perform competitively over models that possess a Condorcet winner, but out-perform the other algorithms over more general models that do not possess a Condorcet winner.en_US
dc.language.isoen_USen_US
dc.relation.ispartofseriesG28207;
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectMulti-armed Banditsen_US
dc.subjectStochastic Preference Modelen_US
dc.subjectUpper Confidence Bounden_US
dc.subjectDueling Banditsen_US
dc.subjectDueling Bandits Problemen_US
dc.subject.classificationComputer Scienceen_US
dc.titleLearning Tournament Solutions from Preference-based Multi-Armed Banditsen_US
dc.typeThesisen_US
dc.degree.nameMTech (Res)en_US
dc.degree.levelMastersen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record