A Study of Thompson Sampling Approach for the Sleeping Multi-Armed Bandit Problem

Chatterjee, Aritra

dc.contributor.advisor	Narahari, Y
dc.contributor.author	Chatterjee, Aritra
dc.date.accessioned	2018-05-29T14:41:36Z
dc.date.accessioned	2018-07-31T04:40:22Z
dc.date.available	2018-05-29T14:41:36Z
dc.date.available	2018-07-31T04:40:22Z
dc.date.issued	2018-05-29
dc.date.submitted	2017
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/3631
dc.identifier.abstract	http://etd.iisc.ac.in/static/etd/abstracts/4501/G28478-Abs.pdf	en_US
dc.description.abstract	The multi-armed bandit (MAB) problem provides a convenient abstraction for many online decision problems arising in modern applications including Internet display advertising, crowdsourcing, online procurement, smart grids, etc. Several variants of the MAB problem have been proposed to extend the basic model to a variety of practical and general settings. The sleeping multi-armed bandit (SMAB) problem is one such variant where the set of available arms varies with time. This study is focused on analyzing the efficacy of the Thompson Sampling algorithm for solving the SMAB problem. Any algorithm for the classical MAB problem is expected to choose one of K available arms (actions) in each of T consecutive rounds. Each choice of an arm generates a stochastic reward from an unknown but fixed distribution. The goal of the algorithm is to maximize the expected sum of rewards over the T rounds (or equivalently minimize the expected total regret), relative to the best fixed action in hindsight. In many real-world settings, however, not all arms may be available in any given round. For example, in Internet display advertising, some advertisers might choose to stay away from the auction due to budget constraints; in crowdsourcing, some workers may not be available at a given time due to timezone difference, etc. Such situations give rise to the sleeping MAB abstraction. In the literature, several upper confidence bound (UCB)-based approaches have been proposed and investigated for the SMAB problem. Our contribution is to investigate the efficacy of a Thomp-son Sampling-based approach. Our key finding is to establish a logarithmic regret bound, which non-trivially generalizes a similar bound known for this approach in the classical MAB setting. Our bound also matches (up to constants) the best-known lower bound for the SMAB problem. Furthermore, we show via detailed simulations, that the Thompson Sampling approach in fact outperforms the known algorithms for the SMAB problem.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	G28478	en_US
dc.subject	Thompson Sampling	en_US
dc.subject	Multi-Armed Bandit Problem	en_US
dc.subject	Upper Confidence Bound (UCB)	en_US
dc.subject	Awake Upper Estimated Reward	en_US
dc.subject	Multi-Armed Bandit Algorithms	en_US
dc.subject	Sleeping Multi-Armed Bandit Model	en_US
dc.subject	TS-SMAB	en_US
dc.subject	Sleeping Multi-Armed Bandit (SMAB) Problem	en_US
dc.subject.classification	Computer Science	en_US
dc.title	A Study of Thompson Sampling Approach for the Sleeping Multi-Armed Bandit Problem	en_US
dc.type	Thesis	en_US
dc.degree.name	MSc Engg	en_US
dc.degree.level	Masters	en_US
dc.degree.discipline	Faculty of Engineering	en_US

Files in this item

Name:: G28478.pdf
Size:: 2.053Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [393]

Show simple item record

A Study of Thompson Sampling Approach for the Sleeping Multi-Armed Bandit Problem

Files in this item

This item appears in the following Collection(s)

Related items

New Methods for Learning from Heterogeneous and Strategic Agents ﻿

Reinforcement Learning in Large and Structured Environments ﻿

Algorithms for Social Good in Online Platforms with Guarantees on Honest Participation and Fairness ﻿

New Methods for Learning from Heterogeneous and Strategic Agents

Reinforcement Learning in Large and Structured Environments

Algorithms for Social Good in Online Platforms with Guarantees on Honest Participation and Fairness