Sequential Transfer in Multi-Armed Bandits using Reward Samples
Abstract
We consider a sequential multi-task problem, where each task is modeled as a stochastic multi-armed bandit with K arms. We study the problem of transfer learning in this setting and propose algorithms based on UCB to transfer reward samples from previous tasks to improve the total regret across all tasks. We consider two different notions of similarity among tasks, (i) universal similarity and (ii) adjacent similarity. In universal similarity, all tasks encountered in the sequence are similar. On the other hand, in adjacent similarity, tasks close to one another in the sequence are more similar than the ones that are farther apart. We provide transfer algorithms and their regret upper bounds for both the similarity notions and then highlight the benefit of transfer. Our regret bounds show that the performance improves as the sequential tasks become closer to each other. Finally, we provide empirical results for our algorithms, which show
performance improvement over the standard UCB algorithm without transfer.