Achieving Fairness in the Stochastic Multi-Armed Bandit Problem
Abstract
The classical Stochastic Multi-armed Bandit (MAB) problem provides an abstraction for many
real-world decision making problems such as sponsored-search auctions, crowd-sourcing, wireless
communication, etc. In this work, we study FAIR-MAB, a variant of the MAB problem, where, in
addition to the goal of maximizing the sum of expected rewards, the algorithm also has to ensure
that each arm is pulled for at least a given fraction of the total number of rounds which imposes an
additional fairness constraint on the algorithm. The non-trivial aspect of FAIR-MAB arises when
the time horizon T is unknown to the algorithm. The literature on fairness in the MAB setting
centers around procedural fairness where the fairness guarantee is in terms of the decision-making
process followed by the algorithm rather than in terms of the outcome of the decisions made by
the algorithm. In contrast to this, we consider an outcome-based fairness notion which can be
validated based on the decision of the algorithm.
Our primary contribution is characterizing a class of algorithms for the FAIR-MAB problem by
two parameters: the unfairness tolerance and the learning algorithm used as a black-box. We define
an appropriate notion of fairness and show that our algorithm guarantees fairness independent
of the choice of the learning algorithm. We define the notion of fairness-aware regret which
naturally extends the conventional notion of regret, and show that the fairness-aware regret of
our algorithm matches in order, the regret of the black-box learning algorithm in the absence of
fairness constraints. Finally, we show via detailed simulations that our algorithm outperforms
the best known algorithm for the FAIR-MAB problem in terms of the fairness guarantee that it
provides, while performing comparably in terms of the fairness-aware regret. We also evaluate the
cost of fairness in the MAB setting in terms of the conventional notion of regret