Reinforcement Learning Algorithms for Off-Policy, Multi-Agent Learning and Applications to Smart Grids
Abstract
Reinforcement Learning (RL) algorithms are a popular class of algorithms for training an agent to
learn desired behavior through interaction with an environment whose dynamics is unknown to the
agent. RL algorithms combined with neural network architectures have enjoyed much success in various
disciplines like games, medicine, energy management, economics and supply chain management.
In our thesis, we study interesting extensions of standard single-agent RL settings, like off-policy and
multi-agent settings. We discuss the motivations and importance of these settings and propose convergent
algorithms to solve these problems. Finally, we consider one of the important applications of
RL, namely smart grids. The goal of the smart grid is to develop a power grid model that intelligently
manages its energy resources. In our thesis, we propose RL models for efficient smart grid design.
Learning the value function of a given policy (target policy) from the data samples obtained from
a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This
problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning
algorithms are a popular class of algorithms for solving prediction problems. TD algorithms with
linear function approximation are convergent when the data samples are generated from the target
policy (known as on-policy prediction) itself. However, it has been well established in the literature
that off-policy TD algorithms under linear function approximation may diverge. In the first part of the
thesis, we propose a convergent online off-policy TD algorithm under linear function approximation.
The main idea is to penalize updates of the algorithm to ensure convergence of the iterates. We provide
a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the
effectiveness of our proposed scheme.
Subsequently, we consider the “off-policy control” setup in RL, where an agent’s objective is to
compute an optimal policy based on the data obtained from a behavior policy. As the optimal policy
can be very different from the behavior policy, learning optimal behavior is very hard in the “offpolicy”
setting compared to the “on-policy” setting wherein the data is collected from the new policy
updates. In this work, we propose the first deep off-policy natural actor-critic algorithm that utilizes
state-action distribution correction for handling the off-policy behavior and the natural policy gradient
for sample efficiency. Unlike the existing natural gradient-based actor-critic algorithms that use only
fixed features for policy and value function approximation, the proposed natural actor-critic algorithm
can utilize a deep neural network’s power to approximate both policy and value function. We illustrate
the benefit of the proposed off-policy natural gradient algorithm by comparing it with the Euclidean
gradient actor-critic algorithm on benchmark RL tasks.
In the third part of the thesis, we consider the problem of two-player zero-sum games. In this
setting, there are two agents, both of whom aim to optimize their payoffs. Both the agents observe
the same state of the game, and the agents’ objective is to compute a strategy profile that maximizes
their payoffs. However, the payoff of the second agent is the negative of the payoff obtained by the
first agent. Therefore, the objective of the second agent is to minimize the total payoff obtained by
the first agent. This problem is formulated as a min-max Markov game in the literature. In this work,
we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation.
Successive relaxation has been successfully applied in the literature to compute a faster value
iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive
relaxation to the two-player zero-sum games. We then derive a generalized minimax Q-learning
algorithm that computes the optimal policy when the model information is unknown. Finally, we
prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic
approximation techniques. Through experiments, we demonstrate the advantages of our proposed
algorithm.
Next, we consider a cooperative stochastic games framework where multiple agents work towards
learning optimal joint actions in an unknown environment to achieve a common goal. In many realworld
applications, however, constraints are often imposed on the actions that the agents can jointly
take. In such scenarios, the agents aim to learn joint actions to achieve a common goal (minimizing a
specified cost function) while meeting the given constraints (specified via certain penalty functions).
Our work considers the relaxation of the constrained optimization problem by constructing the Lagrangian
of the cost and penalty functions. We propose a nested actor-critic solution approach to
solve this relaxed problem. In this approach, an actor-critic scheme is employed to improve the policy
for a given Lagrange parameter update on a faster timescale as in the classical actor-critic architecture.
Using this faster timescale policy update, a meta actor-critic scheme is employed to improve the
Lagrange parameters on the slower timescale. Utilizing the proposed nested actor-critic scheme, we
develop three Nested Actor-Critic (N-AC) algorithms.
In recent times, actor-critic algorithms with attention mechanisms have been successfully applied
to obtain optimal actions for RL agents in multi-agent environments. In the fifth part of our thesis,
we extend this algorithm to the constrained multi-agent RL setting considered above. The idea here
is that optimizing the common goal and satisfying the constraints may require different modes of
attention. Thus, by incorporating different attention modes, the agents can select useful information
required for optimizing the objective and satisfying the constraints separately, thereby yielding better
actions. Through experiments on benchmark multi-agent environments, we discuss the advantages of
our proposed attention-based actor-critic algorithm.
In the last part of our thesis, we study the applications of RL algorithms to Smart Grids. We
consider two important problems - on the supply-side and demand-side, respectively, and study both in
a unified framework. On the supply side, we study the problem of energy trading among microgrids to
maximize profit obtained from selling power while at the same time satisfying the customer demand.
On the demand side, we consider optimally scheduling the time-adjustable demand - i.e., of loads
with flexible time windows in which they can be scheduled. While previous works have treated these
two problems in isolation, we combine these problems and provide a unified Markov decision process
(MDP) framework for these problems.