Reinforcement Learning Algorithms for Off-Policy, Multi-Agent Learning and Applications to Smart Grids

Diddigi, Raghuram Bharadwaj

View/Open

Thesis full text (2.075Mb)

Author

Diddigi, Raghuram Bharadwaj

Metadata

Show full item record

Abstract

Reinforcement Learning (RL) algorithms are a popular class of algorithms for training an agent to learn desired behavior through interaction with an environment whose dynamics is unknown to the agent. RL algorithms combined with neural network architectures have enjoyed much success in various disciplines like games, medicine, energy management, economics and supply chain management. In our thesis, we study interesting extensions of standard single-agent RL settings, like off-policy and multi-agent settings. We discuss the motivations and importance of these settings and propose convergent algorithms to solve these problems. Finally, we consider one of the important applications of RL, namely smart grids. The goal of the smart grid is to develop a power grid model that intelligently manages its energy resources. In our thesis, we propose RL models for efficient smart grid design. Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving prediction problems. TD algorithms with linear function approximation are convergent when the data samples are generated from the target policy (known as on-policy prediction) itself. However, it has been well established in the literature that off-policy TD algorithms under linear function approximation may diverge. In the first part of the thesis, we propose a convergent online off-policy TD algorithm under linear function approximation. The main idea is to penalize updates of the algorithm to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our proposed scheme. Subsequently, we consider the “off-policy control” setup in RL, where an agent’s objective is to compute an optimal policy based on the data obtained from a behavior policy. As the optimal policy can be very different from the behavior policy, learning optimal behavior is very hard in the “offpolicy” setting compared to the “on-policy” setting wherein the data is collected from the new policy updates. In this work, we propose the first deep off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency. Unlike the existing natural gradient-based actor-critic algorithms that use only fixed features for policy and value function approximation, the proposed natural actor-critic algorithm can utilize a deep neural network’s power to approximate both policy and value function. We illustrate the benefit of the proposed off-policy natural gradient algorithm by comparing it with the Euclidean gradient actor-critic algorithm on benchmark RL tasks. In the third part of the thesis, we consider the problem of two-player zero-sum games. In this setting, there are two agents, both of whom aim to optimize their payoffs. Both the agents observe the same state of the game, and the agents’ objective is to compute a strategy profile that maximizes their payoffs. However, the payoff of the second agent is the negative of the payoff obtained by the first agent. Therefore, the objective of the second agent is to minimize the total payoff obtained by the first agent. This problem is formulated as a min-max Markov game in the literature. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation. Successive relaxation has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the two-player zero-sum games. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is unknown. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic approximation techniques. Through experiments, we demonstrate the advantages of our proposed algorithm. Next, we consider a cooperative stochastic games framework where multiple agents work towards learning optimal joint actions in an unknown environment to achieve a common goal. In many realworld applications, however, constraints are often imposed on the actions that the agents can jointly take. In such scenarios, the agents aim to learn joint actions to achieve a common goal (minimizing a specified cost function) while meeting the given constraints (specified via certain penalty functions). Our work considers the relaxation of the constrained optimization problem by constructing the Lagrangian of the cost and penalty functions. We propose a nested actor-critic solution approach to solve this relaxed problem. In this approach, an actor-critic scheme is employed to improve the policy for a given Lagrange parameter update on a faster timescale as in the classical actor-critic architecture. Using this faster timescale policy update, a meta actor-critic scheme is employed to improve the Lagrange parameters on the slower timescale. Utilizing the proposed nested actor-critic scheme, we develop three Nested Actor-Critic (N-AC) algorithms. In recent times, actor-critic algorithms with attention mechanisms have been successfully applied to obtain optimal actions for RL agents in multi-agent environments. In the fifth part of our thesis, we extend this algorithm to the constrained multi-agent RL setting considered above. The idea here is that optimizing the common goal and satisfying the constraints may require different modes of attention. Thus, by incorporating different attention modes, the agents can select useful information required for optimizing the objective and satisfying the constraints separately, thereby yielding better actions. Through experiments on benchmark multi-agent environments, we discuss the advantages of our proposed attention-based actor-critic algorithm. In the last part of our thesis, we study the applications of RL algorithms to Smart Grids. We consider two important problems - on the supply-side and demand-side, respectively, and study both in a unified framework. On the supply side, we study the problem of energy trading among microgrids to maximize profit obtained from selling power while at the same time satisfying the customer demand. On the demand side, we consider optimally scheduling the time-adjustable demand - i.e., of loads with flexible time windows in which they can be scheduled. While previous works have treated these two problems in isolation, we combine these problems and provide a unified Markov decision process (MDP) framework for these problems.

URI

https://etd.iisc.ac.in/handle/2005/5673

Collections

Computer Science and Automation (CSA) [396]