Multi-timescale and Multi-agent Reinforcement Learning Algorithms

Mandal, Lakshmi

View/Open

Thesis full text (3.735Mb)

Author

Mandal, Lakshmi

Metadata

Show full item record

Abstract

This thesis presents six novel works involving several research domains, such as reinforcement learning (RL)– both with or without function approximators including deep neural networks, multi-agent RL, stochastic optimization, and natural language processing (NLP). All our works presented in the thesis fall under either multi-timescale or multi-agent RL algorithms. RL deals with the problem of dynamic decision-making under uncertainty. In RL, we often need multi-layer decision-making. These problems can be handled by incorporating many timescales proportionate to the number of layers of decision-making. Further, we come across various applications in which multiple agents are involved rather than a single agent. These agents can work together to accomplish a common goal and it is hard to immediately extend single-agent learning algorithms to multi-agent settings because of scalability and coordination issues. Thus, multi-timescale and multi-agent settings are considered. Further, we provide theoretical convergence guarantees of all proposed algorithms using either asymptotic or finite-time analyses, or both. Moreover, our experimental results on different standard RL tasks demonstrate that our algorithms are better than the corresponding state-of-the-art (SOTA) algorithms. First, we consider widely used value-based RL algorithms, namely n-step Temporal Difference (TD) methods as they combine the advantages of both the Monte Carlo (MC) and TD methods and require n state transitions before beginning to perform updates. However, finding the optimal n for various problem settings is challenging. Thus, we propose, a purely datadriven, two-step (i.e., incorporating two-timescales) systematic procedure based on stochastic gradient search to provide the optimal value of n and demonstrate 60% − 80% performance improvements in terms of root-mean-squared-error (RMSE) compared to SOTA algorithms. Next, we consider a popular variant of Q-learning (QL), namely Successive Over-Relaxation QL (SOR-QL). However, several limitations exist in the existing SOR-QL, such as the requirement of knowledge of transition probabilities as well as a positive probability of self-transition to each state. Thus, to overcome these limitations, we propose a two-timescale stochastic approximation scheme where we effectively get rid of both of these assumptions. Our algorithm outperforms the SOTA algorithms and is nearly 2 to 6 times (resp. 3 to 10 times) better than the next best-performing algorithm in terms of the average error (resp. standard deviation) metric. Then, we consider the widely popular policy gradient algorithms as value-based RL works well in tabular settings but can diverge in the case of function approximation. However, regular MC policy gradient algorithms require the aggregation of data until a termination state is reached. In real-world applications involving large state and action spaces, the hitting times for goal states can be very sparse or infrequent, resulting in large episodes of unpredictable length. Thus, we present an RL and a safe-RL algorithm that we call the actor-only algorithm (AOA) and safe-actor-only algorithm (SAOA), both of which perform data aggregation over a certain (deterministic) number of epochs. We show that both AOA and SAOA algorithms achieve a sample complexity of O(ϵ−2) and outperform SOTA algorithms by achieving the highest mean and the lowest standard deviation of total rewards. Next, we turn our attention to Actor-Critic (AC) algorithms. It is seen that AC algorithms combined with deep learning architectures have achieved tremendous success. However, the policies obtained by many deep AC algorithms are seen to suffer from high variance making them less useful in safety-critical applications. Thus, we propose a three-timescale data-driven L-step deep AC algorithm that incorporates smoothed functional (SF) estimates. Experimental results demonstrate that the proposed L-step deep AC outperforms multiple SOTA algorithms by achieving a maximum of 84.74% and 95.88% performance improvement in terms of average reward and variance reduction, respectively. Next, we consider a cooperative multi-agent Markov decision process (MDP) involving m(> 1) agents, as this is the underlying setting for cooperative multi-agent RL. In recent works, decentralized policy improvement in the policy iteration is considered but with exact value functions, which is computationally expensive for a large number of agents with highdimensional state-action spaces. Thus, we propose approximate decentralized policy iteration algorithms (both for finite and infinite horizon discounted MDPs), using approximate linear programming with function approximation to compute the approximate value function for decentralized policy improvement. Experimental results show that our proposed algorithms are 9 and 19 times faster than the corresponding SOTA algorithms in terms of computational time performance. Finally, it is observed that most of the practical applications of multi-agent RL involve solving multiple tasks, and each task involves dealing with varied information types such as visual, text-based, etc. We propose a multimodal, multi-agent deep actor-critic (MOMADAC) algorithm that is able to successfully handle such complex settings. Experimental results show 14% − 49% performance improvement using MOMA-DAC in terms of average cost, on different RL tasks

URI

https://etd.iisc.ac.in/handle/2005/7055

Collections

Computer Science and Automation (CSA) [531]