• Login
    View Item 
    •   etd@IISc
    • Division of Electrical, Electronics, and Computer Science (EECS)
    • Computer Science and Automation (CSA)
    • View Item
    •   etd@IISc
    • Division of Electrical, Electronics, and Computer Science (EECS)
    • Computer Science and Automation (CSA)
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Multi-timescale and Multi-agent Reinforcement Learning Algorithms

    Thumbnail
    View/Open
    Thesis full text (3.735Mb)
    Author
    Mandal, Lakshmi
    Metadata
    Show full item record
    Abstract
    This thesis presents six novel works involving several research domains, such as reinforcement learning (RL)– both with or without function approximators including deep neural networks, multi-agent RL, stochastic optimization, and natural language processing (NLP). All our works presented in the thesis fall under either multi-timescale or multi-agent RL algorithms. RL deals with the problem of dynamic decision-making under uncertainty. In RL, we often need multi-layer decision-making. These problems can be handled by incorporating many timescales proportionate to the number of layers of decision-making. Further, we come across various applications in which multiple agents are involved rather than a single agent. These agents can work together to accomplish a common goal and it is hard to immediately extend single-agent learning algorithms to multi-agent settings because of scalability and coordination issues. Thus, multi-timescale and multi-agent settings are considered. Further, we provide theoretical convergence guarantees of all proposed algorithms using either asymptotic or finite-time analyses, or both. Moreover, our experimental results on different standard RL tasks demonstrate that our algorithms are better than the corresponding state-of-the-art (SOTA) algorithms. First, we consider widely used value-based RL algorithms, namely n-step Temporal Difference (TD) methods as they combine the advantages of both the Monte Carlo (MC) and TD methods and require n state transitions before beginning to perform updates. However, finding the optimal n for various problem settings is challenging. Thus, we propose, a purely datadriven, two-step (i.e., incorporating two-timescales) systematic procedure based on stochastic gradient search to provide the optimal value of n and demonstrate 60% − 80% performance improvements in terms of root-mean-squared-error (RMSE) compared to SOTA algorithms. Next, we consider a popular variant of Q-learning (QL), namely Successive Over-Relaxation QL (SOR-QL). However, several limitations exist in the existing SOR-QL, such as the requirement of knowledge of transition probabilities as well as a positive probability of self-transition to each state. Thus, to overcome these limitations, we propose a two-timescale stochastic approximation scheme where we effectively get rid of both of these assumptions. Our algorithm outperforms the SOTA algorithms and is nearly 2 to 6 times (resp. 3 to 10 times) better than the next best-performing algorithm in terms of the average error (resp. standard deviation) metric. Then, we consider the widely popular policy gradient algorithms as value-based RL works well in tabular settings but can diverge in the case of function approximation. However, regular MC policy gradient algorithms require the aggregation of data until a termination state is reached. In real-world applications involving large state and action spaces, the hitting times for goal states can be very sparse or infrequent, resulting in large episodes of unpredictable length. Thus, we present an RL and a safe-RL algorithm that we call the actor-only algorithm (AOA) and safe-actor-only algorithm (SAOA), both of which perform data aggregation over a certain (deterministic) number of epochs. We show that both AOA and SAOA algorithms achieve a sample complexity of O(ϵ−2) and outperform SOTA algorithms by achieving the highest mean and the lowest standard deviation of total rewards. Next, we turn our attention to Actor-Critic (AC) algorithms. It is seen that AC algorithms combined with deep learning architectures have achieved tremendous success. However, the policies obtained by many deep AC algorithms are seen to suffer from high variance making them less useful in safety-critical applications. Thus, we propose a three-timescale data-driven L-step deep AC algorithm that incorporates smoothed functional (SF) estimates. Experimental results demonstrate that the proposed L-step deep AC outperforms multiple SOTA algorithms by achieving a maximum of 84.74% and 95.88% performance improvement in terms of average reward and variance reduction, respectively. Next, we consider a cooperative multi-agent Markov decision process (MDP) involving m(> 1) agents, as this is the underlying setting for cooperative multi-agent RL. In recent works, decentralized policy improvement in the policy iteration is considered but with exact value functions, which is computationally expensive for a large number of agents with highdimensional state-action spaces. Thus, we propose approximate decentralized policy iteration algorithms (both for finite and infinite horizon discounted MDPs), using approximate linear programming with function approximation to compute the approximate value function for decentralized policy improvement. Experimental results show that our proposed algorithms are 9 and 19 times faster than the corresponding SOTA algorithms in terms of computational time performance. Finally, it is observed that most of the practical applications of multi-agent RL involve solving multiple tasks, and each task involves dealing with varied information types such as visual, text-based, etc. We propose a multimodal, multi-agent deep actor-critic (MOMADAC) algorithm that is able to successfully handle such complex settings. Experimental results show 14% − 49% performance improvement using MOMA-DAC in terms of average cost, on different RL tasks
    URI
    https://etd.iisc.ac.in/handle/2005/7055
    Collections
    • Computer Science and Automation (CSA) [397]

    Related items

    Showing items related by title, author, creator and subject.

    • Image Reconstruction Based On Hilbert And Hybrid Filtered Algorithms With Inverse Distance Weight And No Backprojection Weight 

      Narasimhadhan, A V (2014-07-16)
      Filtered backprojection (FBP) reconstruction algorithms are very popular in the field of X-ray computed tomography (CT) because they give advantages in terms of the numerical accuracy and computational complexity. Ramp ...
    • Online Learning and Simulation Based Algorithms for Stochastic Optimization 

      Lakshmanan, K (2018-03-07)
      In many optimization problems, the relationship between the objective and parameters is not known. The objective function itself may be stochastic such as a long-run average over some random cost samples. In such cases ...
    • Performance Evaluation Of Fan-beam And Cone-beam Reconstruction Algorithms With No Backprojection Weight On Truncated Data Problems 

      Sumith, K (2014-07-16)
      This work focuses on using the linear prediction based projection completion for the fan-beam and cone-beam reconstruction algorithm with no backprojection weight. The truncated data problems are addressed in the computed ...

    etd@IISc is a joint service of SERC & J R D Tata Memorial (JRDTML) Library || Powered by DSpace software || DuraSpace
    Contact Us | Send Feedback | Thesis Templates
    Theme by 
    Atmire NV
     

     

    Browse

    All of etd@IIScCommunities & CollectionsTitlesAuthorsAdvisorsSubjectsBy Thesis Submission DateThis CollectionTitlesAuthorsAdvisorsSubjectsBy Thesis Submission Date

    My Account

    LoginRegister

    etd@IISc is a joint service of SERC & J R D Tata Memorial (JRDTML) Library || Powered by DSpace software || DuraSpace
    Contact Us | Send Feedback | Thesis Templates
    Theme by 
    Atmire NV