Show simple item record

dc.contributor.advisorBhatnagar, Shalabh
dc.contributor.authorJayant, Ashish
dc.date.accessioned2022-09-13T04:40:10Z
dc.date.available2022-09-13T04:40:10Z
dc.date.submitted2022
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/5849
dc.description.abstractDuring initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perform a significant number of random exploratory steps, which in the real-world limit the practicality of these algorithms as this can lead to potentially dangerous behavior. Hence safe exploration is a critical issue in applying RL algorithms in the real world. This problem is well studied in the literature under the Constrained Markov Decision Process (CMDP) Framework, where in addition to single-stage rewards, state transitions receive single-stage costs as well. The prescribed cost functions are responsible for mapping undesirable behavior at any given time-step to a scalar value. Then we aim to find a feasible policy that maximizes reward returns and keeps cost returns below a prescribed threshold during training as well as deployment. We propose a novel On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner as well as find a feasible optimal policy using Lagrangian Relaxation-based Proximal Policy Optimization. This combination of transition dynamics learning and a safety-promoting RL algorithm leads to 3-4 times less environment interactions and less cumulative hazard violations compared to the model-free ap- proach. We use an ensemble of neural networks with different initializations to tackle epistemic and aleatoric uncertainty issues faced during environment model learning. We present our results on a challenging Safe Reinforcement Learning benchmark - the Open AI Safety Gym. In addition to this, we perform an attribution analysis of actions taken by the Deep Neural Network-based policy at each time step. This analysis helps us to : 1. Identify the feature in state representation which is significantly responsible for the current action. 2. Empirically provide the evidence of the safety-aware agent’s ability to deal with hazards in the environment provided that hazard information is present in the state representation. In order to perform the above analysis, we assume state representation has meaningful information about hazards and goals. Then we calculate an attribution vector of the same dimension as state using a well-known attribution technique known as Integrated Gradients. The resultant attribution vector provides the importance of each state feature for the current action.en_US
dc.description.sponsorshipNAen_US
dc.language.isoen_USen_US
dc.relation.ispartofseriesmasters;0019.R1
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectSafe RLen_US
dc.subjectPlanningen_US
dc.subjectReinforcement Learningen_US
dc.subjectSafetyen_US
dc.subjectExplainabilityen_US
dc.subject.classificationSafe Reinforcement Learningen_US
dc.subject.classificationModel-based RLen_US
dc.subject.classificationConstrained Reinforcement Learningen_US
dc.subject.classificationExplainabilityen_US
dc.titleModel-based Safe Deep Reinforcement Learning and Empirical Analysis of Safety via Attributionen_US
dc.typeThesisen_US
dc.degree.nameMTech (Res)en_US
dc.degree.levelMastersen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record