Model-based Safe Deep Reinforcement Learning and Empirical Analysis of Safety via Attribution
During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perform a significant number of random exploratory steps, which in the real-world limit the practicality of these algorithms as this can lead to potentially dangerous behavior. Hence safe exploration is a critical issue in applying RL algorithms in the real world. This problem is well studied in the literature under the Constrained Markov Decision Process (CMDP) Framework, where in addition to single-stage rewards, state transitions receive single-stage costs as well. The prescribed cost functions are responsible for mapping undesirable behavior at any given time-step to a scalar value. Then we aim to find a feasible policy that maximizes reward returns and keeps cost returns below a prescribed threshold during training as well as deployment. We propose a novel On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner as well as find a feasible optimal policy using Lagrangian Relaxation-based Proximal Policy Optimization. This combination of transition dynamics learning and a safety-promoting RL algorithm leads to 3-4 times less environment interactions and less cumulative hazard violations compared to the model-free ap- proach. We use an ensemble of neural networks with different initializations to tackle epistemic and aleatoric uncertainty issues faced during environment model learning. We present our results on a challenging Safe Reinforcement Learning benchmark - the Open AI Safety Gym. In addition to this, we perform an attribution analysis of actions taken by the Deep Neural Network-based policy at each time step. This analysis helps us to : 1. Identify the feature in state representation which is significantly responsible for the current action. 2. Empirically provide the evidence of the safety-aware agent’s ability to deal with hazards in the environment provided that hazard information is present in the state representation. In order to perform the above analysis, we assume state representation has meaningful information about hazards and goals. Then we calculate an attribution vector of the same dimension as state using a well-known attribution technique known as Integrated Gradients. The resultant attribution vector provides the importance of each state feature for the current action.