Algorithms for Challenges to Practical Reinforcement Learning

Sindhu, P R

View/Open

Thesis full text (4.550Mb)

Author

Sindhu, P R

Metadata

Show full item record

Abstract

Reinforcement learning (RL) in real world applications faces major hurdles - the foremost being safety of the physical system controlled by the learning agent and the varying environment conditions in which the autonomous agent functions. A RL agent learns to control a system by exploring available actions. In some operating states, when the RL agent exercises an exploratory action, the system may enter unsafe operation, which can lead to safety hazards both for the system as well as for humans supervising the system. RL algorithms thus need to respect these safety constraints and must do so with limited available information. Additionally, RL autonomous agents learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with non-stationary environments, and in these scenarios, RL algorithms yield sub-optimal decisions. In this thesis, the first part develops algorithmic solutions to the challenges of safety and non-stationary environmental conditions. In order to handle safety restrictions and facilitate safe exploration during learning, this thesis proposes a cross-entropy method based sample efficient learning algorithm. This algorithm is developed on constrained optimization framework and utilizes very limited information for the learning of feasible policies. Also during the learning iterations, the exploration is guided in a manner that minimizes safety violations. In the first part, another algorithm for the second challenge is also described. The goal of this algorithm is to maximize the long-term discounted reward accrued when the latent model of the environment changes with time. To achieve this, the algorithm leverages a change point detection algorithm to find change in the statistics of the environment. The results from this statistical algorithm are used to reset learning of policies. The second part of this thesis describes the application of RL in networked intelligent systems. We consider two such systems - aerial quadrotor navigation and industrial internet of things system. In quadrotor navigation problem, with improved usage of machine learning computational frameworks, our proposed method is able to improve upon previously proposed obstacle avoidance algorithms in aerial vehicles. Obstacle avoidance in quadrotor aerial vehicle navigation brings in additional challenges when compared to ground vehicles. This is because, an aerial vehicle has to navigate across more types of obstacles - for e.g., objects like decorative items, furnishings, ceiling fans, sign-boards, tree branches, etc., are also potential obstacles for a quadrotor aerial vehicle. Thus, methods of obstacle avoidance developed for ground robots are clearly inadequate for UAV navigation. Our algorithm improves the efficiency of learning by inferring navigation decisions from temporal information of the ambient surroundings. This information is represented using monocular camera images collected by the quadrotor aerial vehicle. An industrial internet-of-things (IIoT) system has multiple IoT devices, a user equipment (UE), together with a base station (BS) that receives the UE and IoT data. To circumvent the issue of numerous IoT-to-BS connections and to conserve IoT devices' energies, the UE serves as a relay to forward the IoT data to the BS. In this thesis, we consider a specific problem of multiple objective optimization that arises in this simple IIoT setup. The UE employs frame-based uplink transmissions, wherein it shares few slots of every frame to relay the IoT data. The IIoT system experiences a transmission failure called outage when IoT data is not transmitted. The unsent UE data is stored in the UE's buffer and is discarded after the storage time exceeds the age threshold. As the UE and IoT devices share the transmission slots, trade-offs exist between system outages and aged UE data loss. To resolve system outage-data ageing challenge, we adapt the Q-learning algorithm for slot-sharing between UE and IoT data and show numerical results for the same.

URI

https://etd.iisc.ac.in/handle/2005/4983

Collections

Computer Science and Automation (CSA) [547]