Imitation Learning Techniques for Robot Manipulation
Abstract
Robots that can operate in unstructured environments and collaborate with humans play a major role in raising productivity and living standards as societies age. Unlike the robots currently used in industrial settings for repetitive tasks, they will have to be capable of perceiving the novel environments they come across, dealing with the ambiguities of natural and intuitive communication with non-expert human operators, and manipulate the objects in the environment in complex ways. This problem may be broadly divided into two areas. One is to specify what the task is to the robot, and the other is how to execute the specified task.
In the first part of this thesis, a Siamese neural network with a modified spatial attention layer is proposed to specify novel objects that the robot has not seen during the training phase using visual cues. Although Siamese networks have been used for detecting novel objects, the prevalent architectures require a cropped image of the object and cannot support the use of natural and intuitive visual cues for specifying which is the object of interest in the scene. The proposed network is used to enable non-expert human operators to specify new objects by either using a laser pointer, or pointing with finger, or by video demonstration of the task by the human. The problem is a weakly supervised learning problem where the proposed architecture learns the visual cue implicitly as part of the training process without additional labels for the visual cue.
In the second part of the thesis, instructions in natural language are interpreted in the context of the visual scene so that the robot can understand which object to manipulate. A U-Net structure along with LSTM for language processing is proposed for processing spatial relationships specified in the instruction in the context of the scene. Although the U-Net architecture has been successfully applied for several computer vision problems, we show that they are useful not only for object detection but also in the stages after object detection for grounding the natural language instruction in the visual scene. We then go beyond merely specifying the object using natural language to specifying more complex tasks. Most of the current work on imitation learning for neural robot control uses direct neural actuator with expert demonstrations collected using an input device like a game controller or a virtual reality rig. However, in industrial settings, expert robot programmers write short programs to control the robot to perform various tasks rather than using an analog controller to teach the robot. We investigate whether such expert authored programs better capture the intention of the expert and if they can be generated by a neural network. We propose using neural machine translation to translate instructions in English to Python code which in turn accesses the objects detected in the scene and controls the robot to accomplish the specified task. We evaluate how such a translation model compares with the current imitation learning methods for a variety of tasks specified in natural language.
The third part of the thesis is about how to perform complex manipulation tasks. Imitation learning has emerged in recent years as a potent method for training neural networks to control robot actuators. However, much of the existing methods ignore stochasticity in the training data. We discuss various ways in which stochasticity in the teleoperated expert demonstrations can be accounted for when training policy networks and evaluate how they perform for several tasks. Most of the current visuo-motor policy networks for imitation learning use convolutional layers for processing the camera input. By construction, convolution is translation invariant which poses a problem for the subsequent control layers when there are multiple instances of an object in the scene. We propose a modified spatial-softmax layer which we use in a policy network in order to learn how to manipulate objects from teleoperated demonstrations in the presence of multiple instances of the object of interest. We show that the proposed modification is essential to prevent the network from becoming confused when multiple instances of an object are present. Subsequently, we consider the high-precision task of inserting a peg into a hole with a gap of less than 10 um. The current literature on imitation learning has largely focused on visuomotor manipulation tasks that require far less precision. For high precision tasks, it becomes necessary to rely on force sensors rather than visual feedback. We propose using generative adversarial reinforcement learning to learn how to perform this task from only a handful of teleoperated expert demonstrations.
This thesis contributes to the growing body of knowledge on using neural networks for robot control. It makes several contributions on both task specification and task execution. It proposes neural network architectures for specifying tasks intuitively through the use of visual cues and natural language. It presents carefully designed experiments that reveal shortcomings in the prevalent visuomotor policy networks for executing tasks and proposes architectural changes to address them. It also explores scenarios requiring high precision where the use of force sensors is more appropriate than vision.