Learning Action Priors for Deep Visual Predictions

Sarkar, Meenakshi

View/Open

Thesis full text (69.65Mb)

Author

Sarkar, Meenakshi

Metadata

Show full item record

Abstract

This thesis addresses the critical challenge of visual prediction in mobile robotics, particularly focusing on scenarios where cameras mounted on autonomous robots must navigate dynamic environments with human presence. While recent advances in artificial intelligence and machine learning have revolutionized natural language processing and generative AI, similar breakthroughs in video prediction for mobile platforms remain elusive due to the inherent complexities of disentangling robot motion from environmental dynamics. We identify a fundamental gap in existing approaches: the failure to explicitly incorporate robot control actions into visual prediction frameworks. In our initial work, we proposed the Velocity Acceleration Network (VANet), designed to extract and disentangle robot motion dynamics from visual data using motion flow encoders. While VANet represented progress in understanding the interplay between robot movement and visual information, it still relied on inferring motion effects from raw data rather than directly incorporating control signals. To address this limitation, we introduce the Robot Autonomous Motion Dataset (RoAM), a novel open-source stereo-image dataset captured using a Turtlebot3 Burger robot equipped with a Zed mini stereo camera. Unlike existing datasets, RoAM provides synchronized control action data alongside visual information, 2D LiDAR scans, IMU readings, and odometry data, creating a comprehensive multimodal resource for action-conditioned prediction tasks. Building on this dataset, we present two complementary approaches to action-conditioned video prediction. First, we develop deterministic frameworks—ACPNet and ACVG (Action Conditioned Video Generation)— that explicitly condition predicted image frames on robot control actions, resulting in more physically consistent video predictions. We rigorously benchmark these architectures against state-ofthe- art models, demonstrating superior performance when leveraging control action data. We then advance two theoretical frameworks for learning stochastic priors that simultaneously predict future images and actions. (i) Conditional Independence: Under this assumption, we model imageaction pairs as extended system states generated from a shared latent stochastic process. We implement this approach through two models: VG-LeAP, a variational generative framework, and RAFI, built on sparsely conditioned flow matching, demonstrating the versatility of this principle across different architectural paradigms. (ii) Causal Dependence: This framework models images and actions as causally interlinked nodes, reflecting real-world scenarios where robots take actions based on current observations and then observe subsequent states as consequences. We implement this approach through Causal- LeAP, a variational generative framework that learns separate but conditionally dependent latent priors for images and actions. Our comprehensive evaluation demonstrates that explicitly incorporating control actions significantly improves prediction accuracy while maintaining computational efficiency suitable for deployment on resource constrained robotic platforms. This research bridges critical gaps between visual perception, motion planning, and control theory, establishing new foundations for intelligent autonomous systems that can effectively navigate and interact in dynamic human environments

URI

https://etd.iisc.ac.in/handle/2005/7112

Collections

Aerospace Engineering (AE) [455]