Learning Action Priors for Deep Visual Predictions

Sarkar, Meenakshi

dc.contributor.advisor	Ghose, Debasish
dc.contributor.author	Sarkar, Meenakshi
dc.date.accessioned	2025-10-03T10:59:47Z
dc.date.available	2025-10-03T10:59:47Z
dc.date.submitted	2025
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/7112
dc.description.abstract	This thesis addresses the critical challenge of visual prediction in mobile robotics, particularly focusing on scenarios where cameras mounted on autonomous robots must navigate dynamic environments with human presence. While recent advances in artificial intelligence and machine learning have revolutionized natural language processing and generative AI, similar breakthroughs in video prediction for mobile platforms remain elusive due to the inherent complexities of disentangling robot motion from environmental dynamics. We identify a fundamental gap in existing approaches: the failure to explicitly incorporate robot control actions into visual prediction frameworks. In our initial work, we proposed the Velocity Acceleration Network (VANet), designed to extract and disentangle robot motion dynamics from visual data using motion flow encoders. While VANet represented progress in understanding the interplay between robot movement and visual information, it still relied on inferring motion effects from raw data rather than directly incorporating control signals. To address this limitation, we introduce the Robot Autonomous Motion Dataset (RoAM), a novel open-source stereo-image dataset captured using a Turtlebot3 Burger robot equipped with a Zed mini stereo camera. Unlike existing datasets, RoAM provides synchronized control action data alongside visual information, 2D LiDAR scans, IMU readings, and odometry data, creating a comprehensive multimodal resource for action-conditioned prediction tasks. Building on this dataset, we present two complementary approaches to action-conditioned video prediction. First, we develop deterministic frameworks—ACPNet and ACVG (Action Conditioned Video Generation)— that explicitly condition predicted image frames on robot control actions, resulting in more physically consistent video predictions. We rigorously benchmark these architectures against state-ofthe- art models, demonstrating superior performance when leveraging control action data. We then advance two theoretical frameworks for learning stochastic priors that simultaneously predict future images and actions. (i) Conditional Independence: Under this assumption, we model imageaction pairs as extended system states generated from a shared latent stochastic process. We implement this approach through two models: VG-LeAP, a variational generative framework, and RAFI, built on sparsely conditioned flow matching, demonstrating the versatility of this principle across different architectural paradigms. (ii) Causal Dependence: This framework models images and actions as causally interlinked nodes, reflecting real-world scenarios where robots take actions based on current observations and then observe subsequent states as consequences. We implement this approach through Causal- LeAP, a variational generative framework that learns separate but conditionally dependent latent priors for images and actions. Our comprehensive evaluation demonstrates that explicitly incorporating control actions significantly improves prediction accuracy while maintaining computational efficiency suitable for deployment on resource constrained robotic platforms. This research bridges critical gaps between visual perception, motion planning, and control theory, establishing new foundations for intelligent autonomous systems that can effectively navigate and interact in dynamic human environments	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	;ET01093
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Velocity Acceleration Network	en_US
dc.subject	visual prediction	en_US
dc.subject	Deep Learning	en_US
dc.subject	Artificial Intelligence	en_US
dc.subject	Diffusion Models	en_US
dc.subject	Deep Visual Forecasting	en_US
dc.subject	Optical Flow Maps	en_US
dc.subject	LiDAR	en_US
dc.subject	Robot Autonomous Motion Dataset	en_US
dc.subject	moving autonomous agent	en_US
dc.subject	mobile robotics	en_US
dc.subject	RoAM dataset	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Engineering mechanics::Other engineering mechanics	en_US
dc.title	Learning Action Priors for Deep Visual Predictions	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Meenakshi Sarkar-Phd-2025.pdf
Size:: 69.65Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Aerospace Engineering (AE) [435]

Show simple item record