Show simple item record

dc.contributor.advisorGhose, Debasish
dc.contributor.authorSarkar, Meenakshi
dc.date.accessioned2025-10-03T10:59:47Z
dc.date.available2025-10-03T10:59:47Z
dc.date.submitted2025
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/7112
dc.description.abstractThis thesis addresses the critical challenge of visual prediction in mobile robotics, particularly focusing on scenarios where cameras mounted on autonomous robots must navigate dynamic environments with human presence. While recent advances in artificial intelligence and machine learning have revolutionized natural language processing and generative AI, similar breakthroughs in video prediction for mobile platforms remain elusive due to the inherent complexities of disentangling robot motion from environmental dynamics. We identify a fundamental gap in existing approaches: the failure to explicitly incorporate robot control actions into visual prediction frameworks. In our initial work, we proposed the Velocity Acceleration Network (VANet), designed to extract and disentangle robot motion dynamics from visual data using motion flow encoders. While VANet represented progress in understanding the interplay between robot movement and visual information, it still relied on inferring motion effects from raw data rather than directly incorporating control signals. To address this limitation, we introduce the Robot Autonomous Motion Dataset (RoAM), a novel open-source stereo-image dataset captured using a Turtlebot3 Burger robot equipped with a Zed mini stereo camera. Unlike existing datasets, RoAM provides synchronized control action data alongside visual information, 2D LiDAR scans, IMU readings, and odometry data, creating a comprehensive multimodal resource for action-conditioned prediction tasks. Building on this dataset, we present two complementary approaches to action-conditioned video prediction. First, we develop deterministic frameworks—ACPNet and ACVG (Action Conditioned Video Generation)— that explicitly condition predicted image frames on robot control actions, resulting in more physically consistent video predictions. We rigorously benchmark these architectures against state-ofthe- art models, demonstrating superior performance when leveraging control action data. We then advance two theoretical frameworks for learning stochastic priors that simultaneously predict future images and actions. (i) Conditional Independence: Under this assumption, we model imageaction pairs as extended system states generated from a shared latent stochastic process. We implement this approach through two models: VG-LeAP, a variational generative framework, and RAFI, built on sparsely conditioned flow matching, demonstrating the versatility of this principle across different architectural paradigms. (ii) Causal Dependence: This framework models images and actions as causally interlinked nodes, reflecting real-world scenarios where robots take actions based on current observations and then observe subsequent states as consequences. We implement this approach through Causal- LeAP, a variational generative framework that learns separate but conditionally dependent latent priors for images and actions. Our comprehensive evaluation demonstrates that explicitly incorporating control actions significantly improves prediction accuracy while maintaining computational efficiency suitable for deployment on resource constrained robotic platforms. This research bridges critical gaps between visual perception, motion planning, and control theory, establishing new foundations for intelligent autonomous systems that can effectively navigate and interact in dynamic human environmentsen_US
dc.language.isoen_USen_US
dc.relation.ispartofseries;ET01093
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertationen_US
dc.subjectVelocity Acceleration Networken_US
dc.subjectvisual predictionen_US
dc.subjectDeep Learningen_US
dc.subjectArtificial Intelligenceen_US
dc.subjectDiffusion Modelsen_US
dc.subjectDeep Visual Forecastingen_US
dc.subjectOptical Flow Mapsen_US
dc.subjectLiDARen_US
dc.subjectRobot Autonomous Motion Dataseten_US
dc.subjectmoving autonomous agenten_US
dc.subjectmobile roboticsen_US
dc.subjectRoAM dataseten_US
dc.subject.classificationResearch Subject Categories::TECHNOLOGY::Engineering mechanics::Other engineering mechanicsen_US
dc.titleLearning Action Priors for Deep Visual Predictionsen_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.grantorIndian Institute of Scienceen_US
dc.degree.disciplineEngineeringen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record