dc.description.abstract | This thesis addresses the critical challenge of visual prediction in
mobile robotics, particularly focusing on scenarios where cameras
mounted on autonomous robots must navigate dynamic environments
with human presence. While recent advances in artificial intelligence
and machine learning have revolutionized natural language processing
and generative AI, similar breakthroughs in video prediction for
mobile platforms remain elusive due to the inherent complexities of
disentangling robot motion from environmental dynamics. We identify
a fundamental gap in existing approaches: the failure to explicitly
incorporate robot control actions into visual prediction frameworks.
In our initial work, we proposed the Velocity Acceleration
Network (VANet), designed to extract and disentangle robot motion
dynamics from visual data using motion flow encoders. While VANet
represented progress in understanding the interplay between robot
movement and visual information, it still relied on inferring motion
effects from raw data rather than directly incorporating control signals.
To address this limitation, we introduce the Robot Autonomous
Motion Dataset (RoAM), a novel open-source stereo-image dataset
captured using a Turtlebot3 Burger robot equipped with a Zed mini
stereo camera. Unlike existing datasets, RoAM provides synchronized
control action data alongside visual information, 2D LiDAR
scans, IMU readings, and odometry data, creating a comprehensive
multimodal resource for action-conditioned prediction tasks. Building
on this dataset, we present two complementary approaches to
action-conditioned video prediction. First, we develop deterministic
frameworks—ACPNet and ACVG (Action Conditioned Video Generation)—
that explicitly condition predicted image frames on robot control actions, resulting in more physically consistent video predictions.
We rigorously benchmark these architectures against state-ofthe-
art models, demonstrating superior performance when leveraging
control action data.
We then advance two theoretical frameworks for learning stochastic
priors that simultaneously predict future images and actions. (i)
Conditional Independence: Under this assumption, we model imageaction
pairs as extended system states generated from a shared latent
stochastic process. We implement this approach through two models:
VG-LeAP, a variational generative framework, and RAFI, built
on sparsely conditioned flow matching, demonstrating the versatility
of this principle across different architectural paradigms. (ii) Causal
Dependence: This framework models images and actions as causally
interlinked nodes, reflecting real-world scenarios where robots take
actions based on current observations and then observe subsequent
states as consequences. We implement this approach through Causal-
LeAP, a variational generative framework that learns separate but
conditionally dependent latent priors for images and actions.
Our comprehensive evaluation demonstrates that explicitly incorporating
control actions significantly improves prediction accuracy while
maintaining computational efficiency suitable for deployment on resource
constrained robotic platforms. This research bridges critical
gaps between visual perception, motion planning, and control theory,
establishing new foundations for intelligent autonomous systems that
can effectively navigate and interact in dynamic human environments | en_US |