Self-Supervised Domain Adaptation Frameworks for Computer Vision Tasks
Abstract
There is a strong incentive to build intelligent machines that can understand and adapt to changes in the visual world without human supervision. While humans and animals learn to perceive the world on their own, almost all state-of-the-art vision systems heavily rely on external supervision from millions of manually annotated training examples. Gathering such large-scale manual annotations for structured vision tasks, such as monocular depth estimation, scene segmentation, human pose estimation, faces several practical limitations. Usually, the annotations are gathered in two broad ways; 1) via specialized instruments (sensors) or laboratory setups, 2) via manual annotations. Both processes have several drawbacks. While human annotations are expensive, scarce, or error-prone; instrument-based annotations are often noisy or limited to specific laboratory environments. Such limitations not only stand as a major bottleneck in our efforts to gather unambiguous ground-truth but also limit the diversity in the collected labeled dataset. This motivates us to develop innovative ways to utilize synthetic environments to create labeled synthetic datasets with noise-free unambiguous ground-truths. However, the performance of models trained on such synthetic data markedly degrades when tested on real-world samples due to input distribution shift (a.k.a. domain shift). Unsupervised domain adaptation (DA) seeks learning techniques that can minimize the domain discrepancy between a labeled source and an unlabeled target. However, it mostly remains unexplored for challenging structured prediction based vision tasks.
Motivated by the above observations, my research focuses on addressing the following key aspects: (1) Developing algorithms that support improved transferability to domain and task shifts, (2) Leveraging inter-entity or cross-modal relationships to develop self-supervised objectives, and (3) Instilling natural priors to constrain the model output within the realm of natural distributions.
First, we present AdaDepth - an unsupervised domain adaptation (DA) strategy for the pixel-wise regression task of monocular depth estimation. Mode collapse is a common phenomenon observed during adversarial training in the absence of paired supervision. Without access to target depth-maps, we address this challenge using a novel content congruent regularization technique. In a follow-up work, we introduced UM-Adapt, a unified framework to address two distinct objectives in a multi-task adaptation framework, i.e., a) achieving balanced performance across all tasks and b) performing domain adaptation in an unsupervised setting. This is realized using two novel regularization strategies; Contour-based content regularization and exploitation of inter-task coherency using a novel cross-task distillation module. Moving forward, we identified certain key issues in existing domain adaptation algorithms that hinder their practical deployability to a large extent. Existing approaches demand the coexistence of source and target data, which is highly impractical in scenarios where data-sharing is restricted due to proprietary or privacy concerns. To address this, we propose a new setting termed as Source-Free DA and tailored learning protocols for the dense prediction task of semantic segmentation and image classification in both with and without category shift scenarios.
Further, we investigate the problem of Self-supervised Domain Adaptation for the challenging monocular 3D human pose estimation task. The key differentiating factor in our approach is the idea of infusing model-based structural prior as a means to constrain the pose estimation predictions within the realm of natural pose and shape distributions. Towards self-supervised learning, our contribution lies in the effective use of new inter-entity relationships to discern the co-salient foreground appearance and thereby the corresponding pose from just a pair of images having diverse backgrounds. Unlike self-supervised solutions that aim for better generalization, self-adaptive solutions aim for target-specific adaptation, i.e., adaptation to deployment-specific environmental attributes. To this end, we propose a self-adaptive method to align the latent space of human pose from unpaired image-to-latent and the pose-to-latent, by enforcing well-formed non-local latent space rules available for unpaired image (or video) and pose (or motion) domains. This idea of non-local relation distillation against the broadly employed general contrastive learning techniques shows significant improvements in the self-adaptation performance. Further, in a recent work, we propose a novel way to effectively utilize uncertainty estimation for out-of-distribution (OOD) detection, and thus enabling inference-time self-adaptation. The ability to discern OOD samples allows a model to assess when to perform re-adaptation while deployed in a continually changing environment. Such solutions are in high demand for enabling effective real-world deployment across various industries, from virtual and augmented reality to gaming and health-care applications.