Self-Supervised Domain Adaptation Frameworks for Computer Vision Tasks

Kundu, Jogendra Nath

dc.contributor.advisor	Venkatesh Babu, R
dc.contributor.author	Kundu, Jogendra Nath
dc.date.accessioned	2022-07-15T05:37:36Z
dc.date.available	2022-07-15T05:37:36Z
dc.date.submitted	2022
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5782
dc.description.abstract	There is a strong incentive to build intelligent machines that can understand and adapt to changes in the visual world without human supervision. While humans and animals learn to perceive the world on their own, almost all state-of-the-art vision systems heavily rely on external supervision from millions of manually annotated training examples. Gathering such large-scale manual annotations for structured vision tasks, such as monocular depth estimation, scene segmentation, human pose estimation, faces several practical limitations. Usually, the annotations are gathered in two broad ways; 1) via specialized instruments (sensors) or laboratory setups, 2) via manual annotations. Both processes have several drawbacks. While human annotations are expensive, scarce, or error-prone; instrument-based annotations are often noisy or limited to specific laboratory environments. Such limitations not only stand as a major bottleneck in our efforts to gather unambiguous ground-truth but also limit the diversity in the collected labeled dataset. This motivates us to develop innovative ways to utilize synthetic environments to create labeled synthetic datasets with noise-free unambiguous ground-truths. However, the performance of models trained on such synthetic data markedly degrades when tested on real-world samples due to input distribution shift (a.k.a. domain shift). Unsupervised domain adaptation (DA) seeks learning techniques that can minimize the domain discrepancy between a labeled source and an unlabeled target. However, it mostly remains unexplored for challenging structured prediction based vision tasks. Motivated by the above observations, my research focuses on addressing the following key aspects: (1) Developing algorithms that support improved transferability to domain and task shifts, (2) Leveraging inter-entity or cross-modal relationships to develop self-supervised objectives, and (3) Instilling natural priors to constrain the model output within the realm of natural distributions. First, we present AdaDepth - an unsupervised domain adaptation (DA) strategy for the pixel-wise regression task of monocular depth estimation. Mode collapse is a common phenomenon observed during adversarial training in the absence of paired supervision. Without access to target depth-maps, we address this challenge using a novel content congruent regularization technique. In a follow-up work, we introduced UM-Adapt, a unified framework to address two distinct objectives in a multi-task adaptation framework, i.e., a) achieving balanced performance across all tasks and b) performing domain adaptation in an unsupervised setting. This is realized using two novel regularization strategies; Contour-based content regularization and exploitation of inter-task coherency using a novel cross-task distillation module. Moving forward, we identified certain key issues in existing domain adaptation algorithms that hinder their practical deployability to a large extent. Existing approaches demand the coexistence of source and target data, which is highly impractical in scenarios where data-sharing is restricted due to proprietary or privacy concerns. To address this, we propose a new setting termed as Source-Free DA and tailored learning protocols for the dense prediction task of semantic segmentation and image classification in both with and without category shift scenarios. Further, we investigate the problem of Self-supervised Domain Adaptation for the challenging monocular 3D human pose estimation task. The key differentiating factor in our approach is the idea of infusing model-based structural prior as a means to constrain the pose estimation predictions within the realm of natural pose and shape distributions. Towards self-supervised learning, our contribution lies in the effective use of new inter-entity relationships to discern the co-salient foreground appearance and thereby the corresponding pose from just a pair of images having diverse backgrounds. Unlike self-supervised solutions that aim for better generalization, self-adaptive solutions aim for target-specific adaptation, i.e., adaptation to deployment-specific environmental attributes. To this end, we propose a self-adaptive method to align the latent space of human pose from unpaired image-to-latent and the pose-to-latent, by enforcing well-formed non-local latent space rules available for unpaired image (or video) and pose (or motion) domains. This idea of non-local relation distillation against the broadly employed general contrastive learning techniques shows significant improvements in the self-adaptation performance. Further, in a recent work, we propose a novel way to effectively utilize uncertainty estimation for out-of-distribution (OOD) detection, and thus enabling inference-time self-adaptation. The ability to discern OOD samples allows a model to assess when to perform re-adaptation while deployed in a continually changing environment. Such solutions are in high demand for enabling effective real-world deployment across various industries, from virtual and augmented reality to gaming and health-care applications.	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Self-Supervised Learning	en_US
dc.subject	Domain Adaptation	en_US
dc.subject	Computer Vision	en_US
dc.subject	Convolutional Neural Network	en_US
dc.subject	Deep Learning	en_US
dc.subject	Semantic Segmentation	en_US
dc.subject	3D Human Pose Estimation	en_US
dc.subject	AdaDepth	en_US
dc.subject	Dense Vision Tasks	en_US
dc.subject.classification	Computer Science	en_US
dc.title	Self-Supervised Domain Adaptation Frameworks for Computer Vision Tasks	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: jogendra_thesis__final_.pdf
Size:: 32.53Mb
Format:: PDF
Description:: main PDF file

View/Open

This item appears in the following Collection(s)

Department of Computational and Data Sciences (CDS) [102]

Show simple item record