Learning Across Domains: Applications to Text-based Person Search and Multi-Source Domain Adaptation
Abstract
With rapid development in technology and ubiquitous presence of diverse types of sensors, a large
amount of data from different modalities (e.g., text, audio, images etc.) describing the same person/
object/event has become easily available. Similarly, multiple datasets targeted towards the same
task but exhibiting different data distributions are often available. To be able to learn and utilize
the complementary information present across diverse domains can be immensely valuable towards
building more intelligent models. Cross-modal learning and domain adaptation techniques are closely
related to learning under such scenarios. In this thesis, we investigate and provide novel algorithms
for two applications of learning across domains - namely Text-based Person Search and Multi-Source
Domain Adaptation.
Person search in a camera network is an important problem in the field of intelligent video surveillance.
Often the search query comes in the form of unstructured textual description of the target of
interest, and the goal is to retrieve the pedestrian images that best match this description. In the first
part of the thesis, we investigate methods for this cross-modal retrieval problem of Text-based Person
Search. Existing methods utilize class-id information to get discriminative and identity-preserving
features. However, it is not well-explored whether it is beneficial to explicitly ensure that the semantics
of the data are also retained. In the proposed work, we aim to create semantics-preserving
embeddings through an additional task of attribute prediction. Since attribute annotation is typically
unavailable in text-based person search, we first mine them from the text corpus. These attributes are
then used as a means to bridge the modality gap between the image-text inputs, as well as to improve
the representation learning. In summary, we propose an approach for text-based person search by
learning an attribute-driven space along with a class-information driven space, and utilize both for
obtaining the retrieval results. Our experiments show that learning the attribute space not only helps
in improving performance but also yields humanly-interpretable features.
In the second part of the thesis, we worked on Multi-Source Domain Adaptation, a problem involving
multiple data sources, which are of the same modality but follow different distributions. Domain
adaptation is a field of machine learning that aims at learning a model from a labelled source dataset,
such that the model performs well on samples drawn from an unlabelled target domain which has
iv
Abstract
a related but different distribution. The problem of single-source unsupervised domain adaptation
has been explored quite extensively. However, in practice, labelled data is often available from multiple,
differently distributed sources - giving rise to the problem of multi-source domain adaptation
(MSDA). Recent works in MSDA propose to learn a domain-invariant space for the sources and the
target. However, such methods treat each source to be equally relevant and are not sensitive to the
intrinsic relations amongst domains. In this work, we provide a novel algorithm for multi-source domain
adaptation which utilizes the multiple sources based on their relative importance to the target.
Our objective is to dynamically explore the relevance of sources, and then to perform weighted alignment
of domains. We experimentally validate the performance of our method on benchmark datasets,
and achieve state-of-the-art results on Office-Home and Office-Caltech.