Learning Across Domains: Applications to Text-based Person Search and Multi-Source Domain Adaptation
With rapid development in technology and ubiquitous presence of diverse types of sensors, a large amount of data from different modalities (e.g., text, audio, images etc.) describing the same person/ object/event has become easily available. Similarly, multiple datasets targeted towards the same task but exhibiting different data distributions are often available. To be able to learn and utilize the complementary information present across diverse domains can be immensely valuable towards building more intelligent models. Cross-modal learning and domain adaptation techniques are closely related to learning under such scenarios. In this thesis, we investigate and provide novel algorithms for two applications of learning across domains - namely Text-based Person Search and Multi-Source Domain Adaptation. Person search in a camera network is an important problem in the field of intelligent video surveillance. Often the search query comes in the form of unstructured textual description of the target of interest, and the goal is to retrieve the pedestrian images that best match this description. In the first part of the thesis, we investigate methods for this cross-modal retrieval problem of Text-based Person Search. Existing methods utilize class-id information to get discriminative and identity-preserving features. However, it is not well-explored whether it is beneficial to explicitly ensure that the semantics of the data are also retained. In the proposed work, we aim to create semantics-preserving embeddings through an additional task of attribute prediction. Since attribute annotation is typically unavailable in text-based person search, we first mine them from the text corpus. These attributes are then used as a means to bridge the modality gap between the image-text inputs, as well as to improve the representation learning. In summary, we propose an approach for text-based person search by learning an attribute-driven space along with a class-information driven space, and utilize both for obtaining the retrieval results. Our experiments show that learning the attribute space not only helps in improving performance but also yields humanly-interpretable features. In the second part of the thesis, we worked on Multi-Source Domain Adaptation, a problem involving multiple data sources, which are of the same modality but follow different distributions. Domain adaptation is a field of machine learning that aims at learning a model from a labelled source dataset, such that the model performs well on samples drawn from an unlabelled target domain which has iv Abstract a related but different distribution. The problem of single-source unsupervised domain adaptation has been explored quite extensively. However, in practice, labelled data is often available from multiple, differently distributed sources - giving rise to the problem of multi-source domain adaptation (MSDA). Recent works in MSDA propose to learn a domain-invariant space for the sources and the target. However, such methods treat each source to be equally relevant and are not sensitive to the intrinsic relations amongst domains. In this work, we provide a novel algorithm for multi-source domain adaptation which utilizes the multiple sources based on their relative importance to the target. Our objective is to dynamically explore the relevance of sources, and then to perform weighted alignment of domains. We experimentally validate the performance of our method on benchmark datasets, and achieve state-of-the-art results on Office-Home and Office-Caltech.