Integrating Coarse Semantic Information with Deep Image Representations for Object Localization, Model Generalization and Efficient Training

Tripathi, Aditay

dc.contributor.advisor	Chakraborty, Anirban
dc.contributor.author	Tripathi, Aditay
dc.date.accessioned	2024-11-04T04:39:32Z
dc.date.available	2024-11-04T04:39:32Z
dc.date.submitted	2024
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/6663
dc.description.abstract	Coarse semantic features are abstract descriptors capturing broad semantic information in an image, including scene labels, crude contextual relationships between objects in the scene, or even objects described using hand-drawn sketches. Derived from external sources or pre-trained models, these features complement fine-grained representations from deep neural networks, enhancing overall image understanding. In this thesis, we explore applications where we integrate coarse semantic cues with deep image representations to address novel visual analytics tasks and propose significantly improved solutions to existing challenges. In the first part of the thesis, we present novel query-guided object localization frameworks, where concepts like a hand-drawn sketch of an object, a ‘gloss’ delivering a crude description of the object of interest, or a scene-graph representing objects and their coarse relationships provide necessary cues to localize all instances of the corresponding object(s) on a complex natural scene. Next, we utilize the information contained in edge maps via intelligent augmentation and shuffling to improve the robustness of computer vision models against the adverse effects of texture bias prevalent in such models. Lastly, in the realm of self-supervised learning, we use coarse representations derived from a compact proxy model to schedule and sequence training data for efficient training of larger models. Locating objects in a scene through image queries is a key problem in computer vision, with recent work highlighting the challenge of localizing both seen and unseen object categories at test time. A possible solution is to use object images as queries; however, practical obstacles such as copyright, privacy constraints, and difficulties in obtaining annotated data for emerging categories pose challenges. Instead, we propose ‘sketch-guided object localization,’ which utilizes crude hand-drawn sketches to localize corresponding objects. We employ cross-modal attention to integrate query information into the image feature representation, facilitating the generation of region proposals relevant to the query sketch. The region proposals are then scored along with the sketch query for localization. Our method outperforms baselines in both single-query and multi-query localization tasks on object detection benchmarks (MS-COCO and PASCAL-VOC) using abstract hand-drawn sketches as queries. Challenges in scaling sketch-guided object localization include the abstract appearance of hand-drawn sketches, style variations, and a significant domain gap with respect to natural images. Existing solutions using attention-based frameworks show weak alignment and inaccurate localization because they integrate query features after extracting image features independently. To mitigate this, we introduce a novel sketch-guided vision transformer encoder that uses cross-attention after each transformer-based image encoder block to integrate sketch information into the image representation. Thus, we learn query-conditioned image features for stronger alignment with the sketch query. At the decoder’s output, cross-attention is used to integrate sketch information into the object-level image features, thus improving their semantic alignment. The model generalizes to unseen object categories and achieves state-of-the-art performance across both open-set and closed-set localization tasks. The aforementioned works pioneered and optimized the use of hand-drawn sketches for one-shot object localization. However, relying solely on crude hand-drawn sketches may introduce ambiguity - for instance, a rough sketch of a laptop could be confused for a sofa. One approach towards addressing this is to use a coarse linguistic definition of the category, e.g., ‘a small portable computer small enough to use in your lap’, to complement the sketch query. We, therefore, propose a multimodal integration of sketches with linguistic category definitions, called ‘gloss’, for a comprehensive representation of visual and semantic cues. We use cross-modal attention to integrate information from the multi-modal queries into the image representation, thus generating region proposals relevant to the queries. Further, we propose a novel orthogonal projection-based proposal scoring technique that evaluates each proposal with respect to the multi-modal queries. Experiments on the MS-COCO dataset using ’Quick, Draw!’ sketches and ’WordNet’ glosses as queries demonstrate superior performance over related baselines for both seen and unseen categories. In natural scenes, single-query object localization is uncertain due to factors like underrepresentation, occlusion, or unavailability of suitable training data. Scenes with multiple objects exhibit visual relationships, offering strong contextual cues for improved grounding. Scene graphs efficiently represent objects and the coarse semantic relationships (e.g., laptop ‘on’ table) between them. In this work, we study the problem of grounding scene graphs on natural images to improve object localization. To this end, we propose a novel graph neural network-based approach referred to as Visio-Lingual Message PAssing Graph neural Network (VL-MPAG Net). We first construct a directed graph with object proposals as nodes and edges representing plausible relations. Next, we employ a three-step framework involving inter-graph and intra-graph message passing. Through inter-graph message passing, the model integrates scene-graph information into the proposal representations, facilitating the learning of query-conditioned proposal representations. Subsequently, intra-graph message passing refines them to learn context-dependent representations of these proposals as well as that of the query objects. These refined query representations are used to score the proposals for object localization, outperforming baselines on public benchmark datasets. Deep vision models often exhibit overreliance on texture features, resulting in poor generalization. To address this texture bias, we propose a lightweight adversarial augmentation technique called ELeaS that explicitly incentivizes the network to learn holistic shapes for accurate prediction in an object classification setting. Our augmentations superpose coarser descriptors, namely edgemaps, from one image onto another image with shuffled patches using a randomly determined mixing proportion. To be able to classify these augmented images with the label of the edgemap images, the model needs to not only detect and focus on edges but also distinguish between relevant and spurious edges. We show that our augmentations significantly improve classification accuracy and robustness measures on a range of datasets and neural architectures. Analysis using multiple probe datasets shows substantially increased shape sensitivity in our trained models, explaining these observed improvements. Self-supervised learning (SSL) is vital for acquiring high-quality representations from unlabeled image collections, but the growing dataset sizes increase the demand for computational resources in training SSL models. To address this, we propose ‘DYSCOF’, a Dynamic data Selection method. DYSCOF scores and prioritizes essential samples using a Coarse-toFine schedule for optimized data selection. It employs a small proxy model pre-trained via contrastive learning to identify a data subset based on the score difference between the representations learned by the larger target model and the coarse representations obtained from this proxy. The selected subset gets iteratively updated in a coarse-to-fine schedule that initially selects a larger data fraction and gradually reduces the proportion of selected data as training progresses. To further enhance efficiency, we introduce a distillation loss that leverages coarse representations obtained from the proxy model to guide the target model’s learning. Validated on public benchmark datasets, our method achieves a large reduction in computational load on all benchmark datasets, enhancing SSL training efficiency while maintaining classification performance.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	;ET00673
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Coarse-semantic features	en_US
dc.subject	Query-guided localization	en_US
dc.subject	Sketch-guided object localization	en_US
dc.subject	Multimodal query-guided localization	en_US
dc.subject	Detection Transformer	en_US
dc.subject	Robust vision models	en_US
dc.subject	Efficient self-supervised learning	en_US
dc.subject	Multi-modal Computer Vision	en_US
dc.subject	Sketch	en_US
dc.subject	Gloss	en_US
dc.subject	Shape bias	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	Integrating Coarse Semantic Information with Deep Image Representations for Object Localization, Model Generalization and Efficient Training	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Thesis_v1_tripathi-v2.pdf
Size:: 16.13Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Department of Computational and Data Sciences (CDS) [116]

Show simple item record