Integrating Coarse Semantic Information with Deep Image Representations for Object Localization, Model Generalization and Efficient Training
Abstract
Coarse semantic features are abstract descriptors capturing broad semantic information in an
image, including scene labels, crude contextual relationships between objects in the scene, or
even objects described using hand-drawn sketches. Derived from external sources or
pre-trained models, these features complement fine-grained representations from deep neural
networks, enhancing overall image understanding. In this thesis, we explore applications where
we integrate coarse semantic cues with deep image representations to address novel visual
analytics tasks and propose significantly improved solutions to existing challenges. In the first
part of the thesis, we present novel query-guided object localization frameworks, where
concepts like a hand-drawn sketch of an object, a ‘gloss’ delivering a crude description of the
object of interest, or a scene-graph representing objects and their coarse relationships provide
necessary cues to localize all instances of the corresponding object(s) on a complex natural
scene. Next, we utilize the information contained in edge maps via intelligent augmentation and
shuffling to improve the robustness of computer vision models against the adverse effects of
texture bias prevalent in such models. Lastly, in the realm of self-supervised learning, we use
coarse representations derived from a compact proxy model to schedule and sequence training
data for efficient training of larger models.
Locating objects in a scene through image queries is a key problem in computer vision, with
recent work highlighting the challenge of localizing both seen and unseen object categories at
test time. A possible solution is to use object images as queries; however, practical obstacles
such as copyright, privacy constraints, and difficulties in obtaining annotated data for emerging
categories pose challenges. Instead, we propose ‘sketch-guided object localization,’ which
utilizes crude hand-drawn sketches to localize corresponding objects. We employ cross-modal
attention to integrate query information into the image feature representation, facilitating the
generation of region proposals relevant to the query sketch. The region proposals are then
scored along with the sketch query for localization. Our method outperforms baselines in both
single-query and multi-query localization tasks on object detection benchmarks (MS-COCO and
PASCAL-VOC) using abstract hand-drawn sketches as queries.
Challenges in scaling sketch-guided object localization include the abstract appearance of
hand-drawn sketches, style variations, and a significant domain gap with respect to natural
images. Existing solutions using attention-based frameworks show weak alignment and
inaccurate localization because they integrate query features after extracting image features
independently. To mitigate this, we introduce a novel sketch-guided vision transformer encoder
that uses cross-attention after each transformer-based image encoder block to integrate sketch
information into the image representation. Thus, we learn query-conditioned image features for
stronger alignment with the sketch query. At the decoder’s output, cross-attention is used to
integrate sketch information into the object-level image features, thus improving their semantic alignment. The model generalizes to unseen object categories and achieves state-of-the-art
performance across both open-set and closed-set localization tasks.
The aforementioned works pioneered and optimized the use of hand-drawn sketches for
one-shot object localization. However, relying solely on crude hand-drawn sketches may
introduce ambiguity - for instance, a rough sketch of a laptop could be confused for a sofa. One
approach towards addressing this is to use a coarse linguistic definition of the category, e.g., ‘a
small portable computer small enough to use in your lap’, to complement the sketch query. We,
therefore, propose a multimodal integration of sketches with linguistic category definitions,
called ‘gloss’, for a comprehensive representation of visual and semantic cues. We use
cross-modal attention to integrate information from the multi-modal queries into the image
representation, thus generating region proposals relevant to the queries. Further, we propose a
novel orthogonal projection-based proposal scoring technique that evaluates each proposal with
respect to the multi-modal queries. Experiments on the MS-COCO dataset using ’Quick, Draw!’
sketches and ’WordNet’ glosses as queries demonstrate superior performance over related
baselines for both seen and unseen categories.
In natural scenes, single-query object localization is uncertain due to factors like
underrepresentation, occlusion, or unavailability of suitable training data. Scenes with multiple
objects exhibit visual relationships, offering strong contextual cues for improved grounding.
Scene graphs efficiently represent objects and the coarse semantic relationships (e.g., laptop
‘on’ table) between them. In this work, we study the problem of grounding scene graphs on
natural images to improve object localization. To this end, we propose a novel graph neural
network-based approach referred to as Visio-Lingual Message PAssing Graph neural Network
(VL-MPAG Net). We first construct a directed graph with object proposals as nodes and edges
representing plausible relations. Next, we employ a three-step framework involving inter-graph
and intra-graph message passing. Through inter-graph message passing, the model integrates
scene-graph information into the proposal representations, facilitating the learning of
query-conditioned proposal representations. Subsequently, intra-graph message passing refines
them to learn context-dependent representations of these proposals as well as that of the query
objects. These refined query representations are used to score the proposals for object
localization, outperforming baselines on public benchmark datasets.
Deep vision models often exhibit overreliance on texture features, resulting in poor
generalization. To address this texture bias, we propose a lightweight adversarial augmentation
technique called ELeaS that explicitly incentivizes the network to learn holistic shapes for
accurate prediction in an object classification setting. Our augmentations superpose coarser
descriptors, namely edgemaps, from one image onto another image with shuffled patches using
a randomly determined mixing proportion. To be able to classify these augmented images with
the label of the edgemap images, the model needs to not only detect and focus on edges but
also distinguish between relevant and spurious edges. We show that our augmentations
significantly improve classification accuracy and robustness measures on a range of datasets
and neural architectures. Analysis using multiple probe datasets shows substantially increased
shape sensitivity in our trained models, explaining these observed improvements.
Self-supervised learning (SSL) is vital for acquiring high-quality representations from unlabeled
image collections, but the growing dataset sizes increase the demand for computational
resources in training SSL models. To address this, we propose ‘DYSCOF’, a Dynamic data
Selection method. DYSCOF scores and prioritizes essential samples using a Coarse-toFine
schedule for optimized data selection. It employs a small proxy model pre-trained via contrastive
learning to identify a data subset based on the score difference between the representations
learned by the larger target model and the coarse representations obtained from this proxy. The
selected subset gets iteratively updated in a coarse-to-fine schedule that initially selects a larger
data fraction and gradually reduces the proportion of selected data as training progresses. To
further enhance efficiency, we introduce a distillation loss that leverages coarse representations
obtained from the proxy model to guide the target model’s learning. Validated on public
benchmark datasets, our method achieves a large reduction in computational load on all
benchmark datasets, enhancing SSL training efficiency while maintaining classification
performance.