Deep Learning for Hand-drawn Sketches: Analysis, Synthesis and Cognitive Process Models

Sarvadevabhatla, Ravi Kiran

View/Open

Thesis full text (14.99Mb)

Author

Sarvadevabhatla, Ravi Kiran

Metadata

Show full item record

Abstract

Deep Learning-based object category understanding is an important and active area of research in Computer Vision. Most work in this area has predominantly focused on the portion of depiction spectrum consisting of photographic images. However, depictions at the other end of the spectrum, freehand sketches, are a fascinating visual representation and worthy of study in themselves. In this thesis, we present deep-learning approaches for sketch analysis, sketch synthesis and modelling sketch-driven cognitive processes. On the analysis front, we first focus on the problem of recognizing hand-drawn line sketches of objects. We propose a deep Recurrent Neural Network architecture with a novel loss formulation for sketch object recognition. Our approach achieves state-of-the-art results on a large-scale sketch dataset. We also show that the inherently online nature of our framework is especially suitable for on-the- fly recognition of objects as they are being drawn. We then move beyond object-level label prediction to the relatively harder problem of parsing sketched objects, i.e. given a freehand object sketch, determine its salient attributes (e.g. category, semantic parts, pose). To this end, we propose SketchParse, the first deep-network architecture for fully automatic parsing of freehand object sketches. We subsequently demonstrate SketchParse's abilities (i) on two challenging large-scale sketch datasets (ii) in parsing unseen, semantically related object categories (iii) in improving fine-grained sketch-based image retrieval. As a novel application, we also illustrate how SketchParse's output can be used to generate caption-style descriptions for hand-drawn sketches. On the synthesis front, we design generative models for sketches via Generative Adversarial Networks (GANs). Keeping the limited size of sketch datasets in mind, we propose DeLi- GAN, a novel architecture for diverse and limited training data scenarios. In our approach, we reparameterize the latent generative space as a mixture model and learn the mixture model's parameters along with those of GAN. This seemingly simple modification to the vanilla GAN framework is surprisingly e ective and results in models which enable diversity in generated samples although trained with limited data. We show that DeLiGAN generates diverse samples not just for hand-drawn sketches but for other image modalities as well. To quantitatively characterize intra-class diversity of generated samples, we also introduce a modi ed version of \inception-score", a measure which has been found to correlate well with human assessment of generated samples. We subsequently present an approach for synthesizing minimally discriminative sketch-based object representations which we term category-epitomes. The synthesis procedure concurrently provides a natural measure for quantifying the sparseness underlying the original sketch, which we term epitome-score. We show that the category-level distribution of epitome-scores can be used to characterize level of detail required in general for recognizing object categories. On the cognitive process modelling front, we analyze the results of a free-viewing eye fixation study conducted on freehand sketches. The analysis reveals that eye relaxation sequences exhibit marked consistency within a sketch, across sketches of a category and even across suitably grouped sets of categories. This multi-level consistency is remarkable given the variability in depiction and extreme image content sparsity that characterizes hand-drawn object sketches. We show that the multi-level consistency in the fixation data can be exploited to predict a sketch's category given only its fixation sequence and to build a computational model which predicts part-labels underlying the eye fixations on objects. The ability of machine-based agents to play games in human-like fashion is considered a benchmark of progress in AI. Motivated by this observation, we introduce the first computational model aimed at Pictionary, the popular word-guessing social game. We first introduce Sketch-QA, an elementary version of Visual Question Answering task. Styled after Pictionary, Sketch-QA uses incrementally accumulated sketch stroke sequences as visual data and gathering open-ended guess-words from human guessers. To mimic humans playing Pictionary, we propose a deep neural model which generates guess-words in response to temporally evolving human-drawn sketches. The model even makes human-like mistakes while guessing, thus amplifying the human mimicry factor. We evaluate the model on the large-scale guess-word dataset generated via Sketch-QA task and compare with various baselines. We also conduct a Visual Turing Test to obtain human impressions of the guess-words generated by humans and our model. The promising experimental results demonstrate the challenges and opportunities in building computational models for Pictionary and similarly themed games.

URI

https://etd.iisc.ac.in/handle/2005/5351

Collections

Department of Computational and Data Sciences (CDS) [102]