Deep Learning for Hand-drawn Sketches: Analysis, Synthesis and Cognitive Process Models
Abstract
Deep Learning-based object category understanding is an important and active area of research
in Computer Vision. Most work in this area has predominantly focused on the portion of
depiction spectrum consisting of photographic images. However, depictions at the other end of
the spectrum, freehand sketches, are a fascinating visual representation and worthy of study
in themselves. In this thesis, we present deep-learning approaches for sketch analysis, sketch
synthesis and modelling sketch-driven cognitive processes.
On the analysis front, we first focus on the problem of recognizing hand-drawn line sketches
of objects. We propose a deep Recurrent Neural Network architecture with a novel loss formulation
for sketch object recognition. Our approach achieves state-of-the-art results on a
large-scale sketch dataset. We also show that the inherently online nature of our framework is
especially suitable for on-the-
fly recognition of objects as they are being drawn.
We then move beyond object-level label prediction to the relatively harder problem of parsing
sketched objects, i.e. given a freehand object sketch, determine its salient attributes (e.g.
category, semantic parts, pose). To this end, we propose SketchParse, the first deep-network
architecture for fully automatic parsing of freehand object sketches. We subsequently demonstrate
SketchParse's abilities (i) on two challenging large-scale sketch datasets (ii) in parsing
unseen, semantically related object categories (iii) in improving fine-grained sketch-based image
retrieval. As a novel application, we also illustrate how SketchParse's output can be used
to generate caption-style descriptions for hand-drawn sketches.
On the synthesis front, we design generative models for sketches via Generative Adversarial
Networks (GANs). Keeping the limited size of sketch datasets in mind, we propose DeLi-
GAN, a novel architecture for diverse and limited training data scenarios. In our approach, we
reparameterize the latent generative space as a mixture model and learn the mixture model's
parameters along with those of GAN. This seemingly simple modification to the vanilla GAN
framework is surprisingly e ective and results in models which enable diversity in generated
samples although trained with limited data. We show that DeLiGAN generates diverse samples
not just for hand-drawn sketches but for other image modalities as well. To quantitatively characterize intra-class diversity of generated samples, we also introduce a modi ed version of
\inception-score", a measure which has been found to correlate well with human assessment of
generated samples.
We subsequently present an approach for synthesizing minimally discriminative sketch-based
object representations which we term category-epitomes. The synthesis procedure concurrently
provides a natural measure for quantifying the sparseness underlying the original sketch, which
we term epitome-score. We show that the category-level distribution of epitome-scores can be
used to characterize level of detail required in general for recognizing object categories.
On the cognitive process modelling front, we analyze the results of a free-viewing eye fixation
study conducted on freehand sketches. The analysis reveals that eye relaxation sequences exhibit
marked consistency within a sketch, across sketches of a category and even across suitably
grouped sets of categories. This multi-level consistency is remarkable given the variability in
depiction and extreme image content sparsity that characterizes hand-drawn object sketches.
We show that the multi-level consistency in the fixation data can be exploited to predict a
sketch's category given only its fixation sequence and to build a computational model which
predicts part-labels underlying the eye fixations on objects.
The ability of machine-based agents to play games in human-like fashion is considered a
benchmark of progress in AI. Motivated by this observation, we introduce the first computational
model aimed at Pictionary, the popular word-guessing social game. We first introduce
Sketch-QA, an elementary version of Visual Question Answering task. Styled after Pictionary,
Sketch-QA uses incrementally accumulated sketch stroke sequences as visual data and gathering
open-ended guess-words from human guessers. To mimic humans playing Pictionary, we
propose a deep neural model which generates guess-words in response to temporally evolving
human-drawn sketches. The model even makes human-like mistakes while guessing, thus amplifying
the human mimicry factor. We evaluate the model on the large-scale guess-word dataset
generated via Sketch-QA task and compare with various baselines. We also conduct a Visual
Turing Test to obtain human impressions of the guess-words generated by humans and our
model. The promising experimental results demonstrate the challenges and opportunities in
building computational models for Pictionary and similarly themed games.