Landmark Estimation and Image Synthesis Guidance using Self-Supervised Networks
Abstract
The exponential rise in the availability of data over the past decade has fuelled research in deep learning. While supervised deep learning models achieve near-human performance using annotated data, it comes with an additional cost of annotation. Additionally, there could be ambiguity in annotations due to human error. While an image classification task assigns one label to the whole image, as we increase the granularity of the task to landmark estimation, the annotator needs to pinpoint the landmark accurately. The self-supervised learning (SSL) paradigm overcomes these concerns by using pretext task based objectives to learn from large-scale unannotated data. In this work, we show how to extract relevant signals from pretrained self-supervised networks for a) a discriminative task of landmark estimation under limited annotations, and b) increasing perceptual quality of the images generated by generative adversarial network.
In this first part, we demonstrate the emergent correspondence tracking properties in the non-contrastive SSL framework. Using this as supervision, we propose LEAD which is an approach to discover landmarks from an unannotated collection of category-specific images. Existing works in self-supervised landmark detection are based on learning dense (pixel-level) feature representations from an image, which are further used to learn landmarks in a semi-supervised manner. While there have been advances in self-supervised learning of image features for instance-level tasks like classification, these methods do not ensure dense equivariant representations. The property of equivariance is of interest for dense prediction tasks like landmark estimation. In this work, we introduce an approach to enhance the learning of dense equivariant representations in a self-supervised fashion. We follow a two-stage training approach: first, we train a network using the BYOL objective which operates at an instance level. The correspondences obtained through this network are further used to train a dense and compact representation of the image using a lightweight network. We show that having such a prior in the feature extractor helps in landmark detection, even under a drastically limited number of annotations while also improving generalization across scale variations.
Next, we utilize the rich feature space from the SSL framework as a “naturalness” prior to alleviate unnatural image generation from Generative Adversarial Networks (GAN), which is a popular class of generative models. Progress in GANs has enabled the generation of high-resolution photorealistic images of astonishing quality. StyleGANs allow for compelling attribute modification on such images via mathematical operations on the latent style vectors in the W/W+ space that effectively modulates the rich hierarchical representations of the generator. Such operations have recently been generalized beyond mere attribute swapping in the original StyleGAN paper to include interpolations. In spite of many significant improvements in StyleGANs, they are still seen to generate unnatural images. The quality of the generated images is a function of, (a) richness of the hierarchical representations learned by the generator, and, (b) linearity and smoothness of the style spaces. In this work, we propose Hierarchical Semantic Regularizer (HSR) which aligns the hierarchical representations learnt by the generator to corresponding powerful features learned by pretrained networks on large amounts of data. HSR not only improves generator representations but also the linearity and smoothness of the latent style spaces, leading to the generation of more natural-looking style-edited images. To demonstrate improved linearity, we propose a novel metric - Attribute Linearity Score (ALS). A significant reduction in the generation of unnatural images is corroborated by improvement in the Perceptual Path Length (PPL) metric by 15% across different standard datasets while simultaneously improving the linearity of attribute-change in the attribute editing tasks.