Learning to Perceive Humans From Appearance and Pose
Abstract
Analyzing humans and their activities takes a central role in computer vision. This requires machine learning models to encapsulate both the diverse poses and appearances exhibited by humans. Estimating the 3D poses of highly deformable humans from monocular RGB images remains an important, challenging, and unsolved problem with applications in human-robot interaction, augmented reality, gaming industry, etc. Another important task is to identify the same human targets across camera viewpoints in a wide-area video surveillance setup, requiring learning discriminative and robust representations of human appearances under large variabilities of poses, backgrounds, and illuminations. In this thesis, we study several computer vision problems under the theme of estimating human pose and modeling appearance from monocular images.
Estimating the 3D pose from a single image is an ill-posed classical inverse problem as the model lacks depth information. In such scenarios, supervised approaches tend to perform well by guiding the model towards plausible poses. While assuming the availability of a labeled dataset is itself impractical, such approaches tend to suffer from poor generalization to unseen datasets. We thus formulate the problem as an unsupervised learning task and propose a novel framework that consists of a series of differentiable transformations acting as a suitable bottleneck, stimulating effective pose disentanglement. Furthermore, the proposed adaptation technique enables learning from in-the-wild videos beyond laboratory settings, thereby resulting in superior generalizability across diverse and unseen environments.
The 3D pose estimation models discard variations in a human body, e.g., shape and appearance, which may help solve other related tasks such as body-part segmentation. As a next step, we design a single part-based 2D puppet model, relying on human pose articulation constraints and a set of unpaired 3D poses to estimate both 3D poses and part segments from human-centric images. Unlike our previous work, the proposed part-based model allows us to operate on videos with diverse camera movements.
The approaches above cast the 3D pose estimation problem as a task of disentangling human pose and appearance. Different from these, we propose to cast the 3D pose learning as a cross-modal alignment problem in our subsequent work. We consider the availability of an unpaired pool of short-length natural action videos and 3D pose sequences from the input and output modalities respectively. We introduce a novel technique for self-supervised alignment across these modalities while relying on preserving higher-order non-local relations in a pre-learned, latent pose space to attain superior generalizability over the state-of-the-art.
Unsupervised person re-identification (re-ID) aims to tackle the problem of matching identities across non-overlapping cameras without any assumption of labels during training. We propose a two-stage training strategy towards solving this task. First, we train a deep network on an expertly designed pose-transformed dataset obtained by generating multiple perturbations in the pose space for each original image. Next, the network learns to attend to the fundamental aspects of feature learning - compact clusters with low intra-cluster and high inter-cluster variation, thereby mapping similar features closer using the proposed discriminative clustering algorithm. Experiments on large-scale re-ID datasets demonstrate the superiority of our method against state-of-the-art approaches.