A study on Deep Learning Approaches, Architectures and Training Methods for Crowd Analysis
Abstract
Analyzing large crowds quickly is one of the highly sought-after capabilities nowadays. Especially in terms of public security and planning, this assumes prime importance. But automated reasoning of crowd images or videos is a challenging Computer Vision task. The difficulty is so extreme in dense crowds that the task is typically narrowed down to estimating the number of people. Since the count or distribution of people in the scene itself can be very valuable information, this field of research has gained traction. The difficulty mostly stems from the drastic variability in crowd density as any prospective approach has to scale across crowds formed by few tens to thousands of people. This results in large diversity in the way people appear in crowded scenes. Often people are only seen as a bunch of blobs in highly dense crowds, whereas facial or body features might be visible in less dense gatherings. Hence, the visibility and scale of features for crowd discrimination varies drastically with the density of the crowd. Severe occlusion, pose changes and view-point variations further compound the problem. Typical head or body detection-based methods fail to adapt with such a huge diversity, paving way for the simpler crowd density regression models. Add to these, the practical difficulty of annotating millions of head locations in dense crowds. This implies creating large-scale labeled crowd data is expensive and directly takes a toll on the performance of existing CNN based counting models.
Given these challenges, this thesis tackles the problem of crowd counting in multiple perspectives. Detailed inquiry is done to address the three major issues: diversity, data scarcity and localization.
** Addressing Diversity **: First, the diversity issue is considered as it causes significant prediction errors on account of failure to scale well across the density categories. In the diverse scenario, discriminating persons requires larger spatial context and semantics of the scene, instead of local crowd patterns. A set of brain-inspired top-down feedback connections from high-level layers is proposed. This feedback is shown to deliver global context for initial layers of CNN and help correct prediction errors in an iterative manner. Next an alternative mixture of experts approach is devised, where a differential training regime jointly clusters and fine-tunes a set of experts to capture the huge diversity seen in crowd images. This approach results in significant boost in counting performance as different regions of the images are processed by the appropriate expert regressor based on the local density. Further performance improvement is obtained through a growing CNN that can progressively increase its capacity depending on the diversity exhibited in the given crowd dataset.
** Addressing Data Scarcity **: Dense crowd counting demands millions of head annotations for training models. This annotation difficulty could be mitigated using a Grid Winner-Take-All autoencoder, which is designed to learn almost 99% of the parameters from unlabeled crowd images. The model achieves superior results compared to other unsupervised methods and beats the fully supervised baselines in limited data scenarios. In an alternate approach, a binary labeling scheme is conceived. Every image is simply labeled to either dense or sparse crowd category, instead of annotating every single person in the scene. This leads to dramatic reduction in the amount of annotations required and delivers good performance at a very low labeling cost. The objective is pushed further to fully eliminate the dependency on instance-level labeled data. The proposed completely self-supervised architecture does not require any annotation for training, but uses a distribution matching technique to learn the required features. The only input required to train, apart from a large set of unlabeled crowd images, is the approximate upper limit of the crowd count for the given dataset. Experiments show that the model results in effective learning of crowd features and delivers significant counting performance. Furthermore, the superiority of the method is established in limited data settings as well.
** Addressing Localization **: Typical counting models predict crowd density for an image as opposed to detecting every person. These regression methods, in general, fail to localize persons accurate enough for most applications other than counting. Hence, two detection frameworks for dense crowd counting are developed, such that they obviate the need for the prevalent density regression paradigm. The first approach reformulates the task as localized dot prediction in dense crowds, where the model is trained for pixel-wise binary classification to pinpoint people, instead of regressing local crowd density. In the second dense detection architecture, apart from locating persons, the spotted heads are sized with bounding boxes. This approach could detect individual persons consistently across the diversity spectrum. Moreover, this improved localization is achieved without requiring any additional bounding box annotations for training.