Visual Flow Analysis and Saliency Prediction
Abstract
Nowadays, we have millions of cameras in public places such as traffic junctions, railway stations etc., and capturing video data round the clock. This humongous data has resulted in an increased need for automation of visual surveillance. Analysis of crowd and traffic flows is an important step towards achieving this goal. In this work, we present our algorithms for identifying and segmenting dominant ows in surveillance scenarios. In the second part, we present our work aiming at predicting the visual saliency. The ability of humans to discriminate and selectively pay attention to few regions in the scene over the others is a key attentional mechanism. Here, we present our algorithms for predicting human eye fixations and segmenting salient objects in the scene.
(i) Flow Analysis in Surveillance Videos: We propose algorithms for segmenting flows of static and dynamic nature in surveillance videos in an unsupervised manner. In static flows scenarios, we assume the motion patterns to be consistent over the entire duration of video and analyze them in the compressed domain using H.264 motion vectors. Our approach is based on modeling the motion vector field as a Conditional Random Field (CRF) and obtaining oriented motion segments which are merged to obtain the final flow segments. This approach in compressed domain is shown to be both accurate and computationally efficient. In the case of dynamic flow videos (e.g. flows at a traffic junction), we propose a method for segmenting the individual object flows over long durations. This long-term flow segmentation is achieved in the framework of CRF using local color and motion features. We propose a Dynamic Time Warping (DTW) based distance measure between flow segments for clustering them and generate representative dominant ow models. Using these dominant flow models, we perform path prediction for the vehicles entering the camera's field-of-view and detect anomalous motions.
(ii) Visual Saliency Prediction using Deep Convolutional Neural Networks: We propose a deep fully convolutional neural network (CNN) - DeepFix, for accurately predicting eye fixations in the form of saliency maps. Unlike classical works which characterize the saliency map using various hand-crafted features, our model automatically learns features in a hierarchical fashion and predicts saliency map in an end-to-end manner. DeepFix is designed to capture visual semantics at multiple scales while taking global context into account. Generally, fully convolutional nets are spatially invariant which prevents them from modeling location dependent patterns (e.g. centre-bias). Our network overcomes this limitation by incorporating a novel Location Biased Convolutional layer. We experimentally show that our network outperforms other recent approaches by a significant margin.
In general, human eye fixations correlate with locations of salient objects in the scene. However, only a handful of approaches have attempted to simultaneously address these related aspects of eye fixations and object saliency. In our work, we also propose a deep convolutional network capable of simultaneously predicting eye fixations and segmenting salient objects in a unified framework. We design the initial network layers, shared between both the tasks, such that they capture the global contextual aspects of saliency, while the deeper layers of the network address task specific aspects. Our network shows a significant improvement over the current state-of-the-art for both eye fixation prediction and salient object segmentation across a number of challenging datasets.