Bayesian Nonparametric Modeling of Temporal Coherence for Entity-Driven Video Analytics

Mitra, Adway

dc.contributor.advisor	Bhattacharyya, Chiranjib
dc.contributor.author	Mitra, Adway
dc.date.accessioned	2018-05-14T04:28:37Z
dc.date.accessioned	2018-07-31T04:39:09Z
dc.date.available	2018-05-14T04:28:37Z
dc.date.available	2018-07-31T04:39:09Z
dc.date.issued	2018-05-14
dc.date.submitted	2015
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/3527
dc.identifier.abstract	https://etd.iisc.ac.in/static/etd/abstracts/4395/G27575-Abs.pdf	en_US
dc.description.abstract	In recent times there has been an explosion of online user-generated video content. This has generated significant research interest in video analytics. Human users understand videos based on high-level semantic concepts. However, most of the current research in video analytics are driven by low-level features and descriptors, which often lack semantic interpretation. Existing attempts in semantic video analytics are specialized and require additional resources like movie scripts, which are not available for most user-generated videos. There are no general purpose approaches to understanding videos through semantic concepts. In this thesis we attempt to bridge this gap. We view videos as collections of entities which are semantic visual concepts like the persons in a movie, or cars in a F1 race video. We focus on two fundamental tasks in Video Understanding, namely summarization and scene- discovery. Entity-driven Video Summarization and Entity-driven Scene discovery are important open problems. They are challenging due to the spatio-temporal nature of videos, and also due to lack of apriori information about entities. We use Bayesian nonparametric methods to solve these problems. In the absence of external resources like scripts we utilize fundamental structural properties like temporal coherence in videos- which means that adjacent frames should contain the same set of entities and have similar visual features. There have been no focussed attempts to model this important property. This thesis makes several contributions in Computer Vision and Bayesian nonparametrics by addressing Entity-driven Video Understanding through temporal coherence modeling. Temporal Coherence in videos is observed across its frames at the level of features/descriptors, as also at semantic level. We start with an attempt to model TC at the level of features/descriptors. A tracklet is a spatio-temporal fragment of a video- a set of spatial regions in a short sequence (5-20) of consecutive frames, each of which enclose a particular entity. We attempt to find a representation of tracklets to aid tracking of entities. We explore region descriptors like Covari- ance Matrices of spatial features in individual frames. Due to temporal coherence, such matrices from corresponding spatial regions in successive frames have nearly identical eigenvectors. We utilize this property to model a tracklet using a covariance matrix, and use it for region-based entity tracking. We propose a new method to estimate such a matrix. Our method is found to be much more efficient and effective than alternative covariance-based methods for entity tracking. Next, we move to modeling temporal coherence at a semantic level, with special emphasis on videos of movies and TV-series episodes. Each tracklet is associated with an entity (say a particular person). Spatio-temporally close but non-overlapping tracklets are likely to belong to the same entity, while tracklets that overlap in time can never belong to the same entity. Our aim is to cluster the tracklets based on the entities associated with them, with the goal of discovering the entities in a video along with all their occurrences. We argue that Bayesian Nonparametrics is the most convenient way for this task. We propose a temporally coherent version of Chinese Restaurant Process (TC-CRP) that can encode such constraints easily, and results in discovery of pure clusters of tracklets, and also filter out tracklets resulting from false detections. TC-CRP shows excellent performance on person discovery from TV-series videos. We also discuss semantic video summarization, based on entity discovery. Next, we consider entity-driven temporal segmentation of a video into scenes, where each scene is characterized by the entities present in it. This is a novel application, as existing work on temporal segmentation have focussed on low-level features of frames, rather than entities. We propose EntScene: a generative model for videos based on entities and scenes, and propose an inference algorithm based on Blocked Gibbs Sampling, for simultaneous entity discovery and scene discovery. We compare it to alternative inference algorithms, and show significant improvements in terms of segmentatio and scene discovery. Video representation by low-rank matrix has gained popularity recently, and has been used for various tasks in Computer Vision. In such a representation, each column corresponds to a frame or a single detection. Such matrices are likely to have contiguous sets of identical columns due to temporal coherence, and hence they should be low-rank. However, we discover that none of the existing low-rank matrix recovery algorithms are able to preserve such structures. We study regularizers to encourage these structures for low-rank matrix recovery through convex optimization, but note that TC-CRP-like Bayesian modeling is better for enforcing them. We then focus our attention on modeling temporal coherence in hierarchically grouped sequential data, such as word-tokens grouped into sentences, paragraphs, documents etc in a text corpus. We attempt Bayesian modeling for such data, with application to multi-layer segmentation. We first make a detailed study of existing models for such data. We present a taxonomy for such models called Degree-of-Sharing (DoS), based on how various mixture components are shared by the groups of data in these models. We come up with Layered Dirichlet Process which generalizes Hierarchical Dirichlet Process to multiple layers, and can also handle sequential information easily through Markovian approach. This is applied to hierarchical co-segmentation of a set of news transcripts- into broad categories (like politics, sports etc) and individual stories. We also propose a explicit-duration (semi-Markov) approach for this purpose, and provide an efficient inference algorithm for this. We also discuss generative processes for distribution matrices, where each column is a probability distribution. For this we discuss an application: to infer the correct answers to questions on online answering forums from opinions provided by different users.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	G27575	en_US
dc.subject	Video Analytics	en_US
dc.subject	Bayesian Nonparametrics	en_US
dc.subject	Entity Tracking in Videos	en_US
dc.subject	Computer Vision	en_US
dc.subject	Temporal Coherence (Videos)	en_US
dc.subject	Entity-Driven Video Analytics	en_US
dc.subject	Video Tracklets	en_US
dc.subject	Video Modeling and Analysis	en_US
dc.subject	Video Representation	en_US
dc.subject.classification	Computer Science	en_US
dc.title	Bayesian Nonparametric Modeling of Temporal Coherence for Entity-Driven Video Analytics	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.discipline	Faculty of Engineering	en_US

Files in this item

Name:: G27575.pdf
Size:: 38.28Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [559]

Show simple item record

Bayesian Nonparametric Modeling of Temporal Coherence for Entity-Driven Video Analytics

Files in this item

This item appears in the following Collection(s)

Related items

Perceptual Criterion Based Rate Control And Fast Mode Search For Spatial Intra Prediction In Video Coding ﻿

Bitrate Reduction Techniques for Low-Complexity Surveillance Video Coding ﻿

Techniques For Low Power Motion Estimation In Video Encoders ﻿

Perceptual Criterion Based Rate Control And Fast Mode Search For Spatial Intra Prediction In Video Coding

Bitrate Reduction Techniques for Low-Complexity Surveillance Video Coding

Techniques For Low Power Motion Estimation In Video Encoders