Label Efficient and Generalizable No-reference Video Quality Assessment

Mitra, Shankhanil

View/Open

Thesis full text (25.27Mb)

Author

Mitra, Shankhanil

Metadata

Show full item record

Abstract

No-reference (NR) video quality assessment (VQA) refers to the study of the quality of degraded videos without the need for reference pristine videos. The problem has wide applications ranging from the quality assessment of camera-captured videos to other user-generated content such as gaming, animation, and screen sharing. While several successful NR-VQA methods have been designed in the last decade, they all require a large amount of human annotations to learn effective models. This poses significant challenges as there is a need to keep conducting large-scale human subjective studies as distortions and camera pipelines evolve. In this thesis, we focus on addressing these problems through label-efficient and generalizable NR-VQA methods. We first propose a semi-supervised learning (SSL) framework exploiting many unlabelled and a small number of labelled videos. Our main contributions are twofold. Leveraging the benefits of consistency regularization and pseudo-labelling, our SSL model generates reliable pairwise pseudo-ranks for unlabelled video pairs using a student-teacher model on strong-weak augmented videos. We design the strong-weak augmentations to be quality invariant so that the unlabelled videos can be used effectively in SSL. The generated pseudo-ranks are used along with the limited labels to train our SSL model. While our SSL framework helps improve performance with several existing VQA features, we also present a spatial and temporal feature extraction method based on capturing spatial and temporal entropic differences. We show that these features help achieve an even better performance with our SSL framework. In the second part, we further improve the features for superior generalization across varied VQA tasks. In this context, we learn self-supervised quality-aware features without relying on any reference videos, human opinion scores, or training videos from the target database. In particular, we present a self-supervised multiview contrastive learning framework to learn spatio-temporal quality representations. We capture the common information between frame differences and frames by treating them as a pair of views and similarly obtain the shared representations between frame differences and optical flow. Further, we evaluate the self-supervised features in a opinion-unaware setup to test their relevance to VQA. In this regard, we compare the representations of the degraded video given by our module with a corpus of pristine natural video patches to predict the quality of the distorted video. Detailed experiments on multiple camera-captured VQA datasets reveal the superior performance of our method over other features when evaluated without training on human scores. To further improve the self-supervised VQA features, we seek to learn spatio-temporal features from video clips instead of merely operating on video frames and frame differences. In particular, we leverage the benefits of the attention mechanism in 3D transformers to model spatio-temporal dependencies. Thus, we first design a self-supervised Spatio-Temporal Visual Quality Representation Learning (ST-VQRL) framework to generate robust quality-aware features using a novel statistical contrastive loss for videos. Then, we propose a dual-model-based SSL method specifically designed for the Video Quality Assessment (SSL-VQA) task through a novel knowledge transfer of quality predictions between the two models. Despite being learned with limited human-annotated videos, our SSL-VQA method uses the ST-VQRL backbone to produce robust performances across various VQA datasets, including cross-database settings. Finally, we address the problem of generalization in VQA. Recent works have shown the remarkable generalizability of text-to-image latent diffusion models (LDMs) for various discriminative computer vision tasks. In this work, we leverage the denoising process of an LDM for generalizable NR-VQA by understanding the degree of alignment between perceptually relevant visual concepts and quality-aware text prompts. Since applying text-to-image LDMs for every video frame is computationally expensive, we only estimate the quality of a frame-rate sub-sampled version of the original video. To compensate for the loss in motion information due to frame-rate sub-sampling, we propose a novel temporal quality modulator (TQM). Our TQM adjusts for quality prediction by computing the cross-attention between the diffusion model’s representation and the motion features of the original and subsampled videos. Our extensive cross-database experiments across various user-generated, frame-rate variation, Ultra-HD, and streaming content-based databases show that our model can achieve superior generalization in VQA.

URI

https://etd.iisc.ac.in/handle/2005/6729

Collections

Electrical Communication Engineering (ECE) [405]