dc.description.abstract | No-reference (NR) video quality assessment (VQA) refers to the study of the quality of
degraded videos without the need for reference pristine videos. The problem has wide
applications ranging from the quality assessment of camera-captured videos to other
user-generated content such as gaming, animation, and screen sharing. While several
successful NR-VQA methods have been designed in the last decade, they all require a
large amount of human annotations to learn effective models. This poses significant
challenges as there is a need to keep conducting large-scale human subjective studies
as distortions and camera pipelines evolve. In this thesis, we focus on addressing these
problems through label-efficient and generalizable NR-VQA methods.
We first propose a semi-supervised learning (SSL) framework exploiting many unlabelled
and a small number of labelled videos. Our main contributions are twofold. Leveraging
the benefits of consistency regularization and pseudo-labelling, our SSL model generates
reliable pairwise pseudo-ranks for unlabelled video pairs using a student-teacher
model on strong-weak augmented videos. We design the strong-weak augmentations to
be quality invariant so that the unlabelled videos can be used effectively in SSL. The generated
pseudo-ranks are used along with the limited labels to train our SSL model. While
our SSL framework helps improve performance with several existing VQA features, we
also present a spatial and temporal feature extraction method based on capturing spatial
and temporal entropic differences. We show that these features help achieve an even
better performance with our SSL framework.
In the second part, we further improve the features for superior generalization across varied VQA tasks. In this context, we learn self-supervised quality-aware features without
relying on any reference videos, human opinion scores, or training videos from the
target database. In particular, we present a self-supervised multiview contrastive learning
framework to learn spatio-temporal quality representations. We capture the common
information between frame differences and frames by treating them as a pair of views
and similarly obtain the shared representations between frame differences and optical
flow. Further, we evaluate the self-supervised features in a opinion-unaware setup to test
their relevance to VQA. In this regard, we compare the representations of the degraded
video given by our module with a corpus of pristine natural video patches to predict
the quality of the distorted video. Detailed experiments on multiple camera-captured
VQA datasets reveal the superior performance of our method over other features when
evaluated without training on human scores.
To further improve the self-supervised VQA features, we seek to learn spatio-temporal
features from video clips instead of merely operating on video frames and frame differences.
In particular, we leverage the benefits of the attention mechanism in 3D transformers
to model spatio-temporal dependencies. Thus, we first design a self-supervised
Spatio-Temporal Visual Quality Representation Learning (ST-VQRL) framework to generate
robust quality-aware features using a novel statistical contrastive loss for videos.
Then, we propose a dual-model-based SSL method specifically designed for the Video
Quality Assessment (SSL-VQA) task through a novel knowledge transfer of quality predictions
between the two models. Despite being learned with limited human-annotated
videos, our SSL-VQA method uses the ST-VQRL backbone to produce robust performances
across various VQA datasets, including cross-database settings.
Finally, we address the problem of generalization in VQA. Recent works have shown
the remarkable generalizability of text-to-image latent diffusion models (LDMs) for various
discriminative computer vision tasks. In this work, we leverage the denoising process
of an LDM for generalizable NR-VQA by understanding the degree of alignment
between perceptually relevant visual concepts and quality-aware text prompts. Since
applying text-to-image LDMs for every video frame is computationally expensive,
we only estimate the quality of a frame-rate sub-sampled version of the original video. To
compensate for the loss in motion information due to frame-rate sub-sampling, we propose
a novel temporal quality modulator (TQM). Our TQM adjusts for quality prediction
by computing the cross-attention between the diffusion model’s representation and the
motion features of the original and subsampled videos. Our extensive cross-database
experiments across various user-generated, frame-rate variation, Ultra-HD, and streaming
content-based databases show that our model can achieve superior generalization in
VQA. | en_US |