Generalizable No-Reference Image Quality Assessment: Multi-Modal Models and Human Preference Analysis for AI Generated Images
Abstract
One of the major challenges in no-reference (NR) image quality assessment (IQA) is the ability to generalize to diverse quality assessment applications. Recently, multi-modal vision-language models have been found to be very promising in this direction. They are beginning to form a part of several state-of-the-art NR IQA methods. On the other hand, multi-modal large language models (LLMs) are increasingly being studied for various computer vision applications including IQA. In this work, we perform a thorough study of the ability of multi-modal LLMs for NR IQA by training some of its components and testing for its generalizability. In particular, we keep the LLM frozen and learn parameters corresponding to the querying transformer, the LLM prompt, and some layers that process the embedding output by the LLM. We observe that some of these components offer a generalization performance far superior to any existing NR IQA algorithm.
With the rapid emergence of artificial intelligence (AI)-generated images, there is also a need to understand human preferences of these images. We explore the fundamental dimensions of AI generated image quality assessment, particularly the relationship between alignment (how well images match their text prompts) and quality (both low-level artifacts and high-level structural coherence). We analyze how these dimensions interact and contribute to the overall perceived quality, examining whether separate assessment of alignment and quality yields better results than holistic evaluation approaches. Through comparative analysis of existing and novel assessment models, we provide insights into effective strategies for evaluating AI-generated images.