0% found this document useful (0 votes)

23 views18 pages

Explainable Image Quality Evaluation For Text-to-Image Generation With Visual Large Language Model

This document introduces X-IQE, an explainable image quality evaluation approach that leverages large language models to generate textual explanations for evaluating text-to-image generation methods. X-IQE uses a hierarchical Chain of Thought process to enable MiniGPT-4 to produce self-consistent, unbiased explanations that correlate with human evaluation. It offers advantages over human and model evaluation such as distinguishing real from generated images, evaluating text-image alignment, and assessing image aesthetics without requiring training or fine-tuning. The authors validate X-IQE on images generated by diffusion models and find it demonstrates similar performance to state-of-the-art evaluation methods.

Uploaded by

jyfggfesfresfres

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views18 pages

Explainable Image Quality Evaluation For Text-to-Image Generation With Visual Large Language Model

Uploaded by

jyfggfesfresfres

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

X-IQE: eXplainable Image Quality Evaluation

for Text-to-Image Generation

with Visual Large Language Models

Yixiong Chen Li Liu∗

arXiv:2305.10843v2 [cs.CV] 26 May 2023

CUHK-SZ, SRIBD HKUST-GZ

[email protected] [email protected]

Chris Ding
CUHK-SZ
[email protected]

Abstract
This paper introduces a novel explainable image quality evaluation approach called
X-IQE, which leverages visual large language models (LLMs) to evaluate text-
to-image generation methods by generating textual explanations. X-IQE utilizes
a hierarchical Chain of Thought (CoT) to enable MiniGPT-4 to produce self-
consistent, unbiased texts that are highly correlated with human evaluation. It
offers several advantages, including the ability to distinguish between real and
generated images, evaluate text-image alignment, and assess image aesthetics
without requiring model training or fine-tuning. X-IQE is more cost-effective
and efficient compared to human evaluation, while significantly enhancing the
transparency and explainability of deep image quality evaluation models. We
validate the effectiveness of our method as a benchmark using images generated
by prevalent diffusion models. X-IQE demonstrates similar performance to state-
of-the-art (SOTA) evaluation methods on COCO Caption, while overcoming the
limitations of previous evaluation models on DrawBench, particularly in handling
ambiguous generation prompts and text recognition in generated images. Project
website: https://github.com/Schuture/Benchmarking-Awesome-Diffusion-Models

1 Introduction
Image quality evaluation has long been a practical and crucial technique employed in various
applications, including photo enhancement [2], image stream ranking [5], and album thumbnail
composition [8]. With the advent of artificial intelligence (AI) and generative models, such as
diffusion models [47; 33], there is an increasing demand for effective evaluation methods to assess
the large volume of images generated. The evaluation process of image quality by the human visual
system encompasses multiple factors, such as the rationality of the image content, the alignment with
text descriptions [51], and aesthetics [45]. However, some of these evaluation factors are inherently
subjective and challenging to quantify, presenting a significant obstacle in this task.
Existing solutions for image quality evaluation can be broadly categorized into two groups: human
evaluation and model evaluation. Human evaluation, despite its widespread use, suffers from
significant drawbacks, such as high cost (e.g., crowd-sourcing) and limited reproducibility across
different evaluation groups [35]. On the other hand, model evaluation provides a more desirable
∗
Corresponding Author

Preprint. Under review.

Old Paradigm New Paradigm

Human Evaluation Fidelity: Alignment: Aesthetics:

A vintage advertisement The two street signs in the The image has colorful
Fidelity: 7/10 with distorted and blurry image are not at a crossing, typography and the
details. and the image is a wide shot vintage-style design, but it
Alignment: 2/5 of the intersection rather also has AI characteristics,
The text on the image is than a close-up. such as the distorted and
difficult to read. blurry details.
I would rate their alignment
Evaluation Model The fidelity is 2/10 -
certainly AI-generated.
as 3/5 - a few minor
discrepancies.
I would rate this image as
a 4/10.

Prompt: A close up on two 6.74/10

street signs at the crossing. Large Language Models

Figure 1: The paradigm shift from human/model evaluation to LLM explanation of image quality.

alternative; however, it often relies on complex models, including CNN-LSTM [5], CLIP [39], and
BLIP [26], along with specially labeled data and features. This is also expensive and falls short of
the strong generalization capabilities of humans (e.g., on AI-generated images). Furthermore, most
previous model-based evaluations only focus on predicting image quality scores [45; 52], which
makes it challenging to explain the biases and deficiencies in their training data, ultimately resulting
in flawed model performance. Consequently, the question of how to develop cheap, generalizable,
and explainable quality evaluation models remains an open challenge.
In this work, we propose a novel approach that leverages pre-trained visual large language models
(LLMs) to generate analysis texts for images in a conversational style (see Fig. 1). We utilize the
state-of-the-art (SOTA) visual LLM, MiniGPT-4 [56], as our foundational model for implementing
explainable image quality evaluation. Leveraging the in-context learning (ICL) capability [3; 31]
of LLMs, we can inject knowledge into them without further training. Specifically, we propose an
effective prompt set as the standard for LLMs to evaluate image quality. Our method addresses
several key drawbacks associated with existing image quality evaluation models. The advantages of
our proposed explainable image quality evaluation method, X-IQE, are as follows:

• Explainability: Models can describe their reasoning process using Chain of Thought
(CoT) [50] based on the provided prompts, making the results explainable and insightful.
• Comprehensiveness: Carefully designed prompts allow LLMs to perform comprehensive
evaluations, not only for specifically labeled features (e.g., color, composition [25; 54]).
• Powerful Performance: Advanced LLMs are trained on vast amounts of data and possess
greater generalizable image understanding capabilities than most task-specific models.
• Unbiasedness: By utilizing objective prompt text, our model can conduct unbiased evalua-
tions, eliminating the biases that may arise from models trained on datasets annotated by
specific annotation groups, such as crowd-sourcing [35] or annotation companies [52].
• Training-free: Our method harnesses the capabilities of pre-trained LLMs, eliminating the
need for data collection and training efforts required by existing methods [25; 52].

Recent studies for evaluating AI-generated images [52; 51; 35]have predominantly concentrated on
assessing fidelity and text-image alignment scores. Conversely, earlier works assessing real photos or
paintings have primarily emphasized aesthetics [5; 45]. In contrast to existing methods that directly
output evaluation scores, we establish a step-by-step text-to-image method evaluation pipeline. This
approach enables us to generate all three textual evaluations incrementally, allowing subsequent steps
to leverage the information and conclusions derived from the reasoning of previous steps.
To provide a comprehensive analysis of X-IQE and evaluate SOTA text-to-image generative models,
including Stable Diffusion [41], Openjourney 2 , and DeepFloyd-IF 3 . we conduct extensive exper-
iments. Remarkably, X-IQE performs comparably in a zero-shot manner with specialized SOTA
scoring models trained on the AI-generated images or even better in unusual scenarios. These results
demonstrate the efficacy of our method as a versatile text-to-image evaluation protocol.
The contributions of this work are summarized as follows:
2
https://openjourney.art/
3
https://huggingface.co/DeepFloyd/IF-I-XL-v1.0

2
1. We propose X-IQE, an explainable image quality evaluation method based on visual LLMs.
To our knowledge, this is the first application of LLMs for text-to-image evaluation.
2. To enhance the performance and stability of X-IQE, we integrate the expertise of art
professionals into a hierarchical CoT with well-defined conditions and criteria for evaluation.
3. We perform comprehensive experiments on both real and AI-generated images, validating
the explanatory power as well as the quantitative scoring capabilities of X-IQE.

2 Related Work

2.1 Image Quality Evaluation for Text-to-Image Generation

Human evaluation is widely regarded as the benchmarking method for assessing text-to-image
generative models, including rule-based methods [20], GANs [15], and diffusion models [18],
adopted by most SOTA text-to-image generative models [13; 53; 34; 42]. However, the lack of a
consistent evaluation standard and varying protocols often yield different conclusions among the
works. To overcome this limitation, the first approach involves employing standardized evaluation
protocols, which addresses challenges such as prioritizing monetary returns [1; 21] and introducing
biases [35]. The second approach focuses on automatic evaluation metrics, including Inception
Score [43], Frechet Inception Distance [17], and Precision-Recall [24]. Some recent studies have
trained evaluation models using human scoring to align their preferences, but achieving only 65%
consistency [52; 51]. Notably, both approaches consider two metrics: overall image quality and
text-image alignment.

2.2 Image Aesthetic Analysis

Prior aesthetic prediction models have predominantly focused on real images, employing various
approaches for aesthetic assessment. One common practice involves concatenating vector represen-
tations derived from multiple input image patches [30; 45]. Another approach aims to capture the
relationships between different objects or regions within the image to evaluate composition [28; 54].
Notably, the widely used text-to-image generation method, Stable Diffusion, has also been evaluated
using aesthetic predictors trained on AVA [32] and LAION [44] datasets, yielding favorable results.

2.3 Large Language Models

Large language models (LLMs) have achieved remarkable success in recent years. Three main
paradigms emerged: encoder-only (BERT [12]), encoder-decoder (T5 [38]), and decoder-only (GPT-
2 [40]). GPT-3 [3] demonstrated the scaling advantages of decoder-only paradigm, leading to a
surge in related research, including Megatron-Turing NLG [46], Chinchilla [19], PaLM [10], and
LLaMA [48]. InstructGPT [36] and ChatGPT [4] showed coherent multi-turn conversation skills
through fine-tuning GPT-3 with aligned feedback data. LLMs can generate expected outputs for
test instances without additional training when provided with natural language instructions and/or
task demonstrations [55]. This in-context learning ability [3] allows LLMs to learn new tasks with
minimal overhead during inference. Another important capability is Chain of Thought, where LLMs
solve tasks using a prompting mechanism involving intermediate reasoning steps. CoT prompting
has shown performance gains for models larger than 60B [50]. In this work, we leverage ICL and
CoT to enable smaller LLMs to explain their reasoning process in image quality evaluation.

2.4 Pre-trained LLMs in Visual-Language Tasks

LLMs, such as VisualGPT [6] and Frozen [49], serve as powerful decoders for visual features. They
enable cross-modal transfer, aligning visual and linguistic knowledge to describe visual information
using language. BLIP-2 [27] effectively aligns visual features with language models through Flan-
T5 [11], demonstrating strong visual QA capabilities. GPT-4 [4], a recent breakthrough, accomplishes
diverse language tasks based on images by aligning an advanced LLM with human preferences and
intentions. Successful visual-language conversation models require robust conversational language
models (e.g., ChatGPT, LLaMA [48], Vicuna [9]), visual encoders (e.g., VIT [14]), and visual-
language alignment training. MiniGPT-4 [56], chosen for its versatility and simplicity, achieves

3
dialogue-based image understanding by training a linear layer with ample image-text alignment data.
Our proposed method can be flexibly adapted to more advanced visual language models in the future.

3 Method
X-IQE utilizes MiniGPT-4 as its foundational model and incorporates the expertise of art professionals
to design standardized and unbiased prompts. The effectiveness of X-IQE is further enhanced by
its well-structured hierarchical CoT scheme and the application of format constraints for accurate
execution. The overall methodology is depicted in Fig. 2.

3.1 MiniGPT-4

MiniGPT-4 [56] combines a pretrained vision encoder (ViT with Q-Former [27]) and the advanced
LLM Vicuna [9]. ViT enables image parsing, while Vicuna enhances generative capabilities through
conversation rounds. MiniGPT-4 uses frozen ViT and Vicuna, with training involving a linear
projection from visual features to Vicuna. The model is trained on 5 million text-image pairs
for general vision-language knowledge and fine-tuned with 3.5k aligned conversations for natural
responses. In this work, we perform ICL on the frozen MiniGPT-4 for image quality evaluation.
Challenges. Using pretrained LLMs for image quality assessment presents challenges that require
careful ICL strategy design. First, the training of MiniGPT-4 lacks samples from SOTA visual genera-
tive models. So LLMs tend to describe them as normal images based on pre-training preferences. We
address this challenge by incorporating explicit quality evaluation aspects, standards, and reasoning
processes inspired by art professionals’ knowledge. Additionally, Vicuna in MiniGPT-4 has a limited
parameter count (7B/13B), which may affect analysis rationality and result accuracy in complex
scenarios. To mitigate this, we impose strict constraints on the answer structure.

3.2 Expertise from Art Professionals for ICL

Our art industry consultant, experienced in using Stable Diffusion and Midjourney, identified several
empirical discrimination methods for AI image generation. When generating rare and complex
objects, AI produces blurred or distorted details, particularly with human hands and text. Occluded
objects often have inconsistent details on both sides of the occluder. AI-generated images may exhibit
oversaturation, especially when unusual colors are present. Lastly, in photo scenes, AI-generated
images can sometimes have a partially greasy feeling.
Aesthetics judgment indicators are well-studied and less subjective than commonly believed. Aesthet-
ically pleasing images should exhibit harmonious and bright colors, moderate saturation, appropriate
lighting and shadow, well-composed scenes, and a sense of space through perspective [37]. Mem-
orable images should also feature intricate details, expressiveness, and strong visual impact [22].
X-IQE incorporates expertise for discriminating AI images and aesthetic evaluation as prompts.

3.3 Chains of Thought between and within Tasks

We design a hierarchical CoT strategy for X-IQE to enhance the coherence and quality of its responses.
X-IQE evaluates image quality in terms of fidelity, alignment, and aesthetics, recognizing that these
attributes are interconnected. The evaluation results of certain attributes can serve as priors for
assessing other attributes. Fidelity, which measures the realism of an image, influences the assessment
of text-to-image alignment since AI-generated images often exhibit lower alignment compared to
real images [35]. Additionally, the identification of an image as AI-generated impacts the aesthetic
analysis by highlighting specific defects identified in the fidelity evaluation. The alignment evaluation
benefits from knowing the true prompt used for image generation, aiding in the assessment of visual
impact and composition in the aesthetic evaluation. Our CoT is designed as follows:
• f idelity evaluation −→ alignment evaluation −→ aesthetic evaluation
Within each evaluation task, a dedicated CoT is established, encompassing:
• (image description) −→ task specif ic analysis −→ scoring
The image description is included in the first task and reused in the subsequent tasks.

4
Chain of Thought
Image Description

Task-specific Analysis
Scoring

You are my assistant to evaluate the image quality. Briefly describe (within 50 words) the type (e.g., photo, painting) and
content of this image, and analyze whether this image meets the following conditions of an AI-generated image (within 30
words per point).
1. Imperfect details: distorted, blurry, or irrational faces, limbs, fingers, objects, or texts.
2. Improper composition: some misplaced object relationships.
3. Strange colors: overly bright, saturated colors.
4. Artificial look: looks like a real image but has an unclear rendering or other artificial look.

Provide your analysis in JSON format with the following keys: Image description, Imperfect details, Improper composition,
Strange colors, Artificial look, Fidelity (e.g., 6/10). The fidelity scoring criteria are as follows:

Definitely AI-generated (0-1) Very likely AI-generated (2-3) Certain probability of AI generation (4) Unsure (5) Certain
Fidelity probability being real (6) Very real (7-8) Definitely real (9-10)

{
"Image description": "A yellow and blue subway train at a station.",
"Imperfect details": "The train's windows are a bit blurry and the wheels are not very detailed.",
"Improper composition": "The positioning of the train in the frame is slightly off-center.",
"Strange colors": "The colors are bright and vibrant, but not necessarily strange.",
"Artificial look": "The overall image looks quite realistic, but there is a slight artificiality to the lighting and the way the train is rendered.",
"Fidelity": "5/10"
}

According to the image and your previous description, how well does the image align with the following description?

A yellow and blue train stationary at a train station.

Does not match at all (1) Has significant discrepancies (2) Has several minor discrepancies (3) Has a few minor discrepancies (4)
Matches exactly (5)

Provide your analysis in JSON format with the following keys: Alignment analysis (within 100 words), Alignment score (e.g., 4/5).
Alignment
{
"Alignment analysis": "The image depicts a yellow and blue subway train stationary at a train station, with a clear and detailed
representation of the train, the platform, and the surrounding environment. However, there are some minor discrepancies in the
positioning of the train, which is slightly tilted to the left, and the platform is uneven.",
"Alignment score": "4/5"
}

Briefly analyze the aesthetic elements of this image (each item within 20 words) and score its aesthetics. The scoring criteria
for each item are as follows.

Extremely bad (0-1) Poor quality (2-3) Below average (4-5) Above average (5-6) Good (7-8) Excellent (9) Wonderful (10).

Provide your analysis in JSON format with the following keys: Color harmony, Color brightness, Color saturation, Composition,
Perspective, Light and shadow, Detailed expression, Vivid posture, Visual impact, Overall aesthetic score (e.g., 6/10).
Aesthetics
{
"Color harmony": "The color scheme is bright and vibrant, with a balance of yellow and blue.",
"Color brightness": "The colors are bright, but not overly so.",
"Color saturation": "The colors are saturated, but not overly so.",
"Composition": "The composition is well-balanced, with a clear focal point.",
"Perspective": "The perspective is realistic, with a slight tilt to the left.",
"Light and shadow": "The lighting is bright and natural, with clear shadows.",
"Detailed expression": "The details are clear and well-defined.",
"Vivid posture": "The posture of the train is vivid and dynamic.",
"Visual impact": "The visual impact is good, with a clear focal point.",
"Overall aesthetic

Give the rating of aesthetics (in the format like n/10), do not repeat what you have said.

Overall aesthetic score: 7/10

Figure 2: The illustration showcases X-IQE using an image generated by Stable Diffusion 1.4. In
X-IQE, MiniGPT-4 with Vicuna serves as the chatbot. Carefully designed prompts are employed
for image quality analysis, incorporating a hierarchical CoT strategy to ensure response logic and
accuracy. The output format and scoring criteria constraints contribute to stable and coherent
responses that are easy to follow.

5
3.4 Constraints of Output Formats and Scoring Criteria

During our preliminary experiments, we aimed to ensure that MiniGPT-4 produces content that
adheres to the CoT structure. However, we encountered issues with unstable outputs. The two most
common unexpected outputs were: 1) content being generated before the analysis and 2) scoring
presented in varying styles, such as percentages or ratings based on different scales. To enhance
the likelihood of CoT-compliant output and establish an objective and unified scoring standard, we
implemented two improvements: 1) requiring the model to provide output in JSON format and 2)
explicitly defining the conditions associated with different scores.

4 Experiments and Results

4.1 Experimental Settings

Datasets: To evaluate X-IQE’s capability, we utilize COCO Captions [7] and DrawBench texts [42]
for AI image generation. We randomly sample 1000 prompts from COCO Captions while all 200
prompts from DrawBench are used. All diffusion models generate images using DPM Solver [29]
with 25 steps and a guidance scale of 7.5.
Models: Our experiments involve both MiniGPT-4 models with 7B and 13B Vicuna parameters. No
major modifications are made to the model code base4 . During inference, we set beam=1, and the
temperature ranges from 0.01 to 1.0. The models are executed on an Nvidia RTX A6000 48G GPU.
Metrics: To validate X-IQE’s performance in fidelity evaluation, we employ the recall of the
AI-generated images. For alignment and aesthetics validation, we compare the Pearson correlation
coefficient between human evaluation and task-specific models (such as CLIPScore [16] for alignment
and Stable Diffusion Aesthetic Predictor5 for aesthetics) or X-IQE.

4.2 The Impact of Model Size and Temperature

AMT Masters level

General Annotator level

(a) Consistency of repeated experiments (b) Consistency of answers to questions

Figure 3: The stability analysis of X-IQE. (a) the models output more consistent results when the
temperature is lower, with Vicuna 13B always better than the 7B variant. (b) Larger LLM gives more
precise and valid response to the questions for all three tasks.

One question regarding LLMs is whether multiple experiments yield consistent judgments like the
traditional prediction models. And if its responses are always consistent with our questions. We
investigate model consistency in relation to model size and temperature, as depicted in Fig. 3.
The temperature parameter is crucial in LLMs as it controls the randomness and creativity of generated
text. We conduct three repeated experiments with fixed temperatures to examine the consistency
of models’ fidelity prediction (Fig. 3 (a)). Krippendorff’s α [23] decreases as the temperature
increases, indicating that models yield more divergent results with higher temperatures. General
annotator groups typically achieve 0.11 Krippendorff’s α, while expert annotators can reach 0.53 [35].
Moreover, Vicuna 13B consistently exhibits better consistency than Vicuna 7B.
4
https://github.com/Vision-CAIR/MiniGPT-4
5
https://github.com/christophschuhmann/improved-aesthetic-predictor

6
Table 1: Ablation study of CoTs within and between tasks. The baseline is directly asking MiniGPT-4
for separate evaluation scores. Fidelity is measured using the recall of generated images, while
alignment and aesthetics are assessed using Pearson correlation coefficients with human evaluations.
It’s worth noting that the human evaluation exhibits a correlation coefficient of 0.137 with the
CLIPScore and 0.067 with the Aesthetic Predictor, serving as a reference for comparison.
CoT within tasks CoT between tasks
Subtask
baseline +prompt +prompt+format +fidelity +alignment +fidelity+alignment
Fidelity 0.021 0.0 0.698 – – –
Alignment 0.118 0.082 0.263 0.381 – –
Aesthetics 0.030 -0.162 0.259 0.369 0.351 0.418

True Positive: False Negative:

The food items are not perfectly The colors used are naturalistic, the
arranged and some of them are image does not have an artificial
overlapping. The colors of the food look, it looks like a real photograph
items are quite bright. taken on the beach.

Fidelity: 3/10 Fidelity: 7/10

True Negative: False Positive:

A black and white photograph of a The girl's face is distorted and the
steam locomotive with the number color of her hoodie is a bit off. The
717 on the side, surrounded by trees placement of the hands and the
and a mountainous landscape. phone could be improved.

Fidelity: 7/10 Fidelity: 4/10

(a) The fidelity distributions of real/generated images (b) Examples of right and wrong judgments

Figure 4: The results and examples of fidelity evaluation on COCO Caption. (a) Real and fake images
mainly differ in the distribution of fidelity scores. (b) The examples illustrate the capability of X-IQE
to perform image analysis during inference, although some judgments may be inaccurate.

X-IQE aims to derive quantitative evaluations for three tasks, but the models do not always provide
numerical results as expected. For fidelity evaluation, a common response is "It’s a bit unclear if the
image is AI-generated or not", even after two rounds of questioning "Give the rating of fidelity (in
the format like n/10)". The success rates of answering are presented in Fig. 3 (b). Generally, larger
Vicuna performs better in this experiment. Both models have a probability over 10% of providing
unexpected fidelity answers. However, refusal to answer for alignment and aesthetics evaluation is
rare, with Vicuna 13B inconsistently answering questions only in exceptional cases.
In all subsequent experiments, we employ Vicuna 13B with a temperature of 0.1 for accuracy and
reproducibility. Smaller temperatures are avoided due to model instability, as discussed in Section 5.

4.3 The Role of CoT

The ablation study results of CoT are shown in Tab. 1. We observe that without CoT within tasks,
the performance of X-IQE is unsatisfactory. Simply asking the model for a fidelity or alignment
score typically yields high scores, as no consistent standards are provided. Incorporating specific
requirements for task-specific analysis, including scoring criteria and evaluation aspects, improves the
model’s ability to generate reasonable analysis. However, obtaining reliable numerical scores remains
challenging due to Vicuna’s tendency to provide answers before thorough analysis. The inclusion of
formatting requirements ultimately enables the model to produce desirable quantitative results based
on its analysis. For the results of CoT between tasks, the alignment and aesthetics scoring can match
the human evaluations better with the analysis of previous stages. The model frequently utilizes
earlier analysis when responding to subsequent questions. For example, fidelity analysis considering
image details often aids in evaluating aesthetics such as composition and expressive details.

4.4 Real or Fake? Fidelity Evaluation

The results and examples of fidelity inference using X-IQE are presented in Fig. 4. Though the mean
fidelity scores for real and generated images are similar, it does not imply that X-IQE lacks strong
discriminative ability. The fidelity score distributions of real and generated images (Fig. 4 (a)) exhibit

7
(a) Alignment score distributions (r=0.381) (b) Aesthetics score distributions (r=0.418)

Figure 5: The distributions of alignment and aesthetics scores on COCO Caption. In contrast to
humans, who tend to provide score judgments centered around the mean, X-IQE exhibits a broader
range of judgments that include both extremely high and extremely low values.

Table 2: Benchmarking text-to-image models on COCO Caption.

X-IQE
Model CLIP Aes. Pred. ImgRwd HPS
Fidelity Alignment Aesthetics Overall
Stable Diffusion 1.4 0.803 5.22 0.104 0.1966 5.47 3.29 5.76 14.52
Stable Diffusion 2.1 0.831 5.42 0.472 0.1988 5.52 3.45 5.77 14.74
Openjourney 0.806 5.38 0.244 0.1990 5.44 3.37 5.96 14.77
DeepFloyd-IF 0.828 5.26 0.703 0.1994 5.55 3.52 5.79 14.86

significant differences with a p-value of < 10−5 in the Kolmogorov–Smirnov test. X-IQE tends to
assign more scores of 7 to real images and more scores of 5 to SD-generated images. Interestingly,
SD 1.4 can deceive both X-IQE and human observers in quite a few cases. Fig. 4 (b) demonstrates
examples of correct and incorrect judgments for detecting AI generation. Furthermore, X-IQE assigns
very few scores of 6, possibly due to the language bias learned during pretraining.

4.5 Alignment and Aesthetics Evaluations

The distributions of alignment and aesthetics scores from X-IQE and human evaluations are depicted
in Fig. 5. These scores demonstrate higher correlation coefficients of 0.381 and 0.418 with human
evaluations compared to task-specific models like CLIPScore (0.137) and Aesthetic Predictor (0.067),
highlighting the generalizability of X-IQE on the generated dataset. CLIPScore, trained on text-image
pairs collected before 2021 [16], and Aesthetic Predictor, trained on the AVA dataset [32], tend to un-
derperform when dealing with open-scene texts and images, often producing predictions inconsistent
with human expectations [52]. For example, the Aesthetic Predictor outputs values within the narrow
range of [4.8, 5.6], indicating its poor predictive ability within this data domain (the model’s output
range is [0, 10]). X-IQE addresses this issue by incorporating domain-specific knowledge, such as
scoring criteria, into the inference process. But unlike the unimodal nature of human evaluations,
X-IQE’s score distributions are bimodal, suggesting further room for improvement.

4.6 Benchmarking SOTA Visual Generative Models

One of the significant applications of X-IQE is its role as an impartial referee in evaluating various
generative models. We evaluate four SOTA text-to-image models, including Stable Diffusion 1.4
and 2.1, Openjourney, and DeepFloyd-IF. Furthermore, we incorporate the results from two leading
evaluation models specifically trained on AI-generated images (i.e., ImageReward [52] and HPS [51]).
Tab. 2 displays the outcomes for images generated using the COCO Caption. Notably, CLIPScore and
Aesthetic Predictor do not align human perception, whereas X-IQE exhibits strong agreement with
ImageReward and HPS, selecting the best DeepFloyd-IF among the four methods. The overall score
of X-IQE exhibits the same ranking as HPS. X-IQE also demonstrates its ability to extract reasonable

8
Table 3: Benchmarking text-to-image models on DrawBench.
X-IQE
Model CLIP Aes. Pred. ImgRwd HPS
Fidelity Alignment Aesthetics Overall
Stable Diffusion 1.4 0.793 5.09 -0.029 0.1945 5.32 2.72 5.40 13.44
Stable Diffusion 2.1 0.817 5.31 0.163 0.1955 5.10 2.50 5.04 12.64
Openjourney 0.787 5.35 0.056 0.1972 5.14 2.62 5.21 12.97
DeepFloyd-IF 0.827 5.10 0.541 0.1977 5.32 2.96 5.64 13.92

Stable Diffusion 1.4 Stable Diffusion 2.1 Openjourney DeepFloyd-IF

Vague prompt:
A large plant-eating
domesticated mammal with
solid hoofs and a flowing
mane and tail, used for
riding, racing, and to carry
and pull loads

Fed.: 7 Alig.: 4 Aes.: 6 Fed.: 5 Alig.: 3 Aes.: 5 Fed.: 5 Alig.: 3 Aes.: 5 Fed.: 7 Alig.: 4 Aes.: 8

Text prompt:

A sign that says

'Google Research
Pizza Cafe’

Fed.: 5 Alig.: 2 Aes.: 4 Fed.: 5 Alig.: 1 Aes.: 1 Fed.: 5 Alig.: 4 Aes.: 7 Fed.: 7 Alig.: 4 Aes.: 6

Figure 6: Comparison of images generated with DrawBench prompts and their X-IQE scoring.

detailed item scores. DeepFloyd-IF, serving as a public reproduction of Imagen [42], generates
images in the pixel space guided by the text encoder T5 XXL [38], enabling it to produce highly
photorealistic images aligning well with the provided texts. In terms of aesthetics, Openjourney
outperforms other models due to its high-quality training data generated by Midjourney.
Tab. 3 shows the evaluation results on DrawBench, a challenging benchmark including unconventional
prompts like vague descriptions and text generation. In this scenario, X-IQE provides conclusions
that differ from those of ImageReward and HPS. Fig. 6 illustrates the images generated with the four
models. X-IQE accurately ranks them, especially for the poor alignment and aesthetics of SD2.1.
This can be attributed to the robust capabilities of X-IQE in parsing long and ambiguous texts and
recognizing texts within the images, which traditional models struggle to accomplish.

5 Limitations

Model capability. Most visual encoders, including ViT in MiniGPT-4, were pre-trained using low-
resolution images (e.g., 2242 ), whereas the generated images using modern diffusion techniques often
range from 5122 to 10242 . The significant downsampling of images during inference could negatively
impact the model’s ability to discriminate fine details. Additionally, it has been shown that CoTs
typically work best for models with a size larger than 60B. When using smaller models like MiniGPT-
4 with 7B/13B parameters, we observed various unexpected responses such as hallucinations, repeated
answers, and incorrect output formats, especially with temperatures smaller than 0.1.
CoT design. The CoT strategy proposed in this work is concise and includes only necessary
information for the LLM’s inference. Longer and more detailed CoTs may be more effective but were
not established due to the limited text length that current LLMs can support, typically shorter than
1,000 tokens. It has also been observed that X-IQE typically rates the image only with integers and
tends to produce extreme quantitative results. This could be mitigated in future research by improving
the prompts.

9
6 Conclusion
In this study, we have introduced X-IQE, the first explainable method for image quality evaluation
with LLMs. This approach was developed to address the limitations of existing image evaluation
models and to align the evaluation results more closely with human judgments. By employing a
meticulously designed CoT strategy, X-IQE is able to differentiate AI-generated images, assess
the alignment between text and image, and analyze aesthetic elements. Our experimental results
demonstrate the reproducibility and effectiveness of X-IQE, as well as its capability to serve as a
benchmark for current SOTA text-to-image generation methods without requiring additional training
data. The ability of X-IQE to provide explanations for its reasoning process enables researchers to
identify and target its weaknesses for further improvement efficiently.

References
[1] Herman Aguinis, Isabel Villamor, and Ravi S Ramani. Mturk research: Review and recommen-
dations. Journal of Management, 47(4):823–837, 2021.
[2] Subhabrata Bhattacharya, Rahul Sukthankar, and Mubarak Shah. A holistic approach to aesthetic
enhancement of photographs. ACM Transactions on Multimedia Computing, Communications,
and Applications (TOMM), 7(1):1–21, 2011.
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[4] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece
Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general
intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
[5] Kuang-Yu Chang, Kung-Hung Lu, and Chu-Song Chen. Aesthetic critiques generation for
photos. In Proceedings of the IEEE international conference on computer vision, pages 3514–
3523, 2017.
[6] Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Data-efficient
adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022.
[7] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár,
and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv
preprint arXiv:1504.00325, 2015.
[8] Yi-Ling Chen, Jan Klopp, Min Sun, Shao-Yi Chien, and Kwan-Liu Ma. Learning to compose
with professional photographs on the web. In Proceedings of the 25th ACM international
conference on Multimedia, pages 37–45, 2017.
[9] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot
impressing gpt-4 with 90%* chatgpt quality, 2023.
[10] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:
Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[11] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li,
Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned
language models. arXiv preprint arXiv:2210.11416, 2022.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[13] Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image
generation via hierarchical transformers. arXiv preprint arXiv:2204.14217, 2022.

10
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.
[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications
of the ACM, 63(11):139–144, 2020.
[16] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A
reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021.
[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
neural information processing systems, 30, 2017.
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances
in Neural Information Processing Systems, 33:6840–6851, 2020.
[19] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al.
Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
[20] Qixin Hu, Yixiong Chen, Junfei Xiao, Shuwen Sun, Jieneng Chen, Alan Yuille, and Zongwei
Zhou. Label-free liver tumor segmentation. arXiv preprint arXiv:2303.14869, 2023.
[21] Ryan Kennedy, Scott Clifford, Tyler Burleigh, Philip D Waggoner, Ryan Jewell, and Nicholas JG
Winter. The shape of and solutions to the mturk quality crisis. Political Science Research and
Methods, 8(4):614–629, 2020.
[22] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. Understanding and predicting
image memorability at a large scale. In Proceedings of the IEEE international conference on
computer vision, pages 2390–2398, 2015.
[23] Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications,
2018.
[24] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved
precision and recall metric for assessing generative models. Advances in Neural Information
Processing Systems, 32, 2019.
[25] Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter
Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models
using human feedback. arXiv preprint arXiv:2302.12192, 2023.
[26] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-
image pre-training for unified vision-language understanding and generation. In International
Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
[27] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-
image pre-training with frozen image encoders and large language models. arXiv preprint
arXiv:2301.12597, 2023.
[28] Dong Liu, Rohit Puri, Nagendra Kamath, and Subhabrata Bhattacharya. Composition-aware im-
age aesthetics assessment. In Proceedings of the IEEE/CVF Winter Conference on Applications
of Computer Vision, pages 3569–3578, 2020.
[29] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A
fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Advances in
Neural Information Processing Systems, 2022.
[30] Xin Lu, Zhe Lin, Xiaohui Shen, Radomir Mech, and James Z Wang. Deep multi-patch
aggregation network for image style, aesthetics, and quality estimation. In Proceedings of the
IEEE international conference on computer vision, pages 990–998, 2015.

11
[31] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and
Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning
work? arXiv preprint arXiv:2202.12837, 2022.

[32] Naila Murray, Luca Marchesotti, and Florent Perronnin. Ava: A large-scale database for
aesthetic visual analysis. In 2012 IEEE conference on computer vision and pattern recognition,
pages 2408–2415. IEEE, 2012.

[33] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic
models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.

[34] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin,
Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image genera-
tion and editing with text-guided diffusion models. In International Conference on Machine
Learning, pages 16784–16804. PMLR, 2022.

[35] Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne
Heikkilä, and Shin’ichi Satoh. Toward verifiable and reproducible human evaluation for text-to-
image generation. arXiv preprint arXiv:2304.01816, 2023.

[36] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to
follow instructions with human feedback. Advances in Neural Information Processing Systems,
35:27730–27744, 2022.

[37] Stephen E Palmer, Karen B Schloss, and Jonathan Sammartino. Visual aesthetics and human
preference. Annual review of psychology, 64:77–107, 2013.

[38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

[39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In International conference on machine learning,
pages 8748–8763. PMLR, 2021.

[40] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.

[41] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.

[42] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton,
Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al.
Photorealistic text-to-image diffusion models with deep language understanding. Advances in
Neural Information Processing Systems, 35:36479–36494, 2022.

[43] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training gans. Advances in neural information processing systems, 29,
2016.

[44] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman,
Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-
5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth
Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.

[45] Kekai Sheng, Weiming Dong, Chongyang Ma, Xing Mei, Feiyue Huang, and Bao-Gang Hu.
Attention-based multi-patch aggregation for image aesthetic assessment. In Proceedings of the
26th ACM international conference on Multimedia, pages 879–886, 2018.

12
[46] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari,
Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using
deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language
model. arXiv preprint arXiv:2201.11990, 2022.
[47] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv
preprint arXiv:2011.13456, 2020.
[48] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-
thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open
and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[49] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill.
Multimodal few-shot learning with frozen language models. Advances in Neural Information
Processing Systems, 34:200–212, 2021.
[50] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le,
Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In
Advances in Neural Information Processing Systems, 2022.
[51] Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better aligning text-to-
image models with human preference. arXiv preprint arXiv:2303.14420, 2023.
[52] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao
Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.
arXiv preprint arXiv:2304.05977, 2023.
[53] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay
Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive
models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
[54] Bo Zhang, Li Niu, and Liqing Zhang. Image composition assessment with saliency-augmented
multi-pattern pooling. arXiv preprint arXiv:2104.03133, 2021.
[55] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min,
Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv
preprint arXiv:2303.18223, 2023.
[56] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En-
hancing vision-language understanding with advanced large language models. arXiv preprint
arXiv:2304.10592, 2023.

13
A All Prompts Used in the Experiments

Table 4: The main prompts used for X-IQE.

Description Prompt
Fidelity Question You are my assistant to evaluate the image quality. Briefly describe (within 50 words)
the type (e.g., photo, painting) and content of this image, and analyze whether this im-
age meets the following conditions of an AI-generated image (within 30 words per point).

1. Imperfect details: distorted, blurry, or irrational faces, limbs, fingers, objects,

or texts.
2. Improper composition: some misplaced object relationships.
3. Strange colors: overly bright, saturated colors.
4. Artificial look: looks like a real image but has an unclear rendering or other artificial
look.

Provide your analysis in JSON format with the following keys: Image descrip-
tion, Imperfect details, Improper composition, Strange colors, Artificial look, Fidelity
(e.g., 6/10). The fidelity scoring criteria are as follows:

Definitely AI-generated (0-1)

Very likely AI-generated (2-3)
Certain probability of AI generation (4)
Unsure (5)
Certain probability being real (6)
Very real (7-8)
Definitely real (9-10)
Alignment Question According to the image and your previous description, how well does the image align
with the following description?

not match at all (1)

Has significant discrepancies (2)
Has several minor discrepancies (3)
Has a few minor discrepancies (4)
Matches exactly (5)

Provide your analysis in JSON format with the following keys: Alignment anal-
ysis (within 100 words), Alignment score (e.g., 4/5).
Aesthetics Question Briefly analyze the aesthetic elements of this image (each item within 20 words) and
score its aesthetics. The scoring criteria for each item are as follows.

Extremely bad (0-1)

Poor quality (2-3)
Below average (4)
Average (5)
Above average (6)
Good (7-8)
Excellent (9)
Wonderful (10)

14
Table 5: The ablation prompts used for X-IQE.
Description Prompt
Fidelity question baseline You are my assistant to evaluate the image quality. Whether this image seems
like a real image instead of AI-generated? Give your fidelity score out of 10 in
a format like n/10.
Fidelity question w/o format You are my assistant to evaluate the image quality. Whether this image seems
like a real image instead of AI-generated? Give your fidelity score out of 10 in
a format like n/10. Consider the following factor for AI images.

1. Imperfect details: distorted, blurry, or irrational faces, limbs, fin-

gers, objects, or texts.
2. Improper composition: some misplaced object relationships.
3. Strange colors: overly bright, saturated colors.
4. Artificial look: looks like a real image but has an unclear rendering or other
artificial look.

The fidelity scoring criteria are as follows:

Definitely AI-generated (0-1)

Very likely AI-generated (2-3)
Certain probability of AI generation (4)
Unsure (5)
Certain probability being real (6)
Very real (7-8)
Definitely real (9-10)
Alignment question baseline How well does the image align with the following description?

Give an alignment score out of 5 like 3/5.

Alignment question w/o format Briefly describe (within 50 words) the type (e.g., photo, painting) and con-
tent of this image, how well does the image align with the following description?

Does not match at all (1).

Has significant discrepancies (2)
Has several minor discrepancies (3)
Has a few minor discrepancies (4)
Matches exactly (5)

Give an alignment score out of 5 like 3/5.

Aesthetic question baseline Briefly analyze the aesthetic elements of this image. Give an aesthetic score
out of 10 like 6/10.
Aesthetic question w/o format Briefly analyze the aesthetic elements (color harmony, color brightness, color
saturation, composition, perspective, light and shadow, detailed expression,
vivid posture, visual impact) of this image (each item within 20 words). Give
an aesthetic score out of 10 like 6/10. The scoring criteria for each item are as
follows.

Extremely bad (0-1)

Poor quality (2-3)
Below average (4)
Average (5)
Above average (6)
Good (7-8)
Excellent (9)
Wonderful (10)

15
Table 6: The continue prompts used for X-IQE.
Description Prompt
Fidelity Question (cont.) Give the fidelity rating (in a format like n/10), do not repeat what you have said.
Alignment Question (cont.) Give the rating of text-image alignment (in a format like n/5), do not repeat what
you have said.
Aesthetics Question (cont.) Give the rating of aesthetics (in a format like n/10), do not repeat what you have
said.

B Failure Situations

{
"Image description": "A white bench with a book and a can of paint on it, surrounded by
trees and greenery.",
"Imperfect details": "The bench has a smooth and shiny surface, and the book and can of
paint are placed on it in a neat way.",
"Improper composition": "The bench is placed in the center of the image, and the book and
can of paint are positioned symmetrically on either side of it.",
"Strange colors": "The colors used in the image are natural and not too bright or saturated.",
"Artificial look": "The image does not have an artificial look, it looks like a real photograph.",
"Fidelity": "It is difficult to say if the image is AI generated or not, it could be real but it is
hard to tell."
}

Give the fidelity rating (in a format like n/10), do not repeat what you have said.

Figure 7: The example of repeating answers without scoring, which is a common bug of Vicuna.

16
Briefly analyze the aesthetic elements of this image (each item within 20 words)
and score its aesthetics. The scoring criteria for each item are as follows …

{
"Color harmony": "The colors in the image are are not not cohesive.",
"Color brightness": "The brightness of the colors is not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not

Figure 8: The example of repeating a single word. This situation often happens when the temperature
is lower than 0.1.

Briefly describe (within 50 words) the type (e.g., photo, painting) and content of
this image, how well does the image align with the following description

"A Chihuahua dog is standing next to a toy giraffe and looking up. "

{
"Image description": "A small Chihuahua dog is standing next to a toy giraffe
and looking up.",
"Alignment analysis": "The image shows a small Chihuahua dog standing next
to a toy giraffe, but the alignment between the two elements is poor. The dog
is looking up, but the giraffe is not looking at the dog, and their positions
do not suggest any interaction between them.",
"Alignment score": "2"
}

Figure 9: The example of hallucination error. The image caption does not mention the giraffe is
looking at the dog or any of their interaction.

17
Briefly analyze the aesthetic elements of this image (each item within 20 words)
and score its aesthetics. The scoring criteria for each item are as follows …

{
"Color harmony": "The colors in the image are harmonious and natural",
"Color brightness": "The brightness of the colors is not too bright or too dull",
"Color saturation": "The saturation of the colors is not too high or too low",
"Composition": "The composition of the image is not too bad",
"Perspective": "The perspective of the image is not too distorted",
"Light and shadow": "The light and shadow in the image is not too harsh or too soft",
"Detailed expression": "The detailed expression of the cows is not too poor",
"Vivid posture": "The posture of the cows is not too stiff",
"Visual impact": "The visual impact of the image is not too weak",
"Overall aesthetic score": "8"
}

Give the rating of aesthetics (in a format like n/10), do not repeat what you have said.

7.5

Give the rating of aesthetics (in a format like n/10), do not repeat what you have said.

8.2

Figure 10: The example of inconsistent multiple responses.

Briefly describe (within 50 words) the type (e.g., photo, painting) and content of this
image, how well does the image align with the following description …

I'm sorry, but I cannot provide an alignment score without the image.
Please provide the image so that I can evaluate it.

Figure 11: The example of giving no answer. This situation sometimes happens for the separate
alignment evaluation. If fidelity evaluation had been conducted previously, the error would not
happen.

Baze University Brochure.6ba370ec
No ratings yet
Baze University Brochure.6ba370ec
15 pages
ERAS CV Guide by Heba Bader
100% (1)
ERAS CV Guide by Heba Bader
10 pages
Language and Affect
No ratings yet
Language and Affect
33 pages
Intro To TOK - Crash Course For Teachers
100% (1)
Intro To TOK - Crash Course For Teachers
36 pages
Rich Human Feedback For Text-to-Image Generation: Research/google-Research/tree/master/richhf 18k
No ratings yet
Rich Human Feedback For Text-to-Image Generation: Research/google-Research/tree/master/richhf 18k
19 pages
Vision-Language Consistency Guided Multi-Modal Prompt Learning For Blind AI Generated Image Quality Assessment
No ratings yet
Vision-Language Consistency Guided Multi-Modal Prompt Learning For Blind AI Generated Image Quality Assessment
5 pages
Quality Prediction of AI Generated Images and Videos: Emerging Trends and Opportunities
No ratings yet
Quality Prediction of AI Generated Images and Videos: Emerging Trends and Opportunities
16 pages
Imagen 3 Report
No ratings yet
Imagen 3 Report
32 pages
Meta
No ratings yet
Meta
17 pages
LLMScore
No ratings yet
LLMScore
18 pages
NeurIPS 2023 Openagi When LLM Meets Domain Experts Paper Datasets - and - Benchmarks
No ratings yet
NeurIPS 2023 Openagi When LLM Meets Domain Experts Paper Datasets - and - Benchmarks
30 pages
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
No ratings yet
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
19 pages
Journey DB
No ratings yet
Journey DB
20 pages
Document Image Quality Asseccment by Text Line and Text Line Dataset
No ratings yet
Document Image Quality Asseccment by Text Line and Text Line Dataset
8 pages
Internship Report (Sanjay Final)
No ratings yet
Internship Report (Sanjay Final)
45 pages
Artaug:: Enhancing Text-To-Image Generation Through Synthesis-Understanding Interaction
No ratings yet
Artaug:: Enhancing Text-To-Image Generation Through Synthesis-Understanding Interaction
18 pages
2023 Acl-Short 65
No ratings yet
2023 Acl-Short 65
15 pages
Dynamic Image Generation From Text Prompt Research Paper-JOT-5135
100% (1)
Dynamic Image Generation From Text Prompt Research Paper-JOT-5135
7 pages
A Unified Framework With Novel Metrics For Evaluating The Effectiveness of XAI
No ratings yet
A Unified Framework With Novel Metrics For Evaluating The Effectiveness of XAI
10 pages
Automatic Creative Selection With Cross-Modal Matching
No ratings yet
Automatic Creative Selection With Cross-Modal Matching
3 pages
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
No ratings yet
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
14 pages
2412.01253v5【Yi Lightning】Technical Report
No ratings yet
2412.01253v5【Yi Lightning】Technical Report
17 pages
Mathworks - Yann Debray - GPT-4o
No ratings yet
Mathworks - Yann Debray - GPT-4o
17 pages
Papers With Code v2
No ratings yet
Papers With Code v2
15 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
XEQ Scale For Evaluating XAI Experience Quality Grounded in Psychometric Theory
No ratings yet
XEQ Scale For Evaluating XAI Experience Quality Grounded in Psychometric Theory
15 pages
Seedance 1.0: Exploring The Boundaries of Video Generation Models
No ratings yet
Seedance 1.0: Exploring The Boundaries of Video Generation Models
26 pages
Automatic Image Caption Generation System
No ratings yet
Automatic Image Caption Generation System
4 pages
Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models
No ratings yet
Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models
15 pages
Towards Human-Centered Explainable AI A Survey of User Studies For Model Explanations
No ratings yet
Towards Human-Centered Explainable AI A Survey of User Studies For Model Explanations
19 pages
A Survey of AI-Generated Video Evaluation
No ratings yet
A Survey of AI-Generated Video Evaluation
59 pages
Classx-Ai - Unit 1
0% (1)
Classx-Ai - Unit 1
19 pages
Emu: Enhancing Image Generation Models Using Photogenic Needles in A Haystack
No ratings yet
Emu: Enhancing Image Generation Models Using Photogenic Needles in A Haystack
13 pages
Chat GPT Is Not All You Need Paper Review
No ratings yet
Chat GPT Is Not All You Need Paper Review
31 pages
Arlinghaus Maier 2024 AI Images in Vignette Studies Preprint
No ratings yet
Arlinghaus Maier 2024 AI Images in Vignette Studies Preprint
7 pages
Evaluation of Automatically Generated Video Captions Using Vision and Language Models
No ratings yet
Evaluation of Automatically Generated Video Captions Using Vision and Language Models
5 pages
Final Assigment Version 2
No ratings yet
Final Assigment Version 2
44 pages
Ai Workshop-Building Your Own Mini-ChatGPT API - Presentation - 06 25 24
No ratings yet
Ai Workshop-Building Your Own Mini-ChatGPT API - Presentation - 06 25 24
33 pages
Stable Diffusion Prompts Article
No ratings yet
Stable Diffusion Prompts Article
13 pages
Synopsis of Final Year Project (Amaan)
No ratings yet
Synopsis of Final Year Project (Amaan)
13 pages
Ucalgary 2023 Hajimohammadkhani Ahmad
No ratings yet
Ucalgary 2023 Hajimohammadkhani Ahmad
103 pages
BTP - 6 Sem - Part1
No ratings yet
BTP - 6 Sem - Part1
40 pages
Xu Boosting Image Quality Assessment Through Efficient Transformer Adaptation With Local CVPR 2024 Paper
No ratings yet
Xu Boosting Image Quality Assessment Through Efficient Transformer Adaptation With Local CVPR 2024 Paper
11 pages
Building A System That Can Generate High
No ratings yet
Building A System That Can Generate High
2 pages
NeurIPS 2023 T2i Compbench A Comprehensive Benchmark For Open World Compositional Text To Image Generation Paper Datasets and Benchmarks
No ratings yet
NeurIPS 2023 T2i Compbench A Comprehensive Benchmark For Open World Compositional Text To Image Generation Paper Datasets and Benchmarks
25 pages
Cloth Captioning
No ratings yet
Cloth Captioning
36 pages
Review 3
No ratings yet
Review 3
18 pages
Image Textualization - An Automatic Framework
No ratings yet
Image Textualization - An Automatic Framework
23 pages
RP Springer
No ratings yet
RP Springer
10 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Utilizing Generative AI For Text-To-Image Generation
No ratings yet
Utilizing Generative AI For Text-To-Image Generation
6 pages
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
No ratings yet
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
166 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
LLM Paper 4
No ratings yet
LLM Paper 4
24 pages
HKU Sharing
No ratings yet
HKU Sharing
25 pages
Evaluating Vision-Language Models by
No ratings yet
Evaluating Vision-Language Models by
17 pages
Rich Image Captioning in The Wild
No ratings yet
Rich Image Captioning in The Wild
8 pages
Ai Lab File Extra
No ratings yet
Ai Lab File Extra
2 pages
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
No ratings yet
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
9 pages
Kayser E-ViL A Dataset and Benchmark For Natural Language Explanations in ICCV 2021 Paper
No ratings yet
Kayser E-ViL A Dataset and Benchmark For Natural Language Explanations in ICCV 2021 Paper
11 pages
Paper 2
No ratings yet
Paper 2
9 pages
Survey Paper On Text-to-Image Generation
No ratings yet
Survey Paper On Text-to-Image Generation
8 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Pyramid Image Processing: Exploring the Depths of Visual Analysis
From Everand
Pyramid Image Processing: Exploring the Depths of Visual Analysis
Fouad Sabry
No ratings yet
Final Tool SBM Assessment 40
100% (1)
Final Tool SBM Assessment 40
15 pages
Placement Bro 14
No ratings yet
Placement Bro 14
36 pages
Estimation of Parameters Part 2 PDF
No ratings yet
Estimation of Parameters Part 2 PDF
46 pages
Final Examination - PART 2
100% (1)
Final Examination - PART 2
4 pages
Task 1 610 Oscar Diaz
No ratings yet
Task 1 610 Oscar Diaz
8 pages
Academic - Graduate Studies and Research Division Birla Institute of Technology and Science, Pilani-Hyderabad Campus
No ratings yet
Academic - Graduate Studies and Research Division Birla Institute of Technology and Science, Pilani-Hyderabad Campus
2 pages
Video Game Development As Public History
No ratings yet
Video Game Development As Public History
34 pages
JPMC Strategy Analytics Analyst Resume
No ratings yet
JPMC Strategy Analytics Analyst Resume
3 pages
A Journey of Tim1
No ratings yet
A Journey of Tim1
53 pages
BBA (G) 4th Semester Result Regular & Reappear 2005 To 2008 Batch
No ratings yet
BBA (G) 4th Semester Result Regular & Reappear 2005 To 2008 Batch
349 pages
2013 08 20 Mumbai
No ratings yet
2013 08 20 Mumbai
9 pages
6901-Article Text-26436-1-10-20231230
No ratings yet
6901-Article Text-26436-1-10-20231230
11 pages
360 Degree Feedback
No ratings yet
360 Degree Feedback
45 pages
The Content Based Media Exposure Scale C
No ratings yet
The Content Based Media Exposure Scale C
32 pages
Activities Guide SDG 4 - FINAL
No ratings yet
Activities Guide SDG 4 - FINAL
29 pages
Taye Kufa - AFCA - 2017 Presentation PDF
No ratings yet
Taye Kufa - AFCA - 2017 Presentation PDF
27 pages
Manufacturing Scale Up Drugs and Biologics
No ratings yet
Manufacturing Scale Up Drugs and Biologics
5 pages
ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks
No ratings yet
ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks
23 pages
Investigate The Use of Solar Energy Into Small Houses For Sustainable Living
No ratings yet
Investigate The Use of Solar Energy Into Small Houses For Sustainable Living
27 pages
Researching Ghosts
No ratings yet
Researching Ghosts
1 page
Lesson 3-Analysis of Procedures Such As Survey, Interview and Observation
100% (4)
Lesson 3-Analysis of Procedures Such As Survey, Interview and Observation
16 pages
Analisis Jurnal Pico Gea
No ratings yet
Analisis Jurnal Pico Gea
3 pages
The Contribution of Cognitive Psychology To The ST
No ratings yet
The Contribution of Cognitive Psychology To The ST
18 pages
Integrated Undergraduate Research Handbook Version 2020 by DR JB
No ratings yet
Integrated Undergraduate Research Handbook Version 2020 by DR JB
52 pages
Market Research AI Prompts by Alin Dragu
No ratings yet
Market Research AI Prompts by Alin Dragu
9 pages
This Is Your Brain On Music
No ratings yet
This Is Your Brain On Music
4 pages

Explainable Image Quality Evaluation For Text-to-Image Generation With Visual Large Language Model

Uploaded by

Explainable Image Quality Evaluation For Text-to-Image Generation With Visual Large Language Model

Uploaded by

X-IQE: eXplainable Image Quality Evaluation

for Text-to-Image Generation

Yixiong Chen Li Liu∗

CUHK-SZ, SRIBD HKUST-GZ

Preprint. Under review.

Human Evaluation Fidelity: Alignment: Aesthetics:

Prompt: A close up on two 6.74/10

2.1 Image Quality Evaluation for Text-to-Image Generation

2.2 Image Aesthetic Analysis

2.3 Large Language Models

2.4 Pre-trained LLMs in Visual-Language Tasks

3.2 Expertise from Art Professionals for ICL

3.3 Chains of Thought between and within Tasks

A yellow and blue train stationary at a train station.

Overall aesthetic score: 7/10

4 Experiments and Results

4.2 The Impact of Model Size and Temperature

AMT Masters level

General Annotator level

(a) Consistency of repeated experiments (b) Consistency of answers to questions

True Positive: False Negative:

Fidelity: 3/10 Fidelity: 7/10

True Negative: False Positive:

Fidelity: 7/10 Fidelity: 4/10

4.3 The Role of CoT

4.4 Real or Fake? Fidelity Evaluation

Table 2: Benchmarking text-to-image models on COCO Caption.

4.5 Alignment and Aesthetics Evaluations

4.6 Benchmarking SOTA Visual Generative Models

Stable Diffusion 1.4 Stable Diffusion 2.1 Openjourney DeepFloyd-IF

A sign that says

Table 4: The main prompts used for X-IQE.

1. Imperfect details: distorted, blurry, or irrational faces, limbs, fingers, objects,

Definitely AI-generated (0-1)

not match at all (1)

Extremely bad (0-1)

1. Imperfect details: distorted, blurry, or irrational faces, limbs, fin-

The fidelity scoring criteria are as follows:

Definitely AI-generated (0-1)

Give an alignment score out of 5 like 3/5.

Does not match at all (1).

Give an alignment score out of 5 like 3/5.

Extremely bad (0-1)

Figure 10: The example of inconsistent multiple responses.

You might also like