Explainable Image Quality Evaluation For Text-to-Image Generation With Visual Large Language Model
Explainable Image Quality Evaluation For Text-to-Image Generation With Visual Large Language Model
Chris Ding
CUHK-SZ
[email protected]
Abstract
This paper introduces a novel explainable image quality evaluation approach called
X-IQE, which leverages visual large language models (LLMs) to evaluate text-
to-image generation methods by generating textual explanations. X-IQE utilizes
a hierarchical Chain of Thought (CoT) to enable MiniGPT-4 to produce self-
consistent, unbiased texts that are highly correlated with human evaluation. It
offers several advantages, including the ability to distinguish between real and
generated images, evaluate text-image alignment, and assess image aesthetics
without requiring model training or fine-tuning. X-IQE is more cost-effective
and efficient compared to human evaluation, while significantly enhancing the
transparency and explainability of deep image quality evaluation models. We
validate the effectiveness of our method as a benchmark using images generated
by prevalent diffusion models. X-IQE demonstrates similar performance to state-
of-the-art (SOTA) evaluation methods on COCO Caption, while overcoming the
limitations of previous evaluation models on DrawBench, particularly in handling
ambiguous generation prompts and text recognition in generated images. Project
website: https://github.com/Schuture/Benchmarking-Awesome-Diffusion-Models
1 Introduction
Image quality evaluation has long been a practical and crucial technique employed in various
applications, including photo enhancement [2], image stream ranking [5], and album thumbnail
composition [8]. With the advent of artificial intelligence (AI) and generative models, such as
diffusion models [47; 33], there is an increasing demand for effective evaluation methods to assess
the large volume of images generated. The evaluation process of image quality by the human visual
system encompasses multiple factors, such as the rationality of the image content, the alignment with
text descriptions [51], and aesthetics [45]. However, some of these evaluation factors are inherently
subjective and challenging to quantify, presenting a significant obstacle in this task.
Existing solutions for image quality evaluation can be broadly categorized into two groups: human
evaluation and model evaluation. Human evaluation, despite its widespread use, suffers from
significant drawbacks, such as high cost (e.g., crowd-sourcing) and limited reproducibility across
different evaluation groups [35]. On the other hand, model evaluation provides a more desirable
∗
Corresponding Author
Figure 1: The paradigm shift from human/model evaluation to LLM explanation of image quality.
alternative; however, it often relies on complex models, including CNN-LSTM [5], CLIP [39], and
BLIP [26], along with specially labeled data and features. This is also expensive and falls short of
the strong generalization capabilities of humans (e.g., on AI-generated images). Furthermore, most
previous model-based evaluations only focus on predicting image quality scores [45; 52], which
makes it challenging to explain the biases and deficiencies in their training data, ultimately resulting
in flawed model performance. Consequently, the question of how to develop cheap, generalizable,
and explainable quality evaluation models remains an open challenge.
In this work, we propose a novel approach that leverages pre-trained visual large language models
(LLMs) to generate analysis texts for images in a conversational style (see Fig. 1). We utilize the
state-of-the-art (SOTA) visual LLM, MiniGPT-4 [56], as our foundational model for implementing
explainable image quality evaluation. Leveraging the in-context learning (ICL) capability [3; 31]
of LLMs, we can inject knowledge into them without further training. Specifically, we propose an
effective prompt set as the standard for LLMs to evaluate image quality. Our method addresses
several key drawbacks associated with existing image quality evaluation models. The advantages of
our proposed explainable image quality evaluation method, X-IQE, are as follows:
• Explainability: Models can describe their reasoning process using Chain of Thought
(CoT) [50] based on the provided prompts, making the results explainable and insightful.
• Comprehensiveness: Carefully designed prompts allow LLMs to perform comprehensive
evaluations, not only for specifically labeled features (e.g., color, composition [25; 54]).
• Powerful Performance: Advanced LLMs are trained on vast amounts of data and possess
greater generalizable image understanding capabilities than most task-specific models.
• Unbiasedness: By utilizing objective prompt text, our model can conduct unbiased evalua-
tions, eliminating the biases that may arise from models trained on datasets annotated by
specific annotation groups, such as crowd-sourcing [35] or annotation companies [52].
• Training-free: Our method harnesses the capabilities of pre-trained LLMs, eliminating the
need for data collection and training efforts required by existing methods [25; 52].
Recent studies for evaluating AI-generated images [52; 51; 35]have predominantly concentrated on
assessing fidelity and text-image alignment scores. Conversely, earlier works assessing real photos or
paintings have primarily emphasized aesthetics [5; 45]. In contrast to existing methods that directly
output evaluation scores, we establish a step-by-step text-to-image method evaluation pipeline. This
approach enables us to generate all three textual evaluations incrementally, allowing subsequent steps
to leverage the information and conclusions derived from the reasoning of previous steps.
To provide a comprehensive analysis of X-IQE and evaluate SOTA text-to-image generative models,
including Stable Diffusion [41], Openjourney 2 , and DeepFloyd-IF 3 . we conduct extensive exper-
iments. Remarkably, X-IQE performs comparably in a zero-shot manner with specialized SOTA
scoring models trained on the AI-generated images or even better in unusual scenarios. These results
demonstrate the efficacy of our method as a versatile text-to-image evaluation protocol.
The contributions of this work are summarized as follows:
2
https://openjourney.art/
3
https://huggingface.co/DeepFloyd/IF-I-XL-v1.0
2
1. We propose X-IQE, an explainable image quality evaluation method based on visual LLMs.
To our knowledge, this is the first application of LLMs for text-to-image evaluation.
2. To enhance the performance and stability of X-IQE, we integrate the expertise of art
professionals into a hierarchical CoT with well-defined conditions and criteria for evaluation.
3. We perform comprehensive experiments on both real and AI-generated images, validating
the explanatory power as well as the quantitative scoring capabilities of X-IQE.
2 Related Work
Human evaluation is widely regarded as the benchmarking method for assessing text-to-image
generative models, including rule-based methods [20], GANs [15], and diffusion models [18],
adopted by most SOTA text-to-image generative models [13; 53; 34; 42]. However, the lack of a
consistent evaluation standard and varying protocols often yield different conclusions among the
works. To overcome this limitation, the first approach involves employing standardized evaluation
protocols, which addresses challenges such as prioritizing monetary returns [1; 21] and introducing
biases [35]. The second approach focuses on automatic evaluation metrics, including Inception
Score [43], Frechet Inception Distance [17], and Precision-Recall [24]. Some recent studies have
trained evaluation models using human scoring to align their preferences, but achieving only 65%
consistency [52; 51]. Notably, both approaches consider two metrics: overall image quality and
text-image alignment.
Prior aesthetic prediction models have predominantly focused on real images, employing various
approaches for aesthetic assessment. One common practice involves concatenating vector represen-
tations derived from multiple input image patches [30; 45]. Another approach aims to capture the
relationships between different objects or regions within the image to evaluate composition [28; 54].
Notably, the widely used text-to-image generation method, Stable Diffusion, has also been evaluated
using aesthetic predictors trained on AVA [32] and LAION [44] datasets, yielding favorable results.
Large language models (LLMs) have achieved remarkable success in recent years. Three main
paradigms emerged: encoder-only (BERT [12]), encoder-decoder (T5 [38]), and decoder-only (GPT-
2 [40]). GPT-3 [3] demonstrated the scaling advantages of decoder-only paradigm, leading to a
surge in related research, including Megatron-Turing NLG [46], Chinchilla [19], PaLM [10], and
LLaMA [48]. InstructGPT [36] and ChatGPT [4] showed coherent multi-turn conversation skills
through fine-tuning GPT-3 with aligned feedback data. LLMs can generate expected outputs for
test instances without additional training when provided with natural language instructions and/or
task demonstrations [55]. This in-context learning ability [3] allows LLMs to learn new tasks with
minimal overhead during inference. Another important capability is Chain of Thought, where LLMs
solve tasks using a prompting mechanism involving intermediate reasoning steps. CoT prompting
has shown performance gains for models larger than 60B [50]. In this work, we leverage ICL and
CoT to enable smaller LLMs to explain their reasoning process in image quality evaluation.
LLMs, such as VisualGPT [6] and Frozen [49], serve as powerful decoders for visual features. They
enable cross-modal transfer, aligning visual and linguistic knowledge to describe visual information
using language. BLIP-2 [27] effectively aligns visual features with language models through Flan-
T5 [11], demonstrating strong visual QA capabilities. GPT-4 [4], a recent breakthrough, accomplishes
diverse language tasks based on images by aligning an advanced LLM with human preferences and
intentions. Successful visual-language conversation models require robust conversational language
models (e.g., ChatGPT, LLaMA [48], Vicuna [9]), visual encoders (e.g., VIT [14]), and visual-
language alignment training. MiniGPT-4 [56], chosen for its versatility and simplicity, achieves
3
dialogue-based image understanding by training a linear layer with ample image-text alignment data.
Our proposed method can be flexibly adapted to more advanced visual language models in the future.
3 Method
X-IQE utilizes MiniGPT-4 as its foundational model and incorporates the expertise of art professionals
to design standardized and unbiased prompts. The effectiveness of X-IQE is further enhanced by
its well-structured hierarchical CoT scheme and the application of format constraints for accurate
execution. The overall methodology is depicted in Fig. 2.
3.1 MiniGPT-4
MiniGPT-4 [56] combines a pretrained vision encoder (ViT with Q-Former [27]) and the advanced
LLM Vicuna [9]. ViT enables image parsing, while Vicuna enhances generative capabilities through
conversation rounds. MiniGPT-4 uses frozen ViT and Vicuna, with training involving a linear
projection from visual features to Vicuna. The model is trained on 5 million text-image pairs
for general vision-language knowledge and fine-tuned with 3.5k aligned conversations for natural
responses. In this work, we perform ICL on the frozen MiniGPT-4 for image quality evaluation.
Challenges. Using pretrained LLMs for image quality assessment presents challenges that require
careful ICL strategy design. First, the training of MiniGPT-4 lacks samples from SOTA visual genera-
tive models. So LLMs tend to describe them as normal images based on pre-training preferences. We
address this challenge by incorporating explicit quality evaluation aspects, standards, and reasoning
processes inspired by art professionals’ knowledge. Additionally, Vicuna in MiniGPT-4 has a limited
parameter count (7B/13B), which may affect analysis rationality and result accuracy in complex
scenarios. To mitigate this, we impose strict constraints on the answer structure.
Our art industry consultant, experienced in using Stable Diffusion and Midjourney, identified several
empirical discrimination methods for AI image generation. When generating rare and complex
objects, AI produces blurred or distorted details, particularly with human hands and text. Occluded
objects often have inconsistent details on both sides of the occluder. AI-generated images may exhibit
oversaturation, especially when unusual colors are present. Lastly, in photo scenes, AI-generated
images can sometimes have a partially greasy feeling.
Aesthetics judgment indicators are well-studied and less subjective than commonly believed. Aesthet-
ically pleasing images should exhibit harmonious and bright colors, moderate saturation, appropriate
lighting and shadow, well-composed scenes, and a sense of space through perspective [37]. Mem-
orable images should also feature intricate details, expressiveness, and strong visual impact [22].
X-IQE incorporates expertise for discriminating AI images and aesthetic evaluation as prompts.
We design a hierarchical CoT strategy for X-IQE to enhance the coherence and quality of its responses.
X-IQE evaluates image quality in terms of fidelity, alignment, and aesthetics, recognizing that these
attributes are interconnected. The evaluation results of certain attributes can serve as priors for
assessing other attributes. Fidelity, which measures the realism of an image, influences the assessment
of text-to-image alignment since AI-generated images often exhibit lower alignment compared to
real images [35]. Additionally, the identification of an image as AI-generated impacts the aesthetic
analysis by highlighting specific defects identified in the fidelity evaluation. The alignment evaluation
benefits from knowing the true prompt used for image generation, aiding in the assessment of visual
impact and composition in the aesthetic evaluation. Our CoT is designed as follows:
• f idelity evaluation −→ alignment evaluation −→ aesthetic evaluation
Within each evaluation task, a dedicated CoT is established, encompassing:
• (image description) −→ task specif ic analysis −→ scoring
The image description is included in the first task and reused in the subsequent tasks.
4
Chain of Thought
Image Description
Task-specific Analysis
Scoring
You are my assistant to evaluate the image quality. Briefly describe (within 50 words) the type (e.g., photo, painting) and
content of this image, and analyze whether this image meets the following conditions of an AI-generated image (within 30
words per point).
1. Imperfect details: distorted, blurry, or irrational faces, limbs, fingers, objects, or texts.
2. Improper composition: some misplaced object relationships.
3. Strange colors: overly bright, saturated colors.
4. Artificial look: looks like a real image but has an unclear rendering or other artificial look.
Provide your analysis in JSON format with the following keys: Image description, Imperfect details, Improper composition,
Strange colors, Artificial look, Fidelity (e.g., 6/10). The fidelity scoring criteria are as follows:
Definitely AI-generated (0-1) Very likely AI-generated (2-3) Certain probability of AI generation (4) Unsure (5) Certain
Fidelity probability being real (6) Very real (7-8) Definitely real (9-10)
{
"Image description": "A yellow and blue subway train at a station.",
"Imperfect details": "The train's windows are a bit blurry and the wheels are not very detailed.",
"Improper composition": "The positioning of the train in the frame is slightly off-center.",
"Strange colors": "The colors are bright and vibrant, but not necessarily strange.",
"Artificial look": "The overall image looks quite realistic, but there is a slight artificiality to the lighting and the way the train is rendered.",
"Fidelity": "5/10"
}
According to the image and your previous description, how well does the image align with the following description?
Does not match at all (1) Has significant discrepancies (2) Has several minor discrepancies (3) Has a few minor discrepancies (4)
Matches exactly (5)
Provide your analysis in JSON format with the following keys: Alignment analysis (within 100 words), Alignment score (e.g., 4/5).
Alignment
{
"Alignment analysis": "The image depicts a yellow and blue subway train stationary at a train station, with a clear and detailed
representation of the train, the platform, and the surrounding environment. However, there are some minor discrepancies in the
positioning of the train, which is slightly tilted to the left, and the platform is uneven.",
"Alignment score": "4/5"
}
Briefly analyze the aesthetic elements of this image (each item within 20 words) and score its aesthetics. The scoring criteria
for each item are as follows.
Extremely bad (0-1) Poor quality (2-3) Below average (4-5) Above average (5-6) Good (7-8) Excellent (9) Wonderful (10).
Provide your analysis in JSON format with the following keys: Color harmony, Color brightness, Color saturation, Composition,
Perspective, Light and shadow, Detailed expression, Vivid posture, Visual impact, Overall aesthetic score (e.g., 6/10).
Aesthetics
{
"Color harmony": "The color scheme is bright and vibrant, with a balance of yellow and blue.",
"Color brightness": "The colors are bright, but not overly so.",
"Color saturation": "The colors are saturated, but not overly so.",
"Composition": "The composition is well-balanced, with a clear focal point.",
"Perspective": "The perspective is realistic, with a slight tilt to the left.",
"Light and shadow": "The lighting is bright and natural, with clear shadows.",
"Detailed expression": "The details are clear and well-defined.",
"Vivid posture": "The posture of the train is vivid and dynamic.",
"Visual impact": "The visual impact is good, with a clear focal point.",
"Overall aesthetic
Give the rating of aesthetics (in the format like n/10), do not repeat what you have said.
Figure 2: The illustration showcases X-IQE using an image generated by Stable Diffusion 1.4. In
X-IQE, MiniGPT-4 with Vicuna serves as the chatbot. Carefully designed prompts are employed
for image quality analysis, incorporating a hierarchical CoT strategy to ensure response logic and
accuracy. The output format and scoring criteria constraints contribute to stable and coherent
responses that are easy to follow.
5
3.4 Constraints of Output Formats and Scoring Criteria
During our preliminary experiments, we aimed to ensure that MiniGPT-4 produces content that
adheres to the CoT structure. However, we encountered issues with unstable outputs. The two most
common unexpected outputs were: 1) content being generated before the analysis and 2) scoring
presented in varying styles, such as percentages or ratings based on different scales. To enhance
the likelihood of CoT-compliant output and establish an objective and unified scoring standard, we
implemented two improvements: 1) requiring the model to provide output in JSON format and 2)
explicitly defining the conditions associated with different scores.
Datasets: To evaluate X-IQE’s capability, we utilize COCO Captions [7] and DrawBench texts [42]
for AI image generation. We randomly sample 1000 prompts from COCO Captions while all 200
prompts from DrawBench are used. All diffusion models generate images using DPM Solver [29]
with 25 steps and a guidance scale of 7.5.
Models: Our experiments involve both MiniGPT-4 models with 7B and 13B Vicuna parameters. No
major modifications are made to the model code base4 . During inference, we set beam=1, and the
temperature ranges from 0.01 to 1.0. The models are executed on an Nvidia RTX A6000 48G GPU.
Metrics: To validate X-IQE’s performance in fidelity evaluation, we employ the recall of the
AI-generated images. For alignment and aesthetics validation, we compare the Pearson correlation
coefficient between human evaluation and task-specific models (such as CLIPScore [16] for alignment
and Stable Diffusion Aesthetic Predictor5 for aesthetics) or X-IQE.
Figure 3: The stability analysis of X-IQE. (a) the models output more consistent results when the
temperature is lower, with Vicuna 13B always better than the 7B variant. (b) Larger LLM gives more
precise and valid response to the questions for all three tasks.
One question regarding LLMs is whether multiple experiments yield consistent judgments like the
traditional prediction models. And if its responses are always consistent with our questions. We
investigate model consistency in relation to model size and temperature, as depicted in Fig. 3.
The temperature parameter is crucial in LLMs as it controls the randomness and creativity of generated
text. We conduct three repeated experiments with fixed temperatures to examine the consistency
of models’ fidelity prediction (Fig. 3 (a)). Krippendorff’s α [23] decreases as the temperature
increases, indicating that models yield more divergent results with higher temperatures. General
annotator groups typically achieve 0.11 Krippendorff’s α, while expert annotators can reach 0.53 [35].
Moreover, Vicuna 13B consistently exhibits better consistency than Vicuna 7B.
4
https://github.com/Vision-CAIR/MiniGPT-4
5
https://github.com/christophschuhmann/improved-aesthetic-predictor
6
Table 1: Ablation study of CoTs within and between tasks. The baseline is directly asking MiniGPT-4
for separate evaluation scores. Fidelity is measured using the recall of generated images, while
alignment and aesthetics are assessed using Pearson correlation coefficients with human evaluations.
It’s worth noting that the human evaluation exhibits a correlation coefficient of 0.137 with the
CLIPScore and 0.067 with the Aesthetic Predictor, serving as a reference for comparison.
CoT within tasks CoT between tasks
Subtask
baseline +prompt +prompt+format +fidelity +alignment +fidelity+alignment
Fidelity 0.021 0.0 0.698 – – –
Alignment 0.118 0.082 0.263 0.381 – –
Aesthetics 0.030 -0.162 0.259 0.369 0.351 0.418
A black and white photograph of a The girl's face is distorted and the
steam locomotive with the number color of her hoodie is a bit off. The
717 on the side, surrounded by trees placement of the hands and the
and a mountainous landscape. phone could be improved.
(a) The fidelity distributions of real/generated images (b) Examples of right and wrong judgments
Figure 4: The results and examples of fidelity evaluation on COCO Caption. (a) Real and fake images
mainly differ in the distribution of fidelity scores. (b) The examples illustrate the capability of X-IQE
to perform image analysis during inference, although some judgments may be inaccurate.
X-IQE aims to derive quantitative evaluations for three tasks, but the models do not always provide
numerical results as expected. For fidelity evaluation, a common response is "It’s a bit unclear if the
image is AI-generated or not", even after two rounds of questioning "Give the rating of fidelity (in
the format like n/10)". The success rates of answering are presented in Fig. 3 (b). Generally, larger
Vicuna performs better in this experiment. Both models have a probability over 10% of providing
unexpected fidelity answers. However, refusal to answer for alignment and aesthetics evaluation is
rare, with Vicuna 13B inconsistently answering questions only in exceptional cases.
In all subsequent experiments, we employ Vicuna 13B with a temperature of 0.1 for accuracy and
reproducibility. Smaller temperatures are avoided due to model instability, as discussed in Section 5.
The ablation study results of CoT are shown in Tab. 1. We observe that without CoT within tasks,
the performance of X-IQE is unsatisfactory. Simply asking the model for a fidelity or alignment
score typically yields high scores, as no consistent standards are provided. Incorporating specific
requirements for task-specific analysis, including scoring criteria and evaluation aspects, improves the
model’s ability to generate reasonable analysis. However, obtaining reliable numerical scores remains
challenging due to Vicuna’s tendency to provide answers before thorough analysis. The inclusion of
formatting requirements ultimately enables the model to produce desirable quantitative results based
on its analysis. For the results of CoT between tasks, the alignment and aesthetics scoring can match
the human evaluations better with the analysis of previous stages. The model frequently utilizes
earlier analysis when responding to subsequent questions. For example, fidelity analysis considering
image details often aids in evaluating aesthetics such as composition and expressive details.
The results and examples of fidelity inference using X-IQE are presented in Fig. 4. Though the mean
fidelity scores for real and generated images are similar, it does not imply that X-IQE lacks strong
discriminative ability. The fidelity score distributions of real and generated images (Fig. 4 (a)) exhibit
7
(a) Alignment score distributions (r=0.381) (b) Aesthetics score distributions (r=0.418)
Figure 5: The distributions of alignment and aesthetics scores on COCO Caption. In contrast to
humans, who tend to provide score judgments centered around the mean, X-IQE exhibits a broader
range of judgments that include both extremely high and extremely low values.
significant differences with a p-value of < 10−5 in the Kolmogorov–Smirnov test. X-IQE tends to
assign more scores of 7 to real images and more scores of 5 to SD-generated images. Interestingly,
SD 1.4 can deceive both X-IQE and human observers in quite a few cases. Fig. 4 (b) demonstrates
examples of correct and incorrect judgments for detecting AI generation. Furthermore, X-IQE assigns
very few scores of 6, possibly due to the language bias learned during pretraining.
The distributions of alignment and aesthetics scores from X-IQE and human evaluations are depicted
in Fig. 5. These scores demonstrate higher correlation coefficients of 0.381 and 0.418 with human
evaluations compared to task-specific models like CLIPScore (0.137) and Aesthetic Predictor (0.067),
highlighting the generalizability of X-IQE on the generated dataset. CLIPScore, trained on text-image
pairs collected before 2021 [16], and Aesthetic Predictor, trained on the AVA dataset [32], tend to un-
derperform when dealing with open-scene texts and images, often producing predictions inconsistent
with human expectations [52]. For example, the Aesthetic Predictor outputs values within the narrow
range of [4.8, 5.6], indicating its poor predictive ability within this data domain (the model’s output
range is [0, 10]). X-IQE addresses this issue by incorporating domain-specific knowledge, such as
scoring criteria, into the inference process. But unlike the unimodal nature of human evaluations,
X-IQE’s score distributions are bimodal, suggesting further room for improvement.
One of the significant applications of X-IQE is its role as an impartial referee in evaluating various
generative models. We evaluate four SOTA text-to-image models, including Stable Diffusion 1.4
and 2.1, Openjourney, and DeepFloyd-IF. Furthermore, we incorporate the results from two leading
evaluation models specifically trained on AI-generated images (i.e., ImageReward [52] and HPS [51]).
Tab. 2 displays the outcomes for images generated using the COCO Caption. Notably, CLIPScore and
Aesthetic Predictor do not align human perception, whereas X-IQE exhibits strong agreement with
ImageReward and HPS, selecting the best DeepFloyd-IF among the four methods. The overall score
of X-IQE exhibits the same ranking as HPS. X-IQE also demonstrates its ability to extract reasonable
8
Table 3: Benchmarking text-to-image models on DrawBench.
X-IQE
Model CLIP Aes. Pred. ImgRwd HPS
Fidelity Alignment Aesthetics Overall
Stable Diffusion 1.4 0.793 5.09 -0.029 0.1945 5.32 2.72 5.40 13.44
Stable Diffusion 2.1 0.817 5.31 0.163 0.1955 5.10 2.50 5.04 12.64
Openjourney 0.787 5.35 0.056 0.1972 5.14 2.62 5.21 12.97
DeepFloyd-IF 0.827 5.10 0.541 0.1977 5.32 2.96 5.64 13.92
Vague prompt:
A large plant-eating
domesticated mammal with
solid hoofs and a flowing
mane and tail, used for
riding, racing, and to carry
and pull loads
Fed.: 7 Alig.: 4 Aes.: 6 Fed.: 5 Alig.: 3 Aes.: 5 Fed.: 5 Alig.: 3 Aes.: 5 Fed.: 7 Alig.: 4 Aes.: 8
Text prompt:
Fed.: 5 Alig.: 2 Aes.: 4 Fed.: 5 Alig.: 1 Aes.: 1 Fed.: 5 Alig.: 4 Aes.: 7 Fed.: 7 Alig.: 4 Aes.: 6
Figure 6: Comparison of images generated with DrawBench prompts and their X-IQE scoring.
detailed item scores. DeepFloyd-IF, serving as a public reproduction of Imagen [42], generates
images in the pixel space guided by the text encoder T5 XXL [38], enabling it to produce highly
photorealistic images aligning well with the provided texts. In terms of aesthetics, Openjourney
outperforms other models due to its high-quality training data generated by Midjourney.
Tab. 3 shows the evaluation results on DrawBench, a challenging benchmark including unconventional
prompts like vague descriptions and text generation. In this scenario, X-IQE provides conclusions
that differ from those of ImageReward and HPS. Fig. 6 illustrates the images generated with the four
models. X-IQE accurately ranks them, especially for the poor alignment and aesthetics of SD2.1.
This can be attributed to the robust capabilities of X-IQE in parsing long and ambiguous texts and
recognizing texts within the images, which traditional models struggle to accomplish.
5 Limitations
Model capability. Most visual encoders, including ViT in MiniGPT-4, were pre-trained using low-
resolution images (e.g., 2242 ), whereas the generated images using modern diffusion techniques often
range from 5122 to 10242 . The significant downsampling of images during inference could negatively
impact the model’s ability to discriminate fine details. Additionally, it has been shown that CoTs
typically work best for models with a size larger than 60B. When using smaller models like MiniGPT-
4 with 7B/13B parameters, we observed various unexpected responses such as hallucinations, repeated
answers, and incorrect output formats, especially with temperatures smaller than 0.1.
CoT design. The CoT strategy proposed in this work is concise and includes only necessary
information for the LLM’s inference. Longer and more detailed CoTs may be more effective but were
not established due to the limited text length that current LLMs can support, typically shorter than
1,000 tokens. It has also been observed that X-IQE typically rates the image only with integers and
tends to produce extreme quantitative results. This could be mitigated in future research by improving
the prompts.
9
6 Conclusion
In this study, we have introduced X-IQE, the first explainable method for image quality evaluation
with LLMs. This approach was developed to address the limitations of existing image evaluation
models and to align the evaluation results more closely with human judgments. By employing a
meticulously designed CoT strategy, X-IQE is able to differentiate AI-generated images, assess
the alignment between text and image, and analyze aesthetic elements. Our experimental results
demonstrate the reproducibility and effectiveness of X-IQE, as well as its capability to serve as a
benchmark for current SOTA text-to-image generation methods without requiring additional training
data. The ability of X-IQE to provide explanations for its reasoning process enables researchers to
identify and target its weaknesses for further improvement efficiently.
References
[1] Herman Aguinis, Isabel Villamor, and Ravi S Ramani. Mturk research: Review and recommen-
dations. Journal of Management, 47(4):823–837, 2021.
[2] Subhabrata Bhattacharya, Rahul Sukthankar, and Mubarak Shah. A holistic approach to aesthetic
enhancement of photographs. ACM Transactions on Multimedia Computing, Communications,
and Applications (TOMM), 7(1):1–21, 2011.
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[4] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece
Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general
intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
[5] Kuang-Yu Chang, Kung-Hung Lu, and Chu-Song Chen. Aesthetic critiques generation for
photos. In Proceedings of the IEEE international conference on computer vision, pages 3514–
3523, 2017.
[6] Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Data-efficient
adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022.
[7] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár,
and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv
preprint arXiv:1504.00325, 2015.
[8] Yi-Ling Chen, Jan Klopp, Min Sun, Shao-Yi Chien, and Kwan-Liu Ma. Learning to compose
with professional photographs on the web. In Proceedings of the 25th ACM international
conference on Multimedia, pages 37–45, 2017.
[9] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot
impressing gpt-4 with 90%* chatgpt quality, 2023.
[10] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:
Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[11] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li,
Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned
language models. arXiv preprint arXiv:2210.11416, 2022.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[13] Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image
generation via hierarchical transformers. arXiv preprint arXiv:2204.14217, 2022.
10
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.
[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications
of the ACM, 63(11):139–144, 2020.
[16] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A
reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021.
[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
neural information processing systems, 30, 2017.
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances
in Neural Information Processing Systems, 33:6840–6851, 2020.
[19] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al.
Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
[20] Qixin Hu, Yixiong Chen, Junfei Xiao, Shuwen Sun, Jieneng Chen, Alan Yuille, and Zongwei
Zhou. Label-free liver tumor segmentation. arXiv preprint arXiv:2303.14869, 2023.
[21] Ryan Kennedy, Scott Clifford, Tyler Burleigh, Philip D Waggoner, Ryan Jewell, and Nicholas JG
Winter. The shape of and solutions to the mturk quality crisis. Political Science Research and
Methods, 8(4):614–629, 2020.
[22] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. Understanding and predicting
image memorability at a large scale. In Proceedings of the IEEE international conference on
computer vision, pages 2390–2398, 2015.
[23] Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications,
2018.
[24] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved
precision and recall metric for assessing generative models. Advances in Neural Information
Processing Systems, 32, 2019.
[25] Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter
Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models
using human feedback. arXiv preprint arXiv:2302.12192, 2023.
[26] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-
image pre-training for unified vision-language understanding and generation. In International
Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
[27] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-
image pre-training with frozen image encoders and large language models. arXiv preprint
arXiv:2301.12597, 2023.
[28] Dong Liu, Rohit Puri, Nagendra Kamath, and Subhabrata Bhattacharya. Composition-aware im-
age aesthetics assessment. In Proceedings of the IEEE/CVF Winter Conference on Applications
of Computer Vision, pages 3569–3578, 2020.
[29] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A
fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Advances in
Neural Information Processing Systems, 2022.
[30] Xin Lu, Zhe Lin, Xiaohui Shen, Radomir Mech, and James Z Wang. Deep multi-patch
aggregation network for image style, aesthetics, and quality estimation. In Proceedings of the
IEEE international conference on computer vision, pages 990–998, 2015.
11
[31] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and
Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning
work? arXiv preprint arXiv:2202.12837, 2022.
[32] Naila Murray, Luca Marchesotti, and Florent Perronnin. Ava: A large-scale database for
aesthetic visual analysis. In 2012 IEEE conference on computer vision and pattern recognition,
pages 2408–2415. IEEE, 2012.
[33] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic
models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
[34] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin,
Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image genera-
tion and editing with text-guided diffusion models. In International Conference on Machine
Learning, pages 16784–16804. PMLR, 2022.
[35] Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne
Heikkilä, and Shin’ichi Satoh. Toward verifiable and reproducible human evaluation for text-to-
image generation. arXiv preprint arXiv:2304.01816, 2023.
[36] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to
follow instructions with human feedback. Advances in Neural Information Processing Systems,
35:27730–27744, 2022.
[37] Stephen E Palmer, Karen B Schloss, and Jonathan Sammartino. Visual aesthetics and human
preference. Annual review of psychology, 64:77–107, 2013.
[38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In International conference on machine learning,
pages 8748–8763. PMLR, 2021.
[40] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
[41] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
[42] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton,
Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al.
Photorealistic text-to-image diffusion models with deep language understanding. Advances in
Neural Information Processing Systems, 35:36479–36494, 2022.
[43] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training gans. Advances in neural information processing systems, 29,
2016.
[44] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman,
Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-
5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth
Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
[45] Kekai Sheng, Weiming Dong, Chongyang Ma, Xing Mei, Feiyue Huang, and Bao-Gang Hu.
Attention-based multi-patch aggregation for image aesthetic assessment. In Proceedings of the
26th ACM international conference on Multimedia, pages 879–886, 2018.
12
[46] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari,
Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using
deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language
model. arXiv preprint arXiv:2201.11990, 2022.
[47] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv
preprint arXiv:2011.13456, 2020.
[48] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-
thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open
and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[49] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill.
Multimodal few-shot learning with frozen language models. Advances in Neural Information
Processing Systems, 34:200–212, 2021.
[50] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le,
Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In
Advances in Neural Information Processing Systems, 2022.
[51] Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better aligning text-to-
image models with human preference. arXiv preprint arXiv:2303.14420, 2023.
[52] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao
Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.
arXiv preprint arXiv:2304.05977, 2023.
[53] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay
Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive
models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
[54] Bo Zhang, Li Niu, and Liqing Zhang. Image composition assessment with saliency-augmented
multi-pattern pooling. arXiv preprint arXiv:2104.03133, 2021.
[55] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min,
Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv
preprint arXiv:2303.18223, 2023.
[56] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En-
hancing vision-language understanding with advanced large language models. arXiv preprint
arXiv:2304.10592, 2023.
13
A All Prompts Used in the Experiments
Provide your analysis in JSON format with the following keys: Image descrip-
tion, Imperfect details, Improper composition, Strange colors, Artificial look, Fidelity
(e.g., 6/10). The fidelity scoring criteria are as follows:
<Image Caption>
Provide your analysis in JSON format with the following keys: Alignment anal-
ysis (within 100 words), Alignment score (e.g., 4/5).
Aesthetics Question Briefly analyze the aesthetic elements of this image (each item within 20 words) and
score its aesthetics. The scoring criteria for each item are as follows.
Provide your analysis in JSON format with the following keys: Color harmony,
Color brightness, Color saturation, Composition, Perspective, Light and shadow,
Detailed expression, Vivid posture, Visual impact, Overall aesthetic score (e.g., 6/10).
14
Table 5: The ablation prompts used for X-IQE.
Description Prompt
Fidelity question baseline You are my assistant to evaluate the image quality. Whether this image seems
like a real image instead of AI-generated? Give your fidelity score out of 10 in
a format like n/10.
Fidelity question w/o format You are my assistant to evaluate the image quality. Whether this image seems
like a real image instead of AI-generated? Give your fidelity score out of 10 in
a format like n/10. Consider the following factor for AI images.
<Image Caption>
<Image Caption>
15
Table 6: The continue prompts used for X-IQE.
Description Prompt
Fidelity Question (cont.) Give the fidelity rating (in a format like n/10), do not repeat what you have said.
Alignment Question (cont.) Give the rating of text-image alignment (in a format like n/5), do not repeat what
you have said.
Aesthetics Question (cont.) Give the rating of aesthetics (in a format like n/10), do not repeat what you have
said.
B Failure Situations
You are my assistant to evaluate the image quality. Briefly describe (within 50 words) the type (e.g., photo,
painting) and content of this image, and analyze whether this image meets the following conditions …
{
"Image description": "A white bench with a book and a can of paint on it, surrounded by
trees and greenery.",
"Imperfect details": "The bench has a smooth and shiny surface, and the book and can of
paint are placed on it in a neat way.",
"Improper composition": "The bench is placed in the center of the image, and the book and
can of paint are positioned symmetrically on either side of it.",
"Strange colors": "The colors used in the image are natural and not too bright or saturated.",
"Artificial look": "The image does not have an artificial look, it looks like a real photograph.",
"Fidelity": "It is difficult to say if the image is AI generated or not, it could be real but it is
hard to tell."
}
Give the fidelity rating (in a format like n/10), do not repeat what you have said.
{
"Image description": "A white bench with a book and a can of paint on it, surrounded by
trees and greenery.",
"Imperfect details": "The bench has a smooth and shiny surface, and the book and can of
paint are placed on it in a neat way.",
"Improper composition": "The bench is placed in the center of the image, and the book and
can of paint are positioned symmetrically on either side of it.",
"Strange colors": "The colors used in the image are natural and not too bright or saturated.",
"Artificial look": "The image does not have an artificial look, it looks like a real photograph.",
"Fidelity": "It is difficult to say if the image is AI generated or not, it could be real but it is
hard to tell."
}
Give the fidelity rating (in a format like n/10), do not repeat what you have said.
{
"Image description": "A white bench with a book and a can of paint on it, surrounded by
trees and greenery.",
"Imperfect details": "The bench has a smooth and shiny surface, and the book and can of
paint are placed on it in a neat way.",
"Improper composition": "The bench is placed in the center of the image, and the book and
can of paint are positioned symmetrically on either side of it.",
"Strange colors": "The colors used in the image are natural and not too bright or saturated.",
"Artificial look": "The image does not have an artificial look, it looks like a real photograph.",
"Fidelity": "It is difficult to say if the image is AI generated or not, it could be real but it is
hard to tell."
}
Figure 7: The example of repeating answers without scoring, which is a common bug of Vicuna.
16
Briefly analyze the aesthetic elements of this image (each item within 20 words)
and score its aesthetics. The scoring criteria for each item are as follows …
{
"Color harmony": "The colors in the image are are not not cohesive.",
"Color brightness": "The brightness of the colors is not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not not not not not not not not not not not not not not not not not not
not not not not
Figure 8: The example of repeating a single word. This situation often happens when the temperature
is lower than 0.1.
Briefly describe (within 50 words) the type (e.g., photo, painting) and content of
this image, how well does the image align with the following description
"A Chihuahua dog is standing next to a toy giraffe and looking up. "
{
"Image description": "A small Chihuahua dog is standing next to a toy giraffe
and looking up.",
"Alignment analysis": "The image shows a small Chihuahua dog standing next
to a toy giraffe, but the alignment between the two elements is poor. The dog
is looking up, but the giraffe is not looking at the dog, and their positions
do not suggest any interaction between them.",
"Alignment score": "2"
}
Figure 9: The example of hallucination error. The image caption does not mention the giraffe is
looking at the dog or any of their interaction.
17
Briefly analyze the aesthetic elements of this image (each item within 20 words)
and score its aesthetics. The scoring criteria for each item are as follows …
{
"Color harmony": "The colors in the image are harmonious and natural",
"Color brightness": "The brightness of the colors is not too bright or too dull",
"Color saturation": "The saturation of the colors is not too high or too low",
"Composition": "The composition of the image is not too bad",
"Perspective": "The perspective of the image is not too distorted",
"Light and shadow": "The light and shadow in the image is not too harsh or too soft",
"Detailed expression": "The detailed expression of the cows is not too poor",
"Vivid posture": "The posture of the cows is not too stiff",
"Visual impact": "The visual impact of the image is not too weak",
"Overall aesthetic score": "8"
}
Give the rating of aesthetics (in a format like n/10), do not repeat what you have said.
7.5
Give the rating of aesthetics (in a format like n/10), do not repeat what you have said.
8.2
Briefly describe (within 50 words) the type (e.g., photo, painting) and content of this
image, how well does the image align with the following description …
I'm sorry, but I cannot provide an alignment score without the image.
Please provide the image so that I can evaluate it.
Figure 11: The example of giving no answer. This situation sometimes happens for the separate
alignment evaluation. If fidelity evaluation had been conducted previously, the error would not
happen.
18