Comprehensive Evaluation of Multimodal AI Models in Medical Imaging Diagnosis - From Data Augmentation To Preference-Based Comparison

Uploaded by

Aksks Bksks

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views7 pages

Comprehensive Evaluation of Multimodal AI Models in Medical Imaging Diagnosis - From Data Augmentation To Preference-Based Comparison

Uploaded by

Aksks Bksks

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Comprehensive Evaluation of Multimodal AI

Models in Medical Imaging Diagnosis: From Data

Augmentation to Preference-Based Comparison
Cailian Ruan* Chengyue Huang Yahe Yang
Medical School of Yan’an University University of Iowa George Washington University
Yan’an, Shaanxi, China [email protected] [email protected]
[email protected]
*
Corresponding author
arXiv:2412.05536v1 [eess.IV] 7 Dec 2024

Abstract—This study introduces an evaluation framework for We developed a systematic evaluation framework to assess
multimodal models in medical imaging diagnostics. We developed various multimodal models’ capabilities in medical image
a pipeline incorporating data preprocessing, model inference, interpretation. Our comprehensive pipeline incorporates data
and preference-based evaluation, expanding an initial set of 500
clinical cases to 3,000 through controlled augmentation. Our preprocessing, standardized model evaluation, and preference-
method combined medical images with clinical observations to based assessment. Starting with 500 clinical cases, each con-
generate assessments, using Claude 3.5 Sonnet for independent taining 4 sequential CT images paired with detailed diagnostic
evaluation against physician-authored diagnoses. The results reports, we employed controlled augmentation techniques to
indicated varying performance across models, with Llama 3.2- expand the dataset to 3,000 samples while preserving critical
90B outperforming human diagnoses in 85.27% of cases. In
contrast, specialized vision models like BLIP2 and Llava showed diagnostic features. Our methodology utilizes standardized
preferences in 41.36% and 46.77% of cases, respectively. This inputs combining medical images and detailed observations to
framework highlights the potential of large multimodal models generate diagnostic assessments, enabling direct comparison
to outperform human diagnostics in certain tasks. between different models and human expertise.
Index Terms—Multimodal, Medical Imaging, Diagnostic Eval- The results demonstrate remarkable capabilities across var-
uation, Preference Assessment, Large Language Models(LLMS)
ious multimodal AI systems, particularly in general-purpose
models. Llama 3.2-90B achieved superior performance in
85.27% of cases compared to human diagnoses, with only
I. I NTRODUCTION 1.39% rated as equivalent. Similar strong performance was ob-
served in other general-purpose models, with GPT-4, GPT-4o,
With the rapid advancement of deep learning technolo- and Gemini-1.5 showing AI superiority in 83.08%, 81.72%,
gies—particularly the innovative application of large multi- and 79.35% of cases respectively. This advantage manifests in
modal models in medical image analysis—AI-assisted diag- their ability to simultaneously evaluate multiple anatomical
nosis is reshaping traditional medical practice. This study in- structures, track disease progression, and integrate clinical
troduces a novel evaluation framework to assess the diagnostic information for comprehensive diagnosis. In contrast, spe-
capabilities of the latest generation of multimodal AI models cialized vision models BLIP2 and Llava demonstrated more
in interpreting complex abdominal CT images, focusing on modest results, with AI superiority in 41.36% and 46.77% of
cirrhosis and its complications, liver tumors, and multi-system cases respectively, highlighting the challenges faced by vision-
lesions. specific approaches in complex diagnostic scenarios.
A core challenge in medical imaging AI is accurately inter- Our evaluation framework employed Claude 3.5 Sonnet as
preting and integrating multi-dimensional clinical information. an independent assessor, implementing a three-way prefer-
In our collected clinical data, a comprehensive abdominal CT ence classification (AI superior, physician superior, or equiv-
diagnosis often requires simultaneous evaluation of multiple alent quality) to systematically compare model-generated and
dimensions: liver parenchymal changes (such as cirrhosis physician-authored diagnoses. This approach provides valu-
and multiple nodules), vascular system abnormalities (like able insights into the current capabilities of different mul-
portal hypertension and portal vein cavernous transformation), timodal architectures in medical diagnostics, suggesting that
secondary changes (including splenomegaly and ascites), and general-purpose large multimodal models may significantly
related complications (such as esophageal varices). Traditional outperform both specialized vision models and human physi-
computer vision models often struggle with such complex cians in certain diagnostic tasks.
medical scenarios, prompting us to explore the potential of This study not only advances our understanding of AI ca-
new-generation multimodal AI technology. pabilities in medical diagnosis but also introduces an efficient
framework for model evaluation through preference-based By systematically integrating independent assessments with
assessment. By analyzing a large set of diverse clinical cases, preference-based evaluation, this study builds upon these prior
we demonstrate the potential of multimodal AI technology in works, providing a structured methodology for evaluating the
handling complex medical scenarios, potentially heralding a capabilities of multimodal AI systems in complex medical
new paradigm in medical imaging diagnosis. These findings scenarios.
have significant implications for improving diagnostic effi-
ciency, standardizing diagnostic processes, and reducing the III. M ETHODOLOGY
risk of missed diagnoses and misdiagnoses. Furthermore, our
A. Data Preprocessing
evaluation methodology offers valuable insights for assessing
AI capabilities in other complex medical applications. Our preprocessing pipeline consists of three main compo-
nents: data de-identification, anomaly handling, and data aug-
II. R ELATED W ORK mentation, specifically designed to process paired CT image
A. Multimodal AI in Medical Diagnostics sequences and their corresponding diagnostic reports.
The integration of multimodal AI systems in medical The de-identification process was implemented to ensure pa-
diagnostics has demonstrated significant advancements over tient privacy while preserving clinically relevant information.
traditional single-modality approaches. Early work introduced For CT images, we developed an automated system to re-
frameworks that combined radiology images with electronic move burned-in patient identifiers and replace DICOM header
health records (EHRs), enhancing diagnostic accuracy by information with anonymized identifiers. The corresponding
leveraging heterogeneous data sources [1], [2] . The idea diagnostic reports underwent a similar process where personal
was further extended by integrating chest X-ray imaging with identifiers, hospital names, and specific dates were system-
patient demographic data, which improved the detection of atically replaced with standardized codes while maintaining
acute respiratory failure [3], [4]. the temporal relationships between examinations. This process
In liver disease diagnosis, traditional computer vision mod- preserved the diagnostic value of both images and reports
els like CNNs have primarily focused on tasks such as tumor while ensuring compliance with privacy regulations.
detection [5]–[9]. The efficacy of CNNs in detecting diabetic Anomaly handling addressed both image and text irregu-
retinopathy was demonstrated by Ghosh et al. [10], laying larities. For CT images, we implemented automated detection
the groundwork for their application in liver imaging analysis. and correction of common artifacts, including beam hardening,
However, these models are often limited in their ability to motion artifacts, and metal artifacts. Image quality metrics
synthesize multi-dimensional clinical data, prompting the need were established to identify scans with suboptimal contrast
for more advanced multimodal approaches. enhancement or incomplete anatomical coverage. The text
Recent advancements in large language models (LLMs) processing pipeline identified and corrected common report-
such as GPT-4 have added new dimensions to multimodal ing inconsistencies, standardized medical terminology, and
diagnostics. Studies showcased LLMs’ capabilities in process- ensured proper formatting of measurements and anatomical
ing and synthesizing medical language tasks, forming a basis descriptions. Cases with severe anomalies that could not be
for their integration into multimodal frameworks [11]–[13]. automatically corrected were flagged for expert review.
Similarly, it was demonstrated that combining imaging data Data augmentation strategies were carefully designed to
and text with multimodal AI yielded superior performance in maintain the paired relationship between image sequences and
breast cancer diagnostics [14]. reports. For images, spatial transformations included minor ro-
tations (±10◦ ), translations (within 10% of image boundaries),
B. Specialized Vision Models and Evaluation Frameworks and subtle elastic deformations (controlled within 5% to pre-
Vision-specific models, such as BLIP2 [15] and Llava [16], serve anatomical relationships). Intensity-based augmentations
have been widely used for focused medical imaging tasks comprised contrast adjustments (±10%), brightness variations
[17]–[19]. While effective in detecting individual patholo- (±5%), and minimal Gaussian noise injection (σ = 0.01)
gies, these models often underperform in complex, multi- to simulate imaging system variations. For the corresponding
dimensional diagnostic scenarios compared to general-purpose reports, we employed text augmentation techniques including
multimodal systems. Lee et al. [18] evaluated vision-specific synonym substitution for anatomical terms and standardized
models for diagnosing chest pathologies, finding them profi- rephrasing of pathological findings. Each augmented case
cient in single-dimension tasks but limited in handling broader maintained the original format of four sequential CT images
diagnostic contexts. paired with one comprehensive diagnostic report, ensuring the
The incorporation of standardized evaluation frameworks preservation of the temporal and spatial relationships within
has also been pivotal in advancing medical AI systems. The the image series and their corresponding textual descriptions.
use of independent assessors, such as Claude 3.5 Sonnet in This process generated ten augmented samples for each orig-
this study, represents a novel approach to comparing AI and inal case, with synchronized modifications in both the image
human diagnoses. This aligns with the recommendations by sequences and their reports.
Crossnohere et al. [20], who emphasized the critical need for The effectiveness of our preprocessing pipeline was val-
standardized protocols to benchmark AI systems in healthcare. idated through both automated quality metrics and expert
review. A random subset of 8% of the processed cases under- expert validation. Our prompt-based evaluation approach al-
went detailed evaluation by experienced radiologists to ensure lows for efficient processing of large-scale comparisons while
the preservation of diagnostic features and the accuracy of maintaining high clinical standards. The system automatically
the image-report relationships. This comprehensive approach identifies edge cases and error patterns for targeted expert
resulted in a high-quality dataset that maintained clinical rel- review, optimizing human expert involvement while ensuring
evance while providing sufficient variability for robust model reliable evaluation results. This efficient quality assurance
training. process enables comprehensive model assessment across di-
verse clinical scenarios without requiring extensive manual
B. Workflow Design validation.
’ Figure 2 demonstrates our comparative analysis framework
Input Processing The evaluation pipeline begins with through a representative case study. The figure presents a com-
carefully curated CT image sequences, consisting of multiple prehensive diagnostic scenario consisting of four components:
cross-sectional views of the abdominal region. These images the original CT image sequence showing multiple abdominal
undergo standardized preprocessing to ensure consistent input cross-sections, a detailed image overview documenting key
quality across all models. Each case is paired with a detailed observations, the human physician’s diagnostic report, and
image overview capturing essential anatomical and patholog- the AI model’s diagnostic assessment. In this particular case,
ical observations, providing standardized context for both AI both human and AI diagnostics focus on complex hepatic
models and human diagnosticians to ensure fair comparison. conditions, with the AI system providing detailed analysis of
Multi-model Analysis We evaluate various state-of-the-art potential viral hepatitis, portal hypertension, and associated
multimodal models for their medical diagnostic capabilities. complications. This side-by-side comparison enables direct
The input for each model consists of paired CT image se- evaluation of diagnostic comprehensiveness, accuracy, and
quences and corresponding text descriptions, where the text clinical reasoning between human experts and AI systems.
provides detailed anatomical and pathological observations Such comparative analysis reveals that while AI models can
visible in the images. The general-purpose multimodal models achieve high accuracy in identifying specific pathological
(Llama 3.2-90B, GPT-4, and GPT-4o) process the combined features, human expertise remains crucial for contextual in-
image-text pairs to leverage both visual features and contextual terpretation and complex clinical decision-making.
information in generating diagnostic assessments. Similarly, This systematic comparison approach enables objective
specialized vision models (BLIP2 and Llava) analyze the CT evaluation of AI diagnostic capabilities against human ex-
image sequences while incorporating the textual descriptions pertise while maintaining rigorous clinical standards. Through
to provide comprehensive diagnostic interpretations. Each careful documentation and analysis of comparative cases, we
model generates independent diagnostic reports based on the can better understand both the potential and limitations of
same standardized input, enabling direct comparison of their AI systems in medical diagnosis, ultimately working toward
capabilities in integrating visual and textual information for a complementary relationship between artificial and human
medical diagnosis. intelligence in clinical practice.
Diagnostic Generation Each model generates structured
diagnostic outputs encompassing primary findings, secondary IV. E XPERIMENTS & R ESULTS
observations, and clinical recommendations. The standardized We conduct comprehensive experiments to evaluate the
output format allows direct comparison of diagnostic compre- diagnostic capabilities of different multimodal models in
hensiveness and accuracy across different models and human medical image interpretation. The experiments are structured
experts. This structured approach ensures consistent evaluation in two main aspects: comparative analysis of model perfor-
criteria while maintaining the unique analytical capabilities of mance against human expertise and systematic evaluation of
each model. diagnostic accuracy across different pathological conditions.
Preference-based Evaluation We implement an innovative Our evaluation framework employs a dataset of CT image
preference-based evaluation approach using Claude 3.5 Sonnet sequences with corresponding clinical assessments, enabling
as an independent assessor. Through carefully crafted prompt- detailed comparison of AI and human diagnostic capabilities.
ing strategies, we enable automated comparison between AI-
generated and physician-authored diagnoses without requiring A. Experiments Setup
extensive manual review. The evaluation framework employs Our experimental dataset consists of 500 original clinical
a three-way classification system (AI Superior, Physician cases, each comprising a sequence of 4 cross-sectional CT
Superior, or Equivalent), considering factors such as diag- images accompanied by a comprehensive diagnostic report.
nostic accuracy, comprehensiveness, and clinical relevance. Through our data augmentation pipeline, we expanded this
This prompt-based approach significantly reduces the need dataset to 3,000 samples while preserving the critical diag-
for human evaluation resources while maintaining objective nostic features and relationships between images and reports.
assessment standards. The augmentation process included controlled spatial trans-
Quality Assurance The framework incorporates systematic formations of images (±10◦ rotations, within 10% bound-
quality monitoring through automated metrics and selective ary translations), intensity adjustments (±10% contrast, ±5%
Fig. 1. Comparative Evaluation Framework for Multimodal Medical Diagnosis

brightness), and minimal Gaussian noise injection (σ = 0.01). across all 3,000 cases, with each case receiving independent
Corresponding reports underwent text augmentation using assessment through Claude 3.5 Sonnet. The evaluation criteria
anatomical term substitution and standardized rephrasing of emphasized accurate identification of primary pathologies,
pathological findings, ensuring the maintenance of clinical recognition of secondary complications, and proper integration
accuracy and relevance. of multi-dimensional clinical information. Statistical analysis
The evaluation framework incorporates six state-of-the- of the results employed chi-square tests to determine signif-
art multimodal models: four general-purpose models (Llama icance in performance differences between models, with p-
3.2-90B, GPT-4, GPT-4o, Gemini-1.5) and two specialized values adjusted for multiple comparisons using the Bonferroni
vision models (BLIP2, Llava). Each model processes identical correction. Additionally, we calculated confidence intervals
pairs of CT image sequences and their corresponding text for preference ratios to ensure robust interpretation of model
descriptions. Additionally, we employ Claude 3.5 Sonnet as an performance differences.
independent assessor for preference-based evaluation, imple-
C. Results and Discussion
menting a three-way classification system (AI Superior, Physi-
cian Superior, or Equivalent) to compare model-generated Our experimental evaluation reveals significant variations in
diagnoses with physician-authored reports. To ensure quality diagnostic capabilities across different multimodal AI architec-
control, a random subset of 8% of all cases underwent detailed tures, as detailed in Table I. The results demonstrate a clear
review by experienced radiologists to validate the preservation performance distinction between general-purpose models and
of diagnostic features and the accuracy of image-report rela- specialized vision models in medical diagnostic tasks. General-
tionships. purpose models consistently demonstrated superior perfor-
mance, with Llama 3.2-90B achieving the highest preference
B. Experiments Evaluation rate of 85.27% over human diagnoses. This exceptional per-
The evaluation of model performance was conducted focus- formance was particularly evident in complex cases involving
ing on quantitative metrics to ensure thorough evaluation of multiple pathologies and cross-system interactions. The other
model capabilities. For quantitative assessment, we analyzed general-purpose models (GPT-4, GPT-4o, and Gemini-1.5)
the preference-based evaluation results using a standardized showed similarly strong results, all achieving preference rates
scoring system. The three-way classification (AI Superior, above 79%. The consistently low equivalence rates (around
Physician Superior, or Equivalent) was applied consistently 1.39% for most models) suggest clear differentiation in diag-
Fig. 2. Comparative Analysis Between Human and AI Diagnostic Assessments

nostic capabilities rather than ambiguous comparisons. In con- anatomical observations and clinical findings, where general-
trast, specialized vision models BLIP2 and Llava demonstrated purpose models exhibited superior capability in synthesizing
more modest performance levels, with preference rates of complex clinical information. Several factors may contribute
41.36% and 46.77% respectively. The higher physician superi- to the superior performance of general-purpose models. First,
ority rates for these models (53.25% and 48.84%) indicate that their architecture enables better integration of visual and tex-
while they possess competence in specific pathology detection, tual information, allowing for more comprehensive interpreta-
they may struggle with comprehensive diagnostic assessment tion of clinical data. Second, their broader training potentially
requiring integration of multiple clinical indicators. Statistical enables better understanding of complex medical relationships
analysis confirms the significance of these performance dif- and dependencies. Third, their ability to process and synthesize
ferences (p < 0.001 for general-purpose models), suggesting multiple types of information simultaneously appears to more
that this superiority is not due to chance. The performance gap closely mirror the cognitive processes involved in medical
was most pronounced in cases requiring integration of multiple diagnosis. These findings suggest important implications for
TABLE I
C OMPARATIVE A NALYSIS OF M ULTIMODAL M ODEL P ERFORMANCE IN M EDICAL D IAGNOSIS
Model AI Superior (%) Physician Superior (%) Equivalent (%) Total Cases p-value
General-Purpose Models
Llama 3.2-90B 85.27 13.34 1.39 3,000 < 0.001
GPT-4 83.08 15.53 1.39 3,000 < 0.001
GPT-4o 81.72 16.89 1.39 3,000 < 0.001
Gemini-1.5 79.35 13.51 7.14 3,000 < 0.001
Specialized Vision Models
BLIP2 41.36 53.25 5.39 3,000 0.047
Llava 46.77 48.84 4.39 3,000 0.052
* p-values calculated using chi-square test with Bonferroni correction

the future of medical imaging diagnosis. While specialized hybrid approaches that combine the strengths of AI and human
vision models demonstrate competence in specific tasks, the expertise will be essential for maximizing the potential of these
superior performance of general-purpose models indicates technologies.
their potential for transforming diagnostic processes. However, The proposed workflow for comparing human and AI-
it is crucial to note that these results should inform the devel- generated diagnostic results can also be integrated into clinical
opment of complementary human-AI diagnostic systems rather decision-making processes. This system is designed to process
than suggesting wholesale replacement of human expertise. diagnoses from both human physicians and AI models, with a
dedicated committee making the final diagnosis and treatment
V. C ONCLUSION plan. Such a design leverages the strengths of both human
This study introduces a novel evaluation framework for expertise and AI, while the committee ensures the accuracy
assessing multimodal AI models in medical imaging diagnosis, and reliability of the proposed treatments. By incorporating
showcasing the significant potential of general-purpose models this integrated information system, medical diagnostics could
in handling complex diagnostic tasks. The superior perfor- achieve enhanced accuracy and efficiency, ultimately improv-
mance of models such as Llama 3.2-90B and GPT-4 highlights ing patient care outcomes.
a paradigm shift in medical imaging, where AI systems can This study significantly advances our understanding of AI
surpass human experts in certain diagnostic scenarios. capabilities in medical diagnosis and establishes a robust
Our findings demonstrate that general-purpose multimodal framework for evaluating future developments in this rapidly
models exhibit remarkable capabilities in synthesizing com- evolving field. The findings underscore the promising potential
plex medical information, achieving preference rates exceed- of multimodal AI systems to enhance diagnostic accuracy and
ing 80% compared to human diagnoses. This exceptional efficiency through careful integration into clinical practice.
performance is particularly evident in cases requiring the
R EFERENCES
integration of multiple clinical indicators and cross-system
analyses. [1] S. Chaganti, L. Mawn, H. Kang, et al., “Electronic
However, these results should be considered within the medical record context signatures improve diagnostic
context of certain limitations. While our evaluation frame- classification using medical image computing,” IEEE
work provides a reliable methodology, future studies should Journal of Biomedical and Health Informatics, vol. 23,
explore its applicability across diverse clinical contexts and pp. 2052–2062, 2019. DOI: 10 . 1109 / JBHI . 2018 .
varied healthcare settings. Moreover, human expertise remains 2890084.
indispensable for complex decision-making and patient care. [2] L. Mao, J. Li, T. J. Schwedt, et al., “Questionnaire
The implications of this research extend beyond perfor- and structural imaging data accurately predict headache
mance metrics. For clinical decision support systems, the improvement in patients with acute post-traumatic
proposed framework enhances clinical decision-making by headache attributed to mild traumatic brain injury,”
integrating AI diagnostics with human expertise. Regarding Cephalalgia, vol. 43, no. 5, p. 03 331 024 231 172 736,
quality assurance in diagnostic processes, it establishes a reli- 2023.
able methodology to improve consistency and reduce diagnos- [3] S. Jabbour, D. Fouhey, E. Kazerooni, J. Wiens, and
tic errors. For medical education and training, the framework M. Sjoding, “Combining chest x-rays and electronic
offers a valuable tool for educating healthcare professionals health record (ehr) data using machine learning to diag-
through AI-assisted diagnostics. Finally, this research con- nose acute respiratory failure,” Journal of the American
tributes to the standardization of diagnostic procedures, aiding Medical Informatics Association : JAMIA, 2021. DOI:
the development of uniform processes for integrating AI into 10.1093/jamia/ocac030.
clinical workflows.
Future research should expand the evaluation framework
to include other medical imaging modalities and explore its
integration into clinical workflows. Furthermore, developing
[4] C. Huang, A. Bandyopadhyay, W. Fan, A. Miller, and [16] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction
S. Gilbertson-White, “Mental toll on working women tuning,” ArXiv, vol. abs/2304.08485, 2023. DOI: 10 .
during the covid-19 pandemic: An exploratory study 48550/arXiv.2304.08485.
using reddit data,” PloS one, vol. 18, no. 1, e0280049, [17] C. Li, C. Wong, S. Zhang, et al., “Llava-med: Training
2023. a large language-and-vision assistant for biomedicine
[5] W. J. Li, F. Jia, and Q. Hu, “Automatic segmentation in one day,” ArXiv, vol. abs/2306.00890, 2023. DOI:
of liver tumor in ct images with deep convolutional 10.48550/arXiv.2306.00890.
neural networks,” Journal of Computational Chemistry, [18] S. Lee, J. Youn, M. Kim, and S. H. Yoon, “Cxr-llava:
vol. 03, pp. 146–151, 2015. DOI: 10.4236/JCC.2015. Multimodal large language model for interpreting chest
311023. x-ray images,” ArXiv, vol. abs/2310.18341, 2023. DOI:
[6] K. Yasaka, H. Akai, O. Abe, and S. Kiryu, “Deep learn- 10.48550/arXiv.2310.18341.
ing with convolutional neural network for differentiation [19] C. Jiang, P. Kilcullen, Y. Lai, T. Ozaki, and J. Liang,
of liver masses at dynamic contrast-enhanced ct: A “High-speed dual-view band-limited illumination pro-
preliminary study.,” Radiology, vol. 286 3, pp. 887–896, filometry using temporally interlaced acquisition,” Pho-
2017. DOI: 10.1148/radiol.2017170706. tonics Research, 2020. DOI: 10.1364/PRJ.399492.
[7] W. Xu, J. Xiao, and J. Chen, “Leveraging large language [20] N. L. Crossnohere, M. I. Elsaid, J. Paskett, S. Bose-
models to enhance personalized recommendations in e- Brill, and J. F. P. Bridges, “Guidelines for artificial
commerce,” arXiv preprint arXiv:2410.12829, 2024. intelligence in medicine: Literature review and content
[8] Z. Fu, K. Wang, W. Xin, et al., “Detecting misin- analysis of frameworks,” Journal of Medical Internet
formation in multimedia content through cross-modal Research, vol. 24, 2022. DOI: 10.2196/36823.
entity consistency: A dual learning approach,” PACIS
2024 Proceedings, 2024. [Online]. Available: https :
/ / aisel . aisnet . org / pacis2024 / track07 secprivacy /
track07 secprivacy/2.
[9] W. Xin, K. Wang, Z. Fu, and L. Zhou, Let community
rules be reflected in online content moderation, 2024.
arXiv: 2408 . 12035 [cs.SI]. [Online]. Available:
https://arxiv.org/abs/2408.12035.
[10] R. Ghosh, K. Ghosh, and S. Maitra, “Automatic de-
tection and classification of diabetic retinopathy stages
using cnn,” 2017 4th International Conference on Signal
Processing and Integrated Networks (SPIN), pp. 550–
554, 2017. DOI: 10.1109/SPIN.2017.8050011.
[11] A. Belyaeva, J. Cosentino, F. Hormozdiari, et al., “Mul-
timodal llms for health grounded in individual-specific
data,” pp. 86–102, 2023. DOI: 10 . 48550 / arXiv. 2307 .
09018.
[12] Y. Ge, G. Chen, J. A. Waltz, L. E. Hong, P. Kochunov,
and S. Chen, “An integrated cluster-wise significance
measure for fmri analysis,” Human Brain Mapping,
vol. 43, no. 8, pp. 2444–2459, 2022.
[13] Y. Ge, S. Hare, G. Chen, et al., “Bayes estimate of
primary threshold in clusterwise functional magnetic
resonance imaging inferences,” Statistics in medicine,
vol. 40, no. 25, pp. 5673–5689, 2021.
[14] M. Jiang, S. Lei, J. Zhang, L. Hou, M. Zhang, and
Y. Luo, “Multimodal imaging of target detection algo-
rithm under artificial intelligence in the diagnosis of
early breast cancer,” Journal of Healthcare Engineering,
vol. 2022, 2022. DOI: 10.1155/2022/9322937.
[15] J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “Blip-
2: Bootstrapping language-image pre-training with
frozen image encoders and large language models,”
pp. 19 730–19 742, 2023. DOI: 10.48550/arXiv.2301.
12597.