\useunder

\ul

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs’ Performance in Insurance

Chenwei Lin
Fudan University
[email protected]
\AndHanjia Lyu
University of Rochester
[email protected]
\ANDXian Xu
Fudan University
[email protected]
\AndJiebo Luo
University of Rochester
[email protected]

Abstract

Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in various general multimodal applications such as image recognition and visual reasoning, and have also shown promising potential in specialized domains. However, the application potential of LVLMs in the insurance domain—characterized by rich application scenarios and abundant multimodal data—has not been effectively explored. There is no systematic review of multimodal tasks in the insurance domain, nor a benchmark specifically designed to evaluate the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance domain. In this paper, we systematically review and distill multimodal tasks for four representative types of insurance: auto insurance, property insurance, health insurance, and agricultural insurance. We propose INS-MMBench, the first comprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench comprises a total of 2.2K thoroughly designed multiple-choice questions, covering 12 meta-tasks and 22 fundamental tasks. Furthermore, we evaluate multiple representative LVLMs, including closed-source models such as GPT-4o and open-source models like BLIP-2. This evaluation not only validates the effectiveness of our benchmark but also provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain. We hope that INS-MMBench will facilitate the further application of LVLMs in the insurance domain and inspire interdisciplinary development. Our dataset and evaluation code are available at https://github.com/FDU-INS/INS-MMBench.

Refer to caption — Figure 1: Overview of INS-MMBench. INS-MMBench constructs 12 meta-tasks (represented in the inner circle) and 22 fundamental tasks (represented in the outer circle) across four types of insurance, distinguished by four primary colors: blue, red, yellow, and green. For each fundamental task, we provide an example of image-question pair.

1 Introduction

In recent years, Large Language Models (LLMs) have demonstrated remarkably powerful semantic understanding and conversational capabilities [66, 33, 81, 58, 80], profoundly impacting human work and life. Building on this foundation, Large Visual Language Models (LVLMs) have taken a further step by mapping and aligning visual and textual features, enabling the processing and interaction with multimodal data [7, 83, 63, 74]. Researchers have found that LVLMs exhibit exceptional performance in general tasks such as image recognition, document parsing, and OCR processing [72, 37, 69]. Beyond exploring general capabilities, researchers have also begun to apply LVLMs to various specialized domains such as healthcare [30, 61], autonomous driving [20, 41] and social media content analysis [50, 79]. By exploring the capabilities of LVLMs in specialized domains through qualitative and quantitative methods, these studies have demonstrated various application potentials.

Insurance, as a discipline encompassing numerous multimodal application scenarios, involves extensive use of multimodal data and computer vision algorithms in its actual operations [27, 57, 77, 40]. This offers vast potential for the integration of LVLMs with the insurance industry. For instance, in auto insurance, analyzing images of damaged vehicles can enable quick assessments and accurate estimations of damage [51]. Similarly, in property insurance, analyzing images of buildings can help evaluate potential risks [70]. However, existing research [44] has only qualitatively analyzed the application of LVLMs in the insurance domain, without systematically organizing related multimodal tasks or constructing domain-specific benchmarks. This has hindered the in-depth evaluation and promotion of LVLMs’ capabilities within the insurance domain.

To address this challenge, we introduce INS-MMBench, the first comprehensive LVLMs benchmark for the insurance domain (see Figure 1). For task design, we systematically organize and refine multimodal tasks across four representative types of insurance: auto, property, health, and agricultural insurance. Using a bottom-up hierarchical task definition methodology, we construct a total of 12 meta-tasks and 22 fundamental tasks, covering key insurance stages such as underwriting, risk monitoring, and claim processing. For data collection, we search and process datasets from multiple open-source channels, selecting datasets with high scenario relevance, task relevance, and data availability. For benchmark construction, INS-MMBench includes a total of 2.2K thoroughly designed multiple-choice visual questions. Such format facilitates convenient and objective analysis of evaluation results. These questions are formulated manually and the distractor options are generated with the help of GPT-4o.

Furthermore, we select 10 LVLMs for evaluation and conduct a comprehensive analysis of the results. The key findings from the evaluation are as follows: (1) GPT-4o performs the best among all models, scoring 72.91/100. It is also the only model to score over 70, reflecting the challenging nature of the INS-MMBench; (2) There are significant differences in LVLMs’ performance across different insurance types, with better results in auto insurance and health insurance compared to property insurance and agricultural insurance; (3) LVLMs exhibit marked differences in performance across different meta-tasks, closely related to the task type and the image type; (4) The gap between open-source and closed-source LVLMs is narrowing, with some open-source models now approaching or even surpassing the capabilities of closed-source models in some tasks; (5) The primary reasons for LVLMs’ errors on the INS-MMBench are lack of knowledge and understanding in the insurance field, as well as perception errors.

Overall, the contributions of our work are as follows:

•

We propose INS-MMBench, the first LVLMs benchmark for the insurance domain, which includes a total of 2.2K multiple-choice visual questions covering four types of insurance (auto, property, health, and agricultural insurance), 12 meta-tasks and 22 fundamental tasks.
•

We conduct an in-depth evaluation of 10 LVLMs, including 7 proprietary and 3 open-source models, representing the first quantitative assessment of LVLMs’ capabilities in the insurance domain.
•

We conduct a further analysis of the evaluation results, providing insights into the potential applications of LVLMs in the insurance domain. This analysis also offers a reference for understanding the opportunities and challenges associated with LVLMs in this sector.

2 Related works

2.1 Large Vision-Language Models

With the rapid development of Large Language Models (LLMs) [9, 66, 31], researchers are leveraging the powerful generalization capabilities of these pre-trained LLMs for processing and understanding multimodal data [73, 82, 19]. A key area of focus is the use of Large Vision-Language Models (LVLMs) for visual inputs. LVLMs employ visual encoders and visual-to-language adapters to encode the visual features from image data and align these features with textual features. The combined features are then processed by pre-trained LLMs, leading to significant advancements in visual recognition and understanding [74, 68].

Various open-source and closed-source LVLMs are continuously emerging. In the realm of open-source models, notable examples include LLaMA-Adapter [76], LLaVA [47], BLIP-2 [38], MiniGPT-4 [83], and InternVL [12]. These models have successfully integrated visual and textual modalities, achieving commendable results. In the closed-source domain, representative models include GPT-4o [53], GPT-4V [2], GeminiProVision [29], and Qwen-VL [60], all of which have demonstrated outstanding performance in numerous tests and evaluations [72, 28, 43]. We intend to evaluate both open-source and closed-source LVLMs to verify the capability of different models in the insurance domain.

2.2 Benchmarks for Large Vision-Language Models

As research into LVLMs intensifies, an increasing number of researchers are proposing benchmarks to evaluate the capabilities of models [73, 78, 45, 11]. Based on the scope of capability evaluation, these studies can be categorized into three types: task-specific benchmarks, comprehensive benchmarks, and domain-specific benchmarks.

Comprehensive benchmarks

are characterized by their breadth and generality. Researchers construct these benchmarks by defining and categorizing the general capabilities and tasks of LVLMs, resulting in a comprehensive and wide-ranging evaluation. Representative studies include LVLM-eHub [69], SEED-Bench [37, 36], MMBench [48], MME, and MMT-Bench [75].

Task-specific benchmarks

focus on particular tasks and types of visual data, providing detailed task definitions. Examples include SciFIBench [55] for scientific images, MMC-Benchmark [46] for charts, MVBench [39] (using video frames as input) for videos and SEED-Bench-2-Plus [35] for web pages, charts and maps.

Domain-specific benchmarks

are designed for visual tasks within specific professional domain. Due to the specialized knowledge and unique tasks of these domains, general benchmark cannot fully meet the needs of evaluating LVLMs in these areas. As a result, researchers have begun proposing specialized benchmarks for domains such as healthcare (OmniMedVQA [30]), mathematics [49, 62], autonomous driving (Talk2BEV-Bench [20]), and geography [56]. However, as mentioned previously, the insurance domain and even the finance domain currently lack corresponding domain-specific benchmarks for LVLMs [10, 42, 44]. Our work introduces INS-MMBench to address this gap, aiming for a significant advancements in the application of LVLMs in the insurance domain.

3 INS-MMBench

3.1 Tasks

Given the differences in workflows among various types of insurance in practical operations, we select four core types for building this benchmark: auto insurance, commercial/household property insurance, health insurance, and agricultural insurance. These categories cover both life and property insurance, which are the most prevalent in the insurance market and highly representative [65, 21].

To ensure that our evaluation tasks closely align with real-world applications in the insurance domain and fully demonstrate the capabilities of LVLMs in this context, we have developed a bottom-up hierarchical task definition methodology. Using this methodology, we construct a systematic visual task framework specifically tailored for the insurance sector. As an example, we discuss the detailed task construction process for auto insurance (see Figure 2). Initially, based on the insurance value chain theory [23, 24], we select three key stages rich in multimodal data and tasks: vehicle underwriting, vehicle risk monitoring, and vehicle claim processing. At each stage, we identify the key visual elements that insurance operators need to extract. For instance, during the vehicle underwriting stage, operators must confirm elements such as license plate information, vehicle model, dashboard readings, and vehicle condition, which are crucial for information collection, condition verification, and underwriting decision-making. Further, based on these key visual elements, we define the fundamental tasks. For example, the need to extract license plate information led to the definition of the License Plate Recognition task, while the need to monitor risky driving behavior resulted in the In-car Driving Driving Behavior Detection task. By following this process, we define a total of nine fundamental tasks for auto insurance. Finally, we cluster these fundamental tasks based on their characteristics, forming four meta-tasks. Through this approach, we have constructed a comprehensive set of 12 meta-tasks and 22 fundamental tasks across the four types of insurance.

3.2 Dataset collection

Once the task definition is complete, we start collecting data and constructing the multi-choice visual questions. Our data collection and benchmark construction process (see Figure 3) is as follows:

Data sources.

We search for datasets using keywords related to the fundamental tasks in several popular data sources, including Google, Kaggle, Github, and Roboflow. For tasks where multiple public datasets are available, we download and compare these datasets to perform an initial screening. We select datasets with high adaptability and usability for insurance scenarios, as detailed in Table 1.

Data processing.

To facilitate LVLMs evaluation, we set the number of images and questions for each fundamental task to 100. These 100 images are randomly sampled from our selected datasets, and considering the balance of test sample types, we perform balanced sampling on datasets with categorical labels. For example, in the vehicle damage severity detection task, we ensure that the number of labels - undamaged, minor damage, moderate damage and severe damage - is balanced to maintain the validity of the evaluation. Meanwhile, we process the annotation content, converting it to text-based labels, in preparation for subsequent question and answer generation.

Question and answer generation.

For each fundamental task, we set questions that are directly and unambiguously related to the task. For example, the question for the number plate recognition task is “What is the number plate of the vehicle in the picture?” The number of options for the questions is set to 2 to 4. For tasks with yes/no labels, we keep the yes/no labels as options. For other tasks, we generate distractor options using the GPT-4o model, and finally combine these options into a multi-choice visual question format. In each fundamental task we ensure a balanced distribution of correct option positions.

Table 1: An overview of the datasets used in INS-MMBench.

Insurance type

Meta-tasks

Fundamental tasks

Dataset

Access

Auto insurance

Vehicle information extraction

License plate recognition

CCPD [71], mjdfodf-qmbuf [67]

Open Access

Vehicle mileage reading

TRODO [52]

Open Access

Vehicle warning indicator recognition

dataset_dashboard [18]

Open Access

Vehicle appearance recognition

Vehicle make and model identification

Stanford Cars [34]

Open Access

Vehicle modification detection

tuning-car-detection [26]

Open Access

Driving behavior detection

Incar driving behavior detection

Driver-Distraction-Dataset [25]

Open Access

Vehicle damage detection

Vehicle damage part detection

car_dent_scratch_detection-1 [59]

Open Access

Vehicle damage type detection

Cardd [64]

Open Access

Vehicle damage severity detection

car-crash-severity-detection [6]

Open Access

Property insurance

Property risk assessment

Roof condition assessment

damages-svll3 [8]

Open Access

Workplace risk assessment

worker-safety [16]

Open Access

Property anomaly detection

House fire detection

fire-detection-cta61 [15]

Open Access

Property damage detection

House damage type detection

damage-type [4]

Open Access

House damage level detection

damage-level [3]

Open Access

Health insurance

Health risk monitoring

Fall detection

Fall Detection Dataset [32]

Open Access

Health device reading

blood-pressure-monitor-display [54]

Open Access

Medical image recognition

Medical image organ recognition

VQA-Med 2019 [1]

Open Access

Medical image abnormality recognition

VQA-Med 2019 [1]

Open Access

Agricultural insurance

Crop type identification

Field image crop type identification

agricultural crop images [5]

Open Access

Satellite image crop type identification

Drone Imagery Classification

Training Dataset for Crop Types

in Rwanda [13]

Open Access

Crop growth status identification

crop growth stage recognition

wheat-growth-stage-challenge [22]

Open Access

Farmland damage detection

Farmland damage type detection

agriculture-vision [14]

Open Access

4 Experiment

4.1 Experimental setting

Selected LVLMs.

We select a representative set of 10 LVLMs for our evaluation. This set includes seven closed-source LVLMs: GPT-4o, GPT-4V, GeminiProVision, QwenVLPlus, QwenVLMax, Claude3V_Sonnet, and Claude3V_Haiku as well as three open-source LVLMs including LLaVA, BLIP-2, and Qwen-VL-Chat.

Evaluation methods.

We employ VLMEvalKit, an open-source evaluation toolkit for LVLMs developed by OpenCompass [17], to conduct our evaluations. This toolkit supports integrated testing of both closed-source and open-source LVLMs and is adaptable to custom benchmark datasets. VLMEvalKit provides two methods for evaluating responses to multi-choice visual questions: exact matching (finding "A", "B", "C", "D" in the output strings) and LLM-based answer extraction which analyzes the answer outputs using a Large Language Model (we use GPT-3.5 here). These methods help mitigate the issue of uncontrolled free-form content generation by LVLMs. The accuracy metric is used as the evaluation criterion.

4.2 Main results

Tables 2 and 3 present the evaluation results of LVLMS across various insurance types and meta-tasks, respectively, using random guessing as the baseline. The results are organized into two sections: the first seven rows feature proprietary LVLMs, while the subsequent rows cover open-source LVLMs. Overall, GPT-4o outperforms all other models, emerging as the top-performing LVLM on the INS-MMBench with a score of 72.91. This is the only model with an overall score exceeding 70, underscoring the challenging nature of the INS-MMBench. Most LVLMs scored below 60, and some even underperformed relative to a random guess baseline of 25 in certain insurance categories, indicating significant potential for improvement in applying LVLMs within the insurance domain. Based on the data in Tables 2 and 3, the following observations can be made.

Table 2: Evaluation results of the LVLMs across different insurance types. The values in the table represent the average accuracy. The highest and second-highest results are highlighted in bold and underlined, respectively.

Model Overall Auto insurance Household/commercial property insurance Health insurance Agricultural insurance GPT-4o 72.91 85.33 65.00 82.00 45.50 GeminiProVision \ul68.14 \ul82.78 52.00 \ul80.00 43.25 Qwen-VL-Max 67.72 82.89 53.40 77.25 42.00 GPT-4V 65.37 78.11 \ul57.60 76.25 34.75 Qwen-VL-Plus 54.45 63.78 42.40 76.00 27.00 Claude3V_Sonnet 54.27 67.22 47.60 64.50 23.25 Claude3V_Haiku 54.10 67.22 46.20 64.75 23.75 Qwen-VL-Chat 51.82 63.11 39.60 68.50 25.00 LLaVA 48.86 49.11 49.60 52.75 \ul43.50 BLIP-2 40.50 41.67 39.20 43.50 36.50 Random guess 25.00 25.00 25.00 25.00 25.00

Table 3: Evaluation results of the LVLMs across different meta-tasks. The values in the table represent the average accuracy. Specifically, VIE denotes vehicle information extraction, VAR denotes vehicle appearance recognition, DBD denotes driving behavior detection, VDD denotes vehicle damage detection, HPAD denotes household/commercial property anomaly detection, HPDD denotes household/commercial property damage detection, HPRA denotes household/commercial property risk assessment, HRM denotes health risk monitoring, MIR denotes medical image recognition, CGSI denotes crop growth stage identification, CTI denotes crop type identification, FDD denotes farmland damage detection. The highest and second-highest results are highlighted in bold and underlined, respectively.

Model VIE VAR DBD VDD HPAD HPDD HPRA HRM MIR CGSI CTI FDD GPT-4o 82.67 \ul98.00 85.00 \ul79.67 83.00 57.00 64.50 97.50 66.50 32.00 \ul53.00 44.00 GeminiProVision 79.33 98.50 \ul79.00 77.67 \ul86.00 40.00 47.00 \ul94.50 \ul65.50 28.00 58.50 28.00 Qwen-VL-Max \ul80.00 96.50 71.00 80.67 79.00 42.50 51.50 91.50 63.00 \ul30.00 52.50 \ul33.00 GPT-4V 76.00 96.00 63.00 73.67 \ul86.00 \ul43.50 \ul57.50 \ul94.50 58.00 22.00 46.50 24.00 Qwen-VL-Plus 73.67 58.00 63.00 58.00 56.00 40.50 37.50 89.50 62.50 18.00 37.50 15.00 Claude3V_Sonnet 62.00 89.50 51.00 63.00 80.00 36.50 42.50 78.00 51.00 23.00 26.50 17.00 Claude3V_Haiku 60.33 93.50 55.00 60.67 73.00 37.50 41.50 80.00 49.50 20.00 31.00 13.00 Qwen-VL-Chat 53.33 96.00 53.00 54.33 62.00 29.50 38.50 75.50 61.50 12.00 39.00 10.00 LLaVA 32.33 77.00 52.00 46.33 88.00 42.00 38.00 51.50 54.00 28.00 58.50 29.00 BLIP-2 34.33 71.50 36.00 31.00 80.00 26.00 32.00 46.50 40.50 24.00 50.50 21.00 Random guess 25.00 25.00 25.00 25.00 25.00 25.00 25.00 25.00 25.00 25.00 25.00 25.00

LVLMs show significant variance across different types of insurance. Experimental results reveal that both open-source and proprietary LVLMs perform better in tasks related to auto insurance and health insurance compared to those involving property and agricultural insurance. For instance, GPT-4o, which exhibits the best performance, scores 85.33 and 82.00 in auto and health insurance tasks respectively; however, its scores drop to 65.00 and 45.50 in property and agricultural insurance tasks, indicating a gap from practical application. This discrepancy may stem from the availability of datasets. Our data collection process highlights that publicly available datasets are more plentiful and comprehensive in the automotive and medical fields. Based on these observations, we suggest that the future deployment of LVLMs in the insurance sector should be a progressive process, initially focusing on areas like auto and health insurance where they are most effective.

LVLMs show significant variance across different meta-tasks. Experimental results reveal that LVLMs demonstrate considerable performance variability across various meta-tasks, likely influenced by the specific nature of each task and the characteristics of the images involved. Most models excel in tasks like vehicle information extraction (VAE), vehicle appearance recognition (VAR), and health risk monitoring (HRA), which primarily depend on visual element perception and object detection. In contrast, performance dips in more complex tasks such as household/commercial property damage detection (HPDD) and crop growth stage identification (CGSI), which demand additional domain-specific knowledge or reasoning abilities. Furthermore, LVLMs generally struggle with tasks involving satellite or drone aerial imagery, including household/commercial property risk assessment (HPRA), crop type identification (CTI), and farmland damage detection (FDD), where unique imaging perspectives and data complexities pose additional challenges.

Narrowing gap between open-source and closed-source LVLMs. A comparison of the overall performance of open-source and closed-source LVLMs on INS-MMBench indicates that, while there is still a notable gap between the two, some open-source LVLMs are nearing the performance levels of their closed-source counterparts. This trend suggests that as open-source models grow stronger and domain-specific data becomes more abundant, focusing on training high-performance, domain-specific LVLMs could become a key development strategy in the application of LVLMs within the insurance domain.

4.3 Error analysis

To provide further insights into the limitations of LVLMs in the insurance domain, we conduct an in-depth analysis of the errors made by selected models on the INS-MMBench. We examine the error patterns of three models: GPT-4o, GeminiProVision, and Qwen-VL-Max, categorizing the errors into four types: perception errors (where LVLMs do not recognize or detect objects or content within the image), lack of insurance knowledge or reasoning ability (where LVLMs can recognize and perceive visual content but lack the necessary insurance knowledge or reasoning skills to answer the question), refusal to answer (where LVLMs decline to respond to questions they deem sensitive or illegal), and failure to follow instructions (where LVLMs do not adhere to the provided instructions, resulting in irrelevant responses).

The error analysis results for these models are illustrated in Figure 5. The most common error type is the lack of insurance knowledge or reasoning ability, which accounts for 59.5%, 64.0%, and 57.2% of the errors in GPT-4o, GeminiProVision, and Qwen-VL-Max, respectively. Due to insufficient specialized knowledge and analytical skills in the insurance field, LVLMs struggle to accurately assess and judge factors such as risk conditions and the extent of damage. Therefore, optimizing LVLMs for the insurance domain should primarily focus on enriching domain-specific knowledge and enhancing professional capabilities. Perception errors are the second most significant error type. Limited by the capabilities of the visual encoder, LVLMs often fail to fully recognize and capture detailed content in images, leading to misinterpretations. For instance, GPT-4o misidentifies a damaged farmland image as ‘an abstract or close-up view of a textured surface with blue and purple hues’. This type of error is common across LVLMs. Additionally, due to built-in safety monitoring functions, GPT-4o and GeminiProVision sometimes incorrectly flag images as illegal and refuse to respond. Qwen-VL-Max, on the other hand, struggles with following instructions, occasionally outputting content in Chinese, which compromises result accuracy.

5 Discussions and conclusions

In this paper, we introduce INS-MMBench, a multimodal benchmark tailored for the insurance domain, designed to evaluate Large Vision-Language Models (LVLMs). To the best of our knowledge, this is the first initiative to systematically review multimodal tasks within this sector and establish a specialized benchmark specifically for it. INS-MMBench comprises 2.2K multiple-choice visual questions, covering four types of insurance, 12 meta-tasks, and 22 fundamental tasks, effectively supporting the assessment of LVLMs’ applications in insurance. Additionally, we evaluate several mainstream LVLMs and provide a detailed analysis of the results, offering an initial exploration into the feasibility of employing LVLMs in the insurance sector. We hope our benchmark and findings will guide future research in this field and enhance the integration of insurance academia with AI advancements, promoting interdisciplinary exchanges within the sector.

However, this study has limitations. A significant constraint is the lack of open-source image datasets specific to the insurance domain, largely due to privacy concerns. The data used in this study, sourced from publicly available datasets, has been manually curated but may still harbor biases that do not fully align with real-world insurance scenarios. This issue underscores the need for collaborative efforts between insurance companies and the academic community to develop dedicated open-source image datasets for the insurance domain.

Another limitation is that INS-MMBench disaggregates the tasks of LVLMs into various fundamental tasks, assessing LVLM performance from a micro perspective based on task-specific accuracy. In reality, visual tasks in insurance often entail complex integration of multiple capabilities and comprehensive analysis. Addressing this, our next objective is to construct a more complex, integrated application benchmark to enable a deeper evaluation of LVLM applications in the insurance domain.

References

[1] Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Joey Liu, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. CLEF (working notes), 2(6), 2019.
[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[3] Isaac Agyemang. Damage level dataset. https://universe.roboflow.com/isaac-agyemang/damage-level, dec 2021. visited on 2024-05-28.
[4] Isaac Agyemang. Damage type dataset. https://universe.roboflow.com/isaac-agyemang/damage-type, jan 2022. visited on 2024-05-28.
[5] AMAN2000JAISWAL. Agriculture crop images. https://www.kaggle.com/datasets/aman2000jaiswal/agriculture-crop-images, 2021. visited on 2024-05-21.
[6] [email protected]. Car crash severity detection dataset. https://universe.roboflow.com/ansonlau1325-gmail-com/car-crash-severity-detection, apr 2022. visited on 2024-05-28.
[7] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
[8] Capstone2. Damages dataset. https://universe.roboflow.com/capstone2/damages-svll3, nov 2022. visited on 2024-05-28.
[9] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
[10] Jian Chen, Peilin Zhou, Yining Hua, Yingxin Loh, Kehui Chen, Ziyuan Li, Bing Zhu, and Junwei Liang. Fintextqa: A dataset for long-form financial question answering. arXiv preprint arXiv:2405.09980, 2024.
[11] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024.
[12] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
[13] Robert Chew, Jay Rineer, Robert Beach, Maggie O’Neil, Noel Ujeneza, Daniel Lapidus, Thomas Miano, Meghan Hegarty-Craver, Jason Polly, and Dorota S Temple. Deep neural networks and transfer learning for food crop identification in uav images. Drones, 4(1):7, 2020.
[14] Mang Tik Chiu, Xingqian Xu, Yunchao Wei, Zilong Huang, Alexander G Schwing, Robert Brunner, Hrant Khachatrian, Hovnatan Karapetyan, Ivan Dozier, Greg Rose, et al. Agriculture-vision: A large aerial image database for agricultural pattern analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2828–2838, 2020.
[15] College. fire detection dataset. https://universe.roboflow.com/college-pbetq/fire-detection-cta61, oct 2023. visited on 2024-05-28.
[16] computer vision. Worker-safety dataset. https://universe.roboflow.com/computer-vision/worker-safety, jul 2022. visited on 2024-05-28.
[17] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
[18] Dashboarddataset. dataset dashboard dataset. https://universe.roboflow.com/dashboarddataset/dataset_dashboard_, apr 2024. visited on 2024-05-28.
[19] Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An audio language model for audio tasks. Advances in Neural Information Processing Systems, 36:18090–18108, 2023.
[20] Vikrant Dewangan, Tushar Choudhary, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, and K Madhava Krishna. Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving. arXiv preprint arXiv:2310.02251, 2023.
[21] Tania Driver, Mark Brimble, Brett Freudenberg, and Katherine Hunt. Insurance literacy in australia: Not knowing the value of personal insurance. Financial Planning Research Journal, 4(1):53–75, 2018.
[22] GAURAV DUTTA. Wheat growth stage challenge. https://www.kaggle.com/datasets/gauravduttakiit/wheat-growth-stage-challenge, 2023. visited on 2024-05-21.
[23] Martin Eling and Martin Lehmann. The impact of digitalization on the insurance value chain and the insurability of risks. The Geneva papers on risk and insurance-issues and practice, 43:359–396, 2018.
[24] Martin Eling, Davide Nuessle, and Julian Staubli. The impact of artificial intelligence along the insurance value chain and on the insurability of risks. The Geneva Papers on Risk and Insurance-Issues and Practice, 47(2):205–241, 2022.
[25] Amal Ezzouhri, Zakaria Charouh, Mounir Ghogho, and Zouhair Guennoun. Robust deep learning-based driver distraction detection and classification. IEEE Access, 9:168080–168092, 2021.
[26] f-rid nagiyev. Tuning car detection dataset. https://universe.roboflow.com/f-rid-nagiyev/tuning-car-detection, dec 2023. visited on 2024-05-28.
[27] Nisaja Fernando, Abimani Kumarage, Vithyashagar Thiyaganathan, Radesh Hillary, and Lakmini Abeywardhana. Automated vehicle insurance claims processing using computer vision, natural language processing. In 2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer), pages 124–129. IEEE, 2022.
[28] Chaoyou Fu, Renrui Zhang, Haojia Lin, Zihan Wang, Timin Gao, Yongdong Luo, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, et al. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023.
[29] Google. Gemini pro. https://deepmind.google/technologies/gemini/pro/, 2024. Accessed: 2024-05-23.
[30] Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. arXiv preprint arXiv:2402.09181, 2024.
[31] Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
[32] UTTEJ KUMAR KANDAGATLA. Fall detection dataset. https://www.kaggle.com/datasets/uttejkumarkandagatla/fall-detection-dataset, 2022. visited on 2024-05-20.
[33] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023.
[34] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
[35] Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension. arXiv preprint arXiv:2404.16790, 2024.
[36] Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023.
[37] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
[38] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
[39] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005, 2023.
[40] Pei Li, Bingyu Shen, and Weishan Dong. An anti-fraud system for car insurance claim based on visual evidence. arXiv preprint arXiv:1804.11207, 2018.
[41] Yanze Li, Wenhua Zhang, Kai Chen, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. arXiv preprint arXiv:2404.10595, 2024.
[42] Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance, pages 374–382, 2023.
[43] Yunxin Li, Longyue Wang, Baotian Hu, Xinyu Chen, Wanqi Zhong, Chenyang Lyu, and Min Zhang. A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536, 2023.
[44] Chenwei Lin, Hanjia Lyu, Jiebo Luo, and Xian Xu. Harnessing gpt-4v (ision) for insurance: A preliminary exploration. arXiv preprint arXiv:2404.09690, 2024.
[45] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations, 2023.
[46] Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023.
[47] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
[48] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
[49] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
[50] Hanjia Lyu, Jinfa Huang, Daoan Zhang, Yongsheng Yu, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, and Jiebo Luo. Gpt-4v (ision) as a social media analysis engine. arXiv preprint arXiv:2311.07547, 2023.
[51] Dimitrios Mallios, Li Xiaofei, Niall McLaughlin, Jesus Martinez Del Rincon, Clare Galbraith, and Rory Garland. Vehicle damage severity estimation for insurance operations using in-the-wild mobile images. IEEE Access, 2023.
[52] Kaouther Mouheb, Ali Yürekli, and Burcu Yılmazel. Trodo: A public vehicle odometers dataset for computer vision. Data in Brief, 38:107321, 2021.
[53] OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024. Accessed: 2024-05-23.
[54] Final Project. blood-pressure-monitor-display dataset. https://universe.roboflow.com/final-project-cwtfb/blood-pressure-monitor-display, apr 2024. visited on 2024-05-28.
[55] Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie. Scifibench: Benchmarking large multimodal models for scientific figure interpretation. arXiv preprint arXiv:2405.08807, 2024.
[56] Jonathan Roberts, Timo Lüddecke, Rehan Sheikh, Kai Han, and Samuel Albanie. Charting new territories: Exploring the geographic and geospatial capabilities of multimodal llms. arXiv preprint arXiv:2311.14656, 2023.
[57] Srishti Sahni, Anmol Mittal, Farzil Kidwai, Ajay Tiwari, and Kanak Khandelwal. Insurance fraud identification using computer vision and iot: a study of field fires. Procedia Computer Science, 173:56–63, 2020.
[58] Yiqiu Shen, Laura Heacock, Jonathan Elias, Keith D Hentel, Beatriu Reig, George Shih, and Linda Moy. Chatgpt and other large language models are double-edged swords, 2023.
[59] Sindhu. Car dent scratch detection(1) dataset. https://universe.roboflow.com/sindhu/car_dent_scratch_detection-1, dec 2022. visited on 2024-05-28.
[60] Qwen Team. Introducing qwen-vl. https://qwenlm.github.io/blog/qwen-vl/, 2024. Accessed: 2024-05-23.
[61] Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, and Hongliang Ren. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery. arXiv preprint arXiv:2405.10948, 2024.
[62] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804, 2024.
[63] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024.
[64] Xinkuang Wang, Wenjing Li, and Zhongcheng Wu. Cardd: A new dataset for vision-based car damage detection. IEEE Transactions on Intelligent Transportation Systems, 2023.
[65] Sampath Sanjeewa Weedige, Hongbing Ouyang, Yao Gao, and Yaqing Liu. Decision making in personal insurance: Impact of insurance literacy. Sustainability, 11(23):6795, 2019.
[66] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
[67] workspace. mjdfodf-qmbuf dataset. https://universe.roboflow.com/workspace-luixd/mjdfodf-qmbuf, mar 2023. visited on 2024-05-28.
[68] Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and S Yu Philip. Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE, 2023.
[69] Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
[70] Shuyuan Xu, Jun Wang, Wenchi Shou, Tuan Ngo, Abdul-Manan Sadick, and Xiangyu Wang. Computer vision techniques in construction: a critical review. Archives of Computational Methods in Engineering, 28:3383–3397, 2021.
[71] Zhenbo Xu, Wei Yang, Ajin Meng, Nanxue Lu, Huan Huang, Changchun Ying, and Liusheng Huang. Towards end-to-end license plate detection and recognition: A large dataset and baseline. In Proceedings of the European conference on computer vision (ECCV), pages 255–271, 2018.
[72] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
[73] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
[74] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
[75] Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006, 2024.
[76] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
[77] Wei Zhang, Yuan Cheng, Xin Guo, Qingpei Guo, Jian Wang, Qing Wang, Chen Jiang, Meng Wang, Furong Xu, and Wei Chu. Automatic car damage assessment system: Reading and understanding videos as professional insurance inspectors. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13646–13647, 2020.
[78] Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Advances in Neural Information Processing Systems, 36, 2024.
[79] Xinnong Zhang, Haoyu Kuang, Xinyi Mou, Hanjia Lyu, Kun Wu, Siming Chen, Jiebo Luo, Xuanjing Huang, and Zhongyu Wei. Somelvlm: A large vision language model for social media processing. arXiv preprint arXiv:2402.13022, 2024.
[80] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
[81] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
[82] Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
[83] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Appendix A Example Cases

To offer a detailed view of the task settings in INS-MMBench, we have selected sample cases for each core task and present responses from GPT-4o, GeminiProVision, and Qwen-VL-Max in this section.