Automation in Construction
Automation in Construction
Automation in Construction
journal homepage: www.elsevier.com/locate/autcon
Keywords: Traditional safety inspections require significant human effort and time to capture site photos and textual
Image captioning descriptions. While standardized forms and image captioning techniques have been explored to improve
Safety inspection inspection efficiency, compiling reports with both visual and text data remains challenging due to the
Construction safety
multiplicity of safety-related knowledge. To assist inspectors in evaluating violations more efficiently, this
CLIP
paper presents an image-language model, utilizing Contrastive Language-Image Pre-training (CLIP) fine-tuning
Visual attention
and prefix captioning to automatically generate safety observations. A user-friendly mobile phone application
has been created to streamline safety report documentation for site engineers. The language model successfully
classifies nine violation types with an average accuracy of 73.7%, outperforming the baseline model by 41.8%.
Experiment participants confirmed that the mobile application is helpful for safety inspections. This automated
framework simplifies safety documentation by identifying violation scenes through images, improves overall
safety performance, and supports the digital transformation of construction sites.
1. Introduction there are particular formats to adhere to, the actual descriptions often
vary from person to person, leading to inconsistencies. Recent studies
Safety is a core value in construction project management. Cur- have developed language models capable of comprehending these de-
rently, a mixture of tools, such as paper-based safety reports, tablet- scriptions and establishing a knowledge base for automated mapping
based safety observations, and mobile camera photologs, are employed of safety regulations [5]. Although initial results indicate promising
to document onsite safety non-compliance. However, consolidating performance, challenges remain due to the diverse expressions found
these data from different storage silos and analyzing them in a scalable in textual descriptions. For example, both ‘‘the openings should be
way often requires considerable effort and is prone to human error [1]. covered by the safety net’’ and ‘‘the openings need falling prevention’’
This fragmentation can lead to delays in identifying and addressing indicate non-compliance regarding falling hazards but use different ex-
safety issues, potentially compromising onsite safety. In recent years, pressions. Photographs provide visual references and concrete evidence
there has been a shift towards smartphone apps aimed at streamlining of non-compliance, significantly enhancing language understanding.
documentation and reducing human errors during data conversion However, manually linking these photographs with appropriate tex-
[2,3]. These apps provide a more integrated approach to data collection
tual descriptions remains labor-intensive. Multiple research endeavors
but still face challenges in ensuring consistency and comprehensiveness
have demonstrated promising results in integrating visual and textual
in safety reporting. Despite differing workflows among these tools, a
data [6]. This paper aims to automate the generation of construction
common objective is to populate safety reports with rich site photologs
safety observations by leveraging image-language embedding tech-
and detailed text descriptions, essential for clear communication and
niques, thereby reducing the workload of the manual reporting process
compliance verification.
and improving the accuracy and consistency of safety documentation.
Textual descriptions elucidate safety non-compliance by typically
Recent research has led to the development of vision-language
specifying the objects, actions, and contextual factors directly related
to the safety regulations of companies or official agencies [3,4]. While models designed to comprehend the relationships between images and
∗ Corresponding author.
E-mail addresses: [email protected] (W. Tsai), [email protected] (P. Le), [email protected] (W. Ho), [email protected] (N. Chi),
[email protected] (J.J. Lin), [email protected] (S. Tang), [email protected] (S. Hsieh).
https://doi.org/10.1016/j.autcon.2024.105863
Received 12 March 2024; Received in revised form 30 October 2024; Accepted 30 October 2024
Available online 22 November 2024
0926-5805/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
W. Tsai et al. Automation in Construction 169 (2025) 105863
text. Among the key applications of these models is image captioning, VLM accomplishes this by encoding images to obtain features and
or image-to-text generation, where the model predicts corresponding then generating textual descriptions that elucidate the contextual in-
text based on an input image [7]. This technique enables the model to formation within the image. This approach aligns closely with the
identify connections between images and text and generate sentences fundamental concept of image captioning. Throughout our discussion,
that describe them. However, most of these methods primarily focus we will explore various facets of VLM and its applications in image
on describing general construction photos [8,9], necessitating adap- captioning.
tation for construction safety-specific scenarios. Numerous pre-trained
vision-language models, built on different frameworks and modalities, 2.1. Construction scene understanding by image captioning
have emerged to address tasks involving image encoding and text
generation [10]. By applying pre-trained vision-language models, we Liu et al. [9] tackled the challenge of providing a structured lin-
can build our models on a solid foundation and fine-tune them for guistic depiction of construction activity scenes and conveying these
construction safety purposes. Conversely, ChatGPT [11] can generate scenes through the application of image captioning techniques. This
corresponding text according to the provided prompt. However, user linguistic description breaks down the sentences of a scene into main
scenarios are limited to prompt conversation, and we cannot access the objects, primary actions, and primary attributes, primarily focusing
model directly to modify it to process image information. on five distinct construction activities: cart transportation, masonry
One robust vision-language pre-trained model is Contrastive work, rebar work, plastering, and tiling. Each construction scene should
Language-Image Pre-training (CLIP) [12], which features a dual en- include at least one of these construction activities, along with five
coder for both text and image data. The core objective of contrastive descriptions following the formatting guidelines established by the
learning in this context is to discern the types of captions and violations MS COCO caption format [15]. The captioning model used for this
associated with safety violation images. In our research, we employ purpose comprises a visual encoder and a sequential decoder. The
CLIP prefix tuning and incorporate image features encoded by CLIP visual encoder leverages well-known CNN frameworks like VGG-16
as a prefix [13,14]. This modification enables the language model to and ResNet-50 to process images, while the sequential decoder is
generate text descriptions, violations more accurately and specifically. implemented using LSTM as the recurrent neural network (RNN). The
In practice, this research develops a framework that streamlines generated test results, assessed by human evaluators, demonstrate the
safety documentation in construction. The framework automatically feasibility of applying image captioning in practical scenarios. This
generates descriptions of image violations. These descriptions compose research endeavor seeks to bridge the gap between visual information
three key aspects: violation status, violation type, and object rela- and natural language sentences, focusing primarily on a set of five
tionships. This information serves as an indication for further safety common construction activities. However, it is worth noting that the
analysis during inspections. The mobile application has also been de- scope of construction activities addressed in this research is limited to
veloped to simplify real-world inspections for site engineers by assisting these five common types and certain types of linguistic schema [16,17].
with unsafe condition analysis and non-compliance reporting. In reality, there exists a wide array of diverse scenarios necessitating
This paper makes the following contributions: (1) integration of natural language descriptions for construction activities, which may
CLIP with safety reporting: we demonstrate how CLIP, a state-of-the-art not yet have real-world application scenes to implement this captioning
vision-language model, can be adapted and fine-tuned for the specific method effectively.
purpose of generating construction safety observations; (2) automated Xiao et al. [18] conducted a feasibility study to explore the potential
text and image embedding: by leveraging CLIP prefix tuning, we show of image captioning within construction scenarios. They achieved this
how image features can be used as a prefix to improve the genera- by developing a vision language model that draws from the com-
tion of detailed text descriptions, violations, and relevant information; puter vision community. The dataset created for this study adheres
(3) reduction of manual workload: our approach aims to significantly to a linguistic schema and primarily focuses on images related to
reduce the manual effort involved in safety reporting, thereby improv- construction equipment. In contrast to basic CNN-RNN methods, more
ing efficiency and accuracy; and (4) enhanced consistency in safety advanced vision language models were employed, specifically com-
documentation: by automating the generation of safety observations, bining ResNet101 [19] and Transformer [20]. This approach led to
we aim to enhance the consistency and comprehensiveness of safety the creation of a state-of-the-art image captioning model that incorpo-
reports, leading to better compliance and safer construction sites. rates a detailed attention mechanism. The outcomes of this research
The paper commences with an introduction of challenges in the were promising and demonstrated the practicality of image captioning
current safety inspection and documentation process. This is followed within the construction domain. Moreover, due to the growth of the
by a review of state-of-the-art approaches with a detailed description deep-learning approach applied in construction, several researchers are
of image captioning to understand construction scenes and the transla- using the image captioning technique to extract knowledge related to
tion between different embedding features. Subsequently, the research construction. However, there is still a lack of focus on the diversity
methodology and proposed framework are presented. Our framework of diverse non-compliance descriptions related to construction safety
includes four key modules: dataset development, Construction CLIP [21,22].
fine-tuning, CLIP prefix captioning, and CLIP attention model. After
that, the mobile application is developed with our model to support the 2.2. Text generation with different embedding features
practical usage scenario. This is then validated through a quantitative
experiment and user experience testing. Natural Language Generation (NLG) has found diverse applications
across different scenarios for extracting information from both docu-
2. Related work ments and images. These applications span a range of tasks, including
question answering based on building regulations [23], comprehending
Numerous Natural Language Processing (NLP) techniques and algo- construction scenes through UAV-acquired images [24], and employ-
rithms have emerged to address a wide range of text-related challenges, ing textual and visual encoder–decoder models for generating image
including tasks like knowledge extraction from documents and infor- captions that explain activities and scene objects [25]. In recent years,
mation management within construction or safety domains. In the the evolution of language models, particularly those based on Trans-
following discussion, we will delve into the realm of Vision-based Nat- former architecture such as BERT, GPT-2, and Text To-Text Transfer
ural Language Generation in the context of Construction. The primary Transformer (T5) [26], has taken center stage as dominant solutions
approach within Natural Language Generation (NLG) for extracting for both Natural Language Processing (NLP) and Computer Vision (CV)
textual information from images is the Vision Language Model (VLM). tasks [27]. These models are pre-trained and fine-tuned to enhance
2
W. Tsai et al. Automation in Construction 169 (2025) 105863
performance in downstream tasks, including question-answering and 3.1. Construction CLIP fine-tuning
image captioning. Although the language models have been utilized in
various aspects, there is a notable absence of research discussing their CLIP uses the self-attention Transformer and the Vision Transformer
application in the context of safety inspection reports, which typically as text and vision encoders. Transformers have improved traditional
contain rich text and image pairs. Furthermore, the relationships be- recurrent layers using encoder–decoder architecture and achieved ex-
tween images captured for safety inspection and the violation types ceptional performance among language models. In particular, the Vi-
have yet to be explored. sion Transformer (ViT) was used as an alternative to the conventional
To link up the image and content in the image, CLIP, a vision CNN-based image encoders by employing Transformer blocks [35]. This
language model introduced by OpenAI, is introduced to handle this involves segmenting an image into patches and acquiring sequential po-
task [12]. What sets CLIP apart is its ability to establish connections sitional embeddings to record the position of each patch meticulously.
between images and text automatically, without the need for labeled Subsequently, these patches are channeled into a Transformer block,
data. This is achieved by encoding a sequence of image and text mirroring the treatment of tokenized words within NLP applications.
features into a shared dimensional space. Within this space, the cosine ViT showcases notable performance across numerous image classifica-
similarity between each image and text feature is computed, serving tion datasets with minimal adjustments, showcasing its robust ability
as a measure of the similarity between every image and text pair. The to extract image features for supplementary objectives. Through the
core concept underlying CLIP is contrastive learning, which becomes architecture of integrating a Transformer and ViT, CLIP attains remark-
evident when comparing the similarity across the matching image or able performance in vision-language modeling, spanning a spectrum of
text [28]. When CLIP makes a prediction, it essentially selects the tasks in the context of extensive dataset refinement [12]. Fig. 2 depicts
image-text pair with the highest similarity, thereby implying a con- the proposed architecture for the fine-tuning process of Construction
trastive relationship between the candidates of image and text or text CLIP. First, the attribute list, including caption type, violation type, and
and image. This contrastive learning approach has broad applications, images, is fed into the text and image encoder. The caption type refers
including attribute classification and recognition within images [29]. to the safety status presented in the images, while the violation type
CLIP is also applied in the construction scenario to classify the activity provides the violation classification information. We can get 𝑇 ′ and 𝐼 ′
of an excavator. Chen et al. [30] chose CLIP as the model to perform in the format of feature embeddings. Next, the model will calculate
a zero-shot activity learning to identify the activity status of an ex- the cosine similarity between the image embeddings 𝐼 ′ and different
cavator. Ghelmani et al. [31] combine CLIP and 3D-based Res-Net to type embeddings 𝑇 ′ and try to find the type with the highest similarity
recognize the excavator activities within a limited dataset. To evaluate corresponding to the image. The attribute of the pairing result will be
the performance of CLIP, [32] the research results show that CLIP the outputs from Construction CLIP. On the other hand, the label of
has a considerably better performance than popular image encoders the violation type will also be fine-tuned after the fine-tuning process
trained on specific domain data, such as BottomUp-Topdown. Other of the caption type. During the fine-tuning process, the images within
studies from [33] have demonstrated that CLIP often outperforms other the caption datasets are categorized into subsets based on the label of
networks using one or two modalities. Compared with the well-known caption type and violation type associated with each image.
language model BERT, although the CLIP text encoder leaves behind
BERT, it still shows better performance in multimodal association [34]. 3.2. CLIP prefix captioning
The recent growth of natural language processing techniques across
various fields presents a compelling opportunity to improve documen- The additional prefix-tuning attributes allow us to enrich the lan-
tation processes in construction. While image captioning has provided guage model with further information and contextual parameters for
valuable insights into construction scenes, current research does not text generation. With the concept of conditional generation, the CLIP
attempt to analyze non-compliance situations during safety inspections. prefix framework pioneered by Ron et al. [13] and Wang et al. [14].
The paper addresses this gap by proposing a novel approach for doc- The language model produces sentences corresponding to the input
umenting and analyzing unsafe conditions and violation types. This words, which serve as conditions for text generation [26,36,37]. This
approach leverages the power of language models, specifically CLIP, objective is represented by Formula (1), where the prediction of the
which excels due to its innovative contrastive learning approach and subsequent word is derived from the existing contents 𝑦1∶𝑖1 and other
zero-shot learning. Understanding the unsafe conditions in images will input 𝑥. The generated token is integrated into the original input
help inspectors analyze the safety situation, take necessary action, and sequence. This iteration mechanism characteristic of RNNs is known
assist in report generation. as autoregression [38].
𝑃 (𝑦𝑖 |𝑦1∶𝑖−1 , 𝑥) (1)
3. Methodology
To incorporate additional attributes with embedded captions, the
This study proposes an innovative construction safety inspection language model must generate captions that correspond more closely
workflow to document and analyze construction safety images, utilizing to both images and the original captions. Ultimately, the attributes
Construction CLIP and CLIP prefix captioning to assimilate domain predicted by CLIP are encoded using the GPT-2 tokenizer and fused
knowledge from safety reports effectively. The safety reports were with image and caption embeddings. These embedded attributes and
provided by associated construction companies and written by pro- images are treated as the prefix for concatenated embeddings, sub-
fessional safety inspectors. Fig. 1 provides an illustrative overview of sequently serving as inputs to the language model. The right part of
our framework development, including four principal modules: Con- Fig. 3 describes the architecture of CLIP prefix captioning. Based on
struction CLIP fine-tuning, CLIP prefix captioning, CLIP attention, and this architecture, the additional attribute is automatically generated
mobile application. It involves acquiring contrastive features from im- through the fine-tuned Construction CLIP model. The concatenated
ages collected from construction sites, along with attribute annotations. vector includes the caption, attribute embeddings produced by the
The attention map generated by the CLIP attention model visualizes GPT-2 tokenizer using Byte-Pair Encoding (BPE), and CLIP-embedded
how the model understands the input image. Subsequently, the frame- images transformed by Multi-Layer Perceptrons (MLPs).
work is then integrated into a mobile application that safety inspectors The decoding methods are pivotal in determining how the final
can seamlessly adopt. After developing our system, we experimented text outputs are generated from the model. Through auto-regressive
to validate the proposed model’s performance and the mobile applica- language generation, GPT-2 generates sequences of words based on
tion’s usability. The following section will delve into a comprehensive predictions made during each iteration. The generation incorporates
exposition of the methodologies and procedures for each module. the original sequence and extends it by appending new predictions. As
3
W. Tsai et al. Automation in Construction 169 (2025) 105863
Fig. 2. Architecture of CLIP fine-tuning with caption type and violation type.
Eq. (2) explains, autoregressive language generation operates through probability distribution. The initial context, represented as 𝑥, stems
the multiplication of probabilities. The probabilities are associated from input embeddings. The length 𝑁 of the output sequence is dy-
with each decomposed conditional distribution within the overarching namically ascertained by the point at which the time step 𝑖 yields the
4
W. Tsai et al. Automation in Construction 169 (2025) 105863
End Of Sentence (EOS) token in real time. Several prominent decoding Table 1
strategies have emerged, including Greedy search, Beam search, Top- Image quantities with different caption types.
K sampling, Top-p sampling, and Temperature. In this context, we Status Violation Total
employed beam search with 𝑁 set to 3 and a temperature of 0.5 as Caption type 279 985 1264
decoding strategies. Finally, the model outputs a caption describing the
safety-related information obtained from the image and the caption and Table 2
violation types. Image quantities with different violation types.
Falling PPE Electric Workspace Material
∏
𝑁
𝑃 (𝑦1∶𝑁 |𝑥) = 𝑃 (𝑦𝑖 |𝑦1∶𝑖−1 , 𝑥), with 𝑦1∶0 ≠ 0 (2) Violation type 653 192 76 123 66
𝑖=1 Explosion Protruding Mechanical Transport Total
Violation type 49 49 35 21 1264
3.3. CLIP attention
5
W. Tsai et al. Automation in Construction 169 (2025) 105863
Table 3
Caption data with additional attribute.
Sample images Ground truth caption data
∙ caption_type: violation
∙ violation_type: falling
∙ violation_list: Openings lack a safety net or are
non-compliant.
∙ caption: Opening without guardrails
∙ file_name: data1.jpg
∙ objects: openings, safety net
we utilized the AdamW optimizer [45]. This optimization algorithm, concatenated, subsequently being embedded using word embeddings
an extension of the Adam optimizer, incorporates a distinct weight sourced from GPT-2. The encoded input image is transformed into
decay regularization term [46] to prevent overfitting. We implemented the GPT-2 space through diverse mapping strategies, namely, MLP
a linear learning rate scheduler that gradually increases the learning and Transformer. Following this, both textual and visual embeddings
rate during an initial warm-up period. are concatenated to serve as the input for the language model to
After fine-tuning the Construction CLIP model, the CLIP prefix cap- generate captions. The loss function applied is the cross-entropy loss,
tioning model is developed. The dataset also needs to be pre-processed. comparing output logits against labeled captions. The optimization pro-
cedure and scheduler mirror those employed in the Construction CLIP
For textual data, including captions and attributes, tokenization is
fine-tuning, including the AdamW optimizer and the linear scheduler
executed using a GPT-2-based tokenizer implemented with Byte-Pair
featuring warm-up. Consisting of the image and attribute embeddings,
Encoding (BPE). The captions are taken from the original dataset,
the concatenated embeddings are presented as a prefix. The prefix is
while embedded images and attributes including caption type and
similar to a controller steering the output of the language model. In
violation type are obtained from the fine-tuning process. Both caption the implemented approach, we extended the length of the image and
and attribute tokens are padded to uniform length and masked to attribute prefix embeddings to 20 and 10, respectively, from their origi-
establish tokens that necessitate attention. An end-of-sentence marker is nal implementation lengths. This augmentation of prefix length enables
appended to each caption to signify sentence termination. Meanwhile, the storage of greater informational depth, consequently enhancing the
visual data is subjected to encoding by the image encoder from the language model’s precision and fluency in generating text.
fine-tuned Construction CLIP model. For text generation, the GPT-2 Before integrating with the mobile application, the attention layer
language model is selected. The input caption and attribute tokens are is embedded in a CLIP model to calculate the relevancy between image
6
W. Tsai et al. Automation in Construction 169 (2025) 105863
and text. In the initial step, the Construction CLIP model, which was language-specific limitation, we selected the BLEU (Bilingual Evalua-
fine-tuned with caption type and violation type, did not understand the tion Understudy) [47] score as our evaluation metric. While widely
context between the textual and visual information from the datasets. adopted in natural language processing for assessing machine-
To gain knowledge from the captions, we fine-tuned another CLIP for generated text against human-written references, BLEU still has short-
the attention model. This CLIP attention model will be fine-tuned with comings and may not perfectly correlate with human judgment. BLEU
images and captions. In each batch of the training process, the loss primarily measures n-gram precision, focusing on the overlap between
function is calculated with the correctness between matching the im- the generated and reference texts. This approach does not capture
ages and the captions. The model’s parameters are updated during the the information’s semantic meaning, context, or correctness, which
training process according to the correctness of each image and caption are crucial for accurate and reliable safety inspection reports. Fur-
in each batch. The model was trained for 1000 epochs and a linear thermore, BLEU does not consider synonyms, paraphrasing, or the
scheduler was also applied to make the learning rate more accurate. overall meaning of the text. As a result, a generated caption might
Finally, after fine-tuning this CLIP attention model with captions, the convey the same meaning as the reference but use different words and
objects and the relationship between the objects in the images are phrases, leading to a lower BLEU score. Considering other evaluation
interpreted. The model can generate the attention map for the captions metrics for image captioning, such as METEOR, ROUGE, shows the
along with the relevance layer in the CLIP attention model to explain incapability of human judgments. While the SPICE metric exhibits a
the connection between the generated caption and image. stronger correlation with human judgments compared to other metrics,
it falls short in capturing the syntactic structure of generated sentences
4.3. Evaluation metrics [48]. Given these considerations, we employ BLEU as a preliminary
evaluation metric for our initial model development, acknowledging its
For CLIP Fine-tuning evaluation purposes, we use the accuracy limitations and planning to explore more robust evaluation methods in
metric for multi-class classification assessment. The fine-tuning process future research.
of Construction CLIP operates in two distinct settings based on the Finally, we conducted a user experience test with ten safety engi-
number of elements in the violation type (9 elements). The configura- neers from various construction companies to evaluate the effectiveness
tions involve combinations of 2 elements and 9 elements: The former and usability of the mobile application. Participants received train-
configuration helps the model consider contrasting features between ing on the application’s functionalities before using it during their
pairs of data types more accurately. The latter configuration mirrors daily safety inspections. Following the testing period, users were asked
the real-world scenario during inference, where all violation types are to assess the application’s performance based on two criteria: (1)
present. Furthermore, two additional configurations are employed to Effectiveness: Is the generated information significantly helpful for
interlink fine-tuned weights to capitalize on advantages arising from determining safety violations and reducing documentation time? (2)
diverse element numbers (N) for combinations. Sequential fine-tuning Usability: Was the application suitable for daily safety inspection rou-
with 𝑁 = 9 following 𝑁 = 2 and vice versa for violation types equip tines? User feedback was thoroughly analyzed to inform subsequent
the model with the capability to consider features contrastively. The development iterations.
fine-tuning progress is overseen through loss and accuracy metrics
computed for the training and testing sets after each epoch. Across 5. Results and discussion
both fine-tuning phases with different 𝑁 values, the model with the
highest accuracy on the testing set for each epoch is designated as the 5.1. Model predictions and evaluation results
final model. Post-fine-tuning is under the two violation-type settings; a
conclusive model with 𝑁 = 9 following 𝑁 = 2 is chosen specifically. This section details the model performance and evaluation out-
This selection ensures the model’s proficiency in distinguishing all 9 comes. The accuracy of the CLIP fine-tuning model is summarized in
violation types. Tables 4 and 5. The model achieved an average accuracy of 82.9%
To evaluate the CLIP prefix captioning model trained on various The model achieved an average accuracy of 82.9% for violation types.
datasets, standard image captioning metrics are often used. How- The falling from height category, which had larger training images
ever, many of these metrics rely on English-specific resources like reached 95% accuracy. The loss and accuracy experienced substantial
lexical databases or graph-based semantic representations. Given this changes within limited training epochs. The experiment results indicate
7
W. Tsai et al. Automation in Construction 169 (2025) 105863
Table 4 A challenge identified was the integration of the system with ex-
Caption types classification accuracy.
isting company software platforms. As many construction companies
Status Violation Total/Average utilize exclusive systems for document creation, submission, and stor-
Accuracy 0.44 0.94 0.829 age, integrating the automated image captioning model into these
platforms would be beneficial.
Table 5
Violation types classification accuracy. 5.3. Discussion
Falling PPE Electric Workspace Material
Accuracy 0.95 0.82 0.75 0.28 0.43 5.3.1. Framework objectives and outcomes
Explosion Protruding Mechanical Transport Total/Average Image captioning offers a promising value for automated safety
Accuracy 0.10 0.10 0.43 0.20 0.737 inspections in determining violations through images. Recent advance-
ments in computer vision and natural language processing have enabled
models to accurately generate textual descriptions from construction
Table 6
images. Fine-tuning modules further enable image classification, in-
Image captioning model evaluation.
cluding compliance or non-compliance based on the training model
ine Mapping type BLEU-1 BLEU-2 BLEU-3 BLEU-4
process, which acknowledges available information from safety reports
ine Baselinea 0.606 0.506 0.398 0.320
Transformer 0.249 0.242 0.333 0.249 and regulations. Our research demonstrates the effectiveness of this
MLP 0.454 0.447 0.525 0.454 approach by achieving robust performance in recognizing, classifying,
ine and describing violations in construction images. On the other hand,
a
Baseline score was performed by Xiao et al. [18]. the mobile application brings a seamless process for safety inspectors
to create daily safety reports, simplifying the overall documentation
process and facilitating the digital transformation on construction sites.
the model’s strong capacity to fit the training data. The Construction
CLIP model effectively classifies caption and violation types, extracting 5.3.2. Model development challenges
valuable information from images that can inform further analysis. Despite our model’s promise, several challenges arose during the ex-
Table 6 displays the outcomes of the CLIP prefix captioning model, periment phase that should be critically considered for further improve-
adjoining with those of a baseline model employing a CNN and LSTM ment. A primary issue was the overfitting exhibited by the fine-tuned
architecture, trained on the ACID caption dataset [18]. This dataset Construction CLIP model. This demonstrated a significant discrepancy
follows a linguistic schema, implying elevated scores for image caption- pattern between training and testing accuracy, indicating the model’s
ing tasks. Although the transformer-based approach was anticipated inability to generalize to unseen data. This difference is attributed to
to yield superior performance due to its attention mechanism, the the model’s tendency to grasp complex details and noise inherent in
MLP method attained higher BLEU and ROUGE scores. This outcome the training data [49]. The model struggles to generalize to unseen
suggests that overly complex models may overfit due to insufficient data excluded from the training set. This divergence between training
data volume. The BLEU-1 and BLEU-2 scores have mild performance and testing accuracies signifies the presence of overfitting during the
around 0.45 compared to the baseline score. However, The BLEU-3 fine-tuning phase, characterized by low bias and high variance.
and BLEU-4 scores can reach 0.525 and 0.454 explaining that our CLIP Several factors are known to shape the overfitting, such as insuffi-
prefix captioning model can generate more accurate sentences with cient training data, excessive model complexity, and noise in training
longer phrases. While automatic metrics expedite objective evaluation, data. The relatively small and imbalanced dataset primarily ampli-
a substantial gap persists between these metrics and human judgment. fied the model’s tendency to memorize training examples rather than
BLEU does not account for synonyms, paraphrasing, or the overall learn underlying patterns. Additionally, the complex architecture of
meaning of the text. A generated caption might convey the same mean- the model exacerbated the issue by allowing it to capture noise and
ing as the reference but use different words and phrases, leading to a irrelevant details from the training data. The model may learn these
low BLEU score. Consequently, we will incorporate human evaluation incorrect patterns if the training data contains noise or irrelevant
to address the complexity inherent in our target images. information. Besides, the model’s predictions vary considerably with
The CLIP attention model can demonstrate the related area accord- minor changes in the input data.
ing to the images and text by color. This color demonstration can In contrast, the prefix model demonstrated a more stable training
enhance the explainability for the safety inspector to know the reason trajectory, suggesting better generalization. The training loss consis-
for the generated model. Table 7 shows the results of the predicted tently decreases, indicating that the prefix captioning model can gen-
captions and the corresponding attention according to the visual and erate captions resembling the ground truth. While testing loss was
textual data. The color red shows the higher relevance between the not evaluated in this context, the model’s overall performance met
captions and a certain area which can indicate where the captions expectations.
are focusing on the image. On the other hand, the blue color shows Table 8 showcases examples of accurate attribute and caption gen-
there might not be any relevance between this area and the generated eration. The model effectively identifies violation types caption types,
captions. and provides descriptive details. However, when applied to diverse
construction site scenarios, the model’s performance falls short of prac-
5.2. Mobile application user experience testing tical requirements. Table 9 depicts the cases of failures with incorrect
violation types and captions. For example, an unprotected opening was
Testing results indicate that the automatically generated captions misclassified as a fall protection net issue, and a falling hazard without
effectively assisted safety engineers in identifying violations without a net was incorrectly labeled as protruding rebar. In application, the
referring to external regulations. By eliminating the need for manual safety inspectors need to take action to review the captions before
data entry, the system significantly reduced documentation time. The submitting the safety reports. Additionally, specific captions replicate
user interface was found to be intuitive, allowing for easy interaction ground truth captions using the exact words originally used across
with images and review of safety report information before submis- disparate generation conditions. This situation is indicative of model
sion. Participants confirmed the application’s suitability for daily safety collapse. Model collapse causes post-prefix tuning, and the model tends
inspections, aiding in both documentation and violation detection. to generate a restricted set of patterns.
8
W. Tsai et al. Automation in Construction 169 (2025) 105863
Table 7
Attention map with generated captions.
Sample image Attention map Ground truth Prediction
Table 8
Attributes and captions prediction.
Sample image Ground truth Prediction
5.3.3. Practical implication reliability between manual observation reports and image monitoring
This framework revolutionizes construction safety inspection by records.
eliminating paper-based processes. All inspection data is captured dig-
itally and stored centrally, streamlining information transfer and anal- 5.3.4. Limitations and future work
ysis. By generating descriptive captions for images, inspectors can Our research identified several key limitations that warrant fur-
efficiently review and document safety violations, accelerating the
ther investigation. The overfitting of the Construction CLIP model is
identification and resolution of hazards. This digital transformation
a primary concern. Various strategies can be employed to mitigate
empowers inspectors and managers to proactively address complex
overfitting based on the causes of this problem. Firstly, it is essential
safety challenges, improving overall site safety performance. In the
to expand the scale of the dataset. One of the popular methods is
same way, our approach represents a significant step towards a fully
automated safety management system. Automating violation identifi- augmenting the training dataset. Given the relatively limited size of
cation through image captioning enables faster, more accurate safety the image captioning dataset (1264 pairs) compared to the large size of
assessments and decision-making. Implementing this framework will the pre-trained CLIP dataset, increasing the training dataset could solve
transform construction sites into more digital and efficient environ- the problem of overfitting. Secondly, reducing the number of layers
ments, ultimately enhancing worker safety and project outcomes. Out- within the transformer block is another solution to address the com-
side safety inspections, image captioning techniques have potential plex model [49]. Thirdly, regarding fine-tuning jointly, we observed
applications in various construction management aspects. For instance, that training different element combinations led to a sharp increase
detailed image descriptions can build up a well-organized quality con- in training accuracy and a corresponding decline in loss. Neverthe-
trol process or activity recognition. This will assist in confirming the less, the impact on testing accuracy was uncertain. This underscores
9
W. Tsai et al. Automation in Construction 169 (2025) 105863
Table 9
Errors of attributes and captions prediction.
Sample image Ground truth Prediction
the limitations of these strategies in effectively learning from a rela- CRediT authorship contribution statement
tively more minor dataset with different labels. The dataset’s imbalance
across violation classes also impacted model performance. We recom-
Wei-Lun Tsai: Writing – review & editing, Writing – original draft,
mend creating a more balanced dataset by collecting additional images
Visualization, Validation, Methodology, Formal analysis, Data cura-
for underrepresented categories. Despite these challenges, the model
demonstrated promising capabilities in generating informative cap- tion, Conceptualization. Phuong-Linh Le: Writing – review & editing.
tions. Exploring various fine-tuning and decoding strategies can assist Wang-Fat Ho: Writing – review & editing, Visualization, Validation,
in boosting its accuracy and generalizability, including implementing Software, Methodology. Nai-Wen Chi: Validation, Writing – review
different language models. Furthermore, to increase the usability of & editing. Jacob J. Lin: Writing – review & editing, Writing – origi-
the mobile application, the model should be well-packaged for easy nal draft, Visualization, Validation, Supervision, Software, Resources,
transfer and integration with the existing management platform. Future Project administration, Methodology, Investigation, Funding acquisi-
research should focus on overcoming these limitations to develop a tion, Formal analysis, Data curation, Conceptualization. Shuai Tang:
more robust and reliable image captioning model for construction Writing – review & editing, Methodology, Conceptualization. Shang-
safety inspections. Hsien Hsieh: Writing – review & editing, Project administration, Fund-
ing acquisition.
6. Conclusion
This paper introduces a framework for automatically identifying Declaration of competing interest
safety violations in construction images through image captioning. The
primary goal was to develop a method for processing construction
The authors declare the following financial interests/personal rela-
site images to accurately detect and classify violations. We employed
a transformer-based vision-language model to categorize captions in- tionships which may be considered as potential competing interests:
dicating violation status and types. Our model generates descriptive Jacob J. Lin reports financial support was provided by National Science
captions with multiple embeddings to extract information from im- and Technology Council. If there are other authors, they declare that
ages using image-to-text techniques. The framework achieved 82.9% they have no known competing financial interests or personal relation-
accuracy for caption types and 73.7% for violation types, with an ships that could have appeared to influence the work reported in this
impressive 95% accuracy for the ‘‘falling from height’’ category. Ad- paper.
ditionally, BLEU-3 and BLEU-4 scores of 0.525 and 0.454, respectively,
demonstrate the model’s ability to generate coherent and informative
captions. Acknowledgments
This research highlights the potential of natural language processing
to streamline safety inspections in construction. By leveraging safety The project is supported in part by MOST, Taiwan 110-2622-E-002-
inspection records, our framework can identify violations often missed 039, 110-2222-E-002-002-MY3. The support and help of contractors,
or time-consuming to detect manually, improving safety documentation subcontractors, and service companies involved in collecting data and
and prevention. This digital approach also promotes paperless safety
implementing the system are greatly appreciated, including CECI En-
inspections, which is the first step in the digital transformation of the
gineering Consultants, Inc., Chien Kuo Construction Co. Ltd., Feng Yu
construction area.
Construction Co. Ltd., and Ruiju Construction Co. Ltd.
While our study provides a strong foundation, limitations include
dataset size, imbalance, and caption data source constraints. As NLP
in construction advances, continuous model refinement is essential to Data availability
address complex construction environments and safety challenges. Fu-
ture work will focus on increasing the dataset, enhancing the language
model to prevent overfitting, and exploring the combined use of various Data will be made available on request.
language models for a more precise safety inspection.
10
W. Tsai et al. Automation in Construction 169 (2025) 105863
11