0% found this document useful (0 votes)

28 views11 pages

Automation in Construction

Uploaded by

wewashtogether

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views11 pages

Automation in Construction

Uploaded by

wewashtogether

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Automation in Construction 169 (2025) 105863

Contents lists available at ScienceDirect

Automation in Construction
journal homepage: www.elsevier.com/locate/autcon

Construction safety inspection with contrastive language-image pre-training

(CLIP) image captioning and attention
Wei-Lun Tsai a , Phuong-Linh Le a , Wang-Fat Ho a , Nai-Wen Chi c , Jacob J. Lin a ,∗, Shuai Tang b ,
Shang-Hsien Hsieh a
a
Department of Civil Engineering, National Taiwan University, No. 1, Section 4, Roosevelt Rd., Da’an District, Taipei City, 10617, Taiwan
b
Department of Civil and Environmental Engineering, University of Illinois at Urbana-Champaign, 205 N Mathews Ave, Urbana, IL, 61801, USA
c Department of Civil Engineering, National Taipei University of Technology, No. 1, Section 3, Zhongxiao E. Rd., Da’an District, Taipei City, 106344, Taiwan

ARTICLE INFO ABSTRACT

Keywords: Traditional safety inspections require significant human effort and time to capture site photos and textual
Image captioning descriptions. While standardized forms and image captioning techniques have been explored to improve
Safety inspection inspection efficiency, compiling reports with both visual and text data remains challenging due to the
Construction safety
multiplicity of safety-related knowledge. To assist inspectors in evaluating violations more efficiently, this
CLIP
paper presents an image-language model, utilizing Contrastive Language-Image Pre-training (CLIP) fine-tuning
Visual attention
and prefix captioning to automatically generate safety observations. A user-friendly mobile phone application
has been created to streamline safety report documentation for site engineers. The language model successfully
classifies nine violation types with an average accuracy of 73.7%, outperforming the baseline model by 41.8%.
Experiment participants confirmed that the mobile application is helpful for safety inspections. This automated
framework simplifies safety documentation by identifying violation scenes through images, improves overall
safety performance, and supports the digital transformation of construction sites.

1. Introduction there are particular formats to adhere to, the actual descriptions often
vary from person to person, leading to inconsistencies. Recent studies
Safety is a core value in construction project management. Cur- have developed language models capable of comprehending these de-
rently, a mixture of tools, such as paper-based safety reports, tablet- scriptions and establishing a knowledge base for automated mapping
based safety observations, and mobile camera photologs, are employed of safety regulations [5]. Although initial results indicate promising
to document onsite safety non-compliance. However, consolidating performance, challenges remain due to the diverse expressions found
these data from different storage silos and analyzing them in a scalable in textual descriptions. For example, both ‘‘the openings should be
way often requires considerable effort and is prone to human error [1]. covered by the safety net’’ and ‘‘the openings need falling prevention’’
This fragmentation can lead to delays in identifying and addressing indicate non-compliance regarding falling hazards but use different ex-
safety issues, potentially compromising onsite safety. In recent years, pressions. Photographs provide visual references and concrete evidence
there has been a shift towards smartphone apps aimed at streamlining of non-compliance, significantly enhancing language understanding.
documentation and reducing human errors during data conversion However, manually linking these photographs with appropriate tex-
[2,3]. These apps provide a more integrated approach to data collection
tual descriptions remains labor-intensive. Multiple research endeavors
but still face challenges in ensuring consistency and comprehensiveness
have demonstrated promising results in integrating visual and textual
in safety reporting. Despite differing workflows among these tools, a
data [6]. This paper aims to automate the generation of construction
common objective is to populate safety reports with rich site photologs
safety observations by leveraging image-language embedding tech-
and detailed text descriptions, essential for clear communication and
niques, thereby reducing the workload of the manual reporting process
compliance verification.
and improving the accuracy and consistency of safety documentation.
Textual descriptions elucidate safety non-compliance by typically
Recent research has led to the development of vision-language
specifying the objects, actions, and contextual factors directly related
to the safety regulations of companies or official agencies [3,4]. While models designed to comprehend the relationships between images and

∗ Corresponding author.
E-mail addresses: [email protected] (W. Tsai), [email protected] (P. Le), [email protected] (W. Ho), [email protected] (N. Chi),
[email protected] (J.J. Lin), [email protected] (S. Tang), [email protected] (S. Hsieh).

https://doi.org/10.1016/j.autcon.2024.105863
Received 12 March 2024; Received in revised form 30 October 2024; Accepted 30 October 2024
Available online 22 November 2024
0926-5805/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
W. Tsai et al. Automation in Construction 169 (2025) 105863

text. Among the key applications of these models is image captioning, VLM accomplishes this by encoding images to obtain features and
or image-to-text generation, where the model predicts corresponding then generating textual descriptions that elucidate the contextual in-
text based on an input image [7]. This technique enables the model to formation within the image. This approach aligns closely with the
identify connections between images and text and generate sentences fundamental concept of image captioning. Throughout our discussion,
that describe them. However, most of these methods primarily focus we will explore various facets of VLM and its applications in image
on describing general construction photos [8,9], necessitating adap- captioning.
tation for construction safety-specific scenarios. Numerous pre-trained
vision-language models, built on different frameworks and modalities, 2.1. Construction scene understanding by image captioning
have emerged to address tasks involving image encoding and text
generation [10]. By applying pre-trained vision-language models, we Liu et al. [9] tackled the challenge of providing a structured lin-
can build our models on a solid foundation and fine-tune them for guistic depiction of construction activity scenes and conveying these
construction safety purposes. Conversely, ChatGPT [11] can generate scenes through the application of image captioning techniques. This
corresponding text according to the provided prompt. However, user linguistic description breaks down the sentences of a scene into main
scenarios are limited to prompt conversation, and we cannot access the objects, primary actions, and primary attributes, primarily focusing
model directly to modify it to process image information. on five distinct construction activities: cart transportation, masonry
One robust vision-language pre-trained model is Contrastive work, rebar work, plastering, and tiling. Each construction scene should
Language-Image Pre-training (CLIP) [12], which features a dual en- include at least one of these construction activities, along with five
coder for both text and image data. The core objective of contrastive descriptions following the formatting guidelines established by the
learning in this context is to discern the types of captions and violations MS COCO caption format [15]. The captioning model used for this
associated with safety violation images. In our research, we employ purpose comprises a visual encoder and a sequential decoder. The
CLIP prefix tuning and incorporate image features encoded by CLIP visual encoder leverages well-known CNN frameworks like VGG-16
as a prefix [13,14]. This modification enables the language model to and ResNet-50 to process images, while the sequential decoder is
generate text descriptions, violations more accurately and specifically. implemented using LSTM as the recurrent neural network (RNN). The
In practice, this research develops a framework that streamlines generated test results, assessed by human evaluators, demonstrate the
safety documentation in construction. The framework automatically feasibility of applying image captioning in practical scenarios. This
generates descriptions of image violations. These descriptions compose research endeavor seeks to bridge the gap between visual information
three key aspects: violation status, violation type, and object rela- and natural language sentences, focusing primarily on a set of five
tionships. This information serves as an indication for further safety common construction activities. However, it is worth noting that the
analysis during inspections. The mobile application has also been de- scope of construction activities addressed in this research is limited to
veloped to simplify real-world inspections for site engineers by assisting these five common types and certain types of linguistic schema [16,17].
with unsafe condition analysis and non-compliance reporting. In reality, there exists a wide array of diverse scenarios necessitating
This paper makes the following contributions: (1) integration of natural language descriptions for construction activities, which may
CLIP with safety reporting: we demonstrate how CLIP, a state-of-the-art not yet have real-world application scenes to implement this captioning
vision-language model, can be adapted and fine-tuned for the specific method effectively.
purpose of generating construction safety observations; (2) automated Xiao et al. [18] conducted a feasibility study to explore the potential
text and image embedding: by leveraging CLIP prefix tuning, we show of image captioning within construction scenarios. They achieved this
how image features can be used as a prefix to improve the genera- by developing a vision language model that draws from the com-
tion of detailed text descriptions, violations, and relevant information; puter vision community. The dataset created for this study adheres
(3) reduction of manual workload: our approach aims to significantly to a linguistic schema and primarily focuses on images related to
reduce the manual effort involved in safety reporting, thereby improv- construction equipment. In contrast to basic CNN-RNN methods, more
ing efficiency and accuracy; and (4) enhanced consistency in safety advanced vision language models were employed, specifically com-
documentation: by automating the generation of safety observations, bining ResNet101 [19] and Transformer [20]. This approach led to
we aim to enhance the consistency and comprehensiveness of safety the creation of a state-of-the-art image captioning model that incorpo-
reports, leading to better compliance and safer construction sites. rates a detailed attention mechanism. The outcomes of this research
The paper commences with an introduction of challenges in the were promising and demonstrated the practicality of image captioning
current safety inspection and documentation process. This is followed within the construction domain. Moreover, due to the growth of the
by a review of state-of-the-art approaches with a detailed description deep-learning approach applied in construction, several researchers are
of image captioning to understand construction scenes and the transla- using the image captioning technique to extract knowledge related to
tion between different embedding features. Subsequently, the research construction. However, there is still a lack of focus on the diversity
methodology and proposed framework are presented. Our framework of diverse non-compliance descriptions related to construction safety
includes four key modules: dataset development, Construction CLIP [21,22].
fine-tuning, CLIP prefix captioning, and CLIP attention model. After
that, the mobile application is developed with our model to support the 2.2. Text generation with different embedding features
practical usage scenario. This is then validated through a quantitative
experiment and user experience testing. Natural Language Generation (NLG) has found diverse applications
across different scenarios for extracting information from both docu-
2. Related work ments and images. These applications span a range of tasks, including
question answering based on building regulations [23], comprehending
Numerous Natural Language Processing (NLP) techniques and algo- construction scenes through UAV-acquired images [24], and employ-
rithms have emerged to address a wide range of text-related challenges, ing textual and visual encoder–decoder models for generating image
including tasks like knowledge extraction from documents and infor- captions that explain activities and scene objects [25]. In recent years,
mation management within construction or safety domains. In the the evolution of language models, particularly those based on Trans-
following discussion, we will delve into the realm of Vision-based Nat- former architecture such as BERT, GPT-2, and Text To-Text Transfer
ural Language Generation in the context of Construction. The primary Transformer (T5) [26], has taken center stage as dominant solutions
approach within Natural Language Generation (NLG) for extracting for both Natural Language Processing (NLP) and Computer Vision (CV)
textual information from images is the Vision Language Model (VLM). tasks [27]. These models are pre-trained and fine-tuned to enhance

2
W. Tsai et al. Automation in Construction 169 (2025) 105863

performance in downstream tasks, including question-answering and 3.1. Construction CLIP fine-tuning
image captioning. Although the language models have been utilized in
various aspects, there is a notable absence of research discussing their CLIP uses the self-attention Transformer and the Vision Transformer
application in the context of safety inspection reports, which typically as text and vision encoders. Transformers have improved traditional
contain rich text and image pairs. Furthermore, the relationships be- recurrent layers using encoder–decoder architecture and achieved ex-
tween images captured for safety inspection and the violation types ceptional performance among language models. In particular, the Vi-
have yet to be explored. sion Transformer (ViT) was used as an alternative to the conventional
To link up the image and content in the image, CLIP, a vision CNN-based image encoders by employing Transformer blocks [35]. This
language model introduced by OpenAI, is introduced to handle this involves segmenting an image into patches and acquiring sequential po-
task [12]. What sets CLIP apart is its ability to establish connections sitional embeddings to record the position of each patch meticulously.
between images and text automatically, without the need for labeled Subsequently, these patches are channeled into a Transformer block,
data. This is achieved by encoding a sequence of image and text mirroring the treatment of tokenized words within NLP applications.
features into a shared dimensional space. Within this space, the cosine ViT showcases notable performance across numerous image classifica-
similarity between each image and text feature is computed, serving tion datasets with minimal adjustments, showcasing its robust ability
as a measure of the similarity between every image and text pair. The to extract image features for supplementary objectives. Through the
core concept underlying CLIP is contrastive learning, which becomes architecture of integrating a Transformer and ViT, CLIP attains remark-
evident when comparing the similarity across the matching image or able performance in vision-language modeling, spanning a spectrum of
text [28]. When CLIP makes a prediction, it essentially selects the tasks in the context of extensive dataset refinement [12]. Fig. 2 depicts
image-text pair with the highest similarity, thereby implying a con- the proposed architecture for the fine-tuning process of Construction
trastive relationship between the candidates of image and text or text CLIP. First, the attribute list, including caption type, violation type, and
and image. This contrastive learning approach has broad applications, images, is fed into the text and image encoder. The caption type refers
including attribute classification and recognition within images [29]. to the safety status presented in the images, while the violation type
CLIP is also applied in the construction scenario to classify the activity provides the violation classification information. We can get 𝑇 ′ and 𝐼 ′
of an excavator. Chen et al. [30] chose CLIP as the model to perform in the format of feature embeddings. Next, the model will calculate
a zero-shot activity learning to identify the activity status of an ex- the cosine similarity between the image embeddings 𝐼 ′ and different
cavator. Ghelmani et al. [31] combine CLIP and 3D-based Res-Net to type embeddings 𝑇 ′ and try to find the type with the highest similarity
recognize the excavator activities within a limited dataset. To evaluate corresponding to the image. The attribute of the pairing result will be
the performance of CLIP, [32] the research results show that CLIP the outputs from Construction CLIP. On the other hand, the label of
has a considerably better performance than popular image encoders the violation type will also be fine-tuned after the fine-tuning process
trained on specific domain data, such as BottomUp-Topdown. Other of the caption type. During the fine-tuning process, the images within
studies from [33] have demonstrated that CLIP often outperforms other the caption datasets are categorized into subsets based on the label of
networks using one or two modalities. Compared with the well-known caption type and violation type associated with each image.
language model BERT, although the CLIP text encoder leaves behind
BERT, it still shows better performance in multimodal association [34]. 3.2. CLIP prefix captioning
The recent growth of natural language processing techniques across
various fields presents a compelling opportunity to improve documen- The additional prefix-tuning attributes allow us to enrich the lan-
tation processes in construction. While image captioning has provided guage model with further information and contextual parameters for
valuable insights into construction scenes, current research does not text generation. With the concept of conditional generation, the CLIP
attempt to analyze non-compliance situations during safety inspections. prefix framework pioneered by Ron et al. [13] and Wang et al. [14].
The paper addresses this gap by proposing a novel approach for doc- The language model produces sentences corresponding to the input
umenting and analyzing unsafe conditions and violation types. This words, which serve as conditions for text generation [26,36,37]. This
approach leverages the power of language models, specifically CLIP, objective is represented by Formula (1), where the prediction of the
which excels due to its innovative contrastive learning approach and subsequent word is derived from the existing contents 𝑦1∶𝑖1 and other
zero-shot learning. Understanding the unsafe conditions in images will input 𝑥. The generated token is integrated into the original input
help inspectors analyze the safety situation, take necessary action, and sequence. This iteration mechanism characteristic of RNNs is known
assist in report generation. as autoregression [38].
𝑃 (𝑦𝑖 |𝑦1∶𝑖−1 , 𝑥) (1)
3. Methodology
To incorporate additional attributes with embedded captions, the
This study proposes an innovative construction safety inspection language model must generate captions that correspond more closely
workflow to document and analyze construction safety images, utilizing to both images and the original captions. Ultimately, the attributes
Construction CLIP and CLIP prefix captioning to assimilate domain predicted by CLIP are encoded using the GPT-2 tokenizer and fused
knowledge from safety reports effectively. The safety reports were with image and caption embeddings. These embedded attributes and
provided by associated construction companies and written by pro- images are treated as the prefix for concatenated embeddings, sub-
fessional safety inspectors. Fig. 1 provides an illustrative overview of sequently serving as inputs to the language model. The right part of
our framework development, including four principal modules: Con- Fig. 3 describes the architecture of CLIP prefix captioning. Based on
struction CLIP fine-tuning, CLIP prefix captioning, CLIP attention, and this architecture, the additional attribute is automatically generated
mobile application. It involves acquiring contrastive features from im- through the fine-tuned Construction CLIP model. The concatenated
ages collected from construction sites, along with attribute annotations. vector includes the caption, attribute embeddings produced by the
The attention map generated by the CLIP attention model visualizes GPT-2 tokenizer using Byte-Pair Encoding (BPE), and CLIP-embedded
how the model understands the input image. Subsequently, the frame- images transformed by Multi-Layer Perceptrons (MLPs).
work is then integrated into a mobile application that safety inspectors The decoding methods are pivotal in determining how the final
can seamlessly adopt. After developing our system, we experimented text outputs are generated from the model. Through auto-regressive
to validate the proposed model’s performance and the mobile applica- language generation, GPT-2 generates sequences of words based on
tion’s usability. The following section will delve into a comprehensive predictions made during each iteration. The generation incorporates
exposition of the methodologies and procedures for each module. the original sequence and extends it by appending new predictions. As

3
W. Tsai et al. Automation in Construction 169 (2025) 105863

Fig. 1. Overview of proposed framework.

Fig. 2. Architecture of CLIP fine-tuning with caption type and violation type.

Fig. 3. Architecture of prefix captioning with attributes.

Eq. (2) explains, autoregressive language generation operates through probability distribution. The initial context, represented as 𝑥, stems
the multiplication of probabilities. The probabilities are associated from input embeddings. The length 𝑁 of the output sequence is dy-
with each decomposed conditional distribution within the overarching namically ascertained by the point at which the time step 𝑖 yields the

4
W. Tsai et al. Automation in Construction 169 (2025) 105863

End Of Sentence (EOS) token in real time. Several prominent decoding Table 1
strategies have emerged, including Greedy search, Beam search, Top- Image quantities with different caption types.

K sampling, Top-p sampling, and Temperature. In this context, we Status Violation Total

employed beam search with 𝑁 set to 3 and a temperature of 0.5 as Caption type 279 985 1264
decoding strategies. Finally, the model outputs a caption describing the
safety-related information obtained from the image and the caption and Table 2
violation types. Image quantities with different violation types.
Falling PPE Electric Workspace Material
∏
𝑁
𝑃 (𝑦1∶𝑁 |𝑥) = 𝑃 (𝑦𝑖 |𝑦1∶𝑖−1 , 𝑥), with 𝑦1∶0 ≠ 0 (2) Violation type 653 192 76 123 66
𝑖=1 Explosion Protruding Mechanical Transport Total
Violation type 49 49 35 21 1264
3.3. CLIP attention

By calculating the relevancy map according to the relationship

between the image and text, the CLIP model can generate the attention images are related to accidents falling from heights due to their high
map for the vision language model. An attention map can also provide frequency during the execution phase [43]. Subsequently, we extracted
a more intuitive explainability for the safety inspector to check the 1264 image-text pairs and labeling with additional attributes describing
reasoning behind the predicted captions. We used the attention map to safety observation results. We try to complement the construction
generate an interpretation encoder and decoder model based on a bi- information with the traditional image captioning dataset and add
modal transformer [39] to generate the attention map. The relevancy additional attributes according to the classification methods in the
map initializes to zeros, and the attention map will be updated accord- safety reports. The results are categorized into five keys — caption type,
ing to the projections of the cross-modality context. The self-attention violation type, violation list, captions, and objects — to utilize contrastive
layer for the transformer provides generalization information between learning’s comparative characteristics. Caption type, which includes
two modalities. The average across the heads of the attention map status and violation, indicates whether an observation identifies safety
shows the importance and relevance. Thus, the average updates to the violations. Violation type presents the specific types of violations in
relevancy map are the integration of bi-modal attention. the Regulations of the Occupational Safety and Health Act (OSHA)
outlined by the Ministry of Labor in Taiwan [44]. Violation list provides
3.4. Mobile application a standardized description of safety violations derived from internal
company regulations or safety guidelines issued by OSHA. This ap-
The final stage of our safety inspection workflow involves a user- proach avoids the potential for literal differences between each safety
centric mobile application designed to simplify the image captioning inspector. We also considered Captions, which are the descriptions
process for inspectors. This application provides a user-friendly inter- extracted from safety reports. Objects refer to the entities mentioned
face for interacting with the proposed framework [40–42]. The mobile within the violation list. Tables 1 and 2 provide a breakdown of image
application streamlines real-world safety inspections by facilitating quantities for different caption type and violation type categories. Table 3
efficient documentation of violations. Inspectors can capture photos illustrates sample images and the ground truth caption data which are
of violations and record their findings directly within the app. The the observation results.
integrated classification and image captioning model automatically
populates the violation type and caption information, saving inspectors 4.2. Model implementation
time and effort. Finally, inspectors can review the predictions and
finalize their inspection reports through the application. Before training the fine-tuned Construction CLIP model, it is nec-
This application replaces traditional paper-based documentation essary to pre-process the dataset because we only use caption type
with a digital workflow that integrates data collection, bounding box and violation type for the training model. Other attributes must be
creation, mask annotation, and image captioning functionalities. The excluded from the data set used during this process. Subsequently, we
user interface, designed in collaboration with an industry partner, generate the contrastive data pairs from all possible combinations of
comprises three key views: project management, image management, the attributes in the list. For example, Falling and Electric, Falling and
and image labeling as shown in Fig. 4. These views allow inspectors Protruding, and Electric and Protruding could be three contrastive pairs
to navigate between safety inspection procedures, access other project shown in Fig. 5. These pairs are divided into training and testing sets
managing tools, and utilize annotation features. Our Construction CLIP for the next step. The division process considers the amount of each
model, deployed on a cloud platform via an API for easier maintenance attribute so that all types are equally distributed in the training and
and scalability, automatically generates captions and classifies viola- testing sets. Additionally, due to the imbalanced situation of annotation
tion types. This information is filled into the caption column within quantities for each combination, the less common elements are dupli-
the labeling view, saving inspectors valuable time during documenta- cated to match the frequency of the more common elements. The model
tion. Additionally, the cloud-based platform offers a centralized and is trained to comprehend a multi-modal embedding space because CLIP
consistent solution for managing large volumes of safety inspection yields N × 𝑁 pairings for a batch of 𝑁 image and text pairs. The
data. text and image encoders are collaboratively trained to maximize cosine
similarity. The CLIP fine-tuning was trained for 10 epochs. The text was
4. Experiments tokenized by a BPE tokenizer to transfer characters into the Unicode-8
presentation. The image was resized to 224 × 224, cropped at the center
4.1. Dataset development to 224 × 224, and normalized based on a mean of (0.481, 0.458, 0.408)
and a standard deviation of (0.269, 0.261, 0.276). The transformer is
We created a novel dataset of textual descriptions and image safety encoded with 768-dimensional embedding space. The encoder consists
observations. We collected images and their corresponding captions of 12 layers of multi-head self-attention mechanisms, which allow the
from the safety reports. The safety reports incorporate diverse verbal model to weigh the relevance of each patch to others, capturing global
expressions from three experienced safety inspectors affiliated with dependencies and interactions between patches. The output vector is
construction companies. The images are from different building con- 512-dimensional aligning with text representations, enabling cross-
struction sites with diverse conditions during working hours. Most modal tasks. For the purpose of minimizing the model’s loss function,

5
W. Tsai et al. Automation in Construction 169 (2025) 105863

Fig. 4. Three main views in the mobile application.

Table 3
Caption data with additional attribute.
Sample images Ground truth caption data

∙ caption_type: violation
∙ violation_type: falling
∙ violation_list: Openings lack a safety net or are
non-compliant.
∙ caption: Opening without guardrails
∙ file_name: data1.jpg
∙ objects: openings, safety net

we utilized the AdamW optimizer [45]. This optimization algorithm, concatenated, subsequently being embedded using word embeddings
an extension of the Adam optimizer, incorporates a distinct weight sourced from GPT-2. The encoded input image is transformed into
decay regularization term [46] to prevent overfitting. We implemented the GPT-2 space through diverse mapping strategies, namely, MLP
a linear learning rate scheduler that gradually increases the learning and Transformer. Following this, both textual and visual embeddings
rate during an initial warm-up period. are concatenated to serve as the input for the language model to
After fine-tuning the Construction CLIP model, the CLIP prefix cap- generate captions. The loss function applied is the cross-entropy loss,
tioning model is developed. The dataset also needs to be pre-processed. comparing output logits against labeled captions. The optimization pro-
cedure and scheduler mirror those employed in the Construction CLIP
For textual data, including captions and attributes, tokenization is
fine-tuning, including the AdamW optimizer and the linear scheduler
executed using a GPT-2-based tokenizer implemented with Byte-Pair
featuring warm-up. Consisting of the image and attribute embeddings,
Encoding (BPE). The captions are taken from the original dataset,
the concatenated embeddings are presented as a prefix. The prefix is
while embedded images and attributes including caption type and
similar to a controller steering the output of the language model. In
violation type are obtained from the fine-tuning process. Both caption the implemented approach, we extended the length of the image and
and attribute tokens are padded to uniform length and masked to attribute prefix embeddings to 20 and 10, respectively, from their origi-
establish tokens that necessitate attention. An end-of-sentence marker is nal implementation lengths. This augmentation of prefix length enables
appended to each caption to signify sentence termination. Meanwhile, the storage of greater informational depth, consequently enhancing the
visual data is subjected to encoding by the image encoder from the language model’s precision and fluency in generating text.
fine-tuned Construction CLIP model. For text generation, the GPT-2 Before integrating with the mobile application, the attention layer
language model is selected. The input caption and attribute tokens are is embedded in a CLIP model to calculate the relevancy between image

6
W. Tsai et al. Automation in Construction 169 (2025) 105863

Fig. 5. Generating contrastive data pairs.

and text. In the initial step, the Construction CLIP model, which was language-specific limitation, we selected the BLEU (Bilingual Evalua-
fine-tuned with caption type and violation type, did not understand the tion Understudy) [47] score as our evaluation metric. While widely
context between the textual and visual information from the datasets. adopted in natural language processing for assessing machine-
To gain knowledge from the captions, we fine-tuned another CLIP for generated text against human-written references, BLEU still has short-
the attention model. This CLIP attention model will be fine-tuned with comings and may not perfectly correlate with human judgment. BLEU
images and captions. In each batch of the training process, the loss primarily measures n-gram precision, focusing on the overlap between
function is calculated with the correctness between matching the im- the generated and reference texts. This approach does not capture
ages and the captions. The model’s parameters are updated during the the information’s semantic meaning, context, or correctness, which
training process according to the correctness of each image and caption are crucial for accurate and reliable safety inspection reports. Fur-
in each batch. The model was trained for 1000 epochs and a linear thermore, BLEU does not consider synonyms, paraphrasing, or the
scheduler was also applied to make the learning rate more accurate. overall meaning of the text. As a result, a generated caption might
Finally, after fine-tuning this CLIP attention model with captions, the convey the same meaning as the reference but use different words and
objects and the relationship between the objects in the images are phrases, leading to a lower BLEU score. Considering other evaluation
interpreted. The model can generate the attention map for the captions metrics for image captioning, such as METEOR, ROUGE, shows the
along with the relevance layer in the CLIP attention model to explain incapability of human judgments. While the SPICE metric exhibits a
the connection between the generated caption and image. stronger correlation with human judgments compared to other metrics,
it falls short in capturing the syntactic structure of generated sentences
4.3. Evaluation metrics [48]. Given these considerations, we employ BLEU as a preliminary
evaluation metric for our initial model development, acknowledging its
For CLIP Fine-tuning evaluation purposes, we use the accuracy limitations and planning to explore more robust evaluation methods in
metric for multi-class classification assessment. The fine-tuning process future research.
of Construction CLIP operates in two distinct settings based on the Finally, we conducted a user experience test with ten safety engi-
number of elements in the violation type (9 elements). The configura- neers from various construction companies to evaluate the effectiveness
tions involve combinations of 2 elements and 9 elements: The former and usability of the mobile application. Participants received train-
configuration helps the model consider contrasting features between ing on the application’s functionalities before using it during their
pairs of data types more accurately. The latter configuration mirrors daily safety inspections. Following the testing period, users were asked
the real-world scenario during inference, where all violation types are to assess the application’s performance based on two criteria: (1)
present. Furthermore, two additional configurations are employed to Effectiveness: Is the generated information significantly helpful for
interlink fine-tuned weights to capitalize on advantages arising from determining safety violations and reducing documentation time? (2)
diverse element numbers (N) for combinations. Sequential fine-tuning Usability: Was the application suitable for daily safety inspection rou-
with 𝑁 = 9 following 𝑁 = 2 and vice versa for violation types equip tines? User feedback was thoroughly analyzed to inform subsequent
the model with the capability to consider features contrastively. The development iterations.
fine-tuning progress is overseen through loss and accuracy metrics
computed for the training and testing sets after each epoch. Across 5. Results and discussion
both fine-tuning phases with different 𝑁 values, the model with the
highest accuracy on the testing set for each epoch is designated as the 5.1. Model predictions and evaluation results
final model. Post-fine-tuning is under the two violation-type settings; a
conclusive model with 𝑁 = 9 following 𝑁 = 2 is chosen specifically. This section details the model performance and evaluation out-
This selection ensures the model’s proficiency in distinguishing all 9 comes. The accuracy of the CLIP fine-tuning model is summarized in
violation types. Tables 4 and 5. The model achieved an average accuracy of 82.9%
To evaluate the CLIP prefix captioning model trained on various The model achieved an average accuracy of 82.9% for violation types.
datasets, standard image captioning metrics are often used. How- The falling from height category, which had larger training images
ever, many of these metrics rely on English-specific resources like reached 95% accuracy. The loss and accuracy experienced substantial
lexical databases or graph-based semantic representations. Given this changes within limited training epochs. The experiment results indicate

7
W. Tsai et al. Automation in Construction 169 (2025) 105863

Table 4 A challenge identified was the integration of the system with ex-
Caption types classification accuracy.
isting company software platforms. As many construction companies
Status Violation Total/Average utilize exclusive systems for document creation, submission, and stor-
Accuracy 0.44 0.94 0.829 age, integrating the automated image captioning model into these
platforms would be beneficial.
Table 5
Violation types classification accuracy. 5.3. Discussion
Falling PPE Electric Workspace Material
Accuracy 0.95 0.82 0.75 0.28 0.43 5.3.1. Framework objectives and outcomes
Explosion Protruding Mechanical Transport Total/Average Image captioning offers a promising value for automated safety
Accuracy 0.10 0.10 0.43 0.20 0.737 inspections in determining violations through images. Recent advance-
ments in computer vision and natural language processing have enabled
models to accurately generate textual descriptions from construction
Table 6
images. Fine-tuning modules further enable image classification, in-
Image captioning model evaluation.
cluding compliance or non-compliance based on the training model
ine Mapping type BLEU-1 BLEU-2 BLEU-3 BLEU-4
process, which acknowledges available information from safety reports
ine Baselinea 0.606 0.506 0.398 0.320
Transformer 0.249 0.242 0.333 0.249 and regulations. Our research demonstrates the effectiveness of this
MLP 0.454 0.447 0.525 0.454 approach by achieving robust performance in recognizing, classifying,
ine and describing violations in construction images. On the other hand,
a
Baseline score was performed by Xiao et al. [18]. the mobile application brings a seamless process for safety inspectors
to create daily safety reports, simplifying the overall documentation
process and facilitating the digital transformation on construction sites.
the model’s strong capacity to fit the training data. The Construction
CLIP model effectively classifies caption and violation types, extracting 5.3.2. Model development challenges
valuable information from images that can inform further analysis. Despite our model’s promise, several challenges arose during the ex-
Table 6 displays the outcomes of the CLIP prefix captioning model, periment phase that should be critically considered for further improve-
adjoining with those of a baseline model employing a CNN and LSTM ment. A primary issue was the overfitting exhibited by the fine-tuned
architecture, trained on the ACID caption dataset [18]. This dataset Construction CLIP model. This demonstrated a significant discrepancy
follows a linguistic schema, implying elevated scores for image caption- pattern between training and testing accuracy, indicating the model’s
ing tasks. Although the transformer-based approach was anticipated inability to generalize to unseen data. This difference is attributed to
to yield superior performance due to its attention mechanism, the the model’s tendency to grasp complex details and noise inherent in
MLP method attained higher BLEU and ROUGE scores. This outcome the training data [49]. The model struggles to generalize to unseen
suggests that overly complex models may overfit due to insufficient data excluded from the training set. This divergence between training
data volume. The BLEU-1 and BLEU-2 scores have mild performance and testing accuracies signifies the presence of overfitting during the
around 0.45 compared to the baseline score. However, The BLEU-3 fine-tuning phase, characterized by low bias and high variance.
and BLEU-4 scores can reach 0.525 and 0.454 explaining that our CLIP Several factors are known to shape the overfitting, such as insuffi-
prefix captioning model can generate more accurate sentences with cient training data, excessive model complexity, and noise in training
longer phrases. While automatic metrics expedite objective evaluation, data. The relatively small and imbalanced dataset primarily ampli-
a substantial gap persists between these metrics and human judgment. fied the model’s tendency to memorize training examples rather than
BLEU does not account for synonyms, paraphrasing, or the overall learn underlying patterns. Additionally, the complex architecture of
meaning of the text. A generated caption might convey the same mean- the model exacerbated the issue by allowing it to capture noise and
ing as the reference but use different words and phrases, leading to a irrelevant details from the training data. The model may learn these
low BLEU score. Consequently, we will incorporate human evaluation incorrect patterns if the training data contains noise or irrelevant
to address the complexity inherent in our target images. information. Besides, the model’s predictions vary considerably with
The CLIP attention model can demonstrate the related area accord- minor changes in the input data.
ing to the images and text by color. This color demonstration can In contrast, the prefix model demonstrated a more stable training
enhance the explainability for the safety inspector to know the reason trajectory, suggesting better generalization. The training loss consis-
for the generated model. Table 7 shows the results of the predicted tently decreases, indicating that the prefix captioning model can gen-
captions and the corresponding attention according to the visual and erate captions resembling the ground truth. While testing loss was
textual data. The color red shows the higher relevance between the not evaluated in this context, the model’s overall performance met
captions and a certain area which can indicate where the captions expectations.
are focusing on the image. On the other hand, the blue color shows Table 8 showcases examples of accurate attribute and caption gen-
there might not be any relevance between this area and the generated eration. The model effectively identifies violation types caption types,
captions. and provides descriptive details. However, when applied to diverse
construction site scenarios, the model’s performance falls short of prac-
5.2. Mobile application user experience testing tical requirements. Table 9 depicts the cases of failures with incorrect
violation types and captions. For example, an unprotected opening was
Testing results indicate that the automatically generated captions misclassified as a fall protection net issue, and a falling hazard without
effectively assisted safety engineers in identifying violations without a net was incorrectly labeled as protruding rebar. In application, the
referring to external regulations. By eliminating the need for manual safety inspectors need to take action to review the captions before
data entry, the system significantly reduced documentation time. The submitting the safety reports. Additionally, specific captions replicate
user interface was found to be intuitive, allowing for easy interaction ground truth captions using the exact words originally used across
with images and review of safety report information before submis- disparate generation conditions. This situation is indicative of model
sion. Participants confirmed the application’s suitability for daily safety collapse. Model collapse causes post-prefix tuning, and the model tends
inspections, aiding in both documentation and violation detection. to generate a restricted set of patterns.

8
W. Tsai et al. Automation in Construction 169 (2025) 105863

Table 7
Attention map with generated captions.
Sample image Attention map Ground truth Prediction

Violation falling Violation mechanical

The foundation opening is not Opening without a
covered. Caution signs with safety guardrail
triangular cones and barriers
should be placed.

Violation falling Violation falling

Unprotected opening with no fall Fall protection net
protection and warning damaged

Table 8
Attributes and captions prediction.
Sample image Ground truth Prediction

Violation protruding Violation protruding

Unbent or Unprotected Unbent or Unguarded steel
steel reinforcement reinforcement end

Violation falling Violation falling

Fall protection net on Opening without fall
scaffolding damaged and protection net
not restored

5.3.3. Practical implication reliability between manual observation reports and image monitoring
This framework revolutionizes construction safety inspection by records.
eliminating paper-based processes. All inspection data is captured dig-
itally and stored centrally, streamlining information transfer and anal- 5.3.4. Limitations and future work
ysis. By generating descriptive captions for images, inspectors can Our research identified several key limitations that warrant fur-
efficiently review and document safety violations, accelerating the
ther investigation. The overfitting of the Construction CLIP model is
identification and resolution of hazards. This digital transformation
a primary concern. Various strategies can be employed to mitigate
empowers inspectors and managers to proactively address complex
overfitting based on the causes of this problem. Firstly, it is essential
safety challenges, improving overall site safety performance. In the
to expand the scale of the dataset. One of the popular methods is
same way, our approach represents a significant step towards a fully
automated safety management system. Automating violation identifi- augmenting the training dataset. Given the relatively limited size of
cation through image captioning enables faster, more accurate safety the image captioning dataset (1264 pairs) compared to the large size of
assessments and decision-making. Implementing this framework will the pre-trained CLIP dataset, increasing the training dataset could solve
transform construction sites into more digital and efficient environ- the problem of overfitting. Secondly, reducing the number of layers
ments, ultimately enhancing worker safety and project outcomes. Out- within the transformer block is another solution to address the com-
side safety inspections, image captioning techniques have potential plex model [49]. Thirdly, regarding fine-tuning jointly, we observed
applications in various construction management aspects. For instance, that training different element combinations led to a sharp increase
detailed image descriptions can build up a well-organized quality con- in training accuracy and a corresponding decline in loss. Neverthe-
trol process or activity recognition. This will assist in confirming the less, the impact on testing accuracy was uncertain. This underscores

9
W. Tsai et al. Automation in Construction 169 (2025) 105863

Table 9
Errors of attributes and captions prediction.
Sample image Ground truth Prediction

Violation falling Violation falling

Unprotected opening with no fall Excessive sagging in fall
protection and warning protection net

Violation falling Violation protruding

A fall protection net should be installed Unbent or Unguarded
when there is an excessive gap between rebar ends
the structure and the scaffolding

the limitations of these strategies in effectively learning from a rela- CRediT authorship contribution statement
tively more minor dataset with different labels. The dataset’s imbalance
across violation classes also impacted model performance. We recom-
Wei-Lun Tsai: Writing – review & editing, Writing – original draft,
mend creating a more balanced dataset by collecting additional images
Visualization, Validation, Methodology, Formal analysis, Data cura-
for underrepresented categories. Despite these challenges, the model
demonstrated promising capabilities in generating informative caption, Conceptualization. Phuong-Linh Le: Writing – review & editing.
tions. Exploring various fine-tuning and decoding strategies can assist Wang-Fat Ho: Writing – review & editing, Visualization, Validation,
in boosting its accuracy and generalizability, including implementing Software, Methodology. Nai-Wen Chi: Validation, Writing – review
different language models. Furthermore, to increase the usability of & editing. Jacob J. Lin: Writing – review & editing, Writing – origi-
the mobile application, the model should be well-packaged for easy nal draft, Visualization, Validation, Supervision, Software, Resources,
transfer and integration with the existing management platform. Future Project administration, Methodology, Investigation, Funding acquisi-
research should focus on overcoming these limitations to develop a tion, Formal analysis, Data curation, Conceptualization. Shuai Tang:
more robust and reliable image captioning model for construction Writing – review & editing, Methodology, Conceptualization. Shang-
safety inspections. Hsien Hsieh: Writing – review & editing, Project administration, Fund-
ing acquisition.
6. Conclusion

This paper introduces a framework for automatically identifying Declaration of competing interest
safety violations in construction images through image captioning. The
primary goal was to develop a method for processing construction
The authors declare the following financial interests/personal rela-
site images to accurately detect and classify violations. We employed
a transformer-based vision-language model to categorize captions in- tionships which may be considered as potential competing interests:
dicating violation status and types. Our model generates descriptive Jacob J. Lin reports financial support was provided by National Science
captions with multiple embeddings to extract information from im- and Technology Council. If there are other authors, they declare that
ages using image-to-text techniques. The framework achieved 82.9% they have no known competing financial interests or personal relation-
accuracy for caption types and 73.7% for violation types, with an ships that could have appeared to influence the work reported in this
impressive 95% accuracy for the ‘‘falling from height’’ category. Ad- paper.
ditionally, BLEU-3 and BLEU-4 scores of 0.525 and 0.454, respectively,
demonstrate the model’s ability to generate coherent and informative
captions. Acknowledgments
This research highlights the potential of natural language processing
to streamline safety inspections in construction. By leveraging safety The project is supported in part by MOST, Taiwan 110-2622-E-002-
inspection records, our framework can identify violations often missed 039, 110-2222-E-002-002-MY3. The support and help of contractors,
or time-consuming to detect manually, improving safety documentation subcontractors, and service companies involved in collecting data and
and prevention. This digital approach also promotes paperless safety
implementing the system are greatly appreciated, including CECI En-
inspections, which is the first step in the digital transformation of the
gineering Consultants, Inc., Chien Kuo Construction Co. Ltd., Feng Yu
construction area.
Construction Co. Ltd., and Ruiju Construction Co. Ltd.
While our study provides a strong foundation, limitations include
dataset size, imbalance, and caption data source constraints. As NLP
in construction advances, continuous model refinement is essential to Data availability
address complex construction environments and safety challenges. Fu-
ture work will focus on increasing the dataset, enhancing the language
model to prevent overfitting, and exploring the combined use of various Data will be made available on request.
language models for a more precise safety inspection.

10
W. Tsai et al. Automation in Construction 169 (2025) 105863

References [25] Y. Wang, B. Xiao, A. Bouferguene, M. Al-Hussein, H. Li, Vision-based method

for semantic information extraction in construction by integrating deep learning
[1] M.-Y. Cheng, D. Kusoemo, R.A. Gosno, Text mining-based construction site object detection and image captioning, Adv. Eng. Inform. 53 (2022) 101699,
accident classification using hybrid supervised machine learning, Autom. Constr. http://dx.doi.org/10.1016/j.aei.2022.101699.
118 (2020) 103265, http://dx.doi.org/10.1016/j.autcon.2020.103265. [26] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou,
[2] A. Pal, S.-H. Hsieh, Deep-learning-based visual data analytics for smart con- W. Li, P.J. Liu, Exploring the limits of transfer learning with a unified text-
struction management, Autom. Constr. 131 (2021) 103892, http://dx.doi.org/ to-text transformer, J. Mach. Learn. Res. 21 (140) (2020) 1–67, URL http:
10.1016/j.autcon.2021.103892. //jmlr.org/papers/v21/20-074.html.
[3] H. Zhang, S. Chi, J. Yang, M. Nepal, S. Moon, Development of a safety inspection [27] Y. Du, Z. Liu, J. Li, W.X. Zhao, A survey of vision-language pre-trained models,
framework on construction sites using mobile computing, J. Manage. Eng. 33 (3) 2022, http://dx.doi.org/10.48550/ARXIV.2202.10936.
(2017) 04016048, http://dx.doi.org/10.1061/(ASCE)ME.1943-5479.0000495. [28] Y. Tian, D. Krishnan, P. Isola, Contrastive multiview coding, 2020, arXiv:1906.
[4] J.-R. Lin, Z.-Z. Hu, J.-L. Li, L.-M. Chen, Understanding on-site inspection of con- 05849.
struction projects based on keyword extraction and topic modeling, IEEE Access [29] M.V. Conde, K. Turgutlu, CLIP-art: Contrastive pre-training for fine-grained art
8 (2020) 198503–198517, http://dx.doi.org/10.1109/ACCESS.2020.3035214. classification, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern
[5] M.A. Qady, A. Kandil, Concept relation extraction from construction documents Recognition Workshops, CVPRW, 2021, pp. 3951–3955, http://dx.doi.org/10.
using natural language processing, J. Constr. Eng. Manage. 136 (3) (2010) 1109/CVPRW53098.2021.00444.
294–302, http://dx.doi.org/10.1061/(ASCE)CO.1943-7862.0000131. [30] C. Chen, B. Xiao, Y. Zhang, Z. Zhu, Automatic vision-based calculation of
[6] B. Zhong, W. He, Z. Huang, P.E. Love, J. Tang, H. Luo, A building regulation excavator earthmoving productivity using zero-shot learning activity recognition,
question answering system: A deep learning methodology, Adv. Eng. Inform. 46 Autom. Constr. 146 (2023) 104702, http://dx.doi.org/10.1016/j.autcon.2022.
(2020) 101195, http://dx.doi.org/10.1016/j.aei.2020.101195. 104702.
[7] M.Z. Hossain, F. Sohel, M.F. Shiratuddin, H. Laga, A comprehensive survey [31] A. Ghelmani, A. Hammad, Self-supervised contrastive video representation learn-
of deep learning for image captioning, ACM Comput. Surv. 51 (6) (2019) ing for construction equipment activity recognition on limited dataset, Autom.
http://dx.doi.org/10.1145/3295748. Constr. 154 (2023) 105001, http://dx.doi.org/10.1016/j.autcon.2023.105001.
[8] S. Bang, H. Kim, Context-based information generation for managing UAV- [32] R. Bielawski, B. Devillers, T. van de Cruys, R. VanRullen, When does CLIP
acquired data using image captioning, Autom. Constr. 112 (2020) 103116, generalize better than unimodal models? When judging human-centric concepts,
http://dx.doi.org/10.1016/j.autcon.2020.103116. in: 7th Workshop on Representation Learning, Repl4NLP 2022, ACL: Association
[9] H. Liu, G. Wang, T. Huang, P. He, M. Skitmore, X. Luo, Manifesting construction for Computational Linguistics, ACL Special Interest Group on Representation
activity scenes via image captioning, Autom. Constr. 119 (2020) 103334, http: Learning (SIGREP), Dublin, Ireland, 2022, pp. 29–38, http://dx.doi.org/10.
//dx.doi.org/10.1016/j.autcon.2020.103334. 18653/v1/2022.repl4nlp-1.4, hosted by ACL 2022 : 60th Annual Meeting of the
[10] F.-L. Chen, D.-Z. Zhang, M.-L. Han, X.-Y. Chen, J. Shi, S. Xu, B. Xu, VLP: A Association for Computational Linguistics.
survey on vision-language pre-training, Mach. Intell. Res. 20 (1) (2023) 38–56, [33] S. Shen, L.H. Li, H. Tan, M. Bansal, A. Rohrbach, K.-W. Chang, Z. Yao, K. Keutzer,
http://dx.doi.org/10.1007/s11633-022-1369-5. How much can CLIP benefit vision-and-language tasks?, 2021, arXiv:2107.06383.
[11] P.P. Ray, ChatGPT: A comprehensive review on background, applications, key [34] Z. Chen, G.H. Chen, S. Diao, X. Wan, B. Wang, On the difference of BERT-style
challenges, bias, ethics, limitations and future scope, Internet Things Cyber-Phys. and CLIP-style text encoders, 2023, arXiv:2306.03678.
3 (2023) 121–154, http://dx.doi.org/10.1016/j.iotcps.2023.04.003. [35] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
[12] A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An
A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable image is worth 16x16 words: Transformers for image recognition at scale, 2020,
visual models from natural language supervision, in: M. Meila, T. Zhang (Eds.), http://dx.doi.org/10.48550/ARXIV.2010.11929.
Proceedings of the 38th International Conference on Machine Learning, in: Pro- [36] S. Chiu, M. Li, Y.-T. Lin, Y.-N. Chen, SalesBot: Transitioning from chit-chat to
ceedings of Machine Learning Research, vol. 139, PMLR, 2021, pp. 8748–8763, task-oriented dialogues, 2022, http://dx.doi.org/10.48550/ARXIV.2204.10591.
URL https://proceedings.mlr.press/v139/radford21a. [37] J. Tang, T. Zhao, C. Xiong, X. Liang, E. Xing, Z. Hu, Target-guided open-domain
[13] R. Mokady, A. Hertz, A.H. Bermano, ClipCap: CLIP prefix for image captioning, conversation, in: Proceedings of the 57th Annual Meeting of the Association for
2021, arXiv:2111.09734. Computational Linguistics, Association for Computational Linguistics, Florence,
[14] Z. Wang, J. Yu, A.W. Yu, Z. Dai, Y. Tsvetkov, Y. Cao, SimVLM: Simple visual Italy, 2019, pp. 5624–5634, http://dx.doi.org/10.18653/v1/P19-1565.
language model pretraining with weak supervision, 2022, arXiv:2108.10904. [38] J. Alammar, The illustrated GPT-2 (visualizing transformer language models,
[15] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, C.L. Zitnick, 2019, URL https://jalammar.github.io/illustrated-gpt2/.
Microsoft COCO captions: Data collection and evaluation server, 2015, http: [39] H. Chefer, S. Gur, L. Wolf, Generic attention-model explainability for interpreting
//dx.doi.org/10.48550/ARXIV.1504.00325. bi-modal and encoder-decoder transformers, 2021, pp. 397–406, arXiv:2103.
[16] R. Duan, H. Deng, M. Tian, Y. Deng, J. Lin, SODA: A large-scale open site object 15679.
detection dataset for deep learning in construction, Autom. Constr. 142 (2022) [40] T. Wetchakorn, N. Prompoon, Method for mobile user interface design patterns
104499, http://dx.doi.org/10.1016/j.autcon.2022.104499. creation for ios platform, in: 2015 12th International Joint Conference on
[17] B. Zhong, L. Shen, X. Pan, L. Lei, Visual attention framework for identifying Computer Science and Software Engineering, JCSSE, 2015, pp. 150–155, http:
semantic information from construction monitoring video, Saf. Sci. 163 (2023) //dx.doi.org/10.1109/JCSSE.2015.7219787.
106122, http://dx.doi.org/10.1016/j.ssci.2023.106122. [41] A. Inc., Human interface guidelines, 2022, URL https://developer.apple.com/
[18] B. Xiao, Y. Wang, S.-C. Kang, Deep learning image captioning in construction design/human-interface-guidelines.
management: A feasibility study, J. Constr. Eng. Manage. 148 (7) (2022) [42] E.G. Nilsson, Design patterns for user interface for mobile applications, Adv. Eng.
04022049, http://dx.doi.org/10.1061/(ASCE)CO.1943-7862.0002297. Softw. 40 (12) (2009) 1318–1328, http://dx.doi.org/10.1016/j.advengsoft.2009.
[19] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, 01.017, Designing, modelling and implementing interactive systems.
2015, http://dx.doi.org/10.48550/ARXIV.1512.03385. [43] O. Safety, H. Administration, Top 10 most frequently cited standards, 2023, URL
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. https://www.osha.gov/top10citedstandards.
Kaiser, I. Polosukhin, Attention is all you need, 2017, http://dx.doi.org/10. [44] T. Ministry of Labor, Regulations of occupational safety and health act, 2022,
48550/ARXIV.1706.03762. URL https://law.moj.gov.tw/LawClass/LawAllPara.aspx?pcode=N0060009.
[21] H. Chen, L. Hou, S. Wu, G. Zhang, Y. Zou, S. Moon, M. Bhuiyan, Augmented [45] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: The 3rd
reality, deep learning and vision-language query system for construction worker International Conference for Learning Representations, 2017, http://dx.doi.org/
safety, Autom. Constr. 157 (2024) 105158, http://dx.doi.org/10.1016/j.autcon. 10.48550/ARXIV.1412.6980.
2023.105158. [46] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, 2017, http:
[22] L. Zhang, J. Wang, Y. Wang, H. Sun, X. Zhao, Automatic construction site //dx.doi.org/10.48550/ARXIV.1711.05101.
hazard identification integrating construction scene graphs with BERT based [47] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic
domain knowledge, Autom. Constr. 142 (2022) 104535, http://dx.doi.org/10. evaluation of machine translation, in: Proceedings of the 40th Annual Meeting
1016/j.autcon.2022.104535. of the Association for Computational Linguistics, Association for Computa-
[23] B. Zhong, W. He, Z. Huang, P.E. Love, J. Tang, H. Luo, A building regulation tional Linguistics, USA, 2002, pp. 311–318, http://dx.doi.org/10.3115/1073083.
question answering system: A deep learning methodology, Adv. Eng. Inform. 46 1073135.
(2020) 101195, http://dx.doi.org/10.1016/j.aei.2020.101195. [48] Y. Cui, G. Yang, A. Veit, X. Huang, S. Belongie, Learning to evaluate image
[24] J. de Curtò, I. de Zarzà, C.T. Calafate, Semantic scene understanding with captioning, 2018, pp. 5804–5812, arXiv:1806.06422.
large language models on unmanned aerial vehicles, Drones 7 (2) (2023) 114, [49] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, Cambridge,
http://dx.doi.org/10.3390/drones7020114. MA, 2016, URL http://www.deeplearningbook.org, 978-0262035618.

AI - Based Road Safety Audit System
No ratings yet
AI - Based Road Safety Audit System
13 pages
Proactive Safety Hazard Identification Using Visual-Text Semantic Similarity
No ratings yet
Proactive Safety Hazard Identification Using Visual-Text Semantic Similarity
16 pages
What Is Chat GPT?
No ratings yet
What Is Chat GPT?
7 pages
Practical 3
No ratings yet
Practical 3
4 pages
Sanyam Modi Report Seminar Final
No ratings yet
Sanyam Modi Report Seminar Final
39 pages
Understanding and Creating Art With AI: Review and Outlook: Eva Cetinic, James She
No ratings yet
Understanding and Creating Art With AI: Review and Outlook: Eva Cetinic, James She
22 pages
h19611 Nvidia Gen Ai WP
No ratings yet
h19611 Nvidia Gen Ai WP
33 pages
AI and Metaverse in Education
No ratings yet
AI and Metaverse in Education
25 pages
10 Ai Introaritficialintelligence Tp03
No ratings yet
10 Ai Introaritficialintelligence Tp03
3 pages
RealWorldApplications AI PDF
No ratings yet
RealWorldApplications AI PDF
20 pages
NLP QB
100% (2)
NLP QB
14 pages
AI Chatbot Development Guide
No ratings yet
AI Chatbot Development Guide
3 pages
Technicalseminar
No ratings yet
Technicalseminar
47 pages
Natural Language Processing 110641
No ratings yet
Natural Language Processing 110641
19 pages
Human Evaluation of Automatically Generated Text C
No ratings yet
Human Evaluation of Automatically Generated Text C
24 pages
NNDL Unit 5
No ratings yet
NNDL Unit 5
21 pages
Agent Book
No ratings yet
Agent Book
30 pages
TP3 ETUDE CAS Tableau BI
No ratings yet
TP3 ETUDE CAS Tableau BI
74 pages
Perceptions of Human and Machine-Generated Articles
No ratings yet
Perceptions of Human and Machine-Generated Articles
16 pages
Gen AI
No ratings yet
Gen AI
2 pages
Unit-4 AI
No ratings yet
Unit-4 AI
28 pages
AI Dialogue Management Research
No ratings yet
AI Dialogue Management Research
19 pages
High-Resolution Remote Sensing Image Captioning Based On Structured Attention
No ratings yet
High-Resolution Remote Sensing Image Captioning Based On Structured Attention
14 pages
Neha Dubey: Education
No ratings yet
Neha Dubey: Education
3 pages
AI-Powered Text Generation For Harmonious Human-Machine Interaction: Current State and Future Directions
No ratings yet
AI-Powered Text Generation For Harmonious Human-Machine Interaction: Current State and Future Directions
8 pages
Generating Caption For Image Using Beam Search and Analyzation With Unsupervised Image Captioning Algo
No ratings yet
Generating Caption For Image Using Beam Search and Analyzation With Unsupervised Image Captioning Algo
8 pages
File 1715576303772
No ratings yet
File 1715576303772
3 pages
Opinerium Subjective Question Generation Using Large Language Models
No ratings yet
Opinerium Subjective Question Generation Using Large Language Models
15 pages
Final Gen AI Certification Dumps - Validated-Kaavya
No ratings yet
Final Gen AI Certification Dumps - Validated-Kaavya
17 pages
What Matters in Training A GPT4-Style Language Model With Multimodal Inputs?
No ratings yet
What Matters in Training A GPT4-Style Language Model With Multimodal Inputs?
32 pages
Koncel-Kedziorski Et Al. (2019) Text Generation From Knowledge Graphs With Graph Transformers Proceedings of NAACL-HLT 2019, Pages 2284-2293
No ratings yet
Koncel-Kedziorski Et Al. (2019) Text Generation From Knowledge Graphs With Graph Transformers Proceedings of NAACL-HLT 2019, Pages 2284-2293
10 pages
AI Assistant For Visually Impaired 3
No ratings yet
AI Assistant For Visually Impaired 3
6 pages
Securing Cloud Services - A pragmatic guide: Second edition
From Everand
Securing Cloud Services - A pragmatic guide: Second edition
Lee Newcombe
No ratings yet
Cybersecurity in Cloud Computing
From Everand
Cybersecurity in Cloud Computing
Akula Achari
No ratings yet
Deploying and Managing Applications with DigitalOcean: Definitive Reference for Developers and Engineers
From Everand
Deploying and Managing Applications with DigitalOcean: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
Compose in Practice: Definitive Reference for Developers and Engineers
From Everand
Compose in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Network Coding and Signcryption for Cloud Data Integrity
From Everand
Network Coding and Signcryption for Cloud Data Integrity
Noah Joan
No ratings yet
Developing Applications with the MEAN Stack: Definitive Reference for Developers and Engineers
From Everand
Developing Applications with the MEAN Stack: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Securing the CI/CD Pipeline: Best Practices for DevSecOps
From Everand
Securing the CI/CD Pipeline: Best Practices for DevSecOps
Sai Sravan Cherukuri
No ratings yet
Koa Web Development Essentials: Definitive Reference for Developers and Engineers
From Everand
Koa Web Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
EFL Development Guide: Definitive Reference for Developers and Engineers
From Everand
EFL Development Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Continuous Integration Fundamentals: Definitive Reference for Developers and Engineers
From Everand
Continuous Integration Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building the Future: Innovations in Construction Technology
From Everand
Building the Future: Innovations in Construction Technology
Elizabeth Mogopodi
No ratings yet
Comprehensive Guide to WinUI Development: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to WinUI Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CLIP Systems and Applications: The Complete Guide for Developers and Engineers
From Everand
CLIP Systems and Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Streamlit Development Essentials: Definitive Reference for Developers and Engineers
From Everand
Streamlit Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cognigy Automation and Integration Guide: Definitive Reference for Developers and Engineers
From Everand
Cognigy Automation and Integration Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Fly.io Application Deployment and Edge Architecture: The Complete Guide for Developers and Engineers
From Everand
Fly.io Application Deployment and Edge Architecture: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
PrimeVue Essentials: Definitive Reference for Developers and Engineers
From Everand
PrimeVue Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Model-View-Controller Architecture: Definitive Reference for Developers and Engineers
From Everand
Principles of Model-View-Controller Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Twelve-Factor Applications: Definitive Reference for Developers and Engineers
From Everand
Building Twelve-Factor Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Concourse Pipeline Engineering: Definitive Reference for Developers and Engineers
From Everand
Concourse Pipeline Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Code Generation Techniques and Applications: Definitive Reference for Developers and Engineers
From Everand
Code Generation Techniques and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Software Security: Strategies for Robust Backend Systems
From Everand
Advanced Software Security: Strategies for Robust Backend Systems
Adam Jones
No ratings yet
Neutralino.js Essentials: Definitive Reference for Developers and Engineers
From Everand
Neutralino.js Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Ionic Development in Practice: Definitive Reference for Developers and Engineers
From Everand
Ionic Development in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
XCTest in Swift: Definitive Reference for Developers and Engineers
From Everand
XCTest in Swift: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Automated Application Deployment with CodeDeploy: Definitive Reference for Developers and Engineers
From Everand
Automated Application Deployment with CodeDeploy: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Coralogix Essentials: Definitive Reference for Developers and Engineers
From Everand
Coralogix Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Developing Interactive Web Applications with Shiny: Definitive Reference for Developers and Engineers
From Everand
Developing Interactive Web Applications with Shiny: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OWASP Security Principles and Practices: Definitive Reference for Developers and Engineers
From Everand
OWASP Security Principles and Practices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Foundations of Driverless Technology: Definitive Reference for Developers and Engineers
From Everand
Foundations of Driverless Technology: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Nessus Security Scanning Practical Guide: Definitive Reference for Developers and Engineers
From Everand
Nessus Security Scanning Practical Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NativeScript for Application Development: Definitive Reference for Developers and Engineers
From Everand
NativeScript for Application Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Twilio Solutions for Modern Communication: Definitive Reference for Developers and Engineers
From Everand
Twilio Solutions for Modern Communication: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Curiefense for Cloud-Native Application Security: Definitive Reference for Developers and Engineers
From Everand
Curiefense for Cloud-Native Application Security: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Proficient Packer Automation: Definitive Reference for Developers and Engineers
From Everand
Proficient Packer Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to .NET MAUI Development: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to .NET MAUI Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Continuous Integration Pipelines with Buildkite: Definitive Reference for Developers and Engineers
From Everand
Continuous Integration Pipelines with Buildkite: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Desktop Applications with Electron: Definitive Reference for Developers and Engineers
From Everand
Building Desktop Applications with Electron: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Secure Desktop Apps with Tauri: Definitive Reference for Developers and Engineers
From Everand
Building Secure Desktop Apps with Tauri: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
From Everand
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DevTest Engineering Foundations: Definitive Reference for Developers and Engineers
From Everand
DevTest Engineering Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kubeflow Operations and Workflow Engineering: Definitive Reference for Developers and Engineers
From Everand
Kubeflow Operations and Workflow Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Developing Applications with Kivy: Definitive Reference for Developers and Engineers
From Everand
Developing Applications with Kivy: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Observability Engineering with Relic: Definitive Reference for Developers and Engineers
From Everand
Practical Observability Engineering with Relic: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Fortify Security Analysis Essentials: Definitive Reference for Developers and Engineers
From Everand
Fortify Security Analysis Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rekognition Programming Guide: Definitive Reference for Developers and Engineers
From Everand
Rekognition Programming Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cocoa Development Essentials: Definitive Reference for Developers and Engineers
From Everand
Cocoa Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
From Everand
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Engineering Anthos Solutions: Definitive Reference for Developers and Engineers
From Everand
Engineering Anthos Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Checkmarx Security Automation: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Checkmarx Security Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Octopus Deploy in Modern CI/CD Workflows: Definitive Reference for Developers and Engineers
From Everand
Octopus Deploy in Modern CI/CD Workflows: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Veracode Essentials: Definitive Reference for Developers and Engineers
From Everand
Veracode Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming NodeMCU for IoT Applications: Definitive Reference for Developers and Engineers
From Everand
Programming NodeMCU for IoT Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sentry Error Monitoring and Application Observability: Definitive Reference for Developers and Engineers
From Everand
Sentry Error Monitoring and Application Observability: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CCSP - Certified Cloud Security Professional Exam Insights
From Everand
CCSP - Certified Cloud Security Professional Exam Insights
SUJAN
No ratings yet
CCSP - Certified Cloud Security Professional Exam Success
From Everand
CCSP - Certified Cloud Security Professional Exam Success
SUJAN
No ratings yet

Automation in Construction

Uploaded by

Automation in Construction

Uploaded by

Automation in Construction 169 (2025) 105863

Contents lists available at ScienceDirect

Construction safety inspection with contrastive language-image pre-training

ARTICLE INFO ABSTRACT

Fig. 1. Overview of proposed framework.

Fig. 3. Architecture of prefix captioning with attributes.

By calculating the relevancy map according to the relationship

Fig. 4. Three main views in the mobile application.

Fig. 5. Generating contrastive data pairs.

Violation falling Violation mechanical

Violation falling Violation falling

Violation protruding Violation protruding

Violation falling Violation falling

Violation falling Violation falling

Violation falling Violation protruding

References [25] Y. Wang, B. Xiao, A. Bouferguene, M. Al-Hussein, H. Li, Vision-based method

You might also like