Computer Aided Civil Eng - 2022 - Yong - Prompt Engineering For Zero Shot and Few Shot Defect Detection and Classification
Computer Aided Civil Eng - 2022 - Yong - Prompt Engineering For Zero Shot and Few Shot Defect Detection and Classification
12954
I N D U S T R I A L A P P L I C AT I O N
as an alternative to the traditional data-augmentation region feature regression, and image–text matching (ITM),
approaches (Y. Li et al., 2021) for their superior per- with approximately 5 million images and corresponding
formance (J. J. Zhao et al., 2017). Y. Gao et al. (2021) descriptions. Chen et al. (2020) used 5.6 million image–text
augmented the crack and spalling image data using GAN, pairs to pretrain their model, called universal image–
to overcome the low-and imbalanced-class data environ- text representation learning (UNITER), for four tasks,
ment. Y. Li et al. (2021) expanded their defect data using including MLM, ITM, masked region modeling, and word-
GAN for pavement distress detection. Maeda et al. (2021) region alignment. Among several VLP models, contrastive
generated road images, including potholes, by deploying language-image pretraining (CLIP) has received attention
GAN, and the results were used to train their detection owing to its task-agnostic design and robust zero-shot per-
model. However, GAN-based data augmentation possesses formance (Agarwal et al., 2021). CLIP, trained with 400
limitations. The images generated through GAN inevitably million (image, text) pairs, outperformed existing super-
inherit biases from their original dataset (Hu & Li, 2019), vised learning models (e.g., ResNet50) in diverse tasks
and thus may produce non-sensical images that contradict (Radford et al., 2021). However, CLIP still shows a consid-
the laws of physics (Y. Gao et al., 2019). erably low performance on specialized or complex tasks,
In contrast, transfer learning leverages the initial archi- including German traffic sign recognition benchmark (the
tecture and parameters of an existing model, pretrained GTSRB) task, counting objects in synthetic scenes (the
with a significantly large dataset, with a new dataset from a CLEVRCounts task), and lymph node tumor detection (the
target domain, enabling the pretrained model to learn new PatchCamelyon task) (Radford et al., 2021).
features of the target domain easily (Pan & Yang, 2010). To meet the demands of harnessing VLP models, includ-
Owing to this, the effort to collect and label a large dataset ing CLIP, Dall-E (Ramesh et al., 2021), and Midjourney
for training has been greatly reduced. Nevertheless, this (Midjourney, 2022), even a market (Promptbase, 2022)
method still requires hundreds to thousands of images. for buying and selling prompts (inputs) has been formed
For vision-based defect detection tasks, Y. Gao and Mos- (Nine, 2022). However, not all prompts yield desirable
alam (2018) used 1600 images for retraining the last several performance. Several studies have identified that the
blocks of a pretrained visual geometry group-16 (VGG-16) performance of VLP models, including CLIP, fluctuates
model. Bang et al. (2019) selected 427 images from 2.1 TB of depending on how prompts are constructed (Brown et al.,
video data, consisting of 289 h of playback time, to retrain 2020; T. Gao et al., 2020; X. Liu et al., 2021; T. Z. Zhao et al.,
residual networks (ResNet). J. Zhu et al. (2020) deployed 2021; Zhou et al., 2021).
1180 images for retraining Inception-v3. The CLIP developer team identified that the prompts
To eliminate the requirement for the customization of constructed in a form of “a {category} photo of {label}”
a model using an additional dataset, few-shot learning yielded results with higher accuracy than those in a form
and zero-shot learning using colossal pretrained language of “{label}” (Radford et al., 2021). Thus, the optimization
models have been introduced. Few-shot learning is a deep of prompts, which is referred to as prompt engineering,
learning technique aiming to deploy transferable knowl- is important for eliciting the best zero-shot performance
edge to identify the features of classes, where an extremely from pretrained models. Nevertheless, studies on prompt
small number of samples are given per class (Cui et al., engineering, particularly for VLP models, have been rarely
2022). Similarly, zero-shot learning is a deep learning tech- conducted.
nique aiming to identify unseen objects without any given This study aims to identify the features of a prompt that
samples (Lampert et al., 2009). Zero-shot learning mim- can yield the best performance in classifying and detecting
ics the human ability to categorize unseen classes based building defects from a video clip or a pile of images using
on previous experiences and knowledge without further zero-shot learning, based on CLIP. Specific research ques-
training (Chang et al., 2008; Lampert et al., 2009). One tions are discussed in Section 2.4. The scope of this study is
of the key strategies of zero-shot learning is leveraging limited to the defects in residential buildings, particularly
semantic information, such as word embeddings from the five most frequently occurring defects described in pre-
seen/unseen classes (Pourpanah et al., 2022). In the natu- vious literature (Y. Gao & Mosalam, 2020; Guo et al., 2021;
ral language processing (NLP) field, enlarging pretraining Paton-Cole & Aibinu, 2021; Wali & Ali, 2019; Zalejska &
data superabundantly enable a model to consider seman- Hungria, 2019): crack, mildew, nail popping, peeling, and
tic information, resulting in the recent success of zero-shot wrinkling.
and few-shot downstream tasks (X. Liu et al., 2021). The remainder of this paper is organized as follows: Sec-
Similarly, vision-language pretrained (VLP) models tion 2 introduces related work and describes the research
have been developed to perform a variety of tasks without gaps and questions. Section 3 describes the proposed
any customization. Qi et al. (2020) introduced ImageBERT, method and hypotheses. Section 4 presents the datasets
which was pretrained for four tasks: masked language for this study, the overall experimental processes, and
modeling (MLM), masked object classification, masked the performance indices. Section 5 presents the results
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1538 YONG et al.
of the study. Finally, Section 6 concludes the paper with augmented images, through GAN. They showed a perfor-
contributions and limitations. mance improvement for increasing training data. At least
hundreds of images per class were thus required for defect
2 RELATED WORKS detection studies, in the recent 3 years.
In addition to the challenges of acquiring gold-labeled
2.1 Defect detection with insufficient data, the imbalanced number of images per class is a cru-
dataset cial issue with respect to the construction industry (Y.
Gao et al., 2021). The imbalanced data environment is
This section, this study reviews previous deep learning- natural during the monitoring of a structure, not only
based defect detection studies, particularly with an insuf- because defects are relatively rare, compared to undam-
ficient dataset. This study uses the term “insufficiency” to aged components (Y. Gao et al., 2021), but also because
denote deficiency in the quantity, as well as the balance the occurrence frequency of defects varies depending
between datasets. This section enumerates the methods on the type and location (D. Li et al., 2019). Several
and the number of images deployed in each study and approaches have been used to detect or classify defects
discusses the need for zero-shot capability in detail. using insufficient datasets. One approach is oversampling.
Since the advent of advanced computer vision tech- Oversampling refers to the expansion of a small portion
nology, deep learning-based defect detection has been of the class by tuning the original feature or generating
studied extensively. As a rule of thumb, deep learning a similar feature. Meijer et al. (2019) applied oversam-
requires 1000 images per class (Goodfellow et al., 2016). pling with a class-weighted loss function to enlarge their
However, not all studies satisfied this rule of thumb, as training data from 17,663 images to five times as much.
it is often difficult to collect high-quality and reliably Subsequently, they conducted the defect classification task
labeled (gold-labeled) data in the construction industry using CNN. Another approach is the hierarchical task per-
(Maeda et al., 2021). Previous studies bypassed the pro- formance. D. Li et al. (2019) first developed a CNN model
cess of mining such big data, through transfer learning. to identify defects and classified these defects in detail at
Transfer learning is the most commonly deployed method the end. Meta-learning is another method to overcome
for deep learning-based technology, as it decreases the imbalanced data environments. Meta-learning is gener-
requirement for building a bespoke architecture. Y. Gao ally referred to as learning to learn, which connotes an
and Mosalam (2018) used 1600 images for training VGG- outer (or meta) algorithm that refurbishes the inner learn-
16 and performed four classification tasks: component ing algorithm to yield a desired outcome for an outer task
type, spalling condition, damage level, and damage type. (Hospedales et al., 2020). This enables a deep learning
Liang (2019) trained three pretrained models, including model to be trained with small datasets (Nichol et al., 2018).
AlexNet, Google Net, and VGG-16, with 1154 images and Guo et al. (2020) classified façade defects into blistering,
deployed them to identify system-level major failures, cracking, peeling, delamination, spalling, and biological
bridge columns defects, and pinpoint damages. S. Jiang growth through a meta-learning-based CNN trained with
and Zhang (2020) proposed a real-time crack assessment 21,259 images comprising 63% of non-defect data. In addi-
method using the combination of single-shot detector lite tion to meta-learning, semi-supervised learning is also an
and MobileNetV2 (SSDLite-MobileNetV2). Out of 1330 emerging method for imbalanced datasets. Compared to
images, 1030 images were employed for training. meta-learning, semi-supervised learning aims to update
Data augmentation through conventional image prepro- the learning process by adjusting the interaction between
cessing or GAN has been deployed as another strategy labeled and unlabeled data (Engelen & Hoos, 2020; X.
to work with small quantities of data. J. Zhu et al. Zhu & Goldberg, 2009). This has alleviated the labeling
(2020) expanded 243 raw images to a total of 1458 images requirement for end-users, particularly with respect to
through brightness, saturation, and flip adjustments. Sub- the construction industry, as there are not many labeled
sequently, they classified five defects: crack, intact, pock- datasets. Guo et al. (2021) leveraged the semi-supervised
mark, exposed rebar, and spalling through a convolutional learning based on CNN, to classify façade defects into the
neural network (CNN). Y. Li et al. (2021) comparatively same classes as their previous study (Guo et al., 2020),
studied their detection model: the “you only look once however, with a smaller size; a total of 5621 images were
v4,” trained with 2500, 5000, 7500, and 10,000 augmented employed for training. Y. Gao et al. (2021) harnessed over-
images through GAN. They applied these models for pave- sampling through GAN and semi-supervised learning to
ment distress detection and discovered that the greater the classify defects in low and imbalanced datasets. A total of
image-augmentation for training, the greater the accuracy 10,500 images was used to train the CNN model. However,
of the model. Maeda et al. (2021) measured the perfor- only 900 images were used for defects.
mance of their damage detection model, based on SSD Thus far, zero-shot and few-shot concepts are barely
MobileNet when using 1200 original images, 1800 and 2400 used for defect classification/detection. Recently, a study
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YONG et al. 1539
leveraged 1-shot, 2-shot, 5-shot, and 10-shot methods to (Shen et al., 2021), including visual question answering
classify façade defects (Cui et al., 2022). During the train- (Goyal et al., 2017).
ing stage, five defect classes (that is blistering, crack, To be concrete, CLIP includes two encoders as a VLP
delamination, peeling, and no defects) with approximately model. Either a ResNet50 (K. He et al., 2016; T. He et al.,
thousands of images per se were used. However, only 2019; R. Zhang, 2019) or a vision transformer architecture
two novel classes (efflorescence and spalling with exposed (Dosovitskiy et al., 2021) was adopted as the image encoder,
rebars) were tested (Cui et al., 2022). and a masked self-attention transformer (Vaswani et al.,
Notwithstanding, the existing approaches for insuf- 2017) was used as the text encoder. During the training
ficient defect datasets, such as data augmentation, phase, both the image and text encoders extract visual and
meta-learning, and semi-supervised learning, still textual features, respectively, from a set of (image, text)
require hundreds of images per class. In addition, these pairs. Subsequently, these features are mapped into a mul-
bespoke models cannot identify unlearned/untrained timodal embedding space. CLIP computes the cosine sim-
defects. ilarities of all possible (image embedding, text embedding)
pairs. CLIP is trained to maximize the cosine similarities
2.2 Clip of true (image, text) pairs while minimizing that of false
pairs (Radford et al., 2021). CLIP adopts the learning pro-
Scaling up a dataset for pretraining has advanced in deep cess of contrastive visual representation learning from text
learning communities while suggesting a new horizon for (ConVIRT) (Yuhao Zhang et al., 2020) and then simplifies
zero-shot and few-shot transfer. The recent success of pre- it. The process of vectorizing inputs and loss functions of
trained language models, such as bidirectional encoder CLIP is explained as follows.
representations from transformers (BERT; Devlin et al., Given an input image 𝑥𝐼 , a transformed view 𝑥˜ℐ is gener-
2018) and generative pretrained transformer (GPT) series ated via random resizing and cropping. Subsequently, the
(Brown et al., 2020; Radford et al., 2018, 2019), inspired augmented image is converted into a fixed-dimensional
researchers to extend these models to perform vision- vector. Then, the linear projection is applied to be the final
related tasks (Chen et al., 2020; Kim et al., 2021; L. H. vector 𝒥. Similarly, a text input 𝑥𝑇 is transformed to a vector
Li et al., 2019; X. Li et al., 2020; Qi et al., 2020; Radford 𝒯. When training CLIP, the batch size of (𝑥𝐼 , 𝑥𝑇 ) is vector-
et al., 2021). By training visual and textual features simul- ized as (𝒥, 𝒯). Here, 𝑁 denotes the batch size and (𝒥𝑖 , 𝒯𝑖 )
taneously and bridging the semantic gap between them, is the 𝑖th pair of 𝑁 (𝒥, 𝒯) pairs. Then, the image-to-text loss
these models can identify a wide range of information in function is described as
an image (Chen et al., 2020). Beyond the detection of an
(𝒥→𝒯) 𝑒(𝒥𝑖 , 𝒯𝑖 )∕𝜏
object, they can recognize attributes, spatial relationships, 𝑙𝑖 = − log ∑𝑁 (1)
(𝒥𝑖 , 𝒯𝑘 ) ∕𝜏
actions, and intentions (L. H. Li et al., 2019). 𝑘=1 𝑒
CLIP has been widely used in previous studies (W. Wang
et al., 2021), as it pursues end-users to harness a VLP model where (𝒥, 𝒯) is the cosine similarity of the vector pair,
to accomplish diverse tasks without any customization and 𝜏 is a temperature parameter (Yuhao Zhang et al.,
(Radford et al., 2021). Inspired by the results that pre- 2020). This parameter adjusts penalties on negative pairs.
training with web-scale collections of text surpasses that Notably, as the temperature decreases, the distribution
with gold-labeled NLP datasets, CLIP was trained with of embeddings becomes more uniform. Similarly, the
400 million (image, text) pairs under natural language text-to-image loss function is described as
supervision—learning visual features from corresponding 𝑒( 𝒯𝑖 , 𝒥𝑖 )∕𝜏
(𝒯→𝒥)
natural language without human intervention (Radford 𝑙𝑖 = − log ∑𝑁 (2)
(
et al., 2021). 𝑘=1 𝑒 𝒯𝑖 , 𝒥𝑘 ) ∕𝜏
Such scaling-up approach enabled CLIP to be task-
agnostic, operating under various scenarios. The zero-shot The final training loss is defined using the weighted
capability of CLIP outperformed typical supervised learn- combination of two losses (Equations 1 and 2) as follows:
1 ∑ ( (𝒥→𝒯) )
ing models (e.g., ResNet50; K. He et al., 2015), VLP models 𝑁
(𝒯→𝒥)
(e.g., Visual N-grams; A. Li et al., 2017), PixelBERT (Huang ℒ= 𝜆𝑙𝑥 + (1 − 𝜆) 𝑙𝑥 (3)
𝑁 𝑥=1
et al., 2020), learning cross-modality encoder representa-
tions from transformers (LXMERT) (Tan & Bansal, 2019),
where 𝜆 ∈ [0, 1] is a scalar weight (Yuhao Zhang et al.,
UNITER (Chen et al., 2020), and object-semantics aligned
2020).
pre-training (OSCAR) (X. Li et al., 2020), not only in image
Owing to its dominant zero-shot performance, previ-
classification tasks (Radford et al., 2021), including Ima-
ous studies deployed CLIP in artwork classification (Conde
geNet (Deng et al., 2009), SUN397 (Xiao et al., 2010), and
& Turgutlu, 2021), video summarization by detecting
Food101 (Kaur et al., 2017) but also vision-language tasks
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1540 YONG et al.
key video frames (Narasimhan et al., 2021), and vehi- 2.4 Research gap and questions
cle retrieval tasks (Khorramshahi et al., 2021). However,
it does not perform effectively in domain-specific tasks Previous studies proposed deep learning-based defect
or complex tasks, such as traffic sign recognition and detection, which requires large datasets of defect images
lymph node tumor detection (Radford et al., 2021). This that are difficult to acquire. However, owing to recent
indicates that the use of CLIP for zero-shot tasks in a advancements in task-agnostic VLP models (e.g., CLIP),
domain-specific area requires performance improvement. zero-shot transfer, which does not require training data,
has become possible. The task-agnostic VLP models do not
work effectively with domain-specific tasks, whereas they
2.3 Prompt engineering work effectively with generic tasks. Inspired by the success
of prompt engineering for a pretrained model in the NLP
Several studies have revealed that paraphrasing or rephras-
field, several VLP models have embraced prompt engineer-
ing an input query, known as prompt, can bolster the
ing. However, prompt engineering of a VLP model has not
performance of CLIP-like pretrained models (Z. Jiang
reached the stage where it can help us detect defects with
et al., 2020; V. Liu & Chilton, 2022; X. Liu et al., 2021;
high accuracy. Therefore, this study aims to identify the
Radford et al., 2019; Schick & Schütze, 2021). Prompt
characteristics of a prompt enabling CLIP to detect build-
engineering is the process of modifying a query to facil-
ing defects with high accuracy by addressing the following
itate a pretrained model to identify target information
research questions:
with high accuracy (Brown et al., 2020; T. Gao et al.,
2020; P. Liu et al., 2021; T. Z. Zhao et al., 2021). Some
1. Does a generic dictionary definition of a defect perform
studies manually converted an initial query into a set for-
better as a prompt or domain-specific definition of a
mat of the query. Radford et al. (2019) illustrated that
defect better?
GPT-2 could acquire advanced zero-shot competency by
2. Does CLIP perform better with a complete sentence
inserting a task description, such as “translate English to
including stopwords as a prompt, or does it perform bet-
French” into a prompt. Schick and Schütze (2021) changed
ter with a prompt consisting only of core terms without
the labels to task-relevant descriptions applicable to a
stopwords?
masked-language model and showed the improvement
3. Which has stronger descriptive power, either a textual
of robustly optimized BERT performance. Radford et al.
prompt or a visual one?
(2021) described that the elaboration of a query in the
4. Does CLIP detect defects more accurately with a multi-
form of “a {category} photo of {label}” performs better than
modal prompt?
a single “{label}” query in fine-grained image classifica-
tion using CLIP. In contrast, other studies have focused
The significances of these four research questions are
on automatically tuning a prompt. Z. Jiang et al. (2020)
discussed in detail in the next section.
identified that knowledge-contained prompts, generated
by mining-based and paraphrasing-based approaches, can
enhance the performance of pretrained language mod- 3 PROPOSED PROMPT ENGINEERING
els, that is, BERT-base and BERT-large. Compared to METHOD
the vocabularic approach, Liu et al. (2021) presented
p-tuning, which is a method to automatically search bet- 3.1 Construction of DK reflected in
ter prompts with numerical embedding tensors beyond prompts
the scope of the original vocabulary, thereby boosting
the GPT-2’s performance. Zhou et al. (2021) introduced Hypothesis 1 (H1): A DK based definition of a defect
context optimization—generating a prompt consisting of performs better as a prompt than a GK based defini-
context vectors—to not only improve the zero-shot per- tion.
formance of CLIP but also to avoid tuning a prompt
manually. Expanding the information of input data quantitatively
However, a prerequisite for any prompt tuning method, and qualitatively by referring to a dictionary can boost
whether automated or not, is that the result must be known the performance of a deep learning model (Kupi et al.,
first to optimize a prompt. That is, prompt tuning meth- 2021; Peng et al., 2020; Qiu et al., 2020). Motivated by
ods do not explain how to configure the prompt in the first this technique, this study tailored prompts by looking up
place to improve the zero-shot performance. The main dif- dictionaries. In previous prompt engineering approaches,
ference between previous studies and this study is that this Radford et al. (2021) added category information to instruct
study focuses on initial “prompt construction” rather than CLIP to classify objects within a provided category. This
“prompt tuning.” approach cannot augment the information for individ-
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YONG et al. 1541
ual target labels. Zhou et al. (2021) discovered an optimal 3.3 Descriptive power of a visual prompt
prompt in vector space. However, these vectors could not
be converted into real-world vocabulary (Zhou et al., 2021), Hypothesis 3 (H3): A defect image is better than the
which results in an inharmonious human-AI interaction defect definition as a prompt.
(Shin et al., 2020; Wu et al., 2022). Compared to earlier
approaches, our dictionary-enhanced prompt is a “human- The majority of defect reports include descriptions of
readable” prompt, as well as “knowledge-contained.” defects and corresponding images (Dong et al., 2009; Park
In addition, this study distinguished between the et al., 2013), as humans are more receptive to visual infor-
enhanced prompts obtained through domain-specific dic- mation than other types of information (Colavita, 1974;
tionaries and those obtained from the commonly used Colavita et al., 1976; Posner et al., 1976; Sinnett et al., 2007).
dictionaries (e.g., the Oxford English Dictionary), to deter- Based on this claim, this study comparatively measures the
mine the difference between prompts based on domain- descriptive powers of visual and textual information.
specific information and those based on general infor- As described in Section 2.2, there are two encoders in
mation. This study considered definitions in construction CLIP. The text encoder places a textual query, primarily
jargon dictionaries as domain knowledge reflected in used as a prompt, in the 512-dimensional space. In con-
prompts (DK prompts). This was based on the rationale trast, visual data (e.g., images) are encoded as 512, 640,
that they include clear and concise explanations of the 768, or 1024-dimensional vectors by the image encoder.
most commonly encountered terms, phrases, and abbrevi- Considering that CLIP can convert both data types into
ations used throughout the construction industry (Tolson, vectors in the same-dimensional embedding space (the
2012). Conversely, this study considered definitions from 512-dimension), an image can be deployed as a prompt.
the commonly used and renowned dictionaries as general Moreover, multiple vectors from multiple images can be
knowledge reflected in prompts (GK prompts). In addition, merged into a single vector by calculating the average,
this study created ensemble features for these two prompt which is similar to a few-shot transfer (Radford et al., 2021).
groups in the embedding space by averaging embedding M. Wang et al. (2021) employed eight images as a prompt
vectors as Radford et al. (2021) suggested an increase in to carry out few-shot action recognition. In a similar way,
the accuracy of CLIP. This study sets the prompt format, “a this study used a small number of images as a prompt and
{category} photo of {label},” suggested by the CLIP develop- compared it with the textual prompt.
ment team in its latest publication (Radford et al. (2021) as Through this work, it is possible to optimize the prompt
a baseline. Table 1 displays the prompts used in this study, toward classifying or detecting defects and, by extension, to
and Table 2 lists the prompt sources. identify which format is better for delivering information
to a VLP model, either linguistic or visual.
Stopwords (e.g., a, the, it, he, she) are functional and Prior to testing H4, this study performed the principal
frequently appearing words that show low discrimination component analysis (PCA) to verify the existence of a gap
power (Brants, 2003; Lo et al., 2005; Rijsbergen, 1979; Saif between defect images and descriptions when a VLP model
et al., 2014). Considering that the removal of stopwords processes information. PCA is a widely used technique for
facilitates the improvement of data quality and decreases analyzing, compressing, and visualizing high-dimensional
the dimension of data (Silva & Ribeiro, 2003; Wilbur & data (Amiri et al., 2012; Bishop, 2006; Nabian & Meidani,
Sirotkin, 1992), it has become the mainstream preprocess- 2018; Yulong Zhang et al., 2021). PCA is the orthogonal
ing method in NLP (Fan & Mostafavi, 2019; Roy et al., projection of data onto a lower-dimensional linear space,
2020). Along with this trend, this study discarded stop- that is, the principal subspace, such that the variance of
words from prompts through the natural language toolkit the projected data is maximized (Hotelling, 1933). This
(Bird et al., 2009), anticipating performance strengthening. study extracted principal components from features in 512-
Zhou et al. (2021) discovered that there is no golden rule dimensions and visualized them to identify the position
for determining the optimal context length. However, their of image embedding vectors from defect images and text
approach was based on tokens that could not be converted embedding vectors from descriptions.
into the existing vocabulary. Thus, it is still an uncharted Generally, there is a feature difference between an image
area to reveal the relationship between the number of and a text when a deep learning model interprets the input
words in a prompt and the performance of the VLP model. data. Owing to this, earlier studies revealed that merg-
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1542 YONG et al.
TA B L E 1 Prompts and their sources for each prompt group and defect type
Defect type (synonyms)
[Prompt group] Prompt (Source)
Crack (cracks)
[BL] A defect photo of crack (R1)
[DK1] Crack is a fissure or fracture in a material (R2)
[DK2] Crack is a building defect consisting of complete or incomplete separation within a single element or between
contiguous elements of construction (R3)
[DK3] Crack is a fissure (R4)
[GK1] Crack is a fissure or opening formed by the cracking, breaking, or bursting or a hard substance (R5)
[GK2] Crack is a narrow break or opening (R6)
[GK3] Crack is a thin line on the surface of something when it is broken but not actually come apart (R7)
Mildew (mold; mould)
[BL] A defect photo of mildew (R1)
[DK1] Mildew is a fungus growth that is enhanced by dampness (R2)
[DK2] Mildew is a fungus that grows and feeds on paint, cotton, and linen fabric, and so forth, which are exposed moisture;
causes discoloration and decompositions of the surface (R3)
[DK3] Mildew is a fungus that stains materials but does not rot wood (R8)
[GK1] Mildew is a woolly, furry, or staining growth now recognized as consisting of fungus, such as that which forms on
food, textile, and so forth (R5)
[GK2] Mildew is a fungus producing mildew (R6)
[GK3] Mildew is a white or gray substance that grows on walls or other surfaces in wet, slightly warm conditions (R7)
Nail popping (nail pops; nail + popping)
[BL] A defect photo of nail popping (R1)
[DK1] Nail popping is a problem that appears both in decking and in gypsum wallboard finishes where the heads of nails
pull or work themselves out of the framing members and pop through the surface (R2)
[DK2] Nail popping is a protrusion of nailhead and compound directly over the nailhead, caused by outward movement of
the nail relative to the gypsum board (R9)
[DK3] Nail popping is a nail head that protrudes above the surrounding surface (R10)
[GK1] Nail popping is the displacement, dislodgement, and dislocation of a small metal spike (R5)
[GK2] Nail popping is a slender, usually pointed and headed fastener designed to be pounded in, escaping or braking away
from something usually suddenly or unexpectedly (R6)
[GK3] Nail popping is unexpectedly out of or away from a thin pointed piece of metal with a flat top, which you hit into a
surface with a hammer (R7)
Peeling
[BL] A defect photo of peeling (R1)
[DK1] Peeling is a paint defect where the paint debonds from the surface and peels off (R2)
[DK2] Peeling is finishing such as paint that has not properly adhered to the surface and has started to come away from the
substrate (R11)
[DK3] Peeling is the defect of dislodgement of paint or plaster from a surface due to lack of adhesion or a weak backing (R8)
[GK1] Peeling is the removal of the external layer or outer covering of something (R5)
[GK2] Peeling is a peeled-off piece or strip (R6)
[GK3] Peeling is paint peels, it comes off, usually in small pieces (R7)
Wrinkling
[BL] A defect photo of wrinkling (R1)
[DK1] Wrinkling is a paint defect in which the surface becomes wrinkled (R2)
[DK2] Wrinkling is the distortion in a paint film appearing as ripples, may be produced intentionally as a decorative effect,
or may be a defect caused by drying conditions or an excessively thick film (R3)
[DK3] Wrinkling is the development of ridges and furrows in a paint film during drying (R12)
(Continues)
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YONG et al. 1543
TA B L E 1 (Continued)
[GK1] Wrinkling is the action of ceasing, puckering, or contracting into wrinkles (R5)
[GK2] Wrinkling is a small ridge or furrow especially when formed on a surface by the shrinking or contraction of a smooth
substance (R6)
[GK3] Wrinkling is a small untidy fold in a piece of clothing or paper (R7)
Abbreviations: BL, baseline; DK, domain knowledge; GK, general knowledge.
TA B L E 4 Comparison of crack detection performances between the proposed method and the results from Y. Gao et al. (2021)
Experimental
Source group Training data Test data Accuracy F2 score
The proposed Baseline 0 5100 0.985 0.877
method DK1 0 5100 0.987 0.890
DK2 0 5100 0.986 0.880
DK3 0 5100 0.989 0.907
GK1 0 5100 0.988 0.897
GK2 0 5100 0.983 0.857
GK3 0 5100 0.988 0.900
Benchmark (Y. Gao BSL 10,200 5100 0.954 0.352
et al., 2021) BUS 10,200 5100 0.940 0.466
BOS-DA 10,200 5100 0.962 0.501
BOS-GAN 10,200 5100 0.954 0.336
BSL-SDF 10,200 5100 0.902 0.434
BSS-GAN 10,200 5100 0.921 0.728
F I G U R E 8 Relationship between the defect classification F I G U R E 1 0 Comparison of the defect detection performances
performance (F1-score) and the number of words in a prompt between textual and visual prompts
T A B L E 5 Defect classification performances in ensembling general knowledge-based definitions with 16 images (accu-
visual and textual prompts racy = 0.933; F1-score = 0.934). Furthermore, this mul-
Prompt Accuracy F1-score timodal prompt showed the highest performance on the
DK_ensemble (text) 0.834 0.835 defection detection (accuracy = 0.981; F2-score = 0.683).
GK_ensemble (text) 0.736 0.727 Unlike the previous tasks, combinations of images and
1-shot (image) 0.770 0.762 a GK prompt appeared to perform marginally better than
DK_ensemble + 1-shot (text + image) 0.825 0.821 the combinations of images and a DK prompt, not only in
GK_ensemble + 1-shot (text + image) 0.837 0.834
the classification task but also in the detection task. How-
ever, the t-test showed that the differences in the detection
2-shot (image) 0.817 0.807
task were not statistically significant (all p-values > 0.05).
DK_ensemble + 2-shot (text + image) 0.864 0.861
This study cannot explain the better performance of the
GK_ensemble + 2-shot (text + image) 0.870 0.868
combination of GK prompts and images than the others.
4-shot (image) 0.864 0.860 However, this study assumes that the GK + image prompts
DK_ensemble + 4-shot (text + image) 0.896 0.894 capture the core semantic features of a defect from images
GK_ensemble + 4-shot (text + image) 0.904 0.904 and general features from descriptions. This is because the
8-shot (image) 0.891 0.890 PCA results of image embeddings are completely different
DK_ensemble + 8-shot (text + image) 0.917 0.917 from those of text embeddings as shown in Figure 11.
GK_ensemble + 8-shot (text + image) 0.927 0.928
16-shot (image) 0.902 0.901
6 CONCLUSION
DK_ensemble + 16-shot (text + image) 0.925 0.925
GK_ensemble + 16-shot (text + image) 0.933 0.934
Previous studies on deep learning-based defect inspection
have required a large number of defect images for appli-
T A B L E 6 Defect detection performances in ensembling visual cability because they deployed supervised learning. Zero-
and textual prompts shot or few-shot transfer using VLP models such as CLIP
Prompt Accuracy F2-score are expected to be the alternative for traditional supervised
DK_ensemble (text) 0.975 0.591
learning, particularly for data-insufficient tasks. Defect
inspection is one of such tasks, where data acquisition is
GK_ensemble (text) 0.976 0.605
difficult, owing to their sensitivity. The zero-shot perfor-
1-shot (image) 0.966 0.433
mance of a VLP model varies according to the input data
DK_ensemble + 1-shot (text + image) 0.974 0.570
(prompts). However, the identification of optimal prompt
GK_ensemble + 1-shot (text + image) 0.974 0.567 for the improvement of zero-shot performance has not
2-shot (image) 0.970 0.512 been widely studied. This study experimented with four
DK_ensemble + 2-shot (text + image) 0.976 0.610 different groups of prompts to construct an optimal prompt
GK_ensemble + 2-shot (text + image) 0.976 0.611 to bolster the zero-shot performance in classifying and
4-shot (image) 0.975 0.595 detecting five types of building defects (crack, mildew,
DK_ensemble + 4-shot (text + image) 0.979 0.656 nail popping, peeling, and wrinkling). The major results
GK_ensemble + 4-shot (text + image) 0.979 0.659 acquired from testing the hypotheses are as follows:
8-shot (image) 0.978 0.634
DK_ensemble + 8-shot (text + image) 0.980 0.674 H1: A domain knowledge-based definition of a
GK_ensemble + 8-shot (text + image) 0.980 0.671
defect performs better as a prompt than a general
knowledge-based definition
16-shot (image) 0.979 0.648
DK_ensemble + 16-shot (text + image) 0.981 0.679
The results indicated domain knowledge to perform bet-
GK_ensemble + 16-shot (text + image) 0.981 0.683
ter (average accuracy = 0.786; average F1-score = 0.783)
than general knowledge as a prompt for classifying defects.
of images and text performed better than single-modal Domain knowledge also returned the most reliable classi-
prompts, deploying only images or text for both classi- fication performance across different defect types, exhibit-
fication and detection. Table 5 compares the accuracies ing the smallest standard deviations (< 0.2), whereas
and F1-scores from classification, whereas Table 6 com- the baseline and general knowledge exhibited standard
pares the accuracies and F2-scores from detection. The deviations of 0.6. In the zero-shot defect detection, both
highest accuracy and F1-score of the classification task domain knowledge and general knowledge improved per-
were obtained from the multimodal prompt, leveraging formance; however, the difference was not statistically
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1550 YONG et al.
significant (p > 0.05). Furthermore, zero-shot crack detec- ing improves exponentially. Among them, this study made
tion using the proposed prompt methods outperformed a unique contribution in that it focused on the initial
supervised learning, transfer learning, and GAN-based “prompt construction” method, while previous studies
approaches experimented in previous study (Y. Gao et al., focused on “prompt tuning (word tuning).” (2) It demon-
2021). strated the possibility of replacing traditional supervised
learning with zero-shot transfer, which does not require
H2: A list of core terms in a definition performs better as a training process with a very large amount of data. The
a prompt, than a complete sentence definition. required data are “construction domain knowledge” and
a small number of images. (3) It revealed that a care-
A complete sentence definition with stopwords per- ful selection of a prompt is required to elicit complete
formed better as a prompt than a set of core terms that advantage of a VLP model. (4) It showed the feasibility
were extracted from the complete sentence definition after of multimodal information in vision-related tasks in the
removing the stopwords. To examine whether this result construction industry. The results of this study may be
is because a complete sentence definition includes more used as a baseline for future zero-shot or few-shot trans-
words than a set of core terms, the correlation between fer studies and to classify or detect construction-specific
the number of words in a prompt and its performance was tasks or objects, including construction activity recogni-
measured. The Spearman’s rank correlation coefficient test tion, construction equipment detection, or construction
indicated that there was no statistically significant cor- method classification.
relation between the number of words in a prompt and To extend the contributions of this study, two frame-
performance (rS = −0.04, p = 0.81 > 0.05) works are suggested for future research. The first involves
building a construction-VLP model with a plurality of con-
H3: A defect image is better than the defect definition, as struction (image, text) pairs. Considering a large scale
a prompt. of parameters, initializing the parameter of a model and
training it with a domain-specific dataset can be a bet-
For defect classification, when the number of images <4, ter option than fine-tuning the model with a relatively
domain knowledge-based definitions exhibited stronger smaller dataset. After realizing a construction-specific VLP
descriptive power as a prompt than images. However, for model, more accurate and useful zero-shot capabilities will
defect detection, when the number of images >4, images be available in many construction areas. Another potential
were better prompts than knowledge-based definitions. area is that which predicts maintenance cost with multi-
modal information, for example, a defect report with an
H4: A multimodal prompt with the combination of image and its description as a solution. If the maintenance
defect images and definitions performs better than a cost information regarding a safety report is acquired, a
single-modal prompt. cost estimation model, which assumes a defect report as
an input and a maintenance cost as an output, can be built.
When the combination of images and texts was This model may be more practical as it considers a visual
used, instead of using either only images or text, as a and a textural feature. Further, this model can connect
prompt, zero-shot transfer through CLIP performed best image–text features with a financial feature.
in both classifying (highest accuracy = 0.933; highest However, a few issues remain for further studies. First,
F1-score = 0.934) and detecting defects (highest accu- the zero-shot approach used in this study possesses the
racy = 0.981; F2-score = 0.683). Moreover, in all cases, advantage of not requiring the collection and labeling
the proposed prompt methods performed better than the of large datasets. However, the proposed methods have
baseline case (the accuracy and F1-score for the classifica- not been tested in a large-scale real-world project. Sec-
tion task: 0.736 and 0.713; the accuracy and F2-score for ond, using the proposed zero-shot transfer method with
the detection task: 0.956 and 0.275), which is a benchmark appropriate prompts, a defect type can be identified with
prompt in the computer vision field proposed by Radford considerably high accuracy. However, it is still challenging
et al. (2021). The results of this study confirm that domain to identify detailed defect information, such as the severity
knowledge-based prompts enhance the performance of of defects, which can be measured by the width or length
zero-shot detection and classification. of a crack or the angle between building components.
The primary contributions of this study are as follows: These remain as future work. Overall, despite the scope for
(1) There have not been many studies on prompt engineer- improvement, the proposed method is expected to enable
ing yet, although the importance of prompt engineering is practitioners with only introductory knowledge of AI to
rapidly recognized as the performance of zero-shot learn- apply AI technology in everyday construction work.
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YONG et al. 1551
ORCID Colavita, F. B., Tomko, R., & Weisberg, D. (1976). Visual prepotency
Gunwoo Yong https://orcid.org/0000-0003-0912-4520 and eye orientation. Bulletin of the Psychonomic Society, 8, 25–26.
https://doi.org/10.3758/BF03337062
Conde, M. V., & Turgutlu, K. (2021). CLIP-art: Contrastive pre-
REFERENCES training for fine-grained art classification. Proceedings of the
Agarwal, S., Krueger, G., Clark, J., Radford, A., Kim, J. W., IEEE/CVF Conference on Computer Vision and Pattern Recogni-
& Brundage, M. (2021). Evaluating CLIP: Towards character- tion, Nashville, TN (pp. 3951-395).
ization of broader capabilities and downstream implications. Crestwoodpainting. (n.d.). Nail pops: What you should know. https://
arXiv:2108.02818 [cs]. crestwoodpainting.com/nail-pops/
Amezquita-Sanchez, J. P., & Adeli, H. (2015). Synchrosqueezed Cui, Z., Wang, Q., Guo, J., & Lu, N. (2022). Few-shot classification of
wavelet transform-fractality model for locating, detecting, and façade defects based on extensible classifier and contrastive learn-
quantifying damage in smart highrise building structures. Smart ing. Automation in Construction, 141, 104381. https://doi.org/10.
Materials and Structures, 24, 065034. https://doi.org/10.1088/0964- 1016/j.autcon.2022.104381
1726/24/6/065034 D’Addario, J. (2020). New survey finds British businesses are reluctant
Amiri, G. G., Abdolahi Rad, A., Aghajari, S., & Khanmohamadi to proactively share data. https://theodi.org/article/new-survey-
Hazaveh, N. (2012). Generation of near-field artificial ground finds-just-27-of-british-businesses-are-sharing-data/
motions compatible with median-predicted spectra using PSO- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L.
based neural network and wavelet analysis. Computer-Aided Civil (2009). ImageNet: A large-scale hierarchical image database. 2009
and Infrastructure Engineering, 27, 711–730. https://doi.org/10. IEEE Conference on Computer Vision and Pattern Recognition,
1111/j.1467-8667.2012.00783.x Miami, FL (pp. 248–255). https://doi.org/10.1109/CVPR.2009.5206
Audebert, N., Herold, C., Slimani, K., & Vidal, C. (2019). Multimodal 848
deep networks for text and image-based document classification. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert:
In Joint European Conference on Machine Learning and Knowledge Pre-training of deep bidirectional transformers for language
Discovery in Databases (pp. 427–443). Springer, Cham. understanding. arXiv preprint arXiv:1810.04805.
Azimi, M., & Pekcan, G. (2020). Structural health monitoring using Dong, A., Maher, M. L., Kim, M. J., Gu, N., & Wang, X. (2009). Con-
extremely compressed data through deep learning. Computer- struction defect management using a telematic digital workbench.
Aided Civil and Infrastructure Engineering, 35, 597–614. https://doi. Automation in Construction, 18, 814–824. https://doi.org/10.1016/j.
org/10.1111/mice.12517 autcon.2009.03.005
Bang, S., Park, S., Kim, H., & Kim, H. (2019). Encoder–decoder net- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai,
work for pixel-level road crack detection in black-box images. X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G.,
Computer-Aided Civil and Infrastructure Engineering, 34, 713–727. Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is
https://doi.org/10.1111/mice.12440 worth 16×16 words: Transformers for image recognition at scale.
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing arXiv:2010.11929 [cs]. https://doi.org/10.48550/arXiv.2010.11929
with Python (1st ed.). O’Reilly. Fan, C., & Mostafavi, A. (2019). A graph-based method for social sens-
Bishop, C. M. (2006). Pattern recognition and machine learning, ing of infrastructure disruptions in disasters. Computer-Aided Civil
information science and statistics. Springer. and Infrastructure Engineering, 34, 1055–1070. https://doi.org/10.
Brants, T. (2003). Natural language processing in information 1111/mice.12457
retrieval. CLIN. Fu, J., Huang, C., Xing, J., & Zheng, J. (2012). Pattern classification
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, using an olfactory model with PCA feature selection in electronic
P., Neelakantan, A., Krueger, G., Henighan, T., Child, R., Ramesh, noses: Study and application. Sensors, 12, 2818–2830. https://doi.
A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, org/10.3390/s120302818
E., Litwin, M., Gray, S., . . . Amodei, D. (2020). Language models Gallo, I., Calefati, A., Nawaz, S., & Janjua, M. K. (2018). Image
are few-shot learners. Advances in neural information processing and encoded text fusion for multi-modal classification. 2018
systems, 33, 1877–1901. Digital Image Computing: Techniques and Applications (DICTA),
Chang, M.-W., Ratinov, L., Roth, D., & Srikumar, V. (2008). Impor- Canberra, Australia (pp. 1–7). https://doi.org/10.1109/DICTA.2018.
tance of semantic representation: Dataless classification. Proceed- 8615789
ings of the Twenty-Third AAAI Conference on Artificial Intelligence, Gao, T., Fisch, A., & Chen, D. (2021). Making pre-trained lan-
Chicago, IL (pp. 830–835). guage models better few-shot learners. Proceedings of the 59th
Chen, Y. C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Annual Meeting of the Association for Computational Linguistics
Y., & Liu, J. (2020). Uniter: Universal image-text representation and the 11th International Joint Conference on Natural Language
learning. In European conference on computer vision (pp. 104–120). Processing. 1, (pp. 3816–3830).
Springer, Cham. Gao, Y., Kong, B., & Mosalam, K. M. (2019). Deep leaf-bootstrapping
Chun, P., Yamane, T., & Maemura, Y. (2022). A deep learning-based generative adversarial network for structural image data augmen-
image captioning method to automatically generate comprehen- tation. Computer-Aided Civil and Infrastructure Engineering, 34,
sive explanations of bridge damage. Computer-Aided Civil and 755–773. https://doi.org/10.1111/mice.12458
Infrastructure Engineering, 37, 1387–1401. https://doi.org/10.1111/ Gao, Y., & Mosalam, K. M. (2018). Deep transfer learning for image-
mice.12793 based structural damage recognition. Computer-Aided Civil and
Colavita, F. B. (1974). Human sensory dominance. Perception & Infrastructure Engineering, 33, 748–768. https://doi.org/10.1111/
Psychophysics, 16, 409–412. https://doi.org/10.3758/BF03203962 mice.12363
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1552 YONG et al.
Gao, Y., & Mosalam, K. M. (2020). PEER hub ImageNet: A large-scale tion for Computational Linguistics, 8, 423–438. https://doi.org/10.
multiattribute benchmark data set of structural images. Journal 1162/tacl_a_00324
of Structural Engineering, 146, 04020198. https://doi.org/10.1061/ Kaur, P., Sikka, K., & Divakaran, A. (2017). Combining weakly
(ASCE)ST.1943-541X.0002745 and webly supervised learning for classifying food images.
Gao, Y., Zhai, P., & Mosalam, K. M. (2021). Balanced semisu- arXiv:1712.08730 [cs].
pervised generative adversarial network for damage assessment Khorramshahi, P., Rambhatla, S. S., & Chellappa, R. (2021). Towards
from low-data imbalanced-class regime. Computer-Aided Civil accurate visual and natural language-based vehicle retrieval sys-
and Infrastructure Engineering, 36, 1094–1113. https://doi.org/10. tems. 2021 IEEE/CVF Conference on Computer Vision and Pattern
1111/mice.12741 Recognition Workshops (CVPRW), Nashville, TN (pp. 4178–4187).
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT https://doi.org/10.1109/CVPRW53098.2021.00472
Press. Kim, W., Son, B., & Kim, I. (2021). ViLT: Vision-and-language trans-
Gorse, C. A., Johnston, D., & Pritchard, M. (2012). A dictionary former without convolution or region supervision. Proceedings
of construction, surveying, and civil engineering (1st ed.). Oxford of the 38th International Conference on Machine Learning (pp.
University Press. 5583–5594).
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. Kupi, M., Bodnar, M., Schmidt, N., & Posada, C. E. (2021). dictNN:
(2017). Making the v in vqa matter: Elevating the role of image A dictionary-enhanced CNN approach for classifying hate speech
understanding in visual question answering. In Proceedings of the on Twitter. arXiv:2103.08780 [cs.CL] 1–8.
IEEE conference on computer vision and pattern recognition (pp. Lampert, C. H., Nickisch, H., & Harmeling, S. (2009). Learning to
6904–6913). detect unseen object classes by between-class attribute transfer.
Guo, J., Wang, Q., & Li, Y. (2021). Semi-supervised learning based 2009 IEEE Conference on Computer Vision and Pattern Recogni-
on convolutional neural network and uncertainty filter for façade tion, Miami, FL (pp. 951–958). https://doi.org/10.1109/CVPR.2009.
defects classification. Computer-Aided Civil and Infrastructure 5206594
Engineering, 36, 302–317. https://doi.org/10.1111/mice.12632 Lan, M., Zhang, Y., Zhang, L., & Du, B. (2018). Defect detection from
Guo, J., Wang, Q., Li, Y., & Liu, P. (2020). Façade defects classification UAV images based on region-based CNNs. 2018 IEEE International
from imbalanced dataset using meta learning-based convolu- Conference on Data Mining Workshops (ICDMW), Singapore, Sin-
tional neural network. Computer-Aided Civil and Infrastructure gapore (pp. 385–390). https://doi.org/10.1109/ICDMW.2018.00063
Engineering, 35, 1403–1418. https://doi.org/10.1111/mice.12578 Li, A., Jabri, A., Joulin, A., & Van Der Maaten, L. (2017). Learn-
Harris, C. M. (2006). Dictionary of architecture and construction. ing visual n-grams from web data. In Proceedings of the IEEE
McGraw-Hill Education. International Conference on Computer Vision (pp. 4183–4192).
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for Li, D., Cong, A., & Guo, S. (2019). Sewer damage detection from
image recognition. arXiv:1512.03385 [cs]. imbalanced CCTV inspection data using deep convolutional
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning neural networks with hierarchical classification. Automation in
for image recognition. In Proceedings of the IEEE conference on Construction, 101, 199–208. https://doi.org/10.1016/j.autcon.2019.
computer vision and pattern recognition (pp. 770–778). 01.017
He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., & Li, M. (2019). Bag of Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., & Chang, K.-W. (2019).
tricks for image classification with convolutional neural networks. VisualBERT: A simple and performant baseline for vision and
In Proceedings of the IEEE/CVF conference on computer vision and language. arXiv preprint arXiv:1908.03557. 1–14.
pattern recognition (pp. 558–567). Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu,
Hospedales, T., Antoniou, A., Micaelli, P., & Storkey, A. (2021). H., Dong, L., Wei, F., Choi, Y., & Gao, J. (2020). OSCAR: Object-
Meta-learning in neural networks: A survey. IEEE transactions on semantics aligned pre-training for vision-language tasks. In A.
pattern analysis and machine intelligence, 44(9), 5149–5169. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), European con-
Hotelling, H. (1933). Analysis of a complex of statistical variables into ference on computer vision 2020, lecture notes in computer science.
principal components. Journal of Educational Psychology, 24, 417– (pp. 121–137). Springer International Publishing, https://doi.org/
441. https://doi.org/10.1037/h0071325 10.1007/978-3-030-58577-8_8
Hu, M., & Li, J. (2019). Exploring bias in GAN-based data augmenta- Li, Y., Che, P., Liu, C., Wu, D., & Du, Y. (2021). Cross-scene
tion for small samples. arXiv:1905.08495 [cs, stat]. pavement distress detection by a novel transfer learning frame-
Huang, Z., Zeng, Z., Liu, B., Fu, D., & Fu, J. (2020). Pixel-BERT: work. Computer-Aided Civil and Infrastructure Engineering, 36,
Aligning image pixels with text by deep multi-modal transformers. 1398–1415. https://doi.org/10.1111/mice.12674
arXiv:2004.00849 [cs]. Liang, X. (2019). Image-based post-disaster inspection of reinforced
InspectApedia. (n.d.). Construction Dictionary Section 9 Finishes concrete bridge systems using deep learning with Bayesian opti-
Terminology. https://inspectapedia.com/Design/Construction- mization. Computer-Aided Civil and Infrastructure Engineering,
Terms-9-Finishes.txt 34, 415–430. https://doi.org/10.1111/mice.12425
Jiang, S., & Zhang, J. (2020). Real-time crack assessment using deep Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021).
neural networks with wall-climbing unmanned aerial system. Pre-train, prompt, and predict: A systematic survey of prompting
Computer-Aided Civil and Infrastructure Engineering, 35, 549–564. methods in natural language processing. arXiv:2107.13586 [cs].
https://doi.org/10.1111/mice.12519 Liu, V., & Chilton, L. B. (2022). Design Guidelines for Prompt Engi-
Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we neering Text-to-Image Generative Models. In CHI Conference on
know what language models know? Transactions of the Associa- Human Factors in Computing Systems (pp. 1–23).
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YONG et al. 1553
Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., & Tang, J. Perez, H., & Tah, J. H. M. (2021). Deep learning smartphone appli-
(2021). GPT understands, too. arXiv:2103.10385 [cs]. cation for real-time detection of defects in buildings. Structural
Lo, R. T. W., He, B., & Ounis, I. (2005). Automatically building a stop- Control and Health Monitoring, 28, e2751. https://doi.org/10.1002/
word list for an information retrieval system. In Journal on Digital stc.2751
Information Management: Special Issue on the 5th Dutch-Belgian Perez, H., Tah, J. H. M., & Mosavi, A. (2019). Deep learning for detect-
Information Retrieval Workshop (DIR), 5 (pp. 17–24). ing building defects using convolutional neural networks. Sensors,
Maeda, H., Kashiyama, T., Sekimoto, Y., Seto, T., & Omata, H. 19, 3556. https://doi.org/10.3390/s19163556
(2021). Generative adversarial network for road damage detection. Posner, M., Nissen, M., & Klein, R. (1976). Visual dominance:
Computer-Aided Civil and Infrastructure Engineering, 36, 47–60. An information-processing account of its origins and signifi-
https://doi.org/10.1111/mice.12561 cance. Psychological Review, 83, 157–71. https://doi.org/10.1037/
Maeda, H., Sekimoto, Y., Seto, T., Kashiyama, T., & Omata, H. 0033-295X.83.2.157
(2018). Road damage detection and classification using deep neu- Pourpanah, F., Abdar, M., Luo, Y., Zhou, X., Wang, R., Lim, C. P.,
ral networks with smartphone images. Computer-Aided Civil and Wang, X.-Z., & Wu, Q. M. J. (2022). A review of generalized zero-
Infrastructure Engineering, 33, 1127–1141. https://doi.org/10.1111/ shot learning methods. IEEE Transactions on Pattern Analysis and
mice.12387 Machine Intelligence. Advance online publication. https://doi.org/
Meijer, D., Scholten, L., Clemens, F., & Knobbe, A. (2019). A defect 10.1109/TPAMI.2022.3191696
classification methodology for sewer image sets with convolu- Promptbase (2022). Promptbase. https://promptbase.com/
tional neural networks. Automation in Construction, 104, 281–298. Qi, D., Su, L., Song, J., Cui, E., Bharti, T., & Sacheti, A. (2020). Image-
https://doi.org/10.1016/j.autcon.2019.04.013 BERT: Cross-modal pre-training with large-scale weak-supervised
Merriam-Webster (2019). The Merriam-Webster dictionary (Newest image-text data. arXiv:2001.07966 [cs].
ed.). Merriam–Webster Inc. Qiu, Q., Xie, Z., Wu, L., & Tao, L. (2020). Dictionary-based automated
Midjourney (2022). Midjourney. https://github.com/midjourney/ information extraction from geological documents using a deep
docs learning algorithm. Earth and Space Science, 7, 1–18. https://doi.
Montague, D. (2017). Dictionary of building and civil engineering: org/10.1029/2019EA000993
English/French French/English (2nd ed.). Routledge. https://doi. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal,
org/10.4324/9780203851227 S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., &
Nabian, M. A., & Meidani, H. (2018). Deep learning for accelerated Sutskever, I. (2021). Learning transferable visual models from nat-
seismic reliability analysis of transportation networks. Computer- ural language supervision. In International Conference on Machine
Aided Civil and Infrastructure Engineering, 33, 443–458. https:// Learning (pp. 8748–8763). PMLR.
doi.org/10.1111/mice.12359 Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018).
Narasimhan, M., Rohrbach, A., & Darrell, T. (2021). CLIP-It! Improving language understanding by generative pre-training.
language-guided video summarization. Advances in Neural Infor- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever,
mation Processing Systems, 34, 13988–14000. I. (2019). Language models are unsupervised multitask learners.
Nichol, A., Achiam, J., & Schulman, J. (2018). On first-order meta- OpenAI blog, 1(8), 9.
learning algorithms. arXiv:1803.02999 [cs]. Rafiei, M. H., & Adeli, H. (2017). A novel machine learning-based
Nine, A. (2022). People have begun to sell their prompts for AI- algorithm to detect damage in high-rise building structures. Struc-
generated artwork. https://www.extremetech.com/internet/ tural Design of Tall and Special Buildings, 26(18), e1400. https://doi.
339304-people-have-begun-to-sell-their-prompts-for-ai- org/10.1002/tal.1400
generated-artwork Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen,
Özgenel, Ç. F. (2019). Concrete crack images for classification. M., & Sutskever, I. (2021). Zero-shot text-to-image generation.
Mendeley Data, V2, https://doi.org/10.17632/5y9wdsg2zt.2 International Conference on Machine Learning (pp. 8821–8831).
Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Rijsbergen, C. J. V. (1979). Information retrieval (2nd ed.).
Transactions on Knowledge and Data Engineering, 22, 1345–1359. Butterworth-Heinemann.
https://doi.org/10.1109/TKDE.2009.191 Rotimi, F. E., Tookey, J., & Rotimi, J. O. (2015). Evaluating defect
Park, C.-S., Lee, D.-Y., Kwon, O.-S., & Wang, X. (2013). A frame- reporting in new residential buildings in New Zealand. Buildings,
work for proactive construction defect management using BIM, 5, 39–55. https://doi.org/10.3390/buildings5010039
augmented reality and ontology-based data collection template. Roy, K. C., Hasan, S., & Mozumder, P. (2020). A multilabel clas-
Automation in construction. Augmented Reality in Architecture, sification approach to identify hurricane-induced infrastructure
Engineering, and Construction, 33, 61–71. https://doi.org/10.1016/ disruptions using social media data. Computer-Aided Civil and
j.autcon.2012.09.010 Infrastructure Engineering, 35, 1387–1402. https://doi.org/10.1111/
Paton-Cole, V. P., & Aibinu, A. A. (2021). Construction defects and mice.12573
disputes in low-rise residential buildings. Journal of Legal Affairs Saif, H., Fernandez, M., He, Y., & Alani, H. (2014). On stopwords,
and Dispute Resolution in Engineering Construction, 13, 05020016. filtering and data sparsity for sentiment analysis of Twitter. LREC
https://doi.org/10.1061/(ASCE)LA.1943-4170.0000433 2014, Ninth International Conference on Language Resources
Pearson Education (2014). Longman dictionary of contemporary and Evaluation. Proceedings, Reykjavik, Iceland (pp. 810–
English. Pearson Education. 817).
Peng, W., Huang, C., Li, T., Chen, Y., & Liu, Q. (2020). Dictionary- Schick, T., & Schütze, H. (2021). Exploiting cloze questions for
based data augmentation for cross-domain neural machine trans- few shot text classification and natural language inference. Pro-
lation. arXiv:2004.02577 [cs]. ceedings of the 16th Conference of the European Chapter of the
14678667, 2023, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12954 by Georgia Institute Of Technology, Wiley Online Library on [28/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1554 YONG et al.
Association for Computational Linguistics (pp. 255–269). https:// Wang, M., Xing, J., & Liu, Y. (2021). ActionCLIP: A new paradigm for
doi.org/10.18653/v1/2021.eacl-main.20 video action recognition. arXiv:2109.08472 [cs].
Scott, J. S., & Maclean, J. H. (2000). Dictionary of building (4th UK Wang, W., Bao, H., Dong, L., & Wei, F. (2021). VLMo: Unified
ed.). Penguin UK. vision-language pre-training with mixture-of-modality-experts.
Sedgwick, P. (2014). Spearman’s rank correlation coefficient. BMJ: arXiv:2111.02358 [cs].
British Medical Journal, 349, g7327. Wilbur, W. J., & Sirotkin, K. (1992). The automatic identification of
Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.-W., stop words. Journal of Information Science, 18(1), 45–55.
Yao, Z., & Keutzer, K. (2021). How much can CLIP benefit vision- Wu, T., Terry, M., & Cai, C. J. (2022). AI chains: Transparent and con-
and-language tasks? arXiv:2107.06383 [cs]. trollable human-ai interaction by chaining large language model
Shibata, T., Kato, N., & Kurohashi, S. (2007). Automatic object prompts. In CHI Conference on Human Factors in Computing
model acquisition and object recognition by integrating linguistic Systems (pp. 1–22).
and visual information. Proceedings of the 15th International Con- Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN
ference on Multimedia—MULTIMEDIA ’07, Augsburg, Germany. database: Large-scale scene recognition from abbey to zoo. 2010
https://doi.org/10.1145/1291233.1291327 IEEE Computer Society Conference on Computer Vision and Pattern
Shin, T., Razeghi, Y., Logan, R. L. IV., Wallace, E., & Singh, S. Recognition (CVPR), San Francisco, CA (pp. 3485–3492). https://
(2020). AutoPrompt: Eliciting knowledge from language models doi.org/10.1109/CVPR.2010.5539970
with automatically generated prompts. Proceedings of the 2020 Zalejska, J. A., & Hungria, G. R. (2019). Defects in newly constructed
Conference on Empirical Methods in Natural Language Process- residential buildings: Owners’ perspective. International Journal
ing (EMNLP), Online (pp. 4222–4235). https://doi.org/10.18653/v1/ of Building Pathology and Adaptation, 37, 163–185. https://doi.org/
2020.emnlp-main.346 10.1108/IJBPA-09-2018-0077
Silva, C., & Ribeiro, B. (2003). The importance of stop word removal Zhang, R. (2019). Making convolutional networks shift-invariant
on recall values in text categorization. Proceedings of the Inter- again. In International conference on machine learning (pp. 7324–
national Joint Conference on Neural Networks, 2003(3), 1661–1666. 7334). PMLR.
https://doi.org/10.1109/IJCNN.2003.1223656 Zhang, Y., Jiang, H., Miura, Y., Manning, C. D., & Langlotz, C.
Simpson, J., & Weiner, E. (Eds.). (1989). The Oxford English dictionary P. (2020). Contrastive learning of medical visual representations
(2nd ed.). Oxford University Press. from paired images and text. arXiv:2010.00747 [cs].
Sinnett, S., Spence, C., & Soto-Faraco, S. (2007). Visual domi- Zhang, Y., Macdonald, J. H. G., Liu, S., & Harper, P. W. (2021). Damage
nance and attention: The Colavita effect revisited. Perception & detection of nonlinear structures using probability density ratio
Psychophysics, 69, 673–686. https://doi.org/10.3758/BF03193770 estimation. Computer-Aided Civil and Infrastructure Engineering,
Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accu- 37(7), 878–893. https://doi.org/10.1111/mice.12772
racy, F-score and ROC: A family of discriminant measures for Zhao, J. J., Mathieu, M., & LeCun, Y. (2017). Energy-based generative
performance evaluation. Australasian Joint Conference on Artifi- adversarial networks. 5th International Conference on Learning
cial Intelligence, Canberra, Australia (pp. 1015–1021). https://doi. Representations, ICLR 2017, Toulon, France.
org/10.1007/11941439_114 Zhao, T. Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Cal-
Srihari, R. K. (1994). Computational models for integrating linguistic ibrate before use: Improving few-shot performance of language
and visual information: A survey. Artificial Intelligence Review, 8, models. In International Conference on Machine Learning (pp.
349–369. 12697–12706). PMLR.
Standards Australia. (n.d.). National dictionary of building & plumb- Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for
ing terms. https://www.constructiondictionary.com.au/ vision-language models. International Journal of Computer Vision,
Tan, H., & Bansal, M. (2019). LXMERT: Learning cross-modality 130(9), 2337–2348.
encoder representations from transformers. Proceedings of the 2019 Zhu, J., Zhang, C., Qi, H., & Lu, Z. (2020). Vision-based defects detec-
Conference on Empirical Methods in Natural Language Processing tion for bridges using transfer learning and convolutional neural
and the 9th International Joint Conference on Natural Language networks. Structure and Infrastructure Engineering, 16, 1037–1049.
Processing (EMNLP-IJCNLP) (pp. 5100–5111). https://doi.org/10. https://doi.org/10.1080/15732479.2019.1680709
18653/v1/D19-1514 Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-
Tolson, S. (2012). Dictionary of construction terms. Informa Law from supervised learning. Synthesis Lectures on Artificial Intelligence
Routledge. https://doi.org/10.4324/9781315850320 and Machine Learning, 3, 1–130. https://doi.org/10.2200/
van Engelen, J. E., & Hoos, H. H. (2020). A survey on semi-supervised S00196ED1V01Y200906AIM006
learning. Machine Learning, 109, 373–440. https://doi.org/10.1007/
s10994-019-05855-6
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, How to cite this article: Yong, G., Jeon, K., Gil,
A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need.
D., & Lee, G. (2023). Prompt engineering for
Advances in Neural Information Processing Systems, Long Beach,
CA (pp. 5998–6008).
zero-shot and few-shot defect detection and
Wali, K. I., & Ali, N. S. (2019). Diagnosis and evaluation of defects classification using a visual-language pretrained
encountered in newly constructed houses in Erbil City, Kurdistan, model. Computer-Aided Civil and Infrastructure
Iraq. Engineering and Technology Journal, 37, 70–77. https://doi. Engineering, 38, 1536–1554.
org/10.30684/etj.37.2A.5 https://doi.org/10.1111/mice.12954