0% found this document useful (0 votes)
145 views

SODA A Large-Scale Open Site Object Detection Dataset For Deep Learning in

This document introduces SODA, a large-scale open dataset of over 20,000 images collected from multiple construction sites. The images were annotated with 15 object classes categorized as worker, material, machine, and layout. Statistical analysis shows the dataset has good diversity and volume. Evaluation of two deep learning object detection algorithms on the dataset achieved a maximum mean average precision of 81.47%, illustrating the feasibility of the dataset for construction object detection. The dataset aims to benefit construction automation research by serving as a basis for deep learning models and algorithm benchmarking.

Uploaded by

weran54885
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views

SODA A Large-Scale Open Site Object Detection Dataset For Deep Learning in

This document introduces SODA, a large-scale open dataset of over 20,000 images collected from multiple construction sites. The images were annotated with 15 object classes categorized as worker, material, machine, and layout. Statistical analysis shows the dataset has good diversity and volume. Evaluation of two deep learning object detection algorithms on the dataset achieved a maximum mean average precision of 81.47%, illustrating the feasibility of the dataset for construction object detection. The dataset aims to benefit construction automation research by serving as a basis for deep learning models and algorithm benchmarking.

Uploaded by

weran54885
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Automation in Construction xxx (xxxx) 104499

Contents lists available at ScienceDirect

Automation in Construction
journal homepage: www.elsevier.com/locate/autcon

F
SODA: A large-scale open site object detection dataset for deep learning in
construction

OO
Rui Duan a, Hui Deng a, b, Mao Tian c, Yichuan Deng a, b, ⁎, Jiarui Lin d
a School of Civil Engineering and Transportation, South China University of Technology, Guangzhou, China
b State Key Laboratory of Subtropical Building Science, Guangzhou 510641, China
c Sonny Astani Department of Civil and Environmental Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, USA
d Department of Civil Engineering, Tsinghua University, Beijing 10084, China

PR
ARTICLE INFO ABSTRACT

Keywords: Comprehensive image datasets can benefit the construction industry in terms of serving as
Dataset the basis for generating deep-learning-based object detection models and testing the per-
Object detection
formance of object detection algorithms, but building such datasets is complex and re-
Construction site
Deep learning
quires vast professional knowledge.This paper develops and publicly releases a new large-
scale image dataset specifically collected and annotated for the construction site, called
D
Computer vision
Site Object Detection Dataset (SODA), which contains 15 object classes categorized by the
worker, material, machine, and layout. >20,000 images were collected from multiple
construction sites in different situations, weather conditions, and construction phases,
covering different angles and perspectives. Statistical analysis shows that the dataset is
TE

well developed in terms of diversity and volume. Further evaluation with two widely-
adopted deep learning-based object detection algorithms also illustrates the feasibility of
the dataset, achieving a maximum mAP of 81.47%. This research contributes a large-scale
open image dataset for the construction industry and sets up a performance benchmark for
further evaluation of relevant algorithms.
EC

1. Introduction puter vision technology in construction automation has thus attracted


wide attention from academia and industry.
The construction industry is still a labor-intensive industry, with In recent years, deep learning object detection algorithm has devel-
most management and interventions of on-site activities relying on oped rapidly, the detection speed and accuracy have been greatly im-
manual judgments [1], leading to the difficulty and inefficiency of on- proved. Under the appropriate application scenarios, the recognition
RR

site management. Although the emergence of high-resolution monitor- accuracy can reach 98% or even higher. The computer vision technol-
ing cameras makes remote and dynamic monitoring of the construction ogy based on deep learning has significant advantages over the tradi-
site possible, it still requires a lot of manual intervention [2]. The rapid tional image process and recognition methods in terms of detection
development of computer vision technology makes it possible to auto- speed, algorithm robustness and feature extraction without manual de-
mate tasks that cannot be completed by manpower, improving safety sign. [12]. Therefore, introducing the deep-learning based object detec-
management and production efficiency [3]. The importance of cameras tion in construction site management will be a new direction [13].
CO

in construction management has become increasingly recognized, and However, deep learning algorithms are data-hungry, which means the
practitioners have begun to adopt automated applications powered by application of object detection on construction sites requires a cus-
computer vision [4]. For example, video surveillance can detect work- tomized image dataset in the construction field. The complexity and dy-
ers' unsafe behaviors and risks of construction activities [5], where namic nature of construction activities brings challenges for image col-
computer vision technology is used to identify workers who do not lection and annotation, which is the reason that well-annotated image
wear personal protective equipment (PPE) [6–11]. The ability of com- datasets designed for the construction industry are hardly found in pop-
ular open database such as the ImageNet [14].

⁎ Corresponding author at: School of Civil Engineering and Transportation, South China University of Technology, Guangzhou, China.
E-mail address: [email protected] (Y. Deng).

https://doi.org/10.1016/j.autcon.2022.104499
Received 19 February 2022; Received in revised form 17 July 2022; Accepted 23 July 2022
0926-5805/© 20XX

Note: Low-resolution images were used to create this PDF. The original images will be used in the final composition.
R. Duan et al. Automation in Construction xxx (xxxx) 104499

For the promotion of the research of object detection in the con- image datasets. The third section elaborates on the process of building
struction industry, it is necessary to build a large-scale image dataset the dataset. In the fourth section, the statistics of the dataset are pre-
containing specific objects from the construction site (i.e., worker, ma- sented. The fifth section introduces the experimental results of two
terial, machine, layout). The existing construction site image datasets mainstream one-stage object detection algorithms on the proposed
are relatively small in scale and have few categories, concentrating on dataset, which provides the benchmark for subsequent research.
people, PPE, and some machines. This is because: (1) The capturing of
images of the construction site is more challenging than that of ordinary 2. Related work
objects. Due to security concerns, the construction site is generally
closed to the public. Moreover, the available online resources of con- 2.1. Computer vision and deep learning in construction
struction site images are less common than daily objects. (2) It is diffi-

F
cult to obtain data from different angles on construction sites by using The research into computer vision technologies for construction has
conventional monocular camera installed on-site, which is easy to cause aroused intense interest in both academia and industry, such as safety
overfitting of the detection model. (3) The site's environment is usually monitoring, productivity analysis, and personnel management. Koch et

OO
chaotic and dynamic, which brings difficulty for data collection and in- al. [15] reviewed the application of computer vision technology in de-
creases the cost of annotation. Moreover, professional knowledge is re- fect detection and condition assessment of concrete and asphalt civil in-
quired to correctly annotate the objects in the images taken from the frastructure. Xiang et al. [16] proposed an intelligent monitoring
construction site. method based on deep learning for locating and identifying intrusion
A comprehensive image dataset for construction site will benefit the engineering vehicles, which can prevent damage to buildings. To pre-
construction industry in terms of serving as the basis for generating vent construction workers from falling from high, Fang et al. [17] de-
deep-learning-based object detection models and testing the perfor- veloped an automatic method based on two convolutional neural net-

PR
mance of object detection algorithms. Considering the professional works (CNN) models to determine whether workers wear safety belts.
knowledge required to build such a dataset, it remains a task to be re- Yang et al. [18] used cameras on tower crane to record video data and
solved by people in the industry. The purpose of this research is thus to used MASK R-CNN to identify pixel coordinates of workers and danger-
construct a comprehensive object detection image dataset for the con- ous areas. Chen et al. [19] used construction site surveillance videos to
struction site as shown in Fig. 1. The research processes include cate- detect, track and identify excavator activities using three different con-
gory selection, data acquisition, data cleaning, data annotation, dataset volutional neural networks. Deng et al. [20] proposed a method com-
analysis, experimental analysis, and benchmark. A total of 19,846 im- bining computer vision with building information modeling (BIM) to
ages were collected, including 286,201 objects with 15 categories of realize automatic progress monitoring of tile installation. This method
D
classes. >20,000 images of different construction sites were obtained can automatically and accurately measure the construction progress of
using various equipment, including monocular cameras, unmanned the construction site. Fang et al. [21] realized the real-time positioning
aerial vehicles (UAVs), and hook visualization equipment. 35 students of construction-related entities by using deep learning algorithms com-
majoring in Civil Engineering were recruited and trained to annotate bined with semantic information and prior knowledge. Zhang et al.
TE

the images, and then the statistics of data was analyzed. Finally, the [22] proposed an automatic identification method based on deep learn-
dataset is tested by using the mainstream object detection algorithms, ing and ontology. This method can effectively identify the risks in the
with well performance reached, which also provides the benchmark for construction process and prevent the occurrence of construction acci-
selection of algorithms. The images and annotations were then pub- dents. Luo et al. [23] used computer vision and deep learning to auto-
lished online. In the future, it will be upgraded regularly, increasing the matically estimate the posture of different construction equipment in
EC

variety of construction site objects to promote the research related to the video taken at the construction site. Pan et al. [24] proposed a novel
computer vision in construction. framework named Video2entities, which combines visual data and
The remaining of this paper is structured as follows: the second sec- common knowledge graph as prior information and uses zero-shot
tion of this paper introduces the related work of object detection and
RR
CO

Fig. 1. Example images selected from SODA.

2
R. Duan et al. Automation in Construction xxx (xxxx) 104499

learning (ZSL) to realize the detection of unknown targets and effec- image dataset composed of 1330 images for Personal Protective Equip-
tively improve the self-learning ability of the object detection algo- ment (PPE) called Color Helmet and Vest (CHV). Xiao et al. [35] devel-
rithm. For the reference of interested readers, this study collected some oped the Alberta Construction Image Dataset (ACID), an image dataset
reviews [25–27] of computer vision in the construction industry. The specially used to identify construction machinery, and manually col-
literature review shows that there is a lack of sufficient scale of image lected and annotated 10,000 images of 10 kinds of construction ma-
datasets, which will hinder the continuous development of computer vi- chines. Four object detection algorithms, YOLO v3, Inception SSD, R-
sion in the construction field. Moreover, the defect inspection, safety FCN-ResNet, and Faster-RCNN-ResNet, were used to train the dataset,
monitoring, and performance analysis of computer vision in the con- and satisfactory mAP and average detection speed were obtained.
struction field require further investigation. As mentioned above, computer vision related research in construc-
tion has aroused intense interest in both academia and industry. How-

F
2.2. Datasets for deep learning based object detection ever, there are few related datasets available, and most of them only
concentrated on workers, PPE, or machines. According to literature re-
According to the domain coverage, the existing computer vision view, there are few studies on the identification and detection of con-

OO
datasets can be divided into the general image datasets and the domain struction materials [36,37] and layouts on the construction site. There-
image datasets. General datasets include categories in daily life. In con- fore, it is necessary to broaden the coverage by developing image
trast, the domain datasets contain the categories of specific fields. datasets containing worker, machine, material, and layout for the re-
General datasets are mainly composed of natural categories, such as search on deep learning in the construction industry.
people, animals and vehicles. The Mnist dataset [28] contains 70,000
handwritten digital images, which is an entry-level dataset for deep 2.3. Deep learning based object detection algorithms
learning. The PASCAL VOC dataset [29], which is widely used for test-

PR
ing deep learning algorithms, contains 11,000 images of 20 classes. Mi- Object detection is an important application of deep learning, which
crosoft COCO [30] is a dataset containing 160,000 images with 91 cate- focuses on determining the category and pixel location of the target
gories, which has text descriptions of the category and location. Created [36]. As shown in Fig. 2, object detection algorithms are divided into
by Professor Li Feifei's team, ImageNet [14] contains over 14 million two-stage object detection and one-stage object detection. The two-
images, covering >20,000 categories. Most of the research on image stage model is also called the region-based method because of its two-
classification, location, and object detection have benefited from this stage processing of images. It extracts a series of checkboxes and then
dataset. The capacity and types of general datasets are also constantly uses convolutional neural networks for classification. Two-stage object
increasing and improving, giving birth to a large number of excellent detection has higher precision but relatively slow speed. Representative
D
computer vision models (especially deep learning related) that promote algorithms are R-CNN [38], Fast R-CNN [39], and Faster R-CNN [40].
the rapid development of computer vision technology. YOLO [41] is a one-stage object detection framework, which directly
However, the general image datasets introduced above often lack obtains the prediction results from the image. The object detection can
categories related to the construction field. In recent years, some image be achieved only by extracting features once, and the speed is faster
TE

datasets have been developed for the construction industry. Tajdeen et than that of the two-stage algorithm. Object detection in the field of
al. [31] collected thousands of images of construction machines that construction has a long period of exploration, and many outstanding re-
cover five kinds of construction equipment (excavators, loaders, bull- search results have emerged in defect inspection [15], safety monitor-
dozers, rollers, and backhoe diggers). Kim et al. [32] proposed a con- ing [26], and performance analysis [27].
struction object detection method combining deep convolution network
EC

and transfer learning to accurately identify construction equipment. To 3. Methodology


evaluate the proposed method, a benchmark dataset called AIM was
created, containing 2920 construction machine images. Kolar et al. This study constructs a new large-scale construction site im-
[33] focused on detection of guardrail on construction site. By adding age dataset, called Site Object Detection Dataset (SODA), which
background images to the three-dimensional model of the guardrail, contains 15 classes of objects in four categories. >20,000 im-
the authors created an enhanced dataset containing 6000 images. Li et ages using different equipment at different angles and times of
al. [6] established and released a dataset containing 3261 images of the day were collected on several construction sites in the
RR

safety helmet and used the SSD-MobileNet to detect unsafe operation Greater Bay Area, China. Thirty-five students majoring in civil
on construction sites. An et al. [34] created the Moving Objects in Con- engineering were trained in image processing and annotation.
struction Site (MOCS) dataset by collecting >40,000 images from 174 Each student was responsible for about 600 images, which were
construction sites and annotating 13 types of moving objects. They used then checked by several graduate students and experts in the
pixel segmentation to precisely annotate objects and tested them on 15 construction industry. As Fig. 3 shows, the building process of
different deep neural networks. Wang et al. [7] constructed a dedicated SODA mainly includes four steps: object category selection, im-
CO

Fig. 2. Flowchart of the object detection algorithm.

3
R. Duan et al. Automation in Construction xxx (xxxx) 104499

F
OO
PR
Fig. 3. Building process of SODA.

age data acquisition, image data cleaning, and image annota-


D
same interest. (1) Compared with worker and material categories, the
tion. capacity of machine and layout categories is relatively small, so it is
necessary to select a full range of shots of limited targets from multiple
3.1. Category selection angles. (2) Due to the confusion, visual blind spots, and occlusion of
TE

construction sites, it is challenging to collect positive samples using a


Zhou et al. [42] used the principle called 4M1E, which divides the single shooting method. Different shooting methods should be inte-
risk factors into (hu)man, material, machine, method, and environment grated. As shown in Fig. 5, the hook is shot from different angles using
to assess and manage the risks during the construction process. Accord- different devices to provide comprehensive data. Using UAVs and hand-
ing to the 4M1E principle, the common visible object in the construc- held shooting methods on construction sites are along with security
tion site is classified into four categories: worker, material, machine, risks, where careful operation is needed. (3) The data obtained from the
EC

and layout. These four categories (person, material, machine, and lay- high-altitude camera can record the panoramic view of the construction
out) correspond to the physical entities in 4MIE (man, material, ma- site and can also be used to photograph large-scale machines such as
chine, and environment). This paper further expands 15 common object tower cranes. However, these video data are usually too vague to anno-
detection classes suitable for deep learning object detection from these tate for small objects.
four categories in the first edition of SODA. The worker category in-
cludes person, helmet, and reflective vest; The material category in- 3.3. Data cleaning
RR

cludes board, wood, rebar, brick, and scaffold; The machine category
includes handcart, cutter, electric box, hopper, and hook. The layout After the completion of data acquisition, it cannot be annotated di-
category includes fence and slogan. Table 1 summarizes the definition rectly but should be processed firstly to eliminate invalid data. Four ob-
of each class, and Fig. 4 shows the example of the corresponding cate- jectives of data cleaning are proposed in this study: removing dupli-
gories. (See Table 2.) cates, removing ambiguities, removing non-targets, and corresponding
privacy protection. The removal criteria and examples are shown in Fig.
3.2. Data acquisition 6.
CO

The construction of the conventional dataset often collects images 3.3.1. Removing duplicates
from online resources [43]. Therefore, this paper initially applies web The production of the dataset requires that each image must be sig-
crawlers and other methods to obtain online data, but the collected im- nificantly different from other images in terms of angle, position, and il-
ages do not meet the requirements. Since the construction site is often lumination. Therefore, some repetitive images should be manually re-
messy, the objects are often mixed with unrelated objects, so images in moved. In the shooting process of the handheld monocular camera,
SODA are all collected on the actual construction site. Data acquisition many similar images may be collected accidentally. Moreover, some
in construction sites mainly adopts three methods, UAVs, handheld images of SODA are obtained by intercepting video frames. Although it
monocular cameras, and construction site monitoring video (hook visu- is set to capture images from video every 30 frames, there are still rela-
alization). All collected images will eventually be converted into JPG tively repetitive images that need to be manually removed due to slow
format. Ten construction sites covering various construction stages progress in some construction activities.
have been visited, from the foundation pit stage to the decoration stage.
As a result, a total of 21,863 images were collected. 3.3.2. Removing ambiguities
In the process of data collection, there are some noteworthy prob- In the shooting process, some videos and images are blurred due to
lems in data acquisition that may benefit researchers who have the weather, lighting, human, and other reasons. These images are not only

4
R. Duan et al. Automation in Construction xxx (xxxx) 104499

Table 1 fuzzified accordingly. The above-specific operations are performed


Related description of selected objects in the construction site. manually by the recruited students.
Construction Description example The data cleaning process took about 300 h, accounting for 16.16%
site object of the total dataset construction time. In the removal process, 2017 in-
valid images were deleted. In data privacy processing, after a round of
person Workers working in construction should wear PPE (safety Fig. 3
privacy processing and a joint inspection by authors and experts, stu-
helmet, (1)
vest reflective vest). Fig. 3 dents still have 1307 omissions (37 omissions per person).
(2)
helmet Fig. 3 3.4. Data annotation
(3)

F
board Board for construction engineering is used to support the Fig. 3
weight and lateral pressure of concrete mixture with (4) The Visual Object Classes (VOC) format dataset is the standard
plastic flow properties so that it dataset of the world-class Computer Vision Challenge (PASCAL VOC
can be solidified according to the design requirements. Challenge) [29], which is widely used in the field of computer vision

OO
wood Wood is made into square strips according to the actual Fig. 3 and is an industry-recognized standard dataset format. In order to en-
processing needs. It is generally used for decoration and (5)
sure the high quality of annotation, three standards are strictly followed
door and window materials,
template support, and roof truss materials in structural when annotating an image: (1) The annotation box must frame the tar-
construction. get and not intersect the target. (2) Reducing the frame selection of ir-
rebar Rebar refers to the steel used for reinforced concrete and Fig. 3 relevant background. (3) For similar targets with close mutual distance,
prestressed reinforced concrete. Its cross-section is (6) avoid using a frame to select multiple targets. (4) When the target has
circular, and sometimes it is a square with a round angle.
partial occlusion or is impossible to annotate the whole size of it, it
brick Concrete brick is a lightweight porous, thermal Fig. 3

PR
insulation, good fire resis-tance, strong plasticity, and (7) should be omitted. As shown in Fig. 7, the annotation of the blue box is
seismic capacity of new building materials. accepted instead of the red box.
scaffold The scaffold is a working platform built to ensure the Fig. 3 As shown in the following Fig. 8, the software labelImg is
smooth progress (8)
used by students to annotate 15 categories of images. LabelImg
of each construction process.
handcart The handcart at the construction site is a two-wheeled, Fig. 3
is an open-source image annotation tool written in Python [44].
manual push and pull handling vehicle. (9) The labeled image data are saved as XML files in PASCAL VOC
cutter The cutter is the processing machine used in the material Fig. 3 format. The XML format is shown in Fig. 9, which contains in-
processing at (10) formation such as storage path, image name, and coordinate. 35
D
the construction site. Commonly used machines are semi-
students are trained as annotators after receiving tutorials that
automatic cutting machines and CNC cutting machines.
electric box All electrical equipment in the construction site must Fig. 3 explained how to annotate and frame target. To ensure the ac-
have its own special electric switch box, which is (11) curacy of the annotation, the label completed by the students
are regularly examined by a graduate student and the authors.
TE

convenient for the switching operation of the circuit and


the reasonable distribution of electric energy. After obtaining the annotation documents, a round of inspection
hopper Hopper (tower crane hopper, ash hopper, sand hopper, Fig. 3
with another expert is conducted. Although students have been well
concrete hopper) are mainly used in building foundations, (12)
pouring concrete, piling, and trained, there are still some unexpected errors. 1) The spelling error of
high-rise building construction material transportation. label words, 2) The plural error of the label, and 3) Unknown labels.
hook The hook of a tower crane is used to connect objects and Fig. 3 The label errors have been introduced as follows.
ropes. (13)
EC

As Table 3 shows, all the word spelling and plural errors were listed.
fence The fence is a protective facility to prevent accidental Fig. 3
intrusion in the (14)
In addition, there is another kind of error that 48 ‘w’ and 2 ‘wwww’ la-
construction site of building engineering. bels appear. The reason for the ‘w’ error is that the shortcut key of the
slogan The slogan is used to alert workers to civilized and safe Fig. 3 tagging software is w, and the annotator typed the label name incor-
construction, (15) rectly. All of these errors have been corrected in the officially released
spread enterprise culture, etc.
dataset.
After completing the above four processes, the annotations of
RR

challenging to annotate but also affect the training effect of the deep 19,846 images and the corresponding XML files are obtained. A statistic
learning model, which are supposed to be deleted before or during an- of the resources (time, manpower, tools, sites) used in the whole
notation. process is presented. First of all, in terms of manpower and time, three
graduate students took pictures using various equipment on the con-
3.3.3. Removing non-targets struction sites in the data acquisition process, which took about 24
In the process of on-site shooting, it is inevitable to take some im- working hours. Data cleaning and data annotation process are carried
CO

ages that meet the conditions of non-repetition and non-ambiguity but out by 35 undergraduates, and all students' data cleaning and annota-
do not contain the target of our dataset. This kind of image is of no sig- tion time are counted. As Fig. 10 shows, data cleaning costs 300 work-
nificance for the research and needs to be manually removed. ing hours (account for 16.16%), and data annotation costs 1246 work-
ing hours (account for 67.13%). During the inspection process, two
3.3.4. Privacy protection graduate students and two construction industry experts conducted two
Privacy is an essential issue for publishing public datasets. This rounds of sampling inspection, taking a total of 286 h (account for
study spends a lot of time making effort on privacy processing (200 15.41%). UAVs, surveillance cameras, mobile phones, single-lens reflex
working hours for all annotators, accounting for two-thirds of the data cameras (SLRs), and hook visualization equipment are utilized as image
cleaning). Considering the engineering ethics, the proposed privacy capturing tools. A total of 10 construction sites in Guangzhou, Shen-
protection method is divided into two parts, the company information zhen, and Dongguan city in the Greater Bay Area, China have been vis-
(LOGO) processing and human characteristics processing. To avoid in- ited.
fringing on the relevant company's commercial secrets or triggering re-
lated property rights issues, all the company LOGOs have been blurred.
Meanwhile, to avoid ethical issues, the face of the on-site worker is

5
R. Duan et al. Automation in Construction xxx (xxxx) 104499

F
OO
Table 2
Categories and each class of the object label.
PR
Fig. 4. Example of the corresponding classes.

Some object detection models generally specify the length-width ra-


tio and range of the detection object, there is a statistical sample length-
D
category label width clustering requirement for the dataset. The k-means clustering al-
gorithm [45] is thus used to analyze the sample data by visualizing the
Person person helmet vest length-width ratio and range (shown in Fig. 13).
TE

Material board wood rebar brick scaffold


To reflect the characteristics of the original bounding box data dis-
Machine handcart cutter ebox hopper hook tribution of the SODA and compare the data distribution characteristics
Layout fence slogan of each class, the Box-plot of the length-width distribution of each class
of bounding boxes without the outliers is shown in Fig. 14 and Fig. 15.
4. Statistics of the dataset It is also worth noting that the images are taken from different an-
gles from hand-held camera short-range shooting perspective, hand-
EC

In this part, the proposed dataset is analyzed and compared with the held camera long-range shooting perspective, UAV perspective, tower
current open object detection image dataset in the construction indus- crane hook visual system perspective, to achieve full coverage. The dis-
try. The proposed SODA contains a total of 19,846 images, and the size tribution is shown in Fig. 16.
of dominated images in the dataset is 1920 * 1080, accounting for 86%. Table 4 shows a comparison of the proposed dataset with the cur-
A total of 286,201 objects are annotated, and the object distribution is rent popular open object detection dataset in the construction indus-
shown in Fig. 11 and Fig. 12, providing a quantitative understanding of try. The result shows that the proposed SODA not only contains the
RR

the dataset. Among them, the number of labels of worker is the largest, largest number of objects and categories but also is the first time to re-
and the labels of the machine and layout are less, which is more consis- alize the full coverage of the four categories of worker, material, ma-
tent with the situation of the construction sites. Each class has over chine, and layout. Moreover, this study firstly uses the image data
1000 targets, and each category has >20,000 objects. gathered from hook visualization equipment, which is also a neglected
part of previous studies.
CO

Fig. 5. Example of different angles using different devices.

6
R. Duan et al. Automation in Construction xxx (xxxx) 104499

F
OO
Fig. 6. Example of images that need to be removed and processed.

PR
D
TE

Fig. 7. Example of annotation standard.


EC
RR
CO

Fig. 8. Interface of labelImg.

In summary, the following points distinguish SODA from other pub- published datasets and is committed to complementing existing
lished datasets. (1) SODA not only contains the largest number of ob- datasets rather than completely replacing them. (2) SODA also achieves
jects and categories but also is the first time to realize the full coverage authenticity and diversity by collecting images using a variety of equip-
of the four categories of worker, material, machine, and layout. SODA ment from different angles at the construction sites. Although some
initially avoids selecting most categories covered by existing mature datasets have achieved multi-view image data collection, this study

7
R. Duan et al. Automation in Construction xxx (xxxx) 104499

F
OO
PR
Fig. 9. XML format files.

firstly uses the image data gathered from hook visualization equipment, other researchers to use different deep learning algorithms to verify
which is also a neglected part of previous studies. (3) SODA elaborates SODA and compare the performances with the results of this study. The
on the construction process of the dataset about the problems encoun- experiment proves the feasibility of using the deep learning object de-
tered and the solutions, which provides convenience for subsequent re- tection algorithm to detect the construction-related worker, material,
search in establishing such an image dataset and the training of deep machine, and layout in images and videos.
D
learning models. (4) SODA is continuously updated, with new cate-
gories and objects enriched in subsequent updates. Moreover, capabil- 5.1. Algorithm selection
ity expansion including instance segmentation and image caption will
be uploaded in the near future. Xiao et al. [34] concluded that the one-stage detection algorithm is
TE

more suitable for the identification of construction engineering than the


5. Experiments on the dataset two-stage detection algorithm. Although the two-stage algorithm is
able to reach a higher accuracy than the single-stage algorithm, the dif-
This study also aims to provide benchmarks for researchers to select ference is insignificant, and the one-stage algorithm is much faster. It
appropriate algorithms for relevant studies. The authors also welcome can also avoid background errors and learn the generalization charac-
EC

Table 3
Example of the spelling mistake.
Sample

Right scaffold hopper helmet board person vest brick


RR

Error seaffold hooper helemt borad/boadrd preson vests bricks


Number 14 8 8 4/1 1 1 345
CO

Fig. 10. Time distribution for each process of building the dataset.

8
R. Duan et al. Automation in Construction xxx (xxxx) 104499

F
OO
PR
Fig. 11. Number of objects and images in each category.
D
TE
EC
RR
CO

Fig. 12. Number of objects and images in each class.

teristics of objects, which is more suitable for complex scene recogni- net53 is continuously used for convolution to extract image features.
tion on construction sites. Considering the YOLO is developed with Each convolution layer is a unique convolution structure, where l2 reg-
‘flexible’ properties and its popularity, YOLO v3 and YOLO v4 are se- ularizations are performed at each convolution, after which BatchNor-
lected to show their performance on SODA. malization and LeakyReLU are performed. The neck is responsible for
As a one-stage detector, YOLO does not generate the proposal region constructing the Feature Pyramid Network (FPN), using image features
but directly divides the image into S × S grid cells, and each grid de- for up-sampling and feature fusion. The processed enhanced features
tects the object falling into the center. The YOLO v3 network is shown are output to three YOLO Heads for the prediction the results. Three
in Fig. 17, which is mainly composed of the backbone (Darknet53), YOLO Heads are designed for large, medium, and small-scale object
Neck (FPN), and Head (YOLO Head). The residual network in Dark- prediction. YOLO Head uses 3 × 3 convolution to integrate features

9
R. Duan et al. Automation in Construction xxx (xxxx) 104499

F
OO
PR
Fig. 13. Anchors statistical results of the k-means clustering algorithm.
D
TE
EC
RR
CO

Fig. 14. Box-plot of the width distribution of each class.

and then uses 1 × 1 convolution to adjust the number of output chan- layer is replaced by the Mish function from the LeakyRelu in YOLO v3,
nels to predict the results. YOLO v4 can be seen as an improved version and normal Resblock is changed into a CSPnet structure. SPP and PAN
of YOLO v3. YOLO v4 has added a series of tips based on the YOLO v3 network structures are used to form newly FPN on Neck. The SPP struc-
network, making the training and prediction process better. The YOLO ture is added to the last convolution layer of the backbone, and the par-
v4 network is shown in Fig. 18. In terms of backbone, DarkNet53 is re- allel maximum pooling of the convolution kernels of 13 × 13, 9 × 9,
placed by CSPDarkNet53. The activation function in the convolution 5 × 5, and 1 × 1 is carried out so that the network can increase the re-

10
R. Duan et al. Automation in Construction xxx (xxxx) 104499

F
OO
PR
D
Fig. 15. Box-plot of the length distribution of each class.
TE
EC
RR
CO

Fig. 16. Distribution of image resolution and shooting perspective.

11
R. Duan et al. Automation in Construction xxx (xxxx) 104499

Table 4
Comparison of SODA dataset with other datasets in the construction industry.
Dataset Image Object Category Class Image size Year

Person Material Machine Layout

SODA 19,846 286,201 3 5 5 2 15 1920*1080 and higher 2022


CHV 1330 9209 6 – – – 6 608*608 2021
MOCS 41,668 222,861 1 – 12 – 13 1200*-- 2020
ACID 10,000 15,767 – – 10 – 10 >608*608 2020
AIM 2920 2920 – – 5 – 5 500*- 2017

F
Tajeen's 2000 – – – 5 – 5 – 2014

OO
PR
D
TE
EC

Fig. 17. Network structure of YOLO v3.


RR

ceptive field and then separate the salient features. The PANet structure P or N represents whether the sample is predicted to be positive or neg-
i adds down-sampling feature extraction from top to bottom on the ba- ative. TP represents the positive sample is predicted correctly, TN rep-
sis of the traditional up-sampling feature extraction steps of FPN. The resents the negative sample is predicted correctly, FP represents the
PANet structure strengthens the feature extraction by combining up- negative sample is predicted wrongly, and FN represents the positive
sampling and down-sampling. sample is predicted wrongly. After obtaining the four indicators, the
Precision and Recall can be calculated (shown in formulas (1) and (2)).
CO

5.2. Evaluation metrics


(1)
In this study, the mean average precision (mAP) [46] is applied to
evaluate the performance of the model. The use of mAP can eliminate (2)
the limitation of using a single evaluation index by combining Precision
and Recall. Intersection over Union (IOU) is a basic indicator for evalu- (3)
ating performance of object detection algorithm, which is used to mea-
sure the coincidence degree between the detection box and the ground
(4)
truth box. The calculation of IOU is shown in Fig. 19. In the molecular
part, the value is the overlap area between the detection box and the
ground truth box. In the denominator part, the value is the total area Average precision (AP) considers the combination of different Preci-
occupied by the detection box and the ground truth box. After obtain- sion and Recall points, it can be calculated by measuring the area under
ing the IOU index, the TP (True Positives), TN (True Negatives), FP the curve after integration (shown in formulas (3)). By calculating the
(False Positives), and FN (False Negatives) need to be calculated. Re- AP of all classes, the Mean Average Precision (mAP) of the overall per-
spectively, T or F represents whether the sample is correctly predicted.

12
R. Duan et al. Automation in Construction xxx (xxxx) 104499

F
OO
PR
Fig. 18. Network structure of YOLO v4.

CPU @ 2.60GHz 2.60GHz 16.0GB. All algorithms are operated using


the PyTorch framework. SODA is randomly divided into a training set
(17,861 images) and a test set (1985 images) according to 9:1. While
training, 100 epochs are adopted for training, which are divided into
D
two stages: the freezing and thawing stages. The first 50 epochs freeze
the main parameters of the model, which will increase the learning rate
and help the model training jump out of the local optimal solution. In
TE

the second 50 epochs, by thawing model backbone parameters and re-


ducing the learning rate, model backbone parameters are greatly
changed in this stage.
The following Fig. 20 is the loss curve (training and verification
Fig. 19. Calculation process of the IOU.
loss) of two deep learning algorithms on SODA, in which the red line is
the training loss and the blue line is the verification loss. It can be seen
formance of the model can be estimated when detecting (shown in
EC

that the two deep learning algorithms are well fitted to the SODA
formulas (4)).
dataset. Moreover, it should be noted that the loss values of different al-
gorithms are not necessarily comparable because the test algorithm im-
5.3. Result
plements different loss functions. The training loss and verification loss
keep decreasing and finally reach a stable value with the increase of
The experiment is performed on a computer with the following con-
epoch, which shows that the model has been convergent. Meanwhile,
figuration: NVIDIA Geforce RTX 2060, Intel (R) Core (TM) i7-10750H
RR

they decrease after the rise of the 50th thaw epoch, and the verification
CO

Fig. 20. Training loss and validation loss of YOLO (v3, v4).

13
R. Duan et al. Automation in Construction xxx (xxxx) 104499

loss is higher than the training loss. The learning curve shows the ro- the workers are lower. The highest AP is the hook (92.81% in YOLO v3)
bustness and universality of the dataset. and the hopper (95.18% in YOLO v4). The lowest AP is the fence
After the training, the detector is evaluated on the test dataset. The (50.99% in YOLO v3) and the helmet (55.62% in YOLO v4). In terms of
overall performance index analysis results of the test dataset in YOLO detection speed, YOLO v4 performed better at 31.94 FPS than the 25.06
v3 and YOLO v4 are shown in Table 5 and Figs. 21-22. The results show FPS of YOLO v3. Table 6 shows the comparison of the object detection
well detection performance with 71.22% mAP on YOLO v3 and 81.47% model trained in SODA with the other image datasets in the construc-
mAP on YOLO v4. Due to the change in network structure, each cate- tion industry. In the case of the same detector and backbone, the mAP
gory has a different performance, but the training results of the two of SODA reached 81.47, lower than the highest ACID score of 87. As
models show that the mAP of the material is higher, and the results of shown in Fig. 14-15, the SODA dataset contains numerous objects with

F
Table 5
Performance comparison of SODA dataset in YOLO v3 and YOLO v4⁎.

OO
Algorithm Worker (%) Material (%) Machine (%) Layout (%) mAP

person helmet vest board wood rebar brick scaffold handcart cutter ebox hopper hook fence slogan

YOLO v3 73.71 52.88 57.67 83.34 83.32 54.73 76.02 86.43 74.34 64.61 69.05 89.42 91.76 50.99 60.07 71.22
YOLO v4 72.42 55.62 66.09 92.79 91.41 73.21 90.89 92.92 89.45 87.96 76.11 95.18 92.81 74.11 71.08 81.47
⁎ The best performance and the worst performance of the model are roughened with red and green in the chart.

PR
D
TE

Fig. 21. mAP of YOLO (v3, v4).


EC
RR
CO

Fig. 22. Log-average miss rate of YOLO (v3, v4).

Table 6
Comparison of object detection results by different algorithms between SODA and others.
Dataset Detector Backbone Input Size mAP AP (highest) AP (lowest) Speed(fps)

SODA YOLOv3 Darknet53 416 × 416 71.22 91.76 (hook) 50.99 (fence) 25.06
YOLOv4 CSPDarknet53 81.47 95.18 (hopper) 55.62 (helmet) 31.94
MOCS YOLOv3 Darknet53 608 × 608 33 56.809 (worker) 16.782 (hook) 30
ACID YOLOv3 Darknet53 608 × 608 87.8 94.9 (truck) 62.0 (tower crane) 26.3
CHV YOLOv3 Darknet53 416 × 416 82.65 88.19 (white helmet) 77.51 (blue helmet) 27.15
YOLOv4 CSPDarknet53 (608 × 608) 84.16 90.57 (white helmet) 78.14 (blue helmet) 30.18
AIM R-FCN Resnet50 400–700 96.33 99.20 (mixer truck) 94.43 (dump truck) 14
Tajeen's Torralba et al. [46] 375 × 250 96.8 98.0 (dozer) 90.0 (excavator) 18.6 s/per picture
Felzenszwalb et al. [47] 99.0 (roller) 95.0 (excavator) 1.71 s/per picture

14
R. Duan et al. Automation in Construction xxx (xxxx) 104499

different pixel sizes. Therefore, it is reasonable to believe that the detec- limitations and the theme of publishing the dataset, further analysis of
tion process of SODA is more complex than detecting a single category the algorithm is not discussed in this paper.
of objects. The proposed SODA has the highest recognition speed of
31.94 fps, the fastest in the current construction industry dataset. More- 6. Conclusions
over, SODA and CHV made a similar conclusion that the performance
(mAP, speed) of YOLO v4 is better and more comprehensive than YOLO 6.1. Conclusion
v3.
Fig. 23 shows the examples of detection results. The images are se- In this research, a dataset called SODA is developed for object detec-
lected from the validation set, which is never seen in the training tion on the construction site. SODA is an image dataset in VOC format
process. The four images in the first row are taken by cameras and mo- containing 19,846 images and annotation information of 286,201 ob-

F
bile phones. The second row of images is captured by hook visualiza- jects. SODA has 15 categories, which cover most of the common objects
tion devices, UAVs, and surveillance. From the comparisons in Fig. 23, on the construction site. SODA is tested on two mainstream one-stage
it is found that the identified objects are correctly classified, and there object detection algorithms, after which a series of training results and

OO
is less error detection in both YOLO v3 and YOLO v4. The detection ef- benchmarks are obtained.
fects of YOLO v3 and YOLO v4 are not significantly different with the In summary, the contributions of SODA are as follows: (1) SODA
simple background and fewer objects. In the comparison of Fig. 23(2), dataset is specifically built for the construction industry with high-
YOLO v4 has a better detection effect in some remote and small objects quality image data. All the images from SODA are collected from actual
and partially occluded objects. And in large-scale images such as Fig. construction sites at different construction stages, angles, and time. The
23(7), YOLO v4 can identify more small objects. In the complex and training on the proposed dataset will better improve the practicability
crowded scenes, YOLO v3 has some missing detection compared with and effect of object detection in the field of the construction industry

PR
YOLO v4. In summary, YOLO v3 and YOLO v4 have similar recognition than general datasets. In addition, SODA has more images and 4M1E
performance in simple scenes, but YOLO v4 has higher recognition abil- categories than the existing datasets in the construction industry, which
ity in complex scenes. In the comparison of Fig. 23(9), neither YOLO v3 broadens the detection range of common construction workers, PPE,
nor YOLO v4 identified people sheltered by steel cages, but that's rea- and machines. (2) We share relevant experience in building such a
sonable because this study did not annotate obscured workers in the an- dataset, according to which the researchers can increase the number of
notation criteria. The reason why YOLO v4 performs better than YOLO categories and images to update the proposed dataset iteratively. A de-
v3 is that YOLO v4 makes some improvements on the basis of the YOLO tailed data analysis of the image data is carried out, including the statis-
v3 network, which broadens the receptive field and strengthens the fea- tical analysis of the number of images from different categories and the
D
ture extraction so that its detection effect in small remote targets and number of identified objects, the statistical analysis of image resolution
partial occlusion is better. Some other results also met the expectations. size and image shooting perspective, the clustering analysis of all
For example, in the category of worker, the accuracy ranking is person, bounding boxes, and the box-plot analysis of pixel length and width of
vest, and helmet, which is consistent with the size of the target. And the each class. A comparison of the SODA dataset with other open datasets
TE

highest AP scores of YOLO v3 and YOLO v4 all appear in the machine in the construction industry is presented and analyzed. Researchers can
category which is less challenging in detection. As for the small target not only choose some specific categories of images for further in-depth
such as helmet, the AP of YOLO v4 is nearly 3% higher than YOLO v3. research but also expand data on the basis of SODA. (3) To explore the
Choosing a more advanced backbone feature extraction network may applicability of the deep learning-based object detection on the con-
get better detection results in practical engineering. Owing to space struction site using SODA, two one-stage object detection algorithms,
EC
RR
CO

Fig. 23. Identification result of YOLO v3 and YOLO v4.

15
R. Duan et al. Automation in Construction xxx (xxxx) 104499

YOLO v3 and YOLO v4 have been selected to conduct the benchmark ing. Yichuan Deng : Conceptualization, Methodology, Writing
tests. Experimental results show that both YOLO v3 and YOLO v4 spend – review & editing, Funding acquisition. Jiarui Lin : Conceptu-
less time to train the model and reach a relatively high accuracy rate, alization, Resources, Validation, Writing – review & editing.
which prove their well performance in terms of speed and accuracy.
The experiment conducted in this study is also the first case of detecting Declaration of Competing Interest
multiple 4M1E categories in a single object detection model. This work
can provide insights for research on integrating multiple tasks regard- The authors declared that they have no conflicts of interest to this
ing construction site safety monitoring, construction progress analysis, work. We declare that we do not have any commercial or associative in-
emergency response, and civilized construction. terest that represents a conflict of interest in connection with the work
This study is the first open dataset to cover the largest scale of object submitted.

F
categories of the construction site, and elaborates on the construction
process of the dataset about the problems encountered and the solu-
tions. Meanwhile, it also establishes the benchmark to provide insights
for other researchers to select detectors. Notably, SODA is continuously
updated, with new categories and objects enriched in subsequent up-
dates. Moreover, capability expansion including instance segmentation Acknowledgments
and image caption will be uploaded in the near future.
The authors would like to acknowledge the support by
6.2. Limitations and future work Guangdong Science Foundation (Grant No.
2022A1515010174); the support by the State Key Lab of Sub-

PR
Several limitations of research should be mentioned. (1) It is recom- tropical Building Science, South China University of Technol-
mended to add more categories and more data to enrich the SODA ogy (No. 2022ZB19); the support by the Guangzhou Science
dataset. Although the category and capacity of SODA are larger than and Technology Program (No. 202201010338); and support by
other datasets in the construction industry, they are still relatively the National Natural Science Foundation of China (No.
smaller than other datasets in the general computer vision community. 51908323, No. 72091512).
(2) Further research on annotation tasks is needed. In this study, the an- The author would also like to pay special tribute to students who
notation of the SODA dataset is object-level, and only the boundary contribute to the data cleaning and annotation process of SODA at the
frame of the object is annotated rather than pixel-level. This indicates South China University of Technology. Their names are Junxiong
ED
that SODA can only be trained on deep learning object detection algo- Zhang, Fengning Chen, Hongfeng Chen, Jianhe Chen, Jingjun Chen,
rithms instead of the object segmentation algorithms. Regarding this, Zhentao Chen, Yina E, Jie Fan, Xingyu Gao, Jiaxuan He, Jiayi Huang,
SODA will not only continue to increase new categories and capacity in Jingyuan Huang, Ying Huang, Yuefan Huang, Jiaxi Jiang, Liki Lei, Ju-
subsequent updates but also make some update iterations regarding in- fang Lin, Rui Liu, Junjie Ma, Yinchao Qiu, Wanxi Su, Ying Sun, Jiaquan
stance segmentation and image caption. More annotation methods such Wang, Xinyuan Wang, Jide Wu, Haopeng Yan, Yuqi Zeng, Aiwaner
as crowdsourcing [49] and automatic annotation should also be ex- Zeng, Xiaolan Jan, Yang Zhang, Honglong Zheng, Yuxian Zhu, Junze
plored. (3) Images of SODA are collected in the Greater Bay Area of Zheng, Zhu Chao, and Yelin Ru.
CT

China, and model trained on SODA may not be directly applicable to


other countries. The application and feasibility of the SODA should be References
further investigated on construction sites of different countries and re-
[1] J. Teizer, Status quo and open challenges in vision-based sensing and tracking of
gions considering the variety of construction practices. Several previous
temporary resources on infrastructure construction sites, Adv. Eng. Inform. 29 (2)
published datasets [32,34] have also encountered this problem and (2015) 225–238, https://doi.org/10.1016/j.aei.2015.03.006.
they all agree that transfer learning could be used to solve the problem [2] A. Assadzadeh, M. Arashpour, A. Bab-Hadiashar, T. Ngo, H. Li, Automatic far-
RE

of application across different countries and regions. What's more, co- field camera calibration for construction scene analysis, Comp.-Aided Civ.
Infrastruct. Eng. 36 (8) (2021) 1073–1090, https://doi.org/10.1111/mice.12660.
operation with more scholars and companies in different countries and [3] B.A.S. Oliveira, A.P.D.F. Neto, R.M.A. Fernandino, R.F. Carvalho, A.L. Fernandes,
regions is needed. F.G. Guimaraes, Automated monitoring of construction sites of electric power
substations using deep learning, IEEE Access. 9 (2021) 19195–19207, https://
doi.org/10.1109/access.2021.3054468.
Data availability [4] S.V.T. Tran, T.L. Nguyen, H.L. Chi, D. Lee, C. Park, Generative planning for
construction safety surveillance camera installation in 4D BIM environment,
OR

We have built a permanent homepage for SODA, please click the Autom. Constr. 134 (2022) 104103, https://doi.org/10.1016/
j.autcon.2021.104103.
link: http://www2.scut.edu.cn/bim/2022/0223/c12265a460930/ [5] R. Xiong, Y. Song, H. Li, Y. Wang, Onsite video mining for construction hazards
page.htm. For pictures and annotations of SODA, start an immediate identification with visual relationships, Adv. Eng. Inform. 42 (Oct.) (2019) https://
download from this link: https://scut-scet-academic.oss-cn- doi.org/10.1016/j.aei.2019.100966, 100966.1-100966.10.
[6] Y. Li, H. Wei, Z. Han, J. Huang, W. Wang, Deep learning-based safety helmet
guangzhou.aliyuncs.com/SODA/2022.2/VOCv1.zip. Data regarding
detection in engineering management based on convolutional neural networks,
the process of the study can be reasonably obtained from the corre- Adv. Civ. Eng. 6 (2020) 1–10, https://doi.org/10.1155/2020/9703560.
sponding author. [7] Z. Wang, Y. Wu, L. Yang, A. Thirunavukarasu, C. Evison, Y. Zhao, Fast personal
protective equipment detection for real construction sites using deep learning
approaches, Sensors. 21 (10) (2021) 3478, https://doi.org/10.3390/s21103478.
[8] S. Kumar, H. Gupta, D. Yadav, I.A. Ansari, O.P. Verma, YOLO v4 algorithm for
the real-time detection of fire and personal protective equipments at construction
sites, Multimed. Tools Appl. (2021) 1–21, https://doi.org/10.1007/s11042-021-
11280-6.
[9] R. Cheng, X. He, Z. Zheng, Z. Wang, Multi-scale safety helmet detection based on
CRediT authorship contribution statement SAS-YOLO v3-tiny, Appl. Sci. 11 (8) (2021) 3652, https://doi.org/10.3390/
app11083652.
[10] D. Benyang, X.C. Luo, Y. Miao, Safety helmet detection method based on YOLO
Rui Duan : Methodology, Formal analysis, Investiga- v4, in: 2020 IEEE Conference on Computational Intelligence and Security (CIS),
tion, Writing – original draft. Hui Deng : Resources, Valida- IEEE, 2020, pp. 155–158, https://doi.org/10.1109/CIS52066.2020.00041.
tion, Writing – review & editing, Funding acquisition. Mao [11] H. Wang, Z. Hu, Y. Guo, Z. Yang, F. Zhou, P. Xu, A real-time safety helmet
wearing detection approach based on csyolov3, Appl. Sci. 10 (19) (2020) 6732,
Tian : Methodology, Validation, Writing – review & edit-
https://doi.org/10.3390/app10196732.

16
R. Duan et al. Automation in Construction xxx (xxxx) 104499

[12] P. Wang, E. Fan, P. Wang, Comparative analysis of image classification Microsoft Coco: Common Objects in Context, in: European Conference on
algorithms based on traditional machine learning and deep learning, Pattern Computer Vision, Springer, Cham, 2014, September, pp. 740–755, https://doi.org/
Recogn. Lett. 141 (2021) 61–67, https://doi.org/10.1016/j.patrec.2020.07.042. 10.1007/978-3-319-10602-1_48.
[13] Y. Zhang, Safety Management of Civil Engineering Construction Based on [31] H. Tajeen, Z. Zhu, Image dataset development for measuring construction
artificial intelligence and machine vision technology, Adv. Civ. Eng. (2021) 1–14, equipment recognition performance, Autom. Constr. 48 (2014) 1–10, https://
https://doi.org/10.1155/2021/3769634. doi.org/10.1016/j.autcon.2014.07.006.
[14] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale [32] H. Kim, H. Kim, Y.W. Hong, H. Byun, Detecting construction equipment using a
hierarchical image database, in: 2009 IEEE Conference on Computer Vision and region-based fully convolutional network and transfer learning, J. Comput. Civ.
Pattern Recognition, IEEE, 2009, pp. 248–255, https://doi.org/10.1109/ Eng. 32 (2) (2018) 04017082, https://doi.org/10.1061/(ASCE)CP.1943-
CVPR.2009.5206848. 5487.0000731.
[15] C. Koch, K. Georgieva, V. Kasireddy, B. Akinci, P. Fieguth, A review on computer [33] Z. Kolar, H. Chen, X. Luo, Transfer learning and deep convolutional neural
vision based defect detection and condition assessment of concrete and asphalt civil networks for safety guardrail detection in 2D images, Autom. Constr. 89 (2018)
infrastructure, Adv. Eng. Inform. 29 (2) (2015) 196–210, https://doi.org/10.1016/ 58–70, https://doi.org/10.1016/j.autcon.2018.01.003.

F
j.aei.2015.01.008. [34] X.H. An, L. Zhou, C.Z. Wang, P.F. Li, Z.W. Li, Dataset and benchmark for
[16] X. Xiang, N. Lv, X. Guo, S. Wang, A. El Saddik, Engineering vehicles detection detecting moving objects in construction sites, Autom. Constr. 122 (2021) 103482,
based on modified faster R-CNN for power grid surveillance, Sensors. 18 (7) (2018) https://doi.org/10.1016/j.autcon.2020.103482.
2258, https://doi.org/10.3390/s18072258. [35] B. Xiao, S.C. Kang, Development of an image data set of construction machines

OO
[17] W. Fang, L. Ding, H. Luo, P.E. Love, Falls from heights: a computer vision-based for deep learning object detection, J. Comput. Civ. Eng. 35 (2) (2021) 05020005,
approach for safety harness detection, Autom. Constr. 91 (JUL.) (2018) 53–61, https://doi.org/10.1061/(ASCE)cp.1943-5487.0000945.
https://doi.org/10.1016/j.autcon.2018.02.018. [36] J. Song, C.T. Haas, C.H. Caldas, Tracking the location of materials on
[18] Z. Yang, Y. Yuan, M. Zhang, X. Zhao, Y. Zhang, B. Tian, Safety distance construction job sites, J. Constr. Eng. Manag. 132 (9) (2006) 911–918, https://
identification for crane drivers based on mask R-CNN, Sensors. 19 (12) (2019) doi.org/10.1061/(ASCE)0733-9364(2006)132:9(911).
2789, https://doi.org/10.3390/s19122789. [37] A. Dimitrov, M. Golparvar-Fard, Vision-based material recognition for
[19] C. Chen, Z. Zhu, A. Hammad, Automated excavators activity recognition and automated monitoring of construction progress and generating building
productivity analysis from construction site surveillance videos, Autom. Constr. information modeling from unordered site image collections, Adv. Eng. Inform. 28
110 (2020) 103045, https://doi.org/10.1016/j.autcon.2019.103045. (1) (2014) 37–49, https://doi.org/10.1016/j.aei.2013.11.002.

PR
[20] H. Deng, H. Hao, D. Luo, Y. Deng, J.C. Cheng, Automatic indoor construction [38] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate
process monitoring for tiles based on BIM and computer vision, J. Constr. Eng. object detection and semantic segmentation, in: Proceedings of the IEEE
Manag. 146 (1) (2020), https://doi.org/10.1061/(ASCE)CO.1943-7862.0001744. Conference on Computer Vision and Pattern Recognition, IEEE, 2014, pp. 580–587.
[21] Q. Fang, H. Li, X. Luo, C. Li, W. An, A sematic and prior-knowledge-aided https://arxiv-1311.2524.
monocular localization method for construction-related entities, Comp.-Aided Civ. [39] R. Girshick, R.-C.N.N. Fast, Proceedings of the IEEE International Conference on
Infrastruct. Eng. 35(9 (2020) 979–996, https://doi.org/10.1111/mice.12541. Computer Vision, IEEE, 2015, pp. 1440–1448, https://doi.org/10.1109/
[22] M. Zhang, M. Zhu, X. Zhao, Recognition of high-risk scenarios in building ICCV.2015.169.
construction based on image semantics, J. Comput. Civ. Eng. 34 (4) (2020) https:// [40] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object
doi.org/10.1061/(ASCE)cp.1943-5487.0000900, 04020019. detection with region proposal networks, in, Adv. Neural Inf. Proces. Syst. (2015)
[23] H. Luo, M. Wang, P.K.Y. Wong, J.C. Cheng, Full body pose estimation of 91–99, https://doi.org/10.1109/TPAMI.2016.2577031.
D
construction equipment using computer vision and deep learning techniques, [41] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-
Autom. Constr. 110 (2020) 103016, https://doi.org/10.1016/ time object detection, in: Proceedings of the IEEE conference on computer vision
j.autcon.2019.103016. and pattern recognition, 2016, pp. 779–788 arxiv-1506.02640.
[24] Z. Pan, C. Su, Y. Deng, J.C. Cheng, Video2entities: a computer vision-based entity [42] H. Zhou, Y. Zhao, Q. Shen, L. Yang, H. Cai, Risk assessment and management via
extraction framework for updating the architecture, engineering and construction multi-source information fusion for undersea tunnel construction, Autom. Constr.
TE

industry knowledge graphs, Autom. Constr. 125 (2021) 103617, https://doi.org/ 111 (2020) 103050, https://doi.org/10.1016/j.autcon.2019.103050.
10.1016/j.autcon.2021.103617. [43] X.C. Luo, H. Li, Recognizing diverse construction activities in site images via
[25] M. Zhang, R. Shi, Z. Yang, A critical review of vision-based occupational health relevance networks of construction-related objects detected by convolutional
and safety monitoring of construction site workers, Saf. Sci. 126 (2020) 104658, neural networks, J. Comput. Civ. Eng. 32 (3) (2018) 4018012.1, https://doi.org/
https://doi.org/10.1016/j.ssci.2020.104658. 10.1061/(asce)cp.1943-5487.0000756.
[26] J. Seo, S. Han, S. Lee, H. Kim, Computer vision techniques for construction safety [44] T. Lin, LabelImg, https://github.com/tzutalin/labelImg, 2015.
and health monitoring, Adv. Eng. Inform. 29 (2) (2015) 239–251, https://doi.org/ [45] J.A. Hartigan, M.A. Wong, Algorithm AS 136: a K-means clustering algorithm, J.
10.1016/j.aei.2015.02.001. R. Stat. Soc. 28 (1) (1979) 100–108, https://doi.org/10.2307/2346830.
EC

[27] B. Zhong, H. Wu, L. Ding, P.E. Li, H. Li Love, H. Luo, L. Jiao, Mapping computer [46] Z. Zou, Z. Shi, Y. Guo, J. Ye, Object detection in 20 years: A Survey, arXiv (2019)
vision research in construction: Developments, knowledge gaps and implications preprint arXiv:1905.05055. https://arxiv.org/abs/1905.05055v2.
for research, Autom. Constr. 107 (2019) 102919, https://doi.org/10.1016/ [47] A. Torralba, K. Murphy, W.T. Freeman, Sharing features: efficient boosting
j.autcon.2019.102919. procedures for multiclass object detection, in: Proceedings of the 2004 IEEE
[28] L. Deng, The mnist database of handwritten digit images for machine learning Computer Society Conference on Computer Vision and Pattern Recognition, 2 II-II,
research, IEEE Signal Process. Mag. 29 (6) (2012) 141–142, https://doi.org/ CVPR, 2004, https://doi.org/10.1109/CVPR.2004.1315241.
10.1109/msp.2012.2211477. [48] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection
[29] M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal with discriminatively trained part-based models, IEEE Trans. Pattern Anal. Mach.
RR

visual object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338, Intell. 32 (9) (2010) 1627–1645, https://doi.org/10.1109/TPAMI.2009.167.
https://doi.org/10.1007/s11263-009-0275-4. [49] S. Standing, C. Standing, The ethical use of crowdsourcing, Bus. Ethics: A Eur.
[30] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, C.L. Zitnick, Rev. Bus. Ethics. 27 (1) (2018) 72–80, https://doi.org/10.1111/beer.12173.
CO

17

You might also like