0% found this document useful (0 votes)
11 views7 pages

DFSFGSG

This paper presents a machine learning strategy for detecting construction elements, specifically formwork, in UAV photographs of construction sites. The authors achieved 90% accuracy in classifying single object images and 40% in locating formwork in multi-object images using convolutional neural networks. The study highlights the potential of automated image analysis to enhance construction monitoring and reduce manual labor.

Uploaded by

ng28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

DFSFGSG

This paper presents a machine learning strategy for detecting construction elements, specifically formwork, in UAV photographs of construction sites. The authors achieved 90% accuracy in classifying single object images and 40% in locating formwork in multi-object images using convolutional neural networks. The study highlights the potential of automated image analysis to enhance construction monitoring and reduce manual labor.

Uploaded by

ng28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Formwork detection in UAV pictures of construction sites

Katrin Jahr, Alexander Braun & André Borrmann


Chair of Computational Modeling and Simulation, Technical University of Munich, Germany

ABSTRACT:

The monitoring of the construction progress is an essential task on construction sites, which nowadays is con-
ducted mostly by hand. Recent image processing techniques provide a promising approach for reducing manual
labor on site. While modern machine learning algorithms such as convolutional neural networks have proven
to be of sublime value in other application fields, they are widely neglected by the CAE industry so far. In this
paper, we propose a strategy to set up a machine learning routine to detect construction elements on UAV
photographs of construction sites. In an accompanying case study using 750 photographs containing nearly
10.000 formwork elements, we reached accuracies of 90% when classifying single object images and 40% when
locating formwork on multi-object images.

track the progress of construction sites arises


(Golparvar-fard, Pena-Mora, and Savarese 2009;
1 INTRODUCTION Braun et al. 2015).
A detailed geometric as-planned vs. as-built com-
The digitization of the construction industry offers parison allows to track the current progress of a con-
various new possibilities for the planning, monitor- struction site, assess the quality of the construction
ing, and design process of buildings. In recent years, work, and to check for construction defects such as
many research projects are focusing on using methods cracks.
of computer-aided engineering, such as building in- To generate high-quality point clouds, a signifi-
formation modeling or structural simulations, to facil- cant number of consecutive photographs covering the
itate and enhance the planning process. However, as monitored area is needed, requiring extensive image
of now, not many of the advantages of using digital capturing and processing. However, most monitoring
support are used after the planning of a construction tasks do not entail the need for detailed 3D infor-
has been finished. While monitoring of the construc- mation. These include the monitoring of the quantity
tion progress by comparing planned conditions to the and positions of site equipment, of externally stored
actual situation is a labor-intensive task, it is still construction material, and major construction phases.
mostly conducted by the workforce on site with little The image analysis and object detection on aerial
technical support. photographs, which can be taken with relatively low
During the last decades, image processing tech- effort, offers an alternative to expensively generating
niques have been increasingly adopted by the con- 3D point clouds. The scientific field of computer vi-
struction industry, greatly improving and facilitating sion provides different solutions to process and, to a
the process of construction monitoring. These meth- certain extent, understand images.
ods gained new potential due to the more affordable In this contribution, we use two state-of-the-art
and precise acquisition devices like unmanned aerial techniques of image processing to analyze aerial pho-
vehicles (UAVs) or laser scanners. Using the result- tography of construction sites. On the example of
ing 3D point clouds and information retrieved from formwork elements, we demonstrate an artificial in-
the building information model, the possibility to
telligence approach to recognize and locate construc- In between input and output layer, any number of
tion elements on site. In the first part of the paper, we hidden layers can be arranged. While AlexNet con-
give an overview of the state of the art in image anal- tained 8 hidden layers, GoogLeNet (Szegedy et al.
ysis as used on construction sites today, followed by 2015), and Microsoft ResNet (He et al. 2016) use
a further description of the used methodology. We more than 100 hidden layers. The layers are usually
conclude the paper with a proof of concept and a sum- convolution layers (sharpening features), pooling lay-
mary of our results. ers (discarding unnecessary information), or fully
connected layers (enabling classification) (Buduma
2017; Albelwi and Mahmood 2017).
2 STATE OF THE ART To adapt to different problems, such as recogniz-
ing formwork elements on images, CNNs must be
Computer Vision is a heavily researched topic, that trained. During training, the connections between cer-
got even more attention through recent advances in tain neurons are increased, while the connections be-
autonomous driving and machine learning related tween other neurons are reduced–the weights connec-
topics. Image analysis on construction sites, on the tion consecutive layers are weighted. The training is
other hand, is a rather new topic. Since one of the key usually carried out using supervised backpropagation,
aspects of machine learning is the collection of large meaning that the network is fed with example input-
datasets, current approaches focus on data gathering. output pairs (Buduma 2017). The correct solution for
In the scope of automated progress monitoring, Han each input is called ground truth. To train a CNN to-
et al. published an approach for Amazon Turk based wards reliable predictions, a significant amount of
labeling (Han and Golparvar-Fard 2017). (Kropp, training data is required, which has to be prepared in
Koch, and König 2018) tried to detect indoor con- a preprocessing step. ImageNet provides around
struction elements based on similarities, focusing on 1.000 images per class, for example (Russakovsky et
radiators. al. 2015). To accelerate the training processes,
For effective and efficient image analysis and ob- weights of previously trained CNNs can be used. To
ject recognition, machine learning algorithms have adapt pretrained CNNs, the fully connected layers are
been increasingly used during the last decades. In replaced with layers representing the new data and
2012, the convolutional neural network (CNN) trained with the new data.
“AlexNet” (Krizhevsky, Sutskever, and Hinton 2017)
achieved a top-5 error of 15.3% in the prestigious
ImageNet Large Scale Visual Recognition Challenge 3 METHODOLOGY
(Russakovsky et al. 2015). These results were surpris-
ingly accurate at the time, proving the advantages of In the context of the introduced research topics, the
using CNN. On this account, the software industry paper focusses on the image-based detection of tem-
shifted towards using CNN for all machine learning porary construction elements such as formwork. The
based image processing tasks (LeCun, Bengio, and detection of recurring, similar objects can be solved
Hinton 2015). by machine-learning approaches. Several tools sup-
There are different tasks to be solved by image port the image analysis regarding automated detec-
processing algorithms. Well known problems include tion of pretrained image sets.
classification, where single-object images are ana-
lyzed, object detection, where several objects in one
3.1 Image classification using CNN
image may be classified and localized within the im-
age, and image segmentation, where each pixel of an During image classification, which is also known as
image is classified (Buduma 2017). In this paper, we image recognition, images that contain but exactly
focus on image classification and object detection. one object are classified. Each class, that the CNN can
CNNs are structured in locally interconnected lay- detect, is represented by one output neuron. The ac-
ers with shared weights. Each layer comprises multi- tivity of the neurons is read as the probability that the
ple calculation units (called neurons). The neurons of image contains an object of the corresponding class.
the first layer (input layer) represent the pixels of the Image classification algorithms will fail on images
analyzed image, the last layer (output layer) com- containing multiple objects. As images of construc-
prises the predictable object classes. tion sites contain more than one object, image classi-
fication algorithms can only be applied after prepro-
cessing of the data. However, they can be very useful
to confirm certain questions, e.g. if a wall with a
known position is missing, currently shuttered or fin-
ished.
Figure 1: Structure of a sample CNN containing convolutional,
pooling and fully connected layers.
3.2 Object detection using CNN
The evident solution to analyze multi-object images For object detection, the mAP is the average of the
is using a sliding window on the image and run an possible precision at different recall values across all
image classification on each window, which is com- classes. To calculate the AP for each class,
putationally very expensive. Different proposals have (Russakovsky et al. 2015) propose to consider 11 re-
been made to reduce the computational effort, e.g. re- call values according to the proposal by ImageNet:
gion-proposal networks (e.g. R-CNN, (Girshick et al.
2014), (Girshick 2015), (Ren et al. 2017)), which in- 1
𝐴𝐴𝐴𝐴 = � 𝑝𝑝𝑖𝑖 (𝑟𝑟)
telligently detect regions of interest within an image 11
and analyze those further, and single shot detectors 𝑟𝑟 ∈{0.0,...,1.0}
(e.g. DetectNet (Tao, Barker, and Sarathy 2016) and
YOLO (Redmon et al. 2016), (Redmon and Farhadi With 𝑝𝑝𝑖𝑖 = maximum precision for any recall value
2017), (Redmon and Farhadi 2018)), which overlay exceeding r.
the image with a grid and analyze each cell.
3.4 Labeling
3.3 Evaluation of CNN Labeling defines the approach of marking all regions
To measure the performance of an image classifying of interest in a set of pictures and defining the type of
CNN, the top-1 error and top-5 error are used. The the marked region. A subset of labeled pictures is de-
top-1 error represents the fraction of images, for picted in Figure 5 a). The labels are marked with
which the correct class has been predicted with the green bounding boxes.
highest probability. The top-5 error is the fraction of As the labeling work takes a lot of time, a novel
images, for which the correct class is within the 5 approach for automated labeling has been introduced
classes that have been predicted with the highest by (Braun et al., 2018). In the frame of the research
probability, accordingly. project ProgressTrack focusing on automated pro-
To measure the performance of an object detecting gress monitoring with photogrammetric point clouds,
CNN, precision p, recall r and mean average preci- an algorithm has been developed to validate detection
sion mAP can be used. They are calculated using the results of the as-built vs. as-planned comparison. As
number of true positives TP, false positives FP and depicted in Figure 3, the projected 2D geometry of
false negatives FN: construction elements can be transformed from the
building information model’s coordinate system into
𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 the 2D coordinate system of each picture, the element
𝑝𝑝 = 𝑟𝑟 = is included in. This is possible, as the pictures were
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
aligned and oriented during the photogrammetric pro-
In object detection tasks, a prediction is counted as cess and thus making it possible to know the exact
true positive, if it has an intersection over union IoU position in relation to the Building Information
of a distinct value, usually over 0.5, meaning that Model.
more than 50% of the predicted bounding box should
overlap the ground truth bounding box (see Figure 2):

𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜


𝐼𝐼𝐼𝐼𝐼𝐼 =
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑜𝑜𝑜𝑜 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢

Ground truth

Prediction
Area of overlap Area of union
Figure 2: Area of overlap and area of union for predicted and Figure 3: Reprojected bounding box of a column on a picture gath-
labeled bounding boxes ered during acquisition
Figure 2: Sample data from a) labeling, b) image snippets for classification, as well as c) snippets for DetectNet
The process of labeling can benefit from this work by a self-written tool that takes all labeled data and
because by this method, labels for all building ele- images as input and crops them automatically. The
ment can be marked in all pictures, that were taken tool is made available on GitHub as an OpenSource
and aligned accordingly. Future research will focus solution 1. To assure relatively even image sizes with
on this method to extract labels for all construction sufficient detailing, we removed all images with re-
elements and train a CNN accordingly without the sulting dimensions under 200 x 200 pixels.
time consuming, manual labeling work to be done. To train the algorithm not only on formwork ele-
ments but on several classes, we added seven classes
(see Table 1) that are related to construction sites
4 CASE STUDY from the Caltech 256 dataset (Griffin, Holub, and
Perona 2007). The Caltech 256 provides single object
In the following sections, we present an image analy- images of 256 classes that need no further prepro-
sis routine including data preparation as well as the cessing for image classification.
training of convolutional neural networks to be able
to recognize formwork elements. We focus on two Table 1: Classes and number of images per class used for train-
ing of an image classification CNN
different image analysis tasks: image classification
and object detection. Class Origin Number of
images
4.1 Data preparation Barrel Caltech 256 47
Bulldozer Caltech 256 110
As an initial dataset, 9.956 formwork elements were Car Caltech 256 123
labeled manually on pictures of three construction Chair Caltech 256 62
sites that were collected during different case studies Formwork Own dataset 1410
in the recent years. The images contain formwork el- Screwdriver Caltech 256 102
ements from two different, German manufacturers Wheelbarrow Caltech 256 91
and vary in size (30cm up to 2,70m length) as well as Wrench Caltech 256 39
color (red, yellow, black, grey). They were taken at
varying weather conditions on partly cloudy, as well
As GoogLeNet requests input images of 256 x 256
as sunny days. The image acquisition was achieved
pixels, all images are resized to that dimensions by
with aerial photography by different UAVs, but also
DIGITS. For image classification, DIGITS automati-
from the ground with regular digital cameras, result-
cally splits the data into training and validation data.
ing in image sizes from 4000 x 3000 px up to 6000 x
The CNN converged quickly towards high accura-
4000 px. The manual labeling process for this data set
cies (top-1-error) around 85% (Figure 4) and stag-
took around 130h to complete.
nated at 90% after 100 epochs, which is a satisfying
The gathered data is processed as plain text files
result. To achieve even higher accuracies throughout
for each picture and processed for the various neural
all classes, the number of images per class could be
networks according to their respective requirements.
evened out by adding additional images to the un-
derrepresented classes of the training data in future
4.2 Image analysis work.
For image analysis, we used the Nvidia Deep Learn-
ing GPU Training System DIGITS (Yeager 2015),
which provides a graphical web interface to the wide-
spread machine learning frameworks TensorFlow,
Caffe, and Torch (NVIDIA 2018). It enables data-
management, network design and visualization of the
training process.

4.2.1 Image classification


We used a standard GoogLeNet CNN implemented in
Caffe for the image classification task. The training is Figure 3: Loss and accuracy of the GoogLeNet after 30 epochs of
performed using the Adam Solver (Kingma and Ba training for classifying images of formwork elements and objects typi-
2014). We retrieved a classification dataset of form- cally found on construction sites.
work elements from the labeled data (Section 4.1) by
automatically trimming the images around the bound-
ing boxes of the labeled formwork elements (see a
subset in Figure 4 b)). The automation was achieved

1
https://github.com/tumcms/Labelbox2DetectNet
4.3 Object detection input to various classification and detection algo-
rithms, resulting in very high success rates for the
As next step, an object detection algorithm is
classification of single object images and mediocre
introduced, to exactly detect certain elements in
success rates for object detection on multi-object im-
images and also precisely find the position of these
ages. However, as object detection is a highly de-
elements. For this purpose, the dataset depicted in
manding task concerning a large community of re-
Figure 4 c) is used. To detect several formworks
searchers, the results give a promising starting point
within an image of a construction site, we used a CNN
for future improvements.
with DetectNet architecture, implemented in Caffe.
To reduce training time, we used the weights of the
“BVLC GoogleNet model” 2, which has been
pretrained on ImageNet data. The training again is
performed using the Adam Solver.
We split the labeled images into 85% of training
data and 15% of validation data. The images were rec-
orded at a high resolution between 4000 x 3000 and
6000 x 4000 pixels. To minimize the necessary com-
putational effort, we split the images into smaller
patches with a size of 1248 x 384 pixels.
We trained the CNN twice with 300 epochs each.
Figure 4: Precision, recall and mAP of the DetectNet after one
Both precision and recall reached values around 63%, round of 300 epochs of training for detecting formwork on images of
the mAP stagnated around 44% (Figure 5). The net- construction sites.
work manages to detect most formwork elements cor-
rectly with low rates of false detections. In Figure 6,
the resulting bounding box for one example image is
depicted. For this image, a very good result was re-
trieved.
Further steps to improve the object detection algo-
rithm entail more extensive preprocessing of the data,
longer training periods and adjustments of both the
network architecture and the solving algorithms.
Table 2: Number of images and number of formwork elements
contained in that images for training and validation of the object
detection Figure 5: Detected bounding box for formwork elements on a pho-
tography of a construction site.
Nr. of form-
Purpose Nr. of images work elements
Training 646 8429
Validation 99 1487 6 ACKNOWLEDGMENTS

This work is supported by the Bavarian Research


5 SUMMARY Foundation under grant 1156-15.
We thank the Leibniz Supercomputing Centre
The presented research focusses on image analysis of (LRZ) of the Bavarian Academy of Sciences and Hu-
construction site images. To make automated as- manities (BAdW) for the support and provisioning of
sumptions on the construction elements depicted on high-performance computing infrastructure essential
an image, machine learning tools need to be trained. to this publication.
First, the current state of the art for machine learning
approaches is introduced and examined for their suit-
ability of application in the domain of construction. 7 REFERENCES
Then, these approaches are tested on construction
site elements. For the training, 750 images of con- Albelwi, Saleh, and Ausif Mahmood. 2017. “A Framework for
struction sites were labeled, resulting in nearly 10.000 Designing the Architectures of Deep Convolutional
labeled formwork elements. The images were used as Neural Networks.” Entropy 19 (6): 242.
doi:10.3390/e19060242.
Braun, Alexander, Sebastian Tuttas, André Borrmann, and Uwe

2
Released for unrestricted use at
https://github.com/NVIDIA/DIGITS/tree/master/examples/ob-
ject-detection
Stilla. 2015. “A Concept for Automated Construction with Region Proposal Networks.” IEEE Transactions on
Progress Monitoring Using BIM-Based Geometric Pattern Analysis and Machine Intelligence 39 (6): 1137–
Constraints and Photogrammetric Point Clouds.” ITcon 49. doi:10.1109/TPAMI.2016.2577031.
20: 68–79. Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause,
Buduma, Nikhil. 2017. Fundamentals of Deep Learning : Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015.
Designing Next-Generation Machine Intelligence “ImageNet Large Scale Visual Recognition Challenge.”
Algorithms. Vol. 44. doi:10.1007/s13218-012-0198-z. International Journal of Computer Vision 115 (3): 211–
Girshick, Ross. 2015. “Fast R-CNN.” In 2015 IEEE 52. doi:10.1007/s11263-015-0816-y.
International Conference on Computer Vision (ICCV), Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet,
1440–48. IEEE. doi:10.1109/ICCV.2015.169. Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Vanhoucke, and Andrew Rabinovich. 2015. “Going
Malik. 2014. “Rich Feature Hierarchies for Accurate Deeper with Convolutions.” In 2015 IEEE Conference on
Object Detection and Semantic Segmentation.” In 2014 Computer Vision and Pattern Recognition (CVPR), 1–9.
IEEE Conference on Computer Vision and Pattern IEEE. doi:10.1109/CVPR.2015.7298594.
Recognition, 580–87. IEEE. doi:10.1109/CVPR.2014.81. Tao, Andrew, Jon Barker, and Sriya Sarathy. 2016. “DetectNet:
Golparvar-fard, Mani, F Pena-Mora, and S Savarese. 2009. Deep Neural Network for Object Detection in DIGITS.”
“D4AR - a 4 Dimensional Augmented Reality Model for https://devblogs.nvidia.com/detectnet-deep-neural-
Automation Construction Progress Monitoring Data network-object-detection-digits/.
Collection, Processing and Communication.” Journal of Yeager, Luke. 2015. “DIGITS : The Deep Learning GPU
Information Technology in Construction 14 (June): 129– Training System.” ICML AutoML Workshop.
53.
Griffin, G., A. Holub, and P. Perona. 2007. “Caltech-256 Object
Category Dataset.”
http://www.vision.caltech.edu/Image_Datasets/Caltech25
6/.
Han, Kevin K., and Mani Golparvar-Fard. 2017. “Potential of
Big Visual Data and Building Information Modeling for
Construction Performance Analytics: An Exploratory
Study.” Automation in Construction 73 (January): 184–
98. doi:10.1016/j.autcon.2016.11.004.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
2016. “Deep Residual Learning for Image Recognition.”
In 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 770–78. IEEE.
doi:10.1109/CVPR.2016.90.
Kingma, Diederik P., and Jimmy Ba. 2014. “Adam: A Method
for Stochastic Optimization,” December.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2017.
“ImageNet Classification with Deep Convolutional
Neural Networks.” Communications of the ACM 60 (6):
84–90. doi:10.1145/3065386.
Kropp, Christopher, Christian Koch, and Markus König. 2018.
“Interior Construction State Recognition with 4D BIM
Registered Image Sequences.” Automation in
Construction 86 (February): 11–32.
doi:10.1016/j.autcon.2017.10.027.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015.
“Deep Learning.” Nature 521 (7553): 436–44.
doi:10.1038/nature14539.
NVIDIA. 2018. “Nvidia Digits - Deep Learning Digits
Documentation,” no. May.
Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali
Farhadi. 2016. “You Only Look Once: Unified, Real-Time
Object Detection.” In 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 779–
88. IEEE. doi:10.1109/CVPR.2016.91.
Redmon, Joseph, and Ali Farhadi. 2017. “YOLO9000: Better,
Faster, Stronger.” In 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 6517–25. IEEE.
doi:10.1109/CVPR.2017.690.
Redmon, Joseph, and Ali Farhadi. 2018. “YOLOv3: An
Incremental Improvement,” April.
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. 2017.
“Faster R-CNN: Towards Real-Time Object Detection

You might also like