Master Thesis JYX 14Jul (1)

This document is downloaded from DR‑NTU (https://dr.ntu.edu.
sg)
Nanyang Technological University, Singapore.
Visual servo control of robot manipulator with

applications to construction automation
Jin, Yuxin
2022
Jin, Y. (2022). Visual servo control of robot manipulator with applications to construction
automation. Master's thesis, Nanyang Technological University, Singapore.
https://hdl.handle.net/10356/160020
https://hdl.handle.net/10356/160020
https://doi.org/10.32657/10356/160020
This work is licensed under a Creative Commons Attribution‑NonCommercial 4.0

International License (CC BY‑NC 4.0).
Downloaded on 15 Oct 2024 04:37:17 SGT

Visual Servo Control of Robot Manipulator with
Applications to Construction Automation
Jin Yuxin
School of Electrical & Electronic Engineering
A thesis submitted to the Nanyang Technological University

in partial fulfilment of the requirement for the degree of
Master of Engineering
2022
Statement of Originality
I hereby certify that the work embodied in this thesis is the result of original
research, is free of plagiarised materials, and has not been submitted for a
higher degree to any other University or Institution.
19-Jan-22
................. .........................
Date Jin Yuxin
Supervisor Declaration Statement
I have reviewed the content and presentation style of this thesis and declare it is
free of plagiarism and of sufficient grammatical clarity to be examined. To the
best of my knowledge, the research and writing are those of the candidate except
as acknowledged in the Author Attribution Statement. I confirm that the
investigations were conducted in accord with the ethics policies and integrity
standards of Nanyang Technological University and that the research data are
presented honestly and without prejudice.
19-Jan-22
................. ...........................
Date Cheah Chien Chern
Authorship Attribution Statement
Please select one of the following; *delete as appropriate:
(A) This thesis does not contain any materials from papers published in peer-reviewed
journals or from papers accepted at conferences in which I am listed as an author.
(B) This thesis contains material from [x number] paper(s) published in the following
peer-reviewed journal(s) / from papers accepted at conferences in which I am listed as
an author.
19-Jan-22
................. ..........................
Date Jin Yuxin
Acknowledgements
First, I wish to express my greatest gratitude to my professor, Cheah Chien Chern.
He taught me the very first lesson of control theory and robotics. He guided me
into the correct path of research and study patiently. We would met and exchanged
our thoughts every week to prepare for the research project. He helped me to solve
many problems occurred from the design of the experiments to the imperfection of
the algorithm.
I also want to thank my friends and my schoolmates. Xinge, Nithish, and I have had
many discussions about projects, questions, or just random thoughts. I will never
forget every meal we had together at the NTU canteen. My boyfriend, Shuailong,
is my source of energy and light of life. He never failed to cheer me up during
my darkest moment and encourage me to move forward. My childhood friends in
China, although I only meet them face-to-face a few times after I went overseas
for studying. Every week we would still chat via Wechat and have a phone call.
Without their help and company, I cannot make it today.
Finally, I want to address my deepest thanks to my family members. They mean

the whole world to me. During my childhood, my grandparents took care of me
most of the time. They always prepared my favorite dishes and waited for me
to come back home from school. My grandmother was a physics teach in high
school. She was the first person to teach me many interesting scientific tips before
I attended school. My parents are the ones who support me mentally. My mother
never said negative comments about my dream and thoughts. She always tries to
cheer me up and help me achieve what I want. My father taught me to believe in
myself and never give up easily. They made me who I am today.
Two years of my master’s journey is not a very long time. But what it brings to
me and teaches me will impact me for the rest of my life.
Jin Yuxin, January 2022

ix
xi
To my dear family
Abstract
The construction industry has long been a labor-intensive sector. The gap between
the continuously increasing demand for housing and the shirking workforce is grow-
ing wider day-to-day. In addition, the fatal and injury rate at the workplace for
the construction sector remains stubbornly high as compared to other industries.
Construction companies are seeking for robotics and automation technologies to
keep a balance between safety, accuracy, and efficiency.
Vision system is a crucial part of the robotic system in construction automation. By

deploying a vision system on a robot, we can automate the detections of construc-
tion materials, installation components, and defects. Current state-of-art models
for object detection extract the image feature vectors using a deep neural network
which consists of a series of convolutional layers and max-pooling layers to gener-
ate the final output. After training a deep neural network model with a suitable
dataset, it can classify and localize numerous classes. Its detection performance
will be fixed throughout the prediction process. Typically, detection results will
include object class, confidence level, bounding box size, and bounding box co-
ordinates. The confidence level determines how confident we are to confirm the
presence of an object. It can vary dramatically due to di↵erent lighting conditions,
changes in the distance and angle of the camera.
This thesis aims to explore the use of robot visual servoing technique to improve
detection performance during real-time inspection. The proposed method utilizes
object detection information to guide the robot system for achieving a better view of
the target object. A region-based visual servoing controller is developed to position
the target object in the center of the field of view (FOV) while also maximize of
the coverage of the object within the FOV. A case study will be performed on tile
cracks inspection by using the proposed technique. The inspection process is an
important step to evaluate the current stage of the construction project as well
as alert the supervisors if there is an error. It is also a tedious job as it requires
close observation through every wall in every room among all the units. Tile cracks
xiii
xiv
are commonly occurring during the transportation or installation process and the
cracks are usually tiny and therefore not easily detected by human workers. By
combining the visual servo control technique with a deep-learning-based object
detector, we aim to achieve a higher confidence level for the detection of the tile
cracks. Experimental results are presented to illustrate the performance.
Contents
Acknowledgements ix
Abstract xiii
List of Figures xvii
List of Tables xix
Symbols and Acronyms xxi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Review 7
2.1 Robotic Solutions For Building Construction Automation . . . . . . 8
2.1.1 Interior Finishing Robot . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Quality Inspection and Assessment Robot . . . . . . . . . . 9
2.1.3 Site Monitoring Unmanned Aerial Vehicle (UAV) . . . . . . 10
2.2 Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Data Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Classic Neural Network Models for Object Detection . . . . . . . . 15
2.3.1 Convolutional Neural Network(CNN) . . . . . . . . . . . . . 15
2.3.2 Region based Convolutional Neural Network (R-CNN) . . . 17
2.3.3 You Only Look Once (YOLO) Version One and Version Two 18
2.3.4 Single Shot Detector (SSD) . . . . . . . . . . . . . . . . . . 19
2.3.5 Convolutional Neural Networks Applications in Construction
Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
xv
xvi CONTENTS
2.4 Vision-based Controller . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1 Traditional Vision-based Control . . . . . . . . . . . . . . . 21
2.4.2 Visual Servo Controller with CNN . . . . . . . . . . . . . . . 22
3 Methodology 25
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Object Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 YOLOv3 Architecture . . . . . . . . . . . . . . . . . . . . . 27
3.3 Control Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Preliminaries and Definitions . . . . . . . . . . . . . . . . . 29
3.3.2 Image-Based Visual Servoing and Region-Based Control . . 33
3.3.3 Lyapunov Stability Analysis . . . . . . . . . . . . . . . . . . 39
4 Experimental Setup and Experimental Results 45

4.1 Vision System: Hardware Description and Object Detection Model 45
4.1.1 Hardware Description . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 Data Collection, Training and Results . . . . . . . . . . . . . 46
4.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Robot System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Hardware Description . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 50
5 Conclusion and Recommendation 59

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Recommendation for Future Research . . . . . . . . . . . . . . . . . 60
List of Author’s Publications 63
Bibliography 65
List of Figures
2.1 A Example of Five-Fold Cross Validation. . . . . . . . . . . . . . . 11

2.2 A Basic Neural Network Architecture . . . . . . . . . . . . . . . . . 13
2.3 A Simple Illustration of Basic CNN Architecture. . . . . . . . . . . 16
2.4 Faster R-CNN is a single, unified network for object detection. . . . 17
3.1 The basic workflow for proposed controller. . . . . . . . . . . . . . . 26

3.2 The coordinate frame for camera/len system. . . . . . . . . . . . . . 29
3.3 Di↵erent camera configuration system: eye-in-hand and eye-on-hand 31
3.4 Illustration of the definition of the region error function . . . . . . . 35
4.1 Basic user interface of the labeling tool LabelImg. Ground truth
bounding box is drew in green . . . . . . . . . . . . . . . . . . . . . 47
4.2 YOLO detection results of testing images. electrical telecom and
electrical power are shown in (a), door and electrical switch are
shown in (b), electrical light is shown in (c), electrical switch is
shown in (d), window installed is shown in (e) and tile crack is
shown in (f) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 UR5e Robot Manipulator in the Nanyang Technological University
Robotics Lab Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 (a) shows the change of confidence level for bounding box length.
The y-axis is the YOLO confidence level and the x-axis is the YOLO
bounding box width. The blue line is the linear regression model
with a 95% confidence interval as the boundary shadowed in blue.
(b) demonstrates the change of confidence level to distance between
the bounding box center and image frame center. The x-axis is
the pixel distance between the YOLO bounding box center and the
image center and the y-axis is the YOLO confidence level.The blue
line is the linear regression model with a 95% confidence interval as
the boundary shadowed in blue. . . . . . . . . . . . . . . . . . . . . 51
4.5 demonstrates change of the YOLO bounding box location, length
and confidence level during the experiment using YOLO model trained
with 1000 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
with 2000 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xvii
xviii LIST OF FIGURES

with 3000 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.8 demonstrates change of the bounding box location and confidence
level during the experiment of tile cracks . . . . . . . . . . . . . . . 55
4.9 plots the change of the bounding box location, bounding box length
and confidence level during the experiment of tile cracks . . . . . . 56
4.10 plots change of the bounding box location and confidence level dur-
ing the experiment of electrical powers . . . . . . . . . . . . . . . . 56
4.11 demonstrates change of the bounding box location, bounding box
length and confidence level during the experiment of electrical powers 57
List of Tables
3.1 Object detection results on COCO dataset [1]. The table shows
the mean average precision (mAP) and inception time for di↵erent
object detection model. mAP-50 represents that the mAP value is
calculated based on 50% IOU metric. The number after the model
name indicates the input image size. . . . . . . . . . . . . . . . . . 27
4.1 Technical specification about Intel Realsense Depth Camera D435 . 46

4.2 YOLO detection performance on 8 classes in the testing dataset. . . 48
4.3 Tech specification of UR5e Robot Manipulator from Universal Robots 49
4.4 Comparison of Average Confidence Level with Object Detection
Model Trained with Di↵erent Numbers of Epochs. . . . . . . . . . . 54
4.5 records ten sets of YOLO bounding box center coordinates and their
confident levels. End confidence level displays the confidence level
when the corresponding controller reaches center of the image frame. 58
xix
Symbols and Acronyms
Symbols
Rn the n-dimensional Euclidean space
rf the gradient vector
@Y
@x
the partial di↵erentiation of function Y on variable x
ẋ the first di↵erentiation of function x
x̄ the vector with the average of all components of x as each element
1 all-ones column vector with proper dimension
xi,k the i-th component of a vector x at time k
XT the transpose matrix of matrix X
Ni the index set of the neighbors of agent i
Acronyms
DOF Degree of Freedom
YOLO You Only Look Once
PPVC Prefabricated Prefinished Volumetric Construction
IMU Inertial Measurement Unit
GPS Global Positioning System
UAV Monitoring Unmanned Aerial Vehicle
mAP Mean Average Precision
FoV Field of View
RoI Region of Interest
xxi
Chapter 1
Introduction
1.1 Background
The construction industry has long been a supporting pillar of the economy for all
nations. In Singapore, the construction sector contributed about 4% of Singapore’s
total gross domestic product in 2019.1 According to the Building and Construction
Authority of Singapore (BCA), the forecast average annual demand for construc-
tion will reach 32 billion Singapore dollars in 2025.2 With the increasingly high
demand and the limited local population, Singapore’s construction sector relies
heavily on foreign works from neighboring southeast Asian countries. Data from
the Ministry of Manpower of Singapore (MOM) shows that in 2020 311,000 work
permit holders are working in the construction and marine sector, occupying one-
fourth of the total foreign workforce.3 The above situation leads many to inevitable
problems occurring not only in Singapore.
On the one side, the construction site environment can be rather harmful or even
dangerous. According to the annual report from the MOM in Singapore, there
were 13 death cases, 135 major non-fatal injuries case and 1674 minor injury cases
occurred in the construction site in 2019.4 This indicates that the construction
sector is still the main contributor to workplace injury and death among all the
1
https://www.statista.com/statistics/1122999/singapore-nominal-gdp-breakdown-by-sector/
2
https://www1.bca.gov.sg/docs/default-source/docs-corp-form/free-stats.pdf
3
https://www.mom.gov.sg/documents-and-publications/foreign-workforce-numbers
4
https://www.mom.gov.sg/-/media/mom/documents/press-releases/2021/0319-annex-a—
workplace-safety-and-health-report-2020.pdf
1
industries. Furthermore, the United States Bureau of Labor data states that every
year the rate of construction workers who su↵er a fatal injury is fourth-highest
than any other industry.5 The causes of construction workplace injuries can range
from falling from high to car accidents. Besides, most of the construction site jobs
are repetitive and tiring. Construction work usually includes concrete workers,
stoneworkers, flooring installers, glaziers, tile setters, ironworkers, and electricians.
Most of these jobs require long-hour exhausting physical activities. It can lead
to many chronic illnesses such as lumbar diseases or arthritis. Thus, the average
retiring age for a construction worker is 42.5, which is much younger than the
standard retirement time. Using a robot for construction work can ease the tiring
work done by workers. Deploying more autonomous machines in the construction
site means more people can work in a safer place.
On the other side, there is a vast technology gap between the construction indus-
try and others. The Bureau of Economic Analysis data in America shows that the
labor productivity in the construction industry remains at the lowest level among
agriculture, transportation, manufacturing, and utility industries.6 Likewise, a sur-
vey in Japan in 2012 even indicates there is a declining trend in construction labor
productivity from 1990 to 2010 while the whole industry is continuously rising [2].
In 2015, Naoum [3] from London, UK, listed the top factors which influence the
productivity on construction sites. Among the 46 listed factors, he ranks ine↵ective
project planning, delay caused by design error and variation orders, communica-
tion system, work environment, and constraints on a worker’s performance as the
top five factors. We can see that the first three are mainly during preparation, as
researchers cannot intervene much. While the last two show that environmental
and human factors are essential for efficiency and productivity. A good environ-
ment a↵ects how smooth the construction work can carry on and how efficiently
the workers would perform. A robot system is capable of helping with monitoring
the current on-site progress as well as boost up the current construction process.
Moreover, the occurrence of COVID-19 worsens the situation of hiring construction

workers from overseas. In Singapore, the outbreak of COVID did not start in the
community but the worker’s dormitory first. In total, there were 54,518 dorm
residents infected, contributing to almost 90% of the overall cases in Singapore
5
https://www.bls.gov/news.release/pdf/cfoi.pdf
6
https://www.curt.org/committees/managing-construction-productivity/
2
as of 18th May 2021.7 The reason behind it was probably the crowded living
condition among workers in the past.8 It has also become challenging for the
construction companies to recruit new workers abroad as the pandemic situation
is still unpredictable and varying worldwide. This pandemic significantly alters
the timeline of the existing construction projects. As reported by Strait Times,
85 percent of the 89 ongoing build-to-order projects face delays of six to nine
months due to the pandemic, with 43,000 households a↵ected.9 GPD from the
construction sector in Singapore also dropped dramatically from 4957.8 million
Singapore dollars in the first quarter of 2020 to 1681.6 in the third quarter because
10
of the covid outbreak and lockdown in the construction dormitory. This severely
influences the economic circulation and people’s daily life.
Under the above situation, we recognize a need to aid construction work by im-
plementing robotics solutions. It will be the initial step to shift the nature of the
construction industry from labor-intensive to technology-intensive. It would speed
up the project schedule with computer integrated progress monitoring and lessen
workplace injury and death every year.
1.2 Motivation
In Section 1.1, we talk about the current difficulties and limitations faced by the
conventional construction industry. Thus, we endeavor to develop an integrated
robot system that can automate the construction process and assist construction
laborers.
Robot system usually consists of two parts, vision system, and robot hardware sys-
tem. The vision system is a crucial component of the robotic system in construction
automation. As the construction sites are generally disordered and unstructured,
visual information helps us see and comprehend the surrounding conditions. The
vision system should recognize and localize various construction materials, instal-
lation components, and defects. In terms of vision algorithms, the convolutional
7
https://covidsitrep.moh.gov.sg
8
https://www.straitstimes.com/singapore/manpower/workers-describe-crowded-cramped-
living-conditions
9
https://www.straitstimes.com/singapore/spore-will-see-further-delays-in-housing-projects-
due-to-tightening-of-covid-19-measures
10
https://tradingeconomics.com/singapore/gdp-from-construction
3
neural network has gradually replaced the traditional machine vision techniques
and becomes the primary tool for fast and accurate object detection tasks. It
consists of a series of convolutional layers and max-pooling layers to produce the
final output. If we can collect a suitable dataset and train a deep neural network
model, it can classify and localize numerous classes. Typically, detection results
will include object class, confidence level, bounding box size, and bounding box co-
ordinates. The confidence level determines how confident the machine is to confirm
the presence of an object.
However, there are still many limitations and drawbacks of the existing object
detector. The detection performance is fixed after the training throughout the pre-
diction process. The confidence level can vary dramatically due to di↵erent lighting
conditions and changes in the camera’s distance and angle. If the camera faces the
object at an inappropriate angle, it will significantly reduce the performance of the
object detector and even lead to some false detection.
Thus, we see the possibility of applying a robot solution to assist the detection of
the CNN model. This thesis investigates the usage of robot vision-based control
techniques to enhance detection performance during real-time inspection. We aim
to develop a control algorithm that can position the target object in the center
of the field of view (FOV) while also maximizing the coverage of the target ob-
ject within the FOV. A case study will be performed on tile cracks inspection,
and construction installation checks to illustrate the performance of the proposed
technique.
The inspection work is generally an essential step in the construction process. It

intends to evaluate the current stage of the construction project and alert the su-
pervisors if there is an error. It is also a tiresome job as it requires close observation
through every wall in every room among all the units. Tile cracks commonly occur
during the transportation or installation process, and the shots are usually tiny
and hence not easily detected by human workers. Counting the installed items and
update the checklist requires are time consuming tasks as well. Combining the
visual servo control technique with a deep-learning-based object detector would
help to achieve a higher confidence level for detection of the tile cracks and other
installation components. Experimental results are presented to illustrate the per-
formance.
4
1.3 Outline of the Thesis
Chapter 1 introduces the background of the construction industry and the moti-
vation of the project.
Chapter 2 reviews the robotics solutions used in the construction industry as well
as the basics of machine learning. Classic neural network models are also presented
to find out the most suitable one for real-time application.
Chapter 3 elaborates the details of the proposed algorithm. It is divided into two
parts, vision algorithm and control algorithm. For the vision algorithm, we will
give a detailed explanation of the used machine learning algorithm. For the control
algorithm, we will illustrate the mathematical definition and usage.
Chapter 4 explains the hardware of the robot and vision system. It demonstrates
the experimental setup and the experimental results
Chapter 5.1 concludes the thesis and provides possible future research directions.
5
6
Chapter 2
Literature Review
In the following chapter, we intend to review the previous research work about
construction robot and other related contents.
Firstly, we will review di↵erent types of construction robots and robot control tech-
niques for robot design and control. It is crucial to figure out what kind of func-
tionalities and capabilities are essential for specific construction tasks. Knowing
the trend of construction automation tells us the need for construction robots. It
gives us the direction to improve further by understanding the possible drawbacks
of the existing products.
For robot vision, both basic machine learning knowledge and modern neural net-
work models will be illustrated. Cramming the machine learning basics is valu-
able to understand the architecture and function of the neural network. Compar-
isons will be made between various neural network algorithms. Determining the
most proper neural network model for construction site work is the principal focus
throughout the process.
The subsequent literature review will cover the techniques and algorithms used for
object detection, construction robots with diversified purposes and functionalities,
and vision-based control algorithms.
7
2.1 Robotic Solutions For Building Construction
Automation
We need to consider several aspects when designing the overall process of construc-
tion work to achieve full construction automation. In the book Robot Oriented
Design [4], the author listed five key technologies and methodologies that play an
essential role in accomplishing real construction automation: (1) robot-oriented
design, (2) robotic industrialization, (3) construction robots, (4) site automation,
and (5) ambient robotics. In the following subsections, robot applications are listed
according to their functionalities and purposes.
2.1.1 Interior Finishing Robot
Interior finishing work, including painting, tiling, masonry, and plastering, is im-
perative before the handover. Tiling and painting work can be pretty risky as it
often involves high-rise operation and has a higher chance of falling and getting
injured. Meanwhile, it is also a time-consuming and labor-intensive job. Thus,
providing a robotic solution to ensure workers’ safety and boost productivity is
crucial to the industry.
One of the earliest interior-finishing robots called Technion Autonomous Multipur-

pose Interior Robot (TAMIR) was introduced in 1994 [5]. A robot manipulator
with six DoF and 1.62-meter nominal reach is fixed on a three-wheeled mobile
carriage. It is designed to perform all types of interior finishing work mentioned
above in a construction environment. A spray gun with a controllable nozzle is
programmed to be on/o↵ at the desired distance from the wall for the painting and
plastering task, ensuring coverage. It can also perform a wall tiling process which
involves a vacuum gripper to pick up the tile, receive the glue or cement and place
it at the designed place.
Moreover, in 2018, a research group from Nanyang Technological University (NTU)

developed a robot called Pictobot [6] which is more suitable for rooms with high
ceilings. A 3-DoF mobile robot enables the system to move to the target place in-
side the room and level to the desired height if needed. Although fully autonomous
is not achieved in this product, human intelligence combined with robot capability
8
exhibits better results. A human operator can navigate the robot to a di↵erent
workstation, set painting requirements for the proper nozzle, and spray pressure
using the remote controller with the screen. A 6-DoF robot arm with 3D scanning
and reconstruction system can detect the surrounding terrain, plan spraying trajec-
tory and perform the spraying task on an uneven surface. They compared manual
spray painting and human-Pictobot joint task operation by using working time,
transfer efficiency, quality, convenience, safety, and human resources needed. The
robot can finish 100 m2 within 2 hours compared to 3 hours for manual spraying.
It is believed to be more even and consistent in terms of thickness.
By applying human-robot interaction, the robot can accomplish more accurate

movements and attain more reliable results with assistance from human beings.
Robots can improve the productivity and sustainability of interior finishing work
and diminish the dependency on skilled labor while maintaining high quality.
2.1.2 Quality Inspection and Assessment Robot
After completing all the construction works, it is imperative to regularly inspect

and assess it and ensure no defects or cracks.
Pack et al. [7] have proposed a structure-climbing robot for building inspection
named ROBIN. A four DoF articulated mechanism with two vacuum fixtures en-
ables the robot to walk cross surfaces or transit between adjacent surfaces per-
pendicular to each other. ROBIN can climb onto high-rise buildings, bridges, and
other artificial structures for inspection works using cameras or other sensors.
In 2009, a research group from Hanyang University developed a robotic inspection

system [8] which is capable of automatically detecting cracks. The robot consists
of a specially designed car, a mobile control system, and a vision system. The
inspection robot connects to a multi-linkage system fixed onto the bridge, and it
can extend to cover the length of the bridge. It is reported in the experiments that
the accuracy of the proposed method can reach 94.1 % in detection, which is much
higher compared to other machine vision algorithms such as Canny operators [9].
Moreover, Gibb et al. [10] proposed a multi-functional inspection robot for civil
infrastructure evaluation and maintenance in 2017. With the integration of ground-
penetrating radar, electrical resistivity, and a stereo camera, the robot can perform
9
the detection and assessment of the concrete rebar, concrete corrosion, and cracks
at the same time. Meanwhile, an onboard computer enables the system to process
the data and conduct the navigation in real-time. It can output the width of the
cracks and produce a concrete condition map based on that.
Autonomous inspection work reliefs the worker from the repeated and tiresome
check and prevent them from work-related injuries. Meanwhile, the robotic solution
can also reduce the cost and the time of maintenance.
2.1.3 Site Monitoring Unmanned Aerial Vehicle (UAV)
With the increasing popularity and development of UAVs, many researchers also
explored UAV usage in building inspection and monitoring. UAVs can fly without
a crew and inspect the target building facades whose height standard ground robots
cannot reach.
In 2015, Pereira and Pereira [11] evaluated two di↵erent machine vision algorithms
used in UAV applications and their respective performance. Both Sobel filter
algorithm and particle filter algorithm were tested using Linux PC and Raspberry
PI for their accuracy and processing time for crack detection.
Meanwhile, a UAV-based laser scanning for building inspection algorithm was ex-
plored by a research group from Germany in 2016 [12]. With the 470000 preset
points of the UAV trajectory, it obtains the data of the building surface from the
laser scanner and reconstructs a 3D point cloud. The paper also evaluated us-
ing UAVs for checking cracks and other defects using RGB cameras and thermal
sensors.
2.2 Machine Learning Basics
In Section 2.1, it is shown that the principal trend of construction robots is to

develop a mechatronic system consisting of a mobile robot or drone and integrated
sensors such as a vision system. Deep learning and neural network provide us
10
with excellent adaptability and accuracy to design a high-performance vision sys-
tem. More details and background knowledge about machine learning need to be
reviewed before looking into the neural network and its applications.
In the following subsections, basic machine learning techniques and information

will be illustrated to better understand the process of training on a self-generated
data set and using neural networks in a real-time application.
2.2.1 Data Split
The quality of the data directly influences the performance of a model. Generally
speaking, a more extensive dataset means higher accuracy. Every dataset splits
into a training dataset, a validation dataset, and a test dataset. While the training
dataset is only used to train the model, the score on the validation dataset is used to
tuning the hyper-parameters such as learning rate. The test dataset is to evaluate
the performance of the model on an unseen dataset.
When dealing with real-life problems, we often lack enough training data, and it
may lead to an under-fitting problem. Thus, resampling procedure such as cross-
validation becomes extremely important to give a more accurate estimate of the
current model and thus adapt the parameter values.
Dataset
Training Set Validation Set
Iteration 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Iteration 3 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Finding

Parameters
Figure 2.1: A Example of Five-Fold Cross Validation.
Figure 2.1 illustrates an example of k-fold cross validation where k is five. The
general procedure is listed below:
11
1. Randomly shu✏es the dataset to distribute di↵erent classes equally.
2. Divide the whole dataset into k same-size groups.
3. Assign one group as the validation data set.
4. Assign the remaining groups as the training data set.
5. Fit a model on the training set and evaluate it on the validation set.
6. Record the evaluation score.
7. Repeat step 3-6 k times and calculate the summarized performance based on
the average of all the evaluation scores.
We can still obtain a precise evaluation of the model using limited samples by
applying the k-fold cross-validation procedure.
2.2.2 Neural Network
When talking about the origin of the neural network, its connection with zoology
and anatomy cannot be bypassed.
One of the very first papers about brain architecture and neuron mechanism was
published in 1968 by Hubel and Wiesel [13]. By studying the architecture of the
monkey striate cortex, the scientists found out the brain’s organized so that simple
cells were shown in the deep layer and complex cells are contained by the upper
layer. While simple cells were more sensitive to lines and edges, their output will
be converged in the complex or hyper-complex cells. Inspired by Hubel’s work, in
1980, the term ’neocognitron’ first appeared in Fukushima and Miyake’s paper [14]
to present his self-organizing neural network model. The model arranged each
module in a cascade connection. Each module consists of ’S-cells’ similar to the
superficial cells and ’C-cells’ similar to the complex cells. In the paper, the author
claimed that the network could self-learn the characters of the input patterns, and
one of the C-cells from the last layer would respond to it.
A modern artificial neural network is illustrated in Figure 2.2 to show its basic
architecture. Usually, it consists of input layers, hidden layers, and output layers.
12
Hidden
Input Output
Figure 2.2: A Basic Neural Network Architecture
The neural network can be interpreted as the estimator of the complicated inter-
action and relationship between input and output. Given an input feature vector
x 2 Rd and the output label vector y 2 RC with C di↵erent classes, a simple
classifier can be expressed as
ŷ = f (x; ✓) (2.1)
where ✓ is the hyper-parameters and f is the function to map the input feature
vector x to output label space ŷ 2 Rd .
2.2.3 Loss Functions
Typically speaking, machine learning problems usually seek to find a solution (a

set of weights) to minimize the error. An objective function, often called loss
function or cost function, is used to evaluate the candidate solutions. The loss
function summarizes the neural network’s performance and scales it down into a
single value that allows us to improve, rank, and compare di↵erent models.
A commonly used framework for loss function is maximum likelihood estimation.

Maximum likelihood seeks to find the optimum values for the parameters by max-
imizing a likelihood function derived from the training data [15]. After the model
making a prediction, the loss function will estimate the di↵erence between the
prediction distribution and the target distribution in the training data under max-
imum likelihood. Importantly, for di↵erent types of machine learning problems,
di↵erent loss functions are implemented.
13
For binary and multi-class classification problems, we commonly use Cross-Entropy
loss. Given a C classes classification problem, by using the logistics model and the
gold one-hot label y 2 RC , the loss function is defined as
C
X
L(✓) = yc log pc (2.2)
c=1
where pc 2 RC is the predicted probability of class c.
For the regression problem which predicts the specific quantity, Mean-Square loss
is more suitable. Given n number of data points, we can define the loss function
as
n
1X 2
M SE = (Yi Ŷi ) (2.3)
n i=1
where Yi is observed values and Ŷi is the predicted values.
2.2.4 Optimization
As its name suggests, optimization is updating the set of weights to minimize the
loss function.
One of the simplest optimization methods is Gradient Decent (GD). It is like going
downhill. By calculating the derivative of the loss function, we can update the
weights in the direction of negative gradient. The gradient calculation and the
update step can be expressed in the following equations:
gt = r✓t L(✓ t ) (2.4)

✓ t+1 ✓t ⌘gt (2.5)
where t is the number of steps of updating the parameters, gt is the gradient of

loss function L(✓ t ) at step t, ✓t is the weights at step t, and ⌘ is the learning rate.
Learning rate controls how much the weights will be updated each step, and it is
a crucial hyper-parameter we need to define. The learning rate is usually a small
value, such as 0.01 or 0.001, and it is fundamental to choose the correct value. If
14
the learning rate is too large, the learning process will be unstable and oscillated.
While if it is too small, it will take a longer time to converge.
Stochastic Gradient Descent (SGD) comes into play if the data set is too large for
calculation. Instead of calculating the gradient for the entire data set, it estimates
that with a randomly chosen subset of the data. Thus, it not only reduces the
computational time and also achieves faster convergence.
Although GD is a straightforward method, it cannot solve the saddle points and

local minima problems. The gradient is equal to zero at these points, but the loss
function can still be improved. To tackle these problems, more sophisticated meth-
ods such as Adam and Adaptive Gradient Algorithm (AdaGrad) [16] are brought
up after that. Unlike SGD with a fixed learning rate, AdaGrad automatically ad-
justs it for di↵erent components during the training by using past observations.
Due to this character, AdaGrad is a well-fit for sparse data.
2.3 Classic Neural Network Models for Object

Detection
In the following section, neural network models widely used for object detection
will be introduced. With the development of computer vision and the improve-
ment of computational power, more complex models start to take advantage of
traditional machine vision algorithms. Many new models are proposed every year
with increased accuracy and shortened processing time. We will look into di↵erent
models and find out which one is more suitable for real-time object detection.
2.3.1 Convolutional Neural Network(CNN)
The history of CNN started in the 1980s. One of the very first papers was published
by LeCun et al. [17] in 1995. It laid the foundation of architecture and usage of
convolutional neural networks. Three years after that, the same author proposed
LeNet-5 [18], a 7-layer convolutional network that can process images and recognize
hand-written numbers.
15
Although the invention of the CNNs could be traced back to the end of the last
century, it never received great attention until the extensive use of graphics pro-
cessing units (GPUs) in the 2000s. A research group from the University of Toronto
presented an extensive, deep convolutional neural network that was designed with
60 million parameters and 650,00 neurons in 2010 [19]. It is able to classify 1.2
million images into 1000 classes with top-1 and top-5 error rates of 37.5 % and
17.0%, respectively.
fc_3 fc_4
Fully-Connected Fully-Connected
Conv_1 Conv_2
Convolution Max-Pooling Convolution Max-Pooling
Fl
at
Input te
…
ned
Image
n1 channels n1 channels n2 channels n2 channels

… OUTPUTS
n3 units
Figure 2.3: A Simple Illustration of Basic CNN Architecture.
The typical CNN architecture consists of multiple max-pooling layers, convolution

layers, and fully connected layers. Figure 2.3 demonstrates a simple representation
of CNN. In machine learning, convolution means the operation which multiplies
the image pixel with a filter or kernel to generate a feature map by averaging or
summing the output. Hyperparameters define the convolutional filters/kernels. In
the pooling layer, average pooling or maximum pooling are regularly used to sum-
marize the value over all the values present. It is done by calculating the average
or maximum value within a small cluster of the input pixels. The fully connected
layers connect every neuron from the flatten feature map to all the neurons in the
hidden and output layers. It is usually presented during the classification process.
Nowadays, CNN is widely used for gird-like datasets such as images because of its
architecture and capability to separate and extract essential features using spatial
relationships.
16
2.3.2 Region based Convolutional Neural Network (R-CNN)
More complex questions were brought up during the development of computer

vision, such as object detection, which required location information instead of
just image classification.
Thus, the region-based approach was first introduced by Girshick et al. [20] in
2014 called R-CNN. R-CNN used CNN as the backbone to extract features for
each proposal and classify the output using Support Vector Machine (SVM). Their
performance on PASCAL VOC 2012 data set improved 30% compared to the best
previous results achieving a mean average precision (mAP) of 53.7%.
One year after that, the same author proposed a modified method named Fast
Region-based Convolutional Neural Network (Fast R-CNN) [21]. R. Girshick wished
to fix the problems that occurred in R-CNN, such as multi-stage training, expen-
sive training time and storage, and slow real-time detection speed. By using deep
convolutional layers and max-pooling layers, the model can produce a feature map.
Then, a feature vector will be extracted for each object. After feeding the feature
vector into the fully connected layer, it will generate the output containing Softmax
probability and bounding box position. This end-to-end model has higher detec-
tion results compared to R-CNN and does not need storage for feature caching.
The training time is nine times faster than R-CNN with an mAP 66%.
Region
Proposal Proposals
Network
Image Conv Layer Classifier
Feature Maps RoI Pooling
Figure 2.4: Faster R-CNN is a single, unified network for object detection.
Although Fast R-CNN is a great success in terms of speed and accuracy, it also
reveals that the current bottleneck is region proposal computation. Thus, it was
17
further improved by Ren et al. [22] in 2015. The novelty of this method is the
invention of Region Proposal Networks (RPNs), which significantly lessens the
cost for computing proposals and accelerates the test-time operation of the model.
In a single model design, these regions are integrated with a Fast R-CNN model.
RPNs proposes potential region of interests (RoIs) and types of objects, while Fast
R-CNN extracts the features and produces the final output containing bounding
boxes and class labels as shown in Figure 2.4. In short, RPNs tell the following
neural network the area which needs to pay more attention to. It has been evaluated
on the union set of PASCAL VOC 2007 trainval and 2012 trainval and achieves
mAP 73.2 %. Using a deep VGG-16 model [23], their model has a frame rate of 5
fps on a GPU.
2.3.3 You Only Look Once (YOLO) Version One and Ver-
sion Two
Having the idea of creating a more straightforward and faster neural network
for real-time application, Redmon et al. [24] presented a real-time detector that
achieves decent accuracy with only 24 convolutional layers and two fully connected
layers. The faster version of YOLOv1 was able to run at 150 fps which means the
real-time video processing was possible with minimum latency.
Without sliding window and region-based techniques, YOLO brought up a new

concept called ’Unified Detection’, which meant the result was generated from the
whole image across all the objects. Splitting the input images into S⇥S grids, B
bounding boxes, the confidence score of how confident for an object presetting, and
C conditional class probabilities will be calculated for each grid cell. Using this
technique, YOLO can decrease the background error to 4.75 % compared to Fast
R-CNN with a 13.6 % error. It will contain five predictions for every bounding
box, including the center coordinates of the bounding box (x, y), the width and
height (w, h) concerning the entire image, and the confidence. They encoded the
above information to a S ⇥ S ⇥ (B ⇤ 5 + C) tensor in the output layer. Overall,
YOLO still obtained lower accuracy than the state-of-art algorithm with an mAP
of 57.9 % on the VOC 2012 test dataset and did not perform well on small-size
objects. Meanwhile, it su↵ers from significantly more localization errors and lower
recall value.
18
Because of the weaknesses stated before, one year after the publication of YOLOv1,
the same author improved it and manifested YOLOv2 (YOLO9000) [25]. More new
ideas have been included in this model while it still maintains a relatively simple
architecture. Adding batch normalization and higher resolution classifier gives
an increase of mAP for 4 %. Di↵erent from the idea of RPN in Faster R-CNN,
YOLO comes up with convolutional anchor boxes which predict bounding box
coordinates directly from image features. Meanwhile, k-means clustering is applied
to gain better priors size automatically. Instead of using sophisticated classification
models such as VGG-16 or CNN, YOLO uses a custom network called Darknet-19,
consisting of only 19 convolutional layers and five max-pooling layers. It can reach
an mAP of 78.6% at 40 FPS which is a notable advancement compared to SSD
with similar mAP but only has 19 FPS.
2.3.4 Single Shot Detector (SSD)
By observing the slow frame rate of Faster R-CNN and the low accuracy of YOLO
version one, Liu et al. [26] designed a new method for object detection called Single
Shot MultiBox Detector.
Like Fast R-CNN, a feed-forward convolutional network was selected as the base
network to produce a cluster of fixed-size bounding boxes and confidence scores
accordingly. Then assistant structure such as convolution was added to enhance
detection at di↵erent scales, producing a fixed collection of detection results and
quantifying the space of output bounding box shape discretely. SSD outperformed
all other methods for the COCO data set, including Fast R-CNN, Faster R-CNN,
and YOLO v1 with an mAP of 72.4 and 74.9 for input size 300⇥300 and 512⇥512,
respectively. Removing the bounding box proposals from the network architecture
achieves 59 frames per second on the VOC2007 test data set with high-accuracy
detection.
2.3.5 Convolutional Neural Networks Applications in Con-

struction Industry
Although CNN has been a prevalent topic for decades, the application to a real-
time construction project is still something fresh. Due to the chaotic background
19
environment and numerous classes of objects, it is still challenging to train a high-
performance model. The following section lists several construction-related prob-
lems which can be solved by applying CNN.
Because of the difficulty in window detection in urban planning, Neuhausen and

König [27] presents a window detection pipeline. The process of the algorithm
contains preprocessing, detection and postprocessing. Preprocessing is to rectify
the image so that the quadrilateral facade will become a rectangle. It boosts
the detector performance. It cuts out the input images into patches during the
detection, classifying them and merging the detection results. When it comes to
the postprocessing step, it refines the existing bounding box from the previous step
by extracting the edges. It also scans the images based on the detection position
and finds out the left-out one since the window is usually distributed within a
column or row. By performing the mentioned steps, the entire model yields a
precision of 97%, which is ideal for building detection.
Nhat-Duc et al. [28] proposed a method called CNN-CDM for pavement crack
detection. In total, they collected 400 images of pavement surfaces for two di↵erent
classes. The authors compared the performance of two di↵erent algorithms. First,
the collected dataset will go through a crack recognition model by integrating the
Canny edge extraction algorithm with the DFP optimization algorithm. The other
method, called CNN-CDM, is a multi-layer neural network containing a feature
extraction network and classification network. The experimental results on the
training dataset reach 92.08% for the CNN method, while the DFP-Canny method
only achieves 76.69%.
Similarly, to detect and localize moisture damage in bridge deck asphalt pavement,
Zhang et al. [29] develop a mixed deep CNN including ResNet50 network, for
feature extraction, and YOLO v2 network, for recognition. The input data is ob-
tained from Ground Penetrating Radars (GPR), and an IRS algorithm is deployed
to generate relevant data fed into the CNN. The team removes the original-based
network of YOLOv2 and adds a ResNet50 Network on top of it. Instead of using
the original YOLO anchor numbers, the K-means clustering method is used here
for small object detection. According to the experimental results, the detection
CNN model reaches 91% precision. The outcome demonstrates that it is a novel
method of automatically detecting moisture method.
20
Moreover, to guide the facade-cleaning robot avoiding the dangerous area, re-
searchers from the Singapore University of Technology and Design (SUTD) propose
a crack detection algorithm based on a convolutional neural network [30]. They
compare the performance of two di↵erent optimizers. After training the model for
700 epochs, both CNN models reaches around 90% accuracy regardless of vary-
ing illumination and resolution of the input images. Simulation and experimental
results demonstrate that the system is robust for crack detection.
2.4 Vision-based Controller
2.4.1 Traditional Vision-based Control
How visual information can guide robot movement has long been a popular topic
in robotics. Hutchinson et al. [31] give us a detailed tutorial about image-based
visual servoing (IBVS) appears. The core of this algorithm aims to guide the robot’s
movement by purely using the image feature points. It extracts the features from
the image space, calculates the di↵erence with desirable features. The output of the
controller will be the end-e↵ector velocity and orientation which is related to the
error. The process of the IBVS system consists of image feature extraction, Image
Jacobian calculation, and final velocity calculation. Among these, extracting the
desired feature from the image space is the most challenging one.
Thus, some research after this tries to modify the present method by substituting
the manually selected feature points with a machine vision algorithm.
Pop et al. [32] demonstrates the usage of color coding to guide the robot to the
target location. In order to perform the pickup and place task, it is crucial to
determine the size and position of the object. By taking images at a fixed distance,
they first detect the object based on its color in the HSV images. Then, the
algorithm can extract the edges of the object and compute its height using the
scaling factor. After that, the center of gravity can be generated using the center
x and y in pixel. This information will be used during the pickup process. By
doing this, the robot manipulator can grasp the object with less than 5 mm error.
However, the experiment results are primarily influenced by the ambient light,
shadows, reflection, and camera setting.
21
Wang et al. [33] modified the vision system to solve the problem of object grasping
in an unstructured environment. Instead of a single camera, the proposed mobile
manipulation system consists of hybrid camera configurations. One is a monocular
camera installed on the end-e↵ector, and the other is a stereo camera installed on
the robot body. By doing this, they believe that it can provide the vision servoing
controller with more stable depth data while also have a large field of view (FOV).
The experimental results prove that the pixel error stays within 10 pixels for 30
times of experiments.
The paper [34] focus on solving the problem of the visual control of a leader-follower
mobile robot system. The intrinsic and extrinsic parameters of the pinhole camera
are uncalibrated and its position and orientation is unknown. Wang et al. [34]
designed a special marker to estimate the di↵erence between the leader and follower
and calculate the velocity based on that. Under this controller design, the follower
robot is able to track the leader robot
2.4.2 Visual Servo Controller with CNN
As mentioned in Section 2.3, Convolutional Neural Network (CNN) has demon-

strated its capability in terms of accuracy, adaptive capacity, and processing speed.
That is why recently, numerous CNN applications have been developed in the
robotic area for grasping and tracking tasks. In the following subsection, the liter-
ature review contains about the CNN applications in robotics.
Researchers from the University of Seville, Spain, develop an algorithm for grasping
by integrating visual servoing and object detection algorithm [35]. They build a
UAV with a pair of 3 DoF arms for manipulation. The UAV is equipped with
an Intel RealSense D435 depth camera for depth estimation. The robot needs
to know the exact location and orientation of the object to perform the grasping
task. An object detection algorithm is deployed to achieve this then the point
cloud will be used in the alignment process to estimate the object pose. After
generating the grasping point based on the pose calculated before, a pose-based
visual servoing (PBVS) technique is used for approaching the targeted object. The
error is calculated based on the target positions and the current one. By computing
the inverse kinematics, the end-e↵ector is able to grasp the object with minimum
error.
22
Liu and Li [36] propose a control algorithm integrated with CNN to reduce the
difficulty of extracting image features. A two-stream convolutional neural network
is applied to extract the image features automatically in the current situation. The
neural network’s output is the pose parameters and orientation, the translation of
x, y, z-axis, and rotation around the z-axis of the robot base frame. These will be
compared with the image features at the optimal position, and the corresponding
o↵set will be input to control the algorithm for manipulating the robot arm. After
training the model with 400 images, the absolute error is reduced to within 4 mm
in x-, y-, z-axis, and 3.02 degrees for rotation in the z-axis. The robot manipulator
can reach the target pose within 15 steps and remains stable after that. This work
demonstrates the possibility of integrating CNN with vision-based control.
A CNN-based control algorithm is developed by Ahlin et al. [37] to solve the

leaf-picking problem by using a robot manipulator. The network is based on the
AlexNet architecture [38]. They classify the dataset into two classes, leaves and
background. The leaf-picking task consists of computing the leaf’s position in the
image frame and converting it into Cartesian space. IBVS and Monocular Depth
Estimation (MDA) are deployed for searching, approaching, and grasping the tar-
get leaf precisely. The experimental results present that all ten searches performed
are successful, and among 23 approaches, 16 succeed.
The same group of researchers further improved the mentioned robotic system by
substituting the 2D image information with a 3D point cloud [39]. They use a
monocular camera and a 6-DOF robotic manipulator to detect, track, and pick
healthy and unhealthy leaves. Faster R-CNN [22] architecture is applied for object
detection and Mask R-CNN [40] is used for instance-based semantic segmentation.
It can classify leaf into healthy leaf and unhealthy leaf with 0.753 mAP. IBVS
method is applied to move the bounding box to the center of the frame, and the
MDA will take over to minimize the accumulated error. After conducting the
experiments on the real plants, about 92% of the leaves were grabbed after the
attempts.
Anwar et al. [41] designed a quality inspection process for remote radio units
(RRUs) by using the image-based visual servo control. They modified the image
Jacobian through deriving the depth using the projective geometry. The feature
p
vector f is defined as f = [uc , vc , A, ✓ij ]T where (uc , vc ) is the center coordinates
of the region of interest (RoI),A is the area of that, ✓ij is the angle around z-axis.
23
The selected features obtained by computer vision algorithm is able to guide the
robot to track the power port. It is proven to have better performance compared
with traditional camshift tracking algorithm through experiments.
The research works listed above show the flexibility and robustness of CNN and
how it can boost up the performance of the robotic system.
24
Chapter 3
Methodology
Among all the construction works, Cai et al. [42] points out that there are fewer
research papers and products related to inspection work compared to climbing,
cleaning, and maintenance. Tile defect inspection and installation check can be
considered one of the most tiresome works on the construction site. Because it
usually requires the quality engineers to closely look through every part of the wall
or tile to verify the misalignment or damage for the whole room or the building on
a day-to-day basis. Based on the literature review in Section 2.1.2, it is clear to
see that most of the current inspection robots employ traditional machine vision
techniques such as the Canny edge extraction algorithm for crack detection. It is
rigid and cannot distinguish tile crack from other gaps. Its results can be severely
influenced by the light condition, the background texture, and distance. So, we
think there is a need for developing a neural network-based crack detector. Besides
an excellent object detector, many problems are still faced by the construction field
to achieve full automation. Sometimes, the cracks are not visible due to their small
size or poor lightening. Also, other objects such as wires or markers on the wall
can be easily classified wrongly as wall cracks occasionally. A control algorithm is
needed to move the camera to the desired position to capture suitable images.
We proposed a robotic solution for installation and crack inspection more precisely
by combining an object detection algorithm with a traditional controller.
25
3.1 System Overview
Two primary components of the system are the image-based controller and object
detection model as shown in Figure 3.1.
In the beginning, a predefined target location marked as f ⇤ in the diagram will

be passed into the algorithm as the ground truth. In our case, we define that
the center of the bound box should locate the center of the image frame for the
full view of the object. Meanwhile, the height and width of the bounding box
should reach at least 60 % of the image frame length to prevent it from being too
big or too small. The object detection model YOLO is chosen here for real-time
detection applications. It will output the bounding box information based on the
input image labeled as f . The current output f will be compared with the target
f ⇤ to produce the error e. The image-based controller is a combination of a visual
servoing controller with a region-reaching controller. It calculates the end-e↵ector
velocity for the robot manipulator based on the di↵erence between the present
value and the desired value. Then, the robotic arm will execute the input velocity
via a socket connection.
Figure 3.1: The basic workflow for proposed controller.
The control objective is to utilize object detection information to guide the robot
system for achieving a better view of the target object. A region-based visual
servoing controller is developed to position the target object in the center of the
field of view (FOV). At the same time, it also maximizes the coverage of the object
within the FOV.
26
3.2 Object Detection Algorithm
In Section 2.3, we elaborate several commonly used deep learning models for object
detection. They have di↵erent features and architectures. In the following para-
graph, we want to compare the three most famous models Faster R-CNN, SSD,
and YOLO. We intend to find out which one is more suitable for our real-time
application.
Method mAP-50 Time

Faster R-CNN [22] 51.9 85
SSD-513 [26] 50.4 125
YOLOv3-416 [1] 55.3 29
YOLOv3-608 [1] 57.9 51
Table 3.1: Object detection results on COCO dataset [1]. The table shows the
mean average precision (mAP) and inception time for di↵erent object detection
model. mAP-50 represents that the mAP value is calculated based on 50% IOU
metric. The number after the model name indicates the input image size.
Based on the information obtained from paper by Redmon and Farhadi [1], we plot
Table 3.1. For mAP, it measures the average precision considering all classes. It
can range from 0 to 100. We define a detection to be true positive if its intersection
over union (IOU or overlapping area) with the ground truth box greater than the
threshold (50% in our case). Time indicates the prediction time after inputting
one image into the model. The smaller the value means there will be less delay
between the current frame and the detection frame. From these, we can observe
that YOLOv3-416 achieves the shortest inception time among four models, and
YOLOv3-608 is the best in terms of detection accuracy. In order to gain fast real-
time detection results for robot control, we must be sure that the YOLOv3-608 is
the most suitable detector considering both speed and accuracy.
3.2.1 YOLOv3 Architecture
More background knowledge should be obtained for the reasons of the excellent
performance.
YOLOv3 chooses a way di↵erent from its peers for the feature extraction. Typically,
ResNet[43] with a certain level of modification is selected as the backbone. A deep
neural network with a residual learning framework aims to reduce the training
27
time and increase accuracy. YOLOv3 utilizes a new network with 53 convolutional
layers called Darknet-53. Darknet-53 applies a hybrid architecture of Darknet-19
from YOLOv2 and residual network from ResNet. It has comparable performance
for accuracy to Resnet-152 and 2⇥ faster than it.
For the previous version, YOLOv2 performs poorly for small objects with only 5%
average precision. Thus, YOLOv3 introduces multi-scale predictions to fix this
issue. It extracts features from three di↵erent scales. Feature map from 2 layers
previous is up-sampled and merged with an earlier one using concatenation. This
operation provides the network with more semantic information and lower-level
information.
The above two improvements help YOLOv3 gain a decent detection accuracy while
still maintaining a fast prediction speed.
3.3 Control Algorithm
The control algorithm serves as the brain of the overall system. Based on the
current input information, it calculates and decides where to go next. Since we
plan to move the camera to capture a better view of the object, we need first to
decide what means ’good position’ for object detection.
On the one side, the center location of the image frame would be better than the
corner location. Putting the object in the center of the FOV gives it a clear front
look. It helps the camera to capture the image without any distortion or non-focus
problems. When we first make sure that the center of the target object is within
the FOV, we can conduct other operations without fretting about it outside the
frame.
On the other side, increasing the dimension of the object in the image frame is
essential. Since the tile crack is usually relatively tiny and not noticeable, the
object detector cannot perform well when its size is small.
Based on the above two assumptions, a task-priority visual servoing controller is

developed to position the target object in the center of the FOV. Meanwhile, it
also maximizes the coverage of the object within the FOV.
28
3.3.1 Preliminaries and Definitions
Before discussing more details about the controller design, some background knowl-
edge and definitions need more illustration.
Camera Projection Models
Image information is essential to control the robot in visual servoing and all other
vision-based control algorithms. We can form the relationship between the image
in the image plane and the real object by knowing the intrinsic camera parameters
and the depth information.
Assuming that the x-axis and y-axis form the fundamental plane of the image plane
and the z-axis is perpendicular to that plane and along with the optic axis. This
setting defines the camera coordination system as shown in Figure 3.2.
Imag
e Pla
ne
P = (X, Y, Z)
x p = (u, v)
z
Image
Object
Viewing Point
Figure 3.2: The coordinate frame for camera/len system.
The perspective projection model is commonly deployed in the computer vision

model to calculate the transformation between image point to object location in
the camera coordinates. According to Hartley and Zisserman[44], if we have a
point P c = [X, Y, Z]T whose coordinates are listed under the camera coordinate
frame, it can be projected onto the image plane with coordinates p = [u, v]T by
using
" # " #
u X
⇡(X, Y, Z) = = (3.1)
v Z Y
29
where is the focal length of the camera lens which indicates the distance between
the origin and image plane.
The Velocity of the Rigid Body
To understand the relationship between the base frame of the robot manipula-
tor and the end-e↵ector, we need to define the angular and translation veloc-
ity of the end-e↵ector. The motion respect to base coordinates can be sepa-
rated into angular velocity ⌦(t) = [!x (t), !y (t), !z (t)]T and translation velocity
T (t) = [Tx (t), Ty (t), Tz (t)]. P is a point rigidly attached to the end-e↵ector whose
base frame coordinates is [x, y, z]T . According to Hutchinson et al. [31], if we take
derivation of the coordinates of P, we have,
ẋ = z!y y!z + Tx (3.2)

ẏ = x!z z!x + Ty (3.3)
ż = y!x x!y + Tz (3.4)
which can be rewritten as

Ṗ = ⌦ ⇥ P + T (3.5)
If we represent the cross product in (3.5) using skew-symmetric matrix
2 3
0 z y
6 7
6
sk(P ) = 4 z 0 x7 (3.6)
5
y x 0
we can write it as
Ṗ = sk(P )⌦ + T (3.7)
Combining the above equation, we can define the velocity screw as
30
2 3
Tx
6 7
6T 7
6 y7
6 7
6 Tz 7
ṙ = 6
6! 7
7 (3.8)
6 x7
6 7
6!y 7
4 5
!z
where r is the coordinates of the robot end-e↵ector coordinate frame in the task
space.
Camera Configuration
Typically, there are two common camera configurations for visual servo systems:
mounted on the end-e↵ector or fixed in the workspace [31].
#"# #!"
#!"
#!
##
#"
Figure 3.3: Di↵erent camera configuration system: eye-in-hand and eye-on-

hand
The first one is usually called eye-in-hand configuration. In our experimental setup,
we adopt this setting and mount the camera on the end-e↵ector. It is because we
want the camera to be moving together with the robot manipulator. The changes
of the camera’s pose and that of the end-e↵ector are always constant. Under this
setting, we can control the camera
Image Jacobian Matrix
Image Jacobian matrix defines the relationship between the di↵erential changes
in the image feature parameters with the di↵erential changes in the manipulator
position. The Image Jacobian Matrix Jimg 2 <k⇥m can be written as [31].
31
Ẋ = Jimg ṙ (3.9)
where X is the image feature parameter vector and Ẋ is the corresponding image
feature parameter velocity. ṙ represents the end-e↵ector velocity in task space r
defined in (3.8).
2 3
@v1 (r) @v1 (r)
h i @r1
··· @rm
6 . .. .. 7
Jimg = @
@r
=6 .
4 . . . 7
5 (3.10)
@vk (r) @vk (r)
@r1
··· @rm
where vi is the ith image feature velocity of the Ẋ, k is the dimension of ṙ velocity
skew and m is the dimension of task space r
The definition of Image Jacobian Matrix was first presented by Weiss et al.[45] in
1987 . Equation (3.9) demonstrates how the changes of image feature are related
to the changes of robot end-e↵ector position.
An eye-in-hand camera configuration model is adopted in our project as we want

to find a better position for image capturing. According to Kelly et al. [46], the
Image Jacobian Matrix for eye-in-hand camera configuration system with single
feature point X = [u, v]T is given as
" #
fx u uv fx 2 +u2
Z
0 Z fx fx
v
Jimg (X, Z) = fy v fy 2 +v 2 uv
(3.11)
0 Z Z fy fy
u
where fx , fy are the constant focal length of the camera, Z is the depth of the
selected feature point in the depth frame.
It can be further extend to multiple feature points X = [u1 , v1 . . . uk , vk ]T 2 <2k

with di↵erent depth Z 2 <k ,
2f u1 u1 v 1 fx 2 +u21
3
Z1
x
0 Z1 fx fx
v1
2 3 6 fy fy 2 +v12 7
60 v1 u1 v 1
Jimg (X1 , Z1 ) u1 7
6 7 6 Z1 Z1 fy fy 7
.. 6 .. .. .. .. .. .. 7
k
Jimg (X, Z) = 6
4 . 7 =
5 6 . . . . . . 7 (3.12)
6 7
Jimg (Xk , Zk ) 6 fx
0 uk uk v k
2
fx +u2k
vk 7
4 Zk Zk fx fx 5
fy vk fy 2 +vk2 uk v k
0 Zk Zk fy fy
uk
32
where uk and vk are the kth selected image feature points, Zk is the depth of kth
feature point in the depth frame.
3.3.2 Image-Based Visual Servoing and Region-Based Con-

trol
Visual servoing is the technique of applying machine vision information into closed-
loop position control for the robot end-e↵ector. Visual servoing is used here for
controlling the robot system using image features provided by the YOLO bounding
box.
As mentioned in Section 3.3.1, there are two commonly used camera configuration
setups. In both cases, the motion of the robot manipulator will contribute to the
changes of the image feature parameters. The idea of visual servoing is to design
a useful error function e to minimize the di↵erence between desired image feature
parameters and current. When the task is finished, e = 0.
As mentioned at the beginning of Section 3.3, we want to design a controller which

can move the bounding box towards the image frame center while maintaining the
size of the bounding box. In order to fulfill the above requirements, the selected
feature points X is defined as
2 3 2 3
x1 xcyolo
6 7 6 c 7
6x2 7 6 yyolo 7
Xyolo =6 7 6
6x 7 = 6xtl 7
7 (3.13)
4 3 5 4 yolo 5
tl
x4 yyolo
where xcyolo , yyolo

c
are the center coordinates of the YOLO bounding box and xtlyolo , yyolo
tl
are the top left corner coordinates of YOLO bounding box.
Thus, according to (3.12), the overall Jacobian matrix for the image feature Xyolo
can be written as,
33
2 3
fx xcyolo xcyolo yyolo
c fx 2 +xcyolo 2 c
0 yyolo
6Z Z fx fx 7
6 fy
c
yyolo fy 2 +yyolo
c 2 xcyolo yyolo
c
7
60 Z Z fy fy
xcyolo 7
yolo
Jimg =6
6 fx xtl xtl tl fx 2 +xtl
2 7
7 (3.14)
yolo yyolo tl
6Z 0 yolo
Z fx fx
yolo
yyolo 7
4 tl 2
5
fy
tl
yyolo fy 2 +yyolo xtl tl
yolo yyolo
0 Z Z fy fy
xtlyolo
Since feature points xcyolo , yyolo

c tl
and yyolo tl
, yyolo belong to the same object in the
image frame, the distance Z remains same for both of them.
Based on (3.9), if we want to calculate the end-e↵ector velocity ṙ based on the

image feature velocity Ẋ, we need to solve the following equation,
yolo +
ṙ = Jimg Ẋ (3.15)
yolo+ yolo
where Jimg is the pseudoinverse of Jimg 2 <k⇥m .
It can be calculated using
yolo+ yolo T
yolo yolo T 1
Jimg = Jimg (Jimg Jimg ) (3.16)
Furthermore, if we define the control input u(t) to be the end-e↵ector velocity of

the robot manipulator, then we have
u(t) = ṙ (3.17)
The region-reaching control technique is deployed here to guide the robot system
moving towards the direction where it extends the bounding box area to the aspired
size. We believe that increasing the object size to an optimal value will help the
machine learning algorithm.
The definition of the region-reaching controller was first brought up by Cheah

et al. [47] in 2007. Instead of using a desired point or boundary as the target, the
desired region is predefined. When the robot enters this area, an objective function
f (X) specifies the region error will converge to zero.
34
The desired region is defined by scalar functions which are di↵erentiable at all
points for its first partial derivatives. In our approach, it is specified by the following
inequality:
wt2
f1 (x3 ) = ((xcyolo x3 ) 2 ) 0 (3.18)
4
c h2t
f2 (x4 ) = ((yyolo x4 ) 2 ) 0 (3.19)
4
where [x3 , x4 ]T = [xtlyolo , yyolo

tl
]T 2 <2 . xcyolo , yyolo
c
are the center coordinates of
YOLO bounding box. xtlyolo , yyolo
tl
are the coordiates for the top left corner of the
YOLO bounding box as shown in Figure 3.4. wt , ht are the desired width and
height proportional to that of the image frame.
$% $%
(!!"#" , #&'%' )
( (
(!&'%' , #&'%' )
YOLO Bounding Box
Image Frame
Figure 3.4: Illustration of the definition of the region error function
In (3.3.2), (xcyolo x3 ) is to calculate the half of the width of the YOLO bounding
c
box wyolo . (yyolo x4 ) is the equation for the half of the height of the YOLO
bounding box hyolo .
The power function for x3 andx4 is specified as
kptl3
P3 (x3 ) = [min(0, f1 (x3 )]2 (3.20)
2
kptl4
P4 (x4 ) = [min(0, f2 (x4 )]2 (3.21)
2
35
That is, 8
<0 if f1 (x3 ) 0
P3 (x3 ) = (3.22)
: kptl3 f (x )2 if f1 (x3 ) < 0
2 1 3
8
<0 if f2 (x4 ) 0
P4 (x4 ) = (3.23)
: kptl4 f (x )2 if f2 (x4 ) < 0
2 2 4
If we take partial di↵erentiation of the potential energy function (3.20) with respect
to x3 we have,
8
@P3 (x3 ) T <0 if f1 (x3 ) 0
= (3.24)
@x3 :k f (x ) @f1 (x3 ) T if f1 (x3 ) < 0
ptl3 3 @X
which can be written as
@P3 (x3 ) T @f (x3 ) T

( ) = kptl3 min(0, f (x3 ))( ) (3.25)
@x3 @x3
Similarly, for x4 we can also write the partial di↵erential of the potential energy
function (3.26) as,
@P4 (x4 ) T @f2 (x4 ) T

( ) = kptl4 min(0, f2 (x4 ))( ) (3.26)
@x4 @x4
Thus, if wyolo or hyolo is larger than the desired value, the corresponding ( @P@x
3 (x3 ) T
3
)
or( @P@x
4 (x4 ) T
4
) value will become zero.
It will be used to calculate the end-e↵ector velocity in the latter part of the thesis.
Meanwhile, as we want to move the bounding box center towards the image frame
center, we define the potential energy function for x1 and x2 as
kpx
P1 (x1 ) = (x1 xcimg )2 (3.27)
2
kpy c
P2 (x2 ) = (x2 yimg )2 (3.28)
2
36
Combining (3.20),(3.21), (3.27) and (3.28) we can describe our overall power func-
tion as
Pcom (X) = P1 (x1 ) + P2 (x2 ) + P3 (x3 ) + P4 (x4 )

kpx kpy
= (x1 xcimg )2 + (x2 yimg c
)2 (3.29)
2 2
kptl3 kptl4
+ [min(0, f1 (x3 )]2 + [min(0, f2 (x4 )]2
2 2
where xcimg , yimg

c
is the center coordinates of the image frame.
If we take partial di↵erentiation of (3.29), we will get
2 @P 3
com (X)
@x1
6 @Pcom (X) 7
@Pcom (X) T 6 7
( ) = 6 @Pcom (X) 7
6 @x2
7 (3.30)
@X 4 @x3 5
@Pcom (X)
@x4
2 3
kpx (x1 xcimg )
6 7
6 kpy (x2 yimg c
) 7
=6
6k min(0, f (x ))(x
7 (3.31)
4 ptl3 1 3 3 xcyolo )7
5
c
kptl4 min(0, f2 (x4 ))(x4 yyolo )
@Pcom (X)
Thus, @X
can be further written as,
2 3
x1 xcimg
6 7
@Pcom (X) 6 c
x2 yimg 7
=K6
6min(0, f (x ))(x
7 (3.32)
@X 4 1 3 3 xyolo )7
c
5
c
min(0, f2 (x4 ))(x4 yyolo )
where K is a diagonal matrix with positive gain equals to

2 3
kpx 0 0 0
6 7
6 0 kpy 0 0 7
K=6
6 0
7 (3.33)
4 0 kptl3 0 7
5
0 0 0 kptl4
37
At the same time, we define to the error function e(X) as follows:
2 3
x1 xcimg
6 7
6 x2 yimg c 7
e(X) = 6
6min(0, f (x ))(x
7 (3.34)
4 1 3 3 xcyolo )7
5
c
The relationship between the end-e↵ector velocity in the task space and the error
in the visual space can be described as
yolo+
ṙ = Jimg Ke(X) (3.35)
Since we want to minimize the error by sending the desired velocity signal to the
robot.
By substituting (3.34) into (3.35), the overall controller is listed below
2 3
x1 xcimg
6 7
yolo +
6 x2 yimg c 7
u(t) = ṙ = Jimg K 6 7 (3.36)
6min(0, f (x ))(x xcyolo )7
4 1 3 3 5
c
By applying this controller to the robot, we aim to control the YOLO bounding
box center coordinates and area simultaneously to improve the YOLO detection
results.
38
3.3.3 Lyapunov Stability Analysis
Continuing, the stability analysis is conducted on the system by using the kinematic
approach and the dynamic approach.
Kinematic Analysis
First, we propose our Lyapunov-like function as
V = Pcom (X)
kpx kpy
= (x1 xcimg )2 + (x2 yimg c
)2 (3.37)
2 2
kptl3 kptl4
+ [min(0, f1 (x3 )]2 + [min(0, f2 (x4 )]2
2 2
where Pcom (X) is defined in (3.29).
Since Pcom (X) is a continuous scalar function with continuous first partial deriva-
tives. Di↵erentiating (3.37) with respect to time gives us,
@Pcom (X)
V̇ = Ẋ T (3.38)
@X
Combining (3.30) and (3.34), we have
@Pcom (X)
= Ke(X) (3.39)
@X
Combining (3.9) with (3.35), we can get the relationship between the pixel error
and the image feature speed
yolo yolo +
yolo yolo yolo+
Ẋ = Jimg ṙ = Jimg ( Jimg Ke(X)) = Jimg Jimg Ke(X) = Ke(X) (3.40)
which means the image feature velocity is proportional to the error function e(X).
Substitute (3.40) and (3.39) into (3.38) gives us
V̇ = ( Ke(X))T (Ke(X)) = e(X)K T Ke(X) < 0 (3.41)
39
We can see that V̇ < 0 since K in (3.33) is defined as a diagonal matrix with
all positive entries. Thus, it is easy to prove that the eigenvalues of K are all
positive and K is a positive definite matrix. The system is stable by using the
Lyapunov-like function.
Based on this inequality, we have,

2 3
x1 xcimg
6 7
6 x2 yimg c 7
6
e(X) = 6 7=0 (3.42)
4min(0, f1 (x3 ))(x3 xcyolo )7
5
c
It proves that when the system reaches stable state, we have
x1 = xcimg
c
x2 = yimg
(3.43)
min(0, f1 (x3 ))(x3 xcyolo ) = 0
c
min(0, f2 (x4 ))(x4 yyolo )=0
For x1 and x2 , it means it will equal to the image frame center coordinates at stable
state. For x3 and x4 , since when the bounding box exists, we have,
x3 xcyolo 6= 0
(3.44)
c
x4 yyolo 6= 0
To let the third and fourth equations of (3.43) equals to zero, we need
min(0, f1 (x3 )) = 0
(3.45)
min(0, f2 (x4 )) = 0
It suggests that f1 (x3 ) 0 and f2 (x4 ) 0. Substitute (3.3.2) and (3.3.2) into
these, we get,
wt2
((xcyolo x3 ) 2 ) 0
4 (3.46)
c 2 h2t
((yyolo x4 ) ) 0
4
40
Then we have YOLO bounding box height and width large or equal to the preset
threshold.
It means when the system reaches a stable state, all of the variables meet our preset
requirements.
Dynamic Analysis
Furthermore, we want to prove the system’s stability by using the dynamic ap-
proach.
Let r 2 <n represents the position vector of the robot in task space[48] then
r = h(q) (3.47)
where q 2 <n is a vector of joint coordinates and h(q) 2 <n ! Rn describes the
transformation relationship between the joint and task space.
Thus, the velocity vector ṙ is related to q̇ as,
ṙ = J(q)q (3.48)
where J(q) is the Jacobian matrix which maps joint space to task space.
The equations of motion of the robot with n degrees of freedom are given in a joint
space as:
1
M (q)q̈ + ( Ṁ (q) + S(q, q̇))q̇ + g(q) = ⌧ (3.49)
2
Where M(q) is an inertia matrix that is symmetric and positive definite, S(q, q̇) is
a skew symmetric matrix, g(q) denotes a gravitational force vector and ⌧ denotes
the control input.
Meanwhile, we propose a task space region reaching controller as
yolo T @Pcom
⌧= Kv q̇ J T (q)Jimg + g(q) (3.50)
@X
41
where Kv in <n⇥n is a positive definite velocity feedback gain matrix, J T (q) is
yolo T
the transpose of the Jacobian matrix, Jimg is the transpose of the image Jacobian
matrix.
Combining (3.49) and (3.50) gives us,
1 yolo T @Pcom
M (q)q̈ + ( Ṁ (q) + S(q, q̇))q̇ + Kv q̇ + J T (q)Jimg =0 (3.51)
2 @X
To carry out the stability analysis, a Lyapunov-like function is defined here,
1
V = q̇ T M (q)q̇ + Pcom (3.52)
2
Di↵erentiate (3.52) gives us,
1
V̇ = q̇ T M (q)q̈ + q̇ T Ṁ (q)q̇ (3.53)
2
Substitute (3.51) into (3.53),
yolo T @Pcom @Pcom

V̇ = q̇ T Kv q̇ q̇ T J T (q)Jimg + Ẋ T (3.54)
@X @X
yolo T yolo T
Since we know that Ẋ = Jimg ṙT = q̇ T J(q)T Jimg , (3.54) can be further simplified
into,
V̇ = q̇ T Kv q̇  0 (3.55)
Since Kv is a positive definite matrix, V̇ will be less than 0 and the system will
maintain global stability.
From Lasalle invariant theorem [49], we have q̇ ! 0 as t ! 1 and from (3.51) the
largest invariant set satisfies,
yolo T @Pcom
J T (q)Jimg =0 (3.56)
@X
If Jacobian matrices are non-singular, we can further simplify (3.56) to
@Pcom
=0 (3.57)
@X
42
The singularity of Jacobian matrix can be monitored by checking the manipulability
of the manipulator [50]. Singularity avoidance can be achieved by using a redundant
robot with task-priority control [51] by exploring the null space of the Jacobian
matrix.
Substitute (3.30) into (3.57) gives us,

2 3
x1 xcimg
6 7
6 x2 yimg c 7
K6
6min(0, f (x ))(x
7=0 (3.58)
4 1 3 3 xcyolo )7
5
c
Similar to what we did in the kinematic analysis, we can prove that,
x1 = xcimg
c
x2 = yimg
wt2 (3.59)
((xcyolo x3 ) 2 ) 0
4
c h2t
((yyolo x4 ) 2 ) 0
4
It means that all of the variables reach the desired value or are within the desired
region when the system is stable.
43
44
Chapter 4
Experimental Setup and

Experimental Results
Experiment is the final step to verify the proposed control algorithm. Meanwhile, it
is essential to choose the correct hardware and experimental setup before conduct-
ing the experiments. In this chapter, we will split the content into vision system
and robot system. In the vision system section, it includes the camera description
and experimental results for the object detector. In the robot system section, both
robot manipulator description and final experimental results will be presented.
4.1 Vision System: Hardware Description and

Object Detection Model
4.1.1 Hardware Description
Intel Realsense Depth Camera D435 is a stereo camera whose dimension is 90 mm

⇥ 25 mm ⇥ 25 mm. It consists of one IR projector, two imagers, and one RGB
module. Its technical specifications are shown in Table 4.1.
45
Tech Specs Depth RGB
FOV 87 ⇥ 58 69 ⇥ 42
Resolution Up to 1920 ⇥ 1080 Up to 1280 ⇥ 720
Frame Rate Up to 90 fps Up to 30 fps
Depth Accuracy <2% at 2 meter -
Table 4.1: Technical specification about Intel Realsense Depth Camera D435
Moreover, as demonstrated in the official website1 , its ideal working range is from
0.3 meters to 3 meters. It means the camera is capable of performing daily inspec-
tion work. Moreover, with a light weight of only 72 grams, it is easy to mount the
camera on top of the robot system without worrying the gravity.
4.1.2 Data Collection, Training and Results
The last step before training an object detection model is preparing a suitable
dataset. A high-quality dataset is a solution for well-performing object detection.
A good dataset should contain more diversified images, including di↵erent lighting
conditions, distance, and di↵erent types of objects. The more we input into the
model, the better it could learn and predict in real-time. In our case, we select
eight classes that appear on the construction site frequently. They are doors,
windows, electrical switches, electrical powers, electrical mains, electrical telecom
port, electrical lights, and tile cracks.
Among all the 1098 images collected, 1042 are collected at the TeamBuild Con-
struction site in Sengkang, Singapore. The rest is collected in the robotics lab
inside Nanyang Technological University as supplemental data.
After obtaining these images, we use the open-source tool LabelImg to create
annotations for the following training as shown in Figure 4.1. LabelImg enables us
to draw the ground truth bounding boxes around the objects. It will automatically
save the labels in TXT format, which YOLOv3 requires.
When all images are labeled with correct annotations, training can begin. The
learning rate is chosen as 0.00025, while the maximum batch size is 3000. We
deploy the Tesla V100-DGX Station sever to train the model. It takes around 6
hours to train and obtain the final weights.
1
https://www.intelrealsense.com/depth-camera-d435/
46
Figure 4.1: Basic user interface of the labeling tool LabelImg. Ground truth
bounding box is drew in green
4.1.3 Experimental Results
We proceed to test the trained model with the testing dataset. Two hundred twenty
images are randomly pickup from the entire dataset. The model has never seen
them before. Some sample images of the detection results are shown in Figure 4.2
and the mAP for each class is demonstrated in Table 4.2.
(a) (b) (c)
(d) (e) (f)
Figure 4.2: YOLO detection results of testing images. electrical telecom and
electrical power are shown in (a), door and electrical switch are shown in (b),
electrical light is shown in (c), electrical switch is shown in (d), window installed
is shown in (e) and tile crack is shown in (f)
47
Class Name mAP(%)
doors installed 82.13
windows installed 84.18
electrical switch 91.02
electrical power 92.62
electrical telecommute 96.03
electrical lights 86.78
tile cracks 78.17
Table 4.2: YOLO detection performance on 8 classes in the testing dataset.
From Table 4.2 we can see that most of the classes reach decent mAP on the testing
dataset of around 90% accuracy. In contrast, some classes such as tile cracks and
doors installed have slightly low accuracy. Moreover, we expect a decrease during
the real-time experiments as the robot’s movement will lead to blurry and jerky
images. Thus, more assistance should come from the robotic side to boost up
the performance of the machine learning model. The control algorithm should be
designed to enable the camera to position itself in an optimal location for viewing
the object.
4.2 Robot System
4.2.1 Hardware Description
• Universal Robots UR5e Robot Manipulator

UR5e is a robotic manipulator with 6 DoF as shown in Figure 4.3. As demon-
strated in Table 4.3, its revolute joints can rotate ± 360 for all joints. The
repeatability is ± 0.03 mm, which is highly precise. If we issue the command
to reach the same location multiple times, the di↵erence between every trial
will be within 0.03 mm. Repeating the experiments ensures the stability of
the performance. With three DoF, the robot can reach any location in the 3D
space within its reach limit. Having more than three rotating joints o↵ers the
robot more freedom when moving towards an issued target. 500 Hz system
updating frequency makes sure there is nearly no delay in communication
between the robot and the processor.
48
On top of the technical specification listed in Table 4.3. There are also some
limitations of the robot due to safety issues.
The depth camera mounts on the end-e↵ector of the manipulator as demon-
strated in Figure 4.3. We call it an eye-in-hand model, which means the
camera is moving together with the robot end-e↵ector. The robot and the
processor communicate using an Ethernet socket with each other.
Figure 4.3: UR5e Robot Manipulator in the Nanyang Technological University

Robotics Lab Setup
Weight 20.7 kg
Reach 850 mm
Maximum Payload 5 kg
Joint Ranges ± 360 for all joints
Speed Joints: Max 180 , Tool: Approx. 1m/s
System Update Frequency 500 Hz
Pose Repeatability ± 0.03 mm
Degrees of freedom 6 rotating joints
Ethernet socket,
Communication MODBUS TCP & EtherNet/IP Adapter,
Profinet
Table 4.3: Tech specification of UR5e Robot Manipulator from Universal
Robots
49
4.2.2 Experimental Results
1. Proof of Validity of the Selected Controlling Features

To design our control algorithm stated in Section 3.3, we choose the center of
the bounding box and the length of the bounding box to be the controlling
image features.
Before conducting the experiments, we want to prove the validity of the
selected features and demonstrate their influence on the confidence level.
Firstly, we fixed the camera orientation to position the YOLO bounding
box at the center of the image frame to test the influence of changing the
bounding box length. Then, we move the camera towards the object to let
the bounding box length increase gradually and record the confidence level
throughout the experiments. The results is plotted in Figure 4.4(a). The y-
axis is the YOLO confidence level and the x-axis is the YOLO bounding box
width. When bounding box length is relatively short, the confidence level
remains at a fairly low-level clustering at around 60%. When we increase
our bounding box length to 180 pixels, the confidence level also improved to
around 85%.
To further prove it mathematically, we fit the linear regression model onto the
dataset. The definition for the linear regression model with a single regressor
follows the equation below[50]:
Yi = 0 + 1 Xi + ui (4.1)
where the index i runs over the observations, Yi is dependent variable, the
regressand, or simply the left-hand variable, Xi is the independent variable.
Yi = 0+ 1 Xi + ui is the population regression line also called the population
regression function. 0 is the intercept of the population regression line and
1 is the slope of the population regression line and ui is the error term.
Thus, we fit a linear regression model with a 95% confidence interval as
shown in Figure 4.4 to display the relationship between confidence level and
bounding box length.
50
Figure 4.4: (a) shows the change of confidence level for bounding box length.
The y-axis is the YOLO confidence level and the x-axis is the YOLO bounding
box width. The blue line is the linear regression model with a 95% confidence
interval as the boundary shadowed in blue. (b) demonstrates the change of
confidence level to distance between the bounding box center and image frame
center. The x-axis is the pixel distance between the YOLO bounding box center
and the image center and the y-axis is the YOLO confidence level.The blue line
is the linear regression model with a 95% confidence interval as the boundary
shadowed in blue.
As per calculation, the regression model for the bounding length and confi-
dence level is
yic = 0.3xlen
i + 35.9 + 0.02699 = 0.3xlen
i + 35.92699 (4.2)
where yic is the YOLO confidence level for frame i and xlen
i is the corresponding
YOLO bounding box length i.
The slope of the regression line is 0.3, the intercept of the regression line is
35.9 and the error is 0.02699. The regression model shows that the confidence
level and the bounding box length are positively correlated. It means that
increasing bounding box length can improve the confidence level simultane-
ously.
In the second experiment, we fixed the distance between the camera and the
object to minimize the change of bounding box length. By solely control-
ling the orientation of the robot end-e↵ector, we move the positions of the
51
bounding boxes across the whole image frame and record their confidence
level accordingly. The experimental result is shown in Figure 4.4(b).The x-
axis is the pixel distance between the YOLO bounding box center and the
image center and the y-axis is the YOLO confidence level. When the dis-
tance is closer to zero, the confidence level reaches more than 80% and when
it moves away the confidence level gradually drops to below 60%.
We also calculate the regression model using the below equation:
yic = xdist
i + 83.9 + 0.009574 = xdist
i + 83.909574 (4.3)
where yic is the YOLO confidence level for frame i and xdist
i is the correspond-
ing YOLO bounding box center distance i to image frame center.
The slope of the regression line is -1, the intercept of the regression line is
83.9 and the error is 0.00957. The regression model shows that the confidence
level and the bounding box center distance concerning the image center are
negatively correlated. It means that moving the bounding box center closer
to the image center can improve the confidence level simultaneously.
However, these experimental results cannot be generalized. It is subjected to
the customized YOLO model, experimental setup, and object features.
2. Comparisons of Performance of the Controller using the YOLO

Models Trained with Di↵erent Number Epochs
The following experimental results demonstrate the performance of the con-
troller with the usage of YOLO models trained with a di↵erent number of
epochs. It aims to show the necessary number of epochs the model needs to
be trained for the controller to perform properly.
In total, our YOLO model is trained for 3000 epochs. We select the weights
with epochs equal to 500, 1000, 2000, and 3000 to compare their di↵erent
performance. The robot’s starting position and orientation as well as the
controller gain is fixed to ensure that the starting condition is the same for
every experiment.
When epochs only reach 500, the model is not properly trained so there is no
detection throughout the testing period.
When epochs equal to 1000, there are few detections at certain locations, but
it is very unstable. Figure 4.5 (a) shows the YOLO confidence level during the
52
Figure 4.5: demonstrates change of the YOLO bounding box location, length
and confidence level during the experiment using YOLO model trained with 1000
epochs.
experiment. We can observe that the detections only appear in the first few
steps. After the robot manipulator starts moving, the model cannot detect
the crack. Also, most of the confidence level values are below 20%. Similarly,
the bounding box length in Figure 4.5 (b) and center location shown in
Figure 4.5 (c) cannot be calculated properly and change dramatically during
the experiment. Due to the above reasons, the controller cannot continue to
move towards the target location. One thousand epochs are not enough for
the controller to perform normally.
epochs.
When the model is trained for 2000 epochs, the YOLO detections appear in
most of the frames, and they are quite stable as we can see in the following
Figure 4.6. The robot manipulator can move towards the center of the image
frame as displayed in Figure 4.6(c). However, the bounding box size is exces-
sively large than the actual object size, it exceeds the preset threshold from
53
the beginning in Figure 4.6(b). Thus, the controller is not able to increase
the bounding box size. Due to this reason, the YOLO confidence level does
not improve as well as demonstrated in Figure 4.6(a). It is below 50% at
the end of the experiment although the bounding box is moving towards the
center of the image frame.
epochs.
When epochs finally reach 3000 which is also the final weight we used for the
experiments. The center coordinates of the YOLO bounding box converges
to the center of the image frame (640,360) gradually in Figure 4.7 (c). Mean-
while, the height of the YOLO bounding box reaches the preset threshold
and maintains a certain level displayed in Figure 4.7(b). Following that, the
confidence level increases from below 50% to larger than 70%.
In summary, the average confidence level after reaching the desired target is
listed in Table 2 below for models trained with di↵erent epochs. The average
confidence level is calculated by averaging the recorded YOLO confidence
level after the YOLO bounding box center has reached the image frame cen-
ter. If the robot manipulator is unable to approach the target or there is no
detection, we mark the confidence level as 0%.
No. of Epochs 500 1000 2000 3000

Average Conf. 0% 0% 25% 87%
Table 4.4: Comparison of Average Confidence Level with Object Detection

Model Trained with Di↵erent Numbers of Epochs.
According to the Table 2 we can see that, when the number of epochs is under
1000, there is no detection or the detection is not stable enough to perform
54
the controller. When the model is trained with 2000 epochs, the robot can
reach the desired target with a low confidence level. When epochs reach 3000,
the controller can approach the image frame center with a high confidence
level while maintaining the length of the bounding box at a certain level. At
least 3000 epochs are needed for the controller to perform properly.
3. Experimental Results
After obtaining decent YOLO detection results, we proceed with testing the
controller with the target objects. As we can see in Figure 4.8 in the sequence
from (a) to (d) the bounding box is moving from the bottom right corner
towards the center of the image frame. At the same time, the bounding
box still maintains a desired size during the movement. The confidence level
increases from 24.65% to 84.13%.
Figure 4.8: demonstrates change of the bounding box location and confidence
level during the experiment of tile cracks
We can observe the same trend when plotting the following diagram. In Fig-
ure 4.9 (a), the coordinates of the bounding box center begin with (870,510)
and converge to the center of the image frame (640,320) within 40 steps
which are around 20 seconds. Since the bounding box height is larger than
the width, we only control the height. It stays above the threshold, 180 pix-
els, throughout the experiment. The confidence level in Figure 4.9(c) shows
an increase if comparing the value at the start and the end of the experiment.
55
Figure 4.9: plots the change of the bounding box location, bounding box length
and confidence level during the experiment of tile cracks
(a) (b)
(c) (d)
Figure 4.10: plots change of the bounding box location and confidence level
during the experiment of electrical powers
We also repeated the experiments for electrical switches to show the generality
of the control algorithm. In Figure 4.10, it is observed that the bounding box
is moving towards the center of the image frame and the confidence level is
improved from 35% to 92.6%.
We can observe the same trend when plotting the following diagram. In Fig-
ure 4.11 (a), the coordinates of the bounding box center begin with (1000,600)
and converge to the center of the image frame (640,320) within 80 steps
which are around 40 seconds. Since the bounding box height is larger than
the width, we only control the height. It is 105 pixels in the beginning and
increases to more than 90%. The confidence level in Figure 4.9(c) shows a
56
Figure 4.11: demonstrates change of the bounding box location, bounding box
length and confidence level during the experiment of electrical powers
significant improvement if comparing the value at the start and the end of
the experiment.
These two experiments demonstrate the ability to improve the YOLO confi-
dence level by moving towards the image center and maintaining the bound-
ing box length.
4. Comparison of Bounding Box Centering Controller with Proposed

Controller
To show the superiority of the proposed controller, we compare its experimen-
tal results with the only centering controller. Ten di↵erent starting points
are chosen to ensure two sets of experiments have the same starting condition
and also show the generality of the results.
We first conduct the experiments on the UR5e robot manipulator introduced
in Section 4.2.1 purely applying the visual servoing algorithm for centering
the bounding box as described in equation (3.15). We choose tile cracks as
the target object.
Table 4 shows experimental results including confidence level at the start and
end of the experiments for 10 trials via using two di↵erent controllers respec-
tively. Among ten experiments, it is observed that in experiments No.3, 6,
and 7 in Table 4 demonstrate a decreasing trend in confidence level. It is be-
cause sometimes when the camera tries to move the target object towards the
center of the image frame without considering the size of the object. It also
points out the necessity of deploying a region-reaching controller alongside
the bounding box centering controller.
57
End Conf. End Conf.
No. Exp Start Coord. Start Conf.
(Centring) (Region)
1 (328.59, 202.37) 34.65 87.79 97.81
2 (324.57, 1002.16) 27.73 38.24 93.25
3 (340.44,257.20) 46.36 29.33 83.53
4 (349.37, 471.18) 43.82 66.41 93.77
5 (427.91, 965.89) 35.74 89.33 87.76
6 (368.01, 419.78) 57.29 44.39 92.36
7 (328.59, 202.37) 38.54 32.27 84.17
8 (789.88, 427.46) 62.81 87.38 93.24
9 (344.9, 355.41) 52.29 69.44 92.39
10 (329.00, 484.15) 37.75 89.76 91.23
Avg - 43.70 63.43 90.95
Table 4.5: records ten sets of YOLO bounding box center coordinates and
their confident levels. End confidence level displays the confidence level when
the corresponding controller reaches center of the image frame.
Meanwhile, we repeated the experiments at the same starting points by ap-

plying the area coverage controller we proposed in Section 3.3. It utilizes
the center of the bounding box as well as the length of the bounding box to
control the speed of the robot manipulator to improve the confidence level.
Based on the Table 4, we can find out that out of ten experiments, all of
them show increasing trends of YOLO confidence level at the end of the
experiments.
On average, the region-based controller reaches a 90.95% confidence level
at the end of the experiments while the centering only controller only gets
63.43%. Taking bounding box size into consideration boosts the confidence
level. Within all ten experiments conducted, the region controller shows
better performance with a higher confidence level at the end of the test on
nines. These experimental results compare the performance on ten points by
applying a centering-only controller and proposed controller. It shows the
importance of keeping a appropriate size of the object in the image frame
while moving towards the center.
This control algorithm will be beneficial for the automated inspection work
in the construction site as the surrounding environment may di↵er and the
pre-trained model may not perform well in real-time testing. This algorithm
can enhance the confidence level by moving towards a better viewing point
and thus allow the detection algorithm to further confirm the presence of the
objects.
58
Chapter 5
Conclusion and Recommendation
5.1 Conclusion
In this thesis, we discuss the application of vision-based robotic control and object
detection in the construction inspection process.
Chapter 1 introduces the background of the research work. It lists out the reasons
why construction automation is an essential component in the process of modern-
ization. First, the data from BCA in Singapore demonstrates that a large amount
of construction workers is needed every year. However, the injury and fatal rate
for the construction industry remains at the highest level due to the various types
of accidents. Moreover, labor productivity sees a decreasing trend in the construc-
tion industry which means not many technologies have been applied in the field.
Starting from early 2020, the virus COVID-19 has worsened the situation of re-
cruiting new workers from overseas. Due to the stated reasons, the industry has
been seeking robotic and automation solutions. This also explains our motivation
for exploring construction inspection robots. We aim to improve productivity by
applying neural networks and robotic technologies.
Chapter 2 conducts the necessary literature review for the research work. Firstly,
we want to understand the current status of construction automation. Thus, we
reviewed di↵erent types of construction robots such as excavation robots, interior
finishing robots, and quality inspection and assessment robots. Following that, a
59
basic understanding of neural networks is essential for performing object detec-
tion work. Di↵erent steps of training a neural network are listed including split-
ting the data, constructing the network architecture, and tuning the loss function.
Meanwhile, we research various types of object detection models. Lastly, some
vision-based control algorithms are listed.
Chapter 3 describes the methodologies about the proposed controller. It splits into
two parts, vision algorithm and control algorithm. It starts with some background
information of the camera configuration, rigid body velocity, and Jacobian matrix.
After that, an image-based visual servoing algorithm is introduced and we describe
how it integrates with the region-based controller. Finally, we prove the stability
of the system using a kinematic and dynamic approach.
Chapter 4 displays the experimental results of the proposed algorithm. First,

we demonstrate the performance of the object detector on a customized dataset.
Then, the ability to improve the confidence level is shown by using the proposed
controller on di↵erent types of objects. We compare the success rate and results
of the centering only controller and proposed controller to show the superiority of
the proposed method.
The situation in real construction is complicated and dynamic. Facing the camera
at the appropriate location and distance can improve detection accuracy. This
thesis demonstrates the possibility of integrating the control algorithm with object
detection and applying it to the construction process.
5.2 Recommendation for Future Research
We will list out three main recommendations which have possibilities for future
research.
1. Modify the Object Detection Model

Although YOLO is an advanced object detector, there are still some draw-
backs appearing when conducting the experiments. One of the most impor-
tant one is that the YOLO bounding box center coordinates and size have
significant variation during the experiment. It means that the center coordi-
nates and the size of the bounding boxes would fluctuate around the ground
60
truth due to the unpredictable features of the neural network. This sudden
change in the image frame may lead to sudden movement in the robotic arm
even cause the failure of the experiments.
To improve the stability of the detection results, a Long Short Term Memory
(LSTM) model could be combined with the existing YOLO object detector.
LSTM is a recurrent neural network (RNN) architecture that can remember
past information. After the detection model generates the bounding box
information, it will become the input of the LSTM model. After collecting
the real-time bounding box information and labelling the ground truth at the
same time stamp, we can train an LSTM model to minimize the di↵erence.
By doing this, we are able to predict the bounding box based on the current
image information and the bounding box information from the past frames.
Moreover, we can also think of deploying some newer object detection model
1
such as YOLOv5 which is released in 2020. It is believed to have higher
average precision and faster frame rate compared with YOLOv3 at the test
dataset.
After modifying the existing object detection model, we aim to improve the
stability of the proposed controller.
2. Take Rotational Information into Account

Currently, the deployed object detector can only provide a bounding box
the both sides of which are parallel to the image frame border. We only
use the center coordinates and the length of the bounding box to control the
robot manipulator. These features are feasible to provide enough information
to guide the robot at most of the circumstances. However, there are still
some conditions that only center coordinates are not enough to determine
the present position of the target object.
If the object is rotated with respect to the z-axis of the camera coordinates,
the object detector may not perform well. It is because when using the cur-
rent YOLOv3 detector, it can only provide axis-aligned bounding box. It is
not capable of predicting a skewed bounding box. It will include extra back-
ground and it can significantly a↵ect the confidence level and the bounding
box location.
1
https://github.com/ultralytics/yolov5
61
In order to tackle the stated problem, a object detector with rotated bounding
2
box can be considered. Here is one object detector from NVIDIA which
can detect rotated objects that can be considered. It is able to provide an
extra rotating angle of the detected bounding box.
After obtaining angle information from the object detector, we can also add
it into the current controller as a fifth control feature. In this case, both
bounding box center location, size and rotation angle can be controlled by
the robot manipulator to better view the targeting object.
3. Hardware Modification of Combing it with a wheeled robot

As stated in the Section 1.2, we aim to develop a robotic solution for construc-
tion automation. To better suit the need of working in a real construction
site, the robot manipulator can be mounted onto a wheeled robot.
First, we can generate a list of components with location information based
on the BIM (Building Information Model) and also provide a goal point list
according to component locations. Then, this wheeled robot can navigate
using the goal point list from room to room and stop at a predefined loca-
tion. After that, the robot manipulator can perform the installation check by
applying the proposed algorithm algorithm. It will try to approach the tar-
get component and put it in the center of the image frame with no rotation.
The image of the component will be captured and processed using object
detector. If the confidence level is above the threshold, the component will
be marked as installed. After finishing checking all the listed components,
the system can generate a progress report with the status of all the compo-
nents. The quality engineer can use the progress report to check if there is
any component missing in the room and fix it afterwards.
By doing this, the whole process of building inspection can be automated
with a less manual process.
2
https://github.com/NVIDIA/retinanet-examples
62
List of Author’s Publications1
Journal Article
• Muhammad Ilyas*, Hui Ying Khaw, Nithish Muthuchamy Selvaraj, Yuxin

Jin, Xinge Zhao, I- Ming Chen, and Chien Chern Cheah (2021), Robot As-
sisted Object Detection for Construction Automation: Data and Information
Driven Approach, IEEE Transaction on Mechatronics.
1 ⇤
The superscript indicates joint first authors
63
64
Bibliography
[1] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv
preprint arXiv:1804.02767, 2018. xix, 27
[2] Construction Industry Handbook. Construction industry handbook 2012 ‘[re-

search report]. Japan Federation of Construction Contractors, Tokyo, 2012.
2
[3] Shamil George Naoum. Factors influencing labor productivity on construction

sites. International Journal of Productivity and Performance Management,
2016. 2
[4] Thomas Bock and Thomas Linner. Robot-Oriented Design: design and man-
agement tools for the deployment of automation and robotics in construction.
Cambridge University Press, 2015. 8
[5] Abraham Warszawski and Yeontrolhiel Rosenfeld. Robot for interior-finishing

works in building: feasibility analysis. Journal of construction engineering and
management, 120(1):132–151, 1994. 8
[6] Ehsan Asadi, Bingbing Li, and I-Ming Chen. Pictobot: a cooperative paint-
ing robot for interior finishing of industrial developments. IEEE Robotics &
Automation Magazine, 25(2):82–94, 2018. 8
[7] Robert T Pack, Joe L Christopher, and Kazuhiko Kawamura. A rubbertuator-

based structure-climbing inspection robot. In Proceedings of International
Conference on Robotics and Automation, volume 3, pages 1869–1874. IEEE,
1997. 9
[8] Je-Keun Oh, Giho Jang, Semin Oh, Jeong Ho Lee, Byung-Ju Yi, Young Shik
Moon, Jong Seh Lee, and Youngjin Choi. Bridge inspection robot system with
machine vision. Automation in Construction, 18(7):929–941, 2009. 9
[9] John Canny. A computational approach to edge detection. IEEE Transactions

on pattern analysis and machine intelligence, (6):679–698, 1986. 9
[10] Spencer Gibb, Tuan Le, Hung Manh La, Ryan Schmid, and Tony Berend-
sen. A multi-functional inspection robot for civil infrastructure evaluation
and maintenance. In 2017 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pages 2672–2677. IEEE, 2017. 9
65
[11] Fábio Celestino Pereira and Carlos Eduardo Pereira. Embedded image
processing systems for automatic recognition of cracks using uavs. Ifac-
PapersOnline, 48(10):16–21, 2015. 10
[12] David Mader, Robert Blaskow, Patrick Westfeld, and Cornell Weller. Po-
tential of uav-based laser scanner and multispectral camera data in building
inspection. International Archives of the Photogrammetry, Remote Sensing &
Spatial Information Sciences, 41, 2016. 10
[13] David H Hubel and Torsten N Wiesel. Receptive fields and functional archi-
tecture of monkey striate cortex. The Journal of physiology, 195(1):215–243,
1968. 12
[14] Kunihiko Fukushima and Sei Miyake. Neocognitron: A self-organizing neural

network model for a mechanism of visual pattern recognition. In Competition
and cooperation in neural nets, pages 267–285. Springer, 1982. 12
[15] Christopher M Bishop et al. Neural networks for pattern recognition. Oxford
university press, 1995. 13
[16] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods
for online learning and stochastic optimization. Journal of machine learning
research, 12(7), 2011. 15
[17] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech,
and time series. The handbook of brain theory and neural networks, 3361(10):
1995, 1995. 15
[18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Ha↵ner. Gradient-
based learning applied to document recognition. Proceedings of the IEEE, 86
(11):2278–2324, 1998. 15
[19] Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, and Jürgen
Schmidhuber. Deep, big, simple neural nets for handwritten digit recogni-
tion. Neural computation, 22(12):3207–3220, 2010. 16
[20] Ross Girshick, Je↵ Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. In Pro-
ceedings of the IEEE conference on computer vision and pattern recognition,
pages 580–587, 2014. 17
[21] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference
on computer vision, pages 1440–1448, 2015. 17
[22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: To-
wards real-time object detection with region proposal networks. arXiv preprint
arXiv:1506.01497, 2015. 18, 23, 27
[23] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 18
66
[24] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only
look once: Unified, real-time object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
18
[25] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Pro-
ceedings of the IEEE conference on computer vision and pattern recognition,
pages 7263–7271, 2017. 19
[26] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector.
In European conference on computer vision, pages 21–37. Springer, 2016. 19,
27
[27] Marcel Neuhausen and Markus König. Automatic window detection in facade
images. Automation in Construction, 96:527–539, 2018. 20
[28] Hoang Nhat-Duc, Quoc-Lam Nguyen, and Van-Duc Tran. Automatic recogni-
tion of asphalt pavement cracks using metaheuristic optimized edge detection
algorithms and convolution neural network. Automation in Construction, 94:
203–213, 2018. 20
[29] Jun Zhang, Xing Yang, Weiguang Li, Shaobo Zhang, and Yunyi Jia. Auto-
matic detection of moisture damages in asphalt pavements from gpr data with
deep cnn and irs method. Automation in Construction, 113:103119, 2020. 20
[30] Maryam Kouzehgar, Yokhesh Krishnasamy Tamilselvam, Manuel Vega Here-

dia, and Mohan Rajesh Elara. Self-reconfigurable façade-cleaning robot
equipped with deep-learning-based crack detection based on convolutional
neural networks. Automation in Construction, 108:102959, 2019. 21
[31] Seth Hutchinson, Gregory D Hager, and Peter I Corke. A tutorial on visual
servo control. IEEE transactions on robotics and automation, 12(5):651–670,
1996. 21, 30, 31
[32] Cristian Pop, Sanda M Grigorescu, and Arjana Davidescu. Colored object
detection algorithm for visual-servoing application. In 2012 13th International
Conference on Optimization of Electrical and Electronic Equipment (OPTIM),
pages 1539–1544. IEEE, 2012. 21
[33] Ying Wang, Guan-lu Zhang, Haoxiang Lang, Bashan Zuo, and Clarence W
De Silva. A modified image-based visual servo controller with hybrid camera
configuration for robust robotic grasping. Robotics and Autonomous Systems,
62(10):1398–1407, 2014. 21, 22
[34] Hesheng Wang, Dejun Guo, Xinwu Liang, Weidong Chen, Guoqiang Hu, and
Kam K Leang. Adaptive vision-based leader–follower formation control of
mobile robots. IEEE Transactions on Industrial Electronics, 64(4):2893–2902,
2016. 22
67
[35] Pablo Ramon-Soria, Begoña C Arrue, and Anibal Ollero. Grasp planning and
visual servoing for an outdoors aerial dual manipulator. Engineering, 6(1):
77–88, 2020. 22
[36] Jingshu Liu and Yuan Li. An image based visual servo approach with deep
learning for robotic manipulation. arXiv preprint arXiv:1909.07727, 2019. 22,
23
[37] Konrad Ahlin, Benjamin Jo↵e, Ai-Ping Hu, Gary McMurray, and Nader
Sadegh. Autonomous leaf picking using deep learning and visual-servoing.
IFAC-PapersOnLine, 49(16):177–183, 2016. 23
[38] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf,

William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accu-
racy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint
arXiv:1602.07360, 2016. 23
[39] Benjamin Jo↵e, Konrad Ahlin, Ai-Ping Hu, and Gary McMurray. Vision-
guided robotic leaf picking. EasyChair Preprint, 250:1–6, 2018. 23
[40] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn.
In Proceedings of the IEEE international conference on computer vision, pages
2961–2969, 2017. 23
[41] Ali Anwar, Weiyang Lin, Xiaoke Deng, Jianbin Qiu, and Huijun Gao. Quality
inspection of remote radio units using depth-free image-based visual servo with
acceleration command. IEEE Transactions on Industrial Electronics, 66(10):
8214–8223, 2018. 23
[42] Shiyao Cai, Zhiliang Ma, Miroslaw J Skibniewski, and Song Bao. Construc-
tion automation and robotics for high-rise buildings over the past decades: A
comprehensive review. Advanced Engineering Informatics, 42:100989, 2019.
25
[43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016. 27
[44] Richard Hartley and Andrew Zisserman. Camera Models, page 153–177. Cam-
bridge University Press, 2 edition, 2004. doi: 10.1017/CBO9780511811685.
010. 29
[45] Leee Weiss, Arthurc Sanderson, and Charlesp Neuman. Dynamic sensor-based
control of robots with visual feedback. IEEE Journal on Robotics and Automa-
tion, 3(5):404–417, 1987. 32
[46] Rafael Kelly, Ricardo Carelli, Oscar Nasisi, Benjamı́n Kuchen, and Fernando
Reyes. Stable visual servoing of camera-in-hand robotic systems. IEEE/ASME
transactions on mechatronics, 5(1):39–48, 2000. 32
68
[47] Chien-Chern Cheah, De Qun Wang, and Yeow Cheng Sun. Region-reaching
control of robots. IEEE Transactions on Robotics, 23(6):1260–1264, 2007. 34
[48] Suguru Arimoto. Control theory of nonlinear mechanical systems. A Passivity-

based and Circuit-theoretic Approach, 1996. 41
[49] Jean-Jacques E Slotine, Weiping Li, et al. Applied nonlinear control, volume
199. Prentice hall Englewood Cli↵s, NJ, 1991. 42
[50] Odd O Aalen. A linear regression model for the analysis of life times. Statistics
in medicine, 8(8):907–925, 1989. 50
69

Master Thesis JYX 14Jul (1)

Uploaded by

Copyright:

Available Formats

Master Thesis JYX 14Jul (1)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Master Thesis JYX 14Jul (1)

Uploaded by

Copyright:

Available Formats

This document is downloaded from DR‑NTU (https://dr.ntu.edu.

Visual servo control of robot manipulator with

This work is licensed under a Creative Commons Attribution‑NonCommercial 4.0

Downloaded on 15 Oct 2024 04:37:17 SGT

School of Electrical & Electronic Engineering

A thesis submitted to the Nanyang Technological University

higher degree to any other University or Institution.

free of plagiarism and of sufficient grammatical clarity to be examined. To the

as acknowledged in the Author Attribution Statement. I confirm that the

presented honestly and without prejudice.

Please select one of the following; *delete as appropriate:

Finally, I want to address my deepest thanks to my family members. They mean

Jin Yuxin, January 2022

Vision system is a crucial part of the robotic system in construction automation. By

List of Figures xvii

List of Tables xix

Symbols and Acronyms xxi

2.4 Vision-based Controller . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Experimental Setup and Experimental Results 45

5 Conclusion and Recommendation 59

List of Author’s Publications 63

2.1 A Example of Five-Fold Cross Validation. . . . . . . . . . . . . . . 11

3.1 The basic workflow for proposed controller. . . . . . . . . . . . . . . 26

4.7 demonstrates change of the YOLO bounding box location, length

4.1 Technical specification about Intel Realsense Depth Camera D435 . 46

Moreover, the occurrence of COVID-19 worsens the situation of hiring construction

The inspection work is generally an essential step in the construction process. It

2.1.1 Interior Finishing Robot

One of the earliest interior-finishing robots called Technion Autonomous Multipur-

Moreover, in 2018, a research group from Nanyang Technological University (NTU)

By applying human-robot interaction, the robot can accomplish more accurate

2.1.2 Quality Inspection and Assessment Robot

After completing all the construction works, it is imperative to regularly inspect

In 2009, a research group from Hanyang University developed a robotic inspection

2.1.3 Site Monitoring Unmanned Aerial Vehicle (UAV)

2.2 Machine Learning Basics

In Section 2.1, it is shown that the principal trend of construction robots is to

In the following subsections, basic machine learning techniques and information

2.2.1 Data Split

Training Set Validation Set

Iteration 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Iteration 2 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Iteration 3 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Finding

Iteration 4 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Iteration 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Figure 2.1: A Example of Five-Fold Cross Validation.

2. Divide the whole dataset into k same-size groups.

3. Assign one group as the validation data set.

4. Assign the remaining groups as the training data set.

6. Record the evaluation score.

2.2.2 Neural Network

Figure 2.2: A Basic Neural Network Architecture

2.2.3 Loss Functions

Typically speaking, machine learning problems usually seek to find a solution (a

A commonly used framework for loss function is maximum likelihood estimation.

where pc 2 RC is the predicted probability of class c.

where Yi is observed values and Ŷi is the predicted values.

gt = r✓t L(✓ t ) (2.4)

where t is the number of steps of updating the parameters, gt is the gradient of

Although GD is a straightforward method, it cannot solve the saddle points and