Master Thesis JYX 14Jul (1)
Master Thesis JYX 14Jul (1)
Master Thesis JYX 14Jul (1)
sg)
Nanyang Technological University, Singapore.
Jin, Yuxin
2022
Jin, Y. (2022). Visual servo control of robot manipulator with applications to construction
automation. Master's thesis, Nanyang Technological University, Singapore.
https://hdl.handle.net/10356/160020
https://hdl.handle.net/10356/160020
https://doi.org/10.32657/10356/160020
Jin Yuxin
2022
Statement of Originality
I hereby certify that the work embodied in this thesis is the result of original
research, is free of plagiarised materials, and has not been submitted for a
19-Jan-22
................. .........................
Date Jin Yuxin
Supervisor Declaration Statement
I have reviewed the content and presentation style of this thesis and declare it is
best of my knowledge, the research and writing are those of the candidate except
investigations were conducted in accord with the ethics policies and integrity
standards of Nanyang Technological University and that the research data are
19-Jan-22
................. ...........................
Date Cheah Chien Chern
Authorship Attribution Statement
(A) This thesis does not contain any materials from papers published in peer-reviewed
journals or from papers accepted at conferences in which I am listed as an author.
(B) This thesis contains material from [x number] paper(s) published in the following
peer-reviewed journal(s) / from papers accepted at conferences in which I am listed as
an author.
19-Jan-22
................. ..........................
Date Jin Yuxin
Acknowledgements
First, I wish to express my greatest gratitude to my professor, Cheah Chien Chern.
He taught me the very first lesson of control theory and robotics. He guided me
into the correct path of research and study patiently. We would met and exchanged
our thoughts every week to prepare for the research project. He helped me to solve
many problems occurred from the design of the experiments to the imperfection of
the algorithm.
I also want to thank my friends and my schoolmates. Xinge, Nithish, and I have had
many discussions about projects, questions, or just random thoughts. I will never
forget every meal we had together at the NTU canteen. My boyfriend, Shuailong,
is my source of energy and light of life. He never failed to cheer me up during
my darkest moment and encourage me to move forward. My childhood friends in
China, although I only meet them face-to-face a few times after I went overseas
for studying. Every week we would still chat via Wechat and have a phone call.
Without their help and company, I cannot make it today.
Two years of my master’s journey is not a very long time. But what it brings to
me and teaches me will impact me for the rest of my life.
To my dear family
Abstract
The construction industry has long been a labor-intensive sector. The gap between
the continuously increasing demand for housing and the shirking workforce is grow-
ing wider day-to-day. In addition, the fatal and injury rate at the workplace for
the construction sector remains stubbornly high as compared to other industries.
Construction companies are seeking for robotics and automation technologies to
keep a balance between safety, accuracy, and efficiency.
This thesis aims to explore the use of robot visual servoing technique to improve
detection performance during real-time inspection. The proposed method utilizes
object detection information to guide the robot system for achieving a better view of
the target object. A region-based visual servoing controller is developed to position
the target object in the center of the field of view (FOV) while also maximize of
the coverage of the object within the FOV. A case study will be performed on tile
cracks inspection by using the proposed technique. The inspection process is an
important step to evaluate the current stage of the construction project as well
as alert the supervisors if there is an error. It is also a tedious job as it requires
close observation through every wall in every room among all the units. Tile cracks
xiii
xiv
are commonly occurring during the transportation or installation process and the
cracks are usually tiny and therefore not easily detected by human workers. By
combining the visual servo control technique with a deep-learning-based object
detector, we aim to achieve a higher confidence level for the detection of the tile
cracks. Experimental results are presented to illustrate the performance.
Contents
Acknowledgements ix
Abstract xiii
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Review 7
2.1 Robotic Solutions For Building Construction Automation . . . . . . 8
2.1.1 Interior Finishing Robot . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Quality Inspection and Assessment Robot . . . . . . . . . . 9
2.1.3 Site Monitoring Unmanned Aerial Vehicle (UAV) . . . . . . 10
2.2 Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Data Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Classic Neural Network Models for Object Detection . . . . . . . . 15
2.3.1 Convolutional Neural Network(CNN) . . . . . . . . . . . . . 15
2.3.2 Region based Convolutional Neural Network (R-CNN) . . . 17
2.3.3 You Only Look Once (YOLO) Version One and Version Two 18
2.3.4 Single Shot Detector (SSD) . . . . . . . . . . . . . . . . . . 19
2.3.5 Convolutional Neural Networks Applications in Construction
Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
xv
xvi CONTENTS
3 Methodology 25
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Object Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 YOLOv3 Architecture . . . . . . . . . . . . . . . . . . . . . 27
3.3 Control Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Preliminaries and Definitions . . . . . . . . . . . . . . . . . 29
3.3.2 Image-Based Visual Servoing and Region-Based Control . . 33
3.3.3 Lyapunov Stability Analysis . . . . . . . . . . . . . . . . . . 39
Bibliography 65
List of Figures
4.1 Basic user interface of the labeling tool LabelImg. Ground truth
bounding box is drew in green . . . . . . . . . . . . . . . . . . . . . 47
4.2 YOLO detection results of testing images. electrical telecom and
electrical power are shown in (a), door and electrical switch are
shown in (b), electrical light is shown in (c), electrical switch is
shown in (d), window installed is shown in (e) and tile crack is
shown in (f) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 UR5e Robot Manipulator in the Nanyang Technological University
Robotics Lab Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 (a) shows the change of confidence level for bounding box length.
The y-axis is the YOLO confidence level and the x-axis is the YOLO
bounding box width. The blue line is the linear regression model
with a 95% confidence interval as the boundary shadowed in blue.
(b) demonstrates the change of confidence level to distance between
the bounding box center and image frame center. The x-axis is
the pixel distance between the YOLO bounding box center and the
image center and the y-axis is the YOLO confidence level.The blue
line is the linear regression model with a 95% confidence interval as
the boundary shadowed in blue. . . . . . . . . . . . . . . . . . . . . 51
4.5 demonstrates change of the YOLO bounding box location, length
and confidence level during the experiment using YOLO model trained
with 1000 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 demonstrates change of the YOLO bounding box location, length
and confidence level during the experiment using YOLO model trained
with 2000 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xvii
xviii LIST OF FIGURES
3.1 Object detection results on COCO dataset [1]. The table shows
the mean average precision (mAP) and inception time for di↵erent
object detection model. mAP-50 represents that the mAP value is
calculated based on 50% IOU metric. The number after the model
name indicates the input image size. . . . . . . . . . . . . . . . . . 27
xix
Symbols and Acronyms
Symbols
Rn the n-dimensional Euclidean space
rf the gradient vector
@Y
@x
the partial di↵erentiation of function Y on variable x
ẋ the first di↵erentiation of function x
x̄ the vector with the average of all components of x as each element
1 all-ones column vector with proper dimension
xi,k the i-th component of a vector x at time k
XT the transpose matrix of matrix X
Ni the index set of the neighbors of agent i
Acronyms
DOF Degree of Freedom
YOLO You Only Look Once
PPVC Prefabricated Prefinished Volumetric Construction
IMU Inertial Measurement Unit
GPS Global Positioning System
UAV Monitoring Unmanned Aerial Vehicle
mAP Mean Average Precision
FoV Field of View
RoI Region of Interest
xxi
Chapter 1
Introduction
1.1 Background
The construction industry has long been a supporting pillar of the economy for all
nations. In Singapore, the construction sector contributed about 4% of Singapore’s
total gross domestic product in 2019.1 According to the Building and Construction
Authority of Singapore (BCA), the forecast average annual demand for construc-
tion will reach 32 billion Singapore dollars in 2025.2 With the increasingly high
demand and the limited local population, Singapore’s construction sector relies
heavily on foreign works from neighboring southeast Asian countries. Data from
the Ministry of Manpower of Singapore (MOM) shows that in 2020 311,000 work
permit holders are working in the construction and marine sector, occupying one-
fourth of the total foreign workforce.3 The above situation leads many to inevitable
problems occurring not only in Singapore.
On the one side, the construction site environment can be rather harmful or even
dangerous. According to the annual report from the MOM in Singapore, there
were 13 death cases, 135 major non-fatal injuries case and 1674 minor injury cases
occurred in the construction site in 2019.4 This indicates that the construction
sector is still the main contributor to workplace injury and death among all the
1
https://www.statista.com/statistics/1122999/singapore-nominal-gdp-breakdown-by-sector/
2
https://www1.bca.gov.sg/docs/default-source/docs-corp-form/free-stats.pdf
3
https://www.mom.gov.sg/documents-and-publications/foreign-workforce-numbers
4
https://www.mom.gov.sg/-/media/mom/documents/press-releases/2021/0319-annex-a—
workplace-safety-and-health-report-2020.pdf
1
industries. Furthermore, the United States Bureau of Labor data states that every
year the rate of construction workers who su↵er a fatal injury is fourth-highest
than any other industry.5 The causes of construction workplace injuries can range
from falling from high to car accidents. Besides, most of the construction site jobs
are repetitive and tiring. Construction work usually includes concrete workers,
stoneworkers, flooring installers, glaziers, tile setters, ironworkers, and electricians.
Most of these jobs require long-hour exhausting physical activities. It can lead
to many chronic illnesses such as lumbar diseases or arthritis. Thus, the average
retiring age for a construction worker is 42.5, which is much younger than the
standard retirement time. Using a robot for construction work can ease the tiring
work done by workers. Deploying more autonomous machines in the construction
site means more people can work in a safer place.
On the other side, there is a vast technology gap between the construction indus-
try and others. The Bureau of Economic Analysis data in America shows that the
labor productivity in the construction industry remains at the lowest level among
agriculture, transportation, manufacturing, and utility industries.6 Likewise, a sur-
vey in Japan in 2012 even indicates there is a declining trend in construction labor
productivity from 1990 to 2010 while the whole industry is continuously rising [2].
In 2015, Naoum [3] from London, UK, listed the top factors which influence the
productivity on construction sites. Among the 46 listed factors, he ranks ine↵ective
project planning, delay caused by design error and variation orders, communica-
tion system, work environment, and constraints on a worker’s performance as the
top five factors. We can see that the first three are mainly during preparation, as
researchers cannot intervene much. While the last two show that environmental
and human factors are essential for efficiency and productivity. A good environ-
ment a↵ects how smooth the construction work can carry on and how efficiently
the workers would perform. A robot system is capable of helping with monitoring
the current on-site progress as well as boost up the current construction process.
2
as of 18th May 2021.7 The reason behind it was probably the crowded living
condition among workers in the past.8 It has also become challenging for the
construction companies to recruit new workers abroad as the pandemic situation
is still unpredictable and varying worldwide. This pandemic significantly alters
the timeline of the existing construction projects. As reported by Strait Times,
85 percent of the 89 ongoing build-to-order projects face delays of six to nine
months due to the pandemic, with 43,000 households a↵ected.9 GPD from the
construction sector in Singapore also dropped dramatically from 4957.8 million
Singapore dollars in the first quarter of 2020 to 1681.6 in the third quarter because
10
of the covid outbreak and lockdown in the construction dormitory. This severely
influences the economic circulation and people’s daily life.
Under the above situation, we recognize a need to aid construction work by im-
plementing robotics solutions. It will be the initial step to shift the nature of the
construction industry from labor-intensive to technology-intensive. It would speed
up the project schedule with computer integrated progress monitoring and lessen
workplace injury and death every year.
1.2 Motivation
In Section 1.1, we talk about the current difficulties and limitations faced by the
conventional construction industry. Thus, we endeavor to develop an integrated
robot system that can automate the construction process and assist construction
laborers.
Robot system usually consists of two parts, vision system, and robot hardware sys-
tem. The vision system is a crucial component of the robotic system in construction
automation. As the construction sites are generally disordered and unstructured,
visual information helps us see and comprehend the surrounding conditions. The
vision system should recognize and localize various construction materials, instal-
lation components, and defects. In terms of vision algorithms, the convolutional
7
https://covidsitrep.moh.gov.sg
8
https://www.straitstimes.com/singapore/manpower/workers-describe-crowded-cramped-
living-conditions
9
https://www.straitstimes.com/singapore/spore-will-see-further-delays-in-housing-projects-
due-to-tightening-of-covid-19-measures
10
https://tradingeconomics.com/singapore/gdp-from-construction
3
neural network has gradually replaced the traditional machine vision techniques
and becomes the primary tool for fast and accurate object detection tasks. It
consists of a series of convolutional layers and max-pooling layers to produce the
final output. If we can collect a suitable dataset and train a deep neural network
model, it can classify and localize numerous classes. Typically, detection results
will include object class, confidence level, bounding box size, and bounding box co-
ordinates. The confidence level determines how confident the machine is to confirm
the presence of an object.
However, there are still many limitations and drawbacks of the existing object
detector. The detection performance is fixed after the training throughout the pre-
diction process. The confidence level can vary dramatically due to di↵erent lighting
conditions and changes in the camera’s distance and angle. If the camera faces the
object at an inappropriate angle, it will significantly reduce the performance of the
object detector and even lead to some false detection.
Thus, we see the possibility of applying a robot solution to assist the detection of
the CNN model. This thesis investigates the usage of robot vision-based control
techniques to enhance detection performance during real-time inspection. We aim
to develop a control algorithm that can position the target object in the center
of the field of view (FOV) while also maximizing the coverage of the target ob-
ject within the FOV. A case study will be performed on tile cracks inspection,
and construction installation checks to illustrate the performance of the proposed
technique.
4
1.3 Outline of the Thesis
Chapter 1 introduces the background of the construction industry and the moti-
vation of the project.
Chapter 2 reviews the robotics solutions used in the construction industry as well
as the basics of machine learning. Classic neural network models are also presented
to find out the most suitable one for real-time application.
Chapter 3 elaborates the details of the proposed algorithm. It is divided into two
parts, vision algorithm and control algorithm. For the vision algorithm, we will
give a detailed explanation of the used machine learning algorithm. For the control
algorithm, we will illustrate the mathematical definition and usage.
Chapter 4 explains the hardware of the robot and vision system. It demonstrates
the experimental setup and the experimental results
Chapter 5.1 concludes the thesis and provides possible future research directions.
5
6
Chapter 2
Literature Review
In the following chapter, we intend to review the previous research work about
construction robot and other related contents.
Firstly, we will review di↵erent types of construction robots and robot control tech-
niques for robot design and control. It is crucial to figure out what kind of func-
tionalities and capabilities are essential for specific construction tasks. Knowing
the trend of construction automation tells us the need for construction robots. It
gives us the direction to improve further by understanding the possible drawbacks
of the existing products.
For robot vision, both basic machine learning knowledge and modern neural net-
work models will be illustrated. Cramming the machine learning basics is valu-
able to understand the architecture and function of the neural network. Compar-
isons will be made between various neural network algorithms. Determining the
most proper neural network model for construction site work is the principal focus
throughout the process.
The subsequent literature review will cover the techniques and algorithms used for
object detection, construction robots with diversified purposes and functionalities,
and vision-based control algorithms.
7
2.1 Robotic Solutions For Building Construction
Automation
We need to consider several aspects when designing the overall process of construc-
tion work to achieve full construction automation. In the book Robot Oriented
Design [4], the author listed five key technologies and methodologies that play an
essential role in accomplishing real construction automation: (1) robot-oriented
design, (2) robotic industrialization, (3) construction robots, (4) site automation,
and (5) ambient robotics. In the following subsections, robot applications are listed
according to their functionalities and purposes.
Interior finishing work, including painting, tiling, masonry, and plastering, is im-
perative before the handover. Tiling and painting work can be pretty risky as it
often involves high-rise operation and has a higher chance of falling and getting
injured. Meanwhile, it is also a time-consuming and labor-intensive job. Thus,
providing a robotic solution to ensure workers’ safety and boost productivity is
crucial to the industry.
8
exhibits better results. A human operator can navigate the robot to a di↵erent
workstation, set painting requirements for the proper nozzle, and spray pressure
using the remote controller with the screen. A 6-DoF robot arm with 3D scanning
and reconstruction system can detect the surrounding terrain, plan spraying trajec-
tory and perform the spraying task on an uneven surface. They compared manual
spray painting and human-Pictobot joint task operation by using working time,
transfer efficiency, quality, convenience, safety, and human resources needed. The
robot can finish 100 m2 within 2 hours compared to 3 hours for manual spraying.
It is believed to be more even and consistent in terms of thickness.
Pack et al. [7] have proposed a structure-climbing robot for building inspection
named ROBIN. A four DoF articulated mechanism with two vacuum fixtures en-
ables the robot to walk cross surfaces or transit between adjacent surfaces per-
pendicular to each other. ROBIN can climb onto high-rise buildings, bridges, and
other artificial structures for inspection works using cameras or other sensors.
Moreover, Gibb et al. [10] proposed a multi-functional inspection robot for civil
infrastructure evaluation and maintenance in 2017. With the integration of ground-
penetrating radar, electrical resistivity, and a stereo camera, the robot can perform
9
the detection and assessment of the concrete rebar, concrete corrosion, and cracks
at the same time. Meanwhile, an onboard computer enables the system to process
the data and conduct the navigation in real-time. It can output the width of the
cracks and produce a concrete condition map based on that.
Autonomous inspection work reliefs the worker from the repeated and tiresome
check and prevent them from work-related injuries. Meanwhile, the robotic solution
can also reduce the cost and the time of maintenance.
With the increasing popularity and development of UAVs, many researchers also
explored UAV usage in building inspection and monitoring. UAVs can fly without
a crew and inspect the target building facades whose height standard ground robots
cannot reach.
In 2015, Pereira and Pereira [11] evaluated two di↵erent machine vision algorithms
used in UAV applications and their respective performance. Both Sobel filter
algorithm and particle filter algorithm were tested using Linux PC and Raspberry
PI for their accuracy and processing time for crack detection.
Meanwhile, a UAV-based laser scanning for building inspection algorithm was ex-
plored by a research group from Germany in 2016 [12]. With the 470000 preset
points of the UAV trajectory, it obtains the data of the building surface from the
laser scanner and reconstructs a 3D point cloud. The paper also evaluated us-
ing UAVs for checking cracks and other defects using RGB cameras and thermal
sensors.
10
with excellent adaptability and accuracy to design a high-performance vision sys-
tem. More details and background knowledge about machine learning need to be
reviewed before looking into the neural network and its applications.
The quality of the data directly influences the performance of a model. Generally
speaking, a more extensive dataset means higher accuracy. Every dataset splits
into a training dataset, a validation dataset, and a test dataset. While the training
dataset is only used to train the model, the score on the validation dataset is used to
tuning the hyper-parameters such as learning rate. The test dataset is to evaluate
the performance of the model on an unseen dataset.
When dealing with real-life problems, we often lack enough training data, and it
may lead to an under-fitting problem. Thus, resampling procedure such as cross-
validation becomes extremely important to give a more accurate estimate of the
current model and thus adapt the parameter values.
Dataset
Figure 2.1 illustrates an example of k-fold cross validation where k is five. The
general procedure is listed below:
11
1. Randomly shu✏es the dataset to distribute di↵erent classes equally.
5. Fit a model on the training set and evaluate it on the validation set.
7. Repeat step 3-6 k times and calculate the summarized performance based on
the average of all the evaluation scores.
We can still obtain a precise evaluation of the model using limited samples by
applying the k-fold cross-validation procedure.
When talking about the origin of the neural network, its connection with zoology
and anatomy cannot be bypassed.
One of the very first papers about brain architecture and neuron mechanism was
published in 1968 by Hubel and Wiesel [13]. By studying the architecture of the
monkey striate cortex, the scientists found out the brain’s organized so that simple
cells were shown in the deep layer and complex cells are contained by the upper
layer. While simple cells were more sensitive to lines and edges, their output will
be converged in the complex or hyper-complex cells. Inspired by Hubel’s work, in
1980, the term ’neocognitron’ first appeared in Fukushima and Miyake’s paper [14]
to present his self-organizing neural network model. The model arranged each
module in a cascade connection. Each module consists of ’S-cells’ similar to the
superficial cells and ’C-cells’ similar to the complex cells. In the paper, the author
claimed that the network could self-learn the characters of the input patterns, and
one of the C-cells from the last layer would respond to it.
A modern artificial neural network is illustrated in Figure 2.2 to show its basic
architecture. Usually, it consists of input layers, hidden layers, and output layers.
12
Hidden
Input Output
The neural network can be interpreted as the estimator of the complicated inter-
action and relationship between input and output. Given an input feature vector
x 2 Rd and the output label vector y 2 RC with C di↵erent classes, a simple
classifier can be expressed as
ŷ = f (x; ✓) (2.1)
where ✓ is the hyper-parameters and f is the function to map the input feature
vector x to output label space ŷ 2 Rd .
C
X
L(✓) = yc log pc (2.2)
c=1
For the regression problem which predicts the specific quantity, Mean-Square loss
is more suitable. Given n number of data points, we can define the loss function
as
n
1X 2
M SE = (Yi Ŷi ) (2.3)
n i=1
2.2.4 Optimization
As its name suggests, optimization is updating the set of weights to minimize the
loss function.
One of the simplest optimization methods is Gradient Decent (GD). It is like going
downhill. By calculating the derivative of the loss function, we can update the
weights in the direction of negative gradient. The gradient calculation and the
update step can be expressed in the following equations:
Learning rate controls how much the weights will be updated each step, and it is
a crucial hyper-parameter we need to define. The learning rate is usually a small
value, such as 0.01 or 0.001, and it is fundamental to choose the correct value. If
14
the learning rate is too large, the learning process will be unstable and oscillated.
While if it is too small, it will take a longer time to converge.
Stochastic Gradient Descent (SGD) comes into play if the data set is too large for
calculation. Instead of calculating the gradient for the entire data set, it estimates
that with a randomly chosen subset of the data. Thus, it not only reduces the
computational time and also achieves faster convergence.
In the following section, neural network models widely used for object detection
will be introduced. With the development of computer vision and the improve-
ment of computational power, more complex models start to take advantage of
traditional machine vision algorithms. Many new models are proposed every year
with increased accuracy and shortened processing time. We will look into di↵erent
models and find out which one is more suitable for real-time object detection.
The history of CNN started in the 1980s. One of the very first papers was published
by LeCun et al. [17] in 1995. It laid the foundation of architecture and usage of
convolutional neural networks. Three years after that, the same author proposed
LeNet-5 [18], a 7-layer convolutional network that can process images and recognize
hand-written numbers.
15
Although the invention of the CNNs could be traced back to the end of the last
century, it never received great attention until the extensive use of graphics pro-
cessing units (GPUs) in the 2000s. A research group from the University of Toronto
presented an extensive, deep convolutional neural network that was designed with
60 million parameters and 650,00 neurons in 2010 [19]. It is able to classify 1.2
million images into 1000 classes with top-1 and top-5 error rates of 37.5 % and
17.0%, respectively.
fc_3 fc_4
Fully-Connected Fully-Connected
Conv_1 Conv_2
Convolution Max-Pooling Convolution Max-Pooling
Fl
at
Input te
…
ned
Image
n3 units
Nowadays, CNN is widely used for gird-like datasets such as images because of its
architecture and capability to separate and extract essential features using spatial
relationships.
16
2.3.2 Region based Convolutional Neural Network (R-CNN)
Thus, the region-based approach was first introduced by Girshick et al. [20] in
2014 called R-CNN. R-CNN used CNN as the backbone to extract features for
each proposal and classify the output using Support Vector Machine (SVM). Their
performance on PASCAL VOC 2012 data set improved 30% compared to the best
previous results achieving a mean average precision (mAP) of 53.7%.
One year after that, the same author proposed a modified method named Fast
Region-based Convolutional Neural Network (Fast R-CNN) [21]. R. Girshick wished
to fix the problems that occurred in R-CNN, such as multi-stage training, expen-
sive training time and storage, and slow real-time detection speed. By using deep
convolutional layers and max-pooling layers, the model can produce a feature map.
Then, a feature vector will be extracted for each object. After feeding the feature
vector into the fully connected layer, it will generate the output containing Softmax
probability and bounding box position. This end-to-end model has higher detec-
tion results compared to R-CNN and does not need storage for feature caching.
The training time is nine times faster than R-CNN with an mAP 66%.
Region
Proposal Proposals
Network
Image Conv Layer Classifier
Feature Maps RoI Pooling
Figure 2.4: Faster R-CNN is a single, unified network for object detection.
Although Fast R-CNN is a great success in terms of speed and accuracy, it also
reveals that the current bottleneck is region proposal computation. Thus, it was
17
further improved by Ren et al. [22] in 2015. The novelty of this method is the
invention of Region Proposal Networks (RPNs), which significantly lessens the
cost for computing proposals and accelerates the test-time operation of the model.
In a single model design, these regions are integrated with a Fast R-CNN model.
RPNs proposes potential region of interests (RoIs) and types of objects, while Fast
R-CNN extracts the features and produces the final output containing bounding
boxes and class labels as shown in Figure 2.4. In short, RPNs tell the following
neural network the area which needs to pay more attention to. It has been evaluated
on the union set of PASCAL VOC 2007 trainval and 2012 trainval and achieves
mAP 73.2 %. Using a deep VGG-16 model [23], their model has a frame rate of 5
fps on a GPU.
2.3.3 You Only Look Once (YOLO) Version One and Ver-
sion Two
Having the idea of creating a more straightforward and faster neural network
for real-time application, Redmon et al. [24] presented a real-time detector that
achieves decent accuracy with only 24 convolutional layers and two fully connected
layers. The faster version of YOLOv1 was able to run at 150 fps which means the
real-time video processing was possible with minimum latency.
18
Because of the weaknesses stated before, one year after the publication of YOLOv1,
the same author improved it and manifested YOLOv2 (YOLO9000) [25]. More new
ideas have been included in this model while it still maintains a relatively simple
architecture. Adding batch normalization and higher resolution classifier gives
an increase of mAP for 4 %. Di↵erent from the idea of RPN in Faster R-CNN,
YOLO comes up with convolutional anchor boxes which predict bounding box
coordinates directly from image features. Meanwhile, k-means clustering is applied
to gain better priors size automatically. Instead of using sophisticated classification
models such as VGG-16 or CNN, YOLO uses a custom network called Darknet-19,
consisting of only 19 convolutional layers and five max-pooling layers. It can reach
an mAP of 78.6% at 40 FPS which is a notable advancement compared to SSD
with similar mAP but only has 19 FPS.
By observing the slow frame rate of Faster R-CNN and the low accuracy of YOLO
version one, Liu et al. [26] designed a new method for object detection called Single
Shot MultiBox Detector.
Like Fast R-CNN, a feed-forward convolutional network was selected as the base
network to produce a cluster of fixed-size bounding boxes and confidence scores
accordingly. Then assistant structure such as convolution was added to enhance
detection at di↵erent scales, producing a fixed collection of detection results and
quantifying the space of output bounding box shape discretely. SSD outperformed
all other methods for the COCO data set, including Fast R-CNN, Faster R-CNN,
and YOLO v1 with an mAP of 72.4 and 74.9 for input size 300⇥300 and 512⇥512,
respectively. Removing the bounding box proposals from the network architecture
achieves 59 frames per second on the VOC2007 test data set with high-accuracy
detection.
Although CNN has been a prevalent topic for decades, the application to a real-
time construction project is still something fresh. Due to the chaotic background
19
environment and numerous classes of objects, it is still challenging to train a high-
performance model. The following section lists several construction-related prob-
lems which can be solved by applying CNN.
Nhat-Duc et al. [28] proposed a method called CNN-CDM for pavement crack
detection. In total, they collected 400 images of pavement surfaces for two di↵erent
classes. The authors compared the performance of two di↵erent algorithms. First,
the collected dataset will go through a crack recognition model by integrating the
Canny edge extraction algorithm with the DFP optimization algorithm. The other
method, called CNN-CDM, is a multi-layer neural network containing a feature
extraction network and classification network. The experimental results on the
training dataset reach 92.08% for the CNN method, while the DFP-Canny method
only achieves 76.69%.
Similarly, to detect and localize moisture damage in bridge deck asphalt pavement,
Zhang et al. [29] develop a mixed deep CNN including ResNet50 network, for
feature extraction, and YOLO v2 network, for recognition. The input data is ob-
tained from Ground Penetrating Radars (GPR), and an IRS algorithm is deployed
to generate relevant data fed into the CNN. The team removes the original-based
network of YOLOv2 and adds a ResNet50 Network on top of it. Instead of using
the original YOLO anchor numbers, the K-means clustering method is used here
for small object detection. According to the experimental results, the detection
CNN model reaches 91% precision. The outcome demonstrates that it is a novel
method of automatically detecting moisture method.
20
Moreover, to guide the facade-cleaning robot avoiding the dangerous area, re-
searchers from the Singapore University of Technology and Design (SUTD) propose
a crack detection algorithm based on a convolutional neural network [30]. They
compare the performance of two di↵erent optimizers. After training the model for
700 epochs, both CNN models reaches around 90% accuracy regardless of vary-
ing illumination and resolution of the input images. Simulation and experimental
results demonstrate that the system is robust for crack detection.
How visual information can guide robot movement has long been a popular topic
in robotics. Hutchinson et al. [31] give us a detailed tutorial about image-based
visual servoing (IBVS) appears. The core of this algorithm aims to guide the robot’s
movement by purely using the image feature points. It extracts the features from
the image space, calculates the di↵erence with desirable features. The output of the
controller will be the end-e↵ector velocity and orientation which is related to the
error. The process of the IBVS system consists of image feature extraction, Image
Jacobian calculation, and final velocity calculation. Among these, extracting the
desired feature from the image space is the most challenging one.
Thus, some research after this tries to modify the present method by substituting
the manually selected feature points with a machine vision algorithm.
Pop et al. [32] demonstrates the usage of color coding to guide the robot to the
target location. In order to perform the pickup and place task, it is crucial to
determine the size and position of the object. By taking images at a fixed distance,
they first detect the object based on its color in the HSV images. Then, the
algorithm can extract the edges of the object and compute its height using the
scaling factor. After that, the center of gravity can be generated using the center
x and y in pixel. This information will be used during the pickup process. By
doing this, the robot manipulator can grasp the object with less than 5 mm error.
However, the experiment results are primarily influenced by the ambient light,
shadows, reflection, and camera setting.
21
Wang et al. [33] modified the vision system to solve the problem of object grasping
in an unstructured environment. Instead of a single camera, the proposed mobile
manipulation system consists of hybrid camera configurations. One is a monocular
camera installed on the end-e↵ector, and the other is a stereo camera installed on
the robot body. By doing this, they believe that it can provide the vision servoing
controller with more stable depth data while also have a large field of view (FOV).
The experimental results prove that the pixel error stays within 10 pixels for 30
times of experiments.
The paper [34] focus on solving the problem of the visual control of a leader-follower
mobile robot system. The intrinsic and extrinsic parameters of the pinhole camera
are uncalibrated and its position and orientation is unknown. Wang et al. [34]
designed a special marker to estimate the di↵erence between the leader and follower
and calculate the velocity based on that. Under this controller design, the follower
robot is able to track the leader robot
Researchers from the University of Seville, Spain, develop an algorithm for grasping
by integrating visual servoing and object detection algorithm [35]. They build a
UAV with a pair of 3 DoF arms for manipulation. The UAV is equipped with
an Intel RealSense D435 depth camera for depth estimation. The robot needs
to know the exact location and orientation of the object to perform the grasping
task. An object detection algorithm is deployed to achieve this then the point
cloud will be used in the alignment process to estimate the object pose. After
generating the grasping point based on the pose calculated before, a pose-based
visual servoing (PBVS) technique is used for approaching the targeted object. The
error is calculated based on the target positions and the current one. By computing
the inverse kinematics, the end-e↵ector is able to grasp the object with minimum
error.
22
Liu and Li [36] propose a control algorithm integrated with CNN to reduce the
difficulty of extracting image features. A two-stream convolutional neural network
is applied to extract the image features automatically in the current situation. The
neural network’s output is the pose parameters and orientation, the translation of
x, y, z-axis, and rotation around the z-axis of the robot base frame. These will be
compared with the image features at the optimal position, and the corresponding
o↵set will be input to control the algorithm for manipulating the robot arm. After
training the model with 400 images, the absolute error is reduced to within 4 mm
in x-, y-, z-axis, and 3.02 degrees for rotation in the z-axis. The robot manipulator
can reach the target pose within 15 steps and remains stable after that. This work
demonstrates the possibility of integrating CNN with vision-based control.
The same group of researchers further improved the mentioned robotic system by
substituting the 2D image information with a 3D point cloud [39]. They use a
monocular camera and a 6-DOF robotic manipulator to detect, track, and pick
healthy and unhealthy leaves. Faster R-CNN [22] architecture is applied for object
detection and Mask R-CNN [40] is used for instance-based semantic segmentation.
It can classify leaf into healthy leaf and unhealthy leaf with 0.753 mAP. IBVS
method is applied to move the bounding box to the center of the frame, and the
MDA will take over to minimize the accumulated error. After conducting the
experiments on the real plants, about 92% of the leaves were grabbed after the
attempts.
Anwar et al. [41] designed a quality inspection process for remote radio units
(RRUs) by using the image-based visual servo control. They modified the image
Jacobian through deriving the depth using the projective geometry. The feature
p
vector f is defined as f = [uc , vc , A, ✓ij ]T where (uc , vc ) is the center coordinates
of the region of interest (RoI),A is the area of that, ✓ij is the angle around z-axis.
23
The selected features obtained by computer vision algorithm is able to guide the
robot to track the power port. It is proven to have better performance compared
with traditional camshift tracking algorithm through experiments.
The research works listed above show the flexibility and robustness of CNN and
how it can boost up the performance of the robotic system.
24
Chapter 3
Methodology
Among all the construction works, Cai et al. [42] points out that there are fewer
research papers and products related to inspection work compared to climbing,
cleaning, and maintenance. Tile defect inspection and installation check can be
considered one of the most tiresome works on the construction site. Because it
usually requires the quality engineers to closely look through every part of the wall
or tile to verify the misalignment or damage for the whole room or the building on
a day-to-day basis. Based on the literature review in Section 2.1.2, it is clear to
see that most of the current inspection robots employ traditional machine vision
techniques such as the Canny edge extraction algorithm for crack detection. It is
rigid and cannot distinguish tile crack from other gaps. Its results can be severely
influenced by the light condition, the background texture, and distance. So, we
think there is a need for developing a neural network-based crack detector. Besides
an excellent object detector, many problems are still faced by the construction field
to achieve full automation. Sometimes, the cracks are not visible due to their small
size or poor lightening. Also, other objects such as wires or markers on the wall
can be easily classified wrongly as wall cracks occasionally. A control algorithm is
needed to move the camera to the desired position to capture suitable images.
We proposed a robotic solution for installation and crack inspection more precisely
by combining an object detection algorithm with a traditional controller.
25
3.1 System Overview
Two primary components of the system are the image-based controller and object
detection model as shown in Figure 3.1.
The control objective is to utilize object detection information to guide the robot
system for achieving a better view of the target object. A region-based visual
servoing controller is developed to position the target object in the center of the
field of view (FOV). At the same time, it also maximizes the coverage of the object
within the FOV.
26
3.2 Object Detection Algorithm
In Section 2.3, we elaborate several commonly used deep learning models for object
detection. They have di↵erent features and architectures. In the following para-
graph, we want to compare the three most famous models Faster R-CNN, SSD,
and YOLO. We intend to find out which one is more suitable for our real-time
application.
Based on the information obtained from paper by Redmon and Farhadi [1], we plot
Table 3.1. For mAP, it measures the average precision considering all classes. It
can range from 0 to 100. We define a detection to be true positive if its intersection
over union (IOU or overlapping area) with the ground truth box greater than the
threshold (50% in our case). Time indicates the prediction time after inputting
one image into the model. The smaller the value means there will be less delay
between the current frame and the detection frame. From these, we can observe
that YOLOv3-416 achieves the shortest inception time among four models, and
YOLOv3-608 is the best in terms of detection accuracy. In order to gain fast real-
time detection results for robot control, we must be sure that the YOLOv3-608 is
the most suitable detector considering both speed and accuracy.
More background knowledge should be obtained for the reasons of the excellent
performance.
YOLOv3 chooses a way di↵erent from its peers for the feature extraction. Typically,
ResNet[43] with a certain level of modification is selected as the backbone. A deep
neural network with a residual learning framework aims to reduce the training
27
time and increase accuracy. YOLOv3 utilizes a new network with 53 convolutional
layers called Darknet-53. Darknet-53 applies a hybrid architecture of Darknet-19
from YOLOv2 and residual network from ResNet. It has comparable performance
for accuracy to Resnet-152 and 2⇥ faster than it.
For the previous version, YOLOv2 performs poorly for small objects with only 5%
average precision. Thus, YOLOv3 introduces multi-scale predictions to fix this
issue. It extracts features from three di↵erent scales. Feature map from 2 layers
previous is up-sampled and merged with an earlier one using concatenation. This
operation provides the network with more semantic information and lower-level
information.
The above two improvements help YOLOv3 gain a decent detection accuracy while
still maintaining a fast prediction speed.
The control algorithm serves as the brain of the overall system. Based on the
current input information, it calculates and decides where to go next. Since we
plan to move the camera to capture a better view of the object, we need first to
decide what means ’good position’ for object detection.
On the one side, the center location of the image frame would be better than the
corner location. Putting the object in the center of the FOV gives it a clear front
look. It helps the camera to capture the image without any distortion or non-focus
problems. When we first make sure that the center of the target object is within
the FOV, we can conduct other operations without fretting about it outside the
frame.
On the other side, increasing the dimension of the object in the image frame is
essential. Since the tile crack is usually relatively tiny and not noticeable, the
object detector cannot perform well when its size is small.
28
3.3.1 Preliminaries and Definitions
Before discussing more details about the controller design, some background knowl-
edge and definitions need more illustration.
Image information is essential to control the robot in visual servoing and all other
vision-based control algorithms. We can form the relationship between the image
in the image plane and the real object by knowing the intrinsic camera parameters
and the depth information.
Assuming that the x-axis and y-axis form the fundamental plane of the image plane
and the z-axis is perpendicular to that plane and along with the optic axis. This
setting defines the camera coordination system as shown in Figure 3.2.
Imag
e Pla
ne
P = (X, Y, Z)
x p = (u, v)
z
Image
Object
Viewing Point
" # " #
u X
⇡(X, Y, Z) = = (3.1)
v Z Y
29
where is the focal length of the camera lens which indicates the distance between
the origin and image plane.
To understand the relationship between the base frame of the robot manipula-
tor and the end-e↵ector, we need to define the angular and translation veloc-
ity of the end-e↵ector. The motion respect to base coordinates can be sepa-
rated into angular velocity ⌦(t) = [!x (t), !y (t), !z (t)]T and translation velocity
T (t) = [Tx (t), Ty (t), Tz (t)]. P is a point rigidly attached to the end-e↵ector whose
base frame coordinates is [x, y, z]T . According to Hutchinson et al. [31], if we take
derivation of the coordinates of P, we have,
2 3
0 z y
6 7
6
sk(P ) = 4 z 0 x7 (3.6)
5
y x 0
we can write it as
Ṗ = sk(P )⌦ + T (3.7)
30
2 3
Tx
6 7
6T 7
6 y7
6 7
6 Tz 7
ṙ = 6
6! 7
7 (3.8)
6 x7
6 7
6!y 7
4 5
!z
where r is the coordinates of the robot end-e↵ector coordinate frame in the task
space.
Camera Configuration
Typically, there are two common camera configurations for visual servo systems:
mounted on the end-e↵ector or fixed in the workspace [31].
#"# #!"
#!"
#!
##
#"
The first one is usually called eye-in-hand configuration. In our experimental setup,
we adopt this setting and mount the camera on the end-e↵ector. It is because we
want the camera to be moving together with the robot manipulator. The changes
of the camera’s pose and that of the end-e↵ector are always constant. Under this
setting, we can control the camera
Image Jacobian matrix defines the relationship between the di↵erential changes
in the image feature parameters with the di↵erential changes in the manipulator
position. The Image Jacobian Matrix Jimg 2 <k⇥m can be written as [31].
31
Ẋ = Jimg ṙ (3.9)
where X is the image feature parameter vector and Ẋ is the corresponding image
feature parameter velocity. ṙ represents the end-e↵ector velocity in task space r
defined in (3.8).
2 3
@v1 (r) @v1 (r)
h i @r1
··· @rm
6 . .. .. 7
Jimg = @
@r
=6 .
4 . . . 7
5 (3.10)
@vk (r) @vk (r)
@r1
··· @rm
where vi is the ith image feature velocity of the Ẋ, k is the dimension of ṙ velocity
skew and m is the dimension of task space r
The definition of Image Jacobian Matrix was first presented by Weiss et al.[45] in
1987 . Equation (3.9) demonstrates how the changes of image feature are related
to the changes of robot end-e↵ector position.
" #
fx u uv fx 2 +u2
Z
0 Z fx fx
v
Jimg (X, Z) = fy v fy 2 +v 2 uv
(3.11)
0 Z Z fy fy
u
where fx , fy are the constant focal length of the camera, Z is the depth of the
selected feature point in the depth frame.
32
where uk and vk are the kth selected image feature points, Zk is the depth of kth
feature point in the depth frame.
Visual servoing is the technique of applying machine vision information into closed-
loop position control for the robot end-e↵ector. Visual servoing is used here for
controlling the robot system using image features provided by the YOLO bounding
box.
As mentioned in Section 3.3.1, there are two commonly used camera configuration
setups. In both cases, the motion of the robot manipulator will contribute to the
changes of the image feature parameters. The idea of visual servoing is to design
a useful error function e to minimize the di↵erence between desired image feature
parameters and current. When the task is finished, e = 0.
2 3 2 3
x1 xcyolo
6 7 6 c 7
6x2 7 6 yyolo 7
Xyolo =6 7 6
6x 7 = 6xtl 7
7 (3.13)
4 3 5 4 yolo 5
tl
x4 yyolo
Thus, according to (3.12), the overall Jacobian matrix for the image feature Xyolo
can be written as,
33
2 3
fx xcyolo xcyolo yyolo
c fx 2 +xcyolo 2 c
0 yyolo
6Z Z fx fx 7
6 fy
c
yyolo fy 2 +yyolo
c 2 xcyolo yyolo
c
7
60 Z Z fy fy
xcyolo 7
yolo
Jimg =6
6 fx xtl xtl tl fx 2 +xtl
2 7
7 (3.14)
yolo yyolo tl
6Z 0 yolo
Z fx fx
yolo
yyolo 7
4 tl 2
5
fy
tl
yyolo fy 2 +yyolo xtl tl
yolo yyolo
0 Z Z fy fy
xtlyolo
yolo +
ṙ = Jimg Ẋ (3.15)
yolo+ yolo
where Jimg is the pseudoinverse of Jimg 2 <k⇥m .
yolo+ yolo T
yolo yolo T 1
Jimg = Jimg (Jimg Jimg ) (3.16)
u(t) = ṙ (3.17)
The region-reaching control technique is deployed here to guide the robot system
moving towards the direction where it extends the bounding box area to the aspired
size. We believe that increasing the object size to an optimal value will help the
machine learning algorithm.
34
The desired region is defined by scalar functions which are di↵erentiable at all
points for its first partial derivatives. In our approach, it is specified by the following
inequality:
wt2
f1 (x3 ) = ((xcyolo x3 ) 2 ) 0 (3.18)
4
c h2t
f2 (x4 ) = ((yyolo x4 ) 2 ) 0 (3.19)
4
$% $%
(!!"#" , #&'%' )
( (
(!&'%' , #&'%' )
Image Frame
In (3.3.2), (xcyolo x3 ) is to calculate the half of the width of the YOLO bounding
c
box wyolo . (yyolo x4 ) is the equation for the half of the height of the YOLO
bounding box hyolo .
kptl3
P3 (x3 ) = [min(0, f1 (x3 )]2 (3.20)
2
kptl4
P4 (x4 ) = [min(0, f2 (x4 )]2 (3.21)
2
35
That is, 8
<0 if f1 (x3 ) 0
P3 (x3 ) = (3.22)
: kptl3 f (x )2 if f1 (x3 ) < 0
2 1 3
8
<0 if f2 (x4 ) 0
P4 (x4 ) = (3.23)
: kptl4 f (x )2 if f2 (x4 ) < 0
2 2 4
If we take partial di↵erentiation of the potential energy function (3.20) with respect
to x3 we have,
8
@P3 (x3 ) T <0 if f1 (x3 ) 0
= (3.24)
@x3 :k f (x ) @f1 (x3 ) T if f1 (x3 ) < 0
ptl3 3 @X
Similarly, for x4 we can also write the partial di↵erential of the potential energy
function (3.26) as,
Thus, if wyolo or hyolo is larger than the desired value, the corresponding ( @P@x
3 (x3 ) T
3
)
or( @P@x
4 (x4 ) T
4
) value will become zero.
It will be used to calculate the end-e↵ector velocity in the latter part of the thesis.
Meanwhile, as we want to move the bounding box center towards the image frame
center, we define the potential energy function for x1 and x2 as
kpx
P1 (x1 ) = (x1 xcimg )2 (3.27)
2
kpy c
P2 (x2 ) = (x2 yimg )2 (3.28)
2
36
Combining (3.20),(3.21), (3.27) and (3.28) we can describe our overall power func-
tion as
2 @P 3
com (X)
@x1
6 @Pcom (X) 7
@Pcom (X) T 6 7
( ) = 6 @Pcom (X) 7
6 @x2
7 (3.30)
@X 4 @x3 5
@Pcom (X)
@x4
2 3
kpx (x1 xcimg )
6 7
6 kpy (x2 yimg c
) 7
=6
6k min(0, f (x ))(x
7 (3.31)
4 ptl3 1 3 3 xcyolo )7
5
c
kptl4 min(0, f2 (x4 ))(x4 yyolo )
@Pcom (X)
Thus, @X
can be further written as,
2 3
x1 xcimg
6 7
@Pcom (X) 6 c
x2 yimg 7
=K6
6min(0, f (x ))(x
7 (3.32)
@X 4 1 3 3 xyolo )7
c
5
c
min(0, f2 (x4 ))(x4 yyolo )
37
At the same time, we define to the error function e(X) as follows:
2 3
x1 xcimg
6 7
6 x2 yimg c 7
e(X) = 6
6min(0, f (x ))(x
7 (3.34)
4 1 3 3 xcyolo )7
5
c
min(0, f2 (x4 ))(x4 yyolo )
The relationship between the end-e↵ector velocity in the task space and the error
in the visual space can be described as
yolo+
ṙ = Jimg Ke(X) (3.35)
Since we want to minimize the error by sending the desired velocity signal to the
robot.
2 3
x1 xcimg
6 7
yolo +
6 x2 yimg c 7
u(t) = ṙ = Jimg K 6 7 (3.36)
6min(0, f (x ))(x xcyolo )7
4 1 3 3 5
c
min(0, f2 (x4 ))(x4 yyolo )
By applying this controller to the robot, we aim to control the YOLO bounding
box center coordinates and area simultaneously to improve the YOLO detection
results.
38
3.3.3 Lyapunov Stability Analysis
Continuing, the stability analysis is conducted on the system by using the kinematic
approach and the dynamic approach.
Kinematic Analysis
V = Pcom (X)
kpx kpy
= (x1 xcimg )2 + (x2 yimg c
)2 (3.37)
2 2
kptl3 kptl4
+ [min(0, f1 (x3 )]2 + [min(0, f2 (x4 )]2
2 2
Since Pcom (X) is a continuous scalar function with continuous first partial deriva-
tives. Di↵erentiating (3.37) with respect to time gives us,
@Pcom (X)
V̇ = Ẋ T (3.38)
@X
@Pcom (X)
= Ke(X) (3.39)
@X
Combining (3.9) with (3.35), we can get the relationship between the pixel error
and the image feature speed
yolo yolo +
yolo yolo yolo+
Ẋ = Jimg ṙ = Jimg ( Jimg Ke(X)) = Jimg Jimg Ke(X) = Ke(X) (3.40)
which means the image feature velocity is proportional to the error function e(X).
39
We can see that V̇ < 0 since K in (3.33) is defined as a diagonal matrix with
all positive entries. Thus, it is easy to prove that the eigenvalues of K are all
positive and K is a positive definite matrix. The system is stable by using the
Lyapunov-like function.
x1 = xcimg
c
x2 = yimg
(3.43)
min(0, f1 (x3 ))(x3 xcyolo ) = 0
c
min(0, f2 (x4 ))(x4 yyolo )=0
For x1 and x2 , it means it will equal to the image frame center coordinates at stable
state. For x3 and x4 , since when the bounding box exists, we have,
x3 xcyolo 6= 0
(3.44)
c
x4 yyolo 6= 0
To let the third and fourth equations of (3.43) equals to zero, we need
min(0, f1 (x3 )) = 0
(3.45)
min(0, f2 (x4 )) = 0
It suggests that f1 (x3 ) 0 and f2 (x4 ) 0. Substitute (3.3.2) and (3.3.2) into
these, we get,
wt2
((xcyolo x3 ) 2 ) 0
4 (3.46)
c 2 h2t
((yyolo x4 ) ) 0
4
40
Then we have YOLO bounding box height and width large or equal to the preset
threshold.
It means when the system reaches a stable state, all of the variables meet our preset
requirements.
Dynamic Analysis
Furthermore, we want to prove the system’s stability by using the dynamic ap-
proach.
Let r 2 <n represents the position vector of the robot in task space[48] then
r = h(q) (3.47)
where q 2 <n is a vector of joint coordinates and h(q) 2 <n ! Rn describes the
transformation relationship between the joint and task space.
ṙ = J(q)q (3.48)
where J(q) is the Jacobian matrix which maps joint space to task space.
The equations of motion of the robot with n degrees of freedom are given in a joint
space as:
1
M (q)q̈ + ( Ṁ (q) + S(q, q̇))q̇ + g(q) = ⌧ (3.49)
2
Where M(q) is an inertia matrix that is symmetric and positive definite, S(q, q̇) is
a skew symmetric matrix, g(q) denotes a gravitational force vector and ⌧ denotes
the control input.
yolo T @Pcom
⌧= Kv q̇ J T (q)Jimg + g(q) (3.50)
@X
41
where Kv in <n⇥n is a positive definite velocity feedback gain matrix, J T (q) is
yolo T
the transpose of the Jacobian matrix, Jimg is the transpose of the image Jacobian
matrix.
1 yolo T @Pcom
M (q)q̈ + ( Ṁ (q) + S(q, q̇))q̇ + Kv q̇ + J T (q)Jimg =0 (3.51)
2 @X
1
V = q̇ T M (q)q̇ + Pcom (3.52)
2
1
V̇ = q̇ T M (q)q̈ + q̇ T Ṁ (q)q̇ (3.53)
2
yolo T yolo T
Since we know that Ẋ = Jimg ṙT = q̇ T J(q)T Jimg , (3.54) can be further simplified
into,
V̇ = q̇ T Kv q̇ 0 (3.55)
Since Kv is a positive definite matrix, V̇ will be less than 0 and the system will
maintain global stability.
From Lasalle invariant theorem [49], we have q̇ ! 0 as t ! 1 and from (3.51) the
largest invariant set satisfies,
yolo T @Pcom
J T (q)Jimg =0 (3.56)
@X
@Pcom
=0 (3.57)
@X
42
The singularity of Jacobian matrix can be monitored by checking the manipulability
of the manipulator [50]. Singularity avoidance can be achieved by using a redundant
robot with task-priority control [51] by exploring the null space of the Jacobian
matrix.
x1 = xcimg
c
x2 = yimg
wt2 (3.59)
((xcyolo x3 ) 2 ) 0
4
c h2t
((yyolo x4 ) 2 ) 0
4
It means that all of the variables reach the desired value or are within the desired
region when the system is stable.
43
44
Chapter 4
Experiment is the final step to verify the proposed control algorithm. Meanwhile, it
is essential to choose the correct hardware and experimental setup before conduct-
ing the experiments. In this chapter, we will split the content into vision system
and robot system. In the vision system section, it includes the camera description
and experimental results for the object detector. In the robot system section, both
robot manipulator description and final experimental results will be presented.
45
Tech Specs Depth RGB
FOV 87 ⇥ 58 69 ⇥ 42
Resolution Up to 1920 ⇥ 1080 Up to 1280 ⇥ 720
Frame Rate Up to 90 fps Up to 30 fps
Depth Accuracy <2% at 2 meter -
Table 4.1: Technical specification about Intel Realsense Depth Camera D435
Moreover, as demonstrated in the official website1 , its ideal working range is from
0.3 meters to 3 meters. It means the camera is capable of performing daily inspec-
tion work. Moreover, with a light weight of only 72 grams, it is easy to mount the
camera on top of the robot system without worrying the gravity.
The last step before training an object detection model is preparing a suitable
dataset. A high-quality dataset is a solution for well-performing object detection.
A good dataset should contain more diversified images, including di↵erent lighting
conditions, distance, and di↵erent types of objects. The more we input into the
model, the better it could learn and predict in real-time. In our case, we select
eight classes that appear on the construction site frequently. They are doors,
windows, electrical switches, electrical powers, electrical mains, electrical telecom
port, electrical lights, and tile cracks.
Among all the 1098 images collected, 1042 are collected at the TeamBuild Con-
struction site in Sengkang, Singapore. The rest is collected in the robotics lab
inside Nanyang Technological University as supplemental data.
After obtaining these images, we use the open-source tool LabelImg to create
annotations for the following training as shown in Figure 4.1. LabelImg enables us
to draw the ground truth bounding boxes around the objects. It will automatically
save the labels in TXT format, which YOLOv3 requires.
When all images are labeled with correct annotations, training can begin. The
learning rate is chosen as 0.00025, while the maximum batch size is 3000. We
deploy the Tesla V100-DGX Station sever to train the model. It takes around 6
hours to train and obtain the final weights.
1
https://www.intelrealsense.com/depth-camera-d435/
46
Figure 4.1: Basic user interface of the labeling tool LabelImg. Ground truth
bounding box is drew in green
We proceed to test the trained model with the testing dataset. Two hundred twenty
images are randomly pickup from the entire dataset. The model has never seen
them before. Some sample images of the detection results are shown in Figure 4.2
and the mAP for each class is demonstrated in Table 4.2.
Figure 4.2: YOLO detection results of testing images. electrical telecom and
electrical power are shown in (a), door and electrical switch are shown in (b),
electrical light is shown in (c), electrical switch is shown in (d), window installed
is shown in (e) and tile crack is shown in (f)
47
Class Name mAP(%)
doors installed 82.13
windows installed 84.18
electrical switch 91.02
electrical power 92.62
electrical telecommute 96.03
electrical lights 86.78
tile cracks 78.17
From Table 4.2 we can see that most of the classes reach decent mAP on the testing
dataset of around 90% accuracy. In contrast, some classes such as tile cracks and
doors installed have slightly low accuracy. Moreover, we expect a decrease during
the real-time experiments as the robot’s movement will lead to blurry and jerky
images. Thus, more assistance should come from the robotic side to boost up
the performance of the machine learning model. The control algorithm should be
designed to enable the camera to position itself in an optimal location for viewing
the object.
48
On top of the technical specification listed in Table 4.3. There are also some
limitations of the robot due to safety issues.
The depth camera mounts on the end-e↵ector of the manipulator as demon-
strated in Figure 4.3. We call it an eye-in-hand model, which means the
camera is moving together with the robot end-e↵ector. The robot and the
processor communicate using an Ethernet socket with each other.
Weight 20.7 kg
Reach 850 mm
Maximum Payload 5 kg
Joint Ranges ± 360 for all joints
Speed Joints: Max 180 , Tool: Approx. 1m/s
System Update Frequency 500 Hz
Pose Repeatability ± 0.03 mm
Degrees of freedom 6 rotating joints
Ethernet socket,
Communication MODBUS TCP & EtherNet/IP Adapter,
Profinet
Table 4.3: Tech specification of UR5e Robot Manipulator from Universal
Robots
49
4.2.2 Experimental Results
Yi = 0 + 1 Xi + ui (4.1)
where the index i runs over the observations, Yi is dependent variable, the
regressand, or simply the left-hand variable, Xi is the independent variable.
Yi = 0+ 1 Xi + ui is the population regression line also called the population
regression function. 0 is the intercept of the population regression line and
1 is the slope of the population regression line and ui is the error term.
Thus, we fit a linear regression model with a 95% confidence interval as
shown in Figure 4.4 to display the relationship between confidence level and
bounding box length.
50
Figure 4.4: (a) shows the change of confidence level for bounding box length.
The y-axis is the YOLO confidence level and the x-axis is the YOLO bounding
box width. The blue line is the linear regression model with a 95% confidence
interval as the boundary shadowed in blue. (b) demonstrates the change of
confidence level to distance between the bounding box center and image frame
center. The x-axis is the pixel distance between the YOLO bounding box center
and the image center and the y-axis is the YOLO confidence level.The blue line
is the linear regression model with a 95% confidence interval as the boundary
shadowed in blue.
As per calculation, the regression model for the bounding length and confi-
dence level is
yic = 0.3xlen
i + 35.9 + 0.02699 = 0.3xlen
i + 35.92699 (4.2)
where yic is the YOLO confidence level for frame i and xlen
i is the corresponding
YOLO bounding box length i.
The slope of the regression line is 0.3, the intercept of the regression line is
35.9 and the error is 0.02699. The regression model shows that the confidence
level and the bounding box length are positively correlated. It means that
increasing bounding box length can improve the confidence level simultane-
ously.
In the second experiment, we fixed the distance between the camera and the
object to minimize the change of bounding box length. By solely control-
ling the orientation of the robot end-e↵ector, we move the positions of the
51
bounding boxes across the whole image frame and record their confidence
level accordingly. The experimental result is shown in Figure 4.4(b).The x-
axis is the pixel distance between the YOLO bounding box center and the
image center and the y-axis is the YOLO confidence level. When the dis-
tance is closer to zero, the confidence level reaches more than 80% and when
it moves away the confidence level gradually drops to below 60%.
We also calculate the regression model using the below equation:
yic = xdist
i + 83.9 + 0.009574 = xdist
i + 83.909574 (4.3)
where yic is the YOLO confidence level for frame i and xdist
i is the correspond-
ing YOLO bounding box center distance i to image frame center.
The slope of the regression line is -1, the intercept of the regression line is
83.9 and the error is 0.00957. The regression model shows that the confidence
level and the bounding box center distance concerning the image center are
negatively correlated. It means that moving the bounding box center closer
to the image center can improve the confidence level simultaneously.
However, these experimental results cannot be generalized. It is subjected to
the customized YOLO model, experimental setup, and object features.
52
Figure 4.5: demonstrates change of the YOLO bounding box location, length
and confidence level during the experiment using YOLO model trained with 1000
epochs.
experiment. We can observe that the detections only appear in the first few
steps. After the robot manipulator starts moving, the model cannot detect
the crack. Also, most of the confidence level values are below 20%. Similarly,
the bounding box length in Figure 4.5 (b) and center location shown in
Figure 4.5 (c) cannot be calculated properly and change dramatically during
the experiment. Due to the above reasons, the controller cannot continue to
move towards the target location. One thousand epochs are not enough for
the controller to perform normally.
Figure 4.6: demonstrates change of the YOLO bounding box location, length
and confidence level during the experiment using YOLO model trained with 2000
epochs.
When the model is trained for 2000 epochs, the YOLO detections appear in
most of the frames, and they are quite stable as we can see in the following
Figure 4.6. The robot manipulator can move towards the center of the image
frame as displayed in Figure 4.6(c). However, the bounding box size is exces-
sively large than the actual object size, it exceeds the preset threshold from
53
the beginning in Figure 4.6(b). Thus, the controller is not able to increase
the bounding box size. Due to this reason, the YOLO confidence level does
not improve as well as demonstrated in Figure 4.6(a). It is below 50% at
the end of the experiment although the bounding box is moving towards the
center of the image frame.
Figure 4.7: demonstrates change of the YOLO bounding box location, length
and confidence level during the experiment using YOLO model trained with 3000
epochs.
When epochs finally reach 3000 which is also the final weight we used for the
experiments. The center coordinates of the YOLO bounding box converges
to the center of the image frame (640,360) gradually in Figure 4.7 (c). Mean-
while, the height of the YOLO bounding box reaches the preset threshold
and maintains a certain level displayed in Figure 4.7(b). Following that, the
confidence level increases from below 50% to larger than 70%.
In summary, the average confidence level after reaching the desired target is
listed in Table 2 below for models trained with di↵erent epochs. The average
confidence level is calculated by averaging the recorded YOLO confidence
level after the YOLO bounding box center has reached the image frame cen-
ter. If the robot manipulator is unable to approach the target or there is no
detection, we mark the confidence level as 0%.
According to the Table 2 we can see that, when the number of epochs is under
1000, there is no detection or the detection is not stable enough to perform
54
the controller. When the model is trained with 2000 epochs, the robot can
reach the desired target with a low confidence level. When epochs reach 3000,
the controller can approach the image frame center with a high confidence
level while maintaining the length of the bounding box at a certain level. At
least 3000 epochs are needed for the controller to perform properly.
3. Experimental Results
After obtaining decent YOLO detection results, we proceed with testing the
controller with the target objects. As we can see in Figure 4.8 in the sequence
from (a) to (d) the bounding box is moving from the bottom right corner
towards the center of the image frame. At the same time, the bounding
box still maintains a desired size during the movement. The confidence level
increases from 24.65% to 84.13%.
Figure 4.8: demonstrates change of the bounding box location and confidence
level during the experiment of tile cracks
We can observe the same trend when plotting the following diagram. In Fig-
ure 4.9 (a), the coordinates of the bounding box center begin with (870,510)
and converge to the center of the image frame (640,320) within 40 steps
which are around 20 seconds. Since the bounding box height is larger than
the width, we only control the height. It stays above the threshold, 180 pix-
els, throughout the experiment. The confidence level in Figure 4.9(c) shows
an increase if comparing the value at the start and the end of the experiment.
55
Figure 4.9: plots the change of the bounding box location, bounding box length
and confidence level during the experiment of tile cracks
(a) (b)
(c) (d)
Figure 4.10: plots change of the bounding box location and confidence level
during the experiment of electrical powers
We also repeated the experiments for electrical switches to show the generality
of the control algorithm. In Figure 4.10, it is observed that the bounding box
is moving towards the center of the image frame and the confidence level is
improved from 35% to 92.6%.
We can observe the same trend when plotting the following diagram. In Fig-
ure 4.11 (a), the coordinates of the bounding box center begin with (1000,600)
and converge to the center of the image frame (640,320) within 80 steps
which are around 40 seconds. Since the bounding box height is larger than
the width, we only control the height. It is 105 pixels in the beginning and
increases to more than 90%. The confidence level in Figure 4.9(c) shows a
56
Figure 4.11: demonstrates change of the bounding box location, bounding box
length and confidence level during the experiment of electrical powers
significant improvement if comparing the value at the start and the end of
the experiment.
These two experiments demonstrate the ability to improve the YOLO confi-
dence level by moving towards the image center and maintaining the bound-
ing box length.
57
End Conf. End Conf.
No. Exp Start Coord. Start Conf.
(Centring) (Region)
1 (328.59, 202.37) 34.65 87.79 97.81
2 (324.57, 1002.16) 27.73 38.24 93.25
3 (340.44,257.20) 46.36 29.33 83.53
4 (349.37, 471.18) 43.82 66.41 93.77
5 (427.91, 965.89) 35.74 89.33 87.76
6 (368.01, 419.78) 57.29 44.39 92.36
7 (328.59, 202.37) 38.54 32.27 84.17
8 (789.88, 427.46) 62.81 87.38 93.24
9 (344.9, 355.41) 52.29 69.44 92.39
10 (329.00, 484.15) 37.75 89.76 91.23
Avg - 43.70 63.43 90.95
Table 4.5: records ten sets of YOLO bounding box center coordinates and
their confident levels. End confidence level displays the confidence level when
the corresponding controller reaches center of the image frame.
58
Chapter 5
5.1 Conclusion
In this thesis, we discuss the application of vision-based robotic control and object
detection in the construction inspection process.
Chapter 1 introduces the background of the research work. It lists out the reasons
why construction automation is an essential component in the process of modern-
ization. First, the data from BCA in Singapore demonstrates that a large amount
of construction workers is needed every year. However, the injury and fatal rate
for the construction industry remains at the highest level due to the various types
of accidents. Moreover, labor productivity sees a decreasing trend in the construc-
tion industry which means not many technologies have been applied in the field.
Starting from early 2020, the virus COVID-19 has worsened the situation of re-
cruiting new workers from overseas. Due to the stated reasons, the industry has
been seeking robotic and automation solutions. This also explains our motivation
for exploring construction inspection robots. We aim to improve productivity by
applying neural networks and robotic technologies.
Chapter 2 conducts the necessary literature review for the research work. Firstly,
we want to understand the current status of construction automation. Thus, we
reviewed di↵erent types of construction robots such as excavation robots, interior
finishing robots, and quality inspection and assessment robots. Following that, a
59
basic understanding of neural networks is essential for performing object detec-
tion work. Di↵erent steps of training a neural network are listed including split-
ting the data, constructing the network architecture, and tuning the loss function.
Meanwhile, we research various types of object detection models. Lastly, some
vision-based control algorithms are listed.
Chapter 3 describes the methodologies about the proposed controller. It splits into
two parts, vision algorithm and control algorithm. It starts with some background
information of the camera configuration, rigid body velocity, and Jacobian matrix.
After that, an image-based visual servoing algorithm is introduced and we describe
how it integrates with the region-based controller. Finally, we prove the stability
of the system using a kinematic and dynamic approach.
The situation in real construction is complicated and dynamic. Facing the camera
at the appropriate location and distance can improve detection accuracy. This
thesis demonstrates the possibility of integrating the control algorithm with object
detection and applying it to the construction process.
We will list out three main recommendations which have possibilities for future
research.
60
truth due to the unpredictable features of the neural network. This sudden
change in the image frame may lead to sudden movement in the robotic arm
even cause the failure of the experiments.
To improve the stability of the detection results, a Long Short Term Memory
(LSTM) model could be combined with the existing YOLO object detector.
LSTM is a recurrent neural network (RNN) architecture that can remember
past information. After the detection model generates the bounding box
information, it will become the input of the LSTM model. After collecting
the real-time bounding box information and labelling the ground truth at the
same time stamp, we can train an LSTM model to minimize the di↵erence.
By doing this, we are able to predict the bounding box based on the current
image information and the bounding box information from the past frames.
Moreover, we can also think of deploying some newer object detection model
1
such as YOLOv5 which is released in 2020. It is believed to have higher
average precision and faster frame rate compared with YOLOv3 at the test
dataset.
After modifying the existing object detection model, we aim to improve the
stability of the proposed controller.
61
In order to tackle the stated problem, a object detector with rotated bounding
2
box can be considered. Here is one object detector from NVIDIA which
can detect rotated objects that can be considered. It is able to provide an
extra rotating angle of the detected bounding box.
After obtaining angle information from the object detector, we can also add
it into the current controller as a fifth control feature. In this case, both
bounding box center location, size and rotation angle can be controlled by
the robot manipulator to better view the targeting object.
2
https://github.com/NVIDIA/retinanet-examples
62
List of Author’s Publications1
Journal Article
1 ⇤
The superscript indicates joint first authors
63
64
Bibliography
[1] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv
preprint arXiv:1804.02767, 2018. xix, 27
[4] Thomas Bock and Thomas Linner. Robot-Oriented Design: design and man-
agement tools for the deployment of automation and robotics in construction.
Cambridge University Press, 2015. 8
[6] Ehsan Asadi, Bingbing Li, and I-Ming Chen. Pictobot: a cooperative paint-
ing robot for interior finishing of industrial developments. IEEE Robotics &
Automation Magazine, 25(2):82–94, 2018. 8
[8] Je-Keun Oh, Giho Jang, Semin Oh, Jeong Ho Lee, Byung-Ju Yi, Young Shik
Moon, Jong Seh Lee, and Youngjin Choi. Bridge inspection robot system with
machine vision. Automation in Construction, 18(7):929–941, 2009. 9
[10] Spencer Gibb, Tuan Le, Hung Manh La, Ryan Schmid, and Tony Berend-
sen. A multi-functional inspection robot for civil infrastructure evaluation
and maintenance. In 2017 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pages 2672–2677. IEEE, 2017. 9
65
[11] Fábio Celestino Pereira and Carlos Eduardo Pereira. Embedded image
processing systems for automatic recognition of cracks using uavs. Ifac-
PapersOnline, 48(10):16–21, 2015. 10
[12] David Mader, Robert Blaskow, Patrick Westfeld, and Cornell Weller. Po-
tential of uav-based laser scanner and multispectral camera data in building
inspection. International Archives of the Photogrammetry, Remote Sensing &
Spatial Information Sciences, 41, 2016. 10
[13] David H Hubel and Torsten N Wiesel. Receptive fields and functional archi-
tecture of monkey striate cortex. The Journal of physiology, 195(1):215–243,
1968. 12
[15] Christopher M Bishop et al. Neural networks for pattern recognition. Oxford
university press, 1995. 13
[16] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods
for online learning and stochastic optimization. Journal of machine learning
research, 12(7), 2011. 15
[17] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech,
and time series. The handbook of brain theory and neural networks, 3361(10):
1995, 1995. 15
[18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Ha↵ner. Gradient-
based learning applied to document recognition. Proceedings of the IEEE, 86
(11):2278–2324, 1998. 15
[19] Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, and Jürgen
Schmidhuber. Deep, big, simple neural nets for handwritten digit recogni-
tion. Neural computation, 22(12):3207–3220, 2010. 16
[20] Ross Girshick, Je↵ Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. In Pro-
ceedings of the IEEE conference on computer vision and pattern recognition,
pages 580–587, 2014. 17
[21] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference
on computer vision, pages 1440–1448, 2015. 17
[22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: To-
wards real-time object detection with region proposal networks. arXiv preprint
arXiv:1506.01497, 2015. 18, 23, 27
[23] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 18
66
[24] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only
look once: Unified, real-time object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
18
[25] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Pro-
ceedings of the IEEE conference on computer vision and pattern recognition,
pages 7263–7271, 2017. 19
[26] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector.
In European conference on computer vision, pages 21–37. Springer, 2016. 19,
27
[27] Marcel Neuhausen and Markus König. Automatic window detection in facade
images. Automation in Construction, 96:527–539, 2018. 20
[28] Hoang Nhat-Duc, Quoc-Lam Nguyen, and Van-Duc Tran. Automatic recogni-
tion of asphalt pavement cracks using metaheuristic optimized edge detection
algorithms and convolution neural network. Automation in Construction, 94:
203–213, 2018. 20
[29] Jun Zhang, Xing Yang, Weiguang Li, Shaobo Zhang, and Yunyi Jia. Auto-
matic detection of moisture damages in asphalt pavements from gpr data with
deep cnn and irs method. Automation in Construction, 113:103119, 2020. 20
[31] Seth Hutchinson, Gregory D Hager, and Peter I Corke. A tutorial on visual
servo control. IEEE transactions on robotics and automation, 12(5):651–670,
1996. 21, 30, 31
[32] Cristian Pop, Sanda M Grigorescu, and Arjana Davidescu. Colored object
detection algorithm for visual-servoing application. In 2012 13th International
Conference on Optimization of Electrical and Electronic Equipment (OPTIM),
pages 1539–1544. IEEE, 2012. 21
[33] Ying Wang, Guan-lu Zhang, Haoxiang Lang, Bashan Zuo, and Clarence W
De Silva. A modified image-based visual servo controller with hybrid camera
configuration for robust robotic grasping. Robotics and Autonomous Systems,
62(10):1398–1407, 2014. 21, 22
[34] Hesheng Wang, Dejun Guo, Xinwu Liang, Weidong Chen, Guoqiang Hu, and
Kam K Leang. Adaptive vision-based leader–follower formation control of
mobile robots. IEEE Transactions on Industrial Electronics, 64(4):2893–2902,
2016. 22
67
[35] Pablo Ramon-Soria, Begoña C Arrue, and Anibal Ollero. Grasp planning and
visual servoing for an outdoors aerial dual manipulator. Engineering, 6(1):
77–88, 2020. 22
[36] Jingshu Liu and Yuan Li. An image based visual servo approach with deep
learning for robotic manipulation. arXiv preprint arXiv:1909.07727, 2019. 22,
23
[37] Konrad Ahlin, Benjamin Jo↵e, Ai-Ping Hu, Gary McMurray, and Nader
Sadegh. Autonomous leaf picking using deep learning and visual-servoing.
IFAC-PapersOnLine, 49(16):177–183, 2016. 23
[39] Benjamin Jo↵e, Konrad Ahlin, Ai-Ping Hu, and Gary McMurray. Vision-
guided robotic leaf picking. EasyChair Preprint, 250:1–6, 2018. 23
[40] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn.
In Proceedings of the IEEE international conference on computer vision, pages
2961–2969, 2017. 23
[41] Ali Anwar, Weiyang Lin, Xiaoke Deng, Jianbin Qiu, and Huijun Gao. Quality
inspection of remote radio units using depth-free image-based visual servo with
acceleration command. IEEE Transactions on Industrial Electronics, 66(10):
8214–8223, 2018. 23
[42] Shiyao Cai, Zhiliang Ma, Miroslaw J Skibniewski, and Song Bao. Construc-
tion automation and robotics for high-rise buildings over the past decades: A
comprehensive review. Advanced Engineering Informatics, 42:100989, 2019.
25
[43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016. 27
[44] Richard Hartley and Andrew Zisserman. Camera Models, page 153–177. Cam-
bridge University Press, 2 edition, 2004. doi: 10.1017/CBO9780511811685.
010. 29
[45] Leee Weiss, Arthurc Sanderson, and Charlesp Neuman. Dynamic sensor-based
control of robots with visual feedback. IEEE Journal on Robotics and Automa-
tion, 3(5):404–417, 1987. 32
[46] Rafael Kelly, Ricardo Carelli, Oscar Nasisi, Benjamı́n Kuchen, and Fernando
Reyes. Stable visual servoing of camera-in-hand robotic systems. IEEE/ASME
transactions on mechatronics, 5(1):39–48, 2000. 32
68
[47] Chien-Chern Cheah, De Qun Wang, and Yeow Cheng Sun. Region-reaching
control of robots. IEEE Transactions on Robotics, 23(6):1260–1264, 2007. 34
[49] Jean-Jacques E Slotine, Weiping Li, et al. Applied nonlinear control, volume
199. Prentice hall Englewood Cli↵s, NJ, 1991. 42
[50] Odd O Aalen. A linear regression model for the analysis of life times. Statistics
in medicine, 8(8):907–925, 1989. 50
69