Report V1.1.4.final
Report V1.1.4.final
and Technology
Mechanical Engineering
Special Project Report
01/10/2023
1. Introduction
Figure 1 illustrates the proposed algorithm for the task. The architecture is
composed of two main phases. In the first stage, the fixed-camera image is analyzed in
search of the object of interest in the workspace using a pre-trained CNN. The neural
network returns a set of possible targets with their confidence and bounding box corners
coordinates. The center coordinates of the largest-confidence target are converted into
real-world robot cartesian coordinates that are subsequently sent to the manipulator
through Ethernet communication. The second phase starts as the end effector is brought
close to the object, having the in-hand camera detect the orientation of the ChAruco
board on which the object stands. This orientation is converted to Roll-Pitch-Yaw
angles and is used to position the end effector normal to the ChAruco board surface.
Finally, the ChAruco itself is used to acquire the depth of the target, and the object
position is detected by a CNN model, allowing for the robot’s gripper to grasp the
target.
To start with the first stage of the algorithm and approach to the target, object
detection techniques are employed. Object detection is the task of detecting instances of
objects of a specific class within an image or video [1]. It locates objects that exist in an
image and encloses them inside a bounding box with their corresponding types or class
labels attached.
Algorithms for object detection are a combination of two tasks that are image
classification and object localization.
Image classification algorithms can predict the class or type of an object that is in
the image based on a predefined set of classes that the algorithm previously trained. For
example, given an image with a single object as input, as seen in Figure 2, the output
generated will be a class or a label of the corresponding object and the probability of the
prediction.
Figure 2. Differences between image classification, object localization, and object detection,
respectively.
The problem of detecting and localizing the object can be solved using object
detection algorithms such as R-CNN [2], Fast R-CNN [3] or YOLO [4]. In the present
work, a variation of the YOLO network is employed to perform the aforementioned
task. YOLO stands for You Only Look once and is one of the most popular models used
in object detection and computer vision. This algorithm uses a neural network-based
approach to make predictions on the input images, achieving results with high accuracy
and faster than other approaches.
The many components of object detection are combined into one neural network by
YOLO. The network predicts each bounding box using features from the entire image.
Additionally, it simultaneously predicts all bounding boxes for a picture across all
classes. This implies that the network considers the entire image and all its objects when
making decisions. The YOLO design maintains excellent average precision while
enabling end-to-end training and real-time speeds. The input image is divided into an S
x S grid by the system. A grid cell detects an object if its center falls within that grid
cell. Each grid cell predicts B bounding boxes and their corresponding confidence
scores. These confidence scores reflect how confident the model is that the box contains
an object and how accurate it thinks the box is that it predicts. The confidence score
should be zero if there is no object present in that cell. Otherwise, the desired
confidence score is given by the intersection over union (IOU) between the predicted
box and the ground truth (Figure 3). A simplified diagram of the overall process can be
appreciated in Figure 4 [4].
After labeling, the dataset was divided into 324 images for training and 40 images
for validation. The training was carried out using the YOLOv5 custom training
notebook available with GoogleColab [5]. The performance of the trained model is
measured by mAP or Mean Precision average. mAP is equal to the average of the
Average Precision metric across all classes in a model. mAP can be used to compare
both different models on the same task and different versions of the same model. mAP
is measured between 0 and 1 [8]. The following chart resumes the results of our model.
The load of the model was done by using the method in [9]. This method retrieves:
With the coordinates obtained with the model, we were able to draw the bounding
boxes of the objects and roughly find the position of the centroid in pixels. Then, the
position of the centroid is transformed to real world coordinates with a technique
described later in this paper. An overall visual representation of the data obtained can be
seen in the figure below.
Once the center of the object of interest is detected in the first stage, the pixel
coordinates given from the fixed camera are needed to be converted into real world
measurements in robot coordinates, we may represent the projection from 3D points in
the world to 2D points in the image plane of a camera as:
[][ ][ []
XW
fx 0 px 0 C C
]
u
R t YW (
v = 0 fy py 0 W W
1)
1 0 1 ZW
0 0 1 0
1
Where u and v are the pixel coordinates given from the camera, fx, fy, px, and py are
the focal length of the camera in the x-axis, y-axis, and the principal point in the x-axis,
and y-axis, respectively, all the parameters inside this matrix are called intrinsic camera
parameters and are known from a previous camera calibration using the method
explained in [10]. [ 0 1 ]
RCW t CW
represents the extrinsic camera parameters, being a linear
W ( 2 ) −1 W ( 1) g g c (2 ) c ( 1 ) −1
(T g ) T g T c =T c T t (T t ) (2)
AX=XB
Once the manipulator brings the hand-eye camera close to the target, the ChAruco
board below the target is detected and its pose is estimated by using Perspective N Point
(PnP). Since the object of interest can be at any point on the ChAruco surface, and its
frame orientation is at fixed angles to the ChAruco, we are only interested in the surface
orientation of the ChAruco, and the object position is going to be acquired with our
trained CNN model. Once the ChAruco orientation is acquired solving the PnP
problem, we acquire the rotation existing between the hand-eye camera frame to the
T
target frame RC , we transform this rotation to relate the gripper end-effector frame with
the target frame by the expression RTG =RTC R CG. Since we cannot move the robot in end-
effector frames, we have to apply the so-called similarity transform which allows to
convert a given linear transformation in the camera frame to the same linear
transformation in the world robot-frame, the similarity transform is expressed in (3).
Once the rotation matrix to position the gripper normal to the surface of the ChAruco is
obtained, we parametrize it with the roll-pitch-yaw representation.
Once the gripper is normal to the surface of the ChAruco, as shown in Figure 10,
the CNN model is used again to detect the position of the target in pixels. Using the
previously introduced method, the same mapping is done from (u, v) pixel coordinates
into (X C , Y C , Z C ) camera coordinates, where the ZC component is extracted from the
ChAruco board. The linear transformation from camera coordinates into world
coordinates is now used, as shown below
W W C
P =T C P
PW =T W G C
G TCP
C W
P represents camera coordinates of the target, P the world coordinates of the target,
T GC is the homogeneous transformation from camera coordinates to end-effector
coordinates, which is known after solving the hand-eye calibration problem.
Figure 10. Orientation of the gripper normal to the surface and reposition towards the object of interest.
Since the orientation and coordinates of the object are now known, the object
can be grasped and moved without much trouble.
The computer and the iRX6 digital servo controller (robot controller) are linked
together by Ethernet communication. The computer runs a Python code that implements
the object detection model, the 2D pose estimation algorithm, and interacts with
external devices such as the fixed and hand-eye cameras, the 6DOF robot manipulator
and the gripper. The iRX6 receives commands or messages from the computer to action
the manipulator towards the target with the desired pose to perform the grasp. Figure 11
shows the overall interface between the computer and the robot controller.
Figure 11. Computer-Robot Interface Schematic diagram.
3. Results
As seen in the figure above, initially, the manipulator awaits instructions from
the computer. Once the image from the fixed camera was processed and the object was
detected, its robot coordinates were acquired and sent to the robot servo controller to
approach the target. Immediately after, the ChAruco board pose was detected and used
to orient the end-effector using the image from the hand-in camera. Then, the 3D
position of the object was detected having the depth acquired solving the PNP problem
using landmarks from the ChAruco. Since the pose of the object is defined, the
manipulator is then able to grasp the target and place it on the desired position. Finally,
the robot arm returns to its initial position, ending the process.
4. Conclusion
The employment of the ChAruco board alongside CNN object detection proved to
be feasible in performing pick-and-place tasks. However, the sequence of motions is not
smooth enough to compete with other methods using depth sensor devices such as in
[13].
In the near future we hope to update the method presented in this project to require
less intrusive landmarks near the target, enhancing its versatility and applications, while
retaining its performance and inexpensiveness.
References
[4] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You only look once:
Unified, real-time object detection," 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
[10] Z. Zhang, "A flexible new technique for camera calibration," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 11,
2000.
[11] G. An, S. Lee, M.-W. Seo, K. Yun, W.-S. Cheong and S.-J. Kang,
"Charuco board-based omnidirectional camera calibration method,"
Electronics, vol. 7, no. 12, 2018.
[13] T.-T. Le, T.-S. Le, Y.-R. Chen, J. Vidal and C.-Y. Lin, "6d pose estimation
with combined deep learning and 3D vision techniques for a fast and accurate
object grasping," Robotics and Autonomous Systems, vol. 141, 2021.