0% found this document useful (0 votes)

63 views17 pages

Report V1.1.4.final

This document summarizes a student project to develop a robotic arm that can autonomously pick and place objects using 2D vision. The project uses a convolutional neural network to detect objects in images from a fixed camera. It then uses homography transformations to convert the pixel coordinates to real-world robot coordinates. For grasping, an in-hand camera detects the orientation of a marker board attached to objects. This provides orientation and depth information to position the gripper for picking. The overall method provides a framework for 6D pose estimation and grasping without using expensive depth sensors.

Uploaded by

Diego Correa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views17 pages

Report V1.1.4.final

Uploaded by

Diego Correa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

National Taiwan University of Science

and Technology
Mechanical Engineering
Special Project Report

Student Number: F11003106, F11003115, F11003114,

F11003108

Robot Arm Omnidirectional Autonomous

Pick and Place Task using 2D Vision

Name： Diego Correa

Enrique Miranda
Gonzalo Miltos
Mauricio Cristaldo
Advisor：Chyi-Yeu Lin

01/10/2023
1. Introduction

Through the development of Artificial Intelligence, it is a fact that industrial robotic

arms gain more strength when it comes to performing simple or even the most
demanding tasks in the safest environments as well as in the most complex ones.
6D pose estimation for pick-and-place tasks has been a vital research interest in
robotics since its conception. An accurate and fast algorithm for such a job is still to be
found. Moreover, most vision systems depend on a depth perception device that is
costly, and the technology is still developing.
This research aims to support the development of image-based visual servoing
techniques for industrial and service robots by creating a practical framework for
pseudo-6D pose estimation for a pick-and-place algorithm using eye-in-hand and eye-
to-hand configurations. The main idea developed in this work is to use a ChAruco board
along the object to be picked for the in-hand camera to detect its orientation and depth
while using Convolutional Neural Networks (CNNs) with homography to get its
position, this is equivalent to an in-hand 6D pose estimation if the object to be picked
and ChAruco board are at fixed angles to each other, this allows for an accurate and fast
grasping.
2. Method

Figure 1 illustrates the proposed algorithm for the task. The architecture is
composed of two main phases. In the first stage, the fixed-camera image is analyzed in
search of the object of interest in the workspace using a pre-trained CNN. The neural
network returns a set of possible targets with their confidence and bounding box corners
coordinates. The center coordinates of the largest-confidence target are converted into
real-world robot cartesian coordinates that are subsequently sent to the manipulator
through Ethernet communication. The second phase starts as the end effector is brought
close to the object, having the in-hand camera detect the orientation of the ChAruco
board on which the object stands. This orientation is converted to Roll-Pitch-Yaw
angles and is used to position the end effector normal to the ChAruco board surface.
Finally, the ChAruco itself is used to acquire the depth of the target, and the object
position is detected by a CNN model, allowing for the robot’s gripper to grasp the
target.

Figure 1. Proposed algorithm architecture.

2.1. Object Detection Problem

To start with the first stage of the algorithm and approach to the target, object
detection techniques are employed. Object detection is the task of detecting instances of
objects of a specific class within an image or video [1]. It locates objects that exist in an
image and encloses them inside a bounding box with their corresponding types or class
labels attached.

Algorithms for object detection are a combination of two tasks that are image
classification and object localization.

Image classification algorithms can predict the class or type of an object that is in
the image based on a predefined set of classes that the algorithm previously trained. For
example, given an image with a single object as input, as seen in Figure 2, the output
generated will be a class or a label of the corresponding object and the probability of the
prediction.

Object localization algorithms enclose an object in the image within a bounding

box. Again, we have an image with one or more objects as input. However, this time the
output will be the location of the bounding boxes using their position, height, and width.
The differences between them can be appreciated in the figure below.

Figure 2. Differences between image classification, object localization, and object detection,
respectively.

The problem of detecting and localizing the object can be solved using object
detection algorithms such as R-CNN [2], Fast R-CNN [3] or YOLO [4]. In the present
work, a variation of the YOLO network is employed to perform the aforementioned
task. YOLO stands for You Only Look once and is one of the most popular models used
in object detection and computer vision. This algorithm uses a neural network-based
approach to make predictions on the input images, achieving results with high accuracy
and faster than other approaches.

2.2. YOLO mechanism

The many components of object detection are combined into one neural network by
YOLO. The network predicts each bounding box using features from the entire image.
Additionally, it simultaneously predicts all bounding boxes for a picture across all
classes. This implies that the network considers the entire image and all its objects when
making decisions. The YOLO design maintains excellent average precision while
enabling end-to-end training and real-time speeds. The input image is divided into an S
x S grid by the system. A grid cell detects an object if its center falls within that grid
cell. Each grid cell predicts B bounding boxes and their corresponding confidence
scores. These confidence scores reflect how confident the model is that the box contains
an object and how accurate it thinks the box is that it predicts. The confidence score
should be zero if there is no object present in that cell. Otherwise, the desired
confidence score is given by the intersection over union (IOU) between the predicted
box and the ground truth (Figure 3). A simplified diagram of the overall process can be
appreciated in Figure 4 [4].

Figure 3. IOU definition.

Figure 4. Simplified process of detection of objects by YOLO model. Image from [4].

Five predictions compose each bounding box: x, y, w, h, and confidence. The

coordinates of the center of the box relative to the bounds of the grid are represented by
“x” and “y”. The width and height, denoted as “w” and “h” respectively, are predicted
relative to the whole image. Finally, the confidence prediction represents the
Intersection Over Union (IOU) between the predicted box and any ground truth box.
Each grid cell also predicts C conditional class probabilities. These probabilities are
conditioned on the grid cell containing an object. The network is only capable of predict
one set of class probabilities per grid cell, regardless of the number of boxes B [4].
2.3. Creation of the computer mouse model
The target chosen to perform the pick and place task is a computer mouse. The
process of creating a model was performed using YOLOv5, which is one of the most
recent versions of the YOLO family [5].
The procedure began with the collection of the data to form the dataset. More than
150 pictures of the object were taken. Then augmentation, utilizing the Roboflow data
augmentation tool [6] was performed, yielding a dataset of 364 images. The next step
was labeling the bounding boxes of the class. Make Sense AI [7] was the tool that
helped to accomplish this task (Figure 5).
Figure 5. Screen capture of Make Sense AI tool.

After labeling, the dataset was divided into 324 images for training and 40 images
for validation. The training was carried out using the YOLOv5 custom training
notebook available with GoogleColab [5]. The performance of the trained model is
measured by mAP or Mean Precision average. mAP is equal to the average of the
Average Precision metric across all classes in a model. mAP can be used to compare
both different models on the same task and different versions of the same model. mAP
is measured between 0 and 1 [8]. The following chart resumes the results of our model.

Figure 6. Chart with mAP score of the training.

2.4. The model in action

The load of the model was done by using the method in [9]. This method retrieves:

 ID for the object.

 Upper left and lower right corners of bounding boxes (in pixels).
 Confidence number.
 Class number.
 Class name.

With the coordinates obtained with the model, we were able to draw the bounding
boxes of the objects and roughly find the position of the centroid in pixels. Then, the
position of the centroid is transformed to real world coordinates with a technique
described later in this paper. An overall visual representation of the data obtained can be
seen in the figure below.

Figure 7. The model in action.

2.5. 2D pose estimation problem

Once the center of the object of interest is detected in the first stage, the pixel
coordinates given from the fixed camera are needed to be converted into real world
measurements in robot coordinates, we may represent the projection from 3D points in
the world to 2D points in the image plane of a camera as:
[][ ][ []
XW
fx 0 px 0 C C
]
u
R t YW (
v = 0 fy py 0 W W
1)
1 0 1 ZW
0 0 1 0
1

Where u and v are the pixel coordinates given from the camera, fx, fy, px, and py are
the focal length of the camera in the x-axis, y-axis, and the principal point in the x-axis,
and y-axis, respectively, all the parameters inside this matrix are called intrinsic camera
parameters and are known from a previous camera calibration using the method

explained in [10]. [ 0 1 ]
RCW t CW
represents the extrinsic camera parameters, being a linear

transformation from world coordinates to camera coordinates, a variation of the method

presented in [11] is used in this work to acquire this transformation. Finally, XW, YW,
and ZW represent world coordinates. Since the camera is calibrated and its pose and
height to the table are fixed, the equation only has 2 degrees of freedom; only X W and
YW can vary when the object moves around the table. Thus, the process to map a (u, v)
pixel coordinate to a real-world (XW, YW, ZW) coordinate on the table is straightforward,
the result of the transformation can be seen in Figure 7.

2.6. Hand-eye calibration problem

The purpose of the hand-eye calibration problem is to find the transformation

between the coordinates of the in-hand camera and the robot coordinate system. The
essence is to solve the problem AX=XB (See Figure 8 for reference), where X is the
transformation from the camera coordinate system to the robot coordinate system. As
shown in the formula below, the transformation can be solved using the robot pose
transformation and camera pose transformation from the target.
W(1) g c (1) W (2) g c(2)
Tg T c T t =T g Tc Tt

W ( 2 ) −1 W ( 1) g g c (2 ) c ( 1 ) −1
(T g ) T g T c =T c T t (T t ) (2)

Then let: A=(T Wg (2 ))−1 T Wg ( 1) , B=T ct (2 )(T ct (1 ))−1 , X=T gc finally:

AX=XB

According to this mathematical model, the solution to the transformation of the

camera coordinate system to the robot coordinate system requires us to establish an end-
effector frame, which is done by calculating the Tool Center Point (TCP) of the end
effector, a target is also required, for this experiment we use a ChAruco board as the
target because it is easy to detect, having the flexibility of an ArUco marker with the
precision of a normal checkerboard commonly used for calibration.

Figure 8. The eye in hand problem. Image adapted from [12]

2.7. 6D pose estimation and object grasping

Once the manipulator brings the hand-eye camera close to the target, the ChAruco
board below the target is detected and its pose is estimated by using Perspective N Point
(PnP). Since the object of interest can be at any point on the ChAruco surface, and its
frame orientation is at fixed angles to the ChAruco, we are only interested in the surface
orientation of the ChAruco, and the object position is going to be acquired with our
trained CNN model. Once the ChAruco orientation is acquired solving the PnP
problem, we acquire the rotation existing between the hand-eye camera frame to the
T
target frame RC , we transform this rotation to relate the gripper end-effector frame with
the target frame by the expression RTG =RTC R CG. Since we cannot move the robot in end-
effector frames, we have to apply the so-called similarity transform which allows to
convert a given linear transformation in the camera frame to the same linear
transformation in the world robot-frame, the similarity transform is expressed in (3).
Once the rotation matrix to position the gripper normal to the surface of the ChAruco is
obtained, we parametrize it with the roll-pitch-yaw representation.

Rw =( RGW )−1 RC R GW (3)

Figure 9. Linear transformations between frames.

Once the gripper is normal to the surface of the ChAruco, as shown in Figure 10,
the CNN model is used again to detect the position of the target in pixels. Using the
previously introduced method, the same mapping is done from (u, v) pixel coordinates
into (X C , Y C , Z C ) camera coordinates, where the ZC component is extracted from the
ChAruco board. The linear transformation from camera coordinates into world
coordinates is now used, as shown below

W W C
P =T C P

PW =T W G C
G TCP
C W
P represents camera coordinates of the target, P the world coordinates of the target,
T GC is the homogeneous transformation from camera coordinates to end-effector
coordinates, which is known after solving the hand-eye calibration problem.

Figure 10. Orientation of the gripper normal to the surface and reposition towards the object of interest.

Since the orientation and coordinates of the object are now known, the object
can be grasped and moved without much trouble.

2.8. Computer–Robot Interface

The computer and the iRX6 digital servo controller (robot controller) are linked
together by Ethernet communication. The computer runs a Python code that implements
the object detection model, the 2D pose estimation algorithm, and interacts with
external devices such as the fixed and hand-eye cameras, the 6DOF robot manipulator
and the gripper. The iRX6 receives commands or messages from the computer to action
the manipulator towards the target with the desired pose to perform the grasp. Figure 11
shows the overall interface between the computer and the robot controller.
Figure 11. Computer-Robot Interface Schematic diagram.
3. Results

Figure 12. The procedure's phases in order.

As seen in the figure above, initially, the manipulator awaits instructions from
the computer. Once the image from the fixed camera was processed and the object was
detected, its robot coordinates were acquired and sent to the robot servo controller to
approach the target. Immediately after, the ChAruco board pose was detected and used
to orient the end-effector using the image from the hand-in camera. Then, the 3D
position of the object was detected having the depth acquired solving the PNP problem
using landmarks from the ChAruco. Since the pose of the object is defined, the
manipulator is then able to grasp the target and place it on the desired position. Finally,
the robot arm returns to its initial position, ending the process.
4. Conclusion

The employment of the ChAruco board alongside CNN object detection proved to
be feasible in performing pick-and-place tasks. However, the sequence of motions is not
smooth enough to compete with other methods using depth sensor devices such as in
[13].
In the near future we hope to update the method presented in this project to require
less intrusive landmarks near the target, enhancing its versatility and applications, while
retaining its performance and inexpensiveness.
References

[1] E. Zvornicanin, "What is Yolo Algorithm?," 04 November 2022. [Online].

Available: https://www.baeldung.com/cs/yolo-algorithm. [Accessed 15
December 2022].

[2] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich feature hierarchies

for accurate object detection and semantic segmentation," 2014 IEEE
Conference on Computer Vision and Pattern Recognition, 2014.

[3] R. Girshick, "Fast R-CNN," 2015 IEEE International Conference on

Computer Vision (ICCV), 2015.

[4] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You only look once:
Unified, real-time object detection," 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.

[5] G. Jocher, YOLOv5 by Ultralytics, 2020.

[6] Roboflow, "Image Augmentation," [Online]. Available:

https://docs.roboflow.com/image-transformations/image-augmentation.

[7] [Online]. Available: https://www.makesense.ai.

[8] J. Solawetz, "Mean average precision (MAP) in object detection," 2022 11

25. [Online]. Available: https://blog.roboflow.com/mean-average-precision/.

[9] Ultralytics, "Load Yolov5 from Pytorch Hub ⭐ · issue #36 ·

ultralytics/yolov5," [Online]. Available:
https://github.com/ultralytics/yolov5/issues/36.

[10] Z. Zhang, "A flexible new technique for camera calibration," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 11,
2000.

[11] G. An, S. Lee, M.-W. Seo, K. Yun, W.-S. Cheong and S.-J. Kang,
"Charuco board-based omnidirectional camera calibration method,"
Electronics, vol. 7, no. 12, 2018.

[12] T. A. Myhre, "Robot camera calibration," [Online]. Available:

https://www.torsteinmyhre.name/snippets/robcam_calibration.html.

[13] T.-T. Le, T.-S. Le, Y.-R. Chen, J. Vidal and C.-Y. Lin, "6d pose estimation
with combined deep learning and 3D vision techniques for a fast and accurate
object grasping," Robotics and Autonomous Systems, vol. 141, 2021.