0% found this document useful (0 votes)
30 views5 pages

Object Detection Based

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views5 pages

Object Detection Based

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Object detection based on improved RT-DETR for human-robot

collaboration manufacturing system


Haili Lv, Qi Xiang, Jinhua Xiao*
School of Transportation and Logistics Engineering, Wuhan University of Technology, Wuhan
430070, China
* [email protected]

ABSTRACT

In modern intelligent manufacturing, human-robot collaboration is essential for combining the advantages of robots and
human to facilitate mass customized production. In order to improve robot's understanding of the operator's movements
and the working environment, an intelligent recognition method based on improved RT-DETR is proposed. The method
introduces CBAM modules before the multi-scale recognition layer to improve the recognition accuracy of the model.
Meanwhile, in order to cope with the problem of increased number of parameters and computation complexity caused by
the addition of the CBAM modules, the Conv module of the backbone network is replaced by the GhostConv module.
Experimental results show that these two enhanced methods (CBAM and GhostConv) effectively improve the detection
performance of the original RT-DETR model under the dataset examined.
Keywords: Human-robot collaboration, RT-DETR model, Detection performance, GhostConv method

1 INTRODUCTION
To meet the diverse demands of today's market more efficiently, modern companies strive for large-scale customized
production[1]. Employing human-robot collaboration is an effective approach to achieve this goal by combining robots
capability for repeated work with high precision and human worker’s advantage in executing less standard, flexible and
versatile operations[2], thus improving productivity. To promote the robot assist the operator more proactively, researchers
used deep learning for recognizing human gestures[3], postures[4], and voice. The recognition results are then fed back to
robots, enabling them to realign and plan their movements accordingly. However, current research often focuses solely on
worker, neglecting the crucial interaction between the worker and objects being processed. It is important to note that most
actions of the worker start with approaching and contacting the object. Therefor recognizing worker’s contact with the
object is important to aid robots in predicting worker’s next movement, facilitating robots’ collaboration with the worker
in an effective, efficient and safe way.
An intelligent monitoring method for human-robot collaboration is proposed in this paper. The remainder of this paper is
organized as follows. Section 2 analyses the research problem in this paper. Section 3 outlines the proposed methodology
using an improved RT-DETR algorithm to achieve real-time detection of worker hand and object. Section 4 verifies the
feasibility of the proposed method through experiments. A conclusion is given in section 5.

2 RESEARCH PROBLEMS
As discussed above, the rapid development of mass customization has led to the widespread adoption of human-robot
collaboration in production [6-8]. To enhance the intelligence and flexibility of robots, researchers are focusing on the
automatic and intelligent recognition of operators’ movements and intentions [5], in the aim of enhancing collaboration
between robots and human workers. In recent years, deep learning has made significant progress in computer vision,
providing new ways to study the actions and intentions of workers. However, using deep learning method to help robots
know the real-time status of workers faces several challenges: First, training of the algorithm requires a large amount of
data, yet dataset for specific scenario and environment is limited. Second, accurate detection is not an easy task, as objected
to be identified have different shapes, sizes, and visual characteristics.

Fourth International Conference on Mechanical, Electronics, and Electrical and Automation Control (METMS 2024),
edited by Zeashan Hameed Khan, Junxing Zhang, Pengfei Zeng, Proc. of SPIE Vol. 13163,
131638P · © 2024 SPIE · 0277-786X · doi: 10.1117/12.3030128

Proc. of SPIE Vol. 13163 131638P-1


3 METHOD
3.1 RT-DETR network structure
Aiming to address the problem that training of the DETR model converges slowly, Wenyu Lv et al. proposed the RT-DETR
model, an end-to-end detector based on Transformer [9]. The RT-DETR network comprises three main parts: backbone,
neck, and the decoder. Fig.1 shows the structure of the RT-DETR model’s backbone network, which comprises four stages.
The first stage includes the HGstem and HGBlock modules, with HGstem processing input features via convolution and
maximum pooling. The following three stages comprise the HGBlock and Conv modules, where the HGBlock module
uses lightweight convolution to reduce parameters and computation time. Following feature extraction from the backbone
network, the model obtains four-scale outputs equivalent to 4, 8, 16, and 32 down sampling of the input image, to enhance
the model's multi-scale feature characterization capability.
In order to obtain more effective information, the RT-DETR necking network only processes 𝑆 features with rich semantic
information output from the backbone network. It comprises Attention-based Intra-scale Feature Interaction (AIFI) and
CNN-based Cross-scale Feature-fusion Module (CCFM). The former includes MHSA and FFN, which can automatically
capture dependencies between input vectors to better understand contextual information, and the latter fuses multi-scale
features. In the neck network, the 𝑆 obtained from the backbone network is first flattened into a vector that is used as input
to the AIFI module, where the vector is flattened back into a two-dimensional vector 𝐹 . 𝐹 is used as input to the
subsequent CCFM module to complete the cross-scale feature fusion.

Figure.1. Backbone of RT-DETR.


3.2 GhostConv
To obtain redundant features while using less computational resources, Kai Han et al. proposed GhostNet[10].The
GhostConv module is the central component of GhostNet. It completes feature extraction of input features in two stages.
Firstly, it uses a small number of convolution kernels to streamline the number of feature maps and reduce computation
complexity. Then, the feature maps obtained in the first stage undergo constant mapping and linear transformation
operations before being spliced in the channel direction to obtain the final output feature maps. GhostConv convolution
requires fewer parameters and less computation effort than traditional convolution, even when the number of feature maps
is the same.To conclude, introduction of GhostConv to replace Conv module results in a more light-weighted network and
reduces the training effort.
3.3 Introduce CBAM
Inspired by human visual system, Sanghyun Woo et al. proposed CBAM (Convolutional Block Attention Module)[11].
CBAM pays attention to most relevant spatial and channel features and greatly increases the recognition capability. The
CBAM attention module comprises two parts: Channel Attention Module and Spatial Attention Module, as shown in Fig.2.

Figure.2. Overview of convolutional block attention module.


The channel attention module gathers information from each channel of the feature map F using both maximum pooling
and average pooling. This information is then combined by the multilayer perceptual model to produce the feature weight

Proc. of SPIE Vol. 13163 131638P-2


𝑀 𝐹 using the Sigmoid activation function. In the spatial attention module, after maximum pooling and average pooling,
the feature map is then processed using convolution and Sigmoid activation function to obtain feature weights 𝑀 𝐹 .
Ultimately, the output by the CBAM attention mechanism is obtained by multiplying feature map 𝐹 with 𝑀 𝐹
elementwise.
CBAM modules are introduced in front of all three scale detection layers of the head network, as shown in Fig.3. Adding
the CBAM attention module in front of the three scale detection layers of the head network can enhance the salient features
on the feature maps of different scales, resulting in more accurate predictions from the three detection layers.

Figure.3. Adding CBAM to RT-DETR.

4 EXPERIMENT
4.1 Dataset
Datasets for detecting relevant targets in human-robot collaboration are limited, and the targets to be detected vary
depending on the task. To address this issue, we have selected additional images from the 11k hand dataset[12] including
images of the operator's hand to expand the sample set. Additionally, we have expanded the dataset using data enhancement
methods. The expanded dataset comprises 2500 labelled images, categorized as worker_hand, robot_hand and object using
the MakeSense platform[13]. The dataset is divided into training, testing, and validation sets in an 8:1:1 ratio.
4.2 Model training
The experiments in this paper is conducted using the Ubuntu 18.04 operating system, with an NVIDIA GeForce RTX 3080
graphics card. Pytorch 1.7.0 is used as the deep learning framework, with Cuda version 11.0 and Python 3.8 as the
programming environment. The left side of Fig.4 displays the curve of the training loss in 250 epochs. As the model
employs position coding, the loss value is high in the beginning. As the number of epoch increases, the training loss level
off after the 50th epoch and converge after 100 epochs. There is no underfitting or overfitting in train process.
4.3 Analysis of test results
To verify the effectiveness of the improved algorithm, three sets of ablation experiments are performed and the results are
shown in Table 1. We employ precision rate (P), recall rate (R), mean accuracy precision (mAP0.5) as the evaluation
criteria. As can be seen from Table 1, after using GhostConv, P is improved by 3.55% compared to RT-DETR; after
introducing the CBAM attention mechanism, although the recognition accuracy is reduced compared to using GhostConv,
the feature extraction ability of the model is improved, and the recall rate is increased by 1.4% compared to RT-DETR; by
combining GhostConv and CBAM attention mechanism, the detection accuracy, recall and mAP of the improved RT-
DETR are increased by 1.2%, 0.6% and 0.8%, respectively, compared to RT-DETR, indicating that the combination of
these two methods can contribute to the improvement of the model detection performance.
Table 1 Ablation experiment results.

Model GhostConv CBAM P/% R/% [email protected]%


RT-DETR 88.35 72.40 84.20
A  91.90 69.25 82.35
B  88.85 73.80 85.85
C   89.55 73.00 85.00
To compare the detection effectiveness of the improved model more intuitively, results for the base model (RT-DETR) and
the improved model(as shown in Table 1 model C) are given in right side of Fig.4. It can clearly be seen from Fig.4 that
the original RT-DETR model has omissions in the detection of small targets (marked with blue boxes) and also has

Proc. of SPIE Vol. 13163 131638P-3


misdetections in the detection of the operator's hand (marked with purple boxes). The improved RT-DETR model in this
paper is able to achieve accurate recognition of the operator's hands and operation objects and also has more accurate in
terms of target location, which lays an accurate data foundation for subsequent in-depth research.

Figure.4. Improved RT-DETR detection results.

5 CONCLUSION
In order to improve robots’ understanding of workers' actions in human-robot collaboration, this paper proposes an
intelligent recognition method that improves RT-DETR to recognize operator's hand and the object. First, the Conv module
of the backbone network is replaced by the GhostConv module to reduce the number of model parameters and computation
complexity. Meanwhile, CBAM attention mechanism is supplemented before the detection layer to highlight important
information on the feature maps of different scales to make the prediction results more accurate. Experimental results show
that compared with the original RT-DETR model, the detection accuracy of the improved model is increased by 1.2% and
the recall rate is increased by 0.6%, providing a more accurate base for subsequent in-depth research.

REFERENCES

[1] Zheng P, Wang Z, Chen C H, et al. A Survey of Smart Product-Service Systems: Key Aspects, Challenges and
Future Perspectives[J]. Advanced Engineering Informatics, 2019. DOI:10.1016/j.aei.2019.100973.
[2] Liu S, Wang L, Wang X V. Symbiotic human-robot collaboration: multimodal control using function blocks[J].
Procedia CIRP, 2020, 93: 1188-1193. DOI:10.1016/j.procir.2020.03.022.
[3] Wang W, Li R, Chen Y, et al. Predicting human intentions in human–robot hand-over tasks through multimodal
learning[J]. IEEE Transactions on Automation Science and Engineering, 2021, 19(3): 2339-2353.
DOI:10.1109/TASE.2021.3074873.
[4] Costanzo M , Maria G D , Lettera G ,et al.A Multimodal Approach to Human Safety in Collaborative Robotic
Workcells[J].IEEE Transactions on Automation Science and Engineering, 2021,PP(99):1-
15.DOI:10.1109/TASE.2020.3043286.
[5] Bo Wang, Jinhua Xiao, "Exploring human intention recognition based on human robot collaboration
manufacturing toward safety production," Proc. SPIE 12803, Fifth International Conference on Artificial
Intelligence and Computer Science (AICS 2023), 128033Q (16 October 2023);
https://doi.org/10.1117/12.3009583.
[6] H. Yin, J. Xiao and G. Wang, "Human-Robot Collaboration Re-Manufacturing for Uncertain Disassembly in
Retired Battery Recycling," 2022 5th World Conference on Mechanical Engineering and Intelligent
Manufacturing (WCMEIM), Ma'anshan, China, 2022, pp. 595-598.DOI:
10.1109/WCMEIM56910.2022.10021388.
[7] Xiao, J., Gao, J., Anwer, N., and Eynard, B. (July 21, 2023). "Multi-Agent Reinforcement Learning Method for
Disassembly Sequential Task Optimization Based on Human–Robot Collaborative Disassembly in Electric
Vehicle Battery Recycling." ASME. J. Manuf. Sci. Eng. December 2023; 145(12): 121001.
https://doi.org/10.1115/1.4062235
[8] Xiao J, Anwer N, Li W, et al. Dynamic Bayesian network-based disassembly sequencing optimization for electric
vehicle battery[J]. CIRP Journal of Manufacturing Science and Technology, 2022, 38: 824-835.

Proc. of SPIE Vol. 13163 131638P-4


[9] Lv W, Xu S, Zhao Y, et al. Detrs beat yolos on real-time object detection[J]. arXiv preprint arXiv:2304.08069,
2023. https://arxiv.org/pdf/2304.08069.pdf.
[10] Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., & Xu, C. (2020). Ghostnet: More features from cheap operations.
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1580-1589).
https://openaccess.thecvf.com/content_CVPR_2020/html/Han_GhostNet_More_Features_From_Cheap_Operati
ons_CVPR_2020_paper.html.
[11] Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). Cbam: Convolutional block attention module. In Proceedings
of the European conference on computer vision (ECCV) (pp. 3-19).
https://openaccess.thecvf.com/content_ECCV_2018/html/Sanghyun_Woo_Convolutional_Block_Attention_EC
CV_2018_paper.html.
[12] Afifi M .11K Hands: Gender recognition and biometric identification using a large dataset of hand images[J].
Multimedia Tools & Applications, 2019.DOI:10.1007/s11042-019-7424-8.
[13] Piotr Skalski, 2019. Make Sense. https://github.com/SkalskiP/make-sense.

Proc. of SPIE Vol. 13163 131638P-5

You might also like