Autonomous Unmanned Vehicle Automatic Visual Tracking Based On SLAM and YOLO Algorithm
Autonomous Unmanned Vehicle Automatic Visual Tracking Based On SLAM and YOLO Algorithm
Xiaolei Qu1,2 , Jiaxing Wang1 , Xiulin Zhang1 , Xinyu Feng1 , Yang Du4 , Ke Li3(B) ,
and Lijing Wang3
1 Avic Shenyang Aircraft Design and Research Institute, Beijing, China
2 Northwestern Polytechnical University, Xian, China
3 School of Aeronautical Science and Engineering, Beihang University, Beijing, China
[email protected]
4 91001Army, Beijing, China
Abstract. In the UAV distribution center, the use of unmanned vehicles as the
UAV carrier platform can perfectly compensate for the short time that UAVs are
in the air, while saving manpower and material resources, autonomously com-
pleting navigation and obstacle avoidance in known environments, and at the
same time. With human-computer interaction function, it provides target track-
ing ability, which greatly facilitates the centralized and distributed management of
UAVs. In this paper, slam and yolo algorithms are utilized to implement automatic
visual tracking of autonomous unmanned aerial vehicles (UAVs) and to achieve
collaboration between UAVs and unmanned aerial vehicles (UAVs) in a known
experimental environment.
1 Introduction
1.1 Visual Tracking History
Visual tracking started from statistical pattern recognition in the 1950s. At that time, it
mainly focused on the analysis and recognition of two-dimensional images. By the 1980s,
a relatively complete system had been formed. After that, many new methods of visual
tracking appeared, which can be divided into four categories-region-based tracking,
feature-based tracking, deformation template-based tracking, and model-based tracking
[5]. With the rapid development of machine learning, tracking models increasingly use
machine learning methods. In 1998, Isard regarded the visual tracking problem as a
non-linear problem, and introduced the particle filter method to propose the CONDEN-
SATION algorithm, which is the basis of the online update model [6]. In 2009, Mei
et al. combined particle filter framework and sparse representation and proposed L1T
[7]. The model uses an online update mechanism to capture the dynamic information
of the tracking target, which improves the robustness of the algorithm and the tracking
accuracy. In 2013, Wang et al. introduced the dictionary representation into the frame-
work, which can express the occlusion and complex background in the tracking process.
At the same time, the algorithm uses the Huber loss function to optimize the model,
which makes the algorithm more robust and accurate. Big improvement [8]. In 2011,
Sam et al. proposed the STRUCT algorithm [9]. The algorithm expresses the transition
of the target state between two frames of images by designing a structured output, and
learns the joint distribution of images and state transitions through the SVM algorithm.
In order to solve the model drift problem of the tracking algorithm, Zhang et al. proposed
the MEEM algorithm in 2014 [10]. The algorithm designs multiple expert trackers based
on the SVM algorithm, and selects appropriate experts to track and update the expert
model through specific criteria. In 2013, Zdenek et al. proposed the TLD algorithm [11].
The algorithm divides the entire visual tracking process into three processes: tracking,
updating, and detecting. Through the new P-N learning algorithm, the robustness and
accuracy of the model are fully guaranteed.
For the tracking problem, because there are many unknown factors, such as occlusion,
changes in lighting conditions, large deformation of the target, and similar interference
objects of the same type, for different situations, different discriminant features may be
selected to obtain a more accurate matching effect. Therefore, tracking models based
on integrated algorithms appear, such as the MEEM algorithm [10]. In 2019, Wang
et al. proposed the MCCT model to design multiple discriminators based on correlation
filtering through different feature combinations, and then select the most suitable dis-
criminator according to the corresponding selection criteria to achieve high-precision
tracking [21].
pixel. The YOLO network model uses maximum pooling, replacing the original image
area with the maximum value of the image area after convolution, reducing redundant
data and preventing overfitting.
The output layer of the last layer of the YOLO network model is similar to the
SoftMax classifier in the CNN algorithm. The classification output of the fully connected
layer data is similar. The number of output feature maps is the number of classifications
of the target, but there are also different YOLO algorithm output layers that output a 7*
A 7 * 30 tensor, 7 * 7 corresponds to the 7 * 7 grid of the input layer, and 30 represents
the classification result and position information encoding of the object in the image.
Finally, the vector is decoded through a unified agreement to draw the detection result
in the original image.
In formula (1), Pr (Object) represents the possibility of the existence of the target
in the grid target frame, Object represents the target object, and IoU (Intersection over
Union) is used to display the position of the target frame predicted by the current model
Accuracy, the expression is shown in formula (2).
box(pre) ∩ box(true)
pre =
IoU true (2)
box(pre) ∪ box(true)
In formula (2), box(pre) represents the predicted target frame, and box(true)
represents the real target frame.
Autonomous Unmanned Vehicle Automatic Visual Tracking 527
2 Results
Fig. 2. Prototype
As shown in Fig. 3, the unmanned car is able to create a 2D map with the laser radar.
However, in our team target, the cars are running in the known field. Thus, the laser radar
will mainly be used in avoidance rather than mapping.
In Fig. 4, the unmanned car is required to move from right to the left. And the map is
given in advance. As we can see, the car comes across two obstacles, one being square
while the other being round. The car moves fluently and pass through the obstacles, and
reach the destination successfully.
In Fig. 5, the car comes across a moving obstacle (human). It keeps a safe distance
with human well, and the reaction of re-planning is quick enough.
528 X. Qu et al.
3 Conclusion
The intelligent unmanned car is equipped with depth camera and liser radar, which
can definitely meet the demand in our team project. The prototype is challenged with
multiple tasks including mapping, guiding, avoidance and visual following, proving that
this technology serves well in the mission of transporting the UAVs and the interaction
with human operators.
According to the decomposition of the task process of the UAV distribution cen-
ter, the unmanned vehicle needs to participate in three links: transport the UAV out of
the warehouse to the take-off site, retrieve the UAV at the landing site and transport it
back to the designated location (warehouse), and the recipient’s Intervene in command.
From the results, our unmanned vehicle meets the demands of autonomous naviga-
tion functions and realize autonomous path planning based on known map information.
Our unmanned vehicle also achieves autonomous obstacle avoidance functions, with
awareness of the surrounding environment, to avoid moving objects and plan routes rea-
sonably. Our unmanned vehicle has human-targeted visual tracking function to improve
the performance of human-computer interaction.
Acknowledgements. This work are supported by the Chinese National Natural Science Founda-
tion (No. 61773039), the Aeronautical Science Foundation of China (No. 2017ZDXX1043), and
Aeronautical Science Foundation of China (No. 2018XXX).
References
1. Guilmartin, J.F.: Unmanned aerial vehicle. Encyclopedia Britannica, 15 July 2020. https://
www.britannica.com/technology/unmanned-aerial-vehicle. Accessed 15 May 2021
2. Dean, J., Mixter, J., Barr, J.: Multi-Rotor Unmanned Aerial Vehicle. Western Michigan Univer-
sity, 12 August 2015. https://scholarworks.wmich.edu/cgi/viewcontent.cgi?article=3658&
context=honors_theses
3. https://www.droneomega.com/what-is-a-quadcopter/
4. Lukmana, M.A., Nurhadi, H.: Preliminary study on Unmanned Aerial Vehicle (UAV) Quad-
copter using PID controller. In: 2015 International Conference on Advanced Mechatronics,
Intelligent Manufacture, and Industrial Automation (ICAMIMIA), pp. 34–37 (2015). https://
doi.org/10.1109/ICAMIMIA.2015.7507997
5. . (1), 72–74
(2014)
6. Isard, M., Blake, A.: Condensation—conditional density propagation for visual tracking. Int.
J. Comput. Vis. 29(1), 5–28 (1998)
7. Mei, X., Ling, H.: Robust visual tracking using 1minimization. In: IEEE International
Conference on Computer Vision. IEEE (2010)
8. Wang, N., Wang, J., Yeung, D.: Online robust non-negative dictionary learning for visual
tracking. In: IEEE International Conference on Computer Vision. IEEE (2013)
9. Hare, S., Saffari, A., Torr, P.H.S.: Struck: structured output tracking with kernels. In: IEEE
International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, 6–13 November
2011. IEEE (2011)
10. Zhang, J., Ma, S., Sclaroff, S.: MEEM: robust tracking via multiple experts using entropy
minimization (2014)
530 X. Qu et al.
11. Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Softw. Eng.
34(7), 1409–1422 (2011)
12. Bolme, DS., Beveridge, J.R., Draper, B.A., et al.: Visual object tracking using adaptive
correlation filters. In: CVPR, pp. 2544–2550 (2010)
13. Henriques, J.F., Caseiro, R., Martins, P., et al.: High-speed tracking with kernelized correlation
filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)
14. Danelljan, M., Hger, G., Khan, F.S., et al.: Learning spatially regularized correlation filters
for visual tracking (2016)
15. Li, F., Tian, C., Zuo, W., et al.: Learning spatial-temporal regularized correlation filters for
visual tracking (2018)
16. Danelljan, M., Hager, G., Khan, F.S., et al.: Convolutional features for correlation filter based
visual tracking. In: 2015 IEEE International Conference on Computer Vision Workshop
(ICCVW). IEEE (2015)
17. Lukezic, A., Vojir, T., Zajc, L.C., et al.: Discriminative correlation filter with channel and
spatial reliability. In: The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 6309–6318 (2017)
18. Sun, C., Wang, D., Lu, H., et al.: Learning spatial-aware regressions for visual tracking (2017)
19. Danelljan, M., Robinson, A., Khan, F.S., et al.: Beyond correlation filters: learning continuous
convolution operators for visual tracking (2016)
20. Danelljan, M., Bhat, G., Khan, F.S., et al.: ECO: efficient convolution operators for tracking
(2016)
21. Wang, N., Zhou, W., Tian, Q., et al.: Multi-cue correlation filters for robust visual tracking.
In: 2018 IEEE/Conference on Computer Vision and Pattern Recognition. IEEE (2018)
22. Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking
(2015)
23. Held, D., Thrun, S., Savarese, S.: Learning to track at 100 FPS with deep regression networks.
In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 749–
765. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_45
24. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional
siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol.
9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
25. Li, B., Yan, J., Wu, W., et al.: High performance visual tracking with siamese region pro-
posal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). IEEE (2018)
26. Wang, Q., Zhang, L., Bertinetto, L.: Fast online object tracking and segmentation: a unifying
approach. In: IEEE/CVF Conference on Computer Vision & Pattern Recognition. IEEE (2019)
27. Li, B., Wu, W., Wang, Q., et al.: SiamRPN++: evolution of siamese visual tracking with very
deep networks. In: IEEE/CVF Conference on Computer Vision & Pattern Recognition. IEEE
(2019)
28. Guo, Q., Feng, W., Zhou, C., et al.: Learning dynamic siamese network for visual object
tracking. In: International Conference on Computer Vision (ICCV 2017). IEEE Computer
Society (2017)
29. Yang, T., Chan, A.B.: Learning dynamic memory networks for object tracking. In: Ferrari, V.,
Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 153–169.
Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_10
30. Zhu, Z., Wang, Q., Li, B., et al.: Distractor-aware siamese networks for visual object tracking
(2018)
31. Danelljan, M., Bhat, G., Khan, F.S., et al.: ATOM: accurate tracking by overlap maximization.
In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE
(2019)
Autonomous Unmanned Vehicle Automatic Visual Tracking 531
32. Bhat, G., Danelljan, M., Gool, L.V., et al.: Learning discriminative model prediction for
tracking. In: ICCV (2019)
33. : .
(2019)
34. Bolme, D.S., Beveridge, J.R., Draper, B.A., et al.: Visual object tracking using adaptive
correlation filters. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, San Francisco, USA, pp. 2544–2550. IEEE (2010)