Vision basedVehicleDetectionandDistanceEstimation
Vision basedVehicleDetectionandDistanceEstimation
net/publication/348261793
CITATIONS READS
10 2,044
2 authors:
All content following this page was uploaded by Donghao Qiao on 17 September 2021.
Abstract—Real-time vehicle detection is one of the most images but is considerably more expensive and can be as
important topics under the Autonomous Vehicles (AVs) research expensive as a car. The video camera is a great alternative that
paradigm and traffic surveillance. Detecting vehicles and is cheaper and can present visible and colorful images that are
estimating their distances are essential to ensure that the vehicles suitable for extracting useful traffic information to increase
can keep a safe distance and run safely on the roads. The driver and road safety. Computer vision has made significant
technology can also be utilized to determine traffic flow and progress, but needs further advancements in the area of AV to
estimate vehicle speed. In this paper, we apply two different deep fulfill fully automation. In this paper, we focus on detecting the
learning models and compare their performances in detecting
front vehicles with a video camera, which are essential to
vehicles such as cars and trucks for deployment on the self-driving
develop an Advanced Driver-Assistance System (ADAS)
cars to ensure road safety. Our models are based on YOLOv4 and
Faster R-CNN which are efficient and accurate in object detection including the ACC and EBA functions. The workflow is shown
within a given distance. We also propose a vision-based distance in Fig. 1. First, we calibrated the input images which are
estimation algorithm to estimate other vehicles’ distances. In distorted by the camera. Then, we applied our models to detect
detecting vehicles within 100 meters, the two variations of our the vehicles in the images and the bottom-right image in Fig. 1
models, YOLOv4 and Faster R-CNN, achieved 99.16% and shows the predicted vehicles with corresponding bounding
95.47% mean precision, and 79.36% and 85.54% F1-measure boxes and confidence scores. Finally, we estimated the world
respectively on a two-way road. The detection speed is 68 fps and distance based on the image coordinates to compute the vehicle
14 fps for YOLOv4 and Faster R-CNN respectively. detection accuracy within a specified distance to achieve a more
reasonable and outstanding performance. The top-right image in
Keywords— Autonomous Vehicle, Computer Vision, Vehicle Fig. 1 displays the predicted vehicles with corresponding
Detection, YOLO, Faster R-CNN. distances.
I. INTRODUCTION
In recent years, with a number of technological
breakthroughs in the world, Artificial Intelligence (AI) for self-
driving vehicles has drawn much attention and has a significant
impact on our lives. Although Autonomous Vehicles (AVs) are
still going through progression, more and more driver assistant
functions such as Lane Keeping Assistant (LKA), Adaptive
Cruise Control (ACC) and Emergency Brake Assist (EBA) are
being developed and equipped to create smarter and more secure
self-driving technology. Vehicles currently equipped with such
techniques have proprietary software [19]. The vehicles which
are equipped with the ACC system can follow the nearest front
vehicle in the driving lane and adjust vehicle speed
automatically to maintain a safe distance [20]. The EBA system
is an automobile braking technology that increases the braking
pressure in the case of an emergency [21]. Both of these systems
rely on object detection and distance estimation systems. Fig. 1. The workflow of the model.
AVs use multiple kinds of sensors like radar, lidar and video For object detection, first, we explored the classical object
camera to gather real-time traffic information. Radar is good at detection methods such as Scale Invariant Feature Transform
detecting distance from an object based on the radio waves, but (SIFT) [13] or Histograms of Oriented Gradients (HOG) [14] for
it cannot recognize the object. Lidar is highly accurate and can feature extraction, and machine learning algorithms such as
create 3D maps which is safer and more convenient than 2D SVM or Boosting algorithms for object recognition and
This research is funded by the Canadian Urban Transit Research and classification. In 2012 ImageNet Large Scale Visual
Innovation Consortium (CUTRIC), Natural Sciences and Engineering Research Recognition Challenge (ILSVRC), AlexNet [15] achieved an
Council of Canada (NSERC) Discovery, and Canada Foundation for Innovation
(CFI) grants.
We will compare the performance of one-stage detector to the coordinates on the image pixels. We will use it later on in
YOLO [1] and two-stage detector R-CNN [5] in vehicle distance estimation.
detection. The latest version YOLOv4 [3] and Faster R-CNN [7]
are selected due to their high precision and short inference time 𝑓𝑥 0 𝑐𝑥
in object detection which are two critical factors for AVs to 𝑐𝑎𝑚𝑒𝑟𝑎 𝑚𝑎𝑡𝑟𝑖𝑥 𝑀 = [ 0 𝑓𝑦 𝑐𝑦 ] (1)
detect the surrounding environment. 0 0 1
III. METHODOLOGY where (fx, fy) is focal length and (cx, cy) is optical center.
In this section, we explained our methodologies in detail. B. YOLOv4
First, we calibrated the camera with OpenCV. Then we
introduced the object detectors we applied. We first looked into YOLOv4 [3] extends YOLOv3 [2] by additional processing
the YOLO [1] algorithms due to their proven performance and to improve detection precision and shorten the inference time.
speed in real-time object detection. Then, we utilized the two- The network structure of YOLOv4 consists of three parts: head,
stage object detector Faster R-CNN [7] which can achieve high backbone and neck. An overview of the model architecture is
precision in object detection. Our vision-based distance shown in Fig. 2.
estimation method is illustrated at the end of the section.
A. Camera Calibration
Some pinhole cameras introduce significant distortion to
images. Fig. 3 shows an example of a chess board image. The
first image is the original image which is distorted. All the
expected straight lines are bulged out and appear curved. In
order to calibrate the image, we use the calibration function
provided by the OpenCV [26] and a set of chess board images.
We found the corners of the chess board with
findChessboardCorners() function as shown in the second image
and calibrated the image into an undistorted image with
calibrateCamera() function. After calibration, the function
returns the distortion coefficients and camera matrix which can
be used to calibrate other images. The right image shows the
undistorted/corrected chessboard.
Fig. 4. YOLO model architecture [1].
Fig. 5. The Darknet53 model. The model’s layers are shown with filter and
kernel sizes and output dimensions.
𝑢𝑤 𝑋𝑤 𝑟𝑥 0 𝑐𝑥 𝑋𝑤
𝑣
[ 𝑤 ] = 𝐻𝑀 [ 𝑌𝑤 ] = [ 0 𝑟𝑦 𝑐𝑦 ] [ 𝑌𝑤 ] ()
1 1 0 0 1 1
where uw and vw are the coordinates in bird’s-eye-view. Xw and
Yw are the camera coordinates. We also can calculate the
transformation in another way with a rotation matrix as shown
Fig. 8. The yellow dot in the figure is the vanishing point of the lane. The red in Eq. (4), where rx and ry are the pixel-per-meter along x-axis
lines are the lane lines which are generated by Hough Transform. and y-axis. Therefore, we can get the relationship between rx and
ry as shown in Eq. (5).
After getting the RoI, we applied Canny edge detector and
Hough Transform (HT) [12] to detect the lane lines as shown in ||ℎ ||
𝑟𝑦 = 𝑟𝑥 ||ℎ1|| ()
Fig. 8. The HT returns the coordinates of the two ends of the 2
lines. The vanishing point is the nearest point to all these
detected lines. We can get the coordinates of the vanishing point where h1 and h2 are the first and second column of HM-1.
by using Eq. (2). This equation was obtained using geometric We transformed the images into bird’s-eye-view images to
derivation which can be used to calculate the nearest points of estimate vehicle distance as shown in Fig. 9 without
multiple lines in a 2D image. binarization. We used the standard lane width to estimate the
distance which is 12 feet (3.658m). In Fig. 9, the standard road
𝑝(𝑥0 , 𝑦0 ) = (∑𝑘𝑖=1 𝑛𝑖 𝑛𝑖𝑇 )−1 (∑𝑘𝑖=1 𝑛𝑖 𝑛𝑖𝑇 𝑝𝑖 ) (2) width in the images is 12 feet (3.658 m) from which we can get
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
the pixel-per-meter along x-axis. Therefore, we can estimate the 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ()
distance along the y-axis with Eq. (5).
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
After getting the predicted bounding boxes with object 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 ()
detectors, we can calculate the Euclidean distance between the
distant vehicle and our vehicle. The distance is from the 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∙𝑟𝑒𝑐𝑎𝑙𝑙
midpoint of the bottom edge of the bounding box to the midpoint 𝐹1 = 2 ∙ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 ()
of our car front (the red lines in Fig. 10).
B. Implementation Details
We used Ubuntu18.04 with Intel Xeon Gold 6130 CPU,
Tesla V100 GPU with 32 GB RAM, CUDA v10.1 and cuDNN
v9.1. The speed is evaluated with batch size 1. We obtained a
vehicle detection speed of around 60 fps and 14 fps with
YOLOv4 and Faster R-CNN respectively.
C. Vehicle Detection and Distance Estimation
In order to evaluate the detection results, we saved the
images from the video. For the 16-sec long video clips, we got
473 images. One frame of the image with the result of object
recognition is shown in Fig. 11. The left and right images in the
figure show the results of object recognition from Faster R-CNN
Fig. 10. Bird’s-eye-view of the road for distance estimation. and YOLOv4 respectively.
IV. EXPERIMENTS The evaluation results are shown in 0YOLOv4 uses three
different input shapes: 416 × 416, 512 ×512 and 608 × 608.
A. Dataset We can see that YOLOv4 has higher precision, while Faster R-
We used pretrained YOLOv4 and Faster R-CNN, which CNN has significant higher recall and F1-score than YOLOv4.
were trained and validated on MS COCO dataset [8]. We tested When we only count the vehicles in the same direction,
our vehicle detection models on a more challenging 16-sec video YOLOv4 can get 98.04% precision and 70.49% recall, while
clip [27]. Each second of this video includes 30 frames and the Faster R-CNN achieves 95.00% precision and 87.55% recall
resolution of the video is 1,280 × 720. This video offers greater score. Counting the vehicles in both directions on both sides of
challenge because the divider between the two opposite sides of the road, the mean recall values are 67.86% and 79.50%, and the
the road covers parts of the vehicles on the other side. mean precisions are 99.16% and 95.47% respectively for these
two models.
Precision, recall, and F1-measure are utilized to evaluate our
vehicle detectors as shown in Eq. (6), (7) and (8) respectively.
Fig. 11. Examples of the results. Left image is the result of Faster R-CNN, and the right image is the result of YOLOv4.
[3] Bochkovskiy, A., Wang, C.Y. and Liao, H.Y.M., 2020. YOLOv4:
TABLE I. VEHICLE DETECTION RESULTS WITHIN 100 METERS Optimal Speed and Accuracy of Object Detection. arXiv preprint
arXiv:2004.10934.
One Side of the Road
[4] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and
Precision Recall F1 FPS
Berg, A.C., 2016, October. Ssd: Single shot multibox detector. In
YOLOv4 - 416 100.00% 58.13% 71.54% 68 European conference on computer vision (pp. 21-37). Springer, Cham.
[5] Girshick, R., Donahue, J., Darrell, T. and Malik, J., 2014. Rich feature
YOLOv4 - 512 98.75% 66.04% 76.56% 63 hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE conference on computer vision and pattern
YOLOv4 - 608 98.04% 70.49% 79.68% 60
recognition (pp. 580-587).
Faster R-CNN 95.00% 87.55% 89.43% 14 [6] Girshick, R., 2015. Fast r-cnn. In Proceedings of the IEEE international
conference on computer vision (pp. 1440-1448).
Both Sides of the Road
[7] Ren, S., He, K., Girshick, R. and Sun, J., 2015. Faster r-cnn: Towards real-
YOLOv4 - 416 98.18% 56.32% 70.67% 68 time object detection with region proposal networks. In Advances in
neural information processing systems (pp. 91-99).
YOLOv4 - 512 98.07% 64.91% 77.15% 63 [8] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.,
Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common
YOLOv4 - 608 99.16% 67.86% 79.36% 60 objects in context. In European conference on computer vision (pp. 740-
755). Springer, Cham.
Faster R-CNN 95.47% 79.50% 85.54% 14
[9] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 770-778).
In general, object recognition tasks aim to predict fewer false
[10] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn.
negatives and obtain a higher recall. Faster R-CNN can always In Proceedings of the IEEE international conference on computer
get higher recall scores but have a longer inference time than vision (pp. 2961-2969).
YOLOv4. We can see that if we count the vehicles on both sides [11] Cai, Z. and Vasconcelos, N., 2018. Cascade r-cnn: Delving into high
of the road, YOLOv4 with 608 × 608 input size can achieve a quality object detection. In Proceedings of the IEEE conference on
comparable score with a real-time detection speed. computer vision and pattern recognition (pp. 6154-6162).
[12] Duda, R.O. and Hart, P.E., 1972. Use of the Hough transformation to
detect lines and curves in pictures. Communications of the ACM, 15(1),
V. CONCLUSION pp.11-15.
In this study, we applied two models, YOLOv4 and Faster [13] Lowe, D.G., 2004. Distinctive image features from scale-invariant
R-CNN, for vehicle detection in the autonomous vehicle keypoints. International journal of computer vision, 60(2), pp.91-110.
paradigm. We also proposed a vision-based approach to estimate [14] Dalal, N. and Triggs, B., 2005, June. Histograms of oriented gradients for
the distance of the vehicles in the forward direction. We human detection. In 2005 IEEE computer society conference on computer
vision and pattern recognition (CVPR'05) (Vol. 1, pp. 886-893). IEEE.
evaluated both models for a bounded distance of 100m, which
is practical and acceptable to avoid collisions for autonomous [15] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet
classification with deep convolutional neural networks. In Advances in
vehicles. In detecting vehicles within 100 meters, YOLOv4 and neural information processing systems (pp. 1097-1105).
Faster R-CNN achieved 99.16% and 95.47% mean precision as [16] Zhang, S., Wen, L., Bian, X., Lei, Z. and Li, S.Z., 2018. Single-shot
well as 79.36% and 85.54% F1-measure with a detection speed refinement neural network for object detection. In Proceedings of the
of 68 fps and 14 fps respectively on a two-way road. Besides, IEEE conference on computer vision and pattern recognition (pp. 4203-
the models achieved greater accuracy when detecting vehicles 4212).
on the same side of the road. YOLOv4 and Faster R-CNN [17] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B. and Belongie, S.,
achieved 98.04% and 95.00% precision as well as 70.49% and 2017. Feature pyramid networks for object detection. In Proceedings of
the IEEE conference on computer vision and pattern recognition (pp.
87.55% recall respectively. The middle road divider resulted in 2117-2125).
a reduced accuracy when we tried to detect vehicles on both [18] Everingham, M., Van Gool, L., Williams, C.K., Winn, J. and Zisserman,
sides of the road. YOLOv4 detected vehicles with 68 fps, which A., 2010. The pascal visual object classes (voc) challenge. International
is suitable for real-time vehicle detection. We also tested journal of computer vision, 88(2), pp.303-338.
YOLOv3 which is a real-time detector with 78 fps detection [19] Self-driving car, Wikipedia, 2 March 2020, accessed March 2020.
speed, while the recall was 5% lower than YOLOv4. <https://en.wikipedia.org/wiki/Self-driving_car>.
[20] Adaptive cruise control, Wikipedia, 20 Febrary 2020, accessed March
Our ongoing work focuses on training the YOLOv4 model 2020. <https://en.wikipedia.org/wiki/Adaptive_cruise_control>.
on other autonomous vehicle datasets containing traffic [21] Emergency brake assist, Wikipedia, 29 September 2019, accessed March
information such as traffic signs, pedestrians and cyclists. We 2020. <https://en.wikipedia.org/wiki/Emergency_brake_assist>
are also aiming on finding other datasets that can help evaluate [22] Wang, C.Y., Mark Liao, H.Y., Wu, Y.H., Chen, P.Y., Hsieh, J.W. and
our distance estimation method and comparing to the deep Yeh, I.H., 2020. CSPNet: A new backbone that can enhance learning
learning-based depth estimation algorithms. capability of cnn. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops (pp. 390-391).
REFERENCES [23] He, K., Zhang, X., Ren, S. and Sun, J., 2015. Spatial pyramid pooling in
deep convolutional networks for visual recognition. IEEE transactions on
[1] Redmon, J., Divvala, S., Girshick, R. and Farhadi, A., 2016. You only
pattern analysis and machine intelligence, 37(9), pp.1904-1916.
look once: Unified, real-time object detection. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 779-788). [24] Liu, S., Qi, L., Qin, H., Shi, J. and Jia, J., 2018. Path aggregation network
for instance segmentation. In Proceedings of the IEEE conference on
[2] Redmon, J. and Farhadi, A., 2018. Yolov3: An incremental improvement. computer vision and pattern recognition (pp. 8759-8768).
arXiv preprint arXiv:1804.02767.
[25] Simonyan, K. and Zisserman, A., 2014. Very deep convolutional [26] OpenCV dev team, 31 Dec 2019, accessed March 2020, <
networks for large-scale image recognition. arXiv preprint https://docs.opencv.org/master/index.html>.
arXiv:1409.1556. [27] Udacity, accessed July 2020, <https://github.com/udacity/CarND-
Advanced-Lane-Lines>