0% found this document useful (0 votes)
20 views9 pages

Vision basedVehicleDetectionandDistanceEstimation

This conference paper discusses a vision-based vehicle detection and distance estimation system for autonomous vehicles, utilizing two deep learning models: YOLOv4 and Faster R-CNN. The models achieved high precision rates of 99.16% and 95.47% respectively for vehicle detection within 100 meters, and the paper emphasizes the importance of real-time detection for road safety. The proposed system aims to enhance advanced driver-assistance systems by accurately detecting vehicles and estimating their distances to prevent collisions.

Uploaded by

Sheeya Filali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Vision basedVehicleDetectionandDistanceEstimation

This conference paper discusses a vision-based vehicle detection and distance estimation system for autonomous vehicles, utilizing two deep learning models: YOLOv4 and Faster R-CNN. The models achieved high precision rates of 99.16% and 95.47% respectively for vehicle detection within 100 meters, and the paper emphasizes the importance of real-time detection for road safety. The proposed system aims to enhance advanced driver-assistance systems by accurately detecting vehicles and estimating their distances to prevent collisions.

Uploaded by

Sheeya Filali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/348261793

Vision-based Vehicle Detection and Distance Estimation

Conference Paper · December 2020


DOI: 10.1109/SSCI47803.2020.9308364

CITATIONS READS
10 2,044

2 authors:

Donghao Qiao Farhana H. Zulkernine


Queen's University Queen's University
12 PUBLICATIONS 200 CITATIONS 124 PUBLICATIONS 2,483 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Donghao Qiao on 17 September 2021.

The user has requested enhancement of the downloaded file.


Vision-based Vehicle Detection and Distance
Estimation
Donghao Qiao Farhana Zulkernine
School of Computing School of Computing
Queen’s University Queen’s University
Kingston, Canada Kingston, Canada
[email protected] [email protected]

Abstract—Real-time vehicle detection is one of the most images but is considerably more expensive and can be as
important topics under the Autonomous Vehicles (AVs) research expensive as a car. The video camera is a great alternative that
paradigm and traffic surveillance. Detecting vehicles and is cheaper and can present visible and colorful images that are
estimating their distances are essential to ensure that the vehicles suitable for extracting useful traffic information to increase
can keep a safe distance and run safely on the roads. The driver and road safety. Computer vision has made significant
technology can also be utilized to determine traffic flow and progress, but needs further advancements in the area of AV to
estimate vehicle speed. In this paper, we apply two different deep fulfill fully automation. In this paper, we focus on detecting the
learning models and compare their performances in detecting
front vehicles with a video camera, which are essential to
vehicles such as cars and trucks for deployment on the self-driving
develop an Advanced Driver-Assistance System (ADAS)
cars to ensure road safety. Our models are based on YOLOv4 and
Faster R-CNN which are efficient and accurate in object detection including the ACC and EBA functions. The workflow is shown
within a given distance. We also propose a vision-based distance in Fig. 1. First, we calibrated the input images which are
estimation algorithm to estimate other vehicles’ distances. In distorted by the camera. Then, we applied our models to detect
detecting vehicles within 100 meters, the two variations of our the vehicles in the images and the bottom-right image in Fig. 1
models, YOLOv4 and Faster R-CNN, achieved 99.16% and shows the predicted vehicles with corresponding bounding
95.47% mean precision, and 79.36% and 85.54% F1-measure boxes and confidence scores. Finally, we estimated the world
respectively on a two-way road. The detection speed is 68 fps and distance based on the image coordinates to compute the vehicle
14 fps for YOLOv4 and Faster R-CNN respectively. detection accuracy within a specified distance to achieve a more
reasonable and outstanding performance. The top-right image in
Keywords— Autonomous Vehicle, Computer Vision, Vehicle Fig. 1 displays the predicted vehicles with corresponding
Detection, YOLO, Faster R-CNN. distances.

I. INTRODUCTION
In recent years, with a number of technological
breakthroughs in the world, Artificial Intelligence (AI) for self-
driving vehicles has drawn much attention and has a significant
impact on our lives. Although Autonomous Vehicles (AVs) are
still going through progression, more and more driver assistant
functions such as Lane Keeping Assistant (LKA), Adaptive
Cruise Control (ACC) and Emergency Brake Assist (EBA) are
being developed and equipped to create smarter and more secure
self-driving technology. Vehicles currently equipped with such
techniques have proprietary software [19]. The vehicles which
are equipped with the ACC system can follow the nearest front
vehicle in the driving lane and adjust vehicle speed
automatically to maintain a safe distance [20]. The EBA system
is an automobile braking technology that increases the braking
pressure in the case of an emergency [21]. Both of these systems
rely on object detection and distance estimation systems. Fig. 1. The workflow of the model.

AVs use multiple kinds of sensors like radar, lidar and video For object detection, first, we explored the classical object
camera to gather real-time traffic information. Radar is good at detection methods such as Scale Invariant Feature Transform
detecting distance from an object based on the radio waves, but (SIFT) [13] or Histograms of Oriented Gradients (HOG) [14] for
it cannot recognize the object. Lidar is highly accurate and can feature extraction, and machine learning algorithms such as
create 3D maps which is safer and more convenient than 2D SVM or Boosting algorithms for object recognition and
This research is funded by the Canadian Urban Transit Research and classification. In 2012 ImageNet Large Scale Visual
Innovation Consortium (CUTRIC), Natural Sciences and Engineering Research Recognition Challenge (ILSVRC), AlexNet [15] achieved an
Council of Canada (NSERC) Discovery, and Canada Foundation for Innovation
(CFI) grants.

978-1-7281-2547-3/20/$31.00 ©2020 IEEE


error of 15.3% which was more than 10 percent lower than the A. SIFT and HOG
state-of-the-art approach at that time. After that, Convolutional SIFT [13] and HOG [14] are classical algorithms used in
Neural Networks (CNN) became more popular in classification object detection with high efficiency. Lowe [13] first extracted
and object detection. CNNs have demonstrated greater accuracy local invariant features from images and then used the extracted
and generalizability in object detection and classification. features in object detection with clustering algorithms. Dalal et
Therefore, we studied several popular object detectors: the al. [14] extracted features by creating a histogram based on
anchor-based one-stage object detection algorithms such as the gradients of an image and used linear SVM as the baseline
You Only Look Once (YOLO) [1] and the Single Shot Detector classifier. However, both SIFT and HOG are prone to generating
(SSD) [4] as well as anchor-based two-stage region proposal false positives, because the features extracted by SIFT and HOG
algorithms such as R-CNN [5], Fast R-CNN [6], and Faster R- algorithms are low-level features such as edge and color that do
CNN [7] models. The two-stage object detectors generate region not use hierarchical layer-wise representation learning. However,
proposals in the first stage and then classify objects from the the CNN is a hierarchical deep learning architecture which is
region proposals and regress bounding boxes. These algorithms able to learn features by gradually composing lower level
can get high precision but take a longer inference time. One- features into higher level more abstract representations through
stage object detection methods detect objects directly, which multiple layers.
save a considerable amount of computing time and resources
while they can also get comparable Mean Average Precisions B. Two-stage Detectors
(mAP) compared to the two stage methods. R-CNN [5] is the first benchmark to apply CNN in object
In this paper, we detect vehicles with two deep learning detection and achieved 58.5 mAP on VOC 2007 [18]. Girshick
models: one-stage object detector YOLOv4 and two-stage et al. [5] used the selective search to generate around 2,000
object detector Faster R-CNN. The one-stage object detector region proposals to predict the bounding boxes of the objects.
YOLOv4 is well-known for being less time-consuming in object Both R-CNN [5] and Fast R-CNN [6] use selective search to
detection, which is crucial for real-time object detection on AVs. create region proposals which is time-consuming (it takes
Faster R-CNN is a two-stage object detector which generates around 2 seconds to generate 2000 region proposals). Faster R-
region proposals in the first stage. It takes longer time to detect CNN [7] uses RPN that share convolutional layers with object
objects, while it can achieve higher accuracy which is also detection networks to make proposal computation nearly cost-
essential to detect all vehicles and ensure vehicles run safely on free. The RPN is a kind of fully convolutional network (FCN)
the road. We also propose a computer vision method to estimate that can be trained end-to-end especially for the task of
the distance in the image and evaluate the model for a specified generating detection proposals. Because of the high precision of
distance of 100 meters in the forward direction. We set this Faster R-CNN, a number of variations of the model have been
distance as a threshold because objects more than 100 meters proposed such as Mask R-CNN [10] and Cascade R-CNN [11]
away are difficult to recognize accurately in video frames, it is which are state-of-the-art instance segmentation algorithms.
more important to correctly recognize vehicles within a closer However, the region-based algorithms are too slow for real-time
distance to avoid collisions and ensure road safety. Even humans detection. The detection speed of Faster R-CNN is about 7fps
cannot recognize objects far away on the road and cannot for PASCAL VOC 2007.
estimate the distance precisely. Moreover, 100 meters is enough C. Single Shot Detectors
for an intelligent system to take action (speed up or braking) in
Liu et al. [4] proposed a single deep convolutional neural
an emergency. For vehicles within a given distance boundary of
network model that achieved a detection speed of 59 fps with an
100 meters, our YOLOv4 and Faster R-CNN models achieved
accuracy of mAP 74.3% on VOC 2007. Single Shot Multibox
99.16% and 95.47% precisions, 67.86% and 79.50% recall
Detector (SSD) [4] detects objects from multiple feature maps
scores as well as 79.36% and 85.54% F1-scores respectively in
instead of fixed grids as done in YOLO. For each detected
detecting vehicles on both sides of the road. For only vehicle
object, SSD predicts the offsets of default bounding boxes and
side detection within 100 meters boundary, we achieved mean
the confidence scores that indicate the class-probability of the
precisions of 98.04% and 95.00%, mean recall scores of 70.49%
detected object. SSD significantly improves the speed of
and 87.55%, and F1-scores of 79.68% and 89.43% for YOLOv4
detection and still achieves high-quality detection results.
and Faster R-CNN respectively.
However, SSD is not good at detecting small objects appearing
The rest of the paper is structured as follows. Section II in groups, which is because SSD predicts bounding boxes after
presents the related work about object detection. The multiple convolution layers, whereas after a few layers, the
implementation of vehicle detection models and the distance resolution will decrease and it is hard to detect small objects
estimation algorithm are introduced in Section III. In Section IV from low resolution data. Based on SSD, Zhang et al. [16]
we describe the video datasets, implementation details and the presented another single shot detector by adding the anchor
results. Finally, we conclude in Section V with a discussion of refinement module which is called RefineDet. It is better in
the future work. detecting small objects. YOLO series detectors are also SSD.
YOLOv3 [2] uses Darknet53 to perform feature extraction and
II. RELATED WORK a Feature Pyramid Network (FPN) [17] to detect objects in
multi-scale.
In this section, we compared multiple object detectors: the
classical detectors such as SIFT and HOG, the two-stage
approaches as well as the single shot detectors.
Fig. 2. YOLOv4 model architecture.

We will compare the performance of one-stage detector to the coordinates on the image pixels. We will use it later on in
YOLO [1] and two-stage detector R-CNN [5] in vehicle distance estimation.
detection. The latest version YOLOv4 [3] and Faster R-CNN [7]
are selected due to their high precision and short inference time 𝑓𝑥 0 𝑐𝑥
in object detection which are two critical factors for AVs to 𝑐𝑎𝑚𝑒𝑟𝑎 𝑚𝑎𝑡𝑟𝑖𝑥 𝑀 = [ 0 𝑓𝑦 𝑐𝑦 ] (1)
detect the surrounding environment. 0 0 1
III. METHODOLOGY where (fx, fy) is focal length and (cx, cy) is optical center.
In this section, we explained our methodologies in detail. B. YOLOv4
First, we calibrated the camera with OpenCV. Then we
introduced the object detectors we applied. We first looked into YOLOv4 [3] extends YOLOv3 [2] by additional processing
the YOLO [1] algorithms due to their proven performance and to improve detection precision and shorten the inference time.
speed in real-time object detection. Then, we utilized the two- The network structure of YOLOv4 consists of three parts: head,
stage object detector Faster R-CNN [7] which can achieve high backbone and neck. An overview of the model architecture is
precision in object detection. Our vision-based distance shown in Fig. 2.
estimation method is illustrated at the end of the section.
A. Camera Calibration
Some pinhole cameras introduce significant distortion to
images. Fig. 3 shows an example of a chess board image. The
first image is the original image which is distorted. All the
expected straight lines are bulged out and appear curved. In
order to calibrate the image, we use the calibration function
provided by the OpenCV [26] and a set of chess board images.
We found the corners of the chess board with
findChessboardCorners() function as shown in the second image
and calibrated the image into an undistorted image with
calibrateCamera() function. After calibration, the function
returns the distortion coefficients and camera matrix which can
be used to calibrate other images. The right image shows the
undistorted/corrected chessboard.
Fig. 4. YOLO model architecture [1].

Head: YOLOv3. The head of an object detector is


responsible for classification and localization. YOLO is a
unified real-time object detection framework as shown in Fig. 4.
YOLO divides the input image into 𝑆 × 𝑆 grids. If the center of
an object is in a grid, then that grid is used for predicting that
object. For example, the center of the dog is located in the
Fig. 3. Camera calibration. second column and the fifth row, so that grid is used for
detecting the dog. Each grid predicts B number of bounding
The camera matrix Eq. (1) is an intrinsic parameter of the boxes and each bounding box is associated with five parameters:
camera. It transposes the coordinates of the points in 3D space
(x, y) coordinates, width (w), height (h) and a confidence score make bottom layer features propagate to upper layers. However,
𝑡𝑟𝑢𝑡ℎ
(c). The confidence score is equal to Pr (𝑂𝑏𝑗𝑒𝑐𝑡) × 𝐼𝑜𝑈𝑝𝑟𝑒𝑑 instead of adding the feature maps, YOLOv4 concatenates them
where the first part reflects how likely the box contains an object after the PAN.
and the second part, Intersection over Union (IoU), reflects how C. Faster R-CNN
accurately the bounding box detects the content.
Faster R-CNN is a two-stage object detector that generates a
bunch of region proposals. Fig. 6 shows an overview of the
model architecture. Faster R-CNN first uses a network such as
VGG [25] or ResNet [9] to extract feature maps which are then
shared with the following Region Proposal Network (RPN) and
fully connected networks.

Fig. 5. The Darknet53 model. The model’s layers are shown with filter and
kernel sizes and output dimensions.

Backbone: CSPDarknet53. The backbone is used to extract


features from the input images. YOLOv4 applies the Cross
Stage Partial Network (CSPNet) [22] to Darknet53. YOLOv3 Fig. 6. Faster R-CNN model architecture.
uses Darknet53 which has 53 convolutional layers as shown in
Fig. 5. The network applies residual blocks and is mainly Region Proposal Networks. The traditional region proposal
composed of 3 × 3 and 1 × 1 filters with skip connections like methods such as moving windows and selective search are time-
the residual network in ResNet [9]. Darknet53 performs as good consuming. Faster R-CNN replaces these with an RPN which is
as ResNet-101 but is more efficient than that in terms of the applied to predict the object proposals. The network first reduces
detection speed and memory usage. The final output layer has a the dimension of the input feature map with a 3 × 3 filter and
dimension of 𝑆 × 𝑆 × [𝐵 × (5 + 𝑛𝑢𝑚_𝑐𝑙𝑎𝑠𝑠)] , because the then sends it into the two sibling networks of RPN. One branch
images are split into 𝑆 × 𝑆 grids and each grid predicts B calculates the proposals’ foreground and background
bounding boxes, each having five parameters, and the probabilities, and the other branch computes the regression of
probabilities of each class. the proposals’ location. Each location in the feature map predicts
In a convolution block, the CSPNet splits the feature map 9 anchors with 3 scales and 3 aspect ratios in Faster R-CNN. The
into two parts x’ and x”. x’ is directly linked to the end of the classification branch in RPN has 2 × 9 objectness scores that
stage, but x” will go through the convolution block and predict the probability whether the proposal contains an object
concatenate with x’. This can reduce the computational and the regression branch has 4 × 9 outputs which represent the
complexity, because only part of the feature maps will go coordinates of each proposal.
through the convolution blocks. RoI Pooling. After the RPN, the generated region proposals
Neck: SPP and PAN. The neck is always added between are fed into two separate fully connected networks for further
the backbone and head which can detect objects at different classification and bounding box regression. The region
scales and spatial resolutions. YOLOv4 applies Spatial Pyramid proposals generated by RPN are of different sizes, but the inputs
Pooling (SPP) [23] to have a multiple scale perception. As of fully connected networks must have the same size. Therefore,
shown in Fig. 2 (green rectangle), the SPP applies maximum Region of Interest (RoI) pooling is applied to warp the region
pooling with pool sizes equal to 1 × 1 , 5 × 5 , 9 × 9 and proposals into a fix size of 𝐻 × 𝑊 . Each proposal has four
13 × 13. Path Aggregation Network (PAN) [24] is applied to localization parameters (r, c, h, w) where (r, c) is the coordinate
extract features in a hierarchical structure. It has the same top- of the top-left corner in the feature map and (h, w) are the height
down path with lateral connections as the Feature Pyramid and width respectively. The proposal with size (h, w) is divided
Network (FPN) [17] to propagate a hierarchy of features. It also into an (H, W) grid, and then the maximum value of each grid
improves the network with bottom-up path augmentation to cell is computed.
Finally, the modified proposals with the same size (H, W) where p(x0, y0) is the vanishing point coordinates of the lane, k
are fed into the classifier and regressor networks. The classifier is the number of lines detected by HT, ni is the unit normal
predicts which class the proposal belongs to and the regressor vectors of the lines, and pi are the points we get from HT. As
computes the regression of a precise bounding box. shown in Fig. 8 the yellow dot is the vanishing point of the lane.
Using the vanishing point we can transfer the image to bird’s-
D. Distance Estimation eye-view precisely.
To estimate the vehicle distance, we are inspired by a
Udacity project [27]. Therefore, we extended that project and
proposed a vision-based distance estimation method.

Fig. 9. Transfer the Region-of-Interest to bird’s-eye-view.

Based on the vanishing point, we transferred the area in front


of the vehicle which is the blue trapezoid (left image in Fig. 9)
to a rectangle (right image in Fig. 9) by using
getPerspectiveTransform() function in OpenCV [26]. After the
transformation, we can get a 3 × 3 Homography transformation
Fig. 7. Region-of-Interest selection for calculating vanishing point. matrix H as shown in Eq. (3), which is then used to transform
the points from the image view to the bird’s-eye-view.
We estimate the distance of vehicles in bird’s-eye-view,
because each pixel is equally spaced in world coordinate. First, 𝑢𝑤 𝑢
we would like to find the vanishing point which can be used to 𝑣
[ 𝑤] = 𝐻 [𝑣 ] (3)
warp the image into bird’s-eye-view precisely. We chose the
1 1
Region of Interest (RoI) and masked rest of the background as
shown in Fig. 7. We kept the RoI which contained the driving where u and v are the coordinates in image view, and uw and vw
lane in front of the car. The reason we chose this area is the lane are the coordinates in the bird’s-eye-view.
lines can be approximated as straight lines which is helpful in
calculating more accurate vanishing point coordinates. The camera matrix M we got in camera calibration can
transposes the coordinates of the points in 3D space to the
coordinates on the image pixels. Therefore, using the
Homography matrix and camera matrix we can transform the
resolution from world coordinates to pixel coordinates using Eq.
(4).

𝑢𝑤 𝑋𝑤 𝑟𝑥 0 𝑐𝑥 𝑋𝑤
𝑣
[ 𝑤 ] = 𝐻𝑀 [ 𝑌𝑤 ] = [ 0 𝑟𝑦 𝑐𝑦 ] [ 𝑌𝑤 ] ()
1 1 0 0 1 1
where uw and vw are the coordinates in bird’s-eye-view. Xw and
Yw are the camera coordinates. We also can calculate the
transformation in another way with a rotation matrix as shown
Fig. 8. The yellow dot in the figure is the vanishing point of the lane. The red in Eq. (4), where rx and ry are the pixel-per-meter along x-axis
lines are the lane lines which are generated by Hough Transform. and y-axis. Therefore, we can get the relationship between rx and
ry as shown in Eq. (5).
After getting the RoI, we applied Canny edge detector and
Hough Transform (HT) [12] to detect the lane lines as shown in ||ℎ ||
𝑟𝑦 = 𝑟𝑥 ||ℎ1|| ()
Fig. 8. The HT returns the coordinates of the two ends of the 2
lines. The vanishing point is the nearest point to all these
detected lines. We can get the coordinates of the vanishing point where h1 and h2 are the first and second column of HM-1.
by using Eq. (2). This equation was obtained using geometric We transformed the images into bird’s-eye-view images to
derivation which can be used to calculate the nearest points of estimate vehicle distance as shown in Fig. 9 without
multiple lines in a 2D image. binarization. We used the standard lane width to estimate the
distance which is 12 feet (3.658m). In Fig. 9, the standard road
𝑝(𝑥0 , 𝑦0 ) = (∑𝑘𝑖=1 𝑛𝑖 𝑛𝑖𝑇 )−1 (∑𝑘𝑖=1 𝑛𝑖 𝑛𝑖𝑇 𝑝𝑖 ) (2) width in the images is 12 feet (3.658 m) from which we can get
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
the pixel-per-meter along x-axis. Therefore, we can estimate the 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ()
distance along the y-axis with Eq. (5).
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
After getting the predicted bounding boxes with object 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 ()
detectors, we can calculate the Euclidean distance between the
distant vehicle and our vehicle. The distance is from the 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∙𝑟𝑒𝑐𝑎𝑙𝑙
midpoint of the bottom edge of the bounding box to the midpoint 𝐹1 = 2 ∙ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 ()
of our car front (the red lines in Fig. 10).
B. Implementation Details
We used Ubuntu18.04 with Intel Xeon Gold 6130 CPU,
Tesla V100 GPU with 32 GB RAM, CUDA v10.1 and cuDNN
v9.1. The speed is evaluated with batch size 1. We obtained a
vehicle detection speed of around 60 fps and 14 fps with
YOLOv4 and Faster R-CNN respectively.
C. Vehicle Detection and Distance Estimation
In order to evaluate the detection results, we saved the
images from the video. For the 16-sec long video clips, we got
473 images. One frame of the image with the result of object
recognition is shown in Fig. 11. The left and right images in the
figure show the results of object recognition from Faster R-CNN
Fig. 10. Bird’s-eye-view of the road for distance estimation. and YOLOv4 respectively.
IV. EXPERIMENTS The evaluation results are shown in 0YOLOv4 uses three
different input shapes: 416 × 416, 512 ×512 and 608 × 608.
A. Dataset We can see that YOLOv4 has higher precision, while Faster R-
We used pretrained YOLOv4 and Faster R-CNN, which CNN has significant higher recall and F1-score than YOLOv4.
were trained and validated on MS COCO dataset [8]. We tested When we only count the vehicles in the same direction,
our vehicle detection models on a more challenging 16-sec video YOLOv4 can get 98.04% precision and 70.49% recall, while
clip [27]. Each second of this video includes 30 frames and the Faster R-CNN achieves 95.00% precision and 87.55% recall
resolution of the video is 1,280 × 720. This video offers greater score. Counting the vehicles in both directions on both sides of
challenge because the divider between the two opposite sides of the road, the mean recall values are 67.86% and 79.50%, and the
the road covers parts of the vehicles on the other side. mean precisions are 99.16% and 95.47% respectively for these
two models.
Precision, recall, and F1-measure are utilized to evaluate our
vehicle detectors as shown in Eq. (6), (7) and (8) respectively.

Fig. 11. Examples of the results. Left image is the result of Faster R-CNN, and the right image is the result of YOLOv4.
[3] Bochkovskiy, A., Wang, C.Y. and Liao, H.Y.M., 2020. YOLOv4:
TABLE I. VEHICLE DETECTION RESULTS WITHIN 100 METERS Optimal Speed and Accuracy of Object Detection. arXiv preprint
arXiv:2004.10934.
One Side of the Road
[4] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and
Precision Recall F1 FPS
Berg, A.C., 2016, October. Ssd: Single shot multibox detector. In
YOLOv4 - 416 100.00% 58.13% 71.54% 68 European conference on computer vision (pp. 21-37). Springer, Cham.
[5] Girshick, R., Donahue, J., Darrell, T. and Malik, J., 2014. Rich feature
YOLOv4 - 512 98.75% 66.04% 76.56% 63 hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE conference on computer vision and pattern
YOLOv4 - 608 98.04% 70.49% 79.68% 60
recognition (pp. 580-587).
Faster R-CNN 95.00% 87.55% 89.43% 14 [6] Girshick, R., 2015. Fast r-cnn. In Proceedings of the IEEE international
conference on computer vision (pp. 1440-1448).
Both Sides of the Road
[7] Ren, S., He, K., Girshick, R. and Sun, J., 2015. Faster r-cnn: Towards real-
YOLOv4 - 416 98.18% 56.32% 70.67% 68 time object detection with region proposal networks. In Advances in
neural information processing systems (pp. 91-99).
YOLOv4 - 512 98.07% 64.91% 77.15% 63 [8] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.,
Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common
YOLOv4 - 608 99.16% 67.86% 79.36% 60 objects in context. In European conference on computer vision (pp. 740-
755). Springer, Cham.
Faster R-CNN 95.47% 79.50% 85.54% 14
[9] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 770-778).
In general, object recognition tasks aim to predict fewer false
[10] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn.
negatives and obtain a higher recall. Faster R-CNN can always In Proceedings of the IEEE international conference on computer
get higher recall scores but have a longer inference time than vision (pp. 2961-2969).
YOLOv4. We can see that if we count the vehicles on both sides [11] Cai, Z. and Vasconcelos, N., 2018. Cascade r-cnn: Delving into high
of the road, YOLOv4 with 608 × 608 input size can achieve a quality object detection. In Proceedings of the IEEE conference on
comparable score with a real-time detection speed. computer vision and pattern recognition (pp. 6154-6162).
[12] Duda, R.O. and Hart, P.E., 1972. Use of the Hough transformation to
detect lines and curves in pictures. Communications of the ACM, 15(1),
V. CONCLUSION pp.11-15.
In this study, we applied two models, YOLOv4 and Faster [13] Lowe, D.G., 2004. Distinctive image features from scale-invariant
R-CNN, for vehicle detection in the autonomous vehicle keypoints. International journal of computer vision, 60(2), pp.91-110.
paradigm. We also proposed a vision-based approach to estimate [14] Dalal, N. and Triggs, B., 2005, June. Histograms of oriented gradients for
the distance of the vehicles in the forward direction. We human detection. In 2005 IEEE computer society conference on computer
vision and pattern recognition (CVPR'05) (Vol. 1, pp. 886-893). IEEE.
evaluated both models for a bounded distance of 100m, which
is practical and acceptable to avoid collisions for autonomous [15] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet
classification with deep convolutional neural networks. In Advances in
vehicles. In detecting vehicles within 100 meters, YOLOv4 and neural information processing systems (pp. 1097-1105).
Faster R-CNN achieved 99.16% and 95.47% mean precision as [16] Zhang, S., Wen, L., Bian, X., Lei, Z. and Li, S.Z., 2018. Single-shot
well as 79.36% and 85.54% F1-measure with a detection speed refinement neural network for object detection. In Proceedings of the
of 68 fps and 14 fps respectively on a two-way road. Besides, IEEE conference on computer vision and pattern recognition (pp. 4203-
the models achieved greater accuracy when detecting vehicles 4212).
on the same side of the road. YOLOv4 and Faster R-CNN [17] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B. and Belongie, S.,
achieved 98.04% and 95.00% precision as well as 70.49% and 2017. Feature pyramid networks for object detection. In Proceedings of
the IEEE conference on computer vision and pattern recognition (pp.
87.55% recall respectively. The middle road divider resulted in 2117-2125).
a reduced accuracy when we tried to detect vehicles on both [18] Everingham, M., Van Gool, L., Williams, C.K., Winn, J. and Zisserman,
sides of the road. YOLOv4 detected vehicles with 68 fps, which A., 2010. The pascal visual object classes (voc) challenge. International
is suitable for real-time vehicle detection. We also tested journal of computer vision, 88(2), pp.303-338.
YOLOv3 which is a real-time detector with 78 fps detection [19] Self-driving car, Wikipedia, 2 March 2020, accessed March 2020.
speed, while the recall was 5% lower than YOLOv4. <https://en.wikipedia.org/wiki/Self-driving_car>.
[20] Adaptive cruise control, Wikipedia, 20 Febrary 2020, accessed March
Our ongoing work focuses on training the YOLOv4 model 2020. <https://en.wikipedia.org/wiki/Adaptive_cruise_control>.
on other autonomous vehicle datasets containing traffic [21] Emergency brake assist, Wikipedia, 29 September 2019, accessed March
information such as traffic signs, pedestrians and cyclists. We 2020. <https://en.wikipedia.org/wiki/Emergency_brake_assist>
are also aiming on finding other datasets that can help evaluate [22] Wang, C.Y., Mark Liao, H.Y., Wu, Y.H., Chen, P.Y., Hsieh, J.W. and
our distance estimation method and comparing to the deep Yeh, I.H., 2020. CSPNet: A new backbone that can enhance learning
learning-based depth estimation algorithms. capability of cnn. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops (pp. 390-391).
REFERENCES [23] He, K., Zhang, X., Ren, S. and Sun, J., 2015. Spatial pyramid pooling in
deep convolutional networks for visual recognition. IEEE transactions on
[1] Redmon, J., Divvala, S., Girshick, R. and Farhadi, A., 2016. You only
pattern analysis and machine intelligence, 37(9), pp.1904-1916.
look once: Unified, real-time object detection. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 779-788). [24] Liu, S., Qi, L., Qin, H., Shi, J. and Jia, J., 2018. Path aggregation network
for instance segmentation. In Proceedings of the IEEE conference on
[2] Redmon, J. and Farhadi, A., 2018. Yolov3: An incremental improvement. computer vision and pattern recognition (pp. 8759-8768).
arXiv preprint arXiv:1804.02767.
[25] Simonyan, K. and Zisserman, A., 2014. Very deep convolutional [26] OpenCV dev team, 31 Dec 2019, accessed March 2020, <
networks for large-scale image recognition. arXiv preprint https://docs.opencv.org/master/index.html>.
arXiv:1409.1556. [27] Udacity, accessed July 2020, <https://github.com/udacity/CarND-
Advanced-Lane-Lines>

View publication stats

You might also like