Detecting Vehicles Using YOLOv8n in Edge Computing Dashcam
Detecting Vehicles Using YOLOv8n in Edge Computing Dashcam
1st Aura Syafa Aprillia Radim 2nd Muchammad ’Irfan Chanif Rusydi 3rd Surya Michrandi Nasution
School of Electrical Engineering School of Electrical Engineering School of Electrical Engineering
Telkom University Telkom University Telkom University
Bandung, Indonesia Bandung, Indonesia Bandung, Indonesia
[email protected] [email protected] michrandi@ telkomuniversity.ac.id
Abstract—Dashcam is a camera placed on the dashboard of [3], and road damage [4]. Some researchers also applied deep
a vehicle. This device's function is to capture footage of all events learning methods to traffic scenes [5], [6]. However, the
in front of the vehicle. Security and safety have become a detection process was mainly carried out on on-premises
significant concern in various sectors, including transportation computers or a smartphone and had high latency. To prevent
and public roads. Traffic accidents caused by drivers’ ignorance accidents among vehicles, the detection process needs to be
of objects around the vehicle are still a severe problem on the near real-time low latency.
highway. In this study, a simple dashcam built from an edge
computer was developed. By adding a camera, the dashcam is In this paper, a solution to prevent accidents among
able to detect vehicles ahead. By the time, vehicles appear in the vehicles is proposed by implementing an object detection
system, it will be detected using an object detection method system on an edge computing device or a single-board
called YOLOv8. This research is expected to be one step in a computer (SBC). This SBC works as the main processing unit
proof-of-concept of the development of an Intelligent using a GPU that is preinstalled in it. Objects in front of the
Transportation System that is in accordance with traffic dashboard camera are detected by using the Convolutional
conditions in Indonesia. In this paper simulated and tested the Neural Network method.
usage of GPU from the edge computing device. Even though the
YOLO8n has lower 6.29, 9.11, 6.05, and 0.24 points The content of this paper is organized as follows. Section
performances for its precision, recall, mAP50, and mAP50-95 II presents the literature review. Section III presents the
respectively than YOLOv7-tiny, it only used half the system design. Section IV presents experimentation results
computational cost than the YOLOv7-tiny. It shows the and a discussion of the process of developing the system.
YOLOv8n is suitable as a detection method in an edge Finally, Section V explains the conclusion and future work of
computing device. As the inference time testing, objects in an this study.
image can be detected from 65-500 ms based on the power
supplied to the computer. It also means, in a second the system II. LITERATURE REVIEW
is able to infer objects for 2 to 15.38 frames.
A. Dashcam
Keywords—ADAS, dashcam, object detection, You Only Look Dashcam has been widely used by drivers to record traffic
Once, YOLOv8 on the road. Many believe that dashcam is an essential part of
the vehicles. There are several reasons to install a dashcam,
I. INTRODUCTION including the insurance corporation's favorable insurance rates
Security and safety have become a significant concern in for drivers and the collection of material that can be used as
various sectors, including transportation and public security. evidence in legal procedures [7]. To such a degree that both
Traffic accidents caused by drivers’ ignorance of objects Chinese and South Korean governments oblige public
around the vehicle are still serious problems on the highway. transportation and commercial vehicles to install a dashcam to
Intelligent and effective object detection technology is assist in investigating traffic accidents [8]. Despite the
increasingly important in monitoring ahead and approaching importance of being widely known, only some utilize
traffic [1]. dashcams for purposes other than recording the road and
traffic conditions.
A dashboard camera (Dashcam) is placed on a vehicle’s
dashboard. This device usually serves to record all events in B. Object Detection
front of the vehicle. Dashcam is one of the devices that the Object Detection is one of the essential tasks in the
demand proliferates in the market. Currently, a dashcam computer vision field, mainly dealing with detecting instances
serves as a device to record events in front of the vehicle, and of visual objects and then categorizing them into several
then the recording is used as evidence of an accident [2], so classes [9]. With this kind of identification and localization,
later it can be used as a proof of insurance claims. However, object detection can be used to count objects in a scene and
dashcams could also be used for other purposes, such as one determine and track their precise locations, all while
of the components of an Autonomous Driving Assistance accurately labeling them. It has been widely used for
System (ADAS). recognizing faces [10], vehicles [11], counting pedestrians
Many researchers have been incorporating dashcam [12], securing systems [13], implementing autonomous cars
footage in order to classify crash and near-crash conditions [14], etc.. Object detection has undergone many changes and
Authorized licensed use limited to: Southern New Hampshire University. Downloaded on August 08,2024 at 17:06:34 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-9455-9/23/$31.00 ©2023 IEEE 89
2023 3rd International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)
developments in the past twenty years [9]. Although it is ( ). The formulation to measure the precision and recall are
commonly divided into two periods: (1) traditional object shown in (1) and (2) respectively.
detection and (2) deep learning based.
In 2012, Krizhevsky et al. proposed a deep convolutional = (1)
network trained on a subset of ImageNet [15] called AlexNet.
It was the forerunner of the YOLO model. A year later, = (2)
Girshick et al. proposed a new object detection framework
called R-CNN [16]. It combined region proposals with CNNs
to detect objects in images. Since then, the object detection The threshold value determines whether the prediction is
research and development field has been rapidly advancing, Positive or Negative. For example, if the threshold is 0.5, then
with new models, datasets, and techniques emerging rapidly. the prediction is positive if the intersection over union (IoU)
is greater than 0.5. Otherwise, the prediction is negative. Both
Along with the birth of AlexNet, the YOLO (You Only Precision and Recall are relative to the threshold value and are
Look Once) model was introduced in 2015 which [17]. The usually used in the form of a Precision-Recall Curve. The
original base YOLO model can achieve 45 frames per second. calculation of the area under the curve to get the Average
Redmon et al. also released the smaller version of YOLO Precision to get the value is done and the calculation of
called Fast YOLO which can achieve 155 frames per second. the Average Precision for each class must be done. In (3), is
According to Redmon et al., YOLO outperforms DPM and R- the number of classes in the COCO’s dataset.
CNN on the Picasso Dataset and People-Art Dataset [17]. In
2022, Wang et al. release YOLOv7 [18]. A year later, as of = ∑ (3)
January 2023, YOLOv8 was introduced by Ultralytics [19],
the same software company that released YOLOv3 and
YOLOv5. As of now, YOLOv8 is one of the latest State of E. Edge Computing Devices
The Art (SOTA) open-source object detection models. Nasution et al. proposed a system that can detect vehicles
and street lanes [21]. The image feed was captured using a
C. COCO Dataset
smartphone and then processed using the ImageAI library
The COCO dataset is a large-scale object detection, with RetinaNet for COCO as an object detection model
segmentation, and key point dataset. In total, The Microsoft proposed as a system that can detect vehicles and street lanes.
Common Objects in COntext contains 91 common object The image feed was captured using a smartphone and then
categories, with 82 of them having more than 5,000 labeled processed using the ImageAI library with RetinaNet for
instances [20]. The first version of the COCO Dataset had COCO as an object detection model. In their study, the process
124,000 images which are divided into training and validation of object detection is done by using on-premises computer.
datasets. The training dataset consists of 83,000 images, and
the rest images will be used as a validation dataset. Later, the Nasution and Dirgantara improved the systems by
COCO dataset improved their image numbers. In 2017, the building ADAS on an edge computing device. This paper uses
total images in the dataset were more than 330,000 images. Raspberry Pi 4 as the main processing unit [22].
Based on its size, the COCO Dataset is widely used by the Unfortunately, their results only managed to get 0.9 FPS.
State-of-the-Art Object Detection Model to train and evaluate Based on the lack of FPS, this study focused on finding a more
the model performance. robust edge computer.
Despite its popularity, the COCO Dataset has some III. SYSTEM DESIGN
drawbacks. Although the COCO Dataset contains quite a large In this section, the system designed for this study will be
number of image classes, it has an imbalanced class discussed. Overall, the proposed system consists of 3 stages,
distribution. The total number of annotated objects for the as follows: (1) the image’s stream collection from the
person class is 64,115, while the hair dryer class is only around camera’s feeds, (2) the image processing stages, using an edge
100. Additionally, for this research, the model doesn’t need to computer, and (3) the display of the detection result into the
detect anything other than objects that relate to traffic. ADAS’s screen. Fig. 1 shows the flowchart of the proposed
D. Performance Metrics method of detecting traffic objects.
The performances of YOLOv8n are measured by
calculating the Precision ( ), Recall ( ), and mean
Average Precision ( ). is the average of Average
Precision ( ) over all classes. is the area under the
precision-recall curve. The comparison of is averaged
for Intersection over Union (IoU) thresholds from .50 to .95
with .05 increments (MS COCO standard metric, abbreviated
as mAP50-95) and mAP50 (PASCAL VOC metric,
abbreviated as mAP50) [7].
Precision and Recall in this experiment is a relative
measure because it depends on the threshold value. Precision
is calculated by dividing the True Positive ( ) by the sum of
True Positive ( ) and False Positive ( ). Meanwhile, the
value of Recall is calculated by dividing True Positive ( )
with the sum of the True Positive ( ) and False Negative Fig. 1. System Flowchart
Authorized licensed use limited to: Southern New Hampshire University. Downloaded on August 08,2024 at 17:06:34 UTC from IEEE Xplore. Restrictions apply.
90
2023 3rd International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)
First, the camera captures the image frame by frame, C. Training The Model
followed by preprocessing the images before feeding them to Before the model training using YOLOv8n-Traffic with
the YOLO model. The YOLO model provides the bounding high epochs, the optimal hyperparameter configuration must
point coordinates and probability for each class for all detected be defined first. There will be various training configurations
objects. The bounding box and class are then drawn on the with a small number of epochs to save time. Each training
image. The image with the bounding box and class is then configuration needs up to 2 hours, even though the epoch’s
displayed on the screen. number is 8. In training the model, there are various batch
A. Image Feed sizes and optimizers were tested.
The image is gathered by using the camera that mentioned The batch sizes that are used in this paper are 16 and 32.
earlier. Fig. 2, shows a sample result of the gathered image Furthermore, the optimizers that are used are SGD and Adam.
from the camera. In the proposed system, there are several Fig.4 shows the result of training with different batch sizes and
classes (vehicles) that tried to be recognized by using an object optimizers. It can be seen that batch size 32 performs better
detection method. Later, the dataset is filtered into classes that than batch size 16, especially in the stage of model training.
are related to traffic conditions. The best combination between the batch sizes and optimizers
is model training with 8 epochs using batch size 32 and SGD
optimizer. Meanwhile, Fig. 5 shows the comparison between
the GPU utilization of the SGD optimizer when using 16 and
32-batch sizes.
mAP
Fig. 2. Image Sample from Collected Footage
As mentioned in the previous section, the COCO dataset D. Edge Processing Unit
contains 80 classes. Traffic-related object classes are all that In order to improve the frame rate, in this paper the edge
are needed in this research namely cars; trucks; buses; computer that will be used as the main processing unit is
motorcycles; bicycles; traffic lights; stop signs; trains; Jetson Nano. It is powered by a Quad Core ARM-A57 CPU
hydrants; cats; and dogs. There are 78,663 images in training with 32 cores of Maxwell GPU. The GPU (CUDA Core)
split into 12 classes, as shown in Fig. 3. This number of data allows the system to accelerate the deep-learning model. The
was reduced from 122,125 images according to the original availability of GPU is another factor that needs to be
dataset. The training time of the model is expected to be considered in order to improve the frame rate. According to
reduced by decreasing the size of the dataset. As seen in Fig.3, Pandey et al., GPU is a crucial component in the deep learning
the number of motorcycles is lower than the number of car [24], and it means that an edge computer with a GPU is
classes. preferred. The object detection system runs using python3 and
uses PyTorch Library as the deep learning framework.
There will be two stages in creating the proposed system,
namely the training and inferencing stages. The stage of model
training, it was conducted a computer that has GeForce RTX
3090, 24 GB Video RAM, 32GB RAM, Intel Core i3-12100
4 Core 8 Thread, and running on Windows 10. Meanwhile, the
model inferencing and testing stage was performed on Jetson
Nano 4GB that runs Ubuntu 20.04.02 LTS.
As shown in Fig. 6, the prototype of a dashcam is
implemented in a car. It can be seen that the camera is placed
Fig. 3. Filtered COCO Dataset Class Count behind the Jetson Nano. The main power for the system comes
Authorized licensed use limited to: Southern New Hampshire University. Downloaded on August 08,2024 at 17:06:34 UTC from IEEE Xplore. Restrictions apply.
91
2023 3rd International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)
from the car battery which is connected to the cigarette lighter Based on the formula mentioned in (3), the filtered
and USB car charger. Whenever the camera from the dashcam COCO dataset has a better result, as shown in Fig. 10. The
is ready to capture images, the testing is conducted by driving calculation was conducted by simulating the model
around the city of Bandung. training using 8 epochs. It can be seen in the figure, that the
of the filtered dataset reaches 0.298, meanwhile, the
original COCO dataset (train2017) only reaches 0.135. In this
training simulation, the number of small epochs is caused to
save the training time.
mAP
IV. RESULTS AND DISCUSSION
A. Object Detection
According to the filtered dataset, there are 12 kinds of objects
that tried to be detected. As shown in Fig. 7, the result of
detected objects is limited to the filtered classes on the COCO Fig. 10. Comparison of mAP50-95 between Model Training (Filtered and
dataset. According to the figure, there are several cars, and a COCO’s Original Dataset
motorcycle that were detected, meanwhile there is also a
motorcycle that failed to be detected by the system. C. Training Result
In this section, there will be a discussion about the result
of the trained model for detecting objects. In this simulation,
it has a longer epoch than the previous simulation. The model
is trained using 80 epochs. The method of object detection that
is used in this paper is YOLOv8. YOLOv8 offers several
model sizes, such as YOLOv8n which is the smallest model,
YOLOv8l as the medium model, and YOLOv8x as the largest
model. YOLOv8n was chosen as the model due to its small
size and compatibility with edge computing devices.
Fig. 7. Object Detection Result
YOLOv8n used 168 layers with over 3 million parameters
with a computation cost of around 8 GFLOPs.
B. Dataset Comparison
The filtered dataset is compared to the original based on Along with the model training using YOLOv8, the
the GPU utilization (%) and memory (%). As seen in Fig. 8 previous version of this method (YOLOv7-tiny) was also
and Fig. 9, the filtered dataset has lower utilization in GPU trained to compare its performances. Both YOLOv7-tiny and
and memory usage. It means the filtered dataset is more YOLOv8n are the smallest models in its YOLO version. As
efficient when it is trained. Based on these results, the longer seen in Fig. 11, the YOLOv8n has a lower computational cost.
the model has been trained, it will deliver better results. As the need of GFLOPs, YOLOv8n needs only 8.7,
meanwhile, the YOLOv7-tiny needs 13.1. As seen from the
point of view of the parameter aspect, YOLOv8n only needs
almost half of YOLOv7-tiny’s. Meanwhile, the YOLOv8n
GPU Percentage (% )
Authorized licensed use limited to: Southern New Hampshire University. Downloaded on August 08,2024 at 17:06:34 UTC from IEEE Xplore. Restrictions apply.
92
2023 3rd International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)
6.05, and 0.24 greater points for the same metrics as tested in this study, most of the filtered image classes still have less than
YOLOv8n. Though YOLOv8n has lower performances, it 1,000 instance count. In short, fine-tuning the model would be
uses lower GFLOPs and parameters and needs fewer layers beneficial for the performance results.
than YOLOv7-tiny.
D. Prototype Testing
TABLE I. COMPARISON BETWEEN YOLOV7-TINY AND YOLOV8 Based on the prototype that has been implemented in a
vehicle, it describes the real condition of how the dashcam
Model
Recall Precision mAP50 mAP50-95 works. The prototype must be tested its power consumption in
(%) (%) (%) (%) order to know its usage. Somehow, the power source in a
vehicle may need more power to support other devices that
YOLOv7-tiny 65.7 77.54 69.85 46.55 must be charged at the same time when it is used to power the
edge computing device.
YOLOv8n 56.59 71.25 63.8 46.31
The device may need more power consumption in order to
inferencing the images faster. As shown in Table II, the higher
Based on the results, both methods have almost similar power consumption (W) makes inference time faster. For each
performances. The comparison of mAP.95 between both test in power consumption, the fastest, medium, and slowest
smallest versions of YOLO is shown in Fig. 12. It can be seen inference time is 341, 374, and 500 ms. By increasing the
in the figure, that the mean average precision of YOLOv8n is power by 2 watts, the inference time is faster almost twice.
almost similar to YOLOv7-tiny as forementioned before. The fastest, medium and slowest inference time is 158, 178,
YOLOv8n has low performance since it’s the smallest version and 220 ms. Meanwhile, whenever the power increased twice,
of YOLOv8. The bigger version may improve its the inference time was almost 5 times faster. The inference
performance. times in this test (10 W) need less than 100 ms.
10 65 70 81
Fig. 12. mAP50-95 Comparison between YOLOv8n and YOLOv7-tiny
with 80 Epochs
Object detection by using Jetson Nano as its edge
Based on Fig. 13, it can be inferred that achieving metrics computing device has better results compared with previous
of mAP50-95 on COCO val2017 above 50% can only be research mentioned earlier. As shown (4), the formulation for
reached by bigger models such as YOLOv7 and YOLOv8m. measuring the number of Frames per Second (FPS) is
However, both of them have 4 to 8 times the parameter of determined. The value of FPS is defined based on the division
YOLOv8n and YOLOv7-tiny and require 10 times of 1000 to inference time.
computational cost based on GFLOPS values. Adopting those
bigger YOLO models wasn’t an option for us since we meant = (4)
to do the detection process on the Edge Processing Unit with
limited computational resources. As seen in Table III, the FPS value varies between each
power consumption. The FPS is between 2 to 2.93 when the
edge computing device is powered up using 5W. Meanwhile,
the FPS started to rise along with the power. As the Jetson
Nano receives 7W for its power, the system can detect objects
for 4.55 to 6.33 frames every second. In the end, by the time
the power consumption doubled from the first testing, the FPS
range is between 12.35 to 15.38 frames in a second.
TABLE III. FRAME PER SECOND BASED ON THE INFERENCE TIME FOR
EACH POWER CONSUMPTION
On the other hand, the COCO dataset has an imbalance 7 6.33 5.62 4.55
instance count problem. The chosen model was trained on data 10 15.38 14.29 12.35
and only slightly fixed the problem by incorporating it, which
was collected and annotated from traffic footage, so it affected V. CONCLUSION
the performance of the model. Although filtering the COCO
dataset is essential to get better results in terms of , since According to the test conducted in this study, YOLOv7-
it removes a bunch of image classes that are unnecessary in tiny and YOLOv8n almost have similar performances based
Authorized licensed use limited to: Southern New Hampshire University. Downloaded on August 08,2024 at 17:06:34 UTC from IEEE Xplore. Restrictions apply.
93
2023 3rd International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)
Authorized licensed use limited to: Southern New Hampshire University. Downloaded on August 08,2024 at 17:06:34 UTC from IEEE Xplore. Restrictions apply.
94