0% found this document useful (0 votes)

0 views

FPGA-Based_Real-Time_Object_Detection_and_Classification_System_Using_YOLO_for_Edge_Computing

This paper presents an FPGA-based real-time object detection and classification system utilizing YOLO v3 Tiny for edge computing, specifically targeting traffic light detection in Advanced Driving Assistance Systems (ADAS). The system achieves 99% accuracy at 15 frames per second while consuming only 3.5W of power, demonstrating its efficiency for edge devices. The implementation leverages the Xilinx Kria KV260 FPGA development board and utilizes the Bosch Small Traffic Light Dataset for training the YOLO model.

Uploaded by

hassancsk5

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

FPGA-Based_Real-Time_Object_Detection_and_Classification_System_Using_YOLO_for_Edge_Computing

Uploaded by

hassancsk5

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Received 17 May 2024, accepted 20 May 2024, date of publication 23 May 2024, date of current version 31 May 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3404623

FPGA-Based Real-Time Object Detection and

Classification System Using YOLO
for Edge Computing
RASHED AL AMIN , (Graduate Student Member, IEEE), MEHRAB HASAN,
VEIT WIESE, AND ROMAN OBERMAISSER
Institute for Embedded Systems, University of Siegen, 57076 Siegen, Germany
Corresponding author: Rashed Al Amin ([email protected])
This work was supported in part by the TElepresence and collaboration of students in a mixed physical/virtual laboratory for
CYber-physical systems (TECY) Project, funded by the Freiraum 2022 Program of Stiftung Innovation in der Hochschullehre.

ABSTRACT The leap forward in research progress in real-time object detection and classification has been
dramatically boosted by including Embedded Artificial Intelligence (EAI) and Deep Learning (DL). Real-
time object detection and classification with deep learning require many resources and computational power,
which makes it more difficult to use deep learning methods on edge devices. This paper proposed a new,
highly efficient Field Programmable Gate Array (FPGA) based real-time object detection and classification
system using You Only Look Once (YOLO) v3 Tiny for edge computing. However, the proposed system
has been instantiated with Advanced Driving Assistance Systems (ADAS) for evaluation. Traffic light
detection and classification are crucial in ADAS to ensure drivers’ safety. The proposed system used a
camera connected to the Kria KV260 FPGA development board to detect and classify the traffic light. Bosch
Small Traffic Light Dataset (BSTLD) has been used to train the YOLO model, and Xilinx Vitis AI has been
used to quantify and compile the YOLO model. The proposed system can detect and classify traffic light
signals from a high-definition (HD) video streaming in 15 frames per second (FPS) with 99% accuracy.
In addition, it consumes only 3.5W power, demonstrating the ability to work on edge devices. The on-road
experimental results represent fast, precise, and reliable detection and classification of traffic lights in the
proposed system. Overall, this paper demonstrates a low-cost and highly efficient FPGA-based system for
real-time object detection and classification.

INDEX TERMS FPGAs, object detection and classification, YOLO, edge computing.

I. INTRODUCTION while fulfilling the fundamental requirements of autonomous

Deep Learning has been widely used for object detection driving. Given the absence of vehicle-to-infrastructure com-
and classification tasks [1] and the most common applica- munication in contemporary transportation systems, the
tion of object detection and classification is in the domain detection and categorization of traffic lights have gained
of Advanced Driving Assistance Systems. The continual significant importance within the ADAS framework [5], [6].
advancements in Computer Vision (CV) with integrated Arti- Various techniques have been developed to enhance detec-
ficial Intelligence (AI) decision-making and control have tion algorithms based on deep learning technologies, aiming
brought intelligent driving to the forefront of discussions in to create a robust and comprehensive object detector [7].
the realm of ADAS [2], [3], [4]. These systems are designed Convolutional Neural Networks (CNN), rooted in deep learn-
to assist drivers in their decision-making processes, offer- ing, have made significant strides in object recognition and
ing coordination and notifications during unforeseen events detection [8]. Ouyang et al. [5] employed CNN for traffic
light detection using NVidia Jetson TX1/TX2, and Zhang
The associate editor coordinating the review of this manuscript and et al. [9] implemented CNN for traffic light classifica-
approving it for publication was Abdallah Kassem . tion with FPGA. On the other hand, two-stage detectors

2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
73268 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024
R. A. Amin et al.: FPGA-Based Real-Time Object Detection and Classification System

offer superior performance in both localization and recogni- 1. This study presents a new system on the Xilinx Kria
tion accuracy [9]. However, a significant drawback is their KV260 FPGA board, enhancing real-time object detec-
demand for substantial amounts of annotated data for train- tion and classification with an optimized YOLO v3
ing. Another algorithm, the Single Shot MultiBox Detector Tiny deep learning model.
(SSD) [10], exhibits relatively poor performance in han- 2. Instantiate the proposed object detection system to
dling small objects [7]. The most noteworthy algorithm in effectively identify and classify traffic light signals
this category is the You Only Look Once (YOLO) [11], with precision and accuracy.
which has several versions. Compact iterations of the YOLO 3. Evaluate the performance of the proposed system
algorithm have been developed to ensure efficient execution by comparing with state-of-the-art object detection
on hardware-constrained devices [8]. Wu et al. [12] and Abra- system.
ham et al. [13] have presented traffic light detection systems
using YOLO. In addition, Fernando et al. [14] and Huy II. MOTIVATION AND RELATED WORK
et al. [15] have also demonstrated traffic light detection and Implementing YOLO-based object detection on FPGA brings
classification using YOLO. Furthermore, various methods forth both practical and theoretical implications. From a
based on Lightweight [16], PID Controller [17], Salience- practical standpoint, it enables real-time processing of object
Sensitive Loss [18], vehicular ad-hoc networks [19], support detection tasks with minimal latency, rendering it suitable
vector machine [20], and Adaptive Background Suppression for applications necessitating swift decision-making, such as
Filter [21] have been reported for traffic light detection and autonomous vehicles and surveillance systems. The FPGA’s
classification. customizable hardware architecture facilitates optimized per-
Edge computing and object detection are two critical com- formance and resource efficiency, resulting in reduced power
ponents of modern technology, working hand in hand to pro- consumption and cost-effective solutions, particularly bene-
vide efficient and real-time solutions in various applications ficial for embedded systems and IoT devices. On a theoretical
using DL frameworks. However, DL models require high level, this implementation underscores the potential synergy
computational power and network bandwidth. Accelerating between deep learning algorithms like YOLO and hard-
the DL model (e.g., YOLO) in edge devices can improve the ware acceleration techniques such as FPGA, thereby laying
system performance. Therefore, using field-programmable the groundwork for further advancements in edge com-
gate arrays (FPGAs) in object detection and classification puting, where computational resources are constrained yet
systems enhances operational security and real-time comput- real-time processing is imperative. Furthermore, delving into
ing ability while reducing prices. the theoretical realm of optimizing deep learning models for
FPGAs have proven to maintain safety critical systems FPGA architectures can drive progress in hardware-software
with their remarkable computational ability. FPGAs are expe- co-design methodologies, ultimately fostering the develop-
riencing rapid growth in the domain of Artificial Intelligence ment of efficient and scalable AI systems across diverse
(AI) acceleration, driven by their capacity for parallel pro- applications.
cessing and architectural optimizations [22]. FPGA-based Numerous methods have endeavored to develop real-time
implementations of deep learning models yield higher speed object detection systems tailored for edge devices. Neverthe-
and accuracy with lower power consumption than software less, this section delves into an exploration of contemporary
and image-based systems. To address the limitations associ- state-of-the-art object detection methodologies specifically
ated with the low flexibility and accuracy of software and designed for deployment on edge devices. Table 1 encap-
image-based algorithms, as well as the high power con- sulates a comprehensive summary of these object detection
sumption of such methods, this paper presents a real-time systems, providing a comparative overview of their respective
and lightweight system for the detection and classification features and performance metrics.
of traffic lights in autonomous vehicles. The YOLO v3 Several studies have proposed real-time object detection
Tiny algorithm proves its suitability, mainly when deployed systems for edge devices utilizing Graphics Processing Unit
on FPGA boards. However, the newly introduced Xilinx (GPU)-based architectures. However, the significant power
Kria KV260 FPGA development board demonstrates its out- consumption associated with GPU-based systems presents a
standing computational ability to accelerate deep learning considerable bottleneck for their practical deployment [6],
algorithms [23], [24], [25], [26], [27], [28]. The synergistic [12], [13], [16]. Notably, Abraham et al. [13] achieved the
pairing of the Xilinx Kria KV260 development board with the lowest power consumption among GPU-based traffic light
YOLO v3 algorithm is a compelling choice for real-time traf- detection systems. Their study focused on a traffic light detec-
fic light detection and classification on resource-constrained tion system utilizing YOLO architecture and implemented on
devices. In tandem with the detection and classification of a Nvidia Tesla T4 GPU. The research introduces a modified
traffic signals, the system also includes a critical module for YOLO model tailored for detecting traffic lights and signs.
speed control in compliance with the classification results. This model, based on a modified cross-stage partial YOLO
The main contributions of this paper are summarized as v4 architecture, processes images captured by a camera sen-
follows:, sor, leveraging a dataset comprising 1360 training data and

VOLUME 12, 2024 73269

R. A. Amin et al.: FPGA-Based Real-Time Object Detection and Classification System

TABLE 1. State-of-the-art Object detection system.

340 testing data, encompassing six types of traffic lights and conducted with very low image sizes, resulting in compara-
39 types of traffic signs. The network, implemented using tively lower accuracy.
the Darknet framework, attains a mean average precision of Although YOLO-based object detection on FPGA offers
79.77% at a processing speed of 29 FPS, accompanied by a a multitude of advantages including real-time processing,
power consumption of 70W. tailored optimization for FPGA architecture, low power
Alternatively, CNN based object detection serves as consumption, high throughput, low latency, flexibility for
another viable approach, characterized by notably low power various YOLO models, and resource efficiency, numerous
consumption and high throughput [9], [29], [30]. However, YOLO-based object detection systems have emerged in
a key bottleneck of CNN-based real-time object detection recent years. However, to discern the state-of-the-art among
systems lies in their requirement for low input image sizes, these systems, several notable implementations warrant atten-
necessitating substantial computational power for process- tion. Heller et al. [31] introduced an object detection system
ing larger images. Addressing this challenge, Wang et al. utilizing deep learning on embedded edge devices, focusing
[29] proposed a highly efficient CNN-based object detection on maritime object detection with the Xilinx Kria KV260
system. Their work introduces an adaptive CNN edge com- Vision AI Kit. Their study involved training and evaluat-
puting platform tailored for target detection tasks, leveraging ing multiple YOLO neural networks of varying sizes and
FPGA technology. This research capitalizes on the inherent architectural specifications, incorporating structured pruning
pipeline architecture of FPGAs to expedite network compu- techniques such as sparsifying to reduce network size while
tations, utilizing off-chip memory for storing network models preserving detection performance. The proposed deploy-
and thereby obviating the need for resource-intensive tiling ments showcased promising outcomes, achieving an infer-
techniques. Moreover, an innovative online reconfigurable ence speed of 90 FPS with only a marginal 2.4% degradation
design is introduced, enabling real-time adjustments to net- in mean average precision for high-definition input images.
work structure and parameters to accommodate diverse target Nonetheless, while exhibiting enhanced throughput and effi-
recognition objectives. Implemented on a Spartan-6 FPGA ciency, the accuracy of this method necessitates further
platform, the system undergoes evaluation through pedestrian refinement.
and vehicle classification tasks, achieving a detection speed Corcoran et al. [32] proposed a streaming architec-
of 16 frames per second (FPS) and a power consumption rate ture toolflow to accelerate YOLO models on FPGA
of 0.79 W, while attaining a classification accuracy of 96%. devices, employing a deeply pipelined on-chip design for
However, it is noteworthy that the system’s evaluation was YOLO accelerators. These accelerators, generated using an

73270 VOLUME 12, 2024

R. A. Amin et al.: FPGA-Based Real-Time Object Detection and Classification System

automated toolflow, incorporate novel hardware components FPGA board (Xilinx Kria KV-260) with minimal power con-
supporting YOLO operations in a dataflow manner, along sumption, the proposed system aims to maintain accuracy and
with off-chip memory buffering to mitigate on-chip memory throughput while operating within cost constraints. However,
limitations. Their approach achieves a throughput of 69 FPS achieving a trade-off among these performance evaluation
for input images of 416 × 416, consuming 15.4W power. metrics with a low-cost FPGA board presents significant
However, power consumption of this study remains relatively challenges. The Kria KV-260, being a newly launched FPGA
high, and accuracy metrics are not provided, posing potential board with limited memory and computational power, may
limitations. pose a bottleneck for deploying large neural network models.
Nguyen et al. [33] present an FPGA-based design Thus, the proposed system optimizes the YOLO model to
for YOLOv4 network tailored for flying-object detection, fit within the FPGA board’s constraints. Additionally, eval-
addressing challenges of limited floating-point resources uating the proposed system through on-road tests and using
while aiming to maintain accuracy, real-time performance, street photos and videos is a crucial aspect of this study,
and energy efficiency. They curated a suitable dataset of providing real-world validation of its effectiveness.
flying objects for network implementation, training, and fine-
tuning, adapting YOLO networks for FPGA deployment. III. IMPLEMENTATION
Evaluating on the ZCU104 FPGA kit, they achieved 125FPS The initiation of the study’s implementation process
for HD input images, consuming 26.4W power. However, commences with the establishment of the host machine
high throughput is accompanied by elevated power con- environment and the FPGA development board. This imple-
sumption and potentially compromised accuracy, suggesting mentation procedure encompasses multiple sequential stages,
potential bottlenecks. encompassing environment configuration, dataset prepara-
Valadanzoj et al. [34] introduced a high-speed YOLO tion, model training, model conversion, and deployment of
hardware accelerator tailored for self-driving automotive the trained model into the FPGA board, all aimed at creating
applications on FPGA. Their approach involved utilizing a robust and dependable object detection and classification
8-bit and 5-bit fixed-point formats for data and weights to system. The overall architecture of the proposed system is
conserve resources and memory. To address accuracy con- illustrated in Figure 1.
cerns, a Genetic Algorithm was employed to optimize the
decimal point positions across different network layers. Fur- A. DATASET
thermore, a technique enabling simultaneous multiplications The dataset has been prepared according to the YOLO format
with distinct operands using a single DSP block was pre- to train a YOLO model in the darknet framework. The images
sented, enhancing network execution speed. Evaluation on from the collected dataset were annotated with the bounding
the Xilinx Zynq ZC706 FPGA platform yielded a through- boxes surrounding the traffic lights that must be detected. The
put of 55 FPS with 79% accuracy for a given input image, annotations consist of the traffic light class and coordinates
consuming 13.6W of power. Nonetheless, similar to previous of the center on the X-axis, coordinates of the center on
studies, the high throughput was accompanied by elevated the Y-axis, height, and width. A class name was allocated
power consumption and compromised accuracy, potentially to every annotation according to the traffic light class (Red,
posing a limitation. Green, Yellow, Left Green, Right Green, Left Red, Right Red,
Besides of these large NN based architecture, several and No Light). The pixel values of bounding box coordinates
alternative methods have been reported for real-time object (x, y, width, height) were transformed to normalized values
detection, including lightweight approaches [16] and com- between 0 and 1 relative to the dimensions of the traffic light
binations of Blob, Histogram of Oriented Gradients (HOG), image to improve the model’s convergence and robustness.
and Support Vector Machine (SVM) [20]. However, the However, to train the YOLO model, a corresponding text file
performance of these methods still requires improvement has been generated for each traffic light image in the BSTLD
compared to other state-of-the-art object detection systems. dataset. The information about the traffic light in the text file
The majority of prior research has predominantly focused maintains a predefined sequence so that during the training,
on throughput and accuracy metrics, often neglecting the the darknet framework can extract the information correctly
critical aspects of power consumption and cost, which are and train the YOLO model with accurate data.
pivotal considerations for edge computing systems. More- The Bosch Small Traffic Light Dataset (BSTLD) [42]
over, some studies have utilized highly expensive FPGA contains a total of 13,427 camera photos, each with a size
boards, limiting their practical applicability, particularly for of 1280 × 720 pixels. Additionally, the collection includes
large-scale systems or ADAS. Unfortunately, current state- around 24,000 annotated traffic light signals. The images
of-the-art object detection systems still require improvement in the training set depict a variety of obstacles that may be
across all performance metrics, including throughput, accu- encountered when driving in urban environments. However,
racy, power consumption, and cost-effectiveness. To address 5094 images from the BSTLD images were used for training
these constraints, this paper introduces a novel object detec- purposes in this study, and among those images, 1019 images
tion system for FPGA utilizing YOLO. Leveraging a $250 were split to make a test set.

VOLUME 12, 2024 73271

R. A. Amin et al.: FPGA-Based Real-Time Object Detection and Classification System

FIGURE 1. Overall architecture of the proposed system.

B. MODEL PREPARATION AND FPGA DEPLOYMENT model was trained with weights from a pre-trained model
1) MODEL TRAINING THROUGH DARKNET from the darknet to maximize efficiency. This step is followed
Darknet framework serves the dual purpose of training neu- by fine-tuning the YOLOv3 Tiny network with the dataset by
ral networks and subsequently processing images or video running the dataset through the neural network and modifying
frames using deep neural networks. However, to train the the parameters to minimize the loss. Table 2 represents the
YOLO model, this study considers the darknet framework optimized YOLO v3 Tiny model architecture for the proposed
due to its real-time object detection architecture. Darknet system.
mainly employs 53 convolutional layers in 3×3 and 1×1 con-
volutional filters to extract features and reduce output [43]. 2) GENERATION OF FROZEN GRAPH
Besides, It makes predictions by utilizing max pooling. Freezing the graph is a process that combines the architec-
YOLO framework comes with a configuration file that needs ture of the model and weights into a single file called a
to be modified according to the project-specific requirements. frozen graph. The darknet binary weights were converted into
These configurations play a crucial role in determining the the TensorFlow variables. The yolov3 Tiny architecture has
detection capabilities of the model. YOLOv3 configuration been built in TensorFlow with the proper configuration file
files have been changed for traffic light detection and classi- and the appropriate weight file generated from the Darknet
fication, including the number of classes, anchor boxes, and framework.
other hyperparameters. Initially, both the batch and subdivi- The architectural framework of the model has been repli-
sions were set to one. To train efficiently, the batch was set to cated with input and output nodes along with other associated
64, and the subdivision was set to 8 by considering the volume parameters. TensorFlow has built the YOLO model and gen-
of images in the BSTLD and the capacity of the host machine. erated the frozen graph with the neural network layers, the
The proposed YOLOv3 Tiny model utilizes 3 × 3 convolu- determination of activation functions, and the configuration
tion layers with a stride size of 1 to extract features through a of input-output functions in this process by using the config-
feedforward structure. The initial convolution layer takes as uration and weights files. In addition, the frozen graph has
input an image of size 416 × 416, and each convolution layer been optimized by the TensorFlow API. The procedure for
incorporates a padding size of 1. Furthermore, it employs generating a frozen graph illustrated in Fig. 2.
Max pooling layers to down sample data within the convolu-
tion layers. Bounding box predictions are made at two distinct 3) QUANTIZATION
feature map scales and combined with an up sampled 13 × 13 Vitis-AI quantizer provides a function to convert the 32-bit
feature map. YOLOv3 Tiny detects 3 boxes per grid cell floating-point weights and activations to decrease compu-
with 5 bounding boxes. Since this study considers 8 traffic tational complexity and maintain prediction accuracy. The
light classes, the number of filters within the convolutional frozen graph has been investigated by the Vitis-AI to find out
layer has been set to 39 in the configuration file. The YOLO the number of input and output nodes of the trained YOLO

73272 VOLUME 12, 2024

R. A. Amin et al.: FPGA-Based Real-Time Object Detection and Classification System

TABLE 2. Modified YOLO architecture of the proposed system.

model and their names. To calibrate the trained YOLO model 32-bit memory-mapped AXI interface employed to establish
before quantization, 250 random images have been selected the links between the processor and the DPU. In the compi-
from the test set. However, the model has been quantized lation process, the proper fingerprint of the DPU architecture
into INT8 fixed point formats that require less memory band- has been specified to make the compiled model compatible
width, improving power efficiency and computing speed. with the DPU. The compiled model has been deployed to the
Kria KV260 FPGA board to detect and classify the traffic
lights.

IV. EXPERIMENTAL EVALUATION AND DISCUSSION

FIGURE 2. Frozen graph generation process. The evaluation of the proposed system commenced with
a comprehensive software analysis. This preliminary phase
involved rigorous testing and validation of the system’s
4) MODEL COMPILATION algorithms and functionalities in TensorFlow. Once the soft-
The Vitis-AI compiler has been used to compile the quan- ware results met the predefined criteria and demonstrated
tized model into the Kria KV260 FPGA board using Xilinx satisfactory performance, the proposed system has been
DPUCZDX8G. The compilation process has been performed transitioned to hardware deployment for further evaluation.
using the Xilinx Intermediate Representation (XIR) based The hardware assessment phase involved implementing the
compiler of Vitis-AI. Initially, the quantized model was used system on the target FPGA platform and subjecting it to
as the input for the compiler, and then the compiler trans- real-world testing scenarios. This stage aimed to validate
formed the model into the XIR graph. The graph was broken the system’s performance under actual hardware constraints
into different subgraphs by the compiler, and optimization and operational conditions, ensuring its effectiveness and
was applied to the subgraphs. Besides the compiler generated reliability in practical applications. By adopting a systematic
the instruction stream and attached it to the DPU sub- approach that encompasses both software analysis and hard-
graph. However, the instructions and required information for ware evaluation, the proposed system can undergo thorough
Vitis-AI Runtime (VART) has serialized to compile the model validation across various dimensions, thereby instilling con-
into ‘.xmodel’ format. Fig. 3 illustrates the interconnection fidence in its capabilities and suitability for deployment in
between the DPU and the processor of the Kria KV260, with a real-world scenarios.

VOLUME 12, 2024 73273

R. A. Amin et al.: FPGA-Based Real-Time Object Detection and Classification System

FIGURE 3. Interconnection between Xilinx DPU and the Kria KV260 processor.

KV-260 FPGA board serves as an ideal hardware platform for

implementing and evaluating the proposed object detection
system. Its versatile architecture and robust feature set enable
comprehensive hardware assessment, ensuring the system’s
performance and functionality align with the intended objec-
tives and requirements.
B. SOFTWARE EVALUATION
The training process extended to approximately 50,200 iter-
ations. Initially, the loss exceeded 700%, signifying a high
error rate. Subsequently, as the training progressed, the
loss exhibited a rapid decline. Until around 1,000 itera-
tions, there was a notable fluctuation in the loss. However,
after approximately 1,500 iterations, the loss experienced a
significant reduction, and the fluctuation rate decreased sub-
FIGURE 4. Loss vs. Iteration graph of the trained YOLO model.
stantially compared to the earlier stages of training. Notably,
the generated weight file at the 50,000th iteration has been
utilized for traffic light detection and classification, designat-
ing this segment as the region of interest in the graph. The
A. EXPERIMENTAL SETUP loss and iteration graph, as presented in Fig. 4, showcases
The YOLO model preparation, training, and FPGA deploy- this specific region, demonstrating that the loss stabilizes at
ment are carried out by a Linux host machine with Intel(R) approximately 1%.
Core (TM) i7-9750H, 2.66GHZ, and 16GB of RAM. The The accuracy versus iteration graph is represented in Fig. 4.
simulation environment is implemented using the Darknet At the outset of the training process, the accuracy was
framework and Xilinx Vitis-AI. One environment is for notably low. However, with an increase in the number of
training the model, and another is for the model conversion iterations, accuracy improved considerably, particularly up
according to the requirements of the development board. For to around 1,000 iterations. After approximately 1,500 iter-
the hardware evaluation phase, the Xilinx KV-260 FPGA ations, the accuracy exhibited minimal fluctuation. Fig. 5
board was utilized. This board features the Xilinx Zynq highlights the region of interest with the model’s accuracy.
UltraScale+ MPSoC, which integrates a quad-core ARM However, it stabilizes at an very high level of accuracy of
Cortex-A53 processor alongside a programmable FPGA fab- approximately 99%.
ric. Additionally, the board offers a diverse array of interfaces This study has investigated the efficacy of training and
and peripherals tailored for camera and display connectivity, optimizing a YOLO v3 Tiny model for real-time traffic
facilitating seamless integration into object detection sys- light detection and classification. With an outstanding 99%
tems. Leveraging the computational capabilities of the ARM overall accuracy achieved during model training, the model
processor and the flexibility of the FPGA fabric, the Xilinx showcases remarkable proficiency in accurately identifying

73274 VOLUME 12, 2024

R. A. Amin et al.: FPGA-Based Real-Time Object Detection and Classification System

TABLE 3. Comparison of the proposed system with different FPGA based object detection system.

vide valuable insights into the system’s performance under

varying environmental conditions and dynamic scenarios,
further validating its effectiveness and reliability in practical
deployment scenarios. During the experiment, the system
exhibited an impressive processing time of 1.996 seconds
to detect and classify 30 images, achieving a commendable
speed of 15 FPS. Fig. 6 visually portrays the outcomes of traf-
fic light detection and classification by the proposed system.
To further assess the accuracy of annotation, segmentation,
and object detection, the average Intersection over Union
(IOU) has been calculated for various regions. Notably, the
proposed model attained an average IOU of 36%, signifying
FIGURE 5. Accuracy vs. Iteration graph of the trained YOLO model. its robust performance in these critical aspects.

D. CMPARISON WITH OTHER RELATED WORKS

objects amidst complex backgrounds. An essential step in In order to evaluate the performance discrepancy between
this process was the successful conversion of darknet weights the architecture delineated in this study and other heteroge-
into TensorFlow format, enabling the application of quan- neous architectures, a comparative analysis was undertaken
tization and compilation techniques. This conversion not by comparing the outcomes of the proposed system with
only optimized system memory utilization but also preserved those of analogous works documented in recent literature.
accuracy. Notably, the model’s loss remained at approxi- This assessment aimed at appraising the efficacy of the
mately 1% within the region of interest, particularly at 50,000 system architecture elucidated in this research endeavor.
iterations, underscoring its suitability for efficient operational Table 3 encapsulates the outcomes of various FPGA-based
deployment. object detection systems alongside the proposed system. The
findings distinctly illustrate that the CNN-based approach
C. HARDWARE EVALUATION exhibits heightened efficiency owing to the diminutive size
The YOLO model underwent rigorous testing and evalu- of the test images. Corcoran et al. [32] and Valadanzoj
ation on the Kria KV260 FPGA board. Impressively, the et al. [34] denote elevated throughput albeit accompanied
proposed system yielded highly satisfactory results. During by increased power consumption and accuracy. Conversely,
an on-road experiment conducted on the streets of Siegen, Heller et al. [31] and Valadanzoj et al. [34] showcase a
Germany, the system demonstrated real-time capabilities by harmonized trade-off among throughput, power consump-
successfully detecting and classifying traffic lights from a tion, and efficiency. For an equitable comparison, the pro-
high-definition video stream. In addition to evaluating the posed system is compared with the work of Heller et al.
system using the BSTLD test images, this study also sub- [31], given the congruent utilization of FPGA boards and
jected the system to testing using 30 on-road experimental test image sizes in the experimentation. Fig. 7 delineates the
images with a resolution of 720 × 1280. Furthermore, five performance evaluation of our system with the top notched
videos with HD resolution has been utilized, each with a state-of-the-art systems. Remarkably, the performance of the
duration of 2 minutes and a frame rate of 60 fps, resulting proposed system evinces a better performance across all eval-
in a total of 7200 frames captured from street environments. uation metrics, manifesting a 24% enhancement in accuracy,
These real-world on-road experiment images and videos pro- 56% reduction in power consumption, and 55% augmenta-

VOLUME 12, 2024 73275

R. A. Amin et al.: FPGA-Based Real-Time Object Detection and Classification System

FIGURE 6. Detection and classification results of the proposed system.

V. CONCLUSION
An FPGA-based general YOLO model, tailored for object
detection and classification in edge computing devices, has
been introduced. This model has been carefully optimized
to meet the stringent requirements of edge systems, encom-
passing limitations in hardware resources, the need for
high accuracy, and real-time processing speed. In light of
the system’s inherent resource constraints and its overall
performance, it is justifiable to conclude that both the model
and the entire system have demonstrated their efficiency,
FIGURE 7. Performance evaluation of the proposed system.
reliability, and speed, particularly within the domain of
Advanced Driver Assistance Systems (ADAS) and edge
tion in efficiency. Overall, the proposed system demonstrates computing.
satisfactory performance compared to other systems. The One noteworthy feature of the developed system is its
robustness and reliability of the YOLOv3 Tiny model in versatility. It possesses the capability to detect not only traf-
real-time scenarios are underscored by its deployment on fic lights but also a wide range of other object types. This
the Kria KV260 and efficient utilization of the Xilinx DPU. flexibility arises from the system’s adaptability through the
However, isolated misdetections were encountered during the training of the YOLO model with suitable datasets. Con-
testing phase, albeit their incidence was minimal relative to sequently, ample room remains for further exploration and
the total number of objects tested. enhancement of the YOLO model. Architectural modifica-

73276 VOLUME 12, 2024

R. A. Amin et al.: FPGA-Based Real-Time Object Detection and Classification System

tions and optimization efforts can be undertaken to achieve [18] R. Greer, A. Gopalkrishnan, J. Landgren, L. Rakla, A. Gopalan, and
improved performance. M. Trivedi, ‘‘Robust traffic light detection using salience-sensitive loss:
Computational framework and evaluations,’’ 2023, arXiv:2305.04516.
Additionally, it is worth considering the integration of a [19] E. Al-Ezaly, H. M. El-Bakry, A. Abo-Elfetoh, and S. Elhishi, ‘‘An inno-
fault tolerance mechanism within the system to handle unex- vative traffic light recognition method using vehicular ad-hoc networks,’’
pected issues and ensure robustness. A valuable avenue for Sci. Rep., vol. 13, no. 1, p. 4009, Mar. 2023.
future work is the assessment of system reliability through [20] Y. Zhou, Z. Chen, and X. Huang, ‘‘A system-on-chip FPGA design for real-
time traffic signal recognition system,’’ in Proc. IEEE Int. Symp. Circuits
long-term operation, observing its behavior and patterns Syst. (ISCAS), May 2016, pp. 1778–1781.
across various real-world scenarios. Such investigations will [21] Z. Shi, Z. Zou, and C. Zhang, ‘‘Real-time traffic light detection with
contribute to the continuous development and refinement of adaptive background suppression filter,’’ IEEE Trans. Intell. Transp. Syst.,
vol. 17, no. 3, pp. 690–700, Mar. 2016.
the system’s performance and reliability.
[22] K. Sruthi and R. Nandakumar, ‘‘AI/ML-based object detection on FPGA
SoC,’’ in Proc. Int. Conf. Commun., Electron. Digit. Technol. Cham,
REFERENCES Switzerland: Springer, 2023, pp. 479–487.
[23] S. Kalapothas, G. Flamis, and P. Kitsos, ‘‘Efficient edge-AI application
[1] X. Xu, X. Zhang, B. Yu, X. S. Hu, C. Rowen, J. Hu, and Y. Shi, ‘‘DAC-
deployment for FPGAs,’’ Information, vol. 13, no. 6, p. 279, May 2022.
SDC low power object detection challenge for UAV applications,’’ IEEE
Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 392–403, Feb. 2021. [24] K. Shi, M. Wang, X. Tan, Q. Li, and T. Lei, ‘‘Efficient dynamic recon-
figurable CNN accelerator for edge intelligence computing on FPGA,’’
[2] H. Xu, M. Guo, N. Nedjah, J. Zhang, and P. Li, ‘‘Vehicle and pedestrian
Information, vol. 14, no. 3, p. 194, Mar. 2023.
detection algorithm based on lightweight YOLOv3-promote and semi-
precision acceleration,’’ IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, [25] G. Rathod, P. Shah, R. Gajjar, M. I. Patel, and N. Gajjar, ‘‘Implementation
pp. 19760–19771, Oct. 2022. of real-time object detection on FPGA,’’ in Proc. 7th Int. Conf. Trends
[3] L. Fang, Q. Jiang, J. Shi, and B. Zhou, ‘‘TPNet: Trajectory proposal Electron. Informat. (ICOEI), Apr. 2023, pp. 235–240.
network for motion prediction,’’ in Proc. IEEE/CVF Conf. Comput. Vis. [26] P. Dhilleswararao, S. Boppu, M. S. Manikandan, and L. R. Cenkeramaddi,
Pattern Recognit. (CVPR), Jun. 2020, pp. 6796–6805. ‘‘Efficient hardware architectures for accelerating deep neural networks:
[4] X. Chang, H. Pan, W. Sun, and H. Gao, ‘‘YolTrack: Multitask learning Survey,’’ IEEE Access, vol. 10, pp. 131788–131828, 2022.
based real-time multiobject tracking and segmentation for autonomous [27] S. C. Magalhães, F. N. dos Santos, P. Machado, A. P. Moreira, and
vehicles,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 12, J. Dias, ‘‘Benchmarking edge computing devices for grape bunches and
pp. 5323–5333, Dec. 2021. trunks detection using accelerated object detection single shot multibox
[5] Z. Ouyang, J. Niu, Y. Liu, and M. Guizani, ‘‘Deep CNN-based real- deep learning models,’’ Eng. Appl. Artif. Intell., vol. 117, Jan. 2023,
time traffic light detector for self-driving vehicles,’’ IEEE Trans. Mobile Art. no. 105604.
Comput., vol. 19, no. 2, pp. 300–313, Feb. 2020. [28] M. Baczmanski, M. Wasala, and T. Kryjak, ‘‘Implementation of a per-
[6] T. H. P. Tran and J. W. Jeon, ‘‘Accurate real-time traffic light detection ception system for autonomous vehicles using a detection-segmentation
using YOLOv4,’’ in Proc. IEEE Int. Conf. Consum. Electron. Asia (ICCE- network in SoC FPGA,’’ in Proc. Int. Symp. Appl. Reconfigurable Comput.
Asia), Nov. 2020, pp. 1–4. Cham, Switzerland: Springer, 2023, pp. 200–211.
[7] H. Naimi, ‘‘Traffic sign and light detection using deep learning for automo- [29] Y. Wang, Y. Liao, J. Yang, H. Wang, Y. Zhao, C. Zhang, B. Xiao, F. Xu,
tive applications,’’ Ph.D. thesis, Dept. Elect. Comput. Eng., Univ. Windsor, Y. Gao, M. Xu, and J. Zheng, ‘‘An FPGA-based online reconfigurable
Windsor, ON, Canada, 2021. CNN edge computing device for object detection,’’ Microelectron. J.,
[8] P. Adarsh, P. Rathi, and M. Kumar, ‘‘YOLOv3-tiny: Object detection and vol. 137, Jul. 2023, Art. no. 105805.
recognition using one stage improved model,’’ in Proc. 6th Int. Conf. Adv. [30] Z. Zhang, M. A. P. Mahmud, and A. Z. Kouzani, ‘‘FitNN: A low-resource
Comput. Commun. Syst. (ICACCS), Mar. 2020, pp. 687–694. FPGA-based CNN accelerator for drones,’’ IEEE Internet Things J., vol. 9,
[9] J. Zhang, F. Zhang, M. Xie, X. Liu, and T. Feng, ‘‘Design and imple- no. 21, pp. 21357–21369, Nov. 2022.
mentation of CNN traffic lights classification based on FPGA,’’ in Proc. [31] D. Heller, M. Rizk, R. Douguet, A. Baghdadi, and J.-P. Diguet, ‘‘Marine
IEEE 4th Int. Conf. Electron. Inf. Commun. Technol. (ICEICT), Aug. 2021, objects detection using deep learning on embedded edge devices,’’ in
pp. 445–449. Proc. IEEE Int. Workshop Rapid Syst. Prototyping (RSP), Shanghai, China,
[10] W. Liu, ‘‘SSD: Single shot multibox detector,’’ in Proc. 14th Eur. Conf. Oct. 2022, pp. 1–7.
Amsterdam, The Netherlands: Springer, Oct. 2016, pp. 21–37. [32] A. Montgomerie-Corcoran, P. Toupas, Z. Yu, and C.-S. Bouganis,
[11] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once: ‘‘SATAY: A streaming architecture toolflow for accelerating YOLO mod-
Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis. els on FPGA devices,’’ in Proc. Int. Conf. Field Program. Technol.
Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. (ICFPT), Dec. 2023, pp. 179–187.
[12] S. Wu, N. Amenta, J. Zhou, S. Papais, and J. Kelly, ‘‘AUToLights: A [33] D.-D. Nguyen, D.-T. Nguyen, M.-T. Le, and Q.-C. Nguyen, ‘‘FPGA-SoC
robust multi-camera traffic light detection and tracking system,’’ 2023, implementation of YOLOv4 for flying-object detection,’’ J. Real-Time
arXiv:2305.08673. Image Process., vol. 21, no. 3, p. 63, May 2024.
[13] A. Abraham, D. Purwanto, and H. Kusuma, ‘‘Traffic lights and traffic signs [34] Z. Valadanzoj, H. Daryanavard, and A. Harifi, ‘‘High-speed YOLOv4-
detection system using modified you only look once,’’ in Proc. Int. Seminar tiny hardware accelerator for self-driving automotive,’’ J. Supercomput.,
Intell. Technol. Appl. (ISITIA), Surabaya, Indonesia, 2021, pp. 141–146. vol. 80, no. 5, pp. 6699–6724, Mar. 2024.
[14] W. H. D. Fernando and S. Sotheeswaran, ‘‘Automatic road traffic [35] E. Rzaev, A. Khanaev, and A. Amerikanov, ‘‘Neural network for real-time
signs detection and recognition using ‘you only look once’ version 4 object detection on FPGA,’’ in Proc. Int. Conf. Ind. Eng., Appl. Manuf.
(YOLOv4),’’ in Proc. Int. Res. Conf. Smart Comput. Syst. Eng. (SCSE), (ICIEAM), Sochi, Russia, May 2021, pp. 719–723.
vol. 4, Sep. 2021, pp. 38–43. [36] G. Wang, H. Ding, B. Li, R. Nie, and Y. Zhao, ‘‘Trident-YOLO: Improving
[15] H. K. Hua, K. H. Nguyen, L.-D. Quach, and H. N. Tran, ‘‘Traffic lights the precision and speed of mobile device object detection,’’ IET Image
detection and recognition method using deep learning with improved Process., vol. 16, no. 1, pp. 145–157, Jan. 2022.
YOLOv5 for autonomous vehicle in ROS2,’’ in Proc. 8th Int. Conf. Intell. [37] M. Liu, S. Luo, K. Han, B. Yuan, R. F. DeMara, and Y. Bai, ‘‘An efficient
Inf. Technol., Feb. 2023, pp. 117–122. real-time object detection framework on resource-constricted hardware
[16] Z. Yao, Q. Liu, Q. Xie, and Q. Li, ‘‘TL-detector: Lightweight based real- devices via software and hardware co-design,’’ in Proc. IEEE 32nd Int.
time traffic light detection model for intelligent vehicles,’’ IEEE Trans. Conf. Application-Specific Syst., Archit. Processors (ASAP), Jul. 2021,
Intell. Transp. Syst., vol. 24, no. 9, pp. 9736–9750, Sep. 2023. pp. 77–84.
[17] S. Shrivastava, A. Somthankar, V. Pandya, and M. Patil, ‘‘Implementation [38] D. Pestana, P. R. Miranda, J. D. Lopes, R. P. Duarte, M. P. Véstias,
of a pid controller for autonomous vehicles with traffic light detection in H. C. Neto, and J. T. De Sousa, ‘‘A full featured configurable accelerator
CARLA,’’ in Intelligent Computing and Networking. Cham, Switzerland: for object detection with YOLO,’’ IEEE Access, vol. 9, pp. 75864–75877,
Springer, 2023. 2021.

VOLUME 12, 2024 73277

R. A. Amin et al.: FPGA-Based Real-Time Object Detection and Classification System

[39] J. Zhang, L. Cheng, C. Li, Y. Li, G. He, N. Xu, and Y. Lian, ‘‘A low-latency VEIT WIESE received the Dipl.-Ing. degree in
FPGA implementation for real-time object detection,’’ in Proc. IEEE Int. electrical engineering from the University of
Symp. Circuits Syst. (ISCAS), May 2021, pp. 1–5. Siegen, Germany, in 2015. Later on, he was a
[40] F. Esen, A. Degirmenci, and O. Karal, ‘‘Implementation of the object Project Manager with Qualigon GmbH, until 2017.
detection algorithm (YOLOV3) on FPGA,’’ in Proc. Innov. Intell. Syst. He is a Project Manager and a Lecturer with
Appl. Conf. (ASYU), Elazig, Turkey, Oct. 2021, pp. 1–6. the Institute of Embedded Systems, University
[41] J. Xin, M. Cha, L. Shi, C. Long, H. Li, F. Wang, and P. Wang, ‘‘Lightweight of Siegen. His research interests include fault
convolutional neural network of YOLOv3-tiny algorithm on FPGA for tolerance, safety-critical embedded systems, struc-
target detection,’’ in Proc. IEEE Int. Conf. Comput. Sci., Artif. Intell.
tural health monitoring, field-programmable gate
Electron. Eng. (CSAIEE), Aug. 2021, pp. 65–70.
arrays, and artificial intelligence.
[42] K. Behrendt, L. Novak, and R. Botros, ‘‘A deep learning approach to traffic
lights: Detection, tracking, and classification,’’ in Proc. IEEE Int. Conf.
Robot. Autom. (ICRA), May 2017, pp. 1370–1377.
[43] T. Diwan, G. Anirudh, and J. V. Tembhurne, ‘‘Object detection using
YOLO: Challenges, architectural successors, datasets and applications,’’
Multimedia Tools Appl., vol. 82, no. 6, pp. 9243–9275, Mar. 2023.

RASHED AL AMIN (Graduate Student Mem-

ber, IEEE) received the B.Sc. degree in electrical
and electronics engineering from the University
of Dhaka, Bangladesh, and the M.Sc. degree
in mechatronics from the University of Siegen,
Germany, where he is currently pursuing the Ph.D.
degree with the Chair of Embedded Systems. He is
also a Scientific Associate and a Lecturer with ROMAN OBERMAISSER received the master’s
the Chair of Embedded Systems, University of degree in computer sciences from Vienna Uni-
Siegen. His research interests include hardware for versity of Technology, in 2001, and the Ph.D.
artificial intelligence, reconfigurable computing, and FPGA architecture. His degree in computer science from Vienna Uni-
goal is to design the next generation FPGA architecture to optimize deep versity of Technology, in February 2004, with
learning workloads. He received the Deans Award during the master’s study. Prof. Hermann Kopetz, as a Research Advisor.
In July 2009, he has received the Habilitation
(‘‘Venia docendi’’) Certificate for technical com-
MEHRAB HASAN received the B.Sc. degree from puter science. He is a Full Professor with the
American International University, Bangladesh, Institute for Embedded Systems, University of
and the M.Sc. degree in mechatronics from the Siegen. He wrote a book on an integrated time-triggered architecture pub-
University of Siegen, Germany, in 2023. His lished by Springer-Verlag, USA. He is the author of several journal articles
research interests include FPGA design, hardware and conference publications. He has also participated in numerous EU
accelerator, and multiprocessor architectures. research projects (e.g., SAFE POWER, universal, DECOS, and NextTTA).
He was the coordinator of the European research projects DREAMS,
GENESYS, and ACROSS. His research work focuses on system architec-
tures for distributed embedded real-time systems.