0% found this document useful (0 votes)
142 views14 pages

(2025-AEJ) Object Detection in Real-Time Video Surveillance Using Attention Based transformer-YOLOv8 Model

This research presents a novel real-time object detection framework that integrates the YOLOv8 backbone with an attention mechanism and a Transformer-based detection head, enhancing performance in complex environments. The model achieves high precision (96.78%) and recall (96.89%) rates, with a mean average precision of 89.67%, demonstrating its effectiveness in real-time applications like surveillance and industrial automation. The proposed system addresses challenges such as occlusion and varying lighting conditions, making it suitable for deployment in resource-constrained scenarios.

Uploaded by

junhaojia530
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views14 pages

(2025-AEJ) Object Detection in Real-Time Video Surveillance Using Attention Based transformer-YOLOv8 Model

This research presents a novel real-time object detection framework that integrates the YOLOv8 backbone with an attention mechanism and a Transformer-based detection head, enhancing performance in complex environments. The model achieves high precision (96.78%) and recall (96.89%) rates, with a mean average precision of 89.67%, demonstrating its effectiveness in real-time applications like surveillance and industrial automation. The proposed system addresses challenges such as occlusion and varying lighting conditions, making it suitable for deployment in resource-constrained scenarios.

Uploaded by

junhaojia530
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Alexandria Engineering Journal 118 (2025) 482–495

Contents lists available at ScienceDirect

Alexandria Engineering Journal


journal homepage: www.elsevier.com/locate/aej

Original article

Object detection in real-time video surveillance using attention based


transformer-YOLOv8 model
Divya Nimma a,*, Omaia Al-Omari b , Rahul Pradhan c , Zoirov Ulmas d , R.V.V. Krishna e,
Ts. Yousef A.Baker El-Ebiary f, Vuda Sreenivasa Rao g
a
Computational Science, University of southern Mississippi, UMMC, USA
b
Information Systems Department, College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia
c
Department of Computer Engineering & Applications, GLA University, Mathura, India
d
Artificial Intelligence Department at the Tashkent State University of Economics, Uzbekistan
e
ECE Department, Aditya University, Surampalem, AP, India
f
Faculty of Informatics and Computing, UniSZA University, Malaysia
g
Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, AP 522302, India

A R T I C L E I N F O A B S T R A C T

Keywords: Object detection plays a crucial role in various applications, including surveillance, autonomous driving, and
Object detection industrial automation, where accurate and timely identification of objects is essential. This research proposes a
YOLOv8 novel framework that combines the YOLOv8 backbone network with an attention mechanism and a Transformer-
Attention mechanism
based detection head, significantly enhancing object detection performance in real-time images and video. The
Transformer architecture
incorporation of attention mechanisms refines feature extraction from complex scenes, enabling the model to
Real-time processing
focus on relevant regions within images. Using the integration of Transformer architecture, the model leverages
long-range dependencies and global context, leading to more accurate bounding box predictions. The proposed
system effectively processes real-time data, demonstrating superior classification performance with precision
rates reaching 96.78 % and recall rates of 96.89 %. The mean average precision (mAP) is calculated at 89.67 %,
showcasing the framework’s robustness across various practical scenarios. The framework is developed to
address challenges in object detection, such as detecting multiple objects in crowded environments and varying
lighting conditions. The Python architecture supports the implementation of the proposed model. The Python
architecture supports the implementation of the proposed model. The results section assesses the Attention
Transformer-YOLOv8 model against established algorithms like Faster R-CNN, YOLOv3, YOLOv5n, and SSD,
utilizing metrics.

1. Introduction real-time detection and tracking technologies necessary [3]. Among


other things, it may be applied to violent crime and the early identifi­
Systems for finding, detecting, and monitoring have become more cation of alien species. One of the most important aspects of computer
prevalent in many contemporary applications in recent decades [1]. vision is object detection [4]. It plays a vital role in enabling relation­
These applications include autonomous driving systems, medicinal ap­ ships among pictures and text, in addition to aiding the monitoring of
plications, security and traffic monitoring systems, etc. To create a separate entities. The capacity of object detection to provide insightful
functional system that operates in real-time, several studies and publi­ data highlights its many uses in a variety of fields, such as deep-sea vi­
cations have also been published, employing various strategies and sual monitoring systems, machine vision, and the detection of abnor­
techniques. It is now challenging for many employees in control centers malities in medical imaging [5]. Object detection algorithms have
to operate with the same productivity throughout the entire day and advanced quite quickly in the field of deep learning. Opportunities have
handle several cameras at once due to the growth of CCTV cameras and arisen in several fields, including renewable energy, safety, healthcare,
surveillance systems [2]. This rendered the requirement for efficient and education, thanks to artificial intelligence. Nonetheless, the

* Corresponding author.
E-mail addresses: [email protected] (D. Nimma), [email protected] (O. Al-Omari), [email protected] (R. Pradhan), [email protected] (Z. Ulmas),
[email protected] (R.V.V. Krishna), [email protected] (Ts.Y.A.Baker El-Ebiary), [email protected] (V.S. Rao).

https://doi.org/10.1016/j.aej.2025.01.032
Received 16 November 2024; Received in revised form 25 December 2024; Accepted 7 January 2025
Available online 24 January 2025
1110-0168/© 2025 The Authors. Published by Elsevier B.V. on behalf of Faculty of Engineering, Alexandria University. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

industrial sector has a lot of potential for significant automation with distinction that distinguishes this study from previous works.
computer vision [6]. As seen in Fig. 4, we offer a straightforward technique for orientated
This research will propose and experimentally verify the perfor­ item recognition by end-to-end matching of angled boxes with orien­
mance of a real-time attention-based Transformer-YOLOv8 object de­ tated objects. In particular, we use the cross-attention technique to
tector model that can successfully adapt to complicated scenes. extract angle-dimensional information and pre-set and refine fixed-
Improving its detection accuracy under more severe challenges, like length object searches with angle dimensions for interaction with the
partial or total occlusion, blur, or varying sizes, in this model is a specific encoded characteristics. During training, the collection of angle-aware
goal for achieving the model improvement objective. This paper will object searches uses bipartite matching to match ground facts. Multi-
integrate the strengths of YOLOv8 with a Transformer-based attention scale feature maps are required for object recognition because of the
mechanism to further enhance precision, recall, and location detection notable size differences for several types of orientated objects (little ones
in surveillance, autonomous cars, and industrial automation-related like small vehicles and vast ones like ground-track fields). Nevertheless,
applications. for multi-scale characteristics, the original Transformer encoder’s global
Quality Inspection is extremely important in manufacturing since it reasoning architecture of attention mechanism is extremely and opera­
provides customers with reassurance regarding the authenticity and tionally difficult. Moreover, we contend that global reasoning is not
caliber of the goods being produced [7] [8]. Although there are many truly required, particularly in cases when orientated items belonging to
prospects for automation in production, there are difficulties in the area the same category are consistently packed closely together, and the
of surface inspection since flaws can take complex shapes. Because of its object query only communicates with the visual characteristics around
intricacy, human-driven quality inspection is a difficult task fraught the object, as opposed to the entire global image. These observations
with problems including prejudice, weariness, expenses, and lost pro­ motivate us to offer depth-wise separable convolutions for local aggre­
duction time [9]. Because of these inefficiencies, computer vision-driven gation, which can outperform the Transformer’s original self-attention
systems have a great opportunity to provide automated quality checking mechanism in terms of performance. When compared to the tradi­
[10]. By doing so, you may avoid the bottlenecks that come with using tional attention mechanism, replacing with convolutions results in
traditional inspection methods and increase efficiency by integrating shorter Transformer training epochs and faster convergence since, dur­
these solutions smoothly into your current surface defect inspection ing feature extraction, information exchange only occurs between
procedures [11]. However, for CV designs to be successful, they must neighboring pixels. By incorporating novel components of an attention-
meet a strict set of deployment requirements that might differ among based transformer into YOLOv8, this work considerably increases the
manufacturing business sectors. For most applications, the goal goes detection accuracy of YOLOv8. First, the first feature extraction of ob­
beyond only recognizing a single fault; it typically includes identifying jects in the photos is done via the basic YOLOv8 model. To help the
many problems together with their unique spatial information. The model learn as many complicated object features as possible to handle
work uses an attention-based transformer and a better version of the blocking circumstances, the integrated Transformer modules are
YOLOv8 to identify moving objects for image categorization [12]. The also in charge of encoding and decoding complex multi-scale object
latter focuses only on item identification within an image, providing no features. The research contributions are:
information on the exact location of the objects. Single-stage and
two-stage detectors are the two main categories into which object • Integration of YOLOv8 with Transformer-based Attention Mecha­
detection technologies fall [13]. When using two-stage detection sys­ nisms: Introduced a new framework combining efficient YOLOv8
tems, the detection process is split into two stages: first, features are backbone with advanced detection heads Transformer-based,
extracted or proposed, and then the final output is obtained through thereby enhancing the spatial feature extraction as well as long-
regression and classification. Although this solution offers great accu­ range dependency modeling for real-time object detection.
racy, it comes with a heavy computational cost that makes it unsuitable • Improved Precision and Robustness in Complex Environments: Bet­
for real-time deployment on edge devices with limited resources. On the ter detection performance in crowded, occluded, and low-visibility
other hand, classification and regression may happen in a single pass conditions was demonstrated using attention mechanisms to focus
using single-stage detectors as they combine both procedures into a on critical regions of an image, resulting in a precision of 96.78 %
single step [14]. This lowers the computing needs significantly and and recall of 96.89 %.
makes the case for deployment in production systems stronger [15] • Real-Time Efficiency for Resource-Constrained Applications:
[16]. Designed a computationally optimized model with an inference time
The introduction of Transformer-YOLOv8 in this research brings a of 5.2 ms per frame, suitable for real-time applications like surveil­
new methodology based on the strengths of transformer models and lance, autonomous systems, and industrial automation.
YOLO architecture, overcoming the challenges with traditional object • New Hybrid Design for Improved Detection: Introduces a balanced
detection methods. YOLOv5 and YOLOv4 focus mostly on spatial feature framework that synergizes the lightweight architecture of YOLOv8
extraction and often fail to learn the long-range dependencies or capture with the contextual modeling capabilities of Transformer to address
temporal patterns necessary for dynamic environments. This way, the challenges like generalization in dynamic environments and
transformer-based models, like DETR, outperform the others in spatial- achieving a mAP of 89.67 %.
temporal relationship modeling, but these are computationally expen­
sive and lack real-time efficiency. So, with the incorporation of trans­ The layout of the paper is academic, with sections as follows: This
former modules into YOLOv8, the study brings a balanced framework paper aims to discuss the research problem, objectives, and significance in
that increases accuracy and precision without increasing computational Section 1. In Section 2, this is succeeded by a literature review wherein the
infeasibility. The method proposed here is quite in contrast to the work contextualizes the current study to unearth and focus on the research
existing methods that are either slow but very accurate or fast but gaps. Section 4 describes the specific objectives of the study, the method
inaccurate. This method provides synergy between speed and accuracy used to gather data and analyze it, and a rationale for why this method was
and hence is suitable for resource-constrained environments in real-time used would thus be outlined in this section. In Section 5, the results section
applications. This method addresses important issues such as mode comes next, where the study findings are described in detail together with
collapse in transformers and YOLO models’ limited generalization tables and figures for easy understanding. This discussion contextualizes
capability, thus providing a more robust solution for object detection in these findings with regard to the research questions and past literature,
diverse and dynamic scenarios. The novelty here lies in this hybrid discussing implications and potential limitations. Last, in Section 6, the
design which leverages the spatial-temporal modeling capabilities of conclusion restates the key points of this research, discusses implications
transformers but retains the lightweight efficiency of YOLOv8, a for practice, and presents ideas for further research.

483
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

2. Related works CNN classifier using One-Versus-All or One-Versus-One. This paper is


solely concerned with video surveillance of object/weapon identifica­
Wang et al. [17] provide cross-stage partial networks as a remedy for tion problems, namely objects that resemble a knife or a handgun when
the problem that required significant inference computations from handled with the hand. For this purpose, the database for six
previous research from the perspective of network architecture. This objects—knife, smartphone, bill, purse, and card is generated. The
article blames the problem on superfluous gradient data included in evaluation results shown in this work indicate that the proposed
network optimization. The proposed networks preserve gradient vari­ approach reduces the false positive rate compared to the baseline
ability by including feature maps from the beginning and end of a multi-class identification approach.
network stage. In comparison to state-of-the-art approaches on the MS Ingle and Kim [21] offers the document; since many places have
COCO object identification dataset, the studies conducted on the adopted video surveillance for object detection, it becomes important to
ImageNet set of images demonstrate that this decreases computations by monitor an individual camera operator across the cameras in order to
20 % with equivalent or even better precision and performs substantially pick on any odd movement. This is a time-consuming exercise. Real-time
better when speaking of AP50. One of the research’s main contributions situations on the identification and recognition of various forms of
is the identification of the duplicated gradient data problem, which firearms and knives from additional footage surveillance items are a task
causes costly inference computations and inefficient optimization. Wang that is getting in the way of the use of the Multiview interface cameras.
et al. [17] have proposed employing the cross-stage feature combina­ Most of the detecting cameras are low-resource, computationally suffi­
tions approach and the truncating gradient flow to enhance the variety cient devices. To this end, they established a restricted in-resources
of the taught features at several stages. Because CSPNet designs are lightweight subclass detection technique using Convolutional Neural
sufficiently general and easy to develop, they are comparable with Networks that have the capability to classify, segment, and detect
ResNet, ResNeXt, and DenseNet architectures. For tasks like object several gun and knife types with precision in real time. The multiclass
recognition and classification of images, picture-level and box-level la­ subclasses identification convolutional neural network used in this pa­
bels may, nonetheless, exclude some important data. per’s detecting classifier segments object frames into several sub-classes
Saponara, Elhanashi, and Gagliardi [18] demonstrates the use of such as abnormal and normal. The best state-of-the-art method delivers
YOLOv2 neural network convolution for real-time video-based a mean average accuracy of 84.21 % or 90.20 % when it attempts to
free-of-charge smoke detection in anti-free monitoring systems. The detect a knife or a pistol when using a single camera view. In the various
lightweight design of neural networks was used in the creation of types of industrial guns and knives, it was seen that the suggested
YOLOv2 to accommodate embedded platform needs. Using free and method achieved 97.50 % with ImageNet and IMFDB datasets, other
smoking picture sets in various indoor and outdoor environments, the open-image 90.50 %, all Olmos datasets 93 %, and the Multiview
training stage proceeds in real-time. The real-world data is produced interface cameras 90.7 % after detailed testing. It has been evidenced
from the initial data set using the ground truth labeler program. The that despite trying to be as resource-limited as possible, this device is
trained model underwent testing and was contrasted with other quite efficient and effective, yielding an 85.5 % precision rate of Mul­
cutting-edge techniques. Saponara, Elhanashi, and Gagliardi [18] tiview camera recognition.
employed a wide variety of negative and free/smoke films in both in­ Deshpande et al. [22] encompass many realistic outliers in surveil­
door and outdoor settings. YOLOv2 is a superior real-time fire/smoke lance videos, to which optimal and scalable anomaly detection models
detection method when compared to the previous methods. One fixed are needed. This is especially so, given that the task of annotating the
camera per scene, operating in the observable spectral region, makes up anomalous segments in training videos can be painstaking. Hence, a
the low-cost embedded device used for this experiment, the Jetson weakly supervised strategy has been suggested. To this end, VAD ach­
Nano. The video camera does not need to meet any particular specifi­ ieves frame-level abnormality scores by bootstrapping from video-level
cations. Therefore, the camera placed in closed-circuit television labels to relieve the annotation task. However, weakly supervised video
monitoring systems may be utilized when the suggested solution is anomaly detection always tends to misclassify between the abnormal
employed for safety onboard cars, in transportation infrastructures, or in and normal cases during the training process, which causes an impact on
smart cities. The successful trial outcomes demonstrate that the sug­ the detection rate. The paper proposed the transformer-based Video
gested approach is appropriate for developing an intelligent, real-time Swin features together with an attention layer containing dilated
video surveillance device for identifying instances of free or smoke. convolution and self-attention. It captures the video feature effectively
Shorfuzzaman, Hossain, and Alhamid [19] The current COVID-19 and is a better way of considering short ranges and long ranges of de­
pandemic has exposed the shortcomings of the smart city implementa­ pendency for better features of the video to enhance the eventual un­
tion strategies now in place, making it critical to design systems and derstanding of the content of the video. A comparison of the
architectures that can quickly and effectively stop the virus from performance of the proposed framework, using a real-world Shanghai
spreading. Slowing the transmission of this fatal virus can be achieved Tech Campus dataset against other benchmark techniques, shows
by actively tracking and implementing social separation between in­ promising results. Nevertheless, this method has drawbacks similar to
dividuals using an active surveillance system. This study proposes a deep those of using ATE from transformer-based models, where the quality of
learning system powered by data for the sustainable growth of smart the features essentially scales the cost of the entire method, especially
cities, providing a mass video surveillance-based prompt reaction to the for large-scale applications. Moreover, the weakly supervised approach
COVID-19 epidemic. Researchers employed three real-time object is still plagued with obstacles to producing stable instance-level anomaly
identification models based on deep learning to identify people in detection, which may hinder the appliance in cases of exact locations of
movies shot with a monocular lens in order to execute social distance the anomaly.
monitoring. This article used real-world surveillance video datasets to Wang et al. [23] present a progressive technique incorporating
test the system’s effectiveness in preparation for a successful temporal and spatial awareness that gives comprehensive enhancement
deployment. for improvement. The temporal knowledge acquisition involves a
Pérez-Hernández et al. [20] said to enable small object detection and Temporal Feature Aggregation Model named TFAM, which also engages
improve the robustness, accuracy, and quality made with the help of attention between related frames to the context regarding the target
binarization techniques. This research presents a two-staging deep frame. At the same time, a spatial transition awareness model (STAM)
learning technique referred to as Object Detection through Binary sends knowledge to place changes between context frames and a target
Classifiers that can enhance their identification in films. The candidate frame. Built on the DETR transformer-based detector, PTSEFormer
areas are selected from the input frame at the first level, and at the works end-to-end without costly post-processing and achieves an
second level, binary binarization is used, depending on the result of the exciting 88.1 mean Average. Precision (mAP) on the ImageNet VID

484
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

dataset. This approach essentially integrates temporal and spatial ob­ detection have shown satisfactory results but are not exempt from
servations into one model, boosting the development of real-time video certain disadvantages such as limited versatility in handling complex
object detection. Although PTSEFormer attains high accuracy, it is likely emotional movements, highly dependent on motionless preprocessing,
to be a high-computational-cost transformer-based model, which could usual constraints from supervised learning methods, sensitivity to
limit its scalability and real-time performance in resource-constrained adversarial attacks, and higher resource utilization, especially in the
environments. edge computing setting [26]. To address these challenges, our study
Juvan Tervan [24] made a comprehensive analysis of the evolution suggests an advanced attention-based Transformer-YOLOv8 model to
of YOLO, focusing on how it is an essential component for real-time address the lack of flexibility when dealing with situations involving
object detection in applications such as robotics, driverless cars, and dynamic motion, adoption of actual-time preprocessing solutions, and
video surveillance. A review of the advancement in YOLO models starts sensitivity to adversarial threats. Also, the model has low latency and
from the original YOLO to more recent sophisticated models like low memory utilization for efficient usage in real-time applications for
YOLOv8, YOLO-NAS, as well as YOLO integrated with transformers. The edge computing. This approach serves to overcome the limitations of
authors described the standard evaluation metrics and post-processing current detection systems and provide a new trend in object detection
techniques widely used in YOLO-based systems. They also discussed and tracking.
architectural advancements and training strategies introduced in each
version, highlighting innovations that improved performance and effi­ 4. Proposed attention-based transformer-YOLOv8 model for
ciency. The study concluded by summarizing key insights gained from object detection
YOLO’s development and exploring potential future research directions
aimed at further enhancing real-time object detection systems. While The proposed framework for object detection considers input data as
providing a detailed analysis of YOLO’s developments, the paper pri­ real-time video or images that are preprocessed to be analyzed. Second,
marily focuses on architectural and methodological improvements but the YOLOv8 backbone net decides the important features from the pre-
leaves out practical difficulties involved in deploying such models on processed data. These feature maps are followed by what is known as an
resource-constrained devices or against real-world variability. attention mechanism module that helps select areas of importance in
Yuhai et al. [25] proposed a lightweight vehicle detection algorithm object detection. The attention-refined features were then sent to a
to improve the design of intelligent traffic management systems by detection head based on the Transformer Network, which provides the
enhancing the YOLOv5 framework by incorporating perceptual atten­ probability of the location and class of the objects. The bounding box
tion. The design mainly aims to achieve high accuracy without prediction unit detects the locations of the found objects as the final step,
compromising the model’s small complexity and low computation and the system classifies the objects to provide them with labels. The
complexity. The authors provided the two main modules in question: the real-time object detection using the YOLOv8 backbone network and
IPA module of Integrated Perceptual Attention and the MSCCR module Transformer-based attention mechanisms with detection head for ac­
of Multiscale Spatial Channel Reconstruction. The IPA module, con­ curate bounding box regression and object classification is depicted in
structed using a Transformer encoder, reduces parameters while Fig. 1.
capturing global dependencies for richer contextual information. The
MSCCR module enables efficient feature learning without increasing 4.1. Data
parameters or computational complexity. In the YOLOv5s backbone
network, the model obtained a 9 % reduction in parameters and a 3.1 % Identifying and localizing elements in a stream of video and a series
improvement in mAP@50 without any increase in FLOPS. Despite these of images in real-time or almost real-time is known as object detection in
advances, the fact that the method relies on Transformer-based com­ real-time. In short, it recognizes and categorizes items while they are
ponents may cause difficulties in deploying it to extremely low-power or being recorded or viewed within video or image frames. For many ap­
real-time applications where computational resources are very plications, including video surveillance, robotics, self-driving vehicles,
constrained. and medical care, object detection is essential. In previous years, re­
The related works focus on advancements in object detection for searchers have been employing deep learning and computer vision
real-time video surveillance, anomaly detection, and vehicle detection techniques primarily for real-time object recognition. This is a difficult
using deep learning models like YOLO, ResNet, and transformers. One endeavor because, in addition to considering the available computer
approach presents a network design to reduce computational costs while resources, they must balance speed and precision to produce timely,
maintaining accuracy, but it still faces challenges in complex environ­ correct results. High-quality image data is the primary requirement for
ments. Another study used YOLO for real-time smoke detection but has training classical vision models or deep learning algorithms in the tar­
limited robustness across varied environments. A different study deals geted region for object recognition. In this case, road traffic images from
with social distancing monitoring using deep learning, but their system Kaggle were used as the data source. These images depict a variety of
suffers from scalability issues. Several studies are focused on small ob­ traffic situations, such as different road settings, vehicle kinds, and
ject detection and weapon detection but face difficulties in complex lighting conditions. This may be used to train and assess object identi­
recognition tasks and real-time performance issues. Some methods use fication models since it has bounding boxes labeled around a variety of
transformers for anomaly detection, but the approach is computation­ things, including vehicles, trucks, buses, and people. In real-time video
ally intensive. Another study introduces a model that combines temporal surveillance applications, the model intends to improve object recog­
and spatial awareness, which suffers from high computational costs. nition accuracy by concentrating on critical regions by utilizing the
While the YOLO’s evolution improved the performance, the challenges Attention-Based Transformer combined with the YOLO v8 architecture.
persist in terms of deployment and resource-constrained environments. This method guarantees effective road traffic object identification and
The lightweight version of YOLO for vehicle detection is enhanced but detection even in difficult situations, enhancing roadway security and
faces limitations in low-power, real-time applications. These studies monitoring.
depict the current challenges that prevail in object detection, which The dataset referenced for the real-time object detector is likely
involve computational complexity, scalability, and generalization across designed to train models that are to be fitted and then evaluated by
diverse environments. looking at images or frames where objects are identified with position
information. Such datasets typically come as labeled data consisting of
3. Problem statement an image with bounding boxes marked for the objects of interest as well
as corresponding class labels. They curate many of these datasets spe­
The current advancements progress achieved in moving object cifically for object detection applications, from surveillance to

485
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

video as a sequence of images makes it easier to apply detection tech­


niques frame by frame. Once isolated, each frame is ready for size ad­
justments, color transformation, and further analysis.
Adjusting Frame Size: one of the primary requirements of the YOLO
architecture is that the input image must adhere to specific dimensions
for optimal performance. The standard YOLOv8 model requires input
images of either 416 × 416 or 608 × 608 pixels, so each frame is resized
accordingly. This resizing is necessary because neural networks perform
best with fixed input sizes, ensuring uniformity and allowing the
network to apply its learned convolutional filters effectively.
Adjusting Color Values: next, the pixel color values of the frames need
to be adjusted to align with the pre-trained model’s expectations. This
involves normalizing the pixel intensity values, which originally range
from 0 to 255, to a smaller range of either [0,1] or [-1,1]. Normalization
helps the neural network converge faster during inference and training
by ensuring that the pixel values are small, reducing the computational
load. The transformation can be described in Eq. (1):
Iresized−
Inorm = (1)
μ
σ

Where, Inorm is the normalized data after the standardization of the


process. Iresized is the resized input data. μ is the pixel intensity of the
resized input. σ is the standard deviation of the pixel intensity.

4.3. YoloV8 and transformer-based tracking and detection of objects

YOLO v8 and Transformer-Based Tracking and Detection of Objects


is an innovative approach that unifies the features of the latest YOLO v8
architecture for real-time object detection and transformers for
improved tracking and context perception. This association ensures a
durable approach for the identification and tracking of objects in
different and changing environments like video surveillance, self-
driving cars, and many other real-time surveillance. YOLOv8 is a real-
time object detection system that isolates objects in images and video
feeds and is built using the latest technology. The key feature of YOLOv8
is the use of the single neural network design, which enables efficient
computation for bounding boxes and object class and real-time opera­
tion. It partitions the input image into regions and predicts directly the
object classes and their locations. Compared to the previous v7, YOLO
v8 has optimized both precision and speed alongside the detection ac­
curacy, which is particularly important for safety and response urgency
Fig. 1. Proposed framework.
needs such as in security system monitoring and self-driving cars.
Transformers, with the help of their attention mechanism, add to
autonomous driving, as well as generic object recognition. The anno­ YOLO v8 such a valuable feature as the ability to pay attention to
tations typically involve information about the positions and sizes of important details in the input data, especially when it comes to se­
objects in images. Models can learn how to perform both classification quences from videos. As used in object tracking, it is effective in tracking
and localization using these kinds of annotations. For real-time appli­ objects in frames by mastering the link between the frames that contain
cations, the dataset may contain a wide variety of objects captured objects. This is beneficial in continuing tracking in situations such as
under different conditions, such as lighting, backgrounds, and occlusion occlusion, appearance variation, or close scenes. It can retain long-range
levels, to ensure the robustness of the model. The structure of the dataset dependencies, which allows it to analyze sequences of video frames in
is critical in training models to detect objects rapidly and accurately in order to predict the characteristics of a moving object. YOLO v8 and
dynamic environments [27]. transformer networks are independent models with object detection and
tracking applications inheriting the desirable properties of both models;
4.2. Data pre-processing YOLOv8 helps in the fast and accurate detection of the object, while
transformers are helpful in tracking the object with the help of modifi­
The pre-processing stage in our object detection project involves a cation in the attention mechanism over the framed images. This inte­
series of systematic steps to ensure that the input video frames are gration expands the chance to follow multiple items within
properly prepared for efficient detection by the YOLOv8 model. These environments where object looks, lighting, or position might alter. The
steps are designed to enhance the quality of the frames, reduce noise, transformers pay attention to different parts in order to select which
and isolate the relevant moving objects, setting the foundation for the visual features for object tagging are appropriate while reducing
model to perform accurate detection in real time. tracking mistakes. Fig. 2 presents the YOLO-V8 architecture, showcasing
Decoding the Video: The process begins by decoding the video into its its real-time object detection capabilities through a streamlined con­
individual frames. Each frame is extracted from the video sequence to volutional network. This architecture optimizes both precision and
enable further processing. This is critical because object detection speed, enabling effective detection in diverse scenarios. The following
models like YOLOv8 operate on individual images, and treating the sections provide an explanation of the model, the equations used, and a

486
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

Fig. 2. YOLOv8 architecture.

visual diagram to illustrate the process. The positional encoding is then added to the embedding.
The Embedding Layer can turn low-level image features, which are
Eʹ = E + PEq (4)
images and pixel-level information, into high-dense vectors. This
transformation is significant as it wishes a method of identifying with Multi-Head Self-Attention: The multi-head self-attention mechanism is
high dimensional visual information, which the object detection model the core component of the proposed model. This enables the model to
takes in a form that is easier to process. Every pixel or area on the image concentrate on many input sequence segments concurrently, capturing
is translated into a vector of the same constant size, representing its raw complex connections and interactions that are frequently essential for
appearance, color, texture, or spatial distribution. These embeddings act precise compatibility predictions. The multi-head attention mechanism
as a prior input into subsequent layers, so the model learns to detect trains each head to pay attention to distinct sequence features, resulting
objects by identifying intricate structures with the data itself. Let X be in a thorough comprehension of the information in objects. By focusing
the input genetic sequence data. on distinct segments of the input sequence concurrently, this approach
enables the model to capture a variety of linkages and dependencies.
E = Embedding(X) (2)
QKT
Here, E represents the embedded input sequence. Attention(Q, K, V) = softmax(√̅̅̅̅̅)V (5)
Positional Encoding: subsequently, positional encoding is incorpo­ dk
rated into the embedding process. Since transformers do not have spatial Here, Q (query), K (key), and V (value) are projections of the input
understanding, positional encoding gives an opportunity to introduce sequence. The multi-head attention is defined as:
positional information so the model can distinguish different areas of the
image. This encoding assists the model in knowing the absolute posi­ MultiHead(Q, K, V) = Concat(head1, …, headh)WO (6)
tions of objects in the image by incorporating spatial context into the
Where, each head is a separate attention mechanism.
pixel or feature map, thus making the model able to classify objects
Feed-Forward Neural Networks: This is succeeded by the feed-forward
based on their position in the visual space.
neural networks, and after the attention layers, their processes are
pos experienced. These networks further hone the information by perform­
PE(pos,2i) = sin( ) (2)
100002i/dmodel ing non-linear translations, enabling the model to provide geometrical
pos properties of the existing image data and provide an understanding of
PE(pos,2i+1) = cos( ) (3) the intricate relationships inherent in the image data. The subsequent
100002i/dmodel
stage is the feed-forward network through which the data that has

487
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

undergone attention mechanism is sent for further analysis.


y = softmax(Wo h)
̂ (8)
FFN(x) = max(0, xW1 + b1)W2 + b2 (7)
Fig. 3 represents the architecture of a Transformer model, which is a
Output Layer: This fuses the processed feature maps to give outputs widely used neural network architecture in natural language processing
using the predictions. This layer generates the area of interest, which is and computer vision tasks. The architecture consists of an encoder and a
in the form of bounding boxes that point towards the objects in an image decoder. The input sequence is processed by the encoder, and the
and also the class label of the detected object. The classification part of decoder generates the output sequence. Both the encoder and the
the final layer usually uses a softmax activation function to output a decoder use a similar architecture that consists of multiple layers of self-
probability; in other words, we want to know how much probability our attention and feed-forward neural networks. The self-attention mecha­
object can be classified into a particular class. Besides classification, the nism allows the model to weigh the importance of 2 different input to­
layer also gives the coordinates of the boxes, helping to fill the detection kens when processing each token. The positional encoding layer adds
step whereby objects in the image are identified and classified. positional information to the input and output embeddings, which

Fig. 3. Transformer architecture.

488
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

enables the model to understand the order of the tokens in the sequence. such as Faster R-CNN, YOLOv3, and YOLOv5n in terms of precision
(96.78 %), recall (96.89 %), and mAP (89.67 %). It has good real-world
Algorithm. for Transformer-YOLOv8 Model
challenges, such as diverse backgrounds, variable object sizes, and oc­

5. Results and discussion clusions in dynamic scenes. This ensures robust performance in practical
applications like urban surveillance and autonomous systems.
The results section provides a detailed analysis and comparison of The proposed method optimizes real-time detection by reducing the
the proposed Attention Transformer-YOLOv8 model with the previously false positives and false negatives by quite a significant amount. As
existing techniques such as Faster R-CNN, Yolo-V3, Yolo-V5n, and SSD. compared to earlier YOLO-based methods, this model shows better
Each model outcome is compared using various measures of efficiency, performance in terms of localization and classification for more accurate
including inference time, precision, recall, and mean average precision detection. It integrates attention mechanisms with Transformer-based
(mAP). The added tables and figures best support the paper, which is the detection heads that provide better precision in the detection of real-
outperformance of the proposed model, especially in terms of real-time world surveillance scenarios.
runtime and accuracy. The work focuses on significant strides in object Fig. 4 depicts a bustling urban street captured for real-time video
recognition across a range of complex urban surveillance contexts and surveillance object detection analysis using an attention-based Trans­
highlights the value of introducing attention mechanisms and former-YOLOv8 model. The scene features multiple objects of interest,
transformer-based architectures into the YOLOv8 architecture. including vehicles (cars, a taxi, and buses), pedestrians, and surrounding
The Transformer-YOLOv8 model achieves real-time object detection urban infrastructure (high-rise buildings, street signs, and trees). These
in 5.2 ms using lightweight attention mechanisms such as depth-wise objects are crucial for training and evaluating the performance of object
separable convolutions. In terms of accuracy and efficiency, it is ideal detection models like YOLOv8, as they represent common entities in a
for surveillance, autonomous systems, and low-resource environments. real-world urban setting. This image, sourced from Kaggle, serves as a
The Attention Transformer-YOLOv8 model surpasses existing models test case for detecting various dynamic and static elements in a complex

489
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

Fig. 4. Frames of real-time video data (Input Images).

Fig. 5. Detection results on attention-based transformer YOLOv8 architecture.

environment, aiding in the study of the model’s efficiency in identifying well as the objects that overlap, is its flexibility in detail.
and localizing objects in real-time surveillance scenarios. Fig. 6 presents a Confusion Matrix that provides a detailed break­
Fig. 5 presents object detection on city streets using the novel down of the model’s classification performance across four categories:
Transformer-YOLOv8 architecture with attention. The model has sys­ understanding the subject is established, that is, the given person, the
tematically identified different scenes accompanied by objects such as motorcycle, the vehicle, and the background. Each data point in the
cars, buses, taxis, people, and other aspects of a city, like signs. The matrix signifies the share of predictions that relate to a particular
different colors are used to distinguish each of the detected objects in the combination of true and anticipated labels, revealing crucial insights
picture. The colors help with understanding the localization and clas­ into the model’s accuracy and mistakes. Row one accurately classifies
sification of the YOLOv8 model that uses attention mechanisms to 88 % of cases as "person" (true positives). Nevertheless, it categorizes
enhance detection precision. In this real-time surveillance environment, 12 % of the ‘person’ instances as ‘motorcycle’ and an estimated 40 % as
YOLOv8 takes advantage of its good architecture to handle the six- ‘background,’ suggesting that the model is sometimes confounded by
dimensional visual data set rapidly, including addressable static scene background noise or other features. Motorcycle accurately recognizes
features like buildings, signs, moving vehicles, and pedestrians. Through 67 % of motorcycles. Nonetheless, it misclassifies 12 % of motorcycles
the attention mechanism added to the Transformer, the model can point as ’person’ and 32 % as ’car,’ indicating a substantial confusion between
to the important areas of the image to increase the detection rate, such as motorcycles and cars. Third Car’s (row) performance is potent, accu­
the face and eyes. Thus, the accuracy of the proposed Transformer- rately classifying cars up to 92 %. Yet, 32 % of the definite "motorcycle"
YOLOv8 architecture for object detection is shown in this figure, spontaneities are inaccurately labeled as "car," indicating that Motor­
which is applicable for real-time video surveillance applications where cycles and cars are a source of confusion for the model. Also, 55 % of
multiple targets must be detected and tracked in real-time for safety, what was described as a background was labeled as "cars," indicating a
traffic control, etc. It can be concluded that the versatility of the model misinterpretation of background objects as cars. The fourth row in the
in the integration of simultaneous and various scales of the entities, as background misclassifies 12 % of actual person occurrences and 32 % of

490
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

encountered, it struggles to maintain precision when trying to find all


instances of people. The motorcycle (orange curve) curve manages to
keep a relatively stable equilibrium between precision and recall in
contrast to the "person" class, indicating that the model performs well in
motorcyclist detection, with a gradual drop in precision. The car (green
curve) curve begins with a lower precision relative to others, indicating
that the model initially finds it more difficult to accurately detect cars,
particularly at higher recall values where the trade-off between recall
and precision is pronounced. The All Classes (blue curve) indicator
stands for the general efficiency of the model across all identified classes.
The process begins with high precision at low recall levels and gradually
transitions into a balance of precision and recall across all categories.
The precision-recall curve reveals the trade-off and evaluates model
performance on a range of object categories.
Fig. 9 illustrates the ROC curve for the transformer-YOLOv8 model,
demonstrating a high Area Under the Curve (AUC) value of 0.98, indi­
cating excellent classification performance. The model achieves a
remarkable accuracy of 99.80 %, showcasing its reliability in dis­
tinguishing between classes.
Table 1 illustrates the inference times (in milliseconds) of different
object detection methods employed across a range of datasets. Assess­
ment involves the techniques Faster R-CNN, Yolo-v3, Yolo-v5n, SSD, and
the Attention Transformer-Yolov8 model that has been suggested. Faster
R-CNN and Yolo-v3 utilize FLIR and OTCBVS datasets, respectively, with
inference times of 32.5 ms and 36.8 ms. Yolo-v5n, evaluated on the
VEDAI dataset, shows an inference time of 27.5 ms, while the SSD model
manages 26.3 ms on the FLIR dataset. With results from a real-time
Fig. 6. Confusion matrix of the proposed model.
dataset, the proposed Attention Transformer-Yolov8 model drastically
surpasses these models, capable of inference in only 5.2 ms, confirming
motorcycle occurrences as ’background,’ but the category, overall, rai­ its effectiveness in real-time applications.
ses fewer problems than other categories. The matrix illustrates that Fig. 10 represents the inference time comparison of various object
although the model performs effectively with cars and people, it has detection algorithms. The size of each slice in the pie chart corresponds
greater difficulty correctly identifying motorcycles and background to the relative inference time of the respective algorithm. From the
items. The confusion between motorcycles and cars, along with the chart, it’s evident that the proposed Attention Transformer-YOLOv8
confusion between people and the background, indicates opportunities model shows the least inference time and, therefore, the most efficient
for model development to boost accuracy across all classifications. in real-time applications. Such improvements in inference speed can be
The results presented in Fig. 7 showcase the successful completion of achieved by efficient architecture and optimization techniques
the training and validation steps for the introduced YOLOv8 model that employed in the model.
uses a Transformer approach. The training losses, including box loss, Table 2 provides a comparison of precision, recall, and means
objectness loss, and classification loss, all demonstrate a distinct average precision (mAP) at 0.5 thresholds for several object detection
downward trajectory, indicating that the model is successfully learning algorithms. The models evaluated include Yolo-V8, Yolo-V7, Yolo-V5,
and enhancing its ability to precisely localize and classify objects. and Yolo-V5 large, along with the proposed Attention Transformer-
Similarly, a similar trend of validation losses reveals the model’s over- Yolov8. Yolo-V8 achieves 90.5 % precision, 80.7 % recall, and 86.4 %
arching ability on unseen data and, at the same time, does not over- mAP, while Yolo-V7 performs with 83.73 % precision, 83.14 % recall,
fit. Moreover, the same has been shown with significant improve­ and 85.78 % mAP. Yolo-V5 and its larger variant show precision scores
ments in terms of the precision and the mean Average Precision (mAP). of 85.25 % and 84.05 %, respectively, with slightly lower recall values.
The precision increases gradually; this is evidence that the model ne­ The proposed Attention Transformer-Yolov8 outperforms all other
gates numerous false positives, as well as increases its capacity to detect models, showing the highest precision (96.78 %), recall (96.89 %), and
objects. A large degree of increase in the MAP is observed when working mAP (89.67 %), highlighting its superior performance in object detec­
with an IoU threshold of 0.5, which means that bounding boxes with a tion tasks.
moderate degree of overlap can be described by the model. Also, the Fig. 11 provides a comparison of the performances of five different
mean average precision (mAP) increases across a variety of IoU seg­ object detection algorithms using three key metrics: Precision, Recall,
ments (0.5–0.95), indicating that the model is effective at localizing with and Mean Average Precision (mAP). There is one group of three bars
precision across various magnitudes of bounding box overlap. All in all, representing an algorithm, and the height of the three bars indicates the
the findings emphasize the effectiveness of the model in the field of real- respective performance for each metric. It shows that the model Atten­
time object detection, showcasing an enhancement in accuracy accom­ tion Transformer-YOLOv8 outperforms in all three metrics compared
panied by a decrease in losses during training. with the rest of the algorithms. This means the Transformer attention
Fig. 8 represents the Precision-Recall Curve for a variety of object mechanisms are integrated into the YOLOv8 architecture to significantly
classes (person, motorcycle, car, and all classes) that were detected by enhance object detection accuracy.
the YOLOv8 model based on Transformer. Precision and recall serve as
vital assessment measures in the evaluation of object detection models, 5.1. Discussion
where precision gauges the accuracy of the affirmative predictions, and
recall gauges the model’s capability to identify all relevant instances. To compare the performance of the newly developed Attention
The person class (depicted by the light blue curve) shows a high degree Transformer-YOLOv8 model, the model was compared with five tradi­
of precision at lower recall levels, but the precision falls sharply as recall tional object detection models, namely Faster R-CNN, YOLOv3,
rises, indicating that although it detects a person accurately when YOLOv5n, and SSD. The evaluation was made with yardsticks like

491
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

Fig. 7. Training and validation metrics of the proposed transformer-based YOLOv8 model.

inference time, precision, recall, and mean average precision, providing ensure public safety as well as effective threat detection. Besides that, it
necessary comparison measures. In the case of the inference time, as has a high precision and recall due to the mechanism of attention. This
shown in Table 1, the proposed model is far superior as it takes only ability to focus on relevant parts of complex and crowded scenes makes
5.2 ms to complete with real-time performance as compared to others. it suitable for use in smart cities and autonomous driving. Its accuracy at
This shows that it has better real-time relevance, which is especially classifying a wide variety of objects supports its suitability for use in
important in areas such as security surveillance in cities. The other industrial automation for tasks like quality control and monitoring [31].
performance metrics presented in Table 2 also support the proposition of Despite challenges in distinguishing between certain objects, its overall
the proposed model in this research. The Attention Transformer- high performance makes it highly suitable for real-world applications
YOLOv8 model excels the standard YOLOv8, YOLOv7, and YOLOv5 requiring efficient, real-time object detection.
models in all measures with a precision of 96.78 percent, a recall of Most of the credit for this enhanced performance is attributed to
96.89 percent, and a mean average precision of 89.67 percent. attention mechanisms and the transformer model that greatly enhances
The practical implications of the Attention Transformer-YOLOv8 its capability in locating objects and focusing on detailed features of the
model are of tremendous importance in areas like security surveil­ image even within congested and mobile manners of cities. In Figs. 4 and
lance, autonomous driving, and industrial automation. It demonstrates 5, the visualizations depict the model’s live detection functionalities,
real-time performance with a processing time as low as 5.2 ms per highlighting its skillful classification of multiple items, such as vehicles,
frame, where rapid and accurate detection would be required. In secu­ pedestrians, and infrastructure. The model demonstrates good perfor­
rity surveillance, the model’s ability to spot objects such as vehicles, mance according to the confusion matrix, but the challenges of identi­
pedestrians, and even infrastructure in crowded environments serves to fying motorcycles versus cars highlight areas that need improvement as

492
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

Fig. 10. Inference time comparison of object detection algorithms.

Table 2
Performance comparison of object detection algorithms.
Algorithm Precision Recall mAP 0.5
(%) (%) (%)

Yolo-V8 [29] 90.5 80.7 86.4


Yolo-V7 [30] 83.73 83.14 85.78
Yolo-V5 [30] 85.25 78.66 83.73
Fig. 8. Precision-recall curve. Yolo-V5 large [30] 84.05 79.94 84.06
Proposed Attention Transformer- 96.78 96.89 89.67
Yolov8

Fig. 9. ROC curve for transformer-YOLOv8 model. Fig. 11. Performance comparison of object detection algorithms.

localization by making use of the Transformer’s ability to capture long-


Table 1
range dependencies and global context, which leads to more accurate
Inference time comparison of various object detection models.
and refined bounding box predictions.
Methods Dataset Inference Time The model is very good at adapting to changes in lighting, weather,
(ms)
and other environmental factors through the Transformer’s attention
Faster R-CNN [28] FLIR 32.5 mechanism. This way, the model can focus on relevant features even in
Yolo-v3 [28] OTCBVS 36.8 challenging conditions such as low lighting or adverse weather.
Yolo-v5n [28] VEDAI 27.5
SSD FLIR 26.3
Adaptability ensures robust performance in diverse real-world surveil­
Proposed Attention Proposed Real-time 5.2 lance environments [32].
Transformer-Yolov8 dataset To mitigate the ethical concerns, privacy-preserving techniques are
integrated into the model, including the anonymization of sensitive in­
formation and ensuring that no personally identifiable information is
we continue our research. Despite this, the overall high accuracy across
used within the object detection process. In addition, the effort of
object classes points to the model’s suitability for real-world operation.
reducing bias in detection can be achieved by using a diverse and
The use of Transformer-based detection heads significantly improves
balanced dataset in training the model.
the accuracy of bounding box localization compared to traditional
The proposed system is scalable and can be deployed on a large scale,
YOLOv8 implementations. The model achieves better precision in object

493
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

including citywide surveillance networks. Its real-time processing Sultan University for their support.
capability, along with efficient resource utilization, makes it possible to
scale up across multiple cameras and locations. High accuracy with low References
inference time allows it to work effectively in dynamic environments at
large scales with minimal computational overhead. [1] E. Arkin, N. Yadikar, Y. Muhtar, and K. Ubul, “A survey of object detection based
on CNN and transformer,” in 2021 IEEE 2nd international conference on pattern
The framework further provides potential expansion into other recognition and machine learning (PRML), IEEE, 2021, pp. 99–108.
areas, including autonomous driving and industrial automation. It’s [2] L. He and S. Todorovic, “Destr: Object detection with split transformer,” in
designed for real-time object detection and adaptability to dynamic Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
2022, pp. 9377–9386.
environments. It is, therefore, highly suitable for those applications and [3] Y. Tian, “Effective image enhancement and fast object detection for improved UAV
could have even wider applicability beyond video surveillance. applications,” 2023.
The proposed method shows high accuracy in real-time detection; [4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-
to-end object detection with transformers,” in European conference on computer
however, it suffers from challenges such as high computational re­ vision, Springer, 2020, pp. 213–229.
quirements, especially on devices with limited resources. Its perfor­ [5] G. Lavanya, S. Pande, Enhancing Real-time Object Detection with YOLO Algorithm,
mance may also be less robust in extreme lighting or weather conditions. EAI Endorsed Trans. Internet Things 10 (Dec. 2023), https://doi.org/10.4108/
eetiot.4541.
Future versions could focus on optimizing computational efficiency and
[6] Q. Chen et al., “LW-DETR: A Transformer Replacement to YOLO for Real-Time
enhancing robustness to varying environmental factors. Detection,” arXiv preprint arXiv:2406.03459, 2024.
[7] MH_1901182_Final_EDDY_LAI_THIN_JUN.pdf.” Accessed: Sep. 24, 2024. [Online].
6. Conclusion and future works Available: 〈http://eprints.utar.edu.my/6556/1/MH_1901182_Final_EDDY_LAI_
THIN_JUN.pdf〉.
[8] Z. Guo, C. Wang, G. Yang, Z. Huang, G. Li, Msft-yolo: improved yolov5 based on
The model Attention Transformer-YOLOv8 has seen significant ad­ transformer for detecting defects of steel surface, Sensors 22 (9) (2022) 3467.
vancements, and the loss function is continually reducing in values, [9] T. Wang, Z. Ma, T. Yang, S. Zou, PETNet: a YOLO-based prior enhanced
transformer network for aerial image detection, Neurocomputing 547 (2023)
having a precision of 96.78 %, recall of 96.89 %, and mAP of 89.67 %. 126384.
These validate its good detection and classification capabilities, and its [10] S. Jha, C. Seo, E. Yang, G.P. Joshi, Real time object detection and trackingsystem
localization also remains to be very good. It’s real-time video surveil­ for video surveillance system, Multimed. Tools Appl. 80 (3) (Jan. 2021)
3981–3996, https://doi.org/10.1007/s11042-020-09749-x.
lance, and self-governing systems have tremendous prospects when [11] L. Cheng, D. Zhang, Y. Zheng, Road object detection in foggy complex scenes based
strict and critical detection along with tracking with low visibility on improved YOLOv8, IEEE Access (2024).
conditions are required. However, the attention mechanisms, combined [12] N. Yunusov, B.M.S. Islam, A. Abdusalomov, W. Kim, Robust forest fire detection
method for surveillance systems based on you only look once version 8 and transfer
with the Transformer-based detection head, incur a high computational learning approaches, Processes 12 (5) (2024) 1039.
cost with an inference time of 5.2 ms per frame on high-end hardware. [13] H. Yi, B. Liu, B. Zhao, E. Liu, Small object detection algorithm based on improved
Latency in edge devices underlines an additional need for optimization YOLOv8 for remote sensing, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
(2023).
methods, such as model pruning, quantization, or knowledge distilla­
[14] Z. Zhang, X. Lu, G. Cao, Y. Yang, L. Jiao, and F. Liu, “ViT-YOLO: Transformer-based
tion, between performance and efficiency. YOLO for object detection,” in Proceedings of the IEEE/CVF international conference
Adaptive preprocessing techniques, such as data augmentation in on computer vision, 2021, pp. 2799–2808.
real-time, could help enhance robustness in varying conditions. Semi- [15] Y. Fang, et al., You only look at one sequence: rethinking transformer in vision
through object detection, Adv. Neural Inf. Process. Syst. 34 (2021) 26183–26197.
supervised learning would make reliance on labeled datasets decrease [16] L. Min, Z. Fan, Q. Lv, M. Reda, L. Shen, B. Wang, Yolo-dcti: small object detection
the chances of scalability issues. Data sets can be extended to cover in remote sensing base on contextual transformer enhancement, Remote Sens. 15
complex environments with boundary refinement networks, helping to (16) (2023) 3970.
[17] C.-Y. Wang, H.-Y.M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh,
enhance generalization and localization accuracy. Research on adver­ “CSPNet: A new backbone that can enhance learning capability of CNN,” in
sarial defenses and real-time feedback loops will further advance Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
adaptability. All these improvements allow a Transformer-YOLOv8 workshops, 2020, pp. 390–391.
[18] S. Saponara, A. Elhanashi, A. Gagliardi, Real-time video fire/smoke detection
framework to be more robust, adaptable, and efficient in solving real- based on CNN in antifire surveillance systems, J. Real. -Time Image Process. 18
world application tasks. (2021) 889–900.
[19] M. Shorfuzzaman, M.S. Hossain, M.F. Alhamid, Towards the sustainable
development of smart cities through mass video surveillance: a response to the
CRediT authorship contribution statement COVID-19 pandemic, Sustain. Cities Soc. 64 (2021) 102582.
[20] F. Pérez-Hernández, S. Tabik, A. Lamas, R. Olmos, H. Fujita, F. Herrera, Object
Divya Nimma: Formal analysis, Data curation, Conceptualization. detection binary classifiers methodology based on deep learning to identify small
objects handled similarly: application in video surveillance, Knowl. -Based Syst.
Ts. Yousef A. Baker El-Ebiary: Formal analysis, Data curation,
194 (2020) 105590.
Conceptualization. Vuda Sreenivasa Rao: Methodology, Investigation, [21] P.Y. Ingle, Y.-G. Kim, Real-time abnormal object detection for video surveillance in
Funding acquisition. Zoirov Ulmas: Validation, Supervision, Software. smart cities, Art. no. 10, Sensors 22 (10) (Jan. 2022), https://doi.org/10.3390/
s22103862.
R.V.V. Krishna: Writing – review & editing, Writing – original draft,
[22] K. Deshpande, N.S. Punn, S.K. Sonbhadra, and S. Agarwal, “Anomaly detection in
Visualization. Omaia Al-Omari: Methodology, Investigation, Funding surveillance videos using transformer based attention model,” Jun. 06, 2022, arXiv:
acquisition. Rahul Pradhan: Resources, Project administration, arXiv:2206.01524. doi: 10.48550/arXiv.2206.01524.
Methodology. [23] H. Wang, J. Tang, X. Liu, S. Guan, R. Xie, and L. Song, “PTSEFormer: Progressive
Temporal-Spatial Enhanced TransFormer Towards Video Object Detection,” Sep.
06, 2022, arXiv: arXiv:2209.02242. doi: 10.48550/arXiv.2209.02242.
Declaration of Competing Interest [24] juvan Tervan, “A Comprehensive Review of YOLO Architectures in Computer
Vision: From YOLOv1 to YOLOv8 and YOLO-NAS.” Accessed: Dec. 09, 2024.
[Online]. Available: 〈https://www.mdpi.com/2504-4990/5/4/83〉.
The authors declare no conflict of interest. The research was con­ [25] Y. Wang, et al., Lightweight vehicle detection based on improved YOLOv5s,
ducted independently and without any commercial or financial re­ Sensors 24 (4) (Feb. 2024) 1182, https://doi.org/10.3390/s24041182.
lationships that could be construed as a potential conflict of interest. All [26] M. Safaldin, N. Zaghden, M. Mejdoub, An improved YOLOv8 to detect moving
objects, IEEE Access (2024).
interpretations and findings presented in this paper are solely those of [27] Real-Time Object Detection.” Accessed: Sep. 23, 2024. [Online]. Available:
the authors and were not influenced by any external funding or personal 〈https://kaggle.com/code/roshinifernando/real-time-object-detection〉.
relationships. [28] L. Shen, B. Lang, Z. Song, Infrared object detection method based on DBD-YOLOv8,
IEEE Access 11 (2023) 145853–145868, https://doi.org/10.1109/
ACCESS.2023.3345889.
Acknowledgements [29] H. Yi, B. Liu, B. Zhao, E. Liu, Small object detection algorithm based on improved
YOLOv8 for remote sensing, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17
(2024) 1734–1747, https://doi.org/10.1109/JSTARS.2023.3339235.
Omaia Al-Omari, one of the co authors would like to thank Prince

494
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495

[30] J. Yan, et al., Enhanced object detection in pediatric bronchoscopy images using [32] Abed Saif Alghawli, Ahmed I. Taloba, An enhanced ant colony optimization
YOLO-based algorithms with CBAM attention mechanism, Heliyon 10 (12) (2024). mechanism for the classification of depressive disorders, Comput. Intell. Neurosci.
[31] Ahmed I. Taloba, R.T. Matoog, Detecting respiratory diseases using machine 1 (2022) (2022) 1332664.
learning-based pattern recognition on spirometry data, Alex. Eng. J. 113 (2025)
44–59.

495

You might also like