(2025-AEJ) Object Detection in Real-Time Video Surveillance Using Attention Based transformer-YOLOv8 Model
(2025-AEJ) Object Detection in Real-Time Video Surveillance Using Attention Based transformer-YOLOv8 Model
Original article
A R T I C L E I N F O A B S T R A C T
Keywords: Object detection plays a crucial role in various applications, including surveillance, autonomous driving, and
Object detection industrial automation, where accurate and timely identification of objects is essential. This research proposes a
YOLOv8 novel framework that combines the YOLOv8 backbone network with an attention mechanism and a Transformer-
Attention mechanism
based detection head, significantly enhancing object detection performance in real-time images and video. The
Transformer architecture
incorporation of attention mechanisms refines feature extraction from complex scenes, enabling the model to
Real-time processing
focus on relevant regions within images. Using the integration of Transformer architecture, the model leverages
long-range dependencies and global context, leading to more accurate bounding box predictions. The proposed
system effectively processes real-time data, demonstrating superior classification performance with precision
rates reaching 96.78 % and recall rates of 96.89 %. The mean average precision (mAP) is calculated at 89.67 %,
showcasing the framework’s robustness across various practical scenarios. The framework is developed to
address challenges in object detection, such as detecting multiple objects in crowded environments and varying
lighting conditions. The Python architecture supports the implementation of the proposed model. The Python
architecture supports the implementation of the proposed model. The results section assesses the Attention
Transformer-YOLOv8 model against established algorithms like Faster R-CNN, YOLOv3, YOLOv5n, and SSD,
utilizing metrics.
* Corresponding author.
E-mail addresses: [email protected] (D. Nimma), [email protected] (O. Al-Omari), [email protected] (R. Pradhan), [email protected] (Z. Ulmas),
[email protected] (R.V.V. Krishna), [email protected] (Ts.Y.A.Baker El-Ebiary), [email protected] (V.S. Rao).
https://doi.org/10.1016/j.aej.2025.01.032
Received 16 November 2024; Received in revised form 25 December 2024; Accepted 7 January 2025
Available online 24 January 2025
1110-0168/© 2025 The Authors. Published by Elsevier B.V. on behalf of Faculty of Engineering, Alexandria University. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
industrial sector has a lot of potential for significant automation with distinction that distinguishes this study from previous works.
computer vision [6]. As seen in Fig. 4, we offer a straightforward technique for orientated
This research will propose and experimentally verify the perfor item recognition by end-to-end matching of angled boxes with orien
mance of a real-time attention-based Transformer-YOLOv8 object de tated objects. In particular, we use the cross-attention technique to
tector model that can successfully adapt to complicated scenes. extract angle-dimensional information and pre-set and refine fixed-
Improving its detection accuracy under more severe challenges, like length object searches with angle dimensions for interaction with the
partial or total occlusion, blur, or varying sizes, in this model is a specific encoded characteristics. During training, the collection of angle-aware
goal for achieving the model improvement objective. This paper will object searches uses bipartite matching to match ground facts. Multi-
integrate the strengths of YOLOv8 with a Transformer-based attention scale feature maps are required for object recognition because of the
mechanism to further enhance precision, recall, and location detection notable size differences for several types of orientated objects (little ones
in surveillance, autonomous cars, and industrial automation-related like small vehicles and vast ones like ground-track fields). Nevertheless,
applications. for multi-scale characteristics, the original Transformer encoder’s global
Quality Inspection is extremely important in manufacturing since it reasoning architecture of attention mechanism is extremely and opera
provides customers with reassurance regarding the authenticity and tionally difficult. Moreover, we contend that global reasoning is not
caliber of the goods being produced [7] [8]. Although there are many truly required, particularly in cases when orientated items belonging to
prospects for automation in production, there are difficulties in the area the same category are consistently packed closely together, and the
of surface inspection since flaws can take complex shapes. Because of its object query only communicates with the visual characteristics around
intricacy, human-driven quality inspection is a difficult task fraught the object, as opposed to the entire global image. These observations
with problems including prejudice, weariness, expenses, and lost pro motivate us to offer depth-wise separable convolutions for local aggre
duction time [9]. Because of these inefficiencies, computer vision-driven gation, which can outperform the Transformer’s original self-attention
systems have a great opportunity to provide automated quality checking mechanism in terms of performance. When compared to the tradi
[10]. By doing so, you may avoid the bottlenecks that come with using tional attention mechanism, replacing with convolutions results in
traditional inspection methods and increase efficiency by integrating shorter Transformer training epochs and faster convergence since, dur
these solutions smoothly into your current surface defect inspection ing feature extraction, information exchange only occurs between
procedures [11]. However, for CV designs to be successful, they must neighboring pixels. By incorporating novel components of an attention-
meet a strict set of deployment requirements that might differ among based transformer into YOLOv8, this work considerably increases the
manufacturing business sectors. For most applications, the goal goes detection accuracy of YOLOv8. First, the first feature extraction of ob
beyond only recognizing a single fault; it typically includes identifying jects in the photos is done via the basic YOLOv8 model. To help the
many problems together with their unique spatial information. The model learn as many complicated object features as possible to handle
work uses an attention-based transformer and a better version of the blocking circumstances, the integrated Transformer modules are
YOLOv8 to identify moving objects for image categorization [12]. The also in charge of encoding and decoding complex multi-scale object
latter focuses only on item identification within an image, providing no features. The research contributions are:
information on the exact location of the objects. Single-stage and
two-stage detectors are the two main categories into which object • Integration of YOLOv8 with Transformer-based Attention Mecha
detection technologies fall [13]. When using two-stage detection sys nisms: Introduced a new framework combining efficient YOLOv8
tems, the detection process is split into two stages: first, features are backbone with advanced detection heads Transformer-based,
extracted or proposed, and then the final output is obtained through thereby enhancing the spatial feature extraction as well as long-
regression and classification. Although this solution offers great accu range dependency modeling for real-time object detection.
racy, it comes with a heavy computational cost that makes it unsuitable • Improved Precision and Robustness in Complex Environments: Bet
for real-time deployment on edge devices with limited resources. On the ter detection performance in crowded, occluded, and low-visibility
other hand, classification and regression may happen in a single pass conditions was demonstrated using attention mechanisms to focus
using single-stage detectors as they combine both procedures into a on critical regions of an image, resulting in a precision of 96.78 %
single step [14]. This lowers the computing needs significantly and and recall of 96.89 %.
makes the case for deployment in production systems stronger [15] • Real-Time Efficiency for Resource-Constrained Applications:
[16]. Designed a computationally optimized model with an inference time
The introduction of Transformer-YOLOv8 in this research brings a of 5.2 ms per frame, suitable for real-time applications like surveil
new methodology based on the strengths of transformer models and lance, autonomous systems, and industrial automation.
YOLO architecture, overcoming the challenges with traditional object • New Hybrid Design for Improved Detection: Introduces a balanced
detection methods. YOLOv5 and YOLOv4 focus mostly on spatial feature framework that synergizes the lightweight architecture of YOLOv8
extraction and often fail to learn the long-range dependencies or capture with the contextual modeling capabilities of Transformer to address
temporal patterns necessary for dynamic environments. This way, the challenges like generalization in dynamic environments and
transformer-based models, like DETR, outperform the others in spatial- achieving a mAP of 89.67 %.
temporal relationship modeling, but these are computationally expen
sive and lack real-time efficiency. So, with the incorporation of trans The layout of the paper is academic, with sections as follows: This
former modules into YOLOv8, the study brings a balanced framework paper aims to discuss the research problem, objectives, and significance in
that increases accuracy and precision without increasing computational Section 1. In Section 2, this is succeeded by a literature review wherein the
infeasibility. The method proposed here is quite in contrast to the work contextualizes the current study to unearth and focus on the research
existing methods that are either slow but very accurate or fast but gaps. Section 4 describes the specific objectives of the study, the method
inaccurate. This method provides synergy between speed and accuracy used to gather data and analyze it, and a rationale for why this method was
and hence is suitable for resource-constrained environments in real-time used would thus be outlined in this section. In Section 5, the results section
applications. This method addresses important issues such as mode comes next, where the study findings are described in detail together with
collapse in transformers and YOLO models’ limited generalization tables and figures for easy understanding. This discussion contextualizes
capability, thus providing a more robust solution for object detection in these findings with regard to the research questions and past literature,
diverse and dynamic scenarios. The novelty here lies in this hybrid discussing implications and potential limitations. Last, in Section 6, the
design which leverages the spatial-temporal modeling capabilities of conclusion restates the key points of this research, discusses implications
transformers but retains the lightweight efficiency of YOLOv8, a for practice, and presents ideas for further research.
483
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
484
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
dataset. This approach essentially integrates temporal and spatial ob detection have shown satisfactory results but are not exempt from
servations into one model, boosting the development of real-time video certain disadvantages such as limited versatility in handling complex
object detection. Although PTSEFormer attains high accuracy, it is likely emotional movements, highly dependent on motionless preprocessing,
to be a high-computational-cost transformer-based model, which could usual constraints from supervised learning methods, sensitivity to
limit its scalability and real-time performance in resource-constrained adversarial attacks, and higher resource utilization, especially in the
environments. edge computing setting [26]. To address these challenges, our study
Juvan Tervan [24] made a comprehensive analysis of the evolution suggests an advanced attention-based Transformer-YOLOv8 model to
of YOLO, focusing on how it is an essential component for real-time address the lack of flexibility when dealing with situations involving
object detection in applications such as robotics, driverless cars, and dynamic motion, adoption of actual-time preprocessing solutions, and
video surveillance. A review of the advancement in YOLO models starts sensitivity to adversarial threats. Also, the model has low latency and
from the original YOLO to more recent sophisticated models like low memory utilization for efficient usage in real-time applications for
YOLOv8, YOLO-NAS, as well as YOLO integrated with transformers. The edge computing. This approach serves to overcome the limitations of
authors described the standard evaluation metrics and post-processing current detection systems and provide a new trend in object detection
techniques widely used in YOLO-based systems. They also discussed and tracking.
architectural advancements and training strategies introduced in each
version, highlighting innovations that improved performance and effi 4. Proposed attention-based transformer-YOLOv8 model for
ciency. The study concluded by summarizing key insights gained from object detection
YOLO’s development and exploring potential future research directions
aimed at further enhancing real-time object detection systems. While The proposed framework for object detection considers input data as
providing a detailed analysis of YOLO’s developments, the paper pri real-time video or images that are preprocessed to be analyzed. Second,
marily focuses on architectural and methodological improvements but the YOLOv8 backbone net decides the important features from the pre-
leaves out practical difficulties involved in deploying such models on processed data. These feature maps are followed by what is known as an
resource-constrained devices or against real-world variability. attention mechanism module that helps select areas of importance in
Yuhai et al. [25] proposed a lightweight vehicle detection algorithm object detection. The attention-refined features were then sent to a
to improve the design of intelligent traffic management systems by detection head based on the Transformer Network, which provides the
enhancing the YOLOv5 framework by incorporating perceptual atten probability of the location and class of the objects. The bounding box
tion. The design mainly aims to achieve high accuracy without prediction unit detects the locations of the found objects as the final step,
compromising the model’s small complexity and low computation and the system classifies the objects to provide them with labels. The
complexity. The authors provided the two main modules in question: the real-time object detection using the YOLOv8 backbone network and
IPA module of Integrated Perceptual Attention and the MSCCR module Transformer-based attention mechanisms with detection head for ac
of Multiscale Spatial Channel Reconstruction. The IPA module, con curate bounding box regression and object classification is depicted in
structed using a Transformer encoder, reduces parameters while Fig. 1.
capturing global dependencies for richer contextual information. The
MSCCR module enables efficient feature learning without increasing 4.1. Data
parameters or computational complexity. In the YOLOv5s backbone
network, the model obtained a 9 % reduction in parameters and a 3.1 % Identifying and localizing elements in a stream of video and a series
improvement in mAP@50 without any increase in FLOPS. Despite these of images in real-time or almost real-time is known as object detection in
advances, the fact that the method relies on Transformer-based com real-time. In short, it recognizes and categorizes items while they are
ponents may cause difficulties in deploying it to extremely low-power or being recorded or viewed within video or image frames. For many ap
real-time applications where computational resources are very plications, including video surveillance, robotics, self-driving vehicles,
constrained. and medical care, object detection is essential. In previous years, re
The related works focus on advancements in object detection for searchers have been employing deep learning and computer vision
real-time video surveillance, anomaly detection, and vehicle detection techniques primarily for real-time object recognition. This is a difficult
using deep learning models like YOLO, ResNet, and transformers. One endeavor because, in addition to considering the available computer
approach presents a network design to reduce computational costs while resources, they must balance speed and precision to produce timely,
maintaining accuracy, but it still faces challenges in complex environ correct results. High-quality image data is the primary requirement for
ments. Another study used YOLO for real-time smoke detection but has training classical vision models or deep learning algorithms in the tar
limited robustness across varied environments. A different study deals geted region for object recognition. In this case, road traffic images from
with social distancing monitoring using deep learning, but their system Kaggle were used as the data source. These images depict a variety of
suffers from scalability issues. Several studies are focused on small ob traffic situations, such as different road settings, vehicle kinds, and
ject detection and weapon detection but face difficulties in complex lighting conditions. This may be used to train and assess object identi
recognition tasks and real-time performance issues. Some methods use fication models since it has bounding boxes labeled around a variety of
transformers for anomaly detection, but the approach is computation things, including vehicles, trucks, buses, and people. In real-time video
ally intensive. Another study introduces a model that combines temporal surveillance applications, the model intends to improve object recog
and spatial awareness, which suffers from high computational costs. nition accuracy by concentrating on critical regions by utilizing the
While the YOLO’s evolution improved the performance, the challenges Attention-Based Transformer combined with the YOLO v8 architecture.
persist in terms of deployment and resource-constrained environments. This method guarantees effective road traffic object identification and
The lightweight version of YOLO for vehicle detection is enhanced but detection even in difficult situations, enhancing roadway security and
faces limitations in low-power, real-time applications. These studies monitoring.
depict the current challenges that prevail in object detection, which The dataset referenced for the real-time object detector is likely
involve computational complexity, scalability, and generalization across designed to train models that are to be fitted and then evaluated by
diverse environments. looking at images or frames where objects are identified with position
information. Such datasets typically come as labeled data consisting of
3. Problem statement an image with bounding boxes marked for the objects of interest as well
as corresponding class labels. They curate many of these datasets spe
The current advancements progress achieved in moving object cifically for object detection applications, from surveillance to
485
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
486
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
visual diagram to illustrate the process. The positional encoding is then added to the embedding.
The Embedding Layer can turn low-level image features, which are
Eʹ = E + PEq (4)
images and pixel-level information, into high-dense vectors. This
transformation is significant as it wishes a method of identifying with Multi-Head Self-Attention: The multi-head self-attention mechanism is
high dimensional visual information, which the object detection model the core component of the proposed model. This enables the model to
takes in a form that is easier to process. Every pixel or area on the image concentrate on many input sequence segments concurrently, capturing
is translated into a vector of the same constant size, representing its raw complex connections and interactions that are frequently essential for
appearance, color, texture, or spatial distribution. These embeddings act precise compatibility predictions. The multi-head attention mechanism
as a prior input into subsequent layers, so the model learns to detect trains each head to pay attention to distinct sequence features, resulting
objects by identifying intricate structures with the data itself. Let X be in a thorough comprehension of the information in objects. By focusing
the input genetic sequence data. on distinct segments of the input sequence concurrently, this approach
enables the model to capture a variety of linkages and dependencies.
E = Embedding(X) (2)
QKT
Here, E represents the embedded input sequence. Attention(Q, K, V) = softmax(√̅̅̅̅̅)V (5)
Positional Encoding: subsequently, positional encoding is incorpo dk
rated into the embedding process. Since transformers do not have spatial Here, Q (query), K (key), and V (value) are projections of the input
understanding, positional encoding gives an opportunity to introduce sequence. The multi-head attention is defined as:
positional information so the model can distinguish different areas of the
image. This encoding assists the model in knowing the absolute posi MultiHead(Q, K, V) = Concat(head1, …, headh)WO (6)
tions of objects in the image by incorporating spatial context into the
Where, each head is a separate attention mechanism.
pixel or feature map, thus making the model able to classify objects
Feed-Forward Neural Networks: This is succeeded by the feed-forward
based on their position in the visual space.
neural networks, and after the attention layers, their processes are
pos experienced. These networks further hone the information by perform
PE(pos,2i) = sin( ) (2)
100002i/dmodel ing non-linear translations, enabling the model to provide geometrical
pos properties of the existing image data and provide an understanding of
PE(pos,2i+1) = cos( ) (3) the intricate relationships inherent in the image data. The subsequent
100002i/dmodel
stage is the feed-forward network through which the data that has
487
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
488
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
enables the model to understand the order of the tokens in the sequence. such as Faster R-CNN, YOLOv3, and YOLOv5n in terms of precision
(96.78 %), recall (96.89 %), and mAP (89.67 %). It has good real-world
Algorithm. for Transformer-YOLOv8 Model
challenges, such as diverse backgrounds, variable object sizes, and oc
5. Results and discussion clusions in dynamic scenes. This ensures robust performance in practical
applications like urban surveillance and autonomous systems.
The results section provides a detailed analysis and comparison of The proposed method optimizes real-time detection by reducing the
the proposed Attention Transformer-YOLOv8 model with the previously false positives and false negatives by quite a significant amount. As
existing techniques such as Faster R-CNN, Yolo-V3, Yolo-V5n, and SSD. compared to earlier YOLO-based methods, this model shows better
Each model outcome is compared using various measures of efficiency, performance in terms of localization and classification for more accurate
including inference time, precision, recall, and mean average precision detection. It integrates attention mechanisms with Transformer-based
(mAP). The added tables and figures best support the paper, which is the detection heads that provide better precision in the detection of real-
outperformance of the proposed model, especially in terms of real-time world surveillance scenarios.
runtime and accuracy. The work focuses on significant strides in object Fig. 4 depicts a bustling urban street captured for real-time video
recognition across a range of complex urban surveillance contexts and surveillance object detection analysis using an attention-based Trans
highlights the value of introducing attention mechanisms and former-YOLOv8 model. The scene features multiple objects of interest,
transformer-based architectures into the YOLOv8 architecture. including vehicles (cars, a taxi, and buses), pedestrians, and surrounding
The Transformer-YOLOv8 model achieves real-time object detection urban infrastructure (high-rise buildings, street signs, and trees). These
in 5.2 ms using lightweight attention mechanisms such as depth-wise objects are crucial for training and evaluating the performance of object
separable convolutions. In terms of accuracy and efficiency, it is ideal detection models like YOLOv8, as they represent common entities in a
for surveillance, autonomous systems, and low-resource environments. real-world urban setting. This image, sourced from Kaggle, serves as a
The Attention Transformer-YOLOv8 model surpasses existing models test case for detecting various dynamic and static elements in a complex
489
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
environment, aiding in the study of the model’s efficiency in identifying well as the objects that overlap, is its flexibility in detail.
and localizing objects in real-time surveillance scenarios. Fig. 6 presents a Confusion Matrix that provides a detailed break
Fig. 5 presents object detection on city streets using the novel down of the model’s classification performance across four categories:
Transformer-YOLOv8 architecture with attention. The model has sys understanding the subject is established, that is, the given person, the
tematically identified different scenes accompanied by objects such as motorcycle, the vehicle, and the background. Each data point in the
cars, buses, taxis, people, and other aspects of a city, like signs. The matrix signifies the share of predictions that relate to a particular
different colors are used to distinguish each of the detected objects in the combination of true and anticipated labels, revealing crucial insights
picture. The colors help with understanding the localization and clas into the model’s accuracy and mistakes. Row one accurately classifies
sification of the YOLOv8 model that uses attention mechanisms to 88 % of cases as "person" (true positives). Nevertheless, it categorizes
enhance detection precision. In this real-time surveillance environment, 12 % of the ‘person’ instances as ‘motorcycle’ and an estimated 40 % as
YOLOv8 takes advantage of its good architecture to handle the six- ‘background,’ suggesting that the model is sometimes confounded by
dimensional visual data set rapidly, including addressable static scene background noise or other features. Motorcycle accurately recognizes
features like buildings, signs, moving vehicles, and pedestrians. Through 67 % of motorcycles. Nonetheless, it misclassifies 12 % of motorcycles
the attention mechanism added to the Transformer, the model can point as ’person’ and 32 % as ’car,’ indicating a substantial confusion between
to the important areas of the image to increase the detection rate, such as motorcycles and cars. Third Car’s (row) performance is potent, accu
the face and eyes. Thus, the accuracy of the proposed Transformer- rately classifying cars up to 92 %. Yet, 32 % of the definite "motorcycle"
YOLOv8 architecture for object detection is shown in this figure, spontaneities are inaccurately labeled as "car," indicating that Motor
which is applicable for real-time video surveillance applications where cycles and cars are a source of confusion for the model. Also, 55 % of
multiple targets must be detected and tracked in real-time for safety, what was described as a background was labeled as "cars," indicating a
traffic control, etc. It can be concluded that the versatility of the model misinterpretation of background objects as cars. The fourth row in the
in the integration of simultaneous and various scales of the entities, as background misclassifies 12 % of actual person occurrences and 32 % of
490
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
491
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
Fig. 7. Training and validation metrics of the proposed transformer-based YOLOv8 model.
inference time, precision, recall, and mean average precision, providing ensure public safety as well as effective threat detection. Besides that, it
necessary comparison measures. In the case of the inference time, as has a high precision and recall due to the mechanism of attention. This
shown in Table 1, the proposed model is far superior as it takes only ability to focus on relevant parts of complex and crowded scenes makes
5.2 ms to complete with real-time performance as compared to others. it suitable for use in smart cities and autonomous driving. Its accuracy at
This shows that it has better real-time relevance, which is especially classifying a wide variety of objects supports its suitability for use in
important in areas such as security surveillance in cities. The other industrial automation for tasks like quality control and monitoring [31].
performance metrics presented in Table 2 also support the proposition of Despite challenges in distinguishing between certain objects, its overall
the proposed model in this research. The Attention Transformer- high performance makes it highly suitable for real-world applications
YOLOv8 model excels the standard YOLOv8, YOLOv7, and YOLOv5 requiring efficient, real-time object detection.
models in all measures with a precision of 96.78 percent, a recall of Most of the credit for this enhanced performance is attributed to
96.89 percent, and a mean average precision of 89.67 percent. attention mechanisms and the transformer model that greatly enhances
The practical implications of the Attention Transformer-YOLOv8 its capability in locating objects and focusing on detailed features of the
model are of tremendous importance in areas like security surveil image even within congested and mobile manners of cities. In Figs. 4 and
lance, autonomous driving, and industrial automation. It demonstrates 5, the visualizations depict the model’s live detection functionalities,
real-time performance with a processing time as low as 5.2 ms per highlighting its skillful classification of multiple items, such as vehicles,
frame, where rapid and accurate detection would be required. In secu pedestrians, and infrastructure. The model demonstrates good perfor
rity surveillance, the model’s ability to spot objects such as vehicles, mance according to the confusion matrix, but the challenges of identi
pedestrians, and even infrastructure in crowded environments serves to fying motorcycles versus cars highlight areas that need improvement as
492
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
Table 2
Performance comparison of object detection algorithms.
Algorithm Precision Recall mAP 0.5
(%) (%) (%)
Fig. 9. ROC curve for transformer-YOLOv8 model. Fig. 11. Performance comparison of object detection algorithms.
493
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
including citywide surveillance networks. Its real-time processing Sultan University for their support.
capability, along with efficient resource utilization, makes it possible to
scale up across multiple cameras and locations. High accuracy with low References
inference time allows it to work effectively in dynamic environments at
large scales with minimal computational overhead. [1] E. Arkin, N. Yadikar, Y. Muhtar, and K. Ubul, “A survey of object detection based
on CNN and transformer,” in 2021 IEEE 2nd international conference on pattern
The framework further provides potential expansion into other recognition and machine learning (PRML), IEEE, 2021, pp. 99–108.
areas, including autonomous driving and industrial automation. It’s [2] L. He and S. Todorovic, “Destr: Object detection with split transformer,” in
designed for real-time object detection and adaptability to dynamic Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
2022, pp. 9377–9386.
environments. It is, therefore, highly suitable for those applications and [3] Y. Tian, “Effective image enhancement and fast object detection for improved UAV
could have even wider applicability beyond video surveillance. applications,” 2023.
The proposed method shows high accuracy in real-time detection; [4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-
to-end object detection with transformers,” in European conference on computer
however, it suffers from challenges such as high computational re vision, Springer, 2020, pp. 213–229.
quirements, especially on devices with limited resources. Its perfor [5] G. Lavanya, S. Pande, Enhancing Real-time Object Detection with YOLO Algorithm,
mance may also be less robust in extreme lighting or weather conditions. EAI Endorsed Trans. Internet Things 10 (Dec. 2023), https://doi.org/10.4108/
eetiot.4541.
Future versions could focus on optimizing computational efficiency and
[6] Q. Chen et al., “LW-DETR: A Transformer Replacement to YOLO for Real-Time
enhancing robustness to varying environmental factors. Detection,” arXiv preprint arXiv:2406.03459, 2024.
[7] MH_1901182_Final_EDDY_LAI_THIN_JUN.pdf.” Accessed: Sep. 24, 2024. [Online].
6. Conclusion and future works Available: 〈http://eprints.utar.edu.my/6556/1/MH_1901182_Final_EDDY_LAI_
THIN_JUN.pdf〉.
[8] Z. Guo, C. Wang, G. Yang, Z. Huang, G. Li, Msft-yolo: improved yolov5 based on
The model Attention Transformer-YOLOv8 has seen significant ad transformer for detecting defects of steel surface, Sensors 22 (9) (2022) 3467.
vancements, and the loss function is continually reducing in values, [9] T. Wang, Z. Ma, T. Yang, S. Zou, PETNet: a YOLO-based prior enhanced
transformer network for aerial image detection, Neurocomputing 547 (2023)
having a precision of 96.78 %, recall of 96.89 %, and mAP of 89.67 %. 126384.
These validate its good detection and classification capabilities, and its [10] S. Jha, C. Seo, E. Yang, G.P. Joshi, Real time object detection and trackingsystem
localization also remains to be very good. It’s real-time video surveil for video surveillance system, Multimed. Tools Appl. 80 (3) (Jan. 2021)
3981–3996, https://doi.org/10.1007/s11042-020-09749-x.
lance, and self-governing systems have tremendous prospects when [11] L. Cheng, D. Zhang, Y. Zheng, Road object detection in foggy complex scenes based
strict and critical detection along with tracking with low visibility on improved YOLOv8, IEEE Access (2024).
conditions are required. However, the attention mechanisms, combined [12] N. Yunusov, B.M.S. Islam, A. Abdusalomov, W. Kim, Robust forest fire detection
method for surveillance systems based on you only look once version 8 and transfer
with the Transformer-based detection head, incur a high computational learning approaches, Processes 12 (5) (2024) 1039.
cost with an inference time of 5.2 ms per frame on high-end hardware. [13] H. Yi, B. Liu, B. Zhao, E. Liu, Small object detection algorithm based on improved
Latency in edge devices underlines an additional need for optimization YOLOv8 for remote sensing, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
(2023).
methods, such as model pruning, quantization, or knowledge distilla
[14] Z. Zhang, X. Lu, G. Cao, Y. Yang, L. Jiao, and F. Liu, “ViT-YOLO: Transformer-based
tion, between performance and efficiency. YOLO for object detection,” in Proceedings of the IEEE/CVF international conference
Adaptive preprocessing techniques, such as data augmentation in on computer vision, 2021, pp. 2799–2808.
real-time, could help enhance robustness in varying conditions. Semi- [15] Y. Fang, et al., You only look at one sequence: rethinking transformer in vision
through object detection, Adv. Neural Inf. Process. Syst. 34 (2021) 26183–26197.
supervised learning would make reliance on labeled datasets decrease [16] L. Min, Z. Fan, Q. Lv, M. Reda, L. Shen, B. Wang, Yolo-dcti: small object detection
the chances of scalability issues. Data sets can be extended to cover in remote sensing base on contextual transformer enhancement, Remote Sens. 15
complex environments with boundary refinement networks, helping to (16) (2023) 3970.
[17] C.-Y. Wang, H.-Y.M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh,
enhance generalization and localization accuracy. Research on adver “CSPNet: A new backbone that can enhance learning capability of CNN,” in
sarial defenses and real-time feedback loops will further advance Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
adaptability. All these improvements allow a Transformer-YOLOv8 workshops, 2020, pp. 390–391.
[18] S. Saponara, A. Elhanashi, A. Gagliardi, Real-time video fire/smoke detection
framework to be more robust, adaptable, and efficient in solving real- based on CNN in antifire surveillance systems, J. Real. -Time Image Process. 18
world application tasks. (2021) 889–900.
[19] M. Shorfuzzaman, M.S. Hossain, M.F. Alhamid, Towards the sustainable
development of smart cities through mass video surveillance: a response to the
CRediT authorship contribution statement COVID-19 pandemic, Sustain. Cities Soc. 64 (2021) 102582.
[20] F. Pérez-Hernández, S. Tabik, A. Lamas, R. Olmos, H. Fujita, F. Herrera, Object
Divya Nimma: Formal analysis, Data curation, Conceptualization. detection binary classifiers methodology based on deep learning to identify small
objects handled similarly: application in video surveillance, Knowl. -Based Syst.
Ts. Yousef A. Baker El-Ebiary: Formal analysis, Data curation,
194 (2020) 105590.
Conceptualization. Vuda Sreenivasa Rao: Methodology, Investigation, [21] P.Y. Ingle, Y.-G. Kim, Real-time abnormal object detection for video surveillance in
Funding acquisition. Zoirov Ulmas: Validation, Supervision, Software. smart cities, Art. no. 10, Sensors 22 (10) (Jan. 2022), https://doi.org/10.3390/
s22103862.
R.V.V. Krishna: Writing – review & editing, Writing – original draft,
[22] K. Deshpande, N.S. Punn, S.K. Sonbhadra, and S. Agarwal, “Anomaly detection in
Visualization. Omaia Al-Omari: Methodology, Investigation, Funding surveillance videos using transformer based attention model,” Jun. 06, 2022, arXiv:
acquisition. Rahul Pradhan: Resources, Project administration, arXiv:2206.01524. doi: 10.48550/arXiv.2206.01524.
Methodology. [23] H. Wang, J. Tang, X. Liu, S. Guan, R. Xie, and L. Song, “PTSEFormer: Progressive
Temporal-Spatial Enhanced TransFormer Towards Video Object Detection,” Sep.
06, 2022, arXiv: arXiv:2209.02242. doi: 10.48550/arXiv.2209.02242.
Declaration of Competing Interest [24] juvan Tervan, “A Comprehensive Review of YOLO Architectures in Computer
Vision: From YOLOv1 to YOLOv8 and YOLO-NAS.” Accessed: Dec. 09, 2024.
[Online]. Available: 〈https://www.mdpi.com/2504-4990/5/4/83〉.
The authors declare no conflict of interest. The research was con [25] Y. Wang, et al., Lightweight vehicle detection based on improved YOLOv5s,
ducted independently and without any commercial or financial re Sensors 24 (4) (Feb. 2024) 1182, https://doi.org/10.3390/s24041182.
lationships that could be construed as a potential conflict of interest. All [26] M. Safaldin, N. Zaghden, M. Mejdoub, An improved YOLOv8 to detect moving
objects, IEEE Access (2024).
interpretations and findings presented in this paper are solely those of [27] Real-Time Object Detection.” Accessed: Sep. 23, 2024. [Online]. Available:
the authors and were not influenced by any external funding or personal 〈https://kaggle.com/code/roshinifernando/real-time-object-detection〉.
relationships. [28] L. Shen, B. Lang, Z. Song, Infrared object detection method based on DBD-YOLOv8,
IEEE Access 11 (2023) 145853–145868, https://doi.org/10.1109/
ACCESS.2023.3345889.
Acknowledgements [29] H. Yi, B. Liu, B. Zhao, E. Liu, Small object detection algorithm based on improved
YOLOv8 for remote sensing, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17
(2024) 1734–1747, https://doi.org/10.1109/JSTARS.2023.3339235.
Omaia Al-Omari, one of the co authors would like to thank Prince
494
D. Nimma et al. Alexandria Engineering Journal 118 (2025) 482–495
[30] J. Yan, et al., Enhanced object detection in pediatric bronchoscopy images using [32] Abed Saif Alghawli, Ahmed I. Taloba, An enhanced ant colony optimization
YOLO-based algorithms with CBAM attention mechanism, Heliyon 10 (12) (2024). mechanism for the classification of depressive disorders, Comput. Intell. Neurosci.
[31] Ahmed I. Taloba, R.T. Matoog, Detecting respiratory diseases using machine 1 (2022) (2022) 1332664.
learning-based pattern recognition on spirometry data, Alex. Eng. J. 113 (2025)
44–59.
495