0% found this document useful (0 votes)
10 views13 pages

Design and Augmentation of A Deep Learning Based Vehicle Detection Model For Low Light Intensity Conditions

This research develops and enhances a deep learning-based vehicle detection model specifically for low-light conditions, addressing the challenges faced by traditional object detection methods. The study integrates advanced Convolutional Neural Networks (CNNs) with innovative data augmentation techniques to improve vehicle identification accuracy during nighttime or adverse weather. The proposed model is based on the YOLO framework, which has been optimized to better handle low-light scenarios and improve overall detection performance.

Uploaded by

y1357768791
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views13 pages

Design and Augmentation of A Deep Learning Based Vehicle Detection Model For Low Light Intensity Conditions

This research develops and enhances a deep learning-based vehicle detection model specifically for low-light conditions, addressing the challenges faced by traditional object detection methods. The study integrates advanced Convolutional Neural Networks (CNNs) with innovative data augmentation techniques to improve vehicle identification accuracy during nighttime or adverse weather. The proposed model is based on the YOLO framework, which has been optimized to better handle low-light scenarios and improve overall detection performance.

Uploaded by

y1357768791
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

SN Computer Science (2024) 5:605

https://doi.org/10.1007/s42979-024-02944-9

ORIGINAL RESEARCH

Design and Augmentation of a Deep Learning Based Vehicle Detection


Model for Low Light Intensity Conditions
Pramod Kumar Vishwakarma1 · Nitin Jain1

Received: 11 December 2023 / Accepted: 30 April 2024


© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2024

Abstract
The development of autonomous vehicles and the Advanced Driver Assistance System (ADAS) has accelerated recently,
effective traffic management and road safety depend heavily on vehicle identification. However, reliable vehicle detection
in low-light situations at night or in bad weather remains a chronic difficulty in real-world scenarios. This study aims to
meet the urgent requirement for enhanced vehicle detection in low light circumstances by developing and enhancing a deep
learning-based model. An alternative method is suggested that integrates cutting-edge Convolutional Neural Networks
(CNNs) with inventive data augmentation approaches designed specifically for low-light situations. Most object detection
models don’t perform efficiently under low-light conditions and lack enlightenment conditions, due to inappropriate labeling.
When objects have a small number of pixels and the presence of simple elements is rare, conventional CNNs might have
detrimental effects on accurate data analysis due to the excessive amount of convolutional operations. This study introduces
information assortment and the labeling of low-light information to deal with different kinds of circumstances for vehicle
detection. Besides, this work proposes an explicitly upgraded model dependent on the YOLO model.

Keywords Object detection · Deep learning · YOLOv8 · Algorithms · CNN · Evolution metrics

Introduction entails the recognition of cars, subsequently followed by the


tracking of multiple objects and the enumeration of vehicles.
Vehicle recognition and estimations for highway video At now, the identification of objects is achieved by uti-
scenes are highly important for the intelligent management lising traditional machine vision techniques and complex
and regulation of the roadway. Due to the widespread use of deep-learning algorithms. Conventional machine vision
surveillance cameras, a large dataset of video footage cap- methods utilise the movement of a vehicle to distinguish it
turing human activities has been collected for analysis [1]. from the background picture. This approach may be classi-
Primarily, from an elevated vantage point, one may envision fied into three distinct categories. The first approach involves
a far road surface. The vehicle’s query measure is highly var- utilising backdrop deduction, the second approach involves
iable at this survey site, and it’s hard to detect a little object leveraging continuous video frame contrast, and the third
far off the road with any degree of accuracy [2]. In order to approach involves utilising optical stream [3]. The video
successfully overcome the aforementioned challenges and frame differentiation method uses the pixel values of suc-
facilitate their implementation, it is crucial to adequately cessive video frames to establish the contrast. Furthermore,
handle them, despite the complex and impressive camera the progressive frontal district is delineated by the peripheral
sequences. This article suggests a pragmatic approach that [4]. By implementing this strategy and suppressing any dis-
ruptions, it is possible to also identify the back of the vehicle
[5]. After stabilising the original picture in the video, the
* Pramod Kumar Vishwakarma background information is used to create the background
[email protected] model [5]. At that point, each silhouette picture is compared
Nitin Jain with the backdrop model, and the moving item can also be
[email protected] segmented. The optical flow method may be employed to
identify the area of motion in a movie. The optical stream
1
Computer Science and Engineering, Chandigarh University, field formed captures the direction and velocity of movement
Gharuan, Mohali, Punjab 140413, India

SN Computer Science
Vol.:(0123456789)
605 Page 2 of 13 SN Computer Science (2024) 5:605

for each individual pixel [6]. The vehicles are categorised The assessment was conducted on roadways. The Accu-
into three groups: automobile, bus, and motorbike, depend- mulate and Haar-like characteristics, which were intro-
ing on the use of the connecting bends of 3D ridges on their duced in [16], were integrated to develop a vehicle detec-
outside surface [7]. tion method. This algorithm was evaluated using photos
The use of Deep Convolutional Neural Networks of Indian automobiles. Nevertheless, while employing
(DCNNs) has resulted in significant advancements in the the aforementioned approach for vehicle detection, it is
field of vehicle object identification. Convolutional Neural not possible to identify the specific type of vehicle. In
Networks (CNNs) has a strong capacity to gather informa- addition, the absence of illumination makes it difficult
tion about the features of images and can effectively carry to remove the vehicle’s edge or distinguish the moving
out various tasks, including classification and bounding box vehicle, resulting in low accuracy in detection of vehicle
regression [8]. There are two distinct categories that might and therefore impacting the outcome of the detection for
be used to the detecting method. The object-bounding box subsequent use. The research article [17, 18] employed
placement problem is transformed into a regression problem aerial perspectives to capture images, but, these images
for processing by the one-stage technique, which eliminates fail to accurately depict the attributes of each vehicle and
the need for the candidate to directly address the problem. may lead to inaccurate vehicle identifications.
The two-step procedure begins with the creation of a repre- Multi-object tracking is a crucial task in Intelligent
sentation of the item via the application of many algorithms, Transportation Systems (ITS) that involves advanced
and then moves on to the definition of the object through applications in vehicle object recognition [19]. When
the use of a convolutional neural network. In the two-stage establishing relationships between items in a frame, It is
approach, the R-CNN (Region-based Convolutional Neu- imperative to guarantee that an object is limited to appear-
ral Network) uses a special method to search for specific ing on only one track, and that each track is restricted to
regions within the image [9, 10]. The convolutional network corresponding to only one object [20]. The problem can
requires input pictures of a consistent size. Additionally, the be addressed by implementing detection-level rejection
network’s complex structure necessitates a lengthy train- or direction-level avoidance. In order to tackle the dif-
ing period and utilises a substantial amount of processing ficulties arising from differences in size and intensity of
resources. Notably, the Single Shot Multibox Detector [11] moving objects, SIFT feature points were utilised for the
and You Only Look Once (YOLO) frameworks are par- purpose of object tracking. Nevertheless, this technique
ticularly remarkable among the one-stage techniques [12]. has subpar performance [21]. In this study, we suggest
Advanced methods like Multibox, Region Proposal Network the utilisation of the point detection calculation from the
(RPN), and multi-scale feature representation approaches are circle. The Sphere algorithm achieves superior extraction
incorporated into SSD. It employs a set of anchor boxes with feature centres at a faster rate compared to SIFT. Vehicle
varying angle proportions to accurately position objects. In object detection has transitioned from classical approaches
contrast to SSD, the YOLO network partitions the image to deep convolutional network algorithms. Furthermore,
into a suitable number of matrices [4]. Each lattice is tasked there is a limited availability of available datasets specifi-
with forecasting events that are mostly centred within the cally focused on explicit activity situations. Convolutional
system [13]. neural networks are prone to inaccurately detecting small
Vision-based approaches for detecting vehicles have objects because to their sensitivity to fluctuations in scale
yielded abundant results. The Multivariate Adjustment [22].
Detection (Distraught) technique [14] was employed in Video analysis has attracted considerable interest in the
Bangalore, India to identify discrepancies between two computer vision field and is considered a difficult task since
images taken on a roadway. The photos were taken with a it involves both spatial and temporal information. A vast
little temporal gap between them. A modified image is used array of YouTube videos encompassing 487 gaming genres
to measure the vehicle density of the road, which includes is employed in an initial study to train a Convolutional Neu-
the mobile vehicles. The histogram displays the frequency ral Network (CNN) algorithm. This model employs a multi-
distribution of edge gradients. Next, we use the k-means objective approach to assess movement data in neighbour-
technique to split the gradient magnitude statistics into hoods captured in recordings. It consists of distinct modules
thirds, and then we use those thirds to find a closed vehicle for modelling low-resolution images and processing high-
model. A differential approach was utilised to create a shad- resolution images in order to classify films. A method for
ing model for identifying and reducing shadow areas gen- identifying event locations in sports videos using deep learn-
erated by vehicles, thereby mitigating the impact of scene ing is presented. The technique utilises Convolutional Neu-
variations [15]. Eliminating the shadow region can greatly ral Networks (CNNs) to encode both spatial and ephemeral
improve the effectiveness of vehicle detection. data. It also incorporates regularised Autoencoders to per-
form feature combinations. A novel method called Recurrent

SN Computer Science
SN Computer Science (2024) 5:605 Page 3 of 13 605

Convolution Networks (RCNs) has been recently developed at hand was framed, which was the primary distinction.
for video processing [23]. The researchers reconsidered the task of object identifi-
The method utilises Convolutional Neural Networks cation by approaching it as a regression problem, where
(CNNs) to process video outlines and extract visual pat- the goal is to forecast the coordinates of the bounding
terns. It then feeds the frames into Recurrent Neural Net- box, rather than a classification problem. YOLO models
works (RNNs) to analyse temporal information in the vid- are pre-trained on massive datasets like COCO and Ima-
eos. Another proposed approach employs Recurrent Neural geNet during the training process. Because of this, they
Networks (RNN) in the intermediate layers of Convolutional can simultaneously fulfill the roles of both the Master and
Neural Networks (CNN). Similarly, a Gated Recurrent Unit the Student. They can make extremely accurate predictions
is utilised to leverage the scarcity and proximity inside the on courses that they have already been trained on (master
RNN modules. The advancement in picture and video pro- ability), and they are also highly capable of learning new
cessing is contingent upon not only the creation of novel classes with relative ease.
learning algorithms and the utilisation of robust hardware, YOLO models are also capable of producing high accu-
but also largely relies on the accessibility of extensive public racy with lower model sizes, and they can be trained more
datasets. Table 1 contains a selection of large-scale visual quickly than other neural networks. Because they can be
datasets that are commonly used for training deep learning trained on a single GPU, they are more accessible to devel-
algorithms. When it comes to deep learning resources, Ima- opers. At the beginning of the year 2023, the most recent
geNet is generally considered to be the most important and generation of these YOLO models is known as YOLOv8.
prominent. Because of its vast assortment of tagged photos, In comparison to its predecessors, it has undergone several
popular networks such as ResNet, VGG Net, Alex Net, and significant modifications, including the implementation of
Google Net are trained using it. CIFAR10/100 is a compact C3 convolutions, the addition of mosaic augmentation, and
dataset commonly utilised in various research investigations the identification of anchor-free data. Ultralytics is respon-
to analyse and gather visual information. This measure is sible for the development and ongoing maintenance of the
also used to assess numerous deep neural networks (DNNs) open-source SOTA model known as YOLOv8. Because it
in the picture classification task. When it comes to object is offered under the GNU General Public License, the user
detection and semantic segmentation, two popular tools are is granted permission to freely share, modify, and distrib-
PASCAL VOC and Microsoft COCO [24–27]. Recently cre- ute the software. The community of YOLOv8 is thriving
ated by Google, YouTube-8 M is a dataset that processes and has been expanding over time.
videos in a manner similar to ImageNet. For several types It is the Ultralytics team that is responsible for writing
of video analysis, including event detection, comprehension, and maintaining YOLOv8. An individual named Joseph
and classification, this resource can be a one-stop shop. Redmon, who is a computer scientist, was the one who
initially developed YOLO models. He was able to cycle
through three different incarnations of YOLO, the third of
The Background Study of YOLO which was YOLOv3, and all of them were developed in
Darknet Architecture. Glenn Jocher shadowed YOLOv3 in
The YOLOv8 model is the most recent addition to the PyTorch and called it YOLOv5 after making a few minor
YOLO series of electronic devices. This family of models adjustments to it from the previous version. Following
is referred to as YOLO, which is an acronym that stands that, the architecture of YOLOv5 was updated to develop
for “You Only Look Once.” The reason for this is that YOLOv8. The YOLOv8 version was made available to the
they can accurately predict every object that is present in public on January 10th, 2023. It is still in the process of
an image with just one forward pass. The YOLO models being actively developed as of this writing.
introduced a significant difference in the way the work

Table 1  Well-known datasets for DL


Dataset Data type Total images Total categories Ground truth Usage

ImageNet [24] Images 14 million 21,841 Yes Detection, localization and categorization of objects
CIFAR10/100 [25] Images 60,000 10/100 Yes Classification of images
Pascal VOC [26] Images 46,000 20 Yes Detection of objects, semantic segmentation, clas-
sification of image
Microsoft COCO [27] Images 330,000 80 Yes Detection of objects, semantic segmentation

SN Computer Science
605 Page 4 of 13 SN Computer Science (2024) 5:605

Performance Comparison of YOLO with just one forward pass; the name comes from the fact
that they only look at the image once. One key difference
Every YOLO model that has been officially released since among the YOLO models was how they framed the current
YOLOv5 has improved the speed-accuracy ratio. They job. In order to circumvent any problems related to catego-
offer a variety of model levels to meet the specific require- rization, the researchers approached the object identification
ments and hardware needs of each user. These iterations challenge from a different perspective. They reframed it as a
often provide simplified versions specifically tailored for regression problem, specifically focusing on predicting the
edge devices, prioritising faster processing rates and less coordinates of the bounding box.
computational complexity at the expense of precision. The Big datasets such as COCO and ImageNet are used to pre-
comparison of the YOLOv5 and YOLOv8 model scales is train YOLO models. They may play both the role of student
shown in Fig. 1. The correlation between the mean Average and master at the same time because of this. They can learn
Precision (mAP) and the number of parameters (in millions) new courses very rapidly and give quite accurate predictions
on the COCO validation set is shown in the graph on the left. on the classes they have been instructed in master ability. In
Considered IOU thresholds range from fifty to ninety-five addition to being able to train quickly, YOLO models may
[28]. The data demonstrates a distinct pattern in which an achieve great accuracy with relatively modest model sizes.
augmentation in the quantity of parameters improves the They are more approachable for developers since they can
accuracy of the model. Nano, small, medium, large, and be taught on single GPUs. In early 2023, the most recent
extra-large are the scales that each model covers. version of these YOLO models is YOLOv8. Its predeces-
Using the same mAP performance metric, the follow- sors were significantly altered, and it now features anchor-
ing graph compares the inference times on an NVIDIA free detection, C3 convolutions, and mosaic augmentation,
A100 GPU with TensorRT FP16. In this case, the trade-off among other changes.
between quickness and accuracy in making an identifica- The YOLOv8 model was developed and is maintained by
tion stands out. Decreased latency values, which indicate the Ultralytics team; it is an open-source SOTA model. Shar-
faster model inference, generally lead to lower accuracy. ing, modifying, and distributing the program is made pos-
Conversely, models with a longer latency typically exhibit sible by the GNU General Public License, which is part of
enhanced performance when evaluated using the mAP met- the distribution. There is a thriving and constantly expanding
ric on COCO. Any application requiring processing in real- YOLOv8 community. The Ultralytics team is responsible for
time must adhere to the relationship outlined above. Choos- writing and maintaining YOLOv8. The original developer
ing a model is motivated by the necessity to strike a balance of YOLO models was computer scientist Joseph Redmon.
between precision and speed. Using Darknet Architecture, he went through three versions
of YOLO, the most recent of which was YOLOv3.
Framework Using YOLOv8 With some small tweaks and some PyTorch shadow-
ing, Glenn Jocher renamed YOLOv3 to YOLOv5. Next,
The eighth version of the YOLO model family is YOLOv8. YOLOv8 was built by modifying the architecture of
The YOLO models can identify every object in a picture

Fig. 1  A comparison of YOLO’s performance

SN Computer Science
SN Computer Science (2024) 5:605 Page 5 of 13 605

YOLOv5. On January 10th, 2023, YOLOv8 was formally annotation tools: LabelImg, and Visual Object Tagging Tool
launched. It is currently still in the midst of development. (VOTT). Image annotation refers to the process of manu-
ally labeling or marking objects or regions of interest within
an image. These annotations provide valuable information
Proposed Model about the content and characteristics of the objects present
in the image, which is essential for training and evaluating
The Dataset machine learning models, particularly in computer vision
tasks.
Dataset containing information about vehicles. Surveillance Utilising clarified events of different sizes helps enhance
cameras in public areas have been widely used globally, the detection accuracy of small vehicle objects. Separate
however images of traffic are rarely made available to the sets, designated as “preparation” and “test,” make up this
public due to concerns of copyright, privacy, and security. In dataset. In general, there are 5.15 elucidated illustrations
terms of safety concerns, there are three main types of traffic in every picture. Potentially applicable in many different
photo datasets: those taken by in-car cameras, those taken countries, including India, our dataset has the potential to
using surveillance cameras, and those taken by cameras that be a comprehensive vehicle focus set. The dataset offers a
aren’t specifically designed for monitoring [29]. Images of plethora of high-quality images, proper lighting, and detailed
highway and typical street scenes are part of the KITTI descriptions, in contrast to the existing vehicle datasets. The
benchmark dataset [30], which is used for autonomous vehi- road surface area is divided so that subsequent vehicle detec-
cle driving and can handle tasks like 3D object tracking and tion can make an accurate contribution. A picture without
detection. The Stanford Car Dataset [31] is a collection of revolution is created for the erased road surface by construct-
images of vehicles captured by unattended cameras in well- ing a square shape around a basis. A quarter of the handled
lit conditions. This collection contains 19,620 distinct cate- image is the near-to-removed space of the road surface, and
gories of autos, providing comprehensive information about the other four-fifths of the image is the near-to-proximal
the various brands. The Comprehensive-Cars-Dataset [32] space of the road surface; these two regions are defined
includes a larger number of images. The collection of 27,618 with the beginning of the arranged hub. To fix the problem
images encompasses several aspects of the vehicle, such as where the car in the picture could be split in half using the
its top speed, handling, and classification. The 136,727 pho- previous method, the close-to-proximal and near-to-far-off
tos contribute to the overall aesthetic of the car. The datasets regions cross over by 100 pixels. The near-to-proximal and
consist of images clicked by security photographic camera. near-to-removed regions’ pixel upsides segment by segment
One instance is the BIT-Vehicle-Dataset [33], which has a is examined. The road surface location cannot be seen in the
total of 9850 photographs. The gunshot point has a positive segment’s image if all of the pixel values are zero. In this
value, and the size of the car object is too small in each case, the segment is deleted. Distant regions and proximal
picture, making it challenging to generalise for training of spaces of the street surface are the saved regions after the
Convolutional-Neural-Network (CNN). not-street surface regions are banished.
The dataset contains an image extracted from a video of
highway inspections (see Fig. 3). The highway and market
images are capture which include vehicles at different inter- Vehicle Detection Using Deep Learning
val of times. The mobile camera has a flexible field of view
and is not fixed in place, was mounted on the side of the The YOLOv8 model is made up of different types of layers
road. The photographs taken from this vantage point depict that work together to understand images and find objects
the lengthiest stretch of highway and feature vehicles whose within them. These layers include convolutional layers,
sizes are blown away. The images in the collection come bottleneck layers, spatial pyramid pooling, up-sample lay-
from 12 separate observation cameras set up in various envi- ers, concatenation layers, and a detection layer. Each type
ronments with varying degrees of illumination. The automo- of layer has its own specific job in helping the model learn
biles are categorized into five distinct groups in this dataset. about images and spot objects.
A text archive containing the item classification’s numerical The model’s architecture, shown in Table 2, details how
code and the bounding box’s normalized facilitate stores the these layers are arranged. Convolutional layers are like
mark record. The tiny things were made more clear in the the building blocks, helping the model see different parts
surrounding street area; this is how the dataset incorporates of the image. They start by transforming the input image
objects of vehicles that have been massively resized. The to help the model recognize features better. As the model
number of features in an example that is far from the camera goes deeper, these layers keep doubling the information it
is much lower than in one that is close to it. The process of understands while also paying attention to different details
image annotating involves the utilisation of the following in the image.

SN Computer Science
605 Page 6 of 13 SN Computer Science (2024) 5:605

The bottleneck layer is another important part of this 5.1 Apply non-max suppression to remove redundant
model. It’s used multiple times and helps the model become bounding boxes
smarter while keeping the calculations it needs to do under 5.2 Filter detections based on confidence threshold
control. Think of it as a smart way to handle lots of infor- 5.3 Extract vehicle bounding boxes and corresponding
mation efficiently. The spatial pyramid pooling layer helps classes
the model understand the size of objects in the picture. It
combines information from different parts of the image to Step 6: Output the results for each detected vehicle:
make sure it doesn’t miss objects that might be big or small.
Layers, like up sample and concatenation, are like tools 6.1 Display bounding box coordinates
that help the model see more details. They increase the 6.2 Display vehicle class (car, truck, etc.)
sharpness of what the model understands and allow it to
combine different pieces of information from various parts Step 7: Evaluation and performance metrics evaluate
of the image. This helps find objects that might be different model performance using metrics like precision, recall,
sizes or in different parts of the picture. Finally, the detection and mAP
layer is responsible for the actual job of finding objects. It Step 8: Model fine-tuning or improvement (optional)
figures out what objects are in the image and where they are based on evaluation results, fine-tune the model or make
located. It looks at different sizes and shapes of objects to improvements
make sure it finds them accurately.
The way YOLOv8 is designed helps it do a great job of Step 9: End
finding objects in pictures of all sizes and shapes. It is really
good at doing this quickly, which makes it perfect for tasks
where you need to spot things in real time, like in video or The Process Flow of the Proposed Model
live camera feeds. Our goal is to verify the placement of the
object by revisiting the original image, using the automobile The proposed methodology’s process flow encompasses
bounding boxes identified in the two zones. The location and preparation of the data, model-training, and evalua-
classification data obtained from object tracking using the tion of the model, and development of the model. The
vehicle object detection approach can be quite significant. detail of the methodology is explained in the following
The vehicle detection method does not take into account the sub-section (Fig. 2).
precise attributes and condition of the car, as the provided
data is adequate for an automobile inspection. Explanation of the Process Flow

1. Data Preparation
Pseudocode of the Model   Data preparation involves two actions viz. dataset
preparation and augmentation. Under the Dataset Prepa-
Step 1: Initialize the architecture and parameters of the ration task, the collection and annotation of images are
model: initialize the YOLOv8 model done with the objects you want to detect. The preproc-
Step 2: Load pre-trained weights (if available) or train essing of the images is done by resizing and normal-
the model load pre-trained weights or train the model on izing them. Next, the dataset is divided into three parts:
the vehicle dataset training, validation, and test. The next step is to use data
Step 3: Preprocess input images preprocess input images: augmentation techniques like colour jittering, random
cropping, and flipping to. This is done to enhance the
3.1 Resize the image to the required input size variety of your training data and enhance the generaliz-
3.2 Normalize pixel values ability of your model.
3.3 Convert the image to the appropriate format for 2. Model Training:
model input   For model training the configuration is done by
choosing the appropriate Configuration by Choosing
Step 4: Forward pass through the network for each input the desired YOLOv8 model variant (e.g., YOLOv8s,
image: YOLOv8m, YOLOv8l). Then we set hyperparameters
like learning rate, optimizer, and training epochs. Train-
4.1 Pass image through YOLOv8 model obtain bound-
ing of the model is done on prepared data using PyTorch
ing boxes, confidence scores, and class probabilities
or TensorFlow. Monitoring of the training progress is
Step 5: Post-process the predictions post-process predic- done by tracking metrics like loss, accuracy, and mAP
tions:

SN Computer Science
SN Computer Science (2024) 5:605 Page 7 of 13 605

Table 2  The architectural model Layer Type of layer Feature maps Kernel size (K)/
of YOLOv8 outputted stride (S)/padding
(P)

1 Conv (convolutional layer) 3–16 3 × 3/2/1


2 Conv (convolutional layer) 16–32 3 × 3/2/1
3 Conv. to fully connected: C2f (Bottleneck) 32–32 Various
4 Conv (convolutional layer) 32–64 3 × 3/2/1
5 Conv. to fully connected: C2f (Bottleneck) 64–64 Various
6 Conv (convolutional layer) 64–128 3 × 3/2/1
7 Conv. to fully connected: C2f (Bottleneck) 128–128 Various
8 Conv (convolutional layer) 128–256 3 × 3/2/1
9 Conv. to fully connected: C2f (Bottleneck) 256–256 Various
10 SPPF (spatial pyramid pooling fusion) 256–128 Various
11 Upsample (Upsampling Layer) – 2.0 Scale
12 Concat (concatenation layer) – –
13 Conv. to fully connected: C2f (Bottleneck): 384–128 Various
14 Upsample (upsampling layer) – 2.0 Scale
15 Concat (concatenation layer) – –
16 Conv. to fully connected: C2f (Bottleneck) 192–64 Various
17 Conv (convolutional layer) 64–64 3 × 3/2/1
18 Concat (concatenation layer) – –
19 Conv. to fully connected: C2f (Bottleneck) 192–128 Various
20 Conv (convolutional layer) 128–128 3 × 3/2/1
21 Concat (concatenation layer) – –
22 Conv. to fully connected: C2f (Bottleneck) 384–256 Various
23 Detect detection layer () Various Various

(mean Average Precision). Fine-tune hyperparameters 1. Input Data:


to achieve optimal performance.   The input image is usually a 3D tensor with the fol-
3. Model Evaluation: lowing dimensions: (H, W, C), where H is the height of
  To assess the performance of the trained model on the the image, W is its width, and C represents the num-
validation set to evaluate how well it performs on data it ber of color channels (commonly 3 in RGB images).
has not seen before. Analyze the model’s strengths and YOLOv8 then uses this information to create an output
weaknesses to identify areas for improvement. Testing image.
the final model on the test set to obtain a final perfor- 2. Anchor Boxes:
mance measure. Compare the model’s performance to   YOLOv8 anticipates bounding boxes of varying sizes
other object detection models on benchmark datasets. and aspect ratios using anchor boxes. The dataset pro-
4. Model Deployment vides the parameters for these anchor boxes, which are
  Exporting: Export the trained model to a format suit- provided in advance as width and height pairs (wi, hi).
able for inference (e.g., ONNX, Torch Script). Integra- 3. Grid Cells:
tion: Integrate the model into your desired application   A structure made up of cells to partition the input
(e.g., web app, mobile app, security system). Inference: image is used. Everything inside the borders of a given
Run the model on new images to detect objects in real- grid cell is subject to its own set of predictions. The
time or offline. Monitoring: Monitor the model’s perfor- design of the network dictates the diagonal and horizon-
mance in deployment and retrain it if necessary. tal numbers of grid cells.
4. Convolutional Neural Network (CNN):
YOLOv8 Life Cycle Mathematically   YOLOv8 employs a deep CNN architecture to process
the input image and extract features. Convolutional lay-
The YOLOv8 involves several mathematical components ers, down-sampling layers (such as stridden convolution
and steps in its lifecycle as described below. or max-pooling), and feature extraction modules make
up the architecture.

SN Computer Science
605 Page 8 of 13 SN Computer Science (2024) 5:605

Fig. 2  The workflow of proposed deep learning based model

Zi = Convi(Zi − 1)Zi = Convi(Zi − 1)   YOLOv8 is used to estimate class probabilities and


bounding boxes for every grid cell. For any predic-
where ( Zi ) represents feature-map produced by the (ith) tion involving a bounding box, four numbers reflect its
convolutional-layer, and Convi() denotes the convolu- dimensions and center: ( x ), ( y ), ( w ), and ( h ), as well as
tional operation. a confidence score (Con_f ) indicating the confidence
5. Predictions that an object is present within the box. Class probabili-

SN Computer Science
SN Computer Science (2024) 5:605 Page 9 of 13 605

Fig. 3  Identification of vehicles

ties (P(Classi)) represents the likelihood of the object • mAP (Mean Average Precision) takes into account
belonging to a particular class ( i ). the average accuracy across precision-recall curves
( ) for different types of objects. It is a frequently used
Prediction = x, y, w, h, Conf , P(Class1), P(Class2), … , P(Classn)
measure in the domains of object identification and
6. Non-Maximum Suppression (NMS) information retrieval. This statistic is especially
  To filter out redundant bounding box predictions, valuable for evaluating the efficiency of models in
YOLOv8 uses Non-Maximum Suppression. This accurately identifying several classes or categories.
involves: It is expressed as a percentage and is determined by
averaging the Average Precision (AP) values that
• Discarding boxes with low confidence scores were computed separately for each class or category
(Con_f < confidence_threshold). in the dataset.
• Suppressing overlapping boxes by keeping the one
N
with the highest confidence score. 1∑
mAP = APi
  Bounding boxes are evaluated for Intersection over N i=1
Union (IoU) by the NMS method, which discards
boxes with an IoU greater than a specific threshold.   When N is the entire number of object classes

AreaofIntersection
IoU =
AreaofUnion
Results and Discussion
7. Model Evaluation Metrics/Indicators
The processes that were utilized to test the methods that
• The IoU measures the degree to which the expected
were outlined in the “Methods” section are described in full
and ground truth bounding boxes overlap.
below. For our testing, we utilized the preexisting collection
• The precision and recall metrics are used to calculate
of vehicle objects that are located in the “Vehicle dataset”
the F1 score. Recall measures how many positive
area. To capture three separate scenes, our inquiry made
predictions were really made out of all the positive
use of high-quality highway recordings, as can be shown in
forecasts, and accuracy counts how many true posi-
Figs. 3, 4 and 5.
tives were out of all the anticipated positives.
The Table 3 represents the evaluation metrics for object
• The F1 Score is a metric that quantifies the balance
detection and instance segmentation on a dataset containing
between recall and precision by calculating their
35,000 images. The metrics are divided into categories for
average. Recall and precision are two key perfor-
bounding boxes and instance segmentation masks (Fig. 5).
mance indicators for a model in terms of its ability
In the dataset consisting of 35,000 images, the model’s
to accurately identify relevant objects and collect all
performance metrics for both bounding boxes and instance
relevant items simultaneously.
segmentation masks are evaluated. The precision for bound-
Precision ∗ Recall ing boxes is notably high at 0.972, indicating a strong abil-
F1 = 2 ∗
Precision + Recall ity to correctly identify and localize objects. The recall for
TP bounding boxes is also commendable at 0.95, indicating that
Precision =
TP + FP the model effectively captures a substantial proportion of
TP actual instances in the dataset. The mean Average Precision
Recall =
TP + FN at 50% overlap for bounding boxes is 0.957, signifying a
robust performance in terms of object localization accuracy.

SN Computer Science
605 Page 10 of 13 SN Computer Science (2024) 5:605

Fig. 4  Identification of vehicles

Table 3  Performance metrics Images Box (P) R mAP50 mAP 50–95 Mask (P) R mAP50 mAP50-95
for vehicle detection and
instance segmentation of the 15,000 0.972 0.95 0.957 0.924 0.971 0.95 0.957 0.921
model
15,000 0.981 0.98 0.989 0.965 0.971 0.98 0.989 0.963
15,000 0.962 0.98 0.995 0.995 0.952 0.98 0.995 0.991
15,000 0.951 0.99 0.995 0.926 0.941 0.99 0.995 0.926
15,000 0.974 0.97 0.941 0.995 0.973 0.97 0.941 0.995

Fig. 5  Identification of vehicles

Additionally, the model demonstrates consistency in perfor- boundaries. The recall for masks is also notable at 0.95,
mance across a broader range of overlap thresholds, with a suggesting that the model effectively captures a significant
mean Average Precision from 50 to 95% overlap for bound- portion of the object instances in the dataset. The mean
ing boxes at 0.924 (Table 3). Average Precision at 50% overlap for masks is 0.957, show-
Moving to the instance segmentation metrics, the pre- casing the model’s excellence in instance segmentation
cision for masks is impressively high at 0.971, indicating accuracy. Furthermore, the model maintains a high level
the model’s proficiency in accurately delineating object of performance across varying overlap thresholds, with a

SN Computer Science
SN Computer Science (2024) 5:605 Page 11 of 13 605

Table 4  The aggregate count of vehicles identified through various methodologies


Video names Video frames Vehicle cat- Total number of vehicles
egory
Proposed method Image detection method Total vehicle count in the
video
Remote_Area Proximal_Area Remote_Area Proximal_Area Remote_Area Proximal_Area

Video-1 10,000 Car 9728 11,510 693 8616 9840 11,550


Bus 1062 569 72 379 1082 580
Motorcycle 11,701 7390 40,040 3703 11,792 8471
Video-2 10,000 Car 6890 5515 1192 3356 6914 5654
Bus 594 874 102 295 607 882
Motorcycle 9097 7509 3122 2738 9169 7731
Video-3 10,000 Car 5804 4136 1024 1188 5834 4352
Bus 755 316 126 195 783 329
Motorcycle 9708 7900 3231 3266 9726 8007

Table 5  The real vehicle numbers are compared using a variety of methodologies
Vehicle category Remote_Area Proximal_Area Average_Percentage
Proposed Input-image detec- Proposed Input-image detec- Proposed Input-image
method (%) tion method (%) method (%) tion method (%) method (%) detection method
(%)

Number of Car 99.26 11.58 99.39 43.54 99.175 30.06


Vehicles in Bus 98.94 22.08 98.05 63.86 98.995 42.97
Video
Motorcycle 99.11 18.21 98.61 80.99 98.56 49.6
Over all cor- 98.64 31.96 98.82 72.80 98.976 48.86
rect Percent-
age

mean Average Precision from 50 to 95% overlap for masks traditional techniques [7, 8], it is largely affected by the
at 0.921. Overall, these metrics collectively demonstrate the light conditions provided by car Front and rear lights, under
model’s robust and accurate performance in both bound- extremely low lighting conditions. We found that systems
ing box localization and instance segmentation tasks on the with straightforward routes, like ResNet101, also exhibit
given dataset (Table 3). satisfactory performance in detecting night scenes (Table 4).
These metrics indicate high performance, with high preci- Our framework made use of ResNet101 to extract fea-
sion, recall, and mAP scores for both bounding boxes and tures. During the night, frameworks that use networks
instance segmentation masks. The model seems to be effec- with alternate courses as feature extractors may accurately
tive in accurately detecting and segmenting objects in the identify partially transparent and somewhat small objects.
given dataset. Further fine-tuning or experimentation may Interestingly, the ResNet101 framework needed roughly
still be conducted to optimize the model or adapt it to spe- twice as much processing time as the VGG16 framework
cific requirements (Table 3). for images with a 500 × 375-pixel dimension. To achieve
Our results included an assessment of our framework’s continuous detection with a system that has somewhat lim-
evening detecting hones. Images captured under the artificial ited processing capability, the picture size should be reduced
illumination of city nightlights can still be processed by our for implanted frameworks. Objects that are obstructed, tiny,
system. No matter how dim the light is, the system can still or hazy, especially in extremely dark environments, are the
make out the general shapes of objects as if they were com- focus of our review, which also includes advanced labeling
pletely illuminated, even in very low light. Things with few methods and frameworks. Even on the exhibition levels, we
pixels, items with a haze, and objects that are blocked seem saw significant simplification and improvements. According
to be the focus of our labeling. The findings demonstrate to the testing results, whether it’s dark outside or there isn’t
that our innovative strategy for include extraction surpasses enough light, the models trained using our evening datasets,

SN Computer Science
605 Page 12 of 13 SN Computer Science (2024) 5:605

which were labelled by our shows, appear to distinguish the automobile violation detection system. Computer vision
between small, veiled objects (Table 4). technology enables the prompt and precise analysis and com-
In settings that were extremely dim, with almost no prehension of gathered picture data, facilitating rapid image
brightening or lighting that was incredibly powerless, our detection and early identification of violation information.
methods provided detection execution levels that were wor- Intelligent vehicle infraction detection systems can achieve
thy of praise. The levels of detection execution achieved by various information fusion through human–computer inter-
the techniques that were proposed were higher than those action technologies. Implementing this will enhance the pre-
achieved by the initial tactics. During the process of han- cision, dependability, and resilience of the detection system,
dling images at a resolution of 500 × 375 pixels, the Guide hence mitigating issues arising from the failure of a single
values increased from around 0.2–0.8497, achieving a frame sensor or incorrect assessments.
rate of 16 frames per second. The visual correlation of the
yield photographs provided both an external and an emo-
Author Contributions (1) *Research Scholar, (2) Research Supervisor.
tional confirmation of this conclusion. The technique that
we have suggested is capable of accurately identifying auto- Funding Not applicable.
mobiles in a variety of urban evening settings, including
those that are extremely dim. In the work that we planned to Data Availability Not applicable.
do in the future, we anticipated working on the implemen-
tation of our system by employing several standardization Declarations
methodologies that were optional. We anticipated that we Conflict of interest No financial support, directly or indirectly, is re-
would be able to zero in on the photographs that were taken lated to the research.
under extremely bright lighting conditions. In particular, for
Informed Consent I am herby giving my consent to publish the
the data we collected during the evening hours, we advised research paper.
that suitable measurements be encouraged to appropriately
quantify the model exhibition for enclosing items (Tables 4 Research Involving Humans and/or Animals Not applicable.
and 5).

References
Conclusion and Future Works 1. Yilmaz AA, et al. A vehicle detection approach using deep learn-
ing methodologies. ArXiv. 2018; abs/1804.00429. https://d​ oi.o​ rg/​
Our model has the ability to accurately identify obscured 10.​48550/​arXiv.​1804.​00429.
objects in photographs captured under urban nighttime illu- 2. Tas S, et al. Deep learning-based vehicle classification for low
quality images. Sensors 2022;22(13):4740. https://​doi.​org/​10.​
mination. Regardless, even extremely dim circumstances
3390/​s2213​4740.
where there are no lights from vehicle headlights or taillights 3. Trivedi J, Devi MS, Dhara D. Vehicle classification using the con-
to brighten the surroundings, the system is capable of per- volution neural network approach. Zeszyty Naukowe. Transport/
ceiving the shapes of objects as long as they are distinct. The Politechnika Śląska (2021).
4. Vijayaraghavan V, Laavanya M. Vehicle classification and detec-
labeling we observe is associated with obstructed objects,
tion using deep learning. Int J Eng Adv Technol. 2019;9:24–8.
hazy objects, and objects with low pixel counts. Models 5. Hassaballah M, et al. Vehicle detection and tracking in adverse
that incorporate networks with alternate pathways as fea- weather using a deep learning framework. IEEE Trans Intell
ture extractors may accurately identify both large and small Transp Syst. 2020;22(7):4230–42.
6. Meimetis D, et al. Real-time multiple object tracking using deep
objects in low light conditions. In order to achieve consistent
learning methods. Neural Comput Appl. 2023;35(1):89–118.
detection with a computer that has very limited process- 7. Trivedi J, Devi MS, Dhara D. Vehicle classification using the
ing capability, the size of the picture should be reduced convolution neural network approach. Series Transport; 2021.
for implanted frameworks. Our review discusses improved 8. Chen Y, Zhenjin L. An effective approach of vehicle detection
using deep learning. Comput Intell Neurosci. 2022;2022.
labeling algorithms and frameworks for obstructed items,
9. Karungaru S, Lyu D, Kenji T. Vehicle detection and type clas-
small objects, and objects in foggy environments, especially sification based on CNN-SVM. Int J Mach Learn Comput.
in extremely low light conditions. We received significant 2021;11(4):304–10.
enhancements and optimization, even to the exhibition lev- 10. Prasad M, et al. Multi-view vehicle detection based on part model
with active learning. In: 2018 International Joint Conference on
els. Our proposed technique is capable of accurately detect-
Neural Networks (IJCNN). IEEE; 2018.
ing automobiles in various urban evening time conditions 11. Yaraş N. Vehicle type classification with deep learning. MS thesis.
and extremely low light conditions. Izmir Institute of Technology (Turkey); 2020.
For our upcoming research, we will employ human–com-
puter interface and computer vision technologies to analyse

SN Computer Science
SN Computer Science (2024) 5:605 Page 13 of 13 605

12. Faruque MO, Hadi G, Chengjun L. Vehicle classification in video 25. CIFAR. 2009. CIFAR-10 and CIFAR-100 datasets. Retrieved
using deep learning. Mach Learn Data Min Pattern Recognit from https://​www.​cs.​toron​to.​edu/​∼kriz/​cifar.​html. Accessed 3
MLDM. 2019; 117–31. Sept 2023.
13. Maungmai W, Chaiwat N. Vehicle classification with deep learn- 26. Pascal VOC. 2012. The PASCAL visual object classes. Retrieved
ing. In: 2019 IEEE 4th international conference on computer and from http://​host.​robots.​ox.​ac.​uk/​pascal/​VOC/. Accessed October
communication systems (ICCCS). IEEE; 2019. 07, 2023.
14. Jagannathan P, et al. Moving vehicle detection and classification 27. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dol-
using gaussian mixture model and ensemble deep learning tech- lar P, Zitnick CL. Microsoft COCO: common objects in context.
nique. Wireless Commun Mob Comput. 2021;2021:1–15. In: European conference on computer vision. London: Springer;
15. Tsourounis D, et al. SIFT-CNN: when convolutional neural net- 2014. p. 740–55.
works meet dense SIFT descriptors for image and sequence clas- 28. Jiang P, Ergu D, Liu F, Cai Y, Ma B. A review of Yolo algorithm
sification. J Imag. 2022;8(10):256. developments. Proc Comput Sci. 2022;199:1066–73.
16. Sowmya V, Radha R. Efficiency-optimized approach-vehicle 29. Al-refai G, Hisham E, Mutaz R. In-vehicle data for predicting
classification features transfer learning and data augmentation road conditions and driving style using machine learning. Appl
utilizing deep convolutional neural networks. Int J Appl Eng Res. Sci. 2022;12(18):8928.
2020;15(4):372–6. 30. Ahmad AB, et al. Vehicle auto-classification using machine
17. Sathyanarayana N, Anand MN. Vehicle type classification using learning algorithms based on seismic fingerprinting. Computers.
hybrid features and a deep neural network. Int J Appl Metaheuris- 2022;11(10):148.
tic Comput (IJAMC). 2022;13(1):1–22. 31. Liu H. Vehicle verification using deep learning for connected
18. Koga Y, Hiroyuki M, Ryosuke S. A CNN-based method of vehicle vehicle sharing systems. The ACM MobiSys 2019 on Rising Stars
detection from aerial images using hard example mining. Remote Forum; 2019.
Sens. 2018;10(1):124. 32. Prytz R. Machine learning methods for vehicle predictive main-
19. Arinaldi A, Jaka AP, Arlan AG. Detection and classifica- tenance using off-board and on-board data. Diss. Halmstad Uni-
tion of vehicles for traffic video analytics. Proc Compute Sci. versity Press; 2014.
2018;144:259–68. 33. Lee HJ, Ullah I, Wan W, Gao Y, Fang Z. Real-time vehicle make
20. Păvăloi I, Anca I. Iris image classification using SIFT features. and model recognition with the residual SqueezeNet architecture.
Proc Comput Sci. 2019;159:241–250. Sensors. 2019;19:982. https://​doi.​org/​10.​3390/​s1905​0982
21. Xie L, et al. Image classification with Max-SIFT descriptors. In:
International conference on acoustics, speech and signal process- Publisher's Note Springer Nature remains neutral with regard to
ing; 2015. jurisdictional claims in published maps and institutional affiliations.
22. Yaraş N. Vehicle type classification with deep learning. MS thesis.
Izmir Institute of Technology (Turkey); 2020. Springer Nature or its licensor (e.g. a society or other partner) holds
23. Bukała A, et al. Classification of histopathological images using exclusive rights to this article under a publishing agreement with the
scale-invariant feature transform. In: VISAPP 2020 - 15th Inter- author(s) or other rightsholder(s); author self-archiving of the accepted
national Conference on Computer Vision Theory and Applications manuscript version of this article is solely governed by the terms of
(2022). such publishing agreement and applicable law.
24. ImageNet. ImageNet, 2017. https://w ​ ww.i​ mage-n​ et.o​ rg/i​ ndex.p​ hp.
Accessed 1 Sept 2023.

SN Computer Science

You might also like