Final Report
Final Report
EXTRACTION IN VIDEO
A PROJECT REPORT
Submitted by
NIVETHA N 211720243037
CHARUNITHYA P 211720243009
JEEVANANDHU V 211720243301
of
BACHELOR OF TECHNOLOGY
IN
MAY 2024
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
under my supervision.
SIGNATURE SIGNATURE
Dr. N.KANAGAVALLI,B.E.,M.E., Ph.D., Dr. K. REGIN BOSE, M.E., Ph.D,
We thank our principal, Dr. N. BHALAJI, B.E., M.E., Ph.D., and our
Head of the Department Dr. N. KANAGAVALLI, B.E., M.E., Ph.D., for their
suggestions for the development and completion of this project. Words fail to
express our gratitude to our project coordinator Dr. B. N. KARTHIK, B.Tech,
M.Tech, Ph.D. and guide Dr. K. REGIN BOSE,M.E., Ph.D., who took a
special interest in our project and gave their consistent support and guidance
during all stages of this project.
advancements with the advent of deep learning techniques. Existing object detection
systems like YOLO (You Only Look Once) and SSD (Single Shot Multibox
Detector) utilize deep learning models to detect and localize objects within images or
video frames in real-time with accuracy and efficiency. In this study, we present a
architectures: CPU and GPU. Our proposed system leverages with the input of the
video data which are segmented into frames of images and features are extracted
from Pre-processed images on CPU. Then features are passed into the classification
model You Only Look Once version 7 (YOLOv7). YOLOv7 is an evolution of the
YOLO (You Only Look Once) object detection architecture, featuring improved
speed and accuracy through advancements in model design and training techniques,
GPU, Combining YOLOv7 with ResNet-101, known for its superior feature
extraction, enhances the system's ability to accurately detect and classify objects in
v
TABLE OF CONTENTS
CHAPTER PAGE
1.2.2 INTEGRATION OF 4
MACHINE LEARNING IN OBJECT
DETECTION SYSTEM
PROJECT
vi
1.5 SIGNIFICANCE OF INTELLIGENT 7
OBJECT DETECTION
1.6 PROBLEM DEFINITION 7
2. LITERATURE SURVEY 8
2.1 LITERATURE SURVEY WORKS 9
2.2 RELATED WORKS 10
2.3 KEY AREAS OF FOCUS 12
2.4 LIMITATIONS 14
3. SYSTEM DESIGN 15
3.1 SYSTEM ARCHITECTURE 15
3.2 PROCESS WORKFLOW 16
3.3 DATA FLOW MODEL 18
3.3.1 DFD LEVEL 0 18
3.3.2 DFD LEVEL 1 18
3.3.3 DFD LEVEL 2 20
3.4 UML DIAGRAMS 21
3.4.1 USE CASE DIAGRAM 21
3.4.2 ACTIVITY DIAGRAM 22
3.4.3 SEQUENCE DIAGRAM 23
3.5 WORKFLOW 24
4. SYSTEM IMPLEMENTATION 25
4.1 TRACED YOLO V7 MODEL 25
4.2 WORKING PRINCIPLES OF TRACED 26
YOLO V7 MODEL
4.3 APPROACH 27
vii
4.4 ADVANTAGES OF USING TRACED 29
YOLO V7 MODEL CLASSIFIER
OVER OTHER CLASSIFIERS
4.5.6 CLASSIFICATION 34
4.6 EVALUATION METRICS 34
4.7 LIMITATIONS OF YOLO V7 35
5. TESTING 36
5.1 TEST CASES 36
6. FUTURE WORK 38
APPENDICES
APPENDIX A SAMPLE SCREENSHOTS 39
APPENDIX B SAMPLE CODING 43
APPENDIX C SYSTEM REQUIREMENTS 50
REFERENCES 51
viii
LIST OF FIGURES
FIGURE PAGE
NO TITLE NO
2.1 EXISTING SYSTEM 9
2.2 SAMPLE OBJECT DETECTION SYSTEM 13
3.1 SYSTEM ARCHITECTURE 16
3.3 DFD LEVEL 0 18
3.4 DFD LEVEL 1 19
3.5 USE CASE DIAGRAM 21
3.6 ACTIVITY DIAGRAM 22
3.7 SEQUENCE DIAGRAM 23
4.1 YOLO V7 MODEL 26
4.2 APPROACH 28
4.3 OBJECT DETECTION SYSTEM 30
A.1 LANDING PAGE 39
A.2 UPLOAD A VIDEO 40
A.3 MODEL OUTPUT DOWNLOAD 40
A.4 OUTPUT 1 41
A.5 OUTPUT 2 41
A.6 OUTPUT 3 42
A.7 OUTPUT 4 42
ix
LIST OF TABLES
TABLE PAGE
NO TITLE NO
2.1 RELATED WORKS 11
5.1 TEST CASES 37
x
CHAPTER 1
INTRODUCTION
1
predict bounding boxes and class probabilities in a single pass. Despite
advancements, challenges persist, including variations in object scale and
achieving real-time accuracy under resource constraints. Ongoing research
focuses on developing efficient architectures and improving robustness against
adversarial attacks. Object detection stands apart from image recognition and
segmentation by its unique ability to locate objects, enabling tracking and
counting. Its applications span crowd counting, self-driving cars, surveillance,
and anomaly detection, highlighting its versatility and significance in modern
technologies.
2
1.1.2 EVOLUTION OF OBJECT DETECTION TECHNIQUES
The evolution of object detection techniques begins with classical methods like
Haar cascades and HOG, which relied on handcrafted features but faced
challenges with complex visual patterns. Region-based approaches introduced
the concept of generating region proposals before classification, with methods
like Selective Search efficiently identifying potential object regions. The advent
of deep learning, particularly CNNs, revolutionized object detection by learning
hierarchical representations directly from raw pixel data, eliminating the need
for manual feature engineering. Two-stage detectors like Faster R-CNN
integrated region proposal networks into the network architecture, improving
efficiency by generating region proposals from feature maps. In contrast, one-
stage detectors like YOLO prioritized speed but sometimes sacrificed accuracy.
Recent advancements include anchor-free approaches and hybrid architectures,
aiming to strike a balance between accuracy and speed, thus making object
detection more accessible and practical for diverse applications.
3
and PyTorch simplify algorithm implementation, while real-time processing
techniques optimize system performance for various applications, ensuring
efficient use of computational resources without compromising accuracy or
speed.
4
automatic feature extraction from raw image data. Architectural advancements,
like two-stage and one-stage detectors, optimize performance for specific tasks,
enhancing efficiency. Training strategies, such as data augmentation and transfer
learning, improve model generalization, while evaluation metrics gauge
accuracy. Optimization techniques and hardware acceleration ensure real-time
inference, vital for applications like autonomous vehicles and surveillance
systems. Evaluation and optimization are integral, ensuring efficient deployment
on various platforms and devices. Performance metrics, including precision,
recall, and mAP, provide insights into model accuracy and effectiveness.
Efficient model architectures, coupled with GPUs and TPUs, enable rapid
analysis and decision-making. Machine learning facilitates diverse training
strategies, enhancing model performance and robustness. Transfer learning
repurposes models trained on large-scale datasets, saving time and resources
while improving performance. Advanced algorithms, particularly CNNs,
empower object detection systems to discern relevant information
autonomously, enhancing detection accuracy and reliability.
5
occlusion, varying scales, lighting conditions, and complex backgrounds present
substantial obstacles to accurate detection.
With these datasets, the project will focus on designing and implementing
robust machine learning models, particularly leveraging deep learning
architectures like Convolutional Neural Networks (CNNs). These models will
6
be trained on the collected datasets to learn intricate patterns and features
directly from raw pixel data.
LITERATURE SURVEY
The evolution of object detection in images and videos has been shaped
by groundbreaking research in machine learning and deep learning. Ren et al.'s
Faster R-CNN introduced Region Proposal Networks (RPNs), streamlining
detection and enabling real-time performance. Redmon et al.'s YOLO reframed
detection as a regression problem, reducing computational complexity and
enhancing speed, ideal for applications like video surveillance. Liu et al.'s SSD
introduced single-stage detection, improving efficiency and enabling
deployment on resource-constrained devices. Lin et al.'s FPN addressed scale
variation by integrating multi-scale feature maps, enhancing object detection
across sizes. He et al.'s Mask R-CNN extended Faster R-CNN with instance
8
segmentation capabilities, broadening applications in medical imaging and
video editing. Recent advancements include YOLOv5, offering improved
performance and efficiency, and EfficientDet, achieving superior performance
with fewer parameters. DETR introduces an end-to-end Transformer-based
architecture for accurate and interpretable detections. CenterNet simplifies
detection by directly predicting object centers, suitable for resource-limited
scenarios. Cascade R-CNN iteratively refines object proposals for enhanced
detection quality, particularly in challenging scenarios. NAS-FPN automates
feature pyramid network design, enhancing performance and scalability. PANet
introduces a path aggregation network for precise instance segmentation,
improving object understanding.
Fig 2.1 illustrates how seminal works in object detection have advanced
performance and expanded applicability across diverse domains.
9
2.2 RELATED WORKS
Related
Faster R- CNN:
Towards Real-Time
Object Detection Ren, S.,
with Region He, K., 1. Fast R- CNN:
IEEE
Proposal Networks Girshick, 2016 Girshick, R.
TPAMI
R., &
Sun, J.
2. YOLOv4:
Redmon,
YOLOv3: An Bochkovskiy, A.,
J., &
Incremental 2018 arXiv Wang, C.
Farhadi,
Improvement Y., & Liao, H.
A.
Y. M.
EfficientDet: Scalable
2.
and Efficient Object Tan, M.,
EfficientNet: Tan,
Detection Pang, R., & 2020 CVPR
M., & Le, Q.
Le, Q.
10
2. Vision
DETR: End-
Carion, N., Transformers:
to-End Object 2020 ECCV
et al. Dosovitskiy, A., et
Detection
al.
2. Vision
DETR: End- Carion, N., Transformers:
to-End Object 2020 ECCV
et al. Dosovitskiy, A., et
Detection with al.
Transformers
CenterNet: Object
Detection with
Center- Aspects Mo, K., 2. FCOS:
2019 CVPR
Estimation et al. Tian, Z., et al.
2. YOLOv4:
SSD: Single Shot
Bochkovskiy, A.,
MultiBox Detector Liu, W.,
2016 ECCV Wang, C.
et al.
Y., & Liao, H.
Y. M.
Let us delve deeper into each key area of focus for object detection systems
using machine learning and deep learning approaches shows in Fig 2.2:
13
2.4 LIMITATIONS
14
CHAPTER 3
SYSTEM DESIGN
The system architecture shows in Fig 3.1 for an object detection system
using deep learning from videos is a comprehensive framework designed to
manage the intricate processes involved in detecting objects within video
streams. At its core, the Data Acquisition Module serves as the system's entry
point, facilitating the capture of video data from various sources such as
cameras, video files, or live streams. Once acquired, the Frame Segmentation
Module meticulously extracts individual frames from the continuous video
stream, employing techniques like keyframe extraction or motion-based
segmentation to efficiently segment the video into manageable units. Following
segmentation, the Image Pre-processing Module meticulously prepares each
frame for analysis by applying a suite of transformations to enhance its quality
and standardize its format. This involves resizing frames, normalizing pixel
values, and reducing noise or artifacts that could impede accurate object
detection.
15
Fig 3.1 System Architecture
● Data Acquisition
● Frame Segmentation
● Data Pre-processing
● Feature Extraction
● Classification
Data Acquisition: This step involves gathering video data from various sources
like surveillance cameras or video files. Data augmentation techniques may be
16
employed to enhance system robustness by collecting videos from different
viewpoints or lighting conditions. Metadata such as timestamps may also be
collected for further analysis.
Image Pre-processing: Pre-processing prepares frames for input into the object
detection model by resizing them to a uniform size, normalizing pixel values, and
applying filters to enhance quality. This ensures consistency and optimization of
input frames for the detection model's requirements.
17
3.3 DATA FLOW MODEL
The below represented diagrams are the data flow diagrams which
explains our work.
In the level 0 shows in Fig 3.3 of the data flow diagram the basic flow of
the project. The input data contains the COCO database and the result is the
output data. The cleaned data will be used for the detection process. Then a
detailed analysis on detecting the object from video will be made on various
parameters. The process between them involves analysing and detection which
results in the images with the frames for the many purposes to detect the object.
At Level 1 shows in Fig 3.4 of the Data Flow Diagram (DFD) for an
object detection system using deep learning from videos, the central module is
18
Object Detection System (Main), which oversees the entire process. This system
interfaces with two primary components: the Input Video File and the User
Interaction Interface. The Input Video File represents the source of video data,
which could be a video file stored locally or a video stream obtained from a
camera or other sources. On the other hand, the User Interaction Interface
facilitates interaction with the system, enabling users to provide input
parameters or view the system's output. Once the video data is received, it
undergoes Frame Pre-processing, a module responsible for preparing each
frame of the video for object detection.
19
Display/Save module handles the presentation or storage of the detected objects.
It may display the detected objects to the user in real-time or save them to a file
for later analysis. This evaluation provides valuable insights into the system's
performance and helps in its refinement and optimization. Overall, these
interconnected components form a cohesive system for object detection from
videos using deep learning techniques.
20
3.4 UML DIAGRAMS
The following UML diagrams are used to describe the project, they are:
There are three actors involved in the use case diagram are User,
Detection system, and database shows in Fig 3.5. The use cases associated with
users are uploading the object video and output is given from the detection
system. The use cases associated with database are data storage, data
management. The use cases associated with Detection system are frame
segmentation, processing the image, feature extraction and classification.
21
3.4.2 ACTIVITY DIAGRAM
22
3.4.3 SEQUENCE DIAGRAM
The below sequence diagram shows in Fig 3.7 the interactions between
the various components involved in the Object detection process. The lifeline
will be explained by the User, Detection system and Database. The User
uploads the video which is stored into the database. The process the image file
by frame segmentation and data pre-processing. The extracted features are then
used for machine learning classification, which uses algorithms to detect the
object in the video which is given by the user. Finally, an image of detection is
shown based on the machine learning results, summarizing the findings and
providing recommendations for further detection.
23
3.5 WORKFLOW
Step 1: Gather video analysis parameters and objectives from users to tailor the
process.
Step 2: Read the video file or stream, ensuring compatibility and smooth data
ingestion.
Step 4: Initialize the object detection model, such as YOLO, ensuring proper
configuration and optimization.
Step 5: Train the model if required, leveraging labeled datasets for fine-tuning
and improved accuracy.
Step 6: Detect objects within each pre-processed frame using the initialized
model, capturing relevant features.
Step 8: Optionally visualize detected objects overlaid on the original frame for
intuitive analysis.
Step 10: Output detected objects' labels, confidence scores, and bounding box
coordinates, facilitating interpretation and action.
24
CHAPTER 4
SYSTEM IMPLEMENTATION
The seamless integration of the YOLOv7 model into the video processing
pipeline equips the system with the capability to accurately detect and classify
objects within video streams, bolstering its efficacy in domains like
surveillance, traffic monitoring, and video analytics. Through these steps, the
object detection system achieves a holistic understanding of the video content,
facilitating informed decision-making and enhancing situational awareness in
diverse scenarios.
25
Fig 4.1 YOLO V7 Model
26
Feature Extraction: The Traced YOLOv7 model employs a convolutional
neural network (CNN) backbone to automatically extract pertinent features
from pre-processed frames. These features capture critical object characteristics
such as edges, textures, and shapes.
4.3 APPROACH
The approach of the Traced YOLOv7 model shows in Fig 4.2 involves a
systematic process geared towards enabling efficient deployment and real-time
object detection in videos. Initially, the YOLOv7 model undergoes
comprehensive training using a diverse dataset of annotated images and videos,
where it learns to accurately detect and classify objects within the visual
datathrough optimization of its parameters based on defined loss functions.
Following training, the model is subjected to optimization techniques aimed at
streamlining its architecture for deployment on edge devices. This optimization
phase typically encompasses model quantization, pruning, and compression to
reduce its size and computational complexity while maintaining performance
integrity.
27
Fig 4.2 Approach
28
4.4 ADVANTAGES OF USING TRACED YOLO V7
MODEL CLASSIFIER OVER OTHER CLASSIFIERS
Accurate Object Detection: The YOLOv7 architecture, upon which the Traced
YOLOv7 model is based, is known for its high accuracy in object detection
tasks. By leveraging advanced deep learning techniques and multi-scale feature
extraction, the model can accurately detect objects of varying sizes and aspect
ratios within video frames.
29
Multi-Object Detection: The Traced YOLOv7 model excels at detecting
multiple objects within a single frame simultaneously. This capability is
particularly advantageous in crowded scenes or scenarios where numerous
objects need to be detected and classified concurrently.
The Steps involved in object detection system explained in Fig 4.3 are,
30
4.5.1 DATA ACQUISITION
Data acquisition for an object detection system using deep learning from
videos is a crucial process that involves gathering video data from various
sources to create a comprehensive dataset for model training and evaluation.
Initially, the identification of suitable data sources, including surveillance
cameras, online repositories, or recorded video streams, sets the stage for
collecting the necessary footage. Prior to acquisition, it's imperative to address
legal considerations and obtain permissions if the videos contain sensitive
information. Once obtained, the videos undergo annotation and labelling to
mark the presence and location of objects within frames, a vital step for
supervised learning. Following annotation, the dataset may undergo pre-
processing tasks such as resizing, format conversion, or noise reduction to
ensure consistency and quality. Subsequently, the dataset is divided into
training, validation, and testing sets to facilitate model training and evaluation.
Additionally, data augmentation techniques may be applied to increase dataset
diversity and improve model generalization. Ultimately, well- structured storage
and management of the annotated and pre-processed video dataset are
paramount for seamless access during model development and deployment.
Through meticulous data acquisition, an object detection system can be trained
effectively, enhancing its ability to accurately identify objects within video
streams.
31
representative frames are extracted while minimizing computational load.
Adjusting the frame rate balances resources with temporal information.
Segmented frames are then organized for further processing, facilitating the
application of object detection algorithms in tasks like surveillance and
autonomous navigation.
32
4.5.4 FEATURE EXTRACTION
As frames pass through the CNN backbone, feature maps are generated at
multiple levels, encoding representations of object presence and spatial
relationships. Additionally, feature pyramid networks (FPNs) aggregate these
maps, enhancing detection across various scales and sizes.
Utilizing anchor boxes and prediction layers, the model generates bounding
boxes and confidence scores for detected objects based on extracted features,
facilitating object localization and presence indication.
Before building the model, separate the data into two parts, one is
Training data and another is Test data. shows the diagrammatic representation of
Fig 4.4 how data is split into training and testing which is used to obtain the
detection of objects from videos.
33
4.5.6 CLASSIFICATION
In the classification step of object detection using the Traced YOLOv7 model,
features are first extracted from pre-processed video frames, capturing spatial
information, textures, and patterns. These features are then passed through
detection heads within the Traced YOLOv7 architecture.
Utilizing anchor boxes and prediction layers, the model generates bounding
boxes enclosing detected objects, alongside confidence scores indicating object
presence likelihood. Subsequently, the classifier assigns class probabilities to
each detected object based on learned feature representations, ensuring accurate
classification into predefined categories.
After the classification step with the Traced YOLOv7 model classifier,
the detected objects and their corresponding bounding boxes undergo further
evaluation using a pre-trained ResNet-101 model. ResNet-101, a deep
convolutional neural network architecture known for its exceptional
performance in various computer vision tasks, is employed to perform several
key tasks. Firstly, it recognizes and classifies the detected objects within the
video frames, leveraging its deep architecture and learned feature
representations to accurately identify objects based on visual characteristics
34
ResNet-101 facilitates fine-grained classification, distinguishing closely
related categories and offering detailed insights into video frame content.
Additionally, it extracts high-level features from detected objects, capturing
crucial semantic and contextual cues for comprehensive understanding.
Furthermore, ResNet-101 performs semantic segmentation, assigning each pixel
to a specific object class, enabling detailed spatial analysis. Finally, evaluation
of the ResNet-101 output utilizes metrics like Mean Average Precision (mAP)
to assess performance.object detection system accurately. By combining the
capabilities of Traced YOLOv7 for initial detection and ResNet-101 for detailed
evaluation, the object detection system achieves robust and reliable object
recognition in diverse video scenarios.
35
CHAPTER 5
TESTING
Testing for object detection systems using deep learning from videos is a
crucial phase in assessing the performance and reliability of such systems
before deployment in real-world scenarios. This testing process involves
evaluating the model's ability to accurately detect and classify objects within
video frames, ensuring robustness and effectiveness across diverse conditions
and scenarios. By subjecting the object detection system to rigorous testing,
developers can identify potential weaknesses, optimize model parameters, and
enhance overall performance.
Our work is divided into five primary sections for testing shows in Table
5.1. The loading dataset, Frame segmentation, Image pre-processing, Feature
extraction, and Classification are the five modules. The test cases for the
modules and submodules have been checked and passed. The test case id, test
case scenario, test case secondary considerations, and test case state are all
considered when testing our work. Test cases are validated, and the outcomes
and status for the scenarios are handled.
36
TC SCENARIO SECONDARY CONSIDERATION STATE
ID
37
CHAPTER 6
FUTURE WORK
Some potential future directions for improving the object detection system
based on the YOLO V7 model:
38
APPENDIX A
SAMPLE SCREENSHOTS
39
Fig A.2 Upload a video
40
OUTPUT
41
Fig A.6 Output 3
42
APPENDIX B
SAMPLE CODING
In this appendix, we outline the code snippets employed in our
development and evaluation of an AI-powered object detection system for video
analysis. Users are guided to input essential parameters and upload videos for
object detection. Noteworthy is the incorporation of a specialized deep learning
model that allows users to interact with the system's output, enabling detailed
analysis of detected objects. Moreover, the implementation includes robust
error- handling mechanisms to ensure a smooth user experience and precise
detection results, thereby improving the system's robustness and user
satisfaction.
Yolo.py:
import argparse
import logging
import sys
import torch
43
from utils.torch_utils import time_synchronized, fuse_conv_and_bn,
model_info, scale_img, initialize_weights, \
select_device, copy_attr
try:
except ImportError:
thop = None
class Detect(nn.Module):
end2end = False
include_nms = False
concat = False
44
a = torch.tensor(anchors).float().view(self.nl, -1, 2)
self.register_buffer('anchors', a) # shape(nl,na,2)
self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2)) #
shape(nl,1,na,1,1,2)
z = [] # inference output
self.training |= self.export
for i in range(self.nl):
if self.grid[i].shape[2:4] != x[i].shape[2:4]:
y = x[i].sigmoid()
if not torch.onnx.is_in_onnx_export():
45
include_nms = False
concat = False
a = torch.tensor(anchors).float().view(self.nl, -1, 2)
self.register_buffer('anchors', a) # shape(nl,na,2)
self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2)) #
shape(nl,1,na,1,1,2)
z = [] # inference output
self.training |= self.export
46
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4,
2).contiguous()
if self.grid[i].shape[2:4] != x[i].shape[2:4]:
y = x[i].sigmoid()
if not torch.onnx.is_in_onnx_export():
else:
wh = wh ** 2 * (4 * self.anchor_grid[i].data) # new wh
if self.training:
out = x
elif self.end2end:
out = torch.cat(z, 1)
47
args[1] = [list(range(args[1] * 2))] * len(f)
elif m is ReOrg:
c2 = ch[f] * 4
elif m is Contract:
c2 = ch[f] * args[0] ** 2
elif m is Expand:
c2 = ch[f] // args[0] ** 2
else:
c2 = ch[f]
layers.append(m_)
if i == 0:
ch = []
48
ch.append(c2)
parser = argparse.ArgumentParser()
opt = parser.parse_args()
set_logging()
device = select_device(opt.device)
# Create model
model = Model(opt.cfg).to(device)
model.train()
if opt.profile:
y = model(img, profile=True)
49
APPENDIX C
SYSTEM REQUIREMENTS
HARDWARE SPECIFICATION
SOFTWARE SPECIFICATION
● Windows 10 or Higher
● Visual Studio Code
BROWSERS
● Chrome
● Edge
● Mozilla Firefox
● Internet Explorer
● Safari
DATASETS
● COCO
50
50
REFERENCES
11. "A Review of Machine Learning and Deep Learning for Object Detection
and Semantic Segmentation" by Manakitsa et al., providing insights into
machine and deep learning applications for object detection and semantic
segmentation."
12. "Object Detection and Tracking using Deep Learning for Video
Surveillance" by Mohana and RAVISH ARADHYA H V, 2019."
17. "Object Detection with Deep Learning: A Review" by Zhao, Zheng, Xu, and
Wu, offering an in-depth analysis of deep learning-based object detection
frameworks.
52