0% found this document useful (0 votes)
16 views62 pages

Final Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views62 pages

Final Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

SPECIFIED OBJECT DETECTION AND FRAME

EXTRACTION IN VIDEO
A PROJECT REPORT

Submitted by

NIVETHA N 211720243037
CHARUNITHYA P 211720243009
JEEVANANDHU V 211720243301

in partial fulfillment for the award of the degree

of

BACHELOR OF TECHNOLOGY
IN

ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

RAJALAKSHMI INSTITUTE OF TECHNOLOGY

ANNA UNIVERSITY: CHENNAI 600 025

MAY 2024
ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “SPECIFIED OBJECT DETECTION AND

FRAME EXTRACTION IN VIDEO” is the bonafide work of

“CHARUNITHYA.P(211720243009)”, “NIVETHA.N (211720243037)”,

“JEEVANANDHU.V (211720243301)” who carried out the project work

under my supervision.

SIGNATURE SIGNATURE
Dr. N.KANAGAVALLI,B.E.,M.E., Ph.D., Dr. K. REGIN BOSE, M.E., Ph.D,

HEAD OF THE DEPARTMENT SUPERVISOR

Department of AI & DS Department of AI & DS

Rajalakshmi Institute of Technology Rajalakshmi Institute of Technology


ACKNOWLEDGEMENT

We express our sincere gratitude to our honorable Chairperson Dr. (Mrs.)


THANGAM MEGANATHAN, M.A., M.Phil., Ph.D., and Chairman Thiru. S.
MEGANATHAN, B.E., F.I.E., for their constant encouragement to do this
project and also during the entire course period.

We thank our principal, Dr. N. BHALAJI, B.E., M.E., Ph.D., and our
Head of the Department Dr. N. KANAGAVALLI, B.E., M.E., Ph.D., for their
suggestions for the development and completion of this project. Words fail to
express our gratitude to our project coordinator Dr. B. N. KARTHIK, B.Tech,
M.Tech, Ph.D. and guide Dr. K. REGIN BOSE,M.E., Ph.D., who took a
special interest in our project and gave their consistent support and guidance
during all stages of this project.

Finally, we thank all the teaching and non-teaching faculty members of


the Department of ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
at our college who helped us to complete our project in a successful manner.
And also, I would like to express my deep appreciation to my family and friends
for their affection, trust, and support. They have meant very much to me. I could
not have succeeded without them.

Last but not least, I am very honored to express my gratitude to my


parents and friends who were with me throughout the course
CERTIFICATE OF EVALUATION

College Name 2117 – Rajalakshmi Institute of Technology

Branch Artificial Intelligence and Data Science

Semester 8th Semester

Subject AD8811 PROJECT WORK


NAME OF THE
NAME OF THE TITLE OF THE SUPERVISOR
S.NO.
STUDENTS PROJECT WITH
DESIGNATION
1. NIVETHA N
(211720243037) SPECIFIED OBJECT Dr. K. REGIN
2. CHARUNITHYA P DETECTION AND BOSE, M.E.,Ph.D.
(211720243009) FRAME EXTRACTION Department of
3. JEEVANANDHU V IN VIDEO AI&DS
(211720243301)

The report of this project work submitted by the above students in


partial fulfillment for the award of Bachelor of Technology in Artificial
Intelligence and Data Science under Anna University was evaluated and
confirmed to be the report about the work done by the above students.
The University viva-voce is held on

INTERNAL EXAMINER EXTERNAL EXAMINER


ABSTRACT

Object detection, a pivotal task in computer vision, has witnessed significant

advancements with the advent of deep learning techniques. Existing object detection

systems like YOLO (You Only Look Once) and SSD (Single Shot Multibox

Detector) utilize deep learning models to detect and localize objects within images or

video frames in real-time with accuracy and efficiency. In this study, we present a

robust object detection system by synergistically giving two cutting-edge

architectures: CPU and GPU. Our proposed system leverages with the input of the

video data which are segmented into frames of images and features are extracted

from Pre-processed images on CPU. Then features are passed into the classification

model You Only Look Once version 7 (YOLOv7). YOLOv7 is an evolution of the

YOLO (You Only Look Once) object detection architecture, featuring improved

speed and accuracy through advancements in model design and training techniques,

making it a popular choice for real-time applications in computer vision tasks. In

GPU, Combining YOLOv7 with ResNet-101, known for its superior feature

extraction, enhances the system's ability to accurately detect and classify objects in

complex and varied environments, aiming for improved performance in localization

and categorization tasks. To empirically validate the efficacy of our system, we

conduct comprehensive experiments on benchmark datasets such as COCO datasets.

v
TABLE OF CONTENTS
CHAPTER PAGE

NO. TITLE NO.


ABSTRACT v
LIST OF FIGURES ix
LIST OF TABLES x
1. INTRODUCTION 1
1.1 OVERVIEW OF OBJECT DETECTION 1
AND MACHINE LEARNING
1.1.1 IMPORTANCE OF OBJECT 2
DETECTION IN VARIOUS
FIELDS
1.1.2 EVOLUTION OF OBJECT
3
DETECTION TECHNIQUES
1.2 ROLE OF TECHNOLOGY IN OBJECT 3
DETECTION
1.2.1 ADVANCEMENTS IN 4
MACHINE LEARNING AND
COMPUTER VISION

1.2.2 INTEGRATION OF 4
MACHINE LEARNING IN OBJECT
DETECTION SYSTEM

1.3 CHALLENGES IN OBJECT 5


DETECTION SYSTEMS

1.4 PURPOSE AND SCOPE OF THE 6

PROJECT

vi
1.5 SIGNIFICANCE OF INTELLIGENT 7

OBJECT DETECTION
1.6 PROBLEM DEFINITION 7
2. LITERATURE SURVEY 8
2.1 LITERATURE SURVEY WORKS 9
2.2 RELATED WORKS 10
2.3 KEY AREAS OF FOCUS 12
2.4 LIMITATIONS 14
3. SYSTEM DESIGN 15
3.1 SYSTEM ARCHITECTURE 15
3.2 PROCESS WORKFLOW 16
3.3 DATA FLOW MODEL 18
3.3.1 DFD LEVEL 0 18
3.3.2 DFD LEVEL 1 18
3.3.3 DFD LEVEL 2 20
3.4 UML DIAGRAMS 21
3.4.1 USE CASE DIAGRAM 21
3.4.2 ACTIVITY DIAGRAM 22
3.4.3 SEQUENCE DIAGRAM 23
3.5 WORKFLOW 24
4. SYSTEM IMPLEMENTATION 25
4.1 TRACED YOLO V7 MODEL 25
4.2 WORKING PRINCIPLES OF TRACED 26

YOLO V7 MODEL
4.3 APPROACH 27

vii
4.4 ADVANTAGES OF USING TRACED 29
YOLO V7 MODEL CLASSIFIER
OVER OTHER CLASSIFIERS

4.5 STEPS INVOLVED OBJECT 30


DETECTION SYSTEM
4.5.1 DATA ACQUISITION 31
4.5.2 FRAME SEGMENTATION 31
4.5.3 IMAGE PRE-PROCESSING 32
4.5.4 FEATURE EXTRACTION 33
4.5.5 SPLITTING THE DATA INTO 33
TRAINING AND TESTING
DATASETS

4.5.6 CLASSIFICATION 34
4.6 EVALUATION METRICS 34
4.7 LIMITATIONS OF YOLO V7 35
5. TESTING 36
5.1 TEST CASES 36
6. FUTURE WORK 38

APPENDICES
APPENDIX A SAMPLE SCREENSHOTS 39
APPENDIX B SAMPLE CODING 43
APPENDIX C SYSTEM REQUIREMENTS 50
REFERENCES 51

viii
LIST OF FIGURES

FIGURE PAGE

NO TITLE NO
2.1 EXISTING SYSTEM 9
2.2 SAMPLE OBJECT DETECTION SYSTEM 13
3.1 SYSTEM ARCHITECTURE 16
3.3 DFD LEVEL 0 18
3.4 DFD LEVEL 1 19
3.5 USE CASE DIAGRAM 21
3.6 ACTIVITY DIAGRAM 22
3.7 SEQUENCE DIAGRAM 23
4.1 YOLO V7 MODEL 26
4.2 APPROACH 28
4.3 OBJECT DETECTION SYSTEM 30
A.1 LANDING PAGE 39
A.2 UPLOAD A VIDEO 40
A.3 MODEL OUTPUT DOWNLOAD 40
A.4 OUTPUT 1 41
A.5 OUTPUT 2 41
A.6 OUTPUT 3 42
A.7 OUTPUT 4 42

ix
LIST OF TABLES
TABLE PAGE

NO TITLE NO
2.1 RELATED WORKS 11
5.1 TEST CASES 37

x
CHAPTER 1

INTRODUCTION

In today's digitally immersive world, the demand for efficient object


detection systems in both images and videos has surged exponentially. Whether
it's for security surveillance, autonomous vehicles, medical imaging, or
augmented reality applications, the ability to accurately identify and locate
objects within visual data is paramount. This introduction explores the
evolution, challenges, and advancements in object detection systems, shedding
light on their significance in various domains.

Object detection is a computer vision technique that works to identify and


locate objects within an image or video. Specifically, object detection draws
bounding boxes around these detected objects, which allow us to locate where
said objects are in (or how they move through) a given scene.

1.1 OVERVIEW OF OBJECT DETECTION AND


MACHINE LEARNING

Object detection, a cornerstone of computer vision, automates the


identification and localization of objects in images or videos, serving critical
roles in various domains. For instance, in autonomous vehicles, it ensures safe
navigation by recognizing pedestrians, vehicles, and traffic signs. Surveillance
systems leverage object detection for real-time threat detection, while in
medical imaging, it aids in diagnosing tumors or abnormalities. Deep learning,
particularly Convolutional Neural Networks (CNNs), has revolutionized object
detection by enabling models to learn features directly from raw data, enhancing
accuracy and efficiency. Algorithms such as Faster R-CNN and YOLO (You
Only Look Once) exhibit different methodologies: two-stage detectors propose
regions before classification and refinement, whereas one-stage detectors

1
predict bounding boxes and class probabilities in a single pass. Despite
advancements, challenges persist, including variations in object scale and
achieving real-time accuracy under resource constraints. Ongoing research
focuses on developing efficient architectures and improving robustness against
adversarial attacks. Object detection stands apart from image recognition and
segmentation by its unique ability to locate objects, enabling tracking and
counting. Its applications span crowd counting, self-driving cars, surveillance,
and anomaly detection, highlighting its versatility and significance in modern
technologies.

1.1.1 IMPORTANCE OF OBJECT DETECTION IN VARIOUS FIELDS

Object detection is a critical technology with broad applications. In


autonomous vehicles, it ensures safe navigation by accurately identifying
pedestrians, cyclists, and other vehicles, contributing to safer transportation
systems. Surveillance systems rely on object detection to augment human
monitoring efforts, enabling proactive responses to security threats and
enhancing public safety, particularly in crowded spaces like airports and train
stations. In healthcare, object detection aids clinicians in diagnosing and treating
medical conditions by identifying anatomical structures, detecting abnormalities
such as tumors, and tracking disease progression. In retail and manufacturing,
object detection optimizes operations, tracks inventory, and improves efficiency,
leading to streamlined workflows and enhanced customer experiences.
Moreover, in industrial applications like robotics, object detection plays a
crucial role in automating processes and ensuring quality control, enabling tasks
such as pick-and-place operations and assembly line automation. Lastly, in
augmented reality applications, object detection enables the seamless integration
of virtual content with the real world, offering interactive and immersive
experiences across various domains.

2
1.1.2 EVOLUTION OF OBJECT DETECTION TECHNIQUES

The evolution of object detection techniques begins with classical methods like
Haar cascades and HOG, which relied on handcrafted features but faced
challenges with complex visual patterns. Region-based approaches introduced
the concept of generating region proposals before classification, with methods
like Selective Search efficiently identifying potential object regions. The advent
of deep learning, particularly CNNs, revolutionized object detection by learning
hierarchical representations directly from raw pixel data, eliminating the need
for manual feature engineering. Two-stage detectors like Faster R-CNN
integrated region proposal networks into the network architecture, improving
efficiency by generating region proposals from feature maps. In contrast, one-
stage detectors like YOLO prioritized speed but sometimes sacrificed accuracy.
Recent advancements include anchor-free approaches and hybrid architectures,
aiming to strike a balance between accuracy and speed, thus making object
detection more accessible and practical for diverse applications.

1.2 ROLE OF TECHNOLOGY IN OBJECT DETECTION

Technology is the foundation of modern object detection systems, driving


their development, deployment, and enhancement. Advanced algorithms like
Convolutional Neural Networks (CNNs) revolutionize object detection by
automatically learning patterns from raw data, eliminating manual feature
engineering. Technology facilitates handling vast datasets, empowering
researchers with high-performance computing resources and specialized
hardware accelerators for efficient model training. Sensor integration benefits
from technological advancements in cameras, LiDAR, radar, and infrared
sensors, enhancing detection accuracy. Software frameworks like TensorFlow

3
and PyTorch simplify algorithm implementation, while real-time processing
techniques optimize system performance for various applications, ensuring
efficient use of computational resources without compromising accuracy or
speed.

1.2.1 ADVANCEMENTS IN MACHINE LEARNING AND


COMPUTER VISION

Recent years have seen significant advancements in object detection, with


one-stage detectors like YOLO offering faster inference times, while two-stage
detectors such as Faster R-CNN prioritize accuracy. Integration of attention
mechanisms, feature pyramids, and optimization techniques has further
improved detection performance. Machine learning and computer vision have
seen groundbreaking developments, with neural networks learning intricate
patterns directly from raw data, revolutionizing tasks like image classification
and object detection. Transfer learning has made models like ResNet invaluable
for various applications, reducing the need for extensive labelled data.
Generative modelling has enabled the creation of realistic synthetic data,
benefiting domains with scarce labelled data like medical imaging. Attention
mechanisms and self-supervised learning techniques have enhanced model
interpretability and generalization, addressing the need for transparency in AI
systems. Federated learning and edge computing preserve data privacy while
enabling collaborative model training across distributed devices, particularly
relevant for healthcare and IoT applications. Advancements in 3D vision have
expanded computer vision capabilities, opening new opportunities in augmented
reality and robotics.

1.2.2 INTEGRATION OF MACHINE LEARNING IN


OBJECT DETECTION SYSTEMS
Integration of machine learning transforms object detection, enabling

4
automatic feature extraction from raw image data. Architectural advancements,
like two-stage and one-stage detectors, optimize performance for specific tasks,
enhancing efficiency. Training strategies, such as data augmentation and transfer
learning, improve model generalization, while evaluation metrics gauge
accuracy. Optimization techniques and hardware acceleration ensure real-time
inference, vital for applications like autonomous vehicles and surveillance
systems. Evaluation and optimization are integral, ensuring efficient deployment
on various platforms and devices. Performance metrics, including precision,
recall, and mAP, provide insights into model accuracy and effectiveness.
Efficient model architectures, coupled with GPUs and TPUs, enable rapid
analysis and decision-making. Machine learning facilitates diverse training
strategies, enhancing model performance and robustness. Transfer learning
repurposes models trained on large-scale datasets, saving time and resources
while improving performance. Advanced algorithms, particularly CNNs,
empower object detection systems to discern relevant information
autonomously, enhancing detection accuracy and reliability.

1.3 CHALLENGES IN OBJECT DETECTION SYSTEMS

Object detection systems grapple with diverse challenges such as object


appearance variations and class imbalances in datasets. Real-time processing
demands high accuracy with minimal latency, posing computational hurdles for
deep learning models. Adversarial attacks undermine model robustness and
reliability, highlighting security concerns. Domain adaptation is essential for
models to generalize across diverse environments, necessitating robust
techniques for adaptation. Deploying object detection in resource-constrained
environments requires optimization for efficient inference while maintaining
high performance. Balancing accuracy and speed remains a significant
challenge, particularly in critical applications like autonomous driving. Object

5
occlusion, varying scales, lighting conditions, and complex backgrounds present
substantial obstacles to accurate detection.

1.4 PURPOSE AND SCOPE OF THE PROJECT

The project's goal is to develop an advanced object detection system for


accurately identifying and localizing objects in images or video streams. It
involves researching state-of-the-art algorithms, collecting diverse datasets, and
designing robust machine learning models, with a focus on deep learning
architectures. Techniques will address challenges like class imbalance,
adversarial attacks, and domain adaptation, ensuring resilience and efficiency.
Real-time processing optimization and deployment on resource-constrained
platforms are priorities, with rigorous evaluation using standard metrics like
precision, recall, and mAP. The project aims to deliver an innovative solution
advancing computer vision, applicable to domains like surveillance,
autonomous vehicles, and healthcare, ultimately driving progress in the field.

1.4.1 OBJECTIVES OF THE OBJECT DETECTION SYSTEM

The primary objective of the project is to develop an advanced object


detection system capable of accurately identifying and localizing objects within
images or video feeds. To achieve this, the project will begin with an extensive
review of the latest research and methodologies in the field. Diverse datasets
will be acquired, covering a wide range of scenarios and variations in object
appearance, scale, orientation, and environmental conditions.

With these datasets, the project will focus on designing and implementing
robust machine learning models, particularly leveraging deep learning
architectures like Convolutional Neural Networks (CNNs). These models will
6
be trained on the collected datasets to learn intricate patterns and features
directly from raw pixel data.

1.5 SIGNIFICANCE OF INTELLIGENT OBJECT DETECTION

Intelligent object detection stands at the forefront of technological


innovation, offering transformative benefits across diverse domains.
Byharnessing advanced machine learning algorithms and computer vision
techniques, these systems automate tasks, enhance safety and security, optimize
resource allocation, and deliver personalized user experiences. In industries
such as manufacturing, logistics, and agriculture, intelligent object detection
streamlines processes, reduces labour costs, and accelerates throughput by
swiftly and accurately identifying objects within images or video streams.
Moreover, in applications like surveillance and security, these systems play a
pivotal role in threat detection and response, improving public safety and
security measure Intelligent object detection also contributes to personalized
user experiences in augmented reality, gaming, and e-commerce, enabling
tailored recommendations and interactive content based on recognized objects.

1.6 PROBLEM DEFINITION

The project aims to develop an intelligent object detection system


capable of accurately identifying and localizing objects within images or video
streams, addressing numerous challenges inherent in object detection. To tackle
these challenges, extensive research into state-of-the-art algorithms and
methodologies will be conducted, coupled with the collection of diverse
datasets for training and evaluation. Robust machine learning models leveraging
advanced deep learning architectures will be designed to achieve high accuracy
and efficiency. Techniques will be developed to handle challenges like class
imbalance, adversarial attacks, and domain adaptation.
7
CHAPTER 2

LITERATURE SURVEY

Object detection in images and videos is a fundamental task in computer


vision with numerous practical applications ranging from autonomous driving
and surveillance to healthcare and augmented reality. Over the years, significant
progress has been made in the development of object detection systems, driven
by advancements in machine learning and deep learning approaches. These
approaches have revolutionized the field by enabling automated and accurate
detection of objects within visual data. In this literature survey, we review the
state-of-the-art techniques and methodologies in object detection, with a focus
on machine learning and deep learning approaches. We explore the evolution of
object detection algorithms, the challenges faced in the field, and the recent
advancements that have propelled the performance of object detection systems
to unprecedented levels. By synthesizing the existing body of research, we aim
to provide insights into the current landscape of object detection and identify
potential avenues for future research and development.

2.1 LITERATURE SURVEY WORKS

The evolution of object detection in images and videos has been shaped
by groundbreaking research in machine learning and deep learning. Ren et al.'s
Faster R-CNN introduced Region Proposal Networks (RPNs), streamlining
detection and enabling real-time performance. Redmon et al.'s YOLO reframed
detection as a regression problem, reducing computational complexity and
enhancing speed, ideal for applications like video surveillance. Liu et al.'s SSD
introduced single-stage detection, improving efficiency and enabling
deployment on resource-constrained devices. Lin et al.'s FPN addressed scale
variation by integrating multi-scale feature maps, enhancing object detection
across sizes. He et al.'s Mask R-CNN extended Faster R-CNN with instance
8
segmentation capabilities, broadening applications in medical imaging and
video editing. Recent advancements include YOLOv5, offering improved
performance and efficiency, and EfficientDet, achieving superior performance
with fewer parameters. DETR introduces an end-to-end Transformer-based
architecture for accurate and interpretable detections. CenterNet simplifies
detection by directly predicting object centers, suitable for resource-limited
scenarios. Cascade R-CNN iteratively refines object proposals for enhanced
detection quality, particularly in challenging scenarios. NAS-FPN automates
feature pyramid network design, enhancing performance and scalability. PANet
introduces a path aggregation network for precise instance segmentation,
improving object understanding.

Fig 2.1 illustrates how seminal works in object detection have advanced
performance and expanded applicability across diverse domains.

Fig 2.1 Existing system

9
2.2 RELATED WORKS

Related

Literature Survey Authors Year Venue Works

Faster R- CNN:
Towards Real-Time
Object Detection Ren, S.,
with Region He, K., 1. Fast R- CNN:
IEEE
Proposal Networks Girshick, 2016 Girshick, R.
TPAMI
R., &
Sun, J.

2. YOLOv4:
Redmon,
YOLOv3: An Bochkovskiy, A.,
J., &
Incremental 2018 arXiv Wang, C.
Farhadi,
Improvement Y., & Liao, H.
A.
Y. M.

EfficientDet: Scalable
2.
and Efficient Object Tan, M.,
EfficientNet: Tan,
Detection Pang, R., & 2020 CVPR
M., & Le, Q.
Le, Q.

10
2. Vision
DETR: End-
Carion, N., Transformers:
to-End Object 2020 ECCV
et al. Dosovitskiy, A., et
Detection
al.

2. Vision
DETR: End- Carion, N., Transformers:
to-End Object 2020 ECCV
et al. Dosovitskiy, A., et
Detection with al.
Transformers

CenterNet: Object
Detection with
Center- Aspects Mo, K., 2. FCOS:
2019 CVPR
Estimation et al. Tian, Z., et al.

He, K., et 2. PANet: Liu, S., et


Mask R-CNN 2017 ICCV
al. al.

2. YOLOv4:
SSD: Single Shot
Bochkovskiy, A.,
MultiBox Detector Liu, W.,
2016 ECCV Wang, C.
et al.
Y., & Liao, H.

Y. M.

RetinaNet: Focal Loss


for Dense Object Lin, T. Y., 2. DSSD: Fu,
2017 ICCV
Detection et al. C. Y., et al.

Table 2.1 Related works


11
2.3 KEY AREAS OF FOCUS

Let us delve deeper into each key area of focus for object detection systems
using machine learning and deep learning approaches shows in Fig 2.2:

Model Architectures: Object detection research delves into designing networks


that balance accuracy and efficiency by exploring various depths, widths, and
connectivity patterns. Architectural innovations like skip connections and
attention mechanisms optimize performance for challenges such as scale
variation and occlusion.

Data Collection and Annotation: Researchers emphasize collecting diverse


datasets covering various object categories, poses, and lighting conditions.
Augmentation techniques like rotation and flipping increase dataset diversity,
while precise annotation with bounding boxes ensures model efficacy.

Training Strategies: Effective strategies like transfer learning and semi-


supervised learning optimize model performance and minimize training time.
Techniques such as hyperparameter tuning and adaptive learning rates play
pivotal roles in enhancing training effectiveness.

Feature Representation: Feature extraction methods, including CNNs and


FPNs, capture hierarchical representations from input images effectively.
Attention mechanisms enhance feature representation by focusing on
informative regions, while spatial transformers improve localization accuracy.

Object Localization: Anchor-based methods utilize predefined anchor boxes


for accurate object localization, while anchor-free methods simplify the process
by directly regressing object centroids. Techniques for handling scale variation
12
and occlusion enhance localization performance in challenging scenarios.

Inference Speed and Efficiency: Real-time object detection systems demand


swift and efficient inference algorithms to adhere to strict latency requirements.
Techniques like network pruning, quantization, and model distillation reduce
model complexity without compromising accuracy. Hardware accelerators such
as GPUs and TPUs leverage parallel processing to expedite inference, while
distributed techniques further enhance efficiency.

Evaluation Metrics and Benchmarking: Standardized metrics like precision,


recall, mAP, and IoU, along with benchmark datasets like COCO and PASCAL
VOC, are essential for objectively evaluating object detection models. These
metrics ensure the development of reliable systems that meet the evolving
demands of diverse applications by driving advancements in the field.

Fig 2.2 Sample object detection system

13
2.4 LIMITATIONS

Data Dependency: Object detection models heavily rely on annotated training


data, which can be costly and time-consuming to collect. However, inadequate
representation of the target domain in training data can hinder real-world
performance.
Computationally Intensive: Deep learning models for object detection require
significant computational resources during training and inference, posing
challenges for deployment on resource-constrained devices.
Overfitting: Deep learning models are prone to overfitting, memorizing noise or
irrelevant patterns in training data. Techniques like dropout and weight decay
mitigate overfitting, but balancing model capacity and dataset size remains
challenging.
Difficulty with Small Objects and Occlusions: Object detection systems may
struggle with detecting small objects or those partially occluded by other objects,
impacting localization and classification accuracy.
Limited Interpretability: Deep learning models are often considered "black
box," making it difficult to interpret their decisions, especially in safety-critical
applications.
Domain Adaptation: Models trained on one dataset may not generalize well to
new environments. Techniques for domain adaptation aim to transfer knowledge
to new domains, but challenges remain in adapting to diverse environments.
Runtime Performance: Real-time object detection systems require fast
inference speeds while maintaining accuracy, necessitating optimization
techniques and hardware acceleration.
Handling Imbalanced Data: Imbalanced datasets can bias models, affecting
performance on minority classes. Techniques like class re-weighting and data
augmentation address class imbalance issues during training.

14
CHAPTER 3

SYSTEM DESIGN

3.1 SYSTEM ARCHITECTURE

The system architecture shows in Fig 3.1 for an object detection system
using deep learning from videos is a comprehensive framework designed to
manage the intricate processes involved in detecting objects within video
streams. At its core, the Data Acquisition Module serves as the system's entry
point, facilitating the capture of video data from various sources such as
cameras, video files, or live streams. Once acquired, the Frame Segmentation
Module meticulously extracts individual frames from the continuous video
stream, employing techniques like keyframe extraction or motion-based
segmentation to efficiently segment the video into manageable units. Following
segmentation, the Image Pre-processing Module meticulously prepares each
frame for analysis by applying a suite of transformations to enhance its quality
and standardize its format. This involves resizing frames, normalizing pixel
values, and reducing noise or artifacts that could impede accurate object
detection.

Subsequently, the Feature Extraction Module employs sophisticated deep


learning techniques, particularly convolutional neural networks (CNNs), to
automatically extract salient features from the pre-processed frames. These
features capture both low-level details like edges and textures and high-level
features related to object shapes and structures, enabling precise detection.
Finally, the Classification Module assigns labels or categories to the extracted
features, indicating the presence of specific objects within the frames. By
integrating these modules into a coherent architecture, the system can efficiently
process video data and accurately detect objects within video streams, paving
the way for a wide range of applications across domains such as surveillance,

15
Fig 3.1 System Architecture

3.2 PROCESS WORKFLOW

These computer-aided detection methods are found to be more efficient,


reliable, accurate, and less time-consuming as compared to the manual detecting
methods that were less efficient and more time-consuming. The process flow
includes the following steps explained in Fig 3.2, they are:

● Data Acquisition
● Frame Segmentation
● Data Pre-processing
● Feature Extraction
● Classification

Data Acquisition: This step involves gathering video data from various sources
like surveillance cameras or video files. Data augmentation techniques may be
16
employed to enhance system robustness by collecting videos from different
viewpoints or lighting conditions. Metadata such as timestamps may also be
collected for further analysis.

Frame Segmentation: Frame segmentation divides the continuous video stream


into individual frames, typically represented as images. Techniques may include
frame extraction at fixed intervals or advanced methods like keyframe selection
based on scene changes or motion detection algorithms.

Image Pre-processing: Pre-processing prepares frames for input into the object
detection model by resizing them to a uniform size, normalizing pixel values, and
applying filters to enhance quality. This ensures consistency and optimization of
input frames for the detection model's requirements.

Feature Extraction: Convolutional neural networks (CNNs) automatically learn


and extract relevant features from pre-processed frames. These networks capture
hierarchical representations of input data, enabling the system to represent objects
in video frames accurately.

Classification: This step assigns labels to extracted features, indicating the


presence of specific objects within frames. A classification model, such as a
SoftMax classifier, computes the probability distribution over different classes,
with the highest probability assigned to each detected object. These
computer-aided detection methods are found to be more efficient, reliable,
accurate, and less time-consuming.

17
3.3 DATA FLOW MODEL

The below represented diagrams are the data flow diagrams which
explains our work.

3.3.1 DFD LEVEL 0

In the level 0 shows in Fig 3.3 of the data flow diagram the basic flow of
the project. The input data contains the COCO database and the result is the
output data. The cleaned data will be used for the detection process. Then a
detailed analysis on detecting the object from video will be made on various
parameters. The process between them involves analysing and detection which
results in the images with the frames for the many purposes to detect the object.

Fig 3.3 DFD Level 0

3.3.2 DFD LEVEL 1

At Level 1 shows in Fig 3.4 of the Data Flow Diagram (DFD) for an
object detection system using deep learning from videos, the central module is

18
Object Detection System (Main), which oversees the entire process. This system
interfaces with two primary components: the Input Video File and the User
Interaction Interface. The Input Video File represents the source of video data,
which could be a video file stored locally or a video stream obtained from a
camera or other sources. On the other hand, the User Interaction Interface
facilitates interaction with the system, enabling users to provide input
parameters or view the system's output. Once the video data is received, it
undergoes Frame Pre-processing, a module responsible for preparing each
frame of the video for object detection.

Fig 3.4 DFD Level 1

This process involves tasks such as resizing, normalization, or applying


other transformations to enhance the quality of input frames for subsequent
analysis. After pre-processing, the frames are fed into the Object Detection
module, where deep learning techniques like YOLO (You Only Look Once) are
employed to detect objects present within each frame.

19
Display/Save module handles the presentation or storage of the detected objects.
It may display the detected objects to the user in real-time or save them to a file
for later analysis. This evaluation provides valuable insights into the system's
performance and helps in its refinement and optimization. Overall, these
interconnected components form a cohesive system for object detection from
videos using deep learning techniques.

3.3.3 DFD LEVEL 2

At Level 2 of the Data Flow Diagram (DFD) for an object detection


system using deep learning from videos, each module introduced at Level 1 is
expanded to detail its internal processes and interactions. The Input Video File
Module remains the source of video data, while the User Interaction Interface
Module continues to facilitate user interaction with the system. Within the
Frame Pre- processing Module, specific tasks such as resizing, normalization,
and metadata extraction from each frame are outlined to optimize frame quality.
The Object Detection Module is further elaborated to include the deep learning
model architecture employed, detailing the process of loading the pre-trained
model, executing object detection on each frame, and extracting relevant object
information. The Output Display/Save Module expands to specify supported
output formats and may include post-processing tasks for refining detected
object information. Additionally, the optional Performance Evaluation Module
may be detailed further to specify evaluation metrics and the comparison of
ground truth annotations with detected objects for quantitative analysis.
Interactions among these modules are delineated, illustrating the flow of data
and control within the system, including feedback loops for iterative refinement
and optimization. Overall, Level 2 of the DFD provides a comprehensive
blueprint of the internal workings and interactions of the object detection
system, enhancing understanding and facilitating system development and
refinement.

20
3.4 UML DIAGRAMS

The following UML diagrams are used to describe the project, they are:

• Use case diagram


• Activity diagram
• Sequence diagram

3.4.1 USE CASE DIAGRAM

There are three actors involved in the use case diagram are User,
Detection system, and database shows in Fig 3.5. The use cases associated with
users are uploading the object video and output is given from the detection
system. The use cases associated with database are data storage, data
management. The use cases associated with Detection system are frame
segmentation, processing the image, feature extraction and classification.

Fig 3.5 Use case Diagram

21
3.4.2 ACTIVITY DIAGRAM

Activity diagram shows in Fig 3.6 is a flowchart to represent the flow


from one activity to another activity. The activity diagram helps to visualize the
overall workflow of the Object detection process and highlights the key steps
involved in the detection of data using deep learning. The diagram can be used
to guide the development of a software system that supports the object detection
process and helps to improve the accuracy and efficiency of diagnosis. The
process flow starts with input data that contains object images, where the
required features are extracted for effective modelling.

Fig 3.6 Activity Diagram

22
3.4.3 SEQUENCE DIAGRAM

The below sequence diagram shows in Fig 3.7 the interactions between
the various components involved in the Object detection process. The lifeline
will be explained by the User, Detection system and Database. The User
uploads the video which is stored into the database. The process the image file
by frame segmentation and data pre-processing. The extracted features are then
used for machine learning classification, which uses algorithms to detect the
object in the video which is given by the user. Finally, an image of detection is
shown based on the machine learning results, summarizing the findings and
providing recommendations for further detection.

Fig 3.7 Sequence Diagram

23
3.5 WORKFLOW

Step 1: Gather video analysis parameters and objectives from users to tailor the
process.

Step 2: Read the video file or stream, ensuring compatibility and smooth data
ingestion.

Step 3: Pre-process frames by applying resizing, normalization, and other


transformations to enhance model performance.

Step 4: Initialize the object detection model, such as YOLO, ensuring proper
configuration and optimization.

Step 5: Train the model if required, leveraging labeled datasets for fine-tuning
and improved accuracy.

Step 6: Detect objects within each pre-processed frame using the initialized
model, capturing relevant features.

Step 7: Perform post-processing on detected objects, filtering out


low-confidence detections and redundant bounding boxes.

Step 8: Optionally visualize detected objects overlaid on the original frame for
intuitive analysis.

Step 9: Aggregate detected objects if tracking or temporal analysis is needed,


ensuring analysis coherence.

Step 10: Output detected objects' labels, confidence scores, and bounding box
coordinates, facilitating interpretation and action.

24
CHAPTER 4

SYSTEM IMPLEMENTATION

4.1 TRACED YOLO V7 MODEL

Implementing a YOLOv7 model shows in Fig 4.1 for object detection in


videos entails a systematic approach beginning with data acquisition. Video
data, sourced from cameras, files, or streams, undergoes segmentation into
individual frames, representing discrete moments within the video sequence.
These frames undergo image pre-processing, where techniques like resizing,
normalization, and noise reduction are applied to standardize their format and
enhance their quality, optimizing them for subsequent analysis. The crux of the
YOLOv7 model lies in feature extraction, achieved through a convolutional
neural network (CNN) backbone.

This backbone network extracts hierarchical features from the frames,


encompassing both fine-grained details and high-level semantics crucial for
precise object detection. Simultaneously, the model conducts classification,
wherein feature maps are processed by detection heads to predict bounding
boxes and class probabilities for objects within the frames. This integrated
approach enables YOLOv7 to efficiently detect objects across various scales
and aspect ratios, ensuring robust performance in real-time applications.

The seamless integration of the YOLOv7 model into the video processing
pipeline equips the system with the capability to accurately detect and classify
objects within video streams, bolstering its efficacy in domains like
surveillance, traffic monitoring, and video analytics. Through these steps, the
object detection system achieves a holistic understanding of the video content,
facilitating informed decision-making and enhancing situational awareness in
diverse scenarios.

25
Fig 4.1 YOLO V7 Model

4.2 WORKING PRINCIPLES OF TRACED YOLO V7 MODEL

The working principles of the Traced YOLOv7 model for object


detection in videos align with the specified steps of the object detection system
using deep learning from videos:

Data Acquisition: Initially, video data is obtained from diverse sources


like cameras or video files, serving as the input for object detection. This step is
crucial for identifying objects within the video frames accurately.

Frame Segmentation: Following data acquisition, the video stream undergoes


segmentation into individual frames. Each frame represents a distinct snapshot of
the video, enabling independent analysis for object detection purposes.

Image Pre-processing: Before inputting frames into the Traced YOLOv7


model, pre-processing steps are executed to improve their quality and usability.
Tasks may include resizing frames, normalizing pixel values, and reducing noise
to enhance object detection accuracy.

26
Feature Extraction: The Traced YOLOv7 model employs a convolutional
neural network (CNN) backbone to automatically extract pertinent features
from pre-processed frames. These features capture critical object characteristics
such as edges, textures, and shapes.

Classification: Alongside object detection, the Traced YOLOv7 model performs


classification, predicting bounding boxes and class probabilities for detected
objects within frames. This enables accurate identification and classification of
objects in the video stream.

4.3 APPROACH

The approach of the Traced YOLOv7 model shows in Fig 4.2 involves a
systematic process geared towards enabling efficient deployment and real-time
object detection in videos. Initially, the YOLOv7 model undergoes
comprehensive training using a diverse dataset of annotated images and videos,
where it learns to accurately detect and classify objects within the visual
datathrough optimization of its parameters based on defined loss functions.
Following training, the model is subjected to optimization techniques aimed at
streamlining its architecture for deployment on edge devices. This optimization
phase typically encompasses model quantization, pruning, and compression to
reduce its size and computational complexity while maintaining performance
integrity.

27
Fig 4.2 Approach

Following optimization, the YOLOv7 model undergoes tracing to


TensorFlow Lite for edge device compatibility, enabling real-time object
detection. Integrated into video processing pipelines, it efficiently operates on
individual frames, performing preprocessing, feature extraction, and
classification for object detection. With its optimized design and edge device
integration, the Traced YOLOv7 model ensures low-latency inference, crucial
for applications such as surveillance, video analytics, and autonomous vehicles.
Finally, the model is deployed on edge devices, where its lightweight and
efficient architecture make it well-suited for resource-constrained environments,
ensuring accurate and timely object detection in real-world scenarios. Through
this methodical approach, the Traced YOLOv7 model empowers various
applications with robust and efficient object detection capabilities in video data.

28
4.4 ADVANTAGES OF USING TRACED YOLO V7
MODEL CLASSIFIER OVER OTHER CLASSIFIERS

The Traced YOLOv7 model classifier offers several advantages over


other classifiers, making it a preferred choice for object detection tasks in
videos:

Real-Time Performance: The Traced YOLOv7 model is optimized for real-


time performance, enabling it to process video frames and detect objects with
low latency. This real-time capability is crucial for applications such as
surveillance, where timely detection of objects is essential.

Efficient Deployment on Edge Devices: The Traced YOLOv7 model is


designed for deployment on edge devices such as mobile phones, drones, or
embedded systems. Its lightweight architecture and efficient inference make it
well-suited for resource-constrained environments, allowing for on-device
processing without reliance on cloud servers.

Accurate Object Detection: The YOLOv7 architecture, upon which the Traced
YOLOv7 model is based, is known for its high accuracy in object detection
tasks. By leveraging advanced deep learning techniques and multi-scale feature
extraction, the model can accurately detect objects of varying sizes and aspect
ratios within video frames.

Single Stage Detection: Unlike traditional two-stage detectors, such as Faster


R- CNN, which require separate region proposal and object classification
stageYOLOv7 performs object detection in a single stage. This simplifies the
detection pipeline and reduces inference time, resulting in faster processing of
video data.

29
Multi-Object Detection: The Traced YOLOv7 model excels at detecting
multiple objects within a single frame simultaneously. This capability is
particularly advantageous in crowded scenes or scenarios where numerous
objects need to be detected and classified concurrently.

End-to-End Training: YOLOv7 models, including the Traced variant, can be


trained end-to-end on large datasets, facilitating seamless integration of domain-
specific features, and improving overall detection performance.

Flexibility and Customization: The Traced YOLOv7 model offers flexibility


and customization options, allowing developers to fine-tune model parameters,
optimize performance, and adapt to specific application requirements.

4.5 STEPS INVOLVED OBJECT DETECTION SYSTEM

The Steps involved in object detection system explained in Fig 4.3 are,

Fig 4.3 Object Detection System

30
4.5.1 DATA ACQUISITION

Data acquisition for an object detection system using deep learning from
videos is a crucial process that involves gathering video data from various
sources to create a comprehensive dataset for model training and evaluation.
Initially, the identification of suitable data sources, including surveillance
cameras, online repositories, or recorded video streams, sets the stage for
collecting the necessary footage. Prior to acquisition, it's imperative to address
legal considerations and obtain permissions if the videos contain sensitive
information. Once obtained, the videos undergo annotation and labelling to
mark the presence and location of objects within frames, a vital step for
supervised learning. Following annotation, the dataset may undergo pre-
processing tasks such as resizing, format conversion, or noise reduction to
ensure consistency and quality. Subsequently, the dataset is divided into
training, validation, and testing sets to facilitate model training and evaluation.
Additionally, data augmentation techniques may be applied to increase dataset
diversity and improve model generalization. Ultimately, well- structured storage
and management of the annotated and pre-processed video dataset are
paramount for seamless access during model development and deployment.
Through meticulous data acquisition, an object detection system can be trained
effectively, enhancing its ability to accurately identify objects within video
streams.

4.5.2 FRAME SEGMENTATION

Frame segmentation, a key step in video-based object detection systems,


involves dividing a continuous video stream into individual frames, each
representing a distinct moment for analysis. This process enables efficient
object detection by allowing the model to focus on one frame at a time.
Techniques like uniform sampling or motion-based segmentation ensure

31
representative frames are extracted while minimizing computational load.
Adjusting the frame rate balances resources with temporal information.
Segmented frames are then organized for further processing, facilitating the
application of object detection algorithms in tasks like surveillance and
autonomous navigation.

4.5.3 IMAGE PRE-PROCESSING

Image pre-processing is a critical step in preparing video frames for


object detection using deep learning techniques. It involves a series of
operations aimed at enhancing the quality of the frames and standardizing their
format to optimize subsequent processing. In the context of object detection
systems from videos, image pre-processing typically includes the following
steps:

Resizing frames to a uniform size ensures consistency and efficient


processing by the deep learning model, preventing issues from varying aspect
ratios or resolutions. Additionally, normalization standardizes pixel values
across frames, stabilizing the training process by bringing values within a
specific range, typically 0 to 1 or -1 to 1. This consistency ensures consistent
weight updates during training, optimizing model performance.

Additionally, reducing noise and artifacts in the frames is crucial for


improving the quality of input data. Techniques such as denoising filters or
image smoothing algorithms may be applied to remove unwanted noise and
enhance the clarity of objects in the frames.Furthermore, adjusting the color
space or applying color corrections may be necessary to ensure consistency in
color representation across frames. This step helps in mitigating variations in
lighting conditions or camera settings, which can affect the performance of the
object detection model.Finally, data augmentation techniques such as rotation,
flipping, or adding random perturbations may be applied to augment the dataset.

32
4.5.4 FEATURE EXTRACTION

Feature extraction in an object detection system with the Traced YOLOv7


model involves several key steps. Firstly, pre-processed video frames are
inputted into the model, which acts as both a feature extractor and a detector.
Leveraging a convolutional neural network (CNN), the model extracts
hierarchical features capturing spatial information, textures, and patterns.

As frames pass through the CNN backbone, feature maps are generated at
multiple levels, encoding representations of object presence and spatial
relationships. Additionally, feature pyramid networks (FPNs) aggregate these
maps, enhancing detection across various scales and sizes.

Utilizing anchor boxes and prediction layers, the model generates bounding
boxes and confidence scores for detected objects based on extracted features,
facilitating object localization and presence indication.

Moreover, temporal information from consecutive frames can be integrated using


recurrent neural networks (RNNs) or 3D convolutional networks, enabling
motion pattern recognition and improved object tracking within videos.

4.5.5 SPLITTING THE DATA INTO TRAINING AND


TESTING DATASETS

Before building the model, separate the data into two parts, one is
Training data and another is Test data. shows the diagrammatic representation of
Fig 4.4 how data is split into training and testing which is used to obtain the
detection of objects from videos.

33
4.5.6 CLASSIFICATION

In the classification step of object detection using the Traced YOLOv7 model,
features are first extracted from pre-processed video frames, capturing spatial
information, textures, and patterns. These features are then passed through
detection heads within the Traced YOLOv7 architecture.

Utilizing anchor boxes and prediction layers, the model generates bounding
boxes enclosing detected objects, alongside confidence scores indicating object
presence likelihood. Subsequently, the classifier assigns class probabilities to
each detected object based on learned feature representations, ensuring accurate
classification into predefined categories.

To refine detections, post-processing techniques like non-maximum suppression


(NMS) may be employed to filter redundant bounding boxes. Additionally, the
classifier can utilize temporal information from consecutive frames to enhance
classification accuracy by considering motion patterns.

Following this classification process, model evaluation proceeds to RESNET


101, a deep neural network architecture, for further assessment and refinement.

4.6 EVALUATION METRICS

After the classification step with the Traced YOLOv7 model classifier,
the detected objects and their corresponding bounding boxes undergo further
evaluation using a pre-trained ResNet-101 model. ResNet-101, a deep
convolutional neural network architecture known for its exceptional
performance in various computer vision tasks, is employed to perform several
key tasks. Firstly, it recognizes and classifies the detected objects within the
video frames, leveraging its deep architecture and learned feature
representations to accurately identify objects based on visual characteristics

34
ResNet-101 facilitates fine-grained classification, distinguishing closely
related categories and offering detailed insights into video frame content.
Additionally, it extracts high-level features from detected objects, capturing
crucial semantic and contextual cues for comprehensive understanding.
Furthermore, ResNet-101 performs semantic segmentation, assigning each pixel
to a specific object class, enabling detailed spatial analysis. Finally, evaluation
of the ResNet-101 output utilizes metrics like Mean Average Precision (mAP)
to assess performance.object detection system accurately. By combining the
capabilities of Traced YOLOv7 for initial detection and ResNet-101 for detailed
evaluation, the object detection system achieves robust and reliable object
recognition in diverse video scenarios.

4.7 LIMITATIONS OF YOLO V7


YOLO v7 is a powerful and effective object detection algorithm, but it does
have a few limitations.

• YOLO v7, like many object detection algorithms, struggles to detect


small objects. It might fail to accurately detecting objects in
crowded scenes or when objects are far away from the camera.
• YOLO v7 is also not perfect at detecting objects at different scales. This
can make it difficult to detect objects that are either very large or very
small compared to the other objects in the scene.
• YOLO v7 can be sensitive to changes in lighting or other
environmental conditions, so it may be inconvenient to use in
real-world applications where lighting conditions may vary.
• YOLO v7 can be computationally intensive, which can make it difficult
to run in real-time on resource-constrained devices like smartphone.

35
CHAPTER 5

TESTING

Testing for object detection systems using deep learning from videos is a
crucial phase in assessing the performance and reliability of such systems
before deployment in real-world scenarios. This testing process involves
evaluating the model's ability to accurately detect and classify objects within
video frames, ensuring robustness and effectiveness across diverse conditions
and scenarios. By subjecting the object detection system to rigorous testing,
developers can identify potential weaknesses, optimize model parameters, and
enhance overall performance.

Testing encompasses various aspects, including dataset preparation,


evaluation metrics, and validation procedures, all aimed at validating the
system's efficacy and ensuring its suitability for practical applications. Through
comprehensive testing, object detection systems can be refined and validated,
paving the way for their deployment in critical domains such as surveillance,
video analytics, and autonomous systems. This introduction sets the stage for
understanding the importance and objectives of testing in object detection
systems using deep learning from videos.

5.1 TEST CASES

Our work is divided into five primary sections for testing shows in Table
5.1. The loading dataset, Frame segmentation, Image pre-processing, Feature
extraction, and Classification are the five modules. The test cases for the
modules and submodules have been checked and passed. The test case id, test
case scenario, test case secondary considerations, and test case state are all
considered when testing our work. Test cases are validated, and the outcomes
and status for the scenarios are handled.

36
TC SCENARIO SECONDARY CONSIDERATION STATE
ID

TC01 Loading Huge amount of Object video data are Pass


Dataset loaded for the food analysis system

TC02 Food Converting video data into frames Pass


segmentation of images

TC03 Pre-processing Preprocess the framed image data Pass

TC05 Feature Check for features from the pre-processed Pass


Extraction data

TC06 Classification Check classifier to detect the object Pass

Table 5.1 Testcases

37
CHAPTER 6

FUTURE WORK

Some potential future directions for improving the object detection system
based on the YOLO V7 model:

Real-Time Performance Optimization: Further optimize the model for real-


time performance across different hardware platforms to enable faster inference
speeds.

Multi-Object Tracking: Integrate multi-object tracking algorithms to detect


and track objects across consecutive frames, useful in surveillance and sports
analytics.

Improved Accuracy and Generalization: Continuously refine the model


architecture and training process to enhance detection accuracy and
generalization in diverse environments.

Semantic Segmentation Integration: Explore integrating semantic


segmentation techniques to improve object localization and scene
understanding.

Efficient Deployment: Develop strategies like model compression and


quantization for deployment on resource-constrained devices.

Domain Adaptation: Investigate techniques to adapt the model to specific


target domains for improved performance.

Uncertainty Estimation: Incorporate uncertainty estimation techniques to


provide confidence scores for detected objects, aiding decision-making.

Spatiotemporal Analysis: Extend the model to analyze spatiotemporal


information in videos for tasks like action recognition and anomaly detection.

38
APPENDIX A

SAMPLE SCREENSHOTS

The Object Detection and Tracking System UI presents an intuitive


interface for users interested in analyzing objects within videos. Users initiate
the analysis process by selecting the "detect" option, triggering the object
detection mechanism illustrated showed on Fig A.1, A.2, A.3, A.4, A.5,
A.6&A.7. Upon selection, users are prompted to upload a video for object
detection. Utilizing cutting-edge deep learning and machine learning algorithms,
the system proceeds to identify and visually display the detected objects within
the uploaded video. Once the objects are recognized, users are invited to input
specific inquiries related to the detected objects. These inquiries are then
processed by the system's advanced algorithms, incorporating sophisticated
natural language processing methodologies. Consequently, the system generates
tailored outputs in response to the inquiries, providing users with detailed
insights, tracking information, classifications, or any other relevant data
pertaining to the detected objects. Through this seamless process, the Object
Detection and Tracking System offers users a comprehensive and interactive
platform for exploring and understanding objects within videos.

Fig A.1 Landing Page

39
Fig A.2 Upload a video

Fig A.3 Model output download

40
OUTPUT

Fig A.4 Output 1

Fig A.5 Output 2

41
Fig A.6 Output 3

Fig A.7 Output 4

42
APPENDIX B
SAMPLE CODING
In this appendix, we outline the code snippets employed in our
development and evaluation of an AI-powered object detection system for video
analysis. Users are guided to input essential parameters and upload videos for
object detection. Noteworthy is the incorporation of a specialized deep learning
model that allows users to interact with the system's output, enabling detailed
analysis of detected objects. Moreover, the implementation includes robust
error- handling mechanisms to ensure a smooth user experience and precise
detection results, thereby improving the system's robustness and user
satisfaction.

Yolo.py:

import argparse

import logging

import sys

from copy import deepcopy

sys.path.append('./') # to run '$ python *.py' files in subdirectories

logger = logging.getLogger( name )

import torch

from models.common import *

from models.experimental import *

from utils.autoanchor import check_anchor_order

from utils.general import make_divisible, check_file, set_logging

43
from utils.torch_utils import time_synchronized, fuse_conv_and_bn,
model_info, scale_img, initialize_weights, \

select_device, copy_attr

from utils.loss import SigmoidBin

try:

import thop # for FLOPS computation

except ImportError:

thop = None

class Detect(nn.Module):

stride = None # strides computed during build

export = False # onnx export

end2end = False

include_nms = False

concat = False

def init (self, nc=80, anchors=(), ch=()): # detection layer

super(Detect, self). init ()

self.nc = nc # number of classes

self.no = nc + 5 # number of outputs per anchor

self.nl = len(anchors) # number of detection layers

self.na = len(anchors[0]) // 2 # number of anchors

self.grid = [torch.zeros(1)] * self.nl # init grid

44
a = torch.tensor(anchors).float().view(self.nl, -1, 2)
self.register_buffer('anchors', a) # shape(nl,na,2)
self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2)) #
shape(nl,1,na,1,1,2)

self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch) #


output conv

def forward(self, x):

# x = x.copy() # for profiling

z = [] # inference output

self.training |= self.export

for i in range(self.nl):

x[i] = self.m[i](x[i]) # conv

bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)

x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4,


2).contiguous()

if not self.training: # inference

if self.grid[i].shape[2:4] != x[i].shape[2:4]:

self.grid[i] = self._make_grid(nx, ny).to(x[i].device)

y = x[i].sigmoid()

if not torch.onnx.is_in_onnx_export():

y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i] # xy

45
include_nms = False

concat = False

def init (self, nc=80, anchors=(), ch=()): # detection layer

super(IDetect, self). init ()

self.nc = nc # number of classes

self.no = nc + 5 # number of outputs per anchor

self.nl = len(anchors) # number of detection layers

self.na = len(anchors[0]) // 2 # number of anchors

self.grid = [torch.zeros(1)] * self.nl # init grid

a = torch.tensor(anchors).float().view(self.nl, -1, 2)
self.register_buffer('anchors', a) # shape(nl,na,2)
self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2)) #
shape(nl,1,na,1,1,2)

self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch) #


output conv

self.ia = nn.ModuleList(ImplicitA(x) for x in ch)

self.im = nn.ModuleList(ImplicitM(self.no * self.na) for _ in ch)

def forward(self, x):

# x = x.copy() # for profiling

z = [] # inference output

self.training |= self.export

46
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4,
2).contiguous()

if not self.training: # inference

if self.grid[i].shape[2:4] != x[i].shape[2:4]:

self.grid[i] = self._make_grid(nx, ny).to(x[i].device)

y = x[i].sigmoid()

if not torch.onnx.is_in_onnx_export():

y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i] # xy

y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh

else:

xy, wh, conf = y.split((2, 2, self.nc + 1), 4) # y.tensor_split((2, 4,

5), 4) # torch 1.8.0

xy = xy * (2. * self.stride[i]) + (self.stride[i] * (self.grid[i] - 0.5)) #


new xy

wh = wh ** 2 * (4 * self.anchor_grid[i].data) # new wh

y = torch.cat((xy, wh, conf), 4)

z.append(y.view(bs, -1, self.no))

if self.training:

out = x

elif self.end2end:

out = torch.cat(z, 1)

47
args[1] = [list(range(args[1] * 2))] * len(f)

elif m is ReOrg:

c2 = ch[f] * 4

elif m is Contract:

c2 = ch[f] * args[0] ** 2

elif m is Expand:

c2 = ch[f] // args[0] ** 2

else:

c2 = ch[f]

m_ = nn.Sequential(*[m(*args) for _ in range(n)]) if n > 1 else m(*args) #


module

t = str(m)[8:-2].replace(' main .', '') # module type

np = sum([x.numel() for x in m_.parameters()]) # number params

m_.i, m_.f, m_.type, m_.np = i, f, t, np # attach index, 'from' index, type,


number params

logger.info('%3s%18s%3s%10.0f %-40s%-30s' % (i, f, n, np, t, args)) #


print

save.extend(x % i for x in ([f] if isinstance(f, int) else f) if x != -1) #


append to savelist

layers.append(m_)

if i == 0:

ch = []

48
ch.append(c2)

return nn.Sequential(*layers), sorted(save)

if name == ' main ':

parser = argparse.ArgumentParser()

parser.add_argument('--cfg', type=str, default='yolor-csp-c.yaml',


help='model.yaml')

parser.add_argument('--device', default='', help='cuda device, i.e. 0 or 0,1,2,3


or cpu')

parser.add_argument('--profile', action='store_true', help='profile model


speed')

opt = parser.parse_args()

opt.cfg = check_file(opt.cfg) # check file

set_logging()

device = select_device(opt.device)

# Create model

model = Model(opt.cfg).to(device)

model.train()

if opt.profile:

img = torch.rand(1, 3, 640, 640).to(device)

y = model(img, profile=True)

49
APPENDIX C

SYSTEM REQUIREMENTS

In this appendix system requirements, which will be most


important for our work to process are mentioned. We have mentioned the
hardware specifications, software specification and the browsers and the
source of the object data which we used are mentioned below.

HARDWARE SPECIFICATION

● Processor (CPU) with 8GB RAM.


● Internet Connection
● Keyboard and Mouse or some other compatible pointing device

SOFTWARE SPECIFICATION

● Windows 10 or Higher
● Visual Studio Code

BROWSERS

● Chrome
● Edge
● Mozilla Firefox
● Internet Explorer
● Safari

DATASETS

● COCO

50

50
REFERENCES

1. "YOLOv4: Optimal Speed and Accuracy of Object Detection" by


Bochkovskiy et al. introduces YOLOv4 for enhanced object detection.
2. "Deep Learning for Object Detection, Classification, and Tracking in
Industry" addressing the significance and challenges of object detection in
industry applications.
3. "Deep Learning for Object Detection, Classification, and Tracking in
Industry" discusses the importance and challenges of computer vision
techniques in industry applications.
4. "Deep Learning for Object Detection, Classification, and Tracking in
Industry Applications" by discussing the importance of computer vision
techniques.
5. “Deep Learning for Object Detection, Classification, and Tracking in
Industry Applications" by discussing the importance of computer vision
techniques.
6. "Real-time Object Detection with YOLO" by Joseph Redmon and
Santosh Divvala presenting the YOLO algorithm for real-time detection.
7. "Real-time Object Detection with YOLO" by Joseph Redmon and
Santosh Divvala presents the YOLO algorithm for fast and accurate object
detection in real-time applications.
8. "Mask R-CNN" by Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross
Girshick extending Faster R-CNN to include instance segmentation
capabilities.
9. "Temporal Object Detection in Videos with Tubelet Proposal Networks"
by Kang et al. introduces tubelet proposal networks for temporal object
detection.
10. "Single Shot MultiBox Detector" by Liu et al. proposes SSD for efficient
51
object detection.

11. "A Review of Machine Learning and Deep Learning for Object Detection
and Semantic Segmentation" by Manakitsa et al., providing insights into
machine and deep learning applications for object detection and semantic
segmentation."

12. "Object Detection and Tracking using Deep Learning for Video
Surveillance" by Mohana and RAVISH ARADHYA H V, 2019."

13. "Faster R-CNN" by Ren et al. introduces a framework enhancing object


detection speed and accuracy through region proposal networks."

14. "Introduction to Object Detection with Deep Learning" by SuperAnnotate,


exploring the evolution of object detection models and the importance of deep
learning."

15. "DeepSORT: Simple Online Realtime Tracking with Deep Association


Metric" by Wojke et al. presents real-time object tracking.

16. CenterNet: Keypoint Triplets for Object Detection" by Zhang et al.


proposes the CenterNet framework for precise object detection.

17. "Object Detection with Deep Learning: A Review" by Zhao, Zheng, Xu, and
Wu, offering an in-depth analysis of deep learning-based object detection
frameworks.

52

You might also like