NN Report (3) 4
NN Report (3) 4
Abstract
This project addresses the urgent need for effective social distancing enforcement in the wake of the
COVID-19 pandemic through the development of a real-time social distance detection system using
deep learning. Leveraging the YOLO (You Only Look Once) object detection algorithm, the system
accurately identifies and locates individuals within a video feed. By calculating the Euclidean
distance between the centroids of the detected bounding boxes, the system assesses whether
individuals are maintaining a safe distance from each other as per predefined thresholds. To ensure
precise real-world distance measurements, the system incorporates camera calibration techniques
that account for perspective distortions. The model is fine-tuned using a custom dataset to enhance
detection accuracy in specific environments, achieving high performance in both controlled and real-
world settings. The implementation includes optimizations for real-time processing, enabling
deployment on various hardware platforms, including edge devices. Visual and auditory alerts are
generated when social distancing violations are detected, providing an immediate response
mechanism for monitoring compliance in public spaces, workplaces, and other critical areas. This
project demonstrates the potential of combining deep learning and computer vision technologies to
create scalable and efficient solutions for public health challenges, with future enhancements aimed
at improving accuracy and adaptability in more complex environments.
TABLE OF CONTENT
CHAPTER TITLE PAGE
NO.
1 INTRODUCTION……………………………………… 03
1.1 Problem Statement
1.2 Aim
1.5 Objectives
2 LITERATURE SURVEY…………………………………… 06
2.1 Object Detection
3 SYSTEM REQURIMENT
SPECIFICATION……………………………………………. 08
3.1 Purpose
4 SYSTEM DESIGN………………………………………………….. 17
5 IMPLEMENTATION…………………………… 24
6 TESTING………………………………………… 53
6.1Objectives of testing
7 CONCLUSIONS……………………………………………… 60
8 FUTURE ENHANCEMENT……………………… 61
9 REFERENCE…………………………………………… 62
CHAPTER 1:
INTRODUCTION
INTRODUCTION:
In the wake of the global COVID-19 pandemic, the necessity of effective social distancing measures
has become paramount in curbing the spread of the virus. As societies worldwide grapple with the
challenges of maintaining public health while reopening economies and resuming daily activities,
technology emerges as a crucial ally in enforcing social distancing protocols. This project introduces
a cutting-edge approach to address this pressing need through the utilization of deep learning and
computer vision techniques. By harnessing the power of the YOLO (You Only Look Once) object
detection algorithm, this system aims to provide real-time monitoring and enforcement of social
distancing guidelines in various settings. The introduction of such a system is particularly timely
and relevant, as governments, businesses, and communities seek innovative solutions to mitigate the
risks associated with gatherings and interactions in public spaces. This project endeavors to
contribute to the broader effort of safeguarding public health by empowering stakeholders with the
tools necessary to monitor and enforce social distancing effectively. Through a combination of
advanced technology and practical implementation, this project aims to offer a scalable and
adaptable solution that can be deployed across diverse environments, from retail stores and public
transit systems to schools and workplaces. As the world continues to navigate the complexities of
the ongoing pandemic, the development of robust and reliable social distance detection systems
represents a critical step towards ensuring the safety and well-being of individuals and communities
alike.
The COVID-19 pandemic has underscored the critical importance of maintaining social distancing
to mitigate the spread of infectious diseases. However, enforcing social distancing protocols in
public spaces poses significant challenges due to the need for constant monitoring and intervention.
Traditional methods of manual enforcement are labor-intensive, prone to human error, and often
ineffective in ensuring compliance on a large scale. Thus, there is a pressing need for automated
systems capable of accurately detecting and alerting authorities to instances of social distancing
violations in real-time.
1.2 AIM:
The aim of this project is to develop an automated social distance detection system using deep
learning and computer vision techniques. By leveraging state-of-the-art object detection algorithms
such as YOLO (You Only Look Once), the system aims to provide robust and reliable monitoring
of social distancing compliance in various environments. The ultimate goal is to empower
authorities, businesses, and individuals with a scalable and effective tool for enforcing social
distancing protocols and reducing the risk of disease transmission.
The existing systems for monitoring social distancing often rely on manual enforcement or
rudimentary technologies that lack the accuracy and scalability required for widespread deployment.
Manual enforcement is resource-intensive and impractical for continuous monitoring of large
crowds or public spaces. Some existing automated systems utilize basic computer vision algorithms
for detecting people and measuring distances, but they often from limitations in accuracy, speed,
and robustness. There is a clear need for more advanced and reliable solutions capable of
overcoming these challenges.
The proposed system will employ the YOLO object detection algorithm to accurately detect and
localize individuals within a video feed in real-time. By analyzing the spatial relationships between
detected individuals, the system will calculate the distances between them and compare these
distances to predefined thresholds to determine social distancing compliance. The system will be
designed to provide visual and auditory alerts when violations are detected, enabling authorities to
take immediate corrective action. Additionally, the system will incorporate optimizations for real-
time performance and scalability, making it suitable for deployment in diverse environments and
settings.
1.5
Objectives:
Develop a custom dataset for training the YOLO object detection model, including annotated
images of individuals in various social distancing scenarios.
Fine-tune the YOLO model using the custom dataset to enhance detection accuracy and
robustness in real-world environments.
Implement real-time social distance detection algorithms that accurately measure the distances
between detected individuals and identify violations of predefined social distancing thresholds.
Integrate visual and auditory alert mechanisms into the system to notify authorities and
individuals of social distancing violations in real-time.
Evaluate the performance of the system using metrics such as detection accuracy, false positive
rate, and real-time processing speed, and validate its effectiveness in diverse environments and
scenarios.
CHAPTER 2
LITERATURE REVIEW
This recently imposed restriction is widely, but imprecisely, referred to as ―social distancing‖ (SD)
since prevention of the virus diffusion does not require us to weaken our social bonds. The likely
reason of SD naming is that, from a cognitive point of view, physical and social aspects of distance
are deeply intertwined [47], a phenomenon that popular wisdom captures through a proverb that, in
slightly different versions, appears in different languages and cultures, namely ―far from eyes, far
from heart‖. Not surprisingly, the time spent in physical proximity with others, in opposition to the
time spent in individual activities, is a crucial factor in the ―social brain hypothesis‖, one of the most
successful theories of human evolution [26]. Similarly, Attachment Theory, probably the
development model most widely accepted in child psychiatry, revolves around the ability of children
and parents to establish and maintain physical proximity [13]. Finally, the different modulation of
interpersonal distances is known to be one of the main obstacles in intercultural communication
[37]. The above suggests that dealing with interpersonal distances means to deal with evolutionary,
developmental and cultural forces that shape, to a significant extent, our everyday life. As a
consequence, the role of technologies for the analysis of such distances becomes crucial during
pandemics, given that they must mediate between the forces above, responsible for the human
tendency to get too close to avoid contagion, and the pressure of prophylactic measures, artificially
designed to fight a pathogen inaccessible to our senses and cognition. One possible solution is to go
beyond simply measuring how far we are are from one another, as most of the applications on the
market are doing (see Sec. II-C) and try to make sense of what distances mean. In other words, to
inform technologies with principles and laws of Proxemics, the psychology area showing how
people convey social meaning through interpersonal distances and, ultimately, how social and
physical dimensions of space interplay with one another [74]. Proxemics is strictly linked to the
definition of people gath erings, namely groups, and as such, it depends on its spatial organization
and the number of people involved. In general, the surrounding space around a person is
characterized by interpersonal distance classes [38], namely: intimate, personal, peri-personal or
social, and public spaces (see Fig. 3), all associated to different SDs, in turn, also dependent by the
degree of kinship and familiarity between the subjects and by the geometrical configuration and size
of the environment in which an interplay occurs. A blind application of social distancing rules,
encouraging to stay further than 1-2 meters, will eliminate an entire interpersonal distance class and
all of the social interactions which take play within it, including for example those between children
and relatives. As can be noticed, behavior, social interactions, and space arrangements are tightly
coupled, and affect each other. This is why it is important to take into considerations all these aspects
when constraints in this respect are to be imposed, in particular when people health is in play.
CHAPTER-3
SOFTWARE REQUIREMENT SPECIFICATION:
The presentation of the Software Requirements Specification (SRS) gives a review of the whole
SRS with reason, scope, definitions, abbreviations, contractions, references and diagram of the SRS.
The point of this report is to assemble, dissect, and give a top to bottom knowledge of the total
"Social distance detection" by characterizing the difficult articulation in detail. The point by point
necessities of the Indian car purchasing conduct – client related capacities are given in this archive.
3.1 PURPOSE
The Purpose of the Software Requirements Specification is to give the specialized, Functional and
non-useful highlights, needed to build up a web application App. The whole application intended to
give client adaptability to finding the briefest as well as efficient way. To put it plainly, the
motivation behind this SRS record is to give an itemized outline of our product item, its boundaries
and objectives. This archive depicts the task's intended interest group and its UI, equipment and
programming prerequisites. It characterizes how our customer, group and crowd see the item and its
usefulness.
Scope
The Scope of this framework is to presents a survey on information digging strategies utilized for
the expectation of Social distance detection. It is obvious from the framework that information
mining strategy, similar to grouping, is profoundly productive in expectation of Indian car.
PYTHON:
Python is a deciphered, significant level, broadly useful programming language. Made by Guido van
Rossum and first delivered in 1991, Python's plan reasoning accentuates code meaningfulness with
its prominent utilization of critical whitespace. Its language develops and object-arranged
methodology plan to assist software engineers with composing clear, consistent code for little and
huge scope ventures.
portrayed as a "batteries included" language because of its thorough standard library. Python is a
multi-worldview programming language. Article arranged programming and organized writing
computer programs are completely upheld, and a significant number of highlights uphold useful
programming and angle situated programming (counting by metaprogramming and metaobjects
(enchantment methods)).] Many different standards are upheld by means of expansions, including
plan by agreement and rationale programming.
FLASK:
Flask is a miniature web system written in Python. It is delegated a microframework in light of the
fact that it doesn't need specific apparatuses or libraries.[3] It has no information base deliberation
layer, structure approval, or whatever other segments where prior outsider libraries give normal
capacities. In any case, Flask upholds augmentations that can include application includes as though
they were executed in Flask itself. Augmentations exist for object-social mappers, structure
approval, transfer dealing with, different open confirmation advancements and a few basic system
related devices. Augmentations are refreshed unmistakably more as often as possible than the center
Flask program.
ANACONDA:
Anaconda is a free and open-source circulation of the programming dialects Python and R . The
dissemination accompanies the Python translator and different bundles identified with AI and
information science.
Essentially, the thought behind Anaconda is to make it simple for individuals inspired by those fields
to introduce all (or a large portion) of the bundles required with a solitary establishment.
An open-source bundle and condition the executives framework called Conda, which makes it
simple to introduce/update bundles and make/load situations.
Jupyter Notebook, a shareable note pad that joins live code, representations and text.
NUMPY:
NumPy is the principal bundle for logical registering with Python. It contains in addition to other
things:
• useful straight polynomial math, Fourier change, and arbitrary number abilities
PANDAS:
pandas is an open source, BSD-authorized library giving elite, simple to-utilize information
structures and information investigation apparatuses for the Python programming language.
pandas is a Num FOCUS supported undertaking. This will help guarantee the achievement of
improvement of pandas as an a-list open-source venture, and makes it conceivable to give to the
task.
YOLO
With regards to deep learning-based item discovery, there are three essential article locators you'll
experience:
• R-CNN and their variations, including the first R-CNN, Fast R-CNN, and Faster R- CNN
• YOLO
R-CNNs are one of the primary deep learning-based article locators and are a case of a two-phase
identifier.
1. In the principal R-CNN distribution, Rich element progressive systems for exact item
recognition and semantic division, (2013) Girshick et al. proposed an item finder that necessary a
calculation, for example, Selective Search (or proportionate) to propose applicant jumping boxes
that could contain objects.
2. These districts were then passed into a CNN for characterization, at last prompting one of the
main deep learning-based item locators.
The issue with the standard R-CNN strategy was that it was horrendously moderate and not a total
start to finish object finder.
Girshick et al. distributed a second paper in 2015, entitled Fast R-CNN. The Fast R-CNN calculation
made significant enhancements to the first R-CNN, specifically expanding exactness and decreasing
the time it took to play out a forward pass; notwithstanding, the model actually depended on an outer
area proposition calculation.
It wasn't until Girshick et al's. subsequent 2015 paper, Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks, that R-CNNs turned into a genuine start to finish deep
learning object locator by eliminating the Selective Search prerequisite and rather depending on a
Region Proposal Network (RPN) that is (1) completely convolutional and (2) can foresee the article
bouncing boxes and "objectness" scores (i.e., a score measuring how likely it is a locale of a picture
may contain a picture). The yields of the RPNs are then passed into the R-CNN part for conclusive
grouping and naming.
While R-CNNs watch out for extremely exact, the most serious issue with the R-CNN group of
organizations is their speed — they were unimaginably moderate, getting just 5 FPS on a GPU.
To help speed up deep learning-based article finders, both Single Shot Detectors (SSDs) and YOLO
utilize a one-phase identifier procedure.
These calculations treat object location as a relapse issue, taking a given info picture and all the
while picking up jumping box arranges and comparing class name probabilities.
When all is said in done, single-stage indicators will in general be less exact than two-phase locators
however are fundamentally quicker.
Feasibility Study
The feasibility study helps to find solutions to the problems of the project. The solution is given how
looks like a new system look like.
Technical Feasibility
The project entitled ―social distance detection‖ is technically feasible because of the below-
mentioned features. The project is developed in Python. The web server is used to develop ―social
distance detection‖ is local serve. The local server very neatly coordinates between the design and
coding parts. It provides a Graphical User Interface to design an application while the coding is done
in python. At the same time, it provides high-level reliability, availability, and compatibility.
Economic Feasibility
In economic feasibility, cost-benefit analysis is done in which costs and benefits are evaluated.
Economic analysis is used for the effectiveness of the proposed system. In economic feasibility, the
main task is cost-benefit analysis. The system ―social distance detection‖ is feasible because it does
not exceed the estimated cost and the estimated benefits are equal.
Operational Feasibility
The project entitled ―social distance detection‖ is technically feasible because of the below-
mentioned features. The system predicts the social distance detection based on the historical data,
further the details of the patient are added to the DataBase. The performance of the Data mining
techniques are compared based on their execution time and displayed it through a graph.
Behavior Feasibility
The project entitled ―social distance detection‖ is beneficial because it satisfies the objectives when
developed and installed.
3.4 OVERVIEW
Following a section of this document will focus on describing the system in terms of product
functions In the next section, we will address specific requirements of the system, which will
enclose functional requirements and non-functional requirements.
Product Functions
Pre-Processing
The extracted and google trained datasets are compared using TensorFlow
General Constraints
The results generated have to be entered in to the system and any error or any value entered out of
the boundary will not be understood by the system. In any case if the database crashes, the whole
information collected and the results generated will be of no use.
Specific Requirements
This section provides a detailed description of all inputs into and outputs from the system. It also
gives a description of the hardware, software and communication interfaces and provides basic
prototypes of the user interface
Hardware Requirements
Ram : 4GB
Software Requirements
Framework: Flask
CHAPTER 4
SYSTEM DESIGN
The Software Design will be used to aid in software development for android application by
providing the details for how the application should be built. Within the Software Design,
specifications are narrative and graphical documentation of the software design for the project
includes use case models, sequence diagrams, and other supporting requirement information.
Scope
The Design Document is for a primary level system, which will work as a basement for building a
system that provides a base level of functionality to show feasibility for large-scale production use.
The software Design Document, the focus placed on the generation and modification of the
documents. The system will be used in conjunction with other pre-existing systems and will consist
largely of a document interaction faced that abstracts document interactions and handling of the
document objects. This Document provides the Design specifications of ―Video object detection‖.
4.1 SYSTEM ARCHIETECTURE:
LEVEL 0 DFD:
LEVEL 1 DFD:
sequence diagram depict cooperations among classes as far as a trade of messages after some time.
They're likewise called occasion charts. A grouping chart is a decent method to envision and approve
different runtime situations. These can assist with anticipating how a framework will act and to find
duties a class may need to have during the time spent demonstrating another framework.
The motivation behind use case diagram is to catch the dynamic part of a framework. In any case,
this definition is too nonexclusive to even think about describing the reason, as other four outlines
(action, grouping, cooperation, and Statechart) likewise have a similar reason. We will investigate
some particular reason, which will recognize it from other four charts.
Use case graphs are utilized to accumulate the prerequisites of a framework including inside and
outside impacts. These prerequisites are generally plan necessities. Consequently, when a
framework is investigated to accumulate its functionalities, use cases are readied and entertainers
are distinguished.
CHAPTER-5
IMPLEMENTATION
Introduction
The project is implemented using python which is an object oriented programming language
and procedure oriented programming language. Object oriented programming is an approach
that provides a way of modularizing program by creating partitioned memory area of both data
and function that can be used as a template for creating copies of such module on demand.
Implementation of software refers to the final installation of the package in its real environment,
to the satisfaction of the intended users and the operation of the system. The people are not sure that the
software is meant to make their job easier.
The active user must be aware of the benefits of using the system
Their confidence in the software built up
Proper guidance is impaired to the user so that he is comfortable in using the application
Before going ahead and viewing the system, the user must know that for viewing the result, the server
program should be running in the server. If the server object is not running on the server, the actual processes
will not take place.
User Training
To achieve the objectives and benefits expected from the proposed system it is essential for the people
who will be involved to be confident of their role in the new system. As system becomes more complex, the need
for education and training is more and more important. Education is complementary to training. It brings life to
formal training by explaining the background to the
resources for them. Education involves creating the right atmosphere and motivating user staff.
Education information can make training more interesting and more understandable.
After providing the necessary basic training on the computer awareness, the users will have to be
trained on the new application software. This will give the underlying philosophy of the use of the new system
such as the screen flow, screen design, type of help on the screen, type of errors while entering the data, the
corresponding validation check at each entry and the ways to correct the data entered. This training may be
different across different user groups and across different levels of hierarchy.
Operational Documentation
Once the implementation plan is decided, it is essential that the user of the system is made familiar and
comfortable with the environment. A documentation providing the whole operations of the system is being
developed. Useful tips and guidance is given inside the application itself to the user. The system is developed
user friendly so that the user can work the system from the tips given in the application itself.
System Maintenance
The maintenance phase of the software cycle is the time in which software performs useful work. After
a system is successfully implemented, it should be maintained in a proper manner. System maintenance is an
important aspect in the software development life cycle. The need for system maintenance is to make adaptable
to the changes in the system environment. There may be social, technical and other environmental changes,
which affect a system which is being implemented. Software product enhancements may involve providing new
functional capabilities, improving user displays and mode of interaction, upgrading the performance
characteristics of the system. So only thru proper system maintenance procedures, the system can be adapted
to cope up with these changes. Software maintenance is of course, far more than “finding mistakes”.
Corrective Maintenance
The first maintenance activity occurs because it is unreasonable to assume that software testing will
uncover all latent errors in a large software system. During the use of any large program, errors will occur and
be reported to the developer. The process that includes the diagnosis and correction of one or more errors
is called Corrective Maintenance.
Adaptive Maintenance
The second activity that contributes to a definition of maintenance occurs because of the rapid change
that
is encountered in every aspect of computing. Therefore Adaptive maintenance termed as an activity that modifies
software to properly interfere with a changing environment is both necessary and commonplace.
Perceptive Maintenance
The third activity that may be applied to a definition of maintenance occurs when a software package is
successful. As the software is used, recommendations for new capabilities, modifications to existing functions,
and general enhancement are received from users. To satisfy requests in this category, Perceptive maintenance
is performed. This activity accounts for the majority of all efforts expended on software maintenance.
Preventive Maintenance
The fourth maintenance activity occurs when software is changed to improve future maintainability or
reliability, or to provide a better basis for future enhancements. Often called preventive maintenance, this
activity is characterized by reverse engineering and re-engineering techniques
Experts in machine learning and deep learning have not yet reached consensus on these concepts. in
this context, almost every day new ideas are being discussed. Machine Learning is an older concept
than Deep Learning. Deep learning can also be called a technique that performs machine learning.
The differences are listed below;
1) In deep learning, too much data is needed to bring the algorithm structure to the ideal.In machine
learning, the problem can be solved with much less data because the person gives specific features
to the algorithm.
2) Deep learning algorithms try to extract features from data. In machine learning, the features are
determined by the expert.
3) While Deep Learning algorithms work on high performance machines, Machine Learning
algorithms can work on ordinary CPUs.
4) In machine learning, the problem is usually divided into pieces, these parts are solved one by one
and then the solutions are formed as a result of the solutions. In deep learning, the problem is
solved end-to-end.
DeepLearning overview
The term Deep Learning or Deep Neural Network refers to Artificial Neural Networks (ANN)
with multi layers . Over the last few decades, it has been considered to be one of the most
powerful tools, and has become very popular in the literature as it is able to handle a huge
amount of data. The interest in having deeper hidden layers has recently begun to surpass
classical methods performance in different fields; especially in pattern recognition. One of the
most popular deep neural networks is the Convolutional Neural Network (CNN). It take this
name from mathematical linear operation between matrixes called convolution. CNN have
multiple layers; including convolutional layer, non-linearity layer, pooling layer and
fullyconnected layer. The convolutional and fully- connected layers have parameters but
pooling and non-linearity layers don't have parameters. The CNN has an excellent performance
in machine learning problems. Specially the applications that deal with image data, such as
largest image classification data set (Image Net), computer vision, and in natural language
processing (NLP) and the results achieved were very amazing . In this paper we will explain
and define all the elements and important issues related to CNN, and how these elements work.
In addition, we will also state the parameters that effect CNN efficiency. This paper assumes
that the readers have adequate knowledge about both machine learning and artificial neural
network.
Convolutional Neural Network has had ground breaking results over the past decade in a
variety of fields related to pattern recognition; from image processing to voice recognition.
The most beneficial aspect of CNNs is reducing the number of parameters in ANN . This
achievement has prompted both researchers and developers to approach larger models in order
to solve complex tasks, which was not possible with classic ANNs; . The most important
assumption about problems that are solved by CNN should not have features which are
spatially dependent. In other words, for example, in a face detection application, we do not
need to pay attention to where the faces are located in the images. The only concern is to detect
them regardless of their position in the given images . Another important aspect of CNN, is to
obtain abstract features when input propagates toward the deeper layers.
Deep learning algorithms are trained to learn progressively using data. Large data sets are needed to
make sure that the machine delivers desired results. As human brain needs a lot of experiences to learn
and deduce information, the analogous artificial neural network requires copious amount of data. The
more powerful abstraction you want, the more parameters need to be tuned and more parameters require
more data.
At times, the there is a sharp difference in error occurred in training data set and the error encountered
in a new unseen data set. It occurs in complex models, such as having too many parameters relative to
the number of observations. The efficacy of a model is judged by its ability to perform well on an
unseen data set and not by its performance on the training data fed to it.
Hyperparameter Optimization
Hyperparameters are the parameters whose value is defined prior to the commencement of the learning
process. Changing the value of such parameters by a small amount can invoke a large change in the
performance of your model.
Training a data set for a Deep Learning solution requires a lot of data. To perform a task to solve real
world problems, the machine needs to be equipped with adequate processing power. To ensure better
efficiency and less time consumption, data scientists switch to multi-core high performing GPUs and
similar processing units. These processing units are costly and consume a lot of power.
Deep Learning models, once trained, can deliver tremendously efficient and accurate solution to a
specific problem. However, in the current landscape, the neural network architectures are highly
specialized to specific domains of application
Introduction
This section will investigate the distinctive viewpoints concerned about the
implementation of the developed system. This task was concerned about the
improvement and usage the Video Object Detection.
Methodology
Cnn model
Module 1: Region Proposal. Generate and extract category independent region proposals, e.g.
candidate bounding boxes.
Module 2: Feature Extractor. Extract feature from each candidate region, e.g. using a deep
convolutional neural network.
Module 3: Classifier. Classify features as one of the known class, e.g. linear SVM classifier model.
Convolutional Neural Networks have a different architecture than regular Neural Networks. Regular
Neural Networks transform an input by putting it through a series of hidden layers. Every layer is made
up of a set of neurons, where each layer is fully connected to all neurons in the layer before. Finally,
there is a last fully-connected layer — the output layer — that represent the predictions.
Convolutional Neural Networks are a bit different. First of all, the layers are organised in 3 dimensions:
width, height and depth. Further, the neurons in one layer do not connect to all the neurons in the next
layer but only to a small region of it. Lastly, the final output will be reduced to a single vector of
probability scores, organized along the depth dimension.
Feature Extraction:Convolution
Convolution in CNN is performed on an input image using a filter or a kernel. To understand filtering
and convolution you will have to scan the screen starting from top left to right and moving down a bit
after covering the width of the screen and repeating the same process until you are done scanning the
whole screen.
After a convolution layer once you get the feature maps, it is common to add a pooling or a sub-sampling
layer in CNN layers. Similar to the Convolutional Layer, the Pooling layer is responsible for reducing
the spatial size of the Convolved Feature. This is to decrease the computational power required to
process the data through dimensionality reduction. Furthermore, it is useful for extracting dominant
features which are rotational and positional invariant, thus maintaining the process of effectively training
of the model. Pooling shortens the training time and controls over-fitting.
Adding a Fully-Connected layer is a (usually) cheap way of learning non-linear combinations of the
high-level features as represented by the output of the convolutional layer. The Fully-Connected layer
is learning a possibly non-linear function in that space. Example of CNN network:
Now that we have converted our input image into a suitable form, we shall flatten the image into a
column vector. The flattened output is fed to a feed-forward neural network and backpropagation applied
to every iteration of training. Over a series of epochs, the model is able to distinguish between
dominating and certain low-level features in images and classify them using the Softmax Classification
technique.
So now we have all the pieces required to build a CNN. Convolution, ReLU and Pooling. The output of
max pooling is fed into the classifier we discussed initially which is usually a multi-layer perceptron
layer. Usually in CNNs these layers are used more than once i.e. Convolution ->ReLU -> Max-Pool ->
Convolution ->ReLU -> Max-Pool and so on. We won’t discuss the fully connected layer right now.
In this step we initialize the parameters of the convolutional neural network. You will be using 10 filters
of dimension 9x9, and a non-overlapping, contiguous 2x2 pooling region.
Implement the CNN cost and gradient computation in this step. Your network will have two layers.
The first layer is a convolutional layer followed by mean pooling and the second layer is a densely
connected layer into softmax regression.
Learn Parameters
Using a batch method such as L-BFGS to train a convolutional network of this size even on object
datasets, a relatively small dataset, can be computationally slow. A single iteration of calculating the
cost and gradient for the full training set can take several minutes or more. Thus you will use stochastic
gradient descent (SGD) to learn the parameters of the network.
Test
With the convolution network and SGD optimizer in hand, you are now ready to test the performance
of the model. We’ve provided code at the end of cnnTrain.py to test the accuracy of your networks
predictions on the object test set.
In neural networks, Convolutional neural network (ConvNets or CNNs) is one of the main categories to
do images recognition, images classifications. Objects detections, recognition faces etc., are some of the
CNN image classifications takes an input image, process it and classify it under certain categories (Eg.,
Dog, Cat, Tiger, Lion). Computers sees an input image as array of pixels and it depends on the image
resolution. Based on the image resolution, it will see h x w x d( h = Height, w = Width, d = Dimension
). Eg., An image of 6 x 6 x 3 array of matrix of RGB (3 refers to RGB values) and an image of 4 x 4 x
Technically, deep learning CNN models to train and test, each input image will pass it through a series
of convolution layers with filters (Kernals), Pooling, fully connected layers (FC) and apply Softmax
function to classify an object with probabilistic values between 0 and 1. The below figure is a
complete flow of CNN to process an input image and classifies the objects based on values.
Convolution Layer
Convolution is the first layer to extract features from an input image. Convolution preserves the
relationship between pixels by learning image features using small squares of input data. It is a
mathematical operation that takes two inputs such as image matrix and a filter or kernel.
Strides
Stride is the number of pixels shifts over the input matrix. When the stride is 1 then we move the filters
to 1 pixel at a time. When the stride is 2 then we move the filters to 2 pixels at a time and so on. The
below figure shows convolution would work with a stride of 2.
Padding
Sometimes filter does not fit perfectly fit the input image. We have two options:
Drop the part of the image where the filter did not fit. This is called valid padding which keeps only
valid part of the image.
ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x).Why
ReLU is important : ReLU’s purpose is to introduce non-linearity in our ConvNet. Since, the real
world data would want our ConvNet to learn would be non-negative linear values.There are other
non linear functions such as tanh or sigmoid that can also be used instead of ReLU. Most of the data
scientists use ReLU since performance wise ReLU is better than the other two.
Pooling Layer
Pooling layers section would reduce the number of parameters when the images are too large. Spatial
pooling also called subsampling or downsampling which reduces the dimensionality of each map but
retains important information. Spatial pooling can be of different types:
Max Pooling
Average Pooling
Sum Pooling
Max pooling takes the largest element from the rectified feature map. Taking the largest element
could also take the average pooling. Sum of all elements in the feature map call as sum pooling.
The
layer we call as FC layer, we flattened our matrix into vector and feed it into a fully connected layer
like a neural network.
Choose parameters, apply filters with strides, padding if requires. Perform convolution on the image
and apply ReLU activation to the matrix.
Flatten the output and feed into a fully connected layer (FC Layer)
Output the class using an activation function (Logistic Regression with cost functions) and classifies
images.
In the next post, I would like to talk about some popular CNN architectures such as AlexNet,
VGGNet, GoogLeNet, and ResNet.
Image classification!
The convolutional neural network (CNN) is a class of deep learning neural networks. CNNs
represent a huge breakthrough in image recognition. They’re most commonly used to analyze visual
imagery and are frequently working behind the scenes in image classification. They can be found at
the core of everything from Facebook’s photo tagging to self-driving cars. They’re working hard
Image classification is the process of taking an input (like a picture) and outputting a class (like ―cat‖)
or a probability that the input is a particular class (―there’s a 90% probability that this input is a cat‖).
You can look at a picture and know that you’re looking at a terrible shot of your own face, but how
A CNN has
Convolutional layers
ReLU layers
Pooling layers
A CNN convolves (not convolutes…) learned features with input data and uses 2D convolutional layers.
This means that this type of network is ideal for processing 2D images. Compared to other image
classification algorithms, CNNs actually use very little preprocessing. This means that they
can learn the filters that have to be hand-made in other algorithms. CNNs can be used in tons of
applications from image and video recognition, image classification, and recommender systems to
natural language processing and medical image analysis.
CNNs are inspired by biological processes. They’re based on some cool research done by Hubel and
Wiesel in the 60s regarding vision in cats and monkeys. The pattern of connectivity in a CNN comes
from their research regarding the organization of the visual cortex. In a mammal’s eye, individual
neurons respond to visual stimuli only in the receptive field, which is a restricted region. The receptive
fields of different regions partially overlap so that the entire field of vision is covered. This is the way
CNNs have an input layer, and output layer, and hidden layers. The hidden layers usually consist of
convolutional layers, ReLU layers, pooling layers, and fully connected layers.
Convolutional layers apply a convolution operation to the input. This passes the information on to
the next layer.
Pooling combines the outputs of clusters of neurons into a single neuron in the next layer.
Fully connected layers connect every neuron in one layer to every neuron in the next layer.
In a convolutional layer, neurons only receive input from a subarea of the previous layer. In a fully
connected layer, each neuron receives input from every element of the previous layer.
A CNN works by extracting features from images. This eliminates the need for manual feature
extraction. The features are not trained! They’re learned while the network trains on a set of images.
This makes deep learning models extremely accurate for computer vision tasks. CNNs learn feature
detection through tens or hundreds of hidden layers. Each layer increases the complexity of the
learned features.
A
CNN
processes the features through the network. The final fully connected layer provides the ―voting‖
trains through forward propagation and backpropagation for many, many epochs. This repeats
until we have a well-defined neural network with trained weights and feature detectors.
For a black and white image, those pixels are interpreted as a 2D array (for example, 2x2 pixels).
Every pixel has a value between 0 and 255. (Zero is completely black and 255 is completely
white. The greyscale exists between those numbers.) Based on that information, the computer can
For a color image, this is a 3D array with a blue layer, a green layer, and a red layer. Each one of those
colors has its own value between 0 and 255. The color can be found by combining the values in each
ReLU layer
The ReLU (rectified linear unit) layer is another step to our convolution layer. You’re applying
an activation function onto your feature maps to increase non-linearity in the network. This is because
images themselves are highly non-linear! It removes negative values from an activation map by setting
them to zero.
Convolution is a linear operation with things like element wise matrix multiplication and addition. The
real-world data we want our CNN to learn will be non-linear. We can account for that with an operation
like ReLU. You can use other operations like tanh or sigmoid. ReLU, however, is a popular choice
because it can train the network faster without any major penalty to generalization accuracy.
YOLO ("you only look once") is one of the popular algorithm because it achieves high accuracy
along with being able to run in real-time. The algorithm "only looks once" at the image, i.e. it
requires only one forward propagation pass through the network so that it can make predictions.
After non-max suppression, it gives the name of the recognized object along with the bounding
boxes around them. The diagrams for explaining YOLO are from Andrew Ng’s video explanation
of the same
Anchor box
By using Bounding boxes for object detection, only one object can be identified by a grid. So, for
detecting more than one object we go for Anchor box.
Consider the above picture, in that both the human and the car’s midpoint come under the same grid
cell. For this case, we use the anchor box method. The purple color grid cells denote the two anchor
boxes for those objects. Any number of anchor boxes can be used for a single image to detect multiple
objects. In our case, we have taken two anchor boxes.
The above figure shows the anchor box of the image we considered. The vertical anchor box is for the
human and the horizontal one is the anchor box of the car.
Model Details:
The model details are as follows:
•The output is a list of bounding boxes with the recognized classes. Each bounding box is denoted
by 6 numbers (p_c, b_x, b_y, b_h, b_w, c). If you expand c i.e. classes we get an 80-dimensional vector,
each bounding box is then represented by 85 numbers.
If the center or the midpoint of an object falls into a grid cell, then that grid cell is responsible for
detecting that object.
Since in the model we are using 5 anchor boxes and each of the 19 x19 cells thus encodes
information about 5 boxes. Anchor boxes are defined by their width and height. For simplicity, the
image is first flattened that is the last two last dimensions of the shape (19, 19, 5, 85) encoding. so
the output of the Deep CNN is in form: (19, 19, 425). Fig 3 shows the flattening.
Now for each grid that is for each box of the cell compute the following elementwise product as
well as the probability that the box contains a particular class.
After plotting only the boxes that the algorithm had given of higher probability, there are too many
boxes and hence filtering these boxes is very important for accuracy.
Each cell has 5 anchor boxes. So in total if we calculate, the model predicts: 19x19x5 = 1805
boxes. In the figure different colors denote different classes. So we filter the algorithm's output down
to a less number of boxes i.e. much smaller number of detected objects. To do this we carry out two
important steps:
• Get rid of boxes with a low score that is to remove the box which are not very confident
aboutdetecting a class
• Select only one box that overlaps many other boxes with each other and which detects the same
object.
After the filtering based on the score of the classes, the second filter which is applied on the left boxes
is the Non maximum Suppression (NMS).
It uses the concept of Intersection Over Union (IoU). IoU is the ratio of intersection of two boxes to
the union of the boxes. This is shown in Fig 7.
Here are the fundamental concepts of how YOLO object detection can able to detect an object.
The YOLO detector can predict the class of object, its bounding box, and the probability of the class
of object in the bounding box. Each bounding box is having the following parameters.
The center position of the bounding box in the image (bx, by)
The width of the box( bw )
The height of the box ( bh )
The class of object ( c )
Codes:
import the necessary packages
from .social_distancing_config import NMS_THRESH
(H, W) = frame.shape[:2]
results = []
# construct a blob from the input frame and then perform a forward
net.setInput(blob)
layerOutputs = net.forward(ln)
boxes = []
centroids = []
confidences = []
scores = detection[5:]
classID = np.argmax(scores)
confidence = scores[classID]
import numpy as np
import argparse
import imutils
import cv2
import os
# load the COCO class labels our YOLO model was trained on
labelsPath = os.path.sep.join([config.MODEL_PATH, "coco.names"])
LABELS = open(labelsPath).read().strip().split("\n")
# load our YOLO object detector trained on COCO dataset (80 classes)
print("[INFO] loading YOLO from disk...")
net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)
# determine only the *output* layer names that we need from YOLO
ln = net.getLayerNames()
ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]
# if the frame was not grabbed, then we have reached the end
# of the stream
if not grabbed:
break
# resize the frame and then detect people (and only people) in it
frame = imutils.resize(frame, width=700)
results = detect_people(frame, net, ln,
personIdx=LABELS.index("person"))
violate = set()
violate.add(j)
if i in violate:
color = (0, 0, 255)
# draw (1) a bounding box around the person and (2) the
# screen
if args["display"] > 0:
break
# if an output video file path has been supplied and the video
# writer has not been initialized, do so now
if args["output"] != "" and writer is None:
# initialize our video writer
fourcc = cv2.VideoWriter_fourcc(*"MJPG")
writer = cv2.VideoWriter(args["output"], fourcc, 25,
(frame.shape[1], frame.shape[0]), True)
# if the video writer is not None, write the frame to the output
# video file
if writer is not None:
writer.write(frame)
CHAPTER-6
TESTING
INTRODUCTION
Testing is the way toward running a framework with the expectation of discovering blunders. Testing
upgrades the uprightness of the framework by distinguishing the deviations in plans and blunders in
the framework. Testing targets distinguishing blunders – prom zones. This aides in the avoidance of
mistakes in the framework. Testing additionally adds esteems to the item by affirming the client's
necessity.
The primary intention is to distinguish blunders and mistake get-prom zones in a framework. Testing
must be intensive and all around arranged. A somewhat tried framework is as terrible as an untested
framework. Furthermore, the cost of an untested and under-tried framework is high. The execution
is the last and significant stage. It includes client preparation, framework testing so as to guarantee
the effective running of the proposed framework. The client tests the framework and changes are
made by their requirements. The testing includes the testing of the created framework utilizing
different sorts of information. While testing, blunders are noted and rightness is the mode.
Framework testing is a phase of usage, which is pointed toward guaranteeing that the framework
works accurately and productively according to the client's need before the live activity initiates. As
expressed previously, testing is indispensable to the achievement of a framework. Framework testing
makes the coherent presumption that if all the framework is right, the objective will be effectively
accomplished. A progression of tests are performed before the framework is prepared for the client
acknowledgment test.
System testing is a stage of implementation. This helps the weather system works accurately and
efficiently before live operation commences. Testing is vital to the success of the system. The
candidate system is subject to a variety of tests: online response, volume, stress, recovery, security,
and usability tests series of tests are performed for the proposed system are ready for user acceptance
testing.
The test is conducted during the code generation phase itself. All the errors were rectified at
the moment of its discovery. During this testing, it is ensured that
It is focused around the practical necessities of the product. It's anything but a choice to white box
testing; rather, it is a reciprocal methodology that is probably going to reveal an alternate class of
blunders than White Box strategies. It is endeavored to discover mistakes in the accompanying
classes.
• Interface blunders
Unit Testing
Unit testing chiefly centers around the littlest unit of programming plan. This is known as module
testing. The modules are tried independently. The test is done during the programming stage itself.
In this progression, every module is discovered to be working acceptably as respects the normal
yield from the module.
Integration Testing
Mix testing is an efficient methodology for developing the program structure, while simultaneously
leading tests to reveal blunders related with the interface. The goal is to take unit tried modules and
manufacture a program structure. All the modules are joined and tried in general.
Output Testing
Subsequent to performing approval testing, the following stage is yield trying of the proposed
framework, since no framework could be valuable on the off chance that it doesn't create the
necessary yield in a particular configuration. The yield design on the screen is discovered to be right.
The organization was planned in the framework configuration time as indicated by the client needs.
For the printed copy likewise, the yield comes according to the predefined prerequisites by the client.
Subsequently yield testing didn't bring about any amendment for the framework.
Client acknowledgment of a framework is the vital factor for the achievement of any framework.
The framework viable is tried for client acknowledgment by continually staying in contact with the
imminent framework clients at the hour of creating and making changes at whatever point required.
VALIDATION
Toward the consummation of the reconciliation testing, the product is totally amassed as bundle
interfacing blunders have been revealed and adjusted and a last arrangement of programming tests
starts in approval testing. Approval testing can be characterized from multiple points of view,
however a straightforward definition is that the approval succeeds when the product work in a way
that is normal by the client. After approval test has been directed as follows:
• Proposed framework viable has been tried by utilizing an approval test and discovered to be
working acceptably.
5 Test Cases
Algorithm available
data
object
detection
CHAPTER-7
CONCLUSION:
This work proposed an AI and monocular camera based real-time system to monitor the social
distancing. The article proposes an efficient real-time deep learning based framework to automate
the process of monitoring the social distancing via object detection and tracking approaches, where
each individual is identified in the real-time with the help of bounding boxes. The generated
bounding boxes aid in identifying the clusters or groups of people satisfying the closeness property
computed with the help of pairwise vectorized approach. The number of violations are confirmed
by computing the number of groups formed and violation index term computed as the ratio of the
number of people to the number of groups. The extensive trials were conducted with popular state-
of-the-art object detection models: Faster RCNN, SSD, and YOLO v3, where YOLO v3 illustrated
for the efficient performance. Since this approach is highly sensitive to the spatial location of the
camera, the same approach can be fine tuned to better adjust with the corresponding field of view.
CHAPTER-8
FUTURE ENHANCEMENT:
In this project we applied yolo Object detection techniques for identifying person and The generated
bounding boxes aid in identifying the clusters or groups of people satisfying the closeness property
computed with the help of pairwise vectorised approach.in future we can implement for mask detection
in public areas. we can apply different algorithms for better accuracy which works in less time
complexity. As for future scope we can make use of tensor flow library for object detection.
CHAPTER-9
REFERENCES
[1] S. A. Abbas and A. Zisserman. A geometric approach to obtain a bird’s eye view from an image.
In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages
4095–4104, 2019.
[3] C. Bartneck, D. Kulic, E. Croft, and S. Zoghbi. Measurement ´ instruments for the
anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots.
International Journal of Social Robotics, 1(1):71–81, 2009.
[4] J.-C. Bazin and M. Pollefeys. 3-line ransac for orthogonal vanishing point detection. In 2012
IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4282–4287. IEEE,
2012.
[5] J.-C. Bazin, Y. Seo, C. Demonceaux, P. Vasseur, K. Ikeuchi, I. Kweon, and M. Pollefeys. Globally
optimal line clustering and vanishing point estimation in manhattan world. In 2012 IEEE Conference
on Computer Vision and Pattern Recognition, pages 638–645. IEEE, 2012.
[8] C. BenAbdelkader and Y. Yacoob. Statistical body height estimation from a single image. In
2008 8th IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–7.
IEEE, 2008. [9] R. Benenson, M. Omran, J. Hosang, and B. Schiele. Ten years of pedestrian
detection, what have we learned? In European Conference on Computer Vision, pages 613–627.
Springer, 2014.