Open Lab Report - Group 5
Open Lab Report - Group 5
PROJECT REPORT:
Human PokéDex
Submitted by
Faculty in charge
1 - Introduction 4
1.1 – Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 – Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 – What’s New? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 – Project Subdivisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 – Tech Stack, Services, Languages and Frameworks . . . . . . . . . . . . . . . . . . . . 6
1.6 – Project Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2
3 – Design and Development 12
3.1 – Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 – Embeddings, Norms and some Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 – Theoretical Foundation for CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 – Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 – How Triplet Loss Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.2 – Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.3 – Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 – Types of Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 – Model Architectures (Face Recognition) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7 – Video Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7.1 – Data Preparation and Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7.2 – Choosing the right model for classification . . . . . . . . . . . . . . . . . . . 25
3.8 – Residual Network (ResNet) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8.1 – Residual Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.9 – Proposed Video Classification Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.10 – Training the ResNet on UCF – Crimes Dataset . . . . . . . . . . . . . . . . . . . . . . . 27
3.10.1 – Model Architectures (Crime Detection) . . . . . . . . . . . . . . . . . . . . . 29
3
Chapter 1
Introduction
1.1 Abstract
Automation and autonomous systems are among the few powerhouses of innovation that
drive entire domains towards advancing further in leaps and bounds. Great technological
innovations can be attributed to tasks that are made easier and more perceptible by
automation, and artificial intelligence is here to make these automated systems smart enough
to perform their tasks with due diligence and the power of decision-making, thereby greatly
reducing human intervention in redundant processes.
Our project follows the aforementioned ideals: building a product to minimize manual labour
(both physical and mental) for tasks that can be seamlessly automated and processed, while
solving the main problem statement at hand.
The main aim of the project focusses on integrating the existent campus management
system with the physical CCTV network on-campus and further enabling the collective
system to autonomously provide surveillance in the truest sense of the word by employing
deep learning techniques for smarter college management and surveillance.
Video classification is a computer vision problem that was presented with the purpose of
being able to automate classification tasks concerning real-time live video. Given the fact that
the problem is considerably recent, there still exist quite a lot of gaps left to be discovered.
Nevertheless, its applications are becoming largely varied, starting from only detecting the
type of sports or daily activity that is happening on the scene, to actual health and security
problems, to name a few.
It is an interesting fact to note that although surveillance cameras play an important role as a
part of services to ensure the safety of citizens, they’re plain video-providing entities with no
smart decision-making mechanisms of their own. Due to the growth of image/video data
collected from surveillance cameras, automated video analysis has become necessary in order
to detect automatically abnormal events.
With our work, CCTV footage will find more meaning than just a video stream; cameras can
autonomously trigger actions to help curb crime real-time, and the integrated system will
result in greater convenience to everyone on campus.
4
1.2 Objectives
Our project aims at promoting safety on campus by automating the task of monitoring and
reporting crimes by assigning the responsibility of detecting criminal or abnormal activity to
a system which is well-versed in deducing patterns that distinguish criminal activity from
normal activity.
In addition to detecting abnormality from footage, the vast CCTV network intertwined
with the campus management system can be used for further implementations:
• Detecting crimes in a footage fed from a camera and recognizing people involved in
the crime
• Enabling a student tracker system
(Since CCTVs can now recognize faces, the vast CCTV network can maintain
timestamps on a student’s whereabouts at any given instant of time)
• A one-stop app which
▪ leverages the same face recognition model from CCTVs to recognize
criminals from a mobile phone
▪ provides data from student tracker log in order to find the whereabouts of
students/professors
▪ retrieves relevant data on students
(recognized by face or from a dropdown list)
▪ receives alert notifications from the nearest CCTV camera witnessing a crime
▪ stays in sync with the database linked with the CCTV network for better
management of complaints
In order to fulfil these objectives, a greater objective was to generate video classification
inferences from a normal 2D CNN, along with a minor objective of recognizing faces using
simple vector-based classification algorithms.
5
1.4 Project Subdivisions
The project has been divided into separate sub-problems, each dealing with a mutually
exclusive aspect of the net project.
The subdivisions are as follows:
• Develop a robust face recognition model to accurately classify faces from an
incoming video stream. Compare and contrast different face recognition models based
on accuracy and frequency of false positives.
• Collect face datasets from the Android app and create a system that
o asynchronously performs dataset augmentation, face re-alignment and
further pre-processing on datasets
o trains the concerned face recognition models over the newly generated
datasets over the Cloud
o pushes the updated model to the Android application as a package update
• Develop and train a CNN (either pre-trained over an activity classification dataset or
related datasets) over the UCF-Crimes dataset for performing video classification
(after cleaning and annotating datasets according to their corresponding labels).
• Maintain the backend’s structural integrity for asynchronous calls to database, Cloud
Storage and Cloud ML Vision.
6
1.6 Project Outcomes
The outcomes of the project are listed below as follows:
• Data continuously logged to a database on a student’s whereabouts
• Easier student indexing and campus management
• Efficient anomaly/unrest detection and mitigation
• Accessibility (and clearance level-based abstraction) to relevant on-campus data
• Automation of complaint and grievance management
• Android application to alert users of unrest in their proximity, access data on
colleagues or students, lodge/ solve complaints or automatically register a crime when
using the app’s face recognition feature on a crime scene.
__
CHAPTER 2
Description of Tech Stack
This chapter deals with presenting the conceptual fundamentals of image and video
classification with deep learning, and describing the software, frameworks, tools, languages
and the mathematics involved behind the superficial concepts implemented.
In the following subtopics, we will cover in detail the following:
• Software Used:
▪ Jupyter Notebooks and Command Line
▪ Visual Studio Code, Extensions and Remote Scripting
▪ WayScript with containerized applications
▪ Flask and Docker
• Services Used:
▪ Google Cloud Platform
▪ Firebase Services (Cloud Firestore, Cloud Storage, Crashlytics)
• Languages:
▪ Python
▪ Java
▪ XML
• Frameworks:
▪ TensorFlow
▪ Gradle and Android SDK (for v27: Oreo)
• Concepts and Mathematical Basis:
▪ Basic Linear Algebra and Classification Algorithms
▪ Neural Networks and CNNs
▪ Feature Extraction and Classification
7
2.1 Software Used
2.1.1 Jupyter Notebooks and Command Line
Jupyter Notebook is a web-application developed by a non-profit open source
community called Project Jupyter.
This application allows developers to create and share documents that contain live
code, equations, visualizations and narrative text. It is used to execute Python code on
a need-to-run basis, and is greatly effective in testing and debugging sections of code
before pushing the codebase to production.
Uses include: Data cleaning and transformation, numerical simulation, statistical
modelling, data visualization, machine learning etc.
Our project used Jupyter Notebooks (and a widely used variant of Jupyter Notebooks:
Google Colaboratory) to execute pattern recognition algorithms on a collaborative
basis. Google Colaboratory was used to initiate Python runtimes on virtual machines
hosted on the Cloud, so as to leverage their compute power and resources to execute
processes such as training and dataset augmentation faster.
8
Fig. 2.2: Visual Studio Code
9
2.2 Services Used
2.2.1 Google Cloud Platform
Google Cloud Platform is a suite of cloud computing services that run on the same
infrastructure used by Google for their end-to-end applications. Developers can access
IaaS (infrastructure as a service) using command-line tools, SDKs from GCP or from
the virtual terminal in the Google Cloud console.
We used Google Cloud to run a virtual machine/instance with substantial GPU and
CPU power in order to train our video classification model faster for crime detection.
2.2.2 Firebase
Firebase is a platform built for scaling mobile and web applications, and providing
backend services so that developers can focus on deployment with minimal
maintenance of databases, storage and API callbacks.
We used:
▪ Cloud Firestore as an integrated backend for CCTV network as well as the
Android application
▪ Firebase Storage for storing new datasets (Bitmaps), models, unrecognized
images and files required for execution of Python scripts on Docker
▪ Crashlytics to keep track of faults/crashes reported by the app on any end
user’s device
▪ Firebase ML Kit to leverage advanced detection techniques for obtaining
bounding boxes that distinguish faces from the rest of the image.
10
2.3 Languages
As the project has a multi-faceted approach with different components and scripts, we’ve
used a couple of high-level languages along with markup languages for frontend purposes.
2.3.1 Python
Our project uses Python for the following purposes:
▪ Implementing face detection and recognition
▪ Training models for image and video classification
▪ Dataset augmentation
▪ Running the Flask server
Since Python is an interpreted language, multithreading is not entirely supported by
the language. Hence, distributed systems are required in order to execute face
recognition and crime detection simultaneously.
2.3.2 Java
Java v11 is used for Android App Development and implementing a similarity
classifier for the application.
CameraXAPI (responsible for input stream from on-device camera module) is
interfaced using Java and the frontend of the application is made responsive and fluid
due to code written in the language.
Furthermore, Java on Android is known to be highly efficient as it automatically
triggers such tasks over a background thread and executes API calls and database
inferences asynchronously. Java’s garbage collection and memory leak management
makes it an ideal choice for Android App Development as well, in contrast to PWAs
(progressive web-apps) developed using web technologies like React Native and
Angular.js.
11
2.4 Frameworks
2.4.1 TensorFlow
TensorFlow is an end-to-end open source platform built for simplifying and
optimizing tasks that involve Machine Learning and Neural Networks. It is built on
C++, thereby ensuring faster computation, multithreading, efficient memory
management etc.
We used TensorFlow’s framework to reach our objective using different models
which implemented Keras (using TensorFlow backend). For face recognition, we tried
our results with the InceptionResNetV1 model and the VGGFace2 model, both with
TensorFlow backend.
For crime detection, we used a ResNet with TensorFlow backend for training and
predicting criminal activity.
2.4.2 Gradle
Gradle is a build system used to automate building, testing and deployment. A
build.gradle file automates all tasks related to dependency management and
downloading requisite libraries for the native Android app.
Gradle runs on the JVM (Java Virtual Machine) and can be extended to build models
as well. It uses the Groovy language to write all its dependency scripts.
Our project uses Gradle to sync the Android app files with the requisite dependencies
and acquire user permissions for access to modules on the device such as camera,
cellular network, Wi-Fi, etc.
___
CHAPTER 3
Design and Development
3.1 Prerequisites
Our project involves advanced computer vision applications such as video classification and
face recognition, and these techniques require the knowledge of linear algebra, vector
calculus, neural networks, convolutions and convolutional neural networks among many
other mathematical fundamentals. These will be covered in the upcoming sections.
12
3.2 Embeddings, Norms and some Linear Algebra
Embeddings can be considered as compact entities that quantify faces and extract their
distinct features from photos. They’re generally used to represent faces with a 128-d vector in
a 128 unit hypersphere.
Each embedding is essentially a vector with its corresponding position in the hypersphere.
Clusters of these embeddings represent similar faces or classes huddled together in the 128-d
space. In a nutshell, an embedding is a mapping of discrete – categorical – variable to a
vector of continuous numbers. They’re low dimensional, learned continuous vector
representations of discrete variables.
These embeddings are useful because:
• They help in finding nearest neighbours in the embedding space, which can work for
making recommendations based on user interests
• They reduce dimensionality of categorical variables and meaningfully represent
categories in a transformed space
• They can be used as inputs to a machine learning model for a supervised task
• They provide better visualization of concepts and relations between categories
To compute these embeddings, we use a CNN which can accurately detect and compute an
appropriate representation for the corresponding face.
In order to gauge similarities and differences between embeddings, we need some defined
metrics. These metrics can be derived from distances between two or more embeddings (L2
Norm) or the angle of deviation between two embeddings from a reference vector (cosine
similarity).
The L2 Norm returns the Euclidean distance between two vectors. This norm can be used to
determine the distance between embeddings to further measure the dissimilarities between
computed embeddings.
Cosine Similarity is the cosine of the angle between two n-dimensional vectors in an n-
dimensional space.
13
3.3 Theoretical Foundation of CNNs
A CNN is a type of artificial neural network that uses convolutional layers to filter inputs for
obtaining useful information for the network such as edges and shapes among other features.
This type of multi-layered network is widely used to recognize visual patterns such as
characters, symbols, figures, etc. from pixel images.
A CNN is commonly compounded with many repeated layers and activation functions such
as sigmoid, tanh, relu (rectified linear unit), pooling layer and a fully connected layer as
shown in the figure.
• Input Layer:
Contains image data represented by a 3D matrix. This data needs to be converted into
a single column of dimension (width × height × number of channels)
• Convolutional Layer:
Uses convolutional filters (called kernels) with a defined size to go over the entire
input data and perform convolution operations.
The filter slides over the data with a stride S. This process is done to learn and detect
patterns from previous layers.
The result is a feature map.
• Pooling Layer:
Referred to as the downsampling layer, it is used to reduce spatial dimensions but
not the depth of a CNN.
14
• Activation Function:
Normally present after the pooling or fully connected layer, its objective is to apply a
non-linear transfer function to encode patterns through transformations.
Few common activation functions include sigmoid function, tanh function and the
relu (rectified linear unit) function.
1
, 𝑓 (𝑧) = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
(1+𝑒 −𝑧 )
f (z) = (𝑒 𝑧 −𝑒 −𝑧 )
, 𝑓 (𝑧) = 𝑡𝑎𝑛ℎ
(𝑒 𝑧 +𝑒 −𝑧 )
• Output Layer:
The final layer of the network (and the FC layer), this layer contains the activation
functions used to obtain the probabilities of the given input belonging to a particular
class with its corresponding label and output neurons.
15
In a nutshell, here is our proposed solution for face recognition:
▪ Detecting faces
▪ Computing 128-d face embeddings to quantify a face
▪ Train a Support Vector Machine (SVM) on top of the embeddings
▪ Recognize faces in image and video streams
All of these tasks will be achieved with OpenCV, enabling us to obtain a pure OpenCV face
recognition pipeline. This is achieved in two key steps:
• Applying face detection, which detects the presence and location of the face in an
image without identifying it
• Extracting a 128-d feature vector (embedding) that quantify each face in an image
The model responsible for quantifying each face in an image is from the OpenFace Project,
a Python and Torch implementation of face recognition and deep learning.
First, we input an image or a video frame to the pipeline, which applies face detection to
detect the location of the face in the image.
Face detection is performed by a Caffe model in OpenCV’s DNN module.
OpenCV’s deep learning face detector is based on the Single Shot Detector (SSD)
framework with a ResNet base network. A single shot detection refers to a technique
wherein the model needs only a single shot to detect multiple objects within the image. It
discretizes the input image into certain bounding boxes around regions with feature maps of
high confidence and generates multiple boxes around such region maps. The confidence for
each of these boxes is calculated and the box dimensions are adjusted to obtain the best fit
for detection.
16
Fig 3.4: SSD – Multiple Bounding Boxes for Localization and Confidence
Additionally, we can compute facial landmarks (mouth, right/left eyebrows, eyes, nose,
jawline) using dlib, which will futher enable us to preprocess the images and perform face
alignment on datasets for better results.
Face alignment is the process of
• Identifying the geometric structure of the face
• Attempting to obtain the canonical alignment of the face based on translation, rotation
and scale.
After applying face alignment and cropping, we pass the input face through the deep neural
network.
Fig. 3.5: How the Deep Learning model computes face embeddings
To train the face recognition model with deep learning, each input batch of data must
contain three images:
• Anchor • Positive Image • Negative Image
17
The anchor is our current image (let’s say of person A), whereas the positive image is also
an image of person A. The negative image in each batch is that of a person other than A.
The point is that the anchor and positive images both belong to the same person whereas the
negative image does not contain the same face.
The neural network computes the face embeddings of each of these faces and tweaks the
weights of the network using triplet loss in such a way that:
• The 128-d embeddings of the anchor and the positive image lie closer
• At the same time, pushing the embedding of the negative image far away
In this manner, the network is able to learn to quantify faces and return highly robust and
discriminating embeddings suitable for face recognition.
A CNN model computes the embeddings for all input images, and these embeddings are
sufficiently different to train a machine learning classifier such as SVMs, SGD Classifiers,
Random Forests, etc. on top of the face embeddings, and therefore obtain our face
recognition pipeline.
Where:
𝛼 = margin enforced between positive and negative pairs
𝒯 = set of all possible triplets in training set
Hence, the loss function that should be minimized by the network is:
18
3.4.2 Data Augmentation
If your neural nets are getting larger but the training sets aren’t, a point in training is
approached where the learning gradient hits a saturation point and vanishes.
Eventually, the model hits an accuracy wall.
Dataset Augmentation – the process of applying simple and complex
transformations like flipping and style transfer to your data – can help overcome the
increasingly large requirements of deep learning models.
Deep Learning models particularly benefit from data augmentation as they try to
understand relationships between the training examples provided on a pixel-to-pixel
level. Hence, it needs a lot of examples to recognise and derive patterns from the data
it observes.
Dataset Augmentation can multiply the effectiveness of present data.
The augmentation techniques applied in this project include:
• Flipping (both horizontally and vertically)
• Rotating
• Zooming and scaling
• Cropping
• Translating (moving along x and y axis)
• Adding Gaussian noise
• Shearing
• Skewing
• Black and White filters
• Sepia filters
• Blurring images
While most of these transformations can be obtained from fairly simple
implementations like the ImageDataGenerator module of TensorFlow, other
functions have been designed (mentioned forthwith) to perform data augmentation
and upload augmented data to the Cloud for automatic training once new data has
been generated.
19
3.4.3 Hyperparameter Tuning
For training, the number and diversity of hyperparameters such as batch size, learning
rate, number of epochs and number of layers are quite specific to each model. These
hyperparameters must be taken into account while training a CNN to improve
performance.
• Learning Rate:
In charge to quantify the learning progress of a model in such a way that will
be used to optimize its capacity. It is also considered as the most important
one due to the fact that it affects the training time and accuracy.
• Batch Size:
Defines the number of training samples to work before updating the model
parameters. Its size must be greater than or equal to 1, and less than the
training set size. It effects the computational resources and time taken per
epoch.
• Number of epochs:
Number of iterations the CNN works through the whole dataset. It can be a
fixed number of iterations, or techniques such as early stopping can be
employed by using a combination of training error and validation error to stop
the model when the error is over a specific threshold.
• Number of layers:
While going deeper within the network does not usually bring drastically
changing results, in the case of CNNs the networks perform better with every
additional layer due to better feature extraction.
20
Logistic K-Nearest Categorical
Naive Bayes Variable
Regression Neighbours
Decision Tree
Continuous Support
Random
Variable Vector
Forest
Decision Tree Machines
It is most useful when you want to understand how several independent variables
affect a single outcome variable. Logistic regression has limitations; all predictors
should be independent, and there should be no missing values. This algorithm will fail
when there is no linear separation of values.
The hyperplane that we choose directly affects the accuracy of the results. So, we
search for the plane that has the maximum distance between data points of all classes.
SVMs show accurate results with minimal computation power when you have a lot of
features. They train on the worst-case set of data points to obtain the threshold
required for accurately classifying the embeddings of different classes to alternating
sides of the hyperplane.
21
3.6 Model Architectures (Face Recognition)
The models used for the Face Recognition component of the project have been summarised
and illustrated as block diagrams. They cannot be included in the document due to their sizes
which may compromise text quality within the image, and hence their respective clear PNG
images have been hosted and their URLs have been listed forthwith.
22
3.7 Video Classification for Crime Detection
In this segment of the project, we aim towards detecting anomalies in videos of footage and
normal activity. The dataset at play here used for training is the UCF Crimes Dataset, which
is the only dataset that contains videos of diverse classes of crimes, each replete with valuable
and distinctive features.
Main features of the dataset:
• It contains 13 classes in total: Abuse, Arrest, Arson, Assault, Accidents, Burglary,
Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting and Vandalism.
• In total, it consists of 1900 real-world surveillance videos, all obtained under different
environments and each video containing a specific realistic anomaly.
23
merely on the duration when the crime occurred as the irrelevant portions had
been discarded.
Videos with low resolutions and portions where the crime was not imperative
or apparent were sharpened and cropped to highlight portions of the crime
scene.
• Data Labelling:
It is decisive to make annotations in each video in order to differentiate the
abnormal scenes from the normal ones. For this, it is necessary to manually
describe the parts of each video containing an anomaly. The remaining parts
of the video can then be labelled under the normal class.
• Data Augmentation:
Allows enlargement of the variety of data available to train a specific model
without gathering new information.
24
Code Snippet for CLAHE:
Moreover, it was also found that the ResNet was lesser prone to delivering false
positive results.
• False Positive: When no crime is happening on the scene, but the model
considers it as one.
• False Negative: When an anomaly is happening on the scene, but the model
does not detect it.
• True Positive: When an anomaly occurs on the scene and the model
accurately considers it as one.
• True Negative: When an anomaly is not happening on the scene and the
model does not detect one.
25
Fig. 3.13: Classification Rates
Based on these insights, we chose the ResNet-50 for our project on Crime Detection.
The ResNet-50 we chose has been pre-trained with an activity dataset in order to
spare computational costs for training a model from scratch on the UCF Crimes
Dataset.
26
3.9 Proposed Video Classification Technique
Since a video is a series of frames, a naïve video classification method would be:
• Loop over all frames in a video file.
• For each frame, pass the frame through the CNN.
• Classify each frame individually and independent of each other.
• Choose the label with the largest corresponding probability.
• Label the frame and write the output frame to disk.
However, there are a set of problems encountered – one of the problems that does not require
much attention for small scale crime detection is that there is no correlation preserved
between subsequent frames. In order to achieve this, we require a sequence model (RNNs,
LSTMs, GRUs) which can repeatedly run for all previous outputs as input to the recurring
network. As this requires great computation and results in increased complexity, this solution
is not implemented.
Another problem that occurs even in the case of image classification is that of prediction
flickering, wherein the output label predicted by the model flickers as and when the output of
the model changes.
A simple, yet elegant solution is to utilize the rolling prediction averaging. Under this
technique, we slightly modify the processes at the output as follows:
• Obtain prediction from CNN
• Maintain a list of last K predictions
• Compute the average of last K predictions and choose the labels with the largest
corresponding probability
• Label the frame and write the output to disk
• Storing the model and weights with varying extensions (.savedmodel, .pb, .h5)
27
Fig. 3.15: Minimum Accuracy (LR: 10-4, Batch Size = 32)
Epoch 1/50
205/205 [==============================] - 119s 512ms/step - loss: 1.4309 - accuracy: 0.5
046 - val_loss: 0.8264 - val_accuracy: 0.6967
Epoch 2/50
205/205 [==============================] - 104s 507ms/step - loss: 0.8809 - accuracy: 0.6
794 - val_loss: 0.6620 - val_accuracy: 0.7461
Epoch 3/50
205/205 [==============================] - 103s 503ms/step - loss: 0.7386 - accuracy: 0.7
386 - val_loss: 0.5724 - val_accuracy: 0.8024
Epoch 4/50
205/205 [==============================] - 105s 513ms/step - loss: 0.6517 - accuracy: 0.7
697 - val_loss: 0.5136 - val_accuracy: 0.8235
Epoch 5/50
205/205 [==============================] - 105s 512ms/step - loss: 0.6065 - accuracy: 0.7
881 - val_loss: 0.4645 - val_accuracy: 0.8488
Epoch 45/50
205/205 [==============================] - 97s 475ms/step - loss: 0.1821 - accuracy: 0.94
69 - val_loss: 0.1257 - val_accuracy: 0.9646
Epoch 46/50
205/205 [==============================] - 98s 476ms/step - loss: 0.1722 - accuracy: 0.94
84 - val_loss: 0.1233 - val_accuracy: 0.9651
Epoch 47/50
205/205 [==============================] - 97s 473ms/step - loss: 0.1741 - accuracy: 0.94
86 - val_loss: 0.1224 - val_accuracy: 0.9646
Epoch 48/50
205/205 [==============================] - 101s 494ms/step - loss: 0.1669 - accuracy: 0.9
503 - val_loss: 0.1207 - val_accuracy: 0.9665
Epoch 49/50
205/205 [==============================] - 105s 514ms/step - loss: 0.1677 - accuracy: 0.9
500 - val_loss: 0.1179 - val_accuracy: 0.9678
Epoch 50/50
205/205 [==============================] - 181s 886ms/step - loss: 0.1581 - accuracy: 0.9
535 - val_loss: 0.1168 - val_accuracy: 0.9669
28
Fig. 3.17: Evaluating the classifier
The clear PNG image of the classifier based on the ResNet-50 architecture and trained on
UCF-Crimes Dataset is listed below.
The ResNet-50 architecture could not be exported to PNG due to its large size. However,
it is among the most common architectures and can be found online.
• Model Architecture for crime classifier based on ResNet-50 [Clear PNG Image]
29
CHAPTER 4
Demonstration and Results
30
4.2 Face Recognition
4.2.1 Face Recognition on Android Application
31
4.2.2 Face Recognition and logging (from CCTV footage)
Fig. 4.5: Face Recognition from a remote mobile camera (simulating a CCTV) obtained via an RTSP URL
Fig. 4.7: A-Block CCTV recognizes Vrushali at 03:40am on 18th May and logs it into Cloud Firestore database
32
4.3 Crime Detection
Fig. 4.9: Code snippet responsible for only detecting and classifying the crime without recognition
Application:
When a crime is detected by a CCTV camera, it is immediately logged into Firestore along
with a timestamp and the list of people involved in the frame after performing recognition.
A notification is received by the app users who’re in the vicinity of the camera (as they’ve
been recognised and tracked a couple of minutes earlier) about the occurrence of the crime.
The following screenshots display crime detection, recognition and database logging.
33
Fig. 4.10: Output after detecting vandalism. Faces unrecognized in the video are saved as Bitmaps for further
recognition by concerned authorities.
Fig. 4.11: Firebase Storage contains the Bitmaps of unrecognized criminals. These photos are displayed to users
on the Android app.
34
Fig. 4.12: Snippet responsible for face recognition while detecting crimes
Fig. 4.13: Function for keeping Firestore up-to-date and other helper functions
35
CHAPTER 5
Conclusion and Future Enhancements
In general terms, this chapter presents a summary of every step executed to fulfill the project,
starting from literature review, followed by development of the Android application, setting
up the backend on Firebase, implementing face recognition on Android and PC, and detecting
crimes from CCTV footage.
• In recent years, several deep learning approaches have been presented for video
classification because of its successful reach in object detection, image
classification and pattern recognition. However, those techniques still face
challenges with video classification due to a set of factors, such as capturing
spatio-temporal information adequately.
• After literature review, ResNet-50 was selected for crime detection whereas
OpenFace’s FaceNet implementation and InceptionResNetV1 were chosen for
Face Recognition. InceptionResNetV1 was converted to a TFLite model to satisfy
the requirement of face recognition on an edge device.
• A CNN requires a wide range of data to be trained on, and in many cases the data
was either unfit for consumption, was poorly recorded or didn’t match the label it
was uploaded under. Hence, such data had to be processed, cleaned, trimmed,
augmented and re-labelled under requisite criteria and constraints.
• CNN models are not created with general purpose applications. Their parameters
need to be adjusted to work adequately with respect to a given specific task. For
our project, as per literature we noticed that ResNet took higher training time but
also returned an increased accuracy.
• Creating and curating the Android application to share a backend that is used by
the Python scripts for crime and face classification also meant that a lot of
additional features pertaining to the same tech stack could be added. A student
tracker log was implemented to leverage the consistent connection between the
CCTV network and the application along with the face recognition algorithm
executed upon each video stream in the network.
Future Enhancements
The combination of an Android app and an integrated campus management system can be
utilized to its full potential if some more work is done on the app. As the app still needs some
more work to become fully functional, completing the features that truly include campus
management can help the project achieve its real purpose.
On the computer vision front, it would be great to obtain better datasets, try out other models
and hyperparameters to achieve improved accuracy.
36
Glossary
• Convolutional Layer: It uses convolutional kernels in order to extract features from input
data through an activation map of that kernel.
• CPU: It is known as the central processing unit, it is the computer component that is
responsible for interpreting and executing most of the commands from the computer’s
hardware and software.
• Deep Learning: It is a subset of machine learning in artificial intelligence that has networks
capable of learning unsupervised from data that ins unstructured or unlabelled.
• Deep Neural Network: It is a neural network with more than two layers.
• Feature Extraction: It is the generation of derived values coming from an initial dataset,
intended to be informative to facilitate the subsequent learning and generalization steps.
• Feature Map: It is a matrix, or a set of matrices used for the mapping of where a certain
kind of feature is found on the image which can be of interest for the Network.
37
• Gaussian Mixture Model: It is a probabilistic model that assumes all the data points are
generated from a mixture of a finite number of Gaussian distributions with unknown
parameters.
• Graphics processing unit (GPU): It is a computer chip that performs rapid mathematical
calculations, primarily for the purpose of rendering images. Also widely used for deep
learning purposes.
• FLOPS: Floating point operations per second is a measure of computer performance, useful
in fields of scientific computations that require floating-point calculations.
• Hyperparameters: They are model-specific properties that are fixed before the network
model starts to train.
• Output Layer: It refers to the final layer of the network which produces given outputs for
the program for future interpretation.
• Optimization: It is a process to find an alternative with the most cost effective or highest
achievable performance under the given constraints, by maximizing desired factors and
minimizing undesired ones.
• Over-Fit: A modeling error which occurs when a function is too closely fit to a limited set
of data points.
• Parameter: The parameters of a neural network are typically the weights of the
connections. In this case, these parameters are learned during the training stage.
38
• Pre-processing: It refers to the transformations applied to input data before feeding it to the
algorithm. It is a technique used to convert the raw data into a clean data set.
• Stride: It corresponds to the number of steps that the filter will move each time on a given
direction.
• Testing dataset: It is a dataset independent of the training dataset that follows the same
probability distribution as the training dataset to validation purposes.
• Training dataset: It is a sample dataset used for learning purposes to fit the parameters
(e.g., weights) of, for example, a classifier.
• Validation dataset: It is a dataset of examples used to tune the hyperparameters (i.e. the
architecture) of a classifier.
39
References
OpenFace Project:
https://www.pyimagesearch.com/2018/02/26/face-detection-with-opencv-and-deep-learning/
Caffe Framework:
https://caffe.berkeleyvision.org/tutorial/net_layer_blob.html
MobileFaceNets:
https://arxiv.org/pdf/1804.07573.pdf
40
Pyrebase:
https://github.com/thisbejim/Pyrebase#storage
TensorFlow subroutines:
https://github.com/serengil/tensorflow-101
RecyclerView Primer:
https://www.google.com/search?q=RecyclerView%3A+No+adapter+attached%3B+skipping
+layout&rlz=1C1ONGR_enIN929IN929&oq=RecyclerView%3A+No+adapter+attached%3
B+skipping+layout&aqs=chrome..69i57j69i58.172j0j7&sourceid=chrome&ie=UTF-8
Dataset Augmentation:
https://algorithmia.com/blog/introduction-to-dataset-augmentation-and-
expansion#:~:text=Dataset%20augmentation%20%E2%80%93%20the%20process%20of,req
uirements%20of%20Deep%20Learning%20models
41
[1] “Ucf-crime dataset (real-world anomalies detection in videos), Jun. 2019.
[2] A. R. Zamir, “Action recognition in realistic sports videos,” in Computer vision in sports.
Springer, 2014, pp. 181–208.
[5] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” ArXiv e-
prints, 11 2015.
[6] L. Prechelt, “Early stopping-but when?” in Neural Networks: Tricks of the Trade, this
book is an outgrowth of a 1996 NIPS workshop. Springer, 1998, pp. 55–69.
[7] Will Kay, Joao Carreira, Sudheendra Vijayanarasimhan, Fabio Viola, Brian Zhang, “The
Kinetics Action Video Dataset”, May 2017.
[8] Mei Wang and Weihong Deng, “Deep Face Recognition: A Survey”, Aug 2020.
[9] Keval Doshi and Yasin Yilmaz, “Online Anomaly Detection in Surveillance Videos with
Asymptotic Bounds on False Alarm Rate”, Oct 2020
[10] Waqas Sultani, Chen Chen, Mubarak Shah, “Real-world Anomaly Detection in
Surveillance Videos”, Feb 2019.
42