0% found this document useful (0 votes)
118 views67 pages

Face Mask Detection Using Deep CNNS: Prof. Ashwini Kulkarni

This document summarizes a student project on face mask detection using deep CNNs. The project was undertaken by three students and supervised by Prof. Ashwini Kulkarni of the Electronics and Telecommunications Department at College of Engineering Pune. The original aim was pedestrian detection and a women's safety surveillance robot, but was modified due to COVID-19 to focus on pedestrian, face, and face mask detection using pretrained CNN models. The document outlines the objectives, theoretical background on CNN architectures, and results obtained for each detection task along with future scope.

Uploaded by

Shruti Deshmukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views67 pages

Face Mask Detection Using Deep CNNS: Prof. Ashwini Kulkarni

This document summarizes a student project on face mask detection using deep CNNs. The project was undertaken by three students and supervised by Prof. Ashwini Kulkarni of the Electronics and Telecommunications Department at College of Engineering Pune. The original aim was pedestrian detection and a women's safety surveillance robot, but was modified due to COVID-19 to focus on pedestrian, face, and face mask detection using pretrained CNN models. The document outlines the objectives, theoretical background on CNN architectures, and results obtained for each detection task along with future scope.

Uploaded by

Shruti Deshmukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 67

FACE MASK DETECTION USING DEEP CNNS

A Project Report

Submitted by

Maitreyi Bhat 111607038

Neha Bendale 111607045

Shruti Deshmukh 111607058

Submitted in partial fulfilment of the requirements of the degree of

B.Tech. Electronics and Telecommunication

Under the guidance of

Prof. Ashwini Kulkarni

DEPARTMENT OF ELECTRONICS AND

TELECOMMUNICATION ENGINEERING
COLLEGE OF ENGINEERING PUNE

2019-2020

CERTIFICATE

This is to certify that the report entitled ‘FACE MASK DETECTION USING DEEP CNNS’
submitted by Maitreyi Bhat (111607038), Neha Bendale (111607045), Shruti Deshmukh
(111607058) in the partial fulfilment of the requirement for the award of degree of Bachelor of
Technology (Electronics and Telecommunication Engineering) of College of Engineering Pune,
affiliated to the Savitribai Phule Pune University, is a record of their own work.

Signature Signature
Prof. Ashwini Kulkarni Dr. S. P. Mahajan
Project Guide, Head of the Department,
Dept of Electronics and Telecommunications, Dept of Electronics and Telecommunications,
College of Engineering, Pune College of Engineering, Pune
Report Approval
This report entitled

FACE MASK DETECTION USING DEEP CNNS

By

Maitreyi Bhat 111607038

Neha Bendale 111607045

Shruti Deshmukh 111607058

is approved for the degree of

Bachelor of Technology with

Electronics and Telecommunication Engineering

of

Electronics and Telecommunication Department

College of Engineering Pune

(An Autonomous Institute of Govt. of Maharashtra)

Examiners Name Signature

1. External Examiner __________________ __________________


2. Guide/Supervisor __________________ __________________
Declaration
We declare that this written submission represents our ideas in our own words and where others'
ideas or words have been included, we have adequately cited and referenced the original sources.
We also declare that we have adhered to all principles of academic honesty and integrity and
have not misrepresented or fabricated or falsified any idea/data/fact/source in our submission.
We understand that any violation of the above will be cause for disciplinary action by the
Institute and can also evoke penal action from the sources which have thus not been properly
cited or from whom proper permission has not been taken when needed.

Name MIS No. Signature

Maitreyi Bhat 111607038

Neha Bendale 111607045

Shruti Deshmukh 111607058

Date:

Place: Pune
Acknowledgements

We are grateful to our college and to the Department of Electronics and Telecommunication for
providing us this opportunity to work on this project. It is their visionary objective to encourage
students for this project that has blossomed into this extraordinary opportunity.

To begin with, we sincerely thank our project guide Prof. Ashwini Kulkarni for her guidance
towards the design and development phases of the project. Her constant motivation, support in
bolstering requisite theoretical concepts along with invaluable tips and tricks proved to be crucial
towards the success of our work. We also would like to express our gratitude towards the
technical and non-technical guidance provided by the faculties of our department. Special thanks
to our faculty adviser Mrs. Yogita Vaidya for her support. We are also indebted to Mrs. Rajshri
Mahajan for evaluating our work and suggesting invaluable improvements.

The team is immensely grateful to Head of the Department Dr. S. P. Mahajan. It is his support
and encouragement to students to work on a project of their field of interest that has made this
possible. We would like to take this opportunity to thank all the respected teachers of this
department for being a perennial source of inspiration and showing the right path in these
unprecedented times of Covid-19
Abstract
In the wake of the burgeoning amount of data and the demand of IoT based edge computing
devices at the advent of 5G, the research community as well as the industry are taking an
increased interest in deploying AI on the edge. AI based systems in edge computing devices
provide the best of both worlds viz., the state-of-the-art accuracies of deep learning models and
the portability and scalability of embedded systems. However, a major bottleneck is the
prohibitively large size of deep learning models and the limited memory and computation
capacities of embedded systems that work on a low power budget. With this background, our
work is at the intersection of computer vision and deep learning.

This project report documents and presents the results of a social experiment owing to the safety
measures to be taken during the coronavirus situation. Our original project aim was to implement
a women’s safety surveillance robot with pedestrian detection and IOT database for storage and
identification. This title was diverted to pedestrian detection as well as human face detection
without masks, due to the unprecedented times. The social purpose of safety remains the same.
We have divided the project into 4 detection sections being:

1) Pedestrian detection in an image

2) Pedestrian detection in a video

3) Face detection without mask

4) Face detection with mask

The primary aim of the changed title can be used in institutions like colleges and schools,
supermarkets, malls, cinema halls and in public places as well. This will lead to a significant help
in identifying the control measures for coronavirus and curbing the risk of spread of the disease.

This document will contain detailed information about methodology of approaching the 4 sub-
divided sections and predefined machine learning algorithms used to achieve the same. The
output and results obtained are mentioned with the accuracy rates.

This report also provides future scope for experimentation with regards to surveillance UGV that
could implement all parts together and can be used for women’s safety.
Contents
1. Introduction...................................................................................................................12
2. Literature Survey..............................................................................................................13
2.1 Convolutional Neural Networks for Face Recognition.............................................................13
2.2 Model for CNN implementation................................................................................................13
2.3 Face Detecting Techniques.........................................................................................................15
2.4 Algorithm for Pedestrian Detection............................................................................................16
2.5 Image Recognition Techniques.................................................................................................17
3. Proposed Work.................................................................................................................17
3.1 Objectives...................................................................................................................................18
3.2 Resources Requirements............................................................................................................18
3.2.1Software.......................................................................................................................................................18
3.2.2 Hardware.....................................................................................................................................................18

4. Theoretical Background....................................................................................................18

4.1 CNN............................................................................................................................................18
4.1.1 CNN Architectures.......................................................................................................................................19
4.1.1.1 ResNet.................................................................................................................................................19
4.1.1.2 MobileNet Architecture.......................................................................................................................20

4.2 Training Parameters...................................................................................................................21


4.2.1 Convolution Layer Parameters....................................................................................................................21

4.3 Object Detection.........................................................................................................................22


4.3.1 Object detection techniques.......................................................................................................................23
4.3.1.1 Haar Cascade Classifier for Face Mask Detection..............................................................................23
4.3.1.1.1 Cascade Classifier........................................................................................................................24
4.3.1.2 Histogram of Oriented Gradients for Pedestrian Detection.................................................................25
4.3.2 SVM.............................................................................................................................................................27

4.4 Image Recognition......................................................................................................................28


4.4.1 Algorithms...................................................................................................................................................29
4.4.2 Image Data Pre-Processing Steps for Neural Networks..............................................................................29
4.4.3 Limitations of Regular Neural Networks for Image Recognition................................................................31
4.5 Datasets......................................................................................................................................33
4.5.1 CNN Dataset................................................................................................................................................33
4.5.2 Pedestrian detection input images.............................................................................................................34

4.6 Technology used.........................................................................................................................34


4.6.1 TensorFlow...............................................................................................................................................34
4.6.2 Caffe..........................................................................................................................................................35
4.6.3 Keras..........................................................................................................................................................35
4.6.4 OpenCV interface......................................................................................................................................36

5. Methodology.....................................................................................................................36

5.1 Face Mask Detection using CNN..................................................................................................36

5.2 Pedestrian Detection using HOG.................................................................................................46

6. Conclusion........................................................................................................................53

7. Future Work......................................................................................................................54

8. References.........................................................................................................................56
List of Figures
FIGURE 1 : DEEP CNN KERNEL MODEL..................................................................................................................................19
FIGURE 2: DIFFERENT TYPES OF FACE DETECTION METHODS....................................................................................................21
FIGURE 3 : DIFFERENT TYPES OF FACE DETECTION TECHNIQUES.................................................................................................21
FIGURE 4 : CALCULATION STEPS OF THE HOG DESCRIPTOR.......................................................................................................22
FIGURE 5 : FEATURE MAPS IN CONVOLUTION LAYER...............................................................................................................27
FIGURE 6 : PLATFORM MODEL............................................................................................................................................28
FIGURE 7 : ZERO PADDING AT BOUNDARY PIXELS...................................................................................................................28
FIGURE 8 : HAAR FEATURES, FEATURES FROM INTEGRAL IMAGES.............................................................................................30
FIGURE 10 : STAGES OF A CASCADE CLASSIFIER......................................................................................................................31
FIGURE 11 : HOG DESCRIPTOR EXTRACTION STEPS .................................................................................................................32
FIGURE 12 : A) ORIGINAL IMAGE, B) PIXEL GRADIENT MAGNITUDE, C) CELL GRADIENT MAGNITUDE.................................................32
FIGURE 13 : (A) CELL GRADIENT ORIENTATION HISTOGRAM AND (B) BLOCK GRADIENT ORIENTATION HISTOGRAM.............................33
FIGURE 14 : (A) POSSIBLE CLASSIFIERS AND (B) HYPERPLANES AND THEIR MARGINS......................................................................34
FIGURE 15 : TRADITIONAL NEURAL NETWORK VS 3D CNN STRUCTURE.....................................................................................35
FIGURE 16 : FULLY CONNECTED NEURAL NETWORKS................................................................................................................38
FIGURE 17 : WITHOUT_MASK DATASET.................................................................................................................................39
FIGURE 18 : WITH_MASK DATASET......................................................................................................................................39
FIGURE 19 : INPUT IMAGES FOR PEDESTRIAN DETECTION.........................................................................................................40
FIGURE 20 : BLOCK DIAGRAM OF FACE MASK DETECTOR SYSTEM.............................................................................................43
FIGURE 21 : WITH_MASK DATASET......................................................................................................................................44
FIGURE 22 : WITHOUT_MASK DATASET.................................................................................................................................44
FIGURE 23 : AUGMENTED DATASET......................................................................................................................................45
FIGURE 24 : SPLITTING OF DATA INTO TRAINING AND TESTING SET............................................................................................46
FIGURE 25 : TRAINING THE MODEL.......................................................................................................................................46
FIGURE 26 : TESTING THE MODEL.........................................................................................................................................46
FIGURE 27 : ACCURACY OF THE MODEL.................................................................................................................................47
FIGURE 28 : PLOT OF # EPOCH V/S ACCURACY......................................................................................................................47
FIGURE 29 : OUTPUT OF FACE MASK DETECTION...................................................................................................................51
FIGURE 30 : FLOW CHART OF HUMAN DETECTION..................................................................................................................52
FIGURE 31 : PREPROCESSING THE DATA................................................................................................................................53
FIGURE 32 ABSOLUTE VAKU MAGNITUDE OF GRADIENT...........................................................................................................54
FIGURE 33 : HORIZONTAL AND VERTICAL GRADIENTS...............................................................................................................54
FIGURE 34 : HISTOGRAM OF GRADIENTS...............................................................................................................................56
FIGURE 35 : MOVEMENT OF WINDOW BY 8 PIXELS.................................................................................................................57
List of Tables
TABLE 1 : SOFTWARE REQUIREMENTS..................................................................................................................................24
TABLE 2 : PARAMETERS FOR IMAGE DATA PREPARATION..........................................................................................................37

Abbreviations

IoT Internet of Things

HOG Histogram of Gradients

CNN Convolutional Neural Network

UGV Unmanned Ground Vehicle

VGG Visual Geometry Group

GPU Graphics Processing Unit

SVM Support Vector Machine

L2 Norm L2 Distance Metric

LUT Look Up Table

MSE Mean Square Error

AUC Area Under Curve


1. Introduction
Over the past decade, deep neural networks, especially the class, convolutional neural network
have obtained state-of-the-art accuracy in a myriad of domains ranging from computer vision to
speech recognition. Thus, it becomes imperative to deploy these deep learning models on
embedded or edge devices such as portable devices and IoT based sensor networks to exploit the
high accuracy for real time inference. With this background, we explore the problem of efficient
embedded implementation of a specific class of deep learning models known as Convolutional
Neural Networks(CNNs) that have not only been shown to achieve state-of-the-art results but
even outperform humans in certain computer vision tasks.

This work is in the domain of face mask detection from images of people with and
without masks using deep convolutional neural networks. Wearing a face mask has now become
paramount due to the outbreak of COVID-19. The primary advantage of wearing a face mask is
that it will prevent further spread of the disease. There is a significant portion of individuals who
are asymptomatic (lack the symptoms of corona) and few who are pre-symptomatic (who
eventually develop symptoms of corona) can transmit the virus to others before showing
symptoms. The virus can spread between people interacting in proximity — for example,
speaking, coughing, or sneezing — even if those people are not exhibiting symptoms.
Considering this new evidence, it is recommended to wear cloth face coverings in public settings
where other social distancing measures are difficult to maintain (e.g., grocery stores and
pharmacies) especially in areas of significant community-based transmission.

This project report documents and presents the results of a social experiment owing to the
safety measures to be taken during the coronavirus situation. Our original project aim was to
implement a women’s safety surveillance robot with pedestrian detection and IOT database for
storage and identification. This title was diverted to pedestrian detection as well as human face
detection without masks, due to the unprecedented times. The social purpose of safety remains
the same. We have divided the project into 4 detection sections being:

1) Pedestrian detection in an image

2) Pedestrian detection in a video

3) Face detection without mask


4) Face detection with mask

The primary aim of the changed title can be used in institutions like colleges and schools,
supermarkets, malls, cinema halls and in public places as well. This will lead to a significant help
in identifying the control measures for coronavirus and curbing the risk of spread of the disease.

This document will contain detailed information about methodology of approaching


the 4 sub-divided sections and predefined machine learning algorithms used to achieve the same.
The output and results obtained are mentioned with the accuracy rates.

This report also provides future scope for experimentation with regards to surveillance
UGV that could implement all parts together and can be used for women’s safety.
2. Literature Survey
Our work explores three areas of research viz., convolutional neural networks for face mask
detection, pedestrian detection using histogram of gradients used in computer vision and image
processing for object detection.

2.1 Convolutional Neural Networks for Face Recognition

CNN is mostly used in image and face recognition. CNN is a kind of artificial neural network
that employs convolution methodology to extract the features from the input data to
increase the number of features. With computational power of Graphical Processing Units
(GPUs), CNN has achieved remarkable cutting edge results over a number of areas,
including image recognition, scene recognition, semantic segmentation, and edge detection.

CNN’s are best known for their ability to recognize patterns present in images, and so the task
chosen for the network was that of image classification. One of the most common benchmarks
for gauging how well a computer vision algorithm performs is to train it on the MNIST
handwritten digit database: a collection of 70,000 handwritten digits and their corresponding
labels. The goal is to train a CNN to be as accurate as possible. CNN’s make use of filters (also
known as kernels), to detect what features, such as edges, are present throughout an image. A
filter is just a matrix of values, called weights, that are trained to detect specific features. The
filter moves over each part of the image to check if the feature it is meant to detect is present. To
provide a value representing how confident it is that a specific feature is present, the filter carries
out a convolution operation, which is an element-wise product and sum between two matrices.

2.2 Model for CNN implementation

Any CNN has mainly 6 types of operations namely Convolution, Zero Padding, Batch
Normalization, Max Pooling, Average Pooling and Concatenation Out of these the most
computationally intensive operation is the convolution operation. The convolution operation
involves two components as Input Image and a Mask (Kernel).
As shown in figure the kernel or the mask is placed over an area of the image and
element wise multiplication of the kernel and image pixels is carried out to give an output pixel.
Here h(i,j) is the output pixel. Similarly, the kernel is moved around over the entire input image
and each position of the kernel gives rise to a different output pixel. This is how the output
feature map is generated after convolving an image with a kernel /mask.

Figure 1 : Deep CNN kernel model

When the feature is present in part of an image, the convolution operation between the filter and
that part of the image results in a real number with a high value. If the feature is not present, the
resulting value is low. So that the Convolutional Neural Network can learn the values for a filter
that detect features present in the input data, the filter must be passed through a non-linear
mapping.

Programming the CNN:

Step 1: Getting the Data

Step 2: Initialize parameters


Step 3: Define the backpropagation operations

Step 4: Building the network

Step 5: Training the network

2.3 Face Detecting Techniques

Face plays a major role in social intercourse for conveying identity and feelings of a person. So,
automatic face detection systems play an important role in face recognition, facial expression
recognition, head-pose estimation, human–computer interaction etc. Face detection is a computer
technology that determines the location and size of a human face in a digital image. Face
detection has been a standout amongst topics in computer vision literature. Many novel methods
have been proposed to resolve each variation listed above.

Figure 2: Different types of Face Detection Methods

For example, the template-matching methods are used for face localization and detection by
computing the correlation of an input image to a standard face pattern. The feature invariant
approaches are used for feature detection of eyes, mouth, ears, nose, etc. The appearance-based
methods are used for face detection with eigenface neural network and information theoretical
approach. Nevertheless, implementing the methods altogether is still a great challenge.
Figure 3 : Different types of Face Detection techniques

Fortunately, the images used in this project have some degree of uniformity thus the detection
algorithm can be simpler: first, the all the faces are vertical and have frontal view; second, they
are under almost the same illuminate condition. This project presents a face detection technique
mainly based on the color segmentation, image segmentation and template matching methods.

2.4 Algorithm for Pedestrian Detection

Pedestrian detection is an important research field of computer vision and pattern recognition,
and has wide range applications in intelligent transportation, human-computer interaction, video
search and video surveillance and other fields. By detecting, tracking, trajectory analysis and
behavior recognition, real-time video surveillance systems can detect abnormal events and
alarms, and achieve the intelligent video surveillance system. The accuracy of detection and
location will directly affect the performance of the entire system.

Figure 4 : Calculation steps of the HOG descriptor


The basic idea behind this approach is capturing the object appearance and shape by
characterizing it using Local intensity Gradients and Edge Directions. The image is densely
divided into small spatial regions called cells. For each cell a 1-D histogram of gradient
directions/edge directions is computed and later all cell data is combined to give a complete HoG
descriptor of the window. The HOG descriptor has a few key advantages over other descriptors.
Since it operates on local cells, it is invariant to geometric and photometric transformations,
except for object orientation. Such changes would only appear in larger spatial regions. HOG
descriptors may be used for object recognition by providing them as features to a machine
learning algorithm. Dalal and Triggs used HOG descriptors as features in a support vector
machine (SVM) ; however, HOG descriptors are not tied to a specific machine learning
algorithm. The variety of colors and illumination in the surrounding makes normalization
inevitable. We further describe the normalization technique as a part of our approach later in the
report.

2.5 Image Recognition Techniques

Advancements in machine learning and use of high bandwidth data services is fueling the growth
of this technology. Image recognition is the ability of a computer powered camera to identify and
detect objects or features in a digital image or video. It is a method for capturing, processing,
examining, and sympathizing images. To identify and detect images, computers use machine
vision technology that is powered by an artificial intelligence system.

The major steps in the image recognition process are gathering and organizing data,
building a predictive model and using it to recognize images. In this project we have used the
face detection algorithm of image recognition. This algorithm detects face from the entire
image/frame of video. We have used Haar Feature-based Cascade Classifiers for detecting the
features of the face. It is a machine learning based approach where a cascade function is trained
from a lot of positive and negative images. It is then used to detect objects in other images. The
model will predict the possibility of each of the two classes ‘without mask’ and ‘with mask’.
Based on which probability is higher, the label will be chosen and displayed around our faces.
3. Proposed Work
Our original proposed work was to implement a surveillance UGV for women’s safety that
would follow a path with the help of ultrasonic sensors and detect pedestrians, capture the
images and store in a IOT database for further use and recognition. After covid-19 situation, the
hardware part could not be implemented, so we decided to keep the social cause of safety
measures the same and molded it into a 100% software project, changing the direction of
approach. The first part of our work comprises detection of pedestrians from still images and
videos with help of image recognition algorithms. The second and most significant part of our
project consists of recognizing faces without masks, using CNN.

3.1 Objectives

1. Study about various face recognition CNN models and find the most compact and
efficient one.
2. Study about various CNN architectures and pre-trained models to select the one with
minimal parameters and formidable accuracy.
3. Train a smaller model for detection of faces with and without masks.
4. Study about image recognition algorithms and find the most accurate and efficient
one.
5. Identify faces live on a webcam with help of the best-found algorithm.
6. Study about various algorithms for image detection, analyses and find the most
befitting.
7. Identify pedestrians on any given image and video using the best-found algorithm.

3.2 Resources Requirements

3.2.1 Software

Languages Python3, C
Libraries TensorFlow, OpenCV, Keras, Caffe
Table 1 Software Requirements

3.2.2 Hardware

CPU Intel i5 (32GB)


Camera Laptop Webcam
Table 2 Hardware Requirements

4. Theoretical Background
4.1 CNN

Unlike a fully connected neural network, in a Convolutional Neural Network (CNN) the neurons
in one layer don’t connect to all the neurons in the next layer. Rather, a convolutional neural
network uses a three-dimensional structure, where each set of neurons analyzes a specific region
or “feature” of the image. CNNs filter connections by proximity (pixels are only analyzed in
relation to pixels nearby), making the training process computationally achievable. In a CNN
each group of neurons focuses on one part of the image. For example, in a cat image, one group
of neurons might identify the head, another the body, another the tail, etc. There may be several
stages of segmentation in which the neural network image recognition algorithm analyzes
smaller parts of the images, for example, within the head, the cat’s nose, whiskers, ears, etc. The
final output is a vector of probabilities, which predicts, for each feature in the image, how likely
it is to belong to a class or category.

4.1.1 CNN Architectures

Over the past few years, a large variety of CNN architectures have been pioneered. In our
experiments, we focus on a CNN architecture viz., MobileNet architecture by Howard et al. at
Google. We present the representative highlights and strengths of these model architectures
below.
4.1.1.1 ResNet

A residual neural network (ResNet) is an artificial neural network (ANN) of a kind that builds on
constructs known from pyramidal cells in the cerebral cortex. Residual neural networks do this
by utilizing skip connections, or shortcuts to jump over some layers. ResNet uses Batch
Normalization at its core. The Batch Normalization adjusts the input layer to increase the
performance of the network. The problem of covariate shift is mitigated. ResNet makes use of
the Identity Connection, which helps to protect the network from vanishing gradient problems. A
building block of a ResNet is called a residual block or identity block. A residual block is simply
when the activation of a layer is fast-forwarded to a deeper layer in the neural network. In theory,
the training error should monotonically decrease as more layers are added to a neural network.

Residual neural networks do this by utilizing skip connections, or shortcuts to jump over some
layers. Typical ResNet models are implemented with double- or triple- layer skips that contain
nonlinearities (ReLU) and batch normalization in between. An additional weight matrix may be
used to learn the skip weights; these models are known as HighwayNets. Models with several
parallel skips are referred to as DenseNets. In the context of residual neural networks, a non-
residual network may be described as a plain network.

One motivation for skipping over layers is to avoid the problem of vanishing gradients, by
reusing activations from a previous layer until the adjacent layer learns its weights. During
training, the weights adapt to mute the upstream layer, and amplify the previously-skipped layer.
In the simplest case, only the weights for the adjacent layer's connection are adapted, with no
explicit weights for the upstream layer. This works best when a single nonlinear layer is stepped
over, or when the intermediate layers are all linear. If not, then an explicit weight matrix should
be learned for the skipped connection

4.1.1.2 MobileNet Architecture

MobileNets are a class of deep convolutional neural networks designed for embedded vision
applications that are both memory and computation limited. These models are composed of
depth wise convolution operations that are slightly different from vanilla convolutions as
illustrated in Figure 4.9. These depth-wise separable convolutions introduce 2 hyper-parameters
for model shrinking namely width multiplier and resolution multiplier. These hyper-parameters
give the network designer full control over the exact size of their models in addition to a benefit
in the reduced number of computations. Such flexibility can be suitably exploited by choosing a
model that is well within the constraints of the target embedded device. For instance, in our case,
we used a width multiplier of 0.5 and a depth multiplier of 1 to obtain a model with ~0.8M
parameters. Neural networks have revolutionized many areas of machine intelligence, enabling
superhuman accuracy for challenging image recognition tasks. However, the drive to improve
accuracy often comes at a cost: modern state of the art networks require high computational
resources beyond the capabilities of many mobile and embedded applications. This paper
introduces a new neural network architecture that is specifically tailored for mobile and resource
constrained environments. Our network pushes the state of the art for mobile tailored computer
vision models, by significantly decreasing the number of operations and memory needed while
retaining the same accuracy. Our main contribution is a novel layer module: the inverted residual
with linear bottleneck. This module takes as an input a low-dimensional compressed
representation which is first expanded to high dimension and filtered with a lightweight
depthwise convolution. Features are subsequently projected back to allow-dimensional
representation with a linear convolution. The official implementation is available as part of
TensorFlow-Slim model library in. This module can be efficiently implemented using standard
operations in any modern framework and allows our models to beat state of the art along
multiple performance points using standard benchmarks. Furthermore, this convolutional module
is particularly suitable for mobile designs, because it allows to significantly reduce the memory
footprint needed during inference by never fully materializing large intermediate tensors. This
reduces the need for main memory access in many embedded hardware designs, that provide
small amounts of very fast software controlled cache memory.

4.2 Training Parameters

A model parameter is a configuration variable that is internal to the model and whose value can
be estimated from data. They are required by the model when making predictions. These values
define the skill of the model on your problem. They are estimated or learned from data. The
training of model parameters is one of the most challenging problems when constructing a gene
finding algorithm. It involves finding the estimates of the parameters that optimises the
performance of the model, based on a set of training sequences.

4.2.1 Convolution Layer Parameters

In a convolution layer in a CNN there are multiple such masks also known as filters which
operate on the input image and generate multiple feature maps as shown in the figure below.

Figure 5 : Feature Maps in Convolution Layer

The main parameters of a convolutional layer are as follows

1. Input image - Refers to the array of pixels over which the convolution operation is to be
carried out.
2. Number of filters - Each filter operates on the input image independently and generates a
feature map. Thus, the total number of output feature maps generated is equal to the
number of filters in that layer
3. Mask for each filter - Stores the parameters of each filter
4. Stride - The number of pixels by which we shift the filter mask while generating each
output pixel. Thus, higher the stride lower will be the number of output pixels generated
and small will be the output
5. Zero Padding - If the filter mask’s centre pixel is placed over the top left corner pixel of
the input image then it can be observed that some elements of the filter mask do not have
any corresponding over- lapping elements in the input image. Thus, it is a common
practice to pad the input image with zeros in order to perform the convolution of the filter
mask even with the corresponding boundary pixels of the input image as shown in the
following Figure.

Figure 6 : Platform Model

Figure 7 : Zero Padding at Boundary Pixels

4.3 Object Detection

Object detection is a computer technology related to computer vision and image processing that
deals with detecting instances of semantic objects of a certain class such as humans, buildings, or
cars in digital images and videos. Well-researched domains of object detection include face
detection and pedestrian detection. Object detection has applications in many areas of computer
vision, including image retrieval and video surveillance. It is also used in tracking objects, for
example tracking a ball during a football match, tracking movement of a cricket bat, or tracking a
person in a video.

Every object class has its own special features that helps in classifying the class – for example all
circles are round. Object class detection uses these special features. For example, when looking
for circles, objects that are at a distance from a point (i.e. the center) are sought. Similarly, when
looking for squares, objects that are perpendicular at corners and have equal side lengths are
needed. A similar approach is used for face identification where eyes, nose, and lips can be
found and features like skin color and distance between eyes can be found.
4.3.1 Object detection techniques

There are various object detection techniques used. In our project we have used two machine
learning based approaches.

1. Haar Cascade Classifier for Face Mask Detection.


2. Histogram of Oriented Gradients for Pedestrian Detection.

The algorithms of these techniques are explained as follows.

4.3.1.1 Haar Cascade Classifier for Face Mask Detection


It is a well-known algorithm for being able to detect faces and body parts in an image, but can be
trained to identify almost any object. In our project we have used this algorithm to detect faces
without masks. The algorithm has four stages:

1. Haar Feature Selection


2. Creating Integral Images
3. Adaboost Training
4. Cascading Classifiers

Initially, the algorithm needs a lot of positive images of faces with masks and negative images
with faces without masks to train the classifier. Then we need to extract features from it.

First step is to collect the Haar Features. A Haar feature considers adjacent rectangular regions
at a specific location in a detection window, sums up the pixel intensities in each region and
calculates the difference between these sums.

Integral Images are used to make this super-fast. But among all these features we calculated,
most of them are irrelevant. For example, consider the image below. Top row shows two good
features. The first feature selected seems to focus on the property that the region of the eyes is
often darker than the region of the nose and cheeks. The second feature selected relies on the
property that the eyes are darker than the bridge of the nose. But the same windows applying on
cheeks or any other place is irrelevant.
Figure 8 : HAAR features Figure 9 : Features from Integral Images

Selection of the best features out of 160000+ features is accomplished using a concept called
Adaboost. It selects the best features and trains the classifiers that use them. This algorithm
constructs a “strong” classifier as a linear combination of weighted simple “weak” classifiers.
The process is as follows.

During the detection phase, a window of the target size is moved over the input image,
and for each subsection of the image and Haar features are calculated. This difference is then
compared to a learned threshold that separates non-objects from objects. Because each Haar
feature is only a "weak classifier" (its detection quality is slightly better than random guessing) a
large number of Haar features are necessary to describe an object with sufficient accuracy and
are therefore organized into cascade classifiers to form a strong classifier.

4.3.1.1.1 Cascade Classifier

The cascade classifier consists of a collection of stages, where each stage is an ensemble of weak
learners. The weak learners are simple classifiers called decision stumps. Each stage is trained
using a technique called boosting. Boosting provides the ability to train a highly accurate
classifier by taking a weighted average of the decisions made by the weak learners.
Each stage of the classifier labels the region defined by the current location of the sliding
window as either positive or negative. Positive indicates that an object was found and negative
indicates no objects were found. If the label is negative, the classification of this region is
complete, and the detector slides the window to the next location. If the label is positive, the
classifier passes the region to the next stage. The detector reports an object found at the current
window location when the final stage classifies the region as positive.

Figure 10 : Stages of a Cascade Classifier

The stages are designed to reject negative samples as fast as possible. The assumption is that
most windows do not contain the object of interest. Conversely, true positives are rare and worth
taking the time to verify.

 A true positive occurs when a positive sample is correctly classified.


 A false positive occurs when a negative sample is mistakenly classified as positive.
 A false negative occurs when a positive sample is mistakenly classified as negative.

To work well, each stage in the cascade must have a low false negative rate. If a stage
incorrectly labels an object as negative, the classification stops and that mistake cannot be
corrected. However, each stage can have a high false positive rate. Even if the detector
incorrectly labels a nonobject as positive, the mistake can be corrected in subsequent stages.
Adding more stages reduces the overall false positive rate, but it also reduces the overall true
positive rate.

4.3.1.2 Histogram of Oriented Gradients for Pedestrian Detection

This approach is an association between two methods and works as follows: HOG is a local
descriptor that uses a gradient vector orientation histogram and SVM is a classifier with good
generalization power that uses the features extracted by the descriptor. The main idea of HOG is
that object appearance and shape can be described by pixel gradient distribution. The descriptor
extracting process can be divided into four steps:

1) calculate the vector gradient of each pixel


2) group pixels in cells
3) group cells in blocks
4) assemble the descriptor.
First, one-dimensional masks of point discrete derivatives, [−1, 0, 1] and [−1, 0, 1]T are applied
in both horizontal and vertical axis in order to calculate each pixel gradient, as seen in Figures
below.

Figure 11 : HOG descriptor extraction steps: (a) original image, (b) gradient vector calculator, (c) pixel grouping in cells, (d) cell
histogram calculation, (e) cell grouping in blocks and (f) descriptor assembly.

Figure 12 : a) Original image, b) pixel gradient magnitude, c) cell gradient magnitude


Figure 13 : (a) Cell gradient orientation histogram and (b) block gradient orientation histogram

Second, the pixels are grouped into cells, as shown in Figures. In the third step, blocks are
created by grouping the cells, as seen in the figures. In the fourth and final step, Figure 13 (f), the
descriptor is assembled. The descriptor is a cell histogram list of all blocks. Local light variation
or high contrast between foreground and background can be an issue in image processing. The
way to attenuate this issue is by normalizing histograms according to their neighbors.

4.3.2 SVM

The Support Vector Machine is a supervised learning algorithm used mainly for classification
and regression analyses. SVM is a binary linear classifier, but there are approaches that enable it
to deal with non-linear or multiple class problems. SVM works basically by finding a hyperplane
that fits in the middle of two classes. Mathematically, a training dataset can be represented by X,
where xi, i = 1, 2, ..., N are its feature vectors. Considering a linear separability problem, these
vectors belong to only two classes, ω1 or ω2. The objective of SVM is to find a hyperplane,
(equation used is as follows) that classifies the training vectors correctly.

where w is a matrix containing the support vectors and w0 the bias.


Figure 14 : (a) Possible classifiers and (b) hyperplanes and their margins

There are infinite hyperplanes that can be placed between two points, or in this case, two classes.
Figure 14 (a) shows three hyperplanes that could separate the two classes classifying the samples
correctly. Thus, the classifiers generalization power must be taken into consideration, i.e. the
classifiers capacity to work satisfactorily with data outside the training dataset. What SVM does
is to choose the hyperplane with the largest margin between classes during the training process,
as seen in Figure 14 (b). Kernel functions can be used with SVM in order to enable the classifier
to deal with non-linearly separable classes. These functions modify the feature space trying to
transform it into a linear separability problem. Some of the most popular kernel functions in the
literature are the sigmoidal and RBF (Radial Basis Function)

4.4 Image Recognition

The human eye sees an image as a set of signals, interpreted by the brain’s visual cortex. The
outcome is an experience of a scene, linked to objects and concepts that are retained in memory.
Image recognition imitates this process. Computers ‘see’ an image as a set of vectors (color
annotated polygons) or a raster (a canvas of pixels with discrete numerical values for colors).

In the process of neural network image recognition, the vector or raster encoding of the image is
turned into constructs that depict physical objects and features. Computer vision systems can
logically analyze these constructs, first by simplifying images and extracting the most important
information, then by organizing data through feature extraction and classification.
Figure 15 : Traditional Neural Network vs 3D CNN structure

Finally, computer vision systems use classification or other algorithms to decide about the image
or part of it – which category they belong to, or how they can best be described.

4.4.1 Algorithms

One type of image recognition algorithm is an image classifier. It takes an image (or part of an
image) as an input and predicts what the image contains. The output is a class label, such as dog,
cat or table. The algorithm needs to be trained to learn and distinguish between classes.

In a simple case, to create a classification algorithm that can identify images with dogs, you’ll
train a neural network with thousands of images of dogs, and thousands of images of
backgrounds without dogs. The algorithm will learn to extract the features that identify a “dog”
object and correctly classify images that contain dogs. While most image recognition algorithms
are classifiers, other algorithms can be used to perform more complex activities. For example, a
Recurrent Neural Network can be used to automatically write captions describing the content of
an image.

4.4.2 Image Data Pre-Processing Steps for Neural Networks

Neural network image recognition algorithms rely on the quality of the dataset – the images used
to train and test the model. Here are a few important parameters and considerations for image
data preparation.
Image size Higher quality images give the model more
information but require more neural network
nodes and more computing power to process.
The number of images The more data you feed to a model, the more
accurate it will be, but ensure the training set
represents the real population.
Number of channels Grayscale images have 2 channels (black and
white) and color images typically have 3 color
channels (Red, Green, Blue / RGB), with
colors represented in the range [0,255].
Aspect ratio Ensure the images have the same aspect ratio
and size. Typically, neural network models
assume a square shape input image.
Image scaling Once all images are squared you can scale
each image. There are many up-scaling and
down-scaling techniques, which are available
as functions in deep learning libraries.
Mean, standard deviation You can look at the ‘mean image’ by
calculating the mean values for each pixel, in
all training examples, to obtain information
on the underlying structure in the
images.ensures that all input parameters
(pixels in this case) have a uniform data
distribution.
Normalizing image inputs This makes convergence speedier when you
train the network. You can conduct data
normalization by subtracting the mean from
each pixel and then dividing the outcome by
the standard deviation.

Dimensionality reduction You can decide to collapse the RGB channels


into a gray-scale channel. You may want to
reduce other dimensions if you intend to make
the neural network invariant to that dimension
or to make training less computationally
intensive.
Data augmentation involves augmenting the existing dataset, with
perturbed types of current images, including
scaling and rotating. This exposes the neural
network to a variety of variations & this
neural network is less likely to identify
unwanted characteristics in the dataset.
Table 2 : Parameters for image data preparation

4.4.3 Limitations of Regular Neural Networks for Image Recognition

Traditional neural networks use a fully connected architecture, as illustrated below, where every
neuron in one layer connects to all the neurons in the next layer. When it comes to processing
image data:

For an average image with hundreds of pixels and three channels, a traditional neural network
will generate millions of parameters, which can lead to overfitting. The model would be very
computationally intensive. It may be difficult to interpret results, debug and tune the model to
improve its performance.
Figure 16 : Fully connected neural networks
4.5 Datasets

4.5.1 CNN Dataset

Figure 17 : without_mask Dataset

Figure 18 : with_mask Dataset


4.5.2 Pedestrian detection input images

Figure 19 : Input images for Pedestrian detection

4.6 Technology used

4.6.1 TensorFlow

TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used for machine
learning applications such as neural networks. It is an open source artificial intelligence library,
using data flow graphs to build models. It allows developers to create large-scale neural
networks with many layers. TensorFlow is mainly used for: Classification, Perception,
Understanding, Discovering, Prediction and Creation.
4.6.2 Caffe

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is
developed by Berkeley AI Research (BAIR) and by community contributors. Yangqing Jia
created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause
license.

Expressive architecture encourages application and innovation. Models and optimization are
defined by configuration without hard-coding. Switch between CPU and GPU by setting a single
flag to train on a GPU machine then deploy to commodity clusters or mobile devices.

Extensible code fosters active development. In Caffe’s first year, it was forked by over 1,000
developers and had many significant changes contributed back. Thanks to these contributors the
framework tracks the state-of-the-art in both code and models.

Speed makes Caffe perfect for research experiments and industry deployment. Caffe can process
over 60M images per day with a single NVIDIA K40 GPU*. That’s 1 ms/image for inference
and 4 ms/image for learning and more recent library versions and hardware are faster still. We
believe that Caffe is among the fastest convnet implementations available.

4.6.3 Keras

Keras is an open-source neural-network library written in Python. It is capable of running on top


of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to enable fast
experimentation with deep neural networks, it focuses on being user-friendly, modular, and
extensible.

Keras contains numerous implementations of commonly used neural-network building blocks


such as layers, objectives, activation functions, optimizers, and a host of tools to make working
with image and text data easier to simplify the coding necessary for writing deep neural network
code. The code is hosted on GitHub, and community support forums include the GitHub issues
page, and a Slack channel.
In addition to standard neural networks, Keras has support for convolutional and recurrent neural
networks. It supports other common utility layers like dropout, batch normalization, and pooling.

4.6.4 OpenCV interface

OpenCV (Open Source Computer Vision Library) is an open source computer vision and
machine learning software library. OpenCV was built to provide a common infrastructure for
computer vision applications and to accelerate the use of machine perception in commercial
products. Being a BSD-licensed product, OpenCV makes it easy for businesses to utilize and
modify the code.

The library has more than 2500 optimized algorithms, which includes a comprehensive set of
both classic and state-of-the-art computer vision and machine learning algorithms. These
algorithms can be used to detect and recognize faces, identify objects, classify human actions in
videos, track camera movements, track moving objects, extract 3D models of objects, produce
3D point clouds from stereo cameras, stitch images together to produce a high resolution image
of an entire scene, find similar images from an image database, remove red eyes from images
taken using flash, follow eye movements, recognize scenery and establish markers to overlay it
with augmented reality, etc.

5. Methodology
This section elaborates on the methodologies followed in our experiments on Pedestrian
Detection and Mask detection using CNN. A brief overview of this section is as follows. We
begin by introducing the CNN implementation. We start with introducing the procedure of CNN
and model building using Keras followed by the implementation of Face Mask Detector. Then,
we introduce the pedestrian detection algorithm and finally implementation of the same.

5.1 Face Mask Detection using CNN

 Set of TensorFlow.keras imports allow for:


1. Data augmentation
2. Loading the MobilNetV2 classifier (we will fine-tune this model with pre-
trained ImageNet weights)
3. Building a new fully connected (FC) head
4. Pre-processing
5. Loading image data
 Scikit-learn (sklearn) for binarizing class labels, segmenting our dataset, and printing a
classification report.
 Imutils paths implementation will help us to find and list images in our dataset. And we’ll
use matplotlib to plot our training curves.

Figure 20 : Block Diagram of Face Mask Detector System


Step 1: Data Visualization

In the first step, we visualize the total number of images in our dataset in both categories. We can
see that there are 690 images in the ‘yes’ class and 686 images in the ‘no’ class. \

Figure 21 : with_mask Dataset


Figure 22 : without_mask Dataset

Step 2: Data Augmentation

In the next step, we augment our dataset to include more number of images for our training. In
this step of data augmentation, we rotate and flip each of the images in our dataset. We see that,
after data augmentation, we have a total of 2751 images with 1380 images in the ‘yes’ class and
‘1371’ images in the ‘no’ class
Figure 23 : Augmented Dataset

Step 3 : Training and testing Split

It is typical to allocate a percentage of your data for training and a smaller percentage of your
data for testing. The scikit-learn provides a handy train_test_split function which will split the
data for us. Both trainX and testX make up the image data itself while trainY and testY make up
the labels. A call to fit_transform finds all unique class labels in trainY and then transforms them
into one-hot encoded labels. A call to just .transform on testY performs just the one-hot encoding
the unique set of possible class labels was already determined by the call to .fit_transform .

 The number of images with facemask in the training set labelled 'with_mask': 1104
 The number of images with facemask in the test set labelled 'with_mask': 276
 The number of images without facemask in the training set labelled 'without_mask': 1096
 (d)The number of images without facemask in the test set labelled 'without_mask': 275
Figure 24 : Splitting of data into Training and Testing set

Step 4.2: train a deep learning model using our training data and compiled model. Now that our
Keras model is compiled, we can “fit” (i.e., train) it on our training data

Figure 25 : Training the model

Step 4.3 : After we fit our model, we can use our testing data to make predictions and generate a
classification report. We've trained our actual model but now we need to evaluate it on our
testing data.

Figure 26 : Testing the model


It’s important that we evaluate our testing data so we can obtain an unbiased (or as close to
unbiased as possible) representation of how well our model is performing with data it has never
been trained on.

Step 4.4 Evaluating our model

Figure 27 : Accuracy of the model

Figure 28 : Plot of # Epoch v/s Accuracy


Step 5 : Implementing the model

1. Initialize lists for our data and labels . These will later become NumPy arrays.
2. Grab imagePaths and randomly shuffle them.
3. Begin looping over all imagePaths in our dataset.
4. For each imagePath , we proceed to:
a. Load the image into memory.
b. Resize the image to 32x32 pixels as well as flatten the image.
c. Append the resized image to data .
d. Extract the class label of the image from the path and add it to the
labels list. The labels list contains the classes that correspond to each
image in the data list.

Step 6 : Fine Tuning

Fine-tuning is a multi-step process:

5. Remove the fully connected nodes at the end of the network (i.e., where the
actual class label predictions are made).
6. Replace the fully connected nodes with freshly initialized ones.
7. Freeze earlier CONV layers earlier in the network (ensuring that any previous
robust features learned by the CNN are not destroyed).
8. Start training, but only train the FC layer heads.
9. Optionally unfreeze some/all of the CONV layers in the network and perform
a second pass of training.

Step 7: Labeling the Information

After building the model, we label two probabilities for our results. [‘0’ as ‘without_mask’ and
‘1’ as ‘with_mask’]. I am also setting the boundary rectangle color using the RGB values.[‘RED’
for ‘without_mask’ and ‘GREEN’ for ‘with_mask]

labels_dict={0:'without_mask',1:'with_mask'}

color_dict={0:(0,0,255),1:(0,255,0)}
Step 8: Importing the Face detection Program

After this, we intend to use it to detect if we are wearing a face mask using our PC’s webcam.
For this, first, we need to implement face detection. In this, I am using the Haar Feature-based
Cascade Classifiers for detecting the features of the face. This cascade classifier is designed by
OpenCV to detect the frontal face by training thousands of images. The .xml file for the same
needs to be downloaded and used in detecting the face.

Step 9: Detecting the Faces with and without Masks

In the last step, we use the OpenCV library to run an infinite loop to use our web camera in
which we detect the face using the Cascade Classifier. The code webcam =
cv2.VideoCapture(0)denotes the usage of webcam. The model will predict the possibility of each
of the two classes ([without_mask, with_mask]). Based on which probability is higher, the label
will be chosen and displayed around our faces. This is executed with the following steps :

1. Load an input image from disk


2. Detect faces in the image
3. Apply our face mask detector to classify the face as either with_mask or
without_mask
4. Upon loading our --image from disk , we make a copy and grab frame
dimensions for future scaling and display purposes.
5. Pre-processing is handled by OpenCV’s blobFromImage function. We resize
to 300×300 pixels and perform mean subtraction.
6. Lines 47 and 48 then perform face detection to localize where in the image all
faces are.
7. Once we know where each face is predicted to be, we’ll ensure they meet the
--confidence threshold before we extract the faceROIs:
8. Here, we loop over our detections and extract the confidence to measure
against the --confidence threshold.
9. We then compute the bounding box value for a particular face and ensure that
the box falls within the boundaries of the image.
10. run the face ROI through our MaskNet model:
11. Extract the face ROI via NumPy slicing
12. Pre-process the ROI the same way we did during training
13. Perform mask detection to predict with_mask or without_mask

Outputs -
Figure 29 : Output of Face Mask Detection
5.2 Pedestrian Detection using HOG

OpenCV has built-in methods to perform pedestrian detection. OpenCV ships with a pre-trained
HOG + Linear SVM model that can be used to perform pedestrian detection in both images and
video streams.

Figure 30 : Flow Chart of Human detection


Step 1: Preprocessing the data

Preprocessing data is a crucial step in any machine learning project and that’s no different when
working with images. We need to preprocess the image and bring down the aspect ratio to 1:2.
The image size should preferably be 64 x 128. This is because we have divided the image into
8*8 and 16*16 patches to extract the features. Having the specified size (64 x 128) will make all
our calculations pretty simple.

Figure 31 : Preprocessing the data

Step 2: Calculate the Gradient Images


The horizontal and vertical gradients are calculated for the calculation of the histogram of
gradients. This is easily achieved by filtering the image with the following kernels. The x-
gradient fires on vertical lines and the y-gradient fires on horizontal lines. The magnitude of
gradient fires wherever there is a sharp change in intensity. None of them fire when the region is
smooth.

Figure 32 :Left : Absolute value of x-gradient. Center : Absolute value of y-gradient. Right : Magnitude of gradient.

Figure 33 : Horizontal and vertical gradients

The gradient image removed a lot of non-essential information ( e.g. constant colored
background ), but highlighted outlines. At every pixel, the gradient has a magnitude and a
direction. For color images, the gradients of the three channels are evaluated (as shown in the
figure above). The magnitude of gradient at a pixel is the maximum of the magnitude of
gradients of the three channels, and the angle is the angle corresponding to the maximum
gradient.

Step 3: Calculate Histogram of Gradients in 8×8 cells

In this step, the image is divided into 8×8 cells and a histogram of gradients is calculated for
each 8×8 cells. We have divided the image into 8x8 cells because 8×8 cells in a photo of a
pedestrian scaled to 64×128 are big enough to capture interesting features ( e.g. the face, the top
of the head etc) An 8×8 image patch contains 8x8x3 = 192 pixel values. The gradient of this
patch contains 2 values (magnitude and direction) per pixel which adds up to 8x8x2 = 128
numbers.
Figure 34 : 8*8 cells of HOG

Figure 35 : Center : The RGB patch and gradients represented using arrows. Right : The gradients in the same patch represented
as numbers

The histogram is essentially a vector ( or an array ) of 9 bins ( numbers ) corresponding to angles


0, 20, 40, 60 … 160. The image in the center shows the patch of the image overlaid with arrows
showing the gradient —the arrow shows the direction of the gradient and its length shows the
magnitude. The direction of arrows points to the direction of change in intensity and the
magnitude shows how big the difference is. On the right, we see the raw numbers representing
the gradients in the 8×8 cells with angles between 0 and 180 degrees. These are called
“unsigned” gradients because a gradient and it’s negative are represented by the same numbers.
In other words, a gradient arrow and the one 180 degrees opposite to it are considered the same.
Empirically it has been shown that unsigned gradients work better than signed gradients for
pedestrian detection.

Step 4: creation of histogram of gradients in these 8×8 cells

The histogram contains 9 bins corresponding to angles 0, 20, 40 … 160. The following figure
illustrates the process. We are looking at the magnitude and direction of the gradient of the same
8×8 patch as in the previous figure. A bin is selected based on the direction, and the vote ( the
value that goes into the bin ) is selected based on the magnitude. we focus on the pixel encircled
in blue. It has an angle ( direction ) of 80 degrees and magnitude of 2. So it adds 2 to the 5th bin.
The gradient at the pixel encircled using red has an angle of 10 degrees and magnitude of 4.
Since 10 degrees is halfway between 0 and 20, the vote by the pixel splits evenly into the two
bins. If the angle is greater than 160 degrees, it is between 160 and 180, and since the angle
wraps around making 0 and 180 equivalent. So in the example below, the pixel with angle 165
degrees contributes proportionally to the 0 degree bin and the 160 degree bin.

Figure 34 : Histogram of Gradients

The contributions of all the pixels in the 8×8 cells are added up to create the 9-bin histogram. For
the patch above, it looks like thisIn our representation, the y-axis is 0 degrees. As we can see the
histogram has a lot of weight near 0 and 180 degrees, which is just another way of saying that in
the patch gradients are pointing either up or down.

Step 5: Normalize gradients in 16×16 cell

We “normalize” the histogram so they are not affected by lighting variations. It is because
normalizing a vector removes the scale. We now normalize over a bigger sized block of 16×16.
A 16×16 block has 4 histograms which can be concatenated to form a 36 x 1 element vector and
it can be normalized just the way a 3×1 vector is normalized. The window is then moved by 8
pixels (figure below) and a normalized 36×1 vector is calculated over this window and the
process is repeated.
Figure 35 : Movement of window by 8 pixels

Step 6 : Calculate the HOG feature vector

To calculate the final feature vector for the entire image patch, the 36×1 vectors are concatenated
into one giant vector. Calculator is done in the following way:

1. How many positions of the 16×16 blocks do we have ? There are 7 horizontal
and 15 vertical positions making a total of 7 x 15 = 105 positions.
2. Each 16×16 block is represented by a 36×1 vector. So when we concatenate
them all into one giant vector we obtain a 36×105 = 3780 dimensional vector.
Output for Pedestrian Detection

Figure 36 : Output of pedestrian detection


6. Conclusion
This section summarizes major results, discusses certain choices made in the experiments along
with their justification and also discusses potential future work that can be done as an extension
to our experiments.

We have successfully learned how to create a COVID-19 face mask detector using OpenCV,
Keras/TensorFlow, and Deep Learning. To create our face mask detector, we trained a two-class
model of people wearing masks and people not wearing masks. We fine-tuned MobileNetV2 on
our mask/no mask dataset and obtained a classifier that is ~99% accurate.

We then took this face mask classifier and applied it to both images and real-time video streams
by:

1)Detecting faces in images/video

2)Extracting each individual face

3)Applying our face mask classifier

Our face mask detector is accurate, and since we used the MobileNetV2 architecture, it’s also
computationally efficient, making it easier to deploy the model to embedded systems (Raspberry
Pi, Google Coral, Jetson, Nano, etc.).

However, we cannot detect faces in the foreground. The reason behind this is that it’s too
obscured by the mask.The dataset used to train the face detector did not contain ample images of
people wearing face masks.

Therefore, if a large portion of the face is occluded, our face detector will likely fail to detect the
face.

Pedestrian detector detects a person (from images, videos and from a real-time webcam) in
different postures using HOG features and SVM classifier which can help detect people in
different correct and incorrect postures which can be further used to reduce many spines,
muscles, headaches, fatigue, and breathing problems as in today there are numerous body
gestures related problems due to wrong working posture.

The HOG descriptor of an image patch is usually visualized by plotting the 9×1 normalized
histograms in the 8×8 cells. See image on the side. The dominant direction of the histogram
captures the shape of the person. Unfortunately, there is no easy way to visualize the HOG
descriptor in OpenCV.

7. Future Work

Our work can be extended to explore further with regards to our previously proposed plan. We
can have the software implemented on a UGV which can act as a surveillance robot.

The same technology can be combined into a single software to make a live monitoring safety
system in public places like schools, colleges, streets, malls, supermarkets etc.

As you can see from the results sections above, our face mask detector is working quite well
despite:

1)Having limited training data

2)The with_mask class being artificially generated.

To improve our face mask detection model further, we should gather actual images (rather than
artificially generated images) of people wearing masks.

While our artificial dataset worked well in this case, there’s no substitute for the real thing.

Secondly, we should also gather images of faces that may “confuse” our classifier into thinking
the person is wearing a mask when in fact they are not — potential examples include shirts
wrapped around faces, bandana over the mouth, etc. All of these are examples of something that
could be confused as a face mask by our face mask detector.

Finally, we should consider training a dedicated two-class object detector rather than a simple
image classifier. Our current method of detecting whether a person is wearing a mask or not is a
two-step process:

Step 1: Perform face detection

Step 2: Apply our face mask detector to each face

The problem with this approach is that a face mask, by definition, obscures part of the face. If
enough of the face is obscured, the face cannot be detected, and therefore, the face mask detector
will not be applied. To circumvent that issue, we should train a two-class object detector that
consists of a with_mask class and without_mask class. Combining an object detector with a
dedicated with_mask class will allow improvement of the model in two respects.

First, the object detector will be able to naturally detect people wearing masks that otherwise
would have been impossible for the face detector to detect due to too much of the face being
obscured. Secondly, this approach reduces our computer vision pipeline to a single step — rather
than applying face detection and then our face mask detector model, all we need to do is apply
the object detector to give us bounding boxes for people both with_mask and without_mask in a
single forward pass of the network. Not only is such a method more computationally efficient,
it’s also more “elegant” and end-to-end.

The accuracy can be increased and the loss can be minimized by using a set of different
algorithms. For example after implementing the HOG algorithm, a model compression technique
can be used for better results. With this, we conclude our work and hope that the reader has
gained sufficient insights into the domains with ideas to carry our work forward.
8. References

[1] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. of Cognitive Neuroscience, vol.3,
no. 1, pp. 71-86, 1991.

[2] M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve procedure for the
characterization of human faces,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol.12, no.1, pp. 103-108, Jan. 1990.

[3] Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. “Neural network based face
detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(I), pp.23-38,
1998.

[4] Deep Neural Networks Applications in Handwriting Recognition

[5] face-mask-detector-with-opencv-keras-tensorflow-and-deep-learning

[7] COVID-19: Face Mask Detection using TensorFlow and OpenCV

[8] Q. Cao, L. Shen, W. Xie, O. M. Parkhi and A. Zisserman, "VGGFace2: A Dataset for
Recognising Faces across Pose and Age," 2018 13th IEEE International Conference on
Automatic Face & Gesture Recognition (FG 2018), Xi'an, 2018, pp. 67-74.

[9] Histograms of Oriented Gradients for Human Detection

[10] Musab Coşkun, Ayşegül Uçar, Özal Yıldırım, Yakup Demir.: “Face Recognition Based on
Convolutional Neural Network” Conference Paper. November 2017

[11] O. Obulesu ; M. Mahendra ; M. ThrilokReddy, “Machine Learning Techniques and Tools: A


Survey”, 2018 International IEEE Conference on Inventive Research in Computing Applications
(ICIRCA)

[12] Saad Albawi ; Tareq Abed Mohammed ; Saad Al-Zawi, “Understanding of a convolutional
neural network”, 2017 International IEEE Conference on Engineering and Technology (ICET)

[13] Rahul Chauhan ; Kamal Kumar Ghanshala ; R.C Joshi, “Convolutional Neural Network
(CNN) for Image Detection and Recognition”, 2018 First International IEEE Conference on
Secure Cyber Computing and Communication (ICSCC)
[14] Nadia Jmour ; Sehla Zayen ; Afef Abdelkrim,”Convolutional neural networks for image
classification”, 2018 International IEEE Conference on Advanced Systems and Electric
Technologies

[15] Jiudong Yang ; Jianping Li, “Application of deep convolutional neural network”, 2017 14th
International IEEE Computer Conference on Wavelet Active Media Technology and Information
Processing

[16] Le Kang ; Jayant Kumar ; Peng Ye ; Yi Li ; David Doermann, “Convolutional Neural


Networks for Document Image Classification”, 2014 22nd International IEEE Conference on
Pattern Recognition

[17] Harris, C., Stephens., M.: ‘A combined corner and edge detector’. Alvey Vision Conference
(1998) 147–151

[18] Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. 7th European
Conference on Computer Vision 1 (2002) 128–142

[19] Lindeberg, T.: Feature detection with automatic scale selection. International Journal of
Computer Vision 30 (1998) 79–116 [4] Lowe, D.G.: Local feature view clustering for 3D object
recognition. Conference on Computer Vision and Pattern Recognition (2001) 682–688

[20] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto,
M., & Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile
Vision Applications. CoRR, abs/1704.04861.

[21] Cheng, Jian & Wang, Peisong & Li, Gang & Hu, Qinghao & Lu, Hanqing. (2018). A
Survey on Acceleration of Deep Convolutional Neural Networks.

[22] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, "DeepFace: Closing the Gap to Human-
Level Performance in Face Verification," 2014 IEEE Conference on Computer Vision and
Pattern Recognition, Columbus, OH, 2014, pp. 1701-1708

[23] Miscellaneous images from the web

[24] Opencv programming Guide

You might also like