AI Facial Recognition System
AI Facial Recognition System
Nowadays, facial recognition is one of the widely used categories of biometric security
that distinguish itself by its security and speed from other categories such as fingerprint
recognition and eye retina or iris recognition. This technology is mainly used in
electronics devices, airport control, banking, health care, marketing, and advertising.
This thesis project aimed to build a facial recognition system that could recognize
people through the camera and unlock the door locks. Recognized results were sent to
the database and could be analyzed by users after the successful login.
The prototype could successfully recognize human faces and activate the electronics
components. It has fast performance and could log information about recognized
humans in the Google database.
With further advancements, the prototype would implement more extensive algorithms
to distinguish the pictures and real faces through a camera. These algorithms would
make the prototype faster, secure, and suitable for commercial purposes.
List of Abbreviations
1 Introduction 1
2 Artificial Intelligence 2
3 Facial Recognition 20
4 Implementation 25
4.2.2.4 Database 36
5 Conclusion 43
References 44
Appendices
Appendix 1: The encodings of an image in the dataset
Appendix 2: The Circuit diagram of the project
List of Abbreviations
2D: 2-dimensional
3D: 3-dimensional
1 Introduction
The goal of the thesis project was to build a facial recognition system that could
recognize people through the camera and unlock the door locks. The prototype would
be fixed to the doors and use the camera to operate the whole circuitry. The results
would be logged in the google database and analyzed by users after the successful
login.
The implementation of the project was accomplished in three steps. Initially, the facial
recognition system was built using machine learning and deep learning algorithms. In
the second step, the data from the facial recognition system were transmitted to the
electronics circuitry to make a smart lock system. Finally, the last step was to design a
user interface for the google database that displays the attendance list.
This paper provides the required knowledge to build a facial recognition system and the
necessary mathematical formulas of the algorithms used in the project. After the
theory, the practical work is explained where those algorithms were implemented into
practice, which was the major stage of the project to recognize faces and control the
whole electronics circuitry.
2
2 Artificial Intelligence
As Figure 1 shows, the significant subfields of AI are machine learning, deep learning,
and computer vision which were used in the project for different purposes.
Artificial intelligence divides into two categories: strong AI and weak AI. Weak AI is a
narrow application, and it is suitable for specific tasks, for instance, virtual assistants.
On the other hand, strong AI is a broader application and has human-level intelligence.
It is mainly used in advanced robotics and automation. [4.]
3
Supervised learning is one of the three primary methods of machine learning. It uses
various algorithms that train using datasets to classify data or predict outputs, as
illustrated in figure 2. [6.]
The supervised learning algorithms begin the operation by feeding the data and
adjusting the weights until the model fits appropriately. This process is used to ensure
that the model prevents overfitting and underfitting. Over time, the algorithms learn to
approximate the connection between the input data and labels. Once the algorithms are
fully trained, they can observe the new objects and predict proper labels. [6.]
Classification uses an algorithm to assign test data into classes or groups. It identifies
particular entities contained by the dataset and attempts to label those entities. The
most familiar classification algorithms are support vector machines, decision trees, k-
nearest neighbor, naïve bayes, and random forest. Support vector machine is one of
the widely used algorithms that give a high test accuracy. Therefore, this was decided
to be used in the project and is described in detail below. [7.]
The dimension of the hyperplane depends on the space dimension. If the space is two-
dimensional, then the hyperplane is simply a straight line. If it is three-dimensional, then
the hyperplane becomes a two-dimensional plane.[8.]
There are a lot of possible hyperplanes that can be found in a plane, as shown in figure
4. In order to find the optimal hyperplane, among others, mathematical computation of
the margin is needed, which is described below.
5
𝑓(𝑥) = 𝑤 𝑇 𝑥 + 𝑏 (1)
In the equation, w and b are the weight vector and the bias, respectively [9].
|𝑤 𝑇 𝑥 + 𝑏| = 1 (2)
Where 𝑥 is data points closer to the hyperplane called support vectors that are used to
increase the margin and help to build a classifier. [9.]
The next step is to compute the distance between point 𝑥 and the hyperplane by using
the rule of geometry:
|𝑤 𝑇 𝑥+𝑏|
𝑑= ‖𝑤‖
(3)
According to the canonical hyperplane, the numerator in equation (2) is equal to one.
Therefore the distance to the support vectors is
|𝑤 𝑇 𝑥+𝑏| 1
𝑑𝑠𝑣 = ‖𝑤‖
= ‖𝑤‖ (4)
6
The margin is twice the distance between points and the hyperplane:
2
𝑀 = 2 ∗ 𝑑𝑠𝑣 = ‖𝑤‖ (5)
The last step is to maximize the margin, which is equivalent to minimizing a function
𝐿(𝑤) subject to some constraints. Those constraints model the requirement for the
hyperplane to classify all the data points 𝑥𝑖 correctly. Formally,
‖𝑤‖2
min 𝐿(𝑤) = 2
(6)
𝑤,𝑤0
In order to find the perpendicular distance between two data points, x and z, the
following formula is used.
|𝜇(𝑧)|
√∑𝑃𝑖=1(𝑥𝑖 − 𝑧𝑖 )2 = ‖𝑤‖2
(7)
Equation (7) is the Euclidean distance formula and it is used to calculate the distance
between two data points and implemented in face recognition, which is described in
further sections. [10.]
7
Clustering
Clustering is one of the main unsupervised learning tasks. Its algorithms classify a set
of unlabeled data so that data in the same cluster are more similar to each other than
other clusters. [11.]
Association
An association rule is a widely-used method that explores the dataset and acquires the
connections between variables in a dataset. This method is commonly used for market
basket analysis, facilitating companies to understand the relationship between
products. [11.]
Dimensionality Reduction
The model gives faulty results in the beginning. Despite this, as long as the feedback is
provided to the algorithm, it selects correct feedback over incorrect ones and improves
itself for the subsequent trial. Over time, the algorithm learns and makes fewer
mistakes than it used to. [6.]
Deep learning, also called deep neural networks, is a subfield of machine learning,
which is essentially a neural network with more than two layers. It is worth saying that
the “deep” in deep learning refers to the depth of layers in a neural network. The
function of deep learning is to learn from large amounts of data and perform like a
human brain. [12.]
Deep learning algorithms process unstructured data, like text and images, and
automate feature extraction. For instance, the algorithm processes a set of photos of
different animals to categorize by a cat, dog, etc. They can determine which features
are most significant to distinguish each animal from another, like ears, nose, etc. [12.]
9
Neural networks are at the heart of deep learning algorithms. Their name and structure
are inspired by the biological neuron. Neuron in neural networks is a mathematical
operation and imitates the functioning of a biological neuron, as schematized in figure
7. [13.]
As Figure 7 illustrates, the input feeds into the neuron and produces the output. On the
other hand, several input neurons are used to solve complicated problems, as shown in
figure 8.
𝑧 = ∑𝑖 𝑤𝑖 𝑥𝑖 + 𝑏 (8)
𝑎 = 𝜓(𝑧) (9)
Each neuron multiplies its weights 𝑤𝑖 to inputs 𝑥𝑖 , adds the bias 𝑏 and passes the sum
through the activation function 𝜓. [14.]
10
Neural networks have three main types: artificial neural networks, convolutional neural
networks, and recurrent neural networks [14]. CNN is one of the widely used neural
networks in face recognition, which is a topic of this thesis. Therefore, this was decided
to be used during the project and described in detail below.
Artificial Neural networks consist of three main layers of interconnected nodes, each
building upon the previous layer to optimize the prediction or categorization. Those are
input, hidden, and output layers, as shown in figure 9. [13.]
The architecture of the artificial neural networks starts with the input layer, which
ingests the data for processing, and gives the material to hidden layers to do all the
mathematical computations. Finally, the output layer produces the result for given
inputs. [15.]
CNN is a type of neural network that is very effective in image recognition and
classification. They use a mathematical operation on two functions that produce a third
function, called convolution, as shown in figure 10. [16.]
CNN starts the operation by converting the inputted image into pixels, and forwards it to
filter processing. The filters used in image processing are vertical-edge and horizontal-
edge filters. The combination of those filters gets the edges of an object in an image.
[16.] The vertical edge filter, VEF, is defined as follows:
1 0 −1
𝑉𝐸𝐹 = [1 0 −1] = 𝐻𝐸𝐹 𝑇 (10)
1 0 −1
This filter slides over the input image to extract the vertical edges, which is the sum of
the elementwise product in each block, as shown in figure 11. [16.]
Figure 11. The feature map after filtering the image [16]
The elementwise multiplication is performed starting from the first 3x3 block, slides the
block until it covers all possible blocks, and outputs the edges of the image, also called
feature map. The parameter s in this figure is the stride parameter in the convolutional
product. A large stride produces a smaller feature map and vice versa. [16.]
When VEF is used, the pixels on the edges are less used than those in the middle. It
means that the data from the edges are ignored. In order to solve this problem, padding
can be added around the image to consider the edge pixels, as shown in figure 12. [16.]
12
Figure 12. The output, after adding padding around the image [16]
The padding parameter p in figure 12 is the number of elements added to the four sides
of an image [16].
Once the stride and the padding are defined, here comes to construct a CNN, layer per
layer. CNN consists of three layers: convolution, pooling, and fully-connected layers.
[16.]
As mentioned above, CNN derives its name from the convolutional operator. The
primary goal of the convolutional layer is to extract features from the input image, which
can be mathematically represented as a tensor with the following dimensions:
Here 𝑛𝐻 is the height, 𝑛𝑊 is the width and 𝑛𝐶 is the number of channels, which are the
depth of the matrices involved in the convolution. They are used to refer to a specific
component of an image. If the image is grayscale, it has only one channel and has
pixel values in the range of 0 to 255. On the other hand, if the image is RGB, the
number of channels equals three. In this case, the filter can be represented with the
following dimensions:
As described above, the convolutional product between an image and filter is a two-
dimensional matrix. In the convolutional layer, each element is the sum of the
elementwise multiplication of the filter, which is a cube, as illustrated in figure 13. [16.]
13
The filter has the odd dimension 𝑓 to center each pixel and the same number of
channels as the input image [16].
In order to solve complex tasks, the convolutional product is applied using multiple filters
and followed by an activation function 𝜓. The mathematical formula of the convolutional
layer at the 𝑙 𝑡ℎ layer is
[𝑙]
(𝑛𝐶 )
𝑎[𝑙] = [ 𝜓 [𝑙] (𝑐𝑜𝑛𝑣(𝑎[𝑙−1] , 𝐾 (1) )) , 𝜓 [𝑙] (𝑐𝑜𝑛𝑣(𝑎[𝑙−1] , 𝐾 (2) )) , … , 𝜓 [𝑙] (𝑐𝑜𝑛𝑣(𝑎[𝑙−1] , 𝐾 ))] (15)
with
14
[𝑙−1]
[𝑙] 𝑛𝐻/𝑊 +2𝑝[𝑙] −𝑓[𝑙]
𝑛𝐻/𝑊 = ⌊ 𝑠 [𝑙]
+ 1⌋ (17)
According to these equations, the convolutional layer with multiple filters can be
summarized in figure 14. [16.]
Figure 14. Illustration of the convolutional layer with multiple filters [16]
In figure 14, 𝑝[𝑙] and 𝑠[𝑙] are the padding and stride parameters, respectively, and the
learned parameters from these convolutional layers are filters and the bias [16].
CNN uses the pooling layer to reduce the training time and the dimensionality of each
feature map by applying it to each channel. However, it still maintains the useful
information in the image. There are two often-used pooling types: max and average
pooling. Max pooling returns the largest element from the feature map. On the other
hand, average pooling takes the average of all elements, as illustrated in figure 15,
when the stride parameter is equal to two. [16.]
15
[𝑙−1]
𝑝𝑜𝑜𝑙(𝑎[𝑙−1] )𝑥,𝑦,𝑧 = 𝜙 [𝑙] ((𝑎𝑥+𝑖−1,𝑦+𝑗−1,𝑧 ) [𝑙] 2
) (18)
𝑖,𝑗 ∈[1,2,…,𝑓 ]
Here, 𝑎[𝑙−1] is the input image to the pooling layer, which passes through a pooling
function 𝜙 [𝑙] to the output 𝑎[𝑙] as shown in figure 16. [16.]
This layer only produces the compressed version of images using the pooling function,
and it has no learned parameters [16].
16
The fully connected layers are the main layers of the CNN, which connects every
neuron in one layer to every neuron in the other layer. The primary purpose of these
layers is to use convolutional and pooling layers and produce the desired output. They
are the layers where the actual neural network starts and takes in a vector 𝑎[𝑙−1] and
returns a vector 𝑎[𝑙] . The formula of the fully connected layer on the 𝑗 𝑡ℎ node of the 𝑖 𝑡ℎ
layer is
[𝑖]𝑖−1 𝑛 [𝑖] [𝑖−1] [𝑖]
𝑧𝑗 = ∑𝑙=1 𝑤𝑗,𝑙 𝑎𝑙 + 𝑏𝑗 (19)
[𝑖] [𝑖]
𝑎𝑗 = 𝜓 [𝑖] (𝑧𝑗 ) (20)
[𝑖] [𝑖]
Here, 𝑤𝑗,𝑙 is the weight, 𝑏𝑗 is the bias, and 𝑎[𝑖−1] is the output of the pooling layer with
[𝑖−1] [𝑖−1] [𝑖−1]
the dimensions (𝑛𝐻 , 𝑛𝑊 , 𝑛𝐶 ). [16.]
The fully connected layers can be summarized in the illustration in figure 17.
As can be seen here, the input is flattened to a one-dimensional vector, allowing the
fully connected layers to start the operation. The formula of flattening can be expressed
as
[𝑖−1] [𝑖−1] [𝑖−1]
𝑛𝑖−1 = 𝑛𝐻 × 𝑛𝑊 × 𝑛𝐶 (21)
This vector feeds into the fully connected layer and generates the output. The learned
parameters from this layer are the weights and the bias. [16.]
17
Overall, the convolutional neural network is a sequence of all layers and is illustrated in
figure 18.
Initially, CNN extracts features from the input image by performing the convolutional and
the pooling layers. These features fed to fully connected layers to produce the output.
The output can be the label or other features of the inputted image, like 128
measurements described in further sections.
Data preprocessing
Data preprocessing is the step to transform the data so that the computer can easily
read it. It is applied to increase the number of images in a given dataset. There are
many techniques used in data preprocessing, such as cropping, rotation, flipping, etc.
These techniques enable better learning due to the large size of the training set and
allow the algorithm to learn from different conditions.
Before the CNN is trained, the dataset splits into training and test set. The training set
is used to train the algorithm and consists of 80% of the dataset. On the other hand,
the test set is used to check the algorithm's precision. [14.]
18
Learning algorithms
Learning algorithms aim to find the best parameters that give the best prediction. For
this, the loss function 𝐽 is defined to measure the distance between the real and the
predicted values. The loss function has two steps: forward propagation and backward
propagation. [14.]
Forward propagation is basically fully connected layers where the layers receive the
input data, processes the information, and generates the predicted output value of 𝑥𝑖
through the neural network 𝑦̂𝑖𝛳 with some errors. In this case, the loss function 𝐽 is
evaluated as
1 𝑚
𝐽(𝛳) = ∑ ℒ( 𝑦̂𝑖𝛳 , 𝑦𝑖 ) (22)
𝑚 𝑖=1
Here, 𝑚 is the size of the training set, 𝛳 is the model parameters, ℒ is the cost function
and 𝑦𝑖 is the real values for all 𝑖 = (1,2, … , 𝑁). 𝑁 is the iteration of the same process,
called epoch number. [14.]
Backward propagation is the method to train neural networks. This method calculates
the gradients of ℒ for all the network parameters and adjusts those parameters based
on the error rate obtained in the previous epoch.
The convolutional neural network is fully trained when the parameters are adjusted, and
the training of CNN gives the minimum loss, which makes the model fast and reliable..
Activation functions are an essential part of the neural network. They determine
whether a neuron should be activated. The nonlinear functions typically convert the
output of a given neuron to a value between 0 and 1 or -1 and 1. The most common
activation functions are defined below. [18.]
• ReLU:
𝜓(𝑥) = 𝑥1𝑥≥0 = max (0, 𝑥) (23)
19
• Sigmoid:
1
𝜓(𝑥) = 1+𝑒 −𝑥 (24)
• Tanh:
1−𝑒 −2𝑥
𝜓(𝑥) = 1+𝑒 −2𝑥 (25)
• LeakyReLU:
The RNN is a type of neural network that applies sequential data and is used for
natural language processing, speech recognition, language translation, etc. RNNs are
derived from feedforward neural networks and can use their memory to take
information from previous inputs to impact the current input and output, as shown in
figure 19. [18.]
The rolled RNN represents the total predicted outputs. On the other hand, the unrolled
RNN represents the individual layers of the neural network, and each layer maps to a
single output. [18.]
20
Computer vision is a field of AI and works like human vision. It uses deep and machine
learning algorithms described in sections 2.1 and 2.2 to enable computers to observe
and understand images and videos by feeding lots of data. They run data over and
over until they recognize images. [19].
One of the well-known computer vision applications is autonomous vehicles that need
to identify people, cars, and lanes on the road in order to navigate [19].
3 Facial Recognition
Face recognition is executed in three stages: face detection, face encoding, and face
classification [22].
The operation of face recognition starts by detecting faces which uses the HOG
method to detect the faces in an image. The HOG stands for the histogram of oriented
gradients. It starts the operation by converting an image to black and white. For every
pixel in an image, surrounding pixels are selected to figure out the darkness of that
pixel compared to surrounding pixels. Then the arrow is drawn in the direction of the
darkness, as shown in figure 20. [22.]
This process repeats for every single pixel in an image. In the end, every pixel replaces
by arrows. These arrows are called gradients, which are obtained by combining
21
magnitude and angle from the image. First, gradients 𝐺𝑥 and 𝐺𝑦 are calculated for each
pixel using the following formulas. [23.]
𝐺 (𝑥, 𝑦)
𝛳(𝑥, 𝑦) = 𝑎𝑟𝑔𝑡𝑎𝑛( 𝑦 ⁄ ) (30)
𝐺𝑥 (𝑥, 𝑦)
The magnitude and the direction are divided into several cells. For each cell, a 9-point
histogram is calculated and each bin produces the intensity of gradient.
Four cells are combined to form a block once the histogram computation is over for all
cells. This combining is done in an overlapping manner, as shown in figure 21. [23.]
For all four cells in a block, 9-point histograms of each cell are concatenated to form a
36-point feature vector. Then the normalization is applied to reduce the effect of
changes in the contrast between images of the same face. [23.]
Figure 22 below shows the inputted HOG image extracted from a bunch of other training
faces [22].
22
In this way, the faces can be easily found in any image. If the image size is 128x64,
then the total HOG feature is
𝑇𝑓 = 7 ∗ 15 ∗ 36 = 3780 (31)
Here, 36 is the feature vector, 7 and 15 are the blocks in horizontal and vertical
directions, respectively. [23.]
The HOG method goes through 8 steps to collect the feature vectors. Those feature
vectors obtain the HOG feature according to the input image.
23
After detecting the person's face, FaceNet is used to extract features from that face. It
is a convolutional neural network published in 2015 by Google researchers Florian
Schroff, Dmitry Kalenichenko, and James Philbin. Generally, the CNN trains to
recognize pictures, objects, and digits. However, FaceNet takes an input image of a
person's face, extracts the feature from convolutional and max-pooling layers as
described in section 2.2.2, and generates a vector of 128 measurements from fully-
connected layers, as shown in figure 24. [24.]
Figure 25. Distances between embeddings of anchor, positive and negative [20]
An anchor is the first known person image, a positive is another image of the same
person, and a negative one is an image of a different person. Neural networks are
24
trained so that the embedding of anchor images should be close to positive embedding
and far away from negative embedding. [25.]
When the embeddings give close measurements, the neural network is trained and can
generate 128 measurements for any face [22].
The last step is to compare the embedding of the test image with the embedding of the
database image. In this case, the machine learning algorithm SVM can be used to
classify the test image with the closest match. As described in section 2.1.1, equation
(7) is used to find the distance between two data points. The same technique can be
applied to the embeddings of images. If the distance between these embeddings is
small, the faces are from the same person and vice versa. [22.]
Overall, the face recognition system can be summarized in the following figure 26.
Figure 26. Illustration of the face recognition system (Modified from [24])
After FaceNet is trained, the database and the test images pass through the FaceNet,
which generates embeddings. These embeddings feed into the SVM classifier to tell
whether they match or not.
25
4 Implementation
This section of the thesis describes the practical use of the theoretical background, the
necessary materials, tools, technologies, and the detailed workflow of the project.
4.1.1 Python
In this project, Python was used for machine learning, deep learning, mathematics, and
computer vision by taking advantage of various Python libraries such as OpenCV,
TensorFlow, and Openface.
4.1.2 OpenCV
In this thesis work, the OpenCV library was used to read the path, capture the video,
draw the frames, and put the name of the detected face.
26
4.1.3 TensorFlow
4.1.4 Openface
4.1.5 Firebase
Firebase is a Google backend platform that helps to build and run web and mobile
applications. This platform provides tools for analytics, reporting, marketing, fixing app
crashes, cloud messaging, test lab, authentication, as well as a real-time database,
which is used in the project and described in further sections. [30.]
4.1.6 HTML/CSS/JS
HTML, CSS, and JavaScript are the languages to run the web. They all are related but
have specific functions. HTML controls the layout of the content, which provides the
structure for the web page. Then CSS applies to stylize the web page elements, mainly
targets various screen sizes to make web pages responsive. The last step is to use
javascript for adding interactivity to a web page. [31.]
27
Jetson Nano is NVIDIA’s small and powerful computer for AI purposes such as deep
learning and computer vision. Figure 27 illustrates the jetson nano board. [32.]
Jetson nano board has four USB ports, an HDMI port, two connectors for the CSI
cameras, and 40 GPIO pins expansion header to control electronics components. The
operating voltage for this board is 5 Volts using a barrel jack and a micro-USB port.
The barrel jack delivers 4 Amps, while the micro-USB port has 2.5 Amps. [33.]
Jetson Nano allows running multiple neural networks in parallel for image classification,
segmentation, object detection, speech processing, and face recognition [32].
4.1.8 Arduino
This board can be integrated into electronic projects to control relays, LEDs, servos,
and motors as an output. The operating voltage is 5 Volts, while the input voltage
ranges between 6 Volts to 20 Volts. [34.]
4.2.1 Hardware
Various components and sensors were used in this project to build the fully functional
facial recognition system. Some of these components and sensors are attached to the
Arduino UNO board and others to the Jetson Nano board, as illustrated in figure 29.
The table 1 below shows the list of all the necessary components, their quantity, and
values.
Resistor 2x 330 Ω
Green LED 1x -
Red LED 1x -
Relay 2x 5 Volts
Buzzer 1x -
Ultrasonic sensor 1x -
OLED display 2x -
Fan 1x 5 Volts
Webcam 1x -
Wi-Fi Dongle 1x -
USB cable 1x -
In this project, the ultrasonic sensor was used to measure the distance. When the
distance is less than 30 centimeters, then the buzzer buzzes, and the OLED display
outputs the message “Please, Look at the camera,” as shown in figure 30.
Resistors were used to limit the current through the green and red LEDs. These LEDs
were connected to the Arduino UNO. The green LED burns when the face is
recognized, and the red LED burns when the access is denied, as shown in figure 31.
30
The relays were used to send the power to solenoid locks in figure 32 below, which
lock and unlock the door.
These locks work on 9 to 12 Volts. Therefore, an 11.1V Lipo battery was connected to
supply the appropriate amount of voltage to the solenoid locks.
The fan was attached to the Jetson Nano heat sink to cool the processor during the
training process, and the webcam was used to capture the video. The Wi-Fi dongle
was plugged into the USB port of the jetson nano to access the internet since the
Jetson Nano does not have built-in Wi-Fi. The board was powered using the 5V 2.5A
31
Raspberry Pi adapter and shared that power with Arduino using the USB cable. This
USB cable was also used to make a serial communication between these two boards.
4.2.2 Software
The dataset image and the real-time face pass through the facial recognition stages.
When the embedding gives the close measurement in the face classification section, it
means that the faces match, and the data is sent to the google database. All these
steps in the block diagram are explained in further sections.
In this project, AI operates to recognize faces. It starts the process by detecting the
faces using the HOG method described in section 3.1. After inputting the face image,
the HOG function was used to generate a face pattern, as shown in listing 1.
image = cv2.imread('image1.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
fd, hog_image = hog(image, orientations = 8, pixels_per_cell =
(16,16),cells_per_block = (1,1), visualize = True, multichannel =
True)
Listing 1. A python code that generates the face pattern using the HOG function [36]
32
Here, the HOG function was applied to 16x16 pixels per cell and 1x1 cells per block
with eight vector orientations. The output from this HOG function can be plotted using
the matplotlib library, as shown in listing 2 below.
fig, (ax1,ax2) = plt.subplots(1,2, figsize = (8,4), sharex = True,
sharey = True)
ax1.axis('off')
ax1.imshow(image, cmap = plt.cm.gray)
ax1.set_title('Input Image')
hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range =
(0,10))
ax2.axis('off')
ax2.imshow(hog_image_rescaled, cmap = plt.cm.gray)
ax2.set_title('Histogram of Oriented Gradients')
plt.show()
Listing 2. A python code that plots the output from the HOG function [36]
The following figure 34 shows the output from the HOG function.
This HOG image was inputted to the function in the face recognition library to detect
the face, as shown in the following python code in listing 3.
import face_recognition
img = “Ogtay_Ahmadli.jpg”
color=(0,0,255)
faceLocationCurrentImage = face_recognition.face_locations(hog_image)
y1,x2,y2,x1 = faceLocationCurrentImage
cv2.rectangle(img,(x1,y1),(x2,y2),color,1)
Listing 3. A python code that draws a rectangle to the detected face
33
As listing 3 illustrates, face_locations() was used to extract the four points of the
detected image. Then these points were applied to the OpenCV library to draw a
rectangle on a face, as illustrated in figure 35.
After the successful detection in figure 35, the new python subroutine called
findEncoding() was created to find the encodings for each face image in the dataset.
Firstly, this subroutine goes through the dataset, and for each image in the dataset, the
FaceNet method was used to generate those encodings. When the encoding process
is completed, the subroutine returns two lists. The first list is the encoding list of each
image in the dataset, as illustrated in Appendix 1. The second list is the list of the
names in the dataset, as shown in figure 36.
Once the face images were encoded, the subroutine called recognizeFaces() was
created to recognize faces using the support vector machine algorithm. This subroutine
takes the returned lists from the previous subroutine as inputs along with the image.
The process of the subroutine starts by generating the encodings of the real-time face
image detected from a webcam. Next, the encodings are looped through to calculate
the face distance and the result. The result is the list that compares the dataset faces
with the real-time face using the compare_faces() function of the face recognition
library and outputs the following list in figure 37.
34
As figure 37 illustrates, the recognized face is labeled as true and others as false,
corresponding to figure 36.
The face distance is computed using equation (7) in section 2.1.1, which is the
Euclidean formula to find the sum of the distance between encodings of the dataset
and real-time faces, as shown in listing 3.
faceDistance = distance.euclidean(encodingList,encodingFace)
Listing 3. A python code to calculate the distance between encodings
As figure 38 shows, the Euclidean distance of the recognized face is small compared to
others. Then the NumPy library was applied to get the index of the minimum value of a
list using the argmin() function, as shown in listing 4.
matchIndex = np.argmin(faceDistance)
Listing 4. The python code to get the minimum value of a list
The output from this line is equal to one, which is the index of the second element in a
list in figure 38.
The following listing 5 checks whether the result in figure 38 is true or false at the
minimum value.
names = []
if result[matchIndex]:
name = classNames[matchIndex]
color = (0,255,0)
sm.sendData(ser,[0,0,1,0], 1)
35
else:
name = 'unknown'
color = (0, 0, 255)
sm.sendData(ser,[1,1,0,1], 1)
names.append(name)
Listing 5. A python code to recognize faces.
Here, If the result is true, it means that the face is recognized. The name is labeled
according to the list in figure 38 and the match index. Then the data is sent to the
Arduino UNO to unlock the solenoid locks and turn on the green LED.
On the other hand, If the result is false, the name is labeled as ”unknown,” and the
Arduino UNO receives the data to keep the locks closed and turn on the red LED.
After successful decisions, listing 3 in section 4.2.2.1 was slightly modified according to
recognized and unrecognized faces, as shown in listing 6.
y1,x2,y2,x1 = faceLocation
y1,x2,y2,x1 = int(y1/0,25), int(x2/0,25), int(y2/0,25), int(x1/0,25)
cv2.rectangle(imgFaces,(x1,y1),(x2,y2),color,2)
cv2.putText(imgFaces, name, (x1+6, y1-6),
cv2.FONT_HERSHEY_COMPLEX,1,color,2)
Listing 6. A python code to draw a rectangle and put text to the recognize face [36]
Due to the image size in Figure 35, the face locations are increased four times to get
the proper face frame from the webcam. Then a rectangle and a text were added
around the face using the computer vision library.
36
4.2.2.4 Database
In this project, firebase was used to keep the data in google’s real-time database. First,
the firebase database was created, and then the following python module (listing 7)
was designed to get the communication with firebase.
After importing the firebase library, the URL of the firebase database was copied to the
code. Then the postData() subroutine was created to post the name and the time to the
database.
def markAttendance(name):
with open('Attendance.csv','r+') as f:
myDataList = f.readlines()
nameList = []
for line in myDataList:
entry = line.split(',')
nameList.append(entry[0])
if name not in nameList:
now = datetime.now()
dateString = now.strftime('%H:%M:%S')
f.writelines(f'{name},{dateString}\n')
fbm.postData(name,dateString)
Listing 8. The python subroutine that marks the name and the date [36]
37
As Listing 8 illustrates, an empty CSV file called Attendance was created to check
whether the name is in the list or not. If the name is not in the list, then the subroutine
posts the name and the time to the real-time database using the postData() function of
the firebase module.
The transmitter function is the combination of all the subroutines mentioned above. It
activates the webcam and uses the returned values of subroutines to generate the
desired output, as illustrated in listing 9.
def main():
encodingList, classNames = findEncodings("ImageAttendance")
cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)
sm.sendData(ser,[1,1,0,0],1)
while True:
success, img = cap.read()
imgFaces, names = recognizeFaces(img, encodingList,
classNames)
for name in names:
if name == "unknown":
sleep(0.2)
else:
markAttendance(name)
cv2.imshow("Image", imgFaces)
if cv2.waitKey(1) & 0xFF == ord("q"): break
Listing 9. The transmitter function
The function starts the operation by taking the returned values of the findEncodings()
function according to the images in the dataset called “ImageAttendance.” Then it
activates the camera and sends the initial lock and LED values to the Arduino UNO
board.
Then the webcam captures and inputs the image to the recognizeFaces() function.
Here, the for loop was used to loop through the names of the captured faces. If the
38
face is not recognized, the program does not publish anything. Otherwise, the name
and the time are sent to the database, as shown in Figure 39.
As Figure 39 illustrates, the data contains the name of the recognized person and the
time it is recognized.
In the end, the function displays the output, which can be seen in figure 40.
In this project, the Jetson Nano is responsible for AI, and Arduino UNO is responsible
for Electronics operation. The Jetson Nano board is in serial communication with
Arduino UNO to transmit the desired data and make the components operate, as
shown in figure 41.
As figure 41 illustrates, Jetson Nano sends four-digit data to relays and LEDs. Here,
the dollar sign was used to split the data in vertical order while looping, which avoids
any confusion and defines the start and end digit of the signal. This sign was included
in both transmitter and receiver codes.
When the Jetson Nano connects to Arduino UNO with the USB cable, the python
subroutine shown in listing 10 checks if the boards are connected.
import serial
def initConnection(portNo, baudRate):
try:
ser = serial.Serial(portNo, baudRate)
print("Device Connected ")
return ser
except:
print("Not Connected ")
pass
Here, the subroutine checks the port number and the baud rate of the Arduino UNO
using the serial library and returns those initialized serial objects. When the Arduino
UNO is connected, the subroutine prints "Device Connected" and vice versa.
40
After the successful connection, the new subroutine was created to send the data to
Arduino UNO, as shown in listing 11 below.
This subroutine takes the initialized serial object, data, and digits per data value as
inputs. The subroutine starts looping through the data. For each data, it inserts the
dollar sign and sends that data to the relevant port. If some issues occur in the
connection, the subroutine prints "Data Transmission Failed."
The next step was to create a receiver function for Arduino UNO to control the
components. This subroutine starts the operation by checking the dollar sign, as shown
in listing 12 below.
#define numOfValsRec 4
#define digitsPerValRec 1
int valsRec[numOfValsRec];
int stringLength = numOfValsRec * digitsPerValRec + 1;
int counter = 0;
bool counterStart = false;
String receivedString;
void receiveData() {
while (Serial.available()) {
char c = Serial.read();
if (c == '$') {
counterStart = true; }
if (counterStart) {
41
As Listing 12 shows, when the dollar sign is detected and the counter is less than a
string length, then the function gets the data and increments the counter. Following
this, it loops through the received data elements. For each element, an array was
utilized to get and use them in the code independently.
Firstly, the Arduino pin of each component was defined and set up as input or output.
Then the new function was created to pass the received data to solenoid locks and
LEDs, as shown in listing 13.
void unlock_solenoid() {
digitalWrite(solenoid1Pin, valsRec[0]);
digitalWrite(solenoid2Pin, valsRec[1]);
digitalWrite(greenLed, valsRec[2]);
digitalWrite(redLed, valsRec[3]);}
Listing 13. The Arduino subroutine that sends digital values to the components
As listing 13 shows, the array was used to get each signal element and assigned to the
components using the function in listing 3.
42
Overall, there are three main functions in the code that loops all the time, as shown in
listing 14.
void loop() {
receiveData();
unlock_solenoid();
oled();
}
Listing 14. The Looping process of the functions
The first function is to receive the data from the Jetson Nano. The second one is the
function above to pass data to the components. Finally, the last function is to display
the status message on the OLED display according to the data and the distance from
the ultrasonic sensor.
The web page was created using HTML, CSS, and javascript. The first step was to
create a login interface for the webpage, which can be seen in figure 42.
After a successful login, The firebase configurations were used to access the data, and
the webpage displays it, as the following figure 43.
5 Conclusion
The goal of the project was to build a facial recognition system that could recognize
human faces, log information into the database, and unlock the door.
The thesis project was executed in three steps. During the first step, the machine
learning and deep learning algorithms were used to recognize faces and send the data
to the google database. In the second step, AI data is transmitted to the electronics
components and sensors to make a smart lock system. Finally, the last step was to
design a webpage that requires a login and displays the attendance list.
The project’s result was accomplished as expected, and the prototype could
successfully recognize human faces and activate the electronics components. It has
fast performance and could log information about recognized humans in the Google
database.
This prototype can be used for office doors to identify employees, open the door and
send the boss an attendance list, which displays the employee’s name and entry time.
A future improvement of the prototype could be implementing more extensive
algorithms to distinguish the pictures and real faces from a camera. These algorithms
would make the prototype faster, secure, and suitable for commercial purposes.
44
References
6 Towards Data Science [online] What are the types of machine learning?
URL: https://towardsdatascience.com/what-are-the-types-of-machine-learning-
e2b9e5d1756f
Accessed on: 15.10.2021
15 Gavril Obnjanovski [online] Everything you need to know about Neural networks
. and backpropagation
URL: https://towardsdatascience.com/everything-you-need-to-know-about-neural-
networks-and-backpropagation-machine-learning-made-easy-e5285bc2be3a
Accessed on: 23.10.2021
33 Nvidia Developer [online] Getting started with Jetson Nano Developer kit
URL: https://developer.nvidia.com/embedded/learn/get-started-jetson-nano-
devkit#intro
Accessed on: 17.11.2021