0% found this document useful (0 votes)
63 views

Human Expression Detection Using Computer Vision B.E. Project Report - A'

This document is a project report submitted by four students for their Bachelor of Engineering degree. The report describes a project on human expression detection using computer vision. The project aims to detect facial expressions in students to better understand engagement and behavior in classrooms. The proposed methodology uses the Viola-Jones algorithm for face detection and a convolutional neural network for expression recognition. The report includes sections on literature review, proposed work, methodology, requirements, and a project schedule.

Uploaded by

Deepak S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Human Expression Detection Using Computer Vision B.E. Project Report - A'

This document is a project report submitted by four students for their Bachelor of Engineering degree. The report describes a project on human expression detection using computer vision. The project aims to detect facial expressions in students to better understand engagement and behavior in classrooms. The proposed methodology uses the Viola-Jones algorithm for face detection and a convolutional neural network for expression recognition. The report includes sections on literature review, proposed work, methodology, requirements, and a project schedule.

Uploaded by

Deepak S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Human Expression Detection using Computer

Vision
B.E. Project Report - ‘A’
Submitted in partial fulfillment of the requirements
For the degree of

Bachelor of Engineering
in
Computer Engineering
by
Ms. Pranjal Ghan 15CE2008
Ms. Kirti Mahale 15CE1016
Mr. Pradnyesh Gumaste 15CE2022
Mr. Tanuj Jain 15CE1097

Supervisor
Prof Dr. Leena Ragha
Co-Supervisor
Mrs. Harsha Saxena

Department of Computer Engineering


Dr. D. Y. Patil Group’s
Ramrao Adik Institute Of Technology
Dr. D. Y. Patil Vidyanagar, Sector-7, Nerul, Navi Mumbai-400706.
(Affiliated to University of Mumbai)
October 2018
Ramrao Adik Institute of Technology
(Affiliated to the University of Mumbai)
Dr. D. Y. Patil Vidyanagar, Sector-7, Nerul, Navi Mumbai-400706.

CERTIFICATE
This is to certify that, the project ‘A’ titled
“ Human Expression Detection Using Computer Vision ”
is a bonafide work done by

Ms. Pranjal Ghan 15CE2008


Ms. Kirti Mahale 15CE1016
Mr. Pradnyesh Gumaste 15CE2022
Mr. Tanuj Jain 15CE1096

and is submitted in the partial fulfillment of the requirement for the


degree of
Bachelor of Engineering
in
Computer Engineering
to the
University of Mumbai

Supervisor Co-Supervisor
(Prof Dr. Leena Ragha) (Mrs. Harsha Saxena)

Project Co-ordinator Head of Department Principal


(Mrs. Smita Bharne) (Dr. Leena Ragha) (Dr. Ramesh Vasappanavara)
Declaration

We declare that this written submission represents our ideas in our own words and where
other’s ideas or words have been included, we have adequately cited and referenced the orig-
inal sources. We also declare that we have adhered to all principles of academic honesty and
integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in my
submission. We understand that any violation of the above will be cause for disciplinary action
by the Institute and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.

Ms. Pranjal Ghan 15CE2008

Ms. Kirti Mahale 15CE1016

Mr. Pradnyesh Gumaste 15CE2022

Mr. Tanuj Jain 15CE1097

Date : . . . /. . . /. . . . . .
Project Report Approval for B.E

This is to certify that the project entitled “ Human Expression Detection using Computer
Vision ” is a bonafide work done by Ms. Pranjal Ghan, Ms. Kirti Mahale, Mr. Pradnyesh
Gumaste and Mr. Tanuj Jain under the supervision of Prof Dr. Leena Ragha and Mrs. Harsha
Saxena. This dissertation has been approved for the award of Bachelor’s Degree in Computer
Engineering, University of Mumbai.

Examiners :
1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Supervisors :
1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Principal :
..............................

Date : . . . /. . . /. . . . . .

Place : . . . . . . . . . . . .
Acknowledgement

We take this opportunity to express our profound gratitude and deep regards to our supervisor
Prof Dr. Leena Ragha & co-supervisor Mrs. Harsha Saxena for their exemplary guidance,
monitoring and constant encouragement throughout the completion of this report. We are truly
grateful to their efforts to improve our technical writing skills. The blessing, help and guidance
given by him time to time shall carry us a long way in the journey of life on which we are about
to embark.
We take this privilege to express our sincere thanks to Dr. Ramesh Vasappanavara, Prin-
cipal, RAIT for providing the much necessary facilities. We are also thankful to Dr. Leena
Ragha, Head of Department of Computer Engineering, Project Co-ordinator Mrs. Smita
Bharne and Project Co-coordinator Mrs. Bhavana Alte , Department of Computer Engineer-
ing, RAIT, Nerul Navi Mumbai for their generous support.
Last but not the least we would also like to thank all those who have directly or indirectly
helped us in completion of this thesis.

Ms. Pranjal Ghan


Ms. Kirti Mahale
Mr. Pradnyesh Gumaste
Mr. Tanuj Jain
Abstract

Classroom environments are affected by legions of factors that are difficult to detect by
college supervising authorities. Evaluating the student-teacher interaction by looking at the
student behaviour from outside the class can simply provide us a shallow understanding of what
actually is happening within the classroom. In order to gain a greater depth of understanding,
facial expressions of the students can be evaluated. Facial expressions are one of the most
important cues for sensing human emotions and behavioral aspects among st humans. Neural
networks, and deep learning in general, are far more effective at categorizing such emotions
due to their robust designs and accuracy in predictions. We also contrast our deep learning
approach with conventional shallow learning based approaches and show that a convolutional
neural network is far more effective at learning representations of facial expression data.

i
Contents

Abstract i

List of Figures iv

List of Tables v

List of Algorithms vi

1 Intro 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Organization of report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Literature Survey 4
2.1 Research Paper Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Similar Existing Project Comparison . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Proposal 10
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 Face Detection: Using Viola Jones Algorithm . . . . . . . . . . . . . . 12
3.3.1.1 Using Haar Features . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1.2 AdaBoost Training . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1.3 Cascade Classifier . . . . . . . . . . . . . . . . . . . . . . . 13

ii
3.3.2 Convolutional Neural Network(CNN) . . . . . . . . . . . . . . . . . . 14
3.4 Hardware & Software Requirement . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.2 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Planning & Formulation 18


4.1 Schedule for Project / Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Design of System 19
5.1 Diagrams with Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Proposed Results 24
6.1 Proposed Results & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Project Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Conclusion 26

8 Future Work 28

References 29

Appendix 30

iii
List of Figures

2.1 Gabor Filter with various size, length and orientation . . . . . . . . . . . . . . 5

3.1 System Diagram of the emotion detection system . . . . . . . . . . . . . . . . 11


3.2 Haar Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Haar Features used to recognize eyes and the bridge of nose regions of face [?] 13
3.4 Cascade Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.1 DFD Level 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


5.2 DFD Level 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 DFD Level 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 DFD Level 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.5 Use-Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.6 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.1 Single Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


6.2 Multiple Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iv
List of Tables

2.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

v
List of Algorithms

1 Algorithm for Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vi
Chapter 1

Introduction

Human emotions run a gamut from extreme to almost neutral facial emotions. If this huge
range of emotions are carefully analyzed, a lot of information can be obtained regarding the
humans. Each emotion has a unique expression style and using proper interpretation, these
styles can be identified to recognize the particular emotions. Specifically, in a classroom, stu-
dents express a wide range of expressions. These expression can convey a lot of information
regarding the lecture and the comprehension levels of the respective lecture, in the students.
Computer vision is one such technology that can be used to automate this task by implementing
Deep learning models to analyze such data. With the help of this analyzed data, teachers and
responsible educational committees can figure out the different ways to improve the conducted
lectures.

1.1 Overview
Deep learning is the latest upcoming technology. With the automation development that is
currently affecting several industries, it is quite visible that deep learning and computer vision
is the apt technology for our project. For a case in point, consider the live example of computer
vision implemented in the city of mumbai. A trained model captures the image of fast moving
vehicle and if the speed limit is above the speed limit, a direct fine is sent to the driver. Another
implementation of computer vision is the automation robots that are being developed for smart
homes. Robots such as Vector can detect the human emotions and react according to the en-
vironment variables. It is quite evident that the computer vision technology is becoming quite
prominent in automating everyday tasks and in making of intelligent machines.

1
1.2 Objective
The main objective of this project is to leverage computer vision technology in order to
understand and evaluate the various emotions displayed by students in a classroom environment.
Using Deep Learning and computer vision to automate the analysis, we aim to understand the
various expressions displayed by the students in the classroom environment during a lecture.
Using the obtained information, our project can be used to improvise the classroom teaching
and to devise new ways in which students can be motivated during the lectures. The extrapolated
information can be used by teachers and college authorities to improvise the learning. Several
factors such as the quality of lecture, the level of interest in students and many such more,
although not limited to, parameters can be measured and altered by using this designed system.

1.3 Motivation
Nowadays, the class environment is affected by several factors. One of the most common
issues affecting classroom is the lack of student’s attention in the lecture. Students tend to lose
attention in lectures when the lecture begins to cover complex topics or intricate details. Another
common issue is that lecturers fail to provide the energy while teaching the topic. As a result,
the students lose their interest in the subject. As we can see, there are many such, although
not limited to, variegated factors that can influence the overall efficiency of the lecture. Our
project was inspired by this issue faced in the classroom environment. We aim to reduce this
issue by implementing state of the art neural networks to capture the live images of students
in a lecture and to analyze the overall emotions expressed by the students within the lecture.
This information can be further extrapolated to devise several solutions in order to improve the
overall lecture efficiency.

1.4 Problem Definition


Emotion is an intricate topic when it comes to understanding the behavioural aspects of
humans. Although, in the past computer technology has aimed to solved various social issues,
our project strives to solve the existing problems in the educational environment with the use of
latest technology. By locating faces in the scenario, extrapolating facial features from the pre-
processed image, and analyzing the motion of pictures, we can broadly categorize the human
emotions. Our project aims to automate this task by using intelligent computer vision.

2
1.5 Organization of report
The rest of the report is organized as follows. The literature survey and the related work of
this research for our study is covered in Chapter 2. The motivation which leads us towards our
proposal is discussed in Chapter 3. The planning and formulation for our research is defined in
Chapter 4 followed by the design of the system in Chapter 5. We finally conclude the report by
stating the conclusion of the entire project report in Chapter 7. Chapter 8 provides an insight on
the future work of the entire project.

3
Chapter 2

Literature Survey

There are present existing systems which already use some techniques for detecting emotions
in images. Various techniques are used considering the different environment. Hence we did a
survey of various techniques which is presented in section 2.1.

2.1 Research Paper Survey


1. Facial Expression Recognition based on Support Vector Machine using Gabor Wavelet
Filter
Sagor Chandro Bakchy et al[1]. proposes optimize approach for Face Recognition using SVM
(Support Vector Machines) from digital face image. In this paper[1], theydecompose image into
small sets of features. First of all they create training dataset to compare result. Once inputted
face image is pre-processed and compare with training dataset which are already computed.
Highest matching can be achieved by multiple face images, but it needs high computation time.
The pre-processing is carried out in two phases, first from the image Shape feature extraction
is done. In this method they detected 34 out of 58 facial points on the human face and each
image was converted into a 68 - feature vector. In the next step from this image expression
feature extraction was carried out. To carry out this phase they used Gabor Wavelet. In the
spatial domain, a 2D Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane
wave. The Gabor filters are self-similar: all filters can be generated from one mother wavelet
by dilation and rotation. Gabor filters are efficient in reducing image redundancy and robust to
noise. Such filters can be either convolved with the whole image or applied to a limited range
of positions.

4
Figure 2.1: Gabor Filter with various size, length and orientation

2. Human Facial Expression Recognition from Static Images using Shape and Appear-
ance Feature
Naveen Kumar H N et al [6] proposed a different approach for emotion recognition from
static images. They used HOG detector to detect the expression from the images. HOG feature
vectors extracted from training images to train the classifier. Characteristics of local shape or
gradient structure are better projected by HOG features. It is relatively invariant to local geo-
metric and photometric transformation such that small pose variation doesnt affect performance
of FER system. HOG feature vector provides information about shape and appearance of the
face which are better characterized by intensity gradients or edge directions. HOG feature ex-
traction algorithm divides static image of a face into small spatial regions known as Cells. Cells
can be rectangular or circular. The image is divided into cells of size N N pixels and for each
cell, gradients are computed.
Then the extracted HOG features are given as input to a group of Support Vector Machines
(SVM). SVM is a discriminative classifier defined through a separating hyper plane. SVMs are
non-parametric and hence boost the robustness associated with Artificial Neural Networks and
other nonparametric classifiers. The purpose of using SVM is to obtain acceptable results in a
fast, accurate and easier manner. Using this method[6], they reached an accuracy of 92.56%.

3. Analysis of Facial Expressions from Video Images using PCA


Face recognition and expression classification from video image sequences are explained by
Praseeda Lekshmi V et al [7] in which Frames were extracted from image sequences. Skin
color detection method is applied to detect face regions. The whole face was considered for
the construction of Eigen space.In their methodology[7] they treated face recognition as a 2-D
recognition problem and used Principal Component analysis. PCA is a common technique for
finding patterns in the data of high dimensions. In PCA, face images are projected into feature
space or face space. Weight vector comparison was done to get the best match After the face
recognition phase, their system could efficiently identify the expression from the face. They[7]
worked on FG-NET database and tried to classify emotions .

5
4. A Deep Neural Network Driven Feature Learning Method for Multi-view Facial
Expression Recognition
In this paper Tong Zhang et.al [8] used a Deep Neural Network to detect expression within
Multi-PIE and BU-3DFE, in which there were six faces with different emotions and differ-
ent lighting conditions. For Facial feature extractions they[8] used a method called as SIFT
method which annotated fixed points around nose, mouth and eyes.The scale-invariant feature
transform(SIFT) is a feature detection algorithm in computer vision to detect and describe local
features in images.Then these features were used to make a vector of image having dimension
of 128. In order to improve multi pose analysis of image they[8] used a DNN in which they[8]
used 1D convolution rather than 2D Convolution.Then finally they passed this to a CNN layer
which then predicts the emotion shown by the face in the image.

5. DeXpression: Deep Convolutional Neural Network for Expression Recognition


In this paper Peter Burkert et.al[9] proposed a convolution neural network approach to tackle
the problem of emotion detection from images and videos.In this paper[9], the proposed ar-
chitecture is independent of any hand-crafted feature extraction and performs better than the
earlier proposed convolutional neural network based approaches.In their approach they[9] used
as block called as FeatEx[9] which is a Parallel Feature Extraction Block.The block consists of
Convolutional, Pooling, and ReLU Layers. The first Convolutional layer in FeatEx reduces the
dimension since it convolves with a filter of size 1 x 1. It is enhanced by a ReLU layer, which
creates the desired sparseness. The output is then convolved with a filter of size 3 x 3. In the
parallel path a Max Pooling layer is used to reduce information before applying a CNN of size
1 x 1. This application of differently sized filters reflects the various scales at which faces can
appear.The paths are concatenated for a more diverse representation of the input and using this
kind of architecture they achieved a yielded a good result.

2.2 Similar Existing Project Comparison


Computer vision is revolutionizing every industry in which it is implemented. It holds the power
to automate and react to the surroundings of the intelligent machine. The giant corporation of
Amazon recently launched a new project called the Amazon Go Store in United Stated and
Canada. The purpose of the initiative was to create a store where shoppers can easily come,
purchase and leave the store without any hassle. The entire store is based upon computer vision

6
technology. Amazon posits that the computer vision technology can determine when an object
is taken from a shelf and who has taken it. If an item is returned to the shelf, the system is
also able to remove that item from a customers virtual basket. After the buyer is finished with
the shopping, they can directly walk out of the store and the money will be deducted via a
connected bank account. The system using facial recognition to ensure that it bills the correct
buyer for the respective items. Amazon also states that this system also helps reduce shoplifting
by alerting the store owners with any suspicious activity such as passing of items or hiding of
items under clothes. The Amazon Go store implements state of the art technology to achieve
excellent accuracy rate at detecting and constantly monitoring the shopping environment.

7
2.3 Analysis

Sr Research Paper Authors Publication and Year Description and Ac- Limitations
No. Name curacy
1. Facial Expression Abir Fathallah, Lotfi IEEE 14th Interna- CNN was imple- Could not detect
Recognition via Abdi, Ali Douik tional Conference on mented on the CK+ complex emotions.
Deep Learning Computer Systems dataset. The Accu-
and Applications racy obtained was
2017[2] 97%
2. A Deep Neural Tong Zhang, Wen- IEEE Transactions Deep Neural network The dataset used was
Network Driven ming Zheng, Zhen on Multimedia Vol- and SIFT transform limited in size
Feature Learn- Cui, Yuan Zong, ume: 18 , Issue: 12 , was implemented
ing Method for Jingwei Yan and Dec. 2016[8] on the FER2013
Multi-view Fa- Keyu Yan dataset. The accu-
cial Expression racy obtained was
Recognition 96%
3. Facial Expression Sagor Chandro 2nd International Support Vector Ma- The dataset was
Recognition Bakchy, Mst. Jan- Conference on chine with Gabor limited, the accuracy
based on Support natul Ferdous, Electrical and Elec- Wavelets was im- was compromising
Vector Machine Ananna Hoque Sathi, tronic Engineering plemented on the
using Gabor Krishna Chandro (ICEEE) 2017[1] FER2013 dataset.
Wavelet Filter Ray, Faisal Imran, The accuracy ob-
Md. Meraj Ali tained was 84%
4. SenTion: A Rahul Islam, Karan arXiv August Intervector angles The algorithm
framework for Ahuja, Sandip Kar- 2016[10] and histogram lacks compression
Sensing Facial makar, Ferdous features were im- techniques
Expressions Barbhuiya plemented on the
CK+ and JAFEE
datasets. The accu-
racy obtained was
95%

8
Sr Research Paper Authors Publication and Description and Limitations
No. Name Year Accuracy
5. Analysis of Fa- Praseeda World Congress on It was the orig- It was developed
cial Expressions Lekshmi V., Engineering July inial paper which only on the basis of
from Video Im- Dr.M.Sasikumar, 2008 Vol I[7] implemented analy- 10 images
ages using PCA Naveen S sis of expressions
from video images.
The accuracy ob-
tained was 88%
6. DeXpression: Peter Burkert, Felix arXiv August Used CNN for Cannot detect more
Deep Convo- Trier, Muhammad 2017[9] emotion detection complex emotion
lutional Neural Zeshan Afzal An- and used MMI and
Network for dreas Dengel and CKP database and
Expression Marcus Liwicki achieved 99.6%
Recognition and 98.36

Table 2.1: Literature Survey

After finishing the survey and studying the plethora number of techniques relating to emo-
tion detection we concluded that Convolutional Neural Networks provides excellent results.It is
observed that the FER-2013 database gives the least accuracy because the database represents
images of real life scenarios and were taken from Google database gathering the results of a
Google image search of each emotion and synonyms of the emotions

9
Chapter 3

Proposal

The classroom is affected by a lot of distractions and factors. The quality of classroom teaching
can be degraded due to a plethora of causes that may not be visible superficially. We plan
to create an automated system which will assist college authorities and teachers in evaluating
the emotions expressed by the students, during a lecture. By evaluating this data, the quality
of lecture and the attention of students can be extracted and used for improvising the overall
lecture qualities. In the process of implementation, we will leverage the use of computer vision
technology to detect the faces of students, feed the facial expression extracted by the neural
network and classify the emotions from a range of six different common emotions.

3.1 Problem Statement


Emotion is an intricate topic when it comes to understanding the behavioural aspects of humans.
Although, in the past computer technology has aimed to solved this issue, our project strives to
solve the existing problems with the use of latest technology. By locating student faces in the
scenario, extrapolating facial features from the pre-processed image, and analyzing the motion
of pictures, we can broadly categorize the human emotions. There are surfeit of projects that
exist related to this topic. However, our project aims to automate this task by using intelligent
computer vision

10
3.2 Proposed Work

Figure 3.1: System Diagram of the emotion detection system

In this project, we are implementing computer vision technology in order to detect the emo-
tions expressed by the students in a classroom environment. The first stage involves the process
of facial detection from a live stream. The image can be captured in a buffer time using OpenCv.
This image is converted into a gray scale image which has a size of 48*48 pixel. The face de-
tection will be done using Viola-Jones algorithm, specifically using Frontal haar-cascade. This
cropped image will be then fed to the Convolutional neural network model for feature extrac-
tion and emotion detection. The convolutional neural network will train itself to detect emotion
from the these images. The output will be one of the 6 types of emotions namely: -

• Neutral

• Interested

• Bored

• Frustrated

• Confused

• Laughing

3.3 Proposed Methodology


We are using Convolutional Neural Network for facial expression recognition. Following are
the steps we used to build our model:

11
3.3.1 Face Detection: Using Viola Jones Algorithm
In our system we only need the frontal faces that are normalized in scale from the input images.
Therefore it is important to localize and extract the facial region from an image and exclude the
unnecessary parts of image to reduce the computation after feature extraction and processing.
We are using Viola-Jones face detection method as it is robust algorithm which is capable of
fast processing for the real time systems.
In Viola Jones we construct an intermediate image also known as integral image that it used
in order to reduces computational complexity and improve performance of our face detection
algorithm. The Viola - Jones contains of 3 techniques for the face detection:

3.3.1.1 Using Haar Features

The haar features are used for face detection is of a rectangular type which is determined by
an integral image. Figure 3.2 shows different types of haar features which are similar to few
properties common to human faces. The eye region is the one which is darker than the upper-
cheeks so second type of haar feature in figure 3.2 is used to detect that facial region and haar
feature for the nose bridge region which is brighter than the cheeks as shown in the figure 3.3 .

Figure 3.2: Haar Features

Here, using these features we can find the locations of eyes, bridge of nose and mouth by
calculating

X X
V alue of f eature = pixels in black area − sum of pixels in white area (3.1)

It is used for facial edge detection and hence output is horizontal high value line.In haar we
use 24 X 24 pixel sub-window on the image window to find the edge therefore there are 162,336
possible features that can be extracted which are further used for facial region detection.We

12
create a kernel using haar features to extract this line. Then apply the kernel to the whole image
and it has a high output only where the image value matches the kernel that is our expected
output.

Figure 3.3: Haar Features used to recognize eyes and the bridge of nose regions of face [?]

3.3.1.2 AdaBoost Training

AdaBoost is used to optimize the process of detecting the face. The term boost determines the
classifiers that are complex in itself at each stage, which are built of basic strong classifiers of
different features.As we have seen in the previous method it calculates large number of features
from a image. Therefore in order to avoid the redundant features that are less important for clas-
sifier we use AdaBoost algorithm. We find the important features using weighted combination
of weak classifiers and features are considered okay to be included if they can perform better.
This algorithm constructs a strong classifier as a linear combination of weighted simple weak
classifiers.

F (x) = a1 f1 (x) + a2 f2 (x) + a3 f3 (x) + ...... (3.2)

Where, F(x) is strong classifier and f(x) is weak classifier

3.3.1.3 Cascade Classifier

Cascade classifier used to combine many of the features efficiently. The term cascade in a
classifier determines the several filters on a resultant classifier. It on discarding non faces images
to avoid the unnecessary work and spend more time on images with probable face regions.
Therefore a cascade classifier is used which is composed of stages containing strong classifiers.
So that with the output from each we can discard non facial images as shown in figure 3.3.

13
Figure 3.4: Cascade Classifier

3.3.2 Convolutional Neural Network(CNN)


Convolutional Neural Networks are very similar to ordinary Neural Networks in terms of the
following parameters: -
1. They are made up of neurons that are able to learn weights and biases.
2. Each neuron receives some inputs, performs a dot product and optionally follows it with a
non-linearity. But what makes them differ is the architectures which make the explicit assump-
tion that the inputs are images and it allows us to encode certain properties into the architec-
ture. These then make the forward function more efficient to implement and largely reduce the
amount of parameters in the network.
Regular neural network-
The dataset contains 48x48 gray scale images (48 wide, 48 high), so a single fully-connected
neuron in a first hidden layer of a regular Neural Network would have 48x48 = 2304 weights.
This amount still seems manageable, but clearly this fully-connected structure does not scale to
larger images.
Convolution network-
Unlike a regular Neural Network, the layers of a Convolution Network have neurons arranged
in 3 dimensions: width, height,depth. (depth here refers to the third dimension of an activation
volume). As we will soon see, the neurons in a layer will only be connected to a small region
of the layer before it, instead of all of the neurons in a fully-connected manner. The three main
types of layers in convolution network are-
1)Convolutional Layer
2.)Pooling Layer
3.)Fully-Connected Layer
1.Convolution Layer The Convolution layers parameters consist of a set of filters than can

14
learn overtime. Every filter is small spatially (along width and height), but extends through
the full depth of the input volume. During the forward pass, the filters slide more precisely,
convolve each filter across the width and height of the input volume and compute dot products
between the entries of the filter and the input at any position.
As we slide the filter over the width and height of the input volume we will produce a
2-dimensional activation map that gives the responses of that filter at every spatial position.
Intuitively, the network will learn filters that activate when they see some type of visual feature
such as an edge of some orientation or a blotch of some color on the first layer, or eventually
entire honeycomb or wheel-like patterns on higher layers of the network. Now, we will have
an entire set of filters in each Convolution layer , and each of them will produce a separate 2-
dimensional activation map. We will stack these activation maps along the depth dimension and
produce the output volume. The spatial extent of connectivity is a hyper parameter called the
receptive field of the neuron (equivalently this is the filter size). In our data set the, the image
have a size of 48x48 pixel and the first layer of the convolution has 32 filters with a size of 5x5
and the next layer of the convolution has 64 filters of size 5x5.
Spatial arrangement- Three hyper parameters control the size of the output volume: -
1. Depth
2. Stride
3. Zero-padding
The spatial size of the output volume as a function of the
1. Input volume size (W).
2. The receptive field size of the Convolution Layer neurons (F)
3. The stride with which they are applied(S)
4. The amount of zero padding used (P) on the border.
The formula for calculating how many neurons fit is given by

((W − F + 2P )/S) + 1. (3.3)

In our dataset the input of 48x48 and the receptive field size of 5x5 and 0 padding and stride
set to 1 will generate an output of 44x44 .
1.1 Backpropagation-
The backward pass for a convolution operation for both the data and the weights is also a
convolution but with spatially-flipped filters. With the help of backpropogation we get the
updated weights of the filters
2.Pooling Layer-

15
Its function is to progressively reduce the spatial size of the representation to reduce the
amount of parameters and computation in the network, and to also control overfitting. The
Pooling Layer operates independently on every depth slice of the input and re-sizes it spatially.
The most common form is a Max pooling layer with filters of size 2x2 applied with a stride of 2
downsamples every depth slice in the input by 2 along both width and height, discarding most of
the activations. Every MAX operation would in this case be taking a max over 4 numbers (little
2x2 region in some depth slice). The depth dimension remains unchanged. More generally, the
pooling layer: Accepts a volume of size W1xH1xD1 and requires two hyper-parameters:
1 Spatial extent- F
2 Stride -S
Produces a volume of size W2H2xD2where:

W 2 = ((W 1 − F ))/S + 1 (3.4)

H2 = ((H1 − F )/S) + 1 (3.5)

D2 = D1 (3.6)

We implement a pooling layer with F = 2, S = 2. and we get an output of 22x22 matrix.


3. Fully-connected layer-
Neurons in a fully connected layer have full connections to all activation’s in the previous
layer, as seen in regular Neural Networks. Their activation’s can hence be computed with a
matrix multiplication followed by a bias offset. See the Neural Network section of the notes for
more information.
In our project we implement 3 fully connected layers of size 2048,1024,512 respectively.

Figure 3.5: Proposed Methodology

16
3.4 Hardware & Software Requirement

3.4.1 Software Requirements


• Jupyter Notebook-The Jupyter Notebook is an open-source web application that allows
creation and sharing of documents that contain live code, equations, visualizations and
narrative text.

• Python- Popular and easy to use language for Image Processing algorithms

• Tensorflow- TensorFlow is an open-source software library for data flow programming


such as Neural Networks.

• OpenCV- Open Source Computer Vision is a library of programming functions mainly


aimed at real-time computer vision.

• Keras back-end- Open source library written in python that allows fast implementation of
Deep neural network models.

• Operating System- Linux, Windows,Mac

3.4.2 Hardware Requirements


• Minimum RAM - 8GB

• Processor - Minimum Intel i5(7th generation) or better

• Surveillance Web cam - To capture videos.

• Storage -20 GB or more.

17
Chapter 4

Planning & Formulation

4.1 Schedule for Project / Gantt Chart


Planning plays a cardinal role in the execution of a project. It gives a insight on the process
of spreading out the tasks in chronological and sequential order. The activities that need to be
conducted are then arranged in the required sequence. The figure below represents the Gantt
Chart for our project for a duration of 3 months.

Figure 4.1: Gantt Chart

18
Chapter 5

Design of System

5.1 Diagrams with Explanation


DFD DIAGRAM-

• DFD Level 0

The basic flow of the data is shown in the figure below

Figure 5.1: DFD Level 0

The Expression detection system will take the image containing face as input. The system
will then try to predict the emotion expressed by the face in image.

19
• DFD Level 1

The expression detection system will take the image containing face as input and will detect
face features that be given to CNN as input to predict the emotions.

Figure 5.2: DFD Level 1

• DFD Level 1.1

The face features are detected using face detection algorithm Viola Jones. The algorithm
will detect features of faces that are then passed to CNN to predict emotions.

Figure 5.3: DFD Level 1.1

20
• DFD Level 2

The face detection algorithm detect the features of face such as eyes ,nose ,mouth etc. These
features are passed to the CNN for features extraction and emotion grouping to get grouped
emotions such as angry,happy ,sad ,laughing etc.

Figure 5.4: DFD Level 2

21
USE CASE DIAGRAM
This diagram shows the working of the users and the system in detail.

Figure 5.5: Use-Case

The use cases Interact with the System ,Provide face is taken by the user. Other use cases like
Detect face,Classify Emotions ,Display Emotions is done by the System.

22
SEQUENCE DIAGRAM

Figure 5.6: Sequence Diagram

This diagram shows the sequence of the processes in order. The first layer depicts the input
layer. The middle layer represents the core application layer. The third and final layer is used to
represent the model which processes various different functions of the system such as feature
extraction, feature reduction and classification.

23
Chapter 6

Proposed Results

6.1 Proposed Results & Analysis

6.1.1 Dataset
We used FER-2013 database, the database was created as a part of a larger project that included
a competition for Kaggle.com. We were able to look to that project to understand the dataset
and what sort of results we could hope to achieve. In particular, the winner of the competition
achieved a test accuracy of 71% by cleverly implementing a CNN with an output layer that fed
into a linear SVM.We Changed the database to make it suit a scenario of classroom such as
Angry was changed to frustrated, Neutral was kept as it is, happy was changed as laughing, sad
was changed to bored and disgust was interpreted as confused and surprised was converted as
interested.

Figure 6.1: Single Face Detection

24
Figure 6.2: Multiple Face Detection

6.2 Project Outcomes


Students emotions in a classroom can be categorized into large variety. By simply detecting
whether the students are confused or interested, a lot of value can be gathered regarding the
lecture as well as the lecturer. Educational institutions require an automated system that can
perform this task of facial emotion extraction. A normal human may perceive the emotion only
superficially, but the neural network undergoes a great amount of training based on thousands of
data sets. As a result, the prediction of computer becomes much more accurate and convenient
to rely upon.

25
Chapter 7

Conclusion

Computer vision is extremely robust technology that can be used in a myriad number of ap-
plications. Powered by the neural networks, any system implementing computer vision can be
trained to achieve a significantly high level of accuracy at detecting and recognizing objects in
a real world environment. In our system, the convolutional neural network can achieve remark-
able results compared to other technologies that can be implemented to solve the same issue.
The faster computation and accurate predictions obtained from the CNN model can give us a
better insight at the emotions expressed by the students. The categories of emotions, expressed
by the students, can indicate a lot of details regarding the lecture in general. This system will
prove to be extremely helpful at not only improving teaching quality, but also at determining
the level of attention displayed by the students during the lecture. Based on the data obtained,
futher improvements can be made to the classroom teaching pattern.

26
Algorithm 1 Algorithm for Face Detection

1: Start
2: Read the live video from camera
3: while TRUE do
4: Take frame from video
5: Convert the RGB image into Grayscale image
6: For each pixel do
7: Grayscale value = ( (0.3 * R) + (0.59 * G) + (0.11 * B) )
8: Store this grayscale value in the array
9: Store the array as an image.
10: Detect face scales from images
11: Draw rectangle around face using scales
12: For (x,y,w,h) in faces:
13: Rectangle(img,(x,y),(x+w,y+h),(255,0,0),2)
14: Show image with detected face
15: Destroy image window
16: Release Camera
17: Stop

27
Chapter 8

Future Work

Pertaining to the future work that can be improved on this project, there are mainly two major
aspects,we believe, would improve our we believe there are two major areas of focus that would
improve our emotion recognition performance. The first improvement can be to calibrate the
architecture of the CNN used for the model to accurately match the problem our project intends
to focus on. A few examples of this calibration would be removing redundant parameters and
adding new parameters. Furthermore, adjusting the probability of dropout and experimenting to
find ideal stride sizes can also be focused in future work. As we can see there are such improve-
ment areas, although not limited to, factors that can be improved. The second improvement
area is the adaptation of the data sets to represent a real-time recognition environment, in order
to generate a model that can be used in real life application scenario. A case in point for this
can be to simulate noisy backgrounds in the images which can help the model to generate an
accurate recognition. Overall, our models achieve satisfactory results on the FER-2013 data set
with a much simpler CNN architecture. More work is necessary to make the real-time system
robust outside laboratory conditions, and it is possible that a deeper, more calibrated CNN could
improve results.

28
References

[1] Sagor Chandro Bakchy, Mst. Jannatul Ferdous, Ananna Hoque Sathi, Krishna Chandro
Ray, Faisal Imran, Md. Meraj Ali, ”Facial Expression Recognition based on Support Vec-
tor Machine using Gabor Wavelet Filter”, 2nd International Conference on Electrical &
Electronic Engineering (ICEEE), 27-29 December 2017.

[2] Abir Fathallah, Lofti Abdi, Ali Douik, ”Facial Expression Recognition via Deep Learn-
ing”, IEEE/ACS 14th International Conference on Computer Systems and Applications,
2017 .

[3] Viola and M. J. Jones, ”Robust real-time face detection”, International journal of computer
vision Vol. 57,no. 2, pp. 137 154, 2004.

[4] R. G. Harper, A. N. Wiens, and J. D. Matarazzo, ”Non-verbal communication: The state


of the art”, John Wiley & Sons, 1978

[5] Chellappa, C.L. Wilson, S. Sirohey, ”Human and Machine recognition of Faces A Sur-
vey”, Proc. IEEE,Vol. 83, No. 5, pp. 705-741, 2015.

[6] Naveen Kumar H N, Jagadeesha S, Amith K Jain, ”Human Facial Expression Recog-
nition from Static Images using Shape and Appearance Feature”,2nd International
Conference on Applied and Theoretical Computing and Communication Technology
(iCATccT),December 2016 .

[7] Praseeda Lekshmi V., Dr.M.Sasikumar, Naveen S, ”Analysis of Facial Expressions from
Video Images using PCA”, Proceedings of the World Congress on Engineering 2008 Vol
I WCE 2008, July 2 - 4, 2008, London, U.K.

[8] Tong Zhang, Wenming Zheng, Member, IEEE, Zhen Cui, Yuan Zong, Jingwei Yan and
Keyu Yan, ”A Deep Neural Network Driven Feature Learning Method for Multi-view

29
Facial Expression Recognition” IEEE Transactions on Multimedia Volume: 18 , Issue: 12
, December. 2016.

[9] Peter Burkert, Felix Trier, Muhammad Zeshan Afzal , Andreas Dengel and Marcus Li-
wickim, ”DeXpression: Deep Convolutional Neural Network for Expression Recogni-
tion”,arXiv:1509.05371v2, 17 August 2016.

[10] Rahul Islam, Karan Ahuja, Sandip Karmakar, Ferdous Barbhuiya, ”SenTion: A frame-
work for Sensing Facial Expressions”, arXiv:1608.04489, August 2017

30
Appendix0

Project Progress Report:


32

You might also like