1.1 Facial Expression: Classification of Human Faces Based On Expression Using ML 2017-2018
1.1 Facial Expression: Classification of Human Faces Based On Expression Using ML 2017-2018
1.1 Facial Expression: Classification of Human Faces Based On Expression Using ML 2017-2018
CHAPTER 1
INTRODUCTION
1.1 Facial Expression
Trying to interpret a person's emotional state in a nonverbal form, usually requires decoding his/hers
facial expression. Many times, body languages and especially facial expressions, tell us more than words
about
- ,
- . That similarity implies about the facial expression most important role- being a channel of
nonverbal communication.
Expressions and emotions go hand in hand, i.e. special combinations of face muscular actions reflect a
particular emotion. For certain emotions, it is very hard, and maybe even impossible, to avoid it's fitting
facial expression. For example, a person who is trying to ignore his boss's annoying offensive comment by
keeping a neutral expression might nevertheless show a brief expression of anger. This phenomenon of a
brief, involuntary facial expression shown on the face of humans according to emotions experienced is
called 'micro-expression'. Micro-expression is lasting only 1/25-1/15 of a second. Nonetheless, capturing it
can illuminate one's real feelings, whether he wants it or not.
Facial expression analysis has been attracting considerable attention in the advancement of human
machine interface since it provides a natural and efficient way to communicate between humans. Some
application areas related to face and its expressions include personal identification and access control,
video phone and teleconferencing, forensic applications, human-computer interaction, automated
surveillance, cosmetology, and so on.
Although humans recognize facial expressions virtually without effort or delay, reliable expression
recognition by machine is still a challenge. There have been several advances in the past few years in terms
of face detection, feature extraction mechanisms and the techniques used for expression classification, but
development of an automated system that accomplishes this task is difficult. Here we present an approach
based on Convolutional Neural Networks (CNN) for facial expression recognition. The input into our
system is an image; then, we use CNN to predict the facial expression label which should be one of these
labels- ‘ ’ ‘ ’
huge memory requirement. In the convolutional layer, the same coefficients are used across different
locations in the space, so the memory requirement is drastically reduced.
In the proposed system, we provide the program a large training set in the form of images that
represents the emotions that are to be detected. Pre-processing tasks like greyscale conversion, data
augmentation, noise removal and normalization are implemented on the images to make them more
suitable for feature extraction. Each image is then sliced into smaller pieces and a range of convolutional
filters are applied on each piece to extract distinct features from the training set. The CNN learns from
these features and tries to draw parallels among the features detected in each set. They are made up of
convolutional layers, activation functions and pooling layers. After the CNN learns the required features
over a number of iterations, this learning is applied into classifying the new image we give as an input and
classify it into one of the options- happy or not happy and same can be seen as an output.
1.5 Applications
Social Media - We can use Expression detection to understand how happy or sad the user is and try
to make the user experience better by showing them content that they like.
Marketing - Marketing strategies can be improved if the seller has more knowledge of how the
customer feels about their general marketing techniques online.
Human Computer Interaction - Expression detection is making it easier for computers to understand
the emotions of the human it is interacting with and making robots more humanoid in nature.
E-learning platforms - When students opt for e-learning, their interest in the subject usually goes
undetected due to absence of a physical teacher who monitors their behaviour. Expression detection
can help checking if the student is bored of the lectures or enjoys them.
Surveillance – Using of facial expression mapping can help keep an eye on the microexpressions of
’ k fake expression
CHAPTER 2
LITERATURE SURVEY
Facial expressions and uses
Facial expression recognition has attracted considerable attention in the past ten years due to their
potential applications, such as human-robot interaction [1], human-machine interface [2], surveillance [3],
driving safety [4] and health-care [5]. There are many works for recognizing the emotions, such as sadness,
surprise, anger, happiness, fear, disgust, and neutrality [6-9]. However, each person gives same emotion in
a different way and even the imaging conditions changes facial expression appearance. Hence facial
expression recognition is still challenging problem. To get rid of the challenge and to achieve more
accurate results, it is looked at robust features.
K.Berns talks about how the face is one of the complex parts of the human body[1]; consisting of a
large number of muscles which enable complex combinations of motions to express a specific facial
expression. Investigating biomechanics of the human face may give insight on how to mimic its
performance and behavior using mechanical and synthetic skin components. A number of universities and
research laboratories have developed humanoid robots capable of making facial expression to create
intuitive communication interface between the former and humans In recent times the advance in humanoid
robot design and realization of human-like expressive head robot has increased the interest of their use in
quite a number of potential social roles; as a receptionist robot or information desk for different
institutions, as service robot in hotels, or replacing human in repetitive operations. They can also be applied
in therapeutic treatments ’ , and other assistive tasks in sales and health sectors.
Computer animated agents and robots bring a social dimension to human computer interaction and
force us to thinking new ways about how computers could be used in daily life. Face to face
communication is a real-time process operating at a time scale in the order of 40 milliseconds. The level of
uncertainty at this time scale is considerable, making it necessary for humans to rely on sensory rich
perceptual primitives rather than slow symbolic inference pro-cesses. Thus fulfilling the idea of machines
that interact face-to-face with us requires development of robust real-time perceptive primitives. In [2] they
present the first step towards the development of one such primitive: a system that automatically finds
faces in the visual video stream and codes facial expression dynamics in real time. The performance of the
system was being evaluated for applications including automatic reading tu-tors, assessment of human-
robot interaction, and evaluation of psychiatric intervention.
For an e-Healthcare system, detecting the emotion of a patient is vital for initial assessment of the
patient. [5] proposes an emotion recognition system from face using for an e-Healthcare system. For
features, Weber local descriptors (WLD) are utilized. In the proposed system, a static facial image is
subdivided into many blocks. A multi-scale WLD is applied to each of the blocks to obtain a WLD
histogram for the image. The significant bins of the histogram are found by using Fisher discrimination
ratio. These bin values represent the descriptors of the face. The face descriptors are then input to a support
vector machine based classifier to recognition the emotion. Two publicly available databases are used in
the experiments. Experimental results demonstrate that the proposed WLD based system achieves a very
high recognition rate, where the highest recognition accuracy reaches up to 99.28 % in case of Cohn–
Kanade database.
Methods of recognition
Geometric-based methods concern about the feature vectors encoding some facial geometric properties
such as distance, angle, and position to determine the shapes and locations of the invariance points of face.
For instance, in [12], 34 invariance points belong to a face image were extracted for facial expression
recognition. Success of the methods depends on powerful face component detection methods to set facial
invariance points, which provides a few difficult at real life applications.
Appearance-based methods use the features extracted directly from the images but do not include an
information relating to the facial points. There are a lot of Appearance based methods. The most important
ones are Local Binary Pattern (LBP) [13], Local Gabor Binary Patterns (LGBP) [14-15], Scale Invariant
Feature Transform (SIFT) [16], Histogram of Oriented Gradient (HOG) [17], and Curvelet Transform [17].
Facial expressions make the certain regions of face change, which causes interest in just the special
regions.
Many number of Machine learning approaches such as Decision tree learning, Artificial neural
networks (ANN), Support Vector machines (SVM), Genetic algorithms and Bayesian networks have
already been applied to the recognition of facial expression and research on each of these fields has yielded
considerable results. But application of CNN for recognition has its advantages.
Dept of CSE, RITM 6
Classification of human faces based on expression using ML 2017-2018
Yu and Zhang achieved state-of-the-art results in EmotiW in 2015 using CNNs to perform FER. They
used an ensemble of CNNs with five convolutional layers each [18]. Among the insights from their paper
was that randomly perturbing the input images yielded a 2-3% boost in accuracy. Specifically, Yu and
Zhang applied transformations to the input images at train time. At test time, their model generated
predictions for multiple perturbations of each test example and voted on the class label to produce a final
answer. Also interesting is that they used stochastic-pooling rather than max-pooling because of its good
performance on limited training data.
Kim et al. achieved a test accuracy of .61 in EmotiW2015 by using an ensemble based method with
varying network architectures and parameters. They used a hierarchical decision tree and an exponential
rule to combine decisions of different networks rather than simply using a simply weighted average, and
this improved their results. They initialized weights by training networks on other FER datasets and using
these weights for fine-tuning.
Mollahosseini et al. have also achieved state of the art results in FER[20]. Their network consisted of
two convolutional layers, max-pooling, and 4 Inception layers as introduced by GoogLeNet. The proposed
architecture was tested on many publically available data sets. It received a lower test accuracy of 0.47 on
the EmotiW data set but state-of the-art test accuracies on other data sets (i.e. 0.93 on CK+). When
compared to an AlexNet architecture, their proposed architecture improved results by 1-3 percent on most
data sets
Chapter 3
SYSTEM ANALYSIS AND REQUIREMENTS
System development can generally be thought of as having two major components Systems analysis
and System design. System design is a process of planning new business system or one to replace or
complement an existing system. But before this planning can be done, we must thoroughly understand the
old system and determine how computers can best be used to make its operation more effective.
System analysis, then, is the process of gathering and interpreting facts, diagnosing problems, and
using the information to recommend improvements to the system Analysis specifies what the system
should do. Design states how to accomplish the objective.
3.1 Introduction
3.1.1 Purpose
E q ’
condition, which will enable better computer-based services. For instance, medical and welfare services
z ’ ’ C
problems and also improve rehabilitations. It might be even possible to diagnose mental disease
automatically. Moreover, integration with voice sound recognition technology will widen its application to
communication style robot and realize healing conversations desired especially for elderly people, which
will create new industries such as robotics, car navigation systems, and call center services and will
eventually realize a barrier-free ubiquitous computing society. Modern information communication has
q 0’ 1’ k R
z “ - ”T
largely affected by who is in charge of the communication. For example, persuasion by boy/girl friend
would be much more effective than by others. Such real-life examples prove the importance of considering
emotion. However, traditional information communication techniques have not dealt with human emotion.
3.1.2 Scope
While neural networks and other pattern detection methods have been around for the past 50 years,
there has been significant development in the area of convolutional neural networks in the recent past.
CNNs give the best performance in pattern/image recognition problems and even outperform humans in
certain cases. CNNs are often used in image recognition systems. When applied to facial recognition,
CNNs achieved a large decrease in error rate. In 2015 a many-layered CNN demonstrated the ability to
Dept of CSE, RITM 8
Classification of human faces based on expression using ML 2017-2018
spot faces from a wide range of angles, including upside down, even when partially occluded with
competitive performance. Compared to image data domains, there is relatively little work on applying
CNNs to video classification. Video is more complex than images since it has another (temporal)
dimension. However, some extensions of CNNs into the video domain have been explored. One approach
is to treat space and time as equivalent dimensions of the input and perform convolutions in both time and
space. CNNs have also explored natural language processing. CNN models are effective for various NLP
problems and achieved excellent results in semantic parsing, search query retrieval, sentence
modelling, classification, prediction and other traditional NLP tasks. CNNs have been used in drug
discovery. Predicting the interaction between molecules and biological proteins can identify potential
treatments.
We assume we can get clear images of faces and the expression on the face depicts the
emotional state of the human.
Full working is dependent on correctness of training dataset and the quality of images in the set.
Digital camera
Chapter 4
SYSTEM DESIGN
4.1 Use Case diagram
A use case diagram at its simplest is a representation of a user's interaction with the system that shows
the relationship between the user and the different use cases in which the user is involved. A use case
diagram can identify the different types of users of a system and the different use cases and will often be
accompanied by other types of diagrams as well.
A sequence diagram is an interaction diagram that shows how objects operate with one another and in
what order. It is a construct of a message sequence chart.
A sequence diagram shows object interactions arranged in time sequence. It depicts the objects and
classes involved in the scenario and the sequence of messages exchanged between the objects needed to
carry out the functionality of the scenario. Sequence diagrams are typically associated with use case
realizations in the Logical View of the system under development. Sequence diagrams are sometimes
called event diagrams or event scenarios.
Activity diagrams are constructed from a limited number of shapes, connected with arrows.[4] The
most important shape types:
Chapter 5
Machine learning is a field of computer science that often uses statistical techniques to give computers
the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being
explicitly programmed. Evolved from the study of pattern recognition and computational learning theory in
artificial intelligence, machine learning explores the study and construction of algorithms that can learn
from and make predictions on data – such algorithms overcome following strictly static program
instructions by making data-driven predictions or decisions, through building a model from sample inputs.
Machine learning is employed in a range of computing tasks where designing and programming explicit
algorithms with good performance is difficult or infeasible; example applications include email filtering,
detection of network intruders or malicious insiders working towards a data breach, optical character
recognition (OCR), learning to rank, and computer vision.
A core objective of a learner is to generalize from its experience. Generalization in this context is the
ability of a learning machine to perform accurately on new, unseen examples/tasks after having
experienced a learning data set. The training examples come from some generally unknown probability
distribution (considered representative of the space of occurrences) and the learner has to build a general
model about this space that enables it to produce sufficiently accurate predictions in new case
Bayesian networks
Reinforcement learning
Representation learning
Similarity and metric learning
Sparse dictionary learning
Genetic algorithms
Rule-based machine learning
In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-
forward artificial neural networks that has successfully been applied to analyzing visual imagery. CNNs
use a variation of multilayer perceptrons designed to require minimal preprocessing.
Convolutional networks were inspired by biological processes in that the connectivity pattern between
neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to
stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of
different neurons partially overlap such that they cover the entire visual field.
CNNs use relatively little pre-processing compared to other image classification algorithms. This
means that the network learns the filters that in traditional algorithms were hand-engineered. This
independence from prior knowledge and human effort in feature design is a major advantage.
They have applications in image and video recognition, recommender systems and natural language
processing.
5.1.2.1 History
CNNs are commonly associated with computer vision, with historical roots traced back to the 1980s,
when Kunihiko Fukushima proposed a neural network architecture inspired by the feline visual processing
system W z ’
generalize very well, or learn how those patterns might occur in other parts of the image. For example,
k z O k ’
a puppy standing off-center in the image, the image flipped upside down, or partly obscured by wearing a
hat W ’ F k ’ k
around this problem—by creating a mechanism for classifiers to be unaffected by patterns that have been
shifted in position.CNN design follows vision processing in living organisms.
Receptive fields
Work by Hubel and Wiesel in the 1950s and 1960s showed that cat and monkey visual cortexes
contain neurons that individually respond to small regions of the visual field. Provided the eyes are not
moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known
as its receptive field[citation needed]. Neighboring cells have similar and overlapping receptive
fields[citation needed]. Receptive field size and location varies systematically across the cortex to form a
complete map of visual space[citation needed]. The cortex in each hemisphere represents the contralateral
visual field[citation needed].Their 1968 paper identified two basic visual cell types in the brain:
Simple cells, whose output is maximized by straight edges having particular orientations within their
receptive field
Complex cells, which have larger receptive fields, whose output is insensitive to the exact position of
the edges in the field.
Neocognitron
The neocognitron was introduced in 1980. The neocognitron does not require units located at multiple
network positions to have the same trainable weights. This idea appears in 1986 in the book version of the
original backpropagation paper. Neocognitrons were developed in 1988 for temporal signals.[clarification
needed]Their design was improved in 1998,generalized in 2003 and simplified in the same year.
LeNet-5
LeNet-5, a pioneering 7-level convolutional network by LeCun et al. in 1998, that classifies digits, was
applied by several banks to recognise hand-written numbers on checks (cheques) digitized in 32x32 pixel
images. The ability to process higher resolution images requires larger and more convolutional layers, so
this technique is constrained by the availability of computing resources.
Similarly, a shift invariant neural network was proposed for image character recognition in 1988. The
architecture and training algorithm were modified in 1991 and applied for medical image processing and
automatic detection of breast cancer in mammograms.
A different convolution-based design was proposed in 1988 for application to decomposition of one-
dimensional electromyography convolved signals via de-convolution. This design was modified in 1989 to
other de-convolution-based designs.
The feed-forward architecture of convolutional neural networks was extended in the neural abstraction
pyramid by lateral and feedback connections. The resulting recurrent convolutional network allows for the
flexible incorporation of contextual information to iteratively resolve local ambiguities. In contrast to
previous models, image-like outputs at the highest resolution were generated.
GPU implementations
Following the 2005 paper that established the value of GPGPU for machine learning, several
publications described more efficient ways to train convolutional neural networks using GPUs. In 2011,
they were refined and implemented on a GPU, with impressive results. In 2012, Ciresan et al. significantly
improved on the best performance in the literature for multiple image databases, including the MNIST
database, the NORB database, the HWDB1.0 dataset (Chinese characters), the CIFAR10 dataset (dataset of
60000 32x32 labeled RGB images),and the ImageNet dataset.
A CNN architecture is formed by a stack of distinct layers that transform the input volume into an
output volume (e.g. holding the class scores) through a differentiable function. A few distinct types of
layers are commonly used. We discuss them further below:
I. Convolutional layer
The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of
learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the
input volume. During the forward pass, each filter is convolved across the width and height of the input
volume, computing the dot product between the entries of the filter and the input and producing a 2-
dimensional activation map of that filter. As a result, the network learns filters that activate when it detects
some specific type of feature at some spatial position in the input.
Dept of CSE, RITM 18
Classification of human faces based on expression using ML 2017-2018
Stacking the activation maps for all filters along the depth dimension forms the full output volume of
the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a
neuron that looks at a small region in the input and shares parameters with neurons in the same activation
map.
Local connectivity
When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all
neurons in the previous volume because such a network architecture does not take the spatial structure of
the data into account. Convolutional networks exploit spatially local correlation by enforcing a local
connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of
the input volume. The extent of this connectivity is a hyperparameter called the receptive field of the
neuron. The connections are local in space (along width and height), but always extend along the entire
depth of the input volume. Such an architecture ensures that the learnt filters produce the strongest
response to a spatially local input pattern.
Fig 6.Neurons of a convolutional layer (blue), connected to their receptive field (red)
Spatial arrangement
Three hyperparameters control the size of the output volume of the convolutional layer: the depth,
stride and zero-padding.
The depth of the output volume controls the number of neurons in a layer that connect to the
same region of the input volume. These neurons learn to activate for different features in the input. For
example, if the first convolutional layer takes the raw image as input, then different neurons along the
depth dimension may activate in the presence of various oriented edges, or blobs of color.
Stride controls how depth columns around the spatial dimensions (width and height) are
allocated. When the stride is 1 then we move the filters one pixel at a time. This leads to heavily
overlapping receptive fields between the columns, and also to large output volumes. When the stride is 2
Dept of CSE, RITM 19
Classification of human faces based on expression using ML 2017-2018
(or rarely 3 or more) then the filters jump 2 pixels at a time as they slide around. The receptive fields
overlap less and the resulting output volume has smaller spatial dimensions.[33]
Sometimes it is convenient to pad the input with zeros on the border of the input volume. The
size of this padding is a third hyperparameter. Padding provides control of the output volume spatial size.
In particular, sometimes it is desirable to exactly preserve the spatial size of the input volume.
Parameter sharing
A parameter sharing scheme is used in convolutional layers to control the number of free parameters.
It relies on one reasonable assumption: That if a patch feature is useful to compute at some spatial position,
then it should also be useful to compute at other positions. In other words, denoting a single 2-dimensional
slice of depth as a depth slice, we constrain the neurons in each depth slice to use the same weights and
bias.
Since all neurons in a single depth slice share the same parameters, then the forward pass in each depth
slice of the CONV layer can be computed as a convolution of the neuron's weights with the input volume
(hence the name: convolutional layer). Therefore, it is common to refer to the sets of weights as a filter (or
a kernel), which is convolved with the input. The result of this convolution is an activation map, and the set
of activation maps for each different filter are stacked together along the depth dimension to produce the
output volume. Parameter sharing contributes to the translation invariance of the CNN architecture.
Sometimes the parameter sharing assumption may not make sense. This is especially the case when the
input images to a CNN have some specific centered structure, in which we expect completely different
features to be learned on different spatial locations. One practical example is when the input are faces that
have been centered in the image: we might expect different eye-specific or hair-specific features to be
learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and
instead simply call the layer a locally connected layer.
Another important concept of CNNs is pooling, which is a form of non-linear down-sampling. There
are several non-linear functions to implement pooling among which max pooling is the most common. It
partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs
the maximum. The intuition is that the exact location of a feature is less important than its rough location
relative to other features. The pooling layer serves to progressively reduce the spatial size of the
representation, to reduce the number of parameters and amount of computation in the network, and hence
to also control over fitting. It is common to periodically insert a pooling layer between successive
convolutional layers in a CNN architecture. The pooling operation provides another form of translation
invariance.
The pooling layer operates independently on every depth slice of the input and resizes it spatially. The
most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples at
every depth slice in the input by 2 along both width and height, discarding 75% of the activations. In this
case, every max operation is over 4 numbers. The depth dimension remains unchanged.
In addition to max pooling, the pooling units can use other functions, such as average pooling or L2-
norm pooling. Average pooling was often used historically but has recently fallen out of favour compared
to max pooling, which works better in practice.
Due to the aggressive reduction in the size of the representation, the trend is towards using smaller
filters or discarding the pooling layer altogether.
Other functions are also used to increase nonlinearity, for example the saturating hyperbolic tangent
f ( x ) = tanh ( x ) , ( )=| ( )|,
and the sigmoid function
f ( x ) = ( 1 + e ^-x)^-1 }}. ReLU is often preferred to other functions, because it trains the neural network
several times faster[39] without a significant penalty to generalisation accuracy.
Finally, after several convolutional and max pooling layers, the high-level reasoning in the neural
network is done via fully connected layers. Neurons in a fully connected layer have connections to all
activations in the previous layer, as seen in regular neural networks. Their activations can hence be
computed with a matrix multiplication followed by a bias offset.
CNNs use more hyperparameters than a standard MLP. While the usual rules for learning rates and
regularization constants still apply, the following should be kept in mind when optimising.
Number of filters
Since feature map size decreases with depth, layers near the input layer will tend to have fewer filters
while higher layers can have more. To equalize computation at each layer, the feature x pixel position
product is kept roughly constant across layers. Preserving more information about the input would require
keeping the total number of activations (number of feature maps times number of pixel positions) non-
decreasing from one layer to the next.
The number of feature maps directly controls capacity and depends on the number of available
examples and task complexity.
Filter shape
Common field shapes found in the literature vary greatly, and are usually chosen based on the dataset.
The challenge is thus to find the right level of granularity so as to create abstractions at the proper scale,
given a particular dataset.
Typical values are 2x2. Very large input volumes may warrant 4x4 pooling in the lower-layers.
However, choosing larger shapes will dramatically reduce the dimension of the signal, and may result in
excess information loss. Often, non-overlapping pooling windows perform best.
For many applications, little training data is available. Convolutional neural networks usually require
a large amount of training data in order to avoid overfitting. A common technique is to train the network on
a larger data set from a related domain. Once the network parameters have converged an additional training
step is performed using the in-domain data to fine-tune the network weights. This allows convolutional
networks to be successfully applied to problems with small training sets.
Here we make use of Python programming language and run it using the Spyder IDE
Python features a dynamic type system and automatic memory management and supports multiple
programming paradigms, including object-oriented, imperative, functional programming, and procedural
styles. It has a large and comprehensive standard library.
Python interpreters are available for many operating systems, allowing Python code to run on a wide
variety of systems. CPython, the reference implementation of Python, is open source software and has a
community-based development model, as do nearly all of its variant implementations. CPython is managed
by the non-profit Python Software Foundation.
Python uses dynamic typing and a mix of reference counting and a cycle-detecting garbage collector
for memory management. An important feature of Python is dynamic name resolution (late binding), which
binds method and variable names during program execution.
The design of Python offers some support for functional programming in the Lisp tradition. The
language has map(), reduce() and filter() functions; list comprehensions, dictionaries, and sets; and
generator expressions. The standard library has two modules (itertools and functools) that implement
functional tools borrowed from Haskell and Standard ML.
The core philosophy of the language is summarized by the document The Zen of Python (PEP 20),
which includes aphorisms such as:
interpreter was intended by Van Rossum from the start because of his frustrations with ABC, which
espoused the opposite mindset.
While offering choice in coding methodology, the Python philosophy rejects exuberant syntax, such as
in Perl, in favor of a sparser, less-cluttered grammar. As Alex Martelli put it: "To describe something as
clever is not considered a compliment in the Python culture." Python's philosophy rejects the Perl "there is
more than one way to do it" approach to language design in favor of "there should be one—and preferably
only one—obvious way to do it".
Python's developers strive to avoid premature optimization, and moreover, reject patches to non-
critical parts of CPython that would offer a marginal increase in speed at the cost of clarity. When speed is
important, a Python programmer can move time-critical functions to extension modules written in
languages such as C, or try using PyPy, a just-in-time compiler. Cython is also available, which translates a
Python script into C and makes direct C-level API calls into the Python interpreter.
An important goal of Python's developers is making it fun to use. This is reflected in the origin of the
name, which comes from Monty Python, and in an occasionally playful approach to tutorials and reference
materials, such as using examples that refer to spam and eggs instead of the standard foo and bar.
A common neologism in the Python community is pythonic, which can have a wide range of meanings
related to program style. To say that code is pythonic is to say that it uses Python idioms well, that it is
natural or shows fluency in the language, that it conforms with Python's minimalist philosophy and
emphasis on readability. In contrast, code that is difficult to understand or reads like a rough transcription
from another programming language is called unpythonic.
Users and admirers of Python, especially those considered knowledgeable or experienced, are often
referred to as Pythonista, Pythonistas, and Pythoneers.
Indentation
Python uses whitespace indentation to delimit blocks – rather than curly braces or keywords. An
increase in indentation comes after certain statements; a decrease in indentation signifies the end of the
current block. This feature is also sometimes termed the off-side rule.
● The assignment statement (token '=', the equals sign). This operates differently than in traditional
imperative programming languages, and this fundamental mechanism (including the nature of Python's
version of variables) illuminates many other features of the language. Assignment in C, e.g., x = 2,
translates to "typed variable name x receives a copy of numeric value 2". The (right-hand) value is
copied into an allocated storage location for which the (left-hand) variable name is the symbolic
address. The memory allocated to the variable is large enough (potentially quite large) for the declared
type. In the simplest case of Python assignment, using the same example, x = 2, translates to "(generic)
name x receives a reference to a separate, dynamically allocated object of numeric (int) type of value
2." This is termed binding the name to the object. Since the name's storage location doesn't contain the
indicated value, it is improper to call it a variable. Names may be subsequently rebound at any time to
objects of greatly varying types, including strings, procedures, complex objects with data and methods,
etc. Successive assignments of a common value to multiple names, e.g., x = 2; y = 2; z = 2 result in
allocating storage to (at most) three names and one numeric object, to which all three names are
bound. Since a name is a generic reference holder it is unreasonable to associate a fixed data type with
it. However at a given time a name will be bound to some object, which will have a type; thus there is
dynamic typing.
● The if statement, which conditionally executes a block of code, along with else and elif (a contraction
of else-if).
● The for statement, which iterates over an iterable object, capturing each element to a local variable for
use by the attached block.
● The while statement, which executes a block of code as long as its condition is true.
● The try statement, which allows exceptions raised in its attached code block to be caught and handled
by except clauses; it also ensures that clean-up code in a finally block will always be run regardless of
how the block exits.
● The class statement, which executes a block of code and attaches its local namespace to a class, for use
in object-oriented programming.
● The def statement, which defines a function or method.
● The with statement (from Python 2.5), which encloses a code block within a context manager (for
example, acquiring a lock before the block of code is run and releasing the lock afterwards, or opening
a file and then closing it), allowing Resource Acquisition Is Initialization (RAII)-like behavior.
● The pass statement, which serves as a NOP. It is syntactically needed to create an empty code block.
● The assert statement, used during debugging to check for conditions that ought to apply.
● The yield statement, which returns a value from a generator function. From Python 2.5, yield is also an
operator. This form is used to implement coroutines.
● The import statement, which is used to import modules whose functions or variables can be used in the
current program.
● The print statement was changed to the print() function in Python 3.
Python does not support tail call optimization or first-class continuations, and, according to Guido van
Rossum, it never will. However, better support for coroutine-like functionality is provided in 2.5, by
extending Python's generators. Before 2.5, generators were lazy iterators; information was passed
unidirectionally out of the generator. As of Python 2.5, it is possible to pass information back into a
generator function, and as of Python 3.3, the information can be passed through multiple stack levels.
Expressions
Some Python expressions are similar to languages such as C and Java, while some are not:
● Addition, subtraction, and multiplication are the same, but the behavior of division differs. Python also
added the ** operator for exponentiation.
● As of Python 3.5, it supports matrix multiplication directly with the @ operator, versus C and Java,
which implement these as library functions. Earlier versions of Python also used methods instead of an
infix operator.
● In Python, == compares by value, versus Java, which compares numerics by value and objects by
reference. (Value comparisons in Java on objects can be performed with the equals() method.) Python's
is operator may be used to compare object identities (comparison by reference). In Python,
comparisons may be chained, for example a <= b <= c.
● Python uses the words and, or, not for its boolean operators rather than the symbolic &&, ||, ! used in
Java and C.
● Python has a type of expression termed a list comprehension. Python 2.4 extended list comprehensions
into a more general expression termed a generator expression.
● Anonymous functions are implemented using lambda expressions; however, these are limited in that
the body can only be one expression.
● Conditional expressions in Python are written as x if c else y (different in order of operands from the c
? x : y operator common to many other languages).
● Python makes a distinction between lists and tuples. Lists are written as [1, 2, 3], are mutable, and
cannot be used as the keys of dictionaries (dictionary keys must be immutable in Python). Tuples are
written as (1, 2, 3), are immutable and thus can be used as the keys of dictionaries, provided all
elements of the tuple are immutable. The parentheses around the tuple are optional in some contexts.
Tuples can appear on the left side of an equal sign; hence a statement like x, y = y, x can be used to
swap two variables.
● Python has a "string format" operator %. This functions analogous to printf format strings in C, e.g.
"spam=%s eggs=%d" % ("blah", 2) evaluates to "spam=blah eggs=2". In Python 3 and 2.6+, this was
supplemented by the format() method of the str class, e.g. "spam={0} eggs={1}".format("blah", 2).
● Python has various kinds of string literals:
○ Strings delimited by single or double quote marks. Unlike in Unix shells, Perl and Perl-influenced
languages, single quote marks and double quote marks function identically. Both kinds of string use
the backslash (\) as an escape character. String interpolation (done as "$spam" in Unix shells and Perl-
influenced languages) became available in Python 3.6 as "formatted string literals".
○ Triple-quoted strings, which begin and end with a series of three single or double quote marks. They
may span multiple lines and function like here documents in shells, Perl and Ruby.
○ Raw string varieties, denoted by prefixing the string literal with an r. Escape sequences are not
interpreted; hence raw strings are useful where literal backslashes are common, such as regular
expressions and Windows-style paths. Compare "@-quoting" in C#.
● Python has array index and array slicing expressions on lists, denoted as a[key], a[start:stop] or
a[start:stop:step]. Indexes are zero-based, and negative indexes are relative to the end. Slices take
elements from the start index up to, but not including, the stop index. The third slice parameter, called
step or stride, allows elements to be skipped and reversed. Slice indexes may be omitted, for example
a[:] returns a copy of the entire list. Each element of a slice is a shallow copy.
In Python, a distinction between expressions and statements is rigidly enforced, in contrast to languages
such as Common Lisp, Scheme, or Ruby. This leads to duplicating some functionality. For example:
● The eval() vs. exec() built-in functions (in Python 2, exec is a statement); the former is for expressions,
the latter is for statements.
Statements cannot be a part of an expression, so list and other comprehensions or lambda expressions,
all being expressions, cannot contain statements. A particular case of this is that an assignment statement
such as a = 1 cannot form part of the conditional expression of a conditional statement. This has the
advantage of avoiding a classic C error of mistaking an assignment operator = for an equality operator ==
in conditions: if (c = 1) { ... } is syntactically valid (but probably unintended) C code but if c = 1: ... causes
a syntax error in Python.
Methods
Methods on objects are functions attached to the object's class; the syntax instance.method(argument)
is, for normal methods and functions, syntactic sugar for Class.method(instance, argument). Python
methods have an explicit self parameter to access instance data, in contrast to the implicit self (or this) in
some other object-oriented programming languages (e.g., C++, Java, Objective-C, or Ruby).
Typing
Python uses duck typing and has typed objects but untyped variable names. Type constraints are not
checked at compile time; rather, operations on an object may fail, signifying that the given object is not of
a suitable type. Despite being dynamically typed, Python is strongly typed, forbidding operations that are
not well-defined (for example, adding a number to a string) rather than silently attempting to make sense of
them.
Python allows programmers to define their own types using classes, which are most often used for
object-oriented programming. New instances of classes are constructed by calling the class (for example,
SpamClass() or EggsClass()), and the classes are instances of the metaclass type (itself an instance of
itself), allowing metaprogramming and reflection.
Before version 3.0, Python had two kinds of classes: old-style and new-style. The syntax of both styles
is the same, the difference being whether the class object is inherited from, directly or indirectly (all new-
style classes inherit from object and are instances of type). In versions of Python 2 from Python 2.2
onwards, both kinds of classes can be used. Old-style classes were eliminated in Python 3.0.
The long term plan is to support gradual typing and as of Python 3.5, the syntax of the language allows
specifying static types but they are not checked in the default implementation, CPython. An experimental
optional static type checker named mypy supports compile-time type checking
TensorFlow is an open source software library for high performance numerical computation. Its
flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs,
TPUs), and from desktops to clusters of servers to mobile and edge devices. Originally developed by
G B G ’ z
strong support for machine learning and deep learning and the flexible numerical computation core is used
across many other scientific domains.
5.2.3 Keras
Keras is a high-level neural networks API, written in Python and capable of running on top of
TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being
able to go from idea to result with the least possible delay is key to doing good research. Keras was initially
developed as part of the research effort of project ONEIROS (Open-ended Neuro-Electronic Intelligent
Robot Operating System).
Allows for easy and fast prototyping (through user friendliness, modularity, and extensibility).
Supports both convolutional networks and recurrent networks, as well as combinations of the
two.
Runs seamlessly on CPU and GPU.
The core data structure of Keras is a model, a way to organize layers. The simplest type of model is the
Sequential model, a linear stack of layers. For more complex architectures, you should use the Keras
functional API, which allows to build arbitrary graphs of layers.
5.3.1 Dataset
The system is trained and tested on a unique dataset created by the Database administrator of the
project. It consists of 1500 images of static faces sourced from the internet and personal repositories with
no age constraints. .These images are all of the dimension 256x256 pixels to make it even.These images are
labelled as one of the two emotions and stored in their respective directories
We use five-layer convolutional architecture in our proposed system. This consists of two Conv
layers, two Max-pooling layers and one fully connected layer. These layers are initialised using the inbuilt
functions provided by TensorFlow which has Keras running on top of it.
The preprocessing of the image converts into a standard size of 64x64 pixels. Then a 3x3 conv filter is
applied to raw image by dividing into tiny splices. After this, Maxpooling is applied which reduces the
matrix size from 22x22 to 8x8. Re-applying these layers on the same data results in a 3x3 information
about each important feature in the image. These 3x3 matrices are then given to the Fully connected layer
where they undergo multiplication and the Neural Network is trained from this data generated.
An image given by the user for classification also has to go through each of the above layers. But after
the fully connected layer extracts features from the image, it is considered with the weights generated by
the training set/knowledge stored in the neurons and is then classified into one of the two categories based
on its similarity to features in training set.
Fig 11. Flow chart for Training step Fig 12. Flow chart for Testing step
Chapter 6
SCREEN SHOTS
Sample training data set from the directory named not happy
The CNN being trained on Training set iteratively based on number of Epochs
The console asking for the path of image which has to be classified
Chapter 7
CHALLENGES
Posed Expressions - E fi works reasonably well for posed expressions, such as
posed smiles, but their performance drops quite dramatically on spontaneous expressions elicited during
natural conversations. Part of the reason for this diculty may stem from differing temporal dynamics
between posed and spontaneous expressions: Much of the existing work on automatic expression
recognition focuses on static image analysis. While static images are sucient for recognizing intense, posed
expressions, facial coding experts rely heavily on expression dynamics when analyzing subtle, spontaneous
expressions. Thus the development of methods to capture spatiotemporal information has become a very
important endeavor as we try to develop automatic systems that approximate the performance levels of
human experts.
Require large data sets - Deep learning algorithms are trained to learn progressively using data. Large
data sets are needed to make sure that the machine delivers desired results. As human brain needs a lot of
experiences to learn and deduce information, the analogous artificial neural network requires copious
amount of data. The more powerful abstraction you want, the more parameters need to be tuned and more
parameters require more data.
Requires high-performance hardware - Training a data set for a Deep Learning solution requires a lot
of data. To perform a task to solve real world problems, the machine needs to be equipped with adequate
processing power. To ensure better efficiency and less time consumption, data scientists switch to multi-
core high performing GPUs and similar processing units. These processing units are costly and consume a
lot of power. Industry level Deep Learning systems require high-end data centers while smart devices such
as drones, robots other mobile devices require small but efficient processing units. Deploying Deep
Learning solution to the real world thus becomes a costly and power consuming affair.
Neural networks are essentially black-box - We know our model parameters, we feed known data to
the neural networks and how they are put together. But we usually do not understand how they arrive at a
particular solution. Neural networks are essentially Balckboxes and researchers have a hard time
understanding how they deduce conclusions. The lack of ability of neural networks for reason on an
abstract level makes it difficult to implement high-level cognitive functions. Also, their operation is largely
invisible to humans, rendering them unsuitable for domains in which verification of process is important.
An application built on Human Facial Expression Recognition which is used in a real environment
could benefit from distinguishing between more emotions such as Nervousness and Panic. Such a scenario
could be large events where an early detection of Panic could help to prevent mass panics. Other
approaches to enhance emotion recognition could be to allow for composed emotions. For example
frustration can be accompanied by anger, therefore not only showing one emotion, but also the reason.
Thus complex emotions could be more valuable than basic ones. Besides distinguishing between different
emotions, also the strength of an emotion could be considered. Being able to distinguish between different
levels could improve applications, like evaluating reactions to new products. In this example it could
predict the amount of orders that will be made, therefore enabling producing the right amount of products.
Some of the difficulties with improving this is that the images are very small and in some cases it is
very hard to distinguish which expression is on each image, even for humans. Due to limitation of our lack
of GPU, we have only explore certain architecture but we believe that adding more layers and more filters
would further improve the network.
BIBLIOGRAPHY
[6] Y. Tian, T. Kanade, J.F. Cohn, Facial expressions analysis, in: Stan Z. Li, Anil K. Jain (Eds.),
Handbook of Face Recognition, Springer- Verlag, 2004.
[16] U. Tariq, K.H. Lin, Z. Li, X. Zhou, Z. Wang, V. Le, T.S. Huang, X. Lv, T X “E
recognition fro ” : IEEE International Conference on Automatic Face and
Gesture Recognition and Workshops (FG), pp.872–877, 2011.
[18] Z. Yu and C. Zhang. Image based static facial expression recognition with multiple deep network
learning. ICMI Proceedings.
[20] D. C. Ali Mollahosseini and M. H. Mahoor. Going deeper in facial expression recognition using
deep neural networks. IEEE Winter Conference on Applications of Computer Vision, 2016.