cpp edit
cpp edit
ACKNOWLEDGEMENT
It is our privilege to express our sincere regards to our project guide, Mrs.A.S.Shinde
mam for her valuable inputs, able guidance, encouragement, whole- hearted cooperation and
constructive criticism throughout the duration of our project. We deeply express our sincere
thanks to our Head of Department Mr. .S.D. Jadhav for encouraging and allowing us to
present the project on the topic ―Emotions Based Music System at our department premises
for the partial fulfillment of the requirements leading to the award of Diploma in Information
Technology Engineering.
We take this opportunity to thank all our lecturers who have directly or indirectly helped
us for our project. We pay our respect and love to our parents and all other family members and
friends for their love and encouragement throughout our career. Last but not the least we
express our thanks to our friends for their cooperation and support.
ABSTRACT
The face is an important aspect in predicting human emotions and mood. Usually the human
emotions are extracted with the use of camera. There are many applications getting developed
based on detection of human emotions. Few applications of emotion detection are business
notification recommendation, e-learning, mental disorder and depression detection, criminal
behaviour detection etc. In this proposed system, we develop a prototype in recommendation of
dynamic music recommendation system based on human emotions. Based on each human
listening pattern, the songs for each emotions are trained. Integration of feature extraction and
machine learning techniques, from the real face the emotion are detected and once the mood is
derived from the input image, respective songs for the specific mood would be played to hold
the users. In this approach, the application gets connected with human feelings thus giving a
personal touch to the users. Therefore our projected system concentrate on identifying the
human feelings for developing emotion based music player using computer vision and machine
learning techniques. For experimental results, we use CNN model architecture for emotion
detection and music recommendation.
CONTENTS
2.1.1. Introduction.................................................................................. 9
CHAPTER 1
INTRODUCTION
Overview
Machine learning involves computers discovering how they can perform tasks without being
explicitly programmed to do so. It involves computers learning from data provided so that they
carry out certain tasks. For simple tasks assigned to computers, it is possible to program
algorithms telling the machine how to execute all steps required to solve the problem at hand;
on the computer's part, no learning is needed. For more advanced tasks, it can be challenging for
a human to manually create the needed algorithms. In practice, it can turn out to be more
effective to help the machine develop its own algorithm, rather than having human programmers
specify every needed step.
As of 2020, deep learning has become the dominant approach for much ongoing work in the
field of machine learning.
LITERATURE SURVEY
1. Smart Music Player Integrating Facial Emotion Recognition and Music Mood Recommendation
AUTHORS: Shlok Gilda, Husain Zafar, Chintan Soni and Kshitija Waghurdekar
Songs, as a medium of expression, have always been a popular choice to depict and understand
human emotions. Reliable emotion based classification systems can go a long way in helping us
parse their meaning. However, research in the field of emotion-based music classification has
not yielded optimal results. In this paper, we present an affective cross-platform music player,
EMP, which recommends music based on the real-time mood of the user. EMP provides smart
mood based music recommendation by incorporating the capabilities of emotion context
reasoning within our adaptive music recommendation system. Our music player contains three
modules: Emotion Module, Music Classification Module and Recommendation Module. The
Emotion Module takes an image of the user's face as an input and makes use of deep learning
algorithms to identify their mood with an accuracy of 90.23%. The Music Classification Module
makes use of audio features to achieve a remarkable result of 97.69% while classifying songs
into 4 different mood classes. The Recommendation Module suggests songs to the user by
mapping their emotions to the mood type of the song, taking into consideration the preferences
of the user.
Music is universal at least partly because it expresses emotion and regulates affect. Associations
between music and emotion have been examined regularly by music psychologists. Here, we
review recent findings in three areas: (a) the communication and perception of emotion in
music, (b) the emotional consequences of music listening, and (c) predictors of music
preferences.
In this paper, we propose a music mood classification system that reflects a user's profile based
on a belief that music mood perception is subjective and can vary depending on the user's
profile such as age or gender. To this end, we first define a set of generic mood descriptors.
Secondly, we make up several user profiles according to the age and gender. We then obtain
musical items, for each group, to separately train the statistical models. Using the two different
user models, we verify our hypothesis that the user profiles play an important role in mood
perception by showing that both models achieve higher classification accuracy when the test
data and the mood model are of the same kind. Applying our system to automatic play list
generation, we also demonstrate that considering the difference between the user groups in
mood perception has a significant effect in computing music similarity.
CHAPTER 2
System Requirments
2.1. Python
2.1.1. Introduction
• Python is Interactive − You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, and Unix shell and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the GNU General
Public License (GPL).
Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress
• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined
syntax. This allows the student to pick up the language quickly..
• A broad standard library − Python's bulk of the library is very portable and
crossplatform compatible on UNIX, Windows, and Macintosh.
• Interactive Mode − Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.
• Portable − Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
• Extendable − You can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.
• GUI Programming − Python supports GUI applications that can be created and ported
to many system calls, libraries and windows systems, such as Windows MFC,
Macintosh, and the X Window system of
Chapter 3
SYSTEM DESIGN
SYSTEM ARCHITECTURE:
Facial Landmarks
Extraction
Facial Expression
Detection
Music Player
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be
used to represent a system in terms of input data to the system, various processing
carried out on this data, and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to
model the system components. These components are the system process, the data used
by the process, an external entity that interacts with the system and the information
flows in the system.
3. DFD shows how the information moves through the system and how it is modified by a
series of transformations. It is a graphical technique that depicts information flow and
the transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any
level of abstraction. DFD may be partitioned into levels that represent increasing
information flow and functional detail.
Preprocessing
Training dataset
UML DIAGRAMS
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks, patterns
and components.
7. Integrate best practices.
A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented as
use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors in
the system can be depicted.
Preprocessing
User
Training
Classification
CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language (UML) is a type of
static structure diagram that describes the structure of a system by showing the system's classes,
their attributes, operations (or methods), and the relationships among the classes. It explains
which class contains information.
Input Output
Image Acquisition Features extraction
Input Live Video Classification
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that
shows how processes operate with one another and in what order. It is a construct of a Message
Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and
timing diagrams.
Perform Preprocessing
Extract the features with images & send to the testing stage
Predict the type using proposed algorithm & play song accordingly
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities and actions
with support for choice, iteration and concurrency. In the Unified Modeling Language, activity
diagrams can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.
Preprocessing
Training
PROPOSED SYSTEM:
• The proposed system is divided into two parts front end part which is the user interface
and the back end which performs all facial expression related operations. All the
application is implemented in python front end is implemented using python module
Tkinter and backend is implemented using Keras module.
• When the application is started it automatically triggers the prediction module which is
used for prediction of facial expression. The prediction module calls the system camera
for retrieving image, then the image is classified using the pre-trained model. This
process is repeated for N times to get N predictions, from N predictions maximum
counted value is taken and is returned to the application.
• Then the application suggests playlist based on predicted value. Users can either start
that playlist or listen to his regular songs. Users can do operations general music player
operations such as play, pause, next, previous. The application will suggest a playlist
based on mood for every K number of songs.
Chapter 4
IMPLEMENTATION
MODULES:
Dataset
Importing the necessary libraries
Retrieving the images
Splitting the dataset
Building the model
Apply the model and plot the graphs for accuracy and loss
Accuracy on test set
Saving the Trained Model
Face expression in Live webcam
MODULES DESCSRIPTION:
Dataset:
In the first module, we developed the system to get the input dataset for the training and testing
purpose. We have given the dataset for face expression detection in project folder itself.
Computer Vision
Some of the computer vision problems which we will be solving in this article are:
1. Image classification
2. Object detection
3. Neural style transfer
One major problem with computer vision problems is that the input data can get really big.
Suppose an image is of the size 68 X 68 X 3. The input feature dimension then becomes 12,288.
This will be even bigger if we have larger images (say, of size 720 X 720 X 3). Now, if we pass
such a big input to a neural network, the number of parameters will swell up to a HUGE number
(depending on the number of hidden layers and hidden units). This will result in more
computational and memory requirements – not something most of us can deal with.
As you can see, there are many vertical and horizontal edges in the image. The first thing to do
After the convolution, we will get a 4 X 4 image. The first element of the 4 X 4 matrix will be
calculated as:
So, we take the first 3 X 3 matrix from the 6 X 6 image and multiply it with the filter. Now, the
first element of the 4 X 4 output will be the sum of the element-wise product of these values, i.e.
3*1 + 0 + 1*-1 + 1*1 + 5*0 + 8*-1 + 2*1 + 7*0 + 2*-1 = -5. To calculate the second element of
the 4 X 4 output, we will shift our filter one step towards the right and again get the sum of the
element-wise product:
Similarly, we will convolve over the entire image and get a 4 X 4 output:
So, convolving a 6 X 6 input with a 3 X 3 filter gave us an output of 4 X 4. Consider one more
example:
Note: Higher pixel values represent the brighter portion of the image and the lower pixel values
represent the darker portions. This is how we can detect a vertical edge in an image.
The Sobel filter puts a little bit more weight on the central pixels. Instead of using these filters,
we can create our own as well and treat them as a parameter which the model will learn using
backpropagation.
Padding
We have seen that convolving an input of 6 X 6 dimension with a 3 X 3 filter results in 4 X 4
output. We can generalize it and say that if the input is n X n and the filter size is f X f, then the
output size will be (n-f+1) X (n-f+1):
• Input: n X n
• Filter size: f X f
• Output: (n-f+1) X (n-f+1)
There are primarily two disadvantages here:
1. Every time we apply a convolutional operation, the size of the image shrinks
2. Pixels present in the corner of the image are used only a few number of times during
convolution as compared to the central pixels. Hence, we do not focus too much on the
corners since that can lead to information loss
To overcome these issues, we can pad the image with an additional border, i.e., we add one pixel
all around the edges. This means that the input will be an 8 X 8 matrix (instead of a 6 X 6
matrix). Applying convolution of 3 X 3 on it will result in a 6 X 6 matrix which is the original
shape of the image. This is where padding comes to the fore:
• Input: n X n
• Padding: p
• Filter size: f X f
• Output: (n+2p-f+1) X (n+2p-f+1)
There are two common choices for padding:
1. Valid: It means no padding. If we are using valid padding, the output will be (n-f+1) X
(n-f+1)
2. Same: Here, we apply padding so that the output size is the same as the input size, i.e.,
n+2p-f+1 = n
So, p = (f-1)/2
We now know how to use padded convolution. This way we don’t lose a lot of information and
the image does not shrink either. Next, we will look at how to implement strided convolutions.
Strided Convolutions
Suppose we choose a stride of 2. So, while convoluting through the image, we will take two
steps – both in the horizontal and vertical directions separately. The dimensions for stride s will
be:
• Input: n X n
• Padding: p
• Stride: s
• Filter size: f X f
• Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1]
Stride helps to reduce the size of the image, a particularly useful feature.
After convolution, the output shape is a 4 X 4 matrix. So, the first element of the output is the
sum of the element-wise product of the first 27 values from the input (9 values from each
channel) and the 27 values from the filter. After that we convolve over the entire image.
Instead of using just a single filter, we can use multiple filters as well. How do we do that? Let’s
say the first filter will detect vertical edges and the second filter will detect horizontal edges
from the image. If we use multiple filters, the output dimension will change. So, instead of
having a 4 X 4 output as in the above example, we would have a 4 X 4 X 2 output (if we have
used 2 filters):
a[1] = g(z[1])
Department of Information Technology
In our case, input (6 X 6 X 3) is a[0]and filters (3 X 3 X 3) are the weights w[1]. These activations
from layer 1 act as the input for layer 2, and so on. Clearly, the number of parameters in case of
convolutional neural networks is independent of the size of the image. It essentially depends on
the filter size. Suppose we have 10 filters, each of shape 3 X 3 X 3. What will be the number of
parameters in that layer? Let’s try to solve this:
• Number of parameters for each filter = 3*3*3 = 27
• There will be a bias term for each filter, so total parameters per filter = 28
• As there are 10 filters, the total parameters for that layer = 28*10 = 280
No matter how big the image is, the parameters only depend on the filter size. Awesome, isn’t
it? Let’s have a look at the summary of notations for a convolution layer: f[l] = filter size
• p[l] = padding
• s[l] = stride
2. Pooling layer
3. Fully connected layer
Let’s understand the pooling layer in the next section.
Pooling Layers
Pooling layers are generally used to reduce the size of the inputs and hence speed up the
computation. Consider a 4 X 4 matrix as shown below:
For every consecutive 2 X 2 block, we take the max number. Here, we have applied a filter of
size 2 and a stride of 2. These are the hyperparameters for the pooling layer. Apart from max
pooling, we can also apply average pooling where, instead of taking the max of the numbers, we
take their average. In summary, the hyperparameters for a pooling layer are:
1. Filter size
2. Stride
3. Max or average pooling
If the input of the pooling layer is nh X nw X nc, then the output will be [{(nh – f) / s + 1} X {(nw
– f) / s + 1} X nc].
CNN Example
We’ll take things up a notch now. Let’s look at how a convolution neural network with
convolutional and pooling layer works. Suppose we have an input of shape 32 X 32 X 3: There
are a combination of convolution and pooling layers at the beginning, a few fully connected
layers at the end and finally a softmax classifier to classify the input into various categories.
There are a lot of hyperparameters in this network which we have to specify as well. Generally,
we take the set of hyperparameters which have been used in proven research and they end up
doing well. As seen in the above example, the height and width of the input shrinks as we go
deeper into the network (from 32 X 32 to 5 X 5) and the number of channels increases (from 3
to 10).
All of these concepts and techniques bring up a very fundamental question – why convolutions?
Why not something else?
Why Convolutions?
There are primarily two major advantages of using convolutional layers over using just fully
connected layers:
1. Parameter sharing
2. Sparsity of connections Consider the below example:
If we would have used just the fully connected layer, the number of parameters would be =
32*32*3*28*28*6, which is nearly equal to 14 million! Makes no sense, right?
If we see the number of parameters in case of a convolutional layer, it will be = (5*5 + 1) * 6 (if
there are 6 filters), which is equal to 156. Convolutional layers reduce the number of parameters
and speed up the training of the model significantly.
In convolutions, we share the parameters while convolving through the input. The intuition
behind this is that a feature detector, which is helpful in one part of the image, is probably also
useful in another part of the image. So a single filter is convolved over the entire input and
hence the parameters are shared.
The second advantage of convolution is the sparsity of connections. For each layer, each output
value depends on a small number of inputs, instead of taking into account all the inputs.
factor of 2. In dropout layer we have kept dropout rate = 0.25 that means 25% of neurons are
removed randomly.
We apply these 4 layers again with some change in parameters. Then we apply flatten layer to
convert 2-D data to 1-D vector. This layer is followed by dense layer, dropout layer and dense
layer again. The last dense layer outputs 7 nodes as the face expression types. This layer uses
the softmax activation function which gives probability value and predicts which of the 2
options has the highest probability.
Apply the model and plot the graphs for accuracy and loss:
We will compile the model and apply it using fit function. The batch size will be 64. Then we
will plot the graphs for accuracy and loss. We got average validation accuracy of 96.6% and
average training accuracy of 95.3%.
Once you’re confident enough to take your trained and tested model into the production-ready
environment, the first step is to save it into a .h5 or .pkl file using a library like pickle .
Next, let’s import the module and dump the model into.pkl file
Again, this code should be familiar. We are merely searching for the face in our captured frame.
The results will be angry, disgust, fear, happy, neutral, sad, surprise. After capturing the emotion,
A list of songs are suggested based on the emotion.
Chapter 5
INPUT DESIGN
The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and those steps are necessary to
put transaction data in to a usable form for processing can be achieved by inspecting the
computer to read data from a written or printed document or it can occur by having people
keying the data directly into the system. The design of input focuses on controlling the amount
of input required, controlling the errors, avoiding delay, avoiding extra steps and keeping the
process simple. The input is designed in such a way so that it provides security and ease of use
with retaining the privacy. Input Design considered the following things:
OBJECTIVES
1. Input Design is the process of converting a user-oriented description of the input into a
computer-based system. This design is important to avoid errors in the data input process and
show the correct direction to the management for getting correct information from the
computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large volume of
data. The goal of designing input is to make data entry easier and to be free from errors. The
data entry screen is designed in such a way that all the data manipulates can be performed. It
also provides record viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with the help of
screens. Appropriate messages are provided as when needed so that the user will not be in
maize of instant. Thus the objective of input design is to create an input layout that is easy to
follow
OUTPUT DESIGN
A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and direct
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out manner; the
right output must be developed while ensuring that each output element is designed so that
people will find the system can use easily and effectively. When analysis design computer
output, they should Identify the specific output that is needed to meet the requirements.
3. Create document, report, or other formats that contain information produced by the system.
The output form of an information system should accomplish one or more of the following
objectives.
Convey information about past activities, current status or projections of the Future.
Signal important events, opportunities, problems, or warnings.
Trigger an action.
Confirm an action.
Chapter 6
Actual output
Chapter 7
7.1 Advantages:-
The smart music player is an application that runs based on the idea that we can detect a
person's mood based on the expression on his face.
The expression on the face is detected using convoluted neural networks (CNN). The set
of images is taken from the camera of the device and these images are given to a
pretrained CNN which returns facial expression to application. Based on facial
expression a song playlist is suggested.
This is an additional feature of the existing feature of a music player. Usually facial
expression changes within seconds and not consistent so it may lead to wrong playlist
suggestion to overcome this problem application collects N number of images when the
application is started and takes facial expression which has appeared the maximum
number of times.
7.2 Applications-
7.3 Conclusion
This application can be added as an additional feature to current advanced music players which
suggests songs based on previous song history. Adding a facial expression detection system in
music player would increase the situations where the system suggests the song what he needed,
this would increase user satisfaction. This facial recognition model can also be used in several
different situations such as movie suggestions, activity suggestions, etc.
We can add incremental learning to the application in such a way that it learns from new data
generated by the application. The application asks for feedback from the user whether it has
predicted correct or not, based on feedback it will learn. The above process increases model
accuracy and results in improved quality. We can also add new features such as heart rate which
is somewhat connected to human emotions to increase the correctness of the model. We can also
consider background while predicting the emotion, this way we can get better results than the
previous method. For example, if we are in the gym the application must detect objects in a gym
and play motivational songs that are suited for the gym
Chapter 8
Department of Information Technology
Bibliography/ References
REFERENCES
[2] “How music changes your mood”, Examined Existence. [Online]. Available:
http://examinedexistence.com/how-music-changes-yourmood/.
[3] Kyogu Lee and Minsu Cho, “Mood Classification from Musical Audio Using
User Group-dependent Models.”
[5] Mirim Lee and Jun-Dong Cho, “Logmusic: context-based social music
recommendation service on mobile device,” Ubicomp’14 Adjunct, Seattle,
WA, USA, Sep. 13–17, 2014.
[7] Bo Shao, Dingding Wang, Tao Li, and Mitsunori Ogihara, “Music
recommendation based on acoustic features and user access patterns,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 17, no. 8, Nov.
2009.
[8] Ying-li Tian, T. Kanade, and J. Cohn, “Recognizing lower. Face action units
for facial expression analysis,” in Proceedings of the 4th IEEE International
Conference on Automatic Face and Gesture Recognition (FG’00), Mar. 2000,
pp.
484–490.
[9] Gil Levi and Tal Hassner, “Emotion Recognition in the Wild via
Convolutional Neural Networks and Mapped Binary Patterns.” [10] E. E. P.
Myint and M. Pwint, “An approach for mulit-label music mood
classification,” in 2010 2nd International Conference on Signal Processing
Systems, Dalian, 2010, pp.
V1-290-V1-294.
[11] Peter Burkert, Felix Trier, Muhammad Zeshan Afzal, Andreas Dengel, and
Marcus Liwicki, “DeXpression: Deep Convolutional Neural Network for
Expression Recognition.”
[16] Brian McFee, Matt McVicar, Colin Raffel, Dawen Liang, Oriol Nieto, Eric
Battenberg, ..., and Adrian Holovaty, (2015). librosa: 0.4.1 [Data set].
Zenodo. http://doi.org/10.5281/zenodo.32193.
[17] The aubio team, “Aubio, a library for audio labelling,” 2003. [Online].
Available: http://aubio.org/.
[19] Cyril Laurier, Perfecto Herrera, M Mandel and D Ellis, “Audio music mood
classification using support vector machine.”