Emotion_based_Music_Recommendation_System_using_Deep_Learning_Model
Emotion_based_Music_Recommendation_System_using_Deep_Learning_Model
Abstract— The ability to read a person's emotions from inference-based recommendation. An emotion descriptor has
their face is crucial. Using a camera, the necessary input is been successfully used to characterize music taxonomy.
immediately taken from the subject's face. This input may be Emotions may be translated as a series of actual numbers set
used, among other things, to extract data that can be used to of continuous values, which is a presumption for emotion
infer a person's mood. The musical selections fit the "mood" representation. In their circu mflex model, investigators
determined from the source supplied before may then be depicted each affect more than two bipolar dimensions., as a
obtained using this data. This reduces the time -consuming and groundbreaking approach to describing human emotions.
laborious effort of manually classifying songs into various
Arousal and sleep are the two aspects in question. Later,
categories and assists in creating a playlist that is suitable for a
Russel's model was put to music by another researcher. In
particular person depending on their emotional characteristics.
The Music Thespian depends on Facial Expressions and aims
Thayer's approach, "arousal" and "valence" are the two key
to scout out and comprehend the data before building a playlist dimensions. Along the arous al level, the phrases for
according to the specified parameters. To construct an emotions ranged from silent to lively, wh ile along the
emotion-based music system, the suggested system is designed valence dimension, they ranged from negative to positive.
to identifying human emotions. The article explains the The emotional 2-D plane may be split into quadrants, four
techniques currently in use by music players to identify into quadrants, four using Thayer's approach, and each
emotions, the approach that is used with our music player, and quadrant can have eleven emotion adjectives put over it [1].
the most effective manner to make use of our emotion dete ction In contrast, Xiang et al. proposed a "mental state transition
technology. A quick explanation of how the systems operate, network" to describe how people change via their emotions.
create playlists, and classify emotions is provided. The mental states of the network include joyful, sad, angry,
disgusted, afraid, surprised, and calm. Ho wever, other
Keywords— Emotion, Music Therapy, Deep Neu ral Network, feelings like anxiety and excitement aren't taken into
Facial Expression Recognition account. Automatic feeling identification and acceptance in
the music is developing quickly because to advancements in
I. INT RODUCT ION digital signal processing and many efficient co mponent
The computerized evaluation and interpretation of tunes extraction techniques.
by computers is a novel potential in the music industry data
Emot ion finding and recognition have numerous more
recovery. A wide range of study issues in this area are potential uses, including systems for interacting with
studied by scholars due to the diversity and depth of music
computers and music pleasure [2], [3]. The initial study on
material, including Musicology makes use of computer emotion recognition using music was presented by Feng. By
science, digital signal processing, mathematics, and statistics.
assessing two tempo and enunciation of variables that are
Music similarity analysis, audio artist identification, audio to
translated into four different moods —joy, wrath,
score alignment, inquiry by singing or humming, automatic melancholy, and terror—they applied the Co mputational
audio genre/mood categorization, and other recent
Media Aesthetics (CMA) viewpoint.
developments in music information retrieval are only a few
examples. However, other feelings like an xiety and excitement
aren't taken into account. By assessing two tempo and
One of the practical applications that may be offered is articulation variables that are translated into four different
music suggestion based on content. More advanced context-
moods—joy, wrath, melancholy, and terror—they applied
based music recommendations are developed by using the
the Computational Media Aesthetics (CMA) viewpoint.
context information. A content-based music
recommendation system requires multidisciplinary work in There are nu merous additional possible uses, such as
the areas of emotion description, emotion audio entertainment and HCI systems., can benefit fro m
detection/recognition, feature-based classification, and emotion detection and recognition. The first study on
emotion recognition using music was presented by Feng. By music, users can benefit fro m the development of music
studying two tempo and articulation factors that are related to recommendations. The users' preferences for music are the
one other, they used the Computational Media Aesthetics basis for the current recommendation methods. However,
(CMA) v iewpoint mapped into four types of moods: joy, there are occasions when selecting music based on the mood
rage, sadness, and terror. is necessary. In this research, a unique approach for
association finding fro m film music-based emotion-based
music reco mmendation. In order to uncover associations
II. RELAT ED W ORK between emotions and musical features, the methodology
The Automatic Face Detection and Facial Expression looked into the extraction of musical features and adjusted
Recognition System to identify facial expressions was the affinity graph. According to experimental findings, the
proposed initially. Th is system consists of three phases. 1. suggested approach averages 85% accuracy. Interactive
Face recognition 2. Exp ression recognition, follo wed by 3. mood-based games Finding and reco mmending music. A
feature extraction. The RGB Colo r Model is employed in considerable portion of research in reco mmender systems
the initial stage of face detection., lighting adjustment to focuses on enhancing prediction and ranking. But recent
obtain the face, and morphological processes to keep the studies have highlighted the importance of other
necessary facial features, such as the lips and eyes. For the components of the reco mmendations, such transparency,
purpose of ext racting facial features, this system also control, and overall user experience. On the basis of these
emp loys the AAM, or Active Appearance Model Method. components, we provide MoodPlay, a hybrid music
This method uses a statistical model of the shape and endorsement engine with a co mprehensible interface that
appearance of the face to extract facial features. In this combines content and mood-based filters. We show users
approach, the points on the face, such as the eyes, eyebrows, how to utilise MoodPlay to browse music collections by
and mouth, as well as a data file, are created that contains hidden emotional d imensions as we move through how to
informat ion about the model points that were detected. The combine user input at the mo ment of suggestion with
method also detects the face and uses the input of an calculations based on a previous user profile [10]. The
expression to determine how the AAM Model should findings of a user study (N=240) that examined four
change. circu mstances with different levels of v isualisation, control
and interaction are reviewed.
Bezier Cu rve Fitting for the Analysis of Facial
Exp ressions was used to recognize Emotion. A system III. DEEP CONVOLUT IONAL NEURAL NET WORK
based on Bezier curve fitting was proposed. This system
uses a two-step process to detect and analyse the facial area A. Convolutional Neural Network
in the input orig inal image, and then it verifies the facial Unlike trad itional neural networks, CNN layers include
emotion of the distinguishing features in the region of neurons organised in three dimensions: width, height, and
interest [4-6]. To gauge the location of the face and the depth. A layer's neurons will only be partially lin ked to the
placement of the mouth and eyes on the face, feature maps layer before it, interactivity and window-size control rather
were employed after the in itial step of face identification, than being completely interconnected. The main
which used colour still images according to skin-color pixels constructing layers utilized in convolution neural networks
and initiated spatial filtering. In order to apply a Bezier are the Convolution stage, pooling stage, and fully
bend to the eye and mouth, this approach first ext racts the connected layer.
area of prior to deriv ing points from the feature map o f
interest. Th is technique employs train ing and testing to The final output layer would also contain dimensions as
comprehend emotion of the Hausdorff distance using a we would co mpress combining the entire image into one
Bezier bend between the supplied face image and the trajectory of class grades with the premise that the CNN
database image. architecture (number of classes).
Utilizing a library of photos, the user of animated A Convolutional Neural Network is a Deep Learning
technique which has a picture-taking capability, assign
music reco mmendation system can obtain music
different components and objects in the image importance
recommendations based on the genre of each image. The (learnable weights and biases), and know how to distinguish
Nokia Research Center created this technique for making between them [11]. When compared to straightforward
music reco mmendations. Audio signal processing and methods filters, a Conv Net requires far as compared to other
textual meta tags are used in this system to describe the classifications, less pre-processing systems. CNNs can be
genre. Utilising emotion identification fro m facial trained to detect and localize faces in images. They work by
expressions in human-co mputer interaction. A co mp letely using convolutional layers to extract features fro m the input
automated face countenance and identification system built image and a fully connected layer to classify the presence of
on a facial recognition in three steps, facial characteristic a face. CNNs have shown state-of-the-art performance in
facial exp ression and ext raction categorizat ion procedure various computer vision tasks, including image classification
and object detection. They have achieved high accuracy on
was proposed later.
benchmark datasets such as ImageNet, demonstrating their
ability to learn complex features from images.
Discovery of Emotion based Music through association
with mov ie music [7-9]. W ith the expansion of electronic
Because they are hand-engineered, ConvNets may pick the region it convolves. The Convolutional Neural Network
up on these filters and traits with the right training. The is only informed by Max Pooling that we will only carry
organisation of the Visual Co rtex and the brain's neuronal forward that in formation if it is the greatest info rmation
available in terms o f amp litude. On a 4*4 channel, we can
interconnection system both have an impact on the desig n of
perform the most pooling with a 2* 2 kernel and a 2 stride. If
a ConvNet [12-14] specific neurons only in reaction to we examine the init ial 2*2 set, wh ich is what a kernel is
stimuli in th is little area of the v isual field, known as the concentrating on, the channel comprises four values: 8,3,4,7.
Receptive Field. There are several overlapping areas like By using Max-Pooling, the highest value in that set, "8," is
this that make up the total visual field. CNNs can be adapted selected.
to different types of input data and can be used for a wide
range of co mputer vision tasks [15–17]. They can be trained D. Fully Connected Layer
on large datasets and can learn fro m examples, making them
Batch size, number o f inputs, and number of outputs
suitable for various applications. The choice of CNN as an
are the three factors that characterize a fu lly-connected
algorith m for a specific task can be justified based on its
performance, adaptability, interpretability, availab ility, and layer. Forward propagation, computation of the activation
scalability. gradient, and co mputation of the weight gradient are all
directly stated as matrix mu lt iplications. Different
B. Convolutional Layer frameworks have different mappings of the three parameters
The example above duplicates our 5x5x1 input picture to the GEMM dimensions (General Matrix Mult iplication,
with a green region. The element that executes the background in the Matrix Mult iplication Background User's
Kernel/Filter, K, convolution technique is used in the first Gu ide), but the fundamental ideas are the same. The
convolutional layer, wh ich is symbolized by the colo r architecture of the basic CNN is represented in Fig. 1.
yellow. K is represented as a 3x3x1 matrix. CNNs are
capable of automatically extracting relevant features from
images, without requiring manual feature engineering. The
use of convolutional layers in CNNs allows the network to
learn hierarchical representations of the image, where lower
layers extract basic features (e.g. edges, corners), and higher
layers ext ract more co mplex features (e.g. shapes, textures).
CNNs have achieved state-of-the-art performance on a wide
range of co mputer vision tasks, including image
classification, object detection, and image segmentation.
This is due to their ability to learn co mplex features fro m
images and generalize well to new, unseen data.
Fig. 1. Smple CNN Architecture
C. Pooling Layer
The Pooling layer is similar to the Convolutional Layer IV. IMPLEMENT ED M ET HODOLOGY
in that it in responsibility of lo wering the Con volved
The overall structure of the implemented work is
Feature's spatial size. The amount of processing resources
explained in Fig. 2. CNN is the classifier used in this
required for p rocessing the data will be decreased by research.
Dimin ishing the dimensions. Additionally, by enabling the
extraction of dominant traits that are rotational and
positional invariant, it contributes in correctly training the
model. The two main forms of pooling are Average and
Maximu m Pooling. Max Pooling brings back the highest
worth fro m the reg ion of the p icture that a kernel has
covered. The result of typical pooling is the mean of each
number from the region of the picture that a Kernel covered.
50x50-p ixel images that are grayscale. The project is Wild." a publicly availab le standard for pair matching,
consistent with this dimension. commonly referred to as face verification. The collection
B. Face Detection may deploy facial recognition technology and other kinds of
For faces, the collected photos are screened. A face identificat ion. Over 13,000 face images totaling
rectangular face is then created around each face found in 173M B are included in the collection, wh ich was co mpiled
each frame of the Realt ime video using the Haar cascade and from the internet.
CNN algorith m. The technique involves training a classifier
on positive and negative samples of the desired feature.
Before the image is supplied to the model to get the The triplet face images in this dataset, which was produced
prediction, this is a step in the preprocessing. by Google, feature human annotations identifying the two
faces with the most similar facial expressions. The goal of
C. Emotion Detection
the dataset is to assist academics operating on problems
The Keras CNN model is fed the processed images. connected to face analysis of exp ression, including based on
We've already trained our model. Once the image has been expression picture album with exp ression-based retrieval
cropped, a DEEP LEA RNING model has been trained to summarization, feeling categorization, synthesis of
use it to predict the mood of the image. Th is will occur 30 to expression, etc. 500K triplet p ictures and 156K face shots
40 times in 2 to 3 seconds. Once we obtain a list of emotions make up the 200MB dataset.
(which may contain duplicate components), it will first sort
the list according to frequency and then eliminate the
African music dataset: The Royal Museum of Central
duplicates. After completing the aforementioned processes,
a list of user emotions in sorted order will be available. Africa (RM CA) in Belg iu m is the source of the African
music dataset. Songs can be divided into four categories:
country, purpose, ethnic group, and instrumentation. When
D. Music Recommendation
compared to the normal western music dataset, the
Nearly 90,000 songs in the database we utilised are characteristics of this collection of songs are very different.
arranged according to how pleasant and emotionally
significant they are. The database is iterated over with the GTZAN Dataset: 10 genres, including hip-hop, rock,
emotions sorted out, and songs are suggested depending on classical, b lues, country, disco, jazz, reggae, and pop, are
the emotions found in the list. Furthermo re, the recognition represented in the GTZAN dataset, also known as the Genre
two or more emotions can be performed in the same frame. Collection dataset, fro m MARSYA S (Music Analysis
In certain situations, songs are suggested in line with it. Retrieval and Synthesis for Audio Signals). The most often
used dataset for machine learning research on music genre
identification (M GR) is one that is publicly available. The
V. COLLECT ION OF DAT A experimental results outperformed other well-known
methods in terms of classificat ion performance. The 10
Flicker-Faces-HQ Dataset (FFHQ): The Flickr-Faces-HQ classifiers used majority voting to decide, and on the
Dataset (FFHQ) a collection of hu man faces that offers far GTZAN dataset, they had an average accuracy of 94%.
higher protection for accessories like in terms of age,
ethnicity, and choice of eyeglasses, sunglasses, and hats VI. RESULT S AND DISCUSSIONS
than CELEBA-HQ picture backdrop. After being retrieved
fro m Flickr, the images were automat ically cropped and The process of providing suggestions requires
careful consideration of many different aspects, including
aligned. There are 70,000 h igh-quality PNG p ictures in the
the specific situation, individual preferences, sentiments,
dataset with 1024 X 1024 and resolution is varied in terms
and emotions. The personalizat ion, hu man emot ions,
of age, race, and picture backdrop.
contextual desires, and emotional variables gaps in today's
music reco mmendation algorith ms are challenges. The
Google Facial Expression comparison dataset: The triplet proposed work employed CNN fro m deep learning for the
face images in this dataset, which was produced by Goog le, classification of emotions. A mu ltilayered neural network
feature human annotations identifying the two faces with the termed a Convolutional Neural Network can recognize
most similar facial expressions. The dataset's purpose is to intricate details in the data. The performance metrics such as
assist academics operating on problems connected to face Accuracy, Recall, Precision are used to analyse the network
analysis of expression, including expression-based picture model. The confusion matrix was used to calculate the true
picture albu m with expression-based retrieval positive (TP), false positive (FP), true negative (TN), false
summarization, feeling categorization, synthesis of negative (FN) to evaluate the model.
expression, etc. 500K triplet p ictures and 156K face shots
make up the 200MB dataset. In order to grasp the pattern of the pictures
provided for the CNN model, it applies a variety of filters on
Labelled Faces in The Wil d Home (LFW) Dataset: A the image. With the use of CNN layers, the model in this
collection of face images called the dataset for Labelled system is trained by passing a certain number of photos for
Faces in the Wild (LFW) was assembled to study the issue each of the four emotions. Once trained, the model can then
of unrestricted face identification It is tit led "Faces in the predict the test data's emotions. By integrating non -linearity
through MaxPooling, wh ich increases the model's moods, and contextual preferences. For this, the system is
performance in CNN, the model becomes more capable of given data from a variety of sources.
handling a variety of data. The music was reco mmended In the context of this study, the key data processing
based on the mood of the individual’s recognized emotion. techniques are specified, and the experimental prototype has
Figure 3 (a) depicts the developed page for music been developed. However, machine learn ing model need a
recommendation system based on the recognized emotion. lot of data to train the models in order to create pred ictions
(b) shows the recommended music based on the recognized that are as accurate as possible and more or less relevant.
emotion. (c) exhib its the accuracy plot using CNN model, The procedure of collecting data is now underway. In order
(d) illustrates the recognized emotion, and Table 1 presents to fine-tune and test the model for accurate suggestions and
the performance analysis of CNN model. The experiments minimize any side effects, this type of system need
were tested using the FER database. The accuracy achieved extensive clin ical study and collaboration with
by using the neural network model is 96%, 97%, 93%, 94% psychologists. The accuracy achieved by using the neural
for recognising Happy, Fear, Sad and Surprise. Increasing network model is 96%, 97%, 93%, 94% for recognising
the depth of a CNN can improve its performance by Happy, Fear, Sad and Surprise.
allo wing the network to learn more co mplex features. This The further develop ment of this work might be
can be achieved by adding more convolutional and pooling viewed in this sense as the production of music by
layers to the network. artificially intelligent systems with specific musical
characteristics to affect emotional states of people.
A CKNOWLEDGMENT
We thank the signal processing laboratory, Department of
Electronics and Commun ication, Karunya Institute of
Technology and Sciences for the support extended for this
research.
REFERENCES
[1] C. Loconsole, C. R. Miranda, G. Augusto, A. Frisoli, and V. Orvalho,
“Real-time emotion recognition: Novel method for geometrical facial
features extraction,” in VISAPP 2014 - Proceedings of the 9th
International Conference on Computer Vision Theory and
Applications, 2014, vol. 1, pp. 378–385, doi:
10.5220/0004738903780385.
[2] S. Bhutada and T. Iv, “EMOTION BASED MUSIC,” J. Emerg.
T echnol. Innov. Res., vol. 7, no. 4, pp. 2170–2175, 2020.
[3] B. T . Nguyen, H. Chi Minh city, V. H. Minh Trinh, T. V Phan, and H.
D. Nguyen, “An Efficient Real-T ime Emotion Detection Using
Camera and Facial Landmarks,” in Seventh International Conference
on Information Science and Technology, 2017, pp. 251–255.
Fig. 3: (a) Home Page (b) Recommended Music based on Emotion
recognized (c) Accuracy plot using CNN model (d) Emotion Detected [4] N. Chouhan, A. Khan, J. Zeb, and M. Hussnain, “Deep convolutional
neural network and emotional learning based breast cancer detection
using digital mammography,” Comput. Biol. Med., vol. 132, no.
March, p. 104318, 2021, doi: 10.1016/j.compbiomed.2021.104318.
TABLE I. PERFORMANCE ANALYSIS USING CNN MODEL [5] W. Deng, J. Hu, S. Zhang, and J. Guo, “DeepEmo : Real-world Facial
Expression Analysis via Deep Learning,” IEEE IEEE VCIP, 2015.
S pecificity S ensitivity Precision Accuracy [6] E. Reinertsen and G. D. Clifford, “Emotional Detection and Music
Happy 0.95 0.97 0.96 96 % Recommendation System based on User Facial Expression Emotional
Detection and Music Recommendation System based on User Facial
Fear 0.96 0.96 0.94 97 % Expression,” 2020, doi: 10.1088/1757-899X/912/6/062007.
S ad 0.95 0.94 0.93 93 % [7] Z. Liu, W. Xu, W. Zhang, and Q. Jiang, “An emotion-based
S urprise 0.97 0.92 0.95 94 % personalized music recommendation framework for emotion
improvement,” Inf. Process. Manag., vol. 60, no. 3, p. 103256, 2023,
doi: 10.1016/j.ipm.2022.103256.
[8] R. Saranya, “EMOTION BASED MUSIC RECOMMENDATION
SYST EM,” Int. Res. J. Eng. Technol., vol. 6, no. 3, 2019.
CONCLUSION
[9] M. Athavle, D. Mudale, and U. Shrivastav, “Music Recommendation
In this research paper, an emotion-driven Based on Face Emotion Recognition,” vol. 02, no. 018, pp. 1–11,
recommendation model that effectively addresses personal 2021.
preferences and particular life and activity conditions. The [10] A. Mahadik, “Mood based music recommendation system,” no.
March, 2022.
prime objective of the study's approach is to maximize the
[11] L. Nwosu, H. Wang, J. Lu, I. Unwala, X. Yang, and T. Zhang, “ Deep
benefits that listening to music may provide for individuals. Convolutional Neural Network for Facial Expression Recognition
Giv ing the algorithm feedback on the results of the using Facial Parts,” IEEE 15th Intl Conf Dependable, Auton. Secur.
recommendations will enable it to imp rove the music Comput., pp. 1318–1321, 2017, doi: 10.1109/DASC-PICom-
DataCom-CyberSciTec.2017.213.
selections over time. In order to find the best suitable
[12] M. K. Chowdary, T. N. Nguyen, and D. J. Hemanth, “Deep learning-
musical choices fo r each user, the system is built to listen to based facial emotion recognition for human–computer interaction
each unique user and understand their listening goals, applications,” Neural Comput. Appl., vol. 8, 2021, doi:
10.1007/s00521-021-06012-8.
[13] I. More, V. Shirpurkar, Y. Gautam, and N. Singh, “Melomaniac- [15] I. More, V. Shirpurkar, Y. Gautam, and N. Singh, “Melomaniac-
Emotion Based Music Recommendation System,” IJARIIE, no. 3, pp. Emotion Based Music Recommendation System,” IJARIIE, no. 3, pp.
1323–1329, 2021. 1323–1329, 2021.
[14] R. De Prisco, A. Guarino, and D. Malandrino, “applied sciences [16] R. K. G. A, R. K. Kumar, and G. Sanyal, “Facial Emotion Analysis
Induced Emotion-Based Music Recommendation through using Deep Convolution Neural Network,” pp. 1–6.
Reinforcement Learning,” 2022. [17] M. Athavle, D. Mudale, and U. Shrivastav, “Music Recommendation
Based on Face Emotion Recognition,” vol. 02, no. 018, pp. 1–11,
2021.