MITADTSoCiCon2024 Paper178
MITADTSoCiCon2024 Paper178
net/publication/381947610
CITATIONS READS
0 9
3 authors, including:
All content following this page was uploaded by Sheetal Kusal on 16 August 2024.
Shruti Patil
Symbiosis Institute of Technology
Symbiosis International (Deemed University)
Pune, India
[email protected]
https://orcid.org/0000-0003-0881-5477
Abstract—Sentiment analysis, a fundamental domain of Nat- provide feedback or reviews on products or services, which
ural Language Processing (NLP), is important in diverse ap- inevitably creates a lot of data. So, it becomes necessary to
plications such as customer service management, social media analyse and study the sentiments behind people’s thoughts.
monitoring, and product review analysis. Nevertheless, as the use
of visual data such as photos and videos in online communication Sentiment analysis (SA) is becoming a popular method of
continues to grow, conventional text-based sentiment analysis analysing a person’s feelings or sentiments. It is studying
becomes inadequate. Introducing multimodal sentiment analysis the sentiments of people from text and categorising them
is a sophisticated method that uses textual and visual clues to as positive, negative or neutral. It is used to investigate the
get a more comprehensive knowledge of sentiment. This study impact of OSM on different online trends, sentiments behind
introduces an innovative approach to sentiment analysis that
combines pre-trained models with a feature fusion of emoticon- a person’s views and how it will affect the social trends based
based text and picture data. Our algorithm differentiates itself on text [1]. It is helpful in many applications, such as social
from typical sentiment analysis by using emoticon data and network monitoring, text mining, opinion mining, customer
picture features to detect more nuanced sentiment variations, support management, and business monitoring [2] [3]. But
such as extreme positive/negative and neutral. The pre-trained recently, sentiment analysis has originated a groundbreaking
ResNet50 model is used to analyse images from categorised
datasets, which include emotions such as Angry, Disgust, Fear, idea called multimodal sentiment analysis (MSA) [4]. SA has
Happy, Neutral, Sad, and Surprise. On the other hand, textual been applied only to textual data, but nowadays, social media
data is processed using the RNN technique in combination with platforms use text with emojis, images, videos, and audio.
the pre-trained T5 model. In order to enhance efficiency, we So, analysing all these attributes using traditional SA be-
use feature fusion algorithms on both modalities, merging the comes difficult. To analyse emoji-based text data, multimodal
retrieved features from text and pictures to provide a more
comprehensive representation. These models, trained on massive sentiment analysis (MSA) has become pivotal [5]. However,
datasets for specific tasks, serve as a robust starting point for research in sentiment analysis has been focused mainly on
a model, allowing us to refine and adapt them for multimodal text. Nowadays, social networking sites such as Facebook,
sentiment analysis. Instagram, and WhatsApp users increasingly use emojis to
Index Terms—RNN, MT-CNN, Pre-Trained Models, T5, Fea- express their feelings and emotions. MSA allows detection
ture Fusion, Feature Extraction, Auto Noise Encoder, CNN
Autoencoder of sentiments in Images, Videos, Text, and Audio [6] [7]. In
terms of text, it also allows emoticon text where users use
I. I NTRODUCTION some emojis to express their feelings and emotions. From
Online social media (OSM) has emerged as a powerful standard sentiment analysis to multimodal sentiment analysis
tool for expressing thoughts and views. People use popular is a revolutionary concept [8]. Traditional sentiment analysis
social networking sites or applications to make short, frequent only allowed users to analyse Textual data, and finding out
posts on trending hot issues, matters and topics; also, people users’ sentiments became challenging [9]. In general, in mul-
timodal sentiment analysis, everyone analyses images and text
differently and shows output differently, but what if a user
2
the connections between models. The author presents a novel IV. PROPOSED MODEL
fusion mechanism based on Transformers and Attention that In this section, we introduce our multimodal sentiment
can combine the multimodal features of SSL by considering analysis model. In Figure 1, The author shows how our
the multidimensional nature of the SSL function and obtaining proposed model will work with two different dataset and
cutting-edge results for multimodal emotion recognition tasks. how we combined both results using cross-feature fusion.
The author uses IEMOCAP and MELD Dataset. Feature Fusion combines the feature vectors of the training
Erik Cambria et al. [18] Discuss some critical concerns fre- image extracted from the network layer and the feature vectors
quently overlooked in multimodal sentiment analysis research. composed of other numerical data into the total weights so that
e.g., The speaker-independent model’s role, the importance of the proposed model can use as many features as possible for
diverse modalities, and generalizability. The author proposes a further classification.
comprehensive framework based on deep learning to analyse
multimodal emotions and emotional recognition to solve the
above problems. They use convolutional neural networks to
merge visual, text, and audio capabilities, resulting in a 10
performance improvement over existing methods. They apply
the CNN algorithm for Textual Features extraction, and then
the author applies SVM (Support Vector Machine) for Text
Classification. The author uses IEMOCAP, MOUD Dataset,
and MOSI Dataset.
In the [19], authors discussed AI-based algorithms such
as machine learning, deep learning and pre-trained models
for text-based emotion detection with different vectorization
methods.
In their study, Licai Sun et al. [20] addressed the MuSe2021
Multimodal Sentiment Analysis Challenge by explicitly con-
centrating on two sub-challenges: MuSeWilder and MuSe
Sent. MuSeWilder’s objective was to achieve ongoing detec-
tion of emotions, namely arousal and valence. On the other
hand, MuSe Sent focused on the categorisation of particular
moods. Their methodology consisted of first extracting a wide
range of characteristics from three prevalent modalities: vocal,
visual, and textual. It includes low-level handmade Unsu-
pervised/supervised pre-trained models with product features
and high-level deep representations. Next, the author uses a
recurrent neural network with long short-term memory and a
self-awareness mechanism to model the complex time depen-
dence of feature sequences. Continuous regression and discrete
classification are driven by coherence correlation coefficient
losses and F1 losses, respectively.
3
the node’s connections linearly create a directed or undirected identify the fundamental characteristics. The reconstructed
graph. Before switching to a T5 network that has previously matrix is then compared to the original matrix to measure the
been trained, this allows for temporary dynamic behaviour. extent of the reconstruction mistake. The error signal serves as
Figure 2 shows flow diagram of textual dataset. a guiding mechanism for the training process and facilitates
the optimisation of the parameters of both the encoder and
decoder. The DAE gradually improves its capacity to eliminate
noise and retrieve resilient characteristics from the masked
Fig. 2. Flow Diagram of Textual Dataset. input by undergoing repeated training, finally producing the
intended denoised encoding.The encoding of the bottom layer
B. IMAGE DATASET was done in the same way, with the top layer’s encoding
properties acting as input to the lower layer. The procedure was
First, we employ the CNN autoencoder in the image dataset.
then repeated until the desired number of layers was encoded.
A convolutional neural network that has been trained to
replicate an input image from the output layer is known as an
autoencoder. The encoder processes the image. The encoder
is a ConvNet that uses a CNN autoencoder to build a low-
dimensional representation of an Image. The author uses the
ResNet50 Pre-Trained Model. ResNet50 is a convolutional
neural network with 50 layers. From the ImageNet database,
you can download a pre-trained network version that has been
trained with over 1 million photos. In the pre-trained network,
we categorise Images into different object categories, such
as Angry, Disgust, Fear, Happy, Neutral, Sad, and Surprise,
perform cross-functional fusions, and combine results with
image datasets. Figure 3 shows the flow diagram for Image
dataset.
4
Accuracy is ultimately the ratio of TP and TN observations
divided by the total number of instances, representing the
percentage of all correctly classified cases.
V. RESULT ANALYSIS
The authors employed a comprehensive evaluation frame-
work encompassing various performance measures to assess
the proposed systems on balanced and unbalanced datasets.
These metrics included F1-score, accuracy, classification re-
ports, confusion matrices, ROC curves, precision, and recall,
enabling a detailed analysis of emotion class-wise perfor-
mance. Accuracy, a commonly used metric in multi-class clas-
sification, reflects the overall proportion of correctly classified
instances. It is calculated by summing the True Positive (TP)
and True Negative (TN) values from the confusion matrix.
However, accuracy can be misleading on unbalanced datasets, Fig. 7. Confusion Matrix of RNN (Text Dataset).
as it does not account for class distribution. Therefore, addi-
tional metrics such as F1-score and class-specific precision and
recall are crucial for a more nuanced evaluation. For clarity, B. IMAGE DATASET RESULT ANALYSIS
we define the key terms used in the confusion matrix: In Image Classification, the Image Dataset using MT-CNN
• True Positives (TP): Instances correctly classified as the (Multitask Cascaded Convolutional) algorithm with CNN Au-
target class. toencoder and ResNet-50 pre-trained model. MT-CNN on the
• True Negatives (TN): Instances correctly classified as not Image dataset achieved 82 accuracy, whereas the ResNet-50
belonging to the target class. Pre-Trained Network model achieved 84.12 The Confusion
5
Fig. 8. ROC Curve of RNN (Text Dataset).
6
TABLE I [4] Sadam Al-Azani and El-Sayed M El-Alfy. Early and late fusion of
DATASET EVALUATION WITH RESPECT TO PRE - TRAINED NETWORK AND emojis and text to enhance opinion mining. IEEE Access, 9:121031–
FEATURE FUSION 121045, 2021.
[5] Abayomi Bello, Sin-Chun Ng, and Man-Fai Leung. A bert framework
Modalities Algorithm Accuracy in per- to sentiment analysis of tweets. Sensors, 23(1):506, 2023.
centage [6] Ha-Nguyen Tran and Erik Cambria. Ensemble application of elm and
Text RNN 78 gpu for real-time multimodal sentiment analysis. Memetic Computing,
Text T5 pre-Trained 87 10:3–13, 2018.
Network [7] Soujanya Poria, Navonil Majumder, Devamanyu Hazarika, Erik Cam-
Image MT-CNN 82 bria, Alexander Gelbukh, and Amir Hussain. Multimodal sentiment
Image ResNet-50 Pre- 84 analysis: Addressing key issues and setting up the baselines. IEEE
Trained Network Intelligent Systems, 33(6):17–25, 2018.
Text + Image Cross Feature 75 [8] Nan Xu, Wenji Mao, and Guandan Chen. A co-memory network for
Fusion multimodal sentiment analysis. In The 41st international ACM SIGIR
conference on research & development in information retrieval, pages
929–932, 2018.
[9] Ramandeep Kaur and Sandeep Kautish. Multimodal sentiment analy-
profile pictures, text and image contents. sis: A survey and comparison. Research anthology on implementing
sentiment analysis across multiple disciplines, pages 1846–1870, 2022.
VI. CONCLUSION [10] Soujanya Poria, Erik Cambria, and Alexander Gelbukh. Deep convolu-
tional neural network textual features and multiple kernel learning for
Multimodal sentiment analysis on social network platforms utterance-level multimodal sentiment analysis. In Proceedings of the
like Instagram, Twitter, and Facebook. It isn’t easy to analyse 2015 conference on empirical methods in natural language processing,
different modalities simultaneously based on attention pro- pages 2539–2544, 2015.
[11] Sheetal Kusal, Shruti Patil, Jyoti Choudrie, Ketan Kotecha, Deepali Vora,
cesses and feature fusions. This model can efficiently reduce and Ilias Pappas. A systematic review of applications of natural language
noise from text data and produce more accurate text attributes. processing and future challenges with special emphasis in text-based
It retrieves picture characteristics essential for sentiment clas- emotion detection. Artificial Intelligence Review, 56(12):15129–15215,
2023.
sification with the MTCNN model. To learn interaction in- [12] Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller,
formation between text and visuals, attention techniques for Shih-Fu Chang, and Maja Pantic. A survey of multimodal sentiment
effective feature matching have been re-introduced in various analysis. Image and Vision Computing, 65:3–14, 2017.
[13] Kang Zhang, Yushui Geng, Jing Zhao, Jianxin Liu, and Wenxiao Li.
modes for feature fusion. This model captures the sentimen- Sentiment analysis of social media via multimodal feature fusion.
tal representation of multimodal data well, which leverages Symmetry, 12(12):2010, 2020.
model-specific and modal interaction information to correctly [14] Yuan Zhang, Lin Cui, Wei Wang, and Yuxiang Zhang. A survey
on software defined networking with multiple controllers. Journal of
estimate the sentimental polarity of user tweets and the user’s Network and Computer Applications, 103:101–118, 2018.
genuine feelings. Better identify and respond to specific social [15] Ioannis Kitsos. Entity-based Summarization of Web Search Results using
media events. In future work, more modalities like audio, MapReduce. PhD thesis, Foundation for Research and Technology-
Hellas.
video or sensor data can be used to recognise the emotions. [16] Xian Xiaobing, Bao Chao, and Chen Feng. An insight into traffic safety
management system platform based on cloud computing. Procedia-
VII. A BBREVIATIONS Social and Behavioral Sciences, 96:2643–2646, 2013.
[17] Seyed Nima Khezr and Nima Jafari Navimipour. Mapreduce and its
• SA – Sentiment Analysis applications, challenges, and architecture: a comprehensive review and
• MSA – Multimodal Sentiment Analysis directions for future research. Journal of Grid Computing, 15:295–321,
• RNN - Recurrent Neural Network 2017.
[18] Nenavath Srinivas Naik, Atul Negi, and VN Sastry. A review of adaptive
• CNN - Convolutional Neural Network approaches to mapreduce scheduling in heterogeneous environments. In
• SA – MTCNN - Multi-Task Cascaded Convolutional 2014 International Conference on Advances in Computing, Communi-
Neural Networks cations and Informatics (ICACCI), pages 677–683. IEEE, 2014.
[19] Sheetal D Kusal, Shruti G Patil, Jyoti Choudrie, and Ketan V Kotecha.
• NLP – Natural Language Processing Understanding the performance of ai algorithms in text-based emotion
• DAE - Denoising Autoencoder detection for conversational agents. ACM Transactions on Asian and
• DNN – Deep Neural Network Low-Resource Language Information Processing, 2024.
[20] Md Wasi-ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar,
• OSM - Online Social Media and Dhabaleswar K DK Panda. Mr-advisor: A comprehensive tuning,
profiling, and prediction tool for mapreduce execution frameworks on
R EFERENCES hpc clusters. Journal of Parallel and Distributed Computing, 120:237–
[1] Tulika Saha, Apoorva Upadhyaya, Sriparna Saha, and Pushpak Bhat- 250, 2018.
tacharyya. Towards sentiment and emotion aided multi-modal speech
act classification in twitter. In Proceedings of the 2021 Conference
of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 5727–5737, 2021.
[2] Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. Towards
multimodal sentiment analysis: Harvesting opinions from the web.
In Proceedings of the 13th international conference on multimodal
interfaces, pages 169–176, 2011.
[3] Shelley Gupta, Archana Singh, and Jayanthi Ranjan. Multimodal,
multiview and multitasking depression detection framework endorsed
with auxiliary sentiment polarity and emotion detection. International
Journal of System Assurance Engineering and Management, 14(Suppl
1):337–352, 2023.