0% found this document useful (0 votes)
14 views8 pages

MITADTSoCiCon2024 Paper178

Uploaded by

Si Hem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

MITADTSoCiCon2024 Paper178

Uploaded by

Si Hem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/381947610

Pre-Trained Networks and Feature Fusion for Enhanced Multimodal Sentiment


Analysis

Conference Paper · April 2024


DOI: 10.1109/MITADTSoCiCon60330.2024.10574938

CITATIONS READS
0 9

3 authors, including:

Sheetal Kusal Shruti Patil


Symbiosis International University Symbiosis Institute of Technology
12 PUBLICATIONS 136 CITATIONS 89 PUBLICATIONS 1,374 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Sheetal Kusal on 16 August 2024.

The user has requested enhancement of the downloaded file.


2024 MIT Art, Design and Technology School of Computing International Conference (MITADTSoCiCon)
MIT ADT University, Pune, India. Apr 25-27, 2024

Pre-Trained Networks and Feature Fusion for


Enhanced Multimodal Sentiment Analysis

Sheetal Kusal Prem Panchal


Symbiosis Institute of Technology Symbiosis Institute of Technology
Symbiosis International (Deemed University) Symbiosis International (Deemed University)
Pune, India Pune, India
[email protected] [email protected]
https://orcid.org/0000-0002-9830-6619

Shruti Patil
Symbiosis Institute of Technology
Symbiosis International (Deemed University)
Pune, India
[email protected]
https://orcid.org/0000-0003-0881-5477

Abstract—Sentiment analysis, a fundamental domain of Nat- provide feedback or reviews on products or services, which
ural Language Processing (NLP), is important in diverse ap- inevitably creates a lot of data. So, it becomes necessary to
plications such as customer service management, social media analyse and study the sentiments behind people’s thoughts.
monitoring, and product review analysis. Nevertheless, as the use
of visual data such as photos and videos in online communication Sentiment analysis (SA) is becoming a popular method of
continues to grow, conventional text-based sentiment analysis analysing a person’s feelings or sentiments. It is studying
becomes inadequate. Introducing multimodal sentiment analysis the sentiments of people from text and categorising them
is a sophisticated method that uses textual and visual clues to as positive, negative or neutral. It is used to investigate the
get a more comprehensive knowledge of sentiment. This study impact of OSM on different online trends, sentiments behind
introduces an innovative approach to sentiment analysis that
combines pre-trained models with a feature fusion of emoticon- a person’s views and how it will affect the social trends based
based text and picture data. Our algorithm differentiates itself on text [1]. It is helpful in many applications, such as social
from typical sentiment analysis by using emoticon data and network monitoring, text mining, opinion mining, customer
picture features to detect more nuanced sentiment variations, support management, and business monitoring [2] [3]. But
such as extreme positive/negative and neutral. The pre-trained recently, sentiment analysis has originated a groundbreaking
ResNet50 model is used to analyse images from categorised
datasets, which include emotions such as Angry, Disgust, Fear, idea called multimodal sentiment analysis (MSA) [4]. SA has
Happy, Neutral, Sad, and Surprise. On the other hand, textual been applied only to textual data, but nowadays, social media
data is processed using the RNN technique in combination with platforms use text with emojis, images, videos, and audio.
the pre-trained T5 model. In order to enhance efficiency, we So, analysing all these attributes using traditional SA be-
use feature fusion algorithms on both modalities, merging the comes difficult. To analyse emoji-based text data, multimodal
retrieved features from text and pictures to provide a more
comprehensive representation. These models, trained on massive sentiment analysis (MSA) has become pivotal [5]. However,
datasets for specific tasks, serve as a robust starting point for research in sentiment analysis has been focused mainly on
a model, allowing us to refine and adapt them for multimodal text. Nowadays, social networking sites such as Facebook,
sentiment analysis. Instagram, and WhatsApp users increasingly use emojis to
Index Terms—RNN, MT-CNN, Pre-Trained Models, T5, Fea- express their feelings and emotions. MSA allows detection
ture Fusion, Feature Extraction, Auto Noise Encoder, CNN
Autoencoder of sentiments in Images, Videos, Text, and Audio [6] [7]. In
terms of text, it also allows emoticon text where users use
I. I NTRODUCTION some emojis to express their feelings and emotions. From
Online social media (OSM) has emerged as a powerful standard sentiment analysis to multimodal sentiment analysis
tool for expressing thoughts and views. People use popular is a revolutionary concept [8]. Traditional sentiment analysis
social networking sites or applications to make short, frequent only allowed users to analyse Textual data, and finding out
posts on trending hot issues, matters and topics; also, people users’ sentiments became challenging [9]. In general, in mul-
timodal sentiment analysis, everyone analyses images and text
differently and shows output differently, but what if a user

979-8-3503-6287-9/24/$31.00 ©2024 IEEE 1


uploads an image with captions (Text)? If we analyse both fusion of visual and text information. This model concentrates
datasets differently, we won’t be able to determine the user’s on the most significant elements of text and images and may
true sentiment [10]. This research is focused on combining incorporate the most helpful information from both sources.
Text and Image data using Feature Fusion [11] and giving The author uses 3D-CNN and open smile for visual and acous-
sentiments of how Image and Text Data Perform together tic feature extraction, respectively. They use two Datasets,
after applying the Pre-Trained Model and Feature Fusion. This the MOUD Dataset and the IEMOCAP database analyse two
research is focused on combining Text and Image data using different scenarios 1st is the Multimodal Sentiment Analysis
Feature Fusion and giving sentiments of how Image and Text Dataset and the second is the Multimodal Sentiment Detection
Data Perform together after applying the Pre-Trained Model Dataset. The author takes different modalities in sentiments
and Feature Fusion. like audio, video, image, and text, and the author gets the
So, in this work, The author proposes a novel idea that accuracy of audio + text + image + video to combine all the
analyses images and text datasets using a pre-trained model. data author gets 81 accuracy in IEMOCAP Dataset, 67.90
Then, we perform feature fusion to merge both the results, accuracy in MOUD Dataset and 78.80 accuracy in MOSI
resulting in proper and precise user emotions that are straight- Dataset. Soujanya Poria et al. [14] Examine three different
forward to examine. We analyse the text and image separately, deep learning-based architectures for multimodal sentiment
and then we combine both the results using feature fusion. classification. To begin, the author extracts text characteristics
For Emoticon textual data, we used Social Media Platform using a CNN. Using three different sizes of convolution filters,
Comments datasets [12]. The reason for using image is that each with 50 feature maps, extract n-gram features from
they are a simple way to convey a user’s behaviour throughout each speech. Second, the maximum coupling occurs on the
a discussion. It is Beneficial to use Multimodal sentiment output before the ReLU becomes active. These activations are
analysis because it also allows for analysing emojis. For integrated and transferred to a 100-dimensional dense layer
Images, we divide the Image dataset into attributes such as that is considered the textual representation of the utterance.
Angry, Disgust, Fear, Happy, Neutral, Sad, and Surprise. The Emotion tags are used to train this network at the speech
following is a list of our contributions to this paper: level. They use MOUD, MOSI, and IEMOCAP Dataset for
• To enhance the representation of features, we use a Video, Audio, and Text Modalities. They combine all the
Denoising Autoencoder for textual data and a CNN modalities, and the accuracy of using the SVM Algorithm
Autoencoder for visual data. This architecture efficiently is 51.1, and using LSTTM Accuracy is 52.7. Mathieu Page
captures important information while reducing noise, Fortin et al. [15] Most methods assume that text and graphic
leading to a more precise and informative representation formats are always available for testing. Social media often
that may be used for subsequent tasks. violates this assumption because users do not consistently
• We propose a new multimodal cross-feature fusion model post text with images. Therefore, they propose that the model
based on the attention mechanism, in which cross-feature contains one classifier for text analysis, a classifier for image
fusion is utilised to merge both the result text and analysis, and a classifier that combines the two forms for
image to deliver more effective and accurate sentiment prediction. In addition to solving the lack of form problem,
classification information. our experiments show that this multitasking framework works
Finally, the approach was evaluated on two standard as a regularisation mechanism to improve generalisation. It
datasets. The following is how the rest of the article is also shows that the model can handle missing shapes during
divided. The second section briefly reviews recent research in training and train on image-only and text-only examples. The
multimodal sentiment analysis. The suggested multimodal sen- author uses Flickr sentiment for visual sentiment and the VSO
timent classification model based on the attention mechanism dataset for visual sentiment.
is described in Section 3. Section 4 displays and discusses the Anthony Hu et al. [16] This author has a different goal than
model’s experimental results and comparative trials on two the standard goal of sentiment analysis. Instead, it aims to infer
separate Instagram datasets. Lastly, Section 5 summarises the the user’s latent emotional state. As such, the author focuses on
paper. predicting verbal sentiment tags that users put on their Tumblr
posts and treating them as sentiments they report. The author
II. RELATED WORK proposes a new approach to multimodal sentiment analysis
Kang Zhang et al. [13] The author proposes a fine-grained using deep neural networks that combine visual analysis and
attention mechanism-based multimodal sentiment classifica- natural language processing. Also, use the Tumblr dataset for
tion model. First, extract features that more accurately rep- the dataset.
resent the original text using the denoising autoencoder. Then Shamane Siriwardhana et al. [17] In this paper, the author
use the improved conversion autoencoder in combination with introduces three input modalities: text, audio, and image, using
the attention mechanism to extract the features of the image. features extracted from an independently trained SSL model.
Then, combined with the attention mechanism, utilise the It focuses primarily on using a pre-trained SSL model as a
better conversion autoencoder to extract the image’s features. feature extractor to improve emotion recognition tasks. To
The author presents an attention-based fusion model for in- achieve our goal, the author designed a transformer-based mul-
teractively exploring the modal representation vector of the timodal fusion mechanism that works well by understanding

2
the connections between models. The author presents a novel IV. PROPOSED MODEL
fusion mechanism based on Transformers and Attention that In this section, we introduce our multimodal sentiment
can combine the multimodal features of SSL by considering analysis model. In Figure 1, The author shows how our
the multidimensional nature of the SSL function and obtaining proposed model will work with two different dataset and
cutting-edge results for multimodal emotion recognition tasks. how we combined both results using cross-feature fusion.
The author uses IEMOCAP and MELD Dataset. Feature Fusion combines the feature vectors of the training
Erik Cambria et al. [18] Discuss some critical concerns fre- image extracted from the network layer and the feature vectors
quently overlooked in multimodal sentiment analysis research. composed of other numerical data into the total weights so that
e.g., The speaker-independent model’s role, the importance of the proposed model can use as many features as possible for
diverse modalities, and generalizability. The author proposes a further classification.
comprehensive framework based on deep learning to analyse
multimodal emotions and emotional recognition to solve the
above problems. They use convolutional neural networks to
merge visual, text, and audio capabilities, resulting in a 10
performance improvement over existing methods. They apply
the CNN algorithm for Textual Features extraction, and then
the author applies SVM (Support Vector Machine) for Text
Classification. The author uses IEMOCAP, MOUD Dataset,
and MOSI Dataset.
In the [19], authors discussed AI-based algorithms such
as machine learning, deep learning and pre-trained models
for text-based emotion detection with different vectorization
methods.
In their study, Licai Sun et al. [20] addressed the MuSe2021
Multimodal Sentiment Analysis Challenge by explicitly con-
centrating on two sub-challenges: MuSeWilder and MuSe
Sent. MuSeWilder’s objective was to achieve ongoing detec-
tion of emotions, namely arousal and valence. On the other
hand, MuSe Sent focused on the categorisation of particular
moods. Their methodology consisted of first extracting a wide
range of characteristics from three prevalent modalities: vocal,
visual, and textual. It includes low-level handmade Unsu-
pervised/supervised pre-trained models with product features
and high-level deep representations. Next, the author uses a
recurrent neural network with long short-term memory and a
self-awareness mechanism to model the complex time depen-
dence of feature sequences. Continuous regression and discrete
classification are driven by coherence correlation coefficient
losses and F1 losses, respectively.

III. DATASET DESCRIPTION


We have performed various experiments on two datasets
for Multimodal Sentiment analysis tasks. The author used the
Instagram dataset for text and images, which is a social media Fig. 1. Flow Diagram of Proposed Model.
platform. In the text dataset, we have taken captions of images
uploaded on Instagram; for the text dataset, we have used the
emoticon textual dataset. The reason for using emoticons is A. TEXT DATASET
that social media users describe their feelings and emotions In the Text Dataset, we first apply a denoising autoencoder,
through emojis. The best aspect of multimodal sentiment is which recovers relevant information and deletes words that
that it can also analyse emoticons. We have used the Instagram do not have sentiment results. Then, it differentiates different
dataset, a social media platform for the image dataset. In the concepts, such as happy, unfavourable, and neutral, follow-
image dataset, we have taken an image that has captions in the ing The Text Feature Vector. The authors utilise the RNN
image, and we have used the same captions in the text dataset. Algorithm after differentiating in distinct vectors. The RNN
Each image in the dataset was scaled to 224x224 pixels. These algorithm is the most advanced method for sequential data. A
can be found on the Kaggle. recurrent neural network is an artificial neural network where

3
the node’s connections linearly create a directed or undirected identify the fundamental characteristics. The reconstructed
graph. Before switching to a T5 network that has previously matrix is then compared to the original matrix to measure the
been trained, this allows for temporary dynamic behaviour. extent of the reconstruction mistake. The error signal serves as
Figure 2 shows flow diagram of textual dataset. a guiding mechanism for the training process and facilitates
the optimisation of the parameters of both the encoder and
decoder. The DAE gradually improves its capacity to eliminate
noise and retrieve resilient characteristics from the masked
Fig. 2. Flow Diagram of Textual Dataset. input by undergoing repeated training, finally producing the
intended denoised encoding.The encoding of the bottom layer
B. IMAGE DATASET was done in the same way, with the top layer’s encoding
properties acting as input to the lower layer. The procedure was
First, we employ the CNN autoencoder in the image dataset.
then repeated until the desired number of layers was encoded.
A convolutional neural network that has been trained to
replicate an input image from the output layer is known as an
autoencoder. The encoder processes the image. The encoder
is a ConvNet that uses a CNN autoencoder to build a low-
dimensional representation of an Image. The author uses the
ResNet50 Pre-Trained Model. ResNet50 is a convolutional
neural network with 50 layers. From the ImageNet database,
you can download a pre-trained network version that has been
trained with over 1 million photos. In the pre-trained network,
we categorise Images into different object categories, such
as Angry, Disgust, Fear, Happy, Neutral, Sad, and Surprise,
perform cross-functional fusions, and combine results with
image datasets. Figure 3 shows the flow diagram for Image
dataset.

Fig. 3. Flow Diagram of Image Dataset.


Fig. 4. Structure Diagram of Denoising Autoencoder.

C. FEATURE EXTRACTION 2) IMAGE FEATURE EXTRACTION: Image Features were


Feature extraction can help you save time by lowering the extracted using an enhanced attention-based convolutional
number of processing resources you need while maintaining autoencoder (CNN-ATT). The author aims to automatically
the integrity of vital or relevant data. Using feature extraction, learn from the data and map raw data to a data representation.
we can limit the amount of redundant data in each analysis. The CNN deep neural network comprises an encoder and a
1) TEXT FEATURE EXTRACTION: The text feature ex- decoder. The essence of CNN is to extract prospective features
traction layer converts each word into a low-dimensional vec- from data and build a model using features generated from
tor. Another term for this is the word embedding. The noise in the most recent features, as shown in Figure 5. The encoder
text data from social networks reduces the accuracy of feature extracts hidden smart variables from the source data, reduces
extraction. Text features were recovered using a denoising the encoding result to Gaussian noise and converts it to a
autoencoder to decrease noise interference and provide more latent function with a Gaussian distribution. The decoder’s
robust features. The Denoising Autoencoder (DAE) uses a model converts Potential characteristics into a reconstructed
data augmentation method where parts of the input matrix are probability distribution that is as close to the original as
randomly masked according to a predetermined probability possible.
distribution. This process deliberately introduces controlled 3) FEATURE FUSION MODEL: To combine features
”noise” into the data by temporarily suppressing specific char- generated from text feature extractors with image feature
acteristics. This technique intentionally retains the vectorised extractors, we created CFF-ATT, an attention-based feature
matrix representation of the text, keeping its general structure. combination module. Figure 6 depicts the fusion function
Figure 4 presents the denoising autoencoder. Nevertheless, module in further detail. The output function of the first two
the hidden portions symbolise the absence of information, modules was the input of the feature fusion module. There are
similar to data that has been corrupted or contaminated [13]. two types of input: primary and secondary. By merging the two
To tackle this issue, the DAE undertakes training to rebuild input modes, we constructed an output target mode. Feature
the initial input from its masked form. The reconstruction fusion is an essential approach in deep learning that combines
method primarily involves removing noise from the data while many feature representations from multiple sources to improve
also acquiring the ability to restore missing information and the model’s ability to distinguish and classify distinct tasks.

4
Accuracy is ultimately the ratio of TP and TN observations
divided by the total number of instances, representing the
percentage of all correctly classified cases.

A. TEXT DATASET RESULT ANALYSIS


In this work, the authors used mtCNN for the image clas-
sification process to find the accuracy of emotion recognition
using the multi-feature extraction method to extract the feature
of the image. After that, to classify the emotion from the text,
the authors used Twitter, youtube comments in section 4. The
Fig. 5. Structure Diagram of Image Feature Extraction (CNN- Auto Encoder). RNN method was applied to find the accuracy of the emo-
tion recognition system. By combining features, the authors
achieved an accuracy of 78. This research used two baseline
This method combines or merges feature vectors retrieved models, USE and BERT text vectorisers, in conjunction with
from different network levels or divides within the model’s a Deep Neural Network (DNN) for text categorisation. The
architecture. In this instance, our suggested model utilises authors carried out six experiments to investigate the suggested
feature fusion, which involves combining two separate sets of multimodal framework by expanding on these initial tests.
feature vectors. The first set is taken from a common weighted Every experiment used the refined image feature extractor
network layer applied to the training pictures, while the enhanced with a pre-trained USE or BERT text vectoriser.
second set consists of additional relevant numerical data. This The outcomes of these models on the examination dataset
fusion technique exploits the synergistic advantages of image- are shown in Table I. For text classification, combining a
based features and supplementary numerical data, allowing the Denoising Autoencoder and a T5 pre-trained model integrated
model to use a more extensive and holistic representation for with an RNN achieved the highest accuracy of 87.25 on the
enhanced classification accuracy. It allows the proposed model Text Dataset. It surpassed the performance of other models,
to use as many features as feasible for future categorisation. demonstrating the effectiveness of combining the denoising
capabilities of the autoencoder with the robust language under-
standing of T5. It highlights the potential of combining RNN-
based architectures with pre-trained models like T5 to achieve
superior results in text classification tasks. The Confusion
Matrix is shown in the figure 7. Similarly figure 8 shows the
ROC curve for text dataset.

Fig. 6. Structure Diagram of Feature Fusion Model.

V. RESULT ANALYSIS
The authors employed a comprehensive evaluation frame-
work encompassing various performance measures to assess
the proposed systems on balanced and unbalanced datasets.
These metrics included F1-score, accuracy, classification re-
ports, confusion matrices, ROC curves, precision, and recall,
enabling a detailed analysis of emotion class-wise perfor-
mance. Accuracy, a commonly used metric in multi-class clas-
sification, reflects the overall proportion of correctly classified
instances. It is calculated by summing the True Positive (TP)
and True Negative (TN) values from the confusion matrix.
However, accuracy can be misleading on unbalanced datasets, Fig. 7. Confusion Matrix of RNN (Text Dataset).
as it does not account for class distribution. Therefore, addi-
tional metrics such as F1-score and class-specific precision and
recall are crucial for a more nuanced evaluation. For clarity, B. IMAGE DATASET RESULT ANALYSIS
we define the key terms used in the confusion matrix: In Image Classification, the Image Dataset using MT-CNN
• True Positives (TP): Instances correctly classified as the (Multitask Cascaded Convolutional) algorithm with CNN Au-
target class. toencoder and ResNet-50 pre-trained model. MT-CNN on the
• True Negatives (TN): Instances correctly classified as not Image dataset achieved 82 accuracy, whereas the ResNet-50
belonging to the target class. Pre-Trained Network model achieved 84.12 The Confusion

5
Fig. 8. ROC Curve of RNN (Text Dataset).

Matrix is shown in the Figure 9. Figure 10 presents the


Fig. 10. Accuracy Curve of MT-CNN (Image Dataset).
Accuracy Curve of MT-CNN (Image Dataset).

Fig. 11. Accuracy of Feature Fusion (Image + Text Dataset).

comprehensive analysis of the advantages of multimodal mod-


els in handling data obtained via unsupervised methodologies.
Our study signifies a groundbreaking advancement in the field
of Natural Language Processing (NLP) by extending text cate-
gorisation algorithms to include emotion recognition. Utilising
Fig. 9. Confusion Matrix of MT-CNN (Image Dataset). current progress in picture recognition and computer vision
within this paradigm shows significant potential for many
emotion-related tasks, such as sentiment analysis, dialogue
C. FEATURE FUSION RESULT ANALYSIS
modelling, and image processing. Additionally, using insights
To combine features generated from text feature extractors from both picture and text classification domains enhances
with image feature extractors. After fusing the features from the effectiveness of this technique. Integrating statistical tech-
both modalities and performing cross-feature fusion, the model niques to allocate spatial picture attributes inside convolutional
achieved 75 accuracy. Figure 11 shows the training and vali- layers has greatly improved deep-learning models for these
dation accuracy of the model. The red line indicates training applications. Moreover, sophisticated deep learning methods
accuracy and the Blue line indicates the validation accuracy. for natural language processing (NLP), which include com-
The authors have considered the average results of the models. prehensive lexical and semantic representations of language
Models that used both picture and text embeddings per- structures, have shown a growing level of dependability and
formed better than models that used just one kind of input. The efficiency in extracting significance from vectorised characters
exceptional performance may be related to the complementary and word embeddings within dimensional space. This study
relationship between picture and text characteristics, which can be useful in various application areas, such as sentiment
helps reduce each modality’s constraints. Multimodal models analysis in product reviews where customers use product
have clear benefits, such as enhanced resistance to data noise images and textual comments. Similarly, in emotional analysis
and more adaptability. The following sections provide a more of social media posts where users post their thoughts with

6
TABLE I [4] Sadam Al-Azani and El-Sayed M El-Alfy. Early and late fusion of
DATASET EVALUATION WITH RESPECT TO PRE - TRAINED NETWORK AND emojis and text to enhance opinion mining. IEEE Access, 9:121031–
FEATURE FUSION 121045, 2021.
[5] Abayomi Bello, Sin-Chun Ng, and Man-Fai Leung. A bert framework
Modalities Algorithm Accuracy in per- to sentiment analysis of tweets. Sensors, 23(1):506, 2023.
centage [6] Ha-Nguyen Tran and Erik Cambria. Ensemble application of elm and
Text RNN 78 gpu for real-time multimodal sentiment analysis. Memetic Computing,
Text T5 pre-Trained 87 10:3–13, 2018.
Network [7] Soujanya Poria, Navonil Majumder, Devamanyu Hazarika, Erik Cam-
Image MT-CNN 82 bria, Alexander Gelbukh, and Amir Hussain. Multimodal sentiment
Image ResNet-50 Pre- 84 analysis: Addressing key issues and setting up the baselines. IEEE
Trained Network Intelligent Systems, 33(6):17–25, 2018.
Text + Image Cross Feature 75 [8] Nan Xu, Wenji Mao, and Guandan Chen. A co-memory network for
Fusion multimodal sentiment analysis. In The 41st international ACM SIGIR
conference on research & development in information retrieval, pages
929–932, 2018.
[9] Ramandeep Kaur and Sandeep Kautish. Multimodal sentiment analy-
profile pictures, text and image contents. sis: A survey and comparison. Research anthology on implementing
sentiment analysis across multiple disciplines, pages 1846–1870, 2022.
VI. CONCLUSION [10] Soujanya Poria, Erik Cambria, and Alexander Gelbukh. Deep convolu-
tional neural network textual features and multiple kernel learning for
Multimodal sentiment analysis on social network platforms utterance-level multimodal sentiment analysis. In Proceedings of the
like Instagram, Twitter, and Facebook. It isn’t easy to analyse 2015 conference on empirical methods in natural language processing,
different modalities simultaneously based on attention pro- pages 2539–2544, 2015.
[11] Sheetal Kusal, Shruti Patil, Jyoti Choudrie, Ketan Kotecha, Deepali Vora,
cesses and feature fusions. This model can efficiently reduce and Ilias Pappas. A systematic review of applications of natural language
noise from text data and produce more accurate text attributes. processing and future challenges with special emphasis in text-based
It retrieves picture characteristics essential for sentiment clas- emotion detection. Artificial Intelligence Review, 56(12):15129–15215,
2023.
sification with the MTCNN model. To learn interaction in- [12] Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller,
formation between text and visuals, attention techniques for Shih-Fu Chang, and Maja Pantic. A survey of multimodal sentiment
effective feature matching have been re-introduced in various analysis. Image and Vision Computing, 65:3–14, 2017.
[13] Kang Zhang, Yushui Geng, Jing Zhao, Jianxin Liu, and Wenxiao Li.
modes for feature fusion. This model captures the sentimen- Sentiment analysis of social media via multimodal feature fusion.
tal representation of multimodal data well, which leverages Symmetry, 12(12):2010, 2020.
model-specific and modal interaction information to correctly [14] Yuan Zhang, Lin Cui, Wei Wang, and Yuxiang Zhang. A survey
on software defined networking with multiple controllers. Journal of
estimate the sentimental polarity of user tweets and the user’s Network and Computer Applications, 103:101–118, 2018.
genuine feelings. Better identify and respond to specific social [15] Ioannis Kitsos. Entity-based Summarization of Web Search Results using
media events. In future work, more modalities like audio, MapReduce. PhD thesis, Foundation for Research and Technology-
Hellas.
video or sensor data can be used to recognise the emotions. [16] Xian Xiaobing, Bao Chao, and Chen Feng. An insight into traffic safety
management system platform based on cloud computing. Procedia-
VII. A BBREVIATIONS Social and Behavioral Sciences, 96:2643–2646, 2013.
[17] Seyed Nima Khezr and Nima Jafari Navimipour. Mapreduce and its
• SA – Sentiment Analysis applications, challenges, and architecture: a comprehensive review and
• MSA – Multimodal Sentiment Analysis directions for future research. Journal of Grid Computing, 15:295–321,
• RNN - Recurrent Neural Network 2017.
[18] Nenavath Srinivas Naik, Atul Negi, and VN Sastry. A review of adaptive
• CNN - Convolutional Neural Network approaches to mapreduce scheduling in heterogeneous environments. In
• SA – MTCNN - Multi-Task Cascaded Convolutional 2014 International Conference on Advances in Computing, Communi-
Neural Networks cations and Informatics (ICACCI), pages 677–683. IEEE, 2014.
[19] Sheetal D Kusal, Shruti G Patil, Jyoti Choudrie, and Ketan V Kotecha.
• NLP – Natural Language Processing Understanding the performance of ai algorithms in text-based emotion
• DAE - Denoising Autoencoder detection for conversational agents. ACM Transactions on Asian and
• DNN – Deep Neural Network Low-Resource Language Information Processing, 2024.
[20] Md Wasi-ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar,
• OSM - Online Social Media and Dhabaleswar K DK Panda. Mr-advisor: A comprehensive tuning,
profiling, and prediction tool for mapreduce execution frameworks on
R EFERENCES hpc clusters. Journal of Parallel and Distributed Computing, 120:237–
[1] Tulika Saha, Apoorva Upadhyaya, Sriparna Saha, and Pushpak Bhat- 250, 2018.
tacharyya. Towards sentiment and emotion aided multi-modal speech
act classification in twitter. In Proceedings of the 2021 Conference
of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 5727–5737, 2021.
[2] Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. Towards
multimodal sentiment analysis: Harvesting opinions from the web.
In Proceedings of the 13th international conference on multimodal
interfaces, pages 169–176, 2011.
[3] Shelley Gupta, Archana Singh, and Jayanthi Ranjan. Multimodal,
multiview and multitasking depression detection framework endorsed
with auxiliary sentiment polarity and emotion detection. International
Journal of System Assurance Engineering and Management, 14(Suppl
1):337–352, 2023.

View publication stats

You might also like