0% found this document useful (0 votes)
26 views14 pages

A Survey On Multimodal Aspect-Based Sentiment Analysis

Uploaded by

Rich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views14 pages

A Survey On Multimodal Aspect-Based Sentiment Analysis

Uploaded by

Rich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received 28 November 2023, accepted 7 January 2024, date of publication 16 January 2024, date of current version 25 January 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3354844

A Survey on Multimodal Aspect-Based


Sentiment Analysis
HUA ZHAO , MANYU YANG , XUEYANG BAI , AND HAN LIU
College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
Corresponding author: Hua Zhao ([email protected])
This work was supported in part by the National Key Research and Development Program of China under Grant 2022ZD0119501; in part
by the National Natural Science Foundation of China under Grant 52374221; in part by the Science and Technology Development Fund of
Shandong Province of China under Grant ZR2021MG038, Grant ZR2021QG038, Grant ZR2022MF288, and Grant ZR2023MF097; in part
by the Taishan Scholar Program of Shandong Province under Grant ts20190936; and in part by the Shandong University of Science and
Technology (SDUST) Intelligent Science and Security Governance Innovation Team.

ABSTRACT Multimodal Aspect-Based Sentiment Analysis (MABSA), as an emerging task in the field of
sentiment analysis, has recently received widespread attention. Its aim is to combine relevant multimodal data
to determine the sentiment polarity of a given aspect in text. Researchers have surveyed both aspect-based
sentiment analysis and multimodal sentiment analysis, but, to the best of our knowledge, there is no survey on
MABSA. Therefore, in order to assist related researchers to know MABSA better, we surveyed the research
work on MABSA in recent years. Firstly, the relevant concepts of MABSA were introduced. Secondly, the
existing research methods for the two subtasks of MABSA research (that is, multimodal aspect sentiment
classification and aspect sentiment pairs extraction) were summarized and analyzed, and the advantages and
disadvantages of each type of method were analyzed. Once again, the commonly used evaluation corpus
and indicators for MABSA were summarized, and the evaluation results of existing research methods on the
corpus were also compared. Finally, the possible research trends for MABSA were envisioned.

INDEX TERMS Multimodal aspect-based sentiment analysis, multimodal aspect sentiment classification,
aspect sentiment pairs extraction.

I. INTRODUCTION enterprises, it can assist them in understanding user satis-


Sentiment analysis, also known as sentiment orientation anal- faction and needs for products or services, thereby guiding
ysis, opinion mining, etc., refers to the process of using decision-making and improving user experience. For the gov-
natural language processing, machine learning, and deep ernment, it can play an important role in understanding the
learning related methods to process and analyze various public’s response and emotions towards policies, in order to
modalities of data with sentiment orientations, such as text, improve governance and public services.
image, and speech, in order to identify their sentiment tenden- Traditional sentiment analysis is mainly based on one of
cies. It has been one of the hot topics in the fields of natural the modalities such as text [2], [3], [4], [5], image [6], [7], [8],
language processing and image and video mining in recent and speech [9], [10]. However, when expressing sentiments,
years [1]. Sentiment analysis is widely used in fields such as humans may synthetically adopt multiple forms such as texts,
e-commerce, online public opinion analysis, and intelligent facial expressions, voices, and body languages to express sen-
customer service, and plays an important role in many cases. timents. Therefore, there are certain limitations to depending
For individuals, it can help us better understand the underly- on single modality for sentiment analysis. Using multimodal
ing motivations behind human behavior and attitudes. For data synthesis to determine sentiments is one of the better
solutions to this limitation, which has attracted more and
more researchers to strive for it. Specifically, various methods
The associate editor coordinating the review of this manuscript and such as modality representation [11], [12], modality align-
approving it for publication was Gangyi Jiang. ment [13], [14], and modality fusion [15], [16], [17] are
2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by/4.0/ 12039
H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

explored to effectively improve the above limitations in single about specific aspects from a large amount of unstructured
modality sentiment analysis. text. It is usually classified into three sentiment polarities:
Aspect-based sentiment analysis is a subtask in sentiment positive, negative, and neutral.
analysis, which is a fine-grained task aimed at analyzing For example, the text ‘The food is great but the service and
the sentiment polarity of different aspects (i.e., entities) in the environment are dreadful’ contains three aspects, namely
the text. In recent years, a large amount of research has ‘food’, ‘service’ and ‘environment’, as well as two opinion
been invested in aspect-based sentiment analysis [18], [19], words ‘great’ and ‘dreadful’, corresponding to positive, neg-
[20], [21], [22], [23], and good results have been achieved. ative, and negative sentiment polarities. Through fine-grained
MABSA is based on traditional aspect-based sentiment anal- aspect-based sentiment analysis, we can accurately determine
ysis, integrating information from multiple modalities for the sentiment polarity of each aspect in specific contexts,
sentiment analysis. Its goal is to combine relevant multimodal thereby obtaining deeper and more detailed sentiment under-
data to determine the sentiment polarity of a given aspect standing. This is crucial for us to comprehensively evaluate
in the text. MABSA mainly includes three subtasks, namely and understand the subtle differences in sentiment expression,
multimodal aspect term extraction, multimodal aspect senti- as well as people’s attitudes and sentiment tendencies towards
ment classification, and aspect sentiment pairs extraction. specific aspects.
Some scholars have surveyed the existing work on
aspect-based sentiment analysis [24], [25], [26], [27], [28] C. MULTIMODAL SENTIMENT ANALYSIS
and multimodal sentiment analysis [29], [30], [31], [32]. Multimodal Sentiment Analysis (MSA) is a task that utilizes
However, to our knowledge, the existing work on MABSA multiple modal data for sentiment analysis, such as text,
has not yet been sorted and summarized. Therefore, this paper image, etc. Compared with single mode sentiment analysis,
focuses on summarizing the existing research methods for multimodal sentiment analysis can obtain more compre-
MABSA, and analyzing the advantages and disadvantages hensive and accurate sentiment information from different
of the existing methods. In addition, the commonly used perception channels. For example, the sentence ‘‘Today’s
evaluation corpus and indicators for MABSA tasks, as well weather is really nice!’’ expresses positive sentiment when
as the evaluation results of existing methods on the corpus, analyzed solely from the text, but when combined with
are also summarized. Finally, the possible research trends for an image of a rainy day, the overall sentiment is nega-
MABSA are envisioned. tive sentiment with ironic connotations. For this situation,
it is difficult to determine the sentiment solely based on
II. OVERVIEW OF RELATED CONCEPTS the text modality [35]. Certainly, in some cases, the inclu-
A. SENTIMENT ANALYSIS sion of image modality in multimodal sentiment analysis
Sentiment Analysis (SA), also known as opinion mining, can introduce certain amount of noise. For instance, when
orientation analysis, etc., refers to the usage of computer tech- the text contains words like ‘sad’ or ‘unhappy’ while the
nology to analyze data such as text, speech, or image to infer associated image shows a smiling face expressing positive
the sentiments or emotional states expressed therein [33]. It is sentiment, the overall sentiment may actually be negative.
an important research task in natural language processing, In such cases, the image modality adds a certain level of
aiming to enable computers to understand and obtain human noise that can impact the accurate determination of the overall
sentiment. sentiment.
Sentiment analysis includes sentiment classification, emo- Although multimodal data contains richer information,
tion analysis, opinion extraction, comment mining, etc. how to effectively integrate multimodal information is a key
Among them, sentiment classification is the most widely issue in current multimodal sentiment analysis tasks. The
studied issue. According to different granularity, sentiment modal fusion methods is generally divided into three types,
classification can be divided into document level, sentence which are feature layer fusion, decision layer fusion, and
level, and aspect level [34]. Document level sentiment clas- hybrid fusion [36], and the basic ideas of them is shown in
sification aims to predict the overall sentiment polarity of Figure 1. Feature layer fusion refers to the direct concatena-
an entire document or a longer segment of text. Sentence tion or weighted connection of feature vectors of different
level sentiment classification predicts sentiment polarity for modalities into a new vector, which is then input into the
each sentence. Compared to document level or sentence level classifier for sentiment analysis, as shown in Figure 1 (a).
sentiment classification, aspect level sentiment classification Decision layer fusion refers to the independent classification
focuses on predicting the sentiment polarity of specific target of features from different modalities, followed by weighting,
aspects. It analyzes sentiments related to specific aspects, voting mechanism, and other processing to generate the final
goals, or themes in the text. See section B for more detailed decision, as shown in Figure 1 (b). Hybrid fusion is the
introduction about it. integration of feature layer fusion and decision layer fusion
methods. As shown in Figure 1 (c), a classification result
B. ASPECT-BASED SENTIMENT ANALYSIS is firstly obtained using feature layer fusion, and then it
Aspect-Based Sentiment Analysis (ABSA) is a fine-grained is fused with the result of another classifier using decision
sentiment analysis task, which aims at extracting opinions layer fusion. These three fusion methods each have their own

12040 VOLUME 12, 2024


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

FIGURE 1. Schematic diagram of three multimodal fusion methods.

FIGURE 2. Example of MABSA task.

strengths, and it is necessary to choose the appropriate fusion Potter, positive) pair. It is difficult to determine the sentiment
method based on the actual task requirements. polarity of the Harry Potter solely through textual infor-
mation, but the image associated with it expresses positive
D. MULTIMODAL ASPECT-BASED SENTIMENT ANALYSIS sentiment, which provides important clues. As an emerg-
Multimodal aspect-based sentiment analysis is a task that ing sentiment analysis subtask, how to effectively achieve
also utilizes multiple modal data for sentiment analysis, cross-modal alignment between image and text, and solve
but it is oriented towards finer grained sentiment analysis. the inconsistency between image and text, still faces serious
Its goal is to extract aspects and corresponding sentiment challenges.
tendencies in the text by combining data from multiple
modalities.
At present, MABSA related research mainly integrates two III. EXISTING METHODS FOR MABSA
different modalities, which are text and image. Taking an MABSA includes three main subtasks: Multimodal Aspect
example from Twitter-17 dataset, as shown in Figure 2 (a), Terms Extraction (MATE), Multimodal Aspect Sentiment
we expect to extract two aspect-sentiment pairs from text- Classification (MASC), and Aspect Sentiment Pairs Extrac-
image pair, namely (Donald Trump, negative) and (Russia, tion (ASPE), which is a combination of MATE and MASC.
neutral). For Donald Trump, it can be seen from the text that At present, research on MABSA mainly focuses on MASC
the sentiment expressed is negative, while the image associ- and ASPE. Therefore, this section focuses on summarizing
ated with it expresses positive sentiment, which may bring the relevant methods of MASC and ASPE, and Figure 3 gives
some noise. In Figure 2 (b), we want to extract the (Harry the timelines of them.

VOLUME 12, 2024 12041


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

FIGURE 3. Development timeline of related research.

A. MULTIMODAL ASPECT SENTIMENT CLASSIFICATION


METHODS
Given a text S containing n words S = {w1 , w2 , . . . , wn },
l associated images I = {V1 , V2 , . . . , Vl }, and m aspects
A = {A1 , A2 , . . . , Am }, where wi represents the ith word,
Vi represents the ith image, Ai represents the ith aspect in
the text. Taking the pair (S, I ) and one of the aspect Ai as
input, the goal of MASC is to learn a sentiment classifier
that maps (S, I , Ai ) to Y, i.e. f (S, I , Ai ) → Y, where Y ∈
{positive, negative, neutral} or an integer sentiment score set
from 1 to 10, which is determined based on different datasets.
According to existing references, the existing MASC
methods can be roughly divided into two categories, namely
attention mechanism-based MASC method and graph convo-
lutional network-based MASC method.

1) ATTENTION MECHANISM-BASED MASC METHOD


Attention mechanism is a model used to simulate human
attention behavior, which was first introduced in computer
vision in 2015 to enhance key information extraction in
image or video processing [37]. Its main idea is to focus on
FIGURE 4. MIMN model proposed by Xu et al. [38].
information related to the current task and ignore irrelevant
information when processing information.
At present, attention mechanism has been widely applied
in many fields such as computer vision and natural language image. For text, attention mechanism can identify the words
processing. In MASC research, the use of attention mecha- or phrases most relevant to the aspect, ignoring content unre-
nism can help models focus on aspect related information, lated to the aspect. For image, attention mechanism can assign
thereby extracting the most relevant content from text or attention weights to aspect related image regions to capture

12042 VOLUME 12, 2024


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

FIGURE 5. EF-Net model proposed by Wang et al. [41].

visual features related to sentiment classification. In addition, pooling operator is used to model the interaction between
in cross-modal interaction between text and image, the use entities and contexts (both left and right), and original context
of attention mechanism can help the model select and align is added as the final text feature. In addition, the ESAFN
important information between text and image, thereby better model learns entity sensitive visual representations through
capturing and understanding aspect related content in cross- entity oriented visual attention mechanism, and filters visual
modal data. noise through gating mechanism. Finally, in the multimodal
Xu et al. [38] proposed a model called Multi-Interactive fusion layer, another bilinear pooling operator is used to
Memory Network (MIMN), whose basic framework is shown capture the interaction between text and visual modalities,
in Figure 4. The MIMN model includes two interactive mem- and both text feature presentation and visual feature represen-
ory networks to monitor the textual and visual information tation are introduced as the final multimodal representation
of a given aspect, thereby learning not only the interactive input to the SoftMax layer for prediction.
effects between cross-modal data, but also the self-effects in Liu et al. [40] proposed a model called Aspect-Based
single modal data. Specifically, after obtaining the features of Attention and Fusion Network (ABAFN). This model utilizes
the aspect, text and images, the aspect-guided attention mech- attention mechanism to weight contextual and visual repre-
anism is adopted to obtain text and image representations with sentations based on aspects, and then cascades and fuses the
aspect information. Then, multiple interactive attention mod- weighted representations of the two modalities to perform
ules are used to obtain interactive representations between the sentiment label classification tasks.
two modalities, and Gate Recurrent Unit (GRU) is used to Gu et al. [41] designed an Attention Capsule Extraction and
update new textual and visual features for the next step of Multi-head Fusion Network (EF-Net) for MABSA, the basic
operation. Finally, the final output of GRU is used as the final framework of which is shown in Figure 5. EF-Net extracts
text and visual features, and concatenated into the SoftMax image features using ResNet-152 and inputs them into a
layer for prediction. single layer capsule network to obtain the position informa-
Yu et al. [39] proposed the Entity Sensitive Attention and tion of the target in the image. Then, it uses a multi-head
Fusion Network (ESAFN) model for MABSA tasks. ESAFN attention mechanism to obtain target specific textual attention
adds <e> and </e> flags before and after the target entity, and target specific visual attention, and uses a multi-head
dividing the input text into three parts: left context, right attention mechanism to achieve the fusion of multimodal
context, and target entity, and using attention mechanism to features. Finally, the text features and multimodal features are
generate entity sensitive text representations for each left and averaged and cascaded with the target specific visual atten-
right context. In the text fusion layer, a low rank bilinear tion to obtain the final multimodal representation. The final

VOLUME 12, 2024 12043


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

multimodal feature representation is linearly transformed and


then fed into the SoftMax layer for sentiment classification.
Yu et al. [42] proposed a model called Hierarchical Interac-
tive Multimodal Transformer (HIMT). This model proposes
a hierarchical interaction module that first utilizes the aspect
aware Transformer layer to obtain aspect aware text and
image representations, and then models deep modal inter-
actions using the multimodal fusion Transformer layer. This
module also enhances the aspect aware text or image repre-
sentations through self-attention mechanism. Chen et al. [43]
proposed a Hierarchical Cross-modal Transformer (HCT)
neural network model. This model designs a multimodal
interaction module based on a cross-modal Transformer to
model the interaction between text and image, thereby obtain-
ing text related image representations and image related text
representations. These two representations are then concate-
nated into a standard Transformer structure to model the
interaction between these two representations, resulting in
the final multimodal fusion representation and fed into the
SoftMax layer for sentiment classification.
Yu et al. [44] proposed a coarse-to-fine grained
Image-Target Matching (ITM) network. After extracting fea-
tures, ITM firstly uses a coarse-grained matching module
to capture the image-target relevance and alleviate the noise
FIGURE 6. MIGNN model proposed by Li et al. [46].
from unrelated image. Secondly, the fine-grained matching
module further identifies the fine-grained visual objects
aligned with the input target in those target-related image.
Finally, the image-based target representation generated by Li and Li [46] proposed a Modal Interaction Graph Neural
the fine-grained matching module is cascaded with the text Network (MIGNN), whose basic framework is shown in
representation, fed into the Transformer layer for multimodal Figure 6. This network connects semantic units of different
fusion, and then fed into the SoftMax layer for sentiment modalities using aspect to form a multimodal interaction
classification. graph, and then utilizes the message passing mechanism in
Zhao et al. [45] proposed an image, text and aspect the Graph Attention Network (GAT) to fuse information from
sentiment recognition method based on a Joint Aspects Atten- different data sources. The node in the multimodal interaction
tion Interaction Network (JAAIN). This method addresses graph are fine-grained semantic units of each modal data,
the inconsistency and correlation of image and text data. such as text words and visual blocks. Among them, the edges
By multi-level fusion of aspect information and image and between text words are grammatical dependencies, the edges
text information, image and text unrelated to a given aspect between visual blocks are spatial positional relationships,
are removed, and the sentiment representation of the modal and aspect is fully connected to multimodal semantic units.
data of a given aspect is enhanced. The sentiment represen- Finally, the information from text and image is aggregated
tations of text data, image data, and aspect sentiments are into the representation of aspect node through edges between
concatenated, fused, and fully connected to achieve sentiment each node, and the aspect node representation from the final
discrimination. layer of GAT output is input into the SoftMax layer for
classification.
2) GRAPH CONVOLUTIONAL NETWORK-BASED MASC Wang et al. [47] proposed a Multiview Interaction Learning
METHOD Network (MVIN) model, whose basic framework is shown
Graph Convolutional Network (GCN) is a deep learning in Figure 7. The MVIN model extracts features from both
model specifically designed for processing graph data. The contextual and syntactic views of text, in order to fully utilize
core idea of GCN is to perform convolution operations on the the global features of the text during multimodal interac-
graph structure, utilizing local domain information between tion. Then model the relationship between text, image, and
nodes and their neighboring nodes for feature extraction. aspect to achieve multimodal interaction; Simultaneously
By performing convolution operations on connected nodes, integrating interactive representations of different modalities,
GCN can encode local information in the graph, and through dynamically obtaining the contribution of visual information
multi-layer GCN operations, each node can learn global to each word in the text, and fully extracting the correlation
information. between modalities and aspect. Finally, the fused features are

12044 VOLUME 12, 2024


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

FIGURE 7. MVIN model proposed by Wang et al. [47].

input into the fully connected layer and SoftMax layer for words or modalities that the model focuses on in classification
sentiment classification. tasks, which helps improve the interpretability and visualiza-
Zhao and Yang [48] proposed a Fusion with GCN and tion ability of the model. However, the attention mechanism
SE-ResNeXt Network (FGSN) model. This model constructs requires sufficient training data to learn effective attention
a graph convolutional network on the dependency tree of weight distribution. If the dataset is small or unbalanced, the
a text, utilizes syntactic information and word dependen- model may not be able to accurately learn the appropriate
cies to obtain contextual and aspect representations, and attention distribution, thereby affecting model performance.
utilizes positional attention and channel attention mecha- Secondly, the GCN-based MASC method can effectively
nism to obtain image features. Then, image features and utilize graph structure information to model nodes and edges
text features are fused for sentiment polarity classifica- in multimodal data, making them suitable for processing data
tion. Wang et al. [49] proposed an Aspect-level Multimodal with complex relationships and structures. In MABSA, there
Co-attention Graph Convolutional (AMCGC) sentiment are rich correlations and dependencies between modalities,
analysis model, which utilizes the self-attention mechanism and GCN can better handle this complexity. However, GCN
of orthogonal constraints to generate semantic maps of each involves computing and storing the entire graph data, which
modality. Then, through graph convolution and bidirectional may incur significant computational and storage overhead.
gated local cross-modal interaction mechanism, fine-grained GCN also has a certain dependence on the quality of the input
cross-modal correlation and mutual alignment are gradually graph structure. If the connection relationship of graph data
achieved. is inaccurate or incomplete, it may affect the performance of
the model.
3) COMPARISONS OF THE EXISTING MASC METHODS
Everything has two sides. The attention mechanism-based B. ASPECT SENTIMENT PAIRS EXTRACTION METHODS
MASC method and the GCN-based MASC method have their Given a text containing n words S = {w1 , w2 , . . . , wn }
own advantages and disadvantages in MASC, as follows: and related image V , the goal of aspect sentiment pairs
Firstly, the attention mechanism-based MASC method extraction is to extract all aspects contained in the
can selectively focus on important information of different text and their corresponding sentiment categories, namely
modalities and capture the interactions between different {as1 , ae1 , s1 , . . . , asi , aei , si , . . . , asm , aem , sm }, where asi , aei , si
modalities, thereby better understanding and modeling the represents the starting position, ending position, and corre-
problems of multimodal sentiment analysis. In addition, the sponding sentiment category of the ith aspect, respectively,
attention weight of the model can be used to explain the aspect while m represents the number of aspects in the text.

VOLUME 12, 2024 12045


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

TABLE 1. Example of joint-based ASPE method and unified-based ASPE method.

Zhou et al. [52] proposed a unified framework for MATE


and MASC (UMAEC), and the overall framework of the
model is shown in Figure 8. UMAEC first establishes a
shared feature module to model potential semantic associ-
ations between tasks, and then adopts sequence annotation
to simultaneously output multiple aspects and their corre-
sponding sentiment categories contained in the text. Finally,
a simple algorithm is used to implement ASPE.

3) UNIFIED-BASED ASPE METHOD


The unified-based ASPE method treats the MATE and MASC
subtasks as sequence labeling problems based on a unified
labeling scheme, ignoring the boundaries of these two sub-
tasks and treating them as a sequence labeling task, using a
unified labeling scheme such as B-POS [51]. Table 1 provides
an example of joint-based ASPE method and unified-based
FIGURE 8. UMAEC model proposed by Zhou et al. [52]. ASPE method.
Yang et al. [53] proposed a multi-task learning framework
named Cross-Modal Multitask Transformer (CMMT), which
According to existing references, there are four main included two auxiliary tasks to learn intra-modal represen-
research methods for ASPE, namely pipeline-based ASPE tations of aspects or sentiment perception, and introduces
method, joint-based ASPE method, unified-based ASPE a text-guided cross-modal interaction module to dynam-
method, and text generation-based ASPE method [28]. ically control the contribution of visual information to
each word representation in inter-modal interactions. The
1) PIPELINE-BASED ASPE METHOD obtained multimodal representation is then fed to a stan-
The pipeline-based ASPE method treats MATE and MASC dard CRF layer to predict the label sequence. Yu et al. [54]
as two independent subtasks, and implements the ASPE in a proposed a Dual-encoder Transformer with Cross-modal
pipeline manner, that is, executing the MATE task first and Alignment (DTCA), which introduces two auxiliary tasks
then the MASC task. This pipeline approach is simple to to enhance cross attention performance and proposes mini-
implement, but inefficient and prone to error propagation. mizing the Wasserstein distance between the two modalities
Ju et al. [50] jointly performed MATE and MASC for the to align text and image, and feed the obtained multi-
first time and proposed a Joint Multimodal Learning (JML) modal feature to a standard CRF layer prediction label
method to assist in cross-modal relationship detection. JML sequence.
constructs an auxiliary text-image relation detection module
to control the reasonable utilization of visual information, 4) TEXT GENERATION-BASED ASPE METHOD
and then uses a layered framework to bridge the multimodal The generative pre-training model Bidirectional and
connection between MATE and MASC, and visually guides Auto-Regressive Transformers (BART) uses a standard
each sub-module separately. Finally, all aspects of sentiment Transformer based sequence to sequence structure, which
polarity that rely on joint extraction are obtained. combines a bidirectional Transformer encoder and a uni-
directional autoregressive Transformer decoder to pre-train
2) JOINT-BASED ASPE METHOD input text containing noise for denoising reconstruction. It is
The joint-based ASPE method treats the MATE and MASC a typical denoising autoencoder [55]. In recent research,
subtasks as two sequence labeling problems, and trains these some scholars have successfully applied BART to MABSA
two subtasks through a multi-task learning framework to tasks, transforming ASPE task into text generation task, and
utilize and interact with their relationship information. The achieved good results.
final result is obtained by combining the predicted results of Ling et al. [56] proposed a task specific Vision-Language
these two subtasks [51]. Pre-training framework for MABSA (VLP-MABSA), which

12046 VOLUME 12, 2024


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

FIGURE 9. Example of VLP-MABSA model proposed by Ling et al. [56] in downstream ASPE task.

was a unified multimodal encoder-decoder architecture based this method uses two completely independent models to
on BART for all pre-training and downstream tasks. In addi- implement ASPE step by step, ignoring the potential semantic
tion to the general Masked Language Modeling (MLM) associations between the two tasks. In addition, the MATE
and Masked Region Modeling (MRM) tasks, three task- model extracts multiple aspects of the text at once, while the
specific pre-training tasks were further introduced, including MASC model can only predict the sentiment polarity of one
Textual Aspect-Opinion Extraction, Visual Aspect-Opinion aspect at a time. The throughput of the former is greater than
Generation, and Multimodal Sentiment Prediction, to identify that of the latter, and MASC must be performed after the
fine-grained aspect, opinions, and their cross-modal align- MATE is completed, resulting in low ASPE efficiency.
ments. An example of this model for ASPE task is shown in Secondly, the joint-based ASPE method can fully utilize
Figure 9. the correlation and dependency relationship between two sub-
Zhou et al. [57] proposed an Aspect-oriented Method tasks, improve the performance and generalization ability of
(AoM) to detect semantic and sentiment information related the model, and also share common features for representation
to aspects. This method designs an aspect-aware attention and learning. However, the training process of this method
module between the BART based encoder-decoder archi- may be more complex, requiring the design of appropriate
tecture to simultaneously select text tags and image blocks joint loss functions and training strategies.
related to aspect semantics, and explicitly introduces sen- Thirdly, the unified-based ASPE method can better capture
timent embedding into AoM. Then, graph convolutional the relationships and interactions between two subtasks and
network is used to model visual-text and text-text interaction. improve overall performance. However, this method may
Yang et al. [58] first built diverse and comprehensive multi- have conflicts and interferences between tasks, leading to a
modal few-shot datasets according to the data distribution, decrease in model performance. It is necessary to carefully
and then proposed a novel Generative Multimodal Prompt design the model structure and training strategies to balance
(GMP) model for MABSA, which includes the Multimodal the trade-offs between the two tasks.
Encoder module and the N-Stream Decoders module. Fur- Finally, the text generation-based ASPE method can flexi-
thermore, a subtask was introduced to predict the number bly generate complex text structures and handle more flexible
of aspects in each instance to construct the multimodal and diverse inputs and outputs, performing well in new
prompt. fields and with fewer samples. However, the training and
reasoning of generating models may be more complex and
5) COMPARISONS OF THE EXISTING ASPE METHODS
time-consuming.
The four ASPE methods mentioned above have their own
advantages and disadvantages, which are as follows: IV. MABSA EVALUATION CORPUS
Firstly, the pipeline-based ASPE method is simple and Currently, the available datasets for MABSA include
straightforward, easy to implement and understand. However, Multi-ZOL, Twitter-15, Twitter-17, and MASAD, with

VOLUME 12, 2024 12047


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

TABLE 2. Statistical information of the Multi-ZOL dataset. include multimodal user posts posted on Twitter from 2014 to
2015 and 2016 to 2017, retaining only posts of person, loca-
tion, organization, and miscellaneous four entity types, each
containing textual content and related image. Due to the fact
that these two multimodal datasets only contain manually
annotated entities, the author invited three domain experts
to annotate each entity sentiment based on text content and
associated image. Afterwards, each dataset was randomly
divided into three parts in a 3:1:1 ratio: train set, development
set, and test set. The statistical information of the Twitter-15
and Twitter-17 datasets is shown in Table 3.

C. MASAD DATASET
Zhou et al. [61] selected 38532 samples from a partial
TABLE 3. Statistical information for the Twitter-15 and Twitter-17 VSO visual dataset (approximately 120000 samples) that
datasets.
can clearly express sentiments and categorized them into
seven domains: food, goods, buildings, animal, human, plant,
scenery, with a total of 57 predefined aspects. Then crawl
the text description of the image and clean each aspect of
the data to ensure the high quality of each sample. The
MASAD dataset is divided into a train set and a test set, with
both positive and negative sentiment polarities. The statistical
information of the MASAD dataset is shown in Table 4.

V. MABSA EVALUATION
Twitter-15 and Twitter-17 datasets being the most commonly
A. COMMON EVALUATION INDICATORS FOR MABSA
used, followed by Multi-ZOL dataset.
At present, the commonly used evaluation indicators for
MABSA include accuracy, precision, recall, and F1 score.
A. MULTI-ZOL DATASET
The corresponding formula is as follows:
Xu et al. [38] crawled through pages 1-50 of popular mobile
reviews on the mobile channel of ZOL.com website. For TP + TN
each phone, only crawl the comments from the top 20 pages. Accuracy = (1)
TP + TN + FP + FN
The data crawled includes 114 mobile phone brands and TP
1318 types of phones. The crawled data contains single Precision = (2)
TP + FP
modal comments, which are necessary to be filtered out, TP
and as a result, 5288 multimodal comment data are retained. Recall = (3)
TP + FN
In this dataset, each multimodal comment contains a para- 2 × Precision × Recall
graph of text, an image set, and 1-6 aspects. These six F1 = (4)
Precision + Recall
aspects are price-performance ratio, performance configu-
ration, battery life, appearance and feeling, photographing Among them, TP represents the number of correctly
effect, and screen. Pairing each aspect with multimodal com- predicted positive samples, FP represents the number of
ments resulted in a sample of 28469 aspect comment pairs. incorrectly predicted negative samples, FN represents the
For each aspect, the comment has an integer sentiment score number of incorrectly predicted positive samples, and TN
from 1 to 10, which is used as the sentiment label. represents the number of correctly predicted negative sam-
The Multi-ZOL dataset is divided into train, development, ples. Accuracy refers to the percentage of correctly predicted
and test sets in an 8:1:1 ratio. The number of comment results in the total sample; Precision refers to the probabil-
samples with sentiment labels of 7 and 9 in the dataset is 0, ity of actually being a positive sample among all predicted
so it is set to eight classifications when performing sentiment positive samples; Recall refers to the probability of being
label classification tasks. The statistical information of the predicted as a positive sample among actual positive samples;
Multi-ZOL dataset is shown in Table 2. The F1 score takes into account both precision and recall,
achieving the highest level of both and achieving a balance.
B. TWITTER-15 AND TWITTER-17 DATASETS
Yu et al. [39] selected two publicly available multimodal B. EVALUATION RESULTS
named entity recognition datasets to construct the Twitter-15 This section mainly summarizes the evaluation results of
and Twitter-17 datasets, which were collected by Lu et al. [59] existing research methods for MABSA on the Twitter-15,
and Zhang et al. [60], respectively. These two datasets Twitter-17, and Multi-ZOL datasets.

12048 VOLUME 12, 2024


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

TABLE 4. Statistical information of the MASAD dataset.

TABLE 5. Evaluation results of MASC task in the Twitter-15 and Twitter-17 TABLE 7. Evaluation results of ASPE task in the Twitter-15 and Twitter-17
datasets. datasets.

Secondly, for the ASPE task, existing research methods are


only based on the Twitter-15 and Twitter-17 datasets, with the
evaluation results shown in Table 7.
TABLE 6. Evaluation results of MASC task in the Multi-ZOL dataset. From Table 7, it can be seen that the DTCA model with
cross-modal alignment proposed by Yu et al. [54] generally
preforms better, indicating that enhancing the performance
of cross attention and achieving text and image alignment is
beneficial for ASPE.

VI. POSSIBLE RESEARCH TRENDS FOR MABSA


MABSA is an emerging task in the field of sentiment analy-
sis, and there is currently relatively little research related to it.
Firstly, for the MASC task, the summarized results are It can be imagined that there will be more and more research
shown in Table 5 and Table 6, respectively. on MABSA in the future, which may include but is not limited
From Table 5, it can be seen that the F1 score of the coarse- to the following aspects:
to-fine grained ITM network proposed by Yu et al. [44] on
the Twitter-15 and Twitter-17 datasets are higher than those A. FUSING MORE MODALITIES
of other models. This indicates that in MASC, it is important Currently, MABSA mainly focuses on the fusion of text and
to first capture image-target relevance, filter out noise caused image. How to effectively fuse text and image, solve the
by irrelevant image, and then align the target with visual mismatch between text and image, and achieve cross-modal
area. alignment is still a research focus. Besides, an intuitive trend
From the data in Table 6, it can be seen that the JAAIN of MABSA is to introduce more modalities such as audio and
model proposed by Zhao et al. [45] performs the best on video to be integrated with text and image modalities to obtain
the Multi-ZOL dataset, indicating that multi-level fusion of richer sentiment information.
aspect information and image and text information, removal
of text and image unrelated to a given aspect, and enhance- B. ENRICHING THE DATASET
ment of the sentiment representation of text modal data in a Next, we can expect and strive to establish larger and more
given aspect can help improve the effectiveness of sentiment diverse MABSA datasets, which can contain more text,
classification. speech, and image data, and cover more practical application

VOLUME 12, 2024 12049


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

scenarios, thereby making the MABSA model more accurate with the deepening of research, we will strive to expand the
and applicable. scope of literature survey to ensure a more comprehensive
coverage of this field. In addition, the future research trend
C. ACHIEVING MODELS WITH MORE ROBUSTNESS AND of MABSA may not be detailed enough. Although we have
INTERPRETABILITY proposed some possible directions, the future research field
Models for MABSA are usually complex and require a large is very broad, and there are still many unknown challenges
amount of data and computational resources for training and opportunities waiting for further exploration.
and inference. One possible future development trend is to REFERENCES
improve the robustness and interpretability of models, mak- [1] Z. F. Wang, ‘‘Research on multimodal sentiment analysis based on deep
ing them more stable, reliable, and easier to understand. This learning,’’ M.S. thesis, College Commun. Inf. Eng., Nanjing Univ. Posts
helps to improve the effectiveness of the model in practical Telecommun., Nanjing, China, 2022.
[2] R. M. Zhao, X. Xiong, S. G. Ju, Z. Z. Li, and C. Xie, ‘‘Implicit
applications and increases user trust and acceptance. sentiment analysis for Chinese texts based on a hybrid neural network,’’
J. Sichuan Univ., vol. 57, no. 2, pp. 264–270, 2020, doi:
D. MODELING LONG-TERM DEPENDENCY 10.3969/j.issn.0490-6756.2020.02.010.
[3] L. Yang and M. X. He, ‘‘Chinese text sentiment analysis model based
Sentiment analysis often requires modeling the long-term on gated mechanism and convolutional neural network,’’ J. Comput.
dependency information of input text, speech, or image. Appl., vol. 41, no. 10, pp. 2842–2848, 2021, doi: 10.11772/j.issn.1001-
In subsequent research, one challenging trend is more 9081.2020122043.
[4] Z. Y. Wei, ‘‘Research on sentiment analysis of Chinese texts based on
advanced neural network structures or attention mechanism BERT,’’ M.S. thesis, College Electron. Eng., Xidian Univ., Xian, China,
can be studied and applied to better capture and utilize 2022.
long-term dependencies in multimodal data, thereby improv- [5] Q. M. Du, N. Li, W. F. Liu, S. D. Yang, and F. Yue, ‘‘Sentiment
analysis of Chinese short text combining context and dependent syn-
ing the performance of sentiment analysis. tactic information,’’ Com. Sci., vol. 50, no. 3, pp. 307–314, 2023, doi:
10.11896/jsjkx.211200189.
E. TRACKING FINE-GRAINED SENTIMENT CHANGES [6] K. K. Song, ‘‘Research on image sentiment analysis based on deep learn-
ing,’’ Ph.D. dissertation, College Inf. Sci. Technol., Univ. Sci. Tech. China,
Sentiment is dynamic that changes over time and context. Hefei, China, 2018.
One interesting trend is to track and analyze fine-grained [7] Y. Q. Miao, Q. Q. Lei, W. Z. Zhang, M. Zhou, and Y. M. Wen, ‘‘Research
sentiment changes in multimodal data, such as sentiment on image sentiment analysis based on multi-visual object fusion,’’ Appli.
Res. Com., vol. 38, no. 4, pp. 1250–1255, 2021, doi: 10.19734/j.issn.1001-
transitions and changes in intensity. This helps to better 3695.2020.02.0087.
understand users’ sentiment states at different time points [8] J. Y. Yang, ‘‘Image emotion analysis combing psychological and deep
and contexts, thereby providing more personalized sentiment learning models,’’ Ph.D. dissertation, College Electron. Eng., Xidian Univ.,
Xian, China, 2022.
analysis results. [9] J. N. Geng, ‘‘Emotion recognition using user speech,’’ M.S. thesis, College
In summary, MABSA will continue to be developed and Comput. Sci. Technol., Univ. Sci. Tech. China, Hefei, China, 2021.
improved in the future. With the application of technolo- [10] B. W. Cui, ‘‘Research on speech emotion analysis algorithm based on pad
emotion 3D model,’’ M.S. thesis, College Comput. Sci., Shaanxi Normal
gies such as fusing more modalities, enriching the dataset, Univ., Xian, China, 2022.
achieving models with more robustness and interpretability, [11] X. Wu, M. T. Hu, and P. Ding, ‘‘Multi-modal data representation learn-
modeling long-term dependency, and tracking fine-grained ing for ceramic coating materials,’’ J. Shanghai Univ., vol. 28, no. 3,
pp. 492–503, 2022, doi: 10.12066/j.issn.1007-2861.2383.
sentiment changes, we can expect the widespread applica- [12] B. Dong, ‘‘Research on the representation learning of multimodal data,’’
tion and higher accuracy of MABSA in various practical Ph.D. dissertation, College Comput. Sci., Nat. Univ. Defence Tech.,
scenarios. Changsha, China, 2023.
[13] P. Yu, ‘‘Multi-modal fine-grained image classification based on co-
attention alignment mechanism,’’ M.S. thesis, College Comput. Sci.
VII. CONCLUSION Technol., Shandong Univ., Jinan, China, 2022.
In conclusion, this article provides a comprehensive summary [14] K. Y. Huang, ‘‘Research of image-text multimodal representation
algorithm based on object-semantics alignment,’’ M.S. thesis, College
of the existing research on MABSA. Firstly, the relevant Comput. Sci. Technol., Huazhong Univ. Sci. and Tech., Wuhan, Hubei,
concepts of MABSA are introduced. Secondly, the exist- China, 2022.
ing research methods for MASC and ASPE subtasks are [15] W. F. Li, ‘‘Research on social emotion classification based on multi-
modal fusion,’’ M.S. thesis, College Softw. Eng., Chongqing Univ. Posts
summarized, and the advantages and disadvantages of each Telecommun., Chongqing, China, 2020.
type of method are analyzed. Thirdly, the commonly used [16] J. H. Wang, Z. Liu, T. T. Liu, Y. Y. Wang, and Y. J. Cai, ‘‘Multimodal
evaluation indicators and corpus for MABSA are summa- sentiment analysis based on multilevel feature fusion attention network,’’
rized, as well as the evaluation results of existing research J. Chin. Inf. Process., vol. 36, no. 10, pp. 145–154, 2022.
[17] Y. Q. Miao, S. Yang, T. L. Liu, W. Z. Zhang, and L. Zhu, ‘‘Multimodal
methods on the corpus. Finally, the possible research trends sentiment analysis based on cross-modal gating mechanism and improved
for MABSA are envisioned. This paper attempts to establish fusion method,’’ Appl. Res. Com., vol. 40, no. 7, pp. 2025–2030, 2023, doi:
a relatively complete research view for researchers, hoping 10.19734/j.issn.1001-3695.2022.12.0766.
[18] R. F. Li, H. Chen, F. X. Feng, Z. Y. Ma, X. J. Wang, and E. Hovy, ‘‘Dual
to provide some help for further advancing research in this graph convolutional networks for aspect-based sentiment analysis,’’ in
field. Proc. 59th ACL 11th IJCNLP, 2021, pp. 6319–6329.
However, it is important to acknowledge the limitations [19] K. Zhang, K. Zhang, M. Zhang, H. Zhao, Q. Liu, W. Wu, and E. Chen,
‘‘Incorporating dynamic semantics into pre-trained language model for
of this article. The literature survey may not have encom- aspect-based sentiment analysis,’’ in Proc. Findings Assoc. Comput. Lin-
passed all the relevant studies on MABSA. In the future, guistics, ACL, 2022, pp. 3599–3610.

12050 VOLUME 12, 2024


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

[20] H. Chen, Z. Zhai, F. Feng, R. Li, and X. Wang, ‘‘Enhanced multi- [41] D. Gu, J. Wang, S. Cai, C. Yang, Z. Song, H. Zhao, L. Xiao, and
channel graph convolutional network for aspect sentiment triplet extrac- H. Wang, ‘‘Targeted aspect-based multimodal sentiment analysis: An
tion,’’ in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics, 2022, attention capsule extraction and multi-head fusion network,’’ IEEE
pp. 2974–2985. Access, vol. 9, pp. 157329–157336, 2021, doi: 10.1109/ACCESS.2021.
[21] Y. F. Cheng, J. J. Wu, and F. He, ‘‘Aspect level sentiment analysis 3126782.
based on relation gated graph convolutional network,’’ J. Zhejiang Univ., [42] J. Yu, K. Chen, and R. Xia, ‘‘Hierarchical interactive multimodal
vol. 57, no. 3, pp. 437–445, 2023, doi: 10.3785/j.issn.1008-973X.2023. transformer for aspect-based multimodal sentiment analysis,’’ IEEE
03.001. Trans. Affect. Comput., vol. 14, no. 3, pp. 1966–1978, Mar. 2022, doi:
[22] J. Yu, Q. Zhao, and R. Xia, ‘‘Cross-domain data augmentation with 10.1109/TAFFC.2022.3171091.
domain-adaptive language modeling for aspect-based sentiment analy- [43] K. Chen, X. G. Dong, and X. S. Zhou, ‘‘Research on multimodal
sis,’’ in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics, 2023, fine-grained sentiment analysis method based on cross-modal trans-
pp. 1456–1470. former,’’ Comput. Digit. Eng., vol. 50, no. 10, pp. 2270–2275, 2022, doi:
[23] X. Bao, X. Jiang, Z. Wang, Y. Zhang, and G. Zhou, ‘‘Opinion tree parsing 10.3969/j.issn.1672-9722.2022.10.027.
for aspect-based sentiment analysis,’’ in Proc. Findings Assoc. Comput. [44] J. Yu, J. Wang, R. Xia, and J. Li, ‘‘Targeted multimodal sentiment classi-
Linguistics, ACL, 2023, pp. 7971–7984. fication based on coarse-to-fine grained image-target matching,’’ in Proc.
[24] Y. Zhang and T. R. Li, ‘‘Review of comment-oriented aspect-based sen- 31st Int. Joint Conf. Artif. Intell., Jul. 2022, pp. 4482–4488.
timent analysis,’’ Comput. Sci., vol. 47, no. 6, pp. 194–200, 2020, doi: [45] Y. C. Zhao, S. G. Wang, J. Liao, and D. H. He, ‘‘Image-text aspect emotion
10.11896/jsjkx.200200127. recognition based on joint aspect attention interaction,’’ Beijing Univ. Aero-
[25] L. Wang, H. W. Ma, and H. H. Lv, ‘‘Summary of aspect-based senti- naut. Astronaut., vol. 2022, pp. 1–14, Jan. 2022, doi: 10.13700/j.bh.1001-
ment analysis,’’ J. Comput. Appl., vol. 42, no. S2, pp. 1–9, 2022, doi: 5965.2022.0387.
10.11772/j.issn.1001-9081.2021122051. [46] L. Li and P. Li, ‘‘Aspect-level multimodal sentiment analysis based on
[26] W. Zhang, X. Li, Y. Deng, L. Bing, and W. Lam, ‘‘A survey on interaction graph neural network,’’ Appl. Res. Comput., vol. 40, no. 12,
aspect-based sentiment analysis: Tasks, methods, and challenges,’’ IEEE pp. 3683–3689, 2023, doi: 10.19734/j.issn.1001-3695.2022.10.0532.
Trans. Knowl. Data Eng., vol. 35, no. 11, pp. 11019–11038, 2022, doi: [47] X. Y. Wang, W. Q. Pang, and L. J. Zhao, ‘‘Multiview interaction learn-
10.1109/TKDE.2022.3230975. ing network for multimodal aspect-level sentiment analysis,’’ Comput.
[27] Y. Li, S. Wang, J. W. Zhu, M. X. Liang, X. Gao, and Z. X. Jiao, ‘‘Summa- Eng. Appl., vol. 2023, pp. 1–11, Mar. 2023, doi: 10.3778/j.issn.1002-
rization of aspect-level sentiment analysis,’’ Compute. Sci., vol. 50, no. S1, 8331.2210-0288.
pp. 34–40, 2023, doi: 10.11896/jsjkx.220400077. [48] J. Zhao and F. Yang, ‘‘Fusion with GCN and SE-ResNeXt network for
[28] Z. Chen, T. Y. Qian, W. L. Li, T. Zhang, S. Zhou, M. Zhong, aspect based multimodal sentiment analysis,’’ in Proc. IEEE 6th Inf. Tech-
Y. Y. Zhu, and M. C. Liu, ‘‘Low-resource aspect-based sentiment analysis: nol., Netw., Electron. Autom. Control Conf. (ITNEC), vol. 6, Feb. 2023,
A survey,’’ Chin. J. Comput., vol. 46, no. 7, pp. 1445–1472, 2023, doi: pp. 336–340.
10.11897/SP.J.1016.2023.01445. [49] W. Shunjie, C. Guoyong, L. Guangrui, and T. Weibo, ‘‘Aspect-level
[29] X. R. Meng, W. Z. Yang, and T. Wang, ‘‘Survey of sentiment analysis based multimodal co-attention graph convolutional sentiment analysis model,’’
on image and text fusion,’’ J. Comput. Appl., vol. 41, no. 2, pp. 307–317, J. Image Graph., vol. 28, no. 12, pp. 3838–3854, 2023.
2021, doi: 10.11772/j.issn.1001-9081.2020060923. [50] X. Ju, D. Zhang, R. Xiao, J. Li, S. Li, M. Zhang, and G. Zhou,
[30] J. M. Liu, P. X. Zhang, Y. Liu, W. D. Zhang, and J. Fang, ‘‘Summary ‘‘Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal
of multi-modal sentiment analysis technology,’’ J. Frontiers Comput. Sci. relation detection,’’ in Proc. Conf. Empirical Methods Natural Lang. Pro-
Technol., vol. 15, no. 7, pp. 1165–1182, 2021, doi: 10.3778/j.issn.1673- cess., 2021, pp. 4395–4405.
9418.2012075. [51] J. M. Dai, W. W. Kong, Z. Wang, and P. Z. Li, ‘‘End-to-end aspect-
[31] G. W. Chen, P. Z. Zhang, T. Wang, and Q. K. Ye, ‘‘Review on multimodal based sentiment analysis model for BERT and LSI,’’ Comput. Eng. Appl.,
sentiment recognize,’’ J. Commun. Univ. China, vol. 29, no. 2, pp. 70–78, vol. 2023, pp. 1–13, Feb. 2023, doi: 10.3778/j.issn.1002-8331.2303-0220.
2022, doi: 10.16196/j.cnki.issn.1673-4793.2022.02.009. [52] R. Zhou, H. Z. Zhu, W. Y. Guo, S. L. Yu, and Y. Zhang, ‘‘A unified frame-
[32] W. X. Li, H. Y. Mei, and Y. T. Li, ‘‘Survey of multimodal sentiment work for multimodal aspect-term extraction and aspect-level sentiment
analysis based on deep learning,’’ J. Liaoning Univ. Tech., vol. 42, no. 5, classification,’’ J. Comput. Res. Device, vol. 60, no. 12, pp. 2877–2889,
pp. 293–298, 2022, doi: 10.15916/j.issn1674-3261.2022.05.003. Mar. 2023, doi: 10.7544/issn1000-1239.202220441.
[33] M. Meng, ‘‘Sentiment analysis of film criticism based on BERT-TextCNN- [53] L. Yang, J. C. Na, and J. F. Yu, ‘‘Cross-modal multitask transformer
B,’’ M.S. thesis, College Math. Phys., Shanghai Normal Univ., Shanghai, for end-to-end multimodal aspect-based sentiment analysis,’’ Inf. Pro-
China, 2021. cess. Manag., vol. 59, no. 5, pp. 1–15, 2022, doi: 10.1016/j.ipm.2022.
[34] L. L. Wang, C. L. Yao, X. Li, and X. Q. Yu, ‘‘Combining dependency 103038.
syntactic parsing with interactive attention mechanism for implicit aspect [54] Z. W. Yu, J. Wang, L. C. Yu, and X. J. Zhang, ‘‘Dual-encoder transformers
extraction,’’ Appl. Res. Comput., vol. 39, no. 1, pp. 37–42, 2022, doi: with cross-modal alignment for multimodal aspect-based sentiment analy-
10.19734/j.issn.1001-3695.2021.06.0249. sis,’’ in Proc. 2nd Conf. AACL 12th IJCNLP, 2022, pp. 414–423.
[35] H. S. Chen, J. X. An, Q. H. Tao, and J. Zhou, ‘‘Multi-modal senti- [55] W. X. Che, J. Guo, and Y. M. Cui, ‘‘Advanced pretrained language
ment analysis model based on BERT-VGG16,’’ J. Chengdou Univ. Inf. model,’’ in Natural Language Processing: A Pre-trained Model Approach,
Tech., vol. 37, no. 4, pp. 379–385, 2022, doi: 10.16836/j.cnki.jcuit.2022. 11st ed. Beijing, China: Pub. House Electr. Indu., 2021, ch. 8, sec. 4,
04.003. pp. 257–260.
[36] S. Zhang, ‘‘Research on sentiment analysis technology for multimodal [56] Y. Ling, J. Yu, and R. Xia, ‘‘Vision-language pre-training for multimodal
social data,’’ Ph.D. dissertation, College Electron. Inf. Eng., Nanjing Univ. aspect-based sentiment analysis,’’ in Proc. 60th Annu. Meeting Assoc.
Inf. Sci. Tech., Nanjing, China, 2022. Comput. Linguistics, 2022, pp. 2149–2159.
[37] K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, [57] R. Zhou, W. Guo, X. Liu, S. Yu, Y. Zhang, and X. Yuan, ‘‘AoM: Detect-
R. S. Zemel, and Y. Bengio, ‘‘Show, attend and tell: Neural image caption ing aspect-oriented information for multimodal aspect-based sentiment
generation with visual attention,’’ Comput. Sci., vol. 10, pp. 2048–2057, analysis,’’ in Proc. Findings Assoc. Comput. Linguistics, ACL, 2023,
Jan. 2015, doi: 10.48550/arXiv.1502.03044. pp. 8184–8196.
[38] N. Xu, W. J. Mao, and G. D. Chen, ‘‘Multi-interactive memory network [58] X. Yang, S. Feng, D. Wang, Q. Sun, W. Wu, Y. Zhang, P. Hong, and S. Poria,
for aspect based multimodal sentiment analysis,’’ in Proc. AAAI, 2019, ‘‘Few-shot joint multimodal aspect-sentiment analysis based on genera-
pp. 371–378. tive multimodal prompt,’’ in Proc. Findings Assoc. Comput. Linguistics,
[39] J. Yu, J. Jiang, and R. Xia, ‘‘Entity-sensitive attention and fusion net- ACL, 2023, pp. 11575–11589.
work for entity-level multimodal sentiment classification,’’ IEEE/ACM [59] D. Lu, L. Neves, V. Carvalho, N. Zhang, and H. Ji, ‘‘Visual attention model
Trans. Audio, Speech, Language Process., vol. 28, pp. 429–439, 2020, doi: for name tagging in multimodal social media,’’ in Proc. 56th Annu. Meeting
10.1109/TASLP.2019.2957872. Assoc. Comput. Linguistics, 2018, pp. 1990–1999.
[40] L. L. Liu, Y. Yang, and J. Wang, ‘‘ABAFN: Aspect-based sentiment [60] Q. Zhang, J. L. Fu, X. Y. Liu, and X. J. Huang, ‘‘Adaptive co-attention
analysis model for multimodal,’’ Comput. Eng. Appl., vol. 58, no. 10, network for named entity recognition in tweets,’’ in Proc. AAAI, 2018,
pp. 193–199, 2022, doi: 10.3778/j.issn.1002-8331.2108-0056. pp. 5674–5681.

VOLUME 12, 2024 12051


H. Zhao et al.: Survey on Multimodal Aspect-Based Sentiment Analysis

[61] J. Zhou, J. B. Zhao, J. X. Huang, Q. V. Hu, and L. He, ‘‘MASAD: XUEYANG BAI received the B.S. degree in com-
A large-scale dataset for multimodal aspect-based sentiment puter science and technology, in 2021. She is
analysis,’’ Neurocomputing, vol. 455, pp. 47–58, Jan. 2021, doi: currently pursuing the master’s degree in elec-
10.1016/j.neucom.2021.05.040. tronic information with the Shandong University
of Science and Technology, Qingdao, China. Her
research interests include natural language pro-
cessing and named entity recognition.

HUA ZHAO received the B.S. degree in com-


puter science and technology from Liaocheng
University, China, in 2001, and the M.S. and
Ph.D. degrees in computer science and technology
from the Harbin Institute of Technology, China, in
2003 and 2008, respectively. She is currently an
Associate Professor with the College of Computer
Science and Engineering, Shandong University
of Science and Technology. Her current research
interests include sentiment analysis, natural
language processing, and deep learning.

MANYU YANG received the B.S. degree in com- HAN LIU received the B.S. degree in information
puter science and technology, in 2021. She is management and information systems, in 2019.
currently pursuing the master’s degree in com- She is currently pursuing the master’s degree in
puter science and technology with the Shandong library and information with the Shandong Univer-
University of Science and Technology, Qingdao, sity of Science and Technology, Qingdao, China.
China. Her research interests include natural lan- Her research interests include natural language
guage processing and multimodal aspect-based processing and method entity and relation extrac-
sentiment analysis. tion from scientific and technological literature.

12052 VOLUME 12, 2024

You might also like