A Survey On Multimodal Aspect-Based Sentiment Analysis
A Survey On Multimodal Aspect-Based Sentiment Analysis
ABSTRACT Multimodal Aspect-Based Sentiment Analysis (MABSA), as an emerging task in the field of
sentiment analysis, has recently received widespread attention. Its aim is to combine relevant multimodal data
to determine the sentiment polarity of a given aspect in text. Researchers have surveyed both aspect-based
sentiment analysis and multimodal sentiment analysis, but, to the best of our knowledge, there is no survey on
MABSA. Therefore, in order to assist related researchers to know MABSA better, we surveyed the research
work on MABSA in recent years. Firstly, the relevant concepts of MABSA were introduced. Secondly, the
existing research methods for the two subtasks of MABSA research (that is, multimodal aspect sentiment
classification and aspect sentiment pairs extraction) were summarized and analyzed, and the advantages and
disadvantages of each type of method were analyzed. Once again, the commonly used evaluation corpus
and indicators for MABSA were summarized, and the evaluation results of existing research methods on the
corpus were also compared. Finally, the possible research trends for MABSA were envisioned.
INDEX TERMS Multimodal aspect-based sentiment analysis, multimodal aspect sentiment classification,
aspect sentiment pairs extraction.
explored to effectively improve the above limitations in single about specific aspects from a large amount of unstructured
modality sentiment analysis. text. It is usually classified into three sentiment polarities:
Aspect-based sentiment analysis is a subtask in sentiment positive, negative, and neutral.
analysis, which is a fine-grained task aimed at analyzing For example, the text ‘The food is great but the service and
the sentiment polarity of different aspects (i.e., entities) in the environment are dreadful’ contains three aspects, namely
the text. In recent years, a large amount of research has ‘food’, ‘service’ and ‘environment’, as well as two opinion
been invested in aspect-based sentiment analysis [18], [19], words ‘great’ and ‘dreadful’, corresponding to positive, neg-
[20], [21], [22], [23], and good results have been achieved. ative, and negative sentiment polarities. Through fine-grained
MABSA is based on traditional aspect-based sentiment anal- aspect-based sentiment analysis, we can accurately determine
ysis, integrating information from multiple modalities for the sentiment polarity of each aspect in specific contexts,
sentiment analysis. Its goal is to combine relevant multimodal thereby obtaining deeper and more detailed sentiment under-
data to determine the sentiment polarity of a given aspect standing. This is crucial for us to comprehensively evaluate
in the text. MABSA mainly includes three subtasks, namely and understand the subtle differences in sentiment expression,
multimodal aspect term extraction, multimodal aspect senti- as well as people’s attitudes and sentiment tendencies towards
ment classification, and aspect sentiment pairs extraction. specific aspects.
Some scholars have surveyed the existing work on
aspect-based sentiment analysis [24], [25], [26], [27], [28] C. MULTIMODAL SENTIMENT ANALYSIS
and multimodal sentiment analysis [29], [30], [31], [32]. Multimodal Sentiment Analysis (MSA) is a task that utilizes
However, to our knowledge, the existing work on MABSA multiple modal data for sentiment analysis, such as text,
has not yet been sorted and summarized. Therefore, this paper image, etc. Compared with single mode sentiment analysis,
focuses on summarizing the existing research methods for multimodal sentiment analysis can obtain more compre-
MABSA, and analyzing the advantages and disadvantages hensive and accurate sentiment information from different
of the existing methods. In addition, the commonly used perception channels. For example, the sentence ‘‘Today’s
evaluation corpus and indicators for MABSA tasks, as well weather is really nice!’’ expresses positive sentiment when
as the evaluation results of existing methods on the corpus, analyzed solely from the text, but when combined with
are also summarized. Finally, the possible research trends for an image of a rainy day, the overall sentiment is nega-
MABSA are envisioned. tive sentiment with ironic connotations. For this situation,
it is difficult to determine the sentiment solely based on
II. OVERVIEW OF RELATED CONCEPTS the text modality [35]. Certainly, in some cases, the inclu-
A. SENTIMENT ANALYSIS sion of image modality in multimodal sentiment analysis
Sentiment Analysis (SA), also known as opinion mining, can introduce certain amount of noise. For instance, when
orientation analysis, etc., refers to the usage of computer tech- the text contains words like ‘sad’ or ‘unhappy’ while the
nology to analyze data such as text, speech, or image to infer associated image shows a smiling face expressing positive
the sentiments or emotional states expressed therein [33]. It is sentiment, the overall sentiment may actually be negative.
an important research task in natural language processing, In such cases, the image modality adds a certain level of
aiming to enable computers to understand and obtain human noise that can impact the accurate determination of the overall
sentiment. sentiment.
Sentiment analysis includes sentiment classification, emo- Although multimodal data contains richer information,
tion analysis, opinion extraction, comment mining, etc. how to effectively integrate multimodal information is a key
Among them, sentiment classification is the most widely issue in current multimodal sentiment analysis tasks. The
studied issue. According to different granularity, sentiment modal fusion methods is generally divided into three types,
classification can be divided into document level, sentence which are feature layer fusion, decision layer fusion, and
level, and aspect level [34]. Document level sentiment clas- hybrid fusion [36], and the basic ideas of them is shown in
sification aims to predict the overall sentiment polarity of Figure 1. Feature layer fusion refers to the direct concatena-
an entire document or a longer segment of text. Sentence tion or weighted connection of feature vectors of different
level sentiment classification predicts sentiment polarity for modalities into a new vector, which is then input into the
each sentence. Compared to document level or sentence level classifier for sentiment analysis, as shown in Figure 1 (a).
sentiment classification, aspect level sentiment classification Decision layer fusion refers to the independent classification
focuses on predicting the sentiment polarity of specific target of features from different modalities, followed by weighting,
aspects. It analyzes sentiments related to specific aspects, voting mechanism, and other processing to generate the final
goals, or themes in the text. See section B for more detailed decision, as shown in Figure 1 (b). Hybrid fusion is the
introduction about it. integration of feature layer fusion and decision layer fusion
methods. As shown in Figure 1 (c), a classification result
B. ASPECT-BASED SENTIMENT ANALYSIS is firstly obtained using feature layer fusion, and then it
Aspect-Based Sentiment Analysis (ABSA) is a fine-grained is fused with the result of another classifier using decision
sentiment analysis task, which aims at extracting opinions layer fusion. These three fusion methods each have their own
strengths, and it is necessary to choose the appropriate fusion Potter, positive) pair. It is difficult to determine the sentiment
method based on the actual task requirements. polarity of the Harry Potter solely through textual infor-
mation, but the image associated with it expresses positive
D. MULTIMODAL ASPECT-BASED SENTIMENT ANALYSIS sentiment, which provides important clues. As an emerg-
Multimodal aspect-based sentiment analysis is a task that ing sentiment analysis subtask, how to effectively achieve
also utilizes multiple modal data for sentiment analysis, cross-modal alignment between image and text, and solve
but it is oriented towards finer grained sentiment analysis. the inconsistency between image and text, still faces serious
Its goal is to extract aspects and corresponding sentiment challenges.
tendencies in the text by combining data from multiple
modalities.
At present, MABSA related research mainly integrates two III. EXISTING METHODS FOR MABSA
different modalities, which are text and image. Taking an MABSA includes three main subtasks: Multimodal Aspect
example from Twitter-17 dataset, as shown in Figure 2 (a), Terms Extraction (MATE), Multimodal Aspect Sentiment
we expect to extract two aspect-sentiment pairs from text- Classification (MASC), and Aspect Sentiment Pairs Extrac-
image pair, namely (Donald Trump, negative) and (Russia, tion (ASPE), which is a combination of MATE and MASC.
neutral). For Donald Trump, it can be seen from the text that At present, research on MABSA mainly focuses on MASC
the sentiment expressed is negative, while the image associ- and ASPE. Therefore, this section focuses on summarizing
ated with it expresses positive sentiment, which may bring the relevant methods of MASC and ASPE, and Figure 3 gives
some noise. In Figure 2 (b), we want to extract the (Harry the timelines of them.
visual features related to sentiment classification. In addition, pooling operator is used to model the interaction between
in cross-modal interaction between text and image, the use entities and contexts (both left and right), and original context
of attention mechanism can help the model select and align is added as the final text feature. In addition, the ESAFN
important information between text and image, thereby better model learns entity sensitive visual representations through
capturing and understanding aspect related content in cross- entity oriented visual attention mechanism, and filters visual
modal data. noise through gating mechanism. Finally, in the multimodal
Xu et al. [38] proposed a model called Multi-Interactive fusion layer, another bilinear pooling operator is used to
Memory Network (MIMN), whose basic framework is shown capture the interaction between text and visual modalities,
in Figure 4. The MIMN model includes two interactive mem- and both text feature presentation and visual feature represen-
ory networks to monitor the textual and visual information tation are introduced as the final multimodal representation
of a given aspect, thereby learning not only the interactive input to the SoftMax layer for prediction.
effects between cross-modal data, but also the self-effects in Liu et al. [40] proposed a model called Aspect-Based
single modal data. Specifically, after obtaining the features of Attention and Fusion Network (ABAFN). This model utilizes
the aspect, text and images, the aspect-guided attention mech- attention mechanism to weight contextual and visual repre-
anism is adopted to obtain text and image representations with sentations based on aspects, and then cascades and fuses the
aspect information. Then, multiple interactive attention mod- weighted representations of the two modalities to perform
ules are used to obtain interactive representations between the sentiment label classification tasks.
two modalities, and Gate Recurrent Unit (GRU) is used to Gu et al. [41] designed an Attention Capsule Extraction and
update new textual and visual features for the next step of Multi-head Fusion Network (EF-Net) for MABSA, the basic
operation. Finally, the final output of GRU is used as the final framework of which is shown in Figure 5. EF-Net extracts
text and visual features, and concatenated into the SoftMax image features using ResNet-152 and inputs them into a
layer for prediction. single layer capsule network to obtain the position informa-
Yu et al. [39] proposed the Entity Sensitive Attention and tion of the target in the image. Then, it uses a multi-head
Fusion Network (ESAFN) model for MABSA tasks. ESAFN attention mechanism to obtain target specific textual attention
adds <e> and </e> flags before and after the target entity, and target specific visual attention, and uses a multi-head
dividing the input text into three parts: left context, right attention mechanism to achieve the fusion of multimodal
context, and target entity, and using attention mechanism to features. Finally, the text features and multimodal features are
generate entity sensitive text representations for each left and averaged and cascaded with the target specific visual atten-
right context. In the text fusion layer, a low rank bilinear tion to obtain the final multimodal representation. The final
input into the fully connected layer and SoftMax layer for words or modalities that the model focuses on in classification
sentiment classification. tasks, which helps improve the interpretability and visualiza-
Zhao and Yang [48] proposed a Fusion with GCN and tion ability of the model. However, the attention mechanism
SE-ResNeXt Network (FGSN) model. This model constructs requires sufficient training data to learn effective attention
a graph convolutional network on the dependency tree of weight distribution. If the dataset is small or unbalanced, the
a text, utilizes syntactic information and word dependen- model may not be able to accurately learn the appropriate
cies to obtain contextual and aspect representations, and attention distribution, thereby affecting model performance.
utilizes positional attention and channel attention mecha- Secondly, the GCN-based MASC method can effectively
nism to obtain image features. Then, image features and utilize graph structure information to model nodes and edges
text features are fused for sentiment polarity classifica- in multimodal data, making them suitable for processing data
tion. Wang et al. [49] proposed an Aspect-level Multimodal with complex relationships and structures. In MABSA, there
Co-attention Graph Convolutional (AMCGC) sentiment are rich correlations and dependencies between modalities,
analysis model, which utilizes the self-attention mechanism and GCN can better handle this complexity. However, GCN
of orthogonal constraints to generate semantic maps of each involves computing and storing the entire graph data, which
modality. Then, through graph convolution and bidirectional may incur significant computational and storage overhead.
gated local cross-modal interaction mechanism, fine-grained GCN also has a certain dependence on the quality of the input
cross-modal correlation and mutual alignment are gradually graph structure. If the connection relationship of graph data
achieved. is inaccurate or incomplete, it may affect the performance of
the model.
3) COMPARISONS OF THE EXISTING MASC METHODS
Everything has two sides. The attention mechanism-based B. ASPECT SENTIMENT PAIRS EXTRACTION METHODS
MASC method and the GCN-based MASC method have their Given a text containing n words S = {w1 , w2 , . . . , wn }
own advantages and disadvantages in MASC, as follows: and related image V , the goal of aspect sentiment pairs
Firstly, the attention mechanism-based MASC method extraction is to extract all aspects contained in the
can selectively focus on important information of different text and their corresponding sentiment categories, namely
modalities and capture the interactions between different {as1 , ae1 , s1 , . . . , asi , aei , si , . . . , asm , aem , sm }, where asi , aei , si
modalities, thereby better understanding and modeling the represents the starting position, ending position, and corre-
problems of multimodal sentiment analysis. In addition, the sponding sentiment category of the ith aspect, respectively,
attention weight of the model can be used to explain the aspect while m represents the number of aspects in the text.
FIGURE 9. Example of VLP-MABSA model proposed by Ling et al. [56] in downstream ASPE task.
was a unified multimodal encoder-decoder architecture based this method uses two completely independent models to
on BART for all pre-training and downstream tasks. In addi- implement ASPE step by step, ignoring the potential semantic
tion to the general Masked Language Modeling (MLM) associations between the two tasks. In addition, the MATE
and Masked Region Modeling (MRM) tasks, three task- model extracts multiple aspects of the text at once, while the
specific pre-training tasks were further introduced, including MASC model can only predict the sentiment polarity of one
Textual Aspect-Opinion Extraction, Visual Aspect-Opinion aspect at a time. The throughput of the former is greater than
Generation, and Multimodal Sentiment Prediction, to identify that of the latter, and MASC must be performed after the
fine-grained aspect, opinions, and their cross-modal align- MATE is completed, resulting in low ASPE efficiency.
ments. An example of this model for ASPE task is shown in Secondly, the joint-based ASPE method can fully utilize
Figure 9. the correlation and dependency relationship between two sub-
Zhou et al. [57] proposed an Aspect-oriented Method tasks, improve the performance and generalization ability of
(AoM) to detect semantic and sentiment information related the model, and also share common features for representation
to aspects. This method designs an aspect-aware attention and learning. However, the training process of this method
module between the BART based encoder-decoder archi- may be more complex, requiring the design of appropriate
tecture to simultaneously select text tags and image blocks joint loss functions and training strategies.
related to aspect semantics, and explicitly introduces sen- Thirdly, the unified-based ASPE method can better capture
timent embedding into AoM. Then, graph convolutional the relationships and interactions between two subtasks and
network is used to model visual-text and text-text interaction. improve overall performance. However, this method may
Yang et al. [58] first built diverse and comprehensive multi- have conflicts and interferences between tasks, leading to a
modal few-shot datasets according to the data distribution, decrease in model performance. It is necessary to carefully
and then proposed a novel Generative Multimodal Prompt design the model structure and training strategies to balance
(GMP) model for MABSA, which includes the Multimodal the trade-offs between the two tasks.
Encoder module and the N-Stream Decoders module. Fur- Finally, the text generation-based ASPE method can flexi-
thermore, a subtask was introduced to predict the number bly generate complex text structures and handle more flexible
of aspects in each instance to construct the multimodal and diverse inputs and outputs, performing well in new
prompt. fields and with fewer samples. However, the training and
reasoning of generating models may be more complex and
5) COMPARISONS OF THE EXISTING ASPE METHODS
time-consuming.
The four ASPE methods mentioned above have their own
advantages and disadvantages, which are as follows: IV. MABSA EVALUATION CORPUS
Firstly, the pipeline-based ASPE method is simple and Currently, the available datasets for MABSA include
straightforward, easy to implement and understand. However, Multi-ZOL, Twitter-15, Twitter-17, and MASAD, with
TABLE 2. Statistical information of the Multi-ZOL dataset. include multimodal user posts posted on Twitter from 2014 to
2015 and 2016 to 2017, retaining only posts of person, loca-
tion, organization, and miscellaneous four entity types, each
containing textual content and related image. Due to the fact
that these two multimodal datasets only contain manually
annotated entities, the author invited three domain experts
to annotate each entity sentiment based on text content and
associated image. Afterwards, each dataset was randomly
divided into three parts in a 3:1:1 ratio: train set, development
set, and test set. The statistical information of the Twitter-15
and Twitter-17 datasets is shown in Table 3.
C. MASAD DATASET
Zhou et al. [61] selected 38532 samples from a partial
TABLE 3. Statistical information for the Twitter-15 and Twitter-17 VSO visual dataset (approximately 120000 samples) that
datasets.
can clearly express sentiments and categorized them into
seven domains: food, goods, buildings, animal, human, plant,
scenery, with a total of 57 predefined aspects. Then crawl
the text description of the image and clean each aspect of
the data to ensure the high quality of each sample. The
MASAD dataset is divided into a train set and a test set, with
both positive and negative sentiment polarities. The statistical
information of the MASAD dataset is shown in Table 4.
V. MABSA EVALUATION
Twitter-15 and Twitter-17 datasets being the most commonly
A. COMMON EVALUATION INDICATORS FOR MABSA
used, followed by Multi-ZOL dataset.
At present, the commonly used evaluation indicators for
MABSA include accuracy, precision, recall, and F1 score.
A. MULTI-ZOL DATASET
The corresponding formula is as follows:
Xu et al. [38] crawled through pages 1-50 of popular mobile
reviews on the mobile channel of ZOL.com website. For TP + TN
each phone, only crawl the comments from the top 20 pages. Accuracy = (1)
TP + TN + FP + FN
The data crawled includes 114 mobile phone brands and TP
1318 types of phones. The crawled data contains single Precision = (2)
TP + FP
modal comments, which are necessary to be filtered out, TP
and as a result, 5288 multimodal comment data are retained. Recall = (3)
TP + FN
In this dataset, each multimodal comment contains a para- 2 × Precision × Recall
graph of text, an image set, and 1-6 aspects. These six F1 = (4)
Precision + Recall
aspects are price-performance ratio, performance configu-
ration, battery life, appearance and feeling, photographing Among them, TP represents the number of correctly
effect, and screen. Pairing each aspect with multimodal com- predicted positive samples, FP represents the number of
ments resulted in a sample of 28469 aspect comment pairs. incorrectly predicted negative samples, FN represents the
For each aspect, the comment has an integer sentiment score number of incorrectly predicted positive samples, and TN
from 1 to 10, which is used as the sentiment label. represents the number of correctly predicted negative sam-
The Multi-ZOL dataset is divided into train, development, ples. Accuracy refers to the percentage of correctly predicted
and test sets in an 8:1:1 ratio. The number of comment results in the total sample; Precision refers to the probabil-
samples with sentiment labels of 7 and 9 in the dataset is 0, ity of actually being a positive sample among all predicted
so it is set to eight classifications when performing sentiment positive samples; Recall refers to the probability of being
label classification tasks. The statistical information of the predicted as a positive sample among actual positive samples;
Multi-ZOL dataset is shown in Table 2. The F1 score takes into account both precision and recall,
achieving the highest level of both and achieving a balance.
B. TWITTER-15 AND TWITTER-17 DATASETS
Yu et al. [39] selected two publicly available multimodal B. EVALUATION RESULTS
named entity recognition datasets to construct the Twitter-15 This section mainly summarizes the evaluation results of
and Twitter-17 datasets, which were collected by Lu et al. [59] existing research methods for MABSA on the Twitter-15,
and Zhang et al. [60], respectively. These two datasets Twitter-17, and Multi-ZOL datasets.
TABLE 5. Evaluation results of MASC task in the Twitter-15 and Twitter-17 TABLE 7. Evaluation results of ASPE task in the Twitter-15 and Twitter-17
datasets. datasets.
scenarios, thereby making the MABSA model more accurate with the deepening of research, we will strive to expand the
and applicable. scope of literature survey to ensure a more comprehensive
coverage of this field. In addition, the future research trend
C. ACHIEVING MODELS WITH MORE ROBUSTNESS AND of MABSA may not be detailed enough. Although we have
INTERPRETABILITY proposed some possible directions, the future research field
Models for MABSA are usually complex and require a large is very broad, and there are still many unknown challenges
amount of data and computational resources for training and opportunities waiting for further exploration.
and inference. One possible future development trend is to REFERENCES
improve the robustness and interpretability of models, mak- [1] Z. F. Wang, ‘‘Research on multimodal sentiment analysis based on deep
ing them more stable, reliable, and easier to understand. This learning,’’ M.S. thesis, College Commun. Inf. Eng., Nanjing Univ. Posts
helps to improve the effectiveness of the model in practical Telecommun., Nanjing, China, 2022.
[2] R. M. Zhao, X. Xiong, S. G. Ju, Z. Z. Li, and C. Xie, ‘‘Implicit
applications and increases user trust and acceptance. sentiment analysis for Chinese texts based on a hybrid neural network,’’
J. Sichuan Univ., vol. 57, no. 2, pp. 264–270, 2020, doi:
D. MODELING LONG-TERM DEPENDENCY 10.3969/j.issn.0490-6756.2020.02.010.
[3] L. Yang and M. X. He, ‘‘Chinese text sentiment analysis model based
Sentiment analysis often requires modeling the long-term on gated mechanism and convolutional neural network,’’ J. Comput.
dependency information of input text, speech, or image. Appl., vol. 41, no. 10, pp. 2842–2848, 2021, doi: 10.11772/j.issn.1001-
In subsequent research, one challenging trend is more 9081.2020122043.
[4] Z. Y. Wei, ‘‘Research on sentiment analysis of Chinese texts based on
advanced neural network structures or attention mechanism BERT,’’ M.S. thesis, College Electron. Eng., Xidian Univ., Xian, China,
can be studied and applied to better capture and utilize 2022.
long-term dependencies in multimodal data, thereby improv- [5] Q. M. Du, N. Li, W. F. Liu, S. D. Yang, and F. Yue, ‘‘Sentiment
analysis of Chinese short text combining context and dependent syn-
ing the performance of sentiment analysis. tactic information,’’ Com. Sci., vol. 50, no. 3, pp. 307–314, 2023, doi:
10.11896/jsjkx.211200189.
E. TRACKING FINE-GRAINED SENTIMENT CHANGES [6] K. K. Song, ‘‘Research on image sentiment analysis based on deep learn-
ing,’’ Ph.D. dissertation, College Inf. Sci. Technol., Univ. Sci. Tech. China,
Sentiment is dynamic that changes over time and context. Hefei, China, 2018.
One interesting trend is to track and analyze fine-grained [7] Y. Q. Miao, Q. Q. Lei, W. Z. Zhang, M. Zhou, and Y. M. Wen, ‘‘Research
sentiment changes in multimodal data, such as sentiment on image sentiment analysis based on multi-visual object fusion,’’ Appli.
Res. Com., vol. 38, no. 4, pp. 1250–1255, 2021, doi: 10.19734/j.issn.1001-
transitions and changes in intensity. This helps to better 3695.2020.02.0087.
understand users’ sentiment states at different time points [8] J. Y. Yang, ‘‘Image emotion analysis combing psychological and deep
and contexts, thereby providing more personalized sentiment learning models,’’ Ph.D. dissertation, College Electron. Eng., Xidian Univ.,
Xian, China, 2022.
analysis results. [9] J. N. Geng, ‘‘Emotion recognition using user speech,’’ M.S. thesis, College
In summary, MABSA will continue to be developed and Comput. Sci. Technol., Univ. Sci. Tech. China, Hefei, China, 2021.
improved in the future. With the application of technolo- [10] B. W. Cui, ‘‘Research on speech emotion analysis algorithm based on pad
emotion 3D model,’’ M.S. thesis, College Comput. Sci., Shaanxi Normal
gies such as fusing more modalities, enriching the dataset, Univ., Xian, China, 2022.
achieving models with more robustness and interpretability, [11] X. Wu, M. T. Hu, and P. Ding, ‘‘Multi-modal data representation learn-
modeling long-term dependency, and tracking fine-grained ing for ceramic coating materials,’’ J. Shanghai Univ., vol. 28, no. 3,
pp. 492–503, 2022, doi: 10.12066/j.issn.1007-2861.2383.
sentiment changes, we can expect the widespread applica- [12] B. Dong, ‘‘Research on the representation learning of multimodal data,’’
tion and higher accuracy of MABSA in various practical Ph.D. dissertation, College Comput. Sci., Nat. Univ. Defence Tech.,
scenarios. Changsha, China, 2023.
[13] P. Yu, ‘‘Multi-modal fine-grained image classification based on co-
attention alignment mechanism,’’ M.S. thesis, College Comput. Sci.
VII. CONCLUSION Technol., Shandong Univ., Jinan, China, 2022.
In conclusion, this article provides a comprehensive summary [14] K. Y. Huang, ‘‘Research of image-text multimodal representation
algorithm based on object-semantics alignment,’’ M.S. thesis, College
of the existing research on MABSA. Firstly, the relevant Comput. Sci. Technol., Huazhong Univ. Sci. and Tech., Wuhan, Hubei,
concepts of MABSA are introduced. Secondly, the exist- China, 2022.
ing research methods for MASC and ASPE subtasks are [15] W. F. Li, ‘‘Research on social emotion classification based on multi-
modal fusion,’’ M.S. thesis, College Softw. Eng., Chongqing Univ. Posts
summarized, and the advantages and disadvantages of each Telecommun., Chongqing, China, 2020.
type of method are analyzed. Thirdly, the commonly used [16] J. H. Wang, Z. Liu, T. T. Liu, Y. Y. Wang, and Y. J. Cai, ‘‘Multimodal
evaluation indicators and corpus for MABSA are summa- sentiment analysis based on multilevel feature fusion attention network,’’
rized, as well as the evaluation results of existing research J. Chin. Inf. Process., vol. 36, no. 10, pp. 145–154, 2022.
[17] Y. Q. Miao, S. Yang, T. L. Liu, W. Z. Zhang, and L. Zhu, ‘‘Multimodal
methods on the corpus. Finally, the possible research trends sentiment analysis based on cross-modal gating mechanism and improved
for MABSA are envisioned. This paper attempts to establish fusion method,’’ Appl. Res. Com., vol. 40, no. 7, pp. 2025–2030, 2023, doi:
a relatively complete research view for researchers, hoping 10.19734/j.issn.1001-3695.2022.12.0766.
[18] R. F. Li, H. Chen, F. X. Feng, Z. Y. Ma, X. J. Wang, and E. Hovy, ‘‘Dual
to provide some help for further advancing research in this graph convolutional networks for aspect-based sentiment analysis,’’ in
field. Proc. 59th ACL 11th IJCNLP, 2021, pp. 6319–6329.
However, it is important to acknowledge the limitations [19] K. Zhang, K. Zhang, M. Zhang, H. Zhao, Q. Liu, W. Wu, and E. Chen,
‘‘Incorporating dynamic semantics into pre-trained language model for
of this article. The literature survey may not have encom- aspect-based sentiment analysis,’’ in Proc. Findings Assoc. Comput. Lin-
passed all the relevant studies on MABSA. In the future, guistics, ACL, 2022, pp. 3599–3610.
[20] H. Chen, Z. Zhai, F. Feng, R. Li, and X. Wang, ‘‘Enhanced multi- [41] D. Gu, J. Wang, S. Cai, C. Yang, Z. Song, H. Zhao, L. Xiao, and
channel graph convolutional network for aspect sentiment triplet extrac- H. Wang, ‘‘Targeted aspect-based multimodal sentiment analysis: An
tion,’’ in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics, 2022, attention capsule extraction and multi-head fusion network,’’ IEEE
pp. 2974–2985. Access, vol. 9, pp. 157329–157336, 2021, doi: 10.1109/ACCESS.2021.
[21] Y. F. Cheng, J. J. Wu, and F. He, ‘‘Aspect level sentiment analysis 3126782.
based on relation gated graph convolutional network,’’ J. Zhejiang Univ., [42] J. Yu, K. Chen, and R. Xia, ‘‘Hierarchical interactive multimodal
vol. 57, no. 3, pp. 437–445, 2023, doi: 10.3785/j.issn.1008-973X.2023. transformer for aspect-based multimodal sentiment analysis,’’ IEEE
03.001. Trans. Affect. Comput., vol. 14, no. 3, pp. 1966–1978, Mar. 2022, doi:
[22] J. Yu, Q. Zhao, and R. Xia, ‘‘Cross-domain data augmentation with 10.1109/TAFFC.2022.3171091.
domain-adaptive language modeling for aspect-based sentiment analy- [43] K. Chen, X. G. Dong, and X. S. Zhou, ‘‘Research on multimodal
sis,’’ in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics, 2023, fine-grained sentiment analysis method based on cross-modal trans-
pp. 1456–1470. former,’’ Comput. Digit. Eng., vol. 50, no. 10, pp. 2270–2275, 2022, doi:
[23] X. Bao, X. Jiang, Z. Wang, Y. Zhang, and G. Zhou, ‘‘Opinion tree parsing 10.3969/j.issn.1672-9722.2022.10.027.
for aspect-based sentiment analysis,’’ in Proc. Findings Assoc. Comput. [44] J. Yu, J. Wang, R. Xia, and J. Li, ‘‘Targeted multimodal sentiment classi-
Linguistics, ACL, 2023, pp. 7971–7984. fication based on coarse-to-fine grained image-target matching,’’ in Proc.
[24] Y. Zhang and T. R. Li, ‘‘Review of comment-oriented aspect-based sen- 31st Int. Joint Conf. Artif. Intell., Jul. 2022, pp. 4482–4488.
timent analysis,’’ Comput. Sci., vol. 47, no. 6, pp. 194–200, 2020, doi: [45] Y. C. Zhao, S. G. Wang, J. Liao, and D. H. He, ‘‘Image-text aspect emotion
10.11896/jsjkx.200200127. recognition based on joint aspect attention interaction,’’ Beijing Univ. Aero-
[25] L. Wang, H. W. Ma, and H. H. Lv, ‘‘Summary of aspect-based senti- naut. Astronaut., vol. 2022, pp. 1–14, Jan. 2022, doi: 10.13700/j.bh.1001-
ment analysis,’’ J. Comput. Appl., vol. 42, no. S2, pp. 1–9, 2022, doi: 5965.2022.0387.
10.11772/j.issn.1001-9081.2021122051. [46] L. Li and P. Li, ‘‘Aspect-level multimodal sentiment analysis based on
[26] W. Zhang, X. Li, Y. Deng, L. Bing, and W. Lam, ‘‘A survey on interaction graph neural network,’’ Appl. Res. Comput., vol. 40, no. 12,
aspect-based sentiment analysis: Tasks, methods, and challenges,’’ IEEE pp. 3683–3689, 2023, doi: 10.19734/j.issn.1001-3695.2022.10.0532.
Trans. Knowl. Data Eng., vol. 35, no. 11, pp. 11019–11038, 2022, doi: [47] X. Y. Wang, W. Q. Pang, and L. J. Zhao, ‘‘Multiview interaction learn-
10.1109/TKDE.2022.3230975. ing network for multimodal aspect-level sentiment analysis,’’ Comput.
[27] Y. Li, S. Wang, J. W. Zhu, M. X. Liang, X. Gao, and Z. X. Jiao, ‘‘Summa- Eng. Appl., vol. 2023, pp. 1–11, Mar. 2023, doi: 10.3778/j.issn.1002-
rization of aspect-level sentiment analysis,’’ Compute. Sci., vol. 50, no. S1, 8331.2210-0288.
pp. 34–40, 2023, doi: 10.11896/jsjkx.220400077. [48] J. Zhao and F. Yang, ‘‘Fusion with GCN and SE-ResNeXt network for
[28] Z. Chen, T. Y. Qian, W. L. Li, T. Zhang, S. Zhou, M. Zhong, aspect based multimodal sentiment analysis,’’ in Proc. IEEE 6th Inf. Tech-
Y. Y. Zhu, and M. C. Liu, ‘‘Low-resource aspect-based sentiment analysis: nol., Netw., Electron. Autom. Control Conf. (ITNEC), vol. 6, Feb. 2023,
A survey,’’ Chin. J. Comput., vol. 46, no. 7, pp. 1445–1472, 2023, doi: pp. 336–340.
10.11897/SP.J.1016.2023.01445. [49] W. Shunjie, C. Guoyong, L. Guangrui, and T. Weibo, ‘‘Aspect-level
[29] X. R. Meng, W. Z. Yang, and T. Wang, ‘‘Survey of sentiment analysis based multimodal co-attention graph convolutional sentiment analysis model,’’
on image and text fusion,’’ J. Comput. Appl., vol. 41, no. 2, pp. 307–317, J. Image Graph., vol. 28, no. 12, pp. 3838–3854, 2023.
2021, doi: 10.11772/j.issn.1001-9081.2020060923. [50] X. Ju, D. Zhang, R. Xiao, J. Li, S. Li, M. Zhang, and G. Zhou,
[30] J. M. Liu, P. X. Zhang, Y. Liu, W. D. Zhang, and J. Fang, ‘‘Summary ‘‘Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal
of multi-modal sentiment analysis technology,’’ J. Frontiers Comput. Sci. relation detection,’’ in Proc. Conf. Empirical Methods Natural Lang. Pro-
Technol., vol. 15, no. 7, pp. 1165–1182, 2021, doi: 10.3778/j.issn.1673- cess., 2021, pp. 4395–4405.
9418.2012075. [51] J. M. Dai, W. W. Kong, Z. Wang, and P. Z. Li, ‘‘End-to-end aspect-
[31] G. W. Chen, P. Z. Zhang, T. Wang, and Q. K. Ye, ‘‘Review on multimodal based sentiment analysis model for BERT and LSI,’’ Comput. Eng. Appl.,
sentiment recognize,’’ J. Commun. Univ. China, vol. 29, no. 2, pp. 70–78, vol. 2023, pp. 1–13, Feb. 2023, doi: 10.3778/j.issn.1002-8331.2303-0220.
2022, doi: 10.16196/j.cnki.issn.1673-4793.2022.02.009. [52] R. Zhou, H. Z. Zhu, W. Y. Guo, S. L. Yu, and Y. Zhang, ‘‘A unified frame-
[32] W. X. Li, H. Y. Mei, and Y. T. Li, ‘‘Survey of multimodal sentiment work for multimodal aspect-term extraction and aspect-level sentiment
analysis based on deep learning,’’ J. Liaoning Univ. Tech., vol. 42, no. 5, classification,’’ J. Comput. Res. Device, vol. 60, no. 12, pp. 2877–2889,
pp. 293–298, 2022, doi: 10.15916/j.issn1674-3261.2022.05.003. Mar. 2023, doi: 10.7544/issn1000-1239.202220441.
[33] M. Meng, ‘‘Sentiment analysis of film criticism based on BERT-TextCNN- [53] L. Yang, J. C. Na, and J. F. Yu, ‘‘Cross-modal multitask transformer
B,’’ M.S. thesis, College Math. Phys., Shanghai Normal Univ., Shanghai, for end-to-end multimodal aspect-based sentiment analysis,’’ Inf. Pro-
China, 2021. cess. Manag., vol. 59, no. 5, pp. 1–15, 2022, doi: 10.1016/j.ipm.2022.
[34] L. L. Wang, C. L. Yao, X. Li, and X. Q. Yu, ‘‘Combining dependency 103038.
syntactic parsing with interactive attention mechanism for implicit aspect [54] Z. W. Yu, J. Wang, L. C. Yu, and X. J. Zhang, ‘‘Dual-encoder transformers
extraction,’’ Appl. Res. Comput., vol. 39, no. 1, pp. 37–42, 2022, doi: with cross-modal alignment for multimodal aspect-based sentiment analy-
10.19734/j.issn.1001-3695.2021.06.0249. sis,’’ in Proc. 2nd Conf. AACL 12th IJCNLP, 2022, pp. 414–423.
[35] H. S. Chen, J. X. An, Q. H. Tao, and J. Zhou, ‘‘Multi-modal senti- [55] W. X. Che, J. Guo, and Y. M. Cui, ‘‘Advanced pretrained language
ment analysis model based on BERT-VGG16,’’ J. Chengdou Univ. Inf. model,’’ in Natural Language Processing: A Pre-trained Model Approach,
Tech., vol. 37, no. 4, pp. 379–385, 2022, doi: 10.16836/j.cnki.jcuit.2022. 11st ed. Beijing, China: Pub. House Electr. Indu., 2021, ch. 8, sec. 4,
04.003. pp. 257–260.
[36] S. Zhang, ‘‘Research on sentiment analysis technology for multimodal [56] Y. Ling, J. Yu, and R. Xia, ‘‘Vision-language pre-training for multimodal
social data,’’ Ph.D. dissertation, College Electron. Inf. Eng., Nanjing Univ. aspect-based sentiment analysis,’’ in Proc. 60th Annu. Meeting Assoc.
Inf. Sci. Tech., Nanjing, China, 2022. Comput. Linguistics, 2022, pp. 2149–2159.
[37] K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, [57] R. Zhou, W. Guo, X. Liu, S. Yu, Y. Zhang, and X. Yuan, ‘‘AoM: Detect-
R. S. Zemel, and Y. Bengio, ‘‘Show, attend and tell: Neural image caption ing aspect-oriented information for multimodal aspect-based sentiment
generation with visual attention,’’ Comput. Sci., vol. 10, pp. 2048–2057, analysis,’’ in Proc. Findings Assoc. Comput. Linguistics, ACL, 2023,
Jan. 2015, doi: 10.48550/arXiv.1502.03044. pp. 8184–8196.
[38] N. Xu, W. J. Mao, and G. D. Chen, ‘‘Multi-interactive memory network [58] X. Yang, S. Feng, D. Wang, Q. Sun, W. Wu, Y. Zhang, P. Hong, and S. Poria,
for aspect based multimodal sentiment analysis,’’ in Proc. AAAI, 2019, ‘‘Few-shot joint multimodal aspect-sentiment analysis based on genera-
pp. 371–378. tive multimodal prompt,’’ in Proc. Findings Assoc. Comput. Linguistics,
[39] J. Yu, J. Jiang, and R. Xia, ‘‘Entity-sensitive attention and fusion net- ACL, 2023, pp. 11575–11589.
work for entity-level multimodal sentiment classification,’’ IEEE/ACM [59] D. Lu, L. Neves, V. Carvalho, N. Zhang, and H. Ji, ‘‘Visual attention model
Trans. Audio, Speech, Language Process., vol. 28, pp. 429–439, 2020, doi: for name tagging in multimodal social media,’’ in Proc. 56th Annu. Meeting
10.1109/TASLP.2019.2957872. Assoc. Comput. Linguistics, 2018, pp. 1990–1999.
[40] L. L. Liu, Y. Yang, and J. Wang, ‘‘ABAFN: Aspect-based sentiment [60] Q. Zhang, J. L. Fu, X. Y. Liu, and X. J. Huang, ‘‘Adaptive co-attention
analysis model for multimodal,’’ Comput. Eng. Appl., vol. 58, no. 10, network for named entity recognition in tweets,’’ in Proc. AAAI, 2018,
pp. 193–199, 2022, doi: 10.3778/j.issn.1002-8331.2108-0056. pp. 5674–5681.
[61] J. Zhou, J. B. Zhao, J. X. Huang, Q. V. Hu, and L. He, ‘‘MASAD: XUEYANG BAI received the B.S. degree in com-
A large-scale dataset for multimodal aspect-based sentiment puter science and technology, in 2021. She is
analysis,’’ Neurocomputing, vol. 455, pp. 47–58, Jan. 2021, doi: currently pursuing the master’s degree in elec-
10.1016/j.neucom.2021.05.040. tronic information with the Shandong University
of Science and Technology, Qingdao, China. Her
research interests include natural language pro-
cessing and named entity recognition.
MANYU YANG received the B.S. degree in com- HAN LIU received the B.S. degree in information
puter science and technology, in 2021. She is management and information systems, in 2019.
currently pursuing the master’s degree in com- She is currently pursuing the master’s degree in
puter science and technology with the Shandong library and information with the Shandong Univer-
University of Science and Technology, Qingdao, sity of Science and Technology, Qingdao, China.
China. Her research interests include natural lan- Her research interests include natural language
guage processing and multimodal aspect-based processing and method entity and relation extrac-
sentiment analysis. tion from scientific and technological literature.