Exploring Diverse Methods in Visual Question Answering

Uploaded by

quocbao0905461606

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Exploring Diverse Methods in Visual Question Answering

Uploaded by

quocbao0905461606

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Exploring Diverse Methods in Visual Question

Answering
1st Panfeng Li 2nd Qikai Yang
Department of Electrical and Computer Engineering Department of Computer Science
University of Michigan University of Illinois Urbana-Champaign
Ann Arbor, USA Urbana, USA
[email protected] [email protected]

3rd Xieming Geng 4th Wenjing Zhou

Department of Computer Science Department of Statistics
arXiv:2404.13565v2 [cs.CV] 21 May 2024

Chongqing University University of Michigan

Chongqing, China Ann Arbor, USA
[email protected] [email protected]

5th Zhicheng Ding 6th Yi Nian

Fu Foundation School of Engineering and Applied Science Department of Computer Science
Columbia University University of Chicago
New York, USA Chicago, USA
[email protected] [email protected]

Abstract—This study explores innovative methods for improv- the ambitious goal of imbuing machines with human-like
ing Visual Question Answering (VQA) using Generative Ad- perceptual and cognitive abilities. By synthesizing information
versarial Networks (GANs), autoencoders, and attention mech- gleaned from visual stimuli with linguistic cues, VQA systems
anisms. Leveraging a balanced VQA dataset, we investigate
three distinct strategies. Firstly, GAN-based approaches aim to aspire to emulate the nuanced understanding and reasoning
generate answer embeddings conditioned on image and question capabilities exhibited by human agents when confronted with
inputs, showing potential but struggling with more complex multimodal inputs.
tasks. Secondly, autoencoder-based techniques focus on learning In light of the aforementioned considerations, this study
optimal embeddings for questions and images, achieving compa-
embarks on a journey to unravel the intricacies of the VQA
rable results with GAN due to better ability on complex ques-
tions. Lastly, attention mechanisms, incorporating Multimodal conundrum, leveraging insights from one of the most promi-
Compact Bilinear pooling (MCB), address language priors and nent and widely utilized datasets in the field [24]. By delving
attention modeling, albeit with a complexity-performance trade- into the depths of this seminal dataset, we endeavor to shed
off. This study underscores the challenges and opportunities light on the underlying challenges and opportunities inherent
in VQA and suggests avenues for future research, including
in the VQA paradigm, with the ultimate aim of advancing the
alternative GAN formulations and attentional mechanisms.
Index Terms—Visual Question Answering; Generative Adver- frontier of AI research and fostering the development of more
sarial Networks; Autoencoders; Attention intelligent and perceptive machines.

I. I NTRODUCTION II. GAN BASED MECHANISM

The Visual Question Answering (VQA) task [1], represents
Our rationale for applying Generative Adversarial Networks
a pivotal challenge within the field of Artificial Intelligence
stemmed from several considerations:
(AI), striving to bridge the gap between visual perception and
natural language understanding. At its core, VQA endeavors to 1) VQA is fundamentally a generative task—given an image
imbue machines with the capacity to comprehend and respond and question, we must generate an answer. We thus
to inquiries pertaining to the content of images, thereby reasoned that GANs, due to their generative nature, may
mirroring a fundamental aspect of human cognition. Indeed, be a good fit for the problem.
the ability to seamlessly integrate visual information with lin- 2) Conditional GAN [25] made us reason that, by input-
guistic input is regarded as a hallmark of human intelligence, ing question and image embeddings as 2-tuples to the
facilitating the formation of abstract concepts and fostering generator, we could learn to accurately generate answer
deeper comprehension of the surrounding environment. embeddings as the outputs. This is because the generator
In essence, the VQA task epitomizes the interdisciplinary will learn the probability distribution of the answer em-
nature of contemporary AI research [2–7], drawing upon beddings conditioned on the question-image embedding
insights from computer vision [8–12], natural language pro- 2-tuple. By thinking of the discriminator term as an
cessing [13–16], and machine learning [17–23] to realize adaptive loss function, we also reasoned that, with proper
training, this may be able to determine a more appropriate
0 Corresponding author. loss for the VQA classification problem.
Since the goal of our GAN-based approach was, in part,
to simply investigate the application of GANs to classification
problems, we focused more on experimenting with and testing
a wide variety of possible applications of GANs to this
problem than on beating the state-of-the-art results.
A. Architecture
Similar to most existing approaches to the VQA problem,
we first passed the image data through a pre-trained ResNet Fig. 1: High Level Architecture of GAN Based System
to generate a feature vector for each image. In parallel, we
also pre-processed the questions and pass them through an
RNN in order to obtain feature vectors for each question. ator as an input, we also fed the features associated with each
While we do not train the ResNet, we do train the RNN while question and image into the discriminator by concatenating
training the GAN. After generating the feature vectors, we them to the output produced by the generator. We attempted
encode them together into a single vector. We attempted two several variations on this, testing with feeding in the raw
approaches to combining them. Our first approach, motivated features produced by the ResNet and RNN and feeding in
by [26–28], first passed the image and question embeddings combined, embedded features. We denote the coupling of the
through separate tanh layers, then performed element-wise simple generator with a discriminator as GANsimp and the
multiplication of these output vectors, and concatenated them coupling of the more complex generator with a discriminator
with a vector of the element-wise attention of the outputs of the as GANfull A diagram of this architecture is given in Figure
tanh layers. This vector is finally passed through another tanh 1.
nonlinearity. Our second approach was much simpler, utilizing
separate tanh layers to encode the image and question feature B. Training
vectors and then combining these embeddings by element-wise
Our basic training algorithm followed the GAN-CLS with
multiplication.
step size α, outlined by Reed et. al.:
Once we had generated the combined embeddings of the
features, we fed them into our generator network. These 1) Input: minibatch answer embeddings x, matching image-
feature vectors can be thought of as the conditional input to question embedding 2-tuple ψ(t), number of training
the generator, allowing it to generate an output conditioned batch step S
on the image and question. In addition to feeding in the 2) for n = 1 to S do
feature embeddings as conditions for the output, we also 3) h ← ψ(t) (Encode image-question tuple that matches
inputed random noise into the generator. We tested doing this with the answer)
in three different ways—by concatenating the noise to the 4) ĥ ← ψ(t̂) (Encode image-question tuple that doesn’t
feature vector (N1), by adding the noise to the feature vector match with the answer)
to essentially create a non-zero mean noise input (N2), and 5) z ∼ N (0, 1)Z (Draw random noise)
by feeding in only the features with no noise (N0). While the 6) x̂ ← G(z, h) (Forward through Generator)
latter approach may not be a true GAN, it can nonetheless 7) sr ← D(x, h) (correct answer, correct tuple)
be trained as a GAN and gave us a baseline to compare 8) sw ← D(x, ĥ) (correct answer, wrong tuple)
against. We tested with two different architectures for the 9) sf ← D(x̂, h) (output of Generator, correct tuple)
generator. The first utilized three fully connected layers with 10) LD ← log(sr ) + 21 (log(1 − sw ) + log(1 − sf ))
ReLu nonlinearities followed by a linear fully connected layer. 11) D ← D − α δL δD (Update Discriminator)
D

The second simply utilized a single linear layer. Each of 12) LG ← log(sf )
the generators output a vector of length 1000 encoding the 13) G ← G − α δL G
δG (Update Generator)
likelihood of each of the 1000 most common answers. 14) end for
We coupled the simple, single layer generator with the We attempted several variations of this. We experimented
simpler of our embedding methods—for future reference we with pre-training both the Discriminator and the Generator.
denote this network as Gsimp —and the more complex gen- When pre-training the Generator we simply trained it as a
erator with our more complex embedding method—which we softmax classifier with noise added to the inputs. This allows
denote as Gfull . us to initialize the weights to values that will produce relatively
We tested with training generators without the discriminator good results at the start of training the full GAN [29–33].
portion and also, in order to produce the full GAN, we fed the We found that pre-training the Discriminator to optimality
generators’ outputs into a discriminator network. We attempted could be detrimental to the actual training process of the GAN
several different architectures of the discriminator portion but and could worsen the updates of the Generator over time [34].
each was ultimately several fully connected ReLu layers which [34] shows that if the probability densities are either disjoint
output a single number into a sigmoid activation to scale it or lie on a low-dimensional manifold then the Discriminator
between 0 and 1. In addition to taking the output of the gener- can distinguish between them perfectly. This happens to be
Method All Yes/No Number Other
the case for the loss functions proposed by [35]. So instead Baseline Methods
of pre-training our Discriminator to optimality, we follow the Gsimp − N0 − I1 11.58 23.86 5.32 1.48
suggestion in [36–42] and add noise to its inputs. Gfull − N0 − I1 18.65 40.56 0.25 7.84
Gfull − N1 − I1 23.41 55.46 3.49 0.61
We tested adding normalization to the outputs of the layers Gfull − N2 − I1 14.76 35.77 1.25 0.27
in our generator and discriminator modules. In general, how- Gfull − N2 − I2 23.06 54.65 3.50 0.51
ever, we found that this yielded worse results than when using Novel Methods
unnormalized layers. We utilized a small amount of dropout GANsimp − N0 − I1 25.27 62.49 0.32 0.59
GANsimp − N0 − I2 25.57 56.67 9.05 0.61
in training our generator and discriminator modules. GANfull − N0 − I1 27.51 51.90 22.41 0.08
We attempted initializing the weights under several different GANfull − N1 − I1 28.81 57.28 19.13 0.54
distributions, noticing altered results when we did so. We first GANfull − N2 − I1 34.57 65.38 27.36 0.71
Autoencoder 37.65 64.01 24.37 15.77
initialized all weights to follow a Gaussian distribution with Attention 44.32 66.64 32.15 26.72
large values clipped (we denote this initialization method I1) Attention + MCB 47.58 67.60 31.47 36.98
and also tested with initializing weights to follow a uniform
distribution (which we denote I2). TABLE I: Results on VQA 1.9 Validation Dataset. Legend:
Gsimp - simple, single-layer generator trained as classi-
fier; Gfull - full, multi-layer generator trained as classifier;
III. AUTO E NCODER BASED M ECHANISM
GANsimp - simple, single-layer generator trained with dis-
We modified the initial GAN technique to give us an Au- criminator; GANfull - full, multi-layer generator trained with
toencoder based technique, wherein the concatenated features discriminator; N1 - noise concatenated to generator condi-
are passed through an autoencoder to generate low dimensional tioning input; N2 - noise added to generator condition input;
embeddings. Most existing approaches (such as MCB [43]) N0 - no noise inputed to generator; I1 - weights initialized
utilize a fixed method to embed the question and image via Gaussian distribution; I2 - weights initialized via uniform
features together. By employing an autoencoder, we hoped to distribution
learn how to best embed the question and image features into a
low-dimensional space. We use this encoding, after passing it
through several fully connected layers, to generate the answer.

Fig. 3: High Level Architecture of Attention Based System

Fig. 2: High Level Architecture of AutoEncoder Based System

V. R ESULTS

IV. ATTENTION BASED M ECHANISM Table I gives the numerical results obtained by every method
we tested. The metric used is the one presented in [1] where an
To answer a question according to an image, it is critical to answer is considered correct and given a score of one if at least
model both “where to look” and model “what words to listen three of the ten human given responses match that answer. As
to”, namely visual attention and question attention [44]. this table illustrates, the baseline methods in general do worse
However, [24] shows that the language priors make the than all novel approaches we attempted.
VQA dataset [1] unbalanced, whereas simply answering “ten- Figure 4 illustrates the qualitative performance of one of
nis” and “2” will achieve 41% and 39% accuracy for the two our baseline approaches, one of the GAN-based approaches,
types of questions— “What sport is” and “How many”. These and the attention-based model on several images. While the
language priors bring to light the question of whether machines baseline model is able to correctly answer the simpler ques-
truly understand the questions and images or if they tend to tions, it fails miserably at more complex questions such as (c)
give an answer which has higher frequency in the dataset. and (d). For (d) it seems to capture some of the meaning of
Inspired by the strength of Multimodal Compact Bilinear the question (relating a fridge to a meal) yet still incorrectly
Pooling (MCB) at efficiently and expressively combining answers the prompt. As for the GAN-based approach and the
multimodal features [43], we use the MCB operation to replace attention-based approach, both are able to correctly capture the
the simple addition operation used in co-attention mechanism meaning of the questions while the attention-based approach
[44] when combining the features learned from the images and is slightly better than the GAN-based approach - in (d) our
questions together, which may help to learn more information attention-based approach gets the correct answer while our
from visual part [45–47]. GAN-based approach returns an almost correct answer.
(a) Question: How many chairs (b) Question: Is it an overcast (c) Question: What year is the (d) Question: What color is the
are in the photo? Baseline An- day? Baseline Answer: yes. car? Baseline Answer: scarf. fridge? Baseline Answer: din-
swer: 3. GAN Answer: 1. At- GAN Answer: yes. Attention GAN Answer: 2010. Attention ner. GAN Answer: gray. Atten-
tention Answer: 1 Answer: yes Answer: 2010 tion Answer: silver
Fig. 4: Qualitative Results of Visual Question Answering

Method All Yes/No Number Other

Neither Pretrained 0.10 0.08 0.19 0.10
Secondly, our investigation into autoencoder-based tech-
G Pretrained 21.25 54.65 3.50 0.51 niques revealed that while they are effective at learning optimal
D Pretrained 0.05 0.00 0.09 0.08 embeddings for the question and image data, they achieved
Both Pretrained 13.76 35.47 2.16 0.30 slightly better results compared to the GAN-based approach.
This is largely due to the much better results on more complex
TABLE II: Varying Pretraining on GAN with N2 (noise questions. On the contrast, our optimized GANf ull method
added) on VQA 1.9 Validation Dataset yields better on Yes/No and Number questions.
Lastly, the attention mechanisms, particularly those employ-
ing MCB, have demonstrated a substantial benefit in address-
Table II illustrates the effect various pretraining combina- ing the inherent language priors and improving the modeling
tions have on the results produced by the GAN. There results of attention over both the textual and visual inputs. This
are given on the GAN utilizing the full, multi-layer generator method outperformed both the GAN-based and autoencoder-
model GANfull with noise added to the conditioning inputs based approaches in complex question answering scenarios, as
N2 and weights initialized according to a Gaussian distri- seen in the qualitative performance comparisons. However, this
bution I1. From this it is clear that the optimal results are comes with the cost of increased computational complexity,
obtained when the generator is pretrained but the discriminator which poses a trade-off that future research will need to
is not. When we do not pretrain the generator at all, we address.
found that, given the amount of time for which we trained, Overall, the research indicates that while the innovative ap-
the generator is unable to learn the distribution of the data at proaches explored—GANs, autoencoders, and attention mech-
all. In contrast, discriminator pretraining alone doesn’t enhance anisms—show promise, they each come with unique chal-
performance. lenges that need to be overcome. Future research should focus
on refining these methods, perhaps through alternative GAN
VI. C ONCLUSION formulations and enhanced stability techniques for more com-
plex tasks. Additionally, further exploration into hybrid models
This research has embarked on a comprehensive explo- combining these techniques could yield improvements in both
ration of advanced techniques in VQA, using GANs, attention performance and efficiency for VQA systems. This study thus
mechanisms, and autoencoders to enhance the performance of not only advances our understanding of complex VQA systems
VQA systems. Based on the results from the balanced VQA but also outlines clear paths for future developments in the
dataset, we can draw several important conclusions about the field.
effectiveness of these methods in improving VQA systems.
R EFERENCES
Firstly, our experiments with GANs, especially those with
[1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick,
full, multi-layer generator models (GANf ull ), showed a and D. Parikh, “Vqa: Visual question answering,” Proceedings of the
marked improvement over baseline methods when pretraining IEEE International Conference on Computer Vision, pp. 2425–2433,
is applied selectively—specifically, pretraining the generator 2015.
[2] Z. Li, P. Zhou, Y. Zhang, and L. Gao, “Joint coverage and power control
but not the discriminator. This approach allowed the GANs in highly dynamic and massive uav networks: An aggregative game-
to better capture and reproduce the data distribution needed theoretic learning approach,” ArXiv, vol. abs/1907.08363, 2019.
for accurate VQA, although it still presents challenges in task [3] C. Yan, “Predict lightning location and movement with atmospherical
electrical field instrument,” in 2019 IEEE 10th Annual Information Tech-
specificity and learning the correct answers to more complex nology, Electronics and Mobile Communication Conference (IEMCON),
questions, aligned with previous studies [48–51]. 2019, pp. 0535–0537.
[4] C. Yan, Y. Qiu, and Y. Zhu, “Predict oil production with lstm neural “Assister: Assistive navigation via conditional instruction generation,”
network,” in Proceedings of the 9th International Conference on Com- in ECCV, 2022, pp. 271–289.
puter Engineering and Networks, Q. Liu, X. Liu, L. Li, H. Zhou, and [29] T. Deng, G. Shen, T. Qin, J. Wang, W. Zhao, J. Wang, D. Wang, and
H.-H. Zhao, Eds. Singapore: Springer Singapore, 2021, pp. 357–364. W. Chen, “Plgslam: Progressive neural scene represenation with local
[5] Q. Yang, P. Li, Z. Ding, W. Zhou, Y. Nian, and X. Xu, “A comparative to global bundle adjustment,” arXiv preprint arXiv:2312.09866, 2023.
study on enhancing prediction in social network advertisement through [30] T. Deng, S. Liu, X. Wang, Y. Liu, D. Wang, and W. Chen, “Prosgnerf:
data augmentation,” arXiv preprint arXiv:2404.13812, 2024. Progressive dynamic neural scene graph with frequency modulated auto-
[6] X. Yan, W. Wang, M. Xiao, Y. Li, and M. Gao, “Survival prediction encoder in urban scenes,” arXiv preprint arXiv:2312.09076, 2023.
across diverse cancer types using neural networks,” arXiv preprint [31] Z. Zhu and W. Zhou, “Taming heavy-tailed features by shrinkage,”
arXiv:2404.08713, 2024. in International Conference on Artificial Intelligence and Statistics.
[7] M. Xiao, Y. Li, X. Yan, M. Gao, and W. Wang, “Convolutional neural PMLR, 2021, pp. 3268–3276.
network classification of cancer cytopathology images: taking breast [32] M. Li, P. Ling, S. Wen, X. Chen, and F. Wen, “Bubble-wave-mitigation
cancer as an example,” arXiv preprint arXiv:2404.08279, 2024. algorithm and transformer-based neural network demodulator for water-
[8] P. Li, Y. Lin, and E. Schultz-Fellenz, “Contextual hourglass network for air optical camera communications,” IEEE Photonics Journal, vol. 15,
semantic segmentation of high resolution aerial imagery,” arXiv preprint no. 5, pp. 1–10, 2023.
arXiv:1810.12813, 2019. [33] A. J. Read, W. Zhou, S. D. Saini, J. Zhu, and A. K. Waljee, “Prediction of
[9] A. Shree, K. Jia, Z. Xiong, S. F. Chow, R. Phan, P. Li, and D. Curro, gastrointestinal tract cancers using longitudinal electronic health record
“Image analysis,” US Patent App. 17/638,773, 2022. data,” Cancers, vol. 15, no. 5, p. 1399, 2023.
[10] C. Jin, T. Che, H. Peng, Y. Li, and M. Pavone, “Learning from teaching [34] M. Arjovsky and L. Bottou, “Towards principled methods for training
regularization: Generalizable correlations should be easy to imitate,” generative adversarial networks,” NIPS 2016 Workshop on Adversarial
arXiv preprint arXiv:2402.02769, 2024. Training. In review for ICLR, vol. 2016, 2017.
[11] L. Li, “Hierarchical edge aware learning for 3d point cloud,” in Com- [35] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
puter Graphics International Conference. Springer, 2023, pp. 81–92. learning with deep convolutional generative adversarial networks,” arXiv
[12] Z. Ding, P. Li, Q. Yang, S. Li, and Q. Gong, “Regional style and color preprint arXiv:1511.06434, 2015.
transfer,” arXiv preprint arXiv:2404.13880, 2024. [36] A. Zhu, J. Li, and C. Lu, “Pseudo view representation learning for
[13] P. Li, M. Abouelenien, R. Mihalcea, Z. Ding, Q. Yang, and Y. Zhou, “De- monocular rgb-d human pose and shape estimation,” IEEE Signal
ception detection from linguistic and physiological data streams using bi- Processing Letters, vol. 29, pp. 712–716, 2021.
modal convolutional neural networks,” arXiv preprint arXiv:2311.10944, [37] Q. Ning, W. Zheng, H. Xu, A. Zhu, T. Li, Y. Cheng, D. Cui, and
2023. K. Wang, “Rapid segmentation and sensitive analysis of crp with
[14] H. Chen, W. Huang, Y. Ni, S. Yun, F. Wen, H. Latapie, and M. Imani, paper-based microfluidic device using machine learning,” Analytical and
“Taskclip: Extend large vision-language model for task oriented object Bioanalytical Chemistry, vol. 414, no. 13, pp. 3959–3970, 2022.
detection,” CoRR, 2024. [38] J. Yao, T. Wu, and X. Zhang, “Improving depth gradientcontinuity in
[15] L. Li, “Cpseg: Finer-grained image semantic segmentation via chain-of- transformers: A comparative study on monocular depth estimation with
thought language prompting,” in WACV, 2024, pp. 513–522. cnn,” arXiv preprint arXiv:2308.08333, 2023.
[16] W. Dai, J. Tao, X. Yan, Z. Feng, and J. Chen, “Addressing unintended [39] T. Deng, H. Xie, J. Wang, and W. Chen, “Long-term visual simultaneous
bias in toxicity detection: An lstm and attention-based approach,” localization and mapping: Using a bayesian persistence filter-based
in 2023 5th International Conference on Artificial Intelligence and global map prediction,” IEEE Robotics & Automation Magazine, vol. 30,
Computer Applications (ICAICA), 2023, pp. 375–379. no. 1, pp. 36–49, 2023.
[17] X. Chen, K. Li, T. Song, and J. Guo, “Few-shot name entity recognition [40] J. Yao, C. Li, K. Sun, Y. Cai, H. Li, W. Ouyang, and H. Li, “Ndc-
on stackoverflow,” arXiv preprint arXiv:2404.09405, 2024. scene: Boost monocular 3d semantic scene completion in normalized
[18] M. Nagao, C. Yao, T. Onishi, H. Chen, and A. Datta-Gupta, “An efficient devicecoordinates space,” in 2023 ICCV, 2023, pp. 9421–9431.
deep learning-based workflow for co2 plume imaging using distributed [41] J. Yao, X. Pan, T. Wu, and X. Zhang, “Building lane-level maps from
pressure and temperature measurements,” in In Proceedings of the SPE aerial images,” in ICASSP. IEEE, 2024, pp. 3890–3894.
Annual Technical Conference and Exhibition, 2022. [42] L. Li, “Segment any building,” in Computer Graphics International
[19] C. Yao, M. Nagao, and A. Datta-Gupta, “A deep-learning based acceler- Conference. Springer, 2023, pp. 155–166.
ated workflow for robust co2 plume imaging at the illinois basin-decatur [43] A. Fukui, D. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach,
carbon sequestration project,” 12 2023. “Multimodal compact bilinear pooling for visual question answering and
[20] Z. Li, Y. Huang, M. Zhu, J. Zhang, J. Chang, and H. Liu, “Feature visual grounding,” arXiv preprint arXiv:1606.01847, 2016.
manipulation for ddpm based change detection,” CoRR, 2024. [44] J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image
[21] C. Wang, Y. Yang, R. Li, D. Sun, R. Cai, Y. Zhang, C. Fu, and L. Floyd, co-attention for visual question answering,” Advances In Neural Infor-
“Adapting llms for efficient context processing through soft prompt mation Processing Systems, pp. 289–297, 2016.
compression,” arXiv preprint arXiv:2404.04997, 2024. [45] W. Ding, S. Li, G. Zhang, X. Lei, and H. Qian, “Vehicle pose and shape
[22] Y. Li, X. Yan, M. Xiao, W. Wang, and F. Zhang, “Investigation of cre- estimation through multiple monocular vision,” in IEEE International
ating accessibility linked data based on publicly available accessibility Conference on Robotics and Biomimetics, 2018, pp. 709–715.
datasets,” in Proceedings of the 2023 13th International Conference on [46] H. A. Arief, M. Arief, G. Zhang, Z. Liu, M. Bhat, U. G. Indahl,
Communication and Network Security, 2024, p. 77–81. H. Tveite, and D. Zhao, “Sane: Smart annotation and evaluation tools
[23] J. Wang, X. Li, Y. Jin, Y. Zhong, K. Zhang, and C. Zhou, “Research for point cloud data,” IEEE Access, vol. 8, pp. 131 848–131 858, 2020.
on image recognition technology based on multimodal deep learning,” [47] Z. Ding, Z. Lai, S. Li, P. Li, Q. Yang, and E. Wong, “Confidence trigger
arXiv preprint arXiv:2405.03091, 2024. detection: Accelerating real-time tracking-by-detection systems,” arXiv
[24] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making preprint arXiv:1902.00615, 2024.
the v in vqa matter: Elevating the role of image understanding in visual [48] X. Liu, R. Wang, D. Sun, J. Li, C. Youn, Y. Lyu, J. Zhan, D. Wu, X. Xu,
question answering,” arXiv preprint arXiv:1612.00837, 2016. M. Liu, X. Lei, Z. Xu, Y. Zhang, Z. Li, Q. Yang, and T. Abdelzaher,
[25] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Influence pathway discovery on social media,” in 2023 IEEE 9th
“Generative adversarial text to image synthesis,” Proceedings of The International Conference on Collaboration and Internet Computing
33rd International Conference on Machine Learning, vol. 3, May 2016. (CIC). IEEE Computer Society, nov 2023, pp. 105–109.
[26] K. Saito, A. Shin, Y. Ushiku, and T. Harada, “Dualnet: Domain- [49] Y. Xin, S. Luo, P. Jin, Y. Du, and C. Wang, “Self-training with label-
invariant network for visual question answering,” arXiv preprint feature-consistency for domain adaptation,” in Database Systems for
arXiv:1606.06108, 2016. Advanced Applications. Springer Nature Switzerland, 2023, pp. 84–99.
[27] L. Lin, W. Wu, Z. Shangguan, S. Wshah, R. Elmoudi, and B. Xu, “Hpt- [50] Y. Xin, J. Du, Q. Wang, K. Yan, and S. Ding, “Mmap: Multi-
rl: Calibrating power system models based on hierarchical parameter modal alignment prompt for cross-domain multi-task learning,” in AAAI,
tuning and reinforcement learning,” in 2020 19th IEEE International vol. 38, no. 14, 2024, pp. 16 076–16 084.
Conference on Machine Learning and Applications (ICMLA), 2020, pp. [51] X. Song, D. Wu, B. Zhang, Z. Peng, B. Dang, F. Pan, and Z. Wu,
1231–1237. “Zeroprompt: Streaming acoustic encoders are zero-shot masked lms,”
[28] Z. Huang, Z. Shangguan, J. Zhang, G. Bar, M. Boyd, and E. Ohn-Bar, in Proc. INTERSPEECH 2023, 2023, pp. 1648–1652.