Exploring Diverse Methods in Visual Question Answering
Exploring Diverse Methods in Visual Question Answering
Answering
1st Panfeng Li 2nd Qikai Yang
Department of Electrical and Computer Engineering Department of Computer Science
University of Michigan University of Illinois Urbana-Champaign
Ann Arbor, USA Urbana, USA
[email protected] [email protected]
Abstract—This study explores innovative methods for improv- the ambitious goal of imbuing machines with human-like
ing Visual Question Answering (VQA) using Generative Ad- perceptual and cognitive abilities. By synthesizing information
versarial Networks (GANs), autoencoders, and attention mech- gleaned from visual stimuli with linguistic cues, VQA systems
anisms. Leveraging a balanced VQA dataset, we investigate
three distinct strategies. Firstly, GAN-based approaches aim to aspire to emulate the nuanced understanding and reasoning
generate answer embeddings conditioned on image and question capabilities exhibited by human agents when confronted with
inputs, showing potential but struggling with more complex multimodal inputs.
tasks. Secondly, autoencoder-based techniques focus on learning In light of the aforementioned considerations, this study
optimal embeddings for questions and images, achieving compa-
embarks on a journey to unravel the intricacies of the VQA
rable results with GAN due to better ability on complex ques-
tions. Lastly, attention mechanisms, incorporating Multimodal conundrum, leveraging insights from one of the most promi-
Compact Bilinear pooling (MCB), address language priors and nent and widely utilized datasets in the field [24]. By delving
attention modeling, albeit with a complexity-performance trade- into the depths of this seminal dataset, we endeavor to shed
off. This study underscores the challenges and opportunities light on the underlying challenges and opportunities inherent
in VQA and suggests avenues for future research, including
in the VQA paradigm, with the ultimate aim of advancing the
alternative GAN formulations and attentional mechanisms.
Index Terms—Visual Question Answering; Generative Adver- frontier of AI research and fostering the development of more
sarial Networks; Autoencoders; Attention intelligent and perceptive machines.
The second simply utilized a single linear layer. Each of 12) LG ← log(sf )
the generators output a vector of length 1000 encoding the 13) G ← G − α δL G
δG (Update Generator)
likelihood of each of the 1000 most common answers. 14) end for
We coupled the simple, single layer generator with the We attempted several variations of this. We experimented
simpler of our embedding methods—for future reference we with pre-training both the Discriminator and the Generator.
denote this network as Gsimp —and the more complex gen- When pre-training the Generator we simply trained it as a
erator with our more complex embedding method—which we softmax classifier with noise added to the inputs. This allows
denote as Gfull . us to initialize the weights to values that will produce relatively
We tested with training generators without the discriminator good results at the start of training the full GAN [29–33].
portion and also, in order to produce the full GAN, we fed the We found that pre-training the Discriminator to optimality
generators’ outputs into a discriminator network. We attempted could be detrimental to the actual training process of the GAN
several different architectures of the discriminator portion but and could worsen the updates of the Generator over time [34].
each was ultimately several fully connected ReLu layers which [34] shows that if the probability densities are either disjoint
output a single number into a sigmoid activation to scale it or lie on a low-dimensional manifold then the Discriminator
between 0 and 1. In addition to taking the output of the gener- can distinguish between them perfectly. This happens to be
Method All Yes/No Number Other
the case for the loss functions proposed by [35]. So instead Baseline Methods
of pre-training our Discriminator to optimality, we follow the Gsimp − N0 − I1 11.58 23.86 5.32 1.48
suggestion in [36–42] and add noise to its inputs. Gfull − N0 − I1 18.65 40.56 0.25 7.84
Gfull − N1 − I1 23.41 55.46 3.49 0.61
We tested adding normalization to the outputs of the layers Gfull − N2 − I1 14.76 35.77 1.25 0.27
in our generator and discriminator modules. In general, how- Gfull − N2 − I2 23.06 54.65 3.50 0.51
ever, we found that this yielded worse results than when using Novel Methods
unnormalized layers. We utilized a small amount of dropout GANsimp − N0 − I1 25.27 62.49 0.32 0.59
GANsimp − N0 − I2 25.57 56.67 9.05 0.61
in training our generator and discriminator modules. GANfull − N0 − I1 27.51 51.90 22.41 0.08
We attempted initializing the weights under several different GANfull − N1 − I1 28.81 57.28 19.13 0.54
distributions, noticing altered results when we did so. We first GANfull − N2 − I1 34.57 65.38 27.36 0.71
Autoencoder 37.65 64.01 24.37 15.77
initialized all weights to follow a Gaussian distribution with Attention 44.32 66.64 32.15 26.72
large values clipped (we denote this initialization method I1) Attention + MCB 47.58 67.60 31.47 36.98
and also tested with initializing weights to follow a uniform
distribution (which we denote I2). TABLE I: Results on VQA 1.9 Validation Dataset. Legend:
Gsimp - simple, single-layer generator trained as classi-
fier; Gfull - full, multi-layer generator trained as classifier;
III. AUTO E NCODER BASED M ECHANISM
GANsimp - simple, single-layer generator trained with dis-
We modified the initial GAN technique to give us an Au- criminator; GANfull - full, multi-layer generator trained with
toencoder based technique, wherein the concatenated features discriminator; N1 - noise concatenated to generator condi-
are passed through an autoencoder to generate low dimensional tioning input; N2 - noise added to generator condition input;
embeddings. Most existing approaches (such as MCB [43]) N0 - no noise inputed to generator; I1 - weights initialized
utilize a fixed method to embed the question and image via Gaussian distribution; I2 - weights initialized via uniform
features together. By employing an autoencoder, we hoped to distribution
learn how to best embed the question and image features into a
low-dimensional space. We use this encoding, after passing it
through several fully connected layers, to generate the answer.
IV. ATTENTION BASED M ECHANISM Table I gives the numerical results obtained by every method
we tested. The metric used is the one presented in [1] where an
To answer a question according to an image, it is critical to answer is considered correct and given a score of one if at least
model both “where to look” and model “what words to listen three of the ten human given responses match that answer. As
to”, namely visual attention and question attention [44]. this table illustrates, the baseline methods in general do worse
However, [24] shows that the language priors make the than all novel approaches we attempted.
VQA dataset [1] unbalanced, whereas simply answering “ten- Figure 4 illustrates the qualitative performance of one of
nis” and “2” will achieve 41% and 39% accuracy for the two our baseline approaches, one of the GAN-based approaches,
types of questions— “What sport is” and “How many”. These and the attention-based model on several images. While the
language priors bring to light the question of whether machines baseline model is able to correctly answer the simpler ques-
truly understand the questions and images or if they tend to tions, it fails miserably at more complex questions such as (c)
give an answer which has higher frequency in the dataset. and (d). For (d) it seems to capture some of the meaning of
Inspired by the strength of Multimodal Compact Bilinear the question (relating a fridge to a meal) yet still incorrectly
Pooling (MCB) at efficiently and expressively combining answers the prompt. As for the GAN-based approach and the
multimodal features [43], we use the MCB operation to replace attention-based approach, both are able to correctly capture the
the simple addition operation used in co-attention mechanism meaning of the questions while the attention-based approach
[44] when combining the features learned from the images and is slightly better than the GAN-based approach - in (d) our
questions together, which may help to learn more information attention-based approach gets the correct answer while our
from visual part [45–47]. GAN-based approach returns an almost correct answer.
(a) Question: How many chairs (b) Question: Is it an overcast (c) Question: What year is the (d) Question: What color is the
are in the photo? Baseline An- day? Baseline Answer: yes. car? Baseline Answer: scarf. fridge? Baseline Answer: din-
swer: 3. GAN Answer: 1. At- GAN Answer: yes. Attention GAN Answer: 2010. Attention ner. GAN Answer: gray. Atten-
tention Answer: 1 Answer: yes Answer: 2010 tion Answer: silver
Fig. 4: Qualitative Results of Visual Question Answering