VQA Report
VQA Report
Authors:
BOUHAFA TAHA
LOUBABA MALKI L’HLAIBI
Supervisor:
Prof. BELCAID ANASS
I Acknowledgments 3
II Introduction 4
III Literature 5
III.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
III.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
III.3 State-of-the-Art Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 7
IV Datasets 9
IV.1 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
IV.1.1 Key Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
IV.1.2 Data Structure: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
IV.1.3 Examples: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
IV.1.4 Why COCO-VQA? . . . . . . . . . . . . . . . . . . . . . . . . . . 11
IV.2 Preprocessing: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
IV.2.1 Data Preprocessing: . . . . . . . . . . . . . . . . . . . . . . . . . . 12
IV.2.2 Dataset Splitting: . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
V Architecture 14
V.1 Vision Model (Resnet Model) . . . . . . . . . . . . . . . . . . . . . . . . 14
V.1.1 ResNet34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
V.1.2 ResNet50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
V.2 Text Model (BERT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
V.3 Visual Question Answering (VQA) Model . . . . . . . . . . . . . . . . . 21
V.3.1 Multi-Modal Late Fusion Mechanism . . . . . . . . . . . . . . . . 21
1
V.3.2 Architecture of Our VQA Model . . . . . . . . . . . . . . . . . . . 22
VIIConclusion 29
2
I. Acknowledgments
We take this opportunity to express our heartfelt gratitude to all those who contributed
to the successful completion of this project. Your guidance, support and encouragement
were invaluable throughout this journey.
First, we extend our deepest thanks to Prof.BELCAID ANASS, our project guide, for
their constant support, insightful feedback, and expert guidance. Their deep knowledge
in the field of deep learning helped us navigate the complexities of this project and refine
our approach. Their patience and encouragement kept us motivated, even during the
challenging phases of the project.
We would like to acknowledge the creators of the COCO VQA V2.0 dataset for making
their data publicly available. This dataset was instrumental in training and evaluating
our model, and we appreciate the effort that went into its creation.
Our sincere thanks go to the open-source community for providing access to powerful
tools and libraries such as PyTorch, Hugging Face Transformers, and Matplotlib, which
were essential for the implementation and visualization of our project. Without these
resources, this project would not have been possible.
We also thank our colleagues and peers for their constructive feedback, discussions,
and moral support during the project. Their input helped us refine our ideas and overcome
challenges.
Lastly, we express our gratitude to our family and friends for their unwavering support,
encouragement, and patience throughout this journey. Their belief in us kept us going,
even during the most demanding times.
This project has been a rewarding learning experience, and we are grateful to everyone
who contributed to its success.
3
II. Introduction
Visual Question Answering (VQA) is a challenging task in the field of artificial intelligence
that combines computer vision and natural language processing. The goal of VQA is to
enable machines to answer questions about images in a way that is both accurate and
contextually relevant. This project focuses on building a VQA system that leverages
state-of-the-art deep learning models to achieve this goal.
The system integrates two key components: a Convolutional Neural Network (CNN)
for image feature extraction and a Bidirectional Encoder Representations from Trans-
formers (BERT) model for processing textual questions. By combining these features,
the model is trained to predict the most appropriate answer from a predefined vocabu-
lary. The project involves several stages, including data preprocessing, model training,
validation, and testing, with the ultimate aim of achieving high accuracy in answering
questions based on visual content.
This report documents the methodology, implementation, and evaluation of the VQA
system, highlighting the challenges faced and the solutions adopted. The results demon-
strate the effectiveness of combining visual and textual features for the VQA task, pro-
viding insights into the potential of multimodal learning in AI applications.
4
III. Literature
III.1 Definition
Visual Question Answering (VQA) is an interdisciplinary field that combines computer
vision and natural language processing to enable machines to answer natural language
questions about visual content. This capability demands a deep integration of image un-
derstanding, language comprehension, and reasoning processes. A VQA system typically
involves three key stages:
Figure III.1: Figure 1: Examples of questions and answers in VQA tasks, showcasing
different question types.
5
To illustrate how VQA aligns with common computer vision tasks, Table III.1 provides
examples of representative VQA questions associated with specific CV tasks.
Spatial relationships among objects What is between the cat and the sofa?
III.2 Datasets
Datasets are fundamental to the development and evaluation of Visual Question An-
swering (VQA) systems. Below are some of the most commonly used datasets in the
field:
• COCO VQA V2.0 2017 Dataset: Visual Question Answering (VQA) v2.0 is
a dataset containing open-ended questions about images. These questions require
an understanding of vision, language, and commonsense knowledge to answer. It is
the second version of the VQA dataset.
6
Images Questions Annotations Questions per Image Split (Training/Testing)
250,000+ 1M+ 7M+ 5.4 questions average 82,000+ / 80,000+
7
• Large Language Models: Models like BLIP-2, Flamingo, and OFA, adapted from
large language models like CLIP and GPT, leverage extensive pretraining on vast
multimodal datasets. This enables them to achieve enhanced reasoning capabilities
and handle more diverse VQA tasks.
Table III.5: Performance Analysis of VLP Architectures in VQA. The models are evalu-
ated on the test-dev and test-std splits of the VQAv2 dataset.
Source: From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges,
8
IV. Datasets
• Annotations File: Contains the answers for each question, with multiple answers
provided by different annotators.
• Questions File: Contains the questions and their corresponding image IDs.
9
IV.1.3 Examples:
Below are examples of the data structure in the JSON files and an example image from
the dataset.
The annotations file contains detailed information about each question, including the
multiple-choice answer and a list of answers provided by different annotators. Here is an
example:
{
"question_type": "what is",
"multiple_choice_answer": "sharpening knife",
"answers": [
{"answer": "sewing", "answer_confidence": "no", "answer_id": 1},
{"answer": "sharpening knife", "answer_confidence": "yes", "answer_id": 2},
{"answer": "grinding knife", "answer_confidence": "yes", "answer_id": 3},
{"answer": "sharpening knife", "answer_confidence": "yes", "answer_id": 4},
{"answer": "riding", "answer_confidence": "maybe", "answer_id": 5},
{"answer": "knife sharpening", "answer_confidence": "yes", "answer_id": 6},
{"answer": "sharpening knife", "answer_confidence": "maybe", "answer_id": 7},
{"answer": "sharpening knives", "answer_confidence": "maybe", "answer_id":8},
{"answer": "sharpening knife", "answer_confidence": "yes", "answer_id": 9},
{"answer": "knife sharpening", "answer_confidence": "yes", "answer_id": 10}
],
"image_id": 262136,
"answer_type": "other",
"question_id": 262136003
}
10
Questions File Example:
The questions file contains the questions and their corresponding image IDs. Here is an
example:
{
"image_id": 524286,
"question": "Is there a computer mouse on the desk?",
"question_id": 524286002
}
Image Example:
COCO-VQA was chosen for our project due to its large size, diversity of questions, and
real-world applicability. It provides a comprehensive benchmark for evaluating VQA
models and is widely used in the research community.
11
IV.2 Preprocessing:
To make the dataset manageable for our project, we preprocessed the data and split it
into smaller subsets.
We created a vocabulary of the top 1000 most frequent answers from the training
set. An additional <unk> token was added for unseen answers. This vocabulary allowed
us to map answers to numerical indices for training.
Each answer was mapped to its corresponding index in the vocabulary. If an answer
wasn’t in the top 1000, it was assigned the <unk> index. This step converted the textual
answers into numerical form for the model.
Tokenizing Questions:
For questions with multiple answers, we selected the majority answer as the target
label. If there was no majority, the most common answer was chosen. This ensured
consistency in the training data.
12
Image Preprocessing:
• Normalization: The images were normalized using the mean and standard devi-
ation of the ImageNet dataset. This step improved the model’s convergence during
training.
Since the COCO-VQA dataset is very large, we split the training set into smaller subsets
for our project:
This splitting strategy allowed us to train and evaluate our model efficiently while
ensuring that we had enough data for each phase of the project.
13
V. Architecture
V.1.1 ResNet34
ResNet34 is a convolutional neural network architecture that belongs to the ResNet family
of models, introduced by He et al. in 2015. It was specifically designed to address the
problem of vanishing gradients in deep networks by utilizing residual connections, which
allow gradients to flow more easily through deeper layers.
The ResNet34 model consists of 34 layers, including convolutional layers, batch nor-
malization, and activation functions. The key feature of ResNet34 is the use of residual
blocks, which consist of skip connections that bypass one or more layers. These skip con-
nections allow the model to learn residual mappings rather than trying to directly learn
the underlying mapping, helping the network to train deeper architectures effectively.
The architecture of ResNet34 can be broken down as follows:
• Input: The input is an image with a size of 224 × 224 pixels, typically with 3 color
channels (RGB).
• Residual Blocks: ResNet34 is built with 34 layers, where the majority of layers are
grouped into four residual blocks. Each block consists of two or more convolutional
layers, followed by batch normalization and ReLU activation. The output of each
block is added to the input of the block via a skip connection.
14
• Downsampling: To reduce the spatial dimensions of the feature maps, ResNet34
uses downsampling in the first block and at each transition between blocks. This is
typically achieved with a 2 × 2 average pooling layer or a stride of 2 in convolution
layers.
• Fully Connected Layer: After the residual blocks, the feature maps are passed
through a global average pooling layer, which reduces the spatial dimensions to a
single value per channel. This output is then flattened and passed through a fully
connected (dense) layer, which outputs the final predictions.
• Output: The final layer outputs a vector of size 1001, corresponding to the 1000
class labels in the ImageNet dataset, plus an additional class for background or
unused categories.
The key innovation of ResNet34 lies in the residual connections, which allow the
model to be much deeper (34 layers in this case) without suffering from the vanishing
gradient problem. This makes ResNet34 capable of achieving high performance on image
classification tasks, such as the ImageNet challenge.
ResNet34 was chosen for our Visual Question Answering (VQA) model due to its
balance of depth, computational efficiency, and strong feature extraction capabilities.
This architecture, with its residual learning framework, is particularly well-suited for
tasks that require robust visual feature representation.
The decision to use 1001 classes in our model corresponds to the number of answer
in our answers vocabulary.
15
• The ResNet-34 model was trained on our dataset with the following configurations:
Hyperparameters Values
Input Image Size 224x224
Labels (Number of Vocabulary) 1001
Number of Epochs 8
Learning Rate 1e-3
Batch Size 16
Optimizer AdamW
V.1.2 ResNet50
• Input Layer: Similar to ResNet34, the input layer processes images of size 224 ×
224 × 3 (height, width, channels).
• Residual Blocks: ResNet50 employs bottleneck blocks, which differ from the basic
blocks used in ResNet34. Each bottleneck block consists of three convolutional
layers:
16
These bottleneck blocks allow for a deeper network while maintaining computational
efficiency.
• Global Average Pooling and Fully Connected Layer: The feature maps
from the last residual block are passed through a global average pooling layer. The
resulting feature vector is fed into a fully connected layer that outputs 1001 classes,
corresponding to the answer space of the VQA model.
The additional depth and improved feature extraction capabilities of ResNet50 en-
hance the ability of the VQA model to handle more complex and nuanced questions.
By integrating the image features extracted by ResNet50 with the textual features from
the question, the model is likely to achieve a more comprehensive understanding of the
multimodal input, resulting in better performance across a variety of VQA tasks.
Similarly to ResNet34, the decision to use 1001 classes in our model corresponds to
the number of answer in our answers vocabulary.
17
Hyperparameters Values
Input Image Size 224x224
Labels (Number of Vocabulary) 1001
Number of Epochs 5
Learning Rate 1e-3
Batch Size 16
Optimizer AdamW
• Result: The best model achieved had a validation accuracy of 19.34% after train-
ing.
BERT is responsible for interpreting the textual questions in the VQA pipeline. By
leveraging its pretraining on large language corpora, BERT provides robust representa-
tions of the questions, which are integrated with image embeddings using a late fusion
mechanism.
Using BertForSequenceClassification
18
Figure V.3: BERT Architecture: 12 transformer blocks and a classification layer.
Source: ResearchGate
1. Input Tokenization
• The input text (e.g., a question) is tokenized into subwords or words using the
BERT tokenizer.
• Special tokens like [CLS] (start of sequence) and [SEP] (separator for multiple
sentences) are added.
• Padding tokens ([PAD]) are used to ensure all inputs have the same length.
3. Output Representation
• The output of BERT includes embeddings for each token in the input sequence.
• The [CLS] token output is used for classification tasks (e.g., predicting the answer
in VQA).
19
4. Classification Layer In BertForSequenceClassification, a linear layer is added
on top of the [CLS] token output to predict the final class (e.g., the answer in VQA).
The input to BERT consists of tokenized questions, while the output is the predicted
answer class.
Input:
Output:
Training Configuration
20
Hyperparameter Value
Input Type Tokenized Questions
Labels 1001
Number of Epochs 10
Learning Rate 5 × 10−5
Batch Size 16
Optimizer AdamW
Training Results
The model achieved a validation accuracy of 40.48% after training, demonstrating its
effectiveness in processing textual questions for the VQA task.
The VQA multi-modal late fusion mechanism, illustrated in Figure V.5, combines features
extracted from text and images:
• Textual Features: The text input is processed using the BERT model to generate
contextual embeddings, referred to as textual features.
• Image Features: The image input is processed using a convolutional neural net-
work, such as ResNet, to extract visual embeddings, referred to as image features.
• Late Fusion: The textual and image features are processed independently and
fused at a later stage before being passed to the classifier. This approach ensures
that each modality contributes complementary information to the prediction task.
21
Figure V.5: VQA Multi-Modal Late Fusion Mechanism
Our VQA model builds upon the multi-modal fusion approach and introduces a specific
pipeline for feature integration and classification, as shown in Figure V.6:
• Input Features:
– BERT Features: Contextual embeddings extracted from BERT (bert feature size)
after removing its FC layer.
22
– ResNet50 Features: Visual embeddings extracted from ResNet50 (cnn feature size)
after removing its FC layer.
– The concatenated features are passed through a fully connected layer with 512
units and ReLU activation to transform and combine the features.
– The output is then passed through another fully connected layer with 1001
units, where each unit corresponds to a potential answer in the dataset.
• Output: The model predicts the final answer based on the transformed features.
We trained our VQA Model on our dataset was with the following configurations:
Hyperparameters Values
Input Image Size 224x224 image + question
Output Vocabulary Size 1001
Number of Epochs 10
Learning Rate 5e-5
Batch Size 16
Optimizer AdamW
• Result: The model achieved a validation accuracy of 43.73% and a 43.46% test
accuracy.
23
VI. Evaluation of VQA Model
• Validation Accuracy: The accuracy achieved on the validation set during train-
ing.
• Test Accuracy: The final accuracy on the test set, reflecting the model’s gener-
alization capability.
• Training and Validation Loss: The loss values during training and validation,
which indicate how well the model is optimizing.
24
Hyperparameter Value
Input Image Size 224x224 image + Question
Output Vocabulary Size 1001
Number of Epochs 10
Learning Rate 5 × 10−5
Batch Size 16
Optimizer AdamW
VI.1.3 Results
The training and validation loss curves provide insights into the model’s optimization
process. The loss values decrease over epochs, indicating that the model is learning
effectively.
The feature activation distributions show how the model’s internal representations (fea-
tures) are distributed across different dimensions. This helps in understanding the model’s
ability to capture meaningful patterns in the data.
25
Figure VI.1: Training and Validation Loss Curves
26
VI.2.2 Insights
• The feature activation distributions indicate that the model is learning diverse
representations, as seen by the varying shapes of the distributions for different
feature dimensions.
• Features with broader distributions suggest that the model is capturing a wide
range of patterns, while narrower distributions may indicate more specific patterns.
The table below compares our model’s test accuracy with other state-of-the-art models:
VI.3.2 Analysis
• Performance Gap: Our model achieves a test accuracy of 43.46%, which is sig-
nificantly lower than ViLBERT (71.79%) and VisualBERT (70.80%).
– Smaller training dataset (only 20% of the full VQA v2.0 dataset was used).
27
VI.3.3 Future Improvements
• Larger Datasets: Using the full VQA v2.0 dataset or even larger datasets.
28
VII. Conclusion
29