Project Report Image Captioning Models Prakhar Dhyani
Project Report Image Captioning Models Prakhar Dhyani
Team Members
2K22/MC/68, Ishita Sharma
2K22/MC/81, Kumar Utsav
2K22/MC/110, Pradeep Dheenraj Singh
2K22/MC/111, Prakhar Dhyani
2K22/MC/128, Rhythm Narayan
Abstract
The core aspect of this project is the building of an automatic image captioning system, that can
produce descriptions for images with the help of deep learning methods. It helps bridge the gap
between visual data and textual data by effectively incorporating computer vision into natural
language processing. We used the Flickr8K dataset , and then trained neural network models to
learn visual features in images and text descriptions. This considers visual features from state-of-
the-art CNNs, including ResNet, Inception, and VGG networks, which can generate rich, high-
dimensional representations of images. It uses the RNN model for caption generation. Some of its
subtypes include Long Short-Term Memory and Gated Recurrent Units.
We continue by further exploring two fusion strategies to combine visual and textual information:
Early fusion and Late fusion. In early fusion, features are combined at an early stage of the
network and processed by subsequent layers, while in late fusion, outputs are combined at nal
stages. Systematic evaluation of these strategies may enable us to conclude which is relatively more
effective for generating captions that are both accurate and contextually relevant.
The con gurations encompassed fusion strategies, CNN architectures, and RNN-based language
models in this study. The captions were qualitatively evaluated based on relevance, coherence, and
contextual accuracy, contrary to the general metric-based evaluation. The results provide critical
insights into the relative merits and weaknesses as well as practical applicability in real-life
situations of various con gurations.
The results of this paper have practical importance in areas related to accessibility, content
generation, and image search, as this work enhances one's understanding about multimodal
integration in image captioning. Thus, this paper demonstrates how structured comparison of the
architectures and their fusion strategies facilitates the advancement in ef cient and effective
automated image captioning models.
fi
fi
fi
fi
Introduction
Despite the remarkable achievements in deep learning, captioning images with high accuracy and in
context remains an extremely challenging task due to the complexity of visual content
understanding and its translation into natural language. This paper is aimed at a comparison of
performance of different image captioning models to determine which one is strong in what areas
and appropriate for which applications.
This project addresses the problem by qualitatively comparing the performance of various image
captioning models based on the relevance, coherence, and contextual accuracy in the captions.
Instead of using scores on metrics, this analysis provides an overall assessment of the strengths and
weaknesses inherent to each model, which is based on visual inspection and contextual assessment.
With this method, very crucial insights are presented on how these models work out in the real
world, especially how they handle images with a number of objects and subtleties within the visual
details. With these patterns and limitations in the outputs, our project enhances understanding
regarding the applicability and effectiveness of different image captioning models.
• Krizhevsky, A., Sutskever, I., & Hinton, G. E. - ImageNet Classification with Deep
Convolutional Neural Networks
• He, K., Zhang, X., Ren, S., & Sun, J. - Deep Residual Learning for Image Recognition
• Simonyan, K., & Zisserman, A. - Very Deep Convolutional Networks for Large-Scale Image
Recognition
• Hochreiter, S., & Schmidhuber, J. - Long Short-Term Memory
• Karpathy, A., & Fei-Fei, L. - Deep Visual-Semantic Alignments for Generating Image
Descriptions
• Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. - Show and Tell: A Neural Image Caption
Generator
• Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y.
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
• Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., & Darrell, T. - Long-term Recurrent Convolutional Networks for Visual Recognition and
Description
• You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. - Image Captioning with Semantic Attention
• Anderson, P., Fernando, B., Johnson, M., & Gould, S. - SPICE: Semantic Propositional
Image Caption Evaluation
• Hossain, M. D., Sohel, F., Shiratuddin, M. F., & Laga, H. - A Comprehensive Survey of
Deep Learning for Image Captioning
• Marimuthu K., Aditya M., Marius M., Dietrich K., Fusion Models for Improved Visual
Captioning
• Ignazio G., Alessandro C., Shah N., Muhammad K. J., Image and Encoded Text Fusion for
Multi-Modal Classification
Methodology
Our methodology includes early and late fusion approaches with different architectures
of neural networks for the task of image captioning. Our experiments are based on the Flickr8k
dataset. This section elaborates the dataset, pre-processing steps, feature extraction techniques,
model architecture, fusion strategy, and the training process.
Dataset
We use the Flickr 8k dataset as our primary dataset, which has 8,000 images with five human-
annotated captions for each image. The diversity of scenes in the dataset enables us to
learn multiple visual and textual relationships. We normalize the images, as well as tokenize and
encode the captions, so that the inputs are standardized.
All CNNs are pre-trained on the ImageNet dataset. We use final convolutional layer outputs as input
features for fusion models.
• LSTM is renowned for its capacity to capture long-term dependencies, which proves
advantageous in generating contextually consistent captions..
• GRU: being a more straightforward alternative to LSTM, offers computational efficiency
while still delivering robust performance on sequential tasks.
The visual features are integrated with the text features produced by either the LSTM or GRU
to create the final captioning model.
Fusion Strategies
This study investigates two methods of fusion:
• Early Fusion: This technique integrates visual and text features during the initial
layers, thereby enabling the model to establish a shared representation at an early stage
• Late Fusion: This technique Maintains the separation of features until the final layer,
allowing it to preserve its distinct properties prior to their combination for caption
generation.
Employing three CNNs, two RNNs, and two fusion strategies, we proceed to train 12
configurations.
Training Procedure
Each model is trained with TensorFlow on Google Colab for 50 epochs. The cross-entropy loss
function is applied with the Adam optimizer, running at an initial learning rate of 1 x 10^ -4
and batch size of 32. The validation loss applies early stopping to prevent overfitting, and hyper-
parameter tuning is applied for achieving the best results in different configurations.
Results
This image shows the loss graphs shared across the models.
All models show a consistent decrease in loss over the epochs, indicating successful learning. The
rate of decrease is more pronounced in the earlier epochs, which is typical for neural networks as
they initially learn at a faster rate and then slow down as they converge. RESNET has better
performance compared to VGG and INCEPTION and LATE FUSION gives better results than
EARLY FUSION and LSTM outperforms GRU models. It is noted that LATE FUSION takes a
longer time to train than EARLY FUSION due to passing of Images and Text features separately as
compared to a concatenated pass.
Implications
This paper's ndings have highlighted the adaptability of fusion methods and the critical role
feature extraction architectures play in tasks of image captioning. Early fusion promotes common
representations of features and thus makes it easier for a model to comprehend complicated visual-
textual relationships. In contrast, late fusion retains the modality-speci c features and applies to
situations in which information unique to each modality is signi cant. The application of ResNet,
Inception, and VGG in the extraction of image features brings out the need to acquire different,
multiscale, and interpretable characteristics. Similarly, with the integration of LSTM and GRU in
extracting text features, it indicates a need to capture the long-term dependency. This comes with
the advantage of a higher capacity in the case of LSTM and higher computational ef ciency in the
case of GRU, so a trade-off exists where exibility can occur based on resource availability. Lastly,
diversity within the Flickr8k dataset guarantees a robust learning environment for various relations
between visual and textual modalities, although results may vary once applied to domain-speci c
datasets. The examination of 12 con gurations illustrates the scalability of the methodology and
shows the exibility of diverse combinations of architectures and fusion strategies related to
different tasks.
Conclusion
Future work will involve the extension of the methodology to larger, and thus more diverse datasets
like Flickr30k and MS-COCO, or to test adaptability by using domain-specific datasets. Feature
extraction may further be boosted using architectures such as EfficientNet, vision transformers, or
models that use transformers like BERT for text. Hybrid fusion strategies, attention mechanisms,
and optimization for real-time applications improve the robustness and scalability of the
methodology.
Dataset
• Krizhevsky, A., Sutskever, I., & Hinton, G. E. - ImageNet Classification with Deep
Convolutional Neural Networks
• He, K., Zhang, X., Ren, S., & Sun, J. - Deep Residual Learning for Image Recognition
• Simonyan, K., & Zisserman, A. - Very Deep Convolutional Networks for Large-Scale Image
Recognition
• Hochreiter, S., & Schmidhuber, J. - Long Short-Term Memory
• Karpathy, A., & Fei-Fei, L. - Deep Visual-Semantic Alignments for Generating Image
Descriptions
• Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. - Show and Tell: A Neural Image Caption
Generator
• Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y.
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
• Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., & Darrell, T. - Long-term Recurrent Convolutional Networks for Visual Recognition and
Description
• You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. - Image Captioning with Semantic Attention
• Anderson, P., Fernando, B., Johnson, M., & Gould, S. - SPICE: Semantic Propositional
Image Caption Evaluation
• Hossain, M. D., Sohel, F., Shiratuddin, M. F., & Laga, H. - A Comprehensive Survey of
Deep Learning for Image Captioning
• Marimuthu K., Aditya M., Marius M., Dietrich K., Fusion Models for Improved Visual
Captioning
• Ignazio G., Alessandro C., Shah N., Muhammad K. J., Image and Encoded Text Fusion for
Multi-Modal Classification
• https://www.kaggle.com/datasets/shadabhussain/flickr8k
• https://github.com/Div99/Image-Captioning
• https://github.com/yashk2810/Image-Captioning