0% found this document useful (0 votes)
19 views8 pages

Project Report Image Captioning Models Prakhar Dhyani

This document presents a comparative study of various image captioning models utilizing deep learning techniques, specifically focusing on the integration of visual and textual data. The research employs the Flickr8K dataset to evaluate different configurations of convolutional and recurrent neural networks, exploring early and late fusion strategies for caption generation. Findings indicate that while ResNet and LSTM configurations generally outperform others, the choice of model components is context-dependent, suggesting future work could extend to larger datasets and advanced architectures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views8 pages

Project Report Image Captioning Models Prakhar Dhyani

This document presents a comparative study of various image captioning models utilizing deep learning techniques, specifically focusing on the integration of visual and textual data. The research employs the Flickr8K dataset to evaluate different configurations of convolutional and recurrent neural networks, exploring early and late fusion strategies for caption generation. Findings indicate that while ResNet and LSTM configurations generally outperform others, the choice of model components is context-dependent, suggesting future work could extend to larger datasets and advanced architectures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Comparative Study of Image Captioning Models

Code available at: https://github.com/Rhythm1002/Analysis-of-Image-Captioning-Models

Team Members
2K22/MC/68, Ishita Sharma
2K22/MC/81, Kumar Utsav
2K22/MC/110, Pradeep Dheenraj Singh
2K22/MC/111, Prakhar Dhyani
2K22/MC/128, Rhythm Narayan

Abstract
The core aspect of this project is the building of an automatic image captioning system, that can
produce descriptions for images with the help of deep learning methods. It helps bridge the gap
between visual data and textual data by effectively incorporating computer vision into natural
language processing. We used the Flickr8K dataset , and then trained neural network models to
learn visual features in images and text descriptions. This considers visual features from state-of-
the-art CNNs, including ResNet, Inception, and VGG networks, which can generate rich, high-
dimensional representations of images. It uses the RNN model for caption generation. Some of its
subtypes include Long Short-Term Memory and Gated Recurrent Units.

We continue by further exploring two fusion strategies to combine visual and textual information:
Early fusion and Late fusion. In early fusion, features are combined at an early stage of the
network and processed by subsequent layers, while in late fusion, outputs are combined at nal
stages. Systematic evaluation of these strategies may enable us to conclude which is relatively more
effective for generating captions that are both accurate and contextually relevant.

The con gurations encompassed fusion strategies, CNN architectures, and RNN-based language
models in this study. The captions were qualitatively evaluated based on relevance, coherence, and
contextual accuracy, contrary to the general metric-based evaluation. The results provide critical
insights into the relative merits and weaknesses as well as practical applicability in real-life
situations of various con gurations.

The results of this paper have practical importance in areas related to accessibility, content
generation, and image search, as this work enhances one's understanding about multimodal
integration in image captioning. Thus, this paper demonstrates how structured comparison of the
architectures and their fusion strategies facilitates the advancement in ef cient and effective
automated image captioning models.
fi
fi
fi
fi
Introduction
Despite the remarkable achievements in deep learning, captioning images with high accuracy and in
context remains an extremely challenging task due to the complexity of visual content
understanding and its translation into natural language. This paper is aimed at a comparison of
performance of different image captioning models to determine which one is strong in what areas
and appropriate for which applications.

This project addresses the problem by qualitatively comparing the performance of various image
captioning models based on the relevance, coherence, and contextual accuracy in the captions.
Instead of using scores on metrics, this analysis provides an overall assessment of the strengths and
weaknesses inherent to each model, which is based on visual inspection and contextual assessment.
With this method, very crucial insights are presented on how these models work out in the real
world, especially how they handle images with a number of objects and subtleties within the visual
details. With these patterns and limitations in the outputs, our project enhances understanding
regarding the applicability and effectiveness of different image captioning models.

References and Previous Research

• Krizhevsky, A., Sutskever, I., & Hinton, G. E. - ImageNet Classification with Deep
Convolutional Neural Networks
• He, K., Zhang, X., Ren, S., & Sun, J. - Deep Residual Learning for Image Recognition
• Simonyan, K., & Zisserman, A. - Very Deep Convolutional Networks for Large-Scale Image
Recognition
• Hochreiter, S., & Schmidhuber, J. - Long Short-Term Memory
• Karpathy, A., & Fei-Fei, L. - Deep Visual-Semantic Alignments for Generating Image
Descriptions
• Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. - Show and Tell: A Neural Image Caption
Generator
• Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y.
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
• Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., & Darrell, T. - Long-term Recurrent Convolutional Networks for Visual Recognition and
Description
• You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. - Image Captioning with Semantic Attention
• Anderson, P., Fernando, B., Johnson, M., & Gould, S. - SPICE: Semantic Propositional
Image Caption Evaluation
• Hossain, M. D., Sohel, F., Shiratuddin, M. F., & Laga, H. - A Comprehensive Survey of
Deep Learning for Image Captioning
• Marimuthu K., Aditya M., Marius M., Dietrich K., Fusion Models for Improved Visual
Captioning
• Ignazio G., Alessandro C., Shah N., Muhammad K. J., Image and Encoded Text Fusion for
Multi-Modal Classification
Methodology
Our methodology includes early and late fusion approaches with different architectures
of neural networks for the task of image captioning. Our experiments are based on the Flickr8k
dataset. This section elaborates the dataset, pre-processing steps, feature extraction techniques,
model architecture, fusion strategy, and the training process.

Dataset
We use the Flickr 8k dataset as our primary dataset, which has 8,000 images with five human-
annotated captions for each image. The diversity of scenes in the dataset enables us to
learn multiple visual and textual relationships. We normalize the images, as well as tokenize and
encode the captions, so that the inputs are standardized.

Image Feature Extraction


In the image encoding, we conduct experiments using three different CNNs:

• ResNet: It is based on residual connections, which enable efficient deep


training with rich feature maps.
• Inception: It captures multiscale spatial hierarchies efficiently by factorized convolutions.
• VGG: VGG uses uniform convolutional layers to provide consistent and interpretable
features.

All CNNs are pre-trained on the ImageNet dataset. We use final convolutional layer outputs as input
features for fusion models.

Text Feature Extraction


For sequential captions, we will be using RNNs:

• LSTM is renowned for its capacity to capture long-term dependencies, which proves
advantageous in generating contextually consistent captions..
• GRU: being a more straightforward alternative to LSTM, offers computational efficiency
while still delivering robust performance on sequential tasks.

The visual features are integrated with the text features produced by either the LSTM or GRU
to create the final captioning model.

Fusion Strategies
This study investigates two methods of fusion:

• Early Fusion: This technique integrates visual and text features during the initial
layers, thereby enabling the model to establish a shared representation at an early stage
• Late Fusion: This technique Maintains the separation of features until the final layer,
allowing it to preserve its distinct properties prior to their combination for caption
generation.

Employing three CNNs, two RNNs, and two fusion strategies, we proceed to train 12
configurations.
Training Procedure
Each model is trained with TensorFlow on Google Colab for 50 epochs. The cross-entropy loss
function is applied with the Adam optimizer, running at an initial learning rate of 1 x 10^ -4
and batch size of 32. The validation loss applies early stopping to prevent overfitting, and hyper-
parameter tuning is applied for achieving the best results in different configurations.

Experimental Setup and Tools


All the models are implemented in TensorFlow on Google Colab, training on the available GPUs on
Colab. The results for each configuration are averaged over multiple runs to confirm that things are
stable and reduce variations due to randomness.

Results

This image shows the loss graphs shared across the models.

All models show a consistent decrease in loss over the epochs, indicating successful learning. The
rate of decrease is more pronounced in the earlier epochs, which is typical for neural networks as
they initially learn at a faster rate and then slow down as they converge. RESNET has better
performance compared to VGG and INCEPTION and LATE FUSION gives better results than
EARLY FUSION and LSTM outperforms GRU models. It is noted that LATE FUSION takes a
longer time to train than EARLY FUSION due to passing of Images and Text features separately as
compared to a concatenated pass.
Implications

This paper's ndings have highlighted the adaptability of fusion methods and the critical role
feature extraction architectures play in tasks of image captioning. Early fusion promotes common
representations of features and thus makes it easier for a model to comprehend complicated visual-
textual relationships. In contrast, late fusion retains the modality-speci c features and applies to
situations in which information unique to each modality is signi cant. The application of ResNet,
Inception, and VGG in the extraction of image features brings out the need to acquire different,
multiscale, and interpretable characteristics. Similarly, with the integration of LSTM and GRU in
extracting text features, it indicates a need to capture the long-term dependency. This comes with
the advantage of a higher capacity in the case of LSTM and higher computational ef ciency in the
case of GRU, so a trade-off exists where exibility can occur based on resource availability. Lastly,
diversity within the Flickr8k dataset guarantees a robust learning environment for various relations
between visual and textual modalities, although results may vary once applied to domain-speci c
datasets. The examination of 12 con gurations illustrates the scalability of the methodology and
shows the exibility of diverse combinations of architectures and fusion strategies related to
different tasks.

Conclusion

This study demonstrates the effectiveness of multi-modal learning in generating meaningful


captions by combining visual and textual features. The strategies of fusion are crucial for the
balance between feature blending and retention, with no single architecture or fusion strategy that is
universally superior to others. Model component choices are task-speci c and depend on what is
more important: computational ef ciency or interpretability. ResNet and LSTM will most probably
outperform others as they are deeper and richer in representations, but will probably require higher
computational resources. Inception/GRU con gurations provide computational ef ciency with
reasonable performance.

Future work will involve the extension of the methodology to larger, and thus more diverse datasets
like Flickr30k and MS-COCO, or to test adaptability by using domain-specific datasets. Feature
extraction may further be boosted using architectures such as EfficientNet, vision transformers, or
models that use transformers like BERT for text. Hybrid fusion strategies, attention mechanisms,
and optimization for real-time applications improve the robustness and scalability of the
methodology.

Dataset

Dataset available at: https://www.kaggle.com/datasets/shadabhussain/flickr8k


fl
fi
fi
fi
fl
fi
fi
fi
fi
fi
fi
fi
Bibliography

• Krizhevsky, A., Sutskever, I., & Hinton, G. E. - ImageNet Classification with Deep
Convolutional Neural Networks
• He, K., Zhang, X., Ren, S., & Sun, J. - Deep Residual Learning for Image Recognition
• Simonyan, K., & Zisserman, A. - Very Deep Convolutional Networks for Large-Scale Image
Recognition
• Hochreiter, S., & Schmidhuber, J. - Long Short-Term Memory
• Karpathy, A., & Fei-Fei, L. - Deep Visual-Semantic Alignments for Generating Image
Descriptions
• Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. - Show and Tell: A Neural Image Caption
Generator
• Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y.
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
• Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., & Darrell, T. - Long-term Recurrent Convolutional Networks for Visual Recognition and
Description
• You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. - Image Captioning with Semantic Attention
• Anderson, P., Fernando, B., Johnson, M., & Gould, S. - SPICE: Semantic Propositional
Image Caption Evaluation
• Hossain, M. D., Sohel, F., Shiratuddin, M. F., & Laga, H. - A Comprehensive Survey of
Deep Learning for Image Captioning
• Marimuthu K., Aditya M., Marius M., Dietrich K., Fusion Models for Improved Visual
Captioning
• Ignazio G., Alessandro C., Shah N., Muhammad K. J., Image and Encoded Text Fusion for
Multi-Modal Classification
• https://www.kaggle.com/datasets/shadabhussain/flickr8k
• https://github.com/Div99/Image-Captioning
• https://github.com/yashk2810/Image-Captioning

You might also like