Project Report Image Captioning Models Prakhar Dhyani

This document presents a comparative study of various image captioning models utilizing deep learning techniques, specifically focusing on the integration of visual and textual data. The research employs the Flickr8K dataset to evaluate different configurations of convolutional and recurrent neural networks, exploring early and late fusion strategies for caption generation. Findings indicate that while ResNet and LSTM configurations generally outperform others, the choice of model components is context-dependent, suggesting future work could extend to larger datasets and advanced architectures.

Uploaded by

g.o.j.o.16.12.89.s.a.t.o.r.u

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views8 pages

Project Report Image Captioning Models Prakhar Dhyani

Uploaded by

g.o.j.o.16.12.89.s.a.t.o.r.u

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Comparative Study of Image Captioning Models

Code available at: https://github.com/Rhythm1002/Analysis-of-Image-Captioning-Models

Team Members
2K22/MC/68, Ishita Sharma
2K22/MC/81, Kumar Utsav
2K22/MC/110, Pradeep Dheenraj Singh
2K22/MC/111, Prakhar Dhyani
2K22/MC/128, Rhythm Narayan

Abstract
The core aspect of this project is the building of an automatic image captioning system, that can
produce descriptions for images with the help of deep learning methods. It helps bridge the gap
between visual data and textual data by effectively incorporating computer vision into natural
language processing. We used the Flickr8K dataset , and then trained neural network models to
learn visual features in images and text descriptions. This considers visual features from state-of-
the-art CNNs, including ResNet, Inception, and VGG networks, which can generate rich, high-
dimensional representations of images. It uses the RNN model for caption generation. Some of its
subtypes include Long Short-Term Memory and Gated Recurrent Units.

We continue by further exploring two fusion strategies to combine visual and textual information:
Early fusion and Late fusion. In early fusion, features are combined at an early stage of the
network and processed by subsequent layers, while in late fusion, outputs are combined at nal
stages. Systematic evaluation of these strategies may enable us to conclude which is relatively more
effective for generating captions that are both accurate and contextually relevant.

The con gurations encompassed fusion strategies, CNN architectures, and RNN-based language
models in this study. The captions were qualitatively evaluated based on relevance, coherence, and
contextual accuracy, contrary to the general metric-based evaluation. The results provide critical
insights into the relative merits and weaknesses as well as practical applicability in real-life
situations of various con gurations.

The results of this paper have practical importance in areas related to accessibility, content
generation, and image search, as this work enhances one's understanding about multimodal
integration in image captioning. Thus, this paper demonstrates how structured comparison of the
architectures and their fusion strategies facilitates the advancement in ef cient and effective
automated image captioning models.
fi
fi
fi
fi
Introduction
Despite the remarkable achievements in deep learning, captioning images with high accuracy and in
context remains an extremely challenging task due to the complexity of visual content
understanding and its translation into natural language. This paper is aimed at a comparison of
performance of different image captioning models to determine which one is strong in what areas
and appropriate for which applications.

This project addresses the problem by qualitatively comparing the performance of various image
captioning models based on the relevance, coherence, and contextual accuracy in the captions.
Instead of using scores on metrics, this analysis provides an overall assessment of the strengths and
weaknesses inherent to each model, which is based on visual inspection and contextual assessment.
With this method, very crucial insights are presented on how these models work out in the real
world, especially how they handle images with a number of objects and subtleties within the visual
details. With these patterns and limitations in the outputs, our project enhances understanding
regarding the applicability and effectiveness of different image captioning models.

References and Previous Research

• Krizhevsky, A., Sutskever, I., & Hinton, G. E. - ImageNet Classification with Deep
Convolutional Neural Networks
• He, K., Zhang, X., Ren, S., & Sun, J. - Deep Residual Learning for Image Recognition
• Simonyan, K., & Zisserman, A. - Very Deep Convolutional Networks for Large-Scale Image
Recognition
• Hochreiter, S., & Schmidhuber, J. - Long Short-Term Memory
• Karpathy, A., & Fei-Fei, L. - Deep Visual-Semantic Alignments for Generating Image
Descriptions
• Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. - Show and Tell: A Neural Image Caption
Generator
• Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y.
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
• Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., & Darrell, T. - Long-term Recurrent Convolutional Networks for Visual Recognition and
Description
• You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. - Image Captioning with Semantic Attention
• Anderson, P., Fernando, B., Johnson, M., & Gould, S. - SPICE: Semantic Propositional
Image Caption Evaluation
• Hossain, M. D., Sohel, F., Shiratuddin, M. F., & Laga, H. - A Comprehensive Survey of
Deep Learning for Image Captioning
• Marimuthu K., Aditya M., Marius M., Dietrich K., Fusion Models for Improved Visual
Captioning
• Ignazio G., Alessandro C., Shah N., Muhammad K. J., Image and Encoded Text Fusion for
Multi-Modal Classification
Methodology
Our methodology includes early and late fusion approaches with different architectures
of neural networks for the task of image captioning. Our experiments are based on the Flickr8k
dataset. This section elaborates the dataset, pre-processing steps, feature extraction techniques,
model architecture, fusion strategy, and the training process.

Dataset
We use the Flickr 8k dataset as our primary dataset, which has 8,000 images with five human-
annotated captions for each image. The diversity of scenes in the dataset enables us to
learn multiple visual and textual relationships. We normalize the images, as well as tokenize and
encode the captions, so that the inputs are standardized.

Image Feature Extraction

In the image encoding, we conduct experiments using three different CNNs:

• ResNet: It is based on residual connections, which enable efficient deep

training with rich feature maps.
• Inception: It captures multiscale spatial hierarchies efficiently by factorized convolutions.
• VGG: VGG uses uniform convolutional layers to provide consistent and interpretable
features.

All CNNs are pre-trained on the ImageNet dataset. We use final convolutional layer outputs as input
features for fusion models.

Text Feature Extraction

For sequential captions, we will be using RNNs:

• LSTM is renowned for its capacity to capture long-term dependencies, which proves
advantageous in generating contextually consistent captions..
• GRU: being a more straightforward alternative to LSTM, offers computational efficiency
while still delivering robust performance on sequential tasks.

The visual features are integrated with the text features produced by either the LSTM or GRU
to create the final captioning model.

Fusion Strategies
This study investigates two methods of fusion:

• Early Fusion: This technique integrates visual and text features during the initial
layers, thereby enabling the model to establish a shared representation at an early stage
• Late Fusion: This technique Maintains the separation of features until the final layer,
allowing it to preserve its distinct properties prior to their combination for caption
generation.

Employing three CNNs, two RNNs, and two fusion strategies, we proceed to train 12
configurations.
Training Procedure
Each model is trained with TensorFlow on Google Colab for 50 epochs. The cross-entropy loss
function is applied with the Adam optimizer, running at an initial learning rate of 1 x 10^ -4
and batch size of 32. The validation loss applies early stopping to prevent overfitting, and hyper-
parameter tuning is applied for achieving the best results in different configurations.

Experimental Setup and Tools

All the models are implemented in TensorFlow on Google Colab, training on the available GPUs on
Colab. The results for each configuration are averaged over multiple runs to confirm that things are
stable and reduce variations due to randomness.

Results

This image shows the loss graphs shared across the models.

All models show a consistent decrease in loss over the epochs, indicating successful learning. The
rate of decrease is more pronounced in the earlier epochs, which is typical for neural networks as
they initially learn at a faster rate and then slow down as they converge. RESNET has better
performance compared to VGG and INCEPTION and LATE FUSION gives better results than
EARLY FUSION and LSTM outperforms GRU models. It is noted that LATE FUSION takes a
longer time to train than EARLY FUSION due to passing of Images and Text features separately as
compared to a concatenated pass.
Implications

This paper's ndings have highlighted the adaptability of fusion methods and the critical role
feature extraction architectures play in tasks of image captioning. Early fusion promotes common
representations of features and thus makes it easier for a model to comprehend complicated visual-
textual relationships. In contrast, late fusion retains the modality-speci c features and applies to
situations in which information unique to each modality is signi cant. The application of ResNet,
Inception, and VGG in the extraction of image features brings out the need to acquire different,
multiscale, and interpretable characteristics. Similarly, with the integration of LSTM and GRU in
extracting text features, it indicates a need to capture the long-term dependency. This comes with
the advantage of a higher capacity in the case of LSTM and higher computational ef ciency in the
case of GRU, so a trade-off exists where exibility can occur based on resource availability. Lastly,
diversity within the Flickr8k dataset guarantees a robust learning environment for various relations
between visual and textual modalities, although results may vary once applied to domain-speci c
datasets. The examination of 12 con gurations illustrates the scalability of the methodology and
shows the exibility of diverse combinations of architectures and fusion strategies related to
different tasks.

Conclusion

This study demonstrates the effectiveness of multi-modal learning in generating meaningful

captions by combining visual and textual features. The strategies of fusion are crucial for the
balance between feature blending and retention, with no single architecture or fusion strategy that is
universally superior to others. Model component choices are task-speci c and depend on what is
more important: computational ef ciency or interpretability. ResNet and LSTM will most probably
outperform others as they are deeper and richer in representations, but will probably require higher
computational resources. Inception/GRU con gurations provide computational ef ciency with
reasonable performance.

Future work will involve the extension of the methodology to larger, and thus more diverse datasets
like Flickr30k and MS-COCO, or to test adaptability by using domain-specific datasets. Feature
extraction may further be boosted using architectures such as EfficientNet, vision transformers, or
models that use transformers like BERT for text. Hybrid fusion strategies, attention mechanisms,
and optimization for real-time applications improve the robustness and scalability of the
methodology.

Dataset

Dataset available at: https://www.kaggle.com/datasets/shadabhussain/flickr8k

fl
fi
fi
fi
fl
fi
fi
fi
fi
fi
fi
fi
Bibliography

• Krizhevsky, A., Sutskever, I., & Hinton, G. E. - ImageNet Classification with Deep
Convolutional Neural Networks
• He, K., Zhang, X., Ren, S., & Sun, J. - Deep Residual Learning for Image Recognition
• Simonyan, K., & Zisserman, A. - Very Deep Convolutional Networks for Large-Scale Image
Recognition
• Hochreiter, S., & Schmidhuber, J. - Long Short-Term Memory
• Karpathy, A., & Fei-Fei, L. - Deep Visual-Semantic Alignments for Generating Image
Descriptions
• Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. - Show and Tell: A Neural Image Caption
Generator
• Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y.
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
• Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., & Darrell, T. - Long-term Recurrent Convolutional Networks for Visual Recognition and
Description
• You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. - Image Captioning with Semantic Attention
• Anderson, P., Fernando, B., Johnson, M., & Gould, S. - SPICE: Semantic Propositional
Image Caption Evaluation
• Hossain, M. D., Sohel, F., Shiratuddin, M. F., & Laga, H. - A Comprehensive Survey of
Deep Learning for Image Captioning
• Marimuthu K., Aditya M., Marius M., Dietrich K., Fusion Models for Improved Visual
Captioning
• Ignazio G., Alessandro C., Shah N., Muhammad K. J., Image and Encoded Text Fusion for
Multi-Modal Classification
• https://www.kaggle.com/datasets/shadabhussain/flickr8k
• https://github.com/Div99/Image-Captioning
• https://github.com/yashk2810/Image-Captioning

Seminar
No ratings yet
Seminar
16 pages
Answers For End-Sem Exam Part - 2 (Deep Learning)
No ratings yet
Answers For End-Sem Exam Part - 2 (Deep Learning)
20 pages
Zero To Deep Learning With Keras and Tensorflow Compress
No ratings yet
Zero To Deep Learning With Keras and Tensorflow Compress
769 pages
Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
Neurocomputing Chap2 2 PDF
No ratings yet
Neurocomputing Chap2 2 PDF
41 pages
Image Captioning Research Paper
No ratings yet
Image Captioning Research Paper
59 pages
International Baccalaureate (IB) : Artificial Neural Networks - #1
No ratings yet
International Baccalaureate (IB) : Artificial Neural Networks - #1
33 pages
From Show To Tell: A Survey On Image Captioning
No ratings yet
From Show To Tell: A Survey On Image Captioning
22 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
CSE - MINI Project Report Sample
No ratings yet
CSE - MINI Project Report Sample
18 pages
A Survey of Deep Learning Techniques Applied To Trading: Limit Order Book Modeling
No ratings yet
A Survey of Deep Learning Techniques Applied To Trading: Limit Order Book Modeling
10 pages
Neural Network
No ratings yet
Neural Network
37 pages
Ist 407 Presentation
No ratings yet
Ist 407 Presentation
12 pages
Radial Basis Function Neural Network RBFNN
No ratings yet
Radial Basis Function Neural Network RBFNN
14 pages
Deep Learning
No ratings yet
Deep Learning
1 page
CNN Building Blocks
No ratings yet
CNN Building Blocks
14 pages
Report Contents Image Caption Generation-1
No ratings yet
Report Contents Image Caption Generation-1
42 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
Nria20-Dl - Unit-4 Notes-Final
No ratings yet
Nria20-Dl - Unit-4 Notes-Final
21 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
Unit 6
No ratings yet
Unit 6
41 pages
DL Unit5 RNN
No ratings yet
DL Unit5 RNN
107 pages
Lecture 2 Deep Learning Overview
No ratings yet
Lecture 2 Deep Learning Overview
98 pages
Visual Image Caption Generator Using Deep Learning
No ratings yet
Visual Image Caption Generator Using Deep Learning
7 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
No ratings yet
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
4 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Ensemble Classifiers
No ratings yet
Ensemble Classifiers
37 pages
Gray Scale Image Captioning Using CNN and LSTM
No ratings yet
Gray Scale Image Captioning Using CNN and LSTM
8 pages
UNit 6 Machine Learning
No ratings yet
UNit 6 Machine Learning
23 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Image Captioning Using Deep Learning Mait
No ratings yet
Image Captioning Using Deep Learning Mait
8 pages
Chapter 6 AI
No ratings yet
Chapter 6 AI
52 pages
Fin Irjmets1681386363
No ratings yet
Fin Irjmets1681386363
5 pages
100 Days of DEep Learning
No ratings yet
100 Days of DEep Learning
5 pages
Applications of AI
No ratings yet
Applications of AI
13 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
47 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
8 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
Project Review
No ratings yet
Project Review
12 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
IJCRT2310418
No ratings yet
IJCRT2310418
8 pages
Md. Faisal 2024GE10 Assignment
No ratings yet
Md. Faisal 2024GE10 Assignment
13 pages
Important LeetCode Questions
No ratings yet
Important LeetCode Questions
2 pages
Activation Functions
No ratings yet
Activation Functions
9 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
9 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
Report 1
No ratings yet
Report 1
34 pages
Seminar Report Final
No ratings yet
Seminar Report Final
20 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Image Captioners Sometimes Tell More Than Images They See
No ratings yet
Image Captioners Sometimes Tell More Than Images They See
6 pages
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
No ratings yet
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
3 pages
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
No ratings yet
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
6 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
ANN (Artificial Neural Network) 4. LSTM (Long Short-Term Memory)
No ratings yet
ANN (Artificial Neural Network) 4. LSTM (Long Short-Term Memory)
2 pages
TSP CMC 53245
No ratings yet
TSP CMC 53245
18 pages
He 2017
No ratings yet
He 2017
8 pages
Assignment1 Real Analysis
No ratings yet
Assignment1 Real Analysis
6 pages
Assignment2 Real Analysis
No ratings yet
Assignment2 Real Analysis
6 pages
2K22:MC:111 PrakharDhyani Assignment 3
No ratings yet
2K22:MC:111 PrakharDhyani Assignment 3
5 pages
(Ankitveer)
No ratings yet
(Ankitveer)
18 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
No ratings yet
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
6 pages
Image Caption Generation Research Paper
No ratings yet
Image Caption Generation Research Paper
8 pages
Base Paper
No ratings yet
Base Paper
6 pages
RP Springer
No ratings yet
RP Springer
10 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Rich Image Captioning in The Wild
No ratings yet
Rich Image Captioning in The Wild
8 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Ref 12
No ratings yet
Ref 12
7 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
Abstract:: Doi: 10.5281/zenodo.7923088
No ratings yet
Abstract:: Doi: 10.5281/zenodo.7923088
12 pages
chp3 Hebb Network
No ratings yet
chp3 Hebb Network
4 pages
Image Caption Generation Using Deep Neural Networks
No ratings yet
Image Caption Generation Using Deep Neural Networks
3 pages
CS-601 Machine Learning Study Guide
No ratings yet
CS-601 Machine Learning Study Guide
2 pages
DL Project Report
No ratings yet
DL Project Report
10 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Infographic - ABCs of AI and Deep Learning
No ratings yet
Infographic - ABCs of AI and Deep Learning
1 page
Review 3
No ratings yet
Review 3
18 pages
Review 3
No ratings yet
Review 3
18 pages
Papers
No ratings yet
Papers
9 pages
Predictive Models and Techniques
No ratings yet
Predictive Models and Techniques
9 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet