DL Project Report
DL Project Report
1 Introduction 3
4 Methodology 6
5 Results 7
6 Conclusion 9
7 References 9
8 Appendix 10
2
1. Introduction
Image caption generation is a sophisticated task that connects computer vision
visual information. The increasing volume of image data on the internet has
made this task particularly relevant in domains like digital image management,
and social media automation. This project presents an approach that utilizes
extracting deep visual features from images, and Recurrent Neural Networks
combining these two powerful deep learning models, the system attempts to
3
2. Aim and Scope
The primary aim of this project is to design and implement a deep learning-
based system that can automatically generate accurate and context-aware
captions for images. This objective is approached through the integration of
visual feature extraction using ResNet (a deep CNN pretrained on ImageNet)
and sequence modeling using LSTM networks. The specific goals include:
The scope of the project is limited to English language captions and static
images. The system does not cover real-time captioning or multilingual support
but lays the groundwork for future expansion in these directions.
4
3. Dataset and Experimental Setup
The dataset used in this project is the Flickr8k dataset, which consists of 8,000
images, each annotated with five unique human-generated captions. These
captions describe the content and context of the images in natural language,
making the dataset suitable for supervised learning. The images vary in
complexity, object count, and background, providing a diverse training set.
Preprocessing steps:
• Image Preprocessing: Images are resized to 224x224 pixels and
normalized to match the input requirements of ResNet50.
• Caption Preprocessing: Captions are converted to lowercase,
punctuation is removed, and all words are tokenized and indexed.
Special start (<start>) and end (<end>) tokens are added to each caption.
• Vocabulary Limitation: A threshold frequency is used to limit the
vocabulary to the most commonly used words to reduce computational
complexity.
Experimental Setup:
• Programming Language: Python 3.10
• Frameworks: TensorFlow 2.x and Keras
• Hardware: NVIDIA GPU with CUDA support (e.g., NVIDIA RTX 3060)
• Training Split: 80% training, 10% validation, 10% testing
• Batch Size: 64
• Learning Rate: 0.001 with decay
• Loss Function: Categorical cross-entropy
• Optimizer: Adam
This setup ensures a robust training environment and enables efficient
experimentation with model parameters.
5
4. Methodology
The methodology consists of two core modules: image feature extraction and
caption generation.
A. Image Feature Extraction with ResNet50:
• A pretrained ResNet50 model (trained on ImageNet) is used.
• The final classification layer is removed, and the output of the last
average pooling layer (2048-dimensional vector) is used as the image
representation.
• This vector captures high-level semantic information from the image and
serves as input to the captioning model.
B. Caption Generation using LSTM:
• The image feature vector is first passed through a fully connected layer
to project it into the same embedding space as the text data.
• Captions are embedded using an embedding layer that transforms word
indices into dense vectors.
• An LSTM layer processes these embedded vectors, maintaining temporal
dependencies and learning to generate sequences word by word.
• At each timestep, the model predicts the next word using a softmax
activation over the vocabulary.
• During training, teacher forcing is used where the ground truth word is
fed as the next input to the LSTM.
C. Training Details:
• Epochs: 20–30 depending on convergence
• Loss is computed for each predicted word in the sequence.
• Early stopping and model checkpointing are used to prevent overfitting.
6
D. Evaluation:
• The BLEU (Bilingual Evaluation Understudy) score is used to evaluate
caption quality by comparing generated captions to reference human-
written captions.
• Qualitative analysis is done by visually comparing generated captions to
input images.
7
Quantitative Results:
• BLEU-1: 0.70
• BLEU-2: 0.58
• BLEU-3: 0.48
• BLEU-4: 0.41
These scores suggest a good alignment between generated captions and
human-annotated ones, especially for shorter n-grams.
Observations:
• The model accurately describes common scenes and objects (e.g., dogs,
people, vehicles).
• It occasionally generates repetitive or grammatically awkward captions.
• Performance drops for abstract scenes or those with multiple unrelated
subjects.
Limitations:
• No attention mechanism, so focus on specific image regions is lacking.
• Cannot handle unseen or rare objects due to limited vocabulary.
• Generates one caption per image; no diversity in multiple outputs.
8
6. Conclusion
This project successfully demonstrates that combining a pretrained CNN like
ResNet with an RNN model like LSTM can produce reliable and meaningful
image captions. The CNN captures visual semantics, while the LSTM learns the
sequential structure of natural language. The use of the Flickr8k dataset allows
for effective training and benchmarking.
While results are promising, further improvements can be achieved by
incorporating attention mechanisms to focus on image regions during caption
generation or using Transformer-based architectures such as Vision
Transformers (ViT) and BERT/GPT-style models for language modeling. The
addition of multilingual captioning, user personalization, and real-time
deployment could significantly broaden the applicability of this system.
7. References
• Karpathy, A., & Fei-Fei, L. (2015). Deep Visual-Semantic Alignments for
Generating Image Descriptions. CVPR.
• He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for
Image Recognition. CVPR.
• Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory.
Neural Computation.
• Flickr8k Dataset. University of Illinois at Urbana-Champaign:
https://forms.illinois.edu/sec/1713398
• TensorFlow Documentation: https://www.tensorflow.org/
• Keras API Reference: https://keras.io/
9
8. Appendix
A. Sample Captions:
1. *"A man riding a bicycle through a crowded street."
2. *"A small dog jumping through a hoop."
3. *"A group of friends having a picnic in a park."
B. Model Architecture Summary:
• ResNet50 for feature extraction
C. Hyperparameters:
• Batch size: 64
• Epochs: 40
• Dropout: 0.5
D. Training Graphs:
• Loss vs Epochs (Training and Validation)
10