Project I_ Image Captioning with Deep Learning
Project I_ Image Captioning with Deep Learning
Learning
Overview
The goal of this project is to develop a deep learning model that can generate descriptive
captions from input images. Students are required to design and implement the model
completely from scratch—using only basic libraries (e.g., NumPy, low-level tensor operations
in frameworks like TensorFlow or PyTorch) without leveraging pre-built model architectures or
high-level APIs for the core components. The project will help you understand how to combine
computer vision and natural language processing techniques into one cohesive system.
Objectives
● Understand and implement an encoder-decoder architecture: Use a Convolutional
Neural Network (CNN) as the encoder to extract visual features from an image and a
Recurrent Neural Network (RNN) (or a similar sequential model like LSTM/GRU) as the
decoder to generate natural language descriptions.
● Learn feature extraction and sequence modeling: Gain hands-on experience with
designing neural network layers, managing data flow between CNNs and RNNs, and
learning sequence generation.
● Practice end-to-end system development: Build, train, and evaluate a model on a
standard image captioning dataset.
● Improve coding and debugging skills: Write all parts of the model from scratch,
thereby gaining deeper insights into how deep learning frameworks function under the
hood.
Project Requirements
1. Model Implementation
● From-Scratch Coding:
○ Important: All parts of your model must be written from scratch. You cannot use
pre-built image captioning architectures or high-level APIs that abstract away the
model’s inner workings. Basic utilities for tensor operations (from libraries such
as TensorFlow or PyTorch) are allowed, but the encoder and decoder
architectures should be manually implemented.
● Encoder:
○ Implement a CNN to extract feature representations from input images.
○ You may design your own CNN architecture (e.g., using several convolutional
and pooling layers) rather than using pre-trained networks.
● Decoder:
○ Implement a sequential model (RNN, LSTM, or GRU) to generate captions.
○ Integrate an attention mechanism (optional, for extra credit) that can help the
decoder focus on different parts of the image.
● Integration:
○ Connect the encoder’s output to the decoder’s input. Ensure the data flow is
handled correctly.
○ Preprocess input images and captions appropriately.
● Dataset:
○ Use a publicly available image captioning dataset (e.g., MSCOCO, Flickr8k, or
Flickr30k). Provide documentation for how you preprocess both images and text.
● Training:
○ Train your model on the selected dataset.
○ Implement your own training loop (i.e., avoid high-level training abstractions that
handle everything automatically).
○ Implement relevant evaluation metrics (e.g., BLEU, CIDEr) to measure caption
quality.
● Experimentation:
○ Run experiments and provide an analysis of different architectures,
hyperparameters, or design choices.
○ Report on training and validation losses as well as evaluation metrics.
● Code Quality:
○ Write clean, well-documented, and modular code.
○ Include comments explaining your model architecture, training loop, and any
experimental decisions.
● Documentation:
○ Provide a README that explains the project, the design choices, how to run the
code, and a summary of your findings.
○ Include any challenges encountered and how they were addressed.
4. Deliverables
● Source Code:
○ A fully functional code repository containing your model implementation, training
scripts, and evaluation routines.
● Report:
○ A detailed project report (4-6 pages) that covers:
■ An introduction to the problem and literature review.
■ Description of your model architecture and design rationale.
■ Data preprocessing steps and training strategy.
■ Results, including quantitative evaluation metrics and qualitative
examples (sample captions for given images).
■ Discussion on the limitations of your approach and potential
improvements.
● Presentation:
○ A short presentation (optional) summarizing your approach, results, and lessons
learned.
Grading Criteria
Criteria Weight Description
Project Report & 15% - Depth of literature review, rationale behind design
Presentation decisions, and comprehensive results discussion.
- Clarity and professionalism in written and oral
presentation (if applicable).
Additional Notes
● All code must be original and written from scratch. Use of pre-built model
architectures or high-level libraries that abstract away the model details (like directly
using a pre-built image captioning model) is not permitted.
● If you encounter any issues or have questions about the project requirements, please
reach out for clarification.
This project is designed to push you to not only implement complex models but also understand
the underlying mechanisms behind image captioning. Good luck, and we look forward to seeing
your innovative solutions!