0% found this document useful (0 votes)

24 views48 pages

Final Demo

The dissertation focuses on an Image Captioning System that generates textual descriptions for images using AI techniques, particularly CNN + LSTM and Transformer-based models. It addresses various applications such as social media, e-commerce, and medical imaging, while also discussing challenges like bias, contextual accuracy, and compliance. The document outlines methodologies, datasets, and system designs for implementing both LSTM and Transformer architectures for effective image captioning.

Uploaded by

Ghulam Sarwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views48 pages

Final Demo

Uploaded by

Ghulam Sarwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

Image Captioning System

DISSERTATION
AIMLCZG628T

Gulam Sarwar (2022AC05156)

First Semester 2024-2025
Work-Integrated Learning Programs Division
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

16 March 2025
Outline

 Introduction  High Level Design

 Case study (Challenges & Solutions)  Model Evaluation Metrics
 Dataset
 Dashboard
 Methodology
 Result and Conclusion
 System Design and Implementation
 Demo
Introduction

 Image Captioning is an AI-based system that generates textual descriptions for images.

The generated text, also known as caption ,should accurately reflect the images’ content

 Applications: Social Media, E-commerce, Content Moderation , Medical Imaging , to help visually impaired etc.

 Medical Imaging (ex : CT scans for kidney disease detection)

 Content Moderation(Detecting and labelling : inappropriate image, violence detected)

 Two main approaches: CNN + LSTM and Transformer-based models.(Vision Transformer, Swin Transformer for medical
image captioning)
Introduction

 Background: Image Captioning is a critical AI task that bridges vision and language. It helps in generating meaningful
descriptions for images. Applications include accessibility, content tagging, and automated reporting.

 Image captioning has multiple applications. For example, on social media platforms, it automatically suggests image
captions, saving time for content creators. In online retail, it generates captions for product images, thus improving the
shopping experience

 Autonomous vehicles rely on multiple AI technologies, including computer vision, NLP, and deep learning. Image
captioning bridges vision and language processing, enabling more intuitive vehicle responses
Challenges & Solutions

 Social Media : on social media platforms, it automatically suggests image captions, saving time for content creators
 E-commerce : it generates captions for product images, thus improving the shopping experience
 Content Moderation( Detecting and labelling : inappropriate image, violence detected )

 Detection of Inappropriate Language: Identifying and removing captions that contain vulgar, offensive, or inappropriate
language.

 Bias and Fairness: Ensuring that captions do not perpetuate stereotypes, biases, or discrimination against any group or
individual.

 Contextual Accuracy: Verifying that the captions accurately describe the content of the image without adding
misleading or false information.

 Safety and Compliance: Ensuring that captions comply with legal and regulatory standards, as well as platform-specific
guidelines.
Challenges & Solutions
 Medical Imaging : (CT scans for kidney disease detection)

 Accurate identification of kidney conditions (normal, cyst, tumor, stone) from CT scans

 Assists radiologists in diagnosis.

 Automates medical report generation.

CT KIDNEY DATASET: Normal-Cyst-Tumor
and Stone Prediction using Vision
Transformer and Swin Transformer
Challenges & Solutions

 Medical Imaging : (CT scans for kidney disease detection)

Challenges & Solutions

 Helping visually impaired

 Providing an automated place for visually impaired people to see the world by hearing the images that they have
captured
 Increasing the ability of visually impaired people to understand their surroundings.
 It can help in signal understanding
 It can help in navigation and it will help in all day-to-day
activities
Dataset

 Flickr30k/ Flickr8k is a large-scale image captioning dataset designed to facilitate

research in vision-language tasks. It expands on Flickr8k and contains 30,000
images, each annotated with five human-generated captions, providing rich textual
descriptions for training and evaluation.

 Flickr8k is an image captioning dataset consisting of 8,000 images, each paired

with five human-annotated captions. The images are sourced from Flickr and depict
a variety of everyday scenes, focusing on people, animals, and objects in different
activities.
 Training Set: Flickr_8k.trainImages.txt (6,000 images)
 Validation Set: Flickr_8k.devImages.txt (1,000 images)
 Test Set: Flickr_8k.testImages.txt (1,000 images)

 The images in Flickr30k/ Flickr8k are sourced from Flickr, covering a wide range
of everyday scenes, objects, and human activities. Compared to MS COCO, it has
shorter and more diverse captions, making it valuable for evaluating fine-grained
image-captioning models.
Dataset

 Flickr30k is widely used in image captioning, visual question answering (VQA),

and cross-modal retrieval tasks, serving as a benchmark for assessing vision-
language models.
Dataset
Data preparation steps

Caption preparation
Raw captions are often noisy and not in a format that is usable by the ML model. During caption preparation, we remove
inappropriate captions and ensure the remaining ones are consistent and tokenized. In particular, we perform the following
steps:
Remove pairs with a non-English caption: We remove image-caption pairs where the caption is not in English, as this model's focus will
be on English.
Remove duplicate images or captions: To ensure the diversity and quality of the training data, we eliminate duplicate images and
captions. Duplicate images are identified using perceptual hashing techniques or image similarity models (e.g., CLIP image encoder), while
duplicate captions are detected by exact match or semantic similarity checks (e.g., CLIP text encoder). removing duplicates prevents the model
from overfitting to redundant data and helps it learn a broader range of associations between images and text.
Remove irrelevant captions: We use a pretrained vision-language model (e.g., CLIP) to assess the
relevance between images and their corresponding captions. A higher score usually indicates greater
semantic relevance between the image and the text. We remove pairs with scores below a specific threshold,
such as 0.25. This ensures our model learns from high-quality, relevant pairs. For more information on how
CLIP scores the relevance between text and images
Summarize long captions:Captions are often long and detailed. Training the model with these captions
leads to the generation of similarly long captions, which doesn't suit our use case. To address this, we
summarize the captions using a large language model such as Llama [4] to create brief, concise descriptions
that meet our requirements.
Tokenize captions: We use a subword-level tokenization algorithm such as Byte-Pair Encoding (BPE) to tokenize captions into a sequence of
IDs.
Dataset - Data preparation steps
Tokenization and Embeddings

Tokenization:
The process of breaking down text into smaller units, such as words or sub-words, to be processed by language models. Effective tokenization is
crucial for handling diverse vocabularies and languages
Embeddings:
Techniques that convert tokens into dense vector representations, capturing semantic meanings and relationships between words. These
embeddings enable models to understand context and perform tasks like similarity assessments.
Tokenize captions
Text tokenization and token indexing : Text tokenization followed by token indexing converts the raw text into a format the Trans.
former model expects: a sequence of numbers
Fig. Converting raw text to a sequence of numbers
Dataset - Data preparation steps
Text tokenization

Text tokenization:
Example of GPT-4 tokenization

Text tokenization is the process of splitting text into smaller units called tokens.
shows how OpenAl's GPT-4 tokenizes the sentence "Let's go to NYC".

Tokenization can be performed at different levels. For example, "Hello world" can be split into ["Hello",
"world"] or ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]. Generally, tokenization algorithms
are divided into three categories:
• Character-level tokenization
• Word-level tokenization
• Subword-level tokenization
Dataset - Data preparation steps for LSTM on fliker8k
Caption preparation Implementation for LSTM

Text tokenization:
Caption text preprocessing steps

-Convert sentences into lowercase

-Remove special characters
and numbers present in the text
-Remove extra spaces
-Remove single characters
-remove duplicate caption
Add a starting and an ending
tag to the sentences to indicate the
beginning and the ending of a
sentence of a sentence
Dataset - Data preparation steps for LSTM on fliker8k
Caption preparation Implementation for LSTM

Tokenization Level : Word-Level Tokenization

The provided code uses the Tokenizer from Keras, which applies word-level tokenization by default. This means the text is split into individual
words, and each unique word is assigned a numerical index.
Word-Based Tokenization Approach – The Tokenizer splits sentences into words rather than characters or subwords. For example, "Hello
world" would be tokenized as ["Hello", "world"] rather than individual letters .
Vocabulary Construction : The tokenizer builds a word index, where each unique word is assigned an integer ID (tokenizer. word_index),
and the vocabulary size is calculated (vocab_size).
Sequence Conversion – The method texts_to_sequences () converts sentences into a sequence of word indices, representing the text
numerically.
Sentence Length Handling – The max_length variable is computed as the maximum number of words in a sentence, which helps in
defining input sequence padding for model training.

Dataset Splitting – The dataset is split into training (85%) and validation (15%), ensuring that the model is trained and tested on separate
image-caption pairs.
Dataset - Data preparation steps for LSTM on fliker8k
Tokenization Implementation for Transformer

• Tokenization Level: Word-Level Tokenization

1. The TextVectorization layer processes text into integer sequences, where each unique word is assigned a numerical token.
2. The custom standardization function removes special characters and numbers but does not split words into smaller subword units (as
in subword-level tokenization).
3. The vectorizer uses a fixed vocabulary size (VOCAB_SIZE), which is characteristic of word-level tokenization.
4. The output sequence length (SEQ_LENGTH) ensures that each caption is padded to a uniform number of words, not characters or
subwords.
5. If it were character-level tokenization, each letter would be treated as a separate token. If it were subword-level tokenization, uncommon
words would be broken into smaller meaningful units, which is not happening here.
Dataset - Data preparation steps for LSTM on fliker8k
Image preparation

As is the case for captions, not all images are useful. We remove images that might hurt
Training and ensure the remaining images are consistent and suitable for the model training.
In particular, we perform the following steps:

Remove low-resolution images: We remove image-caption pairs in which the image

resolution is less than 256x256 because such low-resolution images might not provide
enough detail for accurate caption generation.

Normalize images: We scale the pixel values to a normalized range, such

This normalization makes the training process more stable.

Remove low-quality images:

To maintain high-quality training data, we filter images that exhibit conditions such as blurriness,
overexposure, underexposure, or other defects that degrade visual clarity. Image quality assessment methods,
such as the LAION Aesthetics Predictor, help identify and remove subpar images by scoring them on factors
such as sharpness, contrast, and lighting.

Adjust image dimensions:

Images typically have a range of sizes and aspect ratios. We resize all images to a uniform size, which is
critical since ML models require fixed-size inputs during training. When adjusting image dimensions, it is
important to preserve their original aspect ratios.
To do so, we often follow two steps:
1. Resizing: First, we resize the image so that the smaller dimension matches the target size.
o Example: If our target size is 256×256 and our original image is 512×768, we resize it to
256×384.
2. Centre-cropping: Next, we centre-crop the resized image to the target dimensions.
o Example: The 256×384 image is centre-cropped to 256×256.
Method Selection and developments

 CNN + RNN-Based Image Captioning

 Transformer-Based Image Captioning
Methodology - Models
First implementation follows a CNN-RNN architecture for image captioning, using DenseNet201 as the feature extractor and LSTMs for sequential text
generation.

Deep Learning Models Generative Ai models

• CNN-Load DenseNet201 (pretrained on ImageNet) to extract
• Here I have designed our customer Transformer based model
image embeddings from the last convolutional layer)
• Which takes transformer encoder decoder
• RNN + STM Generate the caption based on input emabding
feature vector from RNN
• It consist of EfficientNetB0 , Positional Embedding etc
Method Selection and developments

 LSTM based High level Architecture

Method Selection and developments

 LSTM based High level Architecture

This implementation follows a CNN-RNN architecture for image captioning, using
DenseNet201 as the feature extractor and LSTMs for sequential text generation. The
model processes Flickr8k images to generate descriptive captions.
Steps in Implementation
Feature Extraction
Load DenseNet201 (pretrained on ImageNet) to extract image embeddings from
the last convolutional layer.
Resize and preprocess images to match DenseNet201’s input requirements.
Text Processing
Tokenize captions, convert words to integer sequences, and pad sequences for
uniform length.
Create word-to-index and index-to-word mappings for vocabulary.
Model Architecture
Use DenseNet201 to extract feature vectors from input
images.
Pass extracted features into an LSTM-based decoder for caption generation.
LSTMs Implementation

 Feature Extraction: Load DenseNet201 (pretrained on ImageNet) to extract image embeddings from the
last convolutional layer.
 Resize and preprocess images to match DenseNet201’s input requirements.
 Text Processing : Tokenize captions, convert words to integer sequences, and pad sequences for
uniform length.
 Create word-to-index and index-to-word mappings for vocabulary.
Model Architecture

 Use DenseNet201 to extract feature vectors from input images.

 Pass extracted features into an LSTM-based decoder
for caption generation.
 Training : Train the model on Flickr8k using categorical cross-entropy loss and Adam optimizer .
 Sampling : We have used both beam and greedy search for
prediction
Model Summery
Model Learning Curve

 Learning Curve (Loss Curve).

Transformer Implementation

 Feature Extraction:
Use EfficientNetB0 (pretrained on ImageNet) to extract global and local image features.
Apply positional encoding to integrate spatial information into feature embeddings.
 Text Preprocessing:
 Tokenize captions and map words to integer sequences.
 Create word embeddings using a transformer-based embedding layer.
 And it passed to encoder block
Transformer Implementation

Model Architecture
 Use EfficientNetB0 to encode image features.
 Pass encoded features into a Transformer-based
decoder using multi-head attention and positional encoding.
 Apply a fully connected output
layer with a softmax function to predict words.
Transformer Implementation

Overall Architecture:
 Image Input: An image is fed into the EfficientNetB0 CNN.
 Feature Extraction: The CNN extracts feature maps from the image.
 Feature Reshaping: The feature maps are reshaped into a sequence of feature vectors.
 Encoder: The Transformer encoder processes the feature vectors, capturing relationships between
different parts of the image.
 Caption Input: A caption (represented as a sequence of token indices) is fed into the decoder.
 Positional Embedding: The caption tokens are embedded and combined with positional
embeddings.
 Decoder: The Transformer decoder generates the next word in the caption, attending to both the
previous words and the encoded image features.
 Output: The decoder outputs a probability distribution over the vocabulary.
 Training: The model is trained to minimize the cross-entropy loss between the predicted and actual
captions.
System Design and Implementation
LSTM based Implementation

Component Description
Image Encoder Uses DenseNet201 to extract deep image features.
CNN-Based Encoder Converts CNN features into sequence embeddings for the
decoder.
Text Decoder Uses LSTM-based architecture to generate captions one token at
a time.

Unsupervised Pretraining Uses pretrained CNN (DenseNet201) and pretrains LSTM on text
data.

Supervised Finetuning Trains the CNN and LSTM together on image-caption pairs using
cross-entropy loss.

sampling Greedy Search, Beam Search

Evaluation Matrix BLUE score

Data Set Fliker8K

System Design and Implementation
Transformer based Implementation

Component Description
Image Encoder EfficientNetB0 extracts deep image features.
CNN-Based Encoder Converts CNN features into sequence embeddings for the decoder.

Transformer Encoder Uses Multi-Head Attention for spatial feature extraction.

Positional Embedding Adds positional encodings to maintain word order.

Transformer Decoder Uses self-attention and cross-attention for caption generation.

Unsupervised Pretraining Pretrains CNN on ImageNet & decoder on large text corpus.

Supervised Finetuning Trains CNN and Transformer together using cross-entropy loss.

sampling Greedy Search, Beam Search

Evaluation Matrix BLUE score

System Design and Implementation
Sampling
Beam search

Sampling : Generative models are trained to capture the underlying distribution of the training
data. Once trained, these models can generate new samples that are similar to the data they
were trained on. Sampling is the process of using a trained generative model to generate new
data.

 Beam search is a popular deterministic algorithm for generating text from a trained
model. The core idea is to track multiple potential sequences of tokens simultaneously.
 At each step, the model calculates the probabilities for the next possible tokens for each
sequence and selects the "top-k" most probable sequences. The value of k, known as
beam width, is configurable.
System Design and Implementation
Greedy search

 Greedy search is the simplest deterministic algorithm. It always selects the token with the
highest probability as the next token. As was shown in Figure , greedy search can lead
to repetitive patterns in the generated text.
 This occurs because it follows a narrow path based on the highest probability tokens without
considering alternative paths that might lead to more coherent sentences. Due to this limitation,
greedy search is rarely used in practice
Evaluation Metrics

Offline evaluation metrics

 To thoroughly evaluate a language translation model, metrics should measure both translation
accuracy and contextual appropriateness. The research community has proposed
several metrics that, over the years, have become widely accepted as standards.
Some used matrix are
- BLUE, ROUGE, METEOR

Task Metrics

Text Generation Perplexity, BLEU, METEOR, ROUGE, CIDEr

Image Generation FID, IS, KID, SWD, PPL, LPIPS

Text-to-Video FVD, CLIPScore, FID, LPIPS, KID

Evaluation Metrics

BLEU

BLEU (BiLingual Evaluation Understudy) is a precision-based metric that compares ngrams

(a sequence of "n" words) of the candidate translation with n-grams of the reference
translations and counts the ratio of matches. It ranges from O to 1, where a higher value
indicates a more precise translation.
The BLEU score is calculated using the following formula
where:
• N is the maximum n-gram length considered for evaluation
• BP is the brevity penalty
• 𝑝ₙ is the n-grams precision
• 𝑤ₙ represents the weight for different n-gram precisions
Let's explore each of these terms in detail.
:
Evaluation Metrics
BLEU

 BP is a constant term that penalizes translations shorter than the reference translation. The formula is

 Precision (𝑝ₙ)
Precision measures how many n-grams in the candidate translation are present in reference
translations. It is calculated by dividing the number of matching n-grams by the total number of n-
grams in the candidate translation.
Model Prediction/Result

 LSTM Model generated caption for given image

Model Prediction/Result

 Transformer based Model generated caption for given image

Model off Line Evaluation
avg. Blue Score

 Blue Score Transformer Based model VS LSTM

Model on Line Evaluation

 Online time taken for Inferencing around 4 sec once everything is loaded properly
Containerization

 Containerization with the help of Docker

 Benefits of Docker Containerization

 Portability – Containers ensure the application runs consistently across different environments (development, testing, production).
 Scalability – Easily scale applications up or down by running multiple container instances.
 Resource Efficiency – Containers share the host OS kernel, consuming fewer resources than traditional virtual machines.
 Fast Deployment & Startup – Containers start quickly, reducing downtime during deployments.
 Isolation – Each container runs independently, preventing conflicts between dependencies.
 Simplified Dependency Management – Containers package all necessary dependencies, eliminating compatibility issues.
 Version Control & Rollbacks – Docker enables easy versioning, allowing rollbacks to previous versions if needed.
 Security – Containers provide process isolation, minimizing security risks.
 Consistent Development & Production Environments – Eliminates "works on my machine" issues by ensuring uniform
environments.
 Microservices Architecture Support – Containers allow breaking down applications into smaller, manageable services, improving
maintainability and agility.
Containerization

 Model Ringing Status In container

Containerization

 Docker File
Containerization

 Docker compose File

Conclusion and Feature work

 Transformer-based models outperform CNN + LSTM in generating captions.

 can be implemented for multi lingual
 Can be implemented for QA
 Continuously improving the models for better performance and accuracy
 Enhancing the system to handle real-time image captioning for live video feeds
 Optimizing the models for deployment on edge devices for faster and offline processing
 Integrating the system with assistive technologies to better support visually impaired users
Dashboard
Dashboard

• Beam And Greedy Search Implementation

Dashboard

• Caption Generation from live portal

Suggestions and Guidance for Future Endeavors.

• Guidance

------------------End----------------------

Internship Report (Sanjay Final)
No ratings yet
Internship Report (Sanjay Final)
45 pages
Computer Vision 12 Vision Language Models
No ratings yet
Computer Vision 12 Vision Language Models
56 pages
Image Caption
No ratings yet
Image Caption
16 pages
New PDF
No ratings yet
New PDF
48 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
118 Presentation
No ratings yet
118 Presentation
26 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
(Ankitveer)
No ratings yet
(Ankitveer)
18 pages
Report Contents Image Caption Generation-1
No ratings yet
Report Contents Image Caption Generation-1
42 pages
Final Project Report
No ratings yet
Final Project Report
18 pages
Review 3
No ratings yet
Review 3
18 pages
Review 3
No ratings yet
Review 3
18 pages
Project Report
No ratings yet
Project Report
35 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning
No ratings yet
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning
10 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Minor
No ratings yet
Minor
14 pages
BTP Report
No ratings yet
BTP Report
27 pages
Enhancing Image Captioning With Clip As A Prefix Model
No ratings yet
Enhancing Image Captioning With Clip As A Prefix Model
15 pages
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
Image Captioners Sometimes Tell More Than Images They See
No ratings yet
Image Captioners Sometimes Tell More Than Images They See
6 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
Clip Prefix For Image Captioning Task in Generative
No ratings yet
Clip Prefix For Image Captioning Task in Generative
13 pages
RP Springer
No ratings yet
RP Springer
10 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Welcome
No ratings yet
Welcome
3 pages
Imagecaptionusing CNNand LSTM
No ratings yet
Imagecaptionusing CNNand LSTM
11 pages
Document From Deependra Singh
No ratings yet
Document From Deependra Singh
10 pages
Project Review
No ratings yet
Project Review
12 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
No ratings yet
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
6 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
ALGORITHM Saikareddy Img Cap-1742112866980
No ratings yet
ALGORITHM Saikareddy Img Cap-1742112866980
6 pages
Image Captioning Using Deep Learning Mait
No ratings yet
Image Captioning Using Deep Learning Mait
8 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Image Captioning
No ratings yet
Image Captioning
33 pages
Visual Image Caption Generator 38
No ratings yet
Visual Image Caption Generator 38
6 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
No ratings yet
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
4 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Project Synopsis Imagecaptioning
No ratings yet
Project Synopsis Imagecaptioning
5 pages
Functional Safety Assessment PDF
No ratings yet
Functional Safety Assessment PDF
26 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Prepare in Advance - 20250505 - 200620 - 0000 PDF
No ratings yet
Prepare in Advance - 20250505 - 200620 - 0000 PDF
1 page
Hospital Management System
No ratings yet
Hospital Management System
22 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
Data Input in Gis
No ratings yet
Data Input in Gis
8 pages
BSC CSIT Final Year Project Report On Sword of Warrior Game Project Report
No ratings yet
BSC CSIT Final Year Project Report On Sword of Warrior Game Project Report
52 pages
Poster 2
No ratings yet
Poster 2
1 page
I Get To Love You Ruelle Sheet Music For Piano, Violin (Mixed Duet)
No ratings yet
I Get To Love You Ruelle Sheet Music For Piano, Violin (Mixed Duet)
1 page
Xcerts Certifications
No ratings yet
Xcerts Certifications
4 pages
29.06.2022-SWIFT MT103 GPI30B PAF GMBH (No Code)
50% (2)
29.06.2022-SWIFT MT103 GPI30B PAF GMBH (No Code)
2 pages
Web Technology Lab Aim and Algorithm
No ratings yet
Web Technology Lab Aim and Algorithm
10 pages
Algorithms and Flowcharts
No ratings yet
Algorithms and Flowcharts
28 pages
PUE 3123 - Telecommunication Network Design and Planning - 06
100% (1)
PUE 3123 - Telecommunication Network Design and Planning - 06
55 pages
Associate Cloud Engineer Exam - Free Actual Q&As, Page 4 - ExamTopics
No ratings yet
Associate Cloud Engineer Exam - Free Actual Q&As, Page 4 - ExamTopics
3 pages
A Comparative Survey On Mind Mapping Tools PDF
No ratings yet
A Comparative Survey On Mind Mapping Tools PDF
13 pages
01 - Introduction To Computer Security Security
No ratings yet
01 - Introduction To Computer Security Security
39 pages
Java
No ratings yet
Java
12 pages
SqUID Warehouse Robot
No ratings yet
SqUID Warehouse Robot
3 pages
Os Unit 5
No ratings yet
Os Unit 5
15 pages
Install & Running An EMC VNX VSA v2.0
No ratings yet
Install & Running An EMC VNX VSA v2.0
42 pages
Encoder:: US/.html
No ratings yet
Encoder:: US/.html
2 pages
苹果自分配ip地址
100% (2)
苹果自分配ip地址
4 pages
Stakeholder Analysis Template
100% (2)
Stakeholder Analysis Template
3 pages
Penetration - Testing Tutorial
No ratings yet
Penetration - Testing Tutorial
3 pages
QM55R Spec
No ratings yet
QM55R Spec
2 pages
A02yyuw Su Gecirmez Ultrasonik Sensor Datasheet
No ratings yet
A02yyuw Su Gecirmez Ultrasonik Sensor Datasheet
8 pages
T2DDT0 Manual
No ratings yet
T2DDT0 Manual
4 pages
DSD Lab 12 Handout
No ratings yet
DSD Lab 12 Handout
4 pages
Salesforce Platform Pricing Editions
No ratings yet
Salesforce Platform Pricing Editions
2 pages
+BATCH4 HW1 ISEM500 v1
No ratings yet
+BATCH4 HW1 ISEM500 v1
6 pages
Assignment # 3 - Individual - CRM Tool Analysis Due by Nov 13, Sun 1159pm
No ratings yet
Assignment # 3 - Individual - CRM Tool Analysis Due by Nov 13, Sun 1159pm
2 pages
Switch Configuration 2013
No ratings yet
Switch Configuration 2013
2 pages
IA 3116 U2i P - DS
No ratings yet
IA 3116 U2i P - DS
1 page
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
From Everand
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
Chitra Lele
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
Generative AI – An Overview: Software, #1
From Everand
Generative AI – An Overview: Software, #1
Editor IJSMI
No ratings yet

Final Demo

Uploaded by

Final Demo

Uploaded by

Image Captioning System

Gulam Sarwar (2022AC05156)

 Introduction  High Level Design

 Medical Imaging (ex : CT scans for kidney disease detection)

 Content Moderation(Detecting and labelling : inappropriate image, violence detected)

 Assists radiologists in diagnosis.

 Automates medical report generation.

 Medical Imaging : (CT scans for kidney disease detection)

 Helping visually impaired

 Flickr30k/ Flickr8k is a large-scale image captioning dataset designed to facilitate

 Flickr8k is an image captioning dataset consisting of 8,000 images, each paired

 Flickr30k is widely used in image captioning, visual question answering (VQA),

-Convert sentences into lowercase

Tokenization Level : Word-Level Tokenization

• Tokenization Level: Word-Level Tokenization

Remove low-resolution images: We remove image-caption pairs in which the image

Normalize images: We scale the pixel values to a normalized range, such

Remove low-quality images:

Adjust image dimensions:

 CNN + RNN-Based Image Captioning

Deep Learning Models Generative Ai models

 LSTM based High level Architecture

 LSTM based High level Architecture

 Use DenseNet201 to extract feature vectors from input images.

 Learning Curve (Loss Curve).

sampling Greedy Search, Beam Search

Evaluation Matrix BLUE score

Data Set Fliker8K

Transformer Encoder Uses Multi-Head Attention for spatial feature extraction.

Positional Embedding Adds positional encodings to maintain word order.

Transformer Decoder Uses self-attention and cross-attention for caption generation.

sampling Greedy Search, Beam Search

Evaluation Matrix BLUE score

Offline evaluation metrics

Text Generation Perplexity, BLEU, METEOR, ROUGE, CIDEr

Image Generation FID, IS, KID, SWD, PPL, LPIPS

Text-to-Video FVD, CLIPScore, FID, LPIPS, KID

BLEU (BiLingual Evaluation Understudy) is a precision-based metric that compares ngrams

 LSTM Model generated caption for given image

 Transformer based Model generated caption for given image

 Blue Score Transformer Based model VS LSTM

 Containerization with the help of Docker

 Benefits of Docker Containerization

 Model Ringing Status In container

 Docker compose File

 Transformer-based models outperform CNN + LSTM in generating captions.

• Beam And Greedy Search Implementation

• Caption Generation from live portal

You might also like