0% found this document useful (0 votes)
95 views16 pages

Image Caption

Uploaded by

adityadash997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views16 pages

Image Caption

Uploaded by

adityadash997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

B.

TECH MINOR PROJECT PRESENTATION 2024-25


ational Institute of Science & Technology

IMAGE CAPTION GENERATOR USING DEEP


LEARNING
Project ID: 31121
Student 1: ASHISH KUMAR PANDA Student 2: B. SOURAV
Roll No: CSE202110331

Roll No: CSE202110339


By

Student 1: BISHNU PRASAD MAHARANA Student 1: SUSHOVIT BADATYA


Roll No: CSE202110342
Under the guidance of Roll No: CSE202110348
Prof. Charulata Palai
1
B.TECH MINOR PROJECT PRESENTATION 2024-25

CONTENTS:
ational Institute of Science & Technology

• INTRODUCTION
• PROBLEM STATEMENT
• OBJECTIVES
• CHALLENGES
• METHODOLOGY
• FLOWCART AND ALGORITHM
• CODE & SIMULATION
• POSSIBLE OUTCOME
• FUTURE ADD-ON
• REFERENCE
• CONCLUSION

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 2
B.TECH MINOR PROJECT PRESENTATION 2024-25

INTRODUCTION:
ational Institute of Science & Technology

• What is Image Captioning?


– Image captioning is the task of automatically generating a descriptive sentence for a given image.
– This involves understanding the content of the image, recognizing objects, and generating relevant text.
• Why is it Important?
– Enhances accessibility for the visually impaired by generating descriptions of visual content.
– Valuable for social media, where it can automate image descriptions for enhanced user engagement.
– Useful for image search engines by enabling semantic search based on captions.
• Key Applications:
– Social Media Captioning
– Accessibility Tools
– Image Indexing and Search
– Human-Computer Interaction

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 3
B.TECH MINOR PROJECT PRESENTATION 2024-25

PROBLEM STATEMENT:
ational Institute of Science & Technology

To develop a deep learning model that can analyze an image and


automatically generate a human-like caption.

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 4
B.TECH MINOR PROJECT PRESENTATION 2024-25

OBJECTIVE:
ational Institute of Science & Technology

• Design and Train an Effective Deep Learning Model


• Use CNNs for Visual Feature Extraction:
– Convolutional Neural Networks (CNNs) are used to analyze images, breaking them down into a set of visual features. This process transforms the image into a feature vector that serves as the foundation for caption
generation.
– Example Models: ResNet, InceptionV3—known for their accuracy in capturing details within complex images.
• Incorporate RNNs or Transformers for Language Generation:
– Recurrent Neural Networks (RNNs), particularly LSTMs, and Transformers are essential for handling sequential data and generating sentences that are coherent and context-aware.
– Transformers, such as those used in the Transformer model or BERT, are now popular for language generation because of their ability to handle long-range dependencies in sentences.
• Generate Accurate and Descriptive Captions
• Ensure High Relevance to Image Content:
– The generated captions should be directly related to the objects, actions, and context within the image. For instance, distinguishing between similar images, such as a “dog running on the beach” versus a “dog
playing with a ball on grass.”
• Accuracy in Complex Scenarios:
– Train the model to handle scenarios with multiple objects or subtle actions, like “a group of friends hiking on a mountain trail with a dog,” rather than simply “people outdoors.”
– This ensures captions are not only accurate but also nuanced.
• Develop a Model that Balances Accuracy and Grammar
• Generate Grammatically Correct and Coherent Captions:
– The model must follow grammatical rules, making captions understandable and natural.
– Attention mechanisms can enhance coherence by focusing on relevant parts of an image as each word in the caption is generated.
• Achieve Efficient Model Performance for Real-World Use
• Optimize Model for Scalability and Speed:
– The model should be optimized to work efficiently on both high-performance machines (for training) and potentially on lower-powered devices (for inference in real-world applications).
• Balance Between Computation and Accuracy:
– Achieve a balance between computational requirements and output quality to ensure the model is accessible and practical to deploy across various platforms.
• Create a Broadly Applicable Model
• Generalize Across Diverse Image Types:
– Ensure that the model performs well across a wide range of image types and scenarios, from landscapes and indoor scenes to complex actions and events.
• Adaptability to New Data:
– The model should be designed with adaptability in mind, allowing easy retraining with new datasets to improve its versatility and accuracy over time.

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 5
B.TECH MINOR PROJECT PRESENTATION 2024-25

CHALLENGES :
ational Institute of Science & Technology

• Data Dependency:
– High-quality, labeled datasets are required for training. Datasets need to have a wide variety of images
with accurate captions.
• Computational Complexity:
– Training deep neural networks, especially those involving both visual and language models, requires
significant computational power.
• Contextual and Semantic Understanding:
– Generating relevant captions goes beyond identifying objects—it requires an understanding of the
relationships and context within an image to generate human-like descriptions.

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 7
B.TECH MINOR PROJECT PRESENTATION 2024-25

METHODOLOGY:
ational Institute of Science & Technology

• Dataset Preparation:
– Datasets like Flickr8k/30k provide thousands of labeled images with human-generated captions.
– Preprocess images and captions, including resizing, normalizing, and tokenizing text.
• Feature Extraction:
– Use Convolutional Neural Networks (CNNs), such as ResNet or InceptionV3, to extract features from images.
– CNNs transform images into feature vectors that capture essential visual information.
• Text Generation:
– Use Recurrent Neural Networks (RNNs) or Transformers to generate text based on image features.
– LSTMs are commonly used in RNNs to handle sequential data and maintain context over time.
• Integrating CNN and RNN Models:
– Combine CNN feature extraction with RNN/Transformer-based language generation to create an end-to-end
captioning model.
– Image features guide the initial generation, while RNN/Transformer structures help complete sentences.

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 9
B.TECH MINOR PROJECT PRESENTATION 2024-25

FLOWCHART:
& Technology
Technology

Start
Initial the node in
advertisement mode

Compute the node residual Energy


Science &

Is Energy
more then
of Science

threshold
No
Yes
Update the Broadcast as
Select as cluster head and Broadcast
threshold Value common Node

Any node
Institute of

have more
energy
ational Institute

Yes No

Broad cast as cluster head

Create Cluster

Cluster head collects the common


head information with neighbor
details
ational

Stop

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 10
B.TECH MINOR PROJECT PRESENTATION 2024-25

ALGORITHM:
ational Institute of Science & Technology

• Load and Preprocess Data


• Load the dataset (e.g. Flickr8k/30k) containing images and their associated captions.
• Preprocess images by resizing, normalizing pixel values, and augmenting (if needed).
• Tokenize captions: Convert words in captions to integer sequences (word indices) and pad sequences to ensure uniform length.
• Feature Extraction Using CNN
• Load a pre-trained CNN model (e.g., InceptionV3, ResNet), removing the final classification layer.
• Pass each image through the CNN to obtain a feature vector representing the image.
• Save these image feature vectors for later input to the captioning model.
• Caption Generation Model Setup (Encoder-Decoder)
• Encoder:
– Define an embedding layer for the input captions.
– Use the pre-trained CNN as the encoder, which outputs the feature vector.
• Decoder:
– For RNN/LSTM Decoder:
• Pass image features through a dense layer to initialize the hidden state of the RNN.
• Feed the tokenized caption sequence into the RNN, word by word, using the hidden states to generate the next word.
– For Transformer Decoder:
• Use multi-head attention to attend to the image features and previous words in the sequence, generating contextually relevant words at each step.
• Training the Model
• Define a loss function (e.g., categorical cross-entropy) that compares the generated captions to the true captions.
• Use an optimizer (e.g., Adam) to minimize the loss.
• Generate Captions for New Images
• Input a new image into the trained encoder to obtain image features.
• Feed the image features into the decoder and initialize the caption generation process.
• Predict one word at a time by sampling from the probability distribution at each timestep until an end-of-sequence token is generated or a maximum caption length is
reached.

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 11
B.TECH MINOR PROJECT PRESENTATION 2024-25

SIMULATION 1:
ational Institute of Science & Technology
• features = {}
• directory = os.path.join(BASE_DIR, 'Images')

for img_name in tqdm(os.listdir(directory)):
• # Load the image and preprocess it
• img_path = os.path.join(directory, img_name)
• image = load_img(img_path, target_size=(224, 224))
• image = img_to_array(image)
• image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
• image = preprocess_input(image)

# Extract features
• feature = model.predict(image, verbose=0)

# Store features with image ID as key
• image_id = img_name.split('.')[0] # Use image filename without extension as ID
• features[image_id] = feature

• }
• f not os.path.exists(WORKING_DIR):
• os.makedirs(WORKING_DIR)
• print(f"Created directory: {WORKING_DIR}")

# Save extracted features for later use
• with open(os.path.join(WORKING_DIR, 'image_features.pkl'), 'wb') as f:
• pickle.dump(features, f)

print("Feature extraction complete and saved to 'image_features.pkl'.")

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 12
B.TECH MINOR PROJECT PRESENTATION 2024-25

SIMULATION 2:
ational Institute of Science & Technology

• # Train the model


• epochs = 20
• batch_size = 32
• steps = len(train) // batch_size
• # Load the image
• image_id = image_name.split('.')[0]
• img_path = os.path.join(BASE_DIR, "Images", image_name)
• image = Image.open(img_path)

# Display actual captions
• captions = mapping[image_id]
• print('---------------------Actual---------------------')
• for caption in captions:
• print(caption)

# Predict the caption
• y_pred = predict_caption(model, features[image_id].reshape((1, 4096)), tokenizer, max_length)
• print('--------------------Predicted--------------------')
• print(y_pred)

# Show the image with matplotlib
• plt.imshow(image)
• plt.axis('off') # Hide axis
• plt.show()

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 13
B.TECH MINOR PROJECT PRESENTATION 2024-25

POSSIBLE OUTCOMES:
ational Institute of Science & Technology

• High-Quality Captions: Model should generate captions that are accurate, descriptive, and
contextually appropriate.
• Improved Accessibility: Enhances accessibility for visually impaired users by providing descriptions
of visual content.
• Enhanced Image Indexing: Automatically generated captions can aid in search engine indexing for
better image retrieval.

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 14
B.TECH MINOR PROJECT PRESENTATION 2024-25

FUTURE ADD-ONS:
ational Institute of Science & Technology

• Real-Time Captioning: Enable the model to generate captions for images in real-time applications.

• Improved Context and Semantics :Fine-tune the model to capture deeper relationships and context
within images for more nuanced captions.

• Encryption and Decryption of images and generated captions: Encrypt the images and the genrated
captions to enhance security in real time uses.

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 14
B.TECH MINOR PROJECT PRESENTATION 2024-25

REFERENCE:
ational Institute of Science & Technology

• Krizhevsky, Alex, I. Sutskever, and G. E. Hinton. "ImageNet classification with deep convolutional
neural networks." International Conference on Neural Information Processing Systems Curran
Associates Inc. 1097-1105. (2012)
• Image Captioning Based on Deep Neural Networks Shuang Liu1 , Liang Bai1,a, Yanli Hu1 and Haoran
Wang1 1College of Systems Engineering, National University of Defense Technology,410073 Changsha,
China
• Show, Attend and Tell: Neural Image Caption Generation with Visual Attention -“Kelvin Xu, Jimmy Ba,
Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio”
University of Toronto, Courant Institute of Mathematical Sciences, New York University (2016).
• Self-critical Sequence Training for Image Captioning “Steven J. Rennie, Etienne Marcheret, Youssef
Mroueh, Jarret Ross, Vaibhava Goel” IBM Research AI (2017)
• Bottom-Up and Top-Down Attention for Image Captioning- “Peter Anderson, Xiaodong He, Chris
Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang” Australian National University,
Microsoft Research, University of Edinburgh (2018)
• Transformer-based Image Captioning- “Hakan Yanık, Yusuf Sahillioglu” Middle East Technical
University, Turkey (2020)

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 15
B.TECH MINOR PROJECT PRESENTATION 2024-25

CONCLUSION:
ational Institute of Science & Technology

• In conclusion, developing an image caption generator using deep learning merges the complex
fields of computer vision and natural language processing to create a model capable of
automatically generating human-like captions for images. By integrating Convolutional Neural
Networks (CNNs) for visual feature extraction with Recurrent Neural Networks (RNNs) or
Transformers for language generation, the model interprets the visual content of an image and
produces descriptive, contextually relevant captions. This technology holds tremendous
potential for applications in accessibility, content indexing, and human-computer interaction,
bridging the gap between images and text-based understanding. While challenges remain, such
as ensuring nuanced contextual understanding and optimizing computational demands,
continued advancements in model architecture and data processing promise to make image
captioning more accurate, efficient, and widely applicable across diverse real-world scenarios.

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 16
B.TECH MINOR PROJECT PRESENTATION 2024-25
ational Institute of Science & Technology

THANK YOU

(202110331 Ashish Kumar Panda) (202110339 B.Sourav) (202110342 Bishnu Prasad Maharana) (202110348 Sushovit Badatya) 17

You might also like