Internship Report (Sanjay Final)

A SUMMER INTERNSHIP TRAINING AT “MEVI
TECHNOLOGIES LLP,MALLATHAHALLI,BENGALURU”
IT3711-SUMMER INTERNSHIP
Submitted by
SANJAY P - 510821205021
SATHIYA NARAYANAN P - 510821205023
VINOD KUMAR S - 510821205029
in partial fulfilment for the award of the degree
of
BACHELOR OF TECHNOLOGY IN
INFORMATION TECHNOLOGY
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

KANIYAMBADI, VELLORE-632102
ANNA UNIVERSITY:CHENNAI 600 025

DECEMBER 2024
1
ANNA UNIVERSITY:CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that the IT3711-Summer Internship carried out at “MEVI

TECHNOLOGIES LLP,MALLATHAHALLI,BENGALURU” from 15.07.2024
to 15.08.2024 is the bonafide work of
SANJAY P-510821205021
who carried out the work under my supervision. Certified further that to the best of
my knowledge the work reported here in does not form part of any other project
report or dissertation on the basis of which a degree or award was conferred on an
earlier occasion on this or any other candidate.
SIGNATURE
D.DURAI KUMAR
HEAD OF THE DEPARTMENT
Associate Professor
Department of Information Technology
Ganadipathy Tulsi’s Jain Engineering College
Kaniyambadi, Vellore – 632 102.
Submitted for the Project Viva-Vice Examination held on _______________ .
INTERNAL EXAMINER EXTERNAL EXAMINER
1
DECLARATION
I hereby declare that the IT3711-Summer Internship carried out at
“MEVI TECHNOLOGIES LLP,MALLATHAHALLI,BENGALURU”
submitted for the B.TECH Information Technology Degree is my original work
and the dissertation has not formed the basis for the award of any degree,
associated ship, fellowship or any other similar titles.
Place:
Date :
Signature of the student
(SANJAY P)
1
INTERNSHIP CERTIFICATE
2
ABSTRACT
This project implements an image captioning system using a pre-trained Vision
Encoder-Decoder model. The model, loaded from a pre-trained checkpoint,
processes an input image and generates a descriptive caption. The image is first
preprocessed to extract relevant features, which are then fed into the model. The
model generates a sequence of tokens representing the caption, which is
subsequently decoded into human-readable text. This approach leverages the
power of deep learning to automatically describe the visual content of an image.
3
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO
INTERNSHIP CERTIFICATE 2
ABSTRACT 3
1 INTRODUCTION
1.1. OBJECTIVE 6
1.2. OVERVIEW 6
2 LITERATURE SURVEY 7
2.1. EXISTING SYSTEM 8
2.2. PROBLEM IDENTIFICATION 10
2.3. PROBLEM ANALYSIS 11
2.4. PROPOSED SYSTEM 12
3 SYSTEM SPECIFICATION 13
3.1. HARDWARE REQUIREMENTS 13
3.2. SOFTWARE REQUIREMENTS 13
3.3. SOFTWARE DESCRIPTION 13
3.3.1. Python 13
3.3.2. Google Colab 17
3.3.3. Recurrent Neural Networks 19
3.3.4. Artificial Neural Networks 24
4 PROJECT DESCRIPTION 27
4.1. MODULES 27
4
4.2. MODULE DESCRIPTION 27
4.2.1. Pre Trained Model 27
4.2.2. Model Evaluation 27
4.2.3. Model Development 28
5 SYSTEM IMPLEMENTATION 29
6.1. PRE-TRAINED MODEL 29
6.4. MODEL EVALUATION 34
6.5. MODEL DEVELOPMENT 36
6 CONCLUSION 37
FUTURE SCOPE 37
APPENDICES 38
APPENDIX 1: SOURCE CODE 38
APPENDIX 2: SCREEN SHOTS 40
BIBLIOGRAPHY 44
5
CHAPETER 1
INTRODUCTION
1.1. OBJECTIVE
To develop a neural network model that can accurately interpret

visual content of an image and generate a descriptive caption in natural
language.
OVERVIEW
Image caption generation is a challenging task that involves bridging the

gap between computer vision and natural language processing. The objective is
to develop a model that can accurately interpret the visual content of an image
and generate a descriptive caption in natural language.
A common approach involves using a combination of convolutional

neural networks (CNNs) and recurrent neural networks (RNNs). The CNN acts
as an encoder, extracting relevant features from the input image. These features
are then fed into the decoder, which generates the caption one word at a time,
conditioned on the previous words and the visual features.
The training process involves minimizing a loss function that measures

the discrepancy between the generated caption and the ground truth caption.
This typically involves cross-entropy loss, which penalizes the model for
incorrect word predictions. The model is trained on a large dataset of images
and their corresponding captions, allowing it to learn the mapping between
visual content and language.
6
CHAPTER2
LITERATURE SURVEY
[1] Show and Tell: A Pioneering Model for Image Captioning[Vinyals et al.,
2014]
Specification:
 CNN (Convolutional Neural Network): This component extracts visual

features from the input image. It processes the image layer by layer,
identifying patterns and hierarchies of features, such as edges, textures,
and objects.
 RNN (Recurrent Neural Network): This component, specifically a
Long Short-Term Memory (LSTM) network, generates the caption one
word at a time. It processes the sequence of words, considering the
current word and the context of the previous words.
Advantage:
 Pioneering Deep Learning for Image Captioning: Show and Tell was
one of the first models to successfully apply deep learning techniques to
the task of image captioning. It demonstrated the potential of combining
computer vision and natural language processing.
Disadvantage:
 Limited in Generating Complex and Descriptive Captions: The model

often produced generic and short captions, lacking the ability to generate
detailed and nuanced descriptions. This limitation was due to the
relatively simple architecture and the lack of attention mechanisms.
7
2.1. EXISTING SYSTEM
Existing image captioning models leverage a combination of computer vision

and natural language processing techniques to generate descriptive captions for
images. These models typically employ deep learning architectures, such as
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks
(RNNs), to extract visual features from the image and generate corresponding
text descriptions.
One popular approach involves using a CNN to extract high-level features from
the image, followed by an RNN to decode these features into a sequence of
words. However, this approach often struggles with generating coherent and
detailed captions, especially for complex images.
Advantages:
 Accurate and Detailed Captions: Advanced models, especially those

using attention mechanisms and transformers, can generate highly
accurate and detailed captions, capturing the nuances of the image
content.
 Versatility: These models can be applied to a wide range of image types,
from simple objects to complex scenes, making them versatile tools for
various applications.
Disadvantages:
 Computational Cost: Training and deploying these models, especially

transformer-based ones, can be computationally expensive, requiring
significant hardware resources.
8
 Data Dependency: The performance of these models heavily relies on
the quality and quantity of the training data. A lack of diverse and well-
annotated data can limit their capabilities.
2.2. PROBLEM IDENTIFICATION
Major Problems with Existing Image Captioning Models
 Lack of Contextual Understanding: Many models struggle to capture

the broader context of an image, often generating generic and repetitive
captions.
 Limited Creativity and Diversity: Existing models often produce
predictable and formulaic captions, lacking creativity and diversity in
their language.
 Sensitivity to Image Quality and Noise: The performance of these
models can degrade significantly when faced with low-quality or noisy
images.
 Difficulty in Handling Complex Scenes: Models may struggle to
accurately describe complex scenes with multiple objects and intricate
relationships.
 Language Bias and Stereotyping: Some models may exhibit biases in
their generated captions, reflecting societal stereotypes and prejudices.
9
2.3. PROBLEM ANALYSIS
 Semantic Gap: Bridging the semantic gap between visual and textual
modalities remains a significant challenge. Accurately mapping complex visual
scenes to natural language descriptions requires understanding object
relationships, spatial layouts, and contextual cues.
 Handling Visual Ambiguity: Images often contain multiple
interpretations, and generating accurate captions requires disambiguation based
on contextual clues.
 Generating Diverse and Creative Captions: While accuracy is
crucial, generating diverse and creative captions is equally important to enhance
user experience.
 Handling Noisy and Low-Quality Images: Real-world images can
be noisy, low-resolution, or have occlusions, which can significantly impact the
performance of image captioning models.
 Contextual Understanding: Understanding the broader context of an
image, including cultural and social nuances, is essential for generating accurate
and relevant captions.
 Evaluation Metrics: Developing reliable evaluation metrics to assess
the quality of generated captions is challenging, as it involves both semantic and
syntactic correctness..
10
2.4. PROPOSED SYSTEM
The proposed system leverages the power of transformer-based
architectures to significantly enhance image captioning capabilities. By
incorporating advanced techniques like attention mechanisms and data
augmentation, the system can generate more accurate, diverse, and contextually
relevant captions. The system's ability to handle complex scenes, noisy images,
and diverse language styles makes it a versatile tool for a wide range of
applications, including image search, accessibility, and content creation.
Additionally, the system can be easily integrated into various applications, such
as social media platforms, e-commerce websites, and educational tools, to
provide a more immersive and informative user experience. Furthermore, the
system has the potential to be extended to other image-related tasks, such as
image question answering and visual storytelling, opening up new possibilities
for human-computer interaction. By addressing the limitations of existing
models, the proposed system aims to advance the state-of-the-art in image
captioning and provide a valuable tool for a variety of applications.
ADVANTAGES:
 Improved Accuracy: The system generates more accurate and

contextually relevant captions.
 Enhanced Diversity: The system produces a wider range of creative and
engaging captions.
11
CHAPTER 3
SYSTEM SPECIFICATIONS
3.1. HARDWARE REQUIREMENTS
 Hard Disk : 256GB and Above
 RAM : 4GB and Above
 Processor : i3 and Above
3.2. SOFTWARE REQUIREMENTS

 Operating System : Windows 11(64 bit)
 Tools : Google Colab
 Language : Python (3.9)
3.3. SOFTWARE DESCRIPTION

3.3.1. Python
Python is a widely-used general-purpose, high-level programming
language. It was initially designed by Guido van Rossum in 1991 and
developed by Python Software Foundation. It was mainly developed for
emphasis on code readability, and its syntax allows programmers to
express concepts in fewer lines of code. Python is a programming
language that lets you work quickly and integrate systems more
efficiently. It is used for:
✓ Web development (server-side)
✓ Software development
✓ Mathematics
✓ System scripting
12
What can Python do?
 Python can be used on a server to create web applications.
 Python can be used alongside software to create workflows.
 Python can connect to database systems. It can also read and modify
files.
 Python can be used to handle big data and perform complex
mathematics.
 Python can be used for rapid prototyping or production-ready software
development.
Why Python?
 Python works on different platforms (Windows, Mac, Linux, Raspberry
Pi, etc.).
 Python has a simple syntax similar to the English language.
 Python has a syntax that allows developers to write programs with fewer
lines than some other programming languages.
 Python runs on an interpreter system, meaning that code can be executed
as soon as it is written. This means that prototyping can be very quick.
 Python can be treated procedurally, in an object-orientated way, or a
functional way.
Python Features
 Easy to learn − Python has few keywords, a simple structure, and a
clearly defined syntax. This allows the student to pick up the language
quickly.
 Easy to read − Python code is more clearly defined and visible to the
eyes.
13
 Easy to maintain− Python's source code is fairly easy-to-maintain.
 A broad standard library − Python's bulk of the library is very portable
and cross-platform compatible on UNIX, Windows, and Macintosh.
 Interactive Mode − Python has support for an interactive mode that
allows interactive testing and debugging of snippets of code.
 Portable − Python can run on a wide variety of hardware platforms and
has the same interface on all platforms.
 Extendable − You can add low-level modules to the Python interpreter.
These modules enable programmers to add to or customize their tools to
be more efficient.
 Databases − Python provides interfaces to all major commercial
databases.
 GUI Programming − Python supports GUI applications that can be
created and ported to many system calls, libraries, and windows systems,
such as Windows MFC, Macintosh, and the X Window system of Unix.
 Scalable − Python provides a better structure and support for large
programs than shell scripting.
Python Syntax compared to other programming languages

 Python was designed for readability, the add has some similarities to the
English language with influence from mathematics.
 Python uses new lines to complete a command, as opposed to other
programming languages which often use semicolons or parentheses.
 Python relies on indentation, using whitespace, to define scope; such as
the scope of loops, functions, and classes. Other programming languages
often use curly brackets for this purpose.
Python Libraries
Machine Learning, as the name suggests, is the science of programming
a computer by which they can learn from different kinds of data. A more
14
general definition given by Arthur Samuel is – “Machine Learning is the field
of study that gives computers the ability to learn without being explicitly
programmed.” They are typically used to solve various types of life problems.
In the older days, people used to perform Machine Learning tasks by manually
coding all the algorithms and mathematical and statistical formulas made the
process time-consuming and, inefficient. But in the modern days, it is become
very much easy and efficient compared to the olden days by without Python
libraries, frameworks, and modules. Today, Python is one of the most popular
programming languages for this task and it has replaced many languages in the
industry, one of the reasons is its vast collection of libraries. Python libraries
that are used in Machine Learning are:
 Transformers
 Torch
 PIL
 Requests
Transformers: This powerful library provides state-of-the-art Natural

Language Processing (NLP) models and tools, enabling efficient and effective
handling of text data. It simplifies the process of working with pre-trained
models like Vision Encoder-Decoder models, allowing you to quickly integrate
them into your projects.
Torch: As a fundamental deep learning library, PyTorch offers a flexible and

efficient framework for tensor computations and automatic differentiation. It
provides the necessary tools for training and deploying neural networks,
including image captioning models.
PIL (Pillow): This library is indispensable for image processing tasks, enabling
you to open, manipulate, and preprocess images before feeding them into the
model. It provides a user-friendly interface for common image operations.
15
Requests: This library simplifies the process of making HTTP requests,
allowing you to fetch images from the web and incorporate them into your
image captioning pipeline. It provides a convenient way to interact with web
APIs and download image data.
3.3.2 Google Colab
Google Colab is an innovative platform developed by Google that

provides a cloud-based environment for research and education in fields such as
machine learning, data analysis, and artificial intelligence. It offers a Jupyter
notebook interface that requires no setup and runs entirely in the cloud.
Features of Google Colab
 Accessibility:Colab is accessible via a web browser, with no installation
required, making it widely accessible to users worldwide.
 Free Access to Hardware: It provides free access to computing
resources including GPUs (Graphics Processing Units) and TPUs (Tensor
Processing Units), which are crucial for processing large datasets and
complex computations.
 Collaboration: Similar to Google Docs, Colab allows multiple users to
collaborate on the same document in real-time, facilitating team projects
and educational environments.
 Integration with Google Drive:Colab is seamlessly integrated with
Google Drive, allowing users to store their notebooks and access them
from anywhere. This integration also facilitates the sharing of notebooks
and resources.
 Compatibility: The platform supports most libraries and frameworks
used in machine learning and data science, making it a versatile tool for
developers and researchers.
Use Cases
16
Google Colab is used extensively in academia and industry for a variety of
purposes:
 Educational Purposes: Educators use Colab to teach coding, data
science, machine learning, and computational mathematics. The zero-
setup environment means students can start coding without any barriers
related to software installation.
 Research: Researchers utilize the powerful computational resources
provided by Colab to train complex models on large datasets,
significantly reducing the time and cost associated with such
computations.
 Prototype Development: Developers use Colab to prototype new ideas
and algorithms quickly, leveraging its integration with various APIs and
data sources.
Advantages
 Colab removes the barrier of expensive hardware for individuals and
small organizations.
 The platform’s ease of use and no setup requirement allow users to focus
on coding and analysis rather than system configuration.
 Real-time collaboration and easy sharing increase productivity and
facilitate educational and professional teamwork.
17
3.3.3. Recurrent Neural Networks
Recurrent Neural Network(RNN) is a type of Neural Network where the
output from the previous step is fed as input to the current step. In traditional
neural networks, all the inputs and outputs are independent of each other. Still,
in cases when it is required to predict the next word of a sentence, the previous
words are required and hence there is a need to remember the previous words.
Thus RNN came into existence, which solved this issue with the help of a
Hidden Layer. The main and most important feature of RNN is its Hidden state,
which remembers some information about a sequence. The state is also referred
to as Memory State since it remembers the previous input to the network. It uses
the same parameters for each input as it performs the same task on all the inputs
or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
How RNN differs from Feedforward Neural Network?
Artificial neural networks that do not have looping nodes are called feed
forward neural networks. Because all information is only passed forward, this
kind of neural network is also referred to as a multi-layer neural network.
Information moves from the input layer to the output layer – if any
hidden layers are present – unidirectionally in a feedforward neural network.
These networks are appropriate for image classification tasks, for example,
where input and output are independent. Nevertheless, their inability to retain
previous inputs automatically renders them less useful for sequential data
analysis.
18
Figure No3.1: Recurrent Vs Feedfoward networks
Recurrent Neuron and RNN Unfolding
The fundamental processing unit in a Recurrent Neural Network (RNN)
is a Recurrent Unit, which is not explicitly called a “Recurrent Neuron.” This
unit has the unique ability to maintain a hidden state, allowing the network to
capture sequential dependencies by remembering previous inputs while
processing. Long Short-Term Memory (LSTM) and Gated Recurrent Unit
(GRU) versions improve the RNN’s ability to handle long-term dependencies
Recurrent Neural Network Architecture

RNNs have the same input and output architecture as any other deep
neural architecture. However, differences arise in the way information flows
from input to output. Unlike Deep neural networks where we have different
weight matrices for each Dense network in RNN, the weight across the network
remains the same. It calculates state hidden state Hi for every input Xi . By
using the following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C)
Hence
19
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at
timestep i
The parameters in the network are W, U, V, c, b which are shared across
timestep
Figure No3.2: Recurrent Neural Architecture
How does RNN work?

The Recurrent Neural Network consists of multiple fixed activation
function units, one for each time step. Each unit has an internal state which is
called the hidden state of the unit. This hidden state signifies the past knowledge
that the network currently holds at a given time step. This hidden state is
updated at every time step to signify the change in the knowledge of the
network about the past. The hidden state is updated using the following
recurrence relation:-
The formula for calculating the current state:
where,
 ht -> current state
 ht-1 -> previous state
20
 xt -> input state
Formula for applying Activation function(tanh)

where,
 whh -> weight at recurrent neuron
 wxh -> weight at input neuron
The formula for calculating output:

 Yt -> output
 Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN

works on sequential data here we use an updated backpropagation which is
known as Backpropagation through time.
Training through RNN
 A single-time step of the input is provided to the network.
 Then calculate its current state using a set of current input and the
previous state.
 The current ht becomes ht-1 for the next time step.
 One can go as many time steps according to the problem and join the
information from all the previous states.
 Once all the time steps are completed the final current state is used to
calculate the output.
 The output is then compared to the actual output i.e the target output and
the error is generated.
 The error is then back-propagated to the network to update the weights
and hence the network (RNN) is trained using Backpropagation through
time.
21
Advantages
 An RNN remembers each and every piece of information through time. It
is useful in time series prediction only because of the feature to remember
previous inputs as well.
 Recurrent neural networks are even used with convolutional layers to
extend the effective pixel neighborhood.
22
3.3.4. Artificial Neural Networks
Artificial Neural Networks contain artificial neurons which are

called units. These units are arranged in a series of layers that together
constitute the whole Artificial Neural Network in a system. A layer can have
only a dozen units or millions of units as this depends on how the complex
neural networks will be required to learn the hidden patterns in the dataset.
Commonly, Artificial Neural Network has an input layer, an output layer as
well as hidden layers. The input layer receives data from the outside world
which the neural network needs to analyze or learn about. Then this data passes
through one or multiple hidden layers that transform the input into data that is
valuable for the output layer. Finally, the output layer provides an output in the
form of a response of the Artificial Neural Networks to input data provided.
In the majority of neural networks, units are interconnected from one layer to
another. Each of these connections has weights that determine the influence of
one unit on another unit. As the data transfers from one unit to another, the
neural network learns more and more about the data which eventually results in
an output from the output layer.
23
Figure No3.3 :Neural Networks Architecture
The structures and operations of human neurons serve as the basis for artificial
neural networks. It is also known as neural networks or neural nets. The input
layer of an artificial neural network is the first layer, and it receives input from
external sources and releases it to the hidden layer, which is the second layer. In
the hidden layer, each neuron receives input from the previous layer neurons,
computes the weighted sum, and sends it to the neurons in the next layer. These
connections are weighted means effects of the inputs from the previous layer are
optimized more or less by assigning different-different weights to each input
and it is adjusted during the training process by optimizing these weights for
improved model performance.
How do Artificial Neural Networks learn?
Artificial neural networks are trained using a training set. For example,
suppose you want to teach an ANN to recognize a cat. Then it is shown
thousands of different images of cats so that the network can learn to identify a
cat. Once the neural network has been trained enough using images of cats, then
you need to check if it can identify cat images correctly. This is done by making
24
the ANN classify the images it is provided by deciding whether they are cat
images or not. The output obtained by the ANN is corroborated by a human-
provided description of whether the image is a cat image or not. If the ANN
identifies incorrectly then back-propagation is used to adjust whatever it has
learned during training. Backpropagation is done by fine-tuning the weights of
the connections in ANN units based on the error rate obtained. This process
continues until the artificial neural network can correctly recognize a cat in an
image with minimal possible error rates.
25
CHAPTER 4
PROJECT DESCRIPTION
4.1. MODULES
This is the list of modules using in our project to develop a trust user
recommendation using Machine Learning and Neural Network Techniques
 Pre-Trained Model
 Model Evaluation
 Model Development
4.2. MODULE DESCRIPTION

4.2.1. Pre-Trained Model
The model effectively utilizes a pre-trained image captioning model,

specifically the "nlpconnect/vit-gpt2-image-captioning" model from the
Hugging Face Transformers library. This model has been trained on a massive
dataset of images and their corresponding captions, enabling it to generate
accurate and descriptive captions for a wide range of images.
4.2.2. Model Evaluation

To evaluate the performance of the image captioning model, metrics like
BLEU, METEOR, ROUGE, and CIDEr can be employed. These metrics assess
factors such as semantic similarity, grammatical correctness, and overall
coherence between the generated and reference captions. To further improve the
model, techniques such as data augmentation, fine-tuning, attention
mechanisms, beam search, and model ensembling can be utilized. By carefully
considering these evaluation metrics and improvement strategies, we can
develop more accurate and robust image captioning models.
26
4.2.3. Model Development
The proposed image captioning model leverages a pre-trained Vision
Transformer (ViT) as the encoder to extract visual features from input images.
A transformer-based decoder, such as GPT-2, then generates captions
conditioned on these features. To improve performance, techniques like data
augmentation, fine-tuning, and attention mechanisms are employed. The model
is trained using a combination of cross-entropy loss and a language modeling
objective. By leveraging the power of pre-trained models and advanced
techniques, this approach aims to generate accurate, diverse, and contextually
relevant captions.
.
27
CHAPTER 5
SYSTEM IMPLEMENTATION
Implementation is the stage of the project when the theoretical design is
turned out in a working system. System implementation generally benefits from
high levels of user involvement and management support. User participation in
the design and operation of information systems has several positive results.
First, if users are heavily involved in systems design, they move opportunities to
mold the system according to their priorities and business requirements, and
more opportunities to control the outcome. Second, they are more likely to react
positively to the change process. Incorporating user knowledge and expertise
leads to better solutions.
5.1. PRE TRAINED MODEL

Model training is a crucial step in train the machine learning models for
secure user recommendation. The goal of model training is to create a
predictive model that accurately generates caption for the provided image.
1.Installation:
pip install transformers torch
 This line installs two Python libraries:

o Transformers: This library provides pre-trained models for various
Natural Language Processing (NLP) tasks, including image
captioning. It also offers tools for loading, fine-tuning, and using
these models.
o Torch: This library is the core of PyTorch, a deep learning
framework that provides powerful tools for building and training
neural networks. The transformers library relies on PyTorch for its
operations.
28
2. Importing Libraries:
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor,
AutoTokenizer
import torch
from PIL import Image
import requests
 This block imports the necessary libraries for image captioning:

o VisionEncoderDecoderModel from transformers: This class
represents a pre-trained encoder-decoder model specifically
designed for generating captions from images.
o ViTFeatureExtractor from transformers: This class is responsible
for pre-processing images into a format suitable for the
VisionEncoderDecoderModel.
o AutoTokenizer from transformers: This class helps with
tokenization, which is the process of converting text into sequences
of meaningful units (tokens). It automatically detects the correct
tokenizer based on the pre-trained model name.
o torch: Imported for tensor operations used in deep learning models.
o Image from PIL: Used for loading and manipulating images.
o requests: Used for downloading images from URLs if needed.
29
3. Loading the Pre-trained Model:
Python
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-
image-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-
image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-
captioning")
 These lines load the pre-trained image captioning model and its
associated components:
o model: This variable stores the actual
VisionEncoderDecoderModel instance. The from_pretrained
method loads the weights and configuration of the pre-trained
model from the specified name ("nlpconnect/vit-gpt2-image-
captioning"). This model takes pre-processed image features as
input and generates captions as output.
o feature_extractor: This variable holds the ViTFeatureExtractor
instance. It knows how to convert images into a format that the
model can understand.
o tokenizer: This variable contains the AutoTokenizer instance. It is
responsible for converting text captions into sequences of tokens
for processing by the model, and vice versa (decoding) after the
model generates the caption.
30
4. Pre-processing the Image (Function preprocess_image):
Python
def preprocess_image(image_path):
if image_path.startswith('http'):
image = Image.open(requests.get(image_path, stream=True).raw)
else:
image = Image.open(image_path)
return feature_extractor(images=image, return_tensors="pt").pixel_values
 This function takes an image path (local or URL) as input and returns the
pre-processed image features. Here's how it works:
o It first checks if the image_path starts with "http" to determine if
it's a URL or a local file path.
o If it's a URL, it uses requests to download the image and then
opens it using Image.open from the PIL library.
o If it's a local file path, it directly opens the image using
Image.open.
o Finally, it uses the feature_extractor to convert the image into a
tensor representation compatible with the model. The
return_tensors="pt" argument ensures that the image is returned as
a PyTorch tensor.
o The function returns the pre-processed image features, which will
be fed into the model for generating a caption.
31
5. Generating a Caption (Function generate_caption):
Python
def generate_caption(image_path):
pixel_values = preprocess_image(image_path)
output_ids = model.generate(pixel_values, max_length=16, num_beams=4,
early_stopping=True)
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return caption
 This function takes an image path as input and returns a generated caption
for that
32
5.2. MODEL EVALUATION
To assess the quality of generated captions, we employ a combination of

quantitative and qualitative metrics:
Quantitative Metrics:
1. BLEU (Bilingual Evaluation Understudy):

o Measures the precision of n-grams (sequences of words) between
the generated and reference captions.
o Higher BLEU scores indicate better word-level similarity.
o However, BLEU might overemphasize exact word matches and
overlook semantic similarity.
2. METEOR (Metric for Evaluation of Translation with Explicit

Ordering):
33
o Considers semantic similarity, word order, and stemming to
evaluate captions.
o It is more robust to syntactic variations and provides a more
accurate evaluation than BLEU.
o METEOR is often preferred for image captioning due to its focus
on semantic meaning.
3. CIDEr (Consensus-Based Image Description Evaluation):

o Evaluates captions based on consensus between human ratings and
machine-generated scores.
o It considers factors like semantic similarity, grammatical
correctness, and overall coherence.
34
o CIDEr is particularly useful for assessing the overall quality and
relevance of generated captions.
4. Qualitative Evaluation:
 Human Evaluation: Human evaluators can assess the quality of

generated captions based on factors such as fluency, relevance, and
creativity. This subjective evaluation provides insights into the model's
ability to generate natural and engaging captions.
 Visual Inspection: Visually comparing the generated captions with the
original images can help identify potential errors or biases. This can be
especially useful for identifying cases where the model fails to capture
specific details or generates incorrect or irrelevant captions.
5.3. MODEL DEVELOPMENT

The proposed image captioning model leverages a pre-trained Vision
Transformer (ViT) as the encoder to extract visual features from input images.
A transformer-based decoder, such as GPT-2, then generates captions
conditioned on these features. To improve performance, techniques like data
augmentation, fine-tuning, and attention mechanisms are employed. The model
is trained using a combination of cross-entropy loss and a language modeling
objective. By leveraging the power of pre-trained models and advanced
35
techniques, this approach aims to generate accurate, diverse, and contextually
relevant captions.
CHAPTER 6
CONCLUSION
The provided model leverages a pre-trained Vision Transformer (ViT) to
extract visual features from input images. A transformer-based decoder, such as
GPT-2, then generates captions conditioned on these features. The model
leverages techniques like attention mechanisms, fine-tuning, and data
augmentation to improve performance. By utilizing this state-of-the-art
approach, the model can generate accurate, diverse, and contextually relevant
captions for a wide range of images, enhancing user experience and enabling
innovative applications in fields like image search, content creation, and
accessibility.
Future Scope
Future advancements in image captioning could explore the following

directions:
 Multimodal Learning: Integrating additional modalities like audio or

video to enhance the richness of generated captions.
36
 Domain-Specific Adaptation: Tailoring models to specific domains
(e.g., medical, legal, artistic) to improve accuracy and relevance.
 Zero-Shot and Few-Shot Learning: Developing models that can
generate captions for unseen objects or scenes with limited training data.
 Ethical Considerations: Addressing biases and ensuring fairness in the
generation of captions, especially in sensitive contexts.
 Real-time Applications: Optimizing models for real-time applications,
such as live video captioning or image description generation.
 Cross-lingual Captioning: Enabling image captioning across multiple
languages to facilitate global accessibility and understanding.
APPENDICES
APPENDIX 1: SOURCE CODE
pip install transformers torch

from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor,
AutoTokenizer
import torch
from PIL import Image
import requests
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-
image-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-
image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-
37
captioning")
def preprocess_image(image_path):
if image_path.startswith('http'):
image = Image.open(requests.get(image_path, stream=True).raw)
else:
image = Image.open(image_path)
return feature_extractor(images=image, return_tensors="pt").pixel_values
def generate_caption(image_path):
pixel_values = preprocess_image(image_path)
output_ids = model.generate(pixel_values, max_length=16, num_beams=4,
early_stopping=True)
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return caption
# Example usage
image_url = "/content/image.jpg"
caption = generate_caption(image_url)
print(caption)
38
APPENDIX 2: SCREEN SHOT
Installing Library
Importing Libraries
39
Model, Feature Extractor & Tokenizer Loading
40
Leveraging Pre-trained Models
Image Captioning with Pre-trained Models
41
Input Image
42
Output/Result
BIBLOGRAPHY
[1] Show and Tell: A Pioneering Model for Image Captioning[Vinyals et al.,
2014
43

Internship Report (Sanjay Final)

Uploaded by

Copyright:

Available Formats

Internship Report (Sanjay Final)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Internship Report (Sanjay Final)

Uploaded by

Copyright:

Available Formats

A SUMMER INTERNSHIP TRAINING AT “MEVI

SATHIYA NARAYANAN P - 510821205023

VINOD KUMAR S - 510821205029

in partial fulfilment for the award of the degree

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

ANNA UNIVERSITY:CHENNAI 600 025

Certified that the IT3711-Summer Internship carried out at “MEVI

HEAD OF THE DEPARTMENT

Department of Information Technology

Ganadipathy Tulsi’s Jain Engineering College

Kaniyambadi, Vellore – 632 102.

Submitted for the Project Viva-Vice Examination held on _______________ .

INTERNAL EXAMINER EXTERNAL EXAMINER

I hereby declare that the IT3711-Summer Internship carried out at

“MEVI TECHNOLOGIES LLP,MALLATHAHALLI,BENGALURU”

submitted for the B.TECH Information Technology Degree is my original work

associated ship, fellowship or any other similar titles.

This project implements an image captioning system using a pre-trained Vision

Encoder-Decoder model. The model, loaded from a pre-trained checkpoint,

model generates a sequence of tokens representing the caption, which is

subsequently decoded into human-readable text. This approach leverages the

power of deep learning to automatically describe the visual content of an image.

CHAPTER NO TITLE PAGE NO

2.1. EXISTING SYSTEM 8

2.2. PROBLEM IDENTIFICATION 10

2.3. PROBLEM ANALYSIS 11

2.4. PROPOSED SYSTEM 12

3.1. HARDWARE REQUIREMENTS 13

3.2. SOFTWARE REQUIREMENTS 13

3.3. SOFTWARE DESCRIPTION 13

3.3.2. Google Colab 17

3.3.3. Recurrent Neural Networks 19

3.3.4. Artificial Neural Networks 24

4.2.1. Pre Trained Model 27

4.2.2. Model Evaluation 27

4.2.3. Model Development 28

6.1. PRE-TRAINED MODEL 29

6.4. MODEL EVALUATION 34

6.5. MODEL DEVELOPMENT 36

APPENDIX 1: SOURCE CODE 38

APPENDIX 2: SCREEN SHOTS 40

To develop a neural network model that can accurately interpret

Image caption generation is a challenging task that involves bridging the

A common approach involves using a combination of convolutional

The training process involves minimizing a loss function that measures

 CNN (Convolutional Neural Network): This component extracts visual

 Limited in Generating Complex and Descriptive Captions: The model

Existing image captioning models leverage a combination of computer vision

 Accurate and Detailed Captions: Advanced models, especially those

 Computational Cost: Training and deploying these models, especially

2.2. PROBLEM IDENTIFICATION

Major Problems with Existing Image Captioning Models

 Lack of Contextual Understanding: Many models struggle to capture

 Improved Accuracy: The system generates more accurate and

3.2. SOFTWARE REQUIREMENTS

3.3. SOFTWARE DESCRIPTION

Python Syntax compared to other programming languages

Transformers: This powerful library provides state-of-the-art Natural

Torch: As a fundamental deep learning library, PyTorch offers a flexible and