Internship Report (Sanjay Final)
Internship Report (Sanjay Final)
Internship Report (Sanjay Final)
TECHNOLOGIES LLP,MALLATHAHALLI,BENGALURU”
IT3711-SUMMER INTERNSHIP
Submitted by
SANJAY P - 510821205021
of
BACHELOR OF TECHNOLOGY IN
INFORMATION TECHNOLOGY
1
ANNA UNIVERSITY:CHENNAI 600 025
BONAFIDE CERTIFICATE
SIGNATURE
D.DURAI KUMAR
Associate Professor
1
DECLARATION
and the dissertation has not formed the basis for the award of any degree,
Place:
Date :
Signature of the student
(SANJAY P)
1
INTERNSHIP CERTIFICATE
2
ABSTRACT
processes an input image and generates a descriptive caption. The image is first
preprocessed to extract relevant features, which are then fed into the model. The
3
TABLE OF CONTENTS
INTERNSHIP CERTIFICATE 2
ABSTRACT 3
1 INTRODUCTION
1.1. OBJECTIVE 6
1.2. OVERVIEW 6
2 LITERATURE SURVEY 7
3 SYSTEM SPECIFICATION 13
3.3.1. Python 13
4 PROJECT DESCRIPTION 27
4.1. MODULES 27
4
4.2. MODULE DESCRIPTION 27
5 SYSTEM IMPLEMENTATION 29
6 CONCLUSION 37
FUTURE SCOPE 37
APPENDICES 38
BIBLIOGRAPHY 44
5
CHAPETER 1
INTRODUCTION
1.1. OBJECTIVE
OVERVIEW
6
CHAPTER2
LITERATURE SURVEY
[1] Show and Tell: A Pioneering Model for Image Captioning[Vinyals et al.,
2014]
Specification:
Advantage:
Pioneering Deep Learning for Image Captioning: Show and Tell was
one of the first models to successfully apply deep learning techniques to
the task of image captioning. It demonstrated the potential of combining
computer vision and natural language processing.
Disadvantage:
7
2.1. EXISTING SYSTEM
One popular approach involves using a CNN to extract high-level features from
the image, followed by an RNN to decode these features into a sequence of
words. However, this approach often struggles with generating coherent and
detailed captions, especially for complex images.
Advantages:
Disadvantages:
8
Data Dependency: The performance of these models heavily relies on
the quality and quantity of the training data. A lack of diverse and well-
annotated data can limit their capabilities.
9
2.3. PROBLEM ANALYSIS
Semantic Gap: Bridging the semantic gap between visual and textual
modalities remains a significant challenge. Accurately mapping complex visual
scenes to natural language descriptions requires understanding object
relationships, spatial layouts, and contextual cues.
Handling Visual Ambiguity: Images often contain multiple
interpretations, and generating accurate captions requires disambiguation based
on contextual clues.
Generating Diverse and Creative Captions: While accuracy is
crucial, generating diverse and creative captions is equally important to enhance
user experience.
Handling Noisy and Low-Quality Images: Real-world images can
be noisy, low-resolution, or have occlusions, which can significantly impact the
performance of image captioning models.
Contextual Understanding: Understanding the broader context of an
image, including cultural and social nuances, is essential for generating accurate
and relevant captions.
Evaluation Metrics: Developing reliable evaluation metrics to assess
the quality of generated captions is challenging, as it involves both semantic and
syntactic correctness..
10
2.4. PROPOSED SYSTEM
The proposed system leverages the power of transformer-based
architectures to significantly enhance image captioning capabilities. By
incorporating advanced techniques like attention mechanisms and data
augmentation, the system can generate more accurate, diverse, and contextually
relevant captions. The system's ability to handle complex scenes, noisy images,
and diverse language styles makes it a versatile tool for a wide range of
applications, including image search, accessibility, and content creation.
Additionally, the system can be easily integrated into various applications, such
as social media platforms, e-commerce websites, and educational tools, to
provide a more immersive and informative user experience. Furthermore, the
system has the potential to be extended to other image-related tasks, such as
image question answering and visual storytelling, opening up new possibilities
for human-computer interaction. By addressing the limitations of existing
models, the proposed system aims to advance the state-of-the-art in image
captioning and provide a valuable tool for a variety of applications.
ADVANTAGES:
11
CHAPTER 3
SYSTEM SPECIFICATIONS
3.1. HARDWARE REQUIREMENTS
Hard Disk : 256GB and Above
RAM : 4GB and Above
Processor : i3 and Above
12
What can Python do?
Python can be used on a server to create web applications.
Python can be used alongside software to create workflows.
Python can connect to database systems. It can also read and modify
files.
Python can be used to handle big data and perform complex
mathematics.
Python can be used for rapid prototyping or production-ready software
development.
Why Python?
Python works on different platforms (Windows, Mac, Linux, Raspberry
Pi, etc.).
Python has a simple syntax similar to the English language.
Python has a syntax that allows developers to write programs with fewer
lines than some other programming languages.
Python runs on an interpreter system, meaning that code can be executed
as soon as it is written. This means that prototyping can be very quick.
Python can be treated procedurally, in an object-orientated way, or a
functional way.
Python Features
Easy to learn − Python has few keywords, a simple structure, and a
clearly defined syntax. This allows the student to pick up the language
quickly.
Easy to read − Python code is more clearly defined and visible to the
eyes.
13
Easy to maintain− Python's source code is fairly easy-to-maintain.
A broad standard library − Python's bulk of the library is very portable
and cross-platform compatible on UNIX, Windows, and Macintosh.
Interactive Mode − Python has support for an interactive mode that
allows interactive testing and debugging of snippets of code.
Portable − Python can run on a wide variety of hardware platforms and
has the same interface on all platforms.
Extendable − You can add low-level modules to the Python interpreter.
These modules enable programmers to add to or customize their tools to
be more efficient.
Databases − Python provides interfaces to all major commercial
databases.
GUI Programming − Python supports GUI applications that can be
created and ported to many system calls, libraries, and windows systems,
such as Windows MFC, Macintosh, and the X Window system of Unix.
Scalable − Python provides a better structure and support for large
programs than shell scripting.
Python Libraries
Machine Learning, as the name suggests, is the science of programming
a computer by which they can learn from different kinds of data. A more
14
general definition given by Arthur Samuel is – “Machine Learning is the field
of study that gives computers the ability to learn without being explicitly
programmed.” They are typically used to solve various types of life problems.
In the older days, people used to perform Machine Learning tasks by manually
coding all the algorithms and mathematical and statistical formulas made the
process time-consuming and, inefficient. But in the modern days, it is become
very much easy and efficient compared to the olden days by without Python
libraries, frameworks, and modules. Today, Python is one of the most popular
programming languages for this task and it has replaced many languages in the
industry, one of the reasons is its vast collection of libraries. Python libraries
that are used in Machine Learning are:
Transformers
Torch
PIL
Requests
PIL (Pillow): This library is indispensable for image processing tasks, enabling
you to open, manipulate, and preprocess images before feeding them into the
model. It provides a user-friendly interface for common image operations.
15
Requests: This library simplifies the process of making HTTP requests,
allowing you to fetch images from the web and incorporate them into your
image captioning pipeline. It provides a convenient way to interact with web
APIs and download image data.
Use Cases
16
Google Colab is used extensively in academia and industry for a variety of
purposes:
Educational Purposes: Educators use Colab to teach coding, data
science, machine learning, and computational mathematics. The zero-
setup environment means students can start coding without any barriers
related to software installation.
Research: Researchers utilize the powerful computational resources
provided by Colab to train complex models on large datasets,
significantly reducing the time and cost associated with such
computations.
Prototype Development: Developers use Colab to prototype new ideas
and algorithms quickly, leveraging its integration with various APIs and
data sources.
Advantages
Colab removes the barrier of expensive hardware for individuals and
small organizations.
The platform’s ease of use and no setup requirement allow users to focus
on coding and analysis rather than system configuration.
Real-time collaboration and easy sharing increase productivity and
facilitate educational and professional teamwork.
17
3.3.3. Recurrent Neural Networks
Recurrent Neural Network(RNN) is a type of Neural Network where the
output from the previous step is fed as input to the current step. In traditional
neural networks, all the inputs and outputs are independent of each other. Still,
in cases when it is required to predict the next word of a sentence, the previous
words are required and hence there is a need to remember the previous words.
Thus RNN came into existence, which solved this issue with the help of a
Hidden Layer. The main and most important feature of RNN is its Hidden state,
which remembers some information about a sequence. The state is also referred
to as Memory State since it remembers the previous input to the network. It uses
the same parameters for each input as it performs the same task on all the inputs
or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
How RNN differs from Feedforward Neural Network?
Artificial neural networks that do not have looping nodes are called feed
forward neural networks. Because all information is only passed forward, this
kind of neural network is also referred to as a multi-layer neural network.
Information moves from the input layer to the output layer – if any
hidden layers are present – unidirectionally in a feedforward neural network.
These networks are appropriate for image classification tasks, for example,
where input and output are independent. Nevertheless, their inability to retain
previous inputs automatically renders them less useful for sequential data
analysis.
18
Figure No3.1: Recurrent Vs Feedfoward networks
Recurrent Neuron and RNN Unfolding
The fundamental processing unit in a Recurrent Neural Network (RNN)
is a Recurrent Unit, which is not explicitly called a “Recurrent Neuron.” This
unit has the unique ability to maintain a hidden state, allowing the network to
capture sequential dependencies by remembering previous inputs while
processing. Long Short-Term Memory (LSTM) and Gated Recurrent Unit
(GRU) versions improve the RNN’s ability to handle long-term dependencies
19
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at
timestep i
The parameters in the network are W, U, V, c, b which are shared across
timestep
20
xt -> input state
21
Advantages
An RNN remembers each and every piece of information through time. It
is useful in time series prediction only because of the feature to remember
previous inputs as well.
Recurrent neural networks are even used with convolutional layers to
extend the effective pixel neighborhood.
22
3.3.4. Artificial Neural Networks
23
Figure No3.3 :Neural Networks Architecture
The structures and operations of human neurons serve as the basis for artificial
neural networks. It is also known as neural networks or neural nets. The input
layer of an artificial neural network is the first layer, and it receives input from
external sources and releases it to the hidden layer, which is the second layer. In
the hidden layer, each neuron receives input from the previous layer neurons,
computes the weighted sum, and sends it to the neurons in the next layer. These
connections are weighted means effects of the inputs from the previous layer are
optimized more or less by assigning different-different weights to each input
and it is adjusted during the training process by optimizing these weights for
improved model performance.
How do Artificial Neural Networks learn?
Artificial neural networks are trained using a training set. For example,
suppose you want to teach an ANN to recognize a cat. Then it is shown
thousands of different images of cats so that the network can learn to identify a
cat. Once the neural network has been trained enough using images of cats, then
you need to check if it can identify cat images correctly. This is done by making
24
the ANN classify the images it is provided by deciding whether they are cat
images or not. The output obtained by the ANN is corroborated by a human-
provided description of whether the image is a cat image or not. If the ANN
identifies incorrectly then back-propagation is used to adjust whatever it has
learned during training. Backpropagation is done by fine-tuning the weights of
the connections in ANN units based on the error rate obtained. This process
continues until the artificial neural network can correctly recognize a cat in an
image with minimal possible error rates.
25
CHAPTER 4
PROJECT DESCRIPTION
4.1. MODULES
This is the list of modules using in our project to develop a trust user
recommendation using Machine Learning and Neural Network Techniques
Pre-Trained Model
Model Evaluation
Model Development
26
4.2.3. Model Development
The proposed image captioning model leverages a pre-trained Vision
Transformer (ViT) as the encoder to extract visual features from input images.
A transformer-based decoder, such as GPT-2, then generates captions
conditioned on these features. To improve performance, techniques like data
augmentation, fine-tuning, and attention mechanisms are employed. The model
is trained using a combination of cross-entropy loss and a language modeling
objective. By leveraging the power of pre-trained models and advanced
techniques, this approach aims to generate accurate, diverse, and contextually
relevant captions.
.
27
CHAPTER 5
SYSTEM IMPLEMENTATION
Implementation is the stage of the project when the theoretical design is
turned out in a working system. System implementation generally benefits from
high levels of user involvement and management support. User participation in
the design and operation of information systems has several positive results.
First, if users are heavily involved in systems design, they move opportunities to
mold the system according to their priorities and business requirements, and
more opportunities to control the outcome. Second, they are more likely to react
positively to the change process. Incorporating user knowledge and expertise
leads to better solutions.
28
2. Importing Libraries:
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor,
AutoTokenizer
import torch
from PIL import Image
import requests
29
3. Loading the Pre-trained Model:
Python
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-
image-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-
image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-
captioning")
These lines load the pre-trained image captioning model and its
associated components:
o model: This variable stores the actual
VisionEncoderDecoderModel instance. The from_pretrained
method loads the weights and configuration of the pre-trained
model from the specified name ("nlpconnect/vit-gpt2-image-
captioning"). This model takes pre-processed image features as
input and generates captions as output.
o feature_extractor: This variable holds the ViTFeatureExtractor
instance. It knows how to convert images into a format that the
model can understand.
o tokenizer: This variable contains the AutoTokenizer instance. It is
responsible for converting text captions into sequences of tokens
for processing by the model, and vice versa (decoding) after the
model generates the caption.
30
4. Pre-processing the Image (Function preprocess_image):
Python
def preprocess_image(image_path):
if image_path.startswith('http'):
image = Image.open(requests.get(image_path, stream=True).raw)
else:
image = Image.open(image_path)
return feature_extractor(images=image, return_tensors="pt").pixel_values
This function takes an image path (local or URL) as input and returns the
pre-processed image features. Here's how it works:
o It first checks if the image_path starts with "http" to determine if
it's a URL or a local file path.
o If it's a URL, it uses requests to download the image and then
opens it using Image.open from the PIL library.
o If it's a local file path, it directly opens the image using
Image.open.
o Finally, it uses the feature_extractor to convert the image into a
tensor representation compatible with the model. The
return_tensors="pt" argument ensures that the image is returned as
a PyTorch tensor.
o The function returns the pre-processed image features, which will
be fed into the model for generating a caption.
31
5. Generating a Caption (Function generate_caption):
Python
def generate_caption(image_path):
pixel_values = preprocess_image(image_path)
output_ids = model.generate(pixel_values, max_length=16, num_beams=4,
early_stopping=True)
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return caption
This function takes an image path as input and returns a generated caption
for that
32
5.2. MODEL EVALUATION
Quantitative Metrics:
34
o CIDEr is particularly useful for assessing the overall quality and
relevance of generated captions.
4. Qualitative Evaluation:
CHAPTER 6
CONCLUSION
The provided model leverages a pre-trained Vision Transformer (ViT) to
extract visual features from input images. A transformer-based decoder, such as
GPT-2, then generates captions conditioned on these features. The model
leverages techniques like attention mechanisms, fine-tuning, and data
augmentation to improve performance. By utilizing this state-of-the-art
approach, the model can generate accurate, diverse, and contextually relevant
captions for a wide range of images, enhancing user experience and enabling
innovative applications in fields like image search, content creation, and
accessibility.
Future Scope
36
Domain-Specific Adaptation: Tailoring models to specific domains
(e.g., medical, legal, artistic) to improve accuracy and relevance.
Zero-Shot and Few-Shot Learning: Developing models that can
generate captions for unseen objects or scenes with limited training data.
Ethical Considerations: Addressing biases and ensuring fairness in the
generation of captions, especially in sensitive contexts.
Real-time Applications: Optimizing models for real-time applications,
such as live video captioning or image description generation.
Cross-lingual Captioning: Enabling image captioning across multiple
languages to facilitate global accessibility and understanding.
APPENDICES
37
captioning")
def preprocess_image(image_path):
if image_path.startswith('http'):
image = Image.open(requests.get(image_path, stream=True).raw)
else:
image = Image.open(image_path)
return feature_extractor(images=image, return_tensors="pt").pixel_values
def generate_caption(image_path):
pixel_values = preprocess_image(image_path)
output_ids = model.generate(pixel_values, max_length=16, num_beams=4,
early_stopping=True)
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return caption
# Example usage
image_url = "/content/image.jpg"
caption = generate_caption(image_url)
print(caption)
38
APPENDIX 2: SCREEN SHOT
Installing Library
Importing Libraries
39
Model, Feature Extractor & Tokenizer Loading
40
Leveraging Pre-trained Models
41
Input Image
42
Output/Result
BIBLOGRAPHY
[1] Show and Tell: A Pioneering Model for Image Captioning[Vinyals et al.,
2014
43