0% found this document useful (0 votes)
197 views

Image Captioning Using CNN and LSTM

The document discusses image captioning using a CNN-LSTM architecture. It involves using a CNN to extract visual features from images, and an LSTM to generate natural language captions describing the images. Specifically, it uses a pre-trained ResNet50 CNN to extract features from images, and an LSTM language model to translate those features into English captions. It provides details on how the CNN-LSTM model is built and trained on a dataset containing images and corresponding text descriptions, to learn to generate captions for new images.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views

Image Captioning Using CNN and LSTM

The document discusses image captioning using a CNN-LSTM architecture. It involves using a CNN to extract visual features from images, and an LSTM to generate natural language captions describing the images. Specifically, it uses a pre-trained ResNet50 CNN to extract features from images, and an LSTM language model to translate those features into English captions. It provides details on how the CNN-LSTM model is built and trained on a dataset containing images and corresponding text descriptions, to learn to generate captions for new images.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Image captioning using CNN and LSTM

What is image captioning?

Image caption generator is a task that involves computer vision and natural language processing
concepts to recognize the context of an image and describe them in a natural language like English.

In this project we will be using the concept of CNN and LSTM and build a model of Image Caption
Generator which involves the concept of computer vision and Natural Language Process to
recognize the context of images and describe them in natural language like English.

The task of image captioning can be divided into two modules logically: -

Image based mode Extracts the features of our image we use CNN.

Language based model which translates the features and objects extracted by our image-based
model to a natural sentence we use LSTM.

What is CNN?

CNN is a subfield of Deep learning and specialized deep neural networks used for the recognition
and classification of images. It is used to process the data represented as a 2D matrix like images.
It can deal with scaled, translated, and rotated imagery. It analyzes the visual imagery by scanning
them from left to right and top to bottom and extracting relevant features from that. Finally, it
combines all the features for image classification.

What is LSTM

Long Short-Term Memory (LSTM) networks are a type of Recurrent Neural Network (RNN)
capable of learning order dependence in sequence prediction problems. This is most commonly
used in complex problems like Machine Translation, Speech Recognition, and many more.

The reason behind developing LSTM was, when we go deeper into a neural network if the
gradients are very small or zero, then little to no training can take place, leading to poor predictive
performance and this problem was encountered when training traditional RNNs. LSTM networks
are well-suited for classifying, processing, and making predictions based on time series data since
there can be lags of unknown duration between important events in a time series.
LSTM is way more effective and better compared to the traditional RNN as it overcomes the short
term memory limitations of the RNN. LSTM can carry out relevant information throughout the
processing of inputs and discards non-relevant information with a forget gate.

CNN-LSTM ARCHITECTURE:

The CNN-LSTM architecture involves using CNN layers for feature extraction on input data
combined with LSTMs to support sequence prediction. This model is specifically designed for
sequence prediction problems with spatial inputs.

CNN LSTM
input Dense Output
model model

Building the Image Caption Generator

Pre-requests

We use Jupyter notebooks to run our caption generator. and install the following library.

pip install TensorFlow


pip install Keras
pip install pillow
pip install NumPy
Pip install tqdm

Import all the required packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import keras
import re
import nltk
from nltk.corpus import stopwords
import string
import json
from time import time
import pickle
from keras.applications.vgg16 import VGG16
#from keras.applications.resnet50 import ResNet50
#import tensorflow.keras.applications.ResNet50
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.inception_v3 import
preprocess_input, decode_predictions
#from keras.applications import preprocess_input,
decode_predictions
from keras.preprocessing import image
from keras.models import Model, load_model
from keras.preprocessing.sequence import pad_sequences
#from keras.utils import to_categorical
from tensorflow.keras.utils import to_categorical,plot_model
from keras.layers import Input, Dense, Dropout, Embedding,
LSTM
from keras.layers.merge import add
Prepare Text Data

The dataset contains multiple descriptions for each photograph and the text of the descriptions
requires some minimal cleaning. First, we will load the file containing all of the descriptions we
have 600 number of image description.

Each photo has a unique identifier. This identifier is used on the photo filename and in the text file
of descriptions. Next, we will step through the list of photo descriptions. Each photo identifier
maps to a list of textual descriptions
Next, we need to clean the description text. The descriptions are already tokenized and easy to
work with. We will clean the text in the following ways in order to reduce the size of the vocabulary
of words we will need to work with:

• Convert all words to lowercase.


• Remove all punctuation.
• Remove all words that are one character or less in length (e.g. ‘a’).
• Remove all words with numbers in them.

Loading dataset for model training and testing


Transfer Learning

Images -> Features Text -> Features

Using ResNet50 to extract features which is already trained on ImageNet. Resnet50 is very deep
model it has 50 layers with skip connection they don’t have suffer from Vanishing
Gradient.ResNe50 is not a sequential model it can skip connections.

Preprocessing the image:

For image detection, we are using a pre-trained model called Visual Geometry Group (VGG16).
VGG16 is already installed in the Keras library. For feature extraction, the image features are in
224*224 size. The features of the image are extracted just before the last layer of classification as
this is the model used to predict a classification for a photo. We are not interested in classifying
images, hence we excluded the last layer
Training the mode
TESTING THE MODEL:

Now that the model has been trained, we can now test the model against random images. The
predictions contain the max length of index values so we will use the same tokenizer.pkl to get the
words from their index values.

You might also like