Image Captioning Using CNN and LSTM
Image Captioning Using CNN and LSTM
Image caption generator is a task that involves computer vision and natural language processing
concepts to recognize the context of an image and describe them in a natural language like English.
In this project we will be using the concept of CNN and LSTM and build a model of Image Caption
Generator which involves the concept of computer vision and Natural Language Process to
recognize the context of images and describe them in natural language like English.
The task of image captioning can be divided into two modules logically: -
Image based mode Extracts the features of our image we use CNN.
Language based model which translates the features and objects extracted by our image-based
model to a natural sentence we use LSTM.
What is CNN?
CNN is a subfield of Deep learning and specialized deep neural networks used for the recognition
and classification of images. It is used to process the data represented as a 2D matrix like images.
It can deal with scaled, translated, and rotated imagery. It analyzes the visual imagery by scanning
them from left to right and top to bottom and extracting relevant features from that. Finally, it
combines all the features for image classification.
What is LSTM
Long Short-Term Memory (LSTM) networks are a type of Recurrent Neural Network (RNN)
capable of learning order dependence in sequence prediction problems. This is most commonly
used in complex problems like Machine Translation, Speech Recognition, and many more.
The reason behind developing LSTM was, when we go deeper into a neural network if the
gradients are very small or zero, then little to no training can take place, leading to poor predictive
performance and this problem was encountered when training traditional RNNs. LSTM networks
are well-suited for classifying, processing, and making predictions based on time series data since
there can be lags of unknown duration between important events in a time series.
LSTM is way more effective and better compared to the traditional RNN as it overcomes the short
term memory limitations of the RNN. LSTM can carry out relevant information throughout the
processing of inputs and discards non-relevant information with a forget gate.
CNN-LSTM ARCHITECTURE:
The CNN-LSTM architecture involves using CNN layers for feature extraction on input data
combined with LSTMs to support sequence prediction. This model is specifically designed for
sequence prediction problems with spatial inputs.
CNN LSTM
input Dense Output
model model
Pre-requests
We use Jupyter notebooks to run our caption generator. and install the following library.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import keras
import re
import nltk
from nltk.corpus import stopwords
import string
import json
from time import time
import pickle
from keras.applications.vgg16 import VGG16
#from keras.applications.resnet50 import ResNet50
#import tensorflow.keras.applications.ResNet50
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.inception_v3 import
preprocess_input, decode_predictions
#from keras.applications import preprocess_input,
decode_predictions
from keras.preprocessing import image
from keras.models import Model, load_model
from keras.preprocessing.sequence import pad_sequences
#from keras.utils import to_categorical
from tensorflow.keras.utils import to_categorical,plot_model
from keras.layers import Input, Dense, Dropout, Embedding,
LSTM
from keras.layers.merge import add
Prepare Text Data
The dataset contains multiple descriptions for each photograph and the text of the descriptions
requires some minimal cleaning. First, we will load the file containing all of the descriptions we
have 600 number of image description.
Each photo has a unique identifier. This identifier is used on the photo filename and in the text file
of descriptions. Next, we will step through the list of photo descriptions. Each photo identifier
maps to a list of textual descriptions
Next, we need to clean the description text. The descriptions are already tokenized and easy to
work with. We will clean the text in the following ways in order to reduce the size of the vocabulary
of words we will need to work with:
Using ResNet50 to extract features which is already trained on ImageNet. Resnet50 is very deep
model it has 50 layers with skip connection they don’t have suffer from Vanishing
Gradient.ResNe50 is not a sequential model it can skip connections.
For image detection, we are using a pre-trained model called Visual Geometry Group (VGG16).
VGG16 is already installed in the Keras library. For feature extraction, the image features are in
224*224 size. The features of the image are extracted just before the last layer of classification as
this is the model used to predict a classification for a photo. We are not interested in classifying
images, hence we excluded the last layer
Training the mode
TESTING THE MODEL:
Now that the model has been trained, we can now test the model against random images. The
predictions contain the max length of index values so we will use the same tokenizer.pkl to get the
words from their index values.