Practical 3
Practical 3
We are dealing with two types of information, a language one and another
image one. So, the question arises of how or in what order we should intro-
duce the information into our model. Elaborately speaking, we need a language
RNN model to generate a word sequence, so when should we introduce the im-
age data vectors in the language model? A paper by Marc Tanti and Albert
Gatt [Comparison of Architectures], Institute of Linguistics and Language Tech-
nology, University of Malta covered a comparison study of all the approaches.
Image Captioning
Task 3.1
Answer the following questions:
1
Task 3.1.1 - Explain the pros and cons of utilising Concate-
nation for combining embeddings
You can choose training data for the images and captions. You will also
choose how to combine the embeddings and answer the questions at the end.
2
1 # Extract features from the image
2 base_model = VGG16(weights='imagenet', include_top=False)
3 image_features = base_model.predict(img_array)
4 image_features = image_features.reshape(image_features.shape[0], -1)
5 caption_input = Input(shape=(max_length,))
6 caption_embedding = Embedding(input_dim=vocab_size, output_dim=256, input_length=max_length)
7 caption_embedding = LSTM(256)(caption_embedding)
3
Generating Captions:
When the process is completed, it should be able to generate captions on the
test set; please show them.
Evaluation:
The quality of the generated captions is typically evaluated using metrics like
BLEU, METEOR, ROUGE, or CIDEr, which compare the generated caption
to a set of reference captions.
Note: This section is optional as long as you can see the loss is decreasing;
your model will not be penalized on this.