0% found this document useful (0 votes)
9 views29 pages

Natural Language Processing-Section

The document outlines a plan for generating image captions in Arabic using a VGG16 convolutional neural network model for image recognition. It details the steps for implementing the model in Keras, including preprocessing images, extracting features, generating English descriptions, and translating them to Arabic using Google Translate and gTTS. The document also provides code snippets and instructions for setting up the necessary libraries and executing the image caption generation process.

Uploaded by

dw9324764
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

Natural Language Processing-Section

The document outlines a plan for generating image captions in Arabic using a VGG16 convolutional neural network model for image recognition. It details the steps for implementing the model in Keras, including preprocessing images, extracting features, generating English descriptions, and translating them to Arabic using Google Translate and gTTS. The document also provides code snippets and instructions for setting up the necessary libraries and executing the image caption generation process.

Uploaded by

dw9324764
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Natural language processing

Section 6
Arabic - Image Caption Generator
1
Main idea
Image caption generator generates the caption for a
given image by understanding the image. The
challenging part of the caption generation is to
understand the image and understand the image
context and produce English description for the image
and then you can translate it to any other language.

2
Our Plan

we will follow these


steps to generate the
image caption

3
VGG16
VGG16 is a convolutional neural network model that’s used
for image recognition. It’s unique in that it has only 16 layers
that have weights, as opposed to relying on a large number
of hyper-parameters. It’s considered one of the best vision
model architectures.

4
More Information
About VGG_16

• VGG_16 consists of 16 layers, including 13 convolutional layers


and 3 fully connected layers.
• It is relatively deep compared to previous CNN architectures like
LeNet and AlexNet.
• The depth of the network allows it to learn more complex
features and capture fine details in the input images.
• It has a large number of parameters. It has about 138 million
trainable parameters, making it more computationally
expensive to train compared to shallower networks.
• Its models pretrained on large-scale image classification tasks,
such as the ImageNet dataset, are widely available.
• It has been shown to generalize well to other computer vision
tasks, such as object detection and semantic segmentation, by
utilizing its feature extraction capabilities.
5
How to Implement VGG16 in Keras

8 STEPS FOR IMPLEMENTING VGG16 IN


KEARS
1. Import the libraries for VGG16.
2. Create an object for training and testing data.
3. Initialize the model,
4. Pass the data to the dense layer.
5. Compile the model.
6. Import libraries to monitor and control training.
7. Visualize the training/validation data.
8. Test your model.

6
English description to Arabic description

You can use google translate library to translate English description


to Arabic description and then use gTTS library to read the Arabic
caption

7
gTTS
gTTS (Google Text-to-Speech)is a Python library
and CLI tool to interface with Google Translate text-
to-speech API. We will import the gTTS library from
the gtts module which can be used for speech
translation.

8
Install libraries
We need to install 3 libraries:
!pip install pydotplus
!pip install googletrans
!pip install gTTS

9
Load our plan

Use pydotplus library to draw a graph from dotted data


myplan="""digraph {
Load_VGG16_Model_Restructure ->
Load_Pretring_Model ->
Show_image ->
Input_preprocess ->
Generate_English_Description ->
Translate_English_Description_To_Arabic ->
Read_Arabic_Description_by_gTTS
}"""
mygraph=pydotplus.graph_from_dot_data(myplan)
mygraph.write_png("myplan.png“)
display(Image(filename= './myplan.png‘))
10
Preprocessing and cleaning data

Extract features from photos


To extract all features from photos, we need to load the model via

model = VGG16()

Then re-structure the model to remove the last layer from the loaded model as we require
only the features not the classification

model = Model(inputs=model.inputs, outputs=model.layers[-


2].output) 11
Load and Prepare Image

 We can load the image as pixel data and prepare it to be presented to the network.
 Keras provides some tools to help with this step.
 First, we can use the load_img() function to load the image and resize it to the required
size of 224×224 pixels.

from keras.preprocessing.image import load_img


image = load_img(filename, target_size=(224, 224))

12
Convert from pixel to NumPy array

Next, we can convert the pixels to a NumPy array so that we can work with it in Keras. We
can use the img_to_array() function for this.

from keras.preprocessing.image import img_to_array


# convert the image pixels to a numpy array
image = img_to_array(image)

13
Reshape data for the model

 The network expects one or more images as input; that means the input array will
need to be 4-dimensional: samples, rows, columns, and channels.
 We only have one sample (one image). We can reshape the array by calling reshape()
and adding the extra dimension.
 Note:
 The input data is reshaped so that it can be formatted in a way that the network can understand
and use for training

# reshape data for the model


image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

14
Prepare the image for the VGG
model

 Keras provides a function called preprocess_input() to prepare new input for the
network.

from keras.applications.vgg16 import preprocess_input


# prepare the image for the VGG model
image = preprocess_input(image)

15
Get features

 We can call the predict() function on the model in order to get a prediction of the
probability of the image belonging to each of the 1000 known object types. prepare
new input for the network.
 Note:
 By setting verbose 0, 1 or 2 you just say how do you want to 'see' the training progress for each epoch.
 verbose=0 will show you nothing (silent)
 verbose=1 will show you an animated progress bar
 verbose=2 will just mention the number of epoch

feature = model.predict(image, verbose=0)

16
Map an integer to a word

 We get a dictionary contains the word with its index by using word_for_id function

def word_for_id(integer, tokenizer):


for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
17
Definition Fun Generate Description

After extract features from each photo in the directory and mapping each integer to a
word
Now, generate a description for the image

Def Function called “generate_desc”

18
Fun Generate Description

def generate_desc(model, tokenizer, photo,


max_length):
# seed the generation process
in_text = 'startseq'

19
Fun Generate Description

Next Step => iterate over the whole length of the


sequence
for i in range(max_length):
# integer encode input sequence
sequence = tokenizer.texts_to_sequences([in_text])[0]
# pad input
sequence = pad_sequences([sequence], maxlen=max_length)
# predict next word
yhat = model.predict([photo,sequence], verbose=0)

20
Fun Generate Description

Now need to convert probability to integer then map integer


to each word

# convert probability to integer


yhat = argmax(yhat)
# map integer to word
word = word_for_id(yhat, tokenizer)

21
Fun Generate Description
• We should handle in case cannot map the word
# stop if we cannot map the word
if word is None:
break
# append as input for generating the next word
in_text += ' ' + word
# stop if we predict the end of the sequence
if word == 'endseq':
break

• Return of the function generate_desc


return in_texturn in_text 22
Loading Tokenizer and Model
Last Step=> load the tokenizer and Model and define
#max_length
load the tokenizer
tokenizer = load(open('/content/drive/MyDrive/imagecaptiongenerator/tokenizer.pkl',
'rb'))
# pre-define the max sequence length (from training)
max_length = 34
# load the model
model = load_model('/content/drive/MyDrive/test/VGGmodels/{epoch:03d}-
{val_accuracy:.2f}"#"models/{epoch:03d}-{val_loss:.2f}.h5"#"training_1/cp.ckpt')
path = '/content/drive/MyDrive/flicker8k-dataset/Flickr8k_Dataset/Flicker8k_Dataset/
1019077836_6fc9b15408.jpg'
# load and prepare the photograph
photo = extract_features(path)
23
Calling Fun Generate Description

Call function generate_desc() to generate description

english_text = generate_desc(model, tokenizer, photo, max_length)

24
Display Image

To Display Image
display(Image(filename=path))

Then Replace Text


english_text=english_text.replace("startseq", "").replace("endseq", "")
print(english_text)
25
Translation

Translate
translator = googletrans.Translator()
arabic_text =
translator.translate(english_text,dest='ar')
.text

print(arabic_text)

26
Translation Audio

Play Translation Audio


tts = gTTS(arabic_text, lang='ar')
tts.save('test.mp3’)

audio_path="test.mp3"
ipd.Audio(audio_path, autoplay=True)

27
Try it yourself

Dataset :
https://www.kaggle.com/datasets/ming666/flicker8k-dataset
Code:
https://colab.research.google.com/drive/1BlNUBbSxi0HanGsAkz7L1q9YEvtKu7_B?
usp=sharing#scrollTo=xs_0ccfbTNSN

28
Thank you for your attention!

29

You might also like