Sensors: Arabic Captioning For Images of Clothing Using Deep Learning
Sensors: Arabic Captioning For Images of Clothing Using Deep Learning
Article
Arabic Captioning for Images of Clothing Using Deep Learning
Rasha Saleh Al-Malki * and Arwa Yousuf Al-Aama
Computer Science Department, Faculty of Computing and Information Technology, King Abdulaziz University,
Jeddah 21589, Saudi Arabia; [email protected]
* Correspondence: [email protected]
Abstract: Fashion is one of the many fields of application that image captioning is being used in. For
e-commerce websites holding tens of thousands of images of clothing, automated item descriptions
are quite desirable. This paper addresses captioning images of clothing in the Arabic language using
deep learning. Image captioning systems are based on Computer Vision and Natural Language
Processing techniques because visual and textual understanding is needed for these systems. Many
approaches have been proposed to build such systems. The most widely used methods are deep
learning methods which use the image model to analyze the visual content of the image, and the
language model to generate the caption. Generating the caption in the English language using deep
learning algorithms received great attention from many researchers in their research, but there is still
a gap in generating the caption in the Arabic language because public datasets are often not available
in the Arabic language. In this work, we created an Arabic dataset for captioning images of clothing
which we named “ArabicFashionData” because this model is the first model for captioning images of
clothing in the Arabic language. Moreover, we classified the attributes of the images of clothing and
used them as inputs to the decoder of our image captioning model to enhance Arabic caption quality.
In addition, we used the attention mechanism. Our approach achieved a BLEU-1 score of 88.52. The
experiment findings are encouraging and suggest that, with a bigger dataset, the attributes-based
image captioning model can achieve excellent results for Arabic image captioning.
Figure2.2.Merge
Figure Mergearchitecture
architecture[6].
[6].
After comparing these two architectures, the findings showed that the merged archi-
tecture outperformed the injection architecture in performance [6].
After comparing these two architectures, the findings showed that the merged archi-
tecture outperformed the injection architecture in performance [6].
Figure
Figure 3. 3. ResNet50
ResNet50 networkarchitecture
network architecture [10].
[10].
for training, validation, and testing their model (1500, 250, 250) respectively. Their model
achieved a BLEU-1 score of 34.4.
An image captioning model for generating Arabic captions has also been proposed
by Al-Muzaini et al. [3]. Their proposed model uses a merged model in which image
features and captions are learned independently. The model is composed of three parts.
First, an RNN with LSTM to build a language model to encode the varying length of
linguistic sequences is used. Second, a CNN to build an image feature extractor model
to extract features from images in the form of a vector with a fixed length is used. Third,
a decoder model to make a final prediction, which takes the outputted vectors from the
previous models as input is used. They used crowdsourcing to build an Arabic version
of the MSCOCO dataset (1166 images), Arabic translators, and Google translation API to
translate parts of the Flicker8k dataset to Arabic (2261 images). Their model achieved a
BLEU-1 score of 46 which outperformed the models in [14,15].
Jindal [16] extended his work in [14]. Where in the second step, instead of generating
root words for an image in Modern Standard Arabic (MSA) using Deep Belief Networks, he
used a word-based RNN with LSTM. He used Arabic translators to translate the Flickr8k
dataset to Arabic, and 405,000 images with captions from newspapers from various Middle
Eastern countries. His approach achieved a BLEU-1 score of 65.8 on the Flickr8k dataset
which is the best BLEU-1 score, and a BLEU-1 score of 55.6 on the Middle Eastern News
Websites dataset.
ElJundi et al. [17] also proposed a sequence-to-sequence encoder-decoder framework
by using the inject model for generating Arabic captions. They translated the Flickr8k
dataset (8092 images) and used them for training, and testing their model (7292, 800)
respectively. They were the first to make an Arabic dataset publicly available for researchers
to use. Their model achieved a BLEU-1 score of 33.
Table 1 summarizes existing Arabic image captioning approaches.
Evaluation
Language
Reference Architecture Image Model Attention Dataset Metric
Model
(BLEU-1)
ImageNet
RNN-Deep
(10,000)
[14] Compositional R-CNN Belief Network No 34.8
Aljazeera News
(DBN)
(100,000)
Subset of
Encoder-
[15] CNN RNN-LSTM No Flickr8k 34.4
Decoder
(2000)
MSCOCO
Encoder- dataset (1166)
[3] CNN RNN-LSTM No 46
Decoder Flickr8k
datasets (2261)
Flickr8k (8092) 65.8
As can be seen, all the developed Arabic image captioning models depended on gener-
ating the caption by considering the scene as a whole and not using the attention mechanism.
Sensors 2023, 23, 3783 6 of 17
Figure4.4. Image
Figure Image of
of clothing,
clothing,labels,
labels,and
anddescriptive
descriptivecaption
captionexamples
examples[19].
[19].
Hacheme
Hachemeet etal.
al.[20]
[20] addressed
addresseddataset
datasetdiversity
diversityissues
issuesinin image
image captioning
captioningfor forimages
images
of
of clothing.
clothing. They
They built
built an
an African
African fashion
fashion dataset
dataset (InFashAIv1)
(InFashAIv1) and and used
used itit in
in addition
addition toto
the DeepFashion dataset. They used a subset from the DeepFashion dataset
the DeepFashion dataset. They used a subset from the DeepFashion dataset including only including only
tops
topsclothing.
clothing.Using
Usingthe theattributes (gender,
attributes style,
(gender, color,
style, sleeve
color, type,type,
sleeve and garment
and garment type),type),
they
regenerated standardized captions for DeepFashion images. To generate
they regenerated standardized captions for DeepFashion images. To generate captions, captions, they
used
they aused
CNN encoder
a CNN and RNN
encoder Decoder
and RNN and jointly
Decoder trained
and jointly the model
trained on both
the model on datasets.
both da-
tasets. They demonstrated that dataset diversity increases caption quality for African-style
images of clothing, implying that data from Western styles can be transferred. They
trained their model in different ways using both Western and African datasets. The high-
est scores achieved were when they trained their model using both Western and African
Sensors 2023, 23, 3783 7 of 17
They demonstrated that dataset diversity increases caption quality for African-style images
of clothing, implying that data from Western styles can be transferred. They trained their
model in different ways using both Western and African datasets. The highest scores
achieved were when they trained their model using both Western and African datasets,
where their model achieved a BLEU-score of 65 when testing it on both datasets and 66
which is better when testing it only on the African dataset.
Another deep learning model for captioning images of clothing was developed by
Dwivedi & Upadhyaya [1]. They used the merge architecture for their image captioning
model, in which image features and captions are learned independently. For image features
extraction, they proposed a five-layer deep Convolutional Neural Network (CNN-5), and
for encoding the linguistic sequences, they used an RNN with LSTM. To train their model,
they used the Fashion MNIST dataset (70,000 images). Their model achieved a BLEU-1
score of 53.48 on the test data.
Yang et al. [21] provided a large-scale dataset called FAshion CAptioning Dataset
(FACAD) to study captioning images of fashion. They used a ResNet-101 as an encoder
to extract the image features and LSTM as a decoder. They trained their model based on
the reinforcement learning strategy to enhance the results. Furthermore, they provided
two rewards to capture the semantics at the attribute level and the sentence level. The best
score that their model has achieved is a BLEU-4 score of 6.8.
Cai et al. [22] created a new dataset to caption images of fashion called FACAD170K
from the FACAD dataset. They also used a ResNet-101 as an encoder to extract the image
features and LSTM as a decoder. They proposed a method that depends on allowing users
to input preferred semantic attributes to help generate captions that fit their preferences.
Their model achieved a BLEU-1 score of 46.5 and a BLEU-4 score of 19.6.
Moratelli et al. [23] presented a transformer-based captioning model that included ex-
ternal text memory that could be retrieved using k-nearest neighbor (KNN) searches. They
encoded both the images and the text using a vision and textual transformer respectively
and used the FACAD dataset to train their model. Their model achieved a BLEU-1 score of
27.3 and a BLEU-4 score of 10.6.
All these captioning models for images of clothing support only the English language,
use the encoder-decoder architecture, and mostly achieve a low BLEU score compared to
the size of the datasets used.
Table 2 summarizes the existing English captioning models for images of clothing.
Evaluation
Language
Reference Architecture Image Model Attention Dataset Metric
Model
(BLEU-1)
Encoder- DeepFashion
[19] CNN RNN-LSTM No -
Decoder (20,096)
Encoder- DeepFashion+
[20] CNN RNN-LSTM No 66
Decoder InFashAIv1
Fashion MNIST
Encoder-
[1] CNN RNN-LSTM No dataset 53.48
Decoder
(70,000)
Encoder- FACAD dataset
[21] CNN RNN-LSTM Yes -
Decoder (993,000)
FACAD170K
Encoder-
[22] CNN RNN-LSTM Yes dataset 46.5
Decoder
(178,862)
Transformer Transformer
Encoder- FACAD dataset
[23] Neural Neural Yes 27.3
Decoder (993,000)
Network (TNN) Network (TNN)
Sensors 2023, 23, 3783 8 of 17
3. Proposed Method
This section first provides an overview of the proposed model architecture, then
presents the process of training and evaluation.
5. Attributes-based
FigureArabic
Figure 5. Attributes-based Arabic image captioning
image captioning model architecture.
model architecture.
Here are the details of the Attributes-Based Arabic image captioning model architecture:
Here are the details of the Attributes-Based Arabic image captioning model architec-
3.1.1. Encoder
ture: • Image Attributes Encoder
The first step of the proposed model was classifying the attributes of the images of
3.1.1. Encoder clothing using multi-label classification. To do that, we classified and categorized the
ArabicFashionData dataset into multiple classes, each class has multiple labels that are
• Image Attributes Encoder
attributes of the images of clothing it contains. In the training and validation set, we
The first step of the proposed
extracted model
the labels that was
represent classifying
the attributes of thethe attributes
image of the
from the image path,images
then we of
clothing using multi-label classification. To do that, we classified and categorizedtothe
converted these labels to a NumPy array and added them as input to the decoder use it
Ara-
to train the decoder along with the image features and linguistic features.
bicFashionData dataset into multiple classes, each class has multiple labels that are attrib-
• Image Features Encoder
utes of the images of clothing it contains. In the training and validation set, we extracted
Extracting features from an image is a key element of image captioning since it allows
the labels that represent thetoattributes
the system ofofthe
describe a lot the image fromin the
visual details it by image path,
specifying thenpatterns.
its major we con-
verted these labels to a NumPy array and added them as input to the decoder to use it to
train the decoder along with the image features and linguistic features.
• Image Features Encoder
Sensors 2023, 23, 3783 9 of 17
Convolutional Neural Network (CNN) was used for extracting features from images. We
used a pre-trained ResNet50 network as an image encoder in our model because it is trained
on a very large dataset, has very high accuracy, and is available for public use [8]. Attention
was applied by adding an attention layer after image features into the CNN, which aids
in the concentration of selective perception. The images were fed into ResNet50, and the
output was delivered to the decoder network.
• Linguistic Sequences Encoder
Linguistic representations needed to be encoded. To encode linguistic sequences of
varying lengths, we used a language model based on Long Short-Term Memory (LSTM).
A single-layer word embedding system was used in the LSTM model to learn the word
representation. Along with the image attributes, the output from the image features encoder
and linguistic sequences encoder are added as inputs to the decoder network and a final
prediction is made.
3.1.2. Decoder
The decoder’s task is to transform the data received from the encoder into a text
sequence (generate caption). Our decoder contains a dense 128 layer with the activation
function Rectified Linear Units (ReLU). The image attributes were combined with the output
of the image features encoder by concatenation, and this combination was combined with
the output of the linguistic sequences’ encoder by concatenation too and used as inputs
to the dense layer. A softmax prediction for each word in the vocabulary is generated to
be the next word in the sequence by the dense layer, and then the word with the highest
probability is selected. This process is repeated until an END token is generated.
Figure
Figure Model
6. 6. testing
Model on unseen
testing data.
on unseen data.
Because manually entering the attributes would make it impossible to accurately
Because
evaluate manually
the model, entering
we have built the attributes
a multi-label would
classification modelmake it impossible
using the multi-label to ac
evaluatedata
classified thetomodel, we
train and have
test built a multi-label classification model using the mu
the model.
•classified dataImage
Multi-Label to train and test Model
Classification the model.
(Image Attributes Classification)
• Multi-Label Image Classification Model (Image Attributes Classification)
The multi-label classification model was done by adopting deep residual11neural
Sensors 2023, 23, x FOR PEER REVIEW of 18
Thewith
networks multi-label classification
50 layers (ResNet50) under model was done
the transfer byapproach
learning adopting to deep residual ne
accomplish
works
the withtask
detection 50 of
layers
many(ResNet50) underWe
clothing attributes. theextracted
transferthelearning
labels thatapproach to accomp
represent the
attributes
detection of the image
task of manyfrom the image
clothing path, then we
attributes. trained the classifier on these labels
with the corresponding images. This model takes theWe extracted
image theand
as an input labels thatallrepresen
outputs
with the corresponding images. This model takes the image as an input and outputs all the
tributes of the
the attributes image
of this image.from the image path, then we trained the classifier on the
attributes of this image.
We used ResNet50 as the base model in our method, which had been pre-trained on
We used ResNet50 as the base model in our method, which had been pre-trained on
the ImageNet dataset for object detection. We transferred the first 49 layers of ResNet50,
the ImageNet dataset for object detection. We transferred the first 49 layers of ResNet50,
which were left frozen on the multi-label classification model, using transfer learning tech-
which were left frozen on the multi-label classification model, using transfer learning
niques. We removed the 1000 fully connected softmax from ResNet50 and initialized a
techniques. We removed the 1000 fully connected softmax from ResNet50 and initialized a
new one. Our classifier contains 64 output nodes instead of 1000 because we have 64
new one. Our classifier contains 64 output nodes instead of 1000 because we have 64 unique
unique attributes. We trained a 64 fully connected sigmoid using the labeled images of
attributes. We trained a 64 fully connected sigmoid using the labeled images of clothing as
clothing as input and then replaced the 1000 fully connected softmax from ResNet50 with
input and then replaced the 1000 fully connected softmax from ResNet50 with our trained
our trained 64 fully connected sigmoid. The training and validation set of these attributes
64 fully connected sigmoid. The training and validation set of these attributes that were
that were used to train and validate the multi-label classification model is the same that
used to train and validate the multi-label classification model is the same that we used as
we used as input to the decoder to train our image captioning model. Adam optimizer
input to the decoder to train our image captioning model. Adam optimizer with a learning
with a learning rate of 0.0001 was used over binary cross-entropy loss function. The whole
rate of 0.0001 was used over binary cross-entropy loss function. The whole experiment was
experiment was run for 2 epochs with a training batch size of 200 and a validation batch
run for 2 epochs with a training batch size of 200 and a validation batch size of 200.
size of 200.
4. Experimental Setup
4. Experimental Setup
4.1. Dataset
4.1. Dataset
• ArabicFashionData Dataset (AFD)
• ArabicFashionData Dataset (AFD)
Through our research reviews, we were unable to find previous use of deep learning-
Through our research reviews, we were unable to find previous use of deep learning-
based Arabic image captioning in the field of clothing nor did we find an Arabic dataset of
based Arabic image captioning in the field of clothing nor did we find an Arabic dataset
captioned images of clothing. Hence, we created an Arabic dataset for images of clothing
of captioned images of clothing. Hence, we created an Arabic dataset for images of cloth-
which we named “ArabicFashionData” to evaluate our model. Figure 7 shows an overview
ing which we named “ArabicFashionData” to evaluate our model. Figure 7 shows an over-
of the process of creating the ArabicFashionData dataset.
view of the process of creating the ArabicFashionData dataset.
Figure7.7.The
Figure Theprocess
processof
ofcreating
creating our
our ArabicFashionData
ArabicFashionData dataset.
dataset.
TheAFD
The AFDdataset
dataset consists
consists of
of images
images and
and aa single
single caption
caption for
for each
each image.
image. The
Theimages
images
ofofthe
theAFD
AFDdataset
datasetwere
wereobtained
obtainedfrom
fromthe
theDeepFashion
DeepFashiondataset
dataset(described
(describedearlier)
earlier) with-
without
out their attributes, captions, or labels. The caption sentences were written in the Arabic
language based on the attributes that we chose to use which are gender, garment type,
color, sleeve styles, and garment length. From the DeepFashion dataset, we chose the Cat-
egory and Attribute Prediction Benchmark, and we selected a subset from it. The AFD
dataset included images of different types of tops such as hoodies and blazers, and differ-
Leggings ﺑﻨﻄﻠﻮﻥ ﻟﻴﻘﻨﺰ
Short pants ﺑﻨﻄﻠﻮﻥ ﺷﻮﺭﺕ
Jogger ﺑﻨﻄﻠﻮﻥ ﺭﻳﺎﺿﻲ
Pants ﺑﻨﻄﻠﻮﻥ
Sensors 2023, 23, 3783 Skirt ﺗﻨﻮﺭﺓ 11 of 17
The dataset included 15 different garment types shown in Table 3, 17 different color
attributes,
their 2 gender
attributes, types, 2orgarment
captions, labels. lengths, and 3sentences
The caption sleeve styles.
were written in the Arabic
languageBased on the
based attributes
on the we that
attributes chose wetochose
use, to
weuse
wrote
whichtheare
caption
gender,sentences
garment using the
type, color,
templates
sleeve presented
styles, belowlength.
and garment in TableFrom
4. the DeepFashion dataset, we chose the Category
and Attribute Prediction Benchmark, and we selected a subset from it. The AFD dataset
Table 4. The
included templates
images used to write
of different typesthe
of caption
tops suchsentences.
as hoodies and blazers, and different types
Template of bottoms such
Arabic as pants and skirts, in addition to dresses and jumpsuits. ForExample
English the caption
sentences, we translated and processed the captions produced by Hacheme et al. [20] for
<garment type> ﻳﺮﺗﺪﻱ/< ﺗﺮﺗﺪﻱgender> The <gender> is wearing a/an
Top clothing the images on the DeepFashion dataset to suit our work. That is because their Figurecaption
8a,b
< ﺍﻷﻛﻤﺎﻡsleeve style> < ﺍﻟﻠﻮﻥ ﻭcolor> <color><sleeve style><garment type>
sentences include the same attributes we used. So instead of writing caption sentences
The <gender> is wearing a <garment
Dresses and for each image, we
<garment length><garment type> <ﺗﺮﺗﺪﻱgender>
ordered and translated their caption sentences to Arabic using the
jumpsuits ﺍﻷﻛﻤﺎﻡtemplates below.
<sleeve style Since
> ﺍﻟﻠﻮﻥ ﻭ the attribute
<color > <garment length> was nottype>
length><color><garment with in
included a Figure 8c,d
their caption
<sleeve style>
sentences, we added it manually to the garment types (dresses and jumpsuits) as shown in
<garment
Figure type > ﻳﺮﺗﺪﻱ
8c,d. For/the
ﺗﺮﺗﺪﻱcaption
<gender >
sentences ofThe <gender>
images is wearing
of bottoms, a <garment
we wrote the caption sentences
Bottom clothing Figure 8e,f
ﺍﻟﻠﻮﻥ <color><garment length > length><color><garment
for each image of bottoms manually as shown in Figure 8e,f. type>
Figure 8. Some images with standardized caption sentences from the dataset. (a) The woman is
wearing a red long sleeve coat. (b) The man is wearing a black long sleeve blazer. (c) The woman is
wearing a long black jumpsuit with a long sleeve. (d) The woman is wearing a long pink dress with a
long sleeve. (e) The woman is wearing a long white skirt. (f) The man is wearing a long blue pants.
To predict captions with vocabulary that people are familiar with, the process of
translating some garment types to Arabic was done based on six online shopping websites
that are popular in Saudi Arabia. They are ZARA, H&M, OUNASS, 6thStreet, SHEIN, and
MANGO. The garment types that were translated based on these online shopping websites
were sweater, hoodie, jacket, blouse, blazer, cardigan, jumpsuit, t-shirt, dress, coat, leggings,
short pants, joggers, pants, and skirt. Table 3 shows the Arabic translation of each garment
type used in the dataset.
The dataset included 15 different garment types shown in Table 3, 17 different color
attributes, 2 gender types, 2 garment lengths, and 3 sleeve styles.
Based on the attributes we chose to use, we wrote the caption sentences using the
templates presented below in Table 4.
After choosing the attributes, we made a multi-label classification by reordering
the dataset. Where each group of images that share the same attributes was classified
and placed in the same class named with these attributes to use them to train the image
captioning model along with the image features and linguistic features.
Sensors 2023, 23, 3783 12 of 17
Our multi-label classification model has achieved an accuracy of 99% on the training
data and an accuracy of 97.5% on the test data.
Sensors 2023, 23, 3783 13 of 17
Different values of N can be used to calculate the BLEU score. N is the number of
n-grams which is an expression for a group of ‘n’ consecutive words in a sentence. The
default value of N is 4: BLEU-1, BLEU-2, BLEU-3, and BLEU-4 [26].
Two methods are used to transform the generated caption to a probability score for
each word in the vocabulary: the greedy method and the beam search method. In the
greedy method, for each time step, the word with the highest probability is chosen. In
the beam search technique, which is more efficient, from a sequence of candidates, the
sequence with the highest overall score is chosen [27]. The BLEU score can be calculated
for both methods.
Dataset Split
Training (80%) Validation (10%) Testing (10%) Total
Images 63,292 7911 7912 79,115
The dataset contains 79,115 images of clothing, each with a single caption sentence.
The number of words in the predicted caption was limited to 10 words which is the longest
caption sentence in the dataset. Additionally, the symbols <START> and <END> were
added to the beginning and end of each sentence, respectively.
• The attributes of the images were preprocessed in two steps. First, we extracted
the labels that represent the attributes of the image from the image path, then we
converted these labels to a NumPy array.
• The images of clothing were preprocessed in two steps. We first resized the input
images to 100 × 100 pixels. Then we used data augmentation on the training set of
men’s clothing images to avoid the problem of unbalanced data. This was caused
by the fact that men’s images and garment types were less than women’s images
and garment types such as dresses, skirts, and jumpsuits and we wanted to cover all
Sensors 2023, 23, 3783 14 of 17
We also implemented the common model which uses only image features and linguis-
tic features to train the model and evaluated it using our dataset. Then we compared the
results. Table 7 compares our attributes-based model with the common model.
Table 7. Comparing the proposed model with the common model using the BLEU score.
Based on the previous results, we found that training the model using image attributes
along with image features and linguistic features achieved higher results than using only
image features and linguistic features to train the model.
Sensors 2023, 23, 3783 15 of 17
Table 8. Comparing our model with different models using the BLEU score.
As can be observed, according to BLEU, the proposed approach performs better than
all of the methods that were compared. This demonstrates the success of training the model
utilizing image attributes in addition to the image features and linguistic features and
represents an improvement of more than 30 BLEU points over the model that does not use
attributes and other related models. For example, our model outperforms [22] with more
than 40 BLEU points and [23] with more than 60 BLEU points.
6. Conclusions
Image captioning is a difficult computer vision challenge. Future studies will benefit
from this study because there is still a lot of work to be done in order to produce Arabic
captions for digital images. The major goal is to have these automatically generated
captions precise enough to be considered human-like. In this work, we created an Arabic
dataset for captioning and classifying images of clothing and made it publicly available.
Furthermore, we proposed an approach that depends on passing the image attributes
as input along with image features and linguistic features to train the model. For validation,
a model for multi-label classification was developed to output attributes that were used as
input to the trained model to generate the final caption for the input image. In terms of the
Sensors 2023, 23, 3783 16 of 17
BLEU score, the proposed model performs better than state-of-the-art models. Nonetheless,
there is still plenty of potential for further improvement.
In conclusion, our work highlights the advantages of using classification when apply-
ing image captioning to a specialized field, like the fashion one. As a part of future work,
extra attributes will be added to accurately describe the images of clothing, and textile
attributes such as fabric type will be taken into consideration. Moreover, other methods
will be taken into consideration such as Landmark Detection [28], and other kinds of neural
networks such as Transformers [29].
References
1. Dwivedi, P.; Upadhyaya, A. A Novel Deep Learning Model for Accurate Prediction of Image Captions in Fashion Industry. In
Proceedings of the Confluence 2022–12th International Conference on Cloud Computing, Data Science and Engineering, Noida,
India, 27–28 January 2022; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NI, USA, 2022; pp. 207–212.
2. Tran, K.; He, X.; Zhang, L.; Sun, J.; Carapcea, C.; Thrasher, C.; Buehler, C.; Sienkiewicz, C. Rich Image Captioning in the Wild. In
Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July
2016; pp. 49–56.
3. Al-Muzaini, H.A.; Al-Yahya, T.N.; Benhidour, H. Automatic Arabic Image Captioning Using RNN-LST M-Based Language Model
and CNN. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 51692219.
4. Niu, Z.; Zhong, G.; Yu, H. A Review on the Attention Mechanism of Deep Learning. Neurocomputing 2021, 452, 48–62. [CrossRef]
5. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164.
6. Tanti, M.; Gatt, A.; Camilleri, K. What Is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? In
Proceedings of the 10th International Conference on Natural Language Generation; Association for Computational Linguistics: Santiago
de Compostela, Spain, 2017; pp. 51–60.
7. Attai, A.; Elnagar, A. A Survey on Arabic Image Captioning Systems Using Deep Learning Models. In Proceedings of the 2020
14th International Conference on Innovations in Information Technology (IIT), Al Ain, United Arab Emirates, 17–18 November
2020; pp. 114–119.
8. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.;
Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021,
8, 53. [CrossRef]
9. Alam, M.S.; Rahman, M.S.; Hosen, M.I.; Mubin, K.A.; Hossen, S.; Mridha, M.F. Comparison of Different CNN Model Used as
Encoders for Image Captioning. In 2021 International Conference on Data Analytics for Business and Industry, ICDABI 2021; Institute
of Electrical and Electronics Engineers Inc.: Piscataway, NI, USA, 2021; pp. 523–526.
10. Shouman, M.A.; El-Fiky, A.; Hamada, S.; El-Sayed, A.; Karar, M.E. Multi-Label Transfer Learning for Identifying Lung Diseases
Using Chest X-Rays. In Proceedings of the 2021 International Conference on Electronic Engineering (ICEEM), Menouf, Egypt, 3–4
July 2021; pp. 1–6.
11. Sargar, O.; Kinger, S. Image Captioning Methods and Metrics. In 2021 International Conference on Emerging Smart Computing and
Informatics, ESCI 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NI, USA, 2021; pp. 522–526.
12. Xu, K.; Ba, J.L.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image
Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on International Conference on
Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 2048–2057.
13. Shao, L.; Zhu, F.; Li, X. Transfer Learning for Visual Categorization: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26,
1019–1034. [CrossRef]
Sensors 2023, 23, 3783 17 of 17
14. Jindal, V. A Deep Learning Approach for Arabic Caption Generation Using Roots-Words. In Proceedings of the Thirty-First AAAI
Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31, pp. 4941–4942. [CrossRef]
15. Mualla, R.; Alkheir, J. Development of an Arabic Image Description System. Int. J. Comput. Sci. Trends Technol. 2018, 6, 205–213.
16. Jindal, V. Generating Image Captions in Arabic Using Root-Word Based Recurrent Neural Networks and Deep Neural Networks.
In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; 32,
pp. 144–151.
17. ElJundi, O.; Dhaybi, M.; Mokadam, K.; Hajj, H.M.; Asmar, D.C. Resources and End-to-End Neural Network Models for Arabic
Image Captioning. In Proceedings of the VISIGRAPP (5: VISAPP), Valletta, Malta, 27–29 February 2020; pp. 233–241.
18. Liu, Z.; Luo, P.; Qiu, S.; Wang, X.; Tang, X. Deepfashion: Powering Robust Clothes Recognition and Retrieval with Rich
Annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30
June 2016; pp. 1096–1104.
19. Tateno, K.; Takagi, N.; Sawai, K.; Masuta, H.; Motoyoshi, T. Method for Generating Captions for Clothing Images to Support
Visually Impaired People. In Proceedings of the 2020 Joint 11th International Conference on Soft Computing and Intelligent
Systems and 21st International Symposium on Advanced Intelligent Systems (SCIS-ISIS), Hachijo Island, Japan, 5–8 December
2020; pp. 1–5.
20. Hacheme, G.; Sayouti, N. Neural Fashion Image Captioning: Accounting for Data Diversity. arXiv 2021, arXiv:2106.12154.
21. Yang, X.; Zhang, H.; Jin, D.; Liu, Y.; Wu, C.-H.; Tan, J.; Xie, D.; Wang, J.; Wang, X. Fashion Captioning: Towards Generating
Accurate Descriptions with Semantic Rewards. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
23–28, 2020, Proceedings, Part XIII 16; Springer: Cham, Switzerland, 2020; pp. 1–17.
22. Cai, C.; Yap, K.-H.; Wang, S. Attribute Conditioned Fashion Image Captioning. In Proceedings of the 2022 IEEE International
Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 1921–1925.
23. Moratelli, N.; Barraco, M.; Morelli, D.; Cornia, M.; Baraldi, L.; Cucchiara, R. Fashion-Oriented Image Captioning with External
Knowledge Retrieval and Fully Attentive Gates. Sensors 2023, 23, 1286. [CrossRef]
24. Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning Word Vectors for 157 Languages. In Proceedings of the
International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018.
25. Hossin, M.; Sulaiman, M.N. A Review on Evaluation Metrics for Data Classification Evaluations. Int. J. Data Min. Knowl. Manag.
Process 2015, 5, 49040515.
26. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of
the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Cedarville, OH,
USA, 2002; pp. 311–318.
27. Faiyaz Khan, M.; Sadiq-Ur-Rahman, S.M.; Saiful Islam, M. Improved Bengali Image Captioning via Deep Convolutional Neural
Network Based Encoder-Decoder Model. In Proceedings of International Joint Conference on Advances in Computational Intelligence:
IJCACI 2020; Springer: Berlin/Heidelberg, Germany, 2021; pp. 217–229.
28. Huang, C.-Q.; Chen, J.-K.; Pan, Y.; Lai, H.-J.; Yin, J.; Huang, Q.-H. Clothing Landmark Detection Using Deep Networks with Prior
of Key Point Associations. IEEE Trans. Cybern. 2019, 49, 3744–3754. [CrossRef] [PubMed]
29. Alammar, J. The Illustrated Transformer–Jay Alammar–Visualizing machine learning one concept at a time. Jay Alammar Github
2018, 27.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.