An Artificial Intelligence Based Tool For Eye Disease Classification
An Artificial Intelligence Based Tool For Eye Disease Classification
An Artificial Intelligence Based Tool For Eye Disease Classification
(IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
Abstract
Human eye is affected by the different eye diseases namely Diabetic Macular Edema
(DME) and Age- related Macular Degeneration (AMD). Diabetic Macular Edema (DME)
is a common eye disease that causes an irreversible vision loss for diabetic patients. The
AMD is further classified into Early AMD and Late AMD. DRUSEN is an eye problem
caused due to aging and macular degeneration. It destroys our sharp central vision. The
presence of DRUSEN is the symptom for Early AMD. Choroidal Neovascularization(CNV)
is an eye problem caused due to the creation of new blood vessels in the choroid layer of
the eye which leads to sudden deterioration of central vision. CNV is the symptom for the
Late AMD. This work focuses on the design of an Artificial Intelligence based tool for eye
disease detection and classification to detect and classify CNV, DME and DRUSEN
effectively by using the Optical Coherence Tomography(OCT) images. Automatically
identifying and describing the symptoms of an OCT image is really a complex and
challenging task. This is done by using the pre-trained convolution neural network (CNN)
models and image caption generator designed with Long Short Term Memory (LSTM).
This tool will assist the Ophthalmologist in classifying the three different types of eye
diseases namely DME, Early AMD and Late AMD by using the textual description
generated by image captioning generator. The features extracted from the OCT images in
the form of feature vectors and the partial captions generated for each images are used to
fit the training model to create the next word in the sequence. This trained model is further
used for eye disease detection and classification based on the text description provided by
the LSTM for each images not known to the model. The trained model generates captions
to classify the images under diagnosis into four different classes namely NORMAL, DME,
Early AMD and Late AMD. The performance metrics of this image caption generator
designed by using each of the pre-trained CNN models and LSTMs are evaluated and
compared for four different classes independently to select the best image caption
generator. The test results show that the performance of the image caption generator
implemented by using the pre-trained model DenseNet169 and Xception are found to
perform better than all other pre-trained models.
Keywords: Optical coherence tomography(OCT), Diabetic Retinopathy(DR), Deep
Learning, Convolution Neural Networks(CNN), Long Short Term Memory(LSTM),
Transfer Learning, Ophthalmologist, Choroidal Neovascularization(CNV), Age-related
macular degeneration(AMD).
1. Introduction
In Ophthalmology, Optical Coherence Tomography (OCT) plays a vital role in the
detection and classification of eye diseases for further assessment and treatment. In the
present scenario, the diagnosis of eye diseases is primarily dependent and based on the
clinical examination and the subjective analysis of OCT images reported by the referred
expert. This paper aims for the automatic detection and classification of three different eye
diseases present in the OCT images of human eye by using image caption generator
designed by using pre-defined CNN models and RNN. The image caption generator model
generates the textual description about the symptoms of each eye disease for detecting and
classifying the OCT images. This model has to identify the relationship between different
symptoms rather than identifying the symptoms present in the OCT image. The following
are the three eye diseases for which text description is generated by the image caption
generator to detect and classify the disease present in the OCT image.
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
1.Early AMD.
2.Late AMD
3. DME
4. NORMAL
Choroidal Neovascularization (CNV) involves the growth of new blood vessels that
originate from the choroid through a break in the Bruch membrane into the sub–retinal
pigment epithelium (sub-RPE) or sub retinal space. CNV is also closely related with the
excessive amounts of vascular endothelial growth factor (VEGF). This is the symptom for
Late AMD and will lead to sudden deterioration of central vision. DRUSEN, an eye
problem occurs due to aging and macular degeneration. DRUSEN are tiny white or yellow
accumulations of extracellular material that build up between Bruch’s membrane and the
retinal pigment epithelium of the eye. As the age advances or grows, it is normal to have
the presence of few small hard DRUSEN. This represents the symptom for Early AMD
and the presence of large number of DRUSEN in the macula destroys sharp central vision.
The central vision is needed to see objects clearly and to do tasks such as driving the
vehicles and reading the books. Diabetic Macular Edema (DME) is a common eye disease
that causes irreversible vision loss for diabetic patients, if this is left untreated. It is mainly
due to leaking of blood vessels in the retina.
The precise information about the different components or features of retina of the human
such as blood vessels, the macula, the fovea and the optic disc(OD) from the OCT images
aids to detect the different kinds of eye diseases or abnormalities. In the traditional method,
Ophthalmologist analyzes the abnormal retinal images and derives the useful information
to detect each eye disease. But, this process is really time-consuming and it leads to false
prediction in sometimes due to human error. Hence, an automatic and accurate image
caption generator is necessary to provide the text description about the OCT images for
predicting the different kinds of eye diseases accurately.
Three different approaches namely image processing techniques; machine learning and
deep learning are used to detect the different eye diseases in OCT images. Many
algorithms based on image processing are available for detecting the pathologies present in
the human eye from OCT images. In OCT images, the spatial local correlation among the
neighboring pixels, an image processing technique, is an important source of information
which is used to detect and classify different types of eye diseases. The application of
image processing and computer vision to OCT image interpretation has mainly focused on
the development of segmenting the retinal layer and measuring the thickness of the
segmented layer for comparison with the corresponding thickness measurements made
from the database of normal retinal images to identify retinal diseases. Apart from
measuring the thickness of the retinal layer, research work also focused on the
segmentation of fluid regions seen in the retinal OCT images such as Edema or cystic
structures which are observed in advanced stages of DME and AMD or DRUSEN.
The computer-aided diagnosis (CAD) of OCT images uses ML techniques with hand-
engineered features for decision-making. However, devising hand-engineered features
from the OCT images is the most challenging task as it needs expertise in analyzing the
variability of parameters or features in the region of interest (ROI). A machine learning
(ML) technique is used to detect the different stages of Age-Related Macular
Degeneration (AMD) of human eye from OCT images by using multi scale histograms of
oriented gradient descriptors as feature descriptor to extract the features being given to the
support vector machine based classifier for classifying the diseases into DME, Early AMD
and NORMAL eye[1]. The multi scale Local binary pattern (LBP) features are used to
perform multi label classification of retinal OCT images for the detection of macular
pathologies by Liu et al [2]. A support vector machine (SVM) classifier using five distinct
features extracted from the labeled images is proposed by Hassan et al [12] to detect and
classify DME from the abnormal OCT images.
The Deep Learning (DL) model aids to overcome the challenges involved in the
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
Luhui Wu et al [21] have developed an image captioning model for the diabetic
retinopathy images using CNN and RNN. This model has predicted the abnormalities
present in the retinal images for normal and abnormal images. But the shortcomings of this
model are that they have not predicted for the five different levels of diabetic retinopathy.
Min Yang et al[22] have designed an image captioning model for cross domain learning
and prediction. They trained the model using images of one domain. They have used the
same training model for image captioning prediction on images belonging to another
domain. They used CNN-LSTM for developing their model. Yansong Feng et al [23] have
used Scale Invariant Feature Transform Algorithm for representing each news images as a
bag of words in the model proposed by them and they used recall and mean average
precision for evaluating their model. Yang et al [24] had proposed a model that used
generator for generating textual descriptions for the given visual content of the video and
discriminator for controlling the accuracy of generation. Dong-Jin Kim et al [25] had
proposed an image caption generator model involving CNN and Guided LSTM and they
had found that this model had superior performance in terms of predicting the next word of
the sequence accurately when compared with LSTM. But they have not done classification
task related to OCT images. Minsi Wang et al[26] have proposed a novel parallel-fusion
RNN-LSTM architecture for textual description for the image and they obtained better
results and improved efficiency when compared among the dominated one. They have
divided the hidden units of RNN into same equal sized parts and allowed them to run in
parallel. But they have not used any specific dataset to claim the superior features of their
model. Jie Wu et al [27] have proposed a cascaded RNN models for image caption
generation and they have verified their model for its effectiveness and accuracy for its
caption generation. Chetan Amritkar et al[28] had proposed a model using CNN for
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
extracting features from the image and RNN for generating clean sentence about the image
and it was observed that the model frequently had given accurate descriptions about an
image. Jiuxiang Gu et al[29] have introduced a language model based on CNN and
identified their suitability for statistical modeling of language tasks. This language model
can model the long range dependencies in the history of words, which are really important
for image captioning. Xiaodong et al[30] have used deep learning based CNN and RNN
for image caption generation. CNN is used for feature extraction in a sub-region of the
given image and RNN is used for is used for generating image captions for the OCT
images.
2. Data Set
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
The table 1 shows the distribution of different types of images in the original
dataset. The training dataset used by us consists of 6000 labeled high resolution OCT
images, open sourced by a free platform Kaggle for eye disease screening. This dataset is
based on four different classes corresponding to the four different eye diseases as tabulated
in the Table I. The dataset used for validation and testing consist of 1000 and 32 images
respectively.
In our work, finally the images will be classified into any one of the 4 eye diseases
namely Normal, DME, Early AMD with the presence of DRUSEN and Late AMD with the
presence of CNV by using the text description about the image predicted by the
combination of CNN and LSTM. . The figures from 1 to 4 shows the images related to the
different eye diseases classified by our caption generator. The features in them will be
different for different eye diseases. Finally, the textual description will be displayed on
these images to detect and classify the different type of eye disease class.
Figure 1 Normal Figure 2 Early AMD Figure 3 Late AMD Figure 4 DME
3. Proposed Methodology
The proposed methodology shown in Figure 5 consists of four major functional blocks
namely feature extractor, sequence generator, Training the Caption Generator model and
new caption generator or classifier model. The feature extractor is designed by using pre-
trained CNN models by removing the top layer. 7000 OCT images from the 4 eye disease
classes namely Late AMD having the symptoms of CNV, DME, Early AMD having the
symptoms of DRUSEN and NORMAL are given to the feature extractor for feature
extraction. The extracted features corresponding to the training images and the vocabulary
formed for these training images from the textual description of these images are applied as
inputs to the training model to create the trained model. During the training of the model,
sequence generator responsible for generating the sequence of words is used to predict the
next word in the sequence along with LSTM. The image to be tested for any one of the eye
disease class is given to the image caption generator as input. The caption generator will
output the result in the form of textual description about the image. The description
generated by the caption generator is used to find and count the number of correct and
wrong predictions in the OCT images as outputted by caption generator. The number of
correct and incorrect captions predicted for the four different classes are used for designing
a nested dictionary in python. This nested dictionary is passed as an argument to the
Confusion Matrix object in python to generate the different metrics associated with the
caption generator. The results obtained from the Confusion Matrix object is observed and
compared among them to identify the best image caption generator designed with the pre-
trained CNN models and LSTM. The image caption generator whose performance metrics
verified as the best one can be recommended to the eye specialists to diagnose the eye
disease.
All the labeled OCT images are pre-processed by using resizing technique to
meet the input requirements of different pre-trained deep learning models. Mostly, OCT
images are resized into three different resolutions namely 299*299, 224*224 and 96*96
pixels. The purpose of Feature Extractor in the automatic caption generator is to transform
the raw images into limited distinct features to reduce the
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
complexity in processing the images without losing the meaningful information. Pre-trained
neural network models are used for this purpose. The Feature Extractor extracts the unique
features in the images to form the unique vector for all the images and they are stored in
features.pkl file. The feature extraction transforms the labeled OCT images into feature
vectors. Feature Extraction also helps to avoid the large amount of memory requirement and
computing power. The feature vectors and the vocabulary formed from the words of textual
descriptions corresponding to these images are used as inputs to the training model to learn
the features effectively and generate a trained model with optimum weights. The amount of
time used for training the model will vary based on the pre-trained model that is used to
create the feature extractor. The best trained model having the maximum validation
accuracy or minimum validation loss is used for creating captions for the new images during
testing. testing images.
These feature descriptors aforementioned will extract only some specific features in the
OCT images. These features may not be sufficient to detect a specific eye disease
precisely. The accuracy of these feature extractors are not sufficient to use them for feature
extraction. Therefore, there is a need for a model based on deep learning that can extract all
the features corresponding to a specific eye disease class. The pre- trained CNN models are
selected for feature extraction with the aim of extracting all the relevant features in the
OCT images by using the different convolution layers in the CNN model. The lower layers
will extract the low level features and the middle layers will extract the mid level features.
Finally, the end layers will use all these features to create the high level features
corresponding to the particular disease class.
The feature extractor is designed by using the pre-trained weights of ImageNet and by
removing the top layer in the actual pre-trained CNN model. The pre-trained CNN model
is chosen over the conventional neural network and artificial neural network because it is
capable of capturing and learning the features automatically from the images at different
levels of hierarchy similar to the human brain by the different layers of CNN. During
convolution operation, each output value is not required to be connected to every neuron in
the previous layer but connected to only those receptive fields where the convolution
kernel is currently applied. This specific characteristic of convolution layer which will
reduce the amount of interconnection drastically is called local connectivity. Again in the
pre-trained CNN , the same weights are applied over the convolution until the next update
of the parameters referred to as parameter sharing. Thus, there is a drastic reduction in the
number of parameters when compared with ANN where there is a connection between
every pair of single input/output neuron. Clearly, CNN is more efficient than conventional
neural network and ANN in terms of complexity and memory. It was proved and identified
that pre-trained CNN models were a good feature extractor for a completely new
task/problem. The designed feature extractor will extract useful attributes from an already
pre-trained CNN with its trained weights by feeding our image data having different eye
diseases on each level and to tune the CNN a bit for the specific task namely eye disease
classification. The pre-trained CNN models are very efficient in these tasks when
compared to neural networks. The advantage of pre-training is that there is no need to train
the CNN and thus memory and time are saved. The convolution process in the CNN is
capable of extracting the relevant information at low computational costs. Thus, the pre-
trained CNN is selected for feature extractor after considering their merits over
conventional NNs and ANN. The pre-trained CNN models will detect key points on the
image and the number of key points will vary from image to image. Then feature vector is
built for each image based on the number of key points used for representing the image.
These features represent the internal representations of the image just before classifications
were made. Then these features are represented in the form of feature vectors for all the
images in the form of Numpy array. The dimension of the feature vector will vary based on
the pre-trained model used for feature extraction. In this way, the feature vector is
computed for OCT images in the dataset and these features are stored in the features.pkl
file.
The token.txt file is created manually with 5 different captions for each image present
in the dataset used for training and validation. The image id and text description for each
image is separated and stored in token.txt file. The first column of the file contains the
image id and from the second column onwards it contains the description about the image
in the form of symptoms in text description. The imageId is numbered from 0 to 4 after the
image file. A dictionary in python is used for storing the image id of different files as keys
and the description about the file as their corresponding values of the dictionary. The text
description for each image in the entire dataset is cleaned and then the cleaned description
is formed for each image present in the dataset. Then the cleaned description in the text file
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
is converted into a vocabulary of words. The text is cleaned to reduce the size of the
vocabulary of words. The vocabulary should be expressive and as small as possible. If the
size of the vocabulary is small, then the size of the model is small and hence this network
will train faster. Finally, the dictionary of image identifiers and cleaned description are then
saved in the descriptions.txt file, with one image identifier and the corresponding
symptoms description in each line. Now the cleaned descriptions are ready for modeling
and the order of the descriptions may vary.
In this work, both training the model by using partial captions to predict the next word
in the sequence in each epoch is followed by validating the model. The training model is
designed by using three components namely image feature extractor, sequence processor
and decoder. The feature extractor is designed by using different pre-trained models after
removing the top layer in the original models. The features extracted by this pre-defined
model will act as one input to the training model. The feature will be vector of 4096
elements if the pre-trained model vgg16 is used. The size of this feature vector is different
for different pre-trained CNN models. A dense layer is used to process these features to
generate a 256 element representation for each image. The sequence processor is
implemented by using a word embedding layer followed by a Long Short-Term Memory
(LSTM) recurrent neural network layer. An input sequence with a pre-defined length of 34
words is given as input to the embedding layer that uses a mask to ignore the padded
values. LSTM layer with 256 memory units will follow the embedding layer in the
sequence processor. Both the input models namely photo feature extractor model and
LSTM will produce a 256 element vector. Both the models will use regularization in the
form of 50% dropout. This is done to decrease the over fitting in the training dataset as this
model configuration learns fast. The decoder model then merges the two vectors from the
two input models namely feature extractor model and sequence processor model using an
addition operation. The two vectors after merging is fed to Dense 256 neuron layer
followed by a final output dense layer. The output dense layer makes a softmax prediction
over the entire output vocabulary for predicting the next word in the sequence. The skill or
capability of the model is monitored and checked by using the validation dataset. The
whole model is saved to a file if the skill or capability of the model is improved at the end
of each epoch on the validation dataset.
The model with the best skill on the training set is saved as the final model at the end of
each run. This is done by defining a Model Checkpoint in keras and informing it to
monitor the minimum loss on the validation dataset. Then the model with minimum loss is
saved to the file with .h5 extension named with the epoch name, training loss and
validation loss. The checkpoint can be specified in the call to the fit function to fit the
training model with validation dataset as argument to the fit function. The training model
will be provided one word split from the text description and the photo features during
training. This model is expected to learn the next word in the sequence. When the trained
model is used later to generate descriptions during testing, the generated words will be
concatenated and recursively provided as input to generate a caption for a new image.
Thus, the weights corresponding to each epoch where there was an improvement in
validation accuracy is stored in h5 file. Finally, the weights corresponding to a particular
epoch during training which gives the maximum validation accuracy or minimum loss is
used for the final trained model which is used for generating image captioning for the
unseen photo or image.
Evaluation Model
Once the model is fit by using the features and text description of OCT images i.e.,
inputs-output pairs, it is ready for evaluation for predicting its skill on the holdout test
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
data or validation data. The model is evaluated by generating descriptions for all the
photos or images in the validation dataset. These predictions are evaluated by using a
standard cost function. This involves passing in the start description token 'startseq',
generating one word and then calling the trained model recursively with generated words
from the previous predictions as input until the end of sequence token 'endseq' is reached
or the maximum description length is reached. Thus, the trained model can be evaluated
against a given validation dataset of photo descriptions and photo features. The actual and
predicted descriptions are collected and evaluated collectively using the corpus BLEU
score that summarizes how close the generated text is to the expected text.
The trained model obtained after training in the .h5 file contains almost everything needed
to generate the captions for the new OCT images. First, the Tokenizer created during the
encoding of text that has the tokens describing all the training images and the maximum
length of the sequence defined to generate the text description about the OCT images are
needed for generating captions for the new OCT images. Next, the photo for which
captions to be generated and the features to be extracted is applied to the trained model.
This is done by re-defining the model by using LSTM and then adding any one of the pre-
trained models to it. The pre-trained models are used to extract the features and the
extracted features are used as inputs to the trained model. Captions for the OCT images are
generated after the model is successfully validated. After removing the start and end
sequence tokens in the sequence generated, the description having symptoms about the
image along with its class label are generated for the new OCT image. This description
along with the name of the class label to which the OCT image is classified will assist the
Ophthalmologist in classifying the type of eye disease.
The trained model in the form of weights along with the image to be captioned is
applied to the image caption generator for predicting captions for the image. 32 OCT
images from the 4 classes are used for testing the accuracy of the caption generator. The
description generated by the caption generator along with its class label is used to predict
the input OCT image as either correct or incorrect. A nested dictionary in python is
designed with 4 actual classes as 4 internal dictionaries as the key. The values for the
nested dictionary are formed by counting the wrong and correct predictions for the 4
classes. The figure 6 shows the block diagram of PyCM with different inputs and outputs
in the form of reports generated by the Confusion Matrix Object.
The designed nested matrix is applied as input to the Confusion Matrix object
supported by the Python Confusion Matrix library. The Confusion Matrix object will
compute all the metrics corresponding to the per class and overall class and store them in a
HTML file. These metrics are computed for all the 9 image caption generator models
formed by using 9 different pre-trained CNN models and the LSTM. The comparison
across different caption generation models is analyzed for selecting the best caption
generator model. The image caption generator model whose performance is superior is
selected to aid the Ophthalmologist in predicting and classifying the disease type in OCT
images.
Overall Performance Analysis of the Pre-trained CNN Models for all the Classes
Many performance metrics were considered and computed for analysis. They were
cross entropy, Kappa, Kappa Std Error, Overall Accuracy, Random Accuracy, Positive
Prediction Value ( PPV_Macro and PPV_Micro), True Positive Rate( TPR_Macro and
TPR_Micro) and standard error for performance analysis and comparison. They are all
used to evaluate and assess the prediction performance of the classifier. This performance
analysis will provide concrete evidence for the best classifier that can be recommended to
the Ophthalmologist for assessing the eye disease.
predicting the label. It is nothing but the true probability distribution for the different
words present in the image captions to describe a particular label or disease class. Cross
entropy is zero if the prediction capability of the model is perfect. The figure 8 shows the
cross entropy for the 9 different caption models.
Figure 8 Cross Entropy for the caption generator designed with 9 different Pre-
trained Models
<0.20 Poor
0.21-0.40 Fair
0.41-0.60 Moderate
0.61-0.80 Good
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
Figure 9 Kappa value for the caption generator with 9 different Pre-trained
CNN Models
The Figure 9 shows that image caption generators which have used DenseNet201 and Xception
as feature extractors have kappa value equal to 0.9167 and they have the very good strength of agreement.
Figure 10 Kappa standard error value for the caption generator designed with 9
different Pre- trained CNN Models
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
supporting their clinical decision in classifying the different types of eye diseases. The
figure 11 shows the overall accuracy for the caption generator designed with 9 pre-trained
CNN models.
Figure 11 PPV_Macro for the caption generator designed with 9 different Pre-trained
CNN Models
It is observed that caption generators designed with DenseNet201 and Xception for
feature extraction have the maximum value of PPV_Macro equal to 0.94. So this metric
will help us to recommend these caption generators to the Ophthalmologist to assess the
different eye disease classes.
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
generators for taking clinical decision correctly. The figure 12 shows the TPR_Micro and
TPR_Macro values for the caption generators designed with 9 different pre-trained CNN
models as feature extractor.
Performance Analysis of the Different Pre-trained CNN Models among Different Classes
Image caption generators can be compared with each other to select the best
caption generator based on the benchmark results. For each of the 4 classes, this work has
considered and computed 10 different metrics. They are Accuracy, Error Rate, F1Score,
F2Score, Geometric mean of precision and sensitivity, Mathew’s Correlation Coefficient,
Precision or Positive Predictive value, Random Accuracy, Specificity and Sensitivity. Our
experiments have demonstrated that two caption generators designed using 2 different pre-
trained CNN models namely DenseNet201 and Xception have shown good performance in
predicting the symptoms for all the 4 classes.
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
used in the training process, then F2Score is more useful than accuracy. F2Score is really
effective in classification when the cost of false negative is much higher than the cost of
false positive. 4 image caption generation models having F2Score equal to 1 when
generating captions for the Normal eye disease classes accurately are designed with the
pre-trained CNN models namely ResNet50, DenseNet121, DenseNet169 and Xception for
extracting the features to be used during the training of LSTM. The image caption
generation systems which are based on DenseNet201 and Xception are found to have
F2Score close to one. The figure 16 shows the F2Score of 9 different caption generators
for each eye disease class.
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
classes. The figure 18 shows the Mathew’s correlation coefficient of 9 different caption
generator model for each eye disease class.
Performance Analysis using Precision or Positive Predictive Value(PPV) of each Eye Disease
Class:
Precision (PREC) is calculated by dividing the number of correct positive
predictions by the total number of positive predictions. It is also called positive predictive
value (PPV). The best precision value is 1.0, whereas the worst value is 0.0. Four caption
generator models are having precision equal to 1 when predicting the image caption
description for the Normal eye disease classes accurately. Two neural caption generator
models that are designed with DenseNet201 and Xception as feature extractors have
precision almost very close to one when predicting the captions for all the 4 classes. The
figure 19 shows the precision of different caption generator model for each eye disease
class.
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
for feature extraction have specificity value either equal to 1 or close to 1. The figure 20
shows the specificity of different caption generation models for each eye disease class.
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
DenseNet201 and Xception have provided the best performance in predicting the captions
for the eye diseases correctly. Therefore, they can be used in the design of clinical decision
support system to assist the ophthalmologist. This work can also be further extended by
fusing the features from two different pre- trained CNN models and using their features to
design a new model for caption generation. The performance of the newly created model is
further analyzed to identify its suitability for designing a clinical decision support system
by comparing with the existing model.
6. References
[1] Pratul P. Srinivasan, Leo A. Kim, Priyatham S. Mettu, Scott W. Cousins, Grant M. Comer,
Joseph A.
[2] Izatt & Sina Farsiu (2014): Fully Automated Detection Of Diabetic Macular Edema And Dry
Age-Related Macular Degeneration From Optical Coherence Tomography Images,
Biomedical Optics Express, Vol 5, No 10, October 2014.
[3] S. P. K. Karri, Debjani Chakraborty, & Jyotirmoy Chaterjee(2017): Transfer Learning Based
Classification Of Optical Coherence Tomography Images With Diabetic Macular Edema And
Dry Age-Related Macular Degeneration, Biomedical Optics Express, Vol. 8, No. 2, Feb 2017.
[4] Thomas Schlegl, Sebastian M. Waldstein, Hrvoje Bongunovic, Franz Endstraber, Amir
Sadeghipour, Ana-Maria Philip, Dominika Podkowinski, Bianca S. Gerendas, Georg Langs &
Ursula Schmidt-
[5] Erfurth(2018): Fully Automated Detection And Quantification Of Macular Fluid In Oct Using
Deep Learning, American Academy Of Ophthalmology, Vol 25, No 4, April 2018.
[6] Rui Zhaq, Acner Camino, Jie Wang, Ahmed M. Hagg, Yansha Lu, Steven T. Bailey,
Christina J. Flaxel,
[7] Thomas S. Hwang, David Huang, Dengwang Li & Yali Jia (2017): Automated Drusen
Detection In Dry Age-Related Macular Degeneration By Multiple-Depth, En Face Optical
Coherence Tomography, Biomedical Optics Express, Vol. 8, No. 11 | 1 Nov 2017.
Sivaramakrishnan Rajaraman, Sameer K. Antani , Mahdieh Poostchi , Kamolrat Silamut, Md.
A.
[8] Hossain & Richard J. Maudeaude, Stefan Jaeger & George R. Thomas(2017): Pre-Trained
Convolutional Neural Networks As Feature Extractors Toward Improved Malaria Parasite
Detection In Thin Blood Smear Images, Peerj 6:E4568; Doi 10.7717/Peerj.4568.
[9] U.K. Lopes & J.F. Valiati(2017): Pre-Trained Convolutional Neural Networks As Feature
Extractors For Tuberculosis Detection, Computers In Biology And Medicine, 89(2017), 135-
143.
[10] Philippe Burlina, Katia D. Pacheco, Neil Joshi, David E. Freund & Neil M. Bressler(2017):
Comparing Humans
[11] And Deep Learning Performance For Grading Amd: A Study In Using Universal Deep
Features And Transfer Learning For Automated Amd Analysis, Computers In Biology And
Medicine, 82, 80-86, 2017.
[12] Fedix Grassmann, Phd, Judith Mengelkamp, Phd, Caroline Brandl, Phd, Sebastian Harsch,
Martina E. Zimmermann, Phd, Birgit Linkohr, Phd, Anntte Peters, Phd, Iris M. Heid, Phd,
Christoph Palm, Phd, &
[13] Bernhard H.F. Weber( 2018): Phd, A Deep Learning Algorithm For Prediction Of Age-
Related Eye Disease Study Severity Scale For Age-Related Macular Degeneration From
Color Fundus Photography, American Academy Of Ophthalmology, Vol.125, No.9,
September 2018.
[14] Min Yang, Wei Zhao, Wei Xu, Yabing Feng, Zhou Zhao, Xiaojun Chen & Kaile, Multitask
Learning For Cross-
[15] Domain Image Captioning, Ieee Transactions On Multimedia, Vol. 21, N0. 4, April 2019.
[16] Yansong Feng & Mirella Lapata, Automatic Caption Generation For News Images, Ieee
Transactions On Pattern Analysis And Machine Intelligence, Vol. 35, N0. 4, Pp 797-811,
April 2013.
[17] Yang , Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, Heng Tao Shen & Yanli Ji, Video
Captioning By Adversarial
[18] Lstm, Ieee Transactions On Image Processing, Vol. 27, N0. 11, Pp 5600-5612, November
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
2018.
[19] Jie Wu & Haifeng Hu, Cascade Recurrent Neural Network For Image Caption Generation, Iet
Journals & Magazines, Vol.53, No.25, Pp 1642-1643, 2017.
[20] Xiaodong He & Li Deng, Deep Learning For Image-To-Text Generation, A Technical
Review, Ieee Signal Processing Magazine, Vol.34, No. 6, Pp 109-116, November 2017.
a. Conference Proceedings
[21] Jingjing Deng, Xianghua Xie, Louise Terry, Ashley Wood, Nick White, Tom H.Margrain &
Rachel V.
[22] North, Age-Related Macular Degeneration Detection And Stage Classification Using
Choroidal Oct Images, International Conference On Image Analysis And Recognition, Pp
707-715, Iciar 2016.
[23] Luhui Wu, Cheng Wan, Yiquan Wu & Jiang Liu, Generative Captions For Diabetic
Retinopathy Images, 2017 International Conference On Security, Pattern Analysis, And
Cybernetics (Spac), Pp 515-519,
[24] Genevieve C. Y. Chan, Awais Muhammad, Syed A. A. Shah, Tong B. Tang, Cheng-Kai Lu &
Fabrice
[25] Meriaudeau, Transfer Learning For Diabetic Macular Edema (Dme) Detection On Optical
Coherence Tomography (Oct) Images, Proc. Of The 2017 Ieee International Conference On
Signal And Image Processing Applications (Ieee Icsipa 2017), Malaysia, September
www.irjges.com | Dr.S.Jagan
International Research Journal in Global Engineering and Sciences. (IRJGES)
ISSN: 2456-172X | Vol. 4, No. 3, September - November, 2019 | Pages 87-107
www.irjges.com | Dr.S.Jagan