Performance Evaluation of Medical Image Captioning Using
Performance Evaluation of Medical Image Captioning Using
Abdul Rafiq, Akurathi HarshaVarshan, Chinta Sai Charan Tej, Bala Nireekshan,
PVPSIT, PVPSIT, PVPSIT, PVPSIT,
Vijayawada, India. Vijayawada, India. Vijayawada, India. Vijayawada,India.
[email protected] [email protected] [email protected] balanireekshan149@gmail.
om om com
Abstract:
Medical image captioning is one of the fastest developing disciplines that combine computer
vision and natural language processing, and its goal is to automatically provide a descriptive text
summarizing for medical images. We put into practice the process used in the prior research and
implement the latest techniques like the attention mechanism to provide more meaningful captions
regarding the disease. The purpose of the attention mechanisms is to enable the model to focus its
attention on the most important parts of the medical images so that the model can generate the
description from a contextually appropriate and diagnostic perspective. Essentially, initial findings
highlight the usefulness of LSTM networks which are a kind of RNNs but more advanced than
other RNN implementations and show better outcomes. Besides, the merging of encoder-decoder
frameworks together with attention operations can go a long way in improving the accuracy and
coherence of captioning in the area of medical imaging. Our model performance assessment is
different from the BLEU scores, which usually are used as the traditional metrics, but it is rather
focused on medical image captioning tasks tailored ROC curves. The ROC curves provide a
detailed assessment of the models' performance, particularly on the tasks where binary
classification is involved. By using this feature, a clear picture is provided about the model's
capability to distinguish between useful and non-useful image features.
1 INTRODUCTION
Image captioning islive research branch with applications across various fields, such as social media,
healthcare, autonomous vehicles, etc. This field will benefit both the society and the company. Researchers find
this topic compelling as it merges two key AI fields: Natural Language Processing (NLP) and Computer Vision.
Describing images accurately involves understanding their content, achieved through techniques such as image
classification and object recognition.
The implementation of AI in the radiology departments has surged during the pandemics like COVID-19
which can be very useful during the pandemic by reducing the pressure on the doctors by reducing the time for
the analyzing of the x-rays. Chest radiography plays a crucial role in diagnosing the thoracic diseases. This work
can be done by the AI but it is not simple to read the images accurately. Fortunately, the advancements in deep
learning models have shown improvement in the detection of diseases with more accuracy than before. These
algorithms can now be used by the doctors to reduce
the load on the doctors for the analyzing of the X-rays.Deep learning models are really making strides in
diagnosing diabetic retinopathy, skin cancer, and lymph node metastases. They're also gaining ground in chest
radiography, where they're helping with tasks like spotting tuberculosis and finding lung nodules. All this
progress owes a lot to datasets like ChestX-ray8, which have really sparked a lot of research in this field. The
utilization of new technologies, such as AI and deep learning algorithms, has resulted into improvement in the
diagnosis results, but there still is a limited number of test of these medical devices done compared to the human
radiologists in the real time settings so that the reliability and effectiveness of these systems can be confirmed.
The CNN can be hugely useful in medical science. Using the best fit specialized pre-trained model like
DenseNet, the health care service provider will have the ability to attain high accuracy in understanding medical
images like X-ray, X-ray, or MRI. In addition, Grad-CAM with VGG16 hybrid models can be employed as a
visualization technique for image registration to show which model areas are focused on when generating the
caption. This can facilitate better understanding by health professionals.The primary focus of the CNN and RNN
is on smaller image region they do not understand the different features of the image if they are not in
meaningful manner, the CNN and RNN cannot understand it but when it comes to transformers using self-
attention mechanism it can understand different parts of the image even when the features are not in correct
position and it also keeps long range dependencies that is more efficient than other models.
Image captioning describes a text corresponding to an image in an automated manner. Such technology,
including creating captions for visually impaired users, improving image search results, and automation of tasks
includes image annotation, can be applied in a variety of ways
1.Template-Based Captioning:With the pre-designed templates containing blanks, each of those blanks gets
filled with recognizable objects, actions, and features from the given image. Though it’s easy to use and
straightforward, this approach doesn’t have enough flexibility and fails to restore the images that are more
complicated.
2.Retrieval-Based Captioning: In this case copies of captions are obtained from visual similarity matched
images by replacing those of the input image. This strategy brings the desired effect on the general photographs,
but in the case of the exact image, it might miss out on the some details.
3.DeepNeuralNetwork(DNN)Based Captioning:- This is the most developed and stronger method, which
actually relies on strengths of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks RNNs
When coming to image captioning , an encoder is used to convert the input image into a meaningful, fixed-
dimensional form or matrix, also known as a "feature vector" or a "embedding." This representation captures
important visual information that is given into the decoder for further processing. A decoder, on the other hand
gives sequential output, meaning the output is given as a sequence of words that together become a caption,
based on the encoded information provided by the encoder. To generate meaningful and relevant captions, the
decoder commonly uses recurrent neural network (RNN) architectures such as Long Short-Term Memory
(LSTM) or Gated Recurrent Units (GRUs). RNNs as Decoders: Long sentence generating problem is inherent
for regular RNNs and captions are an example. This choice was made because this type of neural network is
perfect for time series analysis – LSTM networks. LSTMs are a particular kind of RNN that helps in dealing
with delay problems of the image features over a long time period, aiding in building more accurate and detailed
captions.
Classical RNNs use discreet time steps to iterate over every word in a sentence which consumes more
computation power as the sentence length increase. There may be some information lost from the beginning of
the sequence, or somehow "washed away" as the network proceeds with the later objects involved.
LSTMs handle this noisy gradient problem by the use of special memory cell. This memory cell makes the
network to remember information for longer duration.
The memory cell is the heart of the LSTM architecture, enabling it to see the relations between words and even
words that are really far apart thus to be able to learn their dependences.
Gates:The structure of LSTMs manages the information moving through four gates: the input gate, the forget
gate, the output gate, and the cell state gate.
•Input Gate (Ii): Chooses what information from the external environment will be imported into the cell
(external stimulus, Xi).
•Forget Gate (Fi): Decides on what from the former state (Ci-1) data to forget.
•Output Gate (Oi): Determines which part of the current cell states (Ct) will be incorporated into the hidden state
(ht) for its output.
•Cell State (Ct): This is the memory property of an LSTM. It yields previous time steps' relevant data, which
ensures the network to learn long-time interdependencies control.
•Hidden State (ht): The magnitude at this point in time stop represents the outcome of the LSTM cell. It
performs such an operation in terms of both the present and past states of cell
A Transformer is a deep learning model which allows using attention mechanisms without using RNNs. And
Attention mechanism enhances the important parts of the input data and fades out the rest. Whereas LSTMs are
a specific type of RNN that was introduced to overcome the vanishing gradient problem that is the drawback of
the traditional RNNs. Through good sequential handling mechanism the LSTMs help in increasing accuracy and
in making images descriptions more accurate and detailed. They possess the ability to depict the relations
between the elements in the picture that can consist of a single sentence explaining the whole scene
The presented model is an encoder-decoder structure comprized by the VGG16 trained on Places1365 dataset as
the encoder and an LSTM network as the decoder. With this particular configuration, the model is capable of
scanning visual features from the image (via CNN) and then utilizing them to generate a textual
description (through LSTM).
II. Literature survey
In [1] authors studied about Image Captioning using deep Neural Networks. In [2]. Authors discussed aabout the
advantage of using machine learning for Speech recognition with deep recurrent neural networks. In [3] authors
presented a deep study on Convolutional Neural Network. In [4]. Authors used Captioning transformer with
scene graph guiding. In [5] authors studied about the fine-grained control of image caption generation with
abstract scene graphs. In [6] authors discussed about Image Caption Generation Using A Deep Architecture, In
[7] authors presented a deep study on visual Skeleton and Reparative Attention for Part-of-Speech image
captioning system. In [8] authors gave a Systematic Literature Review on Image Captioning.
Lcross-entropy( x i )=−¿
[5]
when xi and yiare the transaction features and the label, f(xi) is the produced model, e. g. the probability of
being positive.
By this formula, we can see that if there is any huge imbalance, when there are very few representatives of
careful class, for instance, the loss will be dominated by the negative ones. Summing the support obtained from
all cases of each class. e. class-action lawsuits has gained prominence (curators, directors, actors, etc) due to the
rise of film industries, we find that responsible of such endeavors. e. positive or negative) is:
Feature Extraction:
Feature extraction is a very important part of a context that involves dimensionality reduction where data is
specifically feasibly and classified. This is crucial since it defines other processes carried out on data in the
course of managing and processing it. Where there is big data, involving several variables the process of
management can be rather resource intensive. Feature extraction I shall explain removes this problem in two
ways. First, it specifies the range of variables that are most critical and that can be categorized in parallel.
Secondly, it employs these variables to compute other variables formed by aggregation of variables, in this way,
it decreases the size of data aggregates. They are also relatively much less complex in processing and hold most
of the data that was outlined in the first dataset. Of higher significance in feature extraction is the simplification
of complex data in a format which may be suitable for easy handling while at the same time preserving crucial
details. It makes it easier for analysis since it offers a means by which data dimensionality can be reduced
without distorting the fundamental character of data. Hence, feature extraction takes up the role of bridging the
gap between raw data and applications while offering further opportunities for enhancing the methods of pattern
recognition and prediction with respect to numerous areas of data analysis and machine learning. Feature
extraction can be used to reduce the amount of redundancy incorporated in the representation of data and proves
as beneficial in yielding improved algorithm performance, the requirement of lesser memory space, and
increased ability to understand the solutions obtained. This makes it an important tool in the toolbox of every
Data Scientist whenever they are working on problems in domains such as bioinformatics, image processing,
and natural language processing for the reasons shown in the previous section that dataset for such problems
tend to be high dimensional. Consider an image of the number 8 which is made up of small square boxes called
pixels while we can easily see the image and distinguish its edges and colors machines cant do this as easily
they store images as a matrix of numbers this matrixs size is determined by the number of pixels in the image
for example if an image has dimensions of 180 x 200 or nxm these numbers represent the images height and
width in pixels each number in the matrix corresponds to a pixel value indicating the intensity or brightness of
that pixel smaller numbers closer to zero represent black while larger numbers closer to 255 represent white to
understand this better lets look at an example consider an image with dimensions of 22 x 16 you can verify this
by counting the number of pixels in the image 6 how do machines store colour images the example we discussed
involves a black-and-white image but what about colored images which are much more common are they also
stored as a 2d matrix a colored image is made up of various colors most of which can be created using three
primary colors red green and blue therefore a colored image is represented by three matrices or channels
corresponding to these primary colors each matrix contains values between 0 and 255 indicating the intensity of
that color for each pixel to understand this lets look at an example image note that these are not the original
pixel values for the given image as the original matrix would be very large and difficult to visualize there are
different formats for storing images but the rgb format is the mostwidely used
The dense connectivity that characterizes DenseNet architectures has shown remarkable prowess in extracting
features and classifying images, making it a natural fit for the intricate world of medical image analysis. By
opting for a pre-trained model, we're not starting from scratch. This approach is like having a seasoned expert on
our team, one who has seen thousands of images and learned to spot the subtlest of patterns. This new approach
is like giving each layer a pair of super-powered hearing aids. Now, instead of just listening to the layer right
before it, each layer can tune in to all the layers that came before. It's as if we've turned our neural network into
a chatty cocktail party where information flows freely in all directions. By letting layers communicate more
openly, DenseNets help ensure that important information doesn't get lost in translation. It's like making sure
that crucial clue in a mystery novel doesn't get overlooked just because it was mentioned in chapter one. In
essence, DenseNets took the traditional CNN blueprint and gave it a social networking upgrade, creating a more
connected, chatty, and potentially more insightful neural network. DenseNet creates a richer, more
interconnected learning environment. It's not just about passing information along; it's about
creating a network where every piece of knowledge remains accessible and useful throughout
the entire process.
In the final step, we evaluate our trained model using a separate testing dataset. This helps us assess how well
the model might perform in real-world medical settings. We compare the model's diagnoses against expert-
verified labels, calculating important metrics like accuracy, sensitivity, and specificity. We don't just look at the
numbers, though. We carefully examine cases where the model's predictions differ from the expert annotations.
This analysis helps us identify potential weaknesses in the model and gives us valuable insights for future
improvements. The goal is to ensure our model is not only statistically sound but also practical and reliable for
supporting medical professionals in their diagnostic work. Each refinement we make has the potential to
enhance patient care and outcomes. Following the training phase, we conduct a comprehensive evaluation of the
model's performance using our designated validation dataset. The primary metric employed for assessment is
the area under the receiver operating characteristic curve (AUC-ROC), calculated independently for each
pathology. The AUC-ROC provides a quantitative measure of the model's discriminative ability, indicating its
capacity to differentiate between positive and negative cases across the spectrum of decision
thresholds. This metric is particularly valuable in medical imaging contexts, where the
balance between sensitivity and specificity is crucial. Based on the validation results, we may
engage in hyperparameter optimization. This process involves fine-tuning key parameters
such as learning rate and dropout rate to maximize the model's generalization capability. The
objective is to achieve optimal performance across all pathologies while mitigating
overfitting. This rigorous evaluation and refinement process ensures that the model not only
performs well on the training data but also demonstrates robust generalization to unseen
cases. Such thorough validation is essential in developing reliable AI-assisted diagnostic tools
for clinical applications
OUTPUTS:
VI. Visualization for disease detection:
We implement Gradient-weighted Class Activation Mapping (GradCAM) to visualize the model's decision-
making process. This technique generates heat maps highlighting regions of chest X-rays that significantly
influence pathology predictions. By employing this visualization method, we aim to increase the transparency
and trustworthiness of our AI model. This approach bridges the gap between complex deep learning algorithms
and clinical application, potentially improving the model's acceptance and efficacy in real-world diagnostic
settings. Gradient-weighted Class Activation Mapping, or Grad-CAM, serves as a powerful tool in
decoding the complex decision-making process of artificial intelligence, particularly in image recognition
tasks. Think of it as a translator between the intricate language of AI and our human understanding. In
essence, Grad-CAM allows us to peek into the 'mind' of an AI model, showing us what it focuses on when
making decisions. It's akin to asking a master chef to explain which ingredients in a complex dish
contribute most to its final flavor. The chef doesn't change the recipe but instead points out the key
elements that make the dish special. In the realm of medical imaging, Grad-CAM acts like an AI assistant
capable of not just suggesting a diagnosis but also showing its work. Imagine a radiologist working with
an AI that can highlight specific areas of an X-ray, saying, "I've identified potential pneumonia, and these
shadowy areas here and here are what led me to this conclusion." This visual explanation bridges the gap
between the AI's rapid analysis and the doctor's expert interpretation. What makes Grad-CAM
particularly valuable is its non-invasive nature. It doesn't require rebuilding or retraining our AI models.
Instead, it's like giving our existing AI a special pair of glasses that allow it to show us what it sees as
important. The output of Grad-CAM is a heat map overlaid on the original image, much like a weather
map showing hot and cold spots. The 'hottest' areas are those the AI deemed most crucial for its decision.
This visual representation allows both AI researchers and professionals in various fields to 'see' what the
model sees, fostering a deeper understanding of AI decision-making. By providing this level of
transparency, Grad-CAM is helping to build trust in AI-assisted processes across various industries. It's
paving the way for a future where artificial and human intelligence can collaborate more effectively,
combining the rapid processing power of AI with the nuanced understanding of human experts. In a world
where AI is increasingly influencing critical decisions, Grad-CAM stands as a crucial bridge, ensuring
that these powerful tools remain understandable, trustworthy, and aligned with human oversight.
In the intricate landscape of artificial intelligence, particularly when it comes to teaching machines to
"see," Grad-CAM emerges as a powerful tool that helps us peek behind the curtain. This technique tackles
a fundamental challenge in advanced AI: understanding how our digital eyes make sense of the world
around them. Grad-CAM works its magic by tracing the AI's thought process backwards, revealing which
parts of an image tipped the scales for a particular decision. Unlike simpler methods, it offers insights
tailored to specific classifications, giving us a richer understanding of the AI's reasoning. In real-world
scenarios, such as medical imaging, Grad-CAM shines brightly. It allows doctors to see exactly what
caught the AI's eye in an X-ray or MRI when suggesting a diagnosis. This visual feedback fosters a
stronger partnership between AI assistants and human experts, building trust and leading to more
informed choices.
Beyond healthcare, Grad-CAM's usefulness spreads across industries where image analysis is key. From
self-driving cars to quality checks in factories, this technique helps stakeholders understand and verify AI
decisions. By building a bridge between complex AI algorithms and human insight, Grad-CAM plays a
crucial role in making AI more explainable. It helps lift the veil on the often mysterious world of deep
learning, paving the way for AI systems we can better understand and trust.As AI becomes more
integrated into important decision-making processes, techniques like Grad-CAM will be vital. They'll
help ensure that these systems not only perform well but also operate in a way that's transparent and in
harmony with human oversight.
b. Working Principle
Grad-CAM serves as an interpreter for complex AI models, particularly in image recognition. It's like
giving AI the ability to point out what it's looking at when making decisions, similar to how an expert
might highlight key features in an image. This technique works by examining the final stages of the AI's
processing, tracing back from the conclusion to identify the most influential parts of the image. It
generates heat maps that highlight these crucial areas, providing visual explanations for the AI's
decisions. One of Grad-CAM's strengths is its versatility. It can be applied to various AI models without
requiring changes to their structure, making it a valuable tool across different applications. Importantly,
Grad-CAM achieves this transparency without compromising the AI's accuracy. It balances the need for
interpretability with maintaining high performance.
In practical terms, Grad-CAM is particularly useful in fields like medical imaging. It allows healthcare
professionals to see which areas of an X-ray or MRI the AI focused on when suggesting a diagnosis,
enhancing collaboration between AI systems and human experts. By providing this level of insight, Grad-
CAM is helping to build trust in AI-assisted analysis across various industries. It's a crucial step towards
creating AI systems that are not just accurate, but also understandable and accountable, aligning advanced
technology with human oversight and interpretation.
VIII. Conclusion:
The pre-trained model's performance, visualized through ROC curves and AUC metrics, shows promising
results across various pathologies, particularly for conditions like Cardiomegaly (0.804 AUC) and Edema. The
model demonstrates high accuracy (99.1%) and recall (86%) in discriminating positive from negative cases.
When tested on modified X-ray scans, the model exhibits proficiency in identifying key pathologies and
providing accurate diagnoses, sometimes surpassing human performance. It's capable of predicting both primary
and secondary issues, adhering to standard radiological practices. The model's ability to correctly identify
conditions like cardiomegaly, masses, and edema showcases its precision. However, some predictions may be
influenced by image artifacts or anatomical variations, highlighting areas for future refinement. The algorithm's
capacity to detect specific conditions in relevant anatomical regions, such as edema near the diaphragm,
demonstrates its potential for targeted diagnostic applications. The visualization techniques employed provide
valuable insights into the model's decision-making process, offering a roadmap for future enhancements.
References
[1]. Liu, Shuang & Bai, Liang & Hu, Yanli & Wang, Haoran. (2018). Image Captioning Based on Deep
Neural Networks. MATEC Web of Conferences. 232. 01052. 10.1051/matecconf/201823201052.
[2]. A. Graves, A. Mohamed and G. E. Hinton. Speech recognition with deep recurrent neural networks.
pages 6645–6649, 2013.
[3] Saad Albawi and Tareq Abed Mohammed.Understanding of a Convolutional Neural
Network. 2017.
[4]. H. Chen, Y. Wang, X. Yang, and J. Li. Captioning transformer with scene graph guiding. In 2021 IEEE
international conference on image processing (ICIP), pages 2538–2542. IEEE, 2021.
doi:10.1109/ICIP42928.2021.9506193. URL https://doi.org/10.1109/ICIP42928.2021.9506193.
[5] S. Chen, Q. Jin, P. Wang, and Q. Wu. Say as you wish: Fine-grained control of image caption generation
with abstract scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 9962–9971, Manhattan, New York, U.S., 2020. IEEE.
doi:10.1109/CVPR42600.2020.00998.URL http://doi.org/10.1109/CVPR42600.2020.00998.
[6] A. Hani, N. Tagougui and M. Kherallah, "Image Caption Generation Using A Deep Architecture," 2019
International Arab Conference on Information Technology (ACIT), 2019, pp. 246-251, doi:
10.1109/ACIT47987.2019.8990998.
[7] Yang, L., & Hu, H. (2019). Visual Skeleton and Reparative Attention for Part-of-Speech image
captioning system. Computer Vision and Image Understanding, 189, 102819.
[8] Staniūtė, R., & Šešok, D. (2019). A Systematic Literature Review on Image Captioning. Applied
Sciences, 9(10), 2024.