0% found this document useful (1 vote)

23 views126 pages

CH 1

Uploaded by

manarkhaliidx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

23 views126 pages

CH 1

Uploaded by

manarkhaliidx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 126

CHAPTER 1: Introduction

1.1 Overview

In today’s e-commerce landscape, where personalization and convenience are essential, our
project addresses a key challenge: the inability to try on products before purchase. With
growing demand for immersive online shopping, we are developing an Augmented Reality
(AR) Glasses Try-On App. This app offers users a realistic virtual try-on experience, helping
them explore how different frames fit and look on their face.

Using real-time AR technology, the app ensures accurate visualization with face sizing tools
and fitting recommendations for a perfect match. It also provides customized lens options
and product filters for easy selection.By bridging the gap between in-store and online
experiences, our solution enables users to make confident purchasing decisions from the
comfort of their home.

Our goal is to create a more engaging and seamless shopping journey that enhances
customer satisfaction and minimizes returns. This project aims to redefine the eyewear
shopping process, improving usability, customer experiences, and boosting sales
conversions for retailers.

1.2 Motivation

Our AR Glasses Try-On App stems from the need to empower users with an immersive,
convenient way to explore eyewear without the limitations of traditional online shopping.
With real-time AR technology, we aim to provide an interactive experience where users can
accurately visualize frames on their face.A clear, intuitive design allows users to easily
browse frames, customize lenses, and find the best fit for their preferences.

Our goal is to enhance user confidence and satisfaction, enabling informed decisions through
a straightforward shopping journey.

1.3 Objective and Aim

The objective of the proposed AR Glasses Try-On App is to simplify and enhance the online
eyewear shopping experience by providing real-time AR-based try-ons. The app aims to
merge the convenience of online shopping with the accuracy of in-store fitting, offering
users an accurate preview of how glasses will fit their face. With tailored recommendations
for frames and lenses, we aim to increase user confidence, reduce returns, and drive higher
sales conversions.

1.3.1 Facilitating an Immersive Shopping Experience:

● Enabling users to try on glasses in real-time, replicating the in-store experience.

● Providing a smooth and engaging interface across devices, ensuring convenience and
accessibility.

1.3.2 Increasing Accuracy in Product Selection:

● Utilizing advanced face-tracking and size detection for precise recommendations.

● Reduces mismatches and returns by helping users choose frames suited to their facial
dimensions.

1.3.3 Promoting Lens Customization:

● Offering multiple lens types, such as single vision, blue light filtering, and transition
lenses.
● Supporting prescription lens customization, ensuring products meet users’ specific needs.

1.3.4 Driving User Engagement and Boosts Sales:

● Encouraging users to explore a variety of frames and lenses, increasing engagement.
● Supports social sharing, enabling users to involve friends and family in decision-making.

1.3.5 Supporting Sustainability Efforts:

● Lowering return rates, reducing unnecessary shipping and packaging.

● Contributing to environmental sustainability by minimizing the carbon footprint
associated with product returns.

1.3.6 Consolidating Product Discovery into a Single Platform:

● Combining frame selection, lens customization, and fitting tools into one streamlined
app.
● Simplifying the shopping journey, making it easier for users to find and purchase their
ideal glasses.
1.4 Scope

The proposed AR Glasses Try-On App will enhance the online shopping experience for
eyewear by offering real-time virtual try-ons using AR-based face tracking and visualization.
The app supports lens customization, allowing users to choose from pre-saved or new
prescriptions.

Designed as a multi-platform solution, the app ensures seamless navigation across web and
Android devices. It also offers filters for size charts, material types, and style preferences,
providing a personalized shopping experience tailored to individual needs.

1.5 General Constraints

The successful implementation of the AR Glasses Try-On App depends on several critical
constraints that must be considered::

1.5.1 Data Availability:

The success of the AR try-on feature depends on access to accurate 3D models of glasses
and frames. Limited availability to these models could impact the user experience.

1.5.2 Device Compatibility:

The app relies on advanced AR and face-tracking technologies, which may not perform
optimally on older devices, potentially affecting some users’ experiences.

1.5.3 Data Privacy and Security:

As the app collects facial data for tracking purposes, it must comply with data privacy
regulations (e.g., GDPR) and maintain high security standards to protect personal
information.

1.5.4 Lens Prescription Management:

The lens selection process requires users to provide accurate prescription details, either by
uploading pre-saved prescriptions or inputting new ones. Inaccurate data could affect the
quality of the final product.
1.6 Document Organization

No Now that we have discussed the problem in detail , in this section we are going to
describe the content of next chapters we are going through to explain our solution to the
problem.
Chapter 2: Background information. This chapter will discuss the technical background , for
our problem. Explaining methods and technologies we used.
this chapter explains each used method in deep through the mathematical basics behind our
machine learning models.
Chapter 3: Literature Survey. This chapter holds the essence of researcher's efforts in this
topic. This chapter discuss many research papers that were proposed to solve this problem.
Studying, analysing and discussing these papers helped to introduce our solution and
introduce the available medical chatbot
it includes the medical applications available in market and their pros and cons.
Chapter 4: Proposed Architecture. In this Chapter we introduce our system prototype,
explain the methodology we used in our experiments functional , non-functional ,
Requirements , Use Case Diagram , Sequence Diagram, Class Diagram and System
architecture.
Chapter5: Implementation and Testing. This chapter will cover the implementation code and
the tests conducted for the model.
Chapter 6: Conclusion and future work. The final chapter will summarize our achievements
and the outcomes of our project and possible future directions.
.
CHAPTER 2: Background
2.1 Machine Learning Overview

Machine learning is an area of artificial intelligence (AI) that focuses on developing

algorithms and models that allow computers to learn from data and make judgments or
predictions without being explicitly programmed. The underlying principle behind machine
learning is to give computers the ability to learn and improve based on experience
automatically.

To construct a self-diagnosis Medical chatbots that use AI must perform a variety of

Complex tasks. and machine learning is critical to improve the chatbot's skills, here’s an
overview of why machine learning is essential for this project:
Complex Pattern Recognition:
Machine learning algorithms can effectively recognize and learn patterns from large
datasets, enabling chatbots to accurately identify symptoms, correlate them with potential
illnesses, and offer more accurate suggestions.
Personalized Recommendations:
Machine learning enables chatbots to offer personalized recommendations based on user
data, considering factors like medical history and lifestyle, providing more tailored advice.
Continuous Learning and Improvement:
Machine learning allows chatbots to continuously adapt and improve by learning from new
data, ensuring that they stay current with the latest medical information and diagnostic
techniques.
Diagnosis Accuracy:
Machine learning models can be trained to recognize subtle and complex relationships
between symptoms and diseases, leading to more accurate diagnoses. As the chatbot
encounters more cases and learns from user interactions, its diagnostic accuracy can
improve.
Natural Language Processing (NLP):
Conversational interfaces are essential for a chatbot, and NLP is a subset of machine
learning that focuses on understanding and creating human-like language. The chatbot can
efficiently engage with people, interpret their natural language queries, and offer
informative responses using NLP.
To create a machine-learning model, some processes are required, beginning with collecting
the data, cleaning it, and visualising it to find relationships between data, followed by
finding the best techniques for learning the relations between these data by training and
testing the data (Figure 1)
Figure 1 Machine Learning Architecture
[24]

2.1.1 Machine Learning Types

There are two types of machine learning Supervised learning which includes regression and
classification and neural network and unsupervised learning which includes clustering.

Supervised Learning:
Supervised machine learning algorithms are designed to learn by example. The name
“supervised” learning originates from the idea that training this type of algorithm is like
having a teacher supervise the whole process.
When training a supervised learning algorithm, the training data will consist of inputs paired
with the correct outputs. During training, the algorithm will search for patterns in the data
that correlate with the desired outputs. After training, a supervised learning algorithm, will
take in new unseen inputs and will determine which label the new inputs will be classified as
based on prior training data. The objective of a supervised learning model is to predict the
correct label for a newly presented input data. At its most basic form, a supervised learning
algorithm can be written simply as:
Equation (1): Y=f(X)+ε
Where Y is the predicted output that is determined by a mapping function that assigns a
class to an input value x. The function used to connect input features to a predicted output
is created by the machine learning model during training.
Using N labeled training examples (x1, y1), ..., (xN, yN)
Supervised learning can be split into two subcategories: Classification and regression.
Regression:
Objective: In regression, the goal is to predict a continuous output variable Y based on input
features X, Equation (2): Y=wx+b.
Where:
•Y is the predicted output.
•X is the input data.
•W is the weight.
•b is the y-intercept.
Loss Function: The commonly used loss function for regression is the Mean Squared Error
1 𝑛
(MSE), Equation (3): MSE= (𝑦𝑖 − 𝑦̂𝑖)2
∑
𝑛 𝑖=1

Where:

● n is the number of data points.

● 𝑦𝑖 is the true output for the i-th data point.

● 𝑦̂𝑖 is the predicted output for the i-th data point.

Algorithm Example: Linear Regression, Equation (4): y = β0 + β1x1 + β2x2 + ⋯ + βnxn + ε

Where

● β0 is the intercept.

● β1, β2, …, βn are the coefficients.

The goal is to find the values of β0,β1,...,βn that minimize the MSE.
Classification:
Objective: In classification, the goal is to predict a discrete output variable Y that belongs to
a specific class or category based on input features X.
Equation (5): Y=f(X)
Where

● Y is the predicted class.

● X is the input data.

● f is the mapping function to be learned.

Loss Function: Cross-Entropy Loss (Log Loss)

[
1
( ) ( ) ( )]
Equation (5): Cross-Entropy Loss= − 𝑛 𝛴 𝑦𝑖
𝑙𝑜𝑔 𝑦̂𝑖 + 1 − 𝑦𝑖
𝑙𝑜𝑔 1 − 𝑦̂𝑖
Where

● n is the number of data points.

● 𝑦𝑖 is the true label for the i-th data point.

● 𝑦̂𝑖 is the predicted probability of belonging to class 1 for the i-th data point.

Algorithm Example: Logistic Regression:

logit(y)= β0+β1X1+β2X2+...+βnXn
1
Equation (6): 𝑦 = −𝛽𝑇𝑥
1+ⅇ

Where:

● β0 is the intercept.

● β1, β2, ..., βn are the coefficients.

The goal is to find the values of β0, β1, ..., βn that minimize the Cross-Entropy Loss.
Unsupervised:
Unsupervised learning, also known as unsupervised machine learning, uses machine
learning algorithms to analyze and cluster unlabelled datasets. These algorithms discover
hidden patterns or data groupings without the need for human intervention.
Its ability to discover similarities and differences in information makes it the ideal solution
for exploratory data analysis, cross-selling strategies, customer segmentation, and image
recognition.
Given: a set of N unlabelled inputs {x 1, ..., xN} Goal: learn some intrinsic structure in the
inputs (e.g., groups/clusters) Example: Automatically grouping news stories (Google News)
Clustering:
Objective: In clustering, the goal is to group similar data points into clusters, where points
within the same cluster are more similar to each other than to points in other clusters.
Equation (7): Minimize= 𝛴𝑛 𝛴𝑛 ‖𝑥 2
−𝑐‖
𝑘,𝑖=1 𝑗=1 𝑖𝑗 𝑖

Where:

● k is the number of clusters.

● n is the number of data points.

● xij is the j-th feature of the i-th data point.

● ci is the centroid of the i-th cluster.

The objective is to minimize the sum of squared distances between data points and their
assigned cluster centroids.
(Figure 2) shows the Machine Learning classification
Figure 2 Machine Learning Classification

2.1.2 Neural Networks

Neural networks, also known as artificial neural networks (ANNs) or simulated neural
networks (SNNs), are a subset of machine learning and the core of deep learning algorithms.
Inspired by the human brain, they consist of node layers with input, hidden, and output
layers. These networks rely on training data to improve accuracy, making them powerful
tools in computer science and artificial intelligence, such as speech and image recognition
tasks.
Input Layer:

the input layer is the initial layer of a neural network that takes in the raw input data, and its
nodes represent the features of the input. X= [x1,x2, ..., xn]
Where:

● X represent the input features.

Hidden Layers:

Hidden layers are intermediary layers between the input and output layers. Each neuron in
a hidden layer receives inputs from the previous layer, multiplies them by associated
weights, sums them up, and passes the result through an activation function.
Input to Neuron:
𝑛
Equation (8): 𝑧𝑗 = ∑ 𝑤𝑖𝑗𝑥𝑖 + 𝑏𝑗
𝑖=1

Where:

● zj is the weighted sum of inputs to neuron j.

● wij is the weight connecting input i to neuron j.

● bj is the bias term for neuron j.

Output of Neuron: 𝑎𝑖 = 𝜎(𝑧𝑗)

Where:

● aj is the output of neuron j.

● σ is the activation function.

Common activation functions include the sigmoid function, hyperbolic tangent

(tanh), and rectified linear unit (ReLU).

Output Layer:

The output layer produces the final output of the neural network. The structure and
activation function of the output layer depend on the task (e.g., classification or regression).
Output: 𝑦𝑘 = 𝜎(𝑧𝑘)
Where:

● yk is the output of neuron k in the output layer.

● zk is the weighted sum of inputs to neuron k.

(Figure 3) shows the Neural Network Layers

Figure 3 NN Layers [31]

2.1.3 Deep Learning

Deep learning is a subfield of machine learning that focuses on the development and
application of artificial neural networks, particularly deep neural networks, to solve complex
problems. The term "deep" refers to the depth of these neural networks, which have
multiple layers (commonly referred to as deep architectures) between the input and output
layers. Deep learning algorithms aim to automatically learn and represent hierarchical
features or patterns from raw data, enabling the modeling of intricate relationships and
making them well-suited for tasks such as image and speech recognition, natural language
processing, and various other applications, (Figure 4)

Figure 4 Deep Learning Algorithms

2.1.4 Large Language Model
A large language model refers to a sophisticated and powerful natural language processing
(NLP) model that has been trained on vast amounts of text data to understand and generate
human-like language. These models are designed to handle a wide range of language-
related tasks, including text completion, summarization, translation, question-answering,
and more. The term "large" indicates that these models have a high number of parameters,
enabling them to capture intricate patterns and nuances in language.
One prominent example of a large language model is OpenAI's GPT (Generative Pre-trained
Transformer) series, with models like GPT-3 being particularly notable. These models are
based on transformer architecture, which is well-suited for handling sequential data, such as
text.
Large language models are pre-trained on diverse datasets to learn natural language
structure, grammar, context, and semantics. They can be fine-tuned for specific tasks or
domains using transfer learning. Large models have a substantial number of parameters,
allowing them to capture complex relationships and nuances. They have strong contextual
understanding, generating coherent and contextually relevant text for creative writing and
content generation. They are versatile, adaptable to various NLP tasks without task-specific
modifications, allowing them to be applied to diverse applications (Figure 5).

Figure 5 Large Language Model [9]

2.2 Used methods

In this section, we explain the techniques used for our system

● Convolutional Neural Networks (CNN)

● Recurrent Neural Network (RNN)

● Virtual Question Answering (VQA)
2.2.1 Convolutional Neural Networks

A Convolutional Neural Network (CNN) is a deep learning algorithm ideal for image
recognition and processing, consisting of multiple layers including convolutional, pooling,
and fully connected layers.

Convolutional layers in CNN extract features from input images, which are then passed
through pooling layers to reduce spatial dimensions and fully connected layers to predict or
classify the image, retaining key information.

CNNs, trained on large datasets of labeled images, recognize patterns and features
associated with objects or classes, enabling them to classify new images or extract features
for object detection or segmentation.

They are robust for computer vision and can run directly on an underdone image without
preprocessing. The strength of a CNN comes from its convolutional layer, which can
recognize sophisticated shapes. With multiple layers, it can recognize handwritten digits and
differentiate human faces. CNNs are used in various fields like image and video recognition,
image inspection, media recreation, recommendation systems, and natural language
processing.

CNN Architecture:
The construction of a (CNN) involves assembling multiple layers in a sequential, feed-
forward fashion. This sequential design allows the CNN to learn hierarchical features. In a
CNN, layers are organized with convolutional layers often followed by activation layers.
Some layers may also include pooling layers for grouping. The pre-processing required in a
CNN is akin to the pattern recognition of neurons in the human brain and draws inspiration
from the
organization of the visual cortex (Figure 6).
Figure 6 CNN layers [32]

Convolutional Layer:
Convolutional layers are the core building blocks of CNNs. These layers use convolutional
operations to scan the input data with learnable filters or kernels. The convolution
operation involves sliding a filter over the input data, element-wise multiplication, and
aggregation to create feature maps.
Activation Function:
After the convolution operation, an activation function (commonly ReLU - Rectified Linear
Unit) is applied elementwise to introduce non-linearity. This helps the network learn
complex patterns and relationships in the data.
Pooling Layer:
Pooling layers down sample the spatial dimensions of the feature maps, reducing the
amount of computation and parameters in the network while retaining important
information. Max pooling is a common technique, which takes the maximum value from a
group of neighbouring pixels.
Flattening:
After several convolutional and pooling layers, the high-level reasoning in the neural
network is often encoded in the spatial dimensions. To feed this information into a fully
connected layer, the data is flattened into a one-dimensional vector.
Fully Connected (Dense) Layer:
The flattened vector is connected to one or more fully connected layers, which perform classification or
regression based on the learned features. (see Figure 7)

Figure 7 CNN Layers

2.2.2 Recurrent Neural Network Architecture

A Recurrent Neural Network (RNN) is a type of neural network designed to handle
sequential data. Unlike traditional feedforward neural networks, RNNs have connections
that form directed cycles, allowing them to maintain a hidden state or memory. This ability
to capture temporal dependencies makes RNNs particularly well-suited for tasks involving
sequences, such as time series prediction, natural language processing, and speech
recognition.

Key Components of an RNN:

1- Hidden State:

RNNs maintain a hidden state that is updated at each time step. This hidden state serves as
a memory, capturing information from previous steps and influencing predictions at the
current step.
Recurrent Connections:
2- Recurrent connections:

Allow information to persist across different time steps. The hidden state at a particular
time step is influenced not only by the current input but also by the hidden state from the
previous time step.
3- Input, Output, and Activation Functions:

Similar to other neural networks, RNNs have input and output layers, as well as activation
functions (e.g., tanh, sigmoid) applied to the hidden state and/or output.
4- Backpropagation Through Time (BPTT):

RNNs are trained using an optimization algorithm such as stochastic gradient descent (SGD),
with a variation called Backpropagation Through Time (BPTT). BPTT extends the
backpropagation algorithm to handle sequences by unfolding the network through time.
Challenges and Limitations:
While RNNs are powerful for sequential data, they have some challenges, such as difficulties
in capturing long-term dependencies and the vanishing/exploding gradient problem. As a
result, more advanced architectures like Long Short-Term Memory (LSTM) networks and
Gated Recurrent Units (GRUs) have been introduced to address these issues, (see Figure 8).

Figure 8 RNN Architecture

2.2.3 Transformer Neural Networks

Transformers, a concept in Natural Language Processing (NLP), uses an attention mechanism

to assign significance to words in sentences, capturing context more effectively. This
architecture forms the backbone of models like BERT, GPT, and T5, which have set new
benchmarks in tasks like text generation, text classification, sentiment analysis, and machine
translation. Examples include ChatGPT, which uses GPT-3.5 for AI-generated content, and
GPT- 4, which outperforms human participants in challenging tasks.
Let’s break down the Transformer model shown in figure 9 step by step:
1- Input Embedding:

The Transformer model converts input text into vectors through an embedding layer, using
learned embeddings instead of traditional methods. The input is represented as one-hot
vectors, multiplied by an embedding matrix to generate input embeddings (X), represented
mathematically as X = E * I.
2- Positional Encoding:

The Transformer lacks recurrence or convolution, resulting in no word order sense. To

address this, positional encodings are added to input embeddings, encoding the position of
words in a sentence. The encoding is calculated for a position 'pos' and dimension 'i', and
the final input
(Z) is obtained by adding X + P to the input embeddings.
3- Encoder: Comprised of multiple identical layers, each with two main components:

● Multi-Head Self-Attention: Allows the model to focus on different parts of the input
sequence simultaneously, capturing various contextual relationships.
● Position-Wise Feed-Forward Network: Adds non-linearity and depth, applied
independently to each position in the sequence.
4- Decoder: Also made of multiple layers, each including:

● Masked Multi-Head Self-Attention: Ensures predictions depend only on known

outputs up to that point.
● Multi-Head Attention (over encoder output): Integrates encoder context into the
decoding process.
● Position-Wise Feed-Forward Network: Similar to the encoder’s, adding complexity
and depth.

5- Self-Attention Mechanism:

The self-attention mechanism enables transformers to weigh the relevance of each word
in a sentence relative to others:

● Query, Key, Value Vectors: Input embeddings are transformed into these vectors.

● Attention Scores: Calculated as the dot product of Query and Key vectors, scaled and
passed through a softmax function to obtain attention weights.
● Weighted Sum: These weights are used to compute a weighted sum of the Value
vectors, producing the self-attention output.
6- Advantages:

Parallelization: Enables faster training and inference by processing entire sequences simultaneously.
Long-Range Dependencies: Captures relationships between distant tokens effectively.
Scalability: Performs well with larger datasets and model sizes, as seen in models like BERT
and GPT-3.

Figure 9 Transformer Architecture

2.2.4 Virtual Question Answering (VQA)

VQA stands for Visual Question Answering, and it refers to a type of tasks in the field of deep
learning and computer vision where a model is trained to answer questions about images.
The goal is to develop models that can understand both the visual content of an image and
the textual information in a question, and then generate accurate textual answers. VQA
involves the integration of computer vision and natural language processing to enable
machines to comprehend and respond to questions about visual content.
Key Components of VQA:
1- Image Input:

The model receives an image as input, usually represented as a grid of pixels. (CNNs) are
commonly used to extract visual features from the image.
2- Text Input (Question):

The model also takes in a textual question related to the content of the image. (RNNs) or
Transformer models are often used to process and understand the textual information.
3- Integration of Visual and Textual Information:

The visual features extracted from the image and the embeddings derived from the question
are combined or fused to create a joint representation. This representation is used to
capture the relationship between the image and the question.
4- Answer Generation:

The joint representation is then used to predict or generate an answer to the given question.
This step often involves the use of a fully connected layer with softmax activation for
multiple- choice questions or a regression layer for open-ended questions.
(Figure 10) shows a visual question answering example.

Figure 10 VQA Example [38]

VQA represents a fascinating intersection of computer vision and natural language

processing, and its applications extend to various domains. The ability to answer questions
about visual content brings machines closer to human-like understanding and interaction
with the visual world. Ongoing research in VQA aims to improve model robustness, handle
complex questions, and ensure fair and unbiased responses.
2.2.5 Multimodal Learning

Multimodal learning involves integrating and processing multiple types of data (modalities)
to improve the performance of machine learning models. This can include combinations of
text, image, audio, and other data forms. The objective is to leverage the complementary
information from different modalities to enhance understanding and prediction capabilities.

1- Key Concepts

Modalities: Different types of data sources such as text, images, videos, audio, etc.
Fusion Techniques:

● Early Fusion: Combining raw data from different modalities at the input level before
feeding it into the model.
● Late Fusion: Combining the outputs of unimodal models at the decision level.

● Hybrid Fusion: Combining data at intermediate stages, often using attention

mechanisms.

2- Applications

● Virtual Question Answering (VQA): Combining visual and textual data to answer
questions about images.
● Speech Recognition: Integrating audio and text data to improve transcription accuracy.

● Healthcare: Combining medical imaging and textual reports for more accurate diagnosis.

3- Challenges

● Alignment: Synchronizing data from different modalities.

● Data Imbalance: Handling modalities with varying data volumes and qualities.

● Computational Complexity: Managing increased computational load due to multi-

source data processing.

2.2.6 LoRA Fine-Tuning [40]

LoRA (Low-Rank Adaptation) fine-tuning is a method designed to adapt large pre-trained
models to specific tasks with reduced computational and memory costs. It achieves this by
focusing on updating a small subset of the model parameters instead of the entire model.
1- Key Concepts

Parameter Efficiency: LoRA fine-tuning updates only low-rank matrices added to the original
model parameters, significantly reducing the number of parameters that need to be fine-
tuned.
Modularity: LoRA modules can be easily integrated into existing models without extensive
modifications.
Memory Footprint: Reduces the memory requirements during training and inference,
making it feasible to fine-tune very large models on limited hardware.

2- Process

1. Insert Low-Rank Matrices: Introduce low-rank matrices into the model's architecture,
typically within the attention layers or feed-forward networks.
2. Freeze Original Parameters: Keep the original pre-trained model parameters fixed.
3. Train Low-Rank Parameters: Train only the newly introduced low-rank matrices on the
task- specific data.
3- Benefits

● Resource Efficiency: Drastically lowers the computational and memory demands of

fine- tuning large models.
● Speed: Accelerates the fine-tuning process by focusing on a smaller number of
parameters.
● Versatility: Can be applied to various types of models and tasks.

4- Applications

● NLP Tasks: Commonly used in fine-tuning language models for tasks like text
classification, translation, and summarization.
● Computer Vision: Adapting pre-trained vision models to specific image recognition or
segmentation tasks.

5- Challenges

● Complexity: Requires careful implementation and understanding of low-rank matrix

properties.
● Integration: Might need adjustments to integrate seamlessly with different model
architectures.
2.2.7 Fine Tuning

Fine-tuning involves adapting a pre-trained model to a specific task using additional training
on a smaller, task-specific dataset. This process helps leverage the general knowledge
captured during the pre-training phase while tailoring the model to perform well on a
particular task.

1- Process

1. Pre-Trained Model: Start with a model pre-trained on a large dataset (e.g., BERT, GPT).
2. Task-Specific Data: Gather labeled data relevant to the target task.
3. Training: Further train the model on the task-specific data, usually with a smaller learning
rate to avoid overwriting the pre-trained weights.
4. Evaluation: Assess the model’s performance on the task to ensure it has learned the task-
specific patterns.

2- Applications

● Text Classification: Adapting pre-trained models to categorize text into predefined

categories.
● Named Entity Recognition (NER): Fine-tuning models to identify and classify entities in
text.
● Sentiment Analysis: Tailoring models to determine the sentiment expressed in text.

3- Benefits

● Efficiency: Requires less data and computational power than training a model from
scratch.
● Performance: Often yields better performance by combining general and task-specific
knowledge.
● Flexibility: Can be applied to a wide range of tasks with minimal adjustments.
2.2.8 Very Large Language Models (vLLM)

Very Large Language Models (vLLMs) are advanced versions of large language models,
featuring billions to trillions of parameters. They leverage massive datasets and extensive
computational resources to achieve high levels of performance in various natural language
processing tasks.

1- Key Characteristics

1. Scale: Significantly larger number of parameters compared to traditional models.

2. Pre-training: Trained on extensive datasets using unsupervised learning to capture a wide
range of language patterns and knowledge.
3. Fine-Tuning: Adapted for specific tasks using supervised learning on task-specific datasets.

2- Examples

● GPT-3: A vLLM developed by OpenAI with 175 billion parameters, capable of

performing tasks ranging from text generation to translation.
● BERT Large: A version of BERT with 345 million parameters, used for understanding
and processing language.

3- Applications

● Text Generation: Producing human-like text for chatbots, content creation, and more.

● Language Translation: Providing highly accurate translations between languages.

● Knowledge Extraction: Summarizing information from large text corpora.

4- Challenges

● Resource Intensive: Requires vast computational resources for training and deployment.

● Ethical Concerns: Potential for misuse and generation of biased or harmful content.

● Interpretability: Difficulty in understanding the decision-making process of these

models.
CHAPTER 3: Literature Survey
Overview:
This chapter delves into the comprehensive examination of Market Survey Analysis and
Papers Analysis. The discourse encompasses an exploration of antecedent research
endeavours, elucidating their methodologies, designs, and outcomes. A meticulous
comparison between these antecedent works and our own project will be conducted,
delving into a thorough examination of the strengths and weaknesses inherent in the prior
studies. This scrutiny will be instrumental in elucidating the advancements made in our
annotated presentation and delineating how we have effectively addressed and ameliorated
the deficiencies identified in previous works. Additionally, pertinent research papers will be
scrutinized to provide a holistic understanding of the scholarly landscape, thereby
contributing to the comprehensive documentation of our project.

3.1 Market Survey

The market survey contains example all medical applications available in market, and all
features they provide compared to the proposed medical chatbot (Table 1).
Table 1 market survey
Featur Simp
Arab M Summarizati Fr Associat
es Availabil le Accuracy C
ic RI on e es
ity struct os
supp / “Report” e doctors
App ure t
ort C c
T h
at
No N
Ada Global Yes Yes 70.5% Fr N Yes
o
ee o

No Fr
K health USA Yes N Yes 85% N Yes
ee
o o
/
Pa
id
Yes
Symptoma Global Yes Yes N 77% Fr N No
te o ee o
lsabel-
The Yes
Global No N Yes 95% Fr N No
Sympto
o ee o
m
Checker
doctors
Symptom Global No Yes N No 90% Fr N No
a o ee o
Proposed Global
Yes Yes Y Yes 84.8% Fr Y Yes
Chatbot
es ee e
s
3.1.1 Ada [19]
Ada is a medical AI application that simplifies healthcare journeys and helps people take
care of themselves, The patient can ask the application questions about what hurt him or what

34
his feeling, when the patient opens the application there is a page for registration or login if
you previously have an account. In the next page, the patient can ask medical questions by
clicking on start symptom button, then the application start for given questions for the
patient to answer him related to the query of the patient, to get more information about
the patient, for example, if patient write I have a headache the application start to ask him
questions for how long has this been troubling him? His age? and the patient must answer.
The application gives a summary of the patient's disease and what should he do such as take
the appropriate medication or go to the doctor. Also it contains a page for the profile of the
patient's personal information such as weight and height and health background for genetic
diseases and basic information containing name and date of birth and sex of patient and
allergy conation if the patient has Allergic disease.
This application is not to replace the doctor, but it helps the doctor and the patient to early
detect diseases, some scientific research has shown that Ada was able to achieve an
accuracy of up to 70.5%.
Pros:

● The application is easy to use and there is a specialization of additional questions only
for Covid-19
Cons:

● The patient is allowed to ask one question and then the application starts to ask him a
choice question.
(see Figure 11).

Figure 11 Ada website

[19]
● Low accuracy
3.1.2 Symptomate [21]
Symptomate is a website that offers a symptom checker for free and you can download
as an application For the Mobile. It is characterized by ease of use, as it depends on asking
the user several questions and based on his answers, it provides potential explanations or
conditions associated with them and provides advice or recommendations on what you
should do. It also features 15 languages, including Arabic and also characterized by high
privacy, as no login process is required to use it so your identity is anonymous.

Initially, when using it, you are asked to agree to the terms of use, which include that the
diagnosis is not final and does not replace a visit to the doctor, and also includes that it
should not be used in emergency situations. After that, you are asked to specify the type of
diagnosis, either for yourself or for another person. If you choose the diagnosis for another
person, it will be in the form The questions are different. After choosing the type of
diagnosis, it asks you about the user’s identity, w hether male or female. After that, it asks
about age, and after that it asks about symptoms and geographical area. hhWhen asked
about symptoms, Symptomate displays a human-shaped model. The user can identify the
areas that he suffers from, and which represent the symptoms. After identifying the
symptoms, Symptomate asks some questions based on the symptoms, gender, and age.
Upon completion, he shows the results of the examination and offers some advice.
Suggestions to the user.hSome scientific research has shown that Symptomate was able to
achieve Accuracy of up to 77%.
Pros:
• Simple structure and easy to use
• Available in all countries of the world
• Support 15 languages
• Provides a full report with some tips and advice

Cons:
• low accuracy
• Some cases cannot be diagnosed

Figure 12 shows the user interface of the symptomate application.

Figure 12 symptomate website [21]

3.1.3 Lsabel-the symptom checker doctors [22]

Isabel - The Symptom Checker Doctors is a website that offers a symptom checker for
free, Developed, refined, and tested over 20 years using Artificial Intelligence technologies,
it is characterized by ease of use and supports 9 different languages but it doesn't support
Arabic, The diagnosis consists of three Step:
• The first Step is to tell him your symptoms, age, gender, and country.
• The second step is Possible causes.
• The third step is what to do, as it provides you with advice and guidance for dealing
with symptoms and illness and making decisions.
Pros:

● The website easy to use.

● High accuracy.

Cons:

● It does not provide a specific diagnosis for the condition, rather it provides a general
diagnosis doesn't support CT/MRI scan diagnosis.
Figure 13 shows the user interface of the lsabel symptom application.

Figure 13 lsabel symptomate website [22]

3.1.4 Symptoma [23]
Symptoma is a digital health assistant. It is a website that helps you understand what
might be wrong if you are feeling unwell, Simply state your symptoms and answer questions
to find possible medical causes. Symptoma is based on 17 years of scientific research by
medical doctors and data scientists to help patients find possible medical causes for their
symptoms, Its diagnostic accuracy is raising the bar in its industry and has been validated in
internal, external, and peer-reviewed scientific publications comparing up to 107 solutions
worldwide where Symptoma ranked as #1, It can diagnose COVID-19 symptoms with an
accuracy of 95%, It has an easy and quick interface where you enter your symptoms and it
asks some questions and then diagnoses your condition.
Pros:

● The website is easy and fast to use and specializes in additional questions only for
Covid-19 with high accuracy.
● High accuracy

Cons:

● It offers a range of languages, but the translation accuracy is very poor.

● Sometimes the diagnosis process requires a lot of time and a lot of questions, (see
Figure 14).

Figure 14 Symptoma website [23]

3.2 Papers for other applications
In this section, will we introduce other available medical applications.

3.2.1 Vision–Language Model for Visual Question Answering in Medical

Imagery [37]

This paper introduces an approach based on a transformer encoder–decoder

architecture. Specifically, they extract image features using the vision transformer (ViT)
model, and they embed the question using a textual encoder transformer. Then, they
concatenate the resulting visual and textual representations and feed them into a multi-
modal decoder for generating the answer in an autoregressive way. In the experiments,
they validate the proposed model on two VQA datasets for radiology images termed VQA-
RAD and PathVQA. The model shows promising results compared to existing solutions. It
yields closed and open accuracies of 84.99% and 72.97%, respectively, for VQA-RAD, and
83.86% and 62.37%, respectively, for PathVQA. Other metrics such as the BLUE score
showing the alignment between the predicted and true answer sentences are also
reported.nnTechniques Used: The techniques used in this research involve the development
of a vision-language model for visual question answering in medical imagery. The model
utilizes a transformer encoder- decoder architecture, extracting image features using the
vision transformer (ViT) model and embedding questions using a textual encoder
transformer. The visual and textual representations are concatenated and fed into a multi-
modal decoder for generating answers in an autoregressive manner.n
Contribution: The main contribution of this research is the introduction of a VQA model for
medical datasets, composed of image and text encoders for encoding medical images and
corresponding questions. The model demonstrates promising capabilities compared to
existing solutions, with improvements in accuracy on VQA-RAD and PathVQA datasets. The
attention maps generated by the model are highlighted as useful for understanding key
regions of an image and improving automated diagnostic systems.

Results: The results of the experiments show improvements in accuracy on both VQA-RAD
and PathVQA datasets. The model achieves closed and open accuracies of 84.99% and
72.97% for VQA-RAD, and 83.86% and 62.37% for PathVQA. Additionally, the model's
performance is noted to be more significant on open-ended questions compared to yes/no
questions, with improvements observed in both datasets.

3.2.2 PMC-VQA: Visual Instruction Tuning for Medical Visual Question

Answering [1]
This paper focuses on the problem of Medical Visual Question Answering (MedVQA),
which is crucial in efficiently interpreting medical images with vital clinic-relevant
information. Firstly, they reframe the problem of MedVQA as a generation task that
naturally follows the human-machine interaction, they propose a generative-based model
for medical visual understanding by aligning visual information from a pre-trained vision
encoder with a large
language model. Second, they establish a scalable pipeline to construct a large-scale medical
visual question-answering dataset, PMC-VQA, which contains 227k VQA pairs of 149k
images covering various modalities or diseases. Thirdly, they pre-train our proposed model
on PMC- VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD and
SLAKE, outperforming existing work by a large margin. Ad- Additionally, they propose a test
set that has undergone manual verification, which is significantly more challenging, even the
best models struggle to solve.
Techniques Used: The paper introduces a generative-based model for Medical Visual
Question Answering (MedVQA) by aligning visual information from a pre-trained vision
encoder with a large language model. Additionally, the authors establish a scalable pipeline
to construct a large-scale MedVQA dataset, named PMC-VQA, which contains 227k VQA
pairs of 149k images covering various modalities or diseases.
Contribution: The contributions of the paper include reframing the problem of MedVQA as
a generative learning task, introducing the scalable pipeline for the creation of the MedVQA
dataset, and presenting the PMC-VQA dataset, which significantly exceeds the size and
diversity of existing datasets. The proposed model achieves state-of-the-art performance
and outperforms existing models, and a new, more challenging benchmark for MedVQA is
proposed
Results: The proposed model, pre-trained on PMC-VQA and fine-tuned on multiple public
benchmarks, outperforms existing work by a large margin. Additionally, a more challenging
test set that has undergone manual verification is introduced, where even the best models
struggle to solve, (see Figure 15).

Figure 15 The overall accuracy scores on VQA-RAD and SLAKE [1]

Limitations: The proposed PMC-VQA dataset has inherent biases, and the paper
acknowledges the potential presence of biases in the dataset. Biases might arise from the
data collection process, annotation methodology, or underlying distribution of the medical
images and questions. Understanding and addressing these biases is crucial for ensuring fair
and unbiased performance evaluation.
CHAPTER 4: The proposed
architecture
4.1 Machine Learning
In this chapter, we introduce the development process of a chatbot capable of
answering medical questions in both text and visual formats. By combining advanced
language models and custom-built visual modules, the chatbot aims to provide accurate and
context-aware responses to user queries.
The primary aim of this research project is to create a medical question answering chatbot
capable of handling both text and visual queries. The project is segmented into two main
components: text-based medical question answering and visual-based medical question
answering.
Text Medical Question Answering:
Data: For text-based queries, we used the MedQuAD [25] (Medical Question Answering
Dataset) without requiring extensive preprocessing. The dataset was partitioned into 70%
for training and 30% for testing.

Model Integration:
We fine-tuned the llama 2 7b [36] version for the text medical question answering module.
The chatbot's architecture involves integrating this module to effectively respond to text-
based medical queries.

Visual Medical Question Answering:

Data: In the visual question-answering segment, we curated a custom dataset by combining
RAD-VQA [4] and PMC-VQA [2] datasets. This dataset was split into 80% for training and
20% for validation.
Model Development:
For visual question answering, we designed a chatbot module from scratch. Training
encompasses optimizing the model on the training set, with fine-tuning using the validation
set.
Chatbot Design:
User Interaction:
The chatbot is designed to interact with users by receiving both text and visual inputs (MRI
& CT & X-ray) and providing appropriate medical responses.
4.1.1 Algorithms

4.1.1.1 Gemini:[44]
Gemini is an LLM (Large Language Model) developed by Google DeepMind, currently
comprising three versions. We are utilizing the Flash version due to its optimized contextual
understanding capabilities

4.1.1.2 Llama 3 [45]

Llamas itself consist of different versions in llama-1 the team at Meta introduced four
different type of versions 7B,13B, 33B, 65B, and in llama-2 there are four versions 7B,
13B,34B,70B (B is short for Billion Parameters). The training data includes a new mix of data
from publicly available sources, which does not include data from Meta’s products or
services. figure 37 compares between llama-3 and the others popular models on the
academic benchmarks from Chowdhery et al. (2022) .

Figure 16 Meta Llama3 Instruct model performance

4.1.1.3 Idefic_medical_VQA_merged_4bit [39]

Due to the high computational cost of developing a model from scratch, we have opted to
use an already fine-tuned model for the Visual Question Answering (VQA) part of our
application. The selected model is Idefic_medical_VQA_merged_4bit, which is specifically
trained on the VQARAD_SLAKE dataset. This model will be implemented as an API to serve
the application's needs.
Model Overview:

● Name: Idefic_medical_VQA_merged_4bit[39]

● Dataset: Fine-tuned on the VQARAD_SLAKE dataset[1]

● Type: Visual Question Answering (VQA)
● Architecture: Transformer-based model

● Parameters: 9 billion

● Training Data: Includes a diverse range of medical images and corresponding questions

● Performance: Optimized for high accuracy in interpreting medical imagery

● Fine-tuning Technique: LoRA (Low-Rank Adaptation)

Model Details:
Architecture:

● The model is based on a transformer architecture, which excels in handling both

visual and textual data.
● It leverages a multi-modal approach, integrating visual features from medical images
with textual information from questions.
● The transformer architecture allows for efficient and effective attention mechanisms,
crucial for understanding the intricate details of medical images.

Training and Fine-tuning:

The model has been fine-tuned on the VQARAD_SLAKE dataset, which includes a
comprehensive collection of medical images (MRI, CT scans, X-rays) and related questions.
Fine-tuning was performed using LoRA (Low-Rank Adaptation), a technique that enables
efficient fine-tuning of large models by adjusting only a subset of parameters, significantly
reducing computational requirements.
This fine-tuning ensures that the model is well-equipped to handle a variety of medical
scenarios and provide accurate answers.
Integration as an API:

● The Idefic_medical_VQA_merged_4bit model will be deployed as an API, allowing the

application to query the model for answers to visual questions.
● This API-based approach ensures scalability and ease of integration, enabling the
application to leverage the model's capabilities without extensive computational
resources.
● Users will interact with the chatbot, which will send image and question data to the
API. The model processes the data and returns a precise and contextually relevant
answer.
The choice to use the Idefic_medical_VQA_merged_4bit model significantly reduces the
computational burden and accelerates the development process. By leveraging a pre-
trained and fine-tuned model, we ensure high accuracy and performance in the
application’s VQA functionalities.
4.1.2 Proposed Architecture
The proposed architecture for our Visual Question Answering (VQA) model in the medical
domain is an advanced system designed to efficiently handle and interpret medical images
and related questions. After extensive research and experimentation, we decided to
leverage an existing fine-tuned model and make significant improvements to optimize
performance. The architecture details and enhancements are described below.

Model Overview:

● Name: Custom Medical VQA Model

● Dataset: Fine-tuned on the VQARAD_SLAKE dataset

● Type: Visual Question Answering (VQA)

● Architecture: Transformer-based model with self-attention mechanisms

● Parameters: Approximately 9 billion

● Training Data: Diverse medical images (MRI, CT scans, X-rays) with corresponding
questions
● Performance: Enhanced accuracy and efficiency in interpreting medical imagery

Model Enhancements:
- Optimizer Change:
Original Optimizer: Adam
New Optimizer: AdamW
The switch to AdamW (Adam with Weight Decay) has provided better regularization and
improved convergence, as evidenced by the smoother and more stable training curves
- Attention Mechanism:
Original Mechanism: Multi-Head Attention
New Mechanism: Self-Attention
The integration of self-attention has allowed the model to focus more effectively on relevant
parts of the input, leading to significant improvements in both training and validation
performance.
- Early Stopping:
To prevent overfitting and ensure optimal performance, an early stopping mechanism
was implemented. Training is halted after 70 epochs, as further training resulted in
diminishing returns and unwanted results
Performance Metrics:
The model's performance was evaluated using training and validation loss, as well as top-1
accuracy. The results before and after the improvements are depicted in the provided
plots.

Conclusion:
The proposed architecture, with its enhancements in optimization and attention
mechanisms, demonstrates superior performance in the task of medical VQA. By leveraging
the Idefic_medical_VQA_merged_4bit model fine-tuned with LoRA on the VQARAD_SLAKE
dataset, and integrating self-attention mechanisms, we have significantly improved the
model's ability to understand and interpret medical images. The implementation of early
stopping ensures that the model achieves optimal performance without overfitting.
.

Figure 17 Proposed Architecture

4.1.3 Datasets

4.1.3.1 RAD-VQA
RAD-VQA[3][4]: The Medical Visual Question Answering (VQA) dataset offers a
comprehensive and specialized resource for advancing the development of VQA systems in
the medical domain, with a specific focus on radiology images. Featuring 3,515 meticulously
curated question-answer pairs and 315 radiology images, including X-rays and CT scans, the
dataset incorporates both open-ended and binary questions generated by clinicians.
Notably, the dataset stands out for its high quality, as questions and answers undergo
manual creation and validation by clinicians, ensuring precision and clinical relevance. The
inclusion of natural language questions, reflecting how clinicians naturally interact with
medical images, enhances
the applicability of the dataset to real-world scenarios. Tailored to the challenges of medical
image interpretation, the dataset serves as a benchmark for evaluating VQA system
performance, encompassing overall accuracy as a key metric. Beyond training VQA models,
this resource holds significant potential for advancing research in medical VQA and related
fields, ultimately contributing to the development of systems that can assist medical
professionals. The dataset's significance lies in its capacity to address the complexity of
medical images and the need for specialized domain knowledge, potentially translating into
improved diagnostic accuracy, workflow efficiency, and patient care in clinical applications,
(see Figure 17).

Figure 18 Sample data of RAD-VQA [4]

4.1.3.2 VQARAD_SLAKE [1][2]

In our updated implementation, we utilized the VQARAD_SLAKE dataset instead of PMC-
VQA. The VQARAD_SLAKE dataset is a comprehensive collection designed to enhance the
development of Visual Question Answering (VQA) systems within the medical domain. This
dataset comprises meticulously curated question-answer pairs and corresponding medical
images, including MRI, CT scans, and X-rays. Each question is specifically tailored to test the
model's ability to understand and interpret medical imagery accurately. The dataset is
characterized by its diversity, encompassing a wide range of medical scenarios and imaging
modalities. It serves as a crucial benchmark for evaluating the performance of VQA models,
providing a robust foundation for training and validation. The detailed annotations and high-
quality images in the VQARAD_SLAKE dataset facilitate the development of AI models
capable of delivering precise and context-aware medical diagnoses. Figures 18 and 19
illustrate sample
questions and corresponding medical images from the VQARAD_SLAKE dataset, highlighting
the complexity and variety of the data.

Figure 19 sample image data of the VQARAD_SLAKE [1]

Figure 20 sample Questions data of the VQARAD_SLAKE [1]

4.1.3.1 Books
Essentially, we leveraged medical textbooks and encyclopedias to provide foundational
content for the model to answer pertinent questions.

4.1.3.3 MedQuAD [26]

The "MedQuAD" dataset [26] is collected by Abacha from different websites such as
"cancer.gov", and "nikkk.nih.gov". The dataset holds a collection of 37 question types from
12 different websites, the collection holds (e.g. Treatment, Diagnosis, Side effects). Initially,
this dataset was saved in XML files (Extensible Markup language). The total number of
questions is 47,457 but for the pairs, the number decreased to less than half of 46,457.

4.1.3.4 PubMedQA [27]

This dataset has more than 100k medical questions. The dataset is saved in JSON
format, for that reason each question is saved as an object that has some properties (e.g.
Question, Contexts, labels, answer, MESHES, final decision).

4.1.3.5 ChatDoctor [29]

ChatDoctor is a large language model trained on medical questions,100k real
conversations between patients and doctors from HealthCareMagic.com saved in JSON
format.

4.1.3.6 MedAlpaca [28]

Medalpaca is another LLM (Large Language Model), even bigger than ChatDoctor
trained on medical datasets, and we will use some of the data collection in which this model
is trained to train our model the dataset we took (e.g. Medical Flashcards, Wikidoc,
PubMed, Causal, MMMLU). Most of these datasets are saved in JSON format.

4.1.4 Fast API and medical report

The proposed architecture for our medical report summarization system integrates various
components, including data processing, and a FastAPI-based service for handling requests
and generating summaries, for the pdf report part, we use f2pdf library in python to
generate report, and also handle the multiple questions it already added to report.
Datasets
The datasets are open data set contain 10179 questions with answers

Figure 21 Sample of data

Figure 22 API
Exception Handling model

We created a model that can determine whether the image is medical or not based on
previous knowledge of it, and we placed this model before the main cities in order to avoid
the problem of the user entering a non-medical image and causing problems in the main
model and thus inaccurate results. Therefore, this model is considered a handle for the
user’s problems. Whoever enters incorrect images
In this method, we relied on the prompt engineering of the Google Gemini Pro model, which
knows how to differentiate between medical images or not.

Figure 23 API 2
4.2 Software

4.2.1 Requirements

4.2.1.1 User Requirements

1. User Registration:
● Users should create an account with basic information (name, email, password).

● Optional sign-up methods, including Google and Firebase authentication.

2. User Authentication:
● Secure sign-in methods: email/password, Google, and Firebase authentication.

3. Intuitive User Interface:

● User-friendly design with clear navigation

4. Chat Functionality:
● Users can ask medical questions through a chat interface.
5. Medical Image Upload:
● Users can upload MRI and CT images.

● User-friendly image upload process with capture or selection options.

6. Diagnosis Information:
● Chatbot provides clear diagnosis based on symptoms and uploaded images.

● Users receive information on potential medical conditions and recommendations.

7. History Tracking:
● App maintains a history feature recording date, uploaded images, and descriptions.

● Users can review past conversations and diagnoses for reference.

4.2.1.2 System Requirements

4.2.1.2.1 Hardware Requirements
● Server Infrastructure: The system necessitates robust server infrastructure effectively
manage user data, facilitate chat interactions, and process medical images. This involves
ensuring sufficient server capacity to handle varying loads seamlessly. Additionally, the
servers should possess ample CPU and memory resources to support the intricate AI
algorithms responsible for self-diagnosis and medical image analysis.
● Storage Space: Adequate storage space is paramount for storing diverse datasets, including
user profiles, medical histories, uploaded images, and related descriptions. The system must
have the capability to accommodate a growing database of user data and evolving medical
content, ensuring efficient data management.

● User Devices: The application should be compatible with an array of user devices, primarily
focusing on smartphones. It is essential to ensure that the system supports various screen
sizes and resolutions, providing a responsive and user-friendly interface. This adaptability
guarantees a seamless user experience across different devices, enhancing accessibility and
usability.

4.2.1.2.2 Software Requirements

● Operating System Compatibility: The application should be compatible with multiple

operating systems, specifically targeting Android and iOS. This ensures a broad user reach,
meeting the preferences of both major mobile platforms.

● Development Environment: Utilization of appropriate programming languages and tools is

crucial for seamless app development. Employing a cross-platform framework, such as
Flutter, enhances efficiency by enabling simultaneous development for both Android and
iOS. Integration with Firebase for user authentication streamlines the sign-in process,
ensuring secure and convenient access.

● AI and NLP Libraries: The application should integrate AI and natural language processing
(NLP) libraries or services to empower the chatbot functionality. This involves implementing
AI algorithms for symptom analysis and accurate interpretation of medical images,
contributing to the overall intelligence of the system.

● Database Management: Incorporating a robust database management system is essential

for storing and managing crucial data elements. This includes user profiles, medical history,
and chat interactions.

Image Processing Tools: To facilitate the analysis of MRI and CT scans, the application
should integrate image processing libraries or frameworks. These tools play a pivotal role in
extracting and analysing image metadata, contributing to the accurate interpretation of
medical images. This ensures a comprehensive understanding of users' medical conditions
based on uploaded images.
4.2.1.3 Functional Requirements

Table 2 Functional Requirements

Code Requirements statement Mus comment

t
s
/
Want
Sign Up:

Users can create a new account by providing essential

FR01 information such as name, email, password, and additional Must NA
profile details.
The system verifies user information and ensures that the account
is
unique and secure.
Sign In with google or email:
FR02 Registered users can log in using their email and password. Must NA
The system verifies user credentials and grants access to the app.
Chatbot Screen:

FR03 Users can engage in real-time conversations with the medical

chatbot. Must NA
Users can ask questions, describe symptoms, and seek initial
advice
on medical conditions.
Image Upload:
Users can upload MRI, x-ray and CT images for analysis.
The system processes the uploaded images for potential medical
FR04 Must NA
conditions.
Users receive diagnostic information and insights based on the
images provided.
Report Generation:
FR05 Allows users to download medical reports in both Arabic and Must NA
English versions for their interactions with the chatbot.

User Profile Management:

- Users can access and edit their profiles, including personal details .
FR06 Must NA
- Profile management allows users to update contact information
and language preferences.

Book an appointment:
FR07 Must NA
- Users can book appointment with doctor
Chat with a doctor:
FR08
allow users to send messages to the doctor and receive messages Must NA
from the doctor in real-time.
App Updates
Light and dark Mode Toggle:
The application should provide a toggle switch within the user
FR09 Must NA
profile settings that allows users to enable or disable light or
dark mode.
Languages selection:
FR10 The app should provide a language selection feature that allows Want NA
users to choose between Arabic and English

4.2.1.4 Non-functional Requirements

Table 3 Non-functional Requirements

Must
Code Requirements statement comments
/
Want
Performance Requirements

Response time: all software must respond to a user’s action under

NFR0 Must NA
a certain workload within 4 seconds
1
Security Requirements

NFR0 The behaviour of the software must be correct and predictable Must NA
2
The software must ensure the integrity of the customer account
NFR0 Must NA
information.
3
NFR0 The system must encrypt sensitive data transmitted over the
4 Must NA
Internet between the server and the app

Other Requirements

NFR0 Availability: The system should be available 24/7. Must NA

5
Scalability: The system should be able to handle a growing user
NFR0 Must NA
base.
6
4.2.2 Use Case Diagram
1. Use Case Diagram

● Registration use case Diagram

Registration Scenario, when using the application, the system is going to check for an internet
connection, after checking, the user is going to register for the first time by clicking on the
“sign up” button, and type username, E-mail, and password then after that all the information
is going to be stored in the database. If false information is entered an error message appears
which requires re-entering the detected false information, and that is happening by checking
the input of the fields.
A successful registration is done right after confirming the received confirmation email, which
is sent directly to the user’s email, the user will be able to log in by entering his email and
password in their fields, besides the availability of restoring the forgotten password by clicking
on “forgot my password” button if it was forgotten. A logout option exists just by clicking on
the “logout” button if a different user wants to log in.

● Forget Password use case Diagram

Forget Password Scenario, If the user forgets the password there will be an option called
“Forgotten password” so the user can recover his/her account by requesting a new password
from the system, then it sends a code from the system (secure code) to verify the user identity
and with the secure code, the user will use it with the system to get permission to enter new
password, (see Figure 19).
Figure 24 Use Case Diagram
Activity Diagram

4.2.3.1 User
The user registers on the application, if it is the first time he is going to sign up, then he
is going to enter his information and after finishing he will press submit, then the system will
send a confirmation code and the user will enter this code as a verification, so if he entered
the correct code a registration success message will be sent, (see Figure 20).

Figure 25 User
4.2.3.2 Doctor
When the user enters his information to access the page, the system will confirm login
success, and if he didn’t enter all the information then pressed the (Forget the password)
button, the user will change his password to re-login again, (see Figure 21).

Figure 26 Doctor
4.2.3.3 Book an appointment with DR Activity
Booking an appointment with DR. Visit our app, input your details, and choose a
convenient time. Receive confirmation with all the info you need. Your path to personalized
healthcare starts with a simple click, (see Figure 22).

Figure 27 Book an appointment with DR Activity

4.2.3.4 Chatbot to detect disease via image
Harnessing cutting-edge technology, our chatbot utilizes a Convolutional Neural
Network (CNN) model to detect diseases through images. Simply upload your medical images
(MRI/CT), and our intelligent chatbot will swiftly analyze them. Experience the power of
artificial intelligence in healthcare as the CNN model identifies potential diseases with
accuracy. Revolutionize your diagnostic journey with our innovative chatbot—empowering
you with quick and reliable insights into your health, (see Figure 24).

Figure 28 Chatbot to detect disease via image

4.2.3.5 Registration Activity Diagram

Figure 29 Registration Activity Diagram

4.2.3.6 Forget Password and Login Activity

Figure 30 Forget Password and Login Activity

4.2.3 Sequence diagram

4.2.4.1 Sign up Sequence Diagram

Figure 31 Sign up Sequence Diagram

4.2.4.2 Sign in Sequence Diagram

Figure 32 Sign in Sequence Diagram

4.2.4.3 Chatbot Sequence Diagram

Figure 33 Chatbot Sequence Diagram

4.2.4.4 Book an appointment with DR Sequence Diagram

Figure 34 Book an appointment with DR Sequence Diagram

4.2.4 Class Diagram

Figure 35 Class Digram

4.2.5 UI of application

Figure 36 UI 1

Figure 37 UI 2
Figure 38 UI 3

Figure 39 UI 4
Figure 40 UI 5

Figure 41 Dark mood English & Arabic

CHAPTER 5:
Implementation and
Results

Chapter 5: Implementation and Results

Implementation

This chapter provides details on the implementation of the proposed VQA model, including the
training process, evaluation metrics, and the results obtained.

5.1 Implementation for The Proposed Architecture

5.1.1 model Parameters and Configurations

This script sets up command-line arguments for training a Visual Question Answering (VQA) model,
specifically for medical images and questions. The `parse_opt` function uses `argparse` to define
parameters like random seed (`--SEED`) for reproducibility, batch sizes (`--BATCH_SIZE` and `--
VAL_BATCH_SIZE`), and the number of output units (`--NUM_OUTPUT_UNITS`). It also includes
settings for the question length (`--MAX_QUESTION_LEN`), image channels (`--IMAGE_CHANNEL`),
learning rate (`--INIT_LEARNING_RATE`), and regularization (`--LAMNDA`). Additional parameters
cover MFB pooling, BERT settings, dropout ratios, and the number of training epochs (`--
NUM_EPOCHS`). These allow easy adjustment of training configurations without changing the code.

Figure 42 Parameters of The Proposed Architecture

5.1.2 Question Attention and Image Attention with Multi-modal
Factorized Bilinear MFB
5.1.2.1 Question Attention

This function processes the question encoding in a neural network to highlight important
parts. First, it applies dropout and reshapes the input. Then, it uses two convolutional layers
and a ReLU activation to generate attention weights. These weights are applied to the
question encoding in a loop, focusing on different parts of the question. Finally, the function
combines these focused features and returns them.

Figure 43 Question Encoding

5.1.2.2 Image Attention with MFB
This function combines question and image features using attention mechanisms in a neural
network. It first reshapes the input features, applies projections and convolutions, and
computes element-wise multiplication with attention weights. The resulting features are
then pooled and processed further, resulting in a fused representation. Finally, the function
returns the fused feature vector, ready for further processing in the network.

Figure 44 Integrating Question and Image Features

5.1.2.3 Leveraging Attention and MFB Fusion for Predictive Modeling
This function processes image and question inputs through several layers in a neural network. It first
preprocesses the input question using a BERT model and applies attention to extract question
features. Similarly, it processes the image input and performs attention-based fusion with question
features. The fused features undergo fine-grained fusion using a Multi-modal Factorized Bilinear
(MFB) approach, resulting in a final prediction using a linear layer and log softmax for
classificationnnFigure

Figure 45 Multi-modal Fusion for Question-Image Interaction

5.2 Training and Validation Results

5.2.1 Text-Model
In our implementation, we utilize Gemini, LLaMA-3, and GPT-3-turbo. Here's how the
process unfolds: Gemini is tasked with contextualizing a relevant book related to the topic
of the queried question. Leveraging its prior knowledge, Gemini provides initial answers
based on the contents of the book. Subsequently, LLaMA-3 processes the question and
provides its response. To consolidate these viewpoints, we employ LLaMA-3's
summarization capabilities, integrating both Gemini's insights from the book and LLaMA-3's
direct response into a unified viewpoint. For further details, please refer to the containing
the code
implementation.
5.2.2 Training and Validation Results for The Proposed Architecture

The model's performance was evaluated using training and validation loss, as well as top-1
accuracy. The results before and after the improvements are depicted in the provided plots.

Training and Validation Loss:

Before Improvements:
Training loss showed a steady decline, while validation loss started increasing after around 100
epochs, indicating overfitting (see Figure 44).
Figure 46 Training and Validation Loss Before
Enhancement

After Improvements:
Both training and validation losses stabilized significantly earlier, with validation loss
remaining low, demonstrating better generalization (see Figure 45).

Figure 47 Training and Validation Loss after Enhancement

Top-1 Accuracy:
Before Improvements:
Top-1 accuracy plateaued after around 100 epochs, with a clear gap between training and
validation accuracy (see Figure 46).

Figure 48 Training and Validation Top-1 Accuracy before

After Improvements:
Top-1 accuracy for both training and validation improved consistently, with the validation
accuracy stabilizing at a higher level (see Figure 47).

Figure 49 Training and Validation Top-1 Accuracy after

5.2.3 Training and Validation Results for Idefic_medical_VQA_merged_4bit
Training/Validation Loss:
The following plot [Figure 48] of the training and validation loss during the fine-tuning
process illustrate the model's performance and convergence.

Figure 50 IDEFIC Training/Validation Loss

Model Comparison Results on VQA-RAD [43] :

- Proposed Architecture: This model achieved the highest overall accuracy, with a result of
84.8%. It outperformed all other models by a significant margin, highlighting its superior
performance in medical visual question answering.
- PeFoMed: The next best performing model, with an overall accuracy of 81.9%. Although it
closely follows the Proposed Architecture, it still lags behind by 2.9%.
- PMC-VQA: This model achieved an overall accuracy of 81.6%, placing it slightly below
PeFoMed and 3.2% lower than the Proposed Architecture, indicating competitive but
slightly lesser performance.
- MUMC: With an overall accuracy of 79.2%, MUMC falls short by 5.6% compared to the
leading Proposed Architecture, showing a noticeable gap in performance.
- M2I2: Achieving an overall accuracy of 76.8%, M2I2 is 8% less accurate than the Proposed
Architecture, highlighting a more significant performance difference.
- PMC-CLIP: This model had the lowest overall accuracy in this comparison, with a result of
77.6%. It is 7.2% less accurate than the Proposed Architecture, demonstrating the least
effective performance among the compared models.
Models Overall Accuracy
PeFoMed 81.9%
PMC-VQA 81.6%
MUMC 79.2%
M2I2 76.8%
PMC-CLIP 77.6%
Proposed Architecture 84.8%

Table 4 Medical Visual Question Answering on VQA-RAD [43]

5.3 Unit Testing:

5.3.1 the testing results for the Idefic_medical_VQA_merged_4bit model
Unit Test and Model Comparison

In the unit testing phase, we compared the performance of the Idefic_medical_VQA_merged_4bit

model against several other models on a set of 98 questions. The questions were evenly divided
into two categories: 49 open-ended questions and 49 closed-ended questions. This comparison was
conducted through a manual human testing process, where we tested each model by asking the
same set of questions with corresponding images, whose answers we already knew from the
dataset. We evaluated the models based on whether their answers matched the known correct
answers, considering both exact matches and semantically correct responses. The results of this
comparison are detailed in table 5 and summarized below.

Model Comparison Results:

Idefic_medical_VQA_merged_4bit : This model achieved the highest overall accuracy, with a total of
83 correct answers out of 98, yielding an overall result of 84.69%. It performed exceptionally well on
both open-ended (40 correct answers) and closed-ended questions (47 correct answers).

dineshcr7/MediVQA : The next best performing model was dineshcr7/MediVQA, with an overall
accuracy of 77.55%. It correctly answered 76 questions in total, with 37 correct open-ended and 37
correct closed-ended answers.

dineshcr7/Type_MediVQA : This model had an overall accuracy of 74.49%, with 73 correct answers. It
performed similarly on open-ended (40 correct answers) and closed-ended questions (40 correct
answers).

qwikQ8/vilt_finetuned_200_med (closed) : This model achieved an overall result of 40.82%,

correctly answering 40 questions. It performed better on closed-ended questions (37 correct
answers) than on open-ended ones (12 correct answers).

microsoft/git-base2 : This model had an overall result of 37.76%, with 37 correct answers. It performed
equally on both open-ended (4 correct answers) and closed-ended questions (4 correct answers).

lava-1.5-7b-hf : This model achieved the lowest overall accuracy, with a result of 47.96%. It correctly
answered 47 questions, performing poorly on open-ended questions (4 correct answers) but better
on closed-ended questions (12 correct answers).
Summary of Results:
- Overall Performance: The Idefic_medical_VQA_merged_4bit model significantly
outperformed the other models in terms of overall accuracy. Its performance on both open-
ended and closed-ended questions was superior, indicating its robustness in handling
different types of queries.
- Open-Ended Questions: For open-ended questions, the
Idefic_medical_VQA_merged_4bit and dineshcr7/Type_MediVQA models performed the
best, each correctly answering 40 questions. The dineshcr7/MediVQA model also showed
good performance with 37 correct answers.
- Closed-Ended Questions: The Idefic_medical_VQA_merged_4bit model excelled
in answering closed-ended questions, with 47 correct answers, followed by
dineshcr7/MediVQA with 37 correct answers.
This rigorous human testing process, where each model was evaluated based on its
responses to known questions and images, highlights the effectiveness and accuracy of the
Idefic_medical_VQA_merged_4bit model in real-world applications. The model's ability to
provide correct and contextually accurate answers consistently outperformed the other
tested models, making it a reliable choice for medical visual question answering tasks.

TOT TO
MODEL OVERALL OPEN_T OPEN_F CLOSE_T CLOSE_F
AL TAL
_TRU FAL
E SE

lava-1.5-7b-hf 47 51 47.95% 4 45 12 37

Microsoft/git-base2 37 61 37.75% 4 45 4 45

qwikQ8/
40 58 40.81% 12 37 12 37
vilt_finetuned_200_
med (closed)

dineshcr7/Type_MediVQA 73 25 74.48% 40 9 40 9

dineshcr7/MediVQA 76 22 77.55% 37 12 37 12

IDEFIC_MEDICAL_VQA [40] 83 15 84.69% 40 9 47 2

Table 5 Comparison of Model Performance on Medical VQA Tasks

5.3.2 Api and medical report summarization

In the testing phase, we evaluated the performance of our medical report summarization
system built using FastAPI and the fpdf2 library for generating PDF reports. The testing was
focused on generate report from the fast API generate answers, question, Treatment, and
consulting doctor then collect them make in a pdf form. Below are the details and results of our
testing process.

Figure 51 Medical Report Summarization

Figure 53 Medical report in Arabic
Figure 54 Medical report with multiple questions
5.3.3 Application testing
In our project, we have completed the app testing phase for the medical chat application.
We conducted thorough evaluations using detailed checklists and bug reports to ensure
comprehensive coverage of all critical functionalities. The QA checklist includes a
comprehensive suite of test cases designed to verify that the software meets expected
performance and usability standards. During testing, we identified several defects and
meticulously tracked their statuses in the bug report. The overall results indicate a mix of
passed and failed test cases. While many functionalities are working as expected, there are
still areas that require further debugging and refinement. The summary status of our tests
shows the need for additional iterations to address the identified issues and enhance the
application's robustness and user experience.

Figure 56 Example Code for testing process

Figure 55 screen of the chatbot in action

Figure 57 the testing characteristic and results

For the whole file please click here if you are on the online version

Scan this QR code

CHAPTER 6:
Conclusion And Future
Work

6.1 Conclusion
The development of our AI-powered medical chatbot marks a significant advancement in
the field of healthcare diagnostics and consultation. By integrating cutting-edge AI models
like Llama 2 ,Gemini Pro and Idefic_medical_VQA_merged_4bit, the chatbot provides
accurate and context-aware responses to both text-based and visual medical queries. Our
system has demonstrated high performance, achieving a human test accuracy of 84.69%
for the VQA model (IDEFIC_9B_Medical) and an overall accuracy of 84.8% for the proposed
architecture on the VQA-RAD dataset. The training loss and validation loss for the proposed
architecture were 0.4157 and 0.3969, respectively, indicating the model's robustness and
reliability.
The chatbot's ability to handle a wide range of medical inquiries, from symptom analysis to
interpreting diagnostic images, significantly enhances its utility. It supports both English
and Arabic, catering to a diverse user base and bridging language barriers in healthcare
access. The user-friendly design ensures that patients can easily interact with the chatbot,
receiving comprehensive medical summaries and reports that aid in making informed
health decisions. Additionally, the application provides a comprehensive summary and
report generation feature, offering patients detailed insights into their health conditions.
This capability, combined with its user-friendly interface, makes the chatbot an invaluable
tool for individuals seeking quick and reliable medical advice. The ability to engage users in
natural, conversational dialogue further enhances the accessibility and effectiveness of the
chatbot, Additionally our app is totally free.aThis project not only addresses the immediate
need for accessible medical consultations but also sets a new benchmark for AI
applications in healthcare. The integration of advanced machine learning techniques and
multimodal data processing enables the chatbot to deliver precise and reliable medical
advice, improving patient outcomes and healthcare efficiency.a
6.2 Future Work
While our AI-powered medical chatbot has achieved significant milestones, there are
several areas for future enhancement to further improve its capabilities and user
experience:
1. Real-Time Data Integration: Incorporating real-time data from wearable devices and
electronic health records (EHRs) can provide more comprehensive and up-to-date health
insights, enhancing the chatbot's diagnostic accuracy and relevance.
2. Expanded Language Support: Extending the chatbot's language capabilities beyond
English and Arabic to include other widely spoken languages will make it accessible to a
broader global audience, ensuring more inclusive healthcare solutions.
3. Advanced Diagnostic Features: Developing more sophisticated diagnostic algorithms
and integrating additional medical imaging modalities, such as ultrasound and PET scans,
can broaden the chatbot's diagnostic scope and accuracy.
4. Enhanced User Interface: Improving the chatbot's user interface to include voice
recognition capabilities can make interactions more intuitive and engaging for users.
5. Continuous Learning and Updates: Implementing a continuous learning framework that
keeps the chatbot updated with the latest medical research and diagnostic techniques will
ensure it remains at the forefront of healthcare innovation.
7. Patient Education and Support: Enhancing the chatbot to provide educational resources
and support for patients managing chronic conditions can empower users with the
knowledge and tools needed to take proactive control of their health.
By focusing on these areas, we can further enhance the functionality, accuracy, and user
experience of our AI-powered medical chatbot, making it an even more powerful tool
for improving global healthcare access and outcomes.

‫ قام فريقنا بتطوير روبوت دردشة‬،‫استجابة للطلب المتزايد على الحلول الصحية المتاحة‬
‫ يتميز هذا الروبوت بقدرته الفائقة على التعامل مع‬.‫طبي مدعوم بالذكاء االصطناعي‬
‫االستفسارات الطبية النصية والصور التشخيصي‬
‫ تم تصميم هذا الروبوت لمعالجة قيود الموارد الطبية من‬.‫ ويدعم اللغتين اإلنجليزية والعربية‬،
‫ بما في‬،‫خالل دمج نماذج متقدمة قادرة على تحليل األعراض وتفسير الصور الطبية المختلفة‬
.)X-ray( ‫)واألشعة السينية‬CT( ‫) واألشعة المقطعية‬MRI( ‫ذلك األشعة المغناطيسية‬
‫ يقوم‬،Idefic_medical_VQA_merged_4bit ‫ و‬Llama 2 ‫باستخدام نماذج متطورة مثل‬
‫ الذي تم التحقق منه‬،‫ أظهر أداء النظام‬.‫الروبوت بتفسير األعراض الطبية والصور التشخيصية‬
84.69% ‫ مع تحقيق دقة اختبار بشرية بنسبة‬،‫ نسب دقة عالية‬،‫من خالل اختبارات بشرية‬
‫ تم‬.VQA-RAD ‫ للبنية المقترحة على مجموعة بيانات‬84.8% ‫ ودقة شاملة بنسبة‬VQA ‫لنموذج‬
‫ مما يحسن‬،‫ ويوفر ملخصات شاملة وتقارير طبية‬،‫تصميم الروبوت ليكون سهل االستخدام‬
‫بشكل كبير من وصول المرضى إلى التشخيصات‬
‫ ُيظهر هذا األداة المبتكرة إمكانات الذكاء االصطناعي في تطور‬.‫األولية والرؤى الصحية‬
.‫تشخيصات الرعاية الصحية من خالل جعلها أكثر تيسيرًا وكفاءة‬
References neural-networks-cnn-architectures-explained-
716fb197b243
[1] Papers with Code.(n.d.). PMC-VQA. Retrieved from
https://paperswithcode.com/dataset/pmc-vqa [14] Giammarino, F. et al. (2018). The role of
convolutional neural networks in medical imaging.
Insights into Imaging, 9(4), 611–629.
[2] Hugging Face. (n.d.). PMC-VQA Dataset. Retrieved
https://doi.org/10.1007/s13244-018-0639-9
from https://huggingface.co/datasets/xmcmic/PMC-
VQA/tree/main
[15] ScienceDirect. (n.d.). Convolutional Neural
Network. Retrieved
[3] Papers with Code. (n.d.). Medical Visual Question
from
Answering on VQA-RAD. Retrieved from
https://www.sciencedirect.com/topics/engineering/
https://paperswithcode.com/sota/medical-visual- question-
con volutional-neural-network
answering-on-vqa-rad?p=pmc-vqa-visual- instruction-
tuning-for-medical
[16] Shervine Amidi. (n.d.). Cheatsheet -
Recurrent Neural Networks.
[4] Hugging Face. (n.d.). VQA-RAD Dataset. Retrieved
Retrieved from
from
https://stanford.edu/~shervine/teaching/cs-
https://huggingface.co/datasets/flaviagiammarino/vqarad
230/cheatsheet-recurrent-neural-networks
[5] Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., &
[17] Springer. (n.d.). Protocol for Recurrent
Sivic, J. (2015). NetVLAD: CNN architecture for weakly
Neural Networks. Retrieved
supervised place recognition. arXiv preprint
from
arXiv:1511.07247. Retrieved from
https://link.springer.com/protocol/10.1007/978-1-
https://arxiv.org/abs/1505.00468]
0716- 3195-9_4
[6] Visual Question Answering. (n.d.). Papers with Code.
[18] OpenAI. (n.d.). OpenAI Chat. Retrieved from
Retrieved from
https://chat.openai.com/
https://paperswithcode.com/task/visual-question-
answering
[19] Ada. (n.d.). Retrieved from https://ada.com/
[7] Silva, L. M., & Costa, Y. M. (2021). COVID-19
Detection Algorithm Based on CNN Architecture. [20] K Health. (n.d.). Retrieved
Retrieved from from https://khealth.com/
https://pubmed.ncbi.nlm.nih.gov/35741521/
[21] Symptomate. (n.d.). Retrieved
[8] Carvalho, T., & Costa, Y. M. (2021). Convolutional from https://symptomate.com/
Neural Networks in Medical Imaging: A Comprehensive
Review. Retrieved from [22] Isabel Healthcare. (n.d.). Symptom Checker.
https://pubmed.ncbi.nlm.nih.gov/36253382/ Retrieved
from
[9] Machine Learning. (n.d.). Wikipedia. Retrieved from https://symptomchecker.isabelhealthcare.com/
https://en.wikipedia.org/wiki/Machine_learning
[23] Symptoma. (n.d.). Retrieved
[10] Google Developers. (n.d.). Machine Learning Crash from https://www.symptoma.com/
Course. Retrieved from
https://developers.google.com/machine- learning/crash- [24] Amazon Web Services. (n.d.). Machine
course/ml-intro learning reference architecture. In AWS Well-
Architected Framework - Healthcare Industry
[11] ScienceDirect. (n.d.). Machine Learning. Retrieved Lens. Retrieved from
from https://www.sciencedirect.com/topics/computer- https://docs.aws.amazon.com/wellarchitected/late
science/machine- learning#:~:text=Machine%20learning st/h ealthcare-industry-lens/machine-learning-
%20(ML)%20ref ers%20to,being%20programmed%20with reference- architecture.html
%20that%20 knowledge
[25] Asma {Ben Abacha} and Dina
[12] MonkeyLearn. (n.d.). Introduction to Machine Demner{-}Fushman. (2019). MedQuAD Dataset
Learning. Retrieved from for Medical Questions
https://monkeylearn.com/machine-learning/
[26] Papers with Code. (n.d.). MedQuad.
[13] Raj, D. (2018). Convolutional Neural Networks (CNN) Retrieved from
Architectures Explained. Retrieved from https://paperswithcode.com/dataset/medquad
https://medium.com/@draj0718/convolutional-
[27] Jin, Qiao and Dhingra, Bhuwan and Liu,
Zhengping and Cohen, William and Lu, Xinghua.
(2019).
BubMedQA Dataset
Paul and Oberhauser, Tom and L{\"o}ser,
[28] Han, Tianyu and Adams, Lisa C and Alexander and Truhn, Daniel and Bressem, Keno
Papaioannou, Jens-Michalis and Grundmann, K. (2023). Medalpaca
[29] Li, Yunxiang and Li, Zihan and Zhang, Kai and Dan, answering-2350eea072df
Ruilong and Jiang, Steve and Zhang, You. (2023).
ChatDoctor [39] Shashwath01. (n.d.). IDEFIC 9B Medical
VQA 2k [Data set]. Hugging
[30] Hugging Face. (n.d.). Llama-2-7B. Retrieved from Face.
https://huggingface.co/Shashwath01/Idefic_medic
https://huggingface.co/meta-llama/Llama-2-7b al_V
QA_merged_4bit/blob/main/adapter_model.safete
[31] MathWorks. (n.d.). Extreme Learning Machine. nsor s
Retrieved from
https://www.mathworks.com/matlabcentral/fileexchang [40] Hugging Face. (n.d.). LoRA: Low-Rank
e/93120-extreme-learning- machine?s_tid=FX_rc3_behav Adaptation [Guide]. Hugging
Face.
[32] Seseri, R. (2023). AI Atlas #16: Convolutional Neural https://huggingface.co/docs/peft/main/en/concept
Networks (CNNs). LinkedIn. Retrieved from ual_ guides/lora
https://www.linkedin.com/pulse/ai-atlas-16- convolutional-
neural-networks-cnns-rudina- seseri? [41] Hugging Face. (n.d.). Medical Meadow
utm_source=share&utm_medium=member_an MedQA [Dataset]. Hugging
droid&utm_campaign=share_via Face.
https://huggingface.co/datasets/medalpaca/medic
[33] ICliniq Website al_ meadow_medqa

[34] Stack Exchange [42] Holistic AI Engineering. (n.d.). How to Create

Customisable PDF Reports in Python Using
FPDF2. Medium.
[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
https://medium.com/@engineering_holistic_ai/ho
Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser,
w-to- create-customisable-pdf-reports-in-python-
Illia Polosukhin . (2017). Attention Is All You Need.
using- fpdf2-7239cd8e3627
[36] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
[43] Papers with Code. (n.d.). Medical Visual
Martinet Marie-Anne Lachaux, Timothee Lacroix, Baptiste
Question Answering on VQA-RAD.
Rozière, Naman Goyal Eric Hambro, Faisal Azhar,
Retrieved from
Aurelien Rodriguez, Armand Joulin Edouard Grave,
https://paperswithcode.com/sota/medical-visual-
Guillaume Lample. (2017, February). LLaMA: Open and
question-answering-on-vqa-rad?p=pmc-vqa-
Efficient Foundation Language Models .
visual- instruction-tuning-for-medical
[36] Wikipedia. (2024, February,2024). Large Language
Model [44] Gemini
https://arxiv.org/pdf/2312.11
[37] MDPI. (2023). Analyzing and Identifying the 805/
Characteristics of Rare Diseases Using Machine Learning.
[45 ]Meta
Bioinformatics and Biology Insights, 10(3), 380.
https://doi.org/10.3390/bioinformatics10030380 https://ai.meta.com/blog/meta-llama-3/

[38] Shah, A. (n.d.). Visual Question Answering. [46] GitHub for the text model
Medium. Retrieved https://github.com/abdelrahmanelnabawy
from /GP/
https://medium.com/@anuj_shah/visual-question-