CH 1
CH 1
1.1 Overview
In today’s e-commerce landscape, where personalization and convenience are essential, our
project addresses a key challenge: the inability to try on products before purchase. With
growing demand for immersive online shopping, we are developing an Augmented Reality
(AR) Glasses Try-On App. This app offers users a realistic virtual try-on experience, helping
them explore how different frames fit and look on their face.
Using real-time AR technology, the app ensures accurate visualization with face sizing tools
and fitting recommendations for a perfect match. It also provides customized lens options
and product filters for easy selection.By bridging the gap between in-store and online
experiences, our solution enables users to make confident purchasing decisions from the
comfort of their home.
Our goal is to create a more engaging and seamless shopping journey that enhances
customer satisfaction and minimizes returns. This project aims to redefine the eyewear
shopping process, improving usability, customer experiences, and boosting sales
conversions for retailers.
1.2 Motivation
Our AR Glasses Try-On App stems from the need to empower users with an immersive,
convenient way to explore eyewear without the limitations of traditional online shopping.
With real-time AR technology, we aim to provide an interactive experience where users can
accurately visualize frames on their face.A clear, intuitive design allows users to easily
browse frames, customize lenses, and find the best fit for their preferences.
Our goal is to enhance user confidence and satisfaction, enabling informed decisions through
a straightforward shopping journey.
The objective of the proposed AR Glasses Try-On App is to simplify and enhance the online
eyewear shopping experience by providing real-time AR-based try-ons. The app aims to
merge the convenience of online shopping with the accuracy of in-store fitting, offering
users an accurate preview of how glasses will fit their face. With tailored recommendations
for frames and lenses, we aim to increase user confidence, reduce returns, and drive higher
sales conversions.
● Offering multiple lens types, such as single vision, blue light filtering, and transition
lenses.
● Supporting prescription lens customization, ensuring products meet users’ specific needs.
● Combining frame selection, lens customization, and fitting tools into one streamlined
app.
● Simplifying the shopping journey, making it easier for users to find and purchase their
ideal glasses.
1.4 Scope
The proposed AR Glasses Try-On App will enhance the online shopping experience for
eyewear by offering real-time virtual try-ons using AR-based face tracking and visualization.
The app supports lens customization, allowing users to choose from pre-saved or new
prescriptions.
Designed as a multi-platform solution, the app ensures seamless navigation across web and
Android devices. It also offers filters for size charts, material types, and style preferences,
providing a personalized shopping experience tailored to individual needs.
The successful implementation of the AR Glasses Try-On App depends on several critical
constraints that must be considered::
The success of the AR try-on feature depends on access to accurate 3D models of glasses
and frames. Limited availability to these models could impact the user experience.
The app relies on advanced AR and face-tracking technologies, which may not perform
optimally on older devices, potentially affecting some users’ experiences.
As the app collects facial data for tracking purposes, it must comply with data privacy
regulations (e.g., GDPR) and maintain high security standards to protect personal
information.
The lens selection process requires users to provide accurate prescription details, either by
uploading pre-saved prescriptions or inputting new ones. Inaccurate data could affect the
quality of the final product.
1.6 Document Organization
No Now that we have discussed the problem in detail , in this section we are going to
describe the content of next chapters we are going through to explain our solution to the
problem.
Chapter 2: Background information. This chapter will discuss the technical background , for
our problem. Explaining methods and technologies we used.
this chapter explains each used method in deep through the mathematical basics behind our
machine learning models.
Chapter 3: Literature Survey. This chapter holds the essence of researcher's efforts in this
topic. This chapter discuss many research papers that were proposed to solve this problem.
Studying, analysing and discussing these papers helped to introduce our solution and
introduce the available medical chatbot
it includes the medical applications available in market and their pros and cons.
Chapter 4: Proposed Architecture. In this Chapter we introduce our system prototype,
explain the methodology we used in our experiments functional , non-functional ,
Requirements , Use Case Diagram , Sequence Diagram, Class Diagram and System
architecture.
Chapter5: Implementation and Testing. This chapter will cover the implementation code and
the tests conducted for the model.
Chapter 6: Conclusion and future work. The final chapter will summarize our achievements
and the outcomes of our project and possible future directions.
.
CHAPTER 2: Background
2.1 Machine Learning Overview
Supervised Learning:
Supervised machine learning algorithms are designed to learn by example. The name
“supervised” learning originates from the idea that training this type of algorithm is like
having a teacher supervise the whole process.
When training a supervised learning algorithm, the training data will consist of inputs paired
with the correct outputs. During training, the algorithm will search for patterns in the data
that correlate with the desired outputs. After training, a supervised learning algorithm, will
take in new unseen inputs and will determine which label the new inputs will be classified as
based on prior training data. The objective of a supervised learning model is to predict the
correct label for a newly presented input data. At its most basic form, a supervised learning
algorithm can be written simply as:
Equation (1): Y=f(X)+ε
Where Y is the predicted output that is determined by a mapping function that assigns a
class to an input value x. The function used to connect input features to a predicted output
is created by the machine learning model during training.
Using N labeled training examples (x1, y1), ..., (xN, yN)
Supervised learning can be split into two subcategories: Classification and regression.
Regression:
Objective: In regression, the goal is to predict a continuous output variable Y based on input
features X, Equation (2): Y=wx+b.
Where:
•Y is the predicted output.
•X is the input data.
•W is the weight.
•b is the y-intercept.
Loss Function: The commonly used loss function for regression is the Mean Squared Error
1 𝑛
(MSE), Equation (3): MSE= (𝑦𝑖 − 𝑦̂𝑖)2
∑
𝑛 𝑖=1
Where:
● β0 is the intercept.
The goal is to find the values of β0,β1,...,βn that minimize the MSE.
Classification:
Objective: In classification, the goal is to predict a discrete output variable Y that belongs to
a specific class or category based on input features X.
Equation (5): Y=f(X)
Where
● 𝑦̂𝑖 is the predicted probability of belonging to class 1 for the i-th data point.
Where:
● β0 is the intercept.
The goal is to find the values of β0, β1, ..., βn that minimize the Cross-Entropy Loss.
Unsupervised:
Unsupervised learning, also known as unsupervised machine learning, uses machine
learning algorithms to analyze and cluster unlabelled datasets. These algorithms discover
hidden patterns or data groupings without the need for human intervention.
Its ability to discover similarities and differences in information makes it the ideal solution
for exploratory data analysis, cross-selling strategies, customer segmentation, and image
recognition.
Given: a set of N unlabelled inputs {x 1, ..., xN} Goal: learn some intrinsic structure in the
inputs (e.g., groups/clusters) Example: Automatically grouping news stories (Google News)
Clustering:
Objective: In clustering, the goal is to group similar data points into clusters, where points
within the same cluster are more similar to each other than to points in other clusters.
Equation (7): Minimize= 𝛴𝑛 𝛴𝑛 ‖𝑥 2
−𝑐‖
𝑘,𝑖=1 𝑗=1 𝑖𝑗 𝑖
Where:
The objective is to minimize the sum of squared distances between data points and their
assigned cluster centroids.
(Figure 2) shows the Machine Learning classification
Figure 2 Machine Learning Classification
Neural networks, also known as artificial neural networks (ANNs) or simulated neural
networks (SNNs), are a subset of machine learning and the core of deep learning algorithms.
Inspired by the human brain, they consist of node layers with input, hidden, and output
layers. These networks rely on training data to improve accuracy, making them powerful
tools in computer science and artificial intelligence, such as speech and image recognition
tasks.
Input Layer:
the input layer is the initial layer of a neural network that takes in the raw input data, and its
nodes represent the features of the input. X= [x1,x2, ..., xn]
Where:
Hidden Layers:
Hidden layers are intermediary layers between the input and output layers. Each neuron in
a hidden layer receives inputs from the previous layer, multiplies them by associated
weights, sums them up, and passes the result through an activation function.
Input to Neuron:
𝑛
Equation (8): 𝑧𝑗 = ∑ 𝑤𝑖𝑗𝑥𝑖 + 𝑏𝑗
𝑖=1
Where:
Where:
Output Layer:
The output layer produces the final output of the neural network. The structure and
activation function of the output layer depend on the task (e.g., classification or regression).
Output: 𝑦𝑘 = 𝜎(𝑧𝑘)
Where:
A Convolutional Neural Network (CNN) is a deep learning algorithm ideal for image
recognition and processing, consisting of multiple layers including convolutional, pooling,
and fully connected layers.
Convolutional layers in CNN extract features from input images, which are then passed
through pooling layers to reduce spatial dimensions and fully connected layers to predict or
classify the image, retaining key information.
CNNs, trained on large datasets of labeled images, recognize patterns and features
associated with objects or classes, enabling them to classify new images or extract features
for object detection or segmentation.
They are robust for computer vision and can run directly on an underdone image without
preprocessing. The strength of a CNN comes from its convolutional layer, which can
recognize sophisticated shapes. With multiple layers, it can recognize handwritten digits and
differentiate human faces. CNNs are used in various fields like image and video recognition,
image inspection, media recreation, recommendation systems, and natural language
processing.
CNN Architecture:
The construction of a (CNN) involves assembling multiple layers in a sequential, feed-
forward fashion. This sequential design allows the CNN to learn hierarchical features. In a
CNN, layers are organized with convolutional layers often followed by activation layers.
Some layers may also include pooling layers for grouping. The pre-processing required in a
CNN is akin to the pattern recognition of neurons in the human brain and draws inspiration
from the
organization of the visual cortex (Figure 6).
Figure 6 CNN layers [32]
Convolutional Layer:
Convolutional layers are the core building blocks of CNNs. These layers use convolutional
operations to scan the input data with learnable filters or kernels. The convolution
operation involves sliding a filter over the input data, element-wise multiplication, and
aggregation to create feature maps.
Activation Function:
After the convolution operation, an activation function (commonly ReLU - Rectified Linear
Unit) is applied elementwise to introduce non-linearity. This helps the network learn
complex patterns and relationships in the data.
Pooling Layer:
Pooling layers down sample the spatial dimensions of the feature maps, reducing the
amount of computation and parameters in the network while retaining important
information. Max pooling is a common technique, which takes the maximum value from a
group of neighbouring pixels.
Flattening:
After several convolutional and pooling layers, the high-level reasoning in the neural
network is often encoded in the spatial dimensions. To feed this information into a fully
connected layer, the data is flattened into a one-dimensional vector.
Fully Connected (Dense) Layer:
The flattened vector is connected to one or more fully connected layers, which perform classification or
regression based on the learned features. (see Figure 7)
RNNs maintain a hidden state that is updated at each time step. This hidden state serves as
a memory, capturing information from previous steps and influencing predictions at the
current step.
Recurrent Connections:
2- Recurrent connections:
Allow information to persist across different time steps. The hidden state at a particular
time step is influenced not only by the current input but also by the hidden state from the
previous time step.
3- Input, Output, and Activation Functions:
Similar to other neural networks, RNNs have input and output layers, as well as activation
functions (e.g., tanh, sigmoid) applied to the hidden state and/or output.
4- Backpropagation Through Time (BPTT):
RNNs are trained using an optimization algorithm such as stochastic gradient descent (SGD),
with a variation called Backpropagation Through Time (BPTT). BPTT extends the
backpropagation algorithm to handle sequences by unfolding the network through time.
Challenges and Limitations:
While RNNs are powerful for sequential data, they have some challenges, such as difficulties
in capturing long-term dependencies and the vanishing/exploding gradient problem. As a
result, more advanced architectures like Long Short-Term Memory (LSTM) networks and
Gated Recurrent Units (GRUs) have been introduced to address these issues, (see Figure 8).
The Transformer model converts input text into vectors through an embedding layer, using
learned embeddings instead of traditional methods. The input is represented as one-hot
vectors, multiplied by an embedding matrix to generate input embeddings (X), represented
mathematically as X = E * I.
2- Positional Encoding:
● Multi-Head Self-Attention: Allows the model to focus on different parts of the input
sequence simultaneously, capturing various contextual relationships.
● Position-Wise Feed-Forward Network: Adds non-linearity and depth, applied
independently to each position in the sequence.
4- Decoder: Also made of multiple layers, each including:
5- Self-Attention Mechanism:
The self-attention mechanism enables transformers to weigh the relevance of each word
in a sentence relative to others:
● Query, Key, Value Vectors: Input embeddings are transformed into these vectors.
● Attention Scores: Calculated as the dot product of Query and Key vectors, scaled and
passed through a softmax function to obtain attention weights.
● Weighted Sum: These weights are used to compute a weighted sum of the Value
vectors, producing the self-attention output.
6- Advantages:
Parallelization: Enables faster training and inference by processing entire sequences simultaneously.
Long-Range Dependencies: Captures relationships between distant tokens effectively.
Scalability: Performs well with larger datasets and model sizes, as seen in models like BERT
and GPT-3.
VQA stands for Visual Question Answering, and it refers to a type of tasks in the field of deep
learning and computer vision where a model is trained to answer questions about images.
The goal is to develop models that can understand both the visual content of an image and
the textual information in a question, and then generate accurate textual answers. VQA
involves the integration of computer vision and natural language processing to enable
machines to comprehend and respond to questions about visual content.
Key Components of VQA:
1- Image Input:
The model receives an image as input, usually represented as a grid of pixels. (CNNs) are
commonly used to extract visual features from the image.
2- Text Input (Question):
The model also takes in a textual question related to the content of the image. (RNNs) or
Transformer models are often used to process and understand the textual information.
3- Integration of Visual and Textual Information:
The visual features extracted from the image and the embeddings derived from the question
are combined or fused to create a joint representation. This representation is used to
capture the relationship between the image and the question.
4- Answer Generation:
The joint representation is then used to predict or generate an answer to the given question.
This step often involves the use of a fully connected layer with softmax activation for
multiple- choice questions or a regression layer for open-ended questions.
(Figure 10) shows a visual question answering example.
Multimodal learning involves integrating and processing multiple types of data (modalities)
to improve the performance of machine learning models. This can include combinations of
text, image, audio, and other data forms. The objective is to leverage the complementary
information from different modalities to enhance understanding and prediction capabilities.
1- Key Concepts
Modalities: Different types of data sources such as text, images, videos, audio, etc.
Fusion Techniques:
● Early Fusion: Combining raw data from different modalities at the input level before
feeding it into the model.
● Late Fusion: Combining the outputs of unimodal models at the decision level.
2- Applications
● Virtual Question Answering (VQA): Combining visual and textual data to answer
questions about images.
● Speech Recognition: Integrating audio and text data to improve transcription accuracy.
● Healthcare: Combining medical imaging and textual reports for more accurate diagnosis.
3- Challenges
● Data Imbalance: Handling modalities with varying data volumes and qualities.
Parameter Efficiency: LoRA fine-tuning updates only low-rank matrices added to the original
model parameters, significantly reducing the number of parameters that need to be fine-
tuned.
Modularity: LoRA modules can be easily integrated into existing models without extensive
modifications.
Memory Footprint: Reduces the memory requirements during training and inference,
making it feasible to fine-tune very large models on limited hardware.
2- Process
1. Insert Low-Rank Matrices: Introduce low-rank matrices into the model's architecture,
typically within the attention layers or feed-forward networks.
2. Freeze Original Parameters: Keep the original pre-trained model parameters fixed.
3. Train Low-Rank Parameters: Train only the newly introduced low-rank matrices on the
task- specific data.
3- Benefits
4- Applications
● NLP Tasks: Commonly used in fine-tuning language models for tasks like text
classification, translation, and summarization.
● Computer Vision: Adapting pre-trained vision models to specific image recognition or
segmentation tasks.
5- Challenges
Fine-tuning involves adapting a pre-trained model to a specific task using additional training
on a smaller, task-specific dataset. This process helps leverage the general knowledge
captured during the pre-training phase while tailoring the model to perform well on a
particular task.
1- Process
1. Pre-Trained Model: Start with a model pre-trained on a large dataset (e.g., BERT, GPT).
2. Task-Specific Data: Gather labeled data relevant to the target task.
3. Training: Further train the model on the task-specific data, usually with a smaller learning
rate to avoid overwriting the pre-trained weights.
4. Evaluation: Assess the model’s performance on the task to ensure it has learned the task-
specific patterns.
2- Applications
3- Benefits
● Efficiency: Requires less data and computational power than training a model from
scratch.
● Performance: Often yields better performance by combining general and task-specific
knowledge.
● Flexibility: Can be applied to a wide range of tasks with minimal adjustments.
2.2.8 Very Large Language Models (vLLM)
Very Large Language Models (vLLMs) are advanced versions of large language models,
featuring billions to trillions of parameters. They leverage massive datasets and extensive
computational resources to achieve high levels of performance in various natural language
processing tasks.
1- Key Characteristics
2- Examples
3- Applications
● Text Generation: Producing human-like text for chatbots, content creation, and more.
4- Challenges
● Resource Intensive: Requires vast computational resources for training and deployment.
● Ethical Concerns: Potential for misuse and generation of biased or harmful content.
No Fr
K health USA Yes N Yes 85% N Yes
ee
o o
/
Pa
id
Yes
Symptoma Global Yes Yes N 77% Fr N No
te o ee o
lsabel-
The Yes
Global No N Yes 95% Fr N No
Sympto
o ee o
m
Checker
doctors
Symptom Global No Yes N No 90% Fr N No
a o ee o
Proposed Global
Yes Yes Y Yes 84.8% Fr Y Yes
Chatbot
es ee e
s
3.1.1 Ada [19]
Ada is a medical AI application that simplifies healthcare journeys and helps people take
care of themselves, The patient can ask the application questions about what hurt him or what
34
his feeling, when the patient opens the application there is a page for registration or login if
you previously have an account. In the next page, the patient can ask medical questions by
clicking on start symptom button, then the application start for given questions for the
patient to answer him related to the query of the patient, to get more information about
the patient, for example, if patient write I have a headache the application start to ask him
questions for how long has this been troubling him? His age? and the patient must answer.
The application gives a summary of the patient's disease and what should he do such as take
the appropriate medication or go to the doctor. Also it contains a page for the profile of the
patient's personal information such as weight and height and health background for genetic
diseases and basic information containing name and date of birth and sex of patient and
allergy conation if the patient has Allergic disease.
This application is not to replace the doctor, but it helps the doctor and the patient to early
detect diseases, some scientific research has shown that Ada was able to achieve an
accuracy of up to 70.5%.
Pros:
● The application is easy to use and there is a specialization of additional questions only
for Covid-19
Cons:
● The patient is allowed to ask one question and then the application starts to ask him a
choice question.
(see Figure 11).
Initially, when using it, you are asked to agree to the terms of use, which include that the
diagnosis is not final and does not replace a visit to the doctor, and also includes that it
should not be used in emergency situations. After that, you are asked to specify the type of
diagnosis, either for yourself or for another person. If you choose the diagnosis for another
person, it will be in the form The questions are different. After choosing the type of
diagnosis, it asks you about the user’s identity, w hether male or female. After that, it asks
about age, and after that it asks about symptoms and geographical area. hhWhen asked
about symptoms, Symptomate displays a human-shaped model. The user can identify the
areas that he suffers from, and which represent the symptoms. After identifying the
symptoms, Symptomate asks some questions based on the symptoms, gender, and age.
Upon completion, he shows the results of the examination and offers some advice.
Suggestions to the user.hSome scientific research has shown that Symptomate was able to
achieve Accuracy of up to 77%.
Pros:
• Simple structure and easy to use
• Available in all countries of the world
• Support 15 languages
• Provides a full report with some tips and advice
Cons:
• low accuracy
• Some cases cannot be diagnosed
● High accuracy.
Cons:
● It does not provide a specific diagnosis for the condition, rather it provides a general
diagnosis doesn't support CT/MRI scan diagnosis.
Figure 13 shows the user interface of the lsabel symptom application.
● The website is easy and fast to use and specializes in additional questions only for
Covid-19 with high accuracy.
● High accuracy
Cons:
● Sometimes the diagnosis process requires a lot of time and a lot of questions, (see
Figure 14).
Results: The results of the experiments show improvements in accuracy on both VQA-RAD
and PathVQA datasets. The model achieves closed and open accuracies of 84.99% and
72.97% for VQA-RAD, and 83.86% and 62.37% for PathVQA. Additionally, the model's
performance is noted to be more significant on open-ended questions compared to yes/no
questions, with improvements observed in both datasets.
Limitations: The proposed PMC-VQA dataset has inherent biases, and the paper
acknowledges the potential presence of biases in the dataset. Biases might arise from the
data collection process, annotation methodology, or underlying distribution of the medical
images and questions. Understanding and addressing these biases is crucial for ensuring fair
and unbiased performance evaluation.
CHAPTER 4: The proposed
architecture
4.1 Machine Learning
In this chapter, we introduce the development process of a chatbot capable of
answering medical questions in both text and visual formats. By combining advanced
language models and custom-built visual modules, the chatbot aims to provide accurate and
context-aware responses to user queries.
The primary aim of this research project is to create a medical question answering chatbot
capable of handling both text and visual queries. The project is segmented into two main
components: text-based medical question answering and visual-based medical question
answering.
Text Medical Question Answering:
Data: For text-based queries, we used the MedQuAD [25] (Medical Question Answering
Dataset) without requiring extensive preprocessing. The dataset was partitioned into 70%
for training and 30% for testing.
Model Integration:
We fine-tuned the llama 2 7b [36] version for the text medical question answering module.
The chatbot's architecture involves integrating this module to effectively respond to text-
based medical queries.
4.1.1.1 Gemini:[44]
Gemini is an LLM (Large Language Model) developed by Google DeepMind, currently
comprising three versions. We are utilizing the Flash version due to its optimized contextual
understanding capabilities
● Name: Idefic_medical_VQA_merged_4bit[39]
● Parameters: 9 billion
● Training Data: Includes a diverse range of medical images and corresponding questions
Model Details:
Architecture:
Model Overview:
● Training Data: Diverse medical images (MRI, CT scans, X-rays) with corresponding
questions
● Performance: Enhanced accuracy and efficiency in interpreting medical imagery
Model Enhancements:
- Optimizer Change:
Original Optimizer: Adam
New Optimizer: AdamW
The switch to AdamW (Adam with Weight Decay) has provided better regularization and
improved convergence, as evidenced by the smoother and more stable training curves
- Attention Mechanism:
Original Mechanism: Multi-Head Attention
New Mechanism: Self-Attention
The integration of self-attention has allowed the model to focus more effectively on relevant
parts of the input, leading to significant improvements in both training and validation
performance.
- Early Stopping:
To prevent overfitting and ensure optimal performance, an early stopping mechanism
was implemented. Training is halted after 70 epochs, as further training resulted in
diminishing returns and unwanted results
Performance Metrics:
The model's performance was evaluated using training and validation loss, as well as top-1
accuracy. The results before and after the improvements are depicted in the provided
plots.
Conclusion:
The proposed architecture, with its enhancements in optimization and attention
mechanisms, demonstrates superior performance in the task of medical VQA. By leveraging
the Idefic_medical_VQA_merged_4bit model fine-tuned with LoRA on the VQARAD_SLAKE
dataset, and integrating self-attention mechanisms, we have significantly improved the
model's ability to understand and interpret medical images. The implementation of early
stopping ensures that the model achieves optimal performance without overfitting.
.
4.1.3 Datasets
4.1.3.1 RAD-VQA
RAD-VQA[3][4]: The Medical Visual Question Answering (VQA) dataset offers a
comprehensive and specialized resource for advancing the development of VQA systems in
the medical domain, with a specific focus on radiology images. Featuring 3,515 meticulously
curated question-answer pairs and 315 radiology images, including X-rays and CT scans, the
dataset incorporates both open-ended and binary questions generated by clinicians.
Notably, the dataset stands out for its high quality, as questions and answers undergo
manual creation and validation by clinicians, ensuring precision and clinical relevance. The
inclusion of natural language questions, reflecting how clinicians naturally interact with
medical images, enhances
the applicability of the dataset to real-world scenarios. Tailored to the challenges of medical
image interpretation, the dataset serves as a benchmark for evaluating VQA system
performance, encompassing overall accuracy as a key metric. Beyond training VQA models,
this resource holds significant potential for advancing research in medical VQA and related
fields, ultimately contributing to the development of systems that can assist medical
professionals. The dataset's significance lies in its capacity to address the complexity of
medical images and the need for specialized domain knowledge, potentially translating into
improved diagnostic accuracy, workflow efficiency, and patient care in clinical applications,
(see Figure 17).
Figure 22 API
Exception Handling model
We created a model that can determine whether the image is medical or not based on
previous knowledge of it, and we placed this model before the main cities in order to avoid
the problem of the user entering a non-medical image and causing problems in the main
model and thus inaccurate results. Therefore, this model is considered a handle for the
user’s problems. Whoever enters incorrect images
In this method, we relied on the prompt engineering of the Google Gemini Pro model, which
knows how to differentiate between medical images or not.
Figure 23 API 2
4.2 Software
4.2.1 Requirements
2. User Authentication:
● Secure sign-in methods: email/password, Google, and Firebase authentication.
4. Chat Functionality:
● Users can ask medical questions through a chat interface.
5. Medical Image Upload:
● Users can upload MRI and CT images.
6. Diagnosis Information:
● Chatbot provides clear diagnosis based on symptoms and uploaded images.
● User Devices: The application should be compatible with an array of user devices, primarily
focusing on smartphones. It is essential to ensure that the system supports various screen
sizes and resolutions, providing a responsive and user-friendly interface. This adaptability
guarantees a seamless user experience across different devices, enhancing accessibility and
usability.
● AI and NLP Libraries: The application should integrate AI and natural language processing
(NLP) libraries or services to empower the chatbot functionality. This involves implementing
AI algorithms for symptom analysis and accurate interpretation of medical images,
contributing to the overall intelligence of the system.
Image Processing Tools: To facilitate the analysis of MRI and CT scans, the application
should integrate image processing libraries or frameworks. These tools play a pivotal role in
extracting and analysing image metadata, contributing to the accurate interpretation of
medical images. This ensures a comprehensive understanding of users' medical conditions
based on uploaded images.
4.2.1.3 Functional Requirements
Book an appointment:
FR07 Must NA
- Users can book appointment with doctor
Chat with a doctor:
FR08
allow users to send messages to the doctor and receive messages Must NA
from the doctor in real-time.
App Updates
Light and dark Mode Toggle:
The application should provide a toggle switch within the user
FR09 Must NA
profile settings that allows users to enable or disable light or
dark mode.
Languages selection:
FR10 The app should provide a language selection feature that allows Want NA
users to choose between Arabic and English
Must
Code Requirements statement comments
/
Want
Performance Requirements
NFR0 The behaviour of the software must be correct and predictable Must NA
2
The software must ensure the integrity of the customer account
NFR0 Must NA
information.
3
NFR0 The system must encrypt sensitive data transmitted over the
4 Must NA
Internet between the server and the app
Other Requirements
Registration Scenario, when using the application, the system is going to check for an internet
connection, after checking, the user is going to register for the first time by clicking on the
“sign up” button, and type username, E-mail, and password then after that all the information
is going to be stored in the database. If false information is entered an error message appears
which requires re-entering the detected false information, and that is happening by checking
the input of the fields.
A successful registration is done right after confirming the received confirmation email, which
is sent directly to the user’s email, the user will be able to log in by entering his email and
password in their fields, besides the availability of restoring the forgotten password by clicking
on “forgot my password” button if it was forgotten. A logout option exists just by clicking on
the “logout” button if a different user wants to log in.
Forget Password Scenario, If the user forgets the password there will be an option called
“Forgotten password” so the user can recover his/her account by requesting a new password
from the system, then it sends a code from the system (secure code) to verify the user identity
and with the secure code, the user will use it with the system to get permission to enter new
password, (see Figure 19).
Figure 24 Use Case Diagram
Activity Diagram
4.2.3.1 User
The user registers on the application, if it is the first time he is going to sign up, then he
is going to enter his information and after finishing he will press submit, then the system will
send a confirmation code and the user will enter this code as a verification, so if he entered
the correct code a registration success message will be sent, (see Figure 20).
Figure 25 User
4.2.3.2 Doctor
When the user enters his information to access the page, the system will confirm login
success, and if he didn’t enter all the information then pressed the (Forget the password)
button, the user will change his password to re-login again, (see Figure 21).
Figure 26 Doctor
4.2.3.3 Book an appointment with DR Activity
Booking an appointment with DR. Visit our app, input your details, and choose a
convenient time. Receive confirmation with all the info you need. Your path to personalized
healthcare starts with a simple click, (see Figure 22).
Figure 36 UI 1
Figure 37 UI 2
Figure 38 UI 3
Figure 39 UI 4
Figure 40 UI 5
This chapter provides details on the implementation of the proposed VQA model, including the
training process, evaluation metrics, and the results obtained.
This script sets up command-line arguments for training a Visual Question Answering (VQA) model,
specifically for medical images and questions. The `parse_opt` function uses `argparse` to define
parameters like random seed (`--SEED`) for reproducibility, batch sizes (`--BATCH_SIZE` and `--
VAL_BATCH_SIZE`), and the number of output units (`--NUM_OUTPUT_UNITS`). It also includes
settings for the question length (`--MAX_QUESTION_LEN`), image channels (`--IMAGE_CHANNEL`),
learning rate (`--INIT_LEARNING_RATE`), and regularization (`--LAMNDA`). Additional parameters
cover MFB pooling, BERT settings, dropout ratios, and the number of training epochs (`--
NUM_EPOCHS`). These allow easy adjustment of training configurations without changing the code.
This function processes the question encoding in a neural network to highlight important
parts. First, it applies dropout and reshapes the input. Then, it uses two convolutional layers
and a ReLU activation to generate attention weights. These weights are applied to the
question encoding in a loop, focusing on different parts of the question. Finally, the function
combines these focused features and returns them.
5.2.1 Text-Model
In our implementation, we utilize Gemini, LLaMA-3, and GPT-3-turbo. Here's how the
process unfolds: Gemini is tasked with contextualizing a relevant book related to the topic
of the queried question. Leveraging its prior knowledge, Gemini provides initial answers
based on the contents of the book. Subsequently, LLaMA-3 processes the question and
provides its response. To consolidate these viewpoints, we employ LLaMA-3's
summarization capabilities, integrating both Gemini's insights from the book and LLaMA-3's
direct response into a unified viewpoint. For further details, please refer to the containing
the code
implementation.
5.2.2 Training and Validation Results for The Proposed Architecture
The model's performance was evaluated using training and validation loss, as well as top-1
accuracy. The results before and after the improvements are depicted in the provided plots.
Before Improvements:
Training loss showed a steady decline, while validation loss started increasing after around 100
epochs, indicating overfitting (see Figure 44).
Figure 46 Training and Validation Loss Before
Enhancement
After Improvements:
Both training and validation losses stabilized significantly earlier, with validation loss
remaining low, demonstrating better generalization (see Figure 45).
After Improvements:
Top-1 accuracy for both training and validation improved consistently, with the validation
accuracy stabilizing at a higher level (see Figure 47).
dineshcr7/MediVQA : The next best performing model was dineshcr7/MediVQA, with an overall
accuracy of 77.55%. It correctly answered 76 questions in total, with 37 correct open-ended and 37
correct closed-ended answers.
dineshcr7/Type_MediVQA : This model had an overall accuracy of 74.49%, with 73 correct answers. It
performed similarly on open-ended (40 correct answers) and closed-ended questions (40 correct
answers).
microsoft/git-base2 : This model had an overall result of 37.76%, with 37 correct answers. It performed
equally on both open-ended (4 correct answers) and closed-ended questions (4 correct answers).
lava-1.5-7b-hf : This model achieved the lowest overall accuracy, with a result of 47.96%. It correctly
answered 47 questions, performing poorly on open-ended questions (4 correct answers) but better
on closed-ended questions (12 correct answers).
Summary of Results:
- Overall Performance: The Idefic_medical_VQA_merged_4bit model significantly
outperformed the other models in terms of overall accuracy. Its performance on both open-
ended and closed-ended questions was superior, indicating its robustness in handling
different types of queries.
- Open-Ended Questions: For open-ended questions, the
Idefic_medical_VQA_merged_4bit and dineshcr7/Type_MediVQA models performed the
best, each correctly answering 40 questions. The dineshcr7/MediVQA model also showed
good performance with 37 correct answers.
- Closed-Ended Questions: The Idefic_medical_VQA_merged_4bit model excelled
in answering closed-ended questions, with 47 correct answers, followed by
dineshcr7/MediVQA with 37 correct answers.
This rigorous human testing process, where each model was evaluated based on its
responses to known questions and images, highlights the effectiveness and accuracy of the
Idefic_medical_VQA_merged_4bit model in real-world applications. The model's ability to
provide correct and contextually accurate answers consistently outperformed the other
tested models, making it a reliable choice for medical visual question answering tasks.
TOT TO
MODEL OVERALL OPEN_T OPEN_F CLOSE_T CLOSE_F
AL TAL
_TRU FAL
E SE
lava-1.5-7b-hf 47 51 47.95% 4 45 12 37
Microsoft/git-base2 37 61 37.75% 4 45 4 45
qwikQ8/
40 58 40.81% 12 37 12 37
vilt_finetuned_200_
med (closed)
dineshcr7/Type_MediVQA 73 25 74.48% 40 9 40 9
dineshcr7/MediVQA 76 22 77.55% 37 12 37 12
In the testing phase, we evaluated the performance of our medical report summarization
system built using FastAPI and the fpdf2 library for generating PDF reports. The testing was
focused on generate report from the fast API generate answers, question, Treatment, and
consulting doctor then collect them make in a pdf form. Below are the details and results of our
testing process.
For the whole file please click here if you are on the online version
Or
6.1 Conclusion
The development of our AI-powered medical chatbot marks a significant advancement in
the field of healthcare diagnostics and consultation. By integrating cutting-edge AI models
like Llama 2 ,Gemini Pro and Idefic_medical_VQA_merged_4bit, the chatbot provides
accurate and context-aware responses to both text-based and visual medical queries. Our
system has demonstrated high performance, achieving a human test accuracy of 84.69%
for the VQA model (IDEFIC_9B_Medical) and an overall accuracy of 84.8% for the proposed
architecture on the VQA-RAD dataset. The training loss and validation loss for the proposed
architecture were 0.4157 and 0.3969, respectively, indicating the model's robustness and
reliability.
The chatbot's ability to handle a wide range of medical inquiries, from symptom analysis to
interpreting diagnostic images, significantly enhances its utility. It supports both English
and Arabic, catering to a diverse user base and bridging language barriers in healthcare
access. The user-friendly design ensures that patients can easily interact with the chatbot,
receiving comprehensive medical summaries and reports that aid in making informed
health decisions. Additionally, the application provides a comprehensive summary and
report generation feature, offering patients detailed insights into their health conditions.
This capability, combined with its user-friendly interface, makes the chatbot an invaluable
tool for individuals seeking quick and reliable medical advice. The ability to engage users in
natural, conversational dialogue further enhances the accessibility and effectiveness of the
chatbot, Additionally our app is totally free.aThis project not only addresses the immediate
need for accessible medical consultations but also sets a new benchmark for AI
applications in healthcare. The integration of advanced machine learning techniques and
multimodal data processing enables the chatbot to deliver precise and reliable medical
advice, improving patient outcomes and healthcare efficiency.a
6.2 Future Work
While our AI-powered medical chatbot has achieved significant milestones, there are
several areas for future enhancement to further improve its capabilities and user
experience:
1. Real-Time Data Integration: Incorporating real-time data from wearable devices and
electronic health records (EHRs) can provide more comprehensive and up-to-date health
insights, enhancing the chatbot's diagnostic accuracy and relevance.
2. Expanded Language Support: Extending the chatbot's language capabilities beyond
English and Arabic to include other widely spoken languages will make it accessible to a
broader global audience, ensuring more inclusive healthcare solutions.
3. Advanced Diagnostic Features: Developing more sophisticated diagnostic algorithms
and integrating additional medical imaging modalities, such as ultrasound and PET scans,
can broaden the chatbot's diagnostic scope and accuracy.
4. Enhanced User Interface: Improving the chatbot's user interface to include voice
recognition capabilities can make interactions more intuitive and engaging for users.
5. Continuous Learning and Updates: Implementing a continuous learning framework that
keeps the chatbot updated with the latest medical research and diagnostic techniques will
ensure it remains at the forefront of healthcare innovation.
7. Patient Education and Support: Enhancing the chatbot to provide educational resources
and support for patients managing chronic conditions can empower users with the
knowledge and tools needed to take proactive control of their health.
By focusing on these areas, we can further enhance the functionality, accuracy, and user
experience of our AI-powered medical chatbot, making it an even more powerful tool
for improving global healthcare access and outcomes.
قام فريقنا بتطوير روبوت دردشة،استجابة للطلب المتزايد على الحلول الصحية المتاحة
يتميز هذا الروبوت بقدرته الفائقة على التعامل مع.طبي مدعوم بالذكاء االصطناعي
االستفسارات الطبية النصية والصور التشخيصي
تم تصميم هذا الروبوت لمعالجة قيود الموارد الطبية من. ويدعم اللغتين اإلنجليزية والعربية،
بما في،خالل دمج نماذج متقدمة قادرة على تحليل األعراض وتفسير الصور الطبية المختلفة
.)X-ray( )واألشعة السينيةCT( ) واألشعة المقطعيةMRI( ذلك األشعة المغناطيسية
يقوم،Idefic_medical_VQA_merged_4bit وLlama 2 باستخدام نماذج متطورة مثل
الذي تم التحقق منه، أظهر أداء النظام.الروبوت بتفسير األعراض الطبية والصور التشخيصية
84.69% مع تحقيق دقة اختبار بشرية بنسبة، نسب دقة عالية،من خالل اختبارات بشرية
تم.VQA-RAD للبنية المقترحة على مجموعة بيانات84.8% ودقة شاملة بنسبةVQA لنموذج
مما يحسن، ويوفر ملخصات شاملة وتقارير طبية،تصميم الروبوت ليكون سهل االستخدام
بشكل كبير من وصول المرضى إلى التشخيصات
ُيظهر هذا األداة المبتكرة إمكانات الذكاء االصطناعي في تطور.األولية والرؤى الصحية
.تشخيصات الرعاية الصحية من خالل جعلها أكثر تيسيرًا وكفاءة
References neural-networks-cnn-architectures-explained-
716fb197b243
[1] Papers with Code.(n.d.). PMC-VQA. Retrieved from
https://paperswithcode.com/dataset/pmc-vqa [14] Giammarino, F. et al. (2018). The role of
convolutional neural networks in medical imaging.
Insights into Imaging, 9(4), 611–629.
[2] Hugging Face. (n.d.). PMC-VQA Dataset. Retrieved
https://doi.org/10.1007/s13244-018-0639-9
from https://huggingface.co/datasets/xmcmic/PMC-
VQA/tree/main
[15] ScienceDirect. (n.d.). Convolutional Neural
Network. Retrieved
[3] Papers with Code. (n.d.). Medical Visual Question
from
Answering on VQA-RAD. Retrieved from
https://www.sciencedirect.com/topics/engineering/
https://paperswithcode.com/sota/medical-visual- question-
con volutional-neural-network
answering-on-vqa-rad?p=pmc-vqa-visual- instruction-
tuning-for-medical
[16] Shervine Amidi. (n.d.). Cheatsheet -
Recurrent Neural Networks.
[4] Hugging Face. (n.d.). VQA-RAD Dataset. Retrieved
Retrieved from
from
https://stanford.edu/~shervine/teaching/cs-
https://huggingface.co/datasets/flaviagiammarino/vqa- rad
230/cheatsheet-recurrent-neural-networks
[5] Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., &
[17] Springer. (n.d.). Protocol for Recurrent
Sivic, J. (2015). NetVLAD: CNN architecture for weakly
Neural Networks. Retrieved
supervised place recognition. arXiv preprint
from
arXiv:1511.07247. Retrieved from
https://link.springer.com/protocol/10.1007/978-1-
https://arxiv.org/abs/1505.00468]
0716- 3195-9_4
[6] Visual Question Answering. (n.d.). Papers with Code.
[18] OpenAI. (n.d.). OpenAI Chat. Retrieved from
Retrieved from
https://chat.openai.com/
https://paperswithcode.com/task/visual-question-
answering
[19] Ada. (n.d.). Retrieved from https://ada.com/
[7] Silva, L. M., & Costa, Y. M. (2021). COVID-19
Detection Algorithm Based on CNN Architecture. [20] K Health. (n.d.). Retrieved
Retrieved from from https://khealth.com/
https://pubmed.ncbi.nlm.nih.gov/35741521/
[21] Symptomate. (n.d.). Retrieved
[8] Carvalho, T., & Costa, Y. M. (2021). Convolutional from https://symptomate.com/
Neural Networks in Medical Imaging: A Comprehensive
Review. Retrieved from [22] Isabel Healthcare. (n.d.). Symptom Checker.
https://pubmed.ncbi.nlm.nih.gov/36253382/ Retrieved
from
[9] Machine Learning. (n.d.). Wikipedia. Retrieved from https://symptomchecker.isabelhealthcare.com/
https://en.wikipedia.org/wiki/Machine_learning
[23] Symptoma. (n.d.). Retrieved
[10] Google Developers. (n.d.). Machine Learning Crash from https://www.symptoma.com/
Course. Retrieved from
https://developers.google.com/machine- learning/crash- [24] Amazon Web Services. (n.d.). Machine
course/ml-intro learning reference architecture. In AWS Well-
Architected Framework - Healthcare Industry
[11] ScienceDirect. (n.d.). Machine Learning. Retrieved Lens. Retrieved from
from https://www.sciencedirect.com/topics/computer- https://docs.aws.amazon.com/wellarchitected/late
science/machine- learning#:~:text=Machine%20learning st/h ealthcare-industry-lens/machine-learning-
%20(ML)%20ref ers%20to,being%20programmed%20with reference- architecture.html
%20that%20 knowledge
[25] Asma {Ben Abacha} and Dina
[12] MonkeyLearn. (n.d.). Introduction to Machine Demner{-}Fushman. (2019). MedQuAD Dataset
Learning. Retrieved from for Medical Questions
https://monkeylearn.com/machine-learning/
[26] Papers with Code. (n.d.). MedQuad.
[13] Raj, D. (2018). Convolutional Neural Networks (CNN) Retrieved from
Architectures Explained. Retrieved from https://paperswithcode.com/dataset/medquad
https://medium.com/@draj0718/convolutional-
[27] Jin, Qiao and Dhingra, Bhuwan and Liu,
Zhengping and Cohen, William and Lu, Xinghua.
(2019).
BubMedQA Dataset
Paul and Oberhauser, Tom and L{\"o}ser,
[28] Han, Tianyu and Adams, Lisa C and Alexander and Truhn, Daniel and Bressem, Keno
Papaioannou, Jens-Michalis and Grundmann, K. (2023). Medalpaca
[29] Li, Yunxiang and Li, Zihan and Zhang, Kai and Dan, answering-2350eea072df
Ruilong and Jiang, Steve and Zhang, You. (2023).
ChatDoctor [39] Shashwath01. (n.d.). IDEFIC 9B Medical
VQA 2k [Data set]. Hugging
[30] Hugging Face. (n.d.). Llama-2-7B. Retrieved from Face.
https://huggingface.co/Shashwath01/Idefic_medic
https://huggingface.co/meta-llama/Llama-2-7b al_V
QA_merged_4bit/blob/main/adapter_model.safete
[31] MathWorks. (n.d.). Extreme Learning Machine. nsor s
Retrieved from
https://www.mathworks.com/matlabcentral/fileexchang [40] Hugging Face. (n.d.). LoRA: Low-Rank
e/93120-extreme-learning- machine?s_tid=FX_rc3_behav Adaptation [Guide]. Hugging
Face.
[32] Seseri, R. (2023). AI Atlas #16: Convolutional Neural https://huggingface.co/docs/peft/main/en/concept
Networks (CNNs). LinkedIn. Retrieved from ual_ guides/lora
https://www.linkedin.com/pulse/ai-atlas-16- convolutional-
neural-networks-cnns-rudina- seseri? [41] Hugging Face. (n.d.). Medical Meadow
utm_source=share&utm_medium=member_an MedQA [Dataset]. Hugging
droid&utm_campaign=share_via Face.
https://huggingface.co/datasets/medalpaca/medic
[33] ICliniq Website al_ meadow_medqa
[38] Shah, A. (n.d.). Visual Question Answering. [46] GitHub for the text model
Medium. Retrieved https://github.com/abdelrahmanelnabawy
from /GP/
https://medium.com/@anuj_shah/visual-question-