Major Project Report(Edited)
Major Project Report(Edited)
Major Project Report(Edited)
By
ABHILASH SINGH
(205121004)
1
BONAFIDE CERTIFICATE
This is to certify that the project titled “MedGraph Navigator” is a bonafide record of the
work done by
in partial fulfilment of the requirements for the award of the degree of Master of Computer
Applications from National Institute of Technology, Tiruchirappalli, during the academic
year 2023-2024 (6th Semester – CA750 Project Work).
2
ABSTRACT
The project names as ‘MedGraph Navigator’ is based upon medical entities (chemicals and
diseases) and the relationships among them. Understanding and studying about the medical
entities and their relations is important for any person working/studying in the medical domain.
In the project, data about medical entities and relations has been stored in a knowledge graph
(graph database) and a web application has been developed that takes medical entities as input
from the user and display results in Q&A as well as graph format by querying the graph
The objective of the project is to make it easier to study and understand medical entities and
their relationships with each other. To achieve this objective, the project starts with a dataset
named ‘BioRED’ which contains medical documents and annotations and uses it to extract
data, convert it and store in graph database (OrientDB) to make it easy to answer questions or
b) Model Training
3. Application Development
After completion of the project, a web application has been developed that takes medical entity
as input and provides the result in Q & A format and graph format by querying an OrientDB
database (developed as part of the project) that stores data in graph model.
3
ACKNOWLEDGEMENT
Every project, big or small, is successful largely due to the effort of several wonderful people
who have always given their valuable advice or lent a helping hand. I sincerely appreciate the
inspiration, support, and guidance of all those people who have been instrumental in making
and for arranging the project in a good schedule, and who assisted me in completing the project.
I would like to thank her for duly evaluating my progress and evaluating me.
thankful for its constant support, care, guidance, and regular interaction throughout my project.
I express my sincere thanks to all the faculty members, and scholars of NIT Trichy for their
4
TABLE OF CONTENTS
1. BONAFIDE CERTIFICATE 2
2. ABSTRACT 3
3. ACKNOWLEDGEMENTS 4
4. TABLE OF CONTENTS 5
5. LIST OF FIGURES 6
6. CHAPTERS
a) CHAPTER 1: INTRODUCTION 7
c) CHAPTER 3: PLATFORM 10
e) CHAPTER 5: METHODOLOGY 14
7. APPENDIX 38
8. REFERENCES 39
5
LIST OF FIGURES
2. Project Workflow 14
3. Relation Distribution 15
4. Transformer Model 17
5. BERT Architecture 17
6. System Workflow 1 20
7. System Workflow 2 21
6
CHAPTER 1
INTRODUCTION
The project also includes developing groups of chemicals and diseases based on
commonalities. The grouping allows to represent the knowledge through a multi-level
knowledge graph and query it to answer questions that require more than one level of
traversing.
MedGraph Navigator presents the medical entities and relations in formats like question and
answers, knowledge graph etc. to make it easier for users to understand how medical entities
are related with each other.
As part of dataset extraction, another objective is to develop a deep learning model that can
predict the relation between two medical entities in a given text so that the given entities and
predicted relations can be used to create more data to be stored in the knowledge base.
7
Use Case
The project is primarily designed for the people who wish to study about medical entities.
These people may include medical students, doctors or medical research scholars. The project
is not intended for the common public.
8
CHAPTER 2
PROBLEM STATEMENT
The problem statement is to ‘Develop a web application that presents the medical entities and
relation among them in an easy-to-understand method by developing and querying a
knowledge graph’.
Knowledge graph is a way of knowledge representation in which entities are stored in vertices
and relation among them are stored as edges.
Although a knowledge graph of medical entities and relations will store and present the medical
data in an efficient manner, the knowledge graph would be very large and complex thus making
it difficult to visualize and understand.
The project aims at solving this issue in two ways –
1. Answering some basic questions related to a medical entity by querying a large
knowledge graph.
2. Visualizing a smaller knowledge graph that presents the relation of given entity with
others.
The project includes two major problem statements –
1. Developing a knowledge graph of medical entities and relations.
2. Developing a web application to interact with the user.
9
CHAPTER 3
PLATFORM
Hardware Requirements
Software Requirements
10
CHAPTER 4
REQUIREMENT GATHERING/ANALYSIS
Overview
The project is a medical application based on a knowledge graph of medical entities and
relations among them. The application developed as part of the project is required to take a
medical entity (chemical/disease) and answer some questions related to it by querying the
knowledge graph. The application is required to present the entity and relation information in
a knowledge graph like visualization.
Stakeholder Analysis
The stakeholders of the application include medical students, medical research scientists,
doctors etc. The information presented by the application is not to be used as medical advice
thus, the application is intended only for the purpose of medical study and not to be used by
anyone seeking medical assistance.
Other than the above-mentioned stake holders, the developer of the application is also a
stakeholder and is responsible to ensure that the application fulfils it’s functional as well as
non-functional requirements.
Functional Requirements
11
Data Requirements
12
HTML, CSS and JavaScript are used to develop the front-end of the application. Python and
Flask are used to develop the server side of the application. OrientDB is used as the knowledge
Base for the application.
Safety Requirements
Since the application belongs to the medical domain, it is essential to ensure the safety of the
data stored in the knowledge base. It is important to ensure that the data is not altered as it may
lead to application presenting wrong information to the user.
Security measures like password-based authentication should be used to ensure the integrity of
the data.
13
CHAPTER 5
METHODOLOGY
Project Workflow
Data was extracted from BioRED in a .csv format using python script. The data is required to
construct the knowledge base for the application. The BioRED dataset is in JSON, xml format.
It has annotations for medical entities and relations among them.
Gene
Variant
Disease
Chemical
14
Organism
Cell Line
Till now, the extracted dataset contains 814 records which show relationships between
chemical (Entity1) and disease (Entity2).
500
408
400
292
300
200
114
100
0
Treatment Induce Association
15
Phase – 1(b): Model Training
BERT model was trained for two classes – Positive Correlation and Negative Correlation. The
objective of training the BERT model is to check if relation can be established between two
medical entities in a given sentence. A highly accurate model will allow us to generate more
entity pairs by predicting the relation between them. BERT was trained on 584 records.
Google created the deep learning model BERT (Bidirectional Encoder Representations from
Transformers) for use in natural language processing (NLP) applications. The following are
the main details of BERT:
Architecture:
BERT is based on transformer architecture. Transformer model was introduced in a paper
called as ‘Attention is all you need.’ Transformer models rely entirely on self-attention
mechanisms. Transformers use self-attention to weigh the significance of different words in a
sentence, allowing them to capture dependencies across the entire sequence simultaneously.
Transformers consist of encoder and decoder stacks, where the encoder processes input
sequences, and the decoder generates outputs. They've been pivotal in various language-related
tasks like translation, summarization, question answering, and more.
Most of the conventional models can process the text only in one direction (left to right or right
to left). On the other hand, BERT is capable of processing the text in both the directions.
16
Fig – 4: Transformer Architecture
17
Variants: BERT has many variants. Some of these variants are -
BERT-Base: 110 million parameters, 768 hidden units, and 12 layers.
BERT-Large: 340 million parameters, 1024 hidden units, and 24 layers.
DistilBERT, RoBERTa, and ALBERT: Extensions or optimizations of BERT's
architecture for increased effectiveness or efficiency.
Applications: BERT is used for the following tasks -
Text classification
Named Entity Recognition
Question Answering
Translation
Classification Report -
Confusion Matrix –
Accuracy –
Conclusion –
The accuracy obtained is not enough to generate predictions on critical data as in medical
domain. For this reason, the application did not use any entity relation trios generated by the
models as knowledge base.
18
Phase – 2: Knowledge Base Construc on
Knowledge base is constructed by creating a database in orientdb and inserting the extracted
data into the database. The database is populated by inserting the data in the .csv file by using
python script.
While creating the knowledge base, the related chemicals and diseases are grouped into
chemical and disease groups. This grouping helps to answer queries which require a multi-
level knowledge graph. The entities are stored in vertex classes (chemical, disease,
ChemGroup, DisGroup) and relation among these entities are stored as edge classes (Member,
Induce, Treatment, Association).
19
CHAPTER 6
SYSTEM DESIGN/ANALYSIS
o The user selects the entity type and enter the name of the entity.
o Upon submitting, an API call is made to the flask server and the user input data
is passed to the server.
o The server runs a function that is mapped to the received API call.
o The function runs queries on connected OrientDB database.
o Upon receiving the result from the database, the server passes it as response to
the client application.
o The client application uses the received data to display the information in Q&A
format.
20
2. Knowledge Graph Format
o The user selects the entity type and enter the name of the entity.
o Upon submitting, an API call is made to the flask server and the user input data
is passed to the server.
o The server runs a function that is mapped to the received API call.
o The function runs queries on connected OrientDB database.
o Upon receiving the result from the database, the server passes it as response to
the client application.
o The client application uses the received data and creates a knowledge graph
from it and displays it in a graph like format to the user.
21
The complete application can be understood in terms of three major components –
1. Client Application
2. Flask Server
3. OrientDB Database
Client Application
The client application (Front-end) is developed using HTML, CSS and JavaScript. The client-
side application has two major components –
1. User Input Form – The user input form allows a user to select the entity type and enter
the name of the entity and submit the data to get results in Q&A or Graph format.
2. Output Components – There are two output components. One component displays the
question-and-answer form of information and the other component displays the output
in graph format.
Flask Server
The server application has been developed using the Flask framework of python. The server
application receives API requests from client application and runs the appropriate functions
with respect to the API call and returns the result to the client application.
OrientDB Database
The data has been stored in an orientdb database in graph model. The medical entities are stored
in vertex classes and relation among those entities are stored in edge classes.
In OrientDB, a multi-model database, vertex and edge classes are fundamental components for
defining and managing graph structures. These classes allow us to create, store, and query
graph data efficiently.
22
Vertex Classes
A vertex represents an entity or a node in a graph. Vertex classes in OrientDB are used to
define the types of entities and their properties. Each vertex belongs to a specific vertex class.
The base class for all vertex classes is V. All custom vertex classes inherit from this
base class.
Vertex classes can have properties that describe the attributes of the entities they
represent. For example, a Person vertex class might have properties like name, age,
and email.
We can define a schema for vertex classes to enforce constraints and ensure data
integrity.
Vertex classes can inherit properties from other vertex classes, allowing for a
hierarchical structure.
23
Vertex Class No. of Records
Chemical 403
Disease 379
ChemGroup 505
DisGroup 491
All the vertex classes inherit V class which is pre-defined in OrientDB. All the vertex classes
can be viewed as subclasses of the V class. The V class in this application has a total of 1,778
records.
Edge Classes –
An edge represents a relationship or connection between two vertices in a graph. Edge classes
in OrientDB are used to define the types of relationships and their properties. Each edge
belongs to a specific edge class. Some of the important points about edge classes are as given
below -
The base class for all edge classes is E. All custom edge classes inherit from this base
class.
Edge classes can have properties that describe the attributes of the relationships they
represent. For example, a Friend edge class might have a since property to indicate
when the friendship started.
You can define a schema for edge classes to enforce constraints and ensure data
integrity.
Edges have a direction, with a starting vertex (out-vertex) and an ending vertex (in-
vertex).
There are following edge classes in the application –
1. Association - This edge class represents the relation ‘Association’ between a chemical
and a disease which are associated with each other in some way.
2. Induce - This edge class represents the relation ‘induce’ between a chemical group and
a disease where the chemicals of the given chemical group can cause the disease or the
relation ‘induce’ between a disease group and a chemical where the chemical can cause
24
the disease of the disease group.
3. Treatment - This edge class represents the relation ‘treatment’ between a chemical
group and a disease where the chemicals of the given chemical group can cure the
disease or the relation ‘induce’ between a disease group and a chemical where the
chemical can cure the disease of the disease group.
4. Member – This edge represents the relation between a chemical and a chemical group
in which the chemical is a member of the chemical group or the relation between a
disease and a disease group in which the disease is a member of the disease group.
Association 181
Induce 445
Treatment 370
Member 1590
All the edge classes inherit E class which is pre-defined in OrientDB. All the edge classes can
be viewed as subclasses of the E class. The E class in this application has a total of 2,586
records.
25
CHAPTER 7
SYSTEM DEVELOPMENT AND IMPLEMENTATION
26
those in ‘Entity2’ column of the dataframe in set named ‘diseases’.
4. Create two dictionaries – ChemG and DisG to store chemical/disease groups and
members of those groups in key-value pairs.
5. Create two lists – Dis_Group_Relations and Chem_Group_Relations to store list of
dictionaries where each dictionary contains three keys – Chemical/Disease, Relation,
Group.
6. For each chemical in chemicals, do the following –
a. For each relation, do the following –
i. Get all the values of entity2 from dataframe where entity1 is given
chemical and relation is given relation and create a list ‘arr’
ii. Create a name of a group that serves as the disease group name.
iii. In DisG dictionary, create the key named as the ‘name of group’ and add
‘arr’ as value.
iv. In list ‘Dis_Group_Relations’, append a dictionary with the following
keys – Chemical, Relation, Group.
7. For each disease in diseases, do the following –
a. For each relation, do the following –
i. Get all the values of entity1 from dataframe where entity2 is given
chemical and relation is given relation and create a list ‘arr’
ii. Create a name of a group that serves as the chemical group name.
iii. In ChemG dictionary, create the key named as the ‘name of group’ and
add ‘arr’ as value.
iv. In list ‘Chem_Group_Relations’, append a dictionary with the following
keys – Disease, Relation, Group.
27
6. Open the created database.
7. Create classes for vertices (Chemical, Disease, ChemGroup, DisGroup) and edges
(Induce, Treatment, Association, Member).
8. Read the Medical.csv file using pandas into a DataFrame with columns: Entity1,
Entity2, Relation.
9. Initialize empty lists for chemicals and diseases.
10. Iterate over the CSV file rows to populate the chemicals and diseases lists.
11. Remove duplicates from both lists.
12. Run the algorithm for ‘Grouping of chemicals and diseases’ as mentioned above and
create dictionaries ChemG and DisG to store name of group and its members and lists
Dis_Group_Relations and Chem_Group_Relations to store list of dictionaries where
each dictionary contains three keys – Chemical/Disease, Relation, Group.
13. Create vertices of chemical class for each unique chemical.
14. Create vertices of disease class for each unique disease.
15. For each chemical group in ChemG, do the following –
a. Create vertex of ChemGroup class with name as the name of chemical group.
b. Create edge of member class from each chemical vertex that is a member of
the given chemical group.
16. For each disease group in DisG, do the following –
a. Create vertex of DisGroup class with name as the name of disease group.
b. Create edge of member class from each disease vertex that is a member of the
given disease group.
17. For each entry in Chem_Group_Relations, create an edge of the specified relation
(Value of the ‘Relation’ key in that entry) type from the ChemGroup (Value of the
‘Group’ key in that entry) vertex to the Disease (Value of the ‘Disease’ key in that
entry) vertex.
18. For each entry in Dis_Group_Relations, create an edge of the specified relation type
(Value of the ‘Relation’ key in that entry) from the Chemical (Value of the
‘Chemical’ key in that entry) vertex to the DisGroup (Value of the ‘Group’ key in that
entry) vertex.
19. Close the database connection.
28
Development of Server Application
The server application is developed using flask framework of python. The server-side code
does the following –
1. Connects to the orientdb server using the root user credentials.
2. Open the database names ‘meddb’.
3. Create an instance of the flask application and set CORS mode on.
4. The server application defines following routes –
A. ‘/’
This route is the route of the home page of the application. The server
runs a function that renders the ‘index.html’ file when this route is
called.
B. ‘/getEntities’
The server runs a function that get the list of all the chemicals and
diseases from chemical and disease class from orientdb database and
return them as a single list.
Following queries are used to fetch the data –
select name from Chemical
select name from Disease
C. ‘/getdata’
The server runs a function that does the following –
i. Extract entity type (ent_type) and entity name (ent_name) from
the POST request JSON data.
ii. Initialize a response dictionary with keys for type, name, induce,
treatment, association, related, and groups.
iii. Depending on the entity type (chemical or disease):
For chemical type:
a) Fetch and store chemical group memberships, induced
diseases, treatments, associations, and related chemicals.
b) Following queries are executed by the server to fetch the
data –
select * from Member where chemical = '{ent_name}'
29
chemgroup from Member where chemical =
'{ent_name})
30
select * from Treatment where disgroup in (select disgroup
from Member where disease = '{ent_name}')
D. ‘/getgraph’
i. Extract entity type (ent_type) and entity name (ent_name) from the
POST request JSON data.
ii. Initialize a response dictionary with keys for name and groups.
iii. Depending on the entity type (chemical or disease):
For chemical type:
Fetch from orientdb and store chemical group memberships and
related information (induced diseases, treatments, associations).
Following query is run to get the list of all chemical groups the
chemical belong to –
select * from Member where chemical = '{ent_name}'
Following queries are executed by the server for each chemical
31
group the chemical belongs to –
select * from Member where chemgroup = '{group}'
select * from Induce where chemgroup = '{group}'
select * from Treatment where chemgroup = '{group}'
select * from Association where chemgroup = '{group}'
For disease type:
Fetch and store disease group memberships and related information
(induced chemicals, treatments, associations).
Following query is run to get the list of all disease groups the disease
belong to –
select * from Member where disease = '{ent_name}'
Following queries are executed by the server for each disease group
the disease belongs to –
select * from Member where disgroup = '{group}'
select * from Induce where disgroup = '{group}'
select * from Treatment where disgroup = '{group}'
select * from Association where disgroup = '{group}'
iv. Return the populated response
5. Run the server application.
32
a. Dynamically create ‘p’ element and change the inner html to the
question corresponding to the key based on the type of entity.
b. Dynamically create ‘p’ element and change the inner html to the data
corresponding to the key.
c. Append the elements with the container element.
2. Display result in Graph format –
To display result in graph form, following algorithm is used in the JavaScript code –
a) Get entity type and name from input field.
b) Create a request object with entity name and type as request body.
c) Make an API call to the flask server by sending a POST request.
d) The response obtained from the server has following keys – name, groups,
groupinduce, grouptreatment, groupassociation and their corresponding values.
e) Get the container element.
f) Create an array nodeList and insert all the vertex/node elements which are
created using the response obtained from server. Vertex/node elements are
JavaScript objects with id and label and are created for each chemical, disease
and group.
g) Create an array edgeList and insert all the edge elements which are created
using the response obtained from server. Edge elements are JavaScript objects
with from, to and label keys and are created for each relation among chemicals,
diseases and groups.
h) Create two vis.DataSet objects by using nodeList and edgeList.
i) Create a network by using the two DataSet objects.
33
Screenshots
34
35
36
37
APPENDIX
38
REFERENCES
1. https://orientdb.org/
2. https://flask.palletsprojects.com/en/3.0.x/
3. https://machinelearningmastery.com/the-transformer-model/
4. https://www.techtarget.com/searchenterpriseai/definition/BERT-language-
model#:~:text=BERT%20language%20model%20is%20an,surrounding%20text%20t
o%20establish%20context.
5. https://academic.oup.com/bib/article/23/5/bbac282/6645993
6. https://www.analyticsvidhya.com/blog/2022/11/comprehensive-guide-to-bert/
39