0% found this document useful (0 votes)

55 views42 pages

Extensive Web Data Extraction

A detailed description of the organisation, followed by the problem objective and the scope of the problem with explanations, is provided in this section. It also gives the explanation of the system environment used in the development of the proposed system and gives inf

Uploaded by

Bhaarathi Solutions

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views42 pages

Extensive Web Data Extraction

Uploaded by

Bhaarathi Solutions

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

PREFACE

CHAPTER I – INTRODUCTION specifies the purpose, the objective, and the

environment under which the application works.

CHAPTER II – SYSTEM ANALYSIS provides detailed description of the proposed

system and the different functionalities involved in the system.

CHAPTER III – SYSTEM DESIGN deals with the design issues of the system. It
presents related diagrams like architectural diagram, data flow diagram, module design
diagram, and activity diagram.

CHAPTER IV – SYSTEM TESTING gives the various test cases for the application.

CHAPTER V – IMPLEMENTATION lays out the installation procedure for the

application.

CHAPTER VI – CONCLUSION puts forth the future enhancements and suggestions

related to the project. And gives a reference note that is used in the development of the
project.

iii
CHAPTER I
INTRODUCTION

A detailed description of the organisation, followed by the problem objective and

the scope of the problem with explanations, is provided in this section. It also gives the
explanation of the system environment used in the development of the proposed system
and gives information about the methods and technologies used in the development of the
system.

1.1 ORGANISATION PROFILE

nference is a product-based knowledge synthesis software platform that analyses

global biomedical siloed datasets. It is headquartered in Cambridge, Massachusetts.
nference employs more than 400 engineers and scientists across five offices: Cambridge,
Massachusetts; Rochester, Minnesota; Bangalore, Coimbatore, India; and Toronto,
Canada, and decreased rates of COVID infection in individuals with recent non-COVID-
19 vaccinations. The firm's software platform “de-identifies” medical records, preserving
patient privacy; harmonises semi-structured data into structured data; transforms
unstructured natural language text into structured data.

nference is making the world's biomedical knowledge computable to solve urgent

healthcare problems. The exponential growth of biomedical data, driven by the adoption
of electronic medical records (EMRs), provides an unprecedented opportunity to create
transformative technologies and products in healthcare.

nference partners with medical centres to turn decades of rich and predominantly
unstructured data captured in EMRs into powerful solutions that enable scientists to
discover and develop the next-generation of personalised diagnostics and treatments for
patients worldwide. The company believes that the greatest opportunity for our generation

1
to elevate human health is by developing technology to curate and synthesise the world’s
biomedical data in order to enable that scientific discovery. The platform uses artificial
intelligence and machine learning to extract insights from unstructured and structured data
from biomedical literature, as well as large scale molecular and real-world datasets. The
software is marketed towards pharmaceutical research and development and life sciences
companies.

1.2 PROBLEM DEFINITION

1.2.1 OBJECTIVE
The objective of this project is to optimise data extraction processes within the
healthcare domain, ensuring efficient scheduling and scalability. By focusing exclusively
on healthcare-related data sources, the project aims to support multiple in-house
applications of the biomedical company.

1.2.2 SCOPE
● Focus on optimising the scheduling of healthcare data sources to ensure timely
and efficient extraction.
● Expand the scope of data extraction to cover a wide range of healthcare-related
sources, catering specifically to the needs of the biomedical company.
● Provide a reliable and comprehensive dataset for various in-house applications,
facilitating informed decision-making and enhancing operational efficiency
within the company's ecosystem

1.2.3 USERS
The users of the product are mostly people who work in the biomedical domain.
Some of the users are as follows:
● Doctors
● Pharmaceutical companies

2
● Biomedical researchers
● Hospitals
● Medical Students

1.2.4 LIMITATIONS
● Users of the tool need to have a domain knowledge of biomedicine in order to make
use of most of the tools.
● The tool that the company provides is not free for all to use, it is a pay per use
Platform.

1.3 SYSTEM ENVIRONMENT

System environment gives the minimal machine and software requirements for the
application to run.

HARDWARE ENVIRONMENT
Device Used – Apple MacBook Pro
● RAM – 8 GB
● Hard Disk – 1 TB
● Processor – Apple M1 chip

SOFTWARE ENVIRONMENT
● Operating System – Macintosh
● Database – MongoDB
● Backend – Python
● Docker, Kubernetes
● IDE: Visual Studio Code

3
CHAPTER II
SYSTEM ANALYSIS

System Analysis is a process of studying the system in order to identify the goals,
purpose of the system and to provide better understanding of the system’s requirements.
This chapter gives a brief discussion about the detailed study of the proposed system and
the different functionalities involved in the system.

2.1 SYSTEM DESCRIPTION

● Data Extraction Pipeline: The system comprises a robust data extraction pipeline
leveraging Python scripts orchestrated within Docker containers. These scripts
systematically crawl healthcare-related websites, extracting structured data using
web scraping techniques.
● Scalable Infrastructure: Kubernetes manages the deployment and scaling of Docker
containers, ensuring efficient resource allocation and high availability. MongoDB
serves as the database for storing extracted data, while Minio provides scalable
cloud storage, accommodating the growing volume of healthcare data.
● Intelligent Scheduling: The system incorporates intelligent scheduling algorithms
to prioritise data sources, optimise crawling frequency, and dynamically adjust
schedules based on the changing relevance and volatility of healthcare data.
● Automated monitoring tools like Grafana track source performance and adjust
schedules in real-time to maintain data freshness and reliability.

2.1.1 EXISTING SYSTEM

The existing system is a sophisticated platform engineered to extract and curate

healthcare-related data with precision and efficiency. It relies heavily on Python for robust
backend development, ensuring flexibility and adaptability to the complex demands of

4
healthcare data processing. Docker containers are utilised for seamless containerization,
enhancing portability and ease of deployment. Kubernetes orchestrates the deployment and
scaling of these containers, optimising resource allocation and guaranteeing high
availability of services. MongoDB acts as the backbone of the system, providing a reliable
and scalable database solution for storing the vast amounts of extracted data.
Complementing this, Minio offers cloud storage capabilities, enabling seamless scalability
and accessibility across distributed environments.

One of the system's most notable features is its intelligent scheduling mechanism,
which dynamically prioritises data sources and adjusts crawling frequencies based on their
relevance and volatility. This ensures that the most critical and up-to-date information is
captured efficiently. Additionally, automated monitoring tools continuously track the
performance of data sources, enabling real-time adjustments to schedules to maintain data
freshness and reliability.

Despite its effectiveness, the system faces ongoing challenges, particularly in

scaling data extraction operations to accommodate the ever-growing volume of healthcare
data. Ensuring the reliability and freshness of extracted data remains a priority,
necessitating ongoing refinement and optimization of the system's scheduling and
monitoring capabilities.

2.1.2 FEATURES OF THE PROPOSED SYSTEM

● The system employs cutting-edge algorithms tailored for healthcare sources,

ensuring accurate extraction of structured data
● Leveraging Docker containers for modularity and Kubernetes for orchestration, the
system ensures seamless deployment and scaling across diverse environments.
● With real-time data processing capabilities, the system empowers healthcare
organisations to access timely insights for decision-making, trend analysis, and
predictive modelling, enhancing operational efficiency and patient care.

5
● User-friendly provides intuitive access to extracted healthcare data and insights,
supported by visualisation tools for exploring trends and correlations, facilitating
informed decision-making and strategic planning

2.1.3 TOOLS AND TECHNOLOGIES

PYTHON

Python is a high-level, general-purpose programming language. Its design

philosophy emphasises code readability with the use of significant indentation. Python is
dynamically-typed and garbage-collected. It supports multiple programming paradigms,
including structured, object-oriented and functional programming.

MONGODB

MongoDB is a popular NoSQL database solution known for its flexibility and
scalability. It stores data in a document-oriented format, allowing for the storage of
structured and unstructured data without requiring a predefined schema. MongoDB's
distributed architecture and horizontal scaling capabilities make it well-suited for handling
large volumes of data and high-throughput applications. It is widely used in modern web
development, big data analytics, and real-time processing applications.

DOCKER

Docker is a containerization platform that simplifies the deployment and

management of applications by packaging into lightweight, portable containers. These
containers encapsulate everything needed to run an application, including code, runtime,
libraries, and dependencies, ensuring consistency across different environments. Docker's
efficiency and isolation make it ideal for microservices architectures and continuous
integration/continuous deployment (CI/CD) workflows.

6
KUBERNETES

Kubernetes is an open-source container orchestration platform designed to

automate the deployment, scaling, and management of containerized applications. It
provides features for load balancing, service discovery, and self-healing, enabling seamless
operation of containerized workloads across clusters of hosts. Kubernetes abstracts away
the underlying infrastructure, allowing developers to focus on application development and
deployment without worrying about the underlying infrastructure complexities. It has
become the de facto standard for managing containerized environments in cloud-native
applications.

VISUAL STUDIO CODE

Visual Studio Code is a lightweight, open-source code editor developed by

Microsoft. It offers a wide range of features designed to enhance the coding experience,
including syntax highlighting, code completion, debugging support, and Git integration.
VS Code supports multiple programming languages and frameworks through extensions,
allowing developers to customise their environment to suit their specific needs. Its intuitive
interface and extensive marketplace of extensions make it a popular choice among
developers for various programming tasks, from simple text editing to full-fledged
software development projects.

2.2 USE CASE MODEL

A use-case model is a model of how different types of users interact with the system
to solve a problem. As such, it describes the goals of the users, the interactions between
the users and the system, and the required behaviour of the system in satisfying these goals.
The use case diagram for the proposed system is mentioned in Figure 2.1 below

7
Crawling Infrastructure System

Figure 2.1 Use Case Diagram

8
EXTENDED USE CASES

Scripts Generation Module

Use case name: Scripts Generation Module

Purpose: To generate python scripts for sources which provide downloadable files like
JSON, XML etc on their website on a monthly or yearly basis
Input Parameters: The file downloaded from the particular source
Output Parameters: Python script is written according to the file structure
Primary actor: Developer
Secondary actor: System
Pre-condition: The user must have access to the website to be crawled
Post-condition: Python script is written
Successful scenario:
● The developer will be able to write a successful script in order to store the data in a
structured format
● The data crawled using the script can be used by multiple in-house applications
within the company
Exceptions: If the developer doesn’t have access to the source website due to some
licensing issues, then this module cannot be executed

Config Generation Module

Use case name: Config Generation Module

Purpose: To generate config scripts to extract data from sources which do not provide
downloads but give access to crawl the HTML content of the website.
Input Parameters: The URL of the website to be crawled
Output Parameters: Config script according to the website structure
Primary actor: Developer
Secondary actor: System

9
Pre-condition: The user must have access to the website to be crawled
Post-condition: Config script is written
Successful scenario:
● The developer will be able to write a successful config script in order to store the
data in a structured format
● Relevant Xpaths are written to handle the complex structure of the website that is
being extracted.
Exceptions: If the website doesn’t allow the user to extract the data from its HTML
content, then this module cannot be executed.

Scheduling Module

Use case name: Scheduling Module

Purpose: To schedule the sources that are created using the previous two modules i.e.
Config Generation and Script Generation Module.
Input Parameters: The code written (either script or config), scheduling interval
Output Parameters: Source gets scheduled in Kubernetes
Primary actor: User
Secondary actor: System
Pre-condition: The user must have setup Kubernetes and Docker in the server
Post-condition: Source gets scheduled in Kubernetes
Successful scenario:
● The developer will be able to deploy the source in Docker container
● Once deployed in Docker, the source will be scheduled in the Kubernetes cluster
according to the website.
Exceptions: If the developer has not configured Docker and Kubernetes in the server that
they want to deploy, then this module cannot be executed.

10
MongoDB Upload Module

Use case name: MongoDB Upload Module

Purpose: The output data must be uploaded to MongoDB from which the other in-house
applications fetch its data to show in the UI.
Input Parameters: The scheduled sources’ script or config
Output Parameters: The data gets stored in MongoDB
Primary actor: Developer
Secondary actor: Database
Pre-condition: The source should have been scheduled in Kubernetes
Post-condition: The latest data gets updated in MongoDB
Successful scenario:
● The latest data from the website will be updated in the MongoDB in a scheduled
manner according to the refresh interval defined in the scheduling module.
● This module restricts redundancy and duplication of data and provides timestamps
in which the data got updated which is useful for backtracking purposes.
Exceptions: The user must have the credentials to the MongoDB and read / write access
to the database.

2.3 SOFTWARE REQUIREMENTS SPECIFICATION

A software requirements specification is a document that describes requirements

for a software product, program or set of programs. This defines the plan for the project
with the Functional and Non-functional requirements needed and will provide the detailed
test plan and its scope.

11
2.3.1 FUNCTIONAL REQUIREMENTS

A functional requirement defines a function of a system or its component, where a

function is described as a specification of behaviour between inputs and outputs. Functional
requirements may involve calculations, technical details, data manipulation and
processing, and other specific functionality that define what a system is supposed to
accomplish.

Website Selection and Configuration:

Select and configure websites for data extraction, including specifying URLs,
crawling parameters, and authentication settings, ensuring precise data retrieval from
targeted sources

● User-defined website configuration

● Dynamic website monitoring
● Error handling and notification

Through this functionality, new websites containing reliable healthcare

information are identified and integrated into the existing crawling infrastructure.

Data Extraction:

The system extracts data from configured websites according to predefined rules,
performing preprocessing, transformation, and integrity checks to ensure accurate and
usable data.

● Customised data extraction rules

● Data preprocessing and transformation
● Data integrity verification

12
Depending on the website’s refresh on updation of data, customised scripts are
written to get the data in a timely and quick manner.

Maintenance of Existing Architecture

The system conducts routine health checks, implements continuous integration and
deployment practices, and maintains version control and documentation to ensure the
smooth operation and evolution of the existing architecture.

● Routine health checks

● Continuous integration and deployment (CI/CD)
● Version control and documentation

Improve Scheduling of Crawling in Docker

Advanced scheduling algorithms dynamically allocate crawling resources, fault-

tolerant job queuing mechanisms ensure reliable task execution, and performance
monitoring tools optimise resource allocation for efficient crawling within Docker
containers.

● Dynamic scheduling algorithms

● Fault-tolerant job queuing
● Performance monitoring and optimization

2.3.2 NON-FUNCTIONAL REQUIREMENTS

Non-functional requirements describe the general characteristics of a system. Non-

functional requirements may also describe aspects of the system that don't relate to the
execution. The non-functional requirements required for the application are given below.

13
Performance:

● The average data extraction time for a single website should not exceed 5 seconds
to ensure timely delivery of information.

Usability:

● The system should provide clear error messages and instructions to assist users in
resolving issues encountered during website configuration.

Security:

● User authentication should be enforced using strong password policies and multi-
factor authentication to prevent unauthorised access.

Scalability:

● Auto-scaling mechanisms should be in place to automatically provision additional

resources during peak usage periods to maintain system performance.

Quality:

● Code reviews should be conducted for all changes to ensure adherence to coding
standards and best practices.

Safety:

● Regular backups of data should be performed daily and stored in geographically

diverse locations to mitigate the risk of data loss due to hardware failures or
disasters.
● Error logging and monitoring should be implemented to track and analyse system
errors, allowing for timely detection and resolution of safety-critical issues.

14
2.4 TEST PLAN

Test Plan is the strategy used to verify and ensure that a product or system meets
its design specifications and other requirements.

TEST SCOPE

The entire application consists of multiple modules which cohesively work toward
getting desired output.

Scripts Generation Module:

● Ensure that scripts are generated accurately based on input data sources.
● Validate that the generated scripts adhere to coding standards and best practices.
● Verify the compatibility of generated scripts with the backend software
environment.
● Test the scalability of script generation for a large number of data sources.

Config Generation Module:

● Validate the accuracy of configuration files generated for different environments.

● Ensure that configurations are tailored to the specific requirements of healthcare
data extraction.
● Verify the robustness of configurations in handling various data source types and
formats.
● Test the efficiency of configuration generation for different deployment scenarios.

15
Scheduling Module:

● Validate the scheduling mechanism's ability to efficiently allocate resources for

data extraction.
● Test the scalability of the scheduling algorithm for a growing number of data
sources.
● Verify the accuracy of scheduled tasks based on priority and resource availability.
● Ensure proper handling of exceptions and errors during the scheduling process.

MongoDB Upload Module:

● Validate the integrity of data uploaded to MongoDB from various sources.

● Test the efficiency of data transfer and storage in MongoDB.
● Verify the scalability of MongoDB operations for handling increasing volumes of
healthcare data.
● Ensure data security and compliance with privacy regulations during the upload
process.

16
CHAPTER III
SYSTEM DESIGN

System design is the process of defining the architecture, product design, modules,
interfaces, and data for a system to satisfy specified requirements. System design could be
seen as the application of systems theory to product development.

3.1 ARCHITECTURAL DESIGN

The architecture of a system describes its major components, their relationships

(structures), and how they interact with each other. Software architecture and design
includes several contributory factors such as Business strategy and Application strategy
and design.

The design of the system is built in such a way that it will be user friendly and easy
to navigate within the system. The system works only with an active connection to the
Internet and it is built on a web server. The data which is present on the various in-house
applications is extracted from the databases such as MongoDB which is a structured
database where the external data from various reliable sources are being crawled.

17
Figure 3.1 Design Architecture of the Crawling Infrastructure System

Figure 3.1 depicts the entire diagram of the system’s architecture. The architecture
of the "Extensive Web Data Extraction" project centres on a scalable and efficient

18
framework tailored for healthcare data extraction. Python serves as the backbone for
backend development, providing flexibility and a rich library ecosystem. Docker
containerization and Kubernetes orchestration ensure seamless deployment and
management of microservices across diverse environments, enhancing portability and
scalability.

MongoDB and Minio offer robust storage solutions for structured and unstructured
healthcare data, while Linux servers provide a stable and reliable environment for data
extraction. The design emphasises optimising extraction processes, enhancing scheduling
efficiency, and ensuring system reliability to meet the needs of in-house applications within
the biomedical company's ecosystem.

3.2 STRUCTURAL DESIGN

Structured Design (SD) is diagrammatic notation which is designed to help people

understand the system. The basic goal of SD is to improve quality and reduce the risk of
system failure. It establishes concrete management specification and documentation. It
focuses on solidity, pliability and maintainability of the system.

MODULE DIAGRAM

The module diagrams are used to show the allocation of classes and objects to
modules in the physical design of a system, that is module diagrams indicate the
partitioning of the system architecture. Through module diagrams, it is possible to
understand the general physical architecture of a system. The proposed system has
following modules:

19
Scripts Generation Module

● There are certain websites which provide downloadable data which are ingested in
the crawling infrastructure system.
● For these types of sources, scripts are written in Python and the sources are hosted
in Docker and scheduled in Kubernetes according to the refresh interval which
varies according to the source.

Config Generation Module

● Some sources update the data on the website itself and do not provide any
downloads. For these types of sources, the content is extracted from the website
HTML itself.
● XPaths are used to extract the HTML content and they are feeded into the crawler
component that is already defined in the crawling infrastructure service
● It is then converted into JSON format which is useful to store it in the MongoDB
database.

Scheduling Module

● The code files which are generated in the previous modules such as Scripts
Generation Module and Config Generation Module are then to be scheduled in the
server to ensure timely delivery of the updated data.
● The sources are scheduled according to the refresh frequency which is specified in
the website.
● There are multiple Kubernetes pods in which the multiple sources are scheduled.
● Pods serve as the basic scheduling unit in Kubernetes, encapsulating an
application's Docker containers and associated resources.

20
MongoDB Upload Module

● The scheduled sources are then to be uploaded to MongoDB which is the primary
database that is used by multiple in-house applications of the company.
● In order to reduce the duplication of the redundant data and updation of existing
data without the deletion of the key parameters of the older version of data, logical
techniques are used to handle it accordingly.
● Timestamps are added for the addition and updation of the data to ensure easy
tracking of the history of updation if any issue occurs in data being loaded to the
MongoDB.

Figure 3.2 depicts the Module Diagram of the proposed system

Figure 3.2 Module Diagram of Entity Knowledge Graph

21
Figure 3.3 illustrates the class diagram of the proposed system

Figure 3.3 Class Diagram of Entity Knowledge Graph

Figure 3.3 depicts the class diagram of the system. It contains various class and they are
explained as follows:

22
Authentication of User

● This class ensures that the developer has authorised access to login and use the
proposed system.
● The developer has to provide username and credentials in order to use the proposed
system.

Identification of the source

● The developer identifies reliable healthcare source providers in order to ingest that
data into the existing system
● The developer identifies the sources according to the client requirement as well.

Check Licensing

● The developer checks for the licensing requirements provided by the source
website.
● [Link] file contains all the licensing information and provides which web pages
are available for crawling or not.

Script or Config Identification

● The developer checks for any downloads that are provided by the website itself
● If there are downloads, the developer utilises the ScriptGeneration class and if there
are not any downloads available, then the developer uses ConfigGeneration class.

23
ScriptGeneration

● According to the format of the downloadable file, python scripts are written to the
structure of the file.
● Python scripts are written in such a way that the data provided in the file is all
extracted and are loaded in a JSON format.

ConfigGeneration

● If there are no downloads, then the data is extracted from the websites’ HTML
content itself.
● XPaths are written to handle complex website HTML structure and data is
extracted.

Deploy In CIS

● Crawling Infrastructure Service (CIS) hosts all the sources that are used to extract
reliable healthcare data.
● The source are stored as Docker containers which contains the script code
● Docker containers and built and deployed in CIS

Schedule Kubernetes

● The deployed sources are then scheduled in the Kubernetes cluster

● The refresh interval varies from each source or website

MongoDBUpload

● The data which is extracted are uploaded in MongoDB in a scheduled manner.

24
● Data duplicates are eliminated and updated accordingly
● Timestamps are added for each document to easily check when the data is being
updated.

3.3 BEHAVIOURAL DESIGN

Behavioural design of a system generally represents the manner in which the

system responds when a particular function is being done or when an input data is being
processed at different points of time according to the system. Behavioural patterns are
concerned with the assignment of responsibilities between objects, or, encapsulating
behaviour in an object and delegating requests to it.

SEQUENCE DIAGRAM

The sequence diagram represents the flow of messages in the system and is also
termed as an event diagram. It helps in explaining several dynamic scenarios. It portrays
the communication between the user and the system as a time-ordered sequence of events,
such that these lifelines took part at the run time.

25
Figure 3.4 depicts the Sequence of the entire system, the activity involved in all the
Modules.

Figure 3.4 Sequence Diagram of the Extensive Web Data Extraction

The sequence diagram of the proposed system as depicted above explains how the
system works seamlessly without any interruption of any sorts.

26
● The developer logins into the proposed system using the username and password
required to access the system
● The source is identified according to the client’s requirements and needs
● Licensing for the source is checked and if the website restricts crawling of the
website in all ways, then the user quits the system.
● If the site provides downloads, the data is downloaded and data is extracted from
the file and stored in JSON format.
● If the site doesn’t provide downloads, data is extracted from the HTML content of
the webpage using XPaths and stored in JSON format.
● Then the source is deployed in the Crawling Infrastructure service and scheduled
in the Kubernetes cluster which requests for the refresh interval frequency for that
particular source from the previous module.
● Then the data that is in the JSON format is loaded into the MongoDB in a scheduled
manner without duplicating the data.

3.4 USER INTERFACE DESIGN

User interface design or user interface engineering is the design of user interfaces
for machines and software, such as computers, home appliances, mobile devices, and other
electronic devices, with the focus on maximising usability and the user experience.

27
Screen 3.1 displays the different types of corpuses present in the Crawling Infrastructure
Service

Screen 3.1 Corpus List of CIS

In this page, the user can view the corpus list that is existing in Crawling
Infrastructure service and the number of documents crawled and latest updated date.

28
Screen 3.2 depicts the graph visualisation of the corpus statistics

Screen 3.2 Visualisation of Documents count

In this page of UI, the user can easily interpret the number of documents present in
each corpus.

29
Screen 3.3 illustrates the status of each source that is deployed in CIS.

Screen 3.3 Crawl Status Tab

In this page, the user can easily identify the status of each source. This will aid in
debugging the source if it is not crawled for any specific reason or issue.

30
Screen 3.4 provides the user to post the crawling request to the developers and provide the
Estimated time of completion according to the priority

Screen 3.4 Crawl Request Tab

3.5 DEPLOYMENT DESIGN

Any software system must be deployed under highly favourable circumstances and
environment to get optimum results. A slight variation in the implementation process may
lead to errors or failure of the system. To get a better understanding of the deployment
environment of the proposed system, the deployment diagram is outlined.

Figure 3.5 displays the deployment diagram of the proposed system.

31
Figure 3.5 Deployment Diagram

3.6 NAVIGATION DESIGN

Flow Diagram

Flow diagrams are used to graphically represent the flow of data in a business
information system. Flow diagram describes the processes that are involved in a system to
transfer data from the input to the file storage and reports generation.

Figure 3.6 depicts the flow of the entire system, once the user gets started in the
proposed system

32
Figure 3.6 Flow Diagram of the System

The developer who will get the requirement needs of the user and identifies new
sources to crawl in order to extract reliable healthcare information that caters to the needs
of the company or a particular in-house application that uses the crawled data. Then the
developer checks for the licensing information that is provided by the website and adheres
to the restrictions that are provided in that particular website or source. Once the analysis

33
of licensing is done and the source is ready to crawl, then the developer checks if there are
any downloads provided in the source itself. The developer follows either of one of the
following steps:

● The developer follows the Script generation module if downloads are found.
● The developer follows the Config Generation module if downloads are not found.

Then the developer deploys either the script or config in the Crawling Infrastructure
Service where the scheduling Kubernetes clusters are hosted. The source is scheduled
according to the refresh frequency provided in the website and data is updated in the
MongoDB accordingly without making any duplications in the database. Through this
MongoDB, multiple in-house applications extract the data according to the requirements
of the applications.

34
CHAPTER IV
SYSTEM TESTING

System testing is testing conducted on a complete integrated system to evaluate the

system's compliance with its specified requirements. A series of systematic procedures are
referred to test how the system should perform and where common mistakes may be found
by entering data that may cause the system to malfunction or return incorrect information.
The purpose of testing is quality assurance, verification and validation, or reliability
estimation.

4.1 TEST CASES AND TEST REPORTS

4.1.1 TEST PLAN

A Test Plan documents the strategy that will be used to verify and ensure that a
product meets its system design specifications. Test cases are built around the requirements
and specifications i.e., what the system is supposed to do.

Each test case contains item criteria such as:

● PASS
○ All expected results are achieved and/or all unexpected events are resolved.
● PASS WITH EXCEPTIONS
○ Unexpected events require alternative procedures that have been
implemented and those events are called Exceptions.
● FAIL
○ Testing process response does not confirm the expected results.

35
Table 4.1 contains the list of test cases and their respective test reports.

Test Case Input Expected Actual result Result

Number result

1 [Link] 2000 URLs 1800 URLs Pass with

eu/[Link] crawled crawled Exceptions

2 [Link] All Drugs are All Drugs are Pass

[Link]/scripts/cder/daf/in to be crawled crawled
[Link]?event=reportsSe from this
[Link] website

3 [Link] Doctor names Doctor names Fail

org/appointments/find-a- are to crawled cannot be
doctor crawled due to
licensing issues

4 [Link] File to be File downloaded Pass

/[Link] downloaded and extracted
and extracted

Table 4.1 - List of Test cases and test reports

36
CHAPTER V
SYSTEM IMPLEMENTATION

Implementation phase is the phase in which the project plan is put into motion and
the work of the project is performed. It is important to maintain control and communicate
as needed during implementation. Progress is continuously monitored and appropriate
adjustments are made and recorded as variances from the original plan. In this phase, one
can build the components either from scratch or by composition. Given the architecture
document from the design phase and the requirement document from the analysis phase,
one can build exactly what has been requested.

5.1 INSTALLATION PROCEDURE

Installing Python:

● Go to [Link] and download the latest version of

Python.
● Click on Continue to proceed with the installation procedure.
● After finishing the installation procedure, go to Terminal and enter the following
command to ensure if Python was installed:
python --version

Installing Visual Studio Code:

● Visit the Visual Studio Code website: Go to [Link] in your

web browser.
● Download the installer: Click on the "Download for Mac" button. Visual Studio
Code is available for Windows, macOS, and Linux.

37
● Run the installer: Once the installer is downloaded, locate the file and run it. Follow
the on-screen instructions to complete the installation process. On Windows, you
may need to confirm any security prompts or user account control dialogs.
● Launch Visual Studio Code: After installation is complete, user can launch Visual
Studio Code from your desktop or application menu. Upon opening, you'll be
greeted with the editor interface, ready for you to start coding.

Installing MongoDB – Studio3T:

● Download the latest Studio 3T .dmg file. Remember to select Intel or Apple Silicon
on the download page to get the correct version for your Mac.
● Drag and drop the .dmg file in the Applications folder.
● Login to the already existing mongo servers by clicking the New Connection
button.

Installing Docker:

● Visit the Docker website at [Link] and download the

appropriate version of Docker for your operating system (Windows, macOS, or
Linux).
● Once the Docker installer is downloaded, run the installer and follow the on-screen
instructions to install Docker on your system. On Windows and macOS, you may
need to restart your computer after installation.
● After installation, open a terminal or command prompt and type docker --version
to verify that Docker is installed correctly. You should see the version number of
Docker displayed in the output.

38
Installing Kubernetes:

● Install Minikube: Minikube is a tool that lets you run Kubernetes locally. Visit the
Minikube GitHub page at [Link] and
download the appropriate version for your operating system.
● Install a Hypervisor (if required): Minikube requires a hypervisor to create a virtual
machine to run Kubernetes. Install a hypervisor such as VirtualBox or Hyper-V
based on your operating system's requirements.
● Start Minikube: Once Minikube is installed, open a terminal or command prompt
and run the command minikube start. This command starts a local Kubernetes
cluster using Minikube. It may take a few minutes to download the required
dependencies and start the cluster.
● Verify Installation: After Minikube starts successfully, you can verify the
installation by running kubectl version. This command should display both the
client and server versions of Kubernetes, indicating that Minikube has been
installed and is running correctly.

39
CHAPTER VI
CONCLUSION

In conclusion, nference's endeavour to enhance human health through the curation

and synthesis of biomedical data is commendable, particularly in an era inundated with
misinformation and fake data. The successful implementation of data crawling, scheduled
via Kubernetes, ensures timely delivery of accurate information, vital for various in-house
applications. The development of a responsive UI facilitates effortless navigation and
request initiation for new data sources, augmenting user experience. Despite encountering
challenges such as comprehending complex code logic, debugging, and adhering to
industry software development standards, nference has navigated through these obstacles,
demonstrating adaptability and resilience. Moreover, overcoming hurdles in the
installation and configuration of Docker and Kubernetes underscores the commitment to
technological excellence.

FUTURE ENHANCEMENT

● Continuous refinement and optimization of the Crawling Infrastructure Service to

ensure seamless data acquisition and processing.
● Implementation of advanced debugging tools and practices to expedite issue
resolution and improve overall system robustness.
● Exploration of emerging technologies and methodologies to further streamline
deployment processes and enhance system scalability and efficiency.

40
BIBLIOGRAPHY

WEB REFERENCES

1. GitHub - [Link]
2. Docker - [Link]
3. Kubernetes - [Link]
4. Domain Knowledge - [Link]

Common questions

Docker provides efficient containerization, enhancing portability and consistency across different environments, while Kubernetes orchestrates the deployment and scaling of these containers, optimizing resource allocation and ensuring high availability and seamless operation across clusters .

nference enhances the usability of EMRs by de-identifying and harmonizing semi-structured data into structured formats which makes it easier to process and analyze. The platform transforms unstructured natural language text into structured data, enabling scientists to leverage this information for developing personalized diagnostics and treatments .

Users need to have domain knowledge in biomedicine to effectively use most of the tools provided by the platform. Additionally, the system is not free; it operates on a pay-per-use basis, which may limit accessibility to some potential users .

The document describes a series of systematic testing procedures designed to evaluate the system's compliance with its specified requirements. Test plans are built around requirements and specifications, with test cases ensuring that all expected results are achieved. Strategies include handling unexpected events and validating the system's data processing and operational integrity .

The deployment environment requires an Apple MacBook Pro with 8 GB RAM, a 1 TB hard disk, and an Apple M1 chip. Software requirements include Macintosh Operating System, MongoDB as the database, Python for backend development, Docker, Kubernetes, and Visual Studio Code as the IDE .

Python scripts within Docker containers systematically crawl healthcare websites to extract structured data. Kubernetes manages the deployment and scaling of these containers, ensuring efficient resource allocation and high availability, thus supporting large-scale operations .

MongoDB serves as a scalable and reliable database solution within the system, storing vast amounts of extracted data in a document-oriented format. This architecture supports the storage of structured and unstructured data without requiring a predefined schema, crucial for handling high-throughput applications in healthcare .

The intelligent scheduling algorithm improves data extraction efficiency by dynamically prioritizing sources and adjusting crawling frequencies based on data relevance and volatility, thereby ensuring the capture of the most current and critical information .

The system employs intelligent scheduling mechanisms to dynamically prioritize data sources and adjust crawling frequencies to capture the most relevant and up-to-date information. Automated monitoring tools track source performance and allow real-time adjustments to maintain data reliability and freshness .

The primary challenges include handling the ever-growing volume of healthcare data, ensuring reliability and freshness of extracted data, and refining the system's scheduling and monitoring capabilities to adapt to varying data relevance and volatility .

Disease Prediction System
No ratings yet
Disease Prediction System
10 pages
COMSATS University Islamabad (CUI) : (SDD Document)
No ratings yet
COMSATS University Islamabad (CUI) : (SDD Document)
40 pages
Hospital ER
No ratings yet
Hospital ER
21 pages
Human Disease5
No ratings yet
Human Disease5
12 pages
Medical Report Management & Distribution System On Blockchain
No ratings yet
Medical Report Management & Distribution System On Blockchain
8 pages
Health Record System SRS Document
No ratings yet
Health Record System SRS Document
10 pages
Efficient Doctor-Patient Portal SRS
No ratings yet
Efficient Doctor-Patient Portal SRS
24 pages
Project D RPM
No ratings yet
Project D RPM
4 pages
Nabimara Charles Cit PGD Report
No ratings yet
Nabimara Charles Cit PGD Report
56 pages
Patient-Zero Home Care Management
No ratings yet
Patient-Zero Home Care Management
226 pages
Synopsis For Final Year Odd Semester Project (GITASRI DAS)
No ratings yet
Synopsis For Final Year Odd Semester Project (GITASRI DAS)
10 pages
Health Monitoring System Srs
No ratings yet
Health Monitoring System Srs
13 pages
Chapter 1. Systems Introduction: 1.1 Description of The Project
No ratings yet
Chapter 1. Systems Introduction: 1.1 Description of The Project
52 pages
Chapter 1. Systems Introduction: Description of The Project
No ratings yet
Chapter 1. Systems Introduction: Description of The Project
52 pages
Blood Bank Management System Design
No ratings yet
Blood Bank Management System Design
99 pages
ICTACT
No ratings yet
ICTACT
5 pages
HDIS Plan2
No ratings yet
HDIS Plan2
7 pages
DBMS Project Prosposal by Ali Ahmad, Sultan
No ratings yet
DBMS Project Prosposal by Ali Ahmad, Sultan
9 pages
Healthcare System Project for BCA
No ratings yet
Healthcare System Project for BCA
8 pages
Se Lab6
No ratings yet
Se Lab6
19 pages
HDIMS Webplan1
No ratings yet
HDIMS Webplan1
5 pages
MedChain: Off-Chain Healthcare Storage
No ratings yet
MedChain: Off-Chain Healthcare Storage
6 pages
Hospital Management System (SRS) - 1
No ratings yet
Hospital Management System (SRS) - 1
6 pages
Hadoop in Healthcare: A Case Study
No ratings yet
Hadoop in Healthcare: A Case Study
9 pages
Medicine Recommendation System Overview
No ratings yet
Medicine Recommendation System Overview
16 pages
SE Lab File
No ratings yet
SE Lab File
8 pages
Ieee (Cvo)
No ratings yet
Ieee (Cvo)
5 pages
Synopsis of Smart Health Prediction
50% (4)
Synopsis of Smart Health Prediction
22 pages
Scalable AWS Healthcare Data System
No ratings yet
Scalable AWS Healthcare Data System
7 pages
Heart Disease Prediction 6
No ratings yet
Heart Disease Prediction 6
60 pages
HealthStack Hospital Management System
No ratings yet
HealthStack Hospital Management System
30 pages
Software Engineering Capstone: Raja Sekhar Velagaleti 1293165
No ratings yet
Software Engineering Capstone: Raja Sekhar Velagaleti 1293165
10 pages
Building Doctor Platform - Research Paper
No ratings yet
Building Doctor Platform - Research Paper
5 pages
Susan SE
No ratings yet
Susan SE
33 pages
Srs 3
No ratings yet
Srs 3
16 pages
IoT Da1
No ratings yet
IoT Da1
5 pages
Mic 3 Report
No ratings yet
Mic 3 Report
70 pages
Ip Project Covid
No ratings yet
Ip Project Covid
47 pages
Midpresentation Report-2024
No ratings yet
Midpresentation Report-2024
21 pages
Advanced EHR Requirements Document AI ML
No ratings yet
Advanced EHR Requirements Document AI ML
4 pages
NHS
No ratings yet
NHS
51 pages
Health SH Pere
No ratings yet
Health SH Pere
17 pages
CISP Project
No ratings yet
CISP Project
24 pages
Health Care Information Systems
No ratings yet
Health Care Information Systems
9 pages
HealthHub: Hospital Management System
No ratings yet
HealthHub: Hospital Management System
34 pages
Smart Patient Information System PDF
No ratings yet
Smart Patient Information System PDF
20 pages
HMS Patient Portal Overview
No ratings yet
HMS Patient Portal Overview
29 pages
Health Predict
No ratings yet
Health Predict
143 pages
Wa0012.
No ratings yet
Wa0012.
9 pages
Covid-19 Data Extraction Project
No ratings yet
Covid-19 Data Extraction Project
17 pages
(Solved) Case Study - GlobalHealth Innovations LTD, A Leading Healthcare... - Course Hero
No ratings yet
(Solved) Case Study - GlobalHealth Innovations LTD, A Leading Healthcare... - Course Hero
6 pages
MPR PDF
No ratings yet
MPR PDF
32 pages
Medical Diagnosis System Project 36
No ratings yet
Medical Diagnosis System Project 36
11 pages
Hospital Record Management Feasibility
No ratings yet
Hospital Record Management Feasibility
40 pages
IoT and NoSQL in Pervasive Healthcare
No ratings yet
IoT and NoSQL in Pervasive Healthcare
31 pages
Essential Project Scope Checklist
No ratings yet
Essential Project Scope Checklist
1 page
AIS Chapter 17
No ratings yet
AIS Chapter 17
15 pages
Chapter 4 Diagnosing Organizations
No ratings yet
Chapter 4 Diagnosing Organizations
14 pages
Applied Epidemiology Competencies
No ratings yet
Applied Epidemiology Competencies
44 pages
Characterization of Brain Tissue Phantom Using An Indentation Device and Inverse Finite Element
No ratings yet
Characterization of Brain Tissue Phantom Using An Indentation Device and Inverse Finite Element
32 pages
System Analysis and Designs Notes
No ratings yet
System Analysis and Designs Notes
63 pages
PI in Compliance
No ratings yet
PI in Compliance
36 pages
BIT 2212 Business Systems Modelling Introduction
No ratings yet
BIT 2212 Business Systems Modelling Introduction
11 pages
Loan Monitoring System Abstract
No ratings yet
Loan Monitoring System Abstract
7 pages
Engineering Management
No ratings yet
Engineering Management
3 pages
Project Requirements 23.24
No ratings yet
Project Requirements 23.24
17 pages
Software Requirements Specification: Version 1.0 Approved
No ratings yet
Software Requirements Specification: Version 1.0 Approved
28 pages
Advanced Project Management - A Structured Approach PDF
83% (6)
Advanced Project Management - A Structured Approach PDF
336 pages
Org Man - Q2 M1
No ratings yet
Org Man - Q2 M1
15 pages
LEVEL I - ATA 45 Onboard Maintenance System
100% (9)
LEVEL I - ATA 45 Onboard Maintenance System
40 pages
Quiz Number 1-2
No ratings yet
Quiz Number 1-2
2 pages
AIS Chapter 3
100% (1)
AIS Chapter 3
38 pages
SE III Unit Notes
No ratings yet
SE III Unit Notes
39 pages
Chapter Eight: Principles, Objectives and General Ap-Proaches Relating To Community-Based Rehabilitation
No ratings yet
Chapter Eight: Principles, Objectives and General Ap-Proaches Relating To Community-Based Rehabilitation
5 pages
Health Governance in Low-Income Nations
No ratings yet
Health Governance in Low-Income Nations
8 pages
Risk-Based Process Safety Guide
No ratings yet
Risk-Based Process Safety Guide
5 pages
Computer Studies Syllabus
No ratings yet
Computer Studies Syllabus
19 pages
Roy's Theory
No ratings yet
Roy's Theory
114 pages
Chemical Process Simulation Guide
No ratings yet
Chemical Process Simulation Guide
74 pages
Rumbaugh's Object Modelling Technique (OMT) - : Unit Ii Object Oriented Methodologies 9
No ratings yet
Rumbaugh's Object Modelling Technique (OMT) - : Unit Ii Object Oriented Methodologies 9
71 pages
Science MYP 1 Grade 6 CO For 23-24
No ratings yet
Science MYP 1 Grade 6 CO For 23-24
3 pages
Software Engineering Exam 2021
No ratings yet
Software Engineering Exam 2021
1 page
APTSkuliah Agustus 2023
No ratings yet
APTSkuliah Agustus 2023
67 pages
Mini Project 2-2
No ratings yet
Mini Project 2-2
18 pages
Sample On Sequence Flow of Project Report
No ratings yet
Sample On Sequence Flow of Project Report
10 pages