Extensive Web Data Extraction
Extensive Web Data Extraction
CHAPTER III – SYSTEM DESIGN deals with the design issues of the system. It
presents related diagrams like architectural diagram, data flow diagram, module design
diagram, and activity diagram.
CHAPTER IV – SYSTEM TESTING gives the various test cases for the application.
iii
CHAPTER I
INTRODUCTION
nference partners with medical centres to turn decades of rich and predominantly
unstructured data captured in EMRs into powerful solutions that enable scientists to
discover and develop the next-generation of personalised diagnostics and treatments for
patients worldwide. The company believes that the greatest opportunity for our generation
1
to elevate human health is by developing technology to curate and synthesise the world’s
biomedical data in order to enable that scientific discovery. The platform uses artificial
intelligence and machine learning to extract insights from unstructured and structured data
from biomedical literature, as well as large scale molecular and real-world datasets. The
software is marketed towards pharmaceutical research and development and life sciences
companies.
1.2.1 OBJECTIVE
The objective of this project is to optimise data extraction processes within the
healthcare domain, ensuring efficient scheduling and scalability. By focusing exclusively
on healthcare-related data sources, the project aims to support multiple in-house
applications of the biomedical company.
1.2.2 SCOPE
● Focus on optimising the scheduling of healthcare data sources to ensure timely
and efficient extraction.
● Expand the scope of data extraction to cover a wide range of healthcare-related
sources, catering specifically to the needs of the biomedical company.
● Provide a reliable and comprehensive dataset for various in-house applications,
facilitating informed decision-making and enhancing operational efficiency
within the company's ecosystem
1.2.3 USERS
The users of the product are mostly people who work in the biomedical domain.
Some of the users are as follows:
● Doctors
● Pharmaceutical companies
2
● Biomedical researchers
● Hospitals
● Medical Students
1.2.4 LIMITATIONS
● Users of the tool need to have a domain knowledge of biomedicine in order to make
use of most of the tools.
● The tool that the company provides is not free for all to use, it is a pay per use
Platform.
System environment gives the minimal machine and software requirements for the
application to run.
HARDWARE ENVIRONMENT
Device Used – Apple MacBook Pro
● RAM – 8 GB
● Hard Disk – 1 TB
● Processor – Apple M1 chip
SOFTWARE ENVIRONMENT
● Operating System – Macintosh
● Database – MongoDB
● Backend – Python
● Docker, Kubernetes
● IDE: Visual Studio Code
3
CHAPTER II
SYSTEM ANALYSIS
System Analysis is a process of studying the system in order to identify the goals,
purpose of the system and to provide better understanding of the system’s requirements.
This chapter gives a brief discussion about the detailed study of the proposed system and
the different functionalities involved in the system.
● Data Extraction Pipeline: The system comprises a robust data extraction pipeline
leveraging Python scripts orchestrated within Docker containers. These scripts
systematically crawl healthcare-related websites, extracting structured data using
web scraping techniques.
● Scalable Infrastructure: Kubernetes manages the deployment and scaling of Docker
containers, ensuring efficient resource allocation and high availability. MongoDB
serves as the database for storing extracted data, while Minio provides scalable
cloud storage, accommodating the growing volume of healthcare data.
● Intelligent Scheduling: The system incorporates intelligent scheduling algorithms
to prioritise data sources, optimise crawling frequency, and dynamically adjust
schedules based on the changing relevance and volatility of healthcare data.
● Automated monitoring tools like Grafana track source performance and adjust
schedules in real-time to maintain data freshness and reliability.
4
healthcare data processing. Docker containers are utilised for seamless containerization,
enhancing portability and ease of deployment. Kubernetes orchestrates the deployment and
scaling of these containers, optimising resource allocation and guaranteeing high
availability of services. MongoDB acts as the backbone of the system, providing a reliable
and scalable database solution for storing the vast amounts of extracted data.
Complementing this, Minio offers cloud storage capabilities, enabling seamless scalability
and accessibility across distributed environments.
One of the system's most notable features is its intelligent scheduling mechanism,
which dynamically prioritises data sources and adjusts crawling frequencies based on their
relevance and volatility. This ensures that the most critical and up-to-date information is
captured efficiently. Additionally, automated monitoring tools continuously track the
performance of data sources, enabling real-time adjustments to schedules to maintain data
freshness and reliability.
5
● User-friendly provides intuitive access to extracted healthcare data and insights,
supported by visualisation tools for exploring trends and correlations, facilitating
informed decision-making and strategic planning
PYTHON
MONGODB
MongoDB is a popular NoSQL database solution known for its flexibility and
scalability. It stores data in a document-oriented format, allowing for the storage of
structured and unstructured data without requiring a predefined schema. MongoDB's
distributed architecture and horizontal scaling capabilities make it well-suited for handling
large volumes of data and high-throughput applications. It is widely used in modern web
development, big data analytics, and real-time processing applications.
DOCKER
6
KUBERNETES
A use-case model is a model of how different types of users interact with the system
to solve a problem. As such, it describes the goals of the users, the interactions between
the users and the system, and the required behaviour of the system in satisfying these goals.
The use case diagram for the proposed system is mentioned in Figure 2.1 below
7
Crawling Infrastructure System
8
EXTENDED USE CASES
9
Pre-condition: The user must have access to the website to be crawled
Post-condition: Config script is written
Successful scenario:
● The developer will be able to write a successful config script in order to store the
data in a structured format
● Relevant Xpaths are written to handle the complex structure of the website that is
being extracted.
Exceptions: If the website doesn’t allow the user to extract the data from its HTML
content, then this module cannot be executed.
Scheduling Module
10
MongoDB Upload Module
11
2.3.1 FUNCTIONAL REQUIREMENTS
Select and configure websites for data extraction, including specifying URLs,
crawling parameters, and authentication settings, ensuring precise data retrieval from
targeted sources
Data Extraction:
The system extracts data from configured websites according to predefined rules,
performing preprocessing, transformation, and integrity checks to ensure accurate and
usable data.
12
Depending on the website’s refresh on updation of data, customised scripts are
written to get the data in a timely and quick manner.
The system conducts routine health checks, implements continuous integration and
deployment practices, and maintains version control and documentation to ensure the
smooth operation and evolution of the existing architecture.
13
Performance:
● The average data extraction time for a single website should not exceed 5 seconds
to ensure timely delivery of information.
Usability:
● The system should provide clear error messages and instructions to assist users in
resolving issues encountered during website configuration.
Security:
● User authentication should be enforced using strong password policies and multi-
factor authentication to prevent unauthorised access.
Scalability:
Quality:
● Code reviews should be conducted for all changes to ensure adherence to coding
standards and best practices.
Safety:
14
2.4 TEST PLAN
Test Plan is the strategy used to verify and ensure that a product or system meets
its design specifications and other requirements.
TEST SCOPE
The entire application consists of multiple modules which cohesively work toward
getting desired output.
● Ensure that scripts are generated accurately based on input data sources.
● Validate that the generated scripts adhere to coding standards and best practices.
● Verify the compatibility of generated scripts with the backend software
environment.
● Test the scalability of script generation for a large number of data sources.
15
Scheduling Module:
16
CHAPTER III
SYSTEM DESIGN
System design is the process of defining the architecture, product design, modules,
interfaces, and data for a system to satisfy specified requirements. System design could be
seen as the application of systems theory to product development.
The design of the system is built in such a way that it will be user friendly and easy
to navigate within the system. The system works only with an active connection to the
Internet and it is built on a web server. The data which is present on the various in-house
applications is extracted from the databases such as MongoDB which is a structured
database where the external data from various reliable sources are being crawled.
17
Figure 3.1 Design Architecture of the Crawling Infrastructure System
Figure 3.1 depicts the entire diagram of the system’s architecture. The architecture
of the "Extensive Web Data Extraction" project centres on a scalable and efficient
18
framework tailored for healthcare data extraction. Python serves as the backbone for
backend development, providing flexibility and a rich library ecosystem. Docker
containerization and Kubernetes orchestration ensure seamless deployment and
management of microservices across diverse environments, enhancing portability and
scalability.
MongoDB and Minio offer robust storage solutions for structured and unstructured
healthcare data, while Linux servers provide a stable and reliable environment for data
extraction. The design emphasises optimising extraction processes, enhancing scheduling
efficiency, and ensuring system reliability to meet the needs of in-house applications within
the biomedical company's ecosystem.
MODULE DIAGRAM
The module diagrams are used to show the allocation of classes and objects to
modules in the physical design of a system, that is module diagrams indicate the
partitioning of the system architecture. Through module diagrams, it is possible to
understand the general physical architecture of a system. The proposed system has
following modules:
19
Scripts Generation Module
● There are certain websites which provide downloadable data which are ingested in
the crawling infrastructure system.
● For these types of sources, scripts are written in Python and the sources are hosted
in Docker and scheduled in Kubernetes according to the refresh interval which
varies according to the source.
● Some sources update the data on the website itself and do not provide any
downloads. For these types of sources, the content is extracted from the website
HTML itself.
● XPaths are used to extract the HTML content and they are feeded into the crawler
component that is already defined in the crawling infrastructure service
● It is then converted into JSON format which is useful to store it in the MongoDB
database.
Scheduling Module
● The code files which are generated in the previous modules such as Scripts
Generation Module and Config Generation Module are then to be scheduled in the
server to ensure timely delivery of the updated data.
● The sources are scheduled according to the refresh frequency which is specified in
the website.
● There are multiple Kubernetes pods in which the multiple sources are scheduled.
● Pods serve as the basic scheduling unit in Kubernetes, encapsulating an
application's Docker containers and associated resources.
20
MongoDB Upload Module
● The scheduled sources are then to be uploaded to MongoDB which is the primary
database that is used by multiple in-house applications of the company.
● In order to reduce the duplication of the redundant data and updation of existing
data without the deletion of the key parameters of the older version of data, logical
techniques are used to handle it accordingly.
● Timestamps are added for the addition and updation of the data to ensure easy
tracking of the history of updation if any issue occurs in data being loaded to the
MongoDB.
21
Figure 3.3 illustrates the class diagram of the proposed system
Figure 3.3 depicts the class diagram of the system. It contains various class and they are
explained as follows:
22
Authentication of User
● This class ensures that the developer has authorised access to login and use the
proposed system.
● The developer has to provide username and credentials in order to use the proposed
system.
● The developer identifies reliable healthcare source providers in order to ingest that
data into the existing system
● The developer identifies the sources according to the client requirement as well.
Check Licensing
● The developer checks for the licensing requirements provided by the source
website.
● robots.txt file contains all the licensing information and provides which web pages
are available for crawling or not.
● The developer checks for any downloads that are provided by the website itself
● If there are downloads, the developer utilises the ScriptGeneration class and if there
are not any downloads available, then the developer uses ConfigGeneration class.
23
ScriptGeneration
● According to the format of the downloadable file, python scripts are written to the
structure of the file.
● Python scripts are written in such a way that the data provided in the file is all
extracted and are loaded in a JSON format.
ConfigGeneration
● If there are no downloads, then the data is extracted from the websites’ HTML
content itself.
● XPaths are written to handle complex website HTML structure and data is
extracted.
Deploy In CIS
● Crawling Infrastructure Service (CIS) hosts all the sources that are used to extract
reliable healthcare data.
● The source are stored as Docker containers which contains the script code
● Docker containers and built and deployed in CIS
Schedule Kubernetes
MongoDBUpload
24
● Data duplicates are eliminated and updated accordingly
● Timestamps are added for each document to easily check when the data is being
updated.
SEQUENCE DIAGRAM
The sequence diagram represents the flow of messages in the system and is also
termed as an event diagram. It helps in explaining several dynamic scenarios. It portrays
the communication between the user and the system as a time-ordered sequence of events,
such that these lifelines took part at the run time.
25
Figure 3.4 depicts the Sequence of the entire system, the activity involved in all the
Modules.
The sequence diagram of the proposed system as depicted above explains how the
system works seamlessly without any interruption of any sorts.
26
● The developer logins into the proposed system using the username and password
required to access the system
● The source is identified according to the client’s requirements and needs
● Licensing for the source is checked and if the website restricts crawling of the
website in all ways, then the user quits the system.
● If the site provides downloads, the data is downloaded and data is extracted from
the file and stored in JSON format.
● If the site doesn’t provide downloads, data is extracted from the HTML content of
the webpage using XPaths and stored in JSON format.
● Then the source is deployed in the Crawling Infrastructure service and scheduled
in the Kubernetes cluster which requests for the refresh interval frequency for that
particular source from the previous module.
● Then the data that is in the JSON format is loaded into the MongoDB in a scheduled
manner without duplicating the data.
User interface design or user interface engineering is the design of user interfaces
for machines and software, such as computers, home appliances, mobile devices, and other
electronic devices, with the focus on maximising usability and the user experience.
27
Screen 3.1 displays the different types of corpuses present in the Crawling Infrastructure
Service
In this page, the user can view the corpus list that is existing in Crawling
Infrastructure service and the number of documents crawled and latest updated date.
28
Screen 3.2 depicts the graph visualisation of the corpus statistics
In this page of UI, the user can easily interpret the number of documents present in
each corpus.
29
Screen 3.3 illustrates the status of each source that is deployed in CIS.
In this page, the user can easily identify the status of each source. This will aid in
debugging the source if it is not crawled for any specific reason or issue.
30
Screen 3.4 provides the user to post the crawling request to the developers and provide the
Estimated time of completion according to the priority
Any software system must be deployed under highly favourable circumstances and
environment to get optimum results. A slight variation in the implementation process may
lead to errors or failure of the system. To get a better understanding of the deployment
environment of the proposed system, the deployment diagram is outlined.
31
Figure 3.5 Deployment Diagram
Flow Diagram
Flow diagrams are used to graphically represent the flow of data in a business
information system. Flow diagram describes the processes that are involved in a system to
transfer data from the input to the file storage and reports generation.
Figure 3.6 depicts the flow of the entire system, once the user gets started in the
proposed system
32
Figure 3.6 Flow Diagram of the System
The developer who will get the requirement needs of the user and identifies new
sources to crawl in order to extract reliable healthcare information that caters to the needs
of the company or a particular in-house application that uses the crawled data. Then the
developer checks for the licensing information that is provided by the website and adheres
to the restrictions that are provided in that particular website or source. Once the analysis
33
of licensing is done and the source is ready to crawl, then the developer checks if there are
any downloads provided in the source itself. The developer follows either of one of the
following steps:
● The developer follows the Script generation module if downloads are found.
● The developer follows the Config Generation module if downloads are not found.
Then the developer deploys either the script or config in the Crawling Infrastructure
Service where the scheduling Kubernetes clusters are hosted. The source is scheduled
according to the refresh frequency provided in the website and data is updated in the
MongoDB accordingly without making any duplications in the database. Through this
MongoDB, multiple in-house applications extract the data according to the requirements
of the applications.
34
CHAPTER IV
SYSTEM TESTING
A Test Plan documents the strategy that will be used to verify and ensure that a
product meets its system design specifications. Test cases are built around the requirements
and specifications i.e., what the system is supposed to do.
● PASS
○ All expected results are achieved and/or all unexpected events are resolved.
● PASS WITH EXCEPTIONS
○ Unexpected events require alternative procedures that have been
implemented and those events are called Exceptions.
● FAIL
○ Testing process response does not confirm the expected results.
35
Table 4.1 contains the list of test cases and their respective test reports.
36
CHAPTER V
SYSTEM IMPLEMENTATION
Implementation phase is the phase in which the project plan is put into motion and
the work of the project is performed. It is important to maintain control and communicate
as needed during implementation. Progress is continuously monitored and appropriate
adjustments are made and recorded as variances from the original plan. In this phase, one
can build the components either from scratch or by composition. Given the architecture
document from the design phase and the requirement document from the analysis phase,
one can build exactly what has been requested.
Installing Python:
37
● Run the installer: Once the installer is downloaded, locate the file and run it. Follow
the on-screen instructions to complete the installation process. On Windows, you
may need to confirm any security prompts or user account control dialogs.
● Launch Visual Studio Code: After installation is complete, user can launch Visual
Studio Code from your desktop or application menu. Upon opening, you'll be
greeted with the editor interface, ready for you to start coding.
● Download the latest Studio 3T .dmg file. Remember to select Intel or Apple Silicon
on the download page to get the correct version for your Mac.
● Drag and drop the .dmg file in the Applications folder.
● Login to the already existing mongo servers by clicking the New Connection
button.
Installing Docker:
38
Installing Kubernetes:
● Install Minikube: Minikube is a tool that lets you run Kubernetes locally. Visit the
Minikube GitHub page at https://github.com/kubernetes/minikube/releases and
download the appropriate version for your operating system.
● Install a Hypervisor (if required): Minikube requires a hypervisor to create a virtual
machine to run Kubernetes. Install a hypervisor such as VirtualBox or Hyper-V
based on your operating system's requirements.
● Start Minikube: Once Minikube is installed, open a terminal or command prompt
and run the command minikube start. This command starts a local Kubernetes
cluster using Minikube. It may take a few minutes to download the required
dependencies and start the cluster.
● Verify Installation: After Minikube starts successfully, you can verify the
installation by running kubectl version. This command should display both the
client and server versions of Kubernetes, indicating that Minikube has been
installed and is running correctly.
39
CHAPTER VI
CONCLUSION
FUTURE ENHANCEMENT
40
BIBLIOGRAPHY
WEB REFERENCES
1. GitHub - https://www.atlassian.com/git/tutorials
2. Docker - https://www.docker.com/get-started/
3. Kubernetes - https://kubernetes.io/docs/home/
4. Domain Knowledge - https://nference.com/
41