Shaunak Internship Report PDF
Shaunak Internship Report PDF
“Data Engineering”
A project with
Innoplexus Consulting Services Pvt. Ltd
Submitted to
Bachelor of Technology
In
Shaunak Dhande
(GRNO. 11911401)
Certificate
This is to certify that the internship report titled, Data Engineering at Innoplexus Consulting
Services Pvt. Ltd submitted by Shaunak Dhande (GR NO. 11911401) is a record of bonafide
work carried out by him under guidance of Industry mentor Mr. Jaimin Mehta and College
mentor Prof. Jyoti Madake in partial fulfillment for the award of the Degree of Final Year
Bachelor of Technology in Electronic and Telecommunication.
I wish to express my gratitude to Prof. (Dr.) R. Jalnekar Director, VIT Pune for providing
the facilities of the Institute and for his encouragement during the course of this work. I also
express my deep gratitude to Prof. (Dr.) Shripad Bhatlawande, the Head of the Department
of Electronics and Telecommunication Engineering, VIT Pune for his guidance and support.
I would like to thank my industry mentors Mr. Jaimin Mehta and the employees of
Innoplexus for providing me with guidance and help on every step of the way during the course
of this internship and for imparting me with invaluable knowledge and teaching me the
etiquettes of a professional employee.
I would also 1ike to gratefully acknowledge the enthusiastic supervision of my internship
guide, Prof. Jyoti Madake for her continuous, valuable guidance, patience, constant care, and
kind encouragement throughout the internship work that made me present this internship report
in an efficient manner.
Final1y, I wish to thank my family members and my friends who have always been very
supportive and encouraging.
(Shaunak Dhande)
Date: Dec 2022
INDEX
Chapter 1 Introduction to Innoplexus Page no
1.1 Life Science AI 6
1.2 Ontosight 6
1.3 Technology 8
Chapter 2 Data Extraction, Cleaning and Porting 10
2.1 Data Extraction 9
2.2 Data Cleaning 10
2.3 Data Porting 10
Chapter 3 Automation 11
3.1 Apache Airflow 11
3.2 Pip Package Creation 13
Chapter 4 Full-Stack Development 15
4.1 Mongo and ES Comparison 15
4.2 Data Verification 15
4.3 Dashboard Frontend 16
Chapter 5 Conclusion 17
ABSTRACT
Data engineering is the practice of designing and building systems for collecting, storing, and
analyzing data at scale. It is a broad field with applications in just about every industry. Organizations
have the ability to collect massive amounts of data, and they need the right people and technology to
ensure it is in a highly usable state by the time it reaches data scientists and analysts.
Data engineering team at Innoplexus works on data extraction, data cleaning, data porting, data
validation and other data oriented tasks. The company uses Mongo as their prime database and Elastic search
as their search engine. The combination of both is called CLS, where the final cleaned data is stored. I have
completed various tasks such as data extraction from web sources, data schema validations, data porting, project
automations and full stack development as per the company’s requirements. The presented report describes all
the completed work and gained experience in detail.
Chapter 1
1. Introduction to Innoplexus Consulting Services
Innoplexus offers Data as a Service (DaaS) and Continuous Analytics as a Service (CAAS) products.
Leveraging Artificial Intelligence, proprietary algorithms and patent pending technologies to help global Life
Sciences and Pharmaceuticals organizations with access to relevant data, intelligence & intuitive insights, across
pre-clinical, clinical, regulatory and commercial stages of a drug. We automate the collection and curation of
data using technologies such as natural language processing, network analysis, ontologies, computer vision and
entity normalization. Our clients are Fortune 500 companies including some of the world's largest
pharmaceutical companies.
1.1.1 Services
B. Dashboards:
Continuously crawled, real-time information from multiple publicly available sources in one place. Leverage
Al to stay up to date with guidelines, regulatory patents, market scenarios, and other areas of the drug discovery
landscape. Generate relevant insights for success.
C. Block-chain:
Innoplexus provides block-chain-based system's that enables companies to securely gain access to partner's
metadata at a single platform. It also allows full access upon agreement. Get access to more life sciences data
to generate new insights and for informed decision-making.
D. Indication Prioritization:
Prioritize indications based on logic scoring attained using various factors covering biological validation,
clinical trial valuation, and commercial evaluation. Fast-track research and decision-making to save time and
optimize costs for more success.
1.2 Ontosight:
Ontosight offers a holistic approach to research and discovery in pharma and life sciences by leveraging artificial
intelligence and a self-learning life sciences ontology. Innoplexus provides real-time insights by scanning 95%
of the world-wide web connecting trillions of points from structured and unstructured data.
The platform leverages advanced analytics techniques combined with Al technologies such as machine learning,
computer vision, and entity normalization to offer continuously updated data and insights from multiple sources
(publications, clinical trials, congresses, patents, grants, drug profiles, and gene profiles, etc.) to generate
relevant search results.
Ontosight enables pharmaceutical, CRO, biotech, and life sciences professionals to accelerate research and
development, with real-time insights at their fingertips. Ontosight modules and dashboards help to significantly
accelerate processes throughout the drug development cycle, from preclinical and clinical to regulatory and
commercial phases.
Influence helps identify the right KOLS for your business needs through multiple filter options and a
customizable scoring logic based on the research history and affiliations of KOLS. It offers CRM features to
plan, execute, and track KOL interactions and a deep dive option to analyze individual and top KOLS with
comprehensive research profiles.
Obtain up-to-date, relevant information and insights on top potential medical project participants and their
network to lead your next clinical research project to success.
The Discover engine leverages real-time, updated data from various scientific sources (publications, clinical
trials, congresses etc.) with the most comprehensive life sciences ontology for concept-based contextual and
relevant search results. It enables researchers to search for dissimilar concepts and discover previously unknown
information. Holistic and detailed deep-dive functionalities allow in-depth research and discovery.
Get summarized snapshots of up-to-date clinical development pipelines across indications, interventions, and
therapies to aid informed decisions for strategizing your clinical research projects.
Explore provides an overview of networks across various biomedical entities that facilitates identification of
strong and weak associations among drugs, targets, diseases, and pathways based on the entire life sciences
digital universe. Researchers can perform deep dives to obtain granular scientific information and predict
potential interventions for indications of interest, based on clinical data and network modularity.
Gain insights into previously hidden connections between biomedical entities and significantly reduce time,
costs, and effort spent on research and in experimental studies to accelerate your drug discovery processes.
Moreover, data from external sources such as IMS and publicly available data sources, including social
websites, patient and physician forums, blogs, are also used to inform the leadership in real time so that
appropriate actions can be taken in a timely manner.
1.3 Technology
Innoplexus has Empower Decision Making by Leveraging Cutting-Edge Technology. Their innovation-led
technology development resulted in 100+ patent applications including 23 grants.
1.3.1 Blockchain
Smart contract system for searching unpublished data and making transactions. Real-time valuation of
unpublished data through Al. TruAgent enables integration of confidential data in a secured way.
1.3.5 Ontology
Mapping all discoverable concepts from content of all major data sources in the base ontology. Connecting
observations from curated sources and literature. Self-learning unseen concepts validated by random checks.
Data is collected mainly from web sources and pdf’s. A decided schema is specified for data crawling. Python
modules such as Beautiful Soup, Requests and Selenium are used for data crawling. I have completed data
extraction from more than 100 web sources. Fig 1. Explains the procedure for data extraction from web sources.
Data Cleaning
2.1.1 Requests
Python's Requests library is one of its essential components for sending HTTP requests to a given URL.
Requests must be mastered in order to move forward with these technologies, whether they are REST APIs or
web scraping. A URI returns an answer in response to a request. Python requests come with built-in tools for
handling both requests and responses.
2.1.2 Selenium
Selenium is an effective technology for automating browsers and controlling web browsers through software.
It is functional across all major browsers and operating systems, and its scripts may be written in a number of
languages, including Python, Java, C#, and others. For our purposes, Python will be used. WebDriver,
WebElement, and unit testing with selenium are just a few of the topics covered in the Selenium tutorial.
2.2.3 BeautifulSoup
Information is extracted from the HTML and XML files using BeautifulSoup. It offers a parse tree together with
tools for navigating, searching, and modifying it.
Data cleaning refers to the act of finding and fixing incomplete, inaccurate, or irrelevant sections of the data and
then replacing, changing, or deleting the filthy or coarse data from a record set, table, or database. Extracted
data for a record is in the form of json. This data is stored as a document in a particular schema. Each field value
in this schema is supposed to be non-empty and cleaned for further data usage.
Many times even after carefully extracting data and porting in base mongo, the data is not cleaned. Thus the
data is cleaned using different machine learning approaches and python packages. Schema validation scripts are
run over ported mongo to identify any schema or data type errors. Thus after finally cleaning the data, it is
ported and indexed to further staged mongo and elastic search. Fig 2. depicts data cleaning steps.
Schema validation
Mongo Data Error Fixing
scripts
Data porting is the process of loading data documents from the sources to collection centers. Here, Mongo is
the main database used. The data extracted from sources is loaded to primary mongo. This data is systematic
and structured as collections. After cleaning the data, it is moved to another mongo. Loading data from one
mongo to another is also called porting. When the final confirmed data is loaded in the elastic search engine, it
is called data indexing. A search engine is used by companies for solving data related bugs as search engines
work faster than databases. This complete procedure becomes very complex when dealing with millions of data
documents.
Chapter 3
3. Automation
Data automation is the practice of speeding up and automating the development cycles for data warehouses
while maintaining quality and consistency. The lifetime of a data warehouse is thought to be automated by
DWA, from source system analysis to testing to documentation.
3.1.1 Installation
1. Create a Virtual Environment
2. Prerequisites
Starting with Airflow 2.3.0, Airflow is tested with.
Python: 3.7, 3.8, 3.9, 3.10
Databases:
PostgreSQL: 10, 11, 12, 13, 14
MySQL: 5.7, 8
SQLite: 3.15.0+
MSSQL: 2017, 2019
Kubernetes: 1.20.2, 1.21.1, 1.22.0, 1.23.0, 1.24.0
3. Terminal command:
pip3 install apache-airflow
pip3 install typing extensions
4. Creating an account:
airflow users create-username admin-firstname 'firstname -- lastname lastname-role Admin-
email 'email'
Sign in to your account to see all dags
3.1.2 Creating DAGS
A DAG or a Directed Acyclic Graph-is a collection of all the tasks you want to run, organized in a way that
reflects their relationships and dependencies.
After installing apache airflow, create a dags folder in airflow directory created @ Home by airflow. All python
scripts to automate will be saved here.
To create a DAG in Airflow, we firstly import the DAG dass and then the Operators. Basically, for each Operator
you want to use, you have to make the corresponding import. For example, you want to execute a Python
function, you have to import the Python Operator. You want to execute a bash command, you have to import
the BashOperator. The last import is the datetime class as we need to specify a start date for DAG.
A DAG object must have two parameters, a dag id and a start date. The dag_id is the unique identifier of the
DAG across all of DAGS. Each DAG must have a unique dag id. The start_date defines the date at which your
DAG starts being scheduled. If the start date is set in the past, the scheduler will try to backfill all the non-
triggered DAG Runs between the start date and the current date. For example, if your start date is defined with
a date 3 years ago, you might end up with many DAG Runs running at the same time.
The schedule interval defines the interval of time at which your DAG gets triggered. Every 10 minutes, every
day, every month and so on. 2 ways to define it, either with a CRON expression or with a time delta object.
C. Add Tasks
1. Python Operator:
2. Bash operator:
D. Defining dependencies
The script can be triggered from Airflow UI and all additional information can be viewed there.
1. Registration
The Python community maintains a repository similar to npm for open source packages. If one wants to make
their package publicly accessible, they can upload it on PyPi. So, first step is to register self on
PyPi: ps/pypi.org/account/register/
Setuptools: Setuptools is a package development process library designed for creating and distributing
Python packages
Wheel: The Wheel package provides a bdist_wheel command for Setuptools. It creates whl file which
is directly install-able through the pip install command.
Twine: The Twine package provides a secure, authenticated, and verified connection between your
system and PyPi over HTTPS.
Tqdm: This is a smart progress meter used internally by Twine
4. Package Compilation
Go into your package folder and execute this command python setup.py bdist_wheel.
If you want to test your application on your local machine, you can install the whl file using pip: command:
1. Create pypire: The Pypirc file stores the PyPi repository information Create a file in the home directory.
2. Add the following content to it. Replace javatechy with your username.
[distutils]
index-servers =PyPi
[PyPi]
repository = https://upload.pypi.org/legacy/
username =javatechy\
Data Comparison
Result displayed
Conclusion
I can sum it up by saying that this internship was a brilliant experience. I not only gained more technical
understanding, but also benefited personally from this experience. In the past years, we have seen a huge change
in business operations and companies have realized the value of real asset. Data is tracked, recorded and
leveraged as various key points to make well-informed decisions. Data is used to unlock new business
opportunities, and increase business growth as well. Digital data is offering countless opportunities for
organizations to innovate and better serve customers. Innoplexus gave me the opportunity to learn and gather
knowledge about how a data based company functions. This experience has helped me realize my career path
and future I would like to thank Mr. ‘Ashwin Waknis’, Mr. ‘Jamin Mehta’ and the employees of Innoplexus for
giving me this opportunity and for helping me grow as an engineer and as a person