0% found this document useful (0 votes)
104 views17 pages

Shaunak Internship Report PDF

This document provides an overview of Innoplexus Consulting Services and their key products. Innoplexus offers data and analytics services for the life sciences industry, leveraging artificial intelligence. Their main products are Ontosight, a discovery platform, and Data as a Service (DaaS). Ontosight uses AI techniques like natural language processing to provide insights across the drug development process from research to commercialization. DaaS allows access to a large database of life sciences data through APIs. The company focuses on extracting and analyzing data to help clients in areas like clinical trials and market research.

Uploaded by

Subodh Thorat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views17 pages

Shaunak Internship Report PDF

This document provides an overview of Innoplexus Consulting Services and their key products. Innoplexus offers data and analytics services for the life sciences industry, leveraging artificial intelligence. Their main products are Ontosight, a discovery platform, and Data as a Service (DaaS). Ontosight uses AI techniques like natural language processing to provide insights across the drug development process from research to commercialization. DaaS allows access to a large database of life sciences data through APIs. The company focuses on extracting and analyzing data to help clients in areas like clinical trials and market research.

Uploaded by

Subodh Thorat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

An Internship Report on

“Data Engineering”
A project with
Innoplexus Consulting Services Pvt. Ltd
Submitted to

Vishwakarma Institute of Technology, Pune


(An Autonomous institute Affiliated to Savitribai Phule Pune University)
In partial fulfilment of requirements for

Bachelor of Technology
In

Electronics and Telecommunication Engineering


By

Shaunak Dhande
(GRNO. 11911401)

Under the guidance of

Mr. Jaimin Mehta


(Technology Architect, Innoplexus Consulting Services Pvt. Ltd)

Department of Electronics and Telecommunication Engineering


Vishwakarma Institute of Technology, Pune - 411037

Academic Year: 2022-23 Sem I


Bansilal Ramnath Agarwal Charitable Trust’s

Vishwakarma Institute of Technology, Pune – 37


(An Autonomous Institute Affi1iated to Savitribai Phule Pune University)
Bansilal Ramnath Agarwal Charitable Trust’s

Vishwakarma Institute of Technology, Pune – 37


(An Autonomous Institute Affi1iated to Savitribai Phule Pune University)

Certificate
This is to certify that the internship report titled, Data Engineering at Innoplexus Consulting
Services Pvt. Ltd submitted by Shaunak Dhande (GR NO. 11911401) is a record of bonafide
work carried out by him under guidance of Industry mentor Mr. Jaimin Mehta and College
mentor Prof. Jyoti Madake in partial fulfillment for the award of the Degree of Final Year
Bachelor of Technology in Electronic and Telecommunication.

Industry Mentor Faculty Mentor


Mr. Jaimin Mehta Prof. Jyoti Madake
(Technology Architect, Innoplexus) (VIT, Pune)

Internship Coordinator Head of the Department


Prof. (Dr.) Ashwinee Pulujkar Prof. (Dr.) Shripad Bhatlawande

Date: Dec 2022


Place: Pune
Acknowledgement

I wish to express my gratitude to Prof. (Dr.) R. Jalnekar Director, VIT Pune for providing
the facilities of the Institute and for his encouragement during the course of this work. I also
express my deep gratitude to Prof. (Dr.) Shripad Bhatlawande, the Head of the Department
of Electronics and Telecommunication Engineering, VIT Pune for his guidance and support.
I would like to thank my industry mentors Mr. Jaimin Mehta and the employees of
Innoplexus for providing me with guidance and help on every step of the way during the course
of this internship and for imparting me with invaluable knowledge and teaching me the
etiquettes of a professional employee.
I would also 1ike to gratefully acknowledge the enthusiastic supervision of my internship
guide, Prof. Jyoti Madake for her continuous, valuable guidance, patience, constant care, and
kind encouragement throughout the internship work that made me present this internship report
in an efficient manner.
Final1y, I wish to thank my family members and my friends who have always been very
supportive and encouraging.

(Shaunak Dhande)
Date: Dec 2022
INDEX
Chapter 1 Introduction to Innoplexus Page no
1.1 Life Science AI 6
1.2 Ontosight 6
1.3 Technology 8
Chapter 2 Data Extraction, Cleaning and Porting 10
2.1 Data Extraction 9
2.2 Data Cleaning 10
2.3 Data Porting 10
Chapter 3 Automation 11
3.1 Apache Airflow 11
3.2 Pip Package Creation 13
Chapter 4 Full-Stack Development 15
4.1 Mongo and ES Comparison 15
4.2 Data Verification 15
4.3 Dashboard Frontend 16
Chapter 5 Conclusion 17
ABSTRACT

Data engineering is the practice of designing and building systems for collecting, storing, and
analyzing data at scale. It is a broad field with applications in just about every industry. Organizations
have the ability to collect massive amounts of data, and they need the right people and technology to
ensure it is in a highly usable state by the time it reaches data scientists and analysts.
Data engineering team at Innoplexus works on data extraction, data cleaning, data porting, data
validation and other data oriented tasks. The company uses Mongo as their prime database and Elastic search
as their search engine. The combination of both is called CLS, where the final cleaned data is stored. I have
completed various tasks such as data extraction from web sources, data schema validations, data porting, project
automations and full stack development as per the company’s requirements. The presented report describes all
the completed work and gained experience in detail.
Chapter 1
1. Introduction to Innoplexus Consulting Services
Innoplexus offers Data as a Service (DaaS) and Continuous Analytics as a Service (CAAS) products.
Leveraging Artificial Intelligence, proprietary algorithms and patent pending technologies to help global Life
Sciences and Pharmaceuticals organizations with access to relevant data, intelligence & intuitive insights, across
pre-clinical, clinical, regulatory and commercial stages of a drug. We automate the collection and curation of
data using technologies such as natural language processing, network analysis, ontologies, computer vision and
entity normalization. Our clients are Fortune 500 companies including some of the world's largest
pharmaceutical companies.

1.1 Life Science Al


Innoplexus harnesses proprietary Artificial Intelligence and Machine Learning to provide deeper and real-time
insights into the Life Sciences data universe. We use natural language processing and computer vision to
understand data, make connections, and accelerate drug development from drug discovery to commercialization.

1.1.1 Services

A. Data as a Service (DaaS):


Innoplexus provides APIs to connect to the data ocean covering 95% of the world-wide web and combine it
with enterprise or third-party data aggregated and enriched with their self-learning life sciences ontology
consisting of 31M+ life sciences-related terms and concepts for relevant information.

B. Dashboards:
Continuously crawled, real-time information from multiple publicly available sources in one place. Leverage
Al to stay up to date with guidelines, regulatory patents, market scenarios, and other areas of the drug discovery
landscape. Generate relevant insights for success.

C. Block-chain:
Innoplexus provides block-chain-based system's that enables companies to securely gain access to partner's
metadata at a single platform. It also allows full access upon agreement. Get access to more life sciences data
to generate new insights and for informed decision-making.

D. Indication Prioritization:
Prioritize indications based on logic scoring attained using various factors covering biological validation,
clinical trial valuation, and commercial evaluation. Fast-track research and decision-making to save time and
optimize costs for more success.

1.2 Ontosight:
Ontosight offers a holistic approach to research and discovery in pharma and life sciences by leveraging artificial
intelligence and a self-learning life sciences ontology. Innoplexus provides real-time insights by scanning 95%
of the world-wide web connecting trillions of points from structured and unstructured data.

The platform leverages advanced analytics techniques combined with Al technologies such as machine learning,
computer vision, and entity normalization to offer continuously updated data and insights from multiple sources
(publications, clinical trials, congresses, patents, grants, drug profiles, and gene profiles, etc.) to generate
relevant search results.

Ontosight enables pharmaceutical, CRO, biotech, and life sciences professionals to accelerate research and
development, with real-time insights at their fingertips. Ontosight modules and dashboards help to significantly
accelerate processes throughout the drug development cycle, from preclinical and clinical to regulatory and
commercial phases.

1.2.1 Ontosight Influence


Ontosight Influence leverages Al for real-time discovery, management and network analysis of top and
emerging KOLs across all stages of a drug's life cycle. Prioritize KOLS and dive deep into their research profiles
and networks to strategize meaningful engagements.

Influence helps identify the right KOLS for your business needs through multiple filter options and a
customizable scoring logic based on the research history and affiliations of KOLS. It offers CRM features to
plan, execute, and track KOL interactions and a deep dive option to analyze individual and top KOLS with
comprehensive research profiles.

Obtain up-to-date, relevant information and insights on top potential medical project participants and their
network to lead your next clinical research project to success.

1.2.2 Ontosight Discover


Ontosight Discover is a discovery engine for life sciences that leverages Al and a self-learning life sciences
ontology created by Innoplexus to generate continuous, real-time insights, spanning all therapeutic areas and
indications.

The Discover engine leverages real-time, updated data from various scientific sources (publications, clinical
trials, congresses etc.) with the most comprehensive life sciences ontology for concept-based contextual and
relevant search results. It enables researchers to search for dissimilar concepts and discover previously unknown
information. Holistic and detailed deep-dive functionalities allow in-depth research and discovery.

Get summarized snapshots of up-to-date clinical development pipelines across indications, interventions, and
therapies to aid informed decisions for strategizing your clinical research projects.

1.2.3 Ontosight Explore


Ontosight Explore enables exploration, identification, and building of direct and indirect connections between
various biological entities. The resulting evidence-based network assists in the discovery of potential drugs and
targets for a given disease.

Explore provides an overview of networks across various biomedical entities that facilitates identification of
strong and weak associations among drugs, targets, diseases, and pathways based on the entire life sciences
digital universe. Researchers can perform deep dives to obtain granular scientific information and predict
potential interventions for indications of interest, based on clinical data and network modularity.

Gain insights into previously hidden connections between biomedical entities and significantly reduce time,
costs, and effort spent on research and in experimental studies to accelerate your drug discovery processes.

1.2.4 Ontosight Integrate


Ontosight Integrate is a self-service platform that offers the leadership of an organization a real-time overview
of how products are performing, how sales are going, the status of the budget and expenses, etc. Integrate is a
100% automated platform that amalgamates internal and external data sources, including structured and
unstructured data. Information can be automatically extracted in real time from multiple sources such as CRMS
(SalesForce, Siebel) and ERP (SAP etc.).

Moreover, data from external sources such as IMS and publicly available data sources, including social
websites, patient and physician forums, blogs, are also used to inform the leadership in real time so that
appropriate actions can be taken in a timely manner.

1.3 Technology
Innoplexus has Empower Decision Making by Leveraging Cutting-Edge Technology. Their innovation-led
technology development resulted in 100+ patent applications including 23 grants.

1.3.1 Blockchain
Smart contract system for searching unpublished data and making transactions. Real-time valuation of
unpublished data through Al. TruAgent enables integration of confidential data in a secured way.

1.3.2 Computer Vision


Leverages image processing to classify and extract relevant info from PDF and image files. Enhanced OCR to
handle ambiguous and special characters with higher precision. Understands page layout and structure in the
same way as humans do.

1.3.3 Entity Normalization


Resolving entities from disparate sources covering name variations and degeneration. Increasing the precision
to discover entities even with sparse metadata. Leveraging crawled data to improve normalization.

1.3.4 Machine Learning & Al


Tapping the wealth of unstructured data from internal and external sources. Understanding domain-specific
contextual information (e.g., our life sciences language processing engine). Building reasoning system to serve
intent of user queries.

1.3.5 Ontology
Mapping all discoverable concepts from content of all major data sources in the base ontology. Connecting
observations from curated sources and literature. Self-learning unseen concepts validated by random checks.

1.3.6 Network Analysis


Modelling and persisting (storing) the entire data set as a network in a graph database. Multigraph with networks
from different asset classes as layers. Large scale network analysis to find key insights in real time.
Chapter 2
2. Data Extraction, Cleaning and Porting
2.1 Data Extraction
The act or process of extracting data from data sources for additional data processing or data storage is known
as data extraction. Data transformation and perhaps metadata addition occur after the import into the
intermediate extracting system in order to prepare the data for export to the next stage of the data workflow.

Data is collected mainly from web sources and pdf’s. A decided schema is specified for data crawling. Python
modules such as Beautiful Soup, Requests and Selenium are used for data crawling. I have completed data
extraction from more than 100 web sources. Fig 1. Explains the procedure for data extraction from web sources.

Web source link

Hit link using Requests Hit link using Selenium

Extract all html present with


Beautiful soup

Data Cleaning

Data porting in Mongo DB

Fig 1. Data Extraction

2.1.1 Requests

Python's Requests library is one of its essential components for sending HTTP requests to a given URL.
Requests must be mastered in order to move forward with these technologies, whether they are REST APIs or
web scraping. A URI returns an answer in response to a request. Python requests come with built-in tools for
handling both requests and responses.
2.1.2 Selenium
Selenium is an effective technology for automating browsers and controlling web browsers through software.
It is functional across all major browsers and operating systems, and its scripts may be written in a number of
languages, including Python, Java, C#, and others. For our purposes, Python will be used. WebDriver,
WebElement, and unit testing with selenium are just a few of the topics covered in the Selenium tutorial.

2.2.3 BeautifulSoup
Information is extracted from the HTML and XML files using BeautifulSoup. It offers a parse tree together with
tools for navigating, searching, and modifying it.

2.2.4 Why use Selenium?


Sometimes web sources have cookies and blockers that don’t allow requests module to extract complete data
present in frontend. There are also cases of pagination where the data is spread across different pages with the
same link. Moreover, some sources require object clicking and input for making the data visible. All this is
possible using selenium. One can make the driver wait, click, give an input etc. using the selenium Webdriver.
But selenium is very time consuming and heavy procedure. Thus, selenium should be the last resort to use.

2.2 Data Cleaning

Data cleaning refers to the act of finding and fixing incomplete, inaccurate, or irrelevant sections of the data and
then replacing, changing, or deleting the filthy or coarse data from a record set, table, or database. Extracted
data for a record is in the form of json. This data is stored as a document in a particular schema. Each field value
in this schema is supposed to be non-empty and cleaned for further data usage.

Many times even after carefully extracting data and porting in base mongo, the data is not cleaned. Thus the
data is cleaned using different machine learning approaches and python packages. Schema validation scripts are
run over ported mongo to identify any schema or data type errors. Thus after finally cleaning the data, it is
ported and indexed to further staged mongo and elastic search. Fig 2. depicts data cleaning steps.

Schema validation
Mongo Data Error Fixing
scripts

Fig 2. Data Cleaning

2.3 Data Porting

Data porting is the process of loading data documents from the sources to collection centers. Here, Mongo is
the main database used. The data extracted from sources is loaded to primary mongo. This data is systematic
and structured as collections. After cleaning the data, it is moved to another mongo. Loading data from one
mongo to another is also called porting. When the final confirmed data is loaded in the elastic search engine, it
is called data indexing. A search engine is used by companies for solving data related bugs as search engines
work faster than databases. This complete procedure becomes very complex when dealing with millions of data
documents.
Chapter 3

3. Automation
Data automation is the practice of speeding up and automating the development cycles for data warehouses
while maintaining quality and consistency. The lifetime of a data warehouse is thought to be automated by
DWA, from source system analysis to testing to documentation.

3.1 Apache Airflow


Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.
Many websites from which data is extracted get updated every now and then. Thus to have updated data records
for business growth and quality products and services, it is important to automate the process of data extraction
for reducing time wastage by manual procedure. Apache Airflow gives the platform to automate scripts in
parallel and serial manner.

3.1.1 Installation
1. Create a Virtual Environment

2. Prerequisites
 Starting with Airflow 2.3.0, Airflow is tested with.
 Python: 3.7, 3.8, 3.9, 3.10
 Databases:
 PostgreSQL: 10, 11, 12, 13, 14
 MySQL: 5.7, 8
 SQLite: 3.15.0+
 MSSQL: 2017, 2019
 Kubernetes: 1.20.2, 1.21.1, 1.22.0, 1.23.0, 1.24.0

3. Terminal command:
 pip3 install apache-airflow
 pip3 install typing extensions

 initialize the database:


 airflow init

 Start the web server, default port is 8080:


 airflow webserver -p 8080

 Start the scheduler. recommend opening up a separate terminal:


 airflow scheduler

4. Creating an account:
 airflow users create-username admin-firstname 'firstname -- lastname lastname-role Admin-
email 'email'
 Sign in to your account to see all dags
3.1.2 Creating DAGS
A DAG or a Directed Acyclic Graph-is a collection of all the tasks you want to run, organized in a way that
reflects their relationships and dependencies.
After installing apache airflow, create a dags folder in airflow directory created @ Home by airflow. All python
scripts to automate will be saved here.

A. Make the Imports

To create a DAG in Airflow, we firstly import the DAG dass and then the Operators. Basically, for each Operator
you want to use, you have to make the corresponding import. For example, you want to execute a Python
function, you have to import the Python Operator. You want to execute a bash command, you have to import
the BashOperator. The last import is the datetime class as we need to specify a start date for DAG.

 from airflow import DAG

 from airflow.operators.python import PythonOperator, BranchPythonOperator

 from airflow.operators.bash import BashOperator

 from datetime import datetime

B. Create the Airflow DAG object

A DAG object must have two parameters, a dag id and a start date. The dag_id is the unique identifier of the
DAG across all of DAGS. Each DAG must have a unique dag id. The start_date defines the date at which your
DAG starts being scheduled. If the start date is set in the past, the scheduler will try to backfill all the non-
triggered DAG Runs between the start date and the current date. For example, if your start date is defined with
a date 3 years ago, you might end up with many DAG Runs running at the same time.

The schedule interval defines the interval of time at which your DAG gets triggered. Every 10 minutes, every
day, every month and so on. 2 ways to define it, either with a CRON expression or with a time delta object.

C. Add Tasks

1. Python Operator:

If a task is a python function, we use PythonOperator


Example: task_1 = PythonOperator (task_id='first task',
python callable = function_name,
Dag=my_first_dag)

2. Bash operator:

If a task is a bash command, we use BashOperator


Example: task_2= BashOperator (task_id='second_task',
bash_command='echo 2',
Dag=my first dag)

D. Defining dependencies

Here we declare our flow of pipeline with '>>'


Example:
with DAG (...) as dag:
task1 >> task2

The script can be triggered from Airflow UI and all additional information can be viewed there.

3.2 Pip Package Creation


I created a python pip package as a utility addition task. This package includes different functions for data
extraction and data cleaning ease. Some functionalities included in this package were times-date stamp
conversion, affiliation mapping logic etc. Following is the documentation for creation of pip package.

1. Registration

The Python community maintains a repository similar to npm for open source packages. If one wants to make
their package publicly accessible, they can upload it on PyPi. So, first step is to register self on

PyPi: ps/pypi.org/account/register/

2. Checking the Required Tools

 Setuptools: Setuptools is a package development process library designed for creating and distributing
Python packages

 Wheel: The Wheel package provides a bdist_wheel command for Setuptools. It creates whl file which
is directly install-able through the pip install command.
 Twine: The Twine package provides a secure, authenticated, and verified connection between your
system and PyPi over HTTPS.
 Tqdm: This is a smart progress meter used internally by Twine

3. Setup Your Project

 Create a package say, dokr_pkg.


 Create your executable file inside the package, say, dokr. Create a script without extensions (dokr).
 Make your script executable.
 Create a setup file setup.py in your package. This file will contain all your package metadata
information.
 Add a LICENSE to your project by creating a file called LICENSE.

4. Package Compilation

Go into your package folder and execute this command python setup.py bdist_wheel.

 build: build package information.


 dist: Contains your wbl file. A WHL file is a package saved in the Wheel format, which is the standard
bank-package format used for Python distributions. You can directly install a whl file using pip install
some package, whl on your system.
 Project.egg.info: An egg package contains compiled bytecode, package information, dependency links,
and captures the info used by the setup.py test command when running tests.

5. Install on Your Local Machine

If you want to test your application on your local machine, you can install the whl file using pip: command:

Python - pip install dist/dokr-0.1-py3-none-any.whl


6. Upload on pip

1. Create pypire: The Pypirc file stores the PyPi repository information Create a file in the home directory.

 for Windows: C:\Users\User\Name\pypirc


 for *nix: /.pypirc.

2. Add the following content to it. Replace javatechy with your username.
 [distutils]
 index-servers =PyPi
 [PyPi]
 repository = https://upload.pypi.org/legacy/
 username =javatechy\

3. To upload your dist/*whl file on PyPi, use Twine


command: python-twine upload dist/*

This command will upload your package on PyPi


Chapter 4

4. Full Stack Development


I have completed two full stack development projects and few frontend development projects as well. The
framework used for backend was Python-Flask and Python-Django. Frontend was developed using HTML, CSS
and JavaScript (react).

4.1 Mongo and ES Comparison


Many times, the data ported to mongo and indexed to elastic search was different. Thus the data records were
not matching and caused errors. To solve this issue, a web site for comparing mongo and ES data was made.
The data from mongo and ES was called using a common schema key. This data was compared and errors were
fixed. A frontend of above was created which displayed both the data documents and the comparison result.
The locations of data changes were displayed in case of unmatched data. Fig 3. Displays the procedure.

Mongo Data ES Data

Data Comparison

Result displayed

Fig 3. Data Comparison

4.2 Data Verification


All the data records present in database have a common key present. It is difficult to identify the sources for
which data is not extracted when dealing with millions of documents. Thus based on the value of common key,
data was checked in final database collection. The result helped in confirming the status of available source. Fig
4. Displays data verification process. A frontend using HTML and CSS was made for visual purposes which
displayed the data if present and notified if the data was absent.

Search Key Final database Data availability

Fig 4. Data Verification


4.3 Dashboard frontend
Innoplexus provides few services such as Neuria and Curia. These are application available for open source
installations which provide different data based services to people and institutions for research and awareness.
The company has well setup internal backend and frontend for dealing with internal issues for these applications.
I updated the frontend of dashboards for Neuria application to enhance the connection of website design and
the back end development.
Chapter 5

Conclusion
I can sum it up by saying that this internship was a brilliant experience. I not only gained more technical
understanding, but also benefited personally from this experience. In the past years, we have seen a huge change
in business operations and companies have realized the value of real asset. Data is tracked, recorded and
leveraged as various key points to make well-informed decisions. Data is used to unlock new business
opportunities, and increase business growth as well. Digital data is offering countless opportunities for
organizations to innovate and better serve customers. Innoplexus gave me the opportunity to learn and gather
knowledge about how a data based company functions. This experience has helped me realize my career path
and future I would like to thank Mr. ‘Ashwin Waknis’, Mr. ‘Jamin Mehta’ and the employees of Innoplexus for
giving me this opportunity and for helping me grow as an engineer and as a person

You might also like