0% found this document useful (0 votes)
21 views4 pages

CoreDB - A Data Lake Service

Uploaded by

Evandrino Barros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

CoreDB - A Data Lake Service

Uploaded by

Evandrino Barros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Demonstration CIKM’17, November 6-10, 2017, Singapore

CoreDB: a Data Lake Service


Amin Beheshti Boualem Benatallah Reza Nouri
University of New South Wales, University of New South Wales, University of New South Wales,
Sydney, Australia Sydney, Australia Sydney, Australia
[email protected] [email protected] [email protected]

Van Munin Chhieng HuangTao Xiong Xu Zhao


University of New South Wales, University of New South Wales, University of New South Wales,
Sydney, Australia Sydney, Australia Sydney, Australia
[email protected] [email protected] [email protected]
ABSTRACT 1 INTRODUCTION
The continuous improvement in connectivity, storage and data The production of knowledge from ever increasing amount of pri-
processing capabilities allow access to a data deluge from sensors, vate/open data is seen by many organizations as an increasingly
social-media, news, user-generated, government and private data important capability that can complement the traditional analytics
sources. Accordingly, in a modern data-oriented landscape, with sources [2]. In this context, modern data-oriented applications are
the advent of various data capture and management technologies, dealing with various types of data - unstructured, semi-structured
organizations are rapidly shifting to datafication of their processes. and structured - such as emails, tweets, documents, videos and
In such an environment, analysts may need to deal with a collection images. For example, consider an analyst who is interested in ana-
of datasets, from relational to NoSQL, that holds a vast amount lyzing the Government Budget through engaging public’s thoughts
of data gathered from various private/open data islands, i.e. Data and opinions on social networks. To achieve this, the analyst may
Lake. Organizing, indexing and querying the growing volume of need to deal with a wealth of digital information generated through
internal data and metadata, in a data lake, is challenging and re- social networks, blogs, online communities and mobile applications
quires various skills and experiences to deal with dozens of new which forms a complex data lake [6]: a collection of datasets that
databases and indexing technologies: How to store information holds a vast amount of data gathered from various private/open
items? What technology to use for persisting the data? How to deal data islands. Organizing and indexing the growing volume of inter-
with the large volume of streaming data? How to trace and persist nal data and metadata, in the data lake, is challenging and requires
information about data? What technology to use for indexing the vast amount of knowledge to deal with dozens of new databases
data? How to query the data lake? To address the above mentioned and indexing technologies.
challenges, we present CoreDB - an open source data lake service In particular, for an analyst who is dealing with the data layer
- which offers researchers and developers a single REST API to for organizing, indexing and querying different types of data - from
organize, index and query their data and metadata. CoreDB man- structured entities to be stored in relational databases to large
ages multiple database technologies and offers a built-in design for volume of open data to be organized using appropriate NoSQL
security and tracing. databases such as MongoDB or CouchDB - various skills and ex-
periences may be required: How to store information items (from
CCS CONCEPTS structured entities to unstructured documents)? What technology
• Information systems → Data management systems; Web to use for persisting the data (from Relational to NoSQL databases)?
services; How to deal with the large volume of data being generated on
a continuous basis (from Key-value and document to object and
KEYWORDS graph store)? How to trace and persist information about data (from
Data Lake, Database Service, Data API descriptive to administrative)? What technology to use for index-
ing the data/metadata? How to query the data lake (from SQL to
ACM Reference format: full-text search)?
Amin Beheshti, Boualem Benatallah, Reza Nouri, Van Munin Chhieng,
To address the above mentioned challenges, we present CoreDB -
HuangTao Xiong, and Xu Zhao. 2017. CoreDB: a Data Lake Service. In
an open source data lake service - which offers researchers/developers
Proceedings of CIKM’17 , Singapore, Singapore, November 6–10, 2017, 4 pages.
https://doi.org/10.1145/3132847.3133171 a single REST API to organize, index and query their data and meta-
data. CoreDB manages multiple database technologies (from Rela-
Permission to make digital or hard copies of all or part of this work for personal or tional to NoSQL databases), exposes the power of Elasticsearch [5]
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation and weave them together at the application layer. CoreDB offers
on the first page. Copyrights for components of this work owned by others than ACM a built-in design to support: (i) Security and Access Control: to
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, provide a database security threats including authentication, access
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. control and data encryption; and (ii) Tracing and Provenance [3, 8]:
CIKM’17 , November 6–10, 2017, Singapore, Singapore to collect and aggregate tracing metadata including descriptive, ad-
© 2017 Association for Computing Machinery. ministrative and temporal metadata and build a provenance graph.
ACM ISBN 978-1-4503-4918-5/17/11. . . $15.00
https://doi.org/10.1145/3132847.3133171

2451
Demonstration CIKM’17, November 6-10, 2017, Singapore

CIKM’17 , November 6–10, 2017, Singapore, Singapore A. Beheshti et al.

Authentication, Access Control, Data


SQL Query

Query
Encryption, etc.
Full-Text Search

Security

CoreDB REST API


Search
Apache Drill Apache Phonix

... ...

Tracing and Provenance


MySQL PostgreSQL SQL Server MongoDB CouchDB HBase Hive

Meta-Data
Index

(Create, Read, Update, Delete)


CRUD
Relational Database No SQL Databases

Figure 1: CoreDB Architecture.

The CoreDB API is available as an open source project on GitHub1 . key-value, document and graph stores requirements. CoreDB per-
The rest of the paper is organized as follows. In Section 2, we sists the entities (structured and unstructured) in a JSON format,
present an overview of the CoreDB, while in Section 3 we describe an easy-to-parse structure, for its growing adoption in the Web
our demonstration scenario. data applications. Considering the self-describing nature of JSON
documents, in CoreDB we extend JSON with the option of defining
2 COREDB OVERVIEW a schema for all or part of the data. The following statements illus-
CoreDB is an open source complete Database Service that powers trate how to call the CoreDB service to create a Data Lake and a
multiple relational and NoSQL (key/value, document and graph Dataset (Relational or NoSQL):
stores) database-as-a-service for developing Web data applications, Create a Data lake:
curl -H "Content-Type: application/json" -X POST -d
i.e. data-driven Web applications. CoreDB enables analysts to build '{"name":"DataLake_NAME"}' http://CoreDB/api/clients
a data lake, create relational and/or NoSQL datasets within the data
Create a Dataset:
lake and CRUD (Create, Read, Update and Delete) and query enti- curl -H "Content-Type: application/json" -H "Authorization:
ties in those datasets. CoreDB exposes the power of Elasticsearch, Bearer ACCESS_TOKEN" -X POST -d '{"name":"Dataset_NAME",
"type": "Database_NAME"}' http://CoreDB/api/databases
a search engine based on Apache Lucene (lucene.apache.org/), to
support a powerful index and full-text search. CoreDB has a built- When creating a data lake, the ‘DataLake NAME’ parameter
in design to enable top database security threats (Authentication, needs to be replaced by the user. When calling this service, an access
Access Control and Data Encryption) along with Tracing and Prove- token (‘ACCESS TOKEN’) will be returned which enables the user
nance support. CoreDB weave all these services together at the to access the data lake; and will be required for creating, reading,
application layer and offers a single REST API to organize, index updating and deleting a dataset or an entity in the data lake. For
and query the data and metadata in a data lake. Figure 1 illustrates example, to create a dataset (named ‘dsTweets’) for storing a set of
the architecture and the main components of the CoreDB. tweets in MongoDB, the ‘Dataset NAME’ parameter can be replaced
with ‘dsTweets’ and the ‘Database NAME’ parameter should be
2.1 CRUD Data Lake, Dataset and Entity replaced with ‘MongoDB’. CoreDB supports various relational and
The top-level organizing concept in CoreDB is the Data Lake: a col- NoSQL databases such as MySQL, PostgreSQL, Oracle, MongoDB,
lection of datasets that holds a vast amount of data gathered from HBase and HIVE. The URL (‘http://CoreDB/’) illustrates the Web
various private/open data islands. Within the Data Lake one can address where the CoreDB service is deployed. The next step is to
create a dataset of type relational and/or NoSQL database. CoreDB use the CoreDB service to CRUD entities:
offers a single REST API to create a set of datasets and weave them Create an Entity:
together at the application layer. To create a relational database, curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X POST -d '{"Param1":"Value1",
a database connection configuration operation has been provided "Param2": "Value2", ...}' http://CoreDB/api/entity/
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}
to enable access to many of relational databases such as MySQL,
PostgreSQL and Oracle. Moreover, CoreDB leverages appropriate Read an Entity:
curl -H "Content-Type: application/json" -H "Authorization:
NoSQL database such as MongoDB, HBase and HIVE to organize Bearer ACCESS_TOKEN" -X GET http://CoreDB/api/entity/
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id}
1 https://github.com/unsw-cse-soc/CoreDB

2452
Demonstration CIKM’17, November 6-10, 2017, Singapore

CoreDB: a Data Lake Service CIKM’17 , November 6–10, 2017, Singapore, Singapore

Update an Entity: update, and delete) to system entities by supporting Roles, Respon-
curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X PUT -d '{"Param1":"Value1", sibilities and Privileges, System Privileges and Object Privileges. In
"Param2": "Value2"}' http://CoreDB/api/entity/ CoreDB, privileges are provided directly to users or through roles.
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id}
The following statements illustrate how to use CoreDB service to
Delete an Entity: create a user and get an access token.
curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X DELETE http://CoreDB/api/entity/ Create a User:
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id} curl -H "Content-Type: application/json" -X POST -d '{"userName":
"USER_NAME", "password": "PASSWORD", "role":"ROLE", "clientName":
"DataLake_NAME", "clientSecret":"DataLake_SECRET"}'
When creating, reading, updating or deleting an entity, the ‘Data- http://CoreDB/api/account
base NAME’, ‘Dataset NAME’ and ‘Entity TYPE’ parameters need to
GET Access Token:
be replaced by the user. For example, to CRUD a tweet in ‘dsTweets’ curl -H "Content-Type: application/json" -X POST -d '{"userName":
dataset stored in MongoDB the ‘Database NAME’, ‘Dataset NAME’ "USERNAME", "password": "PASSWORD", "grant_type": "PASSWORD",
"clientName":"YOUR_CLIENT", "clientSecret":"YOUR_CLIENT_SECRET"}'
and ‘Entity TYPE’ parameters should be replaced with ‘MongoDB’, http://CoreDB/api/oauth
‘dsTweets’ and ‘Tweet’ respectively.
After creating the user and receiving the access token, it is pos-
2.2 Index and Query sible to use the following statement to grant an action (create, read,
Index. Full-text search is distinguished from searches based on update, delete and query) to a specific role:
metadata or on parts of the original texts represented in databases Define Access Control:
curl -H "Content-Type: application/json" -H "Authorization:
(such as titles, abstracts, selected sections, or bibliographical ref- Bearer ACCESS_TOKEN" -X POST -d '{"role":{"action":"TRUE/FALSE}}'
erences). CoreDB exposes the power of Elasticsearch without the http://coredbapi/api/{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}

operational burden of managing it by developers. In particular,


when the user enables indexing while creating a dataset, the enti- 2.4 Tracing and Provenance
ties will be automatically indexed for powerful Lucene queries.
Query. CoreDB enables the power of standard SQL with full Tracing the entities over time is very important and assists analysts
ACID transaction capabilities for querying data held in not only re- in understanding what time an entity was created, read, updated,
lational databases but also in NoSQL databases. In particular, using deleted or queried (and who did this)? Where was the location
a simple REST API, it is possible to send a SQL query to be applied (e.g. IP Address)? What was the platform (e.g. mobile or PC)? To
to the datasets created in the data lake. To achieve this, in CoreDB, address this important requirement, the CoreDB API provides a
we leverages Apache Phoenix (phoenix.apache.org/) to take the very useful and powerful functionality involving tracing historical
SQL query and compiles it into native NoSQL store APIs. Moreover, data back to users. CoreDB offers a built-in design to collect and
to support queries that need to join data from multiple datastores aggregate tracing metadata including descriptive, administrative
in the data lake, we leverage Apache Drill (drill.apache.org/). Con- and temporal metadata. We use the tracing metadata to build a
sidering that Elasticsearch is a search engine based on Lucene, it is provenance graph [3]: a directed acyclic attributed graph where the
also possible to apply Wildcard, Fuzzy, Proximity and Range search nodes are users/roles and entities and the relationships among them
queries in CoreDB. The following statement illustrate how to call represents the activities such as created, read, updated, deleted or
the CoreDB service to apply a query (SQL or Full-text search) to queried. The relationships will be tagged with metadata such as
the data lake: timestamp and location. The following statements illustrate how
curl -H "Content-Type: application/json" -H "Authorization:
to call the CoreDB service to receive the provenance graph of a
Bearer ACCESS_TOKEN" -X GET http://coredbapi/entity/ particular entity:
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}?query={query}
curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X GET http://dataapi/api/entity/trace/
For example, to find the tweets (stored in ‘dsTweets’ dataset {Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id}
persisted in the MongoDB database) that contains the keyword
‘CIKM’ the following query can be used. The result will be a JSON file containing a finite set of triples
curl -H "Content-Type: application/json" -H "Authorization:
(subject, predicate, object) representing the relationship between
Bearer ACCESS_TOKEN" -X GET http://coredbapi/entity/ two entities in the provenance graph. For example, the triple (david,
dsTweets/MongoDB/Tweet?query="{"match": {"text":"CIKM"}}"
read[ts:20170320;ip:100.101.102.103], tweet123) represents that a
user with unique id ‘david’ read a tweet with the unique id ‘tweet123’
2.3 Security and Access Control on 20th of March 2017 using a computer with IP address ‘100.101.102.103’.
Database servers are the most important systems in virtually all
organizations. They store critical information (e.g. Email, Financial 3 DEMONSTRATION SCENARIO
data, Personal data, etc.) that is vital for organizations. CoreDB Governments at all levels are starting to recognize the value in their
has a built-in design to support top database security threats (e.g. budgeting process. In this context, the budget is the single most
Weak Authentication and Weak system configuration). In particular, important policy document of governments, where policy objec-
CoreDB supports: Identification and Authentication requirements, tives are reconciled and implemented in various categories such
System Privilege and Object Access Control and Data Encryption. as ‘Health’, ‘Social-Services’, ‘Transport’ and ‘Employment’. In the
For example, each user may be identified and authenticated by the demonstration scenario we present the requirement for an analyst
database system and has different access levels (e.g. create, read, who is interested in analyzing the Government Budget - specifically

2453
Demonstration CIKM’17, November 6-10, 2017, Singapore

CIKM’17 , November 6–10, 2017, Singapore, Singapore A. Beheshti et al.


1 Machine 966
4 Machines 571
8 Machines 419
the Health program - through engaging public’s thoughts and opin- Query Execution Time
ions on social networks. The goal here is to properly link the data
objects in social networks (e.g. tweets in Twitter2 ) to the health 1200

Time in Millisecond
1000
8 Machines
category of the budget. The demonstration scenario consists of the
800
600
400
following parts: 200
0

(i) Data Definition and manipulation. The budget analyst will 4 Machines

use the CoreDB service to create a data lake. Considering that the
Australian budget 2016-17 handed on Tuesday 3 May, 2016; the
analyst is interested in persisting all the tweets from one month 1 Machine

before and two months after this date. The analyst will create a
NoSQL dataset to store these tweets (more than 15 million tweets) 0 200 400 600 800 1000 1200

Execution Time in Millisecond


in MongoDB. We illustrate that it is possible to fill this dataset
entity by entity (a simple java code to read the entities and call the
CoreDB service) or read the whole tweets (uploaded somewhere Figure 2: Sample query execution time.
on the Web in a JSON format) and persist them in the dataset using
the CoreDB service. Then, the analyst will be interested to create a
relational dataset to store the main entities related to the budget to create a new graph dataset [7] in the data lake and store the
health program such as registered doctors and nurses in Australia, ‘tweet − −(contains) − − > diabetes’ relationship. (iv) security and
Hospitals and Pharmacies, Health funds, Medical Devices, Drugs, tracing. In this part we propose to the attendee a scenario where she
Diseases and keywords related to health in MySQL database. These would be able to see the security (Identification and Authentication
information later can be used to filter tweets related to health. To requirements, System Privilege and Object Access Control and Data
build this dataset, the analyst will create a set of users and access Encryption), tracing and provenance capabilities of the CoreDB.
tokens: these credentials will be provided to a set of users who
will help in populating this dataset. This scenario will help us to 4 RELATED WORK AND CONCLUSION
illustrate the tracing capability of the CoreDB. The two closest systems to our work include AsterixDB [1] and
(ii) Index and Query. In this part we illustrate the automatic in- Orchestrate (orchstrate.io/). The added value of CoreDB compare
dexing capability in CoreDB. We propose to the attendee a scenario to these systems include: managing multiple database technologies
where she would be able to use full-text search and SQL queries to (From Relational to NoSQL) and providing a built-in design for
find tweets that contain keywords such as instances of Hospitals, security and tracing. Moreover, CoreDB is available as an open
Drugs and Diseases stored in the relational dataset. Consider the source project and through a single REST API. As an ongoing work,
data lake created in the previous step; following is a sample query we are extending the query component to support SPARQL3 queries.
for linking a tweet persisted in MongoDB to a tuple (in the Hospital We also plan to leverage our previous work [4] on data curation
table) stored in PostgreSQL database: to enable CoreDB automatically curate the data items stored in
SELECT tweets.summary, tweets.user_id, tweets.date FROM the data lake, e.g. extracting features such as keywords and named
mongo.budget.tweets AS tweets INNER JOIN postgresql.health entities and persist them in the data lake.
AS healthDB ON tweets.hopitalId = healthDB.hostpotal.id
WHERE tweets.summary LIKE `%health%' AND tweets.body LIKE
`%Sydney Hospital%' AND tweet.date BETWEEN `21-05-2016' AND ACKNOWLEDGMENTS
`21-08-2016Š AND healthDB.hostpital.name LIKE `%Sydney Hospital%'
This research was partially supported by ARC project LP0669090.
Figure 2 illustrates the performance of this query: the experiment
were performed on Amazon EC2 platform using instances running REFERENCES
Ubuntu Server 14.04. The scalability experiment was done on a [1] Apache. 2017. AsterixDB. https://asterixdb.apache.org/. (2017).
[2] Seyed-Mehdi-Reza Beheshti, Boualem Benatallah, Sherif Sakr, Daniela Grig-
single machine, four machines and eight machines of type t2.large ori, Hamid Reza Motahari-Nezhad, Moshe Chai Barukh, Ahmed Gater, and
that provides 8GB of memory, 2 virtual CPUs and 20GB EBS storage; Seung Hwan Ryu. 2016. Process Analytics - Concepts and Techniques for Querying
and on 15 million tweets. Notice that the query processing comes and Analyzing Process Data. Springer.
[3] Seyed-Mehdi-Reza Beheshti, Hamid R. Motahari Nezhad, and Boualem Bena-
down to three phases (parsing, plan generation and plan execution), tallah. 2012. Temporal Provenance Model (TPM): Model and Query Language.
and the scalability study (in Figure 2) shows the impact on the exe- CoRR abs/1211.5009 (2012). http://arxiv.org/abs/1211.5009
cution phase. The result also shows that the parsing phase is costly [4] Seyed-Mehdi-Reza Beheshti, Alireza Tabebordbar, Boualem Benatallah, and Reza
Nouri. 2017. On Automating Basic Data Curation Tasks. In Proceedings of the
specially when we have several joins among different databases in 26th International Conference on World Wide Web Companion, Perth, Australia,
the data lake. April 3-7, 2017. 165–169.
[5] C. Gormley. 2015. Elasticsearch: The Definitive Guide. " O’Reilly".
(iii) construct relationships among the data objects stored in [6] Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis
MongoDB and MySQL. To properly analyze the tweets, the bud- Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Managing Google’s data
get analyst may require to link tweets to the health related en- lake: an overview of the Goods system. IEEE Data Eng. Bull. 39, 3 (2016), 5–14.
[7] Mohammad Hammoud, Dania Abed Rabbou, Reza Nouri, Seyed-Mehdi-Reza
tities related to the budget health program. For example, as the Beheshti, and Sherif Sakr. 2015. DREAM: Distributed RDF Engine with Adaptive
result of the querying step, the analyst identifies a set of tweets Query Planner and Minimal Communication. PVLDB 8, 6 (2015), 654–665.
which contain the Diabetes diseases. We illustrate how it is possible [8] OPM. 2017. The Open Provenance Model. http://openprovenance.org/. (2017).

2 https://support.twitter.com/articles/215585 3 https://www.w3.org/TR/rdf-sparql-query/

2454

You might also like