CoreDB - A Data Lake Service
CoreDB - A Data Lake Service
2451
Demonstration CIKM’17, November 6-10, 2017, Singapore
Query
Encryption, etc.
Full-Text Search
Security
... ...
Meta-Data
Index
The CoreDB API is available as an open source project on GitHub1 . key-value, document and graph stores requirements. CoreDB per-
The rest of the paper is organized as follows. In Section 2, we sists the entities (structured and unstructured) in a JSON format,
present an overview of the CoreDB, while in Section 3 we describe an easy-to-parse structure, for its growing adoption in the Web
our demonstration scenario. data applications. Considering the self-describing nature of JSON
documents, in CoreDB we extend JSON with the option of defining
2 COREDB OVERVIEW a schema for all or part of the data. The following statements illus-
CoreDB is an open source complete Database Service that powers trate how to call the CoreDB service to create a Data Lake and a
multiple relational and NoSQL (key/value, document and graph Dataset (Relational or NoSQL):
stores) database-as-a-service for developing Web data applications, Create a Data lake:
curl -H "Content-Type: application/json" -X POST -d
i.e. data-driven Web applications. CoreDB enables analysts to build '{"name":"DataLake_NAME"}' http://CoreDB/api/clients
a data lake, create relational and/or NoSQL datasets within the data
Create a Dataset:
lake and CRUD (Create, Read, Update and Delete) and query enti- curl -H "Content-Type: application/json" -H "Authorization:
ties in those datasets. CoreDB exposes the power of Elasticsearch, Bearer ACCESS_TOKEN" -X POST -d '{"name":"Dataset_NAME",
"type": "Database_NAME"}' http://CoreDB/api/databases
a search engine based on Apache Lucene (lucene.apache.org/), to
support a powerful index and full-text search. CoreDB has a built- When creating a data lake, the ‘DataLake NAME’ parameter
in design to enable top database security threats (Authentication, needs to be replaced by the user. When calling this service, an access
Access Control and Data Encryption) along with Tracing and Prove- token (‘ACCESS TOKEN’) will be returned which enables the user
nance support. CoreDB weave all these services together at the to access the data lake; and will be required for creating, reading,
application layer and offers a single REST API to organize, index updating and deleting a dataset or an entity in the data lake. For
and query the data and metadata in a data lake. Figure 1 illustrates example, to create a dataset (named ‘dsTweets’) for storing a set of
the architecture and the main components of the CoreDB. tweets in MongoDB, the ‘Dataset NAME’ parameter can be replaced
with ‘dsTweets’ and the ‘Database NAME’ parameter should be
2.1 CRUD Data Lake, Dataset and Entity replaced with ‘MongoDB’. CoreDB supports various relational and
The top-level organizing concept in CoreDB is the Data Lake: a col- NoSQL databases such as MySQL, PostgreSQL, Oracle, MongoDB,
lection of datasets that holds a vast amount of data gathered from HBase and HIVE. The URL (‘http://CoreDB/’) illustrates the Web
various private/open data islands. Within the Data Lake one can address where the CoreDB service is deployed. The next step is to
create a dataset of type relational and/or NoSQL database. CoreDB use the CoreDB service to CRUD entities:
offers a single REST API to create a set of datasets and weave them Create an Entity:
together at the application layer. To create a relational database, curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X POST -d '{"Param1":"Value1",
a database connection configuration operation has been provided "Param2": "Value2", ...}' http://CoreDB/api/entity/
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}
to enable access to many of relational databases such as MySQL,
PostgreSQL and Oracle. Moreover, CoreDB leverages appropriate Read an Entity:
curl -H "Content-Type: application/json" -H "Authorization:
NoSQL database such as MongoDB, HBase and HIVE to organize Bearer ACCESS_TOKEN" -X GET http://CoreDB/api/entity/
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id}
1 https://github.com/unsw-cse-soc/CoreDB
2452
Demonstration CIKM’17, November 6-10, 2017, Singapore
CoreDB: a Data Lake Service CIKM’17 , November 6–10, 2017, Singapore, Singapore
Update an Entity: update, and delete) to system entities by supporting Roles, Respon-
curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X PUT -d '{"Param1":"Value1", sibilities and Privileges, System Privileges and Object Privileges. In
"Param2": "Value2"}' http://CoreDB/api/entity/ CoreDB, privileges are provided directly to users or through roles.
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id}
The following statements illustrate how to use CoreDB service to
Delete an Entity: create a user and get an access token.
curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X DELETE http://CoreDB/api/entity/ Create a User:
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id} curl -H "Content-Type: application/json" -X POST -d '{"userName":
"USER_NAME", "password": "PASSWORD", "role":"ROLE", "clientName":
"DataLake_NAME", "clientSecret":"DataLake_SECRET"}'
When creating, reading, updating or deleting an entity, the ‘Data- http://CoreDB/api/account
base NAME’, ‘Dataset NAME’ and ‘Entity TYPE’ parameters need to
GET Access Token:
be replaced by the user. For example, to CRUD a tweet in ‘dsTweets’ curl -H "Content-Type: application/json" -X POST -d '{"userName":
dataset stored in MongoDB the ‘Database NAME’, ‘Dataset NAME’ "USERNAME", "password": "PASSWORD", "grant_type": "PASSWORD",
"clientName":"YOUR_CLIENT", "clientSecret":"YOUR_CLIENT_SECRET"}'
and ‘Entity TYPE’ parameters should be replaced with ‘MongoDB’, http://CoreDB/api/oauth
‘dsTweets’ and ‘Tweet’ respectively.
After creating the user and receiving the access token, it is pos-
2.2 Index and Query sible to use the following statement to grant an action (create, read,
Index. Full-text search is distinguished from searches based on update, delete and query) to a specific role:
metadata or on parts of the original texts represented in databases Define Access Control:
curl -H "Content-Type: application/json" -H "Authorization:
(such as titles, abstracts, selected sections, or bibliographical ref- Bearer ACCESS_TOKEN" -X POST -d '{"role":{"action":"TRUE/FALSE}}'
erences). CoreDB exposes the power of Elasticsearch without the http://coredbapi/api/{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}
2453
Demonstration CIKM’17, November 6-10, 2017, Singapore
Time in Millisecond
1000
8 Machines
category of the budget. The demonstration scenario consists of the
800
600
400
following parts: 200
0
(i) Data Definition and manipulation. The budget analyst will 4 Machines
use the CoreDB service to create a data lake. Considering that the
Australian budget 2016-17 handed on Tuesday 3 May, 2016; the
analyst is interested in persisting all the tweets from one month 1 Machine
before and two months after this date. The analyst will create a
NoSQL dataset to store these tweets (more than 15 million tweets) 0 200 400 600 800 1000 1200
2 https://support.twitter.com/articles/215585 3 https://www.w3.org/TR/rdf-sparql-query/
2454