0% found this document useful (0 votes)

21 views4 pages

CoreDB - A Data Lake Service

Uploaded by

Evandrino Barros

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views4 pages

CoreDB - A Data Lake Service

Uploaded by

Evandrino Barros

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Demonstration CIKM’17, November 6-10, 2017, Singapore

CoreDB: a Data Lake Service

Amin Beheshti Boualem Benatallah Reza Nouri
University of New South Wales, University of New South Wales, University of New South Wales,
Sydney, Australia Sydney, Australia Sydney, Australia
[email protected] [email protected] [email protected]

Van Munin Chhieng HuangTao Xiong Xu Zhao

University of New South Wales, University of New South Wales, University of New South Wales,
Sydney, Australia Sydney, Australia Sydney, Australia
[email protected] [email protected] [email protected]
ABSTRACT 1 INTRODUCTION
The continuous improvement in connectivity, storage and data The production of knowledge from ever increasing amount of pri-
processing capabilities allow access to a data deluge from sensors, vate/open data is seen by many organizations as an increasingly
social-media, news, user-generated, government and private data important capability that can complement the traditional analytics
sources. Accordingly, in a modern data-oriented landscape, with sources [2]. In this context, modern data-oriented applications are
the advent of various data capture and management technologies, dealing with various types of data - unstructured, semi-structured
organizations are rapidly shifting to datafication of their processes. and structured - such as emails, tweets, documents, videos and
In such an environment, analysts may need to deal with a collection images. For example, consider an analyst who is interested in ana-
of datasets, from relational to NoSQL, that holds a vast amount lyzing the Government Budget through engaging public’s thoughts
of data gathered from various private/open data islands, i.e. Data and opinions on social networks. To achieve this, the analyst may
Lake. Organizing, indexing and querying the growing volume of need to deal with a wealth of digital information generated through
internal data and metadata, in a data lake, is challenging and re- social networks, blogs, online communities and mobile applications
quires various skills and experiences to deal with dozens of new which forms a complex data lake [6]: a collection of datasets that
databases and indexing technologies: How to store information holds a vast amount of data gathered from various private/open
items? What technology to use for persisting the data? How to deal data islands. Organizing and indexing the growing volume of inter-
with the large volume of streaming data? How to trace and persist nal data and metadata, in the data lake, is challenging and requires
information about data? What technology to use for indexing the vast amount of knowledge to deal with dozens of new databases
data? How to query the data lake? To address the above mentioned and indexing technologies.
challenges, we present CoreDB - an open source data lake service In particular, for an analyst who is dealing with the data layer
- which offers researchers and developers a single REST API to for organizing, indexing and querying different types of data - from
organize, index and query their data and metadata. CoreDB man- structured entities to be stored in relational databases to large
ages multiple database technologies and offers a built-in design for volume of open data to be organized using appropriate NoSQL
security and tracing. databases such as MongoDB or CouchDB - various skills and ex-
periences may be required: How to store information items (from
CCS CONCEPTS structured entities to unstructured documents)? What technology
• Information systems → Data management systems; Web to use for persisting the data (from Relational to NoSQL databases)?
services; How to deal with the large volume of data being generated on
a continuous basis (from Key-value and document to object and
KEYWORDS graph store)? How to trace and persist information about data (from
Data Lake, Database Service, Data API descriptive to administrative)? What technology to use for index-
ing the data/metadata? How to query the data lake (from SQL to
ACM Reference format: full-text search)?
Amin Beheshti, Boualem Benatallah, Reza Nouri, Van Munin Chhieng,
To address the above mentioned challenges, we present CoreDB -
HuangTao Xiong, and Xu Zhao. 2017. CoreDB: a Data Lake Service. In
an open source data lake service - which offers researchers/developers
Proceedings of CIKM’17 , Singapore, Singapore, November 6–10, 2017, 4 pages.
https://doi.org/10.1145/3132847.3133171 a single REST API to organize, index and query their data and meta-
data. CoreDB manages multiple database technologies (from Rela-
Permission to make digital or hard copies of all or part of this work for personal or tional to NoSQL databases), exposes the power of Elasticsearch [5]
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation and weave them together at the application layer. CoreDB offers
on the first page. Copyrights for components of this work owned by others than ACM a built-in design to support: (i) Security and Access Control: to
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, provide a database security threats including authentication, access
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. control and data encryption; and (ii) Tracing and Provenance [3, 8]:
CIKM’17 , November 6–10, 2017, Singapore, Singapore to collect and aggregate tracing metadata including descriptive, ad-
© 2017 Association for Computing Machinery. ministrative and temporal metadata and build a provenance graph.
ACM ISBN 978-1-4503-4918-5/17/11. . . $15.00
https://doi.org/10.1145/3132847.3133171

2451
Demonstration CIKM’17, November 6-10, 2017, Singapore

CIKM’17 , November 6–10, 2017, Singapore, Singapore A. Beheshti et al.

Authentication, Access Control, Data

SQL Query

Query
Encryption, etc.
Full-Text Search

Security

CoreDB REST API

Search
Apache Drill Apache Phonix

... ...

Tracing and Provenance

MySQL PostgreSQL SQL Server MongoDB CouchDB HBase Hive

Meta-Data
Index

(Create, Read, Update, Delete)

CRUD
Relational Database No SQL Databases

Figure 1: CoreDB Architecture.

The CoreDB API is available as an open source project on GitHub1 . key-value, document and graph stores requirements. CoreDB per-
The rest of the paper is organized as follows. In Section 2, we sists the entities (structured and unstructured) in a JSON format,
present an overview of the CoreDB, while in Section 3 we describe an easy-to-parse structure, for its growing adoption in the Web
our demonstration scenario. data applications. Considering the self-describing nature of JSON
documents, in CoreDB we extend JSON with the option of defining
2 COREDB OVERVIEW a schema for all or part of the data. The following statements illus-
CoreDB is an open source complete Database Service that powers trate how to call the CoreDB service to create a Data Lake and a
multiple relational and NoSQL (key/value, document and graph Dataset (Relational or NoSQL):
stores) database-as-a-service for developing Web data applications, Create a Data lake:
curl -H "Content-Type: application/json" -X POST -d
i.e. data-driven Web applications. CoreDB enables analysts to build '{"name":"DataLake_NAME"}' http://CoreDB/api/clients
a data lake, create relational and/or NoSQL datasets within the data
Create a Dataset:
lake and CRUD (Create, Read, Update and Delete) and query enti- curl -H "Content-Type: application/json" -H "Authorization:
ties in those datasets. CoreDB exposes the power of Elasticsearch, Bearer ACCESS_TOKEN" -X POST -d '{"name":"Dataset_NAME",
"type": "Database_NAME"}' http://CoreDB/api/databases
a search engine based on Apache Lucene (lucene.apache.org/), to
support a powerful index and full-text search. CoreDB has a built- When creating a data lake, the ‘DataLake NAME’ parameter
in design to enable top database security threats (Authentication, needs to be replaced by the user. When calling this service, an access
Access Control and Data Encryption) along with Tracing and Prove- token (‘ACCESS TOKEN’) will be returned which enables the user
nance support. CoreDB weave all these services together at the to access the data lake; and will be required for creating, reading,
application layer and offers a single REST API to organize, index updating and deleting a dataset or an entity in the data lake. For
and query the data and metadata in a data lake. Figure 1 illustrates example, to create a dataset (named ‘dsTweets’) for storing a set of
the architecture and the main components of the CoreDB. tweets in MongoDB, the ‘Dataset NAME’ parameter can be replaced
with ‘dsTweets’ and the ‘Database NAME’ parameter should be
2.1 CRUD Data Lake, Dataset and Entity replaced with ‘MongoDB’. CoreDB supports various relational and
The top-level organizing concept in CoreDB is the Data Lake: a col- NoSQL databases such as MySQL, PostgreSQL, Oracle, MongoDB,
lection of datasets that holds a vast amount of data gathered from HBase and HIVE. The URL (‘http://CoreDB/’) illustrates the Web
various private/open data islands. Within the Data Lake one can address where the CoreDB service is deployed. The next step is to
create a dataset of type relational and/or NoSQL database. CoreDB use the CoreDB service to CRUD entities:
offers a single REST API to create a set of datasets and weave them Create an Entity:
together at the application layer. To create a relational database, curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X POST -d '{"Param1":"Value1",
a database connection configuration operation has been provided "Param2": "Value2", ...}' http://CoreDB/api/entity/
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}
to enable access to many of relational databases such as MySQL,
PostgreSQL and Oracle. Moreover, CoreDB leverages appropriate Read an Entity:
curl -H "Content-Type: application/json" -H "Authorization:
NoSQL database such as MongoDB, HBase and HIVE to organize Bearer ACCESS_TOKEN" -X GET http://CoreDB/api/entity/
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id}
1 https://github.com/unsw-cse-soc/CoreDB

2452
Demonstration CIKM’17, November 6-10, 2017, Singapore

CoreDB: a Data Lake Service CIKM’17 , November 6–10, 2017, Singapore, Singapore

Update an Entity: update, and delete) to system entities by supporting Roles, Respon-
curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X PUT -d '{"Param1":"Value1", sibilities and Privileges, System Privileges and Object Privileges. In
"Param2": "Value2"}' http://CoreDB/api/entity/ CoreDB, privileges are provided directly to users or through roles.
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id}
The following statements illustrate how to use CoreDB service to
Delete an Entity: create a user and get an access token.
curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X DELETE http://CoreDB/api/entity/ Create a User:
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id} curl -H "Content-Type: application/json" -X POST -d '{"userName":
"USER_NAME", "password": "PASSWORD", "role":"ROLE", "clientName":
"DataLake_NAME", "clientSecret":"DataLake_SECRET"}'
When creating, reading, updating or deleting an entity, the ‘Data- http://CoreDB/api/account
base NAME’, ‘Dataset NAME’ and ‘Entity TYPE’ parameters need to
GET Access Token:
be replaced by the user. For example, to CRUD a tweet in ‘dsTweets’ curl -H "Content-Type: application/json" -X POST -d '{"userName":
dataset stored in MongoDB the ‘Database NAME’, ‘Dataset NAME’ "USERNAME", "password": "PASSWORD", "grant_type": "PASSWORD",
"clientName":"YOUR_CLIENT", "clientSecret":"YOUR_CLIENT_SECRET"}'
and ‘Entity TYPE’ parameters should be replaced with ‘MongoDB’, http://CoreDB/api/oauth
‘dsTweets’ and ‘Tweet’ respectively.
After creating the user and receiving the access token, it is pos-
2.2 Index and Query sible to use the following statement to grant an action (create, read,
Index. Full-text search is distinguished from searches based on update, delete and query) to a specific role:
metadata or on parts of the original texts represented in databases Define Access Control:
curl -H "Content-Type: application/json" -H "Authorization:
(such as titles, abstracts, selected sections, or bibliographical ref- Bearer ACCESS_TOKEN" -X POST -d '{"role":{"action":"TRUE/FALSE}}'
erences). CoreDB exposes the power of Elasticsearch without the http://coredbapi/api/{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}

operational burden of managing it by developers. In particular,

when the user enables indexing while creating a dataset, the enti- 2.4 Tracing and Provenance
ties will be automatically indexed for powerful Lucene queries.
Query. CoreDB enables the power of standard SQL with full Tracing the entities over time is very important and assists analysts
ACID transaction capabilities for querying data held in not only re- in understanding what time an entity was created, read, updated,
lational databases but also in NoSQL databases. In particular, using deleted or queried (and who did this)? Where was the location
a simple REST API, it is possible to send a SQL query to be applied (e.g. IP Address)? What was the platform (e.g. mobile or PC)? To
to the datasets created in the data lake. To achieve this, in CoreDB, address this important requirement, the CoreDB API provides a
we leverages Apache Phoenix (phoenix.apache.org/) to take the very useful and powerful functionality involving tracing historical
SQL query and compiles it into native NoSQL store APIs. Moreover, data back to users. CoreDB offers a built-in design to collect and
to support queries that need to join data from multiple datastores aggregate tracing metadata including descriptive, administrative
in the data lake, we leverage Apache Drill (drill.apache.org/). Con- and temporal metadata. We use the tracing metadata to build a
sidering that Elasticsearch is a search engine based on Lucene, it is provenance graph [3]: a directed acyclic attributed graph where the
also possible to apply Wildcard, Fuzzy, Proximity and Range search nodes are users/roles and entities and the relationships among them
queries in CoreDB. The following statement illustrate how to call represents the activities such as created, read, updated, deleted or
the CoreDB service to apply a query (SQL or Full-text search) to queried. The relationships will be tagged with metadata such as
the data lake: timestamp and location. The following statements illustrate how
curl -H "Content-Type: application/json" -H "Authorization:
to call the CoreDB service to receive the provenance graph of a
Bearer ACCESS_TOKEN" -X GET http://coredbapi/entity/ particular entity:
{Database_NAME}/{Dataset_NAME}/{Entity_TYPE}?query={query}
curl -H "Content-Type: application/json" -H "Authorization:
Bearer ACCESS_TOKEN" -X GET http://dataapi/api/entity/trace/
For example, to find the tweets (stored in ‘dsTweets’ dataset {Database_NAME}/{Dataset_NAME}/{Entity_TYPE}/{id}
persisted in the MongoDB database) that contains the keyword
‘CIKM’ the following query can be used. The result will be a JSON file containing a finite set of triples
curl -H "Content-Type: application/json" -H "Authorization:
(subject, predicate, object) representing the relationship between
Bearer ACCESS_TOKEN" -X GET http://coredbapi/entity/ two entities in the provenance graph. For example, the triple (david,
dsTweets/MongoDB/Tweet?query="{"match": {"text":"CIKM"}}"
read[ts:20170320;ip:100.101.102.103], tweet123) represents that a
user with unique id ‘david’ read a tweet with the unique id ‘tweet123’
2.3 Security and Access Control on 20th of March 2017 using a computer with IP address ‘100.101.102.103’.
Database servers are the most important systems in virtually all
organizations. They store critical information (e.g. Email, Financial 3 DEMONSTRATION SCENARIO
data, Personal data, etc.) that is vital for organizations. CoreDB Governments at all levels are starting to recognize the value in their
has a built-in design to support top database security threats (e.g. budgeting process. In this context, the budget is the single most
Weak Authentication and Weak system configuration). In particular, important policy document of governments, where policy objec-
CoreDB supports: Identification and Authentication requirements, tives are reconciled and implemented in various categories such
System Privilege and Object Access Control and Data Encryption. as ‘Health’, ‘Social-Services’, ‘Transport’ and ‘Employment’. In the
For example, each user may be identified and authenticated by the demonstration scenario we present the requirement for an analyst
database system and has different access levels (e.g. create, read, who is interested in analyzing the Government Budget - specifically

2453
Demonstration CIKM’17, November 6-10, 2017, Singapore

CIKM’17 , November 6–10, 2017, Singapore, Singapore A. Beheshti et al.

1 Machine 966
4 Machines 571
8 Machines 419
the Health program - through engaging public’s thoughts and opin- Query Execution Time
ions on social networks. The goal here is to properly link the data
objects in social networks (e.g. tweets in Twitter2 ) to the health 1200

Time in Millisecond
1000
8 Machines
category of the budget. The demonstration scenario consists of the
800
600
400
following parts: 200
0

(i) Data Definition and manipulation. The budget analyst will 4 Machines

use the CoreDB service to create a data lake. Considering that the
Australian budget 2016-17 handed on Tuesday 3 May, 2016; the
analyst is interested in persisting all the tweets from one month 1 Machine

before and two months after this date. The analyst will create a
NoSQL dataset to store these tweets (more than 15 million tweets) 0 200 400 600 800 1000 1200

Execution Time in Millisecond

in MongoDB. We illustrate that it is possible to fill this dataset
entity by entity (a simple java code to read the entities and call the
CoreDB service) or read the whole tweets (uploaded somewhere Figure 2: Sample query execution time.
on the Web in a JSON format) and persist them in the dataset using
the CoreDB service. Then, the analyst will be interested to create a
relational dataset to store the main entities related to the budget to create a new graph dataset [7] in the data lake and store the
health program such as registered doctors and nurses in Australia, ‘tweet − −(contains) − − > diabetes’ relationship. (iv) security and
Hospitals and Pharmacies, Health funds, Medical Devices, Drugs, tracing. In this part we propose to the attendee a scenario where she
Diseases and keywords related to health in MySQL database. These would be able to see the security (Identification and Authentication
information later can be used to filter tweets related to health. To requirements, System Privilege and Object Access Control and Data
build this dataset, the analyst will create a set of users and access Encryption), tracing and provenance capabilities of the CoreDB.
tokens: these credentials will be provided to a set of users who
will help in populating this dataset. This scenario will help us to 4 RELATED WORK AND CONCLUSION
illustrate the tracing capability of the CoreDB. The two closest systems to our work include AsterixDB [1] and
(ii) Index and Query. In this part we illustrate the automatic in- Orchestrate (orchstrate.io/). The added value of CoreDB compare
dexing capability in CoreDB. We propose to the attendee a scenario to these systems include: managing multiple database technologies
where she would be able to use full-text search and SQL queries to (From Relational to NoSQL) and providing a built-in design for
find tweets that contain keywords such as instances of Hospitals, security and tracing. Moreover, CoreDB is available as an open
Drugs and Diseases stored in the relational dataset. Consider the source project and through a single REST API. As an ongoing work,
data lake created in the previous step; following is a sample query we are extending the query component to support SPARQL3 queries.
for linking a tweet persisted in MongoDB to a tuple (in the Hospital We also plan to leverage our previous work [4] on data curation
table) stored in PostgreSQL database: to enable CoreDB automatically curate the data items stored in
SELECT tweets.summary, tweets.user_id, tweets.date FROM the data lake, e.g. extracting features such as keywords and named
mongo.budget.tweets AS tweets INNER JOIN postgresql.health entities and persist them in the data lake.
AS healthDB ON tweets.hopitalId = healthDB.hostpotal.id
WHERE tweets.summary LIKE `%health%' AND tweets.body LIKE
`%Sydney Hospital%' AND tweet.date BETWEEN `21-05-2016' AND ACKNOWLEDGMENTS
`21-08-2016Š AND healthDB.hostpital.name LIKE `%Sydney Hospital%'
This research was partially supported by ARC project LP0669090.
Figure 2 illustrates the performance of this query: the experiment
were performed on Amazon EC2 platform using instances running REFERENCES
Ubuntu Server 14.04. The scalability experiment was done on a [1] Apache. 2017. AsterixDB. https://asterixdb.apache.org/. (2017).
[2] Seyed-Mehdi-Reza Beheshti, Boualem Benatallah, Sherif Sakr, Daniela Grig-
single machine, four machines and eight machines of type t2.large ori, Hamid Reza Motahari-Nezhad, Moshe Chai Barukh, Ahmed Gater, and
that provides 8GB of memory, 2 virtual CPUs and 20GB EBS storage; Seung Hwan Ryu. 2016. Process Analytics - Concepts and Techniques for Querying
and on 15 million tweets. Notice that the query processing comes and Analyzing Process Data. Springer.
[3] Seyed-Mehdi-Reza Beheshti, Hamid R. Motahari Nezhad, and Boualem Bena-
down to three phases (parsing, plan generation and plan execution), tallah. 2012. Temporal Provenance Model (TPM): Model and Query Language.
and the scalability study (in Figure 2) shows the impact on the exe- CoRR abs/1211.5009 (2012). http://arxiv.org/abs/1211.5009
cution phase. The result also shows that the parsing phase is costly [4] Seyed-Mehdi-Reza Beheshti, Alireza Tabebordbar, Boualem Benatallah, and Reza
Nouri. 2017. On Automating Basic Data Curation Tasks. In Proceedings of the
specially when we have several joins among different databases in 26th International Conference on World Wide Web Companion, Perth, Australia,
the data lake. April 3-7, 2017. 165–169.
[5] C. Gormley. 2015. Elasticsearch: The Definitive Guide. " O’Reilly".
(iii) construct relationships among the data objects stored in [6] Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis
MongoDB and MySQL. To properly analyze the tweets, the bud- Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Managing Google’s data
get analyst may require to link tweets to the health related en- lake: an overview of the Goods system. IEEE Data Eng. Bull. 39, 3 (2016), 5–14.
[7] Mohammad Hammoud, Dania Abed Rabbou, Reza Nouri, Seyed-Mehdi-Reza
tities related to the budget health program. For example, as the Beheshti, and Sherif Sakr. 2015. DREAM: Distributed RDF Engine with Adaptive
result of the querying step, the analyst identifies a set of tweets Query Planner and Minimal Communication. PVLDB 8, 6 (2015), 654–665.
which contain the Diabetes diseases. We illustrate how it is possible [8] OPM. 2017. The Open Provenance Model. http://openprovenance.org/. (2017).

2 https://support.twitter.com/articles/215585 3 https://www.w3.org/TR/rdf-sparql-query/

2454

Data Lake
No ratings yet
Data Lake
26 pages
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
No ratings yet
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
35 pages
NX Routing in Ugmanager Mode
100% (1)
NX Routing in Ugmanager Mode
3 pages
NewSkies Guide For Modify Bookings
No ratings yet
NewSkies Guide For Modify Bookings
34 pages
Azure Data Engineering Complete Guide
No ratings yet
Azure Data Engineering Complete Guide
130 pages
Data Lakes A Survey of Functions and Systems
No ratings yet
Data Lakes A Survey of Functions and Systems
20 pages
Big Data Architectures and The Data Lake: James Serra
No ratings yet
Big Data Architectures and The Data Lake: James Serra
53 pages
Unit 3
No ratings yet
Unit 3
27 pages
Data Warehousing & Dimensional Modeling Concepts !!
No ratings yet
Data Warehousing & Dimensional Modeling Concepts !!
33 pages
Data Lake Architecture - Designing The Data Lake and Avoiding The Garbage Dump (PDFDrive)
No ratings yet
Data Lake Architecture - Designing The Data Lake and Avoiding The Garbage Dump (PDFDrive)
209 pages
Unit 1.1data Science Technology Stack
No ratings yet
Unit 1.1data Science Technology Stack
87 pages
Unit 1
No ratings yet
Unit 1
60 pages
Data Warehouse OLAP
No ratings yet
Data Warehouse OLAP
21 pages
On Data Lake Architectures Andmetadata Management
No ratings yet
On Data Lake Architectures Andmetadata Management
24 pages
On Data Lake Architectures and Metadata Management: Pegdwend e Sawadogo J Er Ome Darmont
No ratings yet
On Data Lake Architectures and Metadata Management: Pegdwend e Sawadogo J Er Ome Darmont
24 pages
Universal Data Model As A Way To Build Multi-Paradigm Data Lakes
No ratings yet
Universal Data Model As A Way To Build Multi-Paradigm Data Lakes
9 pages
Lecture 15 Data Warehouse and Data Lake Architecture Part 2
No ratings yet
Lecture 15 Data Warehouse and Data Lake Architecture Part 2
12 pages
DL Vs DLH Draft v0.1
No ratings yet
DL Vs DLH Draft v0.1
9 pages
Unit 5
No ratings yet
Unit 5
5 pages
DATA WAREHOUSE - Pertemuan01
No ratings yet
DATA WAREHOUSE - Pertemuan01
20 pages
Data Lake A New Ideology in Big Data Era
No ratings yet
Data Lake A New Ideology in Big Data Era
11 pages
Datalakes
No ratings yet
Datalakes
18 pages
Chapter 2 Data Warehousing
No ratings yet
Chapter 2 Data Warehousing
57 pages
Module-4.1 DBMS
No ratings yet
Module-4.1 DBMS
31 pages
Data Lake Essentials
No ratings yet
Data Lake Essentials
11 pages
Delta Lake
No ratings yet
Delta Lake
12 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
Day 06
No ratings yet
Day 06
34 pages
LakeHouse Architecture
No ratings yet
LakeHouse Architecture
23 pages
WP Dremio Definitive Guide To The Data Lakehouse
No ratings yet
WP Dremio Definitive Guide To The Data Lakehouse
20 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Unit 5 DMS
No ratings yet
Unit 5 DMS
4 pages
Database Datalake
No ratings yet
Database Datalake
2 pages
Clase 2 A
No ratings yet
Clase 2 A
12 pages
Warehouse Assignment MIM 106
No ratings yet
Warehouse Assignment MIM 106
8 pages
Data Lake
No ratings yet
Data Lake
2 pages
Top Five Differences Between Data Lakes and Data Warehouses
No ratings yet
Top Five Differences Between Data Lakes and Data Warehouses
6 pages
Important Concepts in Big Data
No ratings yet
Important Concepts in Big Data
6 pages
Apache Spark Week-5 PDF
No ratings yet
Apache Spark Week-5 PDF
9 pages
Adaptation Database
No ratings yet
Adaptation Database
6 pages
The Data Lakes: A Leap Forward Future of Data Warehousing
No ratings yet
The Data Lakes: A Leap Forward Future of Data Warehousing
5 pages
XII IP CH 4 Importing Exporting
No ratings yet
XII IP CH 4 Importing Exporting
14 pages
Do We Need The Lakehouse Architecture - by Vu Trinh - Apr, 2024 - Data Engineer Things
No ratings yet
Do We Need The Lakehouse Architecture - by Vu Trinh - Apr, 2024 - Data Engineer Things
19 pages
Introduction To Data Lakes
No ratings yet
Introduction To Data Lakes
6 pages
House Refcard 350 Getting Started Data Lakes 2021
No ratings yet
House Refcard 350 Getting Started Data Lakes 2021
5 pages
The Differences Between A Database, Data Warehouse, and Data Lake
No ratings yet
The Differences Between A Database, Data Warehouse, and Data Lake
3 pages
Karim Budhwani (470) 447-0765 Sr. Data Analyst Professional Summary
No ratings yet
Karim Budhwani (470) 447-0765 Sr. Data Analyst Professional Summary
4 pages
Architecting A Data Lake
100% (8)
Architecting A Data Lake
60 pages
Big Query
No ratings yet
Big Query
8 pages
A Guide To Best Practices: Putting The Data Lake To Work
No ratings yet
A Guide To Best Practices: Putting The Data Lake To Work
12 pages
1202990.an Overview of Current Data Lake Architecture Models
No ratings yet
1202990.an Overview of Current Data Lake Architecture Models
6 pages
DW Vs Data Lake
No ratings yet
DW Vs Data Lake
5 pages
DBMS Ii
No ratings yet
DBMS Ii
26 pages
DBMS Assignment 2020
No ratings yet
DBMS Assignment 2020
5 pages
40 Lpa Roadmap by Ribhu Susmita
No ratings yet
40 Lpa Roadmap by Ribhu Susmita
6 pages
WWW w3c Org TR Xmlschema-1
No ratings yet
WWW w3c Org TR Xmlschema-1
406 pages
Premium 6-Quart Pressure Cooker: Model No.: CP016-PC
No ratings yet
Premium 6-Quart Pressure Cooker: Model No.: CP016-PC
24 pages
Practical Work Book of DBMS
No ratings yet
Practical Work Book of DBMS
81 pages
Different Types of Internal Table
No ratings yet
Different Types of Internal Table
82 pages
Unit 2
No ratings yet
Unit 2
85 pages
Chapter 2 DBMS
No ratings yet
Chapter 2 DBMS
101 pages
Database Lab Exercise
No ratings yet
Database Lab Exercise
4 pages
Arcview 3.1/3.2: An Overview
No ratings yet
Arcview 3.1/3.2: An Overview
11 pages
12 Algorithms For System Design Interviews
No ratings yet
12 Algorithms For System Design Interviews
8 pages
IBM Rational ClearQuest Administration
No ratings yet
IBM Rational ClearQuest Administration
40 pages
Data Discretization and Concept Hierarchy Generation - PPT
No ratings yet
Data Discretization and Concept Hierarchy Generation - PPT
21 pages
Dbms r18 Unit 5 Notes
No ratings yet
Dbms r18 Unit 5 Notes
24 pages
Lecture-13 Indexing and Its Types: Subject: DBMS Subject Code: BCA-S301T Faculty: Saurabh Jha
No ratings yet
Lecture-13 Indexing and Its Types: Subject: DBMS Subject Code: BCA-S301T Faculty: Saurabh Jha
16 pages
Elmasri 08 Images Accessible
No ratings yet
Elmasri 08 Images Accessible
21 pages
Elmasri 13 Images Accessible
No ratings yet
Elmasri 13 Images Accessible
20 pages
Btech Cse 5 Sem Database Management Systems 2009
No ratings yet
Btech Cse 5 Sem Database Management Systems 2009
8 pages
Big Data A Survey - The New Paradigms Me
No ratings yet
Big Data A Survey - The New Paradigms Me
10 pages
INFA Success Pack - Adoption Services Questionaire - Cloud
No ratings yet
INFA Success Pack - Adoption Services Questionaire - Cloud
3 pages
SQL Integrity Constraints Assignment - Complete Imp
No ratings yet
SQL Integrity Constraints Assignment - Complete Imp
8 pages
Lab 5. SQL Injection
No ratings yet
Lab 5. SQL Injection
15 pages
MYSQL
No ratings yet
MYSQL
3 pages
RDBMS Assignment
No ratings yet
RDBMS Assignment
8 pages
SQL Server Questions and Answers For Freshers - Sanfoundry
No ratings yet
SQL Server Questions and Answers For Freshers - Sanfoundry
7 pages
Model Test Question Paper
No ratings yet
Model Test Question Paper
4 pages
For Dummy Database If Not Exists School - 012937
No ratings yet
For Dummy Database If Not Exists School - 012937
4 pages
Scalable Data Pipelines: Architecting For The Petabyte Era
From Everand
Scalable Data Pipelines: Architecting For The Petabyte Era
Tochukwu Kennedy Njoku
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Cloud-Based Multi-Modal Information Analytics
From Everand
Cloud-Based Multi-Modal Information Analytics
Tanushri Kaniyar
No ratings yet
SQL Demystified: A Beginner's Roadmap to Data Retrieval and Management
From Everand
SQL Demystified: A Beginner's Roadmap to Data Retrieval and Management
Kaushal Mehta
No ratings yet
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
From Everand
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
Pablo Alejandro Echeverria Barrios
No ratings yet
Structured Query Language Simplified: Efficient and Effective Database Management
From Everand
Structured Query Language Simplified: Efficient and Effective Database Management
Angela White
No ratings yet
Jump Start MySQL: Master the Database That Powers the Web
From Everand
Jump Start MySQL: Master the Database That Powers the Web
Timothy Boronczyk
No ratings yet
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Couchbase Essentials: Definitive Reference for Developers and Engineers
From Everand
Couchbase Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
From Everand
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Vector Database: Definitive Reference for Developers and Engineers
From Everand
Vector Database: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

CoreDB - A Data Lake Service

Uploaded by

CoreDB - A Data Lake Service

Uploaded by

Demonstration CIKM’17, November 6-10, 2017, Singapore

CoreDB: a Data Lake Service

Van Munin Chhieng HuangTao Xiong Xu Zhao

CIKM’17 , November 6–10, 2017, Singapore, Singapore A. Beheshti et al.

Authentication, Access Control, Data

CoreDB REST API

Tracing and Provenance

(Create, Read, Update, Delete)

Figure 1: CoreDB Architecture.

operational burden of managing it by developers. In particular,

CIKM’17 , November 6–10, 2017, Singapore, Singapore A. Beheshti et al.

Execution Time in Millisecond

You might also like