0% found this document useful (0 votes)
31 views9 pages

A Big Data Modeling Methodology

Uploaded by

olsowyverena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views9 pages

A Big Data Modeling Methodology

Uploaded by

olsowyverena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/308856170

A Big Data Modeling Methodology for Apache Cassandra

Conference Paper · June 2015


DOI: 10.1109/BigDataCongress.2015.41

CITATIONS READS
97 8,458

3 authors, including:

Andrey Kashlev Shiyong Lu


Wayne State University Wayne State University
13 PUBLICATIONS 203 CITATIONS 118 PUBLICATIONS 2,246 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

XML to Relational Translation View project

Data Models View project

All content following this page was uploaded by Andrey Kashlev on 16 May 2019.

The user has requested enhancement of the downloaded file.


A Big Data Modeling Methodology
for Apache Cassandra
Artem Chebotko Andrey Kashlev Shiyong Lu
DataStax Inc. Wayne State University Wayne State University
Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—Apache Cassandra is a leading distributed database because of the expressivity of the Structured Query Language
of choice when it comes to big data management with zero (SQL) that readily supports relational joins, nested queries,
downtime, linear scalability, and seamless multiple data center data aggregation, and numerous other features that help to
deployment. With increasingly wider adoption of Cassandra retrieve a desired subset of stored data. As a result, traditional
for online transaction processing by hundreds of Web-scale data modeling is a purely data-driven process, where data
companies, there is a growing need for a rigorous and practical
data modeling approach that ensures sound and efficient schema
access patterns are only taken into account to create additional
design. This work i) proposes the first query-driven big data mod- indexes and occasional materialized views to optimize the most
eling methodology for Apache Cassandra, ii) defines important frequently executed queries.
data modeling principles, mapping rules, and mapping patterns In contrast, known principles used in traditional database
to guide logical data modeling, iii) presents visual diagrams for
Cassandra logical and physical data models, and iv) demonstrates
design cannot be directly applied to data modeling in Cassan-
a data modeling tool that automates the entire data modeling dra. First, the Cassandra data model is designed to achieve su-
process. perior write and read performance for a specified set of queries
that an application needs to run. Data modeling for Cassandra
Keywords—Apache Cassandra, data modeling, automation, starts with application queries. Thus, designing Cassandra
KDM, database design, big data, Chebotko Diagrams, CQL tables based on a conceptual data model alone, without taking
queries into consideration, leads to either inefficient queries
I. I NTRODUCTION or queries that cannot be supported by a data model. Second,
CQL does not support many of the constructs that are common
Apache Cassandra [1], [2] is a leading transactional, scal- in SQL, including expensive table joins and data aggregation.
able, and highly-available distributed database. It is known Instead, efficient Cassandra database schema design relies on
to manage some of the world’s largest datasets on clusters data nesting or schema denormalization to enable complex
with many thousands of nodes deployed across multiple data queries to be answered by only accessing a single table. It
centers. Cassandra data management use cases include product is common that the same data is stored in multiple Cassan-
catalogs and playlists, sensor data and Internet of Things, dra tables to support different queries, which results in data
messaging and social networking, recommendation, personal- duplication. Thus, the traditional philosophy of normalization
ization, fraud detection, and numerous other applications that and minimizing data redundancy is rather opposite to data
deal with time series data. The wide adoption of Cassandra [3] modeling techniques for Cassandra. To summarize, traditional
in big data applications is attributed to, among other things, database design is not suitable for developing correct, let alone
its scalable and fault-tolerant peer-to-peer architecture [4], ver- efficient Cassandra data models.
satile and flexible data model that evolved from the BigTable
data model [5], declarative and user-friendly Cassandra Query In this paper, we propose a novel query-driven data model-
Language (CQL), and very efficient write and read access ing methodology for Apache Cassandra. A high-level overview
paths that enable critical big data applications to stay always of our methodology is shown in Fig. 1(b). A Cassandra solu-
on, scale to millions of transactions per second, and handle tion architect, a role that encompasses both database design
node and even entire data center failures with ease. One of and application design tasks, starts data modeling by building
the biggest challenges that new projects face when adopting a conceptual data model and defining an application workflow
Cassandra is data modeling that has significant differences to capture all application interactions with a database. The
from traditional data modeling approaches used in the past. application workflow describes access patterns or queries that
a data-driven application needs to run against the database.
Traditional data modeling methodology, which is used Based on the identified access patterns, the solution architect
in relational databases, defines well-established steps shaped maps the conceptual data model to a logical data model. The
by decades of database research [6], [7], [8]. A database logical data model specifies Cassandra tables that can effi-
designer typically follows the database schema design work- ciently support application queries according to the application
flow depicted in Fig. 1(a) to define a conceptual data model, workflow. Finally, additional physical optimizations concern-
map it to a relational data model, normalize relations, and ing data types, keys, partition sizes, and ordering are applied
apply various optimizations to produce an efficient database to produce a physical data model that can be instantiated in
schema with tables and indexes. In this process, the pri- Cassandra using CQL.
mary focus is placed on understanding and organizing data
into relations, minimizing data redundancy and avoiding data The most important innovation of our methodology, when
duplication. Queries play a secondary role in schema design. compared to relational database design, is that the application
Query analysis is frequently omitted at the early design stage workflow and the access patterns become first-class citizens
Fig. 1: Traditional data modeling compared with our proposed methodology for Cassandra.

in the data modeling process. Cassandra database design application. In this section, we discuss the table and query
revolves around both the application workflow and the data, models used in Cassandra.
and both are of paramount importance. Another key difference
of our approach compared to the traditional strategy is that A. Table Model
normalization is eliminated and data nesting is used to design
tables for the logical data model. This also implies that joins The notion of a table in Cassandra is different from the
are replaced with data duplication and materialized views for notion of a table in a relational database. A CQL table
complex application queries. These drastic differences demand (hereafter referred to as a table) can be viewed as a set of
much more than a mere adjustment of the data modeling partitions that contain rows with a similar structure. Each
practices. They call for a new way of thinking, a paradigm partition in a table has a unique partition key and each row
shift from purely data-driven approach to query-driven data in a partition may optionally have a unique clustering key.
modeling process. Both keys can be simple (one column) or composite (multiple
columns). The combination of a partition key and a clustering
To our best knowledge, this work presents the first query- key uniquely identifies a row in a table and is called a primary
driven data modeling methodology for Apache Cassandra. Our key. While the partition key component of a primary key is
main contributions are: (i) a first-of-its-kind data modeling always mandatory, the clustering key component is optional. A
methodology for Apache Cassandra, (ii) a set of modeling table with no clustering key can only have single-row partitions
principles, mapping rules, and mapping patterns that guide because its primary key is equivalent to its partition key and
a logical data modeling process, (iii) a visualization tech- there is a one-to-one mapping between partitions and rows.
nique, called Chebotko Diagrams, for logical and physical A table with a clustering key can have multi-row partitions
data models, and (iv) a data modeling tool, called KDM, because different rows in the same partition have different
that automates Cassandra database schema design according clustering keys. Rows in a multi-row partition are always
to the proposed methodology. Our methodology has been ordered by clustering key values in ascending (default) or
successfully applied to real world use cases at a number descending order.
of companies and is incorporated as part of the DataStax
Cassandra training curriculum [9]. A table schema defines a set of columns and a primary
key. Each column is assigned a data type that can be primitive,
The rest of the paper is organized as follows. Section II
such as int or text, or complex (collection data types), such as
provides a background on the Cassandra data model. Section
set, list, or map. A column may also be assigned a special
III introduces conceptual data modeling and application work-
counter data type, which is used to maintain a distributed
flows. Section IV elaborates on a query-driven mapping from
counter that can be added to or subtracted from by concurrent
a conceptual data model to a logical data model. Section V
transactions. In the presence of a counter column, all non-
briefly introduces physical data modeling. Section VI illus-
counter columns in a table must be part of the primary key. A
trates the use of Chebotko Diagrams for visualizing logical
column can be defined as static, which only makes sense in
and physical data models. Section VII presents our KDM tool
a table with multi-row partitions, to denote a column whose
to automate the data modeling process. Finally, Sections VIII
value is shared by all rows in a partition. Finally, a primary key
and IX present related work and conclusions.
is a sequence of columns consisting of partition key columns
followed by optional clustering key columns. In CQL, partition
II. T HE C ASSANDRA DATA M ODEL key columns are delimited by additional parenthesis, which can
A database schema in Cassandra is represented by a be omitted if a partition key is simple. A primary key may not
keyspace that serves as a top-level namespace where all other include counter, static, or collection columns.
data objects, such as tables, reside1 . Within a keyspace, a set of
To illustrate some of these notions, Fig. 2 shows two sam-
CQL tables is defined to store and query data for a particular
ple tables with CQL definitions and sample rows. In Fig. 2(a),
1 Another important function of a keyspace is the specification of a data the Artifacts table contains single-row partitions. Its primary
replication strategy, the topic that lies beyond the scope of this paper. key consists of one column artifact id that is also a simple
CREATE TABLE artifacts(
artifact_id INT, • If a clustering key column is restricted by range
corresponding_author TEXT, (i.e. inequality search) in a query predicate, then all
email TEXT, clustering key columns that precede this clustering
PRIMARY KEY (artifact_id));
column in the primary key definition must be restricted
partition key columns by values and no other clustering column can be used
artifact_id corresponding_author email
in the predicate.
1 John Doe [email protected] Intuitively, a query that restricts all partition key columns
partitions 54 Tom Black [email protected] rows
by values returns all rows in a partition identified by the
61 Jim White [email protected]
specified partition key. For example, the following query over
values
(a) Table Artifacts with single-row partitions
the Artifacts by venue table in Fig. 2(b) returns all artifacts
published in the venue SCC 2013:
CREATE TABLE artifacts_by_venue(
venue_name TEXT, SELECT artifact_id, title
year INT, FROM artifacts_by_venue
artifact_id INT, WHERE venue_name=‘SCC’ AND year=2013
title TEXT,
homepage TEXT STATIC, A query that restricts all partition key columns and some
PRIMARY KEY ((venue_name,year),artifact_id));
clustering key columns by values returns a subset of rows
composite partition key columns clustering key column from a partition that satisfy such a predicate. Similarly, a
static column
query that restricts all partition key columns by values and
venue_ year artifact_ title homepage
name id
one clustering key column by range (preceding clustering key
SCC 2013 1 Composition www.scc2013.org
columns are restricted by values) returns a subset of rows
SCC 2013 ... ... www.scc2013.org rows
from a partition that satisfy such a predicate. For example, the
SCC 2013 54 Mashup www.scc2013.org following query over the Artifacts by venue table in Fig. 2(b)
SCC 2014 1 Orchestration www.scc2014.org returns artifacts with id’s from 1 to 20 published in SCC
partitions
SCC 2014 ... ... www.scc2014.org 2013:
SCC 2014 61 Workflow www.scc2014.org SELECT artifact_id, title
ICWS 2014 1 VM Migration www.icws2014.org FROM artifacts_by_venue
ICWS 2014 ... ... www.icws2014.org WHERE venue_name=‘SCC’ AND year=2013 AND
ICWS 2014 58 Scheduling www.icws2014.org artifact_id>=1 AND artifact_id<=20;
(b) Table Artifacts_by_venue with multi-row partitions
Query results are always ordered based on the default order
SELECT artifact_id, title,
Fig. 2: Sample SELECT
tables artifact_id,
in Cassandra. title,
homepage homepage
specified for clustering key columns when a table is defined
FROM artifacts_by_venue FROM artifacts_by_venue (the CLUSTERING ORDER construct), unless a query explic-
partition key. This tableANDis
WHERE venue_name=’SCC’ shown
WHEREtovenue_name=’SCC’
have three single-row
AND itly reverses the default order (the ORDER BY construct).
year=2013; year=2013 AND artifact_id>=1
partitions. In Fig. 2(b), the Artifacts by venue table contains
AND artifact_id<=20; Finally, CQL supports a number of other features, such
multi-row partitions.
(c) An equality Its primary key
search query (d) Aconsists of composite
range search query as queries that use secondary indexes, IN, and ALLOW
partition key (venue name, year) and simple clustering key FILTERING constructs. Our data modeling methodology does
artifact id. This table is shown to have three partitions, each not directly rely on such queries as their performance is
one containing multiple rows. For any given partition, its rows frequently unpredictable on large datasets. More details on the
are ordered by artifact id in ascending order. In addition, syntax and semantics of CQL can be found in [10].
homepage is defined as a static column, and therefore each
partition can only have one homepage value that is shared by III. C ONCEPTUAL DATA M ODELING AND A PPLICATION
all the rows in that partition. W ORKFLOW M ODELING
B. Query Model The first step in the proposed methodology adds a whole
new dimension to database design, not seen in the traditional
Queries over tables are expressed in CQL, which has relational approach. Designing a Cassandra database schema
an SQL-like syntax. Unlike SQL, CQL supports no binary requires not only understanding of the to-be-managed data,
operations, such as joins, and has a number of rules for query but also understanding of how a data-driven application needs
predicates that ensure efficiency and scalability: to access such data. The former is captured via a conceptual
data model, such as an entity-relationship model. In particular,
• Only primary key columns may be used in a query
we choose to use Entity-Relationship Diagrams in Chen’s
predicate.
notation [8] for conceptual data modeling because this no-
• All partition key columns must be restricted by values tation is truly technology-independent and not tainted with
(i.e. equality search). any relational model features. The latter is captured via an
application workflow diagram that defines data access patterns
• All, some, or none of the clustering key columns can
for individual application tasks. Each access pattern specifies
be used in a query predicate.
what attributes to search for, search on, order by, or do
• If a clustering key column is used in a query predicate, aggregation on with a distributed counter. For readability, in
then all clustering key columns that precede this this paper, we use verbal descriptions of access patterns. More
clustering column in the primary key definition must formally, access patterns can be represented as graph queries
also be used in the predicate. written in a language similar to ERQL [11].
As a running example, we design a database for a digital schema design, and if our use case assumptions about the
library use case. The digital library features a collection of queries (e.g., see Fig. 3(b)) change, a database schema will
digital artifacts, such as papers and posters, which appeared have to change, too. In addition to considering queries and
in various venues. Registered users can leave their feedback ensuring their correct support, we should also take into account
for venues and artifacts in the form of reviews, likes, and an access path of each query to organize data efficiently.
ratings. Fig. 3 shows a conceptual data model and an ap-
plication workflow for our use case. The conceptual data We define the three broad access paths: 1) partition per
model in Fig. 3(a) unambiguously defines all known entity query, 2) partition+ per query, and 3) table or table+ per query.
types, relationship types, attribute types, key, cardinality, and The most efficient option is the “partition per query”, when a
other constraints. For example, a part of the diagram can be query only retrieves one row, a subset of rows or all rows
interpreted as “user is uniquely identified by id and may post from a single partition. For example, both queries presented in
many reviews, while each review is posted by exactly one Section II-B are examples of the “partition per query” access
user”. The application workflow in Fig. 3(b) models a web- path. This access path should be the most common in an online
based application that allows users to interact with various web transaction processing scenario but, in some cases, may not
pages (tasks) to retrieve data using well-defined queries. For be possible or desirable (e.g., a partition may have to become
example, the uppermost task in the figure is the entry point to very large to satisfy this path for a query). The “partition+
the application and allows searching for artifacts in a database per query” and “table and table+ per query” paths refer to
based on one of the queries with different properties. As we retrieving data from a few partitions in a table or from many
show in the next section, both the conceptual data model and partitions in one or more tables, respectively. While these
the application workflow have a profound effect on the design access paths can be valid in some cases, they should be avoided
of a logical data model. to achieve optimal query performance.
DMP3 (Data Nesting). The third key to successful
IV. L OGICAL DATA M ODELING database design is data nesting. Data nesting refers to a
The crux of the Cassandra data modeling methodology is technique that organizes multiple entities (usually of the same
logical data modeling. It takes a conceptual data model and type) together based on a known criterion. Such criterion can
maps it to a logical data model based on queries defined in be that all nested entities must have the same value for some
an application workflow. A logical data model corresponds attribute (e.g., venues with the same name) or that all nested
to a Cassandra database schema with table schemas defining entities must be related to a known entity of a different type
columns, primary, partition, and clustering keys. We define (e.g., digital artifacts that appeared in a particular venue). Data
the query-driven conceptual-to-logical data model mapping via nesting is used to achieve the “partition per query” access path,
data modeling principles, mapping rules, and mapping patterns. such that multiple nested entities can be retrieved from a single
partition. There are two mechanisms in Cassandra to nest data:
A. Data Modeling Principles multi-row partitions and collection types. Our methodology
primarily relies on multi-row partitions to achieve the best
The following four data modeling principles provide a
performance. For example, in Fig. 2(b), the Artifacts by venue
foundation for the mapping of conceptual to logical data
table nests artifacts (rows) under venues (partitions) that fea-
models.
tured those artifacts. In other words, each partition corresponds
DMP1 (Know Your Data). The first key to successful to a venue and each row in a given partition corresponds to
database design is understanding the data, which is captured an artifact that appeared in the partition venue. Tables with
with a conceptual data model. The importance and effort multi-row partitions are common in Cassandra databases.
required for conceptual data modeling should not be under-
estimated. Entity, relationship, and attribute types on an ER DMP4 (Data Duplication). The fourth key to successful
diagram (e.g., see Fig. 3(a)) not only define which data pieces database design is data duplication. Duplicating data in Cas-
need to be stored in a database but also which data properties, sandra across multiple tables, partitions, and rows is a common
such as entity type and relationship type keys, need to be practice that is required to efficiently support different queries
preserved and relied on to organize data correctly. over the same data. It is far better to duplicate data to enable
the “partition per query” access path than to join data from
For example, in Fig. 3(a), name and year constitute a venue multiple tables and partitions. For example, to support queries
key. This is based on our use case assumption that there cannot Q1 and Q2 in Fig. 3(b) via the efficient “partition per query”
be two venues (e.g., conferences) with the same name that are access path, we should create two separate tables that organize
held in the same year. If our assumption is false, the conceptual the same set of artifacts using different table primary keys. In
data model and overall design will have to change. Another the Cassandra world, the trade-off between space efficiency
example is the cardinality of the relationship type features. and time efficiency is almost always in favor of the latter.
In this case, our use case assumption is that a venue can
feature many artifacts and an artifact can only appear in one B. Mapping Rules
venue. Thus, given the one-to-many relationship type, the key
of features is id of an artifact. Again, if our assumption is false, Based on the above data modeling principles, we define
both the cardinalities and key will have to change, resulting in five mapping rules that guide a query-driven transition from a
substantially different table schema design. conceptual data model to a logical data model.
DMP2 (Know Your Queries). The second key to successful MR1 (Entities and Relationships). Entity and relationship
database design is queries, which are captured via an applica- types map to tables, while entities and relationships map to ta-
tion workflow model. Like data, queries directly affect table ble rows. Attribute types that describe entities and relationships
Fig. 3: A conceptual data model and an application workflow for the digital library use case.

at the conceptual level must be preserved as table columns at


the logical level. Violation of this rule may lead to data loss.
MR2 (Equality Search Attributes). Equality search at- Q1 predicate:
name=? AND year>?
tributes, which are used in a query predicate, map to the prefix ORDER BY year DESC
columns of a table primary key. Such columns must include all MR1
MR2 MR3 MR4 MR5
partition key columns and, optionally, one or more clustering
Artifacts_by_venue Artifacts_by_v.. Artifacts_by_v.. Artifacts_by_v.. Artifacts_by_venue
key columns. Violation of this rule may result in inability to venue_name venue_name K venue_name K venue_name K venue_name K
support query requirements. year year year C↑ year C↓ year C↓
artifact_id artifact_id . artifact_id . artifact_id . artifact_id .C↑

MR3 (Inequality Search Attributes). An inequality search artifact_title artifact_title artifact_title artifact_title artifact_title
attribute, which is used in a query predicate, maps to a table [authors] [authors] [authors] [authors] [authors]
{keywords} {keywords} {keywords} {keywords} {keywords}
clustering key column. In the primary key definition, a column
that participates in inequality search must follow columns that Fig. 4: Sample table schema design using the mapping rules.
participate in equality search. Violation of this rule may result
in inability to support query requirements.
MR4 (Ordering Attributes). Ordering attributes, which are the equality search attribute to the partition key column
specified in a query, map to clustering key columns with venue name. MR3 maps the inequality search attribute to the
ascending or descending clustering order as prescribed by the clustering key column year, and MR4 changes the clustering
query. Violation of this rule may result in inability to support order to descending. Finally, MR5 maps the key attribute to
query requirements. the clustering key column artifact id.

MR5 (Key Attributes). Key attribute types map to primary


key columns. A table that stores entities or relationships as
C. Mapping Patterns
rows must include key attributes that uniquely identify these
entities or relationships as part of the table primary key to Based on the above mapping rules, we design mapping
uniquely identify table rows. Violation of this rule may lead patterns that serve as the basis for automating Cassandra
to data loss. database schema design. Given a query and a conceptual data
To design a table schema, it is important to apply these model subgraph that is relevant to the query, each mapping
mapping rules in the context of a particular query and a pattern defines final table schema design without the need to
subgraph of the conceptual data model that the query deals apply individual mapping rules. While we define a number of
with. The rules should be applied in the same order as they different mapping patterns [9], due to space limitations, we
are listed above. only present one mapping pattern and one example.

For example, Fig. 4 illustrates how the mapping rules are A sample mapping pattern is illustrated in Fig. 5(a). It is
applied to design a table for query Q1 (see Fig. 3(b)) that applicable for the case when a given query deals with one-
deals with the relationship Venue-features-Digital Artifact (see to-many relationships and results in a table schema that nests
Fig. 3(a)). Fig. 4 visualizes a table resulting after each rule many entities (rows) under one entity (partition) according to
application using Chebotko’s notation, where K and C denote the relationships. When applied to query Q1 (see Fig. 3(b)) and
partition and clustering key columns, respectively. The arrows the relationship Venue-features-Digital Artifact (see Fig. 3(a)),
next to the clustering key columns denote ascending (↑) or this mapping pattern results in the table schema shown in
descending (↓) order. MR1 results in table Artifacts by venue Fig. 5(b). With our mapping patterns, logical data modeling
whose columns correspond to the attribute types used in becomes as simple as finding an appropriate mapping pattern
the query to search for, search on, or order by. MR2 maps and applying it, which can be automated.
Table schema

Table Name
column name 1 CQL-Type K Partition key column
key1.1=? AND key1.2>? name=? AND year>?
column name 2 CQL-Type C↑ Clustering key column (ASC)
key1.2 (DESC) year (DESC) column name 3 CQL-Type C↓ Clustering key column (DESC)
ET2_by_ET1 Artifacts_by_venue column name 4 CQL-Type S Static column
key1.1 K venue_name K column name 5 CQL-Type IDX Secondary index column
key1.2 C↓ year C↓ column name 6 CQL-Type ++ Counter column
key2.1 C↑ artifact_id C↑
[column name 7] CQL-Type Collection column (list)
key2.2 C↑ artifact_title
attr2.1 [authors]
{column name 8} CQL-Type Collection column (set)
attr2.2 {keywords} <column name 9> CQL-Type Collection column (map)
attr (b) Example mapping column name 10 CQL-Type Regular column
(a) Sample mapping pattern pattern application Entry point (as in an application workflow)
Fig. 5: A sample mapping pattern and the result of its appli- One or more queries supported by a table
Q1, Q2
cation.
Table A Q3 Table B
... ...
V. P HYSICAL DATA M ODELING Transition (as in an application workflow)
The final step of our methodology is the analysis and Fig. 6: The notation of Chebotko Diagrams.
optimization of a logical data model to produce a physical
data model. While the modeling principles, mapping rules,
and mapping patterns ensure a correct and efficient logical table because a timestamp can be extracted from column
schema, there are additional efficiency concerns related to review id of type TIMEUUID.
database engine constraints or finite cluster resources. A typ-
ical analysis of a logical data model involves the estimation VII. AUTOMATION AND THE KDM T OOL
of table partition sizes and data duplication factors. Some of
the common optimization techniques include partition splitting, To automate our proposed methodology in Fig. 1(b), we
inverted indexes, data aggregation and concurrent data access design and implement a Web-based data modeling tool, called
optimizations. These and other techniques are described in [9]. KDM 2 . The tool relies on the mapping patterns and our
proprietary algorithms to automate the most complex, error-
prone, and time-consuming data modeling tasks: conceptual-
VI. C HEBOTKO D IAGRAMS to-logical mapping, logical-to-physical mapping, and CQL
generation. KDM’s Cassandra data modeling automation work-
It is frequently useful to present logical and physical flow is shown in Fig. 8(a). Screenshots of KDM’s user
data model designs visually. To achieve this, we propose interface corresponding to steps 1, 3, and 4 of this workflow
a novel visualization technique, called Chebotko Diagram, are shown in Fig. 8(b).
which presents a database schema design as a combina-
tion of individual table schemas and query-driven application Our tool was successfully validated for several use cases,
workflow transitions. Some of the advantages of Chebotko including the digital library use case. Based on our experience,
Diagrams, when compared to regular CQL schema definition KDM can dramatically reduce time, streamline, and simplify
scripts, include improved overall readability, superior intel- the Cassandra database design process. KDM consistently
ligibility for complex data models, and better expressivity generates sound and efficient data models, which is invaluable
featuring both table schemas and their supported application for less experienced users. For expert users, KDM supports a
queries. Physical-level diagrams contain sufficient information number of advanced features, such as automatic schema gen-
to automatically generate a CQL script that instantiates a eration in the presence of type hierarchies, n-ary relationship
database schema, and can serve as reference documents for de- types, explicit roles, and alternative keys.
velopers and architects that design and maintain a data-driven
solution. The notation of Chebotko Diagrams is presented in VIII. R ELATED W ORK
Fig. 6.
Data modeling has always been a cornerstone of data man-
Sample Chebotko Diagrams for the digital library use agement systems. Conceptual data modeling [8] and relational
case are shown in Fig. 7. The logical-level diagram in database design [6], [7] have been extensively studied and
Fig. 7(a) is derived from the conceptual data model and are now part of a typical database course. Unfortunately, the
application workflow in Fig. 3 using the mapping rules and vast majority of relational data modeling techniques are not
mapping patterns. The physical-level diagram in Fig. 7(b) is applicable to recently emerged big data (or NoSQL) manage-
derived from the logical data model after specifying CQL data ment solutions. The need for new data modeling approaches
types for all columns and applying two minor optimizations: for NoSQL databases has been widely recognized in both
1) a new column avg rating is introduced into tables Arti- industry [12], [13] and academic [14], [15], [16] communities.
facts by venue, Artifacts by author, and Artifacts to avoid an Big data modeling is a challenging and open problem.
additional lookup in the Ratings by artifact table and 2) the
timestamp column is eliminated from the Reviews by user 2 KDM demo can be found at www.cs.wayne.edu/andrey/kdm
Q1 Q2 Q1 Q2
Artifacts_by_venue Artifacts_by_author Artifacts_by_venue Artifacts_by_author
venue_name K author K venue_name K author K Artifacts
Artifacts TEXT TEXT
year C↓ year C↓ year C↓ year C↓ artifact_id INT K
artifact_id K INT INT
artifact_id C↑ artifact_id C↑ artifact_id C↑ artifact_id C↑ avg_rating FLOAT
artifact_title INT INT
artifact_title artifact_title avg_rating avg_rating artifact_title TEXT
[authors] FLOAT FLOAT
[authors] [authors] artifact_title artifact_title [authors] LIST<TEXT>
{keywords} TEXT TEXT
{keywords} {keywords} [authors] [authors] {keywords} SET<TEXT>
venue_name LIST<TEXT> LIST<TEXT>
venue_name {keywords} SET<TEXT> {keywords} venue_name TEXT
year SET<TEXT>
venue_name TEXT
year INT
Q9
Q3 Q4 Q5 Q9
Q3 Q4 Q5
Users_by_artifact Experts_by_artifact Ratings_by_artifact Users_by_artifact Experts_by_artifact Ratings_by_artifact
artifact_id K artifact_id K artifact_id K artifact_id INT K artifact_id INT K artifact_id INT K
user_id C↑ area_of_expertise K num_ratings ++ user_id UUID C↑ area_of_expertise TEXT K num_ratings ++
user_name user_id C↑ sum_ratings ++ user_name TEXT user_id UUID C↑ sum_ratings ++
email user_name email TEXT user_name TEXT
{areas_of_expertise} email {areas_of_expertise} SET<TEXT> email TEXT
{areas_of_expertise} {areas_of_expertise} SET<TEXT>

Q6 Q7 Q8 Q6 Q7 Q8
Venues_by_user Artifacts_by_user Reviews_by_user Venues_by_user Artifacts_by_user Reviews_by_user
user_id K user_id K user_id K user_id UUID K user_id UUID K user_id UUID K
venue_name C↑ year C↓ rating C↓ venue_name TEXT C↑ year INT C↓ rating INT C↓
year C↑ artifact_id C↑ review_id C↑ year INT C↑ artifact_id INT C↑ review_id TIMEUUID C↑
country artifact_title timestamp country TEXT artifact_title TEXT review_title TEXT
homepage [authors] review_title homepage TEXT [authors] LIST<TEXT> body TEXT
{topics} venue_name body {topics} SET<TEXT> venue_name TEXT artifact_id INT
artifact_id artifact_title TEXT
artifact_title

(a) Logical Chebotko Diagram (b) Physical Chebotko Diagram

Fig. 7: Chebotko Diagrams for the digital library use case.

In the big data world, database systems are frequently visualization and the KDM tool for automation are also novel
classified into four broad categories [17] based on their data and unique.
models: 1) key-value databases, such as Riak and Redis,
2) document databases, such as Couchbase and MongoDB, IX. C ONCLUSIONS AND F UTURE W ORK
3) column-family databases, such as Cassandra and HBase, In this paper, we introduced a rigorous query-driven data
and 4) graph databases, such as Titan and Neo4J. Key-value modeling methodology for Apache Cassandra. Our methodol-
databases model data as key-value pairs. Document databases ogy was shown to be drastically different from the traditional
store JSON documents retrievable by keys. Column-family relational data modeling approach in a number of ways, such as
databases model data as table-like structures with multiple query-driven schema design, data nesting and data duplication.
dimensions. Graph databases typically rely on internal ad-hoc We elaborated on the fundamental data modeling principles for
data structures to store any graph data. An effort on a system- Cassandra, and defined mapping rules and mapping patterns
independent NoSQL database design is reported in [18], where to transition from technology-independent conceptual data
the approach is based on NoSQL Abstract Model to specify models to Cassandra-specific logical data models. We also
an intermediate, system-independent data representation. Both explained the role of physical data modeling and proposed
our work and [18] recognize conceptual data modeling and a novel visualization technique, called Chebotko Diagrams,
query-driven design as essential activities of the data modeling which can be used to capture complex logical and physical
process. While databases in different categories may share data models. Finally, we presented a powerful data modeling
similar high-level data modeling ideas, such as data nesting tool, called KDM, which automates some of the most com-
(also, aggregation or embedding) or data duplication, many plex, error-prone, and time-consuming data modeling tasks,
practical data modeling techniques rely on low-level features including conceptual-to-logical mapping, logical-to-physical
that are unique to a category and, more often, to a particular mapping, and CQL generation.
database.
In the future, we plan to extend our work to support new
In the Cassandra world, data modeling insights mostly Cassandra features, such as user defined data types and global
appear in blog posts and presentations that focus on best indexes. We are also interested in exploring data modeling
practices, common use cases, and sample designs. Among techniques in the context of analytic applications. Finally, we
some of the most helpful resources are DataStax developer plan to explore schema evolution in Cassandra.
blog3 , DataStax data modeling page4 , and Patrick McFadin’s
presentations5 . To the best of our knowledge, this work is ACKNOWLEDGEMENTS
the first to propose a systematic and rigorous data modeling
methodology for Apache Cassandra. Chebotko Diagrams for Artem Chebotko would like to thank Anthony Piazza,
Patrick McFadin, Jonathan Ellis, and Tim Berglund for their
3 http://www.datastax.com/dev/blog support at various stages of this effort. This work is partially
4 http://www.datastax.com/resources/data-modeling supported by U.S. National Science Foundation under ACI-
5 http://www.slideshare.net/patrickmcfadin 1443069.
Solution architect Solution architect KDM Solution architect KDM Solution architect KDM Solution architect
(a) KDM's Cassandra data modeling automation workflow

(b) Data modeling for the digial library use case performed in KDM.

Fig. 8: Automated Cassandra data modeling using KDM.

R EFERENCES Proceedings of the 5th Australasian Database Conference, 1994, pp.


292–304.
[1] Apache Cassandra Project, http://cassandra.apache.org/.
[12] J. Maguire and P. O’Kelly, “Does data modeling still matter, amid
[2] Planet Cassandra, http://http://planetcassandra.org/. the market shift to XML, NoSQL, big data, and cloud?” White pa-
[3] Companies that use Cassandra, http://planetcassandra.org/companies/. per, https://www.embarcadero.com/phocadownload/new-papers/okelly-
[4] A. Lakshman and P. Malik, “Cassandra: a decentralized structured whitepaper-071513.pdf, 2013.
storage system,” Operating Sys. Review, vol. 44, no. 2, pp. 35–40, 2010. [13] D. Hsieh, “NoSQL data modeling,” Ebay tech blog.
[5] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, http://www.ebaytechblog.com/2014/10/10/nosql-data-modeling, 2014.
M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A [14] A. Badia and D. Lemire, “A call to arms: revisiting database design,”
distributed storage system for structured data,” ACM Transactions on SIGMOD Record, vol. 40, no. 3, pp. 61–69, 2011.
Computer Systems, vol. 26, no. 2, 2008. [15] P. Atzeni, C. S. Jensen, G. Orsi, S. Ram, L. Tanca, and R. Torlone,
[6] E. F. Codd, “A relational model of data for large shared data banks,” “The relational model is dead, SQL is dead, and I don’t feel so good
Commun. ACM, vol. 13, no. 6, pp. 377–387, 1970. myself,” SIGMOD Record, vol. 42, no. 2, pp. 64–68, 2013.
[7] ——, “Further normalization of the data base relational model,” IBM [16] D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, U. Dayal,
Research Report, San Jose, California, vol. RJ909, 1971. M. Franklin, J. Gehrke, L. Haas, A. Halevy, J. Han et al., “Challenges
and opportunities with big data - a community white paper developed
[8] P. P. Chen, “The entity-relationship model - toward a unified view of
by leading researchers across the United States,” 2011.
data,” ACM Trans. Database Syst., vol. 1, no. 1, pp. 9–36, 1976.
[17] P. J. Sadalage and M. Fowler, NoSQL Distilled: A Brief Guide to the
[9] DataStax Cassandra Training Curriculum, http://www.datastax.com/
Emerging World of Polyglot Persistence. Addison-Wesley, 2012.
what-we-offer/products-services/training/apache-cassandra-data-
modeling/. [18] F. Bugiotti, L. Cabibbo, P. Atzeni, and R. Torlone, “Database design for
NoSQL systems,” in Proceedings of the 33rd International Conference
[10] Cassandra Query Language, https://cassandra.apache.org/doc/cql3/
on Conceptual Modeling, 2014, pp. 223–231.
CQL.html.
[11] M. Lawley and R. W. Topor, “A query language for EER schemas,” in

View publication stats

You might also like