0% found this document useful (0 votes)

105 views21 pages

Whitepaper - Data Modeling in Apache Cassandra

The document discusses five steps to designing an effective data model for Apache Cassandra. It provides background on why data modeling is important, key differences between Cassandra and relational databases, and how Cassandra stores data. The five steps are aimed at those experienced with relational databases to help realize high-quality Cassandra data models.

Uploaded by

Vimal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views21 pages

Whitepaper - Data Modeling in Apache Cassandra

Uploaded by

Vimal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

WHITEPAPER

Data Modeling in
Apache Cassandra™
Five Steps to an Awesome Data Model

For global-scale applications, Apache Cassandra is the favorite choice among architects
and developers.

It offers many advantages including zero-downtime high availability, zero lock-in with
easy data portability, and global scale with extreme performance. Today, Cassandra is
among the most successful NoSQL databases. It is used in countless applications from
online retail to internet portals to time-series databases to mobile application backends.

Having a well-designed data model is essential to meeting application performance and

scalability goals. In this paper, aimed at technical people experienced with relational
databases, we discuss five useful steps to realizing a high-quality data model for your
Cassandra application.
01

Data Modeling in Cassandra

Why data modeling is critical

Having the correct data model is important for any application, but it becomes
especially critical for applications that need zero downtime, zero lock-in, and global scale.
An application may work with thousands of records and a few hundred concurrent users,
but what happens when record and user counts are in millions or billions?

Regardless of the database, if the data model isn’t right, or doesn’t suit the underlying
database architecture, users can experience poor performance, downtime, and even data
loss or data corruption. Fixing a poorly designed data model after an application is in
production is an experience that nobody wants to go through. It’s better to take some time
upfront and use a proven methodology to design a data model that will be scalable,
extensible, and maintainable over the application lifecycle.

Differences between Cassandra and relational databases

Most readers will be familiar with relational databases such as Oracle®, MySQL®, and
PostgreSQL®. Before jumping into data modeling with Cassandra, it’s helpful to explain a
few differences between Cassandra and relational databases. High-level relational
databases prioritize space over response time—one copy of data stored, but many joins
that slow performance in order to pull a result set out of it. Cassandra prioritizes response
time over space—a few copies of data and a real-time return of a result set.

Zero-downtime, high availability – In a relational database, high availability has to be

carefully configured and tested for each system, which delays the time it takes your
application to get to production. The cloud-native architecture of Cassandra allows for
your data to sit on premises, in any cloud, or multiple clouds with built-in data
availability and out-of-the-box direction of traffic to available nodes.

Zero lock-in, highly portable data – In a relational database, developers and DBAs rely
on ETL to move data around which takes a significant amount of time when you have
massive amounts of data. The time it takes to move data means it can be unavailable
to your system, and often applications running on relational databases have scheduled
maintenance windows to handle porting data. The cloud-native architecture of
Cassandra powers highly portable data out of the box, allowing developers to move
faster when building applications. With Cassandra, there’s no need for scheduled
maintenance windows, even during database upgrades.

2 Data Modeling in Apache Cassandra

Global scale, extreme performance worldwide – In a relational database, to achieve
distributed data, geo-local low latencies across multiple geographies is a challenge and
when you layer on top of that the need to serve out extremely high throughput the
relational databases fall over. Relational databases at their core were not designed for
this level of customer interaction at a true global scale. In Cassandra, the performance
of high writes at global scale is a major reason enterprises select Cassandra for
delivering applications with lightning fast customer experiences globally.

In Cassandra, denormalization is expected – With relational databases, designers are

usually encouraged to store data in a normalized1 form to save disk space and ensure
data integrity. In Cassandra, storing the same data redundantly in multiple tables is not
only expected, but it also tends to be a feature of a good data model.

In Cassandra, writes are (almost) free – Owing to Cassandra’s architecture, writes are
shockingly fast compared to relational databases. Write latency may be in the hundreds
of microseconds and large production clusters can support millions of
writes-per-second.2

No joins – Relational database users assume that they can reference fields from
multiple tables in a single query, joining tables on the fly which can slow performance
of your application. With Cassandra, joins don’t exist, so developers structure their data
model to provide equivalent functionality.

Developers need to think about consistency – Relational databases are ACID compliant
(Atomicity, Consistency, Isolation, Durability), characteristics that help guarantee data
validity with multiple reads and writes. Cassandra supports the notion of tunable
consistency, balancing between Consistency, Availability, and Partition Tolerance (CAP)3.
Cassandra gives developers the flexibility to manage trade-offs between data
consistency, availability, and application performance when formulating queries.

Indexing – In a relational database, queries can be optimized by simply creating an

index on a field. While secondary indexes exist in Cassandra, they are not a “silver
bullet” as they are in a relational database management system (RDBMS). In
Cassandra, tables are usually designed to support specific queries, and secondary
indexes are useful only in specific circumstances.

1
Database normalization is the process of structuring a relational database to reduce data redundancy and improve data
integrity – details are available at https://en.wikipedia.org/wiki/Database_normalization
2
Cassandra delivers one million writes per second at Netflix –
https://medium.com/netflix-techblog/benchmarking-cassandra-scalability-on-aws-over-a-million-writes-per-second-39f45f066c9e
3
How Apache Cassandra™ Balances Consistency, Availability, and Performance –
https://www.datastax.com/2019/05/how-apache-cassandra-balances-consistency-availability-and-performance

3 Data Modeling in Apache Cassandra

How Cassandra stores data
Understanding how Cassandra stores data is essential to developing a good data model.
Readers wishing to get a better understanding of Cassandra’s internal architecture can
read the DataStax Apache Cassandra™ Architecture whitepaper available at
https://www.datastax.com/resources/whitepapers/apache-cassandra-architecture.

Cassandra clusters have multiple nodes running in local data centers or public clouds. Data
is typically stored redundantly across nodes according to a configurable replication factor
so that the database continues to operate even when nodes are down or unreachable.

Tables in Cassandra are much like RDBMS tables. Physical records in the table are spread
across the cluster at a location determined by a partition key. The partition key is hashed
to a 64-bit token that identifies the Cassandra node where data and replicas are stored.4
The Cassandra cluster is conceptually represented as a ring, as shown in Figure 1, where
each cluster node is responsible for storing tokens in a range.

Queries that look up records based on the partition key are extremely fast because
Cassandra can immediately determine the host holding required data using the
partitioning function. Since clusters can potentially have hundreds or even thousands of
nodes, Cassandra can handle many simultaneous queries because queries and data are
distributed across cluster nodes.

FIGURE 1 How Cassandra stores data

4
This is a bit of an over-simplification. Recent versions of Cassandra support the notion of vnodes where each cluster node is
responsible for multiple token ranges to better distribute data. https://docs.datastax.com/en/dse/5.1/dse-arch/datastax_
enterprise/dbArch/archDataDistributeVnodesUsing.html. The ring is still a useful way to conceptualize how Cassandra stores
data however

4 Data Modeling in Apache Cassandra

Partition keys can be single columns or can be composed of multiple columns. Cassandra
also supports clustering columns (discussed shortly) that control how data records are
grouped and organized within each partition. Records in Cassandra are stored as lists of
key-value pairs where the column name is the key.

High-level goals for a Cassandra data model

Spread data evenly around the cluster – For Cassandra to work optimally, data should
be spread as evenly as possible across cluster nodes. Distributing data evenly depends
on selecting a good partition key.

Minimize the number of partitions to read – When Cassandra reads data, it’s best to read
from as few partitions as possible since each partition potentially resides on a different
cluster node. If a query involves multiple partitions, the coordinator node responsible for
the query needs to interact with many nodes, thereby reducing performance.

Anticipate how data will grow, and think about potential bottlenecks in advance –
A particular data model might make sense when you have a few hundred transactions
per user, but what would happen to performance if there were millions of transactions
per user? It’s a good idea to always “think big” when building data models for
Cassandra and avoid knowingly introducing bottlenecks.

5 Data Modeling in Apache Cassandra

Five Steps to an Awesome Data Model

A sample application
To illustrate how to build Cassandra applications, DataStax offers a reference application
called KillrVideo5, a fictitious company operating a video service like YouTube™.

FIGURE 2 KillrVideo – sample Cassandra application

When discussing data modeling, it’s helpful to focus on a concrete example. In the sections
that follow, data modeling will be discussed in the context of the KillrVideo application.

In KillrVideo, users perform activities such as creating profiles, submitting, tagging, and
sharing videos, searching for and retrieving videos submitted by others, viewing them,
rating them, and commenting on them. KillrVideo serves as a useful example because it
needs to operate at global scale, supporting millions of users and transaction volumes that
would be all but impossible to achieve using a relational database.

Step 1: Build the application workflow

When building applications using relational databases, developers often start with the data
model, thinking about the data items that need to be stored and how they relate to one
another. With Cassandra, just the opposite is recommended - start with the user experience.
The best practice is to start with the application workflow; an approach referred to as
“query-first design.”

Before thinking about how data will be stored, designers need to know what types of
queries the database will need to support. Figure 3 presents a simplified application
workflow for KillrVideo.com.

5
Details about KillrVideo including source code and design documents are available at http://killrvideo.com

6 Data Modeling in Apache Cassandra

FIGURE 3 Simplified application workflow

The sequence of workflow steps matters because it helps us determine what data is available
and required for each query. For example, before we can show basic information about a user
(step 2 above), a userid is required. The user first needs to log in to the site (step 1) supplying
an email address and password in exchange for the required userid. A userid might also be
obtained by searching for a video (steps 6 or 7), showing comments for a video (step 9), and
looking up details about the user that commented. Similarly, before the application can display
details about a video (step 8) the application needs a videoid obtained by selecting from a list
of the latest videos (step 7) or by searching videos by tag (step 6).

Step 2: Model the queries required by the application

Even at the design stage, developers can think through the sequence of tasks required, mock
up what each screen will look like, and decide what data will be required at each stage.

Figure 4 shows a simplified entity relationship diagram (ERD) for the KillrVideo application.
The application needs to be able to keep track of entities such as users, videos, and
comments. Users can perform activities such as adding videos, rating videos, and posting
comments. Users can comment on multiple videos, and each video can have multiple user
comments associated, but there is only one owner of each video.

7 Data Modeling in Apache Cassandra

FIGURE 4 Entity-relationship diagram (ERD) for KillrVideo

It’s a good idea to iterate between the application workflow and ERD, updating both
as new data items and relationships required by the application are identified. Once
developers have a clear idea of the application workflow and the key data objects
required, it’s possible to start identifying the queries that the application needs to support.
A diagram showing key queries and how they relate to data domains is shown in Figure 5.

FIGURE 5 Identifying the queries required to support the application workflow

8 Data Modeling in Apache Cassandra

Step 3: Create the tables
The next step is to think about how tables should be structured to support queries required
by the application. At this stage, we are getting more concrete about how Cassandra will
store data.

In Cassandra, tables can be grouped into two distinct categories:

Tables with single-row partitions – These types of tables have primary keys that are
also partition keys. They are used to store entities and are usually normalized. Users,
comments, and videos are examples of entities in our application. These tables
should be named based on the entity for clarity (i.e., users or videos).

Tables with multi-row partitions – These types of tables have primary keys that are
composed of partition and clustering keys. They are used to store relationships and
related entities. Remember that Cassandra doesn’t support joins, so developers need
to structure tables to support queries that relate to multiple data items. It’s a good
idea to give tables meaningful names so that people examining the schema can
understand the purpose of different tables (i.e., comments_by_user or
videos_by_tag).

A Cassandra Query Language (CQL) schema for the single-row partition videos table is
shown below. The videos table includes columns, such as the user that uploaded the
video, a description, details about where the video was taken, a timestamp, and a preview
image. A video can have multiple associated tags, so we use a set data type in Cassandra
known as a collection (discussed shortly) to store a variable number of tags per record.

CREATE TABLE videos (

videoid uuid,
userid uuid,
name text,
description text,
location text,
location_type text,
preview_image_location text,
tags set<text>,
added_date timestamp,
PRIMARY KEY (videoid)
);

A table with multi-row partitions supports the query that looks up the latest videos (step 7
in the workflow in Figure 3) is shown below.

9 Data Modeling in Apache Cassandra

CREATE TABLE IF NOT EXISTS latest_videos (
yyyymmdd text,
added_date timestamp,
videoid uuid,
userid uuid,
name text,
preview_image_location text,
PRIMARY KEY (yyyymmdd, added_date, videoid)
) WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);

The latest_videos table illustrates what is meant by “query-first design.” The application
will need to query the most recently uploaded videos every time a user visits the KillrVideo
homepage, so this query needs to be very efficient. On a global video sharing service, the
latest dozen or so videos are almost guaranteed to have been uploaded on the same day
(or in some cases the day before). The table stores a timestamp (indicating the precise
upload time) but rather than using the timestamp as a partition key, the developer stores a
“yyyymmdd” value derived from the timestamp uniquely representing the day that the video
was uploaded. By making yyyymmdd the partition key, videos uploaded on the same day will
be stored in the same partition, making lookups fast and efficient.6

Additional clustering columns (added_date and videoid) specify how records are grouped
and ordered within each partition. By designing the table in this fashion, queries will touch
only one partition for the current day and possibly another partition for the day before.
There will be no need to sort, filter, or otherwise post-process query results. This level of
optimization and efficiency helps explain how Cassandra can support applications with
enormous numbers of queries over very large data sets.

Chebotko Diagrams
A good tool for mapping the data model that supports an application is known as a
Chebotko diagram7 shown in Figure 6. While entity-relationship diagrams are useful in
building a conceptual data model, a Chebotko diagram helps develop the logical and
physical data models required to support the application.

6
This is a design choice. If millions of uploads per day were anticipated, a better choice of key might be “yyyymmddhh” to store
fewer records per partition. This is a good example of the trade-offs that architects need to consider when building a data model.

7
Chebotko diagrams are named after Dr. Artem Chebotko, Solution Architect at DataStax.

10 Data Modeling in Apache Cassandra

FIGURE 6 Chebotko diagram showing how tables support queries for KillrVideo

The Chebotko diagram captures the database schema, showing table names, partition key
columns (K), clustering key columns (C) and their ordering (ascending or descending), static
columns (S), and regular columns with data types. The tables are organized based on the
application workflow to support specific workflow steps and application queries. Let’s take a
look at a couple of tables more closely.

11 Data Modeling in Apache Cassandra

During the sign-in process, a user provides an email address and password for
authentication purposes. The user_by_email table is optimized to quickly retrieve
records by email and return the encrypted password for validation and the userid
required for downstream queries.

Next, we may need to retrieve all comments posted by the user. The comments_by_user
table supports such a query where we retrieve an ordered list of comments made by a
logged-in user. Since comments will be retrieved “by user,” the userid is defined as the
partition key to ensure that comments made by the same user will always reside in the
same partition. The commentid column is defined as a clustering key. Since commentid
is of type timeuuid (a universally unique identifier generated based on a timestamp and
node MAC address), it ensures both uniqueness and ordering of rows in a partition. User
comments will be stored and retrieved in descending order from the same partition
showing the most recent comments first. Since the application will need to display when
a comment was made, the toTimestamp() function can be used to convert the
commentid stored as a timeuuid value to a timestamp as shown.

SELECT commentid, toTimestamp(commentid) AS timestamp, videoid, comment

FROM comments_by_user
WHERE userid = 2506535a-4999-438d-8682-d5a739596343
LIMIT 3;

commentid | timestamp | videoid | comment

Denormalization is Expected
People experienced with relational databases may be surprised by how data elements are
stored redundantly across tables. For example, in Figure 6, a pointer to the video preview
image (preview_image_location) appears in six different tables. There are limits to how
many redundant copies we want to maintain, but remember that in Cassandra,
denormalization is expected and it is a good trade-off to achieve performance and
scalability goals:

Modern storage is inexpensive. For small data items, it’s less of a concern to store data
redundantly than it once was.

Adding videos and generating preview images is a relatively infrequent activity in the
application. There will be thousands of queries that require the presence of a preview
image for every query that inserts or updates an image preview. It makes sense to
optimize for the most frequent queries.

12 Data Modeling in Apache Cassandra

While it’s a small burden for the developer to synchronize data across six tables,
remember that writes in Cassandra are extremely fast, so there is a minimal
performance penalty. If we can speed queries with denormalization, some data
redundancy is a small price to pay. Cassandra has a batch feature (discussed shortly)
that is useful when coordinating multiple writes to denormalized tables.

Step 4: Get the primary key right

Tables in Cassandra have a primary key. The primary key is made up of a partition key,
followed by one or more optional clustering columns that control how rows are laid out in a
Cassandra partition. Getting the primary key right for each table is one of the most crucial
aspects of designing a good data model.

In the latest_videos table, yyyymmdd is the partition key, and it is followed by two
clustering columns, added_date and videoid, ordered in a fashion that supports retrieving
the latest videos.

..
PRIMARY KEY (yyyymmdd, added_date, videoid)
) WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);

A partition key in Cassandra can be made up of multiple columns. For example, if we were
to have a query that finds all videos based on a tag and location, we would need to use a
composite partition key consisting of both the tag and location as shown.

..
PRIMARY KEY ((tag, location), videoid)
..

Creating Unique Keys

For most applications, rows stored in a table need unique identifiers. Good examples are
customer IDs, order IDs, and transaction IDs. If keys aren’t unique, it’s easy to accidentally
over-write rows—disastrous for most applications. Relational databases often use simple
auto-incrementing integers to assign unique keys to records, but this approach isn’t
practical in a distributed system like Cassandra.

To address this problem of unique keys, Cassandra supports universally unique identifiers8
(UUIDs) as a native data type. UUIDs are 128-bit numbers that are guaranteed to be unique
within the scope of an application. Cassandra supports two types of UUIDs. TIMEUUIDs (UUID
version 1) factor details like the MAC address of the node generating the UUID and embed a
60-bit timestamp to help guarantee uniqueness. Cassandra also supports random UUIDs (UUID
version 4). UUIDs are represented in hexadecimal and look something like the following:

bddf5f6b-f1a2-4624-8005-07f8cd5456de

13 Data Modeling in Apache Cassandra

Using a UUID to identify data items that need to be unique is the best practice in Cassandra.
For time-series data (a series of readings from a sensor for example) a timestamp (another
native data type in Cassandra) might also be used and combined with a sensor_id to
generate a unique composite key, but since a timestamp can always be extracted from a
TIMEUUID, using a TIMEUUID is also a good option.

Be Careful with Custom Naming Schemes

Some developers might prefer to devise their own naming schemes to make keys easier to
understand. For example, if a table stores retail locations in the United States, a designer
might use a five-character key consisting of a two-character state identifier and a three-digit
store number within the state. Using “AZ003” to represent the third store in Arizona is much
easier to understand than “bddf5f6b-f1a2-4624-8005-07f8cd5456de”.

As tempting as this is, be careful constructing unique business-friendly identifiers. What

happens if the business opens stores internationally? Is “AL001” in Alabama, Alberta, or
Albania? What happens if the business starts opening kiosks, and larger states suddenly
have more than 1,000 outlets? What happens if the business acquires another retailer with
a different naming convention? Business-friendly identifiers are a fact of life in many
applications, but it’s worth thinking about whether the need for human-readable identifiers
is worth the effort required to maintain them.

Step 5: Use data types effectively

So far, we’ve discussed tables and columns without paying much attention to the best
choice of data type for each column. Cassandra supports a wide variety of data types that
will be familiar to most developers—BigInt, Blob, Boolean, Decimal, Double, Float, Inet (IP
addresses), Int, Text, VarChar, UUID, TIMEUUID, etc. Consider the fields in the video table:

CREATE TABLE videos (

videoid uuid,
userid uuid,
name text,
description text,
location text,
location_type text,
preview_image_location text,
tags set<text>,
added_date timestamp,
PRIMARY KEY (videoid)
);

8
CUUIDs are a standard way of generating unique identifiers – details at https://en.wikipedia.org/wiki/Universally_unique_identifier

14 Data Modeling in Apache Cassandra

There may be millions of unique videos and users, so both are identified as a universally
unique identifier (UUID) a good practice for unique keys in Cassandra.

The name, description, location, and location_type are all UTF-8 strings, so they are
described as type text (essentially the same as a varchar).

The preview_image_location might be a blob, but in a global application, it’s more

efficient to store a pointer to an object store so that the application can leverage a content
delivery network (CDN) to cache images.

The added_date is a date and time that doesn’t need to be unique, so we use the built-in
timestamp type.

In the video tables, there can be multiple tags or keywords associated with each video, so
we use a “set,” one of Cassandra’s useful collection types.

Collections in Cassandra
When modeling the database, some developers might be tempted to store tags associated
with videos in a separate table. When the list of anticipated tags is small however, using a
collection data type that stores tags inside the database record can be more efficient.
This simplifies the database design and reduces the number of tables required.

The five collection data types in Cassandra are:

Set – a group collection of unique values of the same data type.

List – an ordered collection of non-unique values of the same data type.

Map – a set of key-value pairs, where keys are unique, and both keys and values have
associated data types.

Tuple – a fixed length list of non-unique values of different data types.

Nested collection – a collection (i.e., set, list, map, or tuple) that is nested inside of
another collection.

When defining a collection, the user needs to provide a data type for its elements.
A simplified version of our videos table is provided below for illustration.

CREATE TABLE videos (videoid uuid PRIMARY KEY, name text, tags set<text>);

A sample row in this table may look as shown:

videoid | name | tags

-------------------------------------+--------------------+------------------------------------
5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | My Funny Cat Video | {‘buster’, ‘ball of wool’, ‘bath’}

15 Data Modeling in Apache Cassandra

CQL provides convenient syntax to insert, update, or delete items in collections. For example,
a user can update the record for “My Funny Cat Video” and add a tag “wet cat” as shown:

UPDATE videos SET tags = tags + {‘wet cat’}

WHERE videoid = 5b6962dd-3f90-4c93-8f61-eabfa4a803e2;

User-Defined Data Types

Another data type in Cassandra that provides flexibility is a user-defined type (UDT).
UDTs can attach multiple data fields—each named and typed—to a single column.

Let’s assume that the designers of KillrVideo decide to store an optional mailing address
for each user. Rather than add multiple address-related fields, an address type can be
created and leveraged across multiple Cassandra tables.

CREATE TYPE address (

unit text,
street_number text,
street_name text,
city text,
prov_state text,
post_code text
);

The user-defined address type can now be included in the users table as shown. The
frozen keyword is required to use a UDT inside of a collection. It forces Cassandra to treat
the address as a single value. Individual elements of a frozen address cannot be updated
individually; rather, the entire address must be overwritten.

CREATE TABLE users (

userid uuid PRIMARY KEY,
firstname text,
lastname text,
address address,
previous_addresses list<frozen<address>>
);

16 Data Modeling in Apache Cassandra

Things to Keep in Mind

Thinking in relational terms can cause problems

On the surface, Cassandra’s Query Language (CQL) looks much like SQL but failing to
consider how Cassandra stores data can cause serious problems. Consider an example
where we decide to partition our user community into groups. We want to be able to
retrieve information about all users that belong in a group as shown.

SELECT * FROM groups WHERE groupname = ? ;

A good table structure to support such a query in Cassandra would use the groupname as a
partition key so that members of the same group reside on the same partition for fast lookup.

CREATE TABLE groups (

groupname text,
group_description text static,
userid uuid,
firstname text,
lastname text,
PRIMARY KEY (groupname, userid)
);

Someone with a relational database background might be tempted to normalize the database
schema by creating tables for users, groups, and user-group relationships as shown.

CREATE TABLE users (

userid uuid,
firstname text,
lastname text,
PRIMARY KEY (userid)
);
CREATE TABLE groups (
groupname text,
group_description text,
PRIMARY KEY (groupname)
);
CREATE TABLE user_groups (
groupname text,
userid uuid,
PRIMARY KEY (groupname, userid)
);

17 Data Modeling in Apache Cassandra

Now suppose there were 1,000 users in each group, and we wanted to perform the same
query retrieving information for all users in a group. In this normalized case, the application
would need to read 1,002 partitions—first querying the groups table to retrieve the
group_description, then querying the user_groups table to retrieve all userids in the
group. After this, the users table needs to be queried for each of the 1,000 userids, each
on a separate partition.

Secondary indexes
Consider a table holding user accounts where a unique username identifies an account.
The table structure below is ideal for looking up users by username (the partition key).

CREATE TABLE user_accounts (

username text PRIMARY KEY,
email text,
password text,
country text
);

Now imagine that we need to generate a list of user accounts in a single country. Normally,
a query like this would require a full table scan. This is a horrible idea in Cassandra because
depending on the size of the table, a query like this may be very expensive and may
consume all the resources on the cluster.

One way to solve this problem is to create another table designed to support the new query,
with country designated as the partition key and username as the clustering key. However,
such a solution is still problematic because country is a low-cardinality column. As our
database scales, we may end up with millions of users in the same country, causing the
partition for a country to become very large. In general, we do not recommend creating
partitions with more than a few hundred thousand rows. A better practice would be to use a
composite partition key with a country, state, city, and so forth to reduce the number of
rows per partition.

Another solution, which works better for low-cardinality columns, is to use a secondary index:

CREATE INDEX user_account_country ON user_accounts(country);

The secondary index allows users to run queries like the following:

SELECT * FROM user_accounts WHERE country = ‘UK’;

This is convenient and looks much like a relational database. So why not always use a
secondary index as opposed to creating query-specific tables?

18 Data Modeling in Apache Cassandra

The challenge is how secondary indexes are stored. Replicating a global index for every
record on every Cassandra host is not feasible, so instead, Cassandra maintains a
secondary index on each host indexing only the data on that host. To use a secondary
index to search across all user_accounts in the UK, the query needs to check with every
node in the Cassandra cluster, and then use the local index to retrieve only the records
where country = “UK”. This is expensive since user accounts stored in the index reside in
different partitions, but at least the query only reads records for users known to reside in
the UK—far less expensive than a full table scan.

Creating a secondary index on a high-cardinality column like email would be disastrous.

Whereas there are only 195 countries, in a table with 100,000,000 rows, every email address
would be unique. The index on each host would be enormous, and the amount of work
required to support the query on each host would scale accordingly.

Secondary indexes have their place, but they should only be used when queries are
expected to return tens or perhaps hundreds of rows at most, and when the indexed
column has medium cardinality.

Materialized views
Consider a table holding user accounts where a unique username identifies an account.
The table structure below is ideal for looking up users by username (the partition key).

CREATE TABLE user_accounts (

username text PRIMARY KEY,
email text,
password text,
country text
);

Now imagine that we also need to retrieve user accounts by email. It would be
straightforward to solve this problem by creating another table designed to support the
new query, with email designated as the partition key and username as the clustering
key. There is also another good option called materialized view.

CREATE MATERIALIZED VIEW users_by_email AS

SELECT email, password, country
FROM user_accounts
WHERE username IS NOT NULL AND email IS NOT NULL
PRIMARY KEY (email, username);

A materialized view is essentially a CQL table that is defined using a base table. It is
automatically maintained by Cassandra and queried like regular tables.

SELECT * FROM users_by_email WHERE email = ?;

Materialized views are great for convenience and have comparable performance as base
tables, but each materialized view is restricted to a single base table. They work the best
for high-cardinality columns.

19 Data Modeling in Apache Cassandra

Using batches effectively
Batch operations in Cassandra provide a way to group multiple insert, update, or delete
queries together. Cassandra supports two types of batches: unlogged and logged batches.
Logged batches are atomic—in other words, either all operations in the batch complete or
none complete. Batches are useful in updating denormalized tables—updating a value
spread across multiple tables and treating the update as a single transaction.

Developers new to Cassandra often make the mistake of assuming that batches will improve
performance. For batches that update data across multiple partitions, the opposite is usually
the case. A single coordinator node is assigned the responsibility of handling all the queries
between the BEGIN BATCH and END BATCH statements. The coordinator most likely won’t
have local replicas of required data so it will need to reach out to multiple nodes in the
Cassandra cluster, thereby becoming a bottleneck as it manages traffic for multiple queries
across many nodes. The batch may fail if there are not enough nodes up and responding.

The exception is a batch where all operations write to a single partition. In this single partition
case, the batch coordination and database update activities can all be handled efficiently on a
single Cassandra node and the unlogged batch can outperform discrete queries.

Logged batches are 30% slower than unlogged batches, but logged batches provide an
excellent way to enforce consistency when updating denormalized data. Suppose a user
wishes to change the name of a previously uploaded video. The video name appears in multiple
tables. This operation is performed rarely, and in this case, the need for atomicity outweighs
any performance concerns, so grouping queries in a logged batch is a good design decision.

Learn More
For developers wanting to learn more about data modeling in Cassandra, DataStax offers a
free course (DS220: Data Modeling) available at the DataStax Academy. This is one of the
most popular courses. In addition, explore our growing collection of data models for IoT,
e-commerce, finance, and other domains in the Data Modeling by Example learning series.

The data modeling guidelines offered here apply to Apache Cassandra and DataStax Enterprise.

20 Data Modeling in Apache Cassandra

Summary

Getting the data model right is a critical first step in

building a zero downtime, zero lock-in, global-scale
application that is easy to manage and maintain.

A good way to arrive at a correct and robust data model is to work through a proven
methodology for data modeling. Five key steps are to:

1. Map out the application workflow to identify all the functionality that the application
needs to support.
2. Practice “query-first design” thinking about how to structure tables optimally to support
the queries required by each workflow step.
3. Design the tables that will support the required queries using Chebotko diagrams.
4. Pay special attention to primary key, partition key, and clustering key design.
5. Use Cassandra data types effectively and think about where you can use unique
Cassandra features such as collections and user-defined types to simplify the design.

Cassandra meets all customer requirements around scale, performance, and availability,
making it a preferred choice among IT architects, developers, administrators, and
decision-makers.

To learn more about data modeling with Cassandra and DataStax Enterprise, visit
www.datastax.com or send an email to [email protected].

Apache, Apache Cassandra, and Cassandra are either registered trademarks or trademarks of the
Apache Software Foundation or its subsidiaries in Canada, the United States, and/or other countries.

21 Data Modeling in Apache Cassandra

Snowflake Syllabus
100% (1)
Snowflake Syllabus
2 pages
Introduction To The Ibm Dataops Methodology and Practice
No ratings yet
Introduction To The Ibm Dataops Methodology and Practice
30 pages
Data Warehouse MCQS With Answer
100% (4)
Data Warehouse MCQS With Answer
16 pages
Cassandra Presentation Final
100% (3)
Cassandra Presentation Final
71 pages
Cassandra Architecture PDF
No ratings yet
Cassandra Architecture PDF
112 pages
Restaurant Management System
100% (1)
Restaurant Management System
84 pages
Database Management
No ratings yet
Database Management
19 pages
Cassandra Datastax
100% (1)
Cassandra Datastax
10 pages
1S00256
No ratings yet
1S00256
2,413 pages
Cassandra
No ratings yet
Cassandra
25 pages
DataStax Ebook The 5 Main Benefits of Apache Cassandra PDF
100% (1)
DataStax Ebook The 5 Main Benefits of Apache Cassandra PDF
12 pages
Cassandra PPT Final
No ratings yet
Cassandra PPT Final
23 pages
Intro To NoSQL
No ratings yet
Intro To NoSQL
18 pages
Cassandra
No ratings yet
Cassandra
31 pages
Class 3 Cassandra
No ratings yet
Class 3 Cassandra
64 pages
App Ache
No ratings yet
App Ache
55 pages
Ch3 Nosql Wordpress
No ratings yet
Ch3 Nosql Wordpress
15 pages
Oracle 12C Training
No ratings yet
Oracle 12C Training
5 pages
04 Introduction To CassandraDB
No ratings yet
04 Introduction To CassandraDB
19 pages
ITIL v3 Vs v4
No ratings yet
ITIL v3 Vs v4
27 pages
Module 4
No ratings yet
Module 4
22 pages
Business Warehouse - Data Modeling PDF
No ratings yet
Business Warehouse - Data Modeling PDF
110 pages
Cassandra Data Model
No ratings yet
Cassandra Data Model
17 pages
Unit2 Cassandra
No ratings yet
Unit2 Cassandra
15 pages
Answers To Talend Interview Questions
No ratings yet
Answers To Talend Interview Questions
6 pages
Cassandra Article Review
No ratings yet
Cassandra Article Review
10 pages
Apache Cassandra Nosql SonuJha 04
No ratings yet
Apache Cassandra Nosql SonuJha 04
14 pages
Data Governance Best Practices
100% (5)
Data Governance Best Practices
50 pages
Data Warehousing
No ratings yet
Data Warehousing
111 pages
5 Part2
No ratings yet
5 Part2
7 pages
Cassandra Preview
No ratings yet
Cassandra Preview
9 pages
Introduction Into Files and Folders (Directory)
100% (1)
Introduction Into Files and Folders (Directory)
17 pages
Cassandra Complete Notes
No ratings yet
Cassandra Complete Notes
5 pages
Cassandr 1
No ratings yet
Cassandr 1
8 pages
Performance Comparison of NOSQL Database Cassandra
No ratings yet
Performance Comparison of NOSQL Database Cassandra
5 pages
Apache Cassandra: Database
No ratings yet
Apache Cassandra: Database
55 pages
ERP Dan Overview: Christine - N - Dewi@ukdw - Ac.id
No ratings yet
ERP Dan Overview: Christine - N - Dewi@ukdw - Ac.id
43 pages
Information Technology: Assignment 1
No ratings yet
Information Technology: Assignment 1
22 pages
Features of Cassandra
No ratings yet
Features of Cassandra
6 pages
Nosql Cassandra Database: What Is Apache Cassandra?
No ratings yet
Nosql Cassandra Database: What Is Apache Cassandra?
4 pages
Cassandra Quick Guide
No ratings yet
Cassandra Quick Guide
60 pages
SETLabs Briefings Software Validation
No ratings yet
SETLabs Briefings Software Validation
75 pages
Intro To Data Science - Week 10 - LAQ's
No ratings yet
Intro To Data Science - Week 10 - LAQ's
4 pages
Cassendra
100% (1)
Cassendra
21 pages
Cassandra - Module5
No ratings yet
Cassandra - Module5
37 pages
DSX Developer Ebook4 FINAL PDF
No ratings yet
DSX Developer Ebook4 FINAL PDF
27 pages
Database Concepts
No ratings yet
Database Concepts
5 pages
Casandra Vs MongoDB
No ratings yet
Casandra Vs MongoDB
5 pages
Secretary Effective Work Research
No ratings yet
Secretary Effective Work Research
10 pages
Introduction To Cassandra
No ratings yet
Introduction To Cassandra
37 pages
Shravan Apache Cassandra
No ratings yet
Shravan Apache Cassandra
13 pages
Casewise Corporate Modeler Suite Brochure
No ratings yet
Casewise Corporate Modeler Suite Brochure
20 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
33 pages
Practical No 8: Aim: Theory
No ratings yet
Practical No 8: Aim: Theory
7 pages
Cassandra High Availability Sample Chapter
No ratings yet
Cassandra High Availability Sample Chapter
16 pages
TR Bigdata 05 2015 CKL
No ratings yet
TR Bigdata 05 2015 CKL
8 pages
Acknowledgement: Himanshu Singh (1Vj16Cs055) ASHISH JHA (1VJ17CS011)
No ratings yet
Acknowledgement: Himanshu Singh (1Vj16Cs055) ASHISH JHA (1VJ17CS011)
4 pages
Cassandra
No ratings yet
Cassandra
2 pages
Apache Cassandra Database - Instaclustr
No ratings yet
Apache Cassandra Database - Instaclustr
8 pages
Thanks: With More Than 1000 Students/ Professors, Subject Experts and Editors Contributing To It Every Day
No ratings yet
Thanks: With More Than 1000 Students/ Professors, Subject Experts and Editors Contributing To It Every Day
27 pages
EXP-2 Sa DP Lab B.tech Final Year
No ratings yet
EXP-2 Sa DP Lab B.tech Final Year
3 pages
Dzone Refcard 153 Apache Cassandra 2020
No ratings yet
Dzone Refcard 153 Apache Cassandra 2020
11 pages
Quiz1 PDF
No ratings yet
Quiz1 PDF
5 pages
Apache Cassandra: Het Patel Kajal Patel
No ratings yet
Apache Cassandra: Het Patel Kajal Patel
8 pages
Apache Cassandra: by Chethan Gowda
No ratings yet
Apache Cassandra: by Chethan Gowda
12 pages
Apache Cassandra Report
No ratings yet
Apache Cassandra Report
20 pages
DataStax-WP-Apache-Cassandra-Architecture (Technical) PDF
No ratings yet
DataStax-WP-Apache-Cassandra-Architecture (Technical) PDF
22 pages
DWH Question Bank
No ratings yet
DWH Question Bank
9 pages
An Overview of Apache Cassandra: Cassandra Essentials Tutorial Series
No ratings yet
An Overview of Apache Cassandra: Cassandra Essentials Tutorial Series
20 pages
CSC 5301 - Paper - DB & DW Security
No ratings yet
CSC 5301 - Paper - DB & DW Security
17 pages
Mastering Apache Cassandra - Second Edition - Sample Chapter
No ratings yet
Mastering Apache Cassandra - Second Edition - Sample Chapter
31 pages
Bigdata2015 Andrey
No ratings yet
Bigdata2015 Andrey
8 pages
Cassandra Design Patterns - Sample Chapter
No ratings yet
Cassandra Design Patterns - Sample Chapter
32 pages
Cassandra Tutorial For Beginners: Learn in 3 Days: What Is Apache Cassandra?
No ratings yet
Cassandra Tutorial For Beginners: Learn in 3 Days: What Is Apache Cassandra?
4 pages
Quiz DP 18
No ratings yet
Quiz DP 18
4 pages
Learning Apache Cassandra - Sample Chapter
No ratings yet
Learning Apache Cassandra - Sample Chapter
20 pages
Name Shivam Prasad Reg No. 15BCE1196
No ratings yet
Name Shivam Prasad Reg No. 15BCE1196
8 pages
Cassandra
No ratings yet
Cassandra
18 pages
A Study of Cassandra
No ratings yet
A Study of Cassandra
2 pages
Cassandra As Used by Facebook
100% (1)
Cassandra As Used by Facebook
12 pages
How To Restore External Files
No ratings yet
How To Restore External Files
2 pages
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
No ratings yet
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
69 pages
Release Management Approach
No ratings yet
Release Management Approach
6 pages
Cassandra
No ratings yet
Cassandra
7 pages
Cassandra Essentials: Definitive Reference for Developers and Engineers
From Everand
Cassandra Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Amazon RDS Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
Amazon RDS Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
DBA's Guide to NoSQL
From Everand
DBA's Guide to NoSQL
The Enlightened DBA
5/5 (1)
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet

Whitepaper - Data Modeling in Apache Cassandra

Uploaded by

Whitepaper - Data Modeling in Apache Cassandra

Uploaded by

WHITEPAPER

Having a well-designed data model is essential to meeting application performance and

Data Modeling in Cassandra

Why data modeling is critical

Differences between Cassandra and relational databases

Zero-downtime, high availability – In a relational database, high availability has to be

2 Data Modeling in Apache Cassandra

In Cassandra, denormalization is expected – With relational databases, designers are

Indexing – In a relational database, queries can be optimized by simply creating an

3 Data Modeling in Apache Cassandra

FIGURE 1 How Cassandra stores data

4 Data Modeling in Apache Cassandra

High-level goals for a Cassandra data model

5 Data Modeling in Apache Cassandra

Five Steps to an Awesome Data Model

FIGURE 2 KillrVideo – sample Cassandra application

Step 1: Build the application workflow

6 Data Modeling in Apache Cassandra

Step 2: Model the queries required by the application

7 Data Modeling in Apache Cassandra

FIGURE 5 Identifying the queries required to support the application workflow

8 Data Modeling in Apache Cassandra

In Cassandra, tables can be grouped into two distinct categories:

CREATE TABLE videos (

9 Data Modeling in Apache Cassandra

10 Data Modeling in Apache Cassandra

11 Data Modeling in Apache Cassandra

SELECT commentid, toTimestamp(commentid) AS timestamp, videoid, comment

commentid | timestamp | videoid | comment

12 Data Modeling in Apache Cassandra

Step 4: Get the primary key right

Creating Unique Keys

13 Data Modeling in Apache Cassandra

Be Careful with Custom Naming Schemes

As tempting as this is, be careful constructing unique business-friendly identifiers. What

Step 5: Use data types effectively

CREATE TABLE videos (

14 Data Modeling in Apache Cassandra

The preview_image_location might be a blob, but in a global application, it’s more

The five collection data types in Cassandra are:

Set – a group collection of unique values of the same data type.

List – an ordered collection of non-unique values of the same data type.

Tuple – a fixed length list of non-unique values of different data types.

A sample row in this table may look as shown:

videoid | name | tags

15 Data Modeling in Apache Cassandra

UPDATE videos SET tags = tags + {‘wet cat’}

WHERE videoid = 5b6962dd-3f90-4c93-8f61-eabfa4a803e2;

User-Defined Data Types

CREATE TYPE address (

CREATE TABLE users (

16 Data Modeling in Apache Cassandra

Things to Keep in Mind

Thinking in relational terms can cause problems

SELECT * FROM groups WHERE groupname = ? ;

CREATE TABLE groups (

CREATE TABLE users (

17 Data Modeling in Apache Cassandra

CREATE TABLE user_accounts (

CREATE INDEX user_account_country ON user_accounts(country);

SELECT * FROM user_accounts WHERE country = ‘UK’;

18 Data Modeling in Apache Cassandra

Creating a secondary index on a high-cardinality column like email would be disastrous.

CREATE TABLE user_accounts (

CREATE MATERIALIZED VIEW users_by_email AS

SELECT * FROM users_by_email WHERE email = ?;

19 Data Modeling in Apache Cassandra

20 Data Modeling in Apache Cassandra

Getting the data model right is a critical first step in

21 Data Modeling in Apache Cassandra

You might also like