Learning Apache Cassandra - Sample Chapter
Learning Apache Cassandra - Sample Chapter
ee
Mat Brown
P U B L I S H I N G
pl
C o m m u n i t y
$ 44.99 US
27.99 UK
Learning Apache
Cassandra
E x p e r i e n c e
D i s t i l l e d
Learning Apache
Cassandra
Build an efficient, scalable, fault-tolerant, and highly-available
data layer into your application using Cassandra
Sa
m
Mat Brown
Horizontal scalability
Horizontal scalability refers to the ability to expand the storage and processing
capacity of a database by adding more servers to a database cluster. A traditional
single-master database's storage capacity is limited by the capacity of the server that
hosts the master instance. If the data set outgrows this capacity, and a more powerful
server isn't available, the data set must be sharded among multiple independent
database instances that know nothing of each other. Your application bears
responsibility for knowing to which instance a given piece of data belongs.
Cassandra, on the other hand, is deployed as a cluster of instances that are all aware
of each other. From the client application's standpoint, the cluster is a single entity;
the application need not know, nor care, which machine a piece of data belongs to.
Instead, data can be read or written to any instance in the cluster, referred to as a node;
this node will forward the request to the instance where the data actually belongs.
The result is that Cassandra deployments have an almost limitless capacity to store
and process data; when additional capacity is required, more machines can simply
be added to the cluster. When new machines join the cluster, Cassandra takes care
of rebalancing the existing data so that each node in the expanded cluster has a
roughly equal share.
Cassandra is one of the several popular distributed databases
inspired by the Dynamo architecture, originally published in a paper
by Amazon. Other widely used implementations of Dynamo include
Riak and Voldemort. You can read the original paper at http://
s3.amazonaws.com/AllThingsDistributed/sosp/amazondynamo-sosp2007.pdf.
High availability
The simplest database deployments are run as a single instance on a single server.
This sort of configuration is highly vulnerable to interruption: if the server is affected
by a hardware failure or network connection outage, the application's ability to
read and write data is completely lost until the server is restored. If the failure is
catastrophic, the data on that server might be lost completely.
A master-follower architecture improves this picture a bit. The master instance
receives all write operations, and then these operations are replicated to follower
instances. The application can read data from the master or any of the follower
instances, so a single host becoming unavailable will not prevent the application
from continuing to read data. A failure of the master, however, will still prevent
the application from performing any write operations, so while this configuration
provides high read availability, it doesn't completely provide high availability.
[8]
Chapter 1
Cassandra, on the other hand, has no single point of failure for reading or writing
data. Each piece of data is replicated to multiple nodes, but none of these nodes
holds the authoritative master copy. If a machine becomes unavailable, Cassandra
will continue writing data to the other nodes that share data with that machine, and
will queue the operations and update the failed node when it rejoins the cluster. This
means in a typical configuration, two nodes must fail simultaneously for there to be
any application-visible interruption in Cassandra's availability.
How many copies?
When you create a keyspaceCassandra's version of a databaseyou
specify how many copies of each piece of data should be stored; this is
called the replication factor. A replication factor of 3 is a common and
good choice for many use cases.
Write optimization
Traditional relational and document databases are optimized for read performance.
Writing data to a relational database will typically involve making in-place updates
to complicated data structures on disk, in order to maintain a data structure that can
be read efficiently and flexibly. Updating these data structures is a very expensive
operation from a standpoint of disk I/O, which is often the limiting factor for
database performance. Since writes are more expensive than reads, you'll typically
avoid any unnecessary updates to a relational database, even at the expense of extra
read operations.
Cassandra, on the other hand, is highly optimized for write throughput, and in fact
never modifies data on disk; it only appends to existing files or creates new ones.
This is much easier on disk I/O and means that Cassandra can provide astonishingly
high write throughput. Since both writing data to Cassandra, and storing data in
Cassandra, are inexpensive, denormalization carries little cost and is a good way to
ensure that data can be efficiently read in various access scenarios.
Because Cassandra is optimized for write volume, you shouldn't shy
away from writing data to the database. In fact, it's most efficient to
write without reading whenever possible, even if doing so might result
in redundant updates.
Just because Cassandra is optimized for writes doesn't make it bad at reads; in fact,
a well-designed Cassandra database can handle very heavy read loads with no
problem. We'll cover the topic of efficient data modeling in great depth in the next
few chapters.
[9]
Structured records
The first three database features we looked at are commonly found in distributed
data stores. However, databases like Riak and Voldemort are purely key-value
stores; these databases have no knowledge of the internal structure of a record that's
stored at a particular key. This means useful functions like updating only part of a
record, reading only certain fields from a record, or retrieving records that contain a
particular value in a given field are not possible.
Relational databases like PostgreSQL, document stores like MongoDB, and, to a
limited extent, newer key-value stores like Redis do have a concept of the internal
structure of their records, and most application developers are accustomed to taking
advantage of the possibilities this allows. None of these databases, however, offer the
advantages of a masterless distributed architecture.
In Cassandra, records are structured much in the same way as they are in a relational
databaseusing tables, rows, and columns. Thus, applications using Cassandra
can enjoy all the benefits of masterless distributed storage while also getting all the
advanced data modeling and access features associated with structured records.
Secondary indexes
A secondary index, commonly referred to as an index in the context of a relational
database, is a structure allowing efficient lookup of records by some attribute
other than their primary key. This is a widely useful capability: for instance, when
developing a blog application, you would want to be able to easily retrieve all of the
posts written by a particular author. Cassandra supports secondary indexes; while
Cassandra's version is not as versatile as indexes in a typical relational database, it's
a powerful feature in the right circumstances.
[ 10 ]
Chapter 1
In Cassandra, secondary indexes can't be used for result ordering, but tables can be
structured such that rows are always kept sorted by a given column or columns, called
clustering columns. Sorting by arbitrary columns at read time is not possible, but the
capacity to efficiently order records in any way, and to retrieve ranges of records based
on this ordering, is an unusually powerful capability for a distributed database.
Immediate consistency
When we write a piece of data to a database, it is our hope that that data is
immediately available to any other process that may wish to read it. From another
point of view, when we read some data from a database, we would like to be
guaranteed that the data we retrieve is the most recently updated version. This
guarantee is called immediate consistency, and it's a property of most common
single-master databases like MySQL and PostgreSQL.
Distributed systems like Cassandra typically do not provide an immediate
consistency guarantee. Instead, developers must be willing to accept eventual
consistency, which means when data is updated, the system will reflect that update
at some point in the future. Developers are willing to give up immediate consistency
precisely because it is a direct tradeoff with high availability.
In the case of Cassandra, that tradeoff is made explicit through tunable consistency.
Each time you design a write or read path for data, you have the option of immediate
consistency with less resilient availability, or eventual consistency with extremely
resilient availability. We'll cover consistency tuning in great detail in Chapter 10, How
Cassandra Distributes Data.
[ 11 ]
For this reason, many databases offer built-in collection structures that can be
discretely updated: values can be added to, and removed from collections, without
reading and rewriting the entire collection. Cassandra is no exception, offering list, set,
and map collections, and supporting operations like "append the number 3 to the end
of this list". Neither the client nor Cassandra itself needs to read the current state of the
collection in order to update it, meaning collection updates are also blazingly efficient.
Relational joins
In real-world applications, different pieces of data relate to each other in a variety of
ways. Relational databases allow us to perform queries that make these relationships
explicit, for instance, to retrieve a set of events whose location is in the state of New
York (this is assuming events and locations are different record types). Cassandra,
however, is not a relational database, and does not support anything like joins.
Instead, applications using Cassandra typically denormalize data and make clever
use of clustering in order to perform the sorts of data access that would use a join in
a relational database.
For data sets that aren't already denormalized, applications can also perform
client-side joins, which mimic the behavior of a relational database by performing
multiple queries and joining the results at the application level. Client-side joins are
less efficient than reading data that has been denormalized in advance, but offer
more flexibility. We'll cover both of these approaches in Chapter 6, Denormalizing
Data for Maximum Performance.
MapReduce
MapReduce is a technique for performing aggregate processing on large amounts of
data in parallel; it's a particularly common technique in data analytics applications.
Cassandra does not offer built-in MapReduce capabilities, but it can be integrated
with Hadoop in order to perform MapReduce operations across Cassandra data
sets, or Spark for real-time data analysis. The DataStax Enterprise product provides
integration with both of these tools out-of-the-box.
[ 12 ]
Chapter 1
Cassandra
PostgreSQL
MongoDB
Redis
Riak
Structured
records
Yes
Yes
Yes
Limited
No
Secondary
indexes
Yes
Yes
Yes
No
Yes
Discretely
writable
collections
Yes
Yes
Yes
Yes
No
Relational joins No
Yes
No
No
No
Built-in
MapReduce
No
No
Yes
No
Yes
Fast result
ordering
Yes
Yes
Yes
Yes
No
Immediate
consistency
Configurable at Yes
query level
Yes
Yes
Configurable
at cluster
level
Transparent
sharding
Yes
No
Yes
No
Yes
No
No
No
Yes
High
throughput
writes
No
No
Yes
Yes
Yes
[ 13 ]
Installing Cassandra
Now that you're acquainted with Cassandra's substantial powers, you're no doubt
chomping at the bit to try it out. Happily, Cassandra is free, open source, and very
easy to get running.
Installing on Mac OS X
First, we need to make sure that we have an up-to-date installation of the Java
Runtime Environment. Open the Terminal application, and type the following into
the command prompt:
$ java -version
You will see an output that looks something like the following:
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
[ 14 ]
Chapter 1
Pay particular attention to the java version listed: if it's lower than
1.7.0_25, you'll need to install a new version. If you have an older
version of Java or if Java isn't installed at all, head to https://
www.java.com/en/download/mac_download.jsp and follow
the download instructions on the page.
You'll need to set up your environment so that Cassandra knows where to find the
latest version of Java. To do this, set up your JAVA_HOME environment variable to the
install location, and your PATH to include the executable in your new Java installation
as follows:
$ export JAVA_HOME="/Library/Internet PlugIns/JavaAppletPlugin.plugin/Contents/Home"
$ export PATH="$JAVA_HOME/bin":$PATH
You should put these two lines at the bottom of your .bashrc file to ensure that
things still work when you open a new terminal.
The installation instructions given earlier assume that you're using the
latest version of Mac OS X (at the time of writing this, 10.10 Yosemite).
If you're running a different version of OS X, installing Java might
require different steps. Check out https://www.java.com/en/
download/faq/java_mac.xml for detailed installation information.
Once you've got the right version of Java, you're ready to install Cassandra. It's very
easy to install Cassandra using Homebrew; simply type the following:
$ brew install cassandra
$ pip install cassandra-driver cql
$ cassandra
Installed the CQL shell and its dependency, the Python Cassandra driver
Installing on Ubuntu
First, we need to make sure that we have an up-to-date installation of the Java
Runtime Environment. Open the Terminal application, and type the following
into the command prompt:
$ java -version
Pay particular attention to the java version listed: it should start with
1.7. If you have an older version of Java, or if Java isn't installed at all,
you can install the correct version using the following command:
$ sudo apt-get install openjdk-7-jre-headless
Once you've got the right version of Java, you're ready to install Cassandra. First,
you need to add Apache's Debian repositories to your sources list. Add the following
lines to the file /etc/apt/sources.list:
deb http://www.apache.org/dist/cassandra/debian 21x main
deb-src http://www.apache.org/dist/cassandra/debian 21x main
In the Terminal application, type the following into the command prompt:
$ gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D
$ gpg --export --armor F758CE318D77295D | sudo apt-key add $ gpg --keyserver pgp.mit.edu --recv-keys 2B5C1B00
$ gpg --export --armor 2B5C1B00 | sudo apt-key add $ gpg --keyserver pgp.mit.edu --recv-keys 0353B12C
$ gpg --export --armor 0353B12C | sudo apt-key add $ sudo apt-get update
$ sudo apt-get install cassandra
$ cassandra
[ 16 ]
Chapter 1
Added the Apache repositories for Cassandra 2.1 to our sources list
Added the public keys for the Apache repo to our system and updated our
repository cache
Installed Cassandra
Installing on Windows
The easiest way to install Cassandra on Windows is to use the DataStax Community
Edition. DataStax is a company that provides enterprise-level support for Cassandra;
they also release Cassandra packages at both free and paid tiers. DataStax
Community Edition is free, and does not differ from the Apache package in any
meaningful way.
DataStax offers a graphical installer for Cassandra on Windows, which is available
for download at planetcassandra.org/cassandra. On this page, locate Windows
Server 2008/Windows 7 or Later (32-Bit) from the Operating System menu (you
might also want to look for 64-bit if you run a 64-bit version of Windows), and
choose MSI Installer (2.x) from the version columns.
Download and run the MSI file, and follow the instructions, accepting the defaults:
[ 17 ]
Once the installer completes this task, you should have an installation of Cassandra
running on your machine.
Chapter 1
Here are CQL binary drivers available for some popular programming languages:
Language
Driver
Java
Available at
github.com/datastax/java-driver
Python
github.com/datastax/python-driver
Ruby
github.com/datastax/ruby-driver
C++
github.com/datastax/cpp-driver
C#
DataStax C# Driver
github.com/datastax/csharp-driver
JavaScript
(Node.js)
node-cassandra-cql
github.com/jorgebay/nodecassandra-cql
PHP
phpbinarycql
github.com/rmcfrazier/phpbinarycql
While you will likely use one of these drivers in your applications, to try out the code
examples in this book, you can simply use the cqlsh tool, which is a command-line
interface for executing CQL queries and viewing the results. To start cqlsh on OS X
or Linux, simply type cqlsh into your command line; you should see something
like this:
$ cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol
19.39.0]
Use HELP for help.
cqlsh>
On Windows, you can start cqlsh by finding the Cassandra CQL Shell application in
the DataStax Community Edition group in your applications. Once you open it, you
should see the same output we just saw.
Creating a keyspace
A keyspace is a collection of related tables, equivalent to a database in a relational
system. To create the keyspace for our MyStatus application, issue the following
statement in the CQL shell:
CREATE KEYSPACE "my_status"
WITH REPLICATION = {
'class': 'SimpleStrategy', 'replication_factor': 1
};
[ 19 ]
Here we created a keyspace called my_status, which we will use for the
remainder of this book. When we create a keyspace, we have to specify replication
options. Cassandra provides several strategies for managing replication of data;
SimpleStrategy is the best strategy as long as your Cassandra deployment does
not span multiple data centers. The replication_factor value tells Cassandra
how many copies of each piece of data are to be kept in the cluster; since we are only
running a single instance of Cassandra, there is no point in keeping more than one
copy of the data. In a production deployment, you would certainly want a higher
replication factor; 3 is a good place to start.
A few things at this point are worth noting about CQL's syntax:
Selecting a keyspace
Once you've created a keyspace, you would want to use it. In order to do this,
employ the USE command:
USE "my_status";
This tells Cassandra that all future commands will implicitly refer to tables inside the
my_status keyspace. If you close the CQL shell and reopen it, you'll need to reissue
this command.
[ 20 ]
Chapter 1
Summary
In this chapter, you explored the reasons to choose Cassandra from among the many
databases available, and having determined that Cassandra is a great choice, you
installed it on your development machine.
You had your first taste of the Cassandra Query Language when you issued your
first command via the CQL shell in order to create a keyspace. You're now poised to
begin working with Cassandra in earnest.
In the next chapter, we'll begin building the MyStatus application, starting out with a
simple table to model users. We'll cover a lot more CQL commands, and before you
know it, you'll be reading and writing data like a pro.
[ 21 ]
www.PacktPub.com
Stay Connected: