Cassandra Introduction

Introduction to 
 
Apache
1
Me
Robert Stupp 
Freelancer, Coder, Architect 
@snazy snazy@snazy.de
Contributor to Apache Cassandra, 

3.0 UDFs (CASSANDRA-7395 + related)
Databases, Network, Backend
2
Agenda
Apache Cassandra History
Design Principles
Outstanding differences
CQL Intro
Access C*
Clusters
Cassandra Future
3
Apache Cassandra
History
4
Apache Cassandra 
started at Facebook
inspired by
Note: Facebook initially had 

two data centers.
5
2.1 released in Sep 2014
6
Apache Cassandra
Design Principles
7
Hardware failures 
can and will occur!
Cassandra handles failures.

From single node to whole data center.
From client to server.
8
The complicated part 
when learning Cassandra,
is to understand
Cassandra’s simplicity
9
Keep it simple
all nodes are equal
master-less architecture
no name nodes
no SPOF (single point of failure)
no read before modify 

(prevent race conditions)
10
Keep it running
No need to take cluster down … e.g.
during maintenance
during software update
Rolling restart is your friend
11
Outstanding
Differences
12
Cassandra
Highly scalable 
runs with a few nodes 
up to 1000+ nodes cluster!
Linear scalability (proven!)
Multi datacenter aware (world-wide!)
No SPOF
13
Cassandra @ Apple
14
Linear Scalability
15
Scaling Cassandra
More data? 
-> add more nodes
Faster access? 
-> add more nodes
16
Read / Write
performance
Reads are fast
Writes are even faster
17
Durability
Writes are durable - period.
18
Availability @
Netflix
Chaos 
Monkey
kills nodes randomly
19
Availability @
Netflix
Chaos 
Gorilla
kill regions randomly
20
Availability @
Netflix
Chaos 
Kong
kills whole data centers
21
Availability @
Netflix
http://de.slideshare.net/planetcassandra/
active-active-c-behind-the-scenes-at-
netflix
22
32 node cluster (Rasperry PIs)
@DataStax
23
Most outstanding
Great documentation
Many blog posts
Many presentations
Many videos
Regular webinars
Huge, active and healthy community
24
Data Distribution
25
DHT
Data is organized in a
 
„Distributed Hash Table“
(hash over row key)
26
DHT
7 1
6 2
5 3
27
Replication
28
Replication Factor 2
Row A
0
7 1
6 2
Row B
5 3
29
Replication Factor 3
Row A
0
7 1
6 2
Row B
5 3
30
Consistency
Consistency defined per request
Several consistency levels (CLs) 

for different needs
31
Eventual consistency
is not
hopefully consistent
EC means there’s a time gap until updates

are consistently readable
32
Consistency Levels
ANY (only for writes)
ONE, LOCAL_ONE,
TWO, THREE, (not recommended)
ALL, (not recommended)
QUORUM, LOCAL_QUORUM, EACH_QUORUM
SERIAL, LOCAL_SERIAL
33
Consistency
Data is always replicated
CL defines how many replicas must

fulfill the request
34
Write
Write
0
7 1
6 2
5 3
35
Write
Write
0
7 1
6 2
5 3
36
Mutli DC setup
DC 1 DC 2
37
Multi DC replication
Write
DC 1 DC 2
38
Mutli DC replication
Write
DC 1 DC 2
39
Mutli DC replication
Write
DC 1 DC 2
40
Replication & 
Consistency
Define # of replicas 
using replication factor
Define required consistency 

per request
41
CQL Introduction
CQL = Cassandra query language
42
“CQL is SQL 
minus joins, 
minus subqueries, 
plus collections” 
 
(plus user types, 
plus tuple types)
43
Why CQL?
Introduces a schema to Cassandra
Familiar syntax
Easy to understand
DML operations are atomic
44
Data model 
(hierarchical view)
Keyspace (schema)
Table (column family)
Row
partition key (part of primary key)
static columns
clustering key (part of primary key)
columns
45
CQL / DDL
Similar to SQL
CREATE TABLE …
ALTER TABLE …
DROP TABLE …
46
CQL / DML
Similar to SQL
INSERT …
UPDATE …
DELETE …
SELECT …
47
CQL / BATCH
Group related modifications 

(INSERT, UPDATE, DELETE)
Atomic operation
48
CQL types
boolean, int (32bit), bigint (64bit),
float, double,
decimal ("BigDecimal"), 
varint ("BigInteger"),
ascii, text (= varchar), blob,
inet, timestamp, uuid, timeuuid
49
CQL collection
types
list < foo >
set < foo >
map < foo , bar >
Since C* 2.1 collections can contain
any type - even other collections.

50
CQL composite
types
user types (C* 2.1) 

are composite types with named fields
tuple types (C* 2.1) 

are unstructured lists of values
51
CQL / user types
CREATE TYPE address ( 

street text, 
zip int, 
city text); 
 
CREATE TABLE users ( 
username text, 
addresses map<text, address>, 
...
52
Cassandra 
Data Modeling
Access by key 
no access by arbitrary WHERE clause
Duplicate data (it’s ok!)
Aggregate data
Build application maintained indexes
53
RDBMS modeling
54
C* modeling
55
Data Modeling 
with RDBMS
Driven by
"How can I store

something right?"
"What answers 
do I have?"
56
Data Modeling 
with NoSQL
Driven by
"How can I access

something right?" 
 
"What questions 
do I have?"
57
Data Modeling
Basics
Work top-down. Think about:
What does the application do?
What are the access patterns?
Now design data model
58
Data Modeling
cassandra-day-sv-2014-fundamentals-
of-apache-cassandra-data-modeling
data-modeling-with-travis-price
59
Accessing
Cassandra
60
Command Line
cqlsh 
CQL shell
nodetool 
node/cluster administration
61
GUI: DevCenter
Visual query tool
62
Stress test?
Cassandra 2.1 comes with improved

stress tool
Simulate read+write workload
Uses configurable data
Works against older C* versions, too
63
DataStax APLv2 
Open Source Drivers
for Java
for Python
for C#
for Scala / Spark
https://github.com/datastax/
or http://www.datastax.com/download
64
Native protocol
C*’s own net protocol for clients
Request multiplexing
Schema change notifications
Cluster change notifications
65
Third Party Drivers
for huge number of languages
66
Mappers
High level mappers exist at least for

Java
Special case: Scala 

due to its strong+complex type
model (DataStax OSS Spark driver)
67
Spark + Hadoop
Yes - works really good
Note: Spark is about 100x faster
68
Clusters
69
Cluster sizes
C* works with a few nodes
C* works with several hundred /

thousand nodes
70
Cluster setup
Configure for multiple data centers
Plan for multi-DC setup :)
71
Cluster experience
Remember: A single Cassandra

clusters works over multiple data
centers all over the world
„Desaster proven“
Hurricanes
Amazon DC outages
72
Apache Cassandra 
Future
73
Cassandra 3.0 
(in development)
User Defined Functions
Subject 
Aggregate functions to 
change!!!
Functional indexes
Workload recording + playback
Better SSTables, Fully off-heap row cache, Better

serial consistency
Indexes w/ high cardinality
74
Get active !
75
Cassandra Community
http://cassandra.apache.org/
http://planetcassandra.org/ - Blog
http://www.slideshare.net/
planetcassandra/presentations
http://de.slideshare.net/DataStax/
presentations
76
Cassandra Community
https://www.youtube.com/user/
PlanetCassandra
https://www.youtube.com/user/DataStax
http://www.datastax.com/dev/blog/
http://www.datastax.com/docs/
Users Mailing List 

users@cassandra.apache.org
77
Free C* Training!
http://planetcassandra.org/cassandra-
training/
78
Get involved!
Ask questions, 
submit RFEs or experiences to
user mailing list
user@cassandra.apache.org
Answers arrive quickly!
79
Live Demo
User Defined Functions
80
C* 3.0 UDFs
Users create functions using 

CREATE FUNCTION … 
LANGUAGE …  
AS …
Java, JavaScript, Scala, Groovy,

JRuby, Jython
Functions work on all nodes
81
C* 3.0 UDFs
Example
CREATE FUNCTION sin(input double) 

RETURNS double 
LANGUAGE javascript 
AS 'Math.sin(input)';
This is JavaScript!
82
UDFs for what?
Targeted for C* 3.0
Own aggregation code - e.g. 

SELECT sum(value) FROM table 
WHERE …;
Functional indexes - e.g. 

CREATE INDEX idx 
ON table ( myFunction(colname) );
83
Thanks 
for your attention
Download Apache Cassandra at

http://cassandra.apache.org/
Robert Stupp 
@snazy 
snazy@snazy.de 
de.slideshare.net/RobertStupp
84
Q & A
85
86
BACKUP SLIDES
User-Defined-Functions 
Demo
87
88
89
90
91
92
93
94
95
96
97
98
99

Cassandra Introduction

Uploaded by

Copyright:

Available Formats

Cassandra Introduction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cassandra Introduction

Uploaded by

Copyright:

Available Formats

Introduction to

Contributor to Apache Cassandra,

Databases, Network, Backend

Note: Facebook initially had

Cassandra handles failures.

no SPOF (single point of failure)

no read before modify

No need to take cluster down … e.g.

during software update

Rolling restart is your friend

Linear scalability (proven!)

Multi datacenter aware (world-wide!)

Reads are fast

Writes are even faster

Writes are durable - period.

kills nodes randomly

kill regions randomly

kills whole data centers

Many blog posts

Huge, active and healthy community

(hash over row key)

Consistency defined per request

Several consistency levels (CLs)

EC means there’s a time gap until updates

TWO, THREE, (not recommended)

ALL, (not recommended)

QUORUM, LOCAL_QUORUM, EACH_QUORUM

Data is always replicated

CL defines how many replicas must

Define required consistency

CQL = Cassandra query language

Introduces a schema to Cassandra

DML operations are atomic

Table (column family)

partition key (part of primary key)

clustering key (part of primary key)

Group related modifications

ascii, text (= varchar), blob,

inet, timestamp, uuid, timeuuid

set < foo >

map < foo , bar >

Since C* 2.1 collections can contain

any type - even other collections.

user types (C* 2.1)

tuple types (C* 2.1)

CREATE TYPE address (

Duplicate data (it’s ok!)

Build application maintained indexes

"How can I store

"How can I access

Work top-down. Think about:

What does the application do?

What are the access patterns?

Now design data model

Visual query tool

Cassandra 2.1 comes with improved

Simulate read+write workload

Uses configurable data

Introduction to 

Contributor to Apache Cassandra, 

Note: Facebook initially had 

no read before modify 

Several consistency levels (CLs) 

Define required consistency 

Group related modifications 

user types (C* 2.1) 

tuple types (C* 2.1) 

CREATE TYPE address ( 

Special case: Scala 

Users Mailing List 

Users create functions using 

CREATE FUNCTION sin(input double) 

Own aggregation code - e.g. 

Functional indexes - e.g.