Cassandra Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 99

Introduction to



Apache

1
Me

Robert Stupp

Freelancer, Coder, Architect

@snazy snazy@snazy.de

Contributor to Apache Cassandra,



3.0 UDFs (CASSANDRA-7395 + related)

Databases, Network, Backend

2
Agenda
Apache Cassandra History

Design Principles

Outstanding differences

CQL Intro

Access C*

Clusters

Cassandra Future

3
Apache Cassandra
History

4
Apache Cassandra

started at Facebook

inspired by

Note: Facebook initially had



two data centers.

5
2.1 released in Sep 2014

6
Apache Cassandra
Design Principles

7
Hardware failures

can and will occur!

Cassandra handles failures.


From single node to whole data center.
From client to server.
8
The complicated part

when learning Cassandra,

is to understand
Cassandra’s simplicity

9
Keep it simple
all nodes are equal

master-less architecture

no name nodes

no SPOF (single point of failure)

no read before modify



(prevent race conditions)

10
Keep it running

No need to take cluster down … e.g.

during maintenance

during software update

Rolling restart is your friend

11
Outstanding
Differences

12
Cassandra

Highly scalable

runs with a few nodes

up to 1000+ nodes cluster!

Linear scalability (proven!)

Multi datacenter aware (world-wide!)

No SPOF

13
Cassandra @ Apple

14
Linear Scalability

15
Scaling Cassandra

More data?

-> add more nodes

Faster access?

-> add more nodes

16
Read / Write
performance

Reads are fast

Writes are even faster

17
Durability

Writes are durable - period.

18
Availability @
Netflix
Chaos

Monkey

kills nodes randomly

19
Availability @
Netflix

Chaos

Gorilla

kill regions randomly

20
Availability @
Netflix

Chaos

Kong

kills whole data centers

21
Availability @
Netflix

http://de.slideshare.net/planetcassandra/
active-active-c-behind-the-scenes-at-
netflix
22
32 node cluster (Rasperry PIs)
@DataStax

23
Most outstanding
Great documentation

Many blog posts

Many presentations

Many videos

Regular webinars

Huge, active and healthy community

24
Data Distribution

25
DHT

Data is organized in a


„Distributed Hash Table“

(hash over row key)

26
DHT

7 1

6 2

5 3

27
Replication

28
Replication Factor 2
Row A
0

7 1

6 2

Row B
5 3

29
Replication Factor 3
Row A
0

7 1

6 2

Row B
5 3

30
Consistency

Consistency defined per request

Several consistency levels (CLs)



for different needs

31
Eventual consistency

is not
hopefully consistent

EC means there’s a time gap until updates


are consistently readable

32
Consistency Levels
ANY (only for writes)

ONE, LOCAL_ONE,

TWO, THREE, (not recommended)

ALL, (not recommended)

QUORUM, LOCAL_QUORUM, EACH_QUORUM

SERIAL, LOCAL_SERIAL

33
Consistency

Data is always replicated

CL defines how many replicas must


fulfill the request

34
Write
Write
0

7 1

6 2

5 3

35
Write
Write
0

7 1

6 2

5 3

36
Mutli DC setup
DC 1 DC 2

37
Multi DC replication
Write
DC 1 DC 2

38
Mutli DC replication
Write
DC 1 DC 2

39
Mutli DC replication
Write
DC 1 DC 2

40
Replication &

Consistency

Define # of replicas

using replication factor

Define required consistency



per request

41
CQL Introduction

CQL = Cassandra query language

42
“CQL is SQL

minus joins,

minus subqueries,

plus collections”


(plus user types,

plus tuple types)

43
Why CQL?

Introduces a schema to Cassandra

Familiar syntax

Easy to understand

DML operations are atomic

44
Data model

(hierarchical view)
Keyspace (schema)

Table (column family)

Row

partition key (part of primary key)

static columns

clustering key (part of primary key)

columns

45
CQL / DDL

Similar to SQL

CREATE TABLE …

ALTER TABLE …

DROP TABLE …

46
CQL / DML

Similar to SQL

INSERT …

UPDATE …

DELETE …

SELECT …

47
CQL / BATCH

Group related modifications



(INSERT, UPDATE, DELETE)

Atomic operation

48
CQL types
boolean, int (32bit), bigint (64bit),

float, double,

decimal ("BigDecimal"),

varint ("BigInteger"),

ascii, text (= varchar), blob,

inet, timestamp, uuid, timeuuid

49
CQL collection
types
list < foo >

set < foo >

map < foo , bar >

Since C* 2.1 collections can contain

any type - even other collections.


50
CQL composite
types

user types (C* 2.1)



are composite types with named fields

tuple types (C* 2.1)



are unstructured lists of values

51
CQL / user types

CREATE TYPE address (



street text,

zip int,

city text);


CREATE TABLE users (

username text,

addresses map<text, address>,

...

52
Cassandra

Data Modeling
Access by key

no access by arbitrary WHERE clause

Duplicate data (it’s ok!)

Aggregate data

Build application maintained indexes

53
RDBMS modeling

54
C* modeling

55
Data Modeling

with RDBMS
Driven by

"How can I store


something right?"
"What answers

do I have?"
56
Data Modeling

with NoSQL
Driven by

"How can I access


something right?"


"What questions

do I have?"
57
Data Modeling
Basics

Work top-down. Think about:

What does the application do?

What are the access patterns?

Now design data model

58
Data Modeling

http://de.slideshare.net/planetcassandra/
cassandra-day-sv-2014-fundamentals-
of-apache-cassandra-data-modeling

http://de.slideshare.net/planetcassandra/
data-modeling-with-travis-price

59
Accessing
Cassandra

60
Command Line

cqlsh

CQL shell

nodetool

node/cluster administration

61
GUI: DevCenter

Visual query tool

62
Stress test?

Cassandra 2.1 comes with improved


stress tool

Simulate read+write workload

Uses configurable data

Works against older C* versions, too

63
DataStax APLv2

Open Source Drivers
for Java

for Python

for C#

for Scala / Spark

https://github.com/datastax/
or http://www.datastax.com/download
64
Native protocol

C*’s own net protocol for clients

Request multiplexing

Schema change notifications

Cluster change notifications

65
Third Party Drivers

for huge number of languages

66
Mappers

High level mappers exist at least for


Java

Special case: Scala



due to its strong+complex type
model (DataStax OSS Spark driver)

67
Spark + Hadoop

Yes - works really good

Note: Spark is about 100x faster

68
Clusters

69
Cluster sizes

C* works with a few nodes

C* works with several hundred /


thousand nodes

70
Cluster setup

Configure for multiple data centers

Plan for multi-DC setup :)

71
Cluster experience

Remember: A single Cassandra


clusters works over multiple data
centers all over the world

„Desaster proven“

Hurricanes

Amazon DC outages

72
Apache Cassandra

Future

73
Cassandra 3.0

(in development)
User Defined Functions
Subject

Aggregate functions to

change!!!
Functional indexes

Workload recording + playback

Better SSTables, Fully off-heap row cache, Better


serial consistency

Indexes w/ high cardinality

74
Get active !

75
Cassandra Community

http://cassandra.apache.org/

http://planetcassandra.org/ - Blog

http://www.slideshare.net/
planetcassandra/presentations

http://de.slideshare.net/DataStax/
presentations

76
Cassandra Community
https://www.youtube.com/user/
PlanetCassandra

https://www.youtube.com/user/DataStax

http://www.datastax.com/dev/blog/

http://www.datastax.com/docs/

Users Mailing List



users@cassandra.apache.org

77
Free C* Training!

http://planetcassandra.org/cassandra-
training/
78
Get involved!

Ask questions,

submit RFEs or experiences to

user mailing list

user@cassandra.apache.org

Answers arrive quickly!

79
Live Demo
User Defined Functions

80
C* 3.0 UDFs

Users create functions using



CREATE FUNCTION …

LANGUAGE … 

AS …

Java, JavaScript, Scala, Groovy,


JRuby, Jython

Functions work on all nodes

81
C* 3.0 UDFs

Example

CREATE FUNCTION sin(input double)



RETURNS double

LANGUAGE javascript

AS 'Math.sin(input)';

This is JavaScript!

82
UDFs for what?
Targeted for C* 3.0

Own aggregation code - e.g.



SELECT sum(value) FROM table

WHERE …;

Functional indexes - e.g.



CREATE INDEX idx

ON table ( myFunction(colname) );

83
Thanks

for your attention

Download Apache Cassandra at


http://cassandra.apache.org/

Robert Stupp

@snazy

snazy@snazy.de

de.slideshare.net/RobertStupp
84
Q & A

85
86
BACKUP SLIDES
User-Defined-Functions

Demo

87
88
89
90
91
92
93
94
95
96
97
98
99

You might also like