Cassandra Data Modeling Best Practices
Cassandra Data Modeling Best Practices
Cassandra Modeling
Best Practices and Examples
Jay Patel
Architect, Platform Systems
@pateljay3001
That’s me
Technical Architect @ eBay
Passion for building high-scale systems
Architecting eBay’s next-gen data platform
Prototyped first version of eBay’s cloud platform
Built various social applications at IBM & eBay
Implemented telecom softwares at an early startup
Pursuing machine-learning at Stanford
Entrepreneurial minded
http://www.jaykumarpatel.com
2
eBay Marketplaces
$75 billion+ per year in goods are sold on eBay
112 million active users
2 billion+ page views/day
Petabytes of data
400+ million items for sale
Thousands of servers
Thousands of nodes
> 2K sharded logical host Hundreds of nodes
Hundreds of nodes
> 16K tables > 50 TB
> 250 TB provisioned
> 27K indexes > 2 billion ops/day
(local HDD + shared SSD)
> 140 billion SQLs/day
> 6 billion writes/day
> 5 PB provisioned
> 5 billion reads/day
May, 2013
Aug, 2011
Aug, 2012
Doesn’t predict
5
business
eBay’s Real Time Big Data on Cassandra
Social Signals on eBay Product & Item pages
Mobile notification logging and tracking
Tracking for fraud detection
SOA request/response payload logging
Metrics collections and real-time reporting for thousands of severs
Personalization Data Service
NextGen Recommendation System with real-time taste graph for eBay users
Cloud CMS change history storage
Order Payment Management logging
Shipment tracking
RedLaser server logs and analytics
7
Intro to Cassandra Data Model
Non-relational, sparse model designed for high scale distributed storage
9
Data Model – Super & Composite column
Grouping using
Super Column
Grouping using
Composite column
name
Why?
• Physical model is more similar to sorted map than relational
How?
• Map gives efficient key lookup & sorted nature gives efficient scans
• Unbounded no. of column keys
• Key can itself hold value
Each column has timestamp associated. Ignore it during modeling
11
1
Refinement - Think of outer map as unsorted
Why?
• Row keys are sorted in natural order only if OPP is used. OPP
is not recommended!
12
1
How about super column?
Map<RowKey, SortedMap<SuperColumnKey,SortedMap<ColumnKey,
ColumnValue>>>
14
Storing value in column name is perfectly ok
• 64KB is max for column key. Don't store long text fields, such
as item descriptions!
• 2 GB is max for column value. But limit the size to only a few
MBs as there is no streaming.
15
Use wide row for ordering, grouping and filtering
• Since column names are stored sorted, wide rows enable ordering of data and
hence efficient filtering.
• Group data queried together in a wide row to read back efficiently, in one
query.
• Wide rows are heavily used with composite columns to build custom indexes.
Example: Store time series event log data & retrieve them hourly.
16
4
But, don’t go too wide!
Traffic:
All of the traffic related to one row is handled by only one
node/shard (by a single set of replicas, to be more precise).
Size:
Data for a single row must fit on disk within a single node in the
cluster.
17
Choose proper row key – It’s your “shard key”
Example:
Bad row key: “ddmmyyhh”
18
Make sure column key and row key are unique
Example:
• Timestamp alone as a column name can cause collisions
• Use TimeUUID to avoid collisions.
19
Define correct comparator & validator
20
Favor composite column over super column
21
Order of sub-columns in composite column matters
<state|city>
Ordered by State first and then by City. Cities will be grouped by state
physically on the disk.
<city|state>
The other way around, which you don’t want.
22
9
Order affects your queries
Example: User activities
Efficient to query data for a given activity type and time range.
Also, efficient for only activity type. But not efficient for only time range.
23
9
It’s like compound index!
Assume,
CF with composite column name as <subcolumn1 | subcolumn2 | subcolumn3>
Not all the sub-columns needs to be present. But, can’t skip also.
Query on ‘subcolumn1|subcolumn2’ is fine. But not only for ‘subcolumn2’.
24
10 Model column families around query patterns
But start your design with entities and relationships, if you can
• Not easy to tune or introduce new query patterns later by simply creating
indexes or building complex queries using join, group by, etc.
• Think how you can organize data into the nested sorted map to satisfy
your query requirements of fast look-
up/ordering/grouping/filtering/aggregation/etc.
Identify the most frequent query patterns and isolate the less frequent.
Identify which queries are sensitive to latency and which are not.
25
11 De-normalize and duplicate for read performance
26
Example 1 “Likes” relationship between User & Item
27
Example 1
Option 1: Exact replica of relational model
How many queries in the current model? Can it increase further if user becomes
active or item becomes hot? 30
Example 1
Option 4: Partially de-normalized entities
31
Example 1
Best Option for this use case – Option 3
32
Example 2 Semi-structured event log data
Collecting time series event log, and doing real-time aggregation & roll ups
33
Example 2
Example Cassandra model
34
12 Keep read-heavy data separate from write-heavy
Row cache caches the whole row. Be cautious before enabling for wide rows.
35
13 Isolate more frequent from the less frequent
36
14 Manual sharding can help compaction
37
15 Design such that operations are idempotent
Retry on write failure can yield unexpected result if model isn’t update idempotent.
38
15
But may not be efficient
• Counting users requires reading all user ids (million?) - Can’t scale.
• Can we live with approximate count for this use case? - Yes.
• If needed, counter value can be corrected asynchronously by counting the
user ids from update idempotent CF.
39
16 Keep column name short
For example:
Favor ‘fname’ over ‘firstname’, and ‘lname’ over ‘lastname’.
40
17 Favor built-in composite type over manual
41
18 Keep all data in a single CF of the same type
42
19 Don’t use the Counter Column for surrogate keys
43
20 Indexing is not an afterthought, anymore
44
20.1 Primary (or Row key) Index
45
20.2 Built-in Secondary Index
• It's an index on the column values, and not on the column keys.
• Column keys are always indexed & stored physically sorted.
46
20.2
Best used when
48
20.3
Best used when
50
20.3
Select Title from Item where Seller = 'sellerid1'
51
20.3
where Seller = 'sellerid1' order by Price
But,
- How to get old price in order to delete? Read before write?
- Any race condition? What if Consistency Level is eventual?
- Repair on read?
52
where Seller='sellerid1' and ListingDate > 10-11-2011
20.3
and ListingDate < 11-12-2011 Order by Price
Won’t work
Data won’t be ordered by ‘Price’ across dates.
What to do then?
53
Key Takeaways
Cassandra at eBay:
http://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376
55
Meet our operations:
Feng Qu
Principle DBA @ eBay
Cassandra Prod. operations expert
Baba Krishnankutty
Staff DBA @ eBay
Cassandra QA operations expert
56
Are you excited? Come Join Us!
Thank You
@pateljay3001
57