Kafka & Redis For Big Data Solutions: Christopher Curtin Head of Technical Research @chriscurtin

The document discusses using Apache Kafka and Redis for big data solutions. It provides an overview of Kafka and how it can be used for persistent messaging, data pipelines, and event streaming. Redis is described as an in-memory data structure store that can be used for caching, publishing, and accessing real-time data. Bloom filters stored in Redis are presented as a space-efficient way to check membership in large sets for applications like recommendations and A/B testing.

Uploaded by

saileshpanda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views43 pages

Kafka & Redis For Big Data Solutions: Christopher Curtin Head of Technical Research @chriscurtin

Uploaded by

saileshpanda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Kafka & Redis for Big

Data Solutions
•
Christopher Curtin
•
Head of Technical
Research
•
@ChrisCurtin
About Me

25+ years in technology

Head of Technical Research at Silverpop, an IBM Company (14 + years at
Silverpop)

Built a SaaS platform before the term ‘SaaS’ was being used

Prior to Silverpop: real-time control systems, factory automation and
warehouse management

Always looking for technologies and algorithms to help with our
challenges
About Silverpop

Provider of Internet Marketing Solutions

15+ years old, Acquired by IBM in May 2014

Multi-tenant Software as a Service

Thousands of clients with 10's of billions of events per month
Agenda

Kafka

Redis

Bloom Filters
Batch is Bad

Data Egress and Ingress is very expensive

1000's of customers mean thousands of hourly or daily export and
import requests

Often same exports are being requested to feed different business
systems

Prefer Event Pushes to subscribed end points
Apache Kafka
Apache Kafka is a distributed publish-subscribe messaging system. It
is designed to support the following

Persistent messaging with O(1) disk structures that provide constant time
performance even with many TB of stored messages.

High-throughput: even with very modest hardware Kafka can support hundreds
of thousands of messages per second.

Explicit support for partitioning messages over Kafka servers and distributing
consumption over a cluster of consumer machines while maintaining per-
partition ordering semantics.

Support for parallel data load into Hadoop.
Apache Kafka – Why?

Data Integration
Point to Point Integration (thanks to LinkedIn for slide)
What we'd really like (thanks to LinkedIn for slide)
Apache Kafka
Kafka changes the messaging paradigm
Kafka doesn't keep track of who consumed which message
Kafka keeps all messages until you tell it not to
A consumer can ask for the same message over and over again
Consumption Management
Kafka leaves the management of what was consumed to the
consumer
Each message has a unique identifier (within a topic and partition)
Consumers ask for the specific message, or the next one
All messages are written to disk, so asking for an old message means
finding it on disk and starting to stream from there
What is a commit log? (thanks to LinkedIn for slide)
Big Data Use Cases
Data Pipeline
Buffer between event driven systems and batch
Batch Load
Parallel Streaming
Event Replay
Data Pipeline (thanks to LinkedIn for slide)
Buffer
Event generating systems MAY give you a small downtime buffer
before they start losing events
Most will not.
How do you not lose these events when you need downtime?
Batch Load
Sometimes you only need to load data once an hour or once a day
Have a task wake up and stream the events since the last time to a
file, mark where it last picked up and exit
Parallel Streaming
A lot of use cases require different systems to process the same raw
data
Spark, Storm, Infosphere Streams, Spring XD …
Write to a different Kafka Topic and Partition
Batch load into HDFS hourly
Batch load into a traditional data warehouse or database daily
Push to external systems daily
Example: abandoned shopping carts
Event Replay
What if you could take 'production' data and run it through a
debugger?
Or through a non-optimized, 'SPEW' mode logging to see what it
does?
Or through a new version of your software that has /dev/null sinks
Note: Developers HAVE NOT made a local copy of the data
Though with a sanitized copy for QA 'load testing' is simplified.
What about Flume etc?
Very similar use cases, very different implementation
Kafka keeps the events around
Flume can do much more (almost a CEP system)
Kafka feeding Flume is a reasonable combination
Big Data Egress
Big data applications often produce lots of data
Nightly batch jobs
Spark Streaming alerts
Write to Kafka

Same benefits when feeding big data can be applied to consumers
of our output
Or ...
What about Hbase or other 'Hadoop' answers For Egress?
Perfect for many cases
Real-world concerns though

Is the YARN cluster 5-9s?

Is it cost effective to build a fast response Hbase, HDFS etc. for the
needs?

Will the security teams allow real-time access to YARN clusters
from web-tier business application?
Redis – What is it?
From redis.io:

"Redis is an open source, BSD licensed, advanced key-value cache and store.
It is often referred to as a data structure server since keys can contain strings,
hashes, lists, sets, sorted sets, bitmaps and hyperloglogs."
Features
• Unlike typical key-value stores, you can send commands to edit the
value on the server vs. reading back to the client, updating and
pushing to the server
• pub/sub
•TTL on keys
•Clustering and automatic fail-over
•Lua scripting
•client libraries for just about any language you can think of
So Why did we start looking at NoSQL?
“For the cost of an Oracle Enterprise license I can give you 64 cores
and 3 TB of memory”
Redis Basics

In Memory-only key-value store

Single Threaded. Yes, Single Threaded

No Paging, no reading from disk

CS 101 data structures and operations

10's of millions of keys isn't a big deal

How much RAM defines how big the store can get
Hashes
Hashes
- collection of key-value pairs with a single name
- useful for storing data under a common name
- values can only be strings or numeric. No hash of lists

http://redis.io/commands/hget
Web-time access
Write the output of a Spark job to Redis so it can be accessed in web-
time
Publish the results of recommendation engines for a specific user to
Redis in a Hash by User or current page
Using a non-persistent storage model, very low cost to load millions
of matches into memory

Something goes wrong? Load it again from the big data outputs

New information for 1 user or page, replace the value for that one
only
Spark RDDs
Over time Spark can generate a lot of RDDs.

Where are they?

What is in them?

Some people use a naming convention, but that gets problematic

quickly
Instead, create a Redis Set by 'key' you are interested in finding the
RDDs later. Date, source system, step in algorithm on date etc.
Since Redis Set is only a reference to the RDD, can easily create lots of
Sets pointing to same data (almost like an index)
Bloom Filters
From WikiPedia (Don't tell my kid's teacher!)
"A Bloom filter is a space-efficient probabilistic data structure,
conceived by Burton Howard Bloom in 1970, that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not, thus a Bloom filter has a 100%
recall rate"
Hashing

Apply 'x' hash functions to the key to be stored/queried

Each function returns a bit to set in the bitset

Mathematical equations to determine how big to make the bitset,
how many functions to use and your acceptable error level

http://hur.st/bloomfilter?n=4&p=1.0E-20
Example
False Positives

Perfect hash functions aren't worth the cost to develop

Sometimes existing bits for a key are set by many other keys

Make sure you understand the business impact of a false positive

Remember, never a false negative
Why were we interested in Bloom Filters?

Found a lot of places we went to the database to find the data
didn't exist

Found lots of places where we want to know if a user DIDN'T do
something
Persistent Bloom Filters

We needed persistent Bloom Filters for lots of user stories

Found Orestes-BloomFilter on GitHub that used Redis as a store
and enhanced it

Added population filters

Fixed a few bugs
Benefits

Filters are stored in Redis
• Only bitset/bitget calls to server

Reads and updates of the filter from set of application servers

Persistence has a cost, but a fraction of the RDBMS costs

Can load a BF created offline and begin using it
Remember “For the cost of an Oracle License”

Thousands of filters

Dozens of Redis instances

TTL on a Redis key makes cleanup of old filters trivial
Use Cases
White List/Black List

Nightly rebuilds, real time updates as CEP finds something

Ad suggestions

Does the system already know about this visitor?

Maybe? More expensive processing. No? Defaults
Content Recommendations

Batch identification hourly/nightly

Real-time updates as 'hot' pages/content change the recommendation
Use Case – Cardinality Estimation
Client side joins are really difficult.
Hadoop, Spark, MongoDB – how do you know which side to 'drive' from?
We created a Population Bloom Filter that counts unique occurrences of a key using a
Bloom Filter. Build a filter per 'source' file as the data is being collected (Kafka
batches, Spark RDD saves etc.)
Now query the filter to see if the keys you are looking for are (possibly) in the data.
Then get the count and see which side to drive from.
Not ideal, low population may not mean it is the best driver, but could be
Use Case – A/B Testing
Use different algorithms and produce different result sets
Load into Redis in different keys (remember 10MM keys is no big deal)
Have the application tier A/B test the algorithm results
Conclusion

Batch exports from SaaS (or any volume system) is bad

Traditional messaging systems are being stressed by the volume
of data

Redis is a very fast, very simple and very powerful name value
store “Data structure server”

Bloom Filters have lots of applications when you want to quickly
look up if one of millions of 'things' happened

Redis-backed BloomFilters make updatable bloom filters trivial to
use
References

Redis.io

Kafka.apache.org

https://github.com/Baqend/Orestes-Bloomfilter

http://www.slideshare.net/chriscurtin

@ChrisCurtin on twitter
Questions?

Procedimiento para Instalar Web Services and Applications Con Cassandra y Elasticseach en Cluster en Linux
No ratings yet
Procedimiento para Instalar Web Services and Applications Con Cassandra y Elasticseach en Cluster en Linux
81 pages
TDL Reference Manual
100% (1)
TDL Reference Manual
567 pages
Build Your Own Database in Go From Scratch From B+tree To SQL in
No ratings yet
Build Your Own Database in Go From Scratch From B+tree To SQL in
147 pages
Basic Concepts in Big Data 1
No ratings yet
Basic Concepts in Big Data 1
43 pages
Sims Multi Body Tutorial
No ratings yet
Sims Multi Body Tutorial
59 pages
00 - Introduction (Read ME!!!)
No ratings yet
00 - Introduction (Read ME!!!)
50 pages
Socket Programming: 15-441 Computer Networks, Spring 2008 Xi Liu
No ratings yet
Socket Programming: 15-441 Computer Networks, Spring 2008 Xi Liu
39 pages
The Elements of Differentiable Programming
No ratings yet
The Elements of Differentiable Programming
300 pages
Bash Guide
No ratings yet
Bash Guide
104 pages
Competitive Learning Neural Network
No ratings yet
Competitive Learning Neural Network
62 pages
Practical Guide To Keras
No ratings yet
Practical Guide To Keras
28 pages
Build Your Own Database From Scratch - James Smith
No ratings yet
Build Your Own Database From Scratch - James Smith
375 pages
Understanding The Top 5 Redis Performance Metrics
No ratings yet
Understanding The Top 5 Redis Performance Metrics
22 pages
A Quick Introduction To Tensorflow: Machine Learning Spring 2019
100% (1)
A Quick Introduction To Tensorflow: Machine Learning Spring 2019
22 pages
Parallel Computing With Julia
No ratings yet
Parallel Computing With Julia
87 pages
Redis Cheat Sheet
No ratings yet
Redis Cheat Sheet
4 pages
Deep Learning Fundamentals Materials
100% (1)
Deep Learning Fundamentals Materials
216 pages
C NOTES FULL - Final
No ratings yet
C NOTES FULL - Final
124 pages
Programming With C and C++
No ratings yet
Programming With C and C++
363 pages
GUI Programming With Python
No ratings yet
GUI Programming With Python
113 pages
Introduction To Win32
No ratings yet
Introduction To Win32
21 pages
Using Visual C++, 6th (Special) Edition
100% (2)
Using Visual C++, 6th (Special) Edition
882 pages
Modern C++ Tutorial
No ratings yet
Modern C++ Tutorial
88 pages
RabbitMQ - Best Practices For Designing Exchanges, Queues and Bindings
No ratings yet
RabbitMQ - Best Practices For Designing Exchanges, Queues and Bindings
9 pages
GDB Book
No ratings yet
GDB Book
746 pages
C++ Quick Reference
No ratings yet
C++ Quick Reference
9 pages
Go Tutorial PDF
No ratings yet
Go Tutorial PDF
45 pages
Basic Data Types in Python - Real Python
No ratings yet
Basic Data Types in Python - Real Python
15 pages
Grafana Overview
No ratings yet
Grafana Overview
20 pages
Aws Interview
No ratings yet
Aws Interview
4 pages
BC0041 Fundamentals of Database Management Paper 1
No ratings yet
BC0041 Fundamentals of Database Management Paper 1
11 pages
Selfstudys Com File
No ratings yet
Selfstudys Com File
33 pages
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
No ratings yet
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
21 pages
CB Queryoptimization 01
No ratings yet
CB Queryoptimization 01
78 pages
C100dev 3
No ratings yet
C100dev 3
14 pages
Cassandra Tutorial For Beginners: Learn in 3 Days: What Is Apache Cassandra?
No ratings yet
Cassandra Tutorial For Beginners: Learn in 3 Days: What Is Apache Cassandra?
4 pages
Cassandra Vs MongoDB Vs CouchDB Vs Redis Vs Riak Vs HBase Vs Couchbase Vs Hypertable Vs ElasticSearch Vs Accumulo Vs VoltDB Vs Scalaris Comparison - Software Architect Kristof Kovacs
No ratings yet
Cassandra Vs MongoDB Vs CouchDB Vs Redis Vs Riak Vs HBase Vs Couchbase Vs Hypertable Vs ElasticSearch Vs Accumulo Vs VoltDB Vs Scalaris Comparison - Software Architect Kristof Kovacs
11 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Convolutional Neural Networks For Visual Recognition
No ratings yet
Convolutional Neural Networks For Visual Recognition
45 pages
The AI Hierarchy of Needs
No ratings yet
The AI Hierarchy of Needs
8 pages
Lab14 - Understanding Blob Storage - Azure
No ratings yet
Lab14 - Understanding Blob Storage - Azure
35 pages
Yousef Udacity Deep Learning Part 3 CNN
No ratings yet
Yousef Udacity Deep Learning Part 3 CNN
253 pages
Linux Shell Scripting
No ratings yet
Linux Shell Scripting
95 pages
Iot
No ratings yet
Iot
514 pages
Apache Calcite Tutorial
No ratings yet
Apache Calcite Tutorial
83 pages
Computer Vision Projects With Pytorch: Design and Develop Production-Grade Models
No ratings yet
Computer Vision Projects With Pytorch: Design and Develop Production-Grade Models
10 pages
Docker Linux Exercises
No ratings yet
Docker Linux Exercises
38 pages
Andriod Employee Tracker
0% (1)
Andriod Employee Tracker
16 pages
Introduction To Database Programming in Python
No ratings yet
Introduction To Database Programming in Python
26 pages
Responsible Design and Use of Large Language Models
No ratings yet
Responsible Design and Use of Large Language Models
12 pages
Pandas
100% (1)
Pandas
1,131 pages
Iot Merged
No ratings yet
Iot Merged
132 pages
The Studio 3T Field Guide To MongoDB Aggregation
No ratings yet
The Studio 3T Field Guide To MongoDB Aggregation
148 pages
CMD and Port Number
No ratings yet
CMD and Port Number
15 pages
Large Scale Data Pipelines
No ratings yet
Large Scale Data Pipelines
91 pages
No SQL Pr-8
No ratings yet
No SQL Pr-8
18 pages
No SQL Pr-8
No ratings yet
No SQL Pr-8
8 pages
Kafka
No ratings yet
Kafka
43 pages
Redis Essentials - Sample Chapter
No ratings yet
Redis Essentials - Sample Chapter
33 pages
BDM Redis Mongodb
No ratings yet
BDM Redis Mongodb
62 pages
Redis
No ratings yet
Redis
24 pages
Ajp12. Minu
No ratings yet
Ajp12. Minu
9 pages
250 C++ Program Examples & Solutions Techstudy
No ratings yet
250 C++ Program Examples & Solutions Techstudy
1 page
"Golden Section Method": Instructed by CHHORN Sopheaktra
No ratings yet
"Golden Section Method": Instructed by CHHORN Sopheaktra
3 pages
UNIT 3 Compiler Design
No ratings yet
UNIT 3 Compiler Design
28 pages
Empnum Empname Dept - Id: 'Yet To Assigned'
No ratings yet
Empnum Empname Dept - Id: 'Yet To Assigned'
3 pages
Informatica Expression Functions Table
No ratings yet
Informatica Expression Functions Table
1 page
Pointers L1
No ratings yet
Pointers L1
39 pages
Operating Systems: Project Reports
No ratings yet
Operating Systems: Project Reports
8 pages
CSS Presentation
No ratings yet
CSS Presentation
17 pages
Python Fundamentals Sheet
No ratings yet
Python Fundamentals Sheet
29 pages
Release Notes
No ratings yet
Release Notes
13 pages
DUI0040D
No ratings yet
DUI0040D
578 pages
How To Write A Big PLC Program Contact and Coil
No ratings yet
How To Write A Big PLC Program Contact and Coil
5 pages
PHP Piscine: Summary: This Document Is The Day02's Subject For The PHP Piscine
No ratings yet
PHP Piscine: Summary: This Document Is The Day02's Subject For The PHP Piscine
14 pages
Unit - Ii: Stacks and Queues
No ratings yet
Unit - Ii: Stacks and Queues
45 pages
LRU-K Page Replacement Algorithm: CSCI 485 Lecture Notes Instructor: Prof. Shahram Ghandeharizadeh
No ratings yet
LRU-K Page Replacement Algorithm: CSCI 485 Lecture Notes Instructor: Prof. Shahram Ghandeharizadeh
29 pages
Priority Inversion Problem: Click To Edit Master Subtitle Style Ravindra V. Joshi
No ratings yet
Priority Inversion Problem: Click To Edit Master Subtitle Style Ravindra V. Joshi
22 pages
MCSL 054
No ratings yet
MCSL 054
60 pages
Android Activity in Hindi
No ratings yet
Android Activity in Hindi
5 pages
REVISION Docker
No ratings yet
REVISION Docker
1 page
Lesson08-SQL Functions, Subqueries, and Joins
No ratings yet
Lesson08-SQL Functions, Subqueries, and Joins
8 pages
Oracle PL SQL Cheat Sheet 1690341202
No ratings yet
Oracle PL SQL Cheat Sheet 1690341202
7 pages
Conv Number To Letter
No ratings yet
Conv Number To Letter
5 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Mobile Computing Lab Manual
No ratings yet
Mobile Computing Lab Manual
66 pages
Salesforce SOQL Interview Questions
No ratings yet
Salesforce SOQL Interview Questions
3 pages
LM 3 - Database System Architecture
No ratings yet
LM 3 - Database System Architecture
20 pages
OOSE
No ratings yet
OOSE
3 pages
Answers To Selected Exercises: Chapter 3 Graphical Optimization
No ratings yet
Answers To Selected Exercises: Chapter 3 Graphical Optimization
10 pages

Kafka & Redis For Big Data Solutions: Christopher Curtin Head of Technical Research @chriscurtin

Uploaded by

Kafka & Redis For Big Data Solutions: Christopher Curtin Head of Technical Research @chriscurtin

Uploaded by

Kafka & Redis for Big

Some people use a naming convention, but that gets problematic

You might also like