Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
?!
Relational database changes!
Apps and Services
OLTP Queries
Relational
Databases
ODS Hadoop
Poll For Changes
Relational Transforms
Data
Caches & Warehouse
Derived Stores
Transforms
NoSQL!
Key-value
Store
ETL Load
Hadoop
User events!
Apps and Apps and Apps and
Services Services Services
HTTP
Log Aggregation
NFS
rsync
NFS
Relational
Hadoop Data
Warehouse
Transform
Application Logs!
Splunk
Messaging!
App App App App App
Broker Broker
Broker
Monitoring
This is a giant mess!
Apps and Services Apps and Services Apps and Services
OLTP Queries
HTTP
ActiveMQ HTTP
Monitoring
Relational Apps Apps Log Aggregation
Databases
Splunk
Key-value
Store
Data Guard NFS
CSV Dump
ActiveMQ Cache
rsync
Relational Transforms
Data
Transform & Load
Caches & Warehouse
Derived Stores
Transforms
Impossible ideas!
• Publish data from Hadoop to a search index!
• Run a SQL query to find the biggest latency
bottleneck!
• Run a SQL query to find common error patterns!
• Low latency monitoring of database changes or user
activity!
• Incorporate popularity in real-time display and
relevance algorithms!
• Products that incorporate user activity!
An infrastructure solution?!
Idea: Stream Data Platform!
Search Impala
Apps Hive
Monitoring
Stream
Data HADOOP:
DWH
RDBMS Platform: Offline
? Data
Stream Map-
NoSQL Processing
Reduce
Real-time
Analytics Spark
Synchronous
Req/Response Near real time
Offline batch
0 - 100s ms > 100s ms > 1 hour
First Attempt: Messaging systems!!
Problems!
• Throughput!
• Batch systems!
• Persistence!
• Stream Processing!
• Ordering
guarantees!
• Partitioning!
Second Attempt: Build Kafka!!
What does it do?!
Kafka Cluster
1 1 1 Writes
0 1 2 3 4 5 6 7 8 9
0 1 2
Old New
Logs & Publish-Subscribe Messaging!
Source
System
writes
1 1 1
Log 0 1 2 3 4 5 6 7 8 9 0 1 2
reads reads
Destination Destination
System A System B
A Kafka Topic!
Partition 1 1 1
0 0 1 2 3 4 5 6 7 8 9
0 1 2
Partition Writes
0 1 2 3 4 5 6 7 8 9
1
Partition 1 1 1
0 1 2 3 4 5 6 7 8 9
2 0 1 2
Old New
Replication!
Server 1 Server 2 Server 3
Server 1 Server 2
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Scalability of a filesystem!
◦ Hundreds of MB/sec/server throughput!
◦ Many TB per server!
Guarantees of a database!
◦ Messages strictly ordered!
◦ All data persistent!
Distributed by default!
◦ Replication!
◦ Partitioning model!
Producers, Consumers, and Brokers all fault tolerant and horizontally
scalable!
Stream Data Platform!
Search Impala
Apps Hive
Monitoring
KAFKA:
Stream HADOOP:
DWH
RDBMS Data Offline
Platform Data
Stream Map-
NoSQL Processing
Reduce
Real-time
Analytics Spark
Synchronous
Req/Response Near real time
Offline batch
0 - 100s ms > 100s ms > 1 hour
Batch Data => Batch Processing!
Stream processing is a!
generalization!
of batch processing !
and request/response processing!
Request/Response processing: !
One input => One output!
Batch processing: !
All inputs => All outputs!
Stream Processing: !
Some inputs => some outputs!
(you choose how much “some” is)!
Stream Processing a la carte!
Input Kafka Topic
Output Kafka
Topic
Hadoop Live
Data Store
Stream Processing with Frameworks!
+! =! Stream
Processing!
Unix Pipes, Modernized!
cat /usr/share/dict/words | wc -l
On Schemas!
Social Key-Value
Search Oracle Newsfeed OLAP
Graph Storage
Apps
Log
Search Apps
Monitoring
Kafka
Security &
Fraud Samza
Real-time
Analytics
Hadoop Teradata
At LinkedIn!
• Everything in the company is a real-time stream!
• > 800 billion messages written per day!
• > 2.9 trillion messages read per day!
• ~ 1 PB of stream data!
• Tens of thousands of producer processes!
• Backbone for data stores!
• Search!
• Social Graph!
• Newsfeed!
• Primary storage (in progress)!
• Basis for stream processing!
Elsewhere!
Why this is the future!