5 Spark Kafka Cassandra Slides PDF
5 Spark Kafka Cassandra Slides PDF
Spark Streaming
Ahmad Alkilani
DATA ARCHITECT
@akizl
Streaming Ingest with Kafka and Spark
Streaming
§ Introduction to Kafka
§ Architecture
§ Producers and Consumers
@ 951
The Kafka Broker
Topic WebLogs
Partitions 2
RF 2
… …
ack
P1: Replica
ack
P2: Leader
Producer gets
topic meta-data
Partition Assignment & Consumers
Kafka Consumers
Topics
weblogs
Broker 1 Broker 2 Broker 3
Partitions 3
RF 1 P1 (LR) P1 (LR) P1
P2 (LR) P2 P2 (LR)
C1 C2 C1 C2 C3 C4
P1,P3 P2 P1,P3
Consumer Group A Consumer Group B
Consumer Group A
C1 C2 C3
Zookeeper
Partition 1
Partition 2
Partition 3
C1 C2 C3
Consumer Group B
Messaging Models
Messaging Models
Publish-Subscribe C2
CG: B
Topics
weblogs
Partitions 3
RF 1 C1
CG: A
Messaging Models
Queue Semantics
Topics
weblogs
Partitions 3
C1 C2 C3
RF 1
CG: MyQueue
Receiver Model
val lines1 = ssc.socketTextStream("localhost", 9999)
val lines2 = ssc.socketTextStream("localhost", 9998)
Spark Executor
Cache
val linesUnion = lines1.union(lines2)
val words = linesUnion.flatMap(_.split(" ")) Task
Task Task
Input
Data Spark Executor
Cache
Stream
Task
Task Task
Spark Kafka Integration
Spark Streaming Kafka Integration
Receiver Approach
High-Level
Kafka Consumer APIs
Direct Approach
Simple API
Task Task
Receiver-based Approach
Option 2: Create a Kafka stream per topic-partition
val receiverCount = 3
val kafkaStreams = (1 to receiverCount).map { _ =>
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_AND_DISK)
}
val kafkaStream = ssc.union(kafkaStreams)
.map(_._2)
Offsets?
HDFS
../KafkaTopic/KafkaPartition
data, fromOffset, untilOffset
RDD offsetRanges(partitionNumber)
.topic
.partition
.fromOffset
.untilOffset
Streaming Resiliency
Receiver-based Approach
§ Spark Streaming
o Receiver-based
o Direct Stream
§ Resiliency
§ Direct Stream Offsets
§ Recover from Upgrades