0% found this document useful (0 votes)
151 views20 pages

5 Spark Kafka Cassandra Slides PDF

Apache Kafka is a distributed publish-subscribe messaging system. The document discusses using Kafka with Spark Streaming to ingest streaming data. It covers Kafka concepts like brokers, producers, consumers and partitions. It then summarizes the receiver-based and direct stream approaches in Spark Streaming for integrating with Kafka, and how to build resiliency through checkpoints, offsets and recovery from failures or upgrades. Finally, it mentions saving streaming data from Kafka to HDFS and integrating the batch and streaming layers.

Uploaded by

usernameuserna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views20 pages

5 Spark Kafka Cassandra Slides PDF

Apache Kafka is a distributed publish-subscribe messaging system. The document discusses using Kafka with Spark Streaming to ingest streaming data. It covers Kafka concepts like brokers, producers, consumers and partitions. It then summarizes the receiver-based and direct stream approaches in Spark Streaming for integrating with Kafka, and how to build resiliency through checkpoints, offsets and recovery from failures or upgrades. Finally, it mentions saving streaming data from Kafka to HDFS and integrating the batch and streaming layers.

Uploaded by

usernameuserna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Streaming Ingest with Kafka and

Spark Streaming

Ahmad Alkilani
DATA ARCHITECT

@akizl
Streaming Ingest with Kafka and Spark
Streaming
§ Introduction to Kafka
§ Architecture
§ Producers and Consumers

§ Create a Kafka Producer


§ Spark Streaming Integration with Kafka
§ Integrate Batch and Streaming
Distributed publish-subscribe messaging system
Introduction to Kafka
Producers Brokers Consumers
@ 65

Push Pull @ 320

@ 951
The Kafka Broker
Topic WebLogs
Partitions 2
RF 2
… …

partitioner P1: Leader

Producer P2: Replica

ack
P1: Replica
ack
P2: Leader
Producer gets
topic meta-data
Partition Assignment & Consumers
Kafka Consumers
Topics
weblogs
Broker 1 Broker 2 Broker 3
Partitions 3
RF 1 P1 (LR) P1 (LR) P1

P2 (LR) P2 P2 (LR)

telemetry P3 P3 (LR) P3 (LR)


Partitions 3
RF 2

C1 C2 C1 C2 C3 C4
P1,P3 P2 P1,P3
Consumer Group A Consumer Group B
Consumer Group A

C1 C2 C3
Zookeeper

Partition 1

Partition 2

Partition 3

C1 C2 C3

Consumer Group B
Messaging Models
Messaging Models

Publish-Subscribe C2
CG: B

Broker 1 Broker 2 Broker 3

P1 (LR) P3 (LR) P2 (LR)

Topics
weblogs
Partitions 3
RF 1 C1
CG: A
Messaging Models

Queue Semantics

Broker 1 Broker 2 Broker 3

P1(LR) P3 (LR) P2 (LR)

Topics
weblogs
Partitions 3
C1 C2 C3
RF 1
CG: MyQueue
Receiver Model
val lines1 = ssc.socketTextStream("localhost", 9999)
val lines2 = ssc.socketTextStream("localhost", 9998)
Spark Executor
Cache
val linesUnion = lines1.union(lines2)
val words = linesUnion.flatMap(_.split(" ")) Task

Task Task

Input
Data Spark Executor
Cache
Stream
Task

Task Task
Spark Kafka Integration
Spark Streaming Kafka Integration

Receiver Approach
High-Level
Kafka Consumer APIs

• Receivers to receive data


API

• Data stored in Spark executors


• Zero-data loss requires write-ahead log
• Allows for at-least-once semantics

Direct Approach
Simple API

• No receivers. Queries Kafka each batch for offset range


• Simplifies parallelism at the expense of latency
• Zero-data loss without write-ahead log; relies on Kafka’s retention
to replay messages. Better at processing larger datasets
• Allows for exactly-once semantics
Receiver-based Approach
Option 1: Create a single Kafka stream

val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](


ssc, kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_AND_DISK)
.map(_._2)

Kafka Spark Executor Cache


Task (R)

Task Task
Receiver-based Approach
Option 2: Create a Kafka stream per topic-partition
val receiverCount = 3
val kafkaStreams = (1 to receiverCount).map { _ =>
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_AND_DISK)
}
val kafkaStream = ssc.union(kafkaStreams)
.map(_._2)

Kafka Spark Executor Cache


Task (R)

Task (R) Task (R)


Direct Approach
Driver determines offsets since last batch
val params = Map(
"metadata.broker.list" -> "localhost:9092",
"group.id" -> "lambda",
"auto.offset.reset" -> "smallest"
)

KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, Set(topic))


.map(_._2)

Offsets?

Kafka Driver Tasks scheduled to


Spark Executor Cache consume data for
batch and then
Topic|Partition|Offset Task
released for other
operations
Task Task
Save Data from Kafka to HDFS
Build Resiliency into the Application
- Recover from complete failures

Demo - Allow for application updates


Kafka Direct Stream to HDFS

HDFS
../KafkaTopic/KafkaPartition
data, fromOffset, untilOffset

Direct Kafka stream means there’s a 1-1 mapping between


Kafka partition and Spark partition

HasOffsetRanges val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

RDD offsetRanges(partitionNumber)
.topic
.partition
.fromOffset
.untilOffset
Streaming Resiliency

Receiver-based Approach

Spark Checkpoints Direct Stream Approach


Summary
§ Apache Kafka
§ Broker
§ Producer
§ Consumers and Partitions

§ Spark Streaming
o Receiver-based
o Direct Stream

§ Resiliency
§ Direct Stream Offsets
§ Recover from Upgrades

§ HDFS and Batch Layer Integration

You might also like