5a - Streaming Data Analytics PDF
5a - Streaming Data Analytics PDF
Output
Data Ingest Processor
2
• Apache Kafka is a message broker and is an open
source. Some other popular message brokers are:
RabbitMQ, ActiveMQ
• It provides:
o Distributed queuing,
o Produce/consume
• It is beautifully designed distributed system for high
scalability and fault tolerance
3
• Producer and consumer
Topic 1 Consumer 1
Producer 1
Topic 2 Consumer 2
Producer 2
Topic N Consumer N
4
• Example of producer - consumer
#1
Purchases Billing service
Producer 1
User service
Error Product service
Producer 2 Monitoring and
Alerting
#2
Purchases Billing service
Producer 1
User service
Error Product service
Producer 2 Monitoring and
Alerting
5
• Topic and partition
o Kafka topic partitioning allows us to scale a topic
horizontally
o More partitions in the topic higher parallelism
o Topic with more messages can have more partitions,
vice versa
o Topic with multiple partitions doesn’t have global
orders
Topic
4 3 2 1 0 Partition 0
1 0 Partition 1
Producer 1
3 2 1 0 Partition 2
6
• Maintaining global order or not
o For example, in a restaurant:
Customer A orders food earlier than customer B. Customer
A will be angry if customer B gets the food earlier
But, customers don’t really care about the order of the meal
Thus, which topics should be used by consumer “food
handling” and meal handling?
Topic 1
Waiter 1 Partition 0
? Food handling
Partition 1
Waiter 2
Topic 2
Waiter 3
Partition 0 ? Meal handling
7
• Maintaining global order or not (cont.)
o For example, in a restaurant:
Customer A orders food earlier than customer B. Customer
A will be angry if customer B gets the food earlier
But, customers don’t really care about the order of the meal
Thus, which topics should be used by consumer “food
handling” and meal handling?
Topic 1
Waiter 1 Partition 0
Food handling
Partition 1
Waiter 2
Topic 2
Waiter 3
Partition 0 Meal handling
8
• Grouping
o A message in a topic will only be consumed by
one consumer within same group.
o Thus, Kafka also acts as load balancer in this Group 1
case.
Consumer 1
Producer 1 Topic 1 Consumer 2
Consumer 3
o If several consumers want to subscribe to Group 1
messages from a same topic, you can do this
Consumer 1
by set them in different consumer group-id.
Group 2
Producer 1 Topic 1 Consumer 2
Group 3
Consumer 3
9
• Scalability
o Let’s say we have topic with high number of message
o We can make several Kafka servers (brokers) and make several topic
partitions
o Adding more brokers, we can increase Kafka topic capacity
o The message broker parallelism in a single machine ~= number of
cores Topic 1
Broker 0
Producer 1 Partition 0
Group 1
Producer 2 Broker 1
Partition 1 Consumer 1
Producer 3 Broker 2 Consumer 2
Partition 2
Producer 4
Broker 3
Partition 3
Producer 5
10
• Scalability (cont.)
o More partitions we can add more consumers
o Number of partitions ~= maximum unit of parallelism for a Kafka
topic
o However, again, more partitions the message becomes less
ordered
11
• Handling Failure
o Each Kafka topic is configured with a replication factor
o A replication factor N means each partition is replicated by
N Kafka brokers
o Thus, failure in a node, we still have the replica.
Topic 1
Broker 0
Producer 1
Partition 0
Producer 2 Partition 1
Example:
Replication Consumer 1
factor of 2 Producer 3
Broker 1
Producer 4 Partition 1
Partition 0
Producer 5
12
• Handling Failure (cont.)
o For each partition, only one broker acts as partition leader. Other
brokers are followers.
o The leaders takes all reads and writes
o The followers replicate the partition data to stay in-sync with the
leader
o To handle all the cluster coordination, Kafka needs Zookeeper
Topic 1
Broker 0
Producer 1
Partition 0
Producer 2 Partition 1
Example:
Replication Broker 1 Consumer 1
factor of 2 Producer 3 Partition 1
Partition 0
Producer 4
13
• Kafka in Practice
o Run zookeeper service
o Run Kafka server
o Create Kafka topic
kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 1 --partitions 1 --topic my_topic
* All the setups can be seen, e.g., [here] and [here] (additional for Ubuntu
users) 14
• Kafka in Practice (python) Need additional ‘kafka-python’ library.
Example of
producer
code
Example of
consumer
code
15
Stream Data Stream
Output
Data Ingest Processor
16
• An extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams.
• Data can be ingested from many sources like Kafka, Flume, Kinesis, or
TCP sockets.
• Then, the data can be processed using complex algorithms expressed
with high-level functions like map, reduce, join and window.
• Finally, processed data can be pushed out to filesystems, databases,
and live dashboards.
• You can apply Spark’s machine learning and graph processing
algorithms on data streams.
17
• One of the ways to stream from Kafka, in pyspark
Spark streaming data actually is
not a “true” stream, but micro-
batching. In this example, the
stream is each 2 s.
18
• One of the ways to stream from Kafka, in pyspark (cont.)
o First, you need to download “spark-streaming-kafka-0-8-
assembly_2.11-2.4.5.jar” (from maven repository) and
add to “your_spark_folder/jars”
o Then, type “pyspark” in the terminal
19
• Word counting to the streaming data
20
• Word counting to the streaming data (cont.)
o Spark streaming provides a high-level abstraction called discretized
stream or Dstream
o Internally, a DStream is represented as a sequence of RDDs
o Thus, we can apply transformations and actions, similar to RDDs
flatMap and
map operations
21
• Several transformations on DStream
22
• Transformations on DStream
Example of
transform
sortByKey is
not available
for DStream,
and we can
use transform
instead
23
• Transformations on DStream
Example of
updateStateByKey
Current sate
values are
updated using
the previous
state.
24
• Actions (output operations) on DStream
25
• Until now, we know how to handle streaming data in general.
But, why do we need this streaming processing/analytics?
o The two big data characteristics, volume and variety, we
already address them, e.g., by using Hadoop and RDD
concepts that can handle huge amount of data, including the
unstructured one.
o With this streaming data, we deal with velocity that we have
no control over it. Thus, we work on how fast we can analyze
and gain the data insight.
We can never really hope to get a "global" view of a data stream, since
it is unbounded. Hence, to get any value from the data, we must
somehow partition it.
It might be the last 5 minutes, or the last 2 hours. Or maybe the last
256 events? That’s of course use-case-dependent. But this "fragment"
is what’s called a window.
26
• Consider the example of a traffic sensor that counts every 15
seconds the number of vehicles passing a certain location. The
resulting stream could look like:
• How many vehicles passed that location, you would simply sum
the individual counts resulting rolling sums. This would yield a
new stream of partial sums.
31
• Spark Streaming provides windowed computations, which allow
you to apply transformations over a sliding window of data
• In the picture below, window length = 3 and sliding interval = 2
• Those two parameters must be multiples of the batch interval of
the source DStream
32
• window(windowLength, slideInterval): return a new
DStream which is computed based on windowed batches of the
source Dstream.
• countByWindow(windowLength, slideInterval): return a
sliding window count of elements in the stream.
• reduceByKeyAndWindow(func, windowLength, slideInterva
l, [numTasks]): similar to ReduceByKey but is calculated within
defined window.
• reduceByKeyAndWindow(func, invFunc, windowLength, sli
deInterval, [numTasks]): a more efficient version of
reduceByKeyAndWindow.
Example pairs.reduceByKeyAndWindow(lambda x, y: x + y,
lambda x, y: x - y, 30, 10)
36
--done--
37