0% found this document useful (0 votes)
66 views

5a - Streaming Data Analytics PDF

This document describes a data streaming pipeline. Data is ingested from various sources like Twitter, credit card transactions, and sensors into message brokers like Apache Kafka or Apache Flume. The data is then processed using stream processing engines like Apache Spark Streaming, Apache Flink, or Apache Storm. Finally, the processed data is output to targets such as dashboards, databases, and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

5a - Streaming Data Analytics PDF

This document describes a data streaming pipeline. Data is ingested from various sources like Twitter, credit card transactions, and sensors into message brokers like Apache Kafka or Apache Flume. The data is then processed using stream processing engines like Apache Spark Streaming, Apache Flink, or Apache Storm. Finally, the processed data is output to targets such as dashboards, databases, and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Stream Data Stream

Output
Data Ingest Processor

E.g., E.g., E.g., E.g.,


• Twitter data • Apache Kafka • Spark • Dashboard
• Credit card • Apache Streaming • Database
transaction Flume • Apache Flink • Application,
• Sensor data • Amazon • Apache etc
• User queries, Kinesis, etc Storm, etc
etc.

2
• Apache Kafka is a message broker and is an open
source. Some other popular message brokers are:
RabbitMQ, ActiveMQ
• It provides:
o Distributed queuing,
o Produce/consume
• It is beautifully designed distributed system for high
scalability and fault tolerance

3
• Producer and consumer

Topic 1 Consumer 1
Producer 1

Topic 2 Consumer 2

Producer 2
Topic N Consumer N

4
• Example of producer - consumer
#1
Purchases Billing service
Producer 1
User service
Error Product service
Producer 2 Monitoring and
Alerting

#2
Purchases Billing service
Producer 1
User service
Error Product service
Producer 2 Monitoring and
Alerting
5
• Topic and partition
o Kafka topic partitioning allows us to scale a topic
horizontally
o More partitions in the topic  higher parallelism
o Topic with more messages can have more partitions,
vice versa
o Topic with multiple partitions doesn’t have global
orders
Topic
4 3 2 1 0 Partition 0

1 0 Partition 1
Producer 1
3 2 1 0 Partition 2
6
• Maintaining global order or not
o For example, in a restaurant:
 Customer A orders food earlier than customer B. Customer
A will be angry if customer B gets the food earlier
 But, customers don’t really care about the order of the meal
Thus, which topics should be used by consumer “food
handling” and meal handling?
Topic 1

Waiter 1 Partition 0
? Food handling
Partition 1
Waiter 2
Topic 2
Waiter 3
Partition 0 ? Meal handling
7
• Maintaining global order or not (cont.)
o For example, in a restaurant:
 Customer A orders food earlier than customer B. Customer
A will be angry if customer B gets the food earlier
 But, customers don’t really care about the order of the meal
Thus, which topics should be used by consumer “food
handling” and meal handling?
Topic 1

Waiter 1 Partition 0
Food handling
Partition 1
Waiter 2
Topic 2
Waiter 3
Partition 0 Meal handling
8
• Grouping
o A message in a topic will only be consumed by
one consumer within same group.
o Thus, Kafka also acts as load balancer in this Group 1
case.
Consumer 1
Producer 1 Topic 1 Consumer 2
Consumer 3
o If several consumers want to subscribe to Group 1
messages from a same topic, you can do this
Consumer 1
by set them in different consumer group-id.
Group 2
Producer 1 Topic 1 Consumer 2
Group 3

Consumer 3
9
• Scalability
o Let’s say we have topic with high number of message
o We can make several Kafka servers (brokers) and make several topic
partitions
o Adding more brokers, we can increase Kafka topic capacity
o The message broker parallelism in a single machine ~= number of
cores Topic 1
Broker 0
Producer 1 Partition 0
Group 1
Producer 2 Broker 1
Partition 1 Consumer 1
Producer 3 Broker 2 Consumer 2
Partition 2
Producer 4
Broker 3
Partition 3
Producer 5
10
• Scalability (cont.)
o More partitions  we can add more consumers
o Number of partitions ~= maximum unit of parallelism for a Kafka
topic
o However, again, more partitions  the message becomes less
ordered

11
• Handling Failure
o Each Kafka topic is configured with a replication factor
o A replication factor N means each partition is replicated by
N Kafka brokers
o Thus, failure in a node, we still have the replica.
Topic 1
Broker 0
Producer 1
Partition 0
Producer 2 Partition 1
Example:
Replication Consumer 1
factor of 2 Producer 3
Broker 1
Producer 4 Partition 1
Partition 0
Producer 5
12
• Handling Failure (cont.)
o For each partition, only one broker acts as partition leader. Other
brokers are followers.
o The leaders takes all reads and writes
o The followers replicate the partition data to stay in-sync with the
leader
o To handle all the cluster coordination, Kafka needs Zookeeper
Topic 1
Broker 0
Producer 1
Partition 0
Producer 2 Partition 1
Example:
Replication Broker 1 Consumer 1
factor of 2 Producer 3 Partition 1
Partition 0
Producer 4
13
• Kafka in Practice
o Run zookeeper service
o Run Kafka server
o Create Kafka topic
kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 1 --partitions 1 --topic my_topic

o Publish messages to a topic


o Consume messages from a topic

* All the setups can be seen, e.g., [here] and [here] (additional for Ubuntu
users) 14
• Kafka in Practice (python) Need additional ‘kafka-python’ library.

Example of
producer
code

Example of
consumer
code

15
Stream Data Stream
Output
Data Ingest Processor

E.g., E.g., E.g., E.g.,


• Twitter data • Apache Kafka • Spark • Dashboard
• Credit card • Apache Streaming • Database
transaction Flume • Apache Flink • Application,
• Sensor data • Amazon • Apache etc
• User queries, Kinesis, etc Storm, etc
etc.

16
• An extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams.
• Data can be ingested from many sources like Kafka, Flume, Kinesis, or
TCP sockets.
• Then, the data can be processed using complex algorithms expressed
with high-level functions like map, reduce, join and window.
• Finally, processed data can be pushed out to filesystems, databases,
and live dashboards.
• You can apply Spark’s machine learning and graph processing
algorithms on data streams.

17
• One of the ways to stream from Kafka, in pyspark
Spark streaming data actually is
not a “true” stream, but micro-
batching. In this example, the
stream is each 2 s.

18
• One of the ways to stream from Kafka, in pyspark (cont.)
o First, you need to download “spark-streaming-kafka-0-8-
assembly_2.11-2.4.5.jar” (from maven repository) and
add to “your_spark_folder/jars”
o Then, type “pyspark” in the terminal

Producer used in the previous slide

19
• Word counting to the streaming data

20
• Word counting to the streaming data (cont.)
o Spark streaming provides a high-level abstraction called discretized
stream or Dstream
o Internally, a DStream is represented as a sequence of RDDs
o Thus, we can apply transformations and actions, similar to RDDs

RDD @time 1 RDD @time 2 RDD @time 3 RDD @time 4

flatMap and
map operations

21
• Several transformations on DStream

22
• Transformations on DStream

Example of
transform

sortByKey is
not available
for DStream,
and we can
use transform
instead

23
• Transformations on DStream

Example of
updateStateByKey

Current sate
values are
updated using
the previous
state.

24
• Actions (output operations) on DStream

25
• Until now, we know how to handle streaming data in general.
But, why do we need this streaming processing/analytics?
o The two big data characteristics, volume and variety, we
already address them, e.g., by using Hadoop and RDD
concepts that can handle huge amount of data, including the
unstructured one.
o With this streaming data, we deal with velocity that we have
no control over it. Thus, we work on how fast we can analyze
and gain the data insight.
 We can never really hope to get a "global" view of a data stream, since
it is unbounded. Hence, to get any value from the data, we must
somehow partition it.
 It might be the last 5 minutes, or the last 2 hours. Or maybe the last
256 events? That’s of course use-case-dependent. But this "fragment"
is what’s called a window.
26
• Consider the example of a traffic sensor that counts every 15
seconds the number of vehicles passing a certain location. The
resulting stream could look like:

• How many vehicles passed that location, you would simply sum
the individual counts resulting rolling sums. This would yield a
new stream of partial sums.

*Pictures from https://flink.apache.org/


27
• However, a stream of partial sums might not be what we are
looking for  constantly updates the count and some
information such as variation over time is lost.
• Hence, we might want to rephrase our question and ask for the
number of cars that pass the location every minute.
• This requires us to group the elements of the stream into finite
sets, each set corresponding to sixty seconds. This operation is
called a tumbling windows operation.

*Pictures from https://flink.apache.org/


28
• Tumbling windows discretize a stream into non-overlapping
windows.
• For certain applications, it is important that windows are not
disjunct because an application might require smoothed
aggregates.
• For example, we can compute every thirty seconds the number
of cars passed in the last minute. Such windows are called sliding
windows.

*Pictures from https://flink.apache.org/


29
• For many applications, a data stream needs to be grouped into
multiple logical streams on each of which a window operator can
be applied.
• For example about a stream of vehicle counts from multiple
traffic sensors, where each sensor monitors a different location.
• By grouping the stream by sensor id, we can compute windowed
traffic statistics for each location in parallel.

*Pictures from https://flink.apache.org/


30
• Generally speaking, a window defines a finite set of
elements on an unbounded stream.
• This set can be based on:
o time,
o element counts,
o a combination of counts and time,
o or some custom logic to assign elements to
windows.

31
• Spark Streaming provides windowed computations, which allow
you to apply transformations over a sliding window of data
• In the picture below, window length = 3 and sliding interval = 2
• Those two parameters must be multiples of the batch interval of
the source DStream

32
• window(windowLength, slideInterval): return a new
DStream which is computed based on windowed batches of the
source Dstream.
• countByWindow(windowLength, slideInterval): return a
sliding window count of elements in the stream.
• reduceByKeyAndWindow(func, windowLength, slideInterva
l, [numTasks]): similar to ReduceByKey but is calculated within
defined window.
• reduceByKeyAndWindow(func, invFunc, windowLength, sli
deInterval, [numTasks]): a more efficient version of
reduceByKeyAndWindow.
Example  pairs.reduceByKeyAndWindow(lambda x, y: x + y,
lambda x, y: x - y, 30, 10)

(k,1), (k,2), (k,3), (k,4), (k,5), (k,6), (k,7), (k,8)


(k,14) (k,18) = (k,14)+(k,6)-(k,2)
33
• Computing variance

 Thus, during the streaming data, we only need 3 variables to store


∑ , ∑ and .
 This week handsOn is implementing this computation in Spark Streaming with
a source form Kafka producer.
34
• Online K-Means

Source: Princeton Uni.

o There are several online/streaming machine learning algorithms


implemented in Spark MLlib, e.g., K-Means and Regression.
 Or you may train the learning algorithm offline and do the
inference online (streaming).
35
• There are several useful resources to further explore about
stream processing
o Spark structured streaming  [here]
o Again, Spark Streaming data is not a “true” streaming, but
micro-batching with time-based windowing. For “true”
streaming processing with capable of item-based
windowing, you may explore Apache Flink, Apache Storm,
etc.

36
--done--

37

You might also like