0% found this document useful (0 votes)

103 views37 pages

5a - Streaming Data Analytics PDF

This document describes a data streaming pipeline. Data is ingested from various sources like Twitter, credit card transactions, and sensors into message brokers like Apache Kafka or Apache Flume. The data is then processed using stream processing engines like Apache Spark Streaming, Apache Flink, or Apache Storm. Finally, the processed data is output to targets such as dashboards, databases, and applications.

Uploaded by

23522020 Danendra Athallariq Harya P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views37 pages

5a - Streaming Data Analytics PDF

Uploaded by

23522020 Danendra Athallariq Harya P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Stream Data Stream

Output
Data Ingest Processor

E.g., E.g., E.g., E.g.,

• Twitter data • Apache Kafka • Spark • Dashboard
• Credit card • Apache Streaming • Database
transaction Flume • Apache Flink • Application,
• Sensor data • Amazon • Apache etc
• User queries, Kinesis, etc Storm, etc
etc.

2
• Apache Kafka is a message broker and is an open
source. Some other popular message brokers are:
RabbitMQ, ActiveMQ
• It provides:
o Distributed queuing,
o Produce/consume
• It is beautifully designed distributed system for high
scalability and fault tolerance

3
• Producer and consumer

Topic 1 Consumer 1
Producer 1

Topic 2 Consumer 2

Producer 2
Topic N Consumer N

4
• Example of producer - consumer
#1
Purchases Billing service
Producer 1
User service
Error Product service
Producer 2 Monitoring and
Alerting

#2
Purchases Billing service
Producer 1
User service
Error Product service
Producer 2 Monitoring and
Alerting
5
• Topic and partition
o Kafka topic partitioning allows us to scale a topic
horizontally
o More partitions in the topic  higher parallelism
o Topic with more messages can have more partitions,
vice versa
o Topic with multiple partitions doesn’t have global
orders
Topic
4 3 2 1 0 Partition 0

1 0 Partition 1
Producer 1
3 2 1 0 Partition 2
6
• Maintaining global order or not
o For example, in a restaurant:
 Customer A orders food earlier than customer B. Customer
A will be angry if customer B gets the food earlier
 But, customers don’t really care about the order of the meal
Thus, which topics should be used by consumer “food
handling” and meal handling?
Topic 1

Waiter 1 Partition 0
? Food handling
Partition 1
Waiter 2
Topic 2
Waiter 3
Partition 0 ? Meal handling
7
• Maintaining global order or not (cont.)
o For example, in a restaurant:
 Customer A orders food earlier than customer B. Customer
A will be angry if customer B gets the food earlier
 But, customers don’t really care about the order of the meal
Thus, which topics should be used by consumer “food
handling” and meal handling?
Topic 1

Waiter 1 Partition 0
Food handling
Partition 1
Waiter 2
Topic 2
Waiter 3
Partition 0 Meal handling
8
• Grouping
o A message in a topic will only be consumed by
one consumer within same group.
o Thus, Kafka also acts as load balancer in this Group 1
case.
Consumer 1
Producer 1 Topic 1 Consumer 2
Consumer 3
o If several consumers want to subscribe to Group 1
messages from a same topic, you can do this
Consumer 1
by set them in different consumer group-id.
Group 2
Producer 1 Topic 1 Consumer 2
Group 3

Consumer 3
9
• Scalability
o Let’s say we have topic with high number of message
o We can make several Kafka servers (brokers) and make several topic
partitions
o Adding more brokers, we can increase Kafka topic capacity
o The message broker parallelism in a single machine ~= number of
cores Topic 1
Broker 0
Producer 1 Partition 0
Group 1
Producer 2 Broker 1
Partition 1 Consumer 1
Producer 3 Broker 2 Consumer 2
Partition 2
Producer 4
Broker 3
Partition 3
Producer 5
10
• Scalability (cont.)
o More partitions  we can add more consumers
o Number of partitions ~= maximum unit of parallelism for a Kafka
topic
o However, again, more partitions  the message becomes less
ordered

11
• Handling Failure
o Each Kafka topic is configured with a replication factor
o A replication factor N means each partition is replicated by
N Kafka brokers
o Thus, failure in a node, we still have the replica.
Topic 1
Broker 0
Producer 1
Partition 0
Producer 2 Partition 1
Example:
Replication Consumer 1
factor of 2 Producer 3
Broker 1
Producer 4 Partition 1
Partition 0
Producer 5
12
• Handling Failure (cont.)
o For each partition, only one broker acts as partition leader. Other
brokers are followers.
o The leaders takes all reads and writes
o The followers replicate the partition data to stay in-sync with the
leader
o To handle all the cluster coordination, Kafka needs Zookeeper
Topic 1
Broker 0
Producer 1
Partition 0
Producer 2 Partition 1
Example:
Replication Broker 1 Consumer 1
factor of 2 Producer 3 Partition 1
Partition 0
Producer 4
13
• Kafka in Practice
o Run zookeeper service
o Run Kafka server
o Create Kafka topic
kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 1 --partitions 1 --topic my_topic

o Publish messages to a topic

o Consume messages from a topic

* All the setups can be seen, e.g., [here] and [here] (additional for Ubuntu
users) 14
• Kafka in Practice (python) Need additional ‘kafka-python’ library.

Example of
producer
code

Example of
consumer
code

15
Stream Data Stream
Output
Data Ingest Processor

E.g., E.g., E.g., E.g.,

16
• An extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams.
• Data can be ingested from many sources like Kafka, Flume, Kinesis, or
TCP sockets.
• Then, the data can be processed using complex algorithms expressed
with high-level functions like map, reduce, join and window.
• Finally, processed data can be pushed out to filesystems, databases,
and live dashboards.
• You can apply Spark’s machine learning and graph processing
algorithms on data streams.

17
• One of the ways to stream from Kafka, in pyspark
Spark streaming data actually is
not a “true” stream, but micro-
batching. In this example, the
stream is each 2 s.

18
• One of the ways to stream from Kafka, in pyspark (cont.)
o First, you need to download “spark-streaming-kafka-0-8-
assembly_2.11-2.4.5.jar” (from maven repository) and
add to “your_spark_folder/jars”
o Then, type “pyspark” in the terminal

Producer used in the previous slide

19
• Word counting to the streaming data

20
• Word counting to the streaming data (cont.)
o Spark streaming provides a high-level abstraction called discretized
stream or Dstream
o Internally, a DStream is represented as a sequence of RDDs
o Thus, we can apply transformations and actions, similar to RDDs

RDD @time 1 RDD @time 2 RDD @time 3 RDD @time 4

flatMap and
map operations

21
• Several transformations on DStream

22
• Transformations on DStream

Example of
transform

sortByKey is
not available
for DStream,
and we can
use transform
instead

23
• Transformations on DStream

Example of
updateStateByKey

Current sate
values are
updated using
the previous
state.

24
• Actions (output operations) on DStream

25
• Until now, we know how to handle streaming data in general.
But, why do we need this streaming processing/analytics?
o The two big data characteristics, volume and variety, we
already address them, e.g., by using Hadoop and RDD
concepts that can handle huge amount of data, including the
unstructured one.
o With this streaming data, we deal with velocity that we have
no control over it. Thus, we work on how fast we can analyze
and gain the data insight.
 We can never really hope to get a "global" view of a data stream, since
it is unbounded. Hence, to get any value from the data, we must
somehow partition it.
 It might be the last 5 minutes, or the last 2 hours. Or maybe the last
256 events? That’s of course use-case-dependent. But this "fragment"
is what’s called a window.
26
• Consider the example of a traffic sensor that counts every 15
seconds the number of vehicles passing a certain location. The
resulting stream could look like:

• How many vehicles passed that location, you would simply sum
the individual counts resulting rolling sums. This would yield a
new stream of partial sums.

*Pictures from https://flink.apache.org/

27
• However, a stream of partial sums might not be what we are
looking for  constantly updates the count and some
information such as variation over time is lost.
• Hence, we might want to rephrase our question and ask for the
number of cars that pass the location every minute.
• This requires us to group the elements of the stream into finite
sets, each set corresponding to sixty seconds. This operation is
called a tumbling windows operation.

*Pictures from https://flink.apache.org/

28
• Tumbling windows discretize a stream into non-overlapping
windows.
• For certain applications, it is important that windows are not
disjunct because an application might require smoothed
aggregates.
• For example, we can compute every thirty seconds the number
of cars passed in the last minute. Such windows are called sliding
windows.

*Pictures from https://flink.apache.org/

29
• For many applications, a data stream needs to be grouped into
multiple logical streams on each of which a window operator can
be applied.
• For example about a stream of vehicle counts from multiple
traffic sensors, where each sensor monitors a different location.
• By grouping the stream by sensor id, we can compute windowed
traffic statistics for each location in parallel.

*Pictures from https://flink.apache.org/

30
• Generally speaking, a window defines a finite set of
elements on an unbounded stream.
• This set can be based on:
o time,
o element counts,
o a combination of counts and time,
o or some custom logic to assign elements to
windows.

31
• Spark Streaming provides windowed computations, which allow
you to apply transformations over a sliding window of data
• In the picture below, window length = 3 and sliding interval = 2
• Those two parameters must be multiples of the batch interval of
the source DStream

32
• window(windowLength, slideInterval): return a new
DStream which is computed based on windowed batches of the
source Dstream.
• countByWindow(windowLength, slideInterval): return a
sliding window count of elements in the stream.
• reduceByKeyAndWindow(func, windowLength, slideInterva
l, [numTasks]): similar to ReduceByKey but is calculated within
defined window.
• reduceByKeyAndWindow(func, invFunc, windowLength, sli
deInterval, [numTasks]): a more efficient version of
reduceByKeyAndWindow.
Example  pairs.reduceByKeyAndWindow(lambda x, y: x + y,
lambda x, y: x - y, 30, 10)

(k,1), (k,2), (k,3), (k,4), (k,5), (k,6), (k,7), (k,8)

(k,14) (k,18) = (k,14)+(k,6)-(k,2)
33
• Computing variance

 Thus, during the streaming data, we only need 3 variables to store

∑ , ∑ and .
 This week handsOn is implementing this computation in Spark Streaming with
a source form Kafka producer.
34
• Online K-Means

Source: Princeton Uni.

o There are several online/streaming machine learning algorithms

implemented in Spark MLlib, e.g., K-Means and Regression.
 Or you may train the learning algorithm offline and do the
inference online (streaming).
35
• There are several useful resources to further explore about
stream processing
o Spark structured streaming  [here]
o Again, Spark Streaming data is not a “true” streaming, but
micro-batching with time-based windowing. For “true”
streaming processing with capable of item-based
windowing, you may explore Apache Flink, Apache Storm,
etc.

36
--done--

Apache Kafka Documentation
No ratings yet
Apache Kafka Documentation
419 pages
Apache Kafka
No ratings yet
Apache Kafka
9 pages
BDA Unit V
No ratings yet
BDA Unit V
21 pages
Cours - Kafka
No ratings yet
Cours - Kafka
72 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
Understanding Apache Kafka White Paper
No ratings yet
Understanding Apache Kafka White Paper
7 pages
9 - Streaming 4 - Kafka
No ratings yet
9 - Streaming 4 - Kafka
48 pages
HD Mod011 Kafka
No ratings yet
HD Mod011 Kafka
29 pages
Hyderabad Meetup Dec 7th 2024 - Diptiman - Confluent
No ratings yet
Hyderabad Meetup Dec 7th 2024 - Diptiman - Confluent
85 pages
Kafka
No ratings yet
Kafka
50 pages
Kafka
No ratings yet
Kafka
23 pages
Kafka
No ratings yet
Kafka
43 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Kafkha
No ratings yet
Kafkha
32 pages
Real Time Data Streaming New Techniques
No ratings yet
Real Time Data Streaming New Techniques
5 pages
Kafka and Mongodb
No ratings yet
Kafka and Mongodb
15 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
No ratings yet
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
48 pages
Big Data Concepts - Spark & Streaming
No ratings yet
Big Data Concepts - Spark & Streaming
35 pages
Apache Kafka Essentials
No ratings yet
Apache Kafka Essentials
10 pages
Kafka
No ratings yet
Kafka
21 pages
02data Stream Processing With Apache Flink
No ratings yet
02data Stream Processing With Apache Flink
61 pages
Bda Assign2
No ratings yet
Bda Assign2
4 pages
Kafka Notes1
No ratings yet
Kafka Notes1
19 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Slide 5-6 Kafka
No ratings yet
Slide 5-6 Kafka
111 pages
Kafka Clustering v1.0.0
No ratings yet
Kafka Clustering v1.0.0
20 pages
Streaming Data and Stream Processing With Apache Kafka ™: David Tucker, Director of Partner Engineering
No ratings yet
Streaming Data and Stream Processing With Apache Kafka ™: David Tucker, Director of Partner Engineering
44 pages
Kafka
No ratings yet
Kafka
12 pages
EBOOK Streams Redis Streams and Kafka 8 2022
No ratings yet
EBOOK Streams Redis Streams and Kafka 8 2022
69 pages
Effective Kafka A Hands On Guide To Building Robust and Scalable Event Driven Applications With Code Examples in Java 1st Edition Emil Koutanov PDF Download
No ratings yet
Effective Kafka A Hands On Guide To Building Robust and Scalable Event Driven Applications With Code Examples in Java 1st Edition Emil Koutanov PDF Download
52 pages
Ebook Streams Redis Streams and Kafka 20220615
No ratings yet
Ebook Streams Redis Streams and Kafka 20220615
69 pages
Kafka: Big Data Huawei Course
No ratings yet
Kafka: Big Data Huawei Course
14 pages
Kafka Architecture
No ratings yet
Kafka Architecture
5 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
ITHome - Deep Dive Into Apache Flink - Gordon
No ratings yet
ITHome - Deep Dive Into Apache Flink - Gordon
44 pages
Instaclustr Understanding Apache Kafka White Paper
No ratings yet
Instaclustr Understanding Apache Kafka White Paper
8 pages
BDA Lab A7
No ratings yet
BDA Lab A7
10 pages
Apache Kafka Introduction
No ratings yet
Apache Kafka Introduction
21 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Understanding Streams in Redis and Kafka
No ratings yet
Understanding Streams in Redis and Kafka
70 pages
Customizing Kafka Stream Procssing
No ratings yet
Customizing Kafka Stream Procssing
4 pages
Kafka
No ratings yet
Kafka
5 pages
Kafka Using Spring Boot v2
No ratings yet
Kafka Using Spring Boot v2
150 pages
Apache Kafka 101
No ratings yet
Apache Kafka 101
26 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
ScaleUp Meetup - Building Apps Using Kafka @hotstar
No ratings yet
ScaleUp Meetup - Building Apps Using Kafka @hotstar
26 pages
Slide 13 - Kafka
No ratings yet
Slide 13 - Kafka
109 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Kafka Presentation
No ratings yet
Kafka Presentation
16 pages
Apache Kafka-Flink Syllabus
No ratings yet
Apache Kafka-Flink Syllabus
2 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Cessing
No ratings yet
Cessing
67 pages
SITA1603 Unit 3 Material
No ratings yet
SITA1603 Unit 3 Material
45 pages
Fundamentals and Architecture of Apache Kafka
No ratings yet
Fundamentals and Architecture of Apache Kafka
30 pages
Kafka Developer Certified: The Essential Guide
From Everand
Kafka Developer Certified: The Essential Guide
SUJAN
No ratings yet
Go Programming Essentials: From Zero to Production-Ready Applications
From Everand
Go Programming Essentials: From Zero to Production-Ready Applications
Marcus Hartwell
No ratings yet
Learn Docker - .NET Core, Java, Node.JS, PHP or Python: Learn Collection
From Everand
Learn Docker - .NET Core, Java, Node.JS, PHP or Python: Learn Collection
Arnaud Weil
5/5 (4)

5a - Streaming Data Analytics PDF

Uploaded by

5a - Streaming Data Analytics PDF

Uploaded by

Stream Data Stream

E.g., E.g., E.g., E.g.,

o Publish messages to a topic

E.g., E.g., E.g., E.g.,

Producer used in the previous slide

RDD @time 1 RDD @time 2 RDD @time 3 RDD @time 4

*Pictures from https://flink.apache.org/

*Pictures from https://flink.apache.org/

*Pictures from https://flink.apache.org/

*Pictures from https://flink.apache.org/

(k,1), (k,2), (k,3), (k,4), (k,5), (k,6), (k,7), (k,8)

 Thus, during the streaming data, we only need 3 variables to store

Source: Princeton Uni.

o There are several online/streaming machine learning algorithms

You might also like