0% found this document useful (0 votes)

567 views4 pages

Customizing Kafka Stream Procssing

Uploaded by

saeed moradpour

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

567 views4 pages

Customizing Kafka Stream Procssing

Uploaded by

saeed moradpour

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

IEEE Third International Conference on Data Stream Mining & Processing

August 21-25, 2020, Lviv, Ukraine

Spark Structured Streaming: Customizing Kafka

Stream Processing

Yuriy Drohobytskiy Vitaly Brevus Yuriy Skorenkyy

Ternopil Volodymyr Hnatyuk National Dataengi UAB Ternopil Ivan Puluj National Technical
Pedagogical University Vilnius, Lituania University
Ternopil, Ukraine Ternopil Ivan Puluj National Ternopil, Ukraine
Technical University orcid.org/0000-0002-4809-9025
Dataengi UAB
Ternopil, Ukraine
Vilnius, Lituania
orcid.org/0000-0002-7055-9905
orcid.org/0000-0002-3333-1573
Abstract — The aim of the present paper is to develop an
improvement of large-scale multi-party data exchange and stream
processing solution. The method of choice uses Apache Kafka
streams as well as HDFS file granulation, and is exemplified in a
real project of data ingestion into the Hadoop ecosystem. The
management and conditional stream controlling procedures are
proposed. Various ways to manage Kafka offsets during stream
processing are considered.

Keywords—Distributed systems; Stream processing; Kafka

streams; Spark v 2.3.2; HDFS file granulation

I. INTRODUCTION Fig. 1. Irregular stream server load.

Heterogeneous interconnected systems which produce a A problem is to determine that the service has already
wide variety of data are nowadays common [1-3] for a range completed the data processing. For this problem, a universal
of applications, from energy generation and precision solution is absent as no alarm can be set for the last message
agriculture grids to an increasing number of small-scale and arrival. A viable solution to this stream processing problem
dispersed portable smart devices acquiring and transmitting may be the enactment of a dedicated service application for
diverse types of data. To support data-driven business models stream processing.
these data flows are to be properly processed in real time
regardless of their variety and fragmentation [4]. To ingest II. STREAM PROCESSING CUSTOMIZATION
data from the outer sources into HDFS in a cost-effective way A. Solution idea
is a complicated task to be solved by dedicated services that
collect data from various sources outside of the Hadoop To optimally customize the stream processing for
cluster, retrieve data from many databases, transform [5, 6] unpredictable flow of data from diverse producers one has to
and enrich data to finally push it into the Kafka topic. On the choose and wait a proper time interval and avoid wasting paid
other end, Spark application interacts with the Kafka topic and cloud computing services.
puts data into HDFS. ‘External’ service runs periodically, with If during this interval there are no incoming messages in
periodicity chosen appropriately to the specific features of the Kafka distributed streaming platform (see Fig. 2), then
task being solved. To resume such a short outline of the waiting phase is terminated, the consumed data are post-
system’s concept we note that the system’s efficacy depends processed and application stopped untill the next predefined
critically on details discussed in subsequent sections. moment. This way, cluster resources are not wasted and all
Current paper develops a solution to large-scale multi- processing is timely performed in an appropriate way.
party data exchange and stream processing problem. The The practical realization of this simple idea is, however,
considered Spark app reads data from Kafka, processes and not so straightforward as it may seem.
storesdata in HDFS files. The standard streaming app runs
permanently and processes data on the fly, which is an B. Implementation of the solution
ineffective use of resources. To test the above idea we split the whole processing time
Typically, the data load which needs processing which into equal intervals T and monitor periodically if a new
lasts minutes to hours so resources of the Hadoop cluster will message arrives. To make sure that there is no incoming data
not be used effectively for Spark containers. The solution may in this application run we wait for N intervals and stop the
be in starting an application in the defined moment to process application only if there were no messages in NT time. Let’s
all records from Kafka distributed streaming platform, and put examine a typical Apache Spark streaming application for big
it down for idle time interval (see Fig.1). data processing that takes data from Kafka cluster and stores
them into a Parquet binary file format in Hadoop Distributed
File System (HDFS).

978-1-7281-3214-3/20/$31.00 ©2020 IEEE 296

Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.
val fileStream: StreamingQuery = message.writeStream
.format("parquet")
.outputMode("append")
.trigger(Trigger.ProcessingTime(triggerInterval))
.option("checkpointLocation", checkpointLocation)
.option("path", outFilePath)
.queryName(queryName)
.start()

Here, checkpointLocation is the path for the Spark

Streaming Checkpoint data to be stored in. This is necessary
as Spark Streaming is fault-tolerant, and Spark needs to store
its metadata into it. queryName is the arbitrary name of the
streaming query, outFilePath is the path to the file on HDFS.
triggerInterval is the period of time during which the
Spark micro-batch is compiled and then processed by a
Parquet writer, one at a time. So, at the moment, we have a
stream that reads messages from Kafka and stores them into
HDFS file. Also, we have temporal granularity of the stream.
According to the solution idea, we should be able to check
number of messages ingested during each time interval, and if
for N intervals we consumed no messages, the stream is
stopped.
An interface to listen to the stream events (in
org.apache.spark.sql.streaming.StreamingQueryListener)
is:
abstract class StreamingQueryListener {

import StreamingQueryListener._

/**
* Called when a query is started.
* @note This is called synchronously with
Fig.2 Architecture concept of the data processing platform. *
[[org.apache.spark.sql.streaming.DataStreamWriter
`DataStreamWriter.start()`]],
* that is, `onQueryStart` will be called on all
More details should be taken into consideration [7-9] in listeners before
* `DataStreamWriter.start()` returns the
real use cases, however these are irrelevant for the problem corresponding [[StreamingQuery]]. Please
being solved. The presented code runs on Spark v 2.3.2, * don't block this method as it will block your
query.
nevertheless it is fully compatible with the latest Spark 2.4.x * @since 2.0.0
version. The following code starts listening and consuming */
messages from Kafka topic: def onQueryStarted(event: QueryStartedEvent): Unit

val kafkaMessages: DataFrame = spark.readStream /**

.format("kafka") * Called when there is some status update (ingestion
.option("kafka.bootstrap.servers", bootstrapServers) rate updated, etc.)
.option("subscribe", topicIn) *
.option("startingoffsets", "latest") * @note This method is asynchronous. The status in
.load() [[StreamingQuery]] will always be
* latest no matter when this method is called.
Here bootstrapServers specifies the address of Kafka Therefore, the status of [[StreamingQuery]]
* may be changed before/when you process the
brokers, topicIn is the name of the topic, latest ensures that event. E.g., you may find [[StreamingQuery]]
only the new messages in the topic are listened to. The last * is terminated when you are processing
option is unwise as it makes the application to start before the `QueryProgressEvent`.
* @since 2.0.0
data-producing service outside the Hadoop. The solution to */
this problem, will be given later, too. For our purposes we def onQueryProgress(event: QueryProgressEvent): Unit
require message information as string, and Kafka message /**
information, namely partition and offset. Accordingly, we cast * Called when a query is stopped, with or without
the message from Kafka to the following model: error.
* @since 2.0.0
case class KafkaMessage( */
partition: Int, def onQueryTerminated(event: QueryTerminatedEvent):
offset: Long, Unit
value: String }
)
To provide for the required functionality we create the
with every field being self-explanatory. listener as follows:
val message = kafkaMessages.selectExpr("partition", class StreamQueryListener(val query: StreamingQuery, val
"offset", "CAST(value AS STRING)").as[KafkaMessage] maxEmptyTicks: Int = 3) extends StreamingQueryListener {

Now, message is the DataFrame consistsing of the private val queryId = query.id
KafkaMessages. To write messages into file we run the private var currentEmptyCount = 0
private var totalCount: Long = 0
following stream:

297

Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.
override def onQueryStarted(event:
StreamingQueryListener.QueryStartedEvent): Unit = {
beginning. If the starting offsets is specified as "latest",
if (event.id == queryId) { then reading sequence begins from the end and our goal will
!s"Query started. (id = $queryId)" not mbe met as there could be new (unread) messages in Kafka
}
} before the application starts. To solve this problem, let's
consider how Spark manages offsets for Kafka stream
override def onQueryProgress(event: consumer (Fig. 3).
StreamingQueryListener.QueryProgressEvent): Unit = {
if (event.progress.id == queryId) {
!s"Query progress. (id = $queryId)\n\tNumber of
There are several options here:
input rows = ${event.progress.numInputRows},
currentEmptyCount = $currentEmptyCount (total count = 1) Offset information could be stored by Kafka (usually
${totalCount + event.progress.numInputRows})" Kafka uses Zookeeper for this). For this purpose each
event.progress.numInputRows match {
case 0 => consumer should specify its own group.id and offsets are
currentEmptyCount += 1 stored "per group". (This could be done automatically
checkCounterLimit()
case x => (autoCommit option), or manually). This option is
currentEmptyCount = 0 completely useless with a structured streaming API, as there
totalCount += x
} is no possibility to specify group.id option (see.
} documentation). Spark will assign a different group.id for
}
private def checkCounterLimit(): Unit = { the consumer each time the application starts.
if (currentEmptyCount >= maxEmptyTicks) { 2) The second option is what Spark proposes to do by
!s"Query will be STOPPED! (id = $queryId)"
query.stop() default: offsets and the information about the output file
} writes are stored in the directory called checkpoint.
}
override def onQueryTerminated(event:
StreamingQueryListener.QueryTerminatedEvent): Unit = {
if (event.id == queryId) {
!s"Query terminated. (id = $queryId)\n\tTotal rows
processed= $totalCount"
}
}
}

We add this listener as follows:

spark.streams.addListener(new
StreamQueryListener(fileStream, maxRetrives))

here maxRetrives is the number of retrieves with no messages

to wait until the stream stops.
The main logic is in onQueryProgress method. We look
at event.progress.numInputRows value which equals the
number of rows obtained during the batch time window (set
by Fig.3. Process flow segment for offset management.
.trigger(Trigger.ProcessingTime(triggerInterval))).
If there are no messages in the stream we increment Checkpoints store intermediate information to ensure fault
currentEmptyCount counter. When it reaches maximum tolerance. If any sort of an exception occurs, i.e. JVM error,
allowed value then we can gracefully stop the stream by container fault or any other error takes place, then the
query.stop(). If we've got any messages during the time
application recovers from that point automatically. This
window then we clear the counter and start monitoring from
the beginning. We also count the total number of processed powerful mechanism is to be used for critical applications.
messages here. (!"string" interpolator puts the string into However, there are pitfalls in this approach. Firstly, the
logs.) offsets are stored there too, so this folder can not be removed
without the critical loss of the offset information. Secondly,
To complete the application workflow we wait for the the output files can not be removed, if this is done, the next
stream to terminate:
application run will end with an error, as the information in
fileStream.awaitTermination() the checkpoint will not match the output files. In our case,
That's all for our task. We've got the streaming application we'd like to remove the output files between subsequent
which reads all data from Kafka topics and stops when the application runs. The file will physically consist of the many
topic becomes 'empty'. So it does the job for our scheduled parts (each of those parts is the data obtained during one
task. processing time window), so after the application reads all
data from the topic we want to aggregate all files into one big
file and delete the intermediate files. To this end we are to
III. MANUAL KAFKA OFFSETS MANAGEMENT remove the checkpoint directory as well. For this reason we
Having in mind thet our goal is to optimally schedule the consider other options to store Kafka offsets.
streaming application runs, it is unwise to re-read the topic 3) Finally, offsets can be manually stored and specified when
from the beginning each time. Instead, one should start creating a stream. This requires more effort but may represent
reading from the point it stopped last time. Using the most flexible solution.
.option("startingoffsets", "earliest") for the
KafkaMessages we will always read topic messages from the

298

Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.
IV. IMPLEMENTATION the processed data. While there are still options for offsets
Multiple solutions may be attempted. We defined above management which have not been discussed in detail in the
present paper, this particular question can be addressed
the case class KafkaMessage for the received messages. It
separately if a use case is specified.
contains partition and offset information. Therefore,
after we saved all messages in the files and stopped the stream, For platforms driven by huge volumes of data coming live
one can post-process the messages. The goal is to aggregate from multiple origins (IoT sensors, mobile devices etc), the
useful information into one large file and also store Kafka ability to handle irregular data streams with unpredictable
offsets for further usage. We split the data into Offsets temporal distribution provides an important competitive edge,
information and useful data in the following way: therefore the proposed architecture improvement of stream
val offsets: DataFrame = processing can lead to a dramatic cost reduction and
fragmentedMsgs performance increase for cloud services.
.select($"partition", $"offset")
.groupBy($"partition")
.agg(
max($"offset").alias("offset") [1] J. Gama and P. P. Rodrigues, “An Overview on Mining Data Streams,”
) in Foundations of Computational, IntelligenceVolume 6: Data Mining,
A. Abraham, A.-E. Hassanien, A. P. de Leon F. de Carvalho, and V.
for the offsets information. And: Snášel, Eds. Berlin, Heidelberg: Springer, 2009, pp. 29–45.
val entities: Dataset[DataLakeEntity] = [2] J. Gama and P. P. Rodrigues, “Data Stream Processing,” in Learning
fragmentedMsgs from Data Streams: Processing Techniques in Sensor Networks, J.
.select(from_json(col("value"), Gama and M. M. Gaber, Eds. Berlin, Heidelberg: Springer, 2007, pp.
DataLakeEntitySchemas.dataLakeEntitySchema).as("json_str"
))
25–39.
.select($"json_str.*") [3] G. Pal, G. Li, and K. Atkinson, “Big Data Real Time Ingestion and
.as[DataLakeEntity] Machine Learning - IEEE Conference Publication,” presented at the
2016 IEEE First International Conference on Data Stream Mining
for our DataLakeEntity information. (We use Json for the Processing (DSMP), 2018, pp. 25–31.
messages in Kafka topic, this could be different for the other [4] A. Batyuk and V. Voityshyn, “Apache storm based on topology for
application, i.e. protobuf or other formats could be used). As real-time processing of streaming data from social networks,” in 2016
for offsets, the work is almost done: IEEE First International Conference on Data Stream Mining
Processing (DSMP), 2016, pp. 345–349
val offsetsList =
offsets.as[PartitionOffset].collect().toList [5] P. J. Haas, “Data-Stream Sampling: Basic Techniques and Results,” in
Data Stream Management: Processing High-Speed Data Streams, M.
if (offsetsList.nonEmpty) { Garofalakis, J. Gehrke, and R. Rastogi, Eds. Berlin, Heidelberg:
// Store offsets somewhere, i.e.: Springer, 2016, pp. 13–44.
offsetStore.insertOrUpdateOffsets(topicIn, offList)
}
[6] A. Batyuk, V. Voityshyn, and V. Verhun, “Software Architecture
Design of the Real- Time Processes Monitoring Platform,” in 2018
Now we need some storage for offsets. We can write them IEEE Second International Conference on Data Stream Mining
Processing (DSMP), 2018, pp. 98–101
into Spark table, or in HBase DB or PostgreSQL DB etc. When
starting the stream we should read offsets information back [7] D. Vohra, “Apache Parquet,” in Practical Hadoop Ecosystem: A
and pass it as option for the stream, like Definitive Guide to Hadoop-Related Frameworks and Tools, D. Vohra,
.option("startingoffsets", "latest"), but instead of the Ed. Berkeley, CA: Apress, 2016, pp. 325–335.
"latest" we should use special form, like: [8] C. C. Aggarwal, “An Introduction to Data Streams,” in Data Streams:
Models and Algorithms, C. C. Aggarwal, Ed. Boston, MA: Springer
{"topicA":{"0":23,"1":-1},"topicB":{"0":-1}}
US, 2007, pp. 1–8.
Further information may be easily found in the [9] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Data Stream
documentation. Mining,” in Data Mining and Knowledge Discovery Handbook, O.
Maimon and L. Rokach, Eds. Boston, MA: Springer US, 2010, pp.
V. CONCLUSIONS 759–787.
On the basis of the analysis of practically applicable [10] T. Dunning and E. Friedman, Streaming Architecture: New Designs
solutions of large-scale multi-party data exchange and stream Using Apache Kafka and MapR Streams. O’Reilly Media, Inc., 2016.
processing problems [10, 11] we suggest the mechanism for [11] M. Armbrust et al., “Structured Streaming: A Declarative API for Real-
periodic monitoring and reading data from Kafka stream. The Time Applications in Apache Spark,” in Proceedings of the 2018
proposed method allows us to monitor data stream and International Conference on Management of Data, Houston, TX,
manipulate it effectively under defined conditions. Multiple USA, 2018, pp. 601–613,.
approaches to operate Kafka offsets in stream processing may
be adopted depending on the peculiarities of the system and

299

Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.

Pressure Vessel Presentation
No ratings yet
Pressure Vessel Presentation
368 pages
Andrew Psaltis - Sparkstreaming
No ratings yet
Andrew Psaltis - Sparkstreaming
28 pages
Electrical M04 ECUs en
No ratings yet
Electrical M04 ECUs en
13 pages
5a - Streaming Data Analytics PDF
No ratings yet
5a - Streaming Data Analytics PDF
37 pages
100.Essays.from.Examiners.2019@Ielts2com
No ratings yet
100.Essays.from.Examiners.2019@Ielts2com
124 pages
Real-Time Data Processing
No ratings yet
Real-Time Data Processing
67 pages
Spark Kafka
No ratings yet
Spark Kafka
14 pages
AkkaStreamAndHTTPScala PDF
No ratings yet
AkkaStreamAndHTTPScala PDF
100 pages
Akk A Stream and HTTP Java
No ratings yet
Akk A Stream and HTTP Java
138 pages
lec20
No ratings yet
lec20
25 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Spark Streaming
No ratings yet
Spark Streaming
14 pages
09_Apache Spark Streaming
No ratings yet
09_Apache Spark Streaming
31 pages
Slide 12 Spark Streaming
No ratings yet
Slide 12 Spark Streaming
55 pages
Spark Streaming: Tathagata "TD" Das
No ratings yet
Spark Streaming: Tathagata "TD" Das
28 pages
Kafka
No ratings yet
Kafka
78 pages
Continuous_Application_1725280881
No ratings yet
Continuous_Application_1725280881
72 pages
7- Streaming 2- Calcite
No ratings yet
7- Streaming 2- Calcite
45 pages
Lecture 7_1-spark_streaming
No ratings yet
Lecture 7_1-spark_streaming
25 pages
BDA-Lec10
No ratings yet
BDA-Lec10
33 pages
Continuous Processing With Apache Flink: Stephan Ewen @stephanewen
No ratings yet
Continuous Processing With Apache Flink: Stephan Ewen @stephanewen
41 pages
Sigmod Structured Streaming
No ratings yet
Sigmod Structured Streaming
13 pages
8- Streaming 3 - Spark Flink
No ratings yet
8- Streaming 3 - Spark Flink
52 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
networking COS 101
No ratings yet
networking COS 101
73 pages
Kafka
No ratings yet
Kafka
50 pages
Spark Streaming
No ratings yet
Spark Streaming
99 pages
lec19
No ratings yet
lec19
24 pages
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
No ratings yet
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
16 pages
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
51 pages
b0m33bdt-7p-spark-databricks-streaming_2023_en
No ratings yet
b0m33bdt-7p-spark-databricks-streaming_2023_en
50 pages
6- Streaming Part 1
No ratings yet
6- Streaming Part 1
44 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
lec19
No ratings yet
lec19
23 pages
BDA
No ratings yet
BDA
16 pages
bda_assign2
No ratings yet
bda_assign2
4 pages
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
No ratings yet
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
48 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Stream Processing Using Kafka
No ratings yet
Stream Processing Using Kafka
46 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
Bda Ut-2
No ratings yet
Bda Ut-2
18 pages
Real time data streaming new techniques
No ratings yet
Real time data streaming new techniques
5 pages
Report Final Draft - Car - Parking - System-BT-Laptop
No ratings yet
Report Final Draft - Car - Parking - System-BT-Laptop
68 pages
4 - Interaction Styles
No ratings yet
4 - Interaction Styles
71 pages
ECS765P - W10 - Stream Processing
No ratings yet
ECS765P - W10 - Stream Processing
39 pages
Power BI Interview Questions Set II
50% (2)
Power BI Interview Questions Set II
17 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Poojan Kheni 190640107007 Sem 8 Internship PPT Review 0
No ratings yet
Poojan Kheni 190640107007 Sem 8 Internship PPT Review 0
29 pages
MN67B19_ENG (1)
No ratings yet
MN67B19_ENG (1)
36 pages
Kafka stream
No ratings yet
Kafka stream
3 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Router: Presentation - ID © 2008 Cisco Systems, Inc. All Rights Reserved. Cisco Confidential
No ratings yet
Router: Presentation - ID © 2008 Cisco Systems, Inc. All Rights Reserved. Cisco Confidential
18 pages
Lecture 2 - CH 1
No ratings yet
Lecture 2 - CH 1
31 pages
Kuka Sim 11 en
No ratings yet
Kuka Sim 11 en
18 pages
Converge
No ratings yet
Converge
22 pages
Structured Streaming Programming Guide - Spark 3.4.0 Documentation
No ratings yet
Structured Streaming Programming Guide - Spark 3.4.0 Documentation
1 page
SQL Server DataBase Mirroring
No ratings yet
SQL Server DataBase Mirroring
18 pages
Threads: Abhijit A M
No ratings yet
Threads: Abhijit A M
40 pages
Kings: Department of Computer Science and Engineering
No ratings yet
Kings: Department of Computer Science and Engineering
20 pages
Prosafe-Rs Vnet/Ip: User'S Manual
No ratings yet
Prosafe-Rs Vnet/Ip: User'S Manual
50 pages
Writing Note
No ratings yet
Writing Note
3 pages
Round Robin and Priority Schedule
No ratings yet
Round Robin and Priority Schedule
9 pages
Algorithm Complexity Analysis
No ratings yet
Algorithm Complexity Analysis
18 pages
Udemy - Complete PHP Hospital System Using Codeigniter Framework (2020) 2020-10
0% (1)
Udemy - Complete PHP Hospital System Using Codeigniter Framework (2020) 2020-10
3 pages
MM 11
No ratings yet
MM 11
41 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Level 3 Ict
No ratings yet
Level 3 Ict
5 pages
PMO-6 PPT 10.0 - TurboWin-Overview
No ratings yet
PMO-6 PPT 10.0 - TurboWin-Overview
5 pages
Valorant undetected cheats
No ratings yet
Valorant undetected cheats
4 pages
1_14521764397
No ratings yet
1_14521764397
1 page
CS 389 - Software Engineering: Lecture 2 - Part 1 Chapter 2 - Software Processes
No ratings yet
CS 389 - Software Engineering: Lecture 2 - Part 1 Chapter 2 - Software Processes
29 pages
CO - CSE 3205_Computer Peripheral & Interfacing
No ratings yet
CO - CSE 3205_Computer Peripheral & Interfacing
5 pages
Class 12 computer science question
No ratings yet
Class 12 computer science question
5 pages
The NETWORKS Answer
No ratings yet
The NETWORKS Answer
3 pages
Database Exercise: Primary KEY Field Name Data Type
No ratings yet
Database Exercise: Primary KEY Field Name Data Type
3 pages
Coordinate Checkpoint Mechanism On Real-Time Messaging System in Kafka Pipeline Architecture
No ratings yet
Coordinate Checkpoint Mechanism On Real-Time Messaging System in Kafka Pipeline Architecture
6 pages
A Study of Distributed SDN Controller Based On Apache Kafka
No ratings yet
A Study of Distributed SDN Controller Based On Apache Kafka
4 pages
CSSR Case Study
No ratings yet
CSSR Case Study
11 pages
Kali Krishna - A: With Extensive Understanding of
No ratings yet
Kali Krishna - A: With Extensive Understanding of
6 pages
SinoTrack - GPS Tracking System
No ratings yet
SinoTrack - GPS Tracking System
1 page
PPS 2023-24 Even Sem_45385173_2025_01_30_11_52
No ratings yet
PPS 2023-24 Even Sem_45385173_2025_01_30_11_52
2 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Advanced Real-Time Data Integration: Apache Kafka and Spark Streaming Techniques
From Everand
Advanced Real-Time Data Integration: Apache Kafka and Spark Streaming Techniques
Adam Jones
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet

Customizing Kafka Stream Procssing

Uploaded by

Customizing Kafka Stream Procssing

Uploaded by

IEEE Third International Conference on Data Stream Mining & Processing

August 21-25, 2020, Lviv, Ukraine

Spark Structured Streaming: Customizing Kafka

Yuriy Drohobytskiy Vitaly Brevus Yuriy Skorenkyy

Keywords—Distributed systems; Stream processing; Kafka

I. INTRODUCTION Fig. 1. Irregular stream server load.

978-1-7281-3214-3/20/$31.00 ©2020 IEEE 296

Here, checkpointLocation is the path for the Spark

val kafkaMessages: DataFrame = spark.readStream /**

We add this listener as follows:

here maxRetrives is the number of retrieves with no messages

You might also like