0% found this document useful (0 votes)
562 views

Customizing Kafka Stream Procssing

Uploaded by

saeed moradpour
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
562 views

Customizing Kafka Stream Procssing

Uploaded by

saeed moradpour
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

IEEE Third International Conference on Data Stream Mining & Processing

August 21-25, 2020, Lviv, Ukraine

Spark Structured Streaming: Customizing Kafka


Stream Processing

Yuriy Drohobytskiy Vitaly Brevus Yuriy Skorenkyy


Ternopil Volodymyr Hnatyuk National Dataengi UAB Ternopil Ivan Puluj National Technical
Pedagogical University Vilnius, Lituania University
Ternopil, Ukraine Ternopil Ivan Puluj National Ternopil, Ukraine
Technical University orcid.org/0000-0002-4809-9025
Dataengi UAB
Ternopil, Ukraine
Vilnius, Lituania
orcid.org/0000-0002-7055-9905
orcid.org/0000-0002-3333-1573
Abstract — The aim of the present paper is to develop an
improvement of large-scale multi-party data exchange and stream
processing solution. The method of choice uses Apache Kafka
streams as well as HDFS file granulation, and is exemplified in a
real project of data ingestion into the Hadoop ecosystem. The
management and conditional stream controlling procedures are
proposed. Various ways to manage Kafka offsets during stream
processing are considered.

Keywords—Distributed systems; Stream processing; Kafka


streams; Spark v 2.3.2; HDFS file granulation

I. INTRODUCTION Fig. 1. Irregular stream server load.


Heterogeneous interconnected systems which produce a A problem is to determine that the service has already
wide variety of data are nowadays common [1-3] for a range completed the data processing. For this problem, a universal
of applications, from energy generation and precision solution is absent as no alarm can be set for the last message
agriculture grids to an increasing number of small-scale and arrival. A viable solution to this stream processing problem
dispersed portable smart devices acquiring and transmitting may be the enactment of a dedicated service application for
diverse types of data. To support data-driven business models stream processing.
these data flows are to be properly processed in real time
regardless of their variety and fragmentation [4]. To ingest II. STREAM PROCESSING CUSTOMIZATION
data from the outer sources into HDFS in a cost-effective way A. Solution idea
is a complicated task to be solved by dedicated services that
collect data from various sources outside of the Hadoop To optimally customize the stream processing for
cluster, retrieve data from many databases, transform [5, 6] unpredictable flow of data from diverse producers one has to
and enrich data to finally push it into the Kafka topic. On the choose and wait a proper time interval and avoid wasting paid
other end, Spark application interacts with the Kafka topic and cloud computing services.
puts data into HDFS. ‘External’ service runs periodically, with If during this interval there are no incoming messages in
periodicity chosen appropriately to the specific features of the Kafka distributed streaming platform (see Fig. 2), then
task being solved. To resume such a short outline of the waiting phase is terminated, the consumed data are post-
system’s concept we note that the system’s efficacy depends processed and application stopped untill the next predefined
critically on details discussed in subsequent sections. moment. This way, cluster resources are not wasted and all
Current paper develops a solution to large-scale multi- processing is timely performed in an appropriate way.
party data exchange and stream processing problem. The The practical realization of this simple idea is, however,
considered Spark app reads data from Kafka, processes and not so straightforward as it may seem.
storesdata in HDFS files. The standard streaming app runs
permanently and processes data on the fly, which is an B. Implementation of the solution
ineffective use of resources. To test the above idea we split the whole processing time
Typically, the data load which needs processing which into equal intervals T and monitor periodically if a new
lasts minutes to hours so resources of the Hadoop cluster will message arrives. To make sure that there is no incoming data
not be used effectively for Spark containers. The solution may in this application run we wait for N intervals and stop the
be in starting an application in the defined moment to process application only if there were no messages in NT time. Let’s
all records from Kafka distributed streaming platform, and put examine a typical Apache Spark streaming application for big
it down for idle time interval (see Fig.1). data processing that takes data from Kafka cluster and stores
them into a Parquet binary file format in Hadoop Distributed
File System (HDFS).

978-1-7281-3214-3/20/$31.00 ©2020 IEEE 296

Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.
val fileStream: StreamingQuery = message.writeStream
.format("parquet")
.outputMode("append")
.trigger(Trigger.ProcessingTime(triggerInterval))
.option("checkpointLocation", checkpointLocation)
.option("path", outFilePath)
.queryName(queryName)
.start()

Here, checkpointLocation is the path for the Spark


Streaming Checkpoint data to be stored in. This is necessary
as Spark Streaming is fault-tolerant, and Spark needs to store
its metadata into it. queryName is the arbitrary name of the
streaming query, outFilePath is the path to the file on HDFS.
triggerInterval is the period of time during which the
Spark micro-batch is compiled and then processed by a
Parquet writer, one at a time. So, at the moment, we have a
stream that reads messages from Kafka and stores them into
HDFS file. Also, we have temporal granularity of the stream.
According to the solution idea, we should be able to check
number of messages ingested during each time interval, and if
for N intervals we consumed no messages, the stream is
stopped.
An interface to listen to the stream events (in
org.apache.spark.sql.streaming.StreamingQueryListener)
is:
abstract class StreamingQueryListener {

import StreamingQueryListener._

/**
* Called when a query is started.
* @note This is called synchronously with
Fig.2 Architecture concept of the data processing platform. *
[[org.apache.spark.sql.streaming.DataStreamWriter
`DataStreamWriter.start()`]],
* that is, `onQueryStart` will be called on all
More details should be taken into consideration [7-9] in listeners before
* `DataStreamWriter.start()` returns the
real use cases, however these are irrelevant for the problem corresponding [[StreamingQuery]]. Please
being solved. The presented code runs on Spark v 2.3.2, * don't block this method as it will block your
query.
nevertheless it is fully compatible with the latest Spark 2.4.x * @since 2.0.0
version. The following code starts listening and consuming */
messages from Kafka topic: def onQueryStarted(event: QueryStartedEvent): Unit

val kafkaMessages: DataFrame = spark.readStream /**


.format("kafka") * Called when there is some status update (ingestion
.option("kafka.bootstrap.servers", bootstrapServers) rate updated, etc.)
.option("subscribe", topicIn) *
.option("startingoffsets", "latest") * @note This method is asynchronous. The status in
.load() [[StreamingQuery]] will always be
* latest no matter when this method is called.
Here bootstrapServers specifies the address of Kafka Therefore, the status of [[StreamingQuery]]
* may be changed before/when you process the
brokers, topicIn is the name of the topic, latest ensures that event. E.g., you may find [[StreamingQuery]]
only the new messages in the topic are listened to. The last * is terminated when you are processing
option is unwise as it makes the application to start before the `QueryProgressEvent`.
* @since 2.0.0
data-producing service outside the Hadoop. The solution to */
this problem, will be given later, too. For our purposes we def onQueryProgress(event: QueryProgressEvent): Unit
require message information as string, and Kafka message /**
information, namely partition and offset. Accordingly, we cast * Called when a query is stopped, with or without
the message from Kafka to the following model: error.
* @since 2.0.0
case class KafkaMessage( */
partition: Int, def onQueryTerminated(event: QueryTerminatedEvent):
offset: Long, Unit
value: String }
)
To provide for the required functionality we create the
with every field being self-explanatory. listener as follows:
val message = kafkaMessages.selectExpr("partition", class StreamQueryListener(val query: StreamingQuery, val
"offset", "CAST(value AS STRING)").as[KafkaMessage] maxEmptyTicks: Int = 3) extends StreamingQueryListener {

Now, message is the DataFrame consistsing of the private val queryId = query.id
KafkaMessages. To write messages into file we run the private var currentEmptyCount = 0
private var totalCount: Long = 0
following stream:

297

Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.
override def onQueryStarted(event:
StreamingQueryListener.QueryStartedEvent): Unit = {
beginning. If the starting offsets is specified as "latest",
if (event.id == queryId) { then reading sequence begins from the end and our goal will
!s"Query started. (id = $queryId)" not mbe met as there could be new (unread) messages in Kafka
}
} before the application starts. To solve this problem, let's
consider how Spark manages offsets for Kafka stream
override def onQueryProgress(event: consumer (Fig. 3).
StreamingQueryListener.QueryProgressEvent): Unit = {
if (event.progress.id == queryId) {
!s"Query progress. (id = $queryId)\n\tNumber of
There are several options here:
input rows = ${event.progress.numInputRows},
currentEmptyCount = $currentEmptyCount (total count = 1) Offset information could be stored by Kafka (usually
${totalCount + event.progress.numInputRows})" Kafka uses Zookeeper for this). For this purpose each
event.progress.numInputRows match {
case 0 => consumer should specify its own group.id and offsets are
currentEmptyCount += 1 stored "per group". (This could be done automatically
checkCounterLimit()
case x => (autoCommit option), or manually). This option is
currentEmptyCount = 0 completely useless with a structured streaming API, as there
totalCount += x
} is no possibility to specify group.id option (see.
} documentation). Spark will assign a different group.id for
}
private def checkCounterLimit(): Unit = { the consumer each time the application starts.
if (currentEmptyCount >= maxEmptyTicks) { 2) The second option is what Spark proposes to do by
!s"Query will be STOPPED! (id = $queryId)"
query.stop() default: offsets and the information about the output file
} writes are stored in the directory called checkpoint.
}
override def onQueryTerminated(event:
StreamingQueryListener.QueryTerminatedEvent): Unit = {
if (event.id == queryId) {
!s"Query terminated. (id = $queryId)\n\tTotal rows
processed= $totalCount"
}
}
}

We add this listener as follows:


spark.streams.addListener(new
StreamQueryListener(fileStream, maxRetrives))

here maxRetrives is the number of retrieves with no messages


to wait until the stream stops.
The main logic is in onQueryProgress method. We look
at event.progress.numInputRows value which equals the
number of rows obtained during the batch time window (set
by Fig.3. Process flow segment for offset management.
.trigger(Trigger.ProcessingTime(triggerInterval))).
If there are no messages in the stream we increment Checkpoints store intermediate information to ensure fault
currentEmptyCount counter. When it reaches maximum tolerance. If any sort of an exception occurs, i.e. JVM error,
allowed value then we can gracefully stop the stream by container fault or any other error takes place, then the
query.stop(). If we've got any messages during the time
application recovers from that point automatically. This
window then we clear the counter and start monitoring from
the beginning. We also count the total number of processed powerful mechanism is to be used for critical applications.
messages here. (!"string" interpolator puts the string into However, there are pitfalls in this approach. Firstly, the
logs.) offsets are stored there too, so this folder can not be removed
without the critical loss of the offset information. Secondly,
To complete the application workflow we wait for the the output files can not be removed, if this is done, the next
stream to terminate:
application run will end with an error, as the information in
fileStream.awaitTermination() the checkpoint will not match the output files. In our case,
That's all for our task. We've got the streaming application we'd like to remove the output files between subsequent
which reads all data from Kafka topics and stops when the application runs. The file will physically consist of the many
topic becomes 'empty'. So it does the job for our scheduled parts (each of those parts is the data obtained during one
task. processing time window), so after the application reads all
data from the topic we want to aggregate all files into one big
file and delete the intermediate files. To this end we are to
III. MANUAL KAFKA OFFSETS MANAGEMENT remove the checkpoint directory as well. For this reason we
Having in mind thet our goal is to optimally schedule the consider other options to store Kafka offsets.
streaming application runs, it is unwise to re-read the topic 3) Finally, offsets can be manually stored and specified when
from the beginning each time. Instead, one should start creating a stream. This requires more effort but may represent
reading from the point it stopped last time. Using the most flexible solution.
.option("startingoffsets", "earliest") for the
KafkaMessages we will always read topic messages from the

298

Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.
IV. IMPLEMENTATION the processed data. While there are still options for offsets
Multiple solutions may be attempted. We defined above management which have not been discussed in detail in the
present paper, this particular question can be addressed
the case class KafkaMessage for the received messages. It
separately if a use case is specified.
contains partition and offset information. Therefore,
after we saved all messages in the files and stopped the stream, For platforms driven by huge volumes of data coming live
one can post-process the messages. The goal is to aggregate from multiple origins (IoT sensors, mobile devices etc), the
useful information into one large file and also store Kafka ability to handle irregular data streams with unpredictable
offsets for further usage. We split the data into Offsets temporal distribution provides an important competitive edge,
information and useful data in the following way: therefore the proposed architecture improvement of stream
val offsets: DataFrame = processing can lead to a dramatic cost reduction and
fragmentedMsgs performance increase for cloud services.
.select($"partition", $"offset")
.groupBy($"partition")
.agg(
max($"offset").alias("offset") [1] J. Gama and P. P. Rodrigues, “An Overview on Mining Data Streams,”
) in Foundations of Computational, IntelligenceVolume 6: Data Mining,
A. Abraham, A.-E. Hassanien, A. P. de Leon F. de Carvalho, and V.
for the offsets information. And: Snášel, Eds. Berlin, Heidelberg: Springer, 2009, pp. 29–45.
val entities: Dataset[DataLakeEntity] = [2] J. Gama and P. P. Rodrigues, “Data Stream Processing,” in Learning
fragmentedMsgs from Data Streams: Processing Techniques in Sensor Networks, J.
.select(from_json(col("value"), Gama and M. M. Gaber, Eds. Berlin, Heidelberg: Springer, 2007, pp.
DataLakeEntitySchemas.dataLakeEntitySchema).as("json_str"
))
25–39.
.select($"json_str.*") [3] G. Pal, G. Li, and K. Atkinson, “Big Data Real Time Ingestion and
.as[DataLakeEntity] Machine Learning - IEEE Conference Publication,” presented at the
2016 IEEE First International Conference on Data Stream Mining
for our DataLakeEntity information. (We use Json for the Processing (DSMP), 2018, pp. 25–31.
messages in Kafka topic, this could be different for the other [4] A. Batyuk and V. Voityshyn, “Apache storm based on topology for
application, i.e. protobuf or other formats could be used). As real-time processing of streaming data from social networks,” in 2016
for offsets, the work is almost done: IEEE First International Conference on Data Stream Mining
Processing (DSMP), 2016, pp. 345–349
val offsetsList =
offsets.as[PartitionOffset].collect().toList [5] P. J. Haas, “Data-Stream Sampling: Basic Techniques and Results,” in
Data Stream Management: Processing High-Speed Data Streams, M.
if (offsetsList.nonEmpty) { Garofalakis, J. Gehrke, and R. Rastogi, Eds. Berlin, Heidelberg:
// Store offsets somewhere, i.e.: Springer, 2016, pp. 13–44.
offsetStore.insertOrUpdateOffsets(topicIn, offList)
}
[6] A. Batyuk, V. Voityshyn, and V. Verhun, “Software Architecture
Design of the Real- Time Processes Monitoring Platform,” in 2018
Now we need some storage for offsets. We can write them IEEE Second International Conference on Data Stream Mining
Processing (DSMP), 2018, pp. 98–101
into Spark table, or in HBase DB or PostgreSQL DB etc. When
starting the stream we should read offsets information back [7] D. Vohra, “Apache Parquet,” in Practical Hadoop Ecosystem: A
and pass it as option for the stream, like Definitive Guide to Hadoop-Related Frameworks and Tools, D. Vohra,
.option("startingoffsets", "latest"), but instead of the Ed. Berkeley, CA: Apress, 2016, pp. 325–335.
"latest" we should use special form, like: [8] C. C. Aggarwal, “An Introduction to Data Streams,” in Data Streams:
Models and Algorithms, C. C. Aggarwal, Ed. Boston, MA: Springer
{"topicA":{"0":23,"1":-1},"topicB":{"0":-1}}
US, 2007, pp. 1–8.
Further information may be easily found in the [9] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Data Stream
documentation. Mining,” in Data Mining and Knowledge Discovery Handbook, O.
Maimon and L. Rokach, Eds. Boston, MA: Springer US, 2010, pp.
V. CONCLUSIONS 759–787.
On the basis of the analysis of practically applicable [10] T. Dunning and E. Friedman, Streaming Architecture: New Designs
solutions of large-scale multi-party data exchange and stream Using Apache Kafka and MapR Streams. O’Reilly Media, Inc., 2016.
processing problems [10, 11] we suggest the mechanism for [11] M. Armbrust et al., “Structured Streaming: A Declarative API for Real-
periodic monitoring and reading data from Kafka stream. The Time Applications in Apache Spark,” in Proceedings of the 2018
proposed method allows us to monitor data stream and International Conference on Management of Data, Houston, TX,
manipulate it effectively under defined conditions. Multiple USA, 2018, pp. 601–613,.
approaches to operate Kafka offsets in stream processing may
be adopted depending on the peculiarities of the system and

299

Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.

You might also like