Customizing Kafka Stream Procssing
Customizing Kafka Stream Procssing
Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.
val fileStream: StreamingQuery = message.writeStream
.format("parquet")
.outputMode("append")
.trigger(Trigger.ProcessingTime(triggerInterval))
.option("checkpointLocation", checkpointLocation)
.option("path", outFilePath)
.queryName(queryName)
.start()
import StreamingQueryListener._
/**
* Called when a query is started.
* @note This is called synchronously with
Fig.2 Architecture concept of the data processing platform. *
[[org.apache.spark.sql.streaming.DataStreamWriter
`DataStreamWriter.start()`]],
* that is, `onQueryStart` will be called on all
More details should be taken into consideration [7-9] in listeners before
* `DataStreamWriter.start()` returns the
real use cases, however these are irrelevant for the problem corresponding [[StreamingQuery]]. Please
being solved. The presented code runs on Spark v 2.3.2, * don't block this method as it will block your
query.
nevertheless it is fully compatible with the latest Spark 2.4.x * @since 2.0.0
version. The following code starts listening and consuming */
messages from Kafka topic: def onQueryStarted(event: QueryStartedEvent): Unit
Now, message is the DataFrame consistsing of the private val queryId = query.id
KafkaMessages. To write messages into file we run the private var currentEmptyCount = 0
private var totalCount: Long = 0
following stream:
297
Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.
override def onQueryStarted(event:
StreamingQueryListener.QueryStartedEvent): Unit = {
beginning. If the starting offsets is specified as "latest",
if (event.id == queryId) { then reading sequence begins from the end and our goal will
!s"Query started. (id = $queryId)" not mbe met as there could be new (unread) messages in Kafka
}
} before the application starts. To solve this problem, let's
consider how Spark manages offsets for Kafka stream
override def onQueryProgress(event: consumer (Fig. 3).
StreamingQueryListener.QueryProgressEvent): Unit = {
if (event.progress.id == queryId) {
!s"Query progress. (id = $queryId)\n\tNumber of
There are several options here:
input rows = ${event.progress.numInputRows},
currentEmptyCount = $currentEmptyCount (total count = 1) Offset information could be stored by Kafka (usually
${totalCount + event.progress.numInputRows})" Kafka uses Zookeeper for this). For this purpose each
event.progress.numInputRows match {
case 0 => consumer should specify its own group.id and offsets are
currentEmptyCount += 1 stored "per group". (This could be done automatically
checkCounterLimit()
case x => (autoCommit option), or manually). This option is
currentEmptyCount = 0 completely useless with a structured streaming API, as there
totalCount += x
} is no possibility to specify group.id option (see.
} documentation). Spark will assign a different group.id for
}
private def checkCounterLimit(): Unit = { the consumer each time the application starts.
if (currentEmptyCount >= maxEmptyTicks) { 2) The second option is what Spark proposes to do by
!s"Query will be STOPPED! (id = $queryId)"
query.stop() default: offsets and the information about the output file
} writes are stored in the directory called checkpoint.
}
override def onQueryTerminated(event:
StreamingQueryListener.QueryTerminatedEvent): Unit = {
if (event.id == queryId) {
!s"Query terminated. (id = $queryId)\n\tTotal rows
processed= $totalCount"
}
}
}
298
Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.
IV. IMPLEMENTATION the processed data. While there are still options for offsets
Multiple solutions may be attempted. We defined above management which have not been discussed in detail in the
present paper, this particular question can be addressed
the case class KafkaMessage for the received messages. It
separately if a use case is specified.
contains partition and offset information. Therefore,
after we saved all messages in the files and stopped the stream, For platforms driven by huge volumes of data coming live
one can post-process the messages. The goal is to aggregate from multiple origins (IoT sensors, mobile devices etc), the
useful information into one large file and also store Kafka ability to handle irregular data streams with unpredictable
offsets for further usage. We split the data into Offsets temporal distribution provides an important competitive edge,
information and useful data in the following way: therefore the proposed architecture improvement of stream
val offsets: DataFrame = processing can lead to a dramatic cost reduction and
fragmentedMsgs performance increase for cloud services.
.select($"partition", $"offset")
.groupBy($"partition")
.agg(
max($"offset").alias("offset") [1] J. Gama and P. P. Rodrigues, “An Overview on Mining Data Streams,”
) in Foundations of Computational, IntelligenceVolume 6: Data Mining,
A. Abraham, A.-E. Hassanien, A. P. de Leon F. de Carvalho, and V.
for the offsets information. And: Snášel, Eds. Berlin, Heidelberg: Springer, 2009, pp. 29–45.
val entities: Dataset[DataLakeEntity] = [2] J. Gama and P. P. Rodrigues, “Data Stream Processing,” in Learning
fragmentedMsgs from Data Streams: Processing Techniques in Sensor Networks, J.
.select(from_json(col("value"), Gama and M. M. Gaber, Eds. Berlin, Heidelberg: Springer, 2007, pp.
DataLakeEntitySchemas.dataLakeEntitySchema).as("json_str"
))
25–39.
.select($"json_str.*") [3] G. Pal, G. Li, and K. Atkinson, “Big Data Real Time Ingestion and
.as[DataLakeEntity] Machine Learning - IEEE Conference Publication,” presented at the
2016 IEEE First International Conference on Data Stream Mining
for our DataLakeEntity information. (We use Json for the Processing (DSMP), 2018, pp. 25–31.
messages in Kafka topic, this could be different for the other [4] A. Batyuk and V. Voityshyn, “Apache storm based on topology for
application, i.e. protobuf or other formats could be used). As real-time processing of streaming data from social networks,” in 2016
for offsets, the work is almost done: IEEE First International Conference on Data Stream Mining
Processing (DSMP), 2016, pp. 345–349
val offsetsList =
offsets.as[PartitionOffset].collect().toList [5] P. J. Haas, “Data-Stream Sampling: Basic Techniques and Results,” in
Data Stream Management: Processing High-Speed Data Streams, M.
if (offsetsList.nonEmpty) { Garofalakis, J. Gehrke, and R. Rastogi, Eds. Berlin, Heidelberg:
// Store offsets somewhere, i.e.: Springer, 2016, pp. 13–44.
offsetStore.insertOrUpdateOffsets(topicIn, offList)
}
[6] A. Batyuk, V. Voityshyn, and V. Verhun, “Software Architecture
Design of the Real- Time Processes Monitoring Platform,” in 2018
Now we need some storage for offsets. We can write them IEEE Second International Conference on Data Stream Mining
Processing (DSMP), 2018, pp. 98–101
into Spark table, or in HBase DB or PostgreSQL DB etc. When
starting the stream we should read offsets information back [7] D. Vohra, “Apache Parquet,” in Practical Hadoop Ecosystem: A
and pass it as option for the stream, like Definitive Guide to Hadoop-Related Frameworks and Tools, D. Vohra,
.option("startingoffsets", "latest"), but instead of the Ed. Berkeley, CA: Apress, 2016, pp. 325–335.
"latest" we should use special form, like: [8] C. C. Aggarwal, “An Introduction to Data Streams,” in Data Streams:
Models and Algorithms, C. C. Aggarwal, Ed. Boston, MA: Springer
{"topicA":{"0":23,"1":-1},"topicB":{"0":-1}}
US, 2007, pp. 1–8.
Further information may be easily found in the [9] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Data Stream
documentation. Mining,” in Data Mining and Knowledge Discovery Handbook, O.
Maimon and L. Rokach, Eds. Boston, MA: Springer US, 2010, pp.
V. CONCLUSIONS 759–787.
On the basis of the analysis of practically applicable [10] T. Dunning and E. Friedman, Streaming Architecture: New Designs
solutions of large-scale multi-party data exchange and stream Using Apache Kafka and MapR Streams. O’Reilly Media, Inc., 2016.
processing problems [10, 11] we suggest the mechanism for [11] M. Armbrust et al., “Structured Streaming: A Declarative API for Real-
periodic monitoring and reading data from Kafka stream. The Time Applications in Apache Spark,” in Proceedings of the 2018
proposed method allows us to monitor data stream and International Conference on Management of Data, Houston, TX,
manipulate it effectively under defined conditions. Multiple USA, 2018, pp. 601–613,.
approaches to operate Kafka offsets in stream processing may
be adopted depending on the peculiarities of the system and
299
Authorized licensed use limited to: Auckland University of Technology. Downloaded on November 02,2020 at 17:03:51 UTC from IEEE Xplore. Restrictions apply.