0% found this document useful (0 votes)
47 views6 pages

Coordinate Checkpoint Mechanism On Real-Time Messaging System in Kafka Pipeline Architecture

This summary provides the key details about the document in 3 sentences: The document proposes using a fixed checkpoint interval to improve fault tolerance in the Apache Kafka real-time messaging system. It evaluates the performance of using a checkpoint interval by measuring the cost of checkpointing, rollback cost, and total overhead cost. The experimental results show that using a checkpoint interval can reduce recovery time and lost messages by approximately 30% compared to not using an interval.

Uploaded by

saeed moradpour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views6 pages

Coordinate Checkpoint Mechanism On Real-Time Messaging System in Kafka Pipeline Architecture

This summary provides the key details about the document in 3 sentences: The document proposes using a fixed checkpoint interval to improve fault tolerance in the Apache Kafka real-time messaging system. It evaluates the performance of using a checkpoint interval by measuring the cost of checkpointing, rollback cost, and total overhead cost. The experimental results show that using a checkpoint interval can reduce recovery time and lost messages by approximately 30% compared to not using an interval.

Uploaded by

saeed moradpour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Coordinate Checkpoint Mechanism on Real-Time Messaging System in Kafka

Pipeline Architecture

Thandar Aung, Hla Yin Min, Aung Htein Maw


University of Information Technology, Yangon, Myanmar
[email protected], [email protected], [email protected]

Abstract capturing, data processing and data exportation to be fast


The real-time messaging system is the critical thing in and active data processing technology.
computing based on time-critical decision making in Many Applications need to get reliable information
many organizations. In the real-time messaging system, and quickly routed to multiple types of receivers.
fault tolerance is the key challenge of developing Therefore, real-time messaging system appropriate for
reliability requirements. Apache Kafka is a popular seamless integration of information of producers and
framework for consuming data stream into the processing consumers to avoid any kind of rewriting of an application
platforms. However, there are many challenges in the [9]. Real-time processing technologies can perform with
replication process because of a server failure in Apache large capacity data streams rapidly in real-time. Fault
Kafka. To develop fault tolerance in Apache Kafka, this tolerance is becoming a popular topic for developing the
paper focuses on defining a fixed checkpoint interval to reliability of real-time processing. Kafka has high
reduce the recovery time and lost messages in server throughput, built-in partitioning, replication, and fault
failure. Then, we measure the cost of checkpointing, cost tolerance, which makes it a good solution for large scale
of rollback and total time cost of overheads due to the message processing applications [10]. Kafka is an ideal
fixed checkpointing intervals. The system shows the distributed messaging system for delivering a large
drawback of real-time processing depend on the number number of real-time messages. In a distributed system
of lost messages on various partitions and server failure environment, fault tolerance is the main factor to provide
processes. The experimental results emphasize the reliability to large applications. Coordinate checkpointing
checkpoint interval method to reduce recovery time and is a technique based rollback recovery scheme for
lost messages. Conforming to the experimental results, developing fault-tolerant.
the performance of the total time cost with checkpoint This paper focused on fault-tolerance to define the
interval saves time approximately 30% than without checkpoint interval to develop the performance of Kafka
checkpoint interval. processing. The system performed several experiments to
profile the performance testing of Apache Kafka and also
evaluated to prove the performance of the coordinated
Key Words- Apache Kafka, Real-time, Optimal checkpointing mechanism using the fixed checkpoint
Checkpoint Interval, Fault Tolerance, Checkpoint Cost interval method.
1. Introduction The remainder of the paper is organized as follows:
Section 2 introduces the related work of the paper. The
Nowadays, the Growth of Big data led to change over system architecture for the real-time messaging system of
to digital technologies and data become an important role the paper demonstrates in Section 3. The Coordinate
in various sources. The most important challenges of big Checkpoint Mechanism is described in Section 4. Section
data are collecting the data into a huge amount of data and 5 shows the experimental setup for Kafka pipeline
analyzing it. Information reliability, various architecture and the results are evaluated. The system
computational complexities, and computational methods concludes and discusses future work in Section 6.
are important things to overcome the challenges of
analyzing big data. Innovative Infrastructure and creative 2. Related Work
thinking are necessary for analyzing user behavior data,
Application performance tracing, Activity data in the Jay Kreps [5] shows that Kafka is a scalable message
form of logs and messages in business and IT industry system that lies between message producers and
works. Big data computing has two types of computing: consumers. It emphasizes the pull model is
batch-oriented computing and real-time computing. Real- better supported than a push model in Kafka. The author
time processing can perform the processing within a short pointed out the checkpoint method to reduce data loss
period significantly. Real-time processing organizes data when the consumer crashed. However, their contribution

37
can’t solve lost messages by using build-in replication for 3.System Architecture for Real-Time
fault tolerance. Messaging System
Hassan Nazeer, Waheed Iqbal, Fawaz Bokhari Faisal
Bukhari, Shuja Ur Rehman Baig [2]has evaluated real-time
text processing pipeline by using open-source big data
tools to reduce the data stream. The real-time processing
pipeline is provided for minimizing the latency in data
streams and determining the performance of different
cluster sizes and workload generators. They need to
improve the data stream in real-time processing.
Xiaohui Wei, Yuan Zhuang, Hongliang Li1, Zhiliang
Liu [11] focused on two methods to divide partitions and
merge data backup protocol at run time to enhance the
reliability of elastic DSPSs. The authors point out the
checkpoint mechanism can solve high recovery instead of
existing replication processes.
Mallikarjuna Shastry P.M [7] introduced to emphasize
the selection of efficient checkpoint interval between the
two checkpoint interval methods. The author calculates
the total overhead cost due to checkpointing and restarting Figure 1. System Architecture of Kafka
of the application. The author discusses the advantages of
fixed checkpoint interval methods than others when the In Figure1, the system architecture illustrates the
application occurs in the failure case. But they do not strategy of defining optimal checkpoint interval in the
present the effective checkpointing interval method in real-time messaging system. Algorithm 1 shows the
real-time processing. processing steps for developing a real-time messaging
Young [6]is presented the calculation of optimal time system. A fixed checkpoint interval method executes this
interval to resolve a first-order approximation. It can algorithm to improve fault tolerance in Apache Kafka.
define the optimal checkpoint interval to recover the lost
of time overheads in case of the processing failure in the Algorithm 1: Kafka processing in pipeline Architecture
application. Young’s model does not examine the restart Input: real-time messages,
time for resuming the most recent checkpoint when the Output: total time cost
failure occurs. Begin
1. Start zookeeper server
Jagdish Makhijani [4] intends to use the fault- 2 Start Kafka local server to define broker id, port, and log dir
tolerance technique such as checkpointing by making a 3 Create a topic to define replication factors and partitions
4 Publish messages by asynchronous type to Kafka cluster
system fault tolerable. The nature of coordinate Start
checkpoint, all local checkpoints are consistent with (a) Divide partitions by round-robin policy
(b) Identify offset and execute the replication process.
coordinating checkpointing action. The advantages of the (c) Check leader and replicate followers in ISR (in-
coordinate checkpoint algorithm effectively handle the synchronous replica) list
orphan and lost messages. However, the proposed fixed (d) Store replica in zookeeper
End
checkpoint interval needs to measure the rollback cost and Repeat Step 4 on this process until the termination criterion
checkpoint cost. Our technique aims to calculate the 5 Zookeeper handles between topic and consumer
6 if server = fail then
rollback cost, checkpoint cost and total cost due to failure. Process failure recovery
Martin Kleppmann [9]explains the complex (a) Define optimal checkpoint interval
(b) Run fixed checkpoint interval method
application to be built by composing replicated logs and (c) Calculate the total overhead cost in failure recovery
stream operators in Kafka and Samza design. They 7 Check total messages in processing by using GetOffsetShell tools
describe the implementation of stream processing behind 8 Evaluate the producer performance by using producer performance tools
9 Evaluate consumer performance by using consumer performance tools
the design of Kafka and Samza in the applications. 10 Compare the total time cost with and without checkpoint interval on the
This paper intends to develop fault tolerance in the various data sizes.
End
real-time messaging system. The system points out the
strength of defining checkpoint intervals based on
coordinate checkpoint mechanisms. The system proves the To evaluate the development of the Apache Kafka, the
improvement of Kafka processing by using the checkpoint system publishes the messages in batches and tested on
interval method. many partitions. The process of the system points out the
weakness of asynchronous replication. We perform
several exploratory processes to measure the producer and
consumer performance of existing pipeline architecture.

38
We apply to the optimal checkpoint interval to develop
the replication process in a real-time messaging system. Consumer
Group
The system architecture focuses on the effectiveness of
defining the checkpoint interval to improve the fault
1.*
tolerance of Apache Kafka. And then, the system
1 1.*
calculates the total checkpoint cost by using the fixed Producer Topic Consumer
checkpoint interval method. 0...* 0...*
1
0...*
3.1. Apache Kafka Architecture Partition
1
Kafka is an open-source, distributed publish-subscribe 1.* 1…
messaging system that overcomes the challenges of 0...1 *
Zookeeper Broker Replica
consuming the batch data volumes and real-time by 1
implementing the publish-subscribe solution. Apache 1 0...*
Kafka is a fast, scalable, durable, and fault-tolerant
Leader Follower
publish-subscribe messaging system. Replica Replica
A producer publishes messages by using synchronous
and asynchronous types into a Kafka topic. The topic is Figure 2. Kafka Processing
created on a Kafka broker to act as a Kafka server which
maintains feeds of the messages. In Figure 2, the topic is 4. Coordinate Checkpoint Mechanism
divided into partitions, and each message within a
partition is assigned an offset, a monotonically increasing In distributed computing, Checkpointing is a technique
integer that serves as a unique identifier for the messages that helps tolerate failures that avoid restarting the process
within the partition. Each partition has a leader and one or from the beginning. The time based coordinate
more replicas that exist on followers, but a broker has checkpointing ensures consistency and recoverability
zero or one replica per partitions. The leader is used for all property. The coordinated checkpointing is better
reads and writes. Kafka can replicate partitions across a performance than other checkpoints when used by the
configurable number of Kafka servers which is used for parallel application. In a coordinated checkpoint
fault tolerance. Kafka uses a replication method for mechanism, the process coordinates their checkpointing
developing fault tolerance in processing. Producers push action in such a way that the set of local checkpoints
the messages to the Kafka broker and Consumers pull the taken is consistent [8]. There are two steps for
messages from the broker using the traditional push and implementing a coordinated checkpoint mechanism.
pull model in Kafka. Consumer reads messages from
Kafka topics by subscribing to topic partitions. A (1) The system calculates the optimal checkpoint
consumer group handles one or more consumers. A interval method based on the coordinate
consumer is a member of one consumer group. checkpoint mechanism.
The consumer level retains the state of consumed (2) The system computes the cost of total time by
messages and addresses the offset of lost messages due to using a fixed checkpoint interval to optimize the
failure case. The information about the lead replica and system.
the current follower in-sync replicas (ISR) store in Table 1. Notation Used
Zookeeper. When the server failure occurred, the system Parameter Meaning
recovers the messages from the ISR list. Zookeeper [3]is a Tc The size of the Optimum Checkpoint interval
centralized service for maintaining configuration Tf Mean Time Between Failure
information,naming, providing distributed synchronization, Time for saving each checkpoint onto a local
Ts
and providing group services. In distributed applications, disk
Zookeeper is an essential thing for dealing with the process Ci Starting time of the ith checkpoint
as a high-performance coordination service. Zookeeper Ni The number of checkpoints in the ith cycle
manages the state information and messages offsets Ti Ith
failure occurs time
between brokers and consumers. Especially, Zookeeper Rbi The rollback cost in the ith cycle
keeps the broker’s hostname and port, topic and R Restart time for resuming the execution of the
partitions from the broker. Kafka may perform message system from the most recent checkpoint
delivery between producer and consumer by distributing CCi The cost of the checkpoint in the ith cycle
in many ways. TLi The time lost in the ith cycle

In Table 1: the checkpoint interval model uses ten


parameters. Figure 3 defines the checkpoint interval.

39
When the failure cases occur, the system restarts the The time required to restart Restart
whole process in real-time processing. Our experiment Application Failure
proves to solve the problem by using the coordinate
checkpoint mechanism. Tc Ts Tc Ts Ts Tc
Execution time till a failure (Ti)
Cycle i
Figure 4. Fixed Checkpoint Interval
The system determines the number of checkpoints
taken in the ith cycle by using Eq(4).
𝑇
𝑁𝑖 = ⌊ 𝑖⁄(𝑇 + 𝑇 )⌋ (4)
𝑐 𝑠

Figure 3. The Checkpoint Interval Then, the checkpointing cost in the ith cycle is
(a)without checkpoint and (b)with checkpoint computed in Eq(5). Checkpoint cost depends on the
number of checkpoints.
Firstly, the system calculates MTBF (Mean time
between failures). Meantime between failures is the 𝐶𝐶𝑖 = 𝑁𝑖 𝑇𝑠 (5)
predicted elapsed time between inherent failures of a
mechanical or electronic system, during normal system Eq (6) express the calculation of rollback cost in our
operation. system. Rollback cost is inversely proportional to the
𝑇𝑜𝑡𝑎𝑙 𝑢𝑝𝑡𝑖𝑚𝑒 number of checkpoints.
𝑀𝑇𝐵𝐹 = (1)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑟𝑒𝑎𝑘𝑑𝑜𝑤𝑛 𝑅𝑏𝑖 = (𝑇𝑖 − 𝑁𝑖 (𝑇𝑐 + 𝑇𝑠 )) (6)

Total uptime is the total time measurement of Kafka The rollback cost, checkpoint cost, and restart cost
processing based on data sizes. The number of are essential for determining the waste time in the ith cycle
breakdowns is the probability of failures occurred in of Kafka processing. Restart time take from the system
Kafka processing which depends on publishing time into restart time after failure in Kafka. Eq (7) is essential to
Kafka topic. MTBF applies to the optimal checkpoint prove the improvement of our system.
interval calculation. Ts represent the default save time for
each checkpoint into the local disk in Kafka processing. 𝑇𝐿𝑖 = 𝐶𝐶𝑖 + 𝑅𝑏𝑖 + 𝑅 (7)
The system calculates the optimal checkpoint interval by
using Eq (2). 5. Experimental Setup and Results

𝑇𝑐 = ට2𝑇𝑠 𝑇𝑓 To conduct our experiment, the system used an open-


(2)
source framework Apache Kafka 2.11-0.9.0.0 as the main
The system measures to develop fault tolerance in pipeline architecture. The system runs JAVA 1.8 on
Kafka processing by using the best checkpoint interval underlying pipeline architecture. The Real-time data for
method. A fixed time interval is also called the smart experiments are mobile phone spam messages from the
interval which can reduce the communication overhead in Grumble text Web site.
coordinated checkpointing. The fixed Checkpoint interval The implementation of the system based on the
method reduces the overhead of total time cost than other optimal checkpoint interval. The experimental results
methods because of the same interval size. Based on the significantly express the improvement of the system. The
optimal checkpoint interval Eq (2), the system calculates system evaluates the experimental results on four types of
the starting time of ith checkpoint using Eq (3). data sizes and various partitions. And then, the system
𝐶𝑖 = 𝑖𝑇𝑐 + (𝑖 − 1)𝑇𝑠 (3) proves the effectiveness of our recovery mechanism in a
real-time messaging system. The implementation of the
A fixed checkpoint interval performs the same system is based on the following hardware specification.
checkpoint interval size among consecutive checkpoints
in a real-time messaging system. The system points out Table 2.Hardware Specification
Operating system Windows 32-bit Operating system
the total time cost to determine the improvement of
RAM 4.00 GB
existing systems. Figure 4 describes all checkpoint take
the same length and the number of checkpoints identify in Hard-disk 1 TB
the ith cycle. Intel(R) Core(TM) i7-4770 CPU
Processor
@3.40GHz

40
A. Experiment 1: Number of Lost Messages on various various data sizes and estimates the probability of failure
partitions in this system. The optimal checkpoint interval effect on
the comparison of total time cost.

Figure 5: Number of Lost Messages in testing


various partitions
Figure 5 shows the performance evaluation using the Figure 7: Comparison of Optimal checkpoint
producer and consumer performance tools in Apache interval size between MTBF = 4 times and
Kafka. The experimental results show by dividing the MTBF=2 times
more partitions need more recovery time during a single
server failure and tend to form more messages lost. From
the experimental results, the system needs to control the D. Experiment 4: Comparison of total time cost with
lost messages in any situation. checkpoint interval in MTBF = 4 times

B. Experiment 2: Number of Lost Messages on two or


three servers

Figure 8: Comparison of total time cost with


checkpoint interval in MTBF = 4 times
Figure 6: Number of Lost Messages in testing on
two and three servers In Figure 8, the system calculates restart time, saves
time on disk and time of failure depend on the probability
Figure 6 indicates the weak point in fault tolerance by
measuring the performance of the Kafka process. When of Mean Time between Failures in Kafka processing.
the omission failure occurs in Kafka, the system can't Hence, the optimal checkpoint interval effect on the less
handle the lost messages in asynchronous replication. total time cost in Mean time between failures. Compared
to the original system, the total time cost of our system
C. Experiment 3: Comparison of Optimal checkpoint can reduce 37%,34%, 30%, and 26% over different data
interval size between MTBF = 4 times and MTBF=2 times sizes(datasets).
Figure 7 shows the computation of optimal checkpoint
interval depend on the number of the mean time between E. Experiment 5: Comparison of total time cost with
checkpoint interval in MTBF = 2 times
failures (MTBF). The system determines MTBF by
counting the probability of failures occurs in Kafka Figure 9 is tested restart time, save time on disk and
processing. We calculate the total uptime based on time of failure focus on Mean Time between Failures in

41
Kafka processing. The system proves the coordinated 7. References
checkpointing mechanism can reduce the total time cost
36%, 30%, 31%, and 25% respectively depending on the [1] D. P. Acharjya, Kauser Ahmed P,” A Survey on Big
different data sizes (datasets). Data Analytics: Challenges, Open Research Issues,
and Tools" International Journal of Advanced
Computer Science and Applications, 2(7), 2016, p
511-518.

[2] Hassan Nazeer, Waheed Iqbal, Fawaz Bokhari, Faisal


Bukhari, Shuja Ur Rehman Baig, "Real-time Text
Analytics Pipeline Using Open-source Big Data
Tools”, December 2017.

[3] http://zookeeper.apache.org/

[4] Jagdish Makhijani, Manoj Kumar Niranjan, "An


Efficient Protocol Using Smart Interval for
Coordinated Checkpointing“International Conference
Figure 9: Comparison of total time cost with on Advances in Information Technology and Mobile
Communication, pp. 6-12, 2011.
checkpoint interval in MTBF = 2 times
[5] Jay Kreps, Neha Narkhede, Jun Rao, "Kafka: a
As summarizes of Experimental results, conforming to Distributed Messaging System for Log Processing",
Figure 5 and Figure 6, more partitions and servers raise to May 2015.
the system performance. But the system needs to capture
all of the messages in failure cases. In Figure 7, the [6] J.W. Young, "A first-order approximation to the
optimum checkpoint interval,” Communications of
Checkpoint interval size depends on the probability of the
ACM 17,9(Sept1974), 530-531.
number of failures occurs in Kafka. And then, Figure 8
and Figure 9 investigate that defining checkpoint intervals [7] Mallikarjuna Shastry P.M,” Selection of a Checkpoint
can recover lost of messages in Kafka processing and Interval in Coordinated Checkpointing Protocol for
reduce the recovery time. Fault-Tolerant Open MPI", (IJCSE) International
Journal on Computer Science and Engineering, Vol.
02, No. 06, ISSN: 0975-3397, Engg Journals
6. Conclusion Publications, India, 2010.

The system emphasizes to be a reliable message in a [8] Manoj Kumar Niranjan ,” Protocol for Coordinated
real-time big data pipeline architecture. The coordinated Checkpointing using Smart Interval with Dual
checkpoint mechanism improves the fault tolerance Coordinator”, International Journal of Computer
strategy in the original Apache Kafka. As a result, the Applications (0975 – 8887), Volume 99– No.11,
August 2014.
fixed checkpoint interval method is considered to recover
from losing messages and reduce the re-execute time in [9] Martin Kleppmann, “Kafka, Semza and the Unix
server failures. The system outperforms the process in Philosophy of Distributed Data”, Bulletin of the IEEE
Apache Kafka pipeline architecture. Conforming to the Computer Society Technical Committee on Data
experimental results, using a fixed checkpoint interval Engineering, July 2016.
method provides 30% on averages the total time cost
compared with the original system. [10] Nishant Garg, “Apache Kafka”, PACKT Publishing
UK, October 2013.
In the future, we plan to measure the performance of
messages recovery in Kafka processing based [11] Xiaohui Wei, Yuan Zhuang, Hongliang Li1, Zhiliang
checkpointing. By analyzing the performance of messages Liu “Reliable stream data processing for elastic
recovery, we can develop more reliable data in the real- distributed stream processing systems", May 2019.
time messaging system.

42

You might also like