Coordinate Checkpoint Mechanism On Real-Time Messaging System in Kafka Pipeline Architecture
Coordinate Checkpoint Mechanism On Real-Time Messaging System in Kafka Pipeline Architecture
Pipeline Architecture
37
can’t solve lost messages by using build-in replication for 3.System Architecture for Real-Time
fault tolerance. Messaging System
Hassan Nazeer, Waheed Iqbal, Fawaz Bokhari Faisal
Bukhari, Shuja Ur Rehman Baig [2]has evaluated real-time
text processing pipeline by using open-source big data
tools to reduce the data stream. The real-time processing
pipeline is provided for minimizing the latency in data
streams and determining the performance of different
cluster sizes and workload generators. They need to
improve the data stream in real-time processing.
Xiaohui Wei, Yuan Zhuang, Hongliang Li1, Zhiliang
Liu [11] focused on two methods to divide partitions and
merge data backup protocol at run time to enhance the
reliability of elastic DSPSs. The authors point out the
checkpoint mechanism can solve high recovery instead of
existing replication processes.
Mallikarjuna Shastry P.M [7] introduced to emphasize
the selection of efficient checkpoint interval between the
two checkpoint interval methods. The author calculates
the total overhead cost due to checkpointing and restarting Figure 1. System Architecture of Kafka
of the application. The author discusses the advantages of
fixed checkpoint interval methods than others when the In Figure1, the system architecture illustrates the
application occurs in the failure case. But they do not strategy of defining optimal checkpoint interval in the
present the effective checkpointing interval method in real-time messaging system. Algorithm 1 shows the
real-time processing. processing steps for developing a real-time messaging
Young [6]is presented the calculation of optimal time system. A fixed checkpoint interval method executes this
interval to resolve a first-order approximation. It can algorithm to improve fault tolerance in Apache Kafka.
define the optimal checkpoint interval to recover the lost
of time overheads in case of the processing failure in the Algorithm 1: Kafka processing in pipeline Architecture
application. Young’s model does not examine the restart Input: real-time messages,
time for resuming the most recent checkpoint when the Output: total time cost
failure occurs. Begin
1. Start zookeeper server
Jagdish Makhijani [4] intends to use the fault- 2 Start Kafka local server to define broker id, port, and log dir
tolerance technique such as checkpointing by making a 3 Create a topic to define replication factors and partitions
4 Publish messages by asynchronous type to Kafka cluster
system fault tolerable. The nature of coordinate Start
checkpoint, all local checkpoints are consistent with (a) Divide partitions by round-robin policy
(b) Identify offset and execute the replication process.
coordinating checkpointing action. The advantages of the (c) Check leader and replicate followers in ISR (in-
coordinate checkpoint algorithm effectively handle the synchronous replica) list
orphan and lost messages. However, the proposed fixed (d) Store replica in zookeeper
End
checkpoint interval needs to measure the rollback cost and Repeat Step 4 on this process until the termination criterion
checkpoint cost. Our technique aims to calculate the 5 Zookeeper handles between topic and consumer
6 if server = fail then
rollback cost, checkpoint cost and total cost due to failure. Process failure recovery
Martin Kleppmann [9]explains the complex (a) Define optimal checkpoint interval
(b) Run fixed checkpoint interval method
application to be built by composing replicated logs and (c) Calculate the total overhead cost in failure recovery
stream operators in Kafka and Samza design. They 7 Check total messages in processing by using GetOffsetShell tools
describe the implementation of stream processing behind 8 Evaluate the producer performance by using producer performance tools
9 Evaluate consumer performance by using consumer performance tools
the design of Kafka and Samza in the applications. 10 Compare the total time cost with and without checkpoint interval on the
This paper intends to develop fault tolerance in the various data sizes.
End
real-time messaging system. The system points out the
strength of defining checkpoint intervals based on
coordinate checkpoint mechanisms. The system proves the To evaluate the development of the Apache Kafka, the
improvement of Kafka processing by using the checkpoint system publishes the messages in batches and tested on
interval method. many partitions. The process of the system points out the
weakness of asynchronous replication. We perform
several exploratory processes to measure the producer and
consumer performance of existing pipeline architecture.
38
We apply to the optimal checkpoint interval to develop
the replication process in a real-time messaging system. Consumer
Group
The system architecture focuses on the effectiveness of
defining the checkpoint interval to improve the fault
1.*
tolerance of Apache Kafka. And then, the system
1 1.*
calculates the total checkpoint cost by using the fixed Producer Topic Consumer
checkpoint interval method. 0...* 0...*
1
0...*
3.1. Apache Kafka Architecture Partition
1
Kafka is an open-source, distributed publish-subscribe 1.* 1…
messaging system that overcomes the challenges of 0...1 *
Zookeeper Broker Replica
consuming the batch data volumes and real-time by 1
implementing the publish-subscribe solution. Apache 1 0...*
Kafka is a fast, scalable, durable, and fault-tolerant
Leader Follower
publish-subscribe messaging system. Replica Replica
A producer publishes messages by using synchronous
and asynchronous types into a Kafka topic. The topic is Figure 2. Kafka Processing
created on a Kafka broker to act as a Kafka server which
maintains feeds of the messages. In Figure 2, the topic is 4. Coordinate Checkpoint Mechanism
divided into partitions, and each message within a
partition is assigned an offset, a monotonically increasing In distributed computing, Checkpointing is a technique
integer that serves as a unique identifier for the messages that helps tolerate failures that avoid restarting the process
within the partition. Each partition has a leader and one or from the beginning. The time based coordinate
more replicas that exist on followers, but a broker has checkpointing ensures consistency and recoverability
zero or one replica per partitions. The leader is used for all property. The coordinated checkpointing is better
reads and writes. Kafka can replicate partitions across a performance than other checkpoints when used by the
configurable number of Kafka servers which is used for parallel application. In a coordinated checkpoint
fault tolerance. Kafka uses a replication method for mechanism, the process coordinates their checkpointing
developing fault tolerance in processing. Producers push action in such a way that the set of local checkpoints
the messages to the Kafka broker and Consumers pull the taken is consistent [8]. There are two steps for
messages from the broker using the traditional push and implementing a coordinated checkpoint mechanism.
pull model in Kafka. Consumer reads messages from
Kafka topics by subscribing to topic partitions. A (1) The system calculates the optimal checkpoint
consumer group handles one or more consumers. A interval method based on the coordinate
consumer is a member of one consumer group. checkpoint mechanism.
The consumer level retains the state of consumed (2) The system computes the cost of total time by
messages and addresses the offset of lost messages due to using a fixed checkpoint interval to optimize the
failure case. The information about the lead replica and system.
the current follower in-sync replicas (ISR) store in Table 1. Notation Used
Zookeeper. When the server failure occurred, the system Parameter Meaning
recovers the messages from the ISR list. Zookeeper [3]is a Tc The size of the Optimum Checkpoint interval
centralized service for maintaining configuration Tf Mean Time Between Failure
information,naming, providing distributed synchronization, Time for saving each checkpoint onto a local
Ts
and providing group services. In distributed applications, disk
Zookeeper is an essential thing for dealing with the process Ci Starting time of the ith checkpoint
as a high-performance coordination service. Zookeeper Ni The number of checkpoints in the ith cycle
manages the state information and messages offsets Ti Ith
failure occurs time
between brokers and consumers. Especially, Zookeeper Rbi The rollback cost in the ith cycle
keeps the broker’s hostname and port, topic and R Restart time for resuming the execution of the
partitions from the broker. Kafka may perform message system from the most recent checkpoint
delivery between producer and consumer by distributing CCi The cost of the checkpoint in the ith cycle
in many ways. TLi The time lost in the ith cycle
39
When the failure cases occur, the system restarts the The time required to restart Restart
whole process in real-time processing. Our experiment Application Failure
proves to solve the problem by using the coordinate
checkpoint mechanism. Tc Ts Tc Ts Ts Tc
Execution time till a failure (Ti)
Cycle i
Figure 4. Fixed Checkpoint Interval
The system determines the number of checkpoints
taken in the ith cycle by using Eq(4).
𝑇
𝑁𝑖 = ⌊ 𝑖⁄(𝑇 + 𝑇 )⌋ (4)
𝑐 𝑠
Figure 3. The Checkpoint Interval Then, the checkpointing cost in the ith cycle is
(a)without checkpoint and (b)with checkpoint computed in Eq(5). Checkpoint cost depends on the
number of checkpoints.
Firstly, the system calculates MTBF (Mean time
between failures). Meantime between failures is the 𝐶𝐶𝑖 = 𝑁𝑖 𝑇𝑠 (5)
predicted elapsed time between inherent failures of a
mechanical or electronic system, during normal system Eq (6) express the calculation of rollback cost in our
operation. system. Rollback cost is inversely proportional to the
𝑇𝑜𝑡𝑎𝑙 𝑢𝑝𝑡𝑖𝑚𝑒 number of checkpoints.
𝑀𝑇𝐵𝐹 = (1)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑟𝑒𝑎𝑘𝑑𝑜𝑤𝑛 𝑅𝑏𝑖 = (𝑇𝑖 − 𝑁𝑖 (𝑇𝑐 + 𝑇𝑠 )) (6)
Total uptime is the total time measurement of Kafka The rollback cost, checkpoint cost, and restart cost
processing based on data sizes. The number of are essential for determining the waste time in the ith cycle
breakdowns is the probability of failures occurred in of Kafka processing. Restart time take from the system
Kafka processing which depends on publishing time into restart time after failure in Kafka. Eq (7) is essential to
Kafka topic. MTBF applies to the optimal checkpoint prove the improvement of our system.
interval calculation. Ts represent the default save time for
each checkpoint into the local disk in Kafka processing. 𝑇𝐿𝑖 = 𝐶𝐶𝑖 + 𝑅𝑏𝑖 + 𝑅 (7)
The system calculates the optimal checkpoint interval by
using Eq (2). 5. Experimental Setup and Results
40
A. Experiment 1: Number of Lost Messages on various various data sizes and estimates the probability of failure
partitions in this system. The optimal checkpoint interval effect on
the comparison of total time cost.
41
Kafka processing. The system proves the coordinated 7. References
checkpointing mechanism can reduce the total time cost
36%, 30%, 31%, and 25% respectively depending on the [1] D. P. Acharjya, Kauser Ahmed P,” A Survey on Big
different data sizes (datasets). Data Analytics: Challenges, Open Research Issues,
and Tools" International Journal of Advanced
Computer Science and Applications, 2(7), 2016, p
511-518.
[3] http://zookeeper.apache.org/
The system emphasizes to be a reliable message in a [8] Manoj Kumar Niranjan ,” Protocol for Coordinated
real-time big data pipeline architecture. The coordinated Checkpointing using Smart Interval with Dual
checkpoint mechanism improves the fault tolerance Coordinator”, International Journal of Computer
strategy in the original Apache Kafka. As a result, the Applications (0975 – 8887), Volume 99– No.11,
August 2014.
fixed checkpoint interval method is considered to recover
from losing messages and reduce the re-execute time in [9] Martin Kleppmann, “Kafka, Semza and the Unix
server failures. The system outperforms the process in Philosophy of Distributed Data”, Bulletin of the IEEE
Apache Kafka pipeline architecture. Conforming to the Computer Society Technical Committee on Data
experimental results, using a fixed checkpoint interval Engineering, July 2016.
method provides 30% on averages the total time cost
compared with the original system. [10] Nishant Garg, “Apache Kafka”, PACKT Publishing
UK, October 2013.
In the future, we plan to measure the performance of
messages recovery in Kafka processing based [11] Xiaohui Wei, Yuan Zhuang, Hongliang Li1, Zhiliang
checkpointing. By analyzing the performance of messages Liu “Reliable stream data processing for elastic
recovery, we can develop more reliable data in the real- distributed stream processing systems", May 2019.
time messaging system.
42