Chapter 10 Kafka Distributed Publish-Subscribe Messaging System
Chapter 10 Kafka Distributed Publish-Subscribe Messaging System
1 Huawei Confidential
Objectives
2 Huawei Confidential
Contents
1. Introduction
3. Data Management
3 Huawei Confidential
Introduction
Kafka is a distributed, partitioned, replicated, and ZooKeeper-based messaging system. It supports
multi-subscribers and was originally developed by Linkedin.
Main application scenarios: the log collection system and message system
Robust message queue is a foundation of distributed messaging, in which messages are queued
asynchronously between client applications and the messaging system. Two types of messaging
patterns are available: point to point messaging, and publish-subscribe (pub-sub) messaging.
Most of the messaging patterns follow pub-sub. Kafka implements a pub-sub messaging
pattern.
4 Huawei Confidential
Point-to-Point Messaging
In a point-to-point messaging system, messages are persisted in a queue. In this case, one or
more consumers consume the messages in the queue. However, a message can be consumed by a
maximum of one consumer only. Once a consumer reads a message in the queue, it disappears
from that queue. This pattern ensures the data processing sequence even when multiple
consumers consume data at the same time.
5 Huawei Confidential
Publish-Subscribe Messaging
In the publish-subscribe messaging system, messages are persisted in a topic. Unlike point-to-
point messaging system, a consumer can subscribe to one or more topics and consume all the
messages in those topics. One message can be consumed by more than one consumer. A
consumed message is not deleted immediately. In the publish-subscribe messaging system,
message producers are called publishers and message consumers are called subscribers.
6 Huawei Confidential
Kafka Features
Kafka can persist messages in a time complexity of O(1), and can maintain data access performance in
constant time even in the face of terabytes of data.
Kafka provides high throughput. Even on cheap commercial machines, a single-node system can transmit
100,000 messages per second.
Kafka supports message partitioning and distributed consumption, and ensures that messages are transmitted
in sequence in each partition.
Frontend Backend
Kafka supports offline and real-time data processing. Producer Producer
Kafka supports scale-out.
Flume Storm
Kafka
Hadoop Spark
Farmer
7 Huawei Confidential
Contents
1. Introduction
3. Data Management
8 Huawei Confidential
Kafka Topology
ZooKeeper
Zoo Keeper
(Kafka) Broker Broker Broker Zoo Keeper
9 Huawei Confidential
Kafka Basic Concepts
Broker: A Kafka cluster contains one or more service instances, which are called brokers.
Topic: Each message published to the Kafka cluster has a category, which is called a topic.
Partition: Kafka divides a topic into one or more partitions. Each partition physically
corresponds to a directory for storing all messages of the partition.
Consumer: consumes messages and functions as a client to read messages from Kafka
Broker.
Consumer Group: Each consumer belongs to a given consumer group. You can specify a group
name for each consumer.
10 Huawei Confidential
Kafka Topics
Each message published to Kafka belongs to a category, which is called a topic. A topic can also
be interpreted as a message queue. For example, weather can be regarded as a topic (queue)
that stores daily temperature information.
Consumer group 1
Kafka topic
。。。 new
Producer总是在末尾追加消息
Producers always append messages to the end of a queue.
11 Huawei Confidential
Kafka Partition
To improve the throughput of Kafka, each topic is physically divided into one or more partitions.
Each partition is an ordered and immutable sequence of messages. Each partition physically
corresponds to a directory for storing all messages and index files of the partition.
Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Partition 1 0 1 2 3 4 5 6 7 8 9 Writes
Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12
Old New
12 Huawei Confidential
Kafka Partition Offset
The position of each message in a log file is called offset, which is a long integer that
uniquely identifies a message. Consumers use offsets, partitions, and topics to track
records.
Consumer
group C1
Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Partition 1 0 1 2 3 4 5 6 7 8 9 Writes
Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12
Old New
13 Huawei Confidential
Kafka Offset Storage Mechanism
After a consumer reads the message from the broker, the consumer can commit
a transaction, which saves the offset of the read message of the partition in
Kafka. When the consumer reads the partition again, the reading starts from
the next message.
This feature ensures that the same consumer does not consume data repeatedly
from Kafka.
Offsets of consumer groups are stored in the __consumer_offsets directory.
Formula: Math.abs(groupID.hashCode()) % 50
Go to the kafka-logs directory and you will find multiple sub-directories. This is
because kafka generates 50 __consumer_offsets-n directories by default.
14 Huawei Confidential
Kafka Consumer Group
Each consumer belongs to a consumer group. Each message can be consumed
by multiple consumer groups but only one consumer in a consumer group. That
is, data is shared between groups, but exclusive within a group.
Kafka Cluster
Server 1 Server 2
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
15 Huawei Confidential
Other Important Concepts
Replica:
Refers to a replica of a partition, which guarantees high availability of partitions.
Leader:
A role in a replica. Producers and consumers only interact with the leader.
Follower:
A role in a replica, which replicates data from the leader.
Controller:
A server in a Kafka cluster, which is used for leader election and failovers.
16 Huawei Confidential
Contents
1. Introduction
3. Data Management
Data Storage Reliability
Data Transmission Reliability
Old Data Processing Methods
17 Huawei Confidential
Kafka Partition Replica
Kafka Cluster
18 Huawei Confidential
Kafka Partition Replica
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7
writes
old new old new
Producer
19 Huawei Confidential
Kafka HA
A partition may have multiple replicas (equivalent to
default.replication.factor=N in the server.properties configuration).
If there is no replica, once the broker breaks down, all the partition data on the
broker cannot be consumed, and the producer cannot store data in the
partition.
After replication is introduced, a partition may have multiple replicas. One of
these replicas is elected to act as the leader. Producers and consumers interact
with only the leader, and the rest of the replicas act as the followers to copy
messages from the leader.
20 Huawei Confidential
Leader Failover (1)
A new leader needs to be elected in case of the failure of an existing one. When
a new leader is elected, the new leader must have all the messages
committed by the old leader.
According to the write process, all replicas in ISR (in-sync replicas) have fully
caught up with the leader. Only replicas in ISR can be elected as the leader.
For f+1 replicas, a partition can tolerate the number of f failures without losing
committed messages.
21 Huawei Confidential
Leader Failover (2)
If all replicas do not work, there are two solutions:
Wait for a replica in the ISR to come back to life and choose this replica as the
leader. This can ensure that no data is lost, but may take a long time.
Choose the first replica (not necessarily in the ISR) that comes back to life as the
leader. This cannot ensure that no data is lost, but the unavailability time is relatively
short.
22 Huawei Confidential
Contents
1. Introduction
3. Data Management
Data Storage Reliability
Data Transmission Reliability
Old Data Processing Methods
23 Huawei Confidential
Kafka Data Reliability
All Kafka messages are persisted on the disk. When you configure Kafka, you
need to set the replication for a topic partition, which ensures data reliability.
How data reliability is ensured during message delivery?
24 Huawei Confidential
Message Delivery Semantics
There are three data delivery modes:
At Most Once
Messages may be lost.
Messages are never redelivered or reprocessed.
At Least Once
Messages are never lost.
Messages may be redelivered and reprocessed.
Exactly Once
Messages are never lost.
Messages are processed only once.
25 Huawei Confidential
Reliability Assurance - Idempotency
An idempotent operation is an operation that is performed multiple times and
has the same impact as an operation that is performed only once.
Principles:
Each batch of messages sent to Kafka will contain a sequence number that the
broker will use to deduplicate data.
The sequence number is made persistent to the replica log. Therefore, even if the
leader of the partition fails, other brokers take over the leader. The new leader can
still determine whether the resent message is duplicate.
The overhead of this mechanism is very low: each batch of messages has only a few
additional fields.
26 Huawei Confidential
Reliability Assurance - Acks Mechanism
The producer needs the acknowledgement (ACK) signal sent by the server after receiving the data. This
configuration refers to the number of such ACK signals that the producer needs. This configuration actually
represents the availability of data backup. The following settings are common options:
acks=0: Zero indicates that the producer will not wait for any acknowledgment from the server at all. The record will be
immediately added to the socket buffer and considered sent. There is no guarantee that the server has successfully
received the record in this case, and the retry configuration will not take effect (as the client will not generally know of
any failures). The offset given back for each record will always be set to -1.
acks=1: This means that the leader will write the record to its local log but will respond without awaiting full
acknowledgement from all followers. In this case, if the leader fails immediately after acknowledging the record but
before the followers have replicated it, then the record will be lost.
acks=all: This means that the leader will wait for the full set of in-sync replicas to acknowledge the record. This
guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest
available guarantee.
27 Huawei Confidential
Contents
1. Introduction
3. Data Management
Data Storage Reliability
Data Transmission Reliability
Old Data Processing Methods
28 Huawei Confidential
Old Data Processing Methods
In Kafka, each partition of a topic is sub-divided into segments, making it easier for
periodical clearing or deletion of consumed files to free up space.
-rw------- 1 omm wheel 10485760 Jun 13 13:44 00000000000000000000.index
-rw------- 1 omm wheel 1081187 Jun 13 13:45 00000000000000000000.log
For traditional message queues, messages that have been consumed are deleted.
However, the Kafka cluster retains all messages regardless of whether they have
been consumed. Due to disk restrictions, it is impossible to permanently retain all data
(actually unnecessary). Therefore, Kafka needs to process old data.
Configure the Kafka server properties file:
$KAFKA_HOME/config/server.properties
29 Huawei Confidential
Kafka Log Cleanup
Log cleanup policies: delete and compact.
Threshold for deleting logs: retention time limit and size of all logs in a
partition.
30 Huawei Confidential
Kafka Log Compact
31 Huawei Confidential
Summary
This chapter describes the basic concepts of the message system, and Kafka
application scenarios, system architecture, and data management.
32 Huawei Confidential
Quiz
1. Which of the following are characteristics of Kafka?( )
A. High throughput
B. Distributed
C. Message persistence
2. Which of the following components does the Kafka cluster depend on during its running?( )
A. HDFS
B. Zookeeper
C. HBase
D. Spark
33 Huawei Confidential
Recommendations
34 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.