0% found this document useful (0 votes)
9 views36 pages

Chapter 10 Kafka Distributed Publish-Subscribe Messaging System

Uploaded by

mazlout hanadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views36 pages

Chapter 10 Kafka Distributed Publish-Subscribe Messaging System

Uploaded by

mazlout hanadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Chapter 10 Kafka - Distributed Publish-

Subscribe Messaging System


Foreword

 This chapter describes the basic concepts, architecture, and functions of


Kafka. It is important to know how Kafka ensures reliability for data
storage and transmission and how historical data is processed.

1 Huawei Confidential
Objectives

 On completion of this course, you will be able to know:


 Basic concepts of the message system
 Kafka system architecture

2 Huawei Confidential
Contents

1. Introduction

2. Architecture and Functions

3. Data Management

3 Huawei Confidential
Introduction
 Kafka is a distributed, partitioned, replicated, and ZooKeeper-based messaging system. It supports
multi-subscribers and was originally developed by Linkedin.
 Main application scenarios: the log collection system and message system
 Robust message queue is a foundation of distributed messaging, in which messages are queued
asynchronously between client applications and the messaging system. Two types of messaging
patterns are available: point to point messaging, and publish-subscribe (pub-sub) messaging.
Most of the messaging patterns follow pub-sub. Kafka implements a pub-sub messaging
pattern.

4 Huawei Confidential
Point-to-Point Messaging
 In a point-to-point messaging system, messages are persisted in a queue. In this case, one or
more consumers consume the messages in the queue. However, a message can be consumed by a
maximum of one consumer only. Once a consumer reads a message in the queue, it disappears
from that queue. This pattern ensures the data processing sequence even when multiple
consumers consume data at the same time.

5 Huawei Confidential
Publish-Subscribe Messaging
 In the publish-subscribe messaging system, messages are persisted in a topic. Unlike point-to-
point messaging system, a consumer can subscribe to one or more topics and consume all the
messages in those topics. One message can be consumed by more than one consumer. A
consumed message is not deleted immediately. In the publish-subscribe messaging system,
message producers are called publishers and message consumers are called subscribers.

6 Huawei Confidential
Kafka Features
 Kafka can persist messages in a time complexity of O(1), and can maintain data access performance in
constant time even in the face of terabytes of data.
 Kafka provides high throughput. Even on cheap commercial machines, a single-node system can transmit
100,000 messages per second.
 Kafka supports message partitioning and distributed consumption, and ensures that messages are transmitted
in sequence in each partition.
Frontend Backend
 Kafka supports offline and real-time data processing. Producer Producer
 Kafka supports scale-out.

Flume Storm

Kafka
Hadoop Spark

Farmer

7 Huawei Confidential
Contents

1. Introduction

2. Architecture and Functions

3. Data Management

8 Huawei Confidential
Kafka Topology

(Producer) Front End Front End Front End Service

(Push) (Push) (Push) (Push)

ZooKeeper
Zoo Keeper
(Kafka) Broker Broker Broker Zoo Keeper

(Pull) (Pull) (Pull) (Pull)

Hadoop Real-time Other Data


(Consumer) Cluster Monitoring Service Warehouse

9 Huawei Confidential
Kafka Basic Concepts

Broker: A Kafka cluster contains one or more service instances, which are called brokers.

Topic: Each message published to the Kafka cluster has a category, which is called a topic.

Partition: Kafka divides a topic into one or more partitions. Each partition physically
corresponds to a directory for storing all messages of the partition.

Producer: sends messages to Kafka Broker.

Consumer: consumes messages and functions as a client to read messages from Kafka
Broker.

Consumer Group: Each consumer belongs to a given consumer group. You can specify a group
name for each consumer.

10 Huawei Confidential
Kafka Topics
 Each message published to Kafka belongs to a category, which is called a topic. A topic can also
be interpreted as a message queue. For example, weather can be regarded as a topic (queue)
that stores daily temperature information.

Consumer group 1

Consumer group 2 Consumer使用Offset来记录读取位置


Consumers use offset to record the reading
position.
Kafka总是根据时间和大小进行修剪
Kafka prunes according to time and size.

Kafka topic

。。。 new

Older msgs Newer msgs Producer 1


Producer 2
...
Producer n

Producer总是在末尾追加消息
Producers always append messages to the end of a queue.

11 Huawei Confidential
Kafka Partition
 To improve the throughput of Kafka, each topic is physically divided into one or more partitions.
Each partition is an ordered and immutable sequence of messages. Each partition physically
corresponds to a directory for storing all messages and index files of the partition.

Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12

Partition 1 0 1 2 3 4 5 6 7 8 9 Writes

Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12

Old New

12 Huawei Confidential
Kafka Partition Offset
 The position of each message in a log file is called offset, which is a long integer that
uniquely identifies a message. Consumers use offsets, partitions, and topics to track
records.

Consumer
group C1

Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12

Partition 1 0 1 2 3 4 5 6 7 8 9 Writes

Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12

Old New

13 Huawei Confidential
Kafka Offset Storage Mechanism
 After a consumer reads the message from the broker, the consumer can commit
a transaction, which saves the offset of the read message of the partition in
Kafka. When the consumer reads the partition again, the reading starts from
the next message.
 This feature ensures that the same consumer does not consume data repeatedly
from Kafka.
 Offsets of consumer groups are stored in the __consumer_offsets directory.
 Formula: Math.abs(groupID.hashCode()) % 50
 Go to the kafka-logs directory and you will find multiple sub-directories. This is
because kafka generates 50 __consumer_offsets-n directories by default.

14 Huawei Confidential
Kafka Consumer Group
 Each consumer belongs to a consumer group. Each message can be consumed
by multiple consumer groups but only one consumer in a consumer group. That
is, data is shared between groups, but exclusive within a group.

Kafka Cluster

Server 1 Server 2

P0 P3 P1 P2

C1 C2 C3 C4 C5 C6

Consumer group A Consumer group B

15 Huawei Confidential
Other Important Concepts
 Replica:
 Refers to a replica of a partition, which guarantees high availability of partitions.
 Leader:
 A role in a replica. Producers and consumers only interact with the leader.
 Follower:
 A role in a replica, which replicates data from the leader.
 Controller:
 A server in a Kafka cluster, which is used for leader election and failovers.

16 Huawei Confidential
Contents

1. Introduction

2. Architecture and Functions

3. Data Management
 Data Storage Reliability
 Data Transmission Reliability
 Old Data Processing Methods

17 Huawei Confidential
Kafka Partition Replica

Kafka Cluster

Broker 1 Broker 2 Broker 3 Broker 4

Partition-0 Partition-1 Partition-2 Partition-3

Partition-3 Partition-0 Partition-1 Partition-2

18 Huawei Confidential
Kafka Partition Replica

Follower Pulls Data from Leader

Replica Fetcher Thread

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7

writes
old new old new

Leader Partition Follower Partition


ack

Producer

19 Huawei Confidential
Kafka HA
 A partition may have multiple replicas (equivalent to
default.replication.factor=N in the server.properties configuration).
 If there is no replica, once the broker breaks down, all the partition data on the
broker cannot be consumed, and the producer cannot store data in the
partition.
 After replication is introduced, a partition may have multiple replicas. One of
these replicas is elected to act as the leader. Producers and consumers interact
with only the leader, and the rest of the replicas act as the followers to copy
messages from the leader.

20 Huawei Confidential
Leader Failover (1)
 A new leader needs to be elected in case of the failure of an existing one. When
a new leader is elected, the new leader must have all the messages
committed by the old leader.
 According to the write process, all replicas in ISR (in-sync replicas) have fully
caught up with the leader. Only replicas in ISR can be elected as the leader.
 For f+1 replicas, a partition can tolerate the number of f failures without losing
committed messages.

21 Huawei Confidential
Leader Failover (2)
 If all replicas do not work, there are two solutions:
 Wait for a replica in the ISR to come back to life and choose this replica as the
leader. This can ensure that no data is lost, but may take a long time.
 Choose the first replica (not necessarily in the ISR) that comes back to life as the
leader. This cannot ensure that no data is lost, but the unavailability time is relatively
short.

22 Huawei Confidential
Contents

1. Introduction

2. Architecture and Functions

3. Data Management
 Data Storage Reliability
 Data Transmission Reliability
 Old Data Processing Methods

23 Huawei Confidential
Kafka Data Reliability
 All Kafka messages are persisted on the disk. When you configure Kafka, you
need to set the replication for a topic partition, which ensures data reliability.
 How data reliability is ensured during message delivery?

24 Huawei Confidential
Message Delivery Semantics
 There are three data delivery modes:
 At Most Once
 Messages may be lost.
 Messages are never redelivered or reprocessed.
 At Least Once
 Messages are never lost.
 Messages may be redelivered and reprocessed.
 Exactly Once
 Messages are never lost.
 Messages are processed only once.

25 Huawei Confidential
Reliability Assurance - Idempotency
 An idempotent operation is an operation that is performed multiple times and
has the same impact as an operation that is performed only once.
 Principles:
 Each batch of messages sent to Kafka will contain a sequence number that the
broker will use to deduplicate data.
 The sequence number is made persistent to the replica log. Therefore, even if the
leader of the partition fails, other brokers take over the leader. The new leader can
still determine whether the resent message is duplicate.
 The overhead of this mechanism is very low: each batch of messages has only a few
additional fields.

26 Huawei Confidential
Reliability Assurance - Acks Mechanism
 The producer needs the acknowledgement (ACK) signal sent by the server after receiving the data. This
configuration refers to the number of such ACK signals that the producer needs. This configuration actually
represents the availability of data backup. The following settings are common options:
 acks=0: Zero indicates that the producer will not wait for any acknowledgment from the server at all. The record will be
immediately added to the socket buffer and considered sent. There is no guarantee that the server has successfully
received the record in this case, and the retry configuration will not take effect (as the client will not generally know of
any failures). The offset given back for each record will always be set to -1.
 acks=1: This means that the leader will write the record to its local log but will respond without awaiting full
acknowledgement from all followers. In this case, if the leader fails immediately after acknowledging the record but
before the followers have replicated it, then the record will be lost.
 acks=all: This means that the leader will wait for the full set of in-sync replicas to acknowledge the record. This
guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest
available guarantee.

27 Huawei Confidential
Contents

1. Introduction

2. Architecture and Functions

3. Data Management
 Data Storage Reliability
 Data Transmission Reliability
 Old Data Processing Methods

28 Huawei Confidential
Old Data Processing Methods
 In Kafka, each partition of a topic is sub-divided into segments, making it easier for
periodical clearing or deletion of consumed files to free up space.
-rw------- 1 omm wheel 10485760 Jun 13 13:44 00000000000000000000.index
-rw------- 1 omm wheel 1081187 Jun 13 13:45 00000000000000000000.log
 For traditional message queues, messages that have been consumed are deleted.
However, the Kafka cluster retains all messages regardless of whether they have
been consumed. Due to disk restrictions, it is impossible to permanently retain all data
(actually unnecessary). Therefore, Kafka needs to process old data.
 Configure the Kafka server properties file:

$KAFKA_HOME/config/server.properties

29 Huawei Confidential
Kafka Log Cleanup
 Log cleanup policies: delete and compact.
 Threshold for deleting logs: retention time limit and size of all logs in a
partition.

Configuration Default Value Description Range

Log segments will be deleted when they reach


log.cleanup.policy delete the time limit (beyond the retention time). This delete or compact
can take either the value delete or compact.

Maximum period to keep a log segment before


log.retention.hours 168 1 - 2147483647
it is deleted. Unit: hour

Maximum size of log data in a partition. By


log.retention.bytes -1 -1 - 9223372036854775807
default, the value is not restricted. Unit: byte.

30 Huawei Confidential
Kafka Log Compact

31 Huawei Confidential
Summary

 This chapter describes the basic concepts of the message system, and Kafka
application scenarios, system architecture, and data management.

32 Huawei Confidential
Quiz
1. Which of the following are characteristics of Kafka?( )
A. High throughput

B. Distributed

C. Message persistence

D. Random message reading

2. Which of the following components does the Kafka cluster depend on during its running?( )
A. HDFS

B. Zookeeper

C. HBase

D. Spark

33 Huawei Confidential
Recommendations

 Huawei Cloud Official Web Link:


 https://www.huaweicloud.com/intl/en-us/
 Huawei MRS Documentation:
 https://www.huaweicloud.com/intl/en-us/product/mrs.html
 Huawei TALENT ONLINE:
 https://e.huawei.com/en/talent/#/

34 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright© 2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

You might also like