0% found this document useful (0 votes)
79 views14 pages

Kafka: Big Data Huawei Course

Kafka is a distributed publish-subscribe messaging system that can handle large volumes of data. It uses topics to organize streams of records consumed by subscriber clients. Producers publish messages to topics, which are divided into partitions distributed across brokers. Consumers join consumer groups to read from partitions in a topic in a fault-tolerant way. Kafka persists records to disk for reliability and replicates partitions for availability.

Uploaded by

Thiago Siqueira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views14 pages

Kafka: Big Data Huawei Course

Kafka is a distributed publish-subscribe messaging system that can handle large volumes of data. It uses topics to organize streams of records consumed by subscriber clients. Producers publish messages to topics, which are divided into partitions distributed across brokers. Consumers join consumer groups to read from partitions in a topic in a fault-tolerant way. Kafka persists records to disk for reliability and replicates partitions for availability.

Uploaded by

Thiago Siqueira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Big Data Huawei Course

Kafka
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.

Centro de Inovação EDGE - Big Data Course


Table of Contents

1. Introduction..................................................................................................... 1

1.1. Characteristics of Kafka ......................................................................... 1

1.2. Position of Kafka in FusionInsight HD ................................................... 2

2. Architecture of Kafka ...................................................................................... 2

2.1. Topic ...................................................................................................... 3

2.2. Partition.................................................................................................. 4

2.3. Consumer Group ................................................................................... 4

2.4. Partition Offset ....................................................................................... 5

3. Kafka Partition Replica ................................................................................... 5

4. Kafka Logs ..................................................................................................... 6

5. Log Cleanup ................................................................................................... 7

6. Kafka Message Delivery................................................................................. 9

7. Kafka Cluster Mirroring................................................................................... 9

8. Read and Write Process in Kafka ................................................................ 10

Centro de Inovação EDGE - Big Data Course


Kafka - Huawei Course
1. Introduction

� Well, in today's world, real time information is continuously getting generated by appli-
cation like business, social or any other type, which typically includes user behavior
data, application performance tracing activity data in the form of logs and event mes-
sages and so on.
� These logs and messages need easy ways to be reliably and quickly routed to multiple
types of receivers. But most of the time, application that are producing information and
applications that are consuming this information are inaccessible to each other.
� Therefore, we need a messaging system which is responsible for transferring data from
one application to another. So, the application can focus on data, but not worry about
how to share it.
� These applications can use a message broker such as Kafka for transferring data.
� Kafka is a distributed, publish-subscribe messaging system that can handle a high vol-
ume of data and enables applications to pass messages from one end point to another.
� Publish-subscribe system means that messages are persisted in a topic.
� Consumers can subscribe to one or more topic and consume all the messages in that
topic.

1.1. Characteristics of Kafka

� Kafka is mainly designed with the following characteristics:

o Message persistence - In Big Data, any kind of information loss cannot


be afforded. So, in Kafka, messages are persisted on disk as well as repli-
cated within the cluster to prevent data loss.

o High throughput - Kafka is designed to work on commodity hardware


and to support millions of messages, reading or writing per second from
multiple clients. Also, it supports distributed processing. Messages are
partitioned over servers and consumption is distributed over a cluster of
consumer machines.

o Real time processing - M messages produced by the producer threads


should be immediately available to the consumer threads. This feature is

Centro de Inovação EDGE - Big Data Course 1


critical to event-based system. Besides these features, Kafka also sup-
ports easy integration of clients from different platform such as Java, PHP,
Ruby and Python.

o Offline and online message consumption - Kafka is often used for oper-
ational monitoring data. This involves aggregating statistics from distrib-
uted applications to produce centralized operational data. It also can be
used across an organization to collect logs from multiple services and
make them available in a standard format to multiple consumers. Be-
sides, it integrates very well with Apache Storm and Spark for real time
streaming data analysis, which can read data from a topic, processes it
and write processed data to a new topic where it becomes available for
users and applications. Kafka's strong durability is also very useful in the
context of stream processing.

1.2. Position of Kafka in FusionInsight HD

� Kafka serves as a distributed, partitioned, replicated message publishing and subscrip-


tion system that supports online and offline message processing, and it provides Java
APIs for other components.
� Currently, Kafka is used in multiple components such as Streaming, Spark and Flume.

2. Architecture of Kafka

� Kafka is composed by the following components: Brokers, Producers and Consumers.


� Producer publishes messages to a Kafka Broker, which could be page views produced
by a web front end, log servers, or any other application.

Centro de Inovação EDGE - Big Data Course 2


� Broker is a service instance in the Kafka cluster. A cluster contains one or more service
instances (Brokers). Kafka supports horizontal scaling and with many Brokers, we can
increase the cluster throughput.
� Consumer is just a client that reads messages from a Kafka Broker.
� This architecture is based on the publish-subscribe approach. So, Kafka is a publish-
subscribe messaging system. A Producer publishes messages to a Broker in push mode
and a Consumer subscribes and consumes messages from a Broker in pull mode.
� Kafka uses Zookeeper to manage cluster configuration, elect a leader and rebalance
data when Consumers change.
� There are some other concepts like Topic, Partition, Partition Offset, Consumer Group
and so on.

2.1. Topic

� Topic is the category of messages published to the Kafka cluster.

o For example, take weather as a topic. So, daily temperature information can be
stored in this topic.

o A Kafka topic can be interpreted as a queue.

Centro de Inovação EDGE - Big Data Course 3


o Inside of this queue, this little blue box is a message. Messages produced by
Producers are placed at the end of the topic one by one, and Consumers read
messages from left to right and use offsets to mark read locations.

2.2. Partition

� Topics are split into Partitions.


� For each Topic, Kafka keeps a minimum of one Partition.
� Each such Partition contains messages in an immutable ordered sequence.
� Partition mechanism improves the Kafka throughput.

o The Partitions of a Topic are distributed on multiple nodes.

o Multiple Producers and Consumers can concurrently access these nodes to read
and write messages of the topic.

� The number of Partitions in one Topic can be set during Topic creation.

2.3. Consumer Group

� Consumers can join a group by using the same group ID

Centro de Inovação EDGE - Big Data Course 4


� The number of partitions determines the maximum parallelism in a Consumer Group
� Kafka assigns the Partitions of a Topic to the Consumer in a Group so that each Partition
is consumed by exactly one Consumer in the Group.
� There is no Consumers of the same Group to read (that reads) in the same Partition
range.
� Kafka guarantees that a message is only ever read by a single Consumer in the group.

2.4. Partition Offset

� Each message of a Partition has a unique sequence ID called as Offset.


� It is a long integer to uniquely identify a message.
� Consumers can user Topics, Partitions and Offsets to track the message they want.

3. Kafka Partition Replica

� In order to improve the fault tolerance, Kafka has the Partition Replication Policy, which
defines a number of Partition Replicas by using configuration files.
� Replicas are nothing but backups of a Partition, and they are used to prevent data loss.

Centro de Inovação EDGE - Big Data Course 5


� Kafka has Master and Slave Replicas.

o Master Replica is called a Leader.

o Slave Replica is called Follower.

o The replicas in synchronization are called in-sync replicas, aka ISR.

� Kafka uses the Leader to implement Partition read and write operations, which means
that both Consumers and Producers read and write data from the Leader, neither of
them interacts with Followers.
� Followers only synchronize data.

o If a Leader fails, one of the Followers will take over services as the new Leader.

o A Follower that lags far behind the failed Leader in terms of data synchroniza-
tion, due to low performance or network reasons, cannot become the new
Leader.

� The Leader server receives all requests. So, to ensure the cluster performance, Kafka
evenly distributes Leaders in all instances of the cluster.
� During data synchronization between Partitions, the Followers actually pull messages
from the Leader, which uses only one thread called ReplicaFetcherThread to copy data
from the Leader.
� When a Broker is started in Kafka, the Replica Management Service, ReplicaManager, is
created to maintain the connection between ReplicaFetcherThread and other Brokers.

� The Leader and Follower par��ons are distributed over many Brokers.

� ReplicaFetcherThread thread is created for each of these Brokers to synchronize Par�-


�on data.

4. Kafka Logs

� Kafka splits a large Partition file into multiples segment files. In this way, it is easier to
delete messages that already have been consumed, to reduce the disk usage.
� A segment file is composed of two parts, an index file and a log file. These two files have
one to one correspondence and they come in pairs.
� The file with suffix index (.index) and log (.log) indicate the index file and segment data
file, respectively.

Centro de Inovação EDGE - Big Data Course 6


� The index file is used to locate the message quickly.
� The storage layout of Kafka is quite simple. A Topic Partition has a logical log. Physically,
a log is a group of segment files of the same size.
� When a Producer delivers a message to a Partition, the agent appends the message to
the last segment file.
� Segment files are written to the hard disk after a certain period of time or when the
number of messages reaches a certain number. After the write is complete, messages
are made public to Consumers.
� There are multiple files in each Partition, and when data is being written to a file, the
other files in the Partition can only be read. When a file is written fully, a new file is
created for write and the current file can only be read. The new file is named with the
offset from the first file.
� With index information messages can be located quickly.
� Sparse storage of index files greatly reduces the amount of space occupied by index
metadata. Here, sparse storage means to selectively store parts of complete data.

5. Log Cleanup

� In traditional message queues, consumed messages are usually deleted.


� However, in the Kafka cluster, all messages, no matter consumed or not, are persisted.

o Indeed, these messages are not permanently persisted because of the limited
disk space. Actually, it does not need to.

Centro de Inovação EDGE - Big Data Course 7


� Kafka provides two policies to handle messages that are out of date: delete and com-
pact.
� Besides, you can set some parameters for log cleanup. The log cleanup policy parame-
ter is to choose the way of log cleanup.

o Default is to delete.

� There are two other parameters.

o log.retention.hours - is used to set the number of hours to keep a log file


before deleting it. The default value is 168 hours. Logs will be cleaned after this
time.

o log.retention.bytes - indicates the maximum size of the log for each Par-
tition before deleting it.

� Log compaction ensures that Kafka will always retain at least the last known value for
each message key within the log of data for a single Topic Partition.
� The idea is to selectively remove previous records where we have a more recent update
with the same primary key. This way, the log is guaranteed to have at least the last state
for each key. The example below shows the actual process of compacting a log seg-
ment.

o The upper part is the logical structure of a Kafka log with the offset for each
message. A message contains a key and a value. A key could have more than
one values because messages will be updated. After compaction, only the mes-
sages with the latest values are remained for each key. For example, K1 has three
messages whose values are are V1, V3 and V4. So, after compaction, only the

Centro de Inovação EDGE - Big Data Course 8


latest message corresponding to K1 is preserved, which is the message whose
value is V4.

� Kafka messages are stored in sequence. So, the message with the largest offset is con-
sidered the latest message.
� One thing to notice is that messages of the log after compaction keep the original off-
set assigned when they were first written. This means that the offset never changes.

6. Kafka Message Delivery

� All messages in Kafka will be persisted to disks and are backed up to prevent data loss.
� Kafka provides three kinds of message delivery guarantees:

o At most once - messages may be lost but are never redelivered.

o At least once - messages are never lost but may be redelivered.

o Exactly once - this is what people actually want. Each message is delivered once
and only once. But unfortunately, Exactly Once is not available in Kafka currently.

� Messages are delivered in different modes to ensure reliability in different application


scenarios.
� Synchronous delivery is highly available. It means that whenever the client writes a mes-
sage, it will be sent to the Broker.
� Asynchronous delivery means that messages are stored in buffer cache and transferred
to Broker when achieves certain amount. This one is easy to cause data loss. It is not
usually used.
� Commonly, applications use Synchronous delivery with confirmation and Synchronous
replication.

7. Kafka Cluster Mirroring

Centro de Inovação EDGE - Big Data Course 9


� Kafka Cluster Mirroring is a cross-cluster data synchronization solution of Kafka.
� It is implemented through the built-in tool MirrorMaker.
� During data synchronization, the Consumer of MirrorMaker consumes data from the
source cluster. And then uses the built-in Producer to send the data to the target cluster.

8. Read and Write Process in Kafka

� For writing process, the Producer connects to any of the alive nodes and requests meta-
data about the Leaders for the Partitions of a Topic. This allows the Producer to put the
message directly to the Leader Broker for the Partition.
� For reading process, the Consumer subscribes to the message consumption from a
specific Topic on the Kafka Broker.
� The Consumer then issues a fetch request to the Leader Broker to consume the mes-
sage Partition by specifying the message offset with the beginning position of the mes-
sage offset.
� Kafka Consumer always pulls all available messages after its current position in the
Kafka log.
� While reading, or subscribing, the Consumer connects to any of the live nodes and re-
quests metadata about the Leaders for the Partitions of a Topic. This allows the con-
sumer to communicate directly with the lead Broker receiving the messages.

Centro de Inovação EDGE - Big Data Course 10


Centro de Inovação EDGE - Big Data Course

You might also like