Kafka: Big Data Huawei Course
Kafka: Big Data Huawei Course
Kafka
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.
1. Introduction..................................................................................................... 1
2.2. Partition.................................................................................................. 4
� Well, in today's world, real time information is continuously getting generated by appli-
cation like business, social or any other type, which typically includes user behavior
data, application performance tracing activity data in the form of logs and event mes-
sages and so on.
� These logs and messages need easy ways to be reliably and quickly routed to multiple
types of receivers. But most of the time, application that are producing information and
applications that are consuming this information are inaccessible to each other.
� Therefore, we need a messaging system which is responsible for transferring data from
one application to another. So, the application can focus on data, but not worry about
how to share it.
� These applications can use a message broker such as Kafka for transferring data.
� Kafka is a distributed, publish-subscribe messaging system that can handle a high vol-
ume of data and enables applications to pass messages from one end point to another.
� Publish-subscribe system means that messages are persisted in a topic.
� Consumers can subscribe to one or more topic and consume all the messages in that
topic.
o Offline and online message consumption - Kafka is often used for oper-
ational monitoring data. This involves aggregating statistics from distrib-
uted applications to produce centralized operational data. It also can be
used across an organization to collect logs from multiple services and
make them available in a standard format to multiple consumers. Be-
sides, it integrates very well with Apache Storm and Spark for real time
streaming data analysis, which can read data from a topic, processes it
and write processed data to a new topic where it becomes available for
users and applications. Kafka's strong durability is also very useful in the
context of stream processing.
2. Architecture of Kafka
2.1. Topic
o For example, take weather as a topic. So, daily temperature information can be
stored in this topic.
2.2. Partition
o Multiple Producers and Consumers can concurrently access these nodes to read
and write messages of the topic.
� The number of Partitions in one Topic can be set during Topic creation.
� In order to improve the fault tolerance, Kafka has the Partition Replication Policy, which
defines a number of Partition Replicas by using configuration files.
� Replicas are nothing but backups of a Partition, and they are used to prevent data loss.
� Kafka uses the Leader to implement Partition read and write operations, which means
that both Consumers and Producers read and write data from the Leader, neither of
them interacts with Followers.
� Followers only synchronize data.
o If a Leader fails, one of the Followers will take over services as the new Leader.
o A Follower that lags far behind the failed Leader in terms of data synchroniza-
tion, due to low performance or network reasons, cannot become the new
Leader.
� The Leader server receives all requests. So, to ensure the cluster performance, Kafka
evenly distributes Leaders in all instances of the cluster.
� During data synchronization between Partitions, the Followers actually pull messages
from the Leader, which uses only one thread called ReplicaFetcherThread to copy data
from the Leader.
� When a Broker is started in Kafka, the Replica Management Service, ReplicaManager, is
created to maintain the connection between ReplicaFetcherThread and other Brokers.
� The Leader and Follower par��ons are distributed over many Brokers.
4. Kafka Logs
� Kafka splits a large Partition file into multiples segment files. In this way, it is easier to
delete messages that already have been consumed, to reduce the disk usage.
� A segment file is composed of two parts, an index file and a log file. These two files have
one to one correspondence and they come in pairs.
� The file with suffix index (.index) and log (.log) indicate the index file and segment data
file, respectively.
5. Log Cleanup
o Indeed, these messages are not permanently persisted because of the limited
disk space. Actually, it does not need to.
o Default is to delete.
o log.retention.bytes - indicates the maximum size of the log for each Par-
tition before deleting it.
� Log compaction ensures that Kafka will always retain at least the last known value for
each message key within the log of data for a single Topic Partition.
� The idea is to selectively remove previous records where we have a more recent update
with the same primary key. This way, the log is guaranteed to have at least the last state
for each key. The example below shows the actual process of compacting a log seg-
ment.
o The upper part is the logical structure of a Kafka log with the offset for each
message. A message contains a key and a value. A key could have more than
one values because messages will be updated. After compaction, only the mes-
sages with the latest values are remained for each key. For example, K1 has three
messages whose values are are V1, V3 and V4. So, after compaction, only the
� Kafka messages are stored in sequence. So, the message with the largest offset is con-
sidered the latest message.
� One thing to notice is that messages of the log after compaction keep the original off-
set assigned when they were first written. This means that the offset never changes.
� All messages in Kafka will be persisted to disks and are backed up to prevent data loss.
� Kafka provides three kinds of message delivery guarantees:
o Exactly once - this is what people actually want. Each message is delivered once
and only once. But unfortunately, Exactly Once is not available in Kafka currently.
� For writing process, the Producer connects to any of the alive nodes and requests meta-
data about the Leaders for the Partitions of a Topic. This allows the Producer to put the
message directly to the Leader Broker for the Partition.
� For reading process, the Consumer subscribes to the message consumption from a
specific Topic on the Kafka Broker.
� The Consumer then issues a fetch request to the Leader Broker to consume the mes-
sage Partition by specifying the message offset with the beginning position of the mes-
sage offset.
� Kafka Consumer always pulls all available messages after its current position in the
Kafka log.
� While reading, or subscribing, the Consumer connects to any of the live nodes and re-
quests metadata about the Leaders for the Partitions of a Topic. This allows the con-
sumer to communicate directly with the lead Broker receiving the messages.