Apache Kafka - Cluster Architecture

Last Updated : 24 Mar, 2024

Apache Kafka has by now made a perfect fit for developing reliable internet-scale streaming applications which are also fault-tolerant and capable of handling real-time and scalable needs. In this article, we will look into Kafka Cluster architecture in Java by putting that in the spotlight.

In this article, we will learn about, Apache Kafka - Cluster Architecture.

Understanding the Basics of Apache Kafka

Before delving into the cluster architecture, let's establish a foundation by understanding some fundamental concepts of Apache Kafka.

1. Publish-Subscribe Model

Kafka operates on a publish-subscribe model, where data producers publish records to topics, and data consumers subscribe to these topics to receive and process the data. This decoupling of producers and consumers allows for scalable and flexible data processing.

2. Topics and Partitions

Topics are logical channels that categorize and organize data. Within each topic, data is further divided into partitions, enabling parallel processing and efficient load distribution across multiple brokers.

3. Brokers

Brokers are the individual Kafka servers that store and manage data. They are responsible for handling data replication, client communication, and ensuring the overall health of the Kafka cluster.

Key Components of Kafka Cluster Architecture

Key components of Kafka Cluster Architecture involve the following:

Brokers - Nodes in the Kafka Cluster

Responsibilities of Brokers:

Data Storage: Brokers provide data storage ability; thus, they have distributed storage quality for the Kafka cluster.
Replication: Brokers are in charge of data replication which is redundancy assurance to foster a highly available system.
Client Communication: Brokers are middlemen who help in the transition of data from vendors to consumers by serving as a link in this process.

Communication and Coordination Among Brokers:

Inter-Broker Communication: Running a fault-tolerant and scalable distributed system such as a Kafka cluster requires efficient communication among brokers for the sake of synchronization and load balancing.
Cluster Metadata Management: Brokers collectively control data set which are related to metadata about topics, partitions, and consumer groups so as to ensure a single cluster state.

// Code example for creating a Kafka producer and sending a message
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
ProducerRecord<String, String> record = new ProducerRecord<>("example-topic", "key", "Hello, Kafka!");
producer.send(record);
producer.close();

Topics - Logical Channels for Data Organization

Role of Topics in Kafka:

Data Organization: One of Kafka's features under the topic is their categorization and prediction techniques.
Scalable Data Organization: Topics serve as a framework for distributing datasets that provides parallelization via messages.

Partitioning Strategies for Topics:

Partitioning Logic: Partitions carry tags or keys from the partitioning logic, which is the method by which messages are thrown to more push.
Balancing Workload: Fundamental is an equal workload distribution within the brokers, this helps in processing fast data.

Partitions - Enhancing Parallelism and Scalability

Partitioning Logic:

Deterministic Partitioning: The partitioning algorithm is likely deterministic otherwise the system can allocate messages to divisions on a consistent basis.
Key-Based Partitioning: The key for plain text partitioning will be used for determination partition, something that guarantees messages of same key always go to the same partition, hence, messages with different keys go to different partitions.

Importance of Partitions in Data Distribution:

Parallel Processing: Partitioned messages can lead to parallel execution of the processing workload and thus higher link capacities.
Load Distribution: Partitions realize mixing data workload over several brokers so that the latter do not become bottlenecks and minimize used resources.

Replication - Ensuring Fault Tolerance

The Role of Replication in Kafka:

Data Durability: Repetition makes the data persistent in manner which ensures that there are different copies of each partition stored in different brokers.
High Availability: Replication is a feature that is used to provide high system availability as this allows the system to continue running while some of the brokers are under-performing or even failing.

Leader-Follower Replication Model:

Leader-Replica Relationship: There are One head per each partition and a number of followers equal to it. The task of leader is to handle the data writes, and followers then replicate the data needing fault tolerance.
Failover Mechanism: How does this consensus function? One of the followers rises and takes the place of a leader, who had previously ceased to operate, thus making the system cycle continue in operation and data integrity.

// Code example for creating a Kafka consumer and subscribing to a topic
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("example-topic"));
while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        System.out.printf("Received message: key=%s, value=%s%n", record.key(), record.value());
    }

Data Flow within the Kafka Cluster

Understanding the workflow of both producers and consumers is essential for grasping the dynamics of data transmission within the Kafka cluster.

- Producers - Initiating the Data Flow:

Producers in Kafka:

Data Initiation: The primary job of Kafka consumers revolves around consuming data which is facilitated by pushing records to assigned topics by producers.
Asynchronous Messaging: Producers can send asynchronous messages, and since off-cluster decisions do not require Kafka cluster acknowledgments, their operations may continue without any interruption.

Publishing Messages to Topics:

Topic Specification: Producers are responsible for setting the topic whenever a brand publishes messages, where the data will be stored and processed.
Record Format: Messages are structured as keys, value, and their respective metadata. In other words, the key is the identifier, while the value is the message contents, and the metadata is the information attached to the record.

// Sample Kafka Producer in Java
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

KafkaProducer<String, String> producer = new KafkaProducer<>(props);

// Sending a message to the "example-topic" topic
ProducerRecord<String, String> record = new ProducerRecord<>("example-topic", "key", "Hello, Kafka!");
producer.send(record);

// Closing the producer
producer.close();

- Consumers - Processing the Influx of Data

The Role of Consumers in Kafka:

Data Consumption: Through subscription, consumers process data exchanged via Producer channeling in Kafka to become the key actor of the Kafka ecosystem.
Parallel Processing: Consumers can systematize simultaneously through the utilization of consumer networks in that way enabling fast and detailed processes on the same basis as directories or databases are used.

Subscribing to Topics:

Topic Subscription: In contrast to the broadcasting model, consumers subscribe to specific topics of interest and will receive only the data streams into their end systems that they need for their actual purpose.
Consumer Group Dynamics: Several subscribers can create a joint consumer group to conduct jointly received topics without interference from others.
Consumer Groups for Parallel Processing: Consumer Groups for Parallel Processing:
Group Coordination: The Consumer Group takes care of the concurrency aspect and ensures that the messages are being processed by just one consumer at a time and not all.
Parallel Scaling: The ability of consumer groups to parallel scale makes an impact on quality, enabling additional consumers to join and increasing processing capacity.

Maintaining Consumer Offsets:

Offset Tracking: Message offsets are the consumer records themselves, and existing offsets suggest the position of the last message on each partition.
Fault Tolerance: Tracking off-sets will allow consumers to keep abreast of the last consumed message, so they can proceed from where they left if the processing fails. This option is fault-tolerant.

// Sample Kafka Consumer in Java
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "example-group");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);

// Subscribing to the "example-topic" topic
consumer.subscribe(Collections.singletonList("example-topic"));

// Polling for messages
while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        // Process the received message
        System.out.printf("Received message: key=%s, value=%s%n", record.key(), record.value());
    }
}

The Role of Zookeeper: Orchestrating Kafka's Symphony

While Kafka no longer depends on Zookeeper after version 2.8.0, understanding its historical significance is valuable.

Historical Significance of Zookeeper in Kafka:

Coordination Service: Main task of Zookeeper was always to manage the cluster, controlling the essence of the roles that the investigators were to be involve in.
Metadata Management: The zookeeper service in the Kafka data processing pipeline maintained metadata about brokers, partitions and consumer groups and produced consistency in the cluster.

Managing Broker Metadata:

Dynamic Broker Discovery: Brokers were made discover-able since the broker discovery was dynamic and clients can maintain connectivity to available brokers.
Metadata Updates: The responsibility for altering broker metadata was entrusted to the zoosepper, aimed at keeping the clients in the loop about the latest changes occurring in the Kafka cluster.

Leader Election and Configuration Tracking: Leader Election and Configuration Tracking:

Leader Election: The choice of a leader within a single partition was done solely by zookeeper dictating which of the brokers was at the time the leader.
Configuration Tracking: Zookeeper provide ways to trace configuration changes within Kafka cluster with the purpose of providing nodes that operate with the newest settings.

Navigating the Data Flow: Workflows for Producers and Consumers

Understanding the workflows of both producers and consumers provides insights into how data traverses the Kafka cluster.

- Producer Workflow

Sending Messages to Topics:

Record Creation: Producers place the record with the message, key, and optional metadata into the storage as records.
Topic Specification: Producers write the channel identification code for each record, and this code determines message reception.

Determining Message Partition:

Partition Assignment: Kafka makes use of partitioning to ensure that messages are directed either by default or custom option which is determined by Kafka.
Key-Partition Association: This will still happen if partitioning by keys is implemented – keys will always be associated with the assigned partition.

Replication for Fault Tolerance:

Acknowledgment Reception: As a result, producers can get broker's email acknowledgments indicating that all the data is successfully copied for Fail-Over.

- Consumer Workflow

Subscribing to Topics:

Topic Subscription: Subscription plans may be based on specific topics and are a signal that consumers want information from those areas of interest.
Consumer Group Assignment: Process assignments are tailored considering the consumer functions and they divide within the group's context.

Assigning Partitions to Consumers:

Partition Distribution: A consumer group synchronizes the partitioning of partitions among users in such a way that each partition executes in a parallel way.
Dynamic Rebalancing: Individuals of the group make the decision to constantly edit and change fixed partition assignments due to alterations in the group's membership.

Maintaining Offsets for Seamless Resumption:

Offset Storage: Consumer aims at offset saving for the purpose of keeping the records of the last processed message in each particular partition.
Offset Commitment: Consumers use Kafka even though they are all the time committed to offsets, guaranteeing inconsistencies in the progress will not affect the Kafka if a consumer restarts.

Achieving Scalability and Fault Tolerance in Kafka Clusters

The success of Apache Kafka lies in its ability to scale horizontally and maintain fault tolerance.

Scalability Through Data Partitioning:

Parallel Processing: The data partitioning is an efficient tool to carry out parallel data processing over the messages across multiple brokers, so scalability of the system is also enhanced.
Load Balancing: The flow control by partitioning balances the workload among the processors, and this leads to optimum utilization of resources and avoidance of the system bottlenecks.

// Creating a topic with three partitions
AdminClient adminClient = AdminClient.create(props);
NewTopic newTopic = new NewTopic("example-topic", 3, (short) 1);
adminClient.createTopics(Collections.singletonList(newTopic));

Ensuring Fault Tolerance with Replication:

Data Durability: Replication works like a repair allowance by keeping multiple partitions with each copy.
Continuous Operation: In the event of redundant performance failure by one broker from the set of replicated brokers, other brokers with the redundant data will involuntarily start to the work and link exchange to provide seamless service.

Strategies for Seamless Recovery from Node Failures:

Replica Catch-Up: In case the broker fails according to its recovery policy, the leader node gets in touch with its replica to make sure that the latter one is compliant with its current state.
Dynamic Reassignment: The Zookeeper or another mechanism re-allocates the faulty partitions on available brokers. Partitions are assigned to available brokers in an interchangeable manner, which helps recovery of quick action.

Conclusion

In conclusion, the cluster architecture of Apache Kafka can be considered a complex ecosystem that allows the construction of strong and expandable data pipelines. From core components like brokers, topics, and partitions to the dynamic workflows of producers and consumers that make Kafka efficient in handling real-time data every piece makes a difference.

With Kafka quickly evolving and accommodating newer versions and best practices, engineers and architects who are dipping into the matter of real-time data processing need to take this into account. Through a profound understanding of the technicalities within Kafka cluster, you can unleash the full power of this incredible distributed streaming platform, creating data pipelines that are not only reliable but can also withstand the increasingly complex dynamics of today's data-intensive applications.

Class Diagram | Unified Modeling Language (UML)

akansha13102003

Improve

Article Tags :