Kafka Monitoring
Kafka Monitoring
Apache Kafka is a distributed system where a topic is partitioned as well as replicated across
various nodes. It also supports fault tolerance as well as durability. There may be chances of
troubleshooting. Therefore, it is required to manage and monitor different Kafka activities.
Various Kafka monitoring tools are used to monitor and display the corrective actions.
The Kafka application manager enables users to monitor and discover the Kafka servers
automatically. It also tracks the details of resource utilization, such as disk storage, CPU,
memory, etc. This helps the manager to ensure that the user is not running out of resources The
manager also ensures that the server is operating continuously with the alerts, sent whenever an
increment in the resource consumption occurs.
Apache Kafka depends on the java garbage collector for freeing up memory. It is because the
java application runs in the java virtual machine generally. Thus, more activities performed in
the Kafka cluster leads to more execution of the garbage collector. It becomes simple to track
JVM heap sizes using the monitoring tool. The tool also allows the user to track the thread
utilization with the metrics such as peak, live thread count and, daemon. This prevents the
system performance.
In the Kafka cluster, there is only one leader(broker) which controls and manages the partitions
and replicas state. It becomes easy to view the active leader with the Kafka monitoring tools. The
active controllers can find out the active leader at the time of some issue. The application
manager also monitors the Log flush latency of the broker. The Log flush latency is directly
proportional to the number of pipelines.
The application manager monitoring tool keeps the record of the network usage and aggregates
outgoing and incoming byte rate on the broker topics in order to gather more information.
Apache Kafka has the feature of fault tolerance. Therefore, it becomes easy to determine and
analyze the faults. This increases the performance of the Kafka Cluster.
6) Provides data-rich reports on each performance metrics
The Kafka monitoring tool creates evaluated reports on each necessary performance attributes.
These extensive reports help users to know the overall performance metrics.
Note: Apache Kafka offers remote monitoring feature also. This feature is enabled through
JMX by setting an environment variable 'JMX_PORT'. It performs graphs and alerts on
the essential Kafka metrics. Visit the documentation of the official website of the Apache
Kafka to know more.
1. It will reduce the latency and size required to send data to Kafka.
2. It will reduce the bandwidth that will make users increase the net messages which are
sent to the broker.
3. It can lead to low cost when the data is stored in the Kafka via cloud platforms. It is
because cloud services are paid. Therefore, it calculates the amount of data stored in
Kafka.
4. Message compression does not need any change in the configuration of the broker and
consumer.
5. Message compression does not need any change in the configuration of the broker and
consumer.
6. The reduced disk load will lead to fast read and write operations.
Bigger is the producer batch, effective to use the message compression technique.
To compress the data, a 'compression.type' is used. This lets users decide the type of
compression. The type can be 'gzip', 'snappy', 'lz4', or 'none'(default). The 'gzip' has the
maximum compression ratio.
Distribution Kafka consumers get distributed There are a number of consumers present for each
through topic partitions. Each queue instance. These consumers are known as
consumer consumes messages Competitive consumers as they compete with one
from a specific partition at a another for consuming the message. But, the message
time. can be processed just once.
Message ordering is present It maintains the order for flows via a single AMQP
Message inside the partition only. It channel. In addition, it also reorders the retransmitted
Ordering guarantees that either all fail or packets inside its queue logic that will prevent the
pass together. consumer from resequencing the buffers.
Message It contains a log file that Since it is a queue, messages once consumed are
lifetime prevents all messages anytime. removed, and the acknowledgment is received.
It is mainly used for streaming The web servers mainly use it for immediate response
Use Cases
the data. to the requests.
Primarily used It is used as a message broker. But, it also It is used for micro-batch stream
for does small-batch processing. processing.
It takes data from the actual data sources It fetches data from the Kafka itself for
Data source
such as facebook, twitter, etc. processing.
Spark Streaming
Apache spark enables the streaming of large datasets through Spark Streaming. Spark Streaming
is part of the core Spark API which lets users process live data streams. It takes data from
different data sources and process it using complex algorithms. At last, the processed data is
pushed to live dashboards, databases, and filesystem.
Kafka Streams
Parameters Apache Kafka Apache Spark
It processes data from Kafka itself via Spark ingest data from various files, Kafka,
Data Sources
topics and streams. Socket source, etc.
Latency It has low latency than Apache Spark It has a higher latency.
ETL
It is not supported in Apache Kafka. This transformation is supported in Spark.
Transformation
The New York Times, Zalando, Trivago, Booking.com, Yelp (ad platform) uses Spark
Use Cases etc. use Kafka Streams to store and streams for handling millions of ad requests
distribute data. per day.
A client library to process and analyze the data stored in Kafka. Kafka streams enable users to
build applications and microservices. Further, store the output in the Kafka cluster. It does not
have any external dependency on systems other than Kafka. It only processes a single record at a
time.
In this section, we will learn how a producer sends messages to the Kafka topics.
Step2: Type the command: 'kafka-console-producer' on the command line. This will help the
user to read the data from the standard inputs and write it to the Kafka topic.
Step3: After knowing all the requirements, try to produce a message to a topic using the
command:
A '>' will appear in the new line. Start producing some messages, as shown below:
So, in this way, a producer can produce/send several messages to the Kafka topics.
Here, key is the specific partition, and value is the message to be written by the producer to the
topic.
The warning occurred because earlier the topic 'demo' didn't exist. But, as soon the producer
wrote a message, Kafka somehow created that topic. Although, no leader election held for this
unexpected topic, 'LEADER_NOT_AVAILABLE' error could be seen. But, for the next time,
the producer can continue to write more messages as no warning will appear again. It is because
the topic comes in the existing list now.
The users can check using the '-list' command, as shown below:
The topic 'demo' can be seen on the list.
For example,
The topic 'demo' when described using the '-describe' command, gives the value of
'PartitionCount' and 'ReplicationFactor' as 1(default value). Thus, it is always a better option to
create a topic before producing messages to it.
There are following steps taken by the consumer to consume the messages from the topic:
Step2: Type the command: 'kafka-console-consumer' on the command line. This will help the
user to read the data from the Kafka topic and output it to the standard outputs.
The highlighted text represents that a 'bootstrap-server' is required for the consumer to get
connected to the Kafka topics. Also, a 'topic_id' is required to know from which topic the
consumer will read the messages.
Step3: After knowing all the requirements, try to consume a message from a topic using the
command:
'kafka-console-consumer -bootstrap-server localhost:9092 -topic <topic_name>'. Press
enter.
In the previous section, three messages were produced to this topic. But, in the above snapshot, 0
messages could be seen. It is because Apache Kafka does not read all the topics. A Kafka
consumer will consume only those messages which are produced only when the consumer was in
the active state. This can be categorized as a disadvantage of Apache Kafka.
Let's understand:
Open a new terminal. Launch the Kafka console producer. Keep both producer-consumer
consoles together as seen below:
Now, produce some messages in the producer console. After doing so, press Ctrl+C and exit.
It is seen that all messages which are currently produced by the producer console are reflected in
the consumer console. It is because the consumer is in an active state.
To do so, use '-from-beginning' command with the above kafka console consumer command as:
For example,
In the above snapshot, it is clear that all messages are displayed from the beginning.
Note: The order of the messages is not the 'total'. It is because the sequence is at the
partition level only(as studied in the Kafka introduction section).
For this topic 'myfirst', we had three partitions. So, if a user wishes to see the order, create a topic
with a single partition value. It will display whole messages in a sequence.
After completing the message exchange process, press 'Ctrl+C' and stop.
So, several messages can be consumed either from the beginning or from that state when the user
wants the consumer to read.
Let' see how consumers will consume messages from Kafka topics:
Step1: Open the Windows command prompt.
In the above snapshot, the name of the group is 'first_app'. It is seen that no messages are
displayed because no new messages were produced to this topic. If '-from-beginning' command
will be used, all the previous messages will be displayed.
Step3: To view some new messages, produce some instant messages from the producer
console(as did in the previous section).
So, the new messages produced by the producer can be seen in the consumer's console.
Step4: But, it was a single consumer reading data in the group. Let's create more consumers to
understand the power of a consumer group. For that, open a new terminal and type the exact
same consumer command as:
We can further create more consumers under the same group, and each consumer will consume
the messages according to the number of partitions. Try yourself to understand better.
Note: The group id should be the same, then only the messages will be split between the
consumers.
However, if any of the consumers is terminated, the partitions will be reassigned to the active
consumers, and these active consumers will receive the messages.
So, in this way, various consumers in a consumer group consume the messages from the Kafka
topics.
Using the above command, the consumer can read data with the specified keys.
This command is used to read the messages from the starting(discussed earlier). Thus, using it in
a consumer group will give the following output:
It can be noticed that a new consumer group 'second_app' is used to read the messages from the
beginning. If one more time the same command will run, it will not display any output. It is
because offsets are committed in Apache Kafka. So, once a consumer group has read all the until
written messages, next time, it will read the new messages only.
For example, in the below snapshot, when '-from-beginning' command is used again, only the
new messages are read. It is because all the previous messages were consumed earlier only.
'kafka-consumer-groups' command
This command gives the whole documentation to list all the groups, describe the group, delete
consumer info, or reset consumer group offsets.
It requires a bootstrap server for the clients to perform different functions on the consumer
group.
A '-list' command is used to list the number of consumer groups available in the Kafka Cluster.
The command is used as:
A '--describe' command is used to describe a consumer group. The command is used as:
'kafka-consumer-groups.bat -bootstrap-server localhost:9092 -describe group
<group_name>'
This command describes whether any active consumer is present, the current offset value, lag
value is 0 -indicates that the consumer has read all the data.
While resetting the offsets, the user needs to choose three arguments:
1. An execution option
2. Reset Specifications
3. Scope
'-dry-run': It is the default execution option. This option is used to plan those offsets that need
to be reset.
'
'-to-datetime': It reset the offsets on the basis of the offset from datetime. The format used is:
'YYYY-MM-DDTHH:mm:SS.sss'.
'--shift-by': It reset the offsets by shifting the current offset value by 'n'. The value of 'n' can be
positive or negative.
'--from-file': It resets the offsets to the values defined in the CSV file.
'-all-topics': It reset the offset value for all the available topics within a group.
'-topics': It reset the offset value for the specified topics only. The user needs to specify the topic
name for resetting the offset value.
Note: To shift the offset value to a positive count, it is not necessary to use '+' symbol with
it. By default, it will be considered positive only.
Step1: The build tool Maven contains a 'pom.xml' file. The 'pom.xml' is a default XML file that
carries all the information regarding the GroupID, ArtifactID, as well as the Version value. The
user needs to define all the necessary project dependencies in the 'pom.xml' file. Go to the
'pom.xml' file.
Step3: Now, open a web browser and search for 'Kafka Maven' as shown below:
Click on the highlighted link and select the 'Apache Kafka, Kafka-Clients' repository. A
sample is shown in the below snapshot:
Step4: Select the repository version according to the downloaded Kafka version on the system.
For example, in this tutorial, we are using 'Apache Kafka 2.3.0'. Thus, we require the repository
version 2.3.0 (the highlighted one).
Step5: After clicking on the repository version, a new window will open. Copy the dependency
code from there.
Since, we are using Maven, copy the Maven code. If the user is using Gradle, copy the Gradle
written code.
Step6: Paste the copied code to the '<dependencies>...</dependencies>' block, as shown below:
If the version number appears red in color, it means the user missed to enable the 'Auto-Import'
option. If so, go to View>Tool Windows>Maven. A Maven Projects Window will appear on the
right side of the screen. Click on the 'Refresh' button appearing right there. This will enable the
missed Auto-Import Maven Projects. If the color changes to black, it means the missed
dependency is downloaded. The user can proceed to the next step.
Step7: Now, open the web browser and search for 'SL4J Simple' and open the highlighted link
shown in the below snapshot:
A bunch of repositories will appear. Click on the appropriate repository.
To know the appropriate repository, look at the Maven projects window, and see the slf4j version
under 'Dependencies'.
Click on the appropriate version and copy the code, and paste below the Kafka dependency in
the 'pom.xml' file, as shown below:
Note: Either put a comment or remove the <scope> test</scope> tag line from the code.
Because this scope tag defines a limited scope for the dependency, and we need this
dependency for all code, the scope should not be limited.
Now, we have set all the required dependencies. Let's try the 'Simple Hello World' example.
Firstly, create a java package say, 'com.firstgroupapp.aktutorial' and a java class beneath it.
While creating the java package, follow the package naming conventions. Finally, create the
'hello world' program.
After executing the 'producer1.java' file, the output is successfully displayed as 'Hello World'.
This tells the successful working of the IntelliJ IDEA.
There the users can know about all the producer properties offered by Apache Kafka. Here, we
will discuss the required properties, such as:
1. bootstrap.servers: It is a list of the port pairs which are used for establishing an initial
connection to the Kafka cluster. The users can use the bootstrap servers only for making an
initial connection only. This server is present in the host:port, host:port,... form.
2. key.serializer: It is a type of Serializer class of the key which is used to implement the
'org.apache.kafka.common.serialization.Serializer' interface.
3. value.serializer: It is a type of Serializer class which implements the
'org.apache.kafka.common.serialization.Serializer' interface.
Now, let's see the implementation of the producer properties in the IntelliJ IDEA.
When we create the properties, it imports the 'java.util.Properties' to the code.
So, in this way, the first step to create producer properties is completed.
Here, 'first_producer' is the name of the producer we have chosen. The user can choose
accordingly.
Here, 'record' is the name chosen for creating the producer record, 'my_first' is the topic name,
and 'Hye Kafka' is the message. The user can choose accordingly.
1. first_producer.send(record);
To know the output of the above codes, open the 'kafka-console-consumer' on the CLI using the
command:
The data produced by a producer is asynchronous. Therefore, two additional functions, i.e.,
flush() and close() are required (as seen in the above snapshot). The flush() will force all the data
to get produced and close() stops the producer. If these functions are not executed, data will
never be sent to the Kafka, and the consumer will not be able to read it.
The below shows the output of the code on the consumer console as:
On the terminal, users can see various log files. The last line on the terminal says the Kafka
producer is closed. Thus, the message gets displayed on the consumer console asynchronously.
Kafka Real Time Example
Till now, we learned how to read and write data to/from Apache Kafka. In this section, we will
learn to put the real data source to the Kafka.
Here, we will discuss about a real-time application, i.e., Twitter. The users will get to know
about creating twitter producers and how tweets are produced.
Twitter is a social networking service that allows users to interact and post the message. These
messages are known as Tweets. The twitter users make interactions through posting and
commenting on different posts through tweets.
To deal with Twitter, we need to get credentials for Twitter apps. It can be done by creating a
Twitter developer account. To do so, follow the below steps:
Step3: A new page will open. Click on "Apply for a developer account"
Step4: A new page will open, asking the Intended use like, 'How you will use Twitter data?",
and so on. A snapshot is shown below:
After giving the appropriate answers, click on Next.
Step5: The next section is the Review section. Here, the user explanations will be reviewed by
Twitter, as shown below:
If twitter finds the answers appropriate, 'Looks good' option will get enabled. Then, move to the
next section.
Step6: Finally, the user will be asked to review and accept the Developer Agreement. Accept the
agreement by clicking on the checkbox. Submit the application by clicking on the 'Submit
Application'.
Step7: After successful completion, an email confirmation page will open. Confirm with the
provided email id and proceed further.
Step8: After confirmation, a new webpage will open. Click on 'Create an app' as shown below:
Step9: Provide the app details, as shown in the below snapshot:
Step10: After giving the app details, click on the 'Create' option. A dialog box will open
"Review our Developer Terms". Click on the 'Create' option. A snapshot is shown below:
Finally, the app will be created in the following way:
Note: When the app will be created. It will generate Keys and Tokens. Do not disclose them
as these are the secret or sensitive information. If did so, the user can regenerate them for
safety purposes.
Step11: After creating an app, we need to add the twitter dependency in the 'pom.xml' file. To do
so, open 'github twitter java' on a web browser. A snapshot is shown below:
Open the highlighted link or visit: 'https://github.com/twitter/hbc' to open directly.
Step12: There, the user will find the twitter dependency code. Copy the code and paste it in the
'pom.xml' file below the maven dependency code.
A term 'hbc' is used in the dependency code. It stands for 'Hosebird Client' which is a java HTTP
client. It is used for consuming Twitter's standard streaming API. The Hosebird Client is divided
into two modules:
1. hbc-core: It uses a message queue. This message queue is further used by the consumer
to poll for raw string messages.
2. hbc-twitter4j: This is different from hbc-core as it uses the twitter4j listeners. Twitter4j
is an unofficial java library through which we can easily integrate our java build
application with various twitter services.
In the twitter dependency code, hbc-core is used. Users can also use twitter4j instead.
So, in this way, the first stage of the real-time example is completed.
Creating Twitter Producer
In this section, we will learn to create a twitter producer.
Step1: Create a new java package, following the package naming convention rules. Then, create
a java class within it, say 'tweetproducer.java.'
Step2: Create a twitter client by creating a method for it. Now, copy the Quickstart code from
the 'github twitter java' to the twitter client method, as shown below:
Paste it in the newly created method. This code will create a connection between the client and
the hbc host. The BlockingQueue will stop the client to dequeue or enqueue the messages when
the queue is empty or already full. As we are using hbc-core, we only require the msgQueue.
Also, we will follow the terms, not the people. Therefore, copy the highlighted code only.
Now, copy the 'Creating a client' code given below the connection code as:
Paste the code below the connection code. This code will create a twitter client through the client
builder. As we are using msgQueue, do not copy the red highlighted code, which is for the
eventMessageQueue. It is not required.
Step3: Create the producer in a similar way we learned in the previous sections with a bootstrap
server connection.
Step4: After creating the Kafka producer, its time to send tweets to Kafka. Copy the while loop
code from the 'github twitter java', given below the 'Creating a client' code. Paste below the
producer code.
Now, we are ready to read tweets from Twitter. Although, a Kafka producer read messages from
a topic. So, create the specified topic using the '-create' command on the CLI. Also, specify the
partition value and the replication factor.
For example,
Here, the topic 'twitter_topic' has been created with partition value 6 and replication-factor 1.
Finally, execute the code and experience Kafka in the real-world application.
The complete code for creating the Twitter Client is given below:
1. package com.github.learnkafka;
2.
3. import com.google.common.collect.Lists;
4. import com.twitter.hbc.ClientBuilder;
5. import com.twitter.hbc.core.Client;
6. import com.twitter.hbc.core.Constants;
7. import com.twitter.hbc.core.Hosts;
8. import com.twitter.hbc.core.HttpHosts;
9. import com.twitter.hbc.core.endpoint.StatusesFilterEndpoint;
10. import com.twitter.hbc.core.processor.StringDelimitedProcessor;
11. import com.twitter.hbc.httpclient.auth.Authentication;
12. import com.twitter.hbc.httpclient.auth.OAuth1;
13. import org.apache.kafka.clients.producer.*;
14. import org.apache.kafka.common.serialization.StringSerializer;
15. import org.slf4j.Logger;
16. import org.slf4j.LoggerFactory;
17.
18. import java.util.List;
19. import java.util.Properties;
20. import java.util.concurrent.BlockingQueue;
21. import java.util.concurrent.LinkedBlockingQueue;
22. import java.util.concurrent.TimeUnit;
23.
24. public class tweetproducer {
25. Logger logger = LoggerFactory.getLogger(tweetproducer.class.getName());
26. String consumerKey = "";//specify the consumer key from the twitter app
27. String consumerSecret = "";//specify the consumerSecret key from the twitter app
28. String token = "";//specify the token key from the twitter app
29. String secret = "";//specify the secret key from the twitter app
30.
31. public tweetproducer() {}//constructor to invoke the producer function
32.
33. public static void main(String[] args) {
34. new tweetproducer().run();
35. }
36.
37. public void run() {
38. logger.info("Setup");
39.
40. BlockingQueue<String> msgQueue = new
41. LinkedBlockingQueue<String>(1000);//Specify the size accordingly.
42. Client client = tweetclient(msgQueue);
43. client.connect(); //invokes the connection function
44. KafkaProducer<String,String> producer=createKafkaProducer();
45.
46. // on a different thread, or multiple different threads....
47. while (!client.isDone()) {
48. String msg = null;
49. try {
50. msg = msgQueue.poll(5, TimeUnit.SECONDS);//specify the time
51. } catch (InterruptedException e) {
52. e.printStackTrace();
53. client.stop();
54. }
55. if (msg != null) {
56. logger.info(msg);
57. producer.send(new ProducerRecord<>("twitter_topic", null, msg), new Callbac
k() {
58. @Override
59. public void onCompletion(RecordMetadata recordMetadata, Exception e) {
60. if(e!=null){
61. logger.error("Something went wrong",e);
62. }
63. }
64. });
65. }
66.
67. }//Specify the topic name, key value, msg
68.
69. logger.info("This is the end");//When the reading is complete, inform logger
70. }
71.
72. public Client tweetclient(BlockingQueue<String> msgQueue) {
73.
74. Hosts hosebirdHosts = new HttpHosts(Constants.STREAM_HOST);
75. StatusesFilterEndpoint hosebirdEndpoint = new StatusesFilterEndpoint();
76. List<String> terms = Lists.newArrayList("India ");//describe
77. //anything for which we want to read the tweets.
78. hosebirdEndpoint.trackTerms(terms);
79. Authentication hosebirdAuth = new OAuth1(consumerKey,consumerSecret,tok
en,secret);
80. ClientBuilder builder = new ClientBuilder()
81. .name("Hosebird-Client-01") // optional: mainly for the logs
82. .hosts(hosebirdHosts)
83. .authentication(hosebirdAuth)
84. .endpoint(hosebirdEndpoint)
85. .processor(new StringDelimitedProcessor(msgQueue));
86.
87.
88. Client hosebirdClient = builder.build();
89. return hosebirdClient; // Attempts to establish a connection.
90. }
91. public KafkaProducer<String,String> createKafkaProducer(){
92. //creating kafka producer
93. //creating producer properties
94. String bootstrapServers="127.0.0.1:9092";
95. Properties properties= new Properties();
96. properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, boot
strapServers);
97. properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, Stri
ngSerializer.class.getName());
98. properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
StringSerializer.class.getName());
99.
100. KafkaProducer<String,String> first_producer = new KafkaProducer<String,
String>(properties);
101. return first_producer;
102.
103. }
104. }
In the above code, the user will specify the consumerKey, consumerSecret key, token key as well
as the secret key. As it is sensitive information, therefore it cannot be displayed. Copy the key
from the 'developer.twitter.com' and paste at their respective positions.
Copy the keys from 'Keys and Tokens' and paste in the code.