0% found this document useful (0 votes)
6 views64 pages

Kafka Monitoring

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views64 pages

Kafka Monitoring

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 64

Kafka Monitoring

Apache Kafka is a distributed system where a topic is partitioned as well as replicated across
various nodes. It also supports fault tolerance as well as durability. There may be chances of
troubleshooting. Therefore, it is required to manage and monitor different Kafka activities.
Various Kafka monitoring tools are used to monitor and display the corrective actions.

There are following activities performed under Kafka monitoring:

1) Keeping tracks of utilized system resources

The Kafka application manager enables users to monitor and discover the Kafka servers
automatically. It also tracks the details of resource utilization, such as disk storage, CPU,
memory, etc. This helps the manager to ensure that the user is not running out of resources The
manager also ensures that the server is operating continuously with the alerts, sent whenever an
increment in the resource consumption occurs.

2) Keeps an eye on threads and JVM usage

Apache Kafka depends on the java garbage collector for freeing up memory. It is because the
java application runs in the java virtual machine generally. Thus, more activities performed in
the Kafka cluster leads to more execution of the garbage collector. It becomes simple to track
JVM heap sizes using the monitoring tool. The tool also allows the user to track the thread
utilization with the metrics such as peak, live thread count and, daemon. This prevents the
system performance.

3) Understands controller, broker, and replication statistics

In the Kafka cluster, there is only one leader(broker) which controls and manages the partitions
and replicas state. It becomes easy to view the active leader with the Kafka monitoring tools. The
active controllers can find out the active leader at the time of some issue. The application
manager also monitors the Log flush latency of the broker. The Log flush latency is directly
proportional to the number of pipelines.

4) Monitors the network as well as Topic Details

The application manager monitoring tool keeps the record of the network usage and aggregates
outgoing and incoming byte rate on the broker topics in order to gather more information.

5) Manages faults and troubleshoot fastly

Apache Kafka has the feature of fault tolerance. Therefore, it becomes easy to determine and
analyze the faults. This increases the performance of the Kafka Cluster.
6) Provides data-rich reports on each performance metrics

The Kafka monitoring tool creates evaluated reports on each necessary performance attributes.
These extensive reports help users to know the overall performance metrics.

Note: Apache Kafka offers remote monitoring feature also. This feature is enabled through
JMX by setting an environment variable 'JMX_PORT'. It performs graphs and alerts on
the essential Kafka metrics. Visit the documentation of the official website of the Apache
Kafka to know more.

Message Compression in Kafka


As we have seen that the producer sends data to the Kafka in the text format, commonly called
the JSON format. JSON has a demerit, i.e., data is stored in the string form. This creates several
duplicated records to get stored in the Kafka topic. Thus, it occupies much disk space.
Consequently, it is required to reduce disk space. This can be done by compressing or lingering
the data before sending it to the Kafka.

Need for Message Compression


There can be the following reasons which better describes the need to reduce the message size:

1. It will reduce the latency and size required to send data to Kafka.
2. It will reduce the bandwidth that will make users increase the net messages which are
sent to the broker.
3. It can lead to low cost when the data is stored in the Kafka via cloud platforms. It is
because cloud services are paid. Therefore, it calculates the amount of data stored in
Kafka.
4. Message compression does not need any change in the configuration of the broker and
consumer.
5. Message compression does not need any change in the configuration of the broker and
consumer.
6. The reduced disk load will lead to fast read and write operations.

Producer Batch/Record Batch


A producer writes messages to the Kafka, one by one. Therefore, Kafka plays smartly. It waits
for the messages that are being produced to Kafka. Then, it creates a batch and put the messages
into it, until it becomes full. Then, send the batch to the Kafka. Such type of batch is known as a
Producer Batch. The default batch size is 16KB, and the maximum can be anything. Large is
the batch size, more is the compression, throughput, and efficiency of producer requests.
Note: The message size should not exceed the batch size. Otherwise, the message will not be
batched. Also, the batch is allocated per partitions, so do not set it to a very high number.

Bigger is the producer batch, effective to use the message compression technique.

Message Compression Format


Message Compression is always done at the producer side, so there is no requirement to change
the configurations at the consumer or broker side.
In the figure, a producer batch of 200 MB is created. After compression, it is reduced to 101 MB.

To compress the data, a 'compression.type' is used. This lets users decide the type of
compression. The type can be 'gzip', 'snappy', 'lz4', or 'none'(default). The 'gzip' has the
maximum compression ratio.

Disadvantages of Message Compression


There are following disadvantages of the message compression:

1. The producers commit some CPU cycles for compression.


2. The consumers commit some CPU cycles for decompression.
3. These disadvantages lead to increased CPU usage.
Thus, message compression is a better option to reduce the disk load.

Apache Kafka Vs. RabbitMQ


What is RabbitMQ?
RabbitMQ is the most widely used, general-purpose, and open-source message broker. It was
released in the year 2007 and was a primary component in messaging systems. Currently, it is
used for streaming use cases. RabbitMQ was able to handle the background tasks or act as a
message broker between microservices. It helped the web applications in reducing the loads.
Also, it reduced the delivery time of the servers for those tasks or resources which were time-
consuming.

What is Apache Kafka?


Apache Kafka is also an open-source distributed pub/sub message system. It was released in the
year 2011 which works as middle storage between two applications. The producer writes and
stores the message in the Kafka cluster. On the other hand, the consumer consumes messages
from the cluster. It also reduces the slow delivery of heavy messages.

Kafka Vs. RabbitMQ

Parameters Apache Kafka RabbitMQ

Distribution Kafka consumers get distributed There are a number of consumers present for each
through topic partitions. Each queue instance. These consumers are known as
consumer consumes messages Competitive consumers as they compete with one
from a specific partition at a another for consuming the message. But, the message
time. can be processed just once.

With the help of zookeeper, it


Through clustering and high available queues provides
High manages the state of the Kafka
high-performance data replication. Thus, it also
Availability cluster and supports high
provides high availability.
availability.

It can process millions of


It can also process millions of messages within a
Performance messages in a second with less
second, but it needs more number of the hardware.
number of the hardware.

There are replicated brokers


available in Kafka, which works Here, queues are not automatically replicated. The
Replication
when the master broker is configuration is mandatory.
down.

Multiple consumer types can


Multi Although messages are routed to various queues, only
subscribe to many messages to
subscriber one consumer from a queue can process the message.
Kafka.

Apache Kafka supports


Message This supports any standard queue protocols such as
primitives such as int8, int16,
Protocols STOMP, AMQP, HTTP, etc.
etc. and binary messages.

Message ordering is present It maintains the order for flows via a single AMQP
Message inside the partition only. It channel. In addition, it also reorders the retransmitted
Ordering guarantees that either all fail or packets inside its queue logic that will prevent the
pass together. consumer from resequencing the buffers.

Message It contains a log file that Since it is a queue, messages once consumed are
lifetime prevents all messages anytime. removed, and the acknowledgment is received.

Highly scalable pub/sub


distributed messaging system. It
A general-purpose pub/sub message broker. Its
Architecture has brokers, topics, partitions,
architecture varies from Kafka as it consists of queues.
and topics within the Kafka
cluster.

It is mainly used for streaming The web servers mainly use it for immediate response
Use Cases
the data. to the requests.

It supports those transactions


that exhibit a ?read-process- It does not guarantee atomicity even when the
Transactions
write? pattern performed transaction indulges only a single queue.
to/from Kafka topics.

Language Apache Kafka is written in Scala RabbitMQ is written in Erlang.


with JVM.

Routing It supports complex routing


It does not support complex routing scenarios.
Support scenarios.

With high growth, it led to a


Developer RabbitMQ carries mature client libraries that support
good experience. But, it only
Experience Java, PHP, Python, Ruby, and many more.
supports Java clients.

Apache Kafka Vs. Apache Storm


Apache Storm
It is an open-source and real-time stream processing system. Apache Storm was mainly used for
fastening the traditional processes. It reliably processes the unbounded streams. It has spouts and
bolts for designing the storm applications in the form of topology. Any pr ogramming language
can use it. Thus, it is simple to use. It can process millions of messages within a second.

Kafka Vs. Storm

There are the following differences between Kafka and Storm:

Parameters Apache Kafka Apache Storm

Originally created by Nathan Marz


Originally developed by LinkedIn. Then, it was (Backtype team). Later, acquired by
Developers
donated to Apache Foundation. Twitter. Further, it became the top-level
project of Apache.
Programming Apache Storm is written in Clojure and
Apache Kafka is written in Scala with JVM.
Language Java.

It is a real-time message processing


Type of system It is a distributed messaging system.
system.

Primarily used It is used as a message broker. But, it also It is used for micro-batch stream
for does small-batch processing. processing.

It does not store the data. It transfers


It maintains the local file system, such as XFS
Data Storage the data from the input stream to the
or EXT4, for storing the data.
output stream.

Apache Kafka depends on the zookeeper to


run the Kafka server and let the Apache Storm has no external
Depends on
consumer/producer to read/write the dependency.
messages to Kafka.

It has a latency power of less than 1-2


Latency The latency power of Kafka is millisecond. seconds. It is because it depends on the
data source.

Language Best supported by Java programming


It supports all programming languages.
Support language.

Security Data is not highly secure. Data is highly secured.

It takes data from the actual data sources It fetches data from the Kafka itself for
Data source
such as facebook, twitter, etc. processing.

Due to zookeeper, it is able to tolerate the It has an in-built feature of auto-


Fault-tolerant
faults. restarting.

Developers It is durable, scalable, as well as gives high-


It is easy and flexible to use.
Experience throughput value.

Kafka Streams Vs. Spark Streaming


Apache Spark
Apache Spark is a distributed and a general processing system which can handle petabytes of
data at a time. It is mainly used for streaming and processing the data. It is distributed among
thousands of virtual servers. Large organizations use Spark to handle the huge amount of
datasets. Apache Spark allows to build applications faster using approx 80 high-level operators.
It gains high performance for streaming and batch data via a query optimizer, a physical
execution engine, and a DAG scheduler. Thus, its speed is hundred times faster.

Spark Streaming

Apache spark enables the streaming of large datasets through Spark Streaming. Spark Streaming
is part of the core Spark API which lets users process live data streams. It takes data from
different data sources and process it using complex algorithms. At last, the processed data is
pushed to live dashboards, databases, and filesystem.

Kafka Streams
Parameters Apache Kafka Apache Spark

Originally developed by LinkedIn. Later, Originally developed at the University of


Developers donated to Apache Software California. Later, it was donated to Apache
Foundation. Software Foundation.

It executes on the top of the Spark stack. It


It is a Java client library. Thus, it can
Infrastructure can be either Spark standalone, YARN, or
execute wherever Java is supported.
container-based.

It processes data from Kafka itself via Spark ingest data from various files, Kafka,
Data Sources
topics and streams. Socket source, etc.

It processes the events as it arrives. It has a micro-batch processing model. It


Processing
Thus, it uses Event-at-a-time splits the incoming streams into small batches
Model
(continuous) processing model. for further processing.

Latency It has low latency than Apache Spark It has a higher latency.

ETL
It is not supported in Apache Kafka. This transformation is supported in Spark.
Transformation

Fault-tolerance Fault-tolerance is complex in Kafka. Fault-tolerance is easy in Spark.

Language It supports multiple languages such as Java,


It supports Java mainly.
Support Scala, R, Python.

The New York Times, Zalando, Trivago, Booking.com, Yelp (ad platform) uses Spark
Use Cases etc. use Kafka Streams to store and streams for handling millions of ad requests
distribute data. per day.

A client library to process and analyze the data stored in Kafka. Kafka streams enable users to
build applications and microservices. Further, store the output in the Kafka cluster. It does not
have any external dependency on systems other than Kafka. It only processes a single record at a
time.

Kafka Streams Vs. Spark Streaming

Sending data to Kafka Topics


Kafka Console Producer
In order to send data to the Kafka topic, a producer is required. The role of the producer is to
send or write data/messages to the Kafka topics.

In this section, we will learn how a producer sends messages to the Kafka topics.

There are following steps used to launch a producer:

Step1: Start the zookeeper as well as the kafka server.

Step2: Type the command: 'kafka-console-producer' on the command line. This will help the
user to read the data from the standard inputs and write it to the Kafka topic.

Note: Choose '.bat' or '.sh' as per the operating system.


The highlighted text represents that a 'broker-list' and a 'topic id' is required to produce a
message. It is because a producer must know the id of the topic to which the data is to be written.

Step3: After knowing all the requirements, try to produce a message to a topic using the
command:

'kafka-console-producer -broker-list localhost:9092 -topic <topic_name>'. Press enter.


Note: Here, 9092 is the port number of the Kafka server.

Here, 'myfirst' topic is chosen to write messages to.

A '>' will appear in the new line. Start producing some messages, as shown below:

Step4: Press 'Ctrl+c' and exist by pressing the 'Y' key.

So, in this way, a producer can produce/send several messages to the Kafka topics.

Producer with Keys


A Kafka producer can write data to the topic either with or without a key. If a producer does not
specify a key, the data will be stored to any of the partitions with key=null, else the data will be
stored to the specified partition only. A 'parse.key' and a 'key.seperator' is required to specify a
key for the topic. The command used is:

1. 'kafka-console-producer --broker-list localhost:9092 --topic <topic_name> --property par


se.key=true --property key.separator=,
2. > key,value
3. > another key,another value'

Here, key is the specific partition, and value is the message to be written by the producer to the
topic.

When a topic does not exist?


Suppose the producer wants to send messages to a new topic that does not exist yet. In such a
situation, a warning will appear, as shown in the below snapshot, after producing a message. It is
just a warning.

Why this warning?

The warning occurred because earlier the topic 'demo' didn't exist. But, as soon the producer
wrote a message, Kafka somehow created that topic. Although, no leader election held for this
unexpected topic, 'LEADER_NOT_AVAILABLE' error could be seen. But, for the next time,
the producer can continue to write more messages as no warning will appear again. It is because
the topic comes in the existing list now.

The users can check using the '-list' command, as shown below:
The topic 'demo' can be seen on the list.

Describing the new topic


As such topics which are created directly by the producer grabs the default number of partitions
and its replication factor as 1.

For example,

The topic 'demo' when described using the '-describe' command, gives the value of
'PartitionCount' and 'ReplicationFactor' as 1(default value). Thus, it is always a better option to
create a topic before producing messages to it.

Changing the Default Values


Follow the below steps to change the default values for the new topic:

1. Open 'server.properties' file using Notepad++, or any other text editor.


2. Edit the value of num.partitions=1 to a new value. Let it be 3. So, whenever such new
topics are introduced, the number of PartitionCount and ReplicationFactor will be
3(whatever the user has set).
3. Save the file.

But, always create topics before.

Kafka Console Consumer


In this section, the users will learn how a consumer consumes or reads the messages from the
Kafka topics.

There are following steps taken by the consumer to consume the messages from the topic:

Step 1: Start the zookeeper as well as the kafka server initially.

Step2: Type the command: 'kafka-console-consumer' on the command line. This will help the
user to read the data from the Kafka topic and output it to the standard outputs.

Note: Choose '.bat' or '.sh' as per the operating system.

The highlighted text represents that a 'bootstrap-server' is required for the consumer to get
connected to the Kafka topics. Also, a 'topic_id' is required to know from which topic the
consumer will read the messages.

Step3: After knowing all the requirements, try to consume a message from a topic using the
command:
'kafka-console-consumer -bootstrap-server localhost:9092 -topic <topic_name>'. Press
enter.

Note: Bootstrap server is the Kafka server, having port number=9092.

In the previous section, three messages were produced to this topic. But, in the above snapshot, 0
messages could be seen. It is because Apache Kafka does not read all the topics. A Kafka
consumer will consume only those messages which are produced only when the consumer was in
the active state. This can be categorized as a disadvantage of Apache Kafka.

Let's understand:

Open a new terminal. Launch the Kafka console producer. Keep both producer-consumer
consoles together as seen below:
Now, produce some messages in the producer console. After doing so, press Ctrl+C and exit.
It is seen that all messages which are currently produced by the producer console are reflected in
the consumer console. It is because the consumer is in an active state.

Reading whole messages


Apache Kafka allows to produce millions of messages. Sometimes, a consumer may require to
read whole messages from a particular topic.

To do so, use '-from-beginning' command with the above kafka console consumer command as:

'kafka-console-consumer.bat -bootstrap-server 127.0.0.1:9092 -topic myfirst -from-


beginning'. This command tells the Kafka topic to allow the consumer to read all the messages
from the beginning(i.e., from the time when the consumer was inactive).

For example,
In the above snapshot, it is clear that all messages are displayed from the beginning.

Note: The order of the messages is not the 'total'. It is because the sequence is at the
partition level only(as studied in the Kafka introduction section).

For this topic 'myfirst', we had three partitions. So, if a user wishes to see the order, create a topic
with a single partition value. It will display whole messages in a sequence.

After completing the message exchange process, press 'Ctrl+C' and stop.

So, several messages can be consumed either from the beginning or from that state when the user
wants the consumer to read.

Next TopicKafka Consumer Group CLI

Kafka Consumer Group CLI


Generally, a Kafka consumer belongs to a particular consumer group. A consumer group
basically represents the name of an application. In order to consume messages in a consumer
group, '-group' command is used.

Let' see how consumers will consume messages from Kafka topics:
Step1: Open the Windows command prompt.

Step2: Use the '-group' command as: 'kafka-console-consumer -bootstrap-server


localhost:9092 -topic -group <group_name>'. Give some name to the group. Press enter.

In the above snapshot, the name of the group is 'first_app'. It is seen that no messages are
displayed because no new messages were produced to this topic. If '-from-beginning' command
will be used, all the previous messages will be displayed.

Step3: To view some new messages, produce some instant messages from the producer
console(as did in the previous section).
So, the new messages produced by the producer can be seen in the consumer's console.

Step4: But, it was a single consumer reading data in the group. Let's create more consumers to
understand the power of a consumer group. For that, open a new terminal and type the exact
same consumer command as:

'kafka-console-consumer.bat --bootstrap-server 127.0.0.1:9092 --topic <topic_name> --


group <group_name>'.
In the above snapshot, it is clear that the producer is sending data to the Kafka topics. The two
consumers are consuming the messages. Look at the sequence of the messages. As there were
three partitions created for 'myfirst' topic(discussed earlier), so messages are split in that
sequence only.

We can further create more consumers under the same group, and each consumer will consume
the messages according to the number of partitions. Try yourself to understand better.

Note: The group id should be the same, then only the messages will be split between the
consumers.

However, if any of the consumers is terminated, the partitions will be reassigned to the active
consumers, and these active consumers will receive the messages.
So, in this way, various consumers in a consumer group consume the messages from the Kafka
topics.

Consumer with Keys


When a producer has attached a key value with the data, it will get stored to that specified
partition. If no key value is specified, the data will move to any partition. So, when a consumer
reads the message with a key, it will be displayed null, if no key was specified. A 'print.key' and
a 'key.seperator' sre required to consume messages from the Kafka topics. The command used
is:

'kafka-console-consumer -bootstrap-server localhost:9092 -topic <topic_name> --from-


beginning -property print.key=true -property key.seperator=,'

Using the above command, the consumer can read data with the specified keys.

More about Consumer Group


'--from-beginning' command

This command is used to read the messages from the starting(discussed earlier). Thus, using it in
a consumer group will give the following output:
It can be noticed that a new consumer group 'second_app' is used to read the messages from the
beginning. If one more time the same command will run, it will not display any output. It is
because offsets are committed in Apache Kafka. So, once a consumer group has read all the until
written messages, next time, it will read the new messages only.

For example, in the below snapshot, when '-from-beginning' command is used again, only the
new messages are read. It is because all the previous messages were consumed earlier only.
'kafka-consumer-groups' command

This command gives the whole documentation to list all the groups, describe the group, delete
consumer info, or reset consumer group offsets.
It requires a bootstrap server for the clients to perform different functions on the consumer
group.

Listing Consumer Groups

A '-list' command is used to list the number of consumer groups available in the Kafka Cluster.
The command is used as:

'kafka-consumer-groups.bat -bootstrap-server localhost:9092 -list'.

A snapshot is shown below, there are three consumer groups present.

Describing a Consumer Group

A '--describe' command is used to describe a consumer group. The command is used as:
'kafka-consumer-groups.bat -bootstrap-server localhost:9092 -describe group
<group_name>'

This command describes whether any active consumer is present, the current offset value, lag
value is 0 -indicates that the consumer has read all the data.

Resetting the Offsets


Offsets are committed in Apache Kafka. Therefore, if a user wants to read the messages again, it
is required to reset the offsets value. 'Kafka-consumer-groups' command offers an option to
reset the offsets. Resetting the offset value means defining the point from where the user wants
to read the messages again. It supports only one consumer group at a time, and there should be
no active instances for the group.

While resetting the offsets, the user needs to choose three arguments:

1. An execution option
2. Reset Specifications
3. Scope

There are two executions options available:

'-dry-run': It is the default execution option. This option is used to plan those offsets that need
to be reset.

'

-execute': This option is used to update the offset values.


There are following reset specifications available:

'-to-datetime': It reset the offsets on the basis of the offset from datetime. The format used is:
'YYYY-MM-DDTHH:mm:SS.sss'.

'--to-earliest': It reset the offsets to the earliest offset.

' --to-latest': It reset the offsets to the latest offset.

'--shift-by': It reset the offsets by shifting the current offset value by 'n'. The value of 'n' can be
positive or negative.

'--from-file': It resets the offsets to the values defined in the CSV file.

' --to-current': It reset the offsets to the current offset.

There are two scopes available to define:

'-all-topics': It reset the offset value for all the available topics within a group.

'-topics': It reset the offset value for the specified topics only. The user needs to specify the topic
name for resetting the offset value.

Let's try and see:

1) Using '-to-earliest' command


In the above snapshot, the offsets are reset to the new offset as 0. It is because '-to-earliest'
command is used, which has reset the offset value to 0.

2) Using '-shift-by' command


In the first snapshot, the offset value is shifted from '0' to '+2'. In the second one, the offset value
is shifted from '2' to '-1'.

Note: To shift the offset value to a positive count, it is not necessary to use '+' symbol with
it. By default, it will be considered positive only.

Creating Kafka Producer in Java


In the last section, we learned the basic steps to create a Kafka Project. Now, before creating a
Kafka producer in java, we need to define the essential Project dependencies. In our project,
there will be two dependencies required:
1. Kafka Dependencies
2. Logging Dependencies, i.e., SLF4J Logger.

There are following steps required to set the dependencies:

Step1: The build tool Maven contains a 'pom.xml' file. The 'pom.xml' is a default XML file that
carries all the information regarding the GroupID, ArtifactID, as well as the Version value. The
user needs to define all the necessary project dependencies in the 'pom.xml' file. Go to the
'pom.xml' file.

Step2: Firstly, we need to define the Kafka Dependencies. Create a


'<dependencies>...</dependencies>' block within which we will define the required
dependencies.

Step3: Now, open a web browser and search for 'Kafka Maven' as shown below:
Click on the highlighted link and select the 'Apache Kafka, Kafka-Clients' repository. A
sample is shown in the below snapshot:

Step4: Select the repository version according to the downloaded Kafka version on the system.
For example, in this tutorial, we are using 'Apache Kafka 2.3.0'. Thus, we require the repository
version 2.3.0 (the highlighted one).
Step5: After clicking on the repository version, a new window will open. Copy the dependency
code from there.
Since, we are using Maven, copy the Maven code. If the user is using Gradle, copy the Gradle
written code.

Step6: Paste the copied code to the '<dependencies>...</dependencies>' block, as shown below:
If the version number appears red in color, it means the user missed to enable the 'Auto-Import'
option. If so, go to View>Tool Windows>Maven. A Maven Projects Window will appear on the
right side of the screen. Click on the 'Refresh' button appearing right there. This will enable the
missed Auto-Import Maven Projects. If the color changes to black, it means the missed
dependency is downloaded. The user can proceed to the next step.

Step7: Now, open the web browser and search for 'SL4J Simple' and open the highlighted link
shown in the below snapshot:
A bunch of repositories will appear. Click on the appropriate repository.
To know the appropriate repository, look at the Maven projects window, and see the slf4j version
under 'Dependencies'.
Click on the appropriate version and copy the code, and paste below the Kafka dependency in
the 'pom.xml' file, as shown below:
Note: Either put a comment or remove the <scope> test</scope> tag line from the code.
Because this scope tag defines a limited scope for the dependency, and we need this
dependency for all code, the scope should not be limited.

Now, we have set all the required dependencies. Let's try the 'Simple Hello World' example.

Firstly, create a java package say, 'com.firstgroupapp.aktutorial' and a java class beneath it.
While creating the java package, follow the package naming conventions. Finally, create the
'hello world' program.
After executing the 'producer1.java' file, the output is successfully displayed as 'Hello World'.
This tells the successful working of the IntelliJ IDEA.

Creating Java Producer


Basically, there are four steps to create a java producer, as discussed earlier:
1. Create producer properties
2. Create the producer
3. Create a producer record
4. Send the data.

Creating Producer Properties


Apache Kafka offers various Kafka Properties which are used for creating a producer. To know
about each property, visit the official site of Apache, i.e., 'https://kafka.apache.org'. Move to
Kafka>Documentations>Configurations>Producer Configs.

There the users can know about all the producer properties offered by Apache Kafka. Here, we
will discuss the required properties, such as:

1. bootstrap.servers: It is a list of the port pairs which are used for establishing an initial
connection to the Kafka cluster. The users can use the bootstrap servers only for making an
initial connection only. This server is present in the host:port, host:port,... form.
2. key.serializer: It is a type of Serializer class of the key which is used to implement the
'org.apache.kafka.common.serialization.Serializer' interface.
3. value.serializer: It is a type of Serializer class which implements the
'org.apache.kafka.common.serialization.Serializer' interface.

Now, let's see the implementation of the producer properties in the IntelliJ IDEA.
When we create the properties, it imports the 'java.util.Properties' to the code.

So, in this way, the first step to create producer properties is completed.

Creating the Producer

To create a Kafka producer, we just need to create an object of KafkaProducer.

The object of KafkaProducer can be created as:

1. KafkaProducer<String,String> first_producer = new KafkaProducer<String, String>(properties);

Here, 'first_producer' is the name of the producer we have chosen. The user can choose
accordingly.

Let's see in the below snapshot:


Creating the Producer Record
In order to send the data to Kafka, the user need to create a ProducerRecord. It is because all the
producers lie inside a producer record. Here, the producer specifies the topic name as well as the
message which is to be delivered to Kafka.

A ProducerRecord can be created as:

1. ProducerRecord<String, String> record=new ProducerRecord<String, String>("my_first", "Hye Ka


fka");

Here, 'record' is the name chosen for creating the producer record, 'my_first' is the topic name,
and 'Hye Kafka' is the message. The user can choose accordingly.

Let's see in the below snapshot:


Sending the data
Now, the user is ready to send the data to Kafka. The producer just needs to invoke the object of
the ProducerRecord as:

1. first_producer.send(record);

Let's see in the below snapshot:

To know the output of the above codes, open the 'kafka-console-consumer' on the CLI using the
command:

'kafka-console-consumer -bootstrap-server 127.0.0.1:9092 -topic my_first -group first_app'

The data produced by a producer is asynchronous. Therefore, two additional functions, i.e.,
flush() and close() are required (as seen in the above snapshot). The flush() will force all the data
to get produced and close() stops the producer. If these functions are not executed, data will
never be sent to the Kafka, and the consumer will not be able to read it.
The below shows the output of the code on the consumer console as:

On the terminal, users can see various log files. The last line on the terminal says the Kafka
producer is closed. Thus, the message gets displayed on the consumer console asynchronously.
Kafka Real Time Example
Till now, we learned how to read and write data to/from Apache Kafka. In this section, we will
learn to put the real data source to the Kafka.

Here, we will discuss about a real-time application, i.e., Twitter. The users will get to know
about creating twitter producers and how tweets are produced.

Twitter is a social networking service that allows users to interact and post the message. These
messages are known as Tweets. The twitter users make interactions through posting and
commenting on different posts through tweets.

To deal with Twitter, we need to get credentials for Twitter apps. It can be done by creating a
Twitter developer account. To do so, follow the below steps:

Step1:Create a Twitter account, if it does not exist.

Step2: Open 'developer.twitter.com' on the web browser, as shown below:


Click on the Apply option.

Step3: A new page will open. Click on "Apply for a developer account"

Step4: A new page will open, asking the Intended use like, 'How you will use Twitter data?",
and so on. A snapshot is shown below:
After giving the appropriate answers, click on Next.

Step5: The next section is the Review section. Here, the user explanations will be reviewed by
Twitter, as shown below:
If twitter finds the answers appropriate, 'Looks good' option will get enabled. Then, move to the
next section.

Step6: Finally, the user will be asked to review and accept the Developer Agreement. Accept the
agreement by clicking on the checkbox. Submit the application by clicking on the 'Submit
Application'.
Step7: After successful completion, an email confirmation page will open. Confirm with the
provided email id and proceed further.

Step8: After confirmation, a new webpage will open. Click on 'Create an app' as shown below:
Step9: Provide the app details, as shown in the below snapshot:
Step10: After giving the app details, click on the 'Create' option. A dialog box will open
"Review our Developer Terms". Click on the 'Create' option. A snapshot is shown below:
Finally, the app will be created in the following way:
Note: When the app will be created. It will generate Keys and Tokens. Do not disclose them
as these are the secret or sensitive information. If did so, the user can regenerate them for
safety purposes.

Step11: After creating an app, we need to add the twitter dependency in the 'pom.xml' file. To do
so, open 'github twitter java' on a web browser. A snapshot is shown below:
Open the highlighted link or visit: 'https://github.com/twitter/hbc' to open directly.

Step12: There, the user will find the twitter dependency code. Copy the code and paste it in the
'pom.xml' file below the maven dependency code.
A term 'hbc' is used in the dependency code. It stands for 'Hosebird Client' which is a java HTTP
client. It is used for consuming Twitter's standard streaming API. The Hosebird Client is divided
into two modules:

1. hbc-core: It uses a message queue. This message queue is further used by the consumer
to poll for raw string messages.
2. hbc-twitter4j: This is different from hbc-core as it uses the twitter4j listeners. Twitter4j
is an unofficial java library through which we can easily integrate our java build
application with various twitter services.

In the twitter dependency code, hbc-core is used. Users can also use twitter4j instead.

So, in this way, the first stage of the real-time example is completed.
Creating Twitter Producer
In this section, we will learn to create a twitter producer.

There are basically three steps to create a twitter producer:

1. Create a twitter client.


2. Create the Producer
3. Send tweets

Step1: Create a new java package, following the package naming convention rules. Then, create
a java class within it, say 'tweetproducer.java.'

Step2: Create a twitter client by creating a method for it. Now, copy the Quickstart code from
the 'github twitter java' to the twitter client method, as shown below:

Paste it in the newly created method. This code will create a connection between the client and
the hbc host. The BlockingQueue will stop the client to dequeue or enqueue the messages when
the queue is empty or already full. As we are using hbc-core, we only require the msgQueue.
Also, we will follow the terms, not the people. Therefore, copy the highlighted code only.

Now, copy the 'Creating a client' code given below the connection code as:
Paste the code below the connection code. This code will create a twitter client through the client
builder. As we are using msgQueue, do not copy the red highlighted code, which is for the
eventMessageQueue. It is not required.

Step3: Create the producer in a similar way we learned in the previous sections with a bootstrap
server connection.

Step4: After creating the Kafka producer, its time to send tweets to Kafka. Copy the while loop
code from the 'github twitter java', given below the 'Creating a client' code. Paste below the
producer code.

Now, we are ready to read tweets from Twitter. Although, a Kafka producer read messages from
a topic. So, create the specified topic using the '-create' command on the CLI. Also, specify the
partition value and the replication factor.

For example,
Here, the topic 'twitter_topic' has been created with partition value 6 and replication-factor 1.
Finally, execute the code and experience Kafka in the real-world application.

The complete code for creating the Twitter Client is given below:

1. package com.github.learnkafka;
2.
3. import com.google.common.collect.Lists;
4. import com.twitter.hbc.ClientBuilder;
5. import com.twitter.hbc.core.Client;
6. import com.twitter.hbc.core.Constants;
7. import com.twitter.hbc.core.Hosts;
8. import com.twitter.hbc.core.HttpHosts;
9. import com.twitter.hbc.core.endpoint.StatusesFilterEndpoint;
10. import com.twitter.hbc.core.processor.StringDelimitedProcessor;
11. import com.twitter.hbc.httpclient.auth.Authentication;
12. import com.twitter.hbc.httpclient.auth.OAuth1;
13. import org.apache.kafka.clients.producer.*;
14. import org.apache.kafka.common.serialization.StringSerializer;
15. import org.slf4j.Logger;
16. import org.slf4j.LoggerFactory;
17.
18. import java.util.List;
19. import java.util.Properties;
20. import java.util.concurrent.BlockingQueue;
21. import java.util.concurrent.LinkedBlockingQueue;
22. import java.util.concurrent.TimeUnit;
23.
24. public class tweetproducer {
25. Logger logger = LoggerFactory.getLogger(tweetproducer.class.getName());
26. String consumerKey = "";//specify the consumer key from the twitter app
27. String consumerSecret = "";//specify the consumerSecret key from the twitter app
28. String token = "";//specify the token key from the twitter app
29. String secret = "";//specify the secret key from the twitter app
30.
31. public tweetproducer() {}//constructor to invoke the producer function
32.
33. public static void main(String[] args) {
34. new tweetproducer().run();
35. }
36.
37. public void run() {
38. logger.info("Setup");
39.
40. BlockingQueue<String> msgQueue = new
41. LinkedBlockingQueue<String>(1000);//Specify the size accordingly.
42. Client client = tweetclient(msgQueue);
43. client.connect(); //invokes the connection function
44. KafkaProducer<String,String> producer=createKafkaProducer();
45.
46. // on a different thread, or multiple different threads....
47. while (!client.isDone()) {
48. String msg = null;
49. try {
50. msg = msgQueue.poll(5, TimeUnit.SECONDS);//specify the time
51. } catch (InterruptedException e) {
52. e.printStackTrace();
53. client.stop();
54. }
55. if (msg != null) {
56. logger.info(msg);
57. producer.send(new ProducerRecord<>("twitter_topic", null, msg), new Callbac
k() {
58. @Override
59. public void onCompletion(RecordMetadata recordMetadata, Exception e) {
60. if(e!=null){
61. logger.error("Something went wrong",e);
62. }
63. }
64. });
65. }
66.
67. }//Specify the topic name, key value, msg
68.
69. logger.info("This is the end");//When the reading is complete, inform logger
70. }
71.
72. public Client tweetclient(BlockingQueue<String> msgQueue) {
73.
74. Hosts hosebirdHosts = new HttpHosts(Constants.STREAM_HOST);
75. StatusesFilterEndpoint hosebirdEndpoint = new StatusesFilterEndpoint();
76. List<String> terms = Lists.newArrayList("India ");//describe
77. //anything for which we want to read the tweets.
78. hosebirdEndpoint.trackTerms(terms);
79. Authentication hosebirdAuth = new OAuth1(consumerKey,consumerSecret,tok
en,secret);
80. ClientBuilder builder = new ClientBuilder()
81. .name("Hosebird-Client-01") // optional: mainly for the logs
82. .hosts(hosebirdHosts)
83. .authentication(hosebirdAuth)
84. .endpoint(hosebirdEndpoint)
85. .processor(new StringDelimitedProcessor(msgQueue));
86.
87.
88. Client hosebirdClient = builder.build();
89. return hosebirdClient; // Attempts to establish a connection.
90. }
91. public KafkaProducer<String,String> createKafkaProducer(){
92. //creating kafka producer
93. //creating producer properties
94. String bootstrapServers="127.0.0.1:9092";
95. Properties properties= new Properties();
96. properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, boot
strapServers);
97. properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, Stri
ngSerializer.class.getName());
98. properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
StringSerializer.class.getName());
99.
100. KafkaProducer<String,String> first_producer = new KafkaProducer<String,
String>(properties);
101. return first_producer;
102.
103. }
104. }

In the above code, the user will specify the consumerKey, consumerSecret key, token key as well
as the secret key. As it is sensitive information, therefore it cannot be displayed. Copy the key
from the 'developer.twitter.com' and paste at their respective positions.
Copy the keys from 'Keys and Tokens' and paste in the code.

The output of the above code will be displayed as:


The client establishes a connection with the Hosebird. After this, we can see too many tweets
produced on 'India'. Post some tweets on any specified topic and try out.

Try out the 'kafka-console-consumer -bootstrap-server 127.0.0.1:9092 -topic twitter_topic'


command on the CLI. The output will be the same as on the IntelliJ IDEA terminal:
In this way, we can create a real Twitter-Kafka-Producer and send tweets to Kafka.

You might also like