Kafka
Kafka
KAFKA APACHE
3
Useful Links
The Kafka Streaming Load Scheduler
https://www.vertica.com/
Zookeeper Recipes and solutions
Kafka Controller Election and Controller Broker
Partitioning and bootstrapping
4
What is Kafka?
What is Apache Kafka
open-source stream-processing software platform
developed by LinkedIn and written in Scala and Java.
It provides a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka
connects to external systems via Kafka Connect and provides Kafka Streams, a Java stream
processing library.
Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set"
abstraction that naturally groups messages together to reduce the overhead of the network roundtrip.
5
Point to Point Data Pipeline
6
Centralized Data Pipeline
7
Kafka Cluster
8
Default Topics
Topic Name Event Type Notes
th-cef CEF event data. Can be configured as a SmartConnector
or Collector/CTH destination
th-binary_esm Binary security events, which is the format consumed by Can be configured as a SmartConnector
ArcSight ESM. or Collector/CTH destination .
th-syslog A Collector sends raw syslog data to this topic, which is Can be configured as a SmartConnector
read by one or more Connectors in Transformation Hub. or Collector/CTH destination .
th-cef-other CEF event data destined for a non-ArcSight subscriber
th-arcsight-avro-p_metrics For ArcSight product use only. Stream processor
operational metrics data.
th-arcsight-avro For ArcSight product use only. Event data in Avro format
for use by ArcSight Investigate.
th-arcsightjson-datastore For ArcSight product use only. Event data in JSON format
for use by ArcSight infrastructure management.
Note: In addition, using ArcSight Management Center, you can create new custom topics to which your
SmartConnectors can connect and send events
9
Transformation Hub Event Flow
10
Transformation Hub
Avro is defined by a schema (schema is written in JSON)
Data is fully typed
Data is compressed automatically (Less CPU usage)
Kafka Scheduler consumes from TH AVRO topic and copies into Vertica. For Kafka scheduler
configuration options you can read investigate deployment guide p. 48
Vertica's streaming load scheduler provides high-performance streaming data load from
Kafka into your Vertica database.
Documentation is embedded in the schema
Schema can evolve over time, in a safe manner (schema evolution).
11
ESM Topic
ESM topic consumes the binary security event data from connectors.
Binary events are batched and sent to TH in a batch of up to 100 events or after a certain period of
time (depending on configuration) with less than 100 events.
Kafka Apache: Our APIs encourage batching small things together for efficiency. We have found this is
a very significant performance win. Both our API to send messages and our API to fetch messages
always work with a sequence of messages not a single message to encourage this.
12
Topics, partitions and offsets I
Topics: a particular stream of data
Similar to a table in a database (without all the constraints)
You can have as many topics as you want
A topic is identified by its name
Topics are split in partitions
Each partition is ordered
Each message within a partition gets an incremental id, called offset
13
Topics, partitions and offsets II
Offset only have a meaning for a specific partition.
Order is guaranteed only within a partition (not across partitions)
Data is kept only for a limited time (default is one week)
Once the data is written to a partition, it can’t be changed.
Partitions are also the way that Kafka provides redundancy and scalability
14
Brokers
A Kafka cluster is composed of multiple brokers
Each broker is identified with its ID
Each broker contains certain topic partitions
After connecting to any broker (called a bootstrap broker), you will be connected to the entire cluster
A good number to get started is 3 brokers, but some big clusters have over 100 brokers
15
Kafka Controller Election
A kafka broker can be elected as the controller in the process known as “Kafka Controller Election”
Controller Broker is a kafka service that runs on every broker in a kafka cluster, but only one can be
active(elected) at any point in time.
When the leader fails a new client arises as the new leader.
ZooKeeper uses the SEQUENCE|EPHEMERAL flags when creating znodes that represent "proposals" of
clients.
With the sequence flag, ZooKeeper automatically appends a sequence number that is greater than any
one previously appended to a child of "/election". The process that created the znode with the smallest
appended sequence number is the leader.
Kafka Controller Election process relies heavily on the features of Apache ZooKeeper that acts as the
source of truth and guarantees that only one broker can ever be elected (due to how ephemeral
nodes work).
16
Brokers and topics
Example of topic-th-cef with 3 partitions
Example of topic-th-esm with 2 partitions
Example of topic-th-avro with 3 partitions
17
Topic replication factor
Topic should have a replication factor > 1 (Usually between 2 and 3)
This way if a broker is down, another broker can server the data
Example:
18
Concept of Leader for a Partition
At any time only ONE broker can be a leader for a given partition
Only that leader can receive and serve data for a partition
The other brokers will synchronize the data
Therefore each partition has one leader and multiple ISR (in-sync replica)
Leaders and ISR are determined by zookeeper
19
Producers
Producers write data to topics (which is made of partitions)
In case of Broker failures, Producers will automatically recover
Producers can choose to receive acknowledgment of data writes:
Acks=0: Producer won’t wait for acknowledgement (possible data loss)
Acks=1: Producer will wait for leader acknowledgement (limited data loss)
Acks=all: Leader + replicas acknowledgement (no data loss)
20
Consumers
Consumers read data from a topic (identified by name)
Consumers know which broker to read from
The consumer group assures that each partition is only consumed by one member
If a single consumer fails, the remaining members of the group will rebalance the partitions being
consumed to take over for the missing member
Data is read in order within each partition
Kafka consumers can take up to 24 hours for the broker nodes to balance the partitions among the
consumers.
21
Consumer Groups
Consumers read data in consumer groups
Consumers will automatically use a Group Coordinator and a Consumer Coordinator to assign consumers
to a partition
Each consumer within a group reads from exclusive partitions
If you have more consumers than partitions, some consumers will be inactive
22
Consumer Groups
A consumer of a topic in Transformation Hub can scale the consumption rate by adding more consumers
to the consumer group.
when adding new consumers to the consumer group, please consider the topic partition count of the
topic you are consuming from.
When the number of consumers in a consumer group is a single consumer, the single consumer will
consume from all partitions in the source topic.
When the number of consumers in a consumer group is lower than the partition count in the source
topic, each of the consumers will consume from a subset of the topic partitions.
23
Consumer Groups
When the number of consumers in a consumer group equals the partition count in the source topic,
each of the consumers will consume from each of the topic partitions.
When the number of consumers in a consumer group is higher than the partition count in the source
topic, each of the consumers will consume from each of the topic partitions, and additional consumers
will stay idle unless new partitions are added to the source topic.
If you change the number of partitions in the source topic to match the consumer group size
Transformation Hub will automatically re-balance the consumer groups
24
What if too many consumers?
If you have more consumers than partitions, some consumers will be
inactive
25
Consumer Offsets
Kafka stores the offsets at which a consumer group has been reading
The offsets committed live in a kafka topic named _consumer_offsets
When a consumer in a group has processed data received from kafka, it should be committing the offsets
If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer
offsets
26
Kafka Broker Discovery
Every Kafka Broker is also called a bootstrap server
That means that you only need to connect to one broker, and you will be connected to the entire cluster
Each broker knows about all brokers, topics and partitions (metadata)
All kafka brokers can answer a metadata request that describes the current state of the cluster:what
topics there are, which partitions those topics have, which broker is the leader for those partitions, and
the host and port information for these brokers.
The client does not need to keep polling to see if the cluster has changed; it can fetch metadata once
when it is instantiated cache that metadata until it receives an error indicating that the metadata is out
of date. This error can come in two forms: (1) a socket error indicating the client cannot communicate
with a particular broker, (2) an error code in the response to a request indicating that this broker no
longer hosts the partition for which data was requested.
27
Kafka Guarantees
Messages are appended to a topic-partition in the order they are sent
Consumers read messages in the order stored in a topic-partition
With a replication factor of N, producers and consumers can tolerate up to N – 1 brokers being down
This is why a replication factor of 3 is a good idea:
Allows for one broker to be taken down for maintenance
Allows for another broker to be taken down unexpectedly
As long as the number of partitions remains constant for a topic (no new partitions), the same key will
always go to the same partition.
28
Kafka Environment Settings
~]# kubectl exec -it th-kafka-0 -n arcsight-installer-nh2xa – bash
root@th-kafka-0:/# printenv | grep -i request
KAFKA_SOCKET_REQUEST_MAX_BYTES=104857600
29
Kafka Logs
You can check kafka logs for the following reasons
Message size change
Creation of a new topic
New License installation
Addition of nodes
Node status
Changes in Topic leaders
Consumer’s information
Data deleted due to retention time policy
Shows who is the GroupCoordinator
And more …
~]# kubectl logs th-kafka-# -n arcsight-installer-nh2xa | grep -i message.max.bytes
message.max.bytes = 1000012
message.max.bytes = 1000012
message.max.bytes = 1000012
~]# kubectl logs th-kafka-(x) -n arcsight-installer-XXX
~]# kubectl get nodes -L kafka
30
Kafka Concepts
31
Additional Kafka Notes for TH 3.1.0
Transformation Hub is designed with support for third-party tools. You can create a standard Kafka
consumer and configure it to subscribe to Transformation Hub topics. By doing this you can pull
Transformation Hub events into your own data lake.
Custom consumers must use Kafka client libraries of version 0.11 or later.
Configure the Initial Host:Port(s) parameter field in the Transformation Hub Destination to include all
Kafka broker (worker) nodes as a comma-separated list. Provide all Kafka broker (worker)nodes for a
producer and a consumer configuration to avoid a single point of failure.
Producers Configs
Consumers Configs
32
Additional Kafka Notes for TH 3.1.0
TLS performance impact is a known Kafka behavior. Exact details of the impact will depend on your specific
configuration, but could reduce the event rate by half or more.
You can manage topic routing and Transformation Hub infrastructure through ArcMC.
Transformation Hub provides the open-source Transformation Hub Kafka Manager to help you monitor and
manage its Kafka services
For more information about Kafka Manager, refer to https://github.com/yahoo/kafka-manager.
For more information about Kafka monitoring, refer to the monitoring section of the Kafka documentation
33
Additional Kafka Notes for TH 3.1.0
The number of custom topics you can create will be limited by Kafka, as well as performance and system
resources needed to support the number of topics created In Transformation Hub Kafka Manager, users
will see different offset values between CEF (Investigate or Logger) topics and binary (ESM) topics.
In CEF topics, the offset value can generally be associated with number of events that passed through
the topic. Each message in a CEF topic is an individual event. However, that same association cannot be
made for the ESM topic, as several events are batched into each message
In a worker node all events data will be stored on the node by default under
/opt/arcsight/k8s-hostpath-volume/th/kafka.
When a Transformation Hub reinstall is performed, all data that resides in a Kafka topic is preserved. No
data is lost.
When the consumer resumes data collection from the kafka topics, the consumer re-starts where it last
left off. No data is lost.
The record of the topic offsets is recorded in the topic __consumer_offsets
34
Additional Kafka Notes for TH 3.1.0
Kafka automatically distributes each event in a topic to the number of broker nodes indicated by the
topic replication level specified during the Transformation Hub configuration.
Replication decreases throughput slightly, ArcSight recommends that you configure a replication factor
of at least 2.
You need at least one node for each replica. For example, a topic replication level of 5 requires at least
five nodes; one replica would be stored on each node.
A replication level of 2 means that two broker nodes will receive that same event. If one goes down, the
event data would still be present on the other, and would be restored to the first broker node when it
came back up. Both broker nodes would have to go down simultaneously for the data to be lost.
A topic replication level of 3 means that three broker nodes will receive that event. All three would have
to go down simultaneously for the event data to be lost.
35
Transformation Hub
If a Kafka node is not present in the Kafka manager UI (in a 3Msx3Ws cluster) execute the following
command on the node not showing in the kafka manager UI.
# kube-stop.sh
# systemctl is-active docker kubelet kube-proxy [Make sure all services are down]
# kube-start.sh
Following up from the previous example make sure that the following file in each node has the correct
broker id: /opt/arcsight/k8s-hostpath-volume/th/kafka/meta.properties
Should be 1001, 1002, 1003
36
Troubleshooting notes and commands
If cluster list is empty in the Kafka Manager UI, delete the existing Kafka Manager pod and try the UI
again after a new Kafka Manager pod is back to the Running state. Example
# kubectl delete pod th-kafka-manager-58dc45cb5f-t769s -n arcsight-installer-zxwac
# kubectl get nodes -L kafka
37
Troubleshooting notes and commands
If in the topics list you see the column “Under Replicated” with pink color as shown below that means
that one or more nodes are not working properly. You can do the following
Go to the Brokers page and make sure they are all running
# kubectl get nodes [All nodes should show as ready]
If you see “Ready,SchedulingDisabled” then execute the following
# kubectl uncordon ipaddressOfNode
38
Zookeeper
Kafka uses Zookeeper to manage the cluster.
Apache Zookeeper is a distributed, open-source configuration, synchronization service.
It determines the state, meaning checks if kafka brokers are alive by sending hearbeats request from the
nodes.
Kafka can’t work without Zookeeper
ZooKeeper gets used for leadership election for Broker Topic Partition Leaders
Zookeeper sends notifications to Kafka in case of changes (Ex, new topic, broker dies, broker comes up,
delete topics, etc.)
39
Zookeeper
Zookeeper by design operates with an odd number of servers (3,5,7)
Zookeeper selects new leaders on broker failures
Zookeeper communicates new leaders to brokers
Zookeeper monitors the liveness of brokers
Zookeeper has a leader which handles writes the rest of the servers are followers which handle the reads
Zookeeper does not store consumer offsets with Kafka > v0.10
Zookeeper manages brokers (Keeps a list of them)
40
Transformation Hub
How to use the command line to create topics?
41
[zk: localhost:2181(CONNECTED) 0] ls /
Accessing Zookeeper from the Command Line
~]# kubectl exec th-zookeeper-0 -n arcsight-installer-y0tyk -- kafka-topics --zookeeper localhost:2181 –list
~]# kubectl exec th-zookeeper-0 -n arcsight-installer-y0tyk -- kafka-topics --zookeeper localhost:2181 --
create --topic dandman --replication-factor 2 --partitions 2
Created topic "dandman".
~]# kubectl exec th-zookeeper-0 -n arcsight-installer-y0tyk -- kafka-topics --zookeeper localhost:2181 --
describe --topic th-cef
Topic-Level Configs
42
43
Thank You.