Big Data-Kafka

Kafka Document

Uploaded by

santhoshnandigama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views14 pages

Big Data-Kafka

Kafka Document

Uploaded by

santhoshnandigama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Kafka

Need for streaming based system in Big Data

+ A unified platform for handling all the real-time data feeds a large company
might have.
+ High throughput to support high volume event feeds.
+ Support real-time processing of these feeds to create new AND derived
feeds.
+ Support large data backlogs to handle periodic ingestion from offline
systems.
+ Support low-latency delivery to handle more traditional messaging use
cases. Guarantee fault-tolerance in the presence of machine failures.
What is Kafka

Kafka is Distributed,Persistent,Reliable and High throughput

Pub Sub based, event based messaging system.
Kafka Concepts

There are one or more servers available in Apache Kafka cluster, basically, these servers
(each) are what we call a broker. Brokers are also responsible for maintaining general state
information of the system, leader election, etc
+Producers write data to brokers.
+Consumers read data from brokers.
+All this is distributed.
+Data is stored in topics.
+Topics are split into partitions, which are replicated.
More Concepts
+ Partitions: A topic consists of partitions.Partition: ordered + immutable sequence of messages that is continually
appended to – a commit log. Partitions of a topic is configurable. Partitions determines max consumer (group)
parallelism.
+ Kafka Log: A log is nothing different but another way to view a partition. Basically, a data source writes
messages to the log. Further, one or more consumers read that data from the log at any time they want.
+ Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset
Consumers track their pointers via (offset, partition, topic) tuples
+ Replicas: “backups” of a partition They exist solely to prevent data loss. Replicas are never read from, never
written to. They do NOT help to increase producer or consumer parallelism! Kafka tolerates (numReplicas – 1)
dead brokers before losing data
+ Consumer Group: Kafka can have multiple consumer process/instance running. Basically, one consumer group
will have one unique group-id. Moreover, exactly one consumer instance reads the data from one partition in one
consumer group, at the time of reading. Since, there is more than one consumer group, in that case, one instance
from each of these groups can read from one single partition. However, there will be some inactive consumers, if
the number of consumers exceeds the number of partitions. Let’s understand it with an example if there are 8
consumers and 6 partitions in a single consumer group, that means there will be 2 inactive consumers.
Role of Zookeeper in Kafka
Apache Zookeeper serves as the coordination interface between the Kafka brokers and
consumers.
+ Also, we can say it is a distributed configuration and synchronization service.
+ Basically, ZooKeeper cluster shares the information with the Kafka servers.
+ Moreover, Kafka stores basic metadata information in ZooKeeper Kafka, such as topics,
brokers, consumer offsets (queue readers) and so on.
+ In addition, failure of Kafka Zookeeper/broker does not affect the Kafka cluster. It is
because the critical information which is stored in the ZooKeeper is replicated across its
ensembles. Then Kafka restores the state as ZooKeeper restarts, leading to zero
downtime for Kafka.
Producer and Consumer Kafka
+You use Kafka “producers” to write data to Kafka brokers.
+Available for JVM (Java, Scala), C/C++, Python, Ruby, etc.
+ In order to send messages asynchronously to a topic, KafkaProducer class provides send method. So,
the signature of send() is
+ After creating a Kafka Producer to send messages to Apache Kafka cluster. Now, we are creating a
Kafka Consumer to consume messages from the Kafka cluster.

+ Kafka Consumer subscribes to one or more topics in the Kafka cluster then further feeds on tokens or
messages from the Kafka Topics. In addition, using Heartbeat we can know the connectivity of
Consumer to Kafka Cluster. However, let’s define Heartbeat. It is set up at Consumer to let Zookeeper
or Broker Coordinator know if the Consumer is still connected to the Cluster. So, Kafka Consumer is no
longer connected to the Cluster, if the heartbeat is absent. In that case, the Broker Coordinator has to
re-balance the load. Moreover, Heartbeat is an overhead to the cluster. Also, by keeping the data
throughput and overhead in consideration, we can configure the interval at which the heartbeat is at
Consumer.
Partition to Consumer Group Mapping

Within same group: NO

• Two consumers (Consumer 1, 2)
within the same group (Group
1) CAN NOT consume the same
message from partition (Partition 0).
Across different groups: YES
• Two consumers in two groups
(Consumer 1 from Group
1, Consumer 1 from Group
2) CAN consume the same message
from partition (Partition 0).
Kafka Topic Partition Replication
+ For the purpose of fault tolerance, Kafka can perform replication of partitions across a configurable number of
Kafka servers. Basically, there is a leader server and zero or more follower servers in each partition. Also, for a
partition, leaders are those who handle all read and write requests.
+ However, if the leader dies, the followers replicate leaders and take over. Additionally, for parallel consumer
handling within a group, Kafka uses also uses partitions.
+ he broker which has the partition leader handles all reads and writes of records. Moreover, to the leader partition
to followers (node/partition pair), Kafka replicates writes. On defining the term ISR, a follower which is in-sync is
what we call an ISR (in-sync replica). Although, Kafka chooses a new ISR as the new leader if a partition leader fails.
Acks
+ In Kafka, a message is considered committed when “any required” ISR (in-sync replicas) for that
partition have applied it to their data log.
+ Message acking is about conveying this “Yes, committed!” information back from the brokers to the
producer client.
+ Exact meaning of “any required” is defined by request.required.acks.
+ Only producers must configure acking Exact behavior is configured via request.required.acks, which
determines when a produce request is considered completed. Allows you to trade latency (speed) <-
> durability (data safety).
+ Consumers: Acking and how you configured it on the side of producers do not matter to consumers
because only committed messages are ever given out to consumers. They don’t need to worry
about potentially seeing a message that could be lost if the leader fails.
+ Typical values of request.required.acks 0: producer never waits for an ack from the broker.Gives
the lowest latency but the weakest durability guarantees.
+ 1: producer gets an ack after the leader replica has received the data. Gives better durability as the
we wait until the lead broker acks the request. Only msgs that were written to the now-dead leader
but not yet replicated will be lost.
+ -1: producer gets an ack after all ISR have received the data. Gives the best durability as Kafka
guarantees that no data will be lost as long as at least one ISR remains.
+ Beware of interplay with request.timeout.ms!
+ “The amount of time the broker will wait trying to meet the request.required.acks requirement
before sending back an error to the client.”
Producer and Consumer Kafka
Console Utils
cd /usr/hdp/current/kafka-broker
List Topics
bin/kafka-topics.sh --list --zookeeper localhost:2181

Create Topic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic ineurontopic

Producer to a topic
bin/kafka-console-producer.sh --broker-list 172.18.0.2:6667 --topic ineurontopic

Consumer from a Topic

bin/kafka-console-consumer.sh --bootstrap-server 172.18.0.2:6667 --topic ineurontopic --from-beginning

Documentation :https://gerardnico.com/dit/kafka/kafka-console-consumer
Class Notes

/FileStore/newdata.csv
/databricks-datasets/
%sh

mv /dbfs/FileStore/newdata.csv /dbfs/databricks-datasets/
reduce - pull the entire dataset down into a single location because it is reducing to one final value.

[1,2,3,4]
rdd.reduce(somefunction) ---> N1,N2,N3....---> Single python array result into Driver

[(1,2),(2,3),(2,4),(1,2).....] ->N1,N2,N3....-->[(1,4),(2,7)]
pairedrdd.reduceByKey(somefunction) - one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further
transformations done on its dataset.

Apache Kafka
No ratings yet
Apache Kafka
27 pages
Kafka Intro
No ratings yet
Kafka Intro
51 pages
Kafka Streaming Data
No ratings yet
Kafka Streaming Data
154 pages
Apache - Kafka Notes
No ratings yet
Apache - Kafka Notes
9 pages
Apache Kafka Notes
No ratings yet
Apache Kafka Notes
11 pages
Kafka
No ratings yet
Kafka
88 pages
Apache Kafka 360 1631077800
No ratings yet
Apache Kafka 360 1631077800
137 pages
Kafka SlidesShare
No ratings yet
Kafka SlidesShare
100 pages
Unit 5 Apache Kafka Notes
No ratings yet
Unit 5 Apache Kafka Notes
54 pages
Some Special Terms in Kafka
No ratings yet
Some Special Terms in Kafka
10 pages
Kafkha
No ratings yet
Kafkha
32 pages
Kafka Using Spring Boot
No ratings yet
Kafka Using Spring Boot
136 pages
Apache Kafka - Thi Nguyen's Blog
No ratings yet
Apache Kafka - Thi Nguyen's Blog
39 pages
Kafka Concepts For SQS User
No ratings yet
Kafka Concepts For SQS User
17 pages
Kafka Interview CheatSheet
No ratings yet
Kafka Interview CheatSheet
3 pages
5 Kafka 2.7m
No ratings yet
5 Kafka 2.7m
46 pages
08 Apache Kafka
No ratings yet
08 Apache Kafka
45 pages
Kafka
No ratings yet
Kafka
5 pages
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
10 pages
Kafka
No ratings yet
Kafka
20 pages
Kafka
No ratings yet
Kafka
15 pages
Kafka Notes2
No ratings yet
Kafka Notes2
19 pages
Kafka Overview
No ratings yet
Kafka Overview
36 pages
Apache Kafka
No ratings yet
Apache Kafka
43 pages
Kafka Using Spring Boot v2
No ratings yet
Kafka Using Spring Boot v2
150 pages
Big Data - Group 14
No ratings yet
Big Data - Group 14
26 pages
Kafka Terminology
No ratings yet
Kafka Terminology
9 pages
Apache Kafka
No ratings yet
Apache Kafka
94 pages
Kafka Topic Questions
No ratings yet
Kafka Topic Questions
9 pages
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
10 pages
Kafka For Beginners
No ratings yet
Kafka For Beginners
77 pages
Kafka
No ratings yet
Kafka
43 pages
Fundamentals and Architecture of Apache Kafka
No ratings yet
Fundamentals and Architecture of Apache Kafka
30 pages
Introduction To Apache Kafka and Its Setup
No ratings yet
Introduction To Apache Kafka and Its Setup
29 pages
AWS Vs Azure Vs Google Vs IBM Vs Oracle Vs Alibaba - A Detailed Comparison and Mapping Between Various Cloud Services
50% (2)
AWS Vs Azure Vs Google Vs IBM Vs Oracle Vs Alibaba - A Detailed Comparison and Mapping Between Various Cloud Services
21 pages
Kafka With Spring Boot
No ratings yet
Kafka With Spring Boot
48 pages
Apache Kafka
No ratings yet
Apache Kafka
38 pages
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
11 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Kafka Clustering v1.0.0
No ratings yet
Kafka Clustering v1.0.0
20 pages
Apache Kafka Beginner Guide Final
No ratings yet
Apache Kafka Beginner Guide Final
3 pages
Kafka
No ratings yet
Kafka
12 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Apache Kafka
No ratings yet
Apache Kafka
10 pages
Kafka Internals
No ratings yet
Kafka Internals
30 pages
AK
No ratings yet
AK
22 pages
Oracle in The Cloud
No ratings yet
Oracle in The Cloud
34 pages
Kafka
No ratings yet
Kafka
3 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Log
No ratings yet
Log
44 pages
Abnormal Releases - DROP
No ratings yet
Abnormal Releases - DROP
6 pages
Documentation
No ratings yet
Documentation
105 pages
Migrating Into A Cloud: (The Seven-Step Model of Migration Into A Cloud VM Migration and Cloud Middleware)
No ratings yet
Migrating Into A Cloud: (The Seven-Step Model of Migration Into A Cloud VM Migration and Cloud Middleware)
16 pages
Cloud Tranning
No ratings yet
Cloud Tranning
130 pages
Apache Kafka Description
No ratings yet
Apache Kafka Description
36 pages
Kafka
No ratings yet
Kafka
23 pages
Az 900
67% (3)
Az 900
180 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Kafka - Interview Questions
No ratings yet
Kafka - Interview Questions
4 pages
Kafka Patterns and Anti-Patterns
No ratings yet
Kafka Patterns and Anti-Patterns
7 pages
Sumo Logic White Paper Modern Apps 2018
No ratings yet
Sumo Logic White Paper Modern Apps 2018
20 pages
Kafka and Spark Streaming
No ratings yet
Kafka and Spark Streaming
45 pages
Simple Odoo ERP Auto Scaling On AWS: Objectives
No ratings yet
Simple Odoo ERP Auto Scaling On AWS: Objectives
17 pages
AWS Certified Solutions Architect Associate Practice Test 05
No ratings yet
AWS Certified Solutions Architect Associate Practice Test 05
78 pages
Top Answers To Kafka Interview Questions
No ratings yet
Top Answers To Kafka Interview Questions
3 pages
GCP Pe
No ratings yet
GCP Pe
20 pages
Module 0 - Building Scalable Java Microservices With Spring Boot and Spring Cloud
No ratings yet
Module 0 - Building Scalable Java Microservices With Spring Boot and Spring Cloud
8 pages
2021 SPOTO AWS SAA-C02 Practice Questions PDF
No ratings yet
2021 SPOTO AWS SAA-C02 Practice Questions PDF
6 pages
Cloud Computing Qb-Module 1-5
No ratings yet
Cloud Computing Qb-Module 1-5
7 pages
Azure Teacleap
No ratings yet
Azure Teacleap
20 pages
Soa-C02 2
No ratings yet
Soa-C02 2
29 pages
AWS Training
No ratings yet
AWS Training
10 pages
Adi Sopian S Kom M Kom 25092023022501 Pertemuan 1 - Mengenal Cloud Computing
No ratings yet
Adi Sopian S Kom M Kom 25092023022501 Pertemuan 1 - Mengenal Cloud Computing
15 pages
Signal R
No ratings yet
Signal R
5 pages
MCFA
No ratings yet
MCFA
11 pages
Apache Kafka Key Concepts
100% (1)
Apache Kafka Key Concepts
8 pages
SAP NetWeaver Versions
No ratings yet
SAP NetWeaver Versions
3 pages
Cloud For Iot
No ratings yet
Cloud For Iot
11 pages
AWS Certified Developer Associate PDF
No ratings yet
AWS Certified Developer Associate PDF
2 pages
VMware
No ratings yet
VMware
3 pages
Cumulux Overview and Backgrounder
No ratings yet
Cumulux Overview and Backgrounder
1 page
1 NoSQL For BigData
No ratings yet
1 NoSQL For BigData
8 pages
Networking For DevOps Engineers
No ratings yet
Networking For DevOps Engineers
6 pages
Microsoft Azure Architect Technologies
No ratings yet
Microsoft Azure Architect Technologies
5 pages
Nguyen Minh Thien
No ratings yet
Nguyen Minh Thien
2 pages
Post Man
No ratings yet
Post Man
19 pages

Big Data-Kafka

Uploaded by

Big Data-Kafka

Uploaded by

Kafka

Need for streaming based system in Big Data

Kafka is Distributed,Persistent,Reliable and High throughput

Within same group: NO

Consumer from a Topic

You might also like