0% found this document useful (0 votes)

9 views29 pages

Introduction To Apache Kafka and Its Setup

This document provides an introduction to Apache Kafka, a distributed publish-subscribe messaging system designed for high throughput and fault tolerance. It covers the fundamentals of messaging systems, Kafka's architecture, terminology, and setup instructions. Additionally, it highlights the advantages of using Kafka for real-time data processing and its integration with other technologies like Spark Streaming.

Uploaded by

jayashree

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views29 pages

Introduction To Apache Kafka and Its Setup

Uploaded by

jayashree

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Window-based Stream Data Analytics

with SPARK and Kafka

4. Introduction to Apache Kafka and its setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outlines

• What is a messaging system?

• Point to point
• Publish-subscribe (pub-sub)
• Apache Kafka
• Terminology
• Architecture
• Why Kafka?
• Kafka Setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outlines

• What is a messaging system?

• Point to point
• Publish-subscribe (pub-sub)
• Apache Kafka
• Terminology
• Architecture
• Why Kafka?
• Kafka Setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
What is a Messaging System?

• A Messaging System is responsible for transferring data from one

application to another
• Applications do not involve in how to share data.

• Distributed messaging:
• Point to point
• Publish-subscribe (pub-sub)

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Point to Point Messaging System

• This system uses a queue to persist messages

• One or more consumers can consume the messages in the queue

• but a particular message can be consumed by just one consumer.
• Then, it will disappear from the queue.

Message
Sender Receiver
Queue

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Publish-Subscribe Messaging System

• In this system, messages are persisted in a topic.

• Consumers can subscribe to one or more topic and consume all
the messages in that topic.
• Message producers are called publishers and message consumers
are called subscribers.

sc r ibe
Producer Sub Consumer

Publish Topic 1 msg

(topic, msg)
Topic 3
Producer Consumer
Topic 2 Subscribe

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outlines

• What is a messaging system?

• Point to point
• Publish-subscribe (pub-sub)
• Apache Kafka
• Terminology
• kafka Architecture
• Why Kafka?
• Kafka Setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka
● Kafka is used as an enterprise messaging system to decouple source
and target systems to exchange data.
● Kafka provides high throughput with partitions and fault tolerance with
replication.

Source: https://dzone.com/articles/introduction-to-apache-kafka-1

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka

• Apache Kafka is a distributed publish-subscribe messaging system

and a robust queue that can handle a high amount of data.
• Kafka maintains feeds of messages in topics. Producers write data
to topics and consumers read from topics. Since Kafka is a
distributed system, topics are partitioned and replicated across
multiple nodes.

Consumer
Producer

Kafka
Cluster Consumer

Producer

Consumer

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka

Topics Kafka Brokers

Leader

Partition 1 Server 1

0 1 p1
Replica
1 Consumer
Producer
Partition 2 Follower
Read/ pull data

Server 2
012 Consumer
Write/ Push data
Replica
P2 2

Producer Partition 3 Follower

Consumer
0 Server 3

Replica
p3 3
Old - - - - - - - > New

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka: Terminology

• Topic: A stream of messages belonging to a particular category.

• Partition: Topics are split into partitions.

• For each topic, Kafka keeps one or more partitions.
• Each partition has a unique sequence ID called offset.
• Example: In the previous diagram, a topic has three partitions. Partition 1
has two offset factors 0 and 1.

Source: https://dzone.com/articles/introduction-to-apache-kafka-1

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka: Terminology

Below are some points we need to remember when working with partitions.
● Topics are identified by name. We can have many named topics in a
cluster.
● The order of messages is maintained at the partition level, not across
topics.
● Once the data written to the partition, it is not overridden. This is
called immutability.
● The messages in partitions are stored with keys, values, and
timestamps. Kafka ensures publishing the message to the same
partition for a given key.
● From the Kafka cluster, each partition will have a leader that will take
read/write operations to that partition.

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka: Terminology

• Replicas: backups of a partition

• Broker: Kafka run as a cluster comprised of one or more servers

each of which is called broker.

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka: Example

Source: https://dzone.com/articles/introduction-to-apache-kafka-1

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Kafka Consumer Group

Source: http://cloudurable.com/blog/kafka-architecture-consumers/index.html

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
2 Server Kafka Cluster Hosting 4 Partition (P0-P5)

Source: http://cloudurable.com/blog/kafka-architecture-consumers/index.html

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Zookeeper

• Zookeeper is an open source and a high performance coordination

service for distributed applications.
• ZooKeeper is used for managing and coordinating Kafka brokers.
• ZooKeeper is mainly used to notify producers and consumers
about the presence of any new broker in the Kafka system or
about the failure of any broker in the Kafka system.
• ZooKeeper notifies the producer and consumer about the
presence or failure of a broker based on which producer and
consumer makes a decision and starts coordinating their tasks with
some other broker.

http://zookeeper.apache.org/

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Kafka Architecture

Kafka Cluster

Broker 1 Consumer

Producer
Broker 2
Consumer
Write/ Push data Read/ pull data

Producer Broker 3

Consumer
get kafka
Update Offset
broker id
Zookeeper

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outlines

• What is a messaging system?

• Point to point
• Publish-subscribe (pub-sub)
• Apache Kafka
• Terminology
• Kafka Architecture
• Why Kafka?
• Kafka Setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Why Kafka?

• Kafka is a fast, scalable, durable, and reliable publish-subscribe

messaging system.
• Scalability - Kafka messaging system scales easily without down time.
• Durability - Kafka uses “Distributed commit log” which means messages
persists on disk as fast as possible, hence it is durable.
• Reliability - Kafka is distributed, partitioned, replicated and fault tolerance.
• Performance. Kafka has high throughput for both publishing and subscribing
messages. It maintains stable performance even when dealing with many
terabytes of stored messages.
• Real time streaming data processed for real time analytics
• Kafka can works in combination with
• Flume/Flafka, Spark Streaming, Storm, HBase and Spark for real-time analysis
and processing of streaming data

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outlines

• What is a messaging system?

• Point to point
• Publish-subscribe (pub-sub)
• Apache Kafka
• Terminology
• Kafka Architecture
• Why Kafka?
• Kafka Setup

Operating system: Unix-based (although it also works on Windows

platform, these instructions are for unix-based systems)

Step 1: Download the code and un-tar it:

https://www.apache.org/dyn/closer.cgi?path=/kafka/2.1.0/kafka_2.11-
2.1.0.tgz

Ref: https://kafka.apache.org/quickstart

Step 2: Start the ZooKeeper Server

If you do not have a ZooKeeper server on your system, you can use a
single-node Zookeeper instance in the Kafka package.

> bin/zookeeper-server-start.sh config/zookeeper.properties

When it starts successfully, you will see:
[2013-04-22 15:01:37,495] INFO Reading configuration from: config/zookeeper.properties
(org.apache.zookeeper.server.quorum.QuorumPeerConfig)
...

Step 3: Start the Kafka Server

> bin/kafka-server-start.sh config/server.properties

When it starts successfully, you will see:

[2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)

[2013-04-22 15:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576
(kafka.utils.VerifiableProperties)
...

Step 4: Create a topic

To create a topic named “test” with a single partition and one replica:

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-

factor 1 --partitions 1 --topic test

If you run the following command, you will see the topic name:
> bin/kafka-topics.sh --list --zookeeper localhost:2181

test

Step 5: Send some messages by producer:

By using the following command, you can send messages to the

server through the producer:
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic
test
This is a test.
This is the second test.

Step 6: Receive the messages by consumer:

Kafka also has a command line consumer that will dump out messages
to standard output:

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --

topic test --from-beginning
This is a test.
This is the second test.

• We will discuss details about Apache Spark Streaming Based on Kafka

• We will also give some example in pyspark about how to extract
features from documents:
○ Tokenizer
○ Stop word removal
○ n-gram
○ TF-IDF
○ word2vec

• We will also give some example in pyspark about how to classify

documents with extracted features:
○ Decision Tree
○ Naive Bayes
○ SVM

Apache Kafka:

- https://kafka.apache.org/

• https://dev.to/de_maric/what-is-a-consumer-group-in-kafka-49il

• https://blog.cloudera.com/scalability-of-kafka-messaging-using-
consumer-groups/
-

Guide To Clear Spring Boot Microservice Interviews (Free Sample Copy)
No ratings yet
Guide To Clear Spring Boot Microservice Interviews (Free Sample Copy)
41 pages
Kafka Using Spring Boot
No ratings yet
Kafka Using Spring Boot
136 pages
Apache Kafka
No ratings yet
Apache Kafka
9 pages
Apache Kafka
No ratings yet
Apache Kafka
27 pages
Unit 5 Apache Kafka Notes
No ratings yet
Unit 5 Apache Kafka Notes
54 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Kafka
No ratings yet
Kafka
23 pages
Introduction To Apache Kafka
No ratings yet
Introduction To Apache Kafka
18 pages
5 Kafka 2.7m
No ratings yet
5 Kafka 2.7m
46 pages
Kafka Using Spring Boot v2
No ratings yet
Kafka Using Spring Boot v2
150 pages
Apache Kafka Beginner Guide Final
No ratings yet
Apache Kafka Beginner Guide Final
3 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Apache Kafka 360 1631077800
No ratings yet
Apache Kafka 360 1631077800
137 pages
Kafka
No ratings yet
Kafka
15 pages
Kafka
No ratings yet
Kafka
12 pages
Fundamentals and Architecture of Apache Kafka
No ratings yet
Fundamentals and Architecture of Apache Kafka
30 pages
Apache Kafka
No ratings yet
Apache Kafka
10 pages
Kafkha
No ratings yet
Kafkha
32 pages
Kafka Overview
No ratings yet
Kafka Overview
36 pages
Apache Kafka Description
No ratings yet
Apache Kafka Description
36 pages
Big Data - Group 14
No ratings yet
Big Data - Group 14
26 pages
08 Apache Kafka
No ratings yet
08 Apache Kafka
45 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Apache Kafka - Introduction - Tutorialspoint
No ratings yet
Apache Kafka - Introduction - Tutorialspoint
3 pages
Kafka
No ratings yet
Kafka
43 pages
Cours - Kafka
No ratings yet
Cours - Kafka
72 pages
Documentation
No ratings yet
Documentation
105 pages
Getting To Know Kafka: Ola Is The First Course in The Series of Courses Covering All The Aspects of Kafka
100% (1)
Getting To Know Kafka: Ola Is The First Course in The Series of Courses Covering All The Aspects of Kafka
23 pages
Apache Kafka - Thi Nguyen's Blog
No ratings yet
Apache Kafka - Thi Nguyen's Blog
39 pages
Apache Kafka Beginner Guide
No ratings yet
Apache Kafka Beginner Guide
40 pages
Kafka Topic Questions
No ratings yet
Kafka Topic Questions
9 pages
AK
No ratings yet
AK
22 pages
Big Data-Kafka
No ratings yet
Big Data-Kafka
14 pages
Apache - Kafka Notes
No ratings yet
Apache - Kafka Notes
9 pages
Kafka Clustering v1.0.0
No ratings yet
Kafka Clustering v1.0.0
20 pages
Apache Kafka
No ratings yet
Apache Kafka
13 pages
Kafka
No ratings yet
Kafka
3 pages
Apache Kafka
No ratings yet
Apache Kafka
27 pages
KAFKAExample 2
No ratings yet
KAFKAExample 2
12 pages
Kafka
No ratings yet
Kafka
20 pages
Kafka My Kafka Note v67
No ratings yet
Kafka My Kafka Note v67
55 pages
Kafka
No ratings yet
Kafka
19 pages
Apache Kafka Essentials
No ratings yet
Apache Kafka Essentials
10 pages
Apache Kafka - Introduction
No ratings yet
Apache Kafka - Introduction
2 pages
Apache Kafka 101
No ratings yet
Apache Kafka 101
25 pages
Chapter 1 - Introduction To KAFKA: Objectives
No ratings yet
Chapter 1 - Introduction To KAFKA: Objectives
17 pages
Introduction To Apache Kafka
No ratings yet
Introduction To Apache Kafka
15 pages
Kafka With Spring Boot
No ratings yet
Kafka With Spring Boot
48 pages
Unit 3
No ratings yet
Unit 3
26 pages
Kafka Presentation
No ratings yet
Kafka Presentation
16 pages
KAFKA
No ratings yet
KAFKA
11 pages
Publisher Subscriber Based Messaging System - Demo
No ratings yet
Publisher Subscriber Based Messaging System - Demo
12 pages
Kafka
No ratings yet
Kafka
88 pages
Data and AI Kafka Overview 1740507867
No ratings yet
Data and AI Kafka Overview 1740507867
20 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
? Kafka
No ratings yet
? Kafka
2 pages
Apache Kafka Key Concepts
100% (1)
Apache Kafka Key Concepts
8 pages
Sophat Chhay
No ratings yet
Sophat Chhay
3 pages
Serverless Computing
100% (1)
Serverless Computing
2 pages
Cloud Elevate Azure AZ 900 Exam Prep Set 1
No ratings yet
Cloud Elevate Azure AZ 900 Exam Prep Set 1
144 pages
Uday Devops Cloud
No ratings yet
Uday Devops Cloud
6 pages
Laravel 02 - Introduction To Laravel
No ratings yet
Laravel 02 - Introduction To Laravel
26 pages
Spring Boot 2 0 0 Upgrade Notes
No ratings yet
Spring Boot 2 0 0 Upgrade Notes
17 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Azure Global Infrastructure - Regions and Availability Zones
No ratings yet
Azure Global Infrastructure - Regions and Availability Zones
15 pages
Sandeep Deshpande - JAVA
No ratings yet
Sandeep Deshpande - JAVA
5 pages
Componentes
No ratings yet
Componentes
3 pages
Presentation Topic - DCSE-PC501 - D
No ratings yet
Presentation Topic - DCSE-PC501 - D
4 pages
Oracle Cloud Slides v3
No ratings yet
Oracle Cloud Slides v3
119 pages
MOD 6 - Software Design
No ratings yet
MOD 6 - Software Design
49 pages
Curriculum Vitae: Tushar Kamble Work Experience & Skills Summary
No ratings yet
Curriculum Vitae: Tushar Kamble Work Experience & Skills Summary
3 pages
Oracle Fusion Middleware Administration: Atul Kumar
No ratings yet
Oracle Fusion Middleware Administration: Atul Kumar
26 pages
Sandeep Tella Java
No ratings yet
Sandeep Tella Java
3 pages
Openstack Certificate Syllabus
No ratings yet
Openstack Certificate Syllabus
3 pages
CheatSheet Renz Aws-Devops-Overview
No ratings yet
CheatSheet Renz Aws-Devops-Overview
1 page
Group Presentation
No ratings yet
Group Presentation
6 pages
Bhaskar Garnimitta
No ratings yet
Bhaskar Garnimitta
13 pages
Question Bank SOA
No ratings yet
Question Bank SOA
1 page
CC - Question Bank
No ratings yet
CC - Question Bank
2 pages
Srikanth Mergu: Professional Summary
No ratings yet
Srikanth Mergu: Professional Summary
3 pages
Java Ee
No ratings yet
Java Ee
57 pages
Newly Updated Current Resume
No ratings yet
Newly Updated Current Resume
9 pages
A Systematic Literature Review of IoT System
No ratings yet
A Systematic Literature Review of IoT System
20 pages
Dev Resume
No ratings yet
Dev Resume
9 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
8 pages
ACE v11 The Next Generation of IBM Integration Bus and App Connect Professional
No ratings yet
ACE v11 The Next Generation of IBM Integration Bus and App Connect Professional
31 pages

Introduction To Apache Kafka and Its Setup

Uploaded by

Introduction To Apache Kafka and Its Setup

Uploaded by

Window-based Stream Data Analytics

with SPARK and Kafka

4. Introduction to Apache Kafka and its setup

• What is a messaging system?

• What is a messaging system?

• A Messaging System is responsible for transferring data from one

• This system uses a queue to persist messages

• One or more consumers can consume the messages in the queue

• In this system, messages are persisted in a topic.

Publish Topic 1 msg

• What is a messaging system?

• Apache Kafka is a distributed publish-subscribe messaging system

Topics Kafka Brokers

Producer Partition 3 Follower

• Topic: A stream of messages belonging to a particular category.

• Partition: Topics are split into partitions.

• Replicas: backups of a partition

• Broker: Kafka run as a cluster comprised of one or more servers

• Zookeeper is an open source and a high performance coordination

• What is a messaging system?

• Kafka is a fast, scalable, durable, and reliable publish-subscribe

• What is a messaging system?

Operating system: Unix-based (although it also works on Windows

Step 1: Download the code and un-tar it:

Step 2: Start the ZooKeeper Server

> bin/zookeeper-server-start.sh config/zookeeper.properties

Step 3: Start the Kafka Server

> bin/kafka-server-start.sh config/server.properties

[2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)

Step 4: Create a topic

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-

Step 5: Send some messages by producer:

By using the following command, you can send messages to the

Step 6: Receive the messages by consumer:

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --

• We will discuss details about Apache Spark Streaming Based on Kafka

• We will also give some example in pyspark about how to classify

You might also like