0% found this document useful (0 votes)
8 views29 pages

Introduction To Apache Kafka and Its Setup

This document provides an introduction to Apache Kafka, a distributed publish-subscribe messaging system designed for high throughput and fault tolerance. It covers the fundamentals of messaging systems, Kafka's architecture, terminology, and setup instructions. Additionally, it highlights the advantages of using Kafka for real-time data processing and its integration with other technologies like Spark Streaming.

Uploaded by

jayashree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views29 pages

Introduction To Apache Kafka and Its Setup

This document provides an introduction to Apache Kafka, a distributed publish-subscribe messaging system designed for high throughput and fault tolerance. It covers the fundamentals of messaging systems, Kafka's architecture, terminology, and setup instructions. Additionally, it highlights the advantages of using Kafka for real-time data processing and its integration with other technologies like Spark Streaming.

Uploaded by

jayashree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Window-based Stream Data Analytics

with SPARK and Kafka

4. Introduction to Apache Kafka and its setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outlines

• What is a messaging system?


• Point to point
• Publish-subscribe (pub-sub)
• Apache Kafka
• Terminology
• Architecture
• Why Kafka?
• Kafka Setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outlines

• What is a messaging system?


• Point to point
• Publish-subscribe (pub-sub)
• Apache Kafka
• Terminology
• Architecture
• Why Kafka?
• Kafka Setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
What is a Messaging System?

• A Messaging System is responsible for transferring data from one


application to another
• Applications do not involve in how to share data.

• Distributed messaging:
• Point to point
• Publish-subscribe (pub-sub)

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Point to Point Messaging System

• This system uses a queue to persist messages

• One or more consumers can consume the messages in the queue


• but a particular message can be consumed by just one consumer.
• Then, it will disappear from the queue.

Message
Sender Receiver
Queue

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Publish-Subscribe Messaging System

• In this system, messages are persisted in a topic.


• Consumers can subscribe to one or more topic and consume all
the messages in that topic.
• Message producers are called publishers and message consumers
are called subscribers.

sc r ibe
Producer Sub Consumer

Publish Topic 1 msg


(topic, msg)
Topic 3
Producer Consumer
Topic 2 Subscribe

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outlines

• What is a messaging system?


• Point to point
• Publish-subscribe (pub-sub)
• Apache Kafka
• Terminology
• kafka Architecture
• Why Kafka?
• Kafka Setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka
● Kafka is used as an enterprise messaging system to decouple source
and target systems to exchange data.
● Kafka provides high throughput with partitions and fault tolerance with
replication.

Source: https://dzone.com/articles/introduction-to-apache-kafka-1

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka

• Apache Kafka is a distributed publish-subscribe messaging system


and a robust queue that can handle a high amount of data.
• Kafka maintains feeds of messages in topics. Producers write data
to topics and consumers read from topics. Since Kafka is a
distributed system, topics are partitioned and replicated across
multiple nodes.

Consumer
Producer

Kafka
Cluster Consumer

Producer

Consumer

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka

Topics Kafka Brokers


Leader

Partition 1 Server 1

0 1 p1
Replica
1 Consumer
Producer
Partition 2 Follower
Read/ pull data

Server 2
012 Consumer
Write/ Push data
Replica
P2 2

Producer Partition 3 Follower


Consumer
0 Server 3

Replica
p3 3
Old - - - - - - - > New

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka: Terminology

• Topic: A stream of messages belonging to a particular category.

• Partition: Topics are split into partitions.


• For each topic, Kafka keeps one or more partitions.
• Each partition has a unique sequence ID called offset.
• Example: In the previous diagram, a topic has three partitions. Partition 1
has two offset factors 0 and 1.

Source: https://dzone.com/articles/introduction-to-apache-kafka-1

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka: Terminology

Below are some points we need to remember when working with partitions.
● Topics are identified by name. We can have many named topics in a
cluster.
● The order of messages is maintained at the partition level, not across
topics.
● Once the data written to the partition, it is not overridden. This is
called immutability.
● The messages in partitions are stored with keys, values, and
timestamps. Kafka ensures publishing the message to the same
partition for a given key.
● From the Kafka cluster, each partition will have a leader that will take
read/write operations to that partition.

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka: Terminology

• Replicas: backups of a partition

• Broker: Kafka run as a cluster comprised of one or more servers


each of which is called broker.

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Apache Kafka: Example

Source: https://dzone.com/articles/introduction-to-apache-kafka-1

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Kafka Consumer Group

Source: http://cloudurable.com/blog/kafka-architecture-consumers/index.html

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
2 Server Kafka Cluster Hosting 4 Partition (P0-P5)

Source: http://cloudurable.com/blog/kafka-architecture-consumers/index.html

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Zookeeper

• Zookeeper is an open source and a high performance coordination


service for distributed applications.
• ZooKeeper is used for managing and coordinating Kafka brokers.
• ZooKeeper is mainly used to notify producers and consumers
about the presence of any new broker in the Kafka system or
about the failure of any broker in the Kafka system.
• ZooKeeper notifies the producer and consumer about the
presence or failure of a broker based on which producer and
consumer makes a decision and starts coordinating their tasks with
some other broker.

http://zookeeper.apache.org/

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Kafka Architecture

Kafka Cluster

Broker 1 Consumer

Producer
Broker 2
Consumer
Write/ Push data Read/ pull data

Producer Broker 3

Consumer
get kafka
Update Offset
broker id
Zookeeper

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outlines

• What is a messaging system?


• Point to point
• Publish-subscribe (pub-sub)
• Apache Kafka
• Terminology
• Kafka Architecture
• Why Kafka?
• Kafka Setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Why Kafka?

• Kafka is a fast, scalable, durable, and reliable publish-subscribe


messaging system.
• Scalability - Kafka messaging system scales easily without down time.
• Durability - Kafka uses “Distributed commit log” which means messages
persists on disk as fast as possible, hence it is durable.
• Reliability - Kafka is distributed, partitioned, replicated and fault tolerance.
• Performance. Kafka has high throughput for both publishing and subscribing
messages. It maintains stable performance even when dealing with many
terabytes of stored messages.
• Real time streaming data processed for real time analytics
• Kafka can works in combination with
• Flume/Flafka, Spark Streaming, Storm, HBase and Spark for real-time analysis
and processing of streaming data

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Learning Outlines

• What is a messaging system?


• Point to point
• Publish-subscribe (pub-sub)
• Apache Kafka
• Terminology
• Kafka Architecture
• Why Kafka?
• Kafka Setup

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Kafka Setup

Operating system: Unix-based (although it also works on Windows


platform, these instructions are for unix-based systems)

Step 1: Download the code and un-tar it:


https://www.apache.org/dyn/closer.cgi?path=/kafka/2.1.0/kafka_2.11-
2.1.0.tgz

Ref: https://kafka.apache.org/quickstart

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Kafka Setup (Cont’d)

Step 2: Start the ZooKeeper Server


If you do not have a ZooKeeper server on your system, you can use a
single-node Zookeeper instance in the Kafka package.

> bin/zookeeper-server-start.sh config/zookeeper.properties


When it starts successfully, you will see:
[2013-04-22 15:01:37,495] INFO Reading configuration from: config/zookeeper.properties
(org.apache.zookeeper.server.quorum.QuorumPeerConfig)
...

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Kafka Setup (Cont’d)

Step 3: Start the Kafka Server

> bin/kafka-server-start.sh config/server.properties


When it starts successfully, you will see:

[2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)


[2013-04-22 15:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576
(kafka.utils.VerifiableProperties)
...

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Kafka Setup (Cont’d)

Step 4: Create a topic


To create a topic named “test” with a single partition and one replica:

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-


factor 1 --partitions 1 --topic test

If you run the following command, you will see the topic name:
> bin/kafka-topics.sh --list --zookeeper localhost:2181

test

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Kafka Setup (Cont’d)

Step 5: Send some messages by producer:

By using the following command, you can send messages to the


server through the producer:
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic
test
This is a test.
This is the second test.

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Kafka Setup (Cont’d)

Step 6: Receive the messages by consumer:

Kafka also has a command line consumer that will dump out messages
to standard output:

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --


topic test --from-beginning
This is a test.
This is the second test.

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Next Lessons

• We will discuss details about Apache Spark Streaming Based on Kafka


• We will also give some example in pyspark about how to extract
features from documents:
○ Tokenizer
○ Stop word removal
○ n-gram
○ TF-IDF
○ word2vec

• We will also give some example in pyspark about how to classify


documents with extracted features:
○ Decision Tree
○ Naive Bayes
○ SVM

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017
Reference

Apache Kafka:

- https://kafka.apache.org/

• https://dev.to/de_maric/what-is-a-consumer-group-in-kafka-49il

• https://blog.cloudera.com/scalability-of-kafka-messaging-using-
consumer-groups/
-

This document is licensed with a Creative Commons Attribution 4.0 International License ©2017

You might also like