0% found this document useful (0 votes)
8 views26 pages

Big Data - Group 14

Uploaded by

Khả Võ Văn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views26 pages

Big Data - Group 14

Uploaded by

Khả Võ Văn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

L01 - Group 14

Big Data Analytics and Business


Intelligence - CO4033

Big Data Management - Kafka


A distributed, partitioned, replicated commit log service
L01 - Group 14

Members
1 Trần Mậu Thật
2112342

2 Võ Văn Khả
2110264
Dương Thuận Đông
3 2210762
Introduction
Basic concepts
How Kafka works
Agenda Key features
Application
Pros and cons
Demo
Introduction
A unique distributed publish-subscribe messaging
system

Written in the Scala language with multi-language


support

Runs on the Java Virtual Machine (JVM).


Introduction
Basic concepts
Message
A fundamental data unit in Kafka.

Represents a record of information that is produced


by producers and consumed by consumers.
Basic concepts
Structure of Kafka message
Key: determines the partition where message is sent.

Value: represents the actual data and payload in a message.

Offset: is a unique sequential number assigned to each


message within a partition.

Timestamp: indicates when the message was produced.

Header(optional): key-value pairs used to carry additional


metadata or information about the message.
Basic concepts
Topic
A category or a channel to which messages are stored
and transmitted between producers and consumers.

Topics are fundamental to Kafka's publish-subscribe


model.

Kafka supports two types of topic: regular topic and


compacted topic.
Basic concepts
Partition
A fundamental unit of Kafka's architecture that enables
scalability and fault tolerance.

Each topic in Kafka can be divided into multiple partitions, and


each partition is an ordered, immutable sequence of
messages.

Each partition can reside on different brokers.


Basic concepts
Producer
Client process or application that generates and sends
messages to Kafka.

The producers can specify the topic and the partition of that
topic to which the message should be sent, either by specifying
a key or using a default partitioning strategy.

Is responsible for deciding the level of acknowledgment it


requires from brokers.
Basic concepts
Consumer
Client process or application that consumes the messages stored in
Kafka topics by subscribing to one or more topics.

Pulls data from Kafka brokers, processing the data in the order it was
stored in the topic’s partitions.

Each consumer in a consumer group is assigned different partitions,


ensuring that no two consumers in the same group read the same
partition simultaneously.

After successfully processing a message, the consumer commits its


offset to Kafka to track which messages have been handled.
Basic concepts
Broker
A server which is responsible for storing data and handling
communication between producers and consumers.

Can act as a leader or follower for partitions: the leader broker handles
read and write requests, while the follower brokers replicate the
leader’s data.

Provides an acknowledgment service to the producer, helping the


producer verify whether its messages have been successfully received
and stored by the broker.
Basic concepts
Cluster
A distributed system consisted of multiple brokers
working together.

Uses Zookeeper for managing cluster metadata,


including the list of brokers, leader elections for
partitions, and configuration management.
How Kafka works
How Kafka works
Operation of the Kafka system
First, a producer creates a message and sends it to a topic in
Kafka.
After receiving the message, the leader broker send an
acknowledgement back to the producer.
Once a message is written to a partition, it is stored on disk and
retained according to the topic’s retention policy.
Consumers in a specific consumer group subscribe to a topic and
read messages from the partitions.
Finally, Consumers process the messages, and Kafka tracks their
progress using offsets.
Key features
Message distribution
Manage the sending, storage and consumption of messages in
the Producers (senders), Brokers (Kafka servers) and
Consumers (receivers)

Based on elements like topics, partitions and consumer


groups

Ensure high availability, scalability and efficient parallel


processing
Key features
Event Streaming
Continuous capture, storage, processing, and routing of real-
time event streams, allowing systems to react to events

Handle large volumes of real-time event data from different


sources

Enable to manage and process data streams effectively


Key features
Data storage
Message Retention: storing messages after sent to prevent
data loss -> used for re-reading, re-consuming and re-
subscribing

Store large volumes of data, create a data lake or data


warehouse

Real-time stream processing (Kafka Streams API or other


processing frameworks like Apache Flink or Apache Spark)
Pros and cons
High-throughput: Deliver a large volume of
messages quickly and continuously

Scalability: managed and configured internally by


team -> allow to customized and scale

Open source

Automatic message storage: easy to retrieve and


replay messages.
Pros and cons
Lack of a complete monitoring toolkit: various
tools, but each supports different aspects of
management: Kafka tool, Lense, ...

No wildcard topic selection: Require the exact


topic name to process messages.

Reduced performance: Message size increases,


compress and decompress

Behave clumsy: Become slower in handling


messages as number of queues in cluster grows
Application
E-commerce: processing real-time
data smoothly without bottlenecks
Application
Healthcare: stream real-time patient data from
sensors -> enabling continuous monitoring and
timely interventions
Application
Advertising: track user behavior on websites
in real time, processing that data to deliver
personalized ads
Application
L01 - Group 14

Thank you
for listening!

You might also like