100% found this document useful (2 votes)
1K views33 pages

Apache Kafka

The document discusses the Apache Kafka architecture. It describes Kafka as a publish-subscribe messaging system that allows producers to publish messages to topics that can be consumed by subscribers. It highlights key aspects of Kafka including its distributed commit log architecture, replication for fault tolerance, and support for horizontal scaling through partitioning. The document also summarizes Kafka's core abstractions like topics, partitions, producers, consumers and the underlying protocol.

Uploaded by

dharmareddyr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
1K views33 pages

Apache Kafka

The document discusses the Apache Kafka architecture. It describes Kafka as a publish-subscribe messaging system that allows producers to publish messages to topics that can be consumed by subscribers. It highlights key aspects of Kafka including its distributed commit log architecture, replication for fault tolerance, and support for horizontal scaling through partitioning. The document also summarizes Kafka's core abstractions like topics, partitions, producers, consumers and the underlying protocol.

Uploaded by

dharmareddyr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Apache: Big Data 2015

The Best of Apache


Kafka Architecture
Ranganathan Balashanmugam
@ran_than
Helló Budapest
About Me

❏ Graduated as Civil Engineer.


❏ <dev> 10+ years </dev>
❏ <Thoughtworker from=”India”/>
❏ Organizer of Hyderabad Scalability Meetup with 2000+
members.
“Form follows function.”
- Louis Sullivan
Gravity Dam
Indirasagar Dam, India

img src: http://www.montanhydraulik.in


Forces on a gravity
dam

Head Water Dam


weight
Tail Water

Uplift
❏ publish-subscribe messaging service
❏ distributed commit/write-ahead log

“producers produce, consumers consume, in large distributed


reliable way -- real time”
Why Kafka?

❏ DBs
❏ Logs
❏ Brokers
❏ HDFS

“For highly distributed messages, Kafka stands out.”


Kafka Vs ________

src: https://softwaremill.com/mqperf/
Timeline

Open sourced by LinkedIn, as version 0.6

Graduated from Apache

Several Engineers who built Kakfa create


Confluent
Latest stable - 0.8.2.1

2011 2012 2013 2014 2015


A Kafka Message

key key
CRC magic attributes message length message content
length message

kafka.message.Message

Change requested:KAFKA-2511
Producers - push

Request => RequiredAcks Timeout [TopicName [Partition MessageSetSize MessageSet]]

Kafka
Broker

Response => [TopicName [Partition ErrorCode Offset]]

org.apache.kafka.clients.producer.KafkaProducer
Topic

Remove messages based on

number of time size


messages

kafka.common.Topic
Partitions

Serves: Horizontal scaling, Parallel consumer reads

kafka.cluster.Partition
Consumers - pull

Consumer 2
Consumer 1

kafka.consumer.ConsumerConnector,
kafka.consumer.SimpleConsumer
Consumer offsets
committing and fetching consumer offsets

img src: http://www.reynanprinting.com/photos/undefined/impresion-offset1.jpg


kafka:// - protocol

“Binary protocol over TCP”

● Metadata
● Send
● Fetch
● Offsets
● Offset commit
● Offset fetch
Mechanical
Sympathy
"The most amazing achievement of the computer software industry is its continuing
cancellation of the steady and staggering gains made by the computer hardware
industry." - Henry Peteroski
Image source: http://www.theguide2surrey.com
Persistence
“Everything is faster till the disk IO.”
Disk faster than RAM

src: http://queue.acm.org/detail.cfm?id=1563874
Linear Read & Writes

On high level there are only two operations:

fetch messages from a


Append to end of log partition beginning from a
particular message id

sequential file I/O


“Let us play pictionary”
Linux Page Cache

“Kafka ate my RAM”


ZeroCopy

src: http://www.ibm.com/developerworks/library/j-zerocopy/
Batching
small latency to improve throughput

img src: https://prashanthpanduranga.files.wordpress.com/2015/05/tirupati.jpg


Compression
bandwidth is more expensive per-byte to scale than disk I/O, CPU,
or network bandwidth capacity within a facility
kafka.message.CompressionCodec
Log compaction
kafka.log.LogCleaner, LogCleanerManager
img src: http://kafka.apache.org/083/documentation.html
Message Delivery

Atleast once Atmost once Exactly once


Replication
un-replicated = replication factor of one
Quorum based

● Better latency
● To tolerate “f” failures, need “2f+1” replicas
Primary-backup
replication

Topic 1 Topic 1 Topic 1

Topic 2 Topic 2 Topic 2

Topic 3 Topic 3 Topic 3

Broker 1 Broker 2 Broker 3 Broker 4


ZooKeeper
cluster coordinator
THANK YOU
For questions or suggestions:

Ran.ga.na.than B
[email protected]
@ran_than

You might also like