SlideShare a Scribd company logo
1
Why is My Stream Processing Job Slow?
Xavier Léauté, Software Engineer
Gwen Shapira, Principal Data Architect
2
Kafka 101
Distributed
Scalable
Fault-Tolerant
Partitioned + Replicated Log
Ordering guarantees
Consumers advance independently
Exactly-once delivery
Transactional commits
What people think of Stream Monitoring 3
What our typical experience is
4
Confidential 5
Real Customer Experiences
Confidential 5
Real Customer Experiences
Client Side Broken Streaming Job / App
Confidential 5
Real Customer Experiences
Client Side Broken Streaming Job / App
End-to-End Slow Replication
Your Kafka stream job stopped
humming… now what?
6
Confidential 7
What we check
Consumer Lag
Partition Assignment
Partition Skew
Client Logs
GC Log
Metrics
Request Latencies
Commit Rates
Group Rebalancing
Basic Tuning
Batch Sizes
Commit Rate
Application Profiling
8
The Newbie - During an incident…
GC Logs? Metrics?

How do I get those?
I’ll just change some configs
and reboot everything.
9
Consumer Lag
Wait for me!
10
Bad Capacity Allocation
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3
10
Bad Capacity Allocation
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3
10
Bad Capacity Allocation
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3
11
Watch for Partition Skew
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3
11
Watch for Partition Skew
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3
12
Not all partitions are created equal
Important for
Keyed topics
Custom partitioned topics
Early warning signs
some partitions lagging
uneven CPU / Network usage
Typical cause
skewed key distribution in your data
bad joins (null keys)
imbalance across brokers
13
Clients have metrics too!
Start with the basics GC / CPU / Network
General Slowness
Consumer or Producer Side?
Global Request Latencies
Some partitions still lagging
Per Broker metrics (bad node / network)
Per Topic metrics (data / tuning)
Buffer Size
Offset Commit
14
Turn up the log level
The logs took too
much space, so we
deleted them.
15
Time for some profiling
https://github.com/jvm-profiling-tools/async-profiler
https://github.com/brendangregg/FlameGraph
./profiler.sh -d 30 -f flamegraph.svg <pid>
To impress your coworkers
https://github.com/Netflix/flamescope
16
Here’s where your CPU cycles went
% CPU Time
Stack
16
Here’s where your CPU cycles went
GC
% CPU Time
Stack
16
Here’s where your CPU cycles went
RocksDB
% CPU Time
Stack
16
Here’s where your CPU cycles went
Kafka poll() loop
% CPU Time
Stack
16
Here’s where your CPU cycles went
Actual Processing Time
% CPU Time
Stack
17
Spark Streaming Clickstream Example (using Kafka)
18
Spark Streaming Clickstream Example (using Kafka)
18
Spark Streaming Clickstream Example (using Kafka)
Scheduler
Event Loop
18
Spark Streaming Clickstream Example (using Kafka)
Shuffle Writes
Scheduler
Event Loop
18
Spark Streaming Clickstream Example (using Kafka)
30% deserialization
Shuffle Writes
Scheduler
Event Loop
18
Spark Streaming Clickstream Example (using Kafka)
30% deserialization
Shuffle Writes
Scheduler
Event Loop
Read from Kafka
& Processing
19
Maybe it’s your code
20
Let’s commit, just to be safe, right?
Common beginner mistake
Commit only as needed
keep recovery short
maximize throughput
Metrics to validate
commit-rate
commit-latency-avg
MESSAGES
COMMIT
MESSAGES
21
Right-size your batches
21
Right-size your batches
Bigger Batches
increase throughput
improve compression
21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
batch.size + linger.ms
21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
batch.size + linger.ms
don’t forget!
21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
batch.size + linger.ms
Watch
request-rate
request-latency-avg
compression-rate
don’t forget!
22
My app keeps rebalancing
Symptoms
low throughput
high network chatter
consumer logs galore
no progress
hanging
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Hi!
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Group
Hi!
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Group
Hi!
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Group
Hi!
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Response
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Response
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Sync Group
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Sync Response
23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
New Assignment
24
Restoring a Happy Balance
Timing Issues
long GC pauses (tens of seconds)
infrequent calls to poll()
timeouts too short?
flaky network
1 bad machine affects the entire group
Watch
join-rate
sync-rate
25
Competent Users
• Monitor Consumer Lag
• Lookout for Partition Skew
• Commit Offsets Sparingly
• Collect Logs
• Understand how to tune Batch Sizes
26
Kafka Pros
• Watch Group Partition Assignment
• Monitor Client Metrics
• Understand Consumer Rebalancing
• Profile their applications
• Distinguish Client/App/Broker problems
Replication Everything is Slow
27
28
Famous last words…
“You just consume, and
produce. How hard
can this be?”
29
Famous last words…
“We have a disaster in our
main cluster. Can we fail over
to secondary? We can’t lose
more than 7 seconds of data.”
30
Monitor Replication Lag - In messages
31
Monitor Replication Lag - or in seconds…
Screenshot of replicator streams monitoring
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
record-retry-rate

record-error-rate

Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
record-retry-rate

record-error-rate

waiting-threads

bufferpool-wait-time
Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
record-retry-rate

record-error-rate

waiting-threads

bufferpool-wait-time
io-ratio

io-wait-ratio

byte-consumed-rate
Buffer
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
record-retry-rate

record-error-rate

waiting-threads

bufferpool-wait-time
io-ratio

io-wait-ratio

byte-consumed-rate
Buffer
fetch-size-avg

fetch-size-max

fetch-rate
Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
io-ratio

io-wait-ratio

outgoing-byte-rate
batch-size-avg

batch-size-max
record-retry-rate

record-error-rate

waiting-threads

bufferpool-wait-time
io-ratio

io-wait-ratio

byte-consumed-rate
Buffer
fetch-size-avg

fetch-size-max

fetch-rate
record-max-lag
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
Buffer
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
network or
destination kafka
performance
Buffer
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
network or
destination kafka
performance
increase
batch.size
Buffer
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
Buffer
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
network or origin
kafka performance
Buffer
Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when 

buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
network or origin
kafka performance
Buffer
fetch.max.bytes

fetch.min.bytes

fetch.max.wait
34
Network Tuning
• WAN has high latency. We deal with it.
• Compute buffer size to match:  https://www.switch.ch/network/tools/tcp_throughput/
• send.buffer.bytes and receive.buffer.bytes on producer, consumer, brokers
• OS tuning: https://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php 

net.core.rmem_default, net.core.rmem_max, net.core.wmem_default,
net.core.wmem_max
• Enable logging to check if this had any effect:
log4j.logger.org.apache.kafka.common.network.Selector=DEBUG
• Additional tips in our docs
35
Competent users
• Monitor consumer lag
• Add processes when things are slow
• Automate deployment
36
Kafka Pros
• Monitor time lag
• Collect client metrics
• Knows which side to blame
• Know which configs to tune
• Tunes the network over the WAN
Resources and Next Steps
https://github.com/confluentinc/cp-demo
https://www.confluent.io/download/
https://slackpass.io/confluentcommunity
https://www.confluent.io/blog
Thank you!
@gwenshap

gwen@confluent.io
@xvrl

xavier@confluent.io

More Related Content

Why is My Stream Processing Job Slow? with Xavier Leaute

  • 1. 1 Why is My Stream Processing Job Slow? Xavier Léauté, Software Engineer Gwen Shapira, Principal Data Architect
  • 2. 2 Kafka 101 Distributed Scalable Fault-Tolerant Partitioned + Replicated Log Ordering guarantees Consumers advance independently Exactly-once delivery Transactional commits
  • 3. What people think of Stream Monitoring 3
  • 4. What our typical experience is 4
  • 6. Confidential 5 Real Customer Experiences Client Side Broken Streaming Job / App
  • 7. Confidential 5 Real Customer Experiences Client Side Broken Streaming Job / App End-to-End Slow Replication
  • 8. Your Kafka stream job stopped humming… now what? 6
  • 9. Confidential 7 What we check Consumer Lag Partition Assignment Partition Skew Client Logs GC Log Metrics Request Latencies Commit Rates Group Rebalancing Basic Tuning Batch Sizes Commit Rate Application Profiling
  • 10. 8 The Newbie - During an incident… GC Logs? Metrics?
 How do I get those? I’ll just change some configs and reboot everything.
  • 12. 10 Bad Capacity Allocation kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group fast-data-reader TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID fast-data 1 8661694 8703404 41710 myapp-1 fast-data 3 8577975 8616490 38515 myapp-2 fast-data 0 4902354 8741872 3839518 myapp-3 fast-data 2 4922614 8621757 3699143 myapp-3
  • 13. 10 Bad Capacity Allocation kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group fast-data-reader TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID fast-data 1 8661694 8703404 41710 myapp-1 fast-data 3 8577975 8616490 38515 myapp-2 fast-data 0 4902354 8741872 3839518 myapp-3 fast-data 2 4922614 8621757 3699143 myapp-3
  • 14. 10 Bad Capacity Allocation kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group fast-data-reader TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID fast-data 1 8661694 8703404 41710 myapp-1 fast-data 3 8577975 8616490 38515 myapp-2 fast-data 0 4902354 8741872 3839518 myapp-3 fast-data 2 4922614 8621757 3699143 myapp-3
  • 15. 11 Watch for Partition Skew kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group fast-data-reader TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID fast-data 1 8661694 8703404 41710 myapp-1 fast-data 3 8577975 8616490 38515 myapp-2 fast-data 0 4902354 8741872 3839518 myapp-3 fast-data 2 4922614 8621757 3699143 myapp-3
  • 16. 11 Watch for Partition Skew kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group fast-data-reader TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID fast-data 1 8661694 8703404 41710 myapp-1 fast-data 3 8577975 8616490 38515 myapp-2 fast-data 0 4902354 8741872 3839518 myapp-3 fast-data 2 4922614 8621757 3699143 myapp-3
  • 17. 12 Not all partitions are created equal Important for Keyed topics Custom partitioned topics Early warning signs some partitions lagging uneven CPU / Network usage Typical cause skewed key distribution in your data bad joins (null keys) imbalance across brokers
  • 18. 13 Clients have metrics too! Start with the basics GC / CPU / Network General Slowness Consumer or Producer Side? Global Request Latencies Some partitions still lagging Per Broker metrics (bad node / network) Per Topic metrics (data / tuning) Buffer Size Offset Commit
  • 19. 14 Turn up the log level The logs took too much space, so we deleted them.
  • 20. 15 Time for some profiling https://github.com/jvm-profiling-tools/async-profiler https://github.com/brendangregg/FlameGraph ./profiler.sh -d 30 -f flamegraph.svg <pid> To impress your coworkers https://github.com/Netflix/flamescope
  • 21. 16 Here’s where your CPU cycles went % CPU Time Stack
  • 22. 16 Here’s where your CPU cycles went GC % CPU Time Stack
  • 23. 16 Here’s where your CPU cycles went RocksDB % CPU Time Stack
  • 24. 16 Here’s where your CPU cycles went Kafka poll() loop % CPU Time Stack
  • 25. 16 Here’s where your CPU cycles went Actual Processing Time % CPU Time Stack
  • 26. 17 Spark Streaming Clickstream Example (using Kafka)
  • 27. 18 Spark Streaming Clickstream Example (using Kafka)
  • 28. 18 Spark Streaming Clickstream Example (using Kafka) Scheduler Event Loop
  • 29. 18 Spark Streaming Clickstream Example (using Kafka) Shuffle Writes Scheduler Event Loop
  • 30. 18 Spark Streaming Clickstream Example (using Kafka) 30% deserialization Shuffle Writes Scheduler Event Loop
  • 31. 18 Spark Streaming Clickstream Example (using Kafka) 30% deserialization Shuffle Writes Scheduler Event Loop Read from Kafka & Processing
  • 33. 20 Let’s commit, just to be safe, right? Common beginner mistake Commit only as needed keep recovery short maximize throughput Metrics to validate commit-rate commit-latency-avg MESSAGES COMMIT MESSAGES
  • 35. 21 Right-size your batches Bigger Batches increase throughput improve compression
  • 36. 21 Right-size your batches Bigger Batches increase throughput improve compression Small enough (<< 10MB) to keep GC low
  • 37. 21 Right-size your batches Bigger Batches increase throughput improve compression Small enough (<< 10MB) to keep GC low batch.size + linger.ms
  • 38. 21 Right-size your batches Bigger Batches increase throughput improve compression Small enough (<< 10MB) to keep GC low batch.size + linger.ms don’t forget!
  • 39. 21 Right-size your batches Bigger Batches increase throughput improve compression Small enough (<< 10MB) to keep GC low batch.size + linger.ms Watch request-rate request-latency-avg compression-rate don’t forget!
  • 40. 22 My app keeps rebalancing Symptoms low throughput high network chatter consumer logs galore no progress hanging
  • 41. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6
  • 42. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Hi!
  • 43. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Join Group Hi!
  • 44. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Join Group Hi!
  • 45. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Join Group Hi!
  • 46. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Join Response
  • 47. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Join Response
  • 48. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Sync Group
  • 49. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Sync Response
  • 50. 23 Kafka Consumer Group Rebalancing 101 Consumer A Consumer B Consumer C Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 New Assignment
  • 51. 24 Restoring a Happy Balance Timing Issues long GC pauses (tens of seconds) infrequent calls to poll() timeouts too short? flaky network 1 bad machine affects the entire group Watch join-rate sync-rate
  • 52. 25 Competent Users • Monitor Consumer Lag • Lookout for Partition Skew • Commit Offsets Sparingly • Collect Logs • Understand how to tune Batch Sizes
  • 53. 26 Kafka Pros • Watch Group Partition Assignment • Monitor Client Metrics • Understand Consumer Rebalancing • Profile their applications • Distinguish Client/App/Broker problems
  • 55. 28 Famous last words… “You just consume, and produce. How hard can this be?”
  • 56. 29 Famous last words… “We have a disaster in our main cluster. Can we fail over to secondary? We can’t lose more than 7 seconds of data.”
  • 57. 30 Monitor Replication Lag - In messages
  • 58. 31 Monitor Replication Lag - or in seconds… Screenshot of replicator streams monitoring
  • 59. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full Buffer
  • 60. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate Buffer
  • 61. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max Buffer
  • 62. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max record-retry-rate
 record-error-rate
 Buffer
  • 63. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max record-retry-rate
 record-error-rate
 waiting-threads
 bufferpool-wait-time Buffer
  • 64. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max record-retry-rate
 record-error-rate
 waiting-threads
 bufferpool-wait-time io-ratio
 io-wait-ratio
 byte-consumed-rate Buffer
  • 65. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max record-retry-rate
 record-error-rate
 waiting-threads
 bufferpool-wait-time io-ratio
 io-wait-ratio
 byte-consumed-rate Buffer fetch-size-avg
 fetch-size-max
 fetch-rate
  • 66. Confidential 32 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full io-ratio
 io-wait-ratio
 outgoing-byte-rate batch-size-avg
 batch-size-max record-retry-rate
 record-error-rate
 waiting-threads
 bufferpool-wait-time io-ratio
 io-wait-ratio
 byte-consumed-rate Buffer fetch-size-avg
 fetch-size-max
 fetch-rate record-max-lag
  • 67. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full Buffer
  • 68. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full network or destination kafka performance Buffer
  • 69. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full network or destination kafka performance increase batch.size Buffer
  • 70. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full network or destination kafka performance increase batch.size destination kafka issues Buffer
  • 71. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full network or destination kafka performance increase batch.size destination kafka issues network or origin kafka performance Buffer
  • 72. Confidential 33 Simple and elegant design Origin Destination Consumer producer Buffer block when 
 buffer is full network or destination kafka performance increase batch.size destination kafka issues network or origin kafka performance Buffer fetch.max.bytes
 fetch.min.bytes
 fetch.max.wait
  • 73. 34 Network Tuning • WAN has high latency. We deal with it. • Compute buffer size to match:  https://www.switch.ch/network/tools/tcp_throughput/ • send.buffer.bytes and receive.buffer.bytes on producer, consumer, brokers • OS tuning: https://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php 
 net.core.rmem_default, net.core.rmem_max, net.core.wmem_default, net.core.wmem_max • Enable logging to check if this had any effect: log4j.logger.org.apache.kafka.common.network.Selector=DEBUG • Additional tips in our docs
  • 74. 35 Competent users • Monitor consumer lag • Add processes when things are slow • Automate deployment
  • 75. 36 Kafka Pros • Monitor time lag • Collect client metrics • Knows which side to blame • Know which configs to tune • Tunes the network over the WAN
  • 76. Resources and Next Steps https://github.com/confluentinc/cp-demo https://www.confluent.io/download/ https://slackpass.io/confluentcommunity https://www.confluent.io/blog