Why is My Stream Processing Job Slow? with Xavier Leaute

1
Why is My Stream Processing Job Slow?
Xavier Léauté, Software Engineer
Gwen Shapira, Principal Data Architect

2
Kafka 101
Distributed
Scalable
Fault-Tolerant
Partitioned + Replicated Log
Ordering guarantees
Consumers advance independently
Exactly-once delivery
Transactional commits

What people think of Stream Monitoring 3

What our typical experience is
4

Confidential 5
Real Customer Experiences

Confidential 5
Client Side Broken Streaming Job / App

Confidential 5
Client Side Broken Streaming Job / App
End-to-End Slow Replication

Your Kafka stream job stopped
humming… now what?
6

Confidential 7
What we check
Consumer Lag
Partition Assignment
Partition Skew
Client Logs
GC Log
Metrics
Request Latencies
Commit Rates
Group Rebalancing
Basic Tuning
Batch Sizes
Commit Rate
Application Profiling

8
The Newbie - During an incident…
GC Logs? Metrics? 
How do I get those?
I’ll just change some configs
and reboot everything.

10
Bad Capacity Allocation
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3

11
Watch for Partition Skew
kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group
fast-data-reader
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
fast-data 1 8661694 8703404 41710 myapp-1
fast-data 3 8577975 8616490 38515 myapp-2
fast-data 0 4902354 8741872 3839518 myapp-3
fast-data 2 4922614 8621757 3699143 myapp-3

12
Not all partitions are created equal
Important for
Keyed topics
Custom partitioned topics
Early warning signs
some partitions lagging
uneven CPU / Network usage
Typical cause
skewed key distribution in your data
bad joins (null keys)
imbalance across brokers

13
Clients have metrics too!
Start with the basics GC / CPU / Network
General Slowness
Consumer or Producer Side?
Global Request Latencies
Some partitions still lagging
Per Broker metrics (bad node / network)
Per Topic metrics (data / tuning)
Buffer Size
Offset Commit

14
Turn up the log level
The logs took too
much space, so we
deleted them.

15
Time for some profiling
https://github.com/jvm-profiling-tools/async-profiler
https://github.com/brendangregg/FlameGraph
./profiler.sh -d 30 -f flamegraph.svg <pid>
To impress your coworkers
https://github.com/Netflix/flamescope

16
Here’s where your CPU cycles went
% CPU Time
Stack

16
GC
% CPU Time
Stack

16
RocksDB
% CPU Time
Stack

16
Kafka poll() loop
% CPU Time
Stack

16
Actual Processing Time
% CPU Time
Stack

17
Spark Streaming Clickstream Example (using Kafka)

18

18
Scheduler
Event Loop

18
Shuffle Writes
Scheduler
Event Loop

18
30% deserialization
Shuffle Writes
Scheduler
Event Loop

18
30% deserialization
Shuffle Writes
Scheduler
Event Loop
Read from Kafka
& Processing

20
Let’s commit, just to be safe, right?
Common beginner mistake
Commit only as needed
keep recovery short
maximize throughput
Metrics to validate
commit-rate
commit-latency-avg
MESSAGES
COMMIT
MESSAGES

21
Right-size your batches
Bigger Batches
increase throughput
improve compression

21
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low

21
Bigger Batches
increase throughput
improve compression
batch.size + linger.ms

21
Bigger Batches
increase throughput
improve compression
don’t forget!

21
Bigger Batches
increase throughput
improve compression
Watch
request-rate
request-latency-avg
compression-rate
don’t forget!

22
My app keeps rebalancing
Symptoms
low throughput
high network chatter
consumer logs galore
no progress
hanging

23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6

23
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Hi!

23
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Group
Hi!

23
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Response

23
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Sync Group

23
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Sync Response

23
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
New Assignment

24
Restoring a Happy Balance
Timing Issues
long GC pauses (tens of seconds)
infrequent calls to poll()
timeouts too short?
flaky network
1 bad machine affects the entire group
Watch
join-rate
sync-rate

25
Competent Users
• Monitor Consumer Lag
• Lookout for Partition Skew
• Commit Offsets Sparingly
• Collect Logs
• Understand how to tune Batch Sizes

26
Kafka Pros
• Watch Group Partition Assignment
• Monitor Client Metrics
• Understand Consumer Rebalancing
• Profile their applications
• Distinguish Client/App/Broker problems

Replication Everything is Slow
27

28
Famous last words…
“You just consume, and
produce. How hard
can this be?”

29
Famous last words…
“We have a disaster in our
main cluster. Can we fail over
to secondary? We can’t lose
more than 7 seconds of data.”

30
Monitor Replication Lag - In messages

31
Monitor Replication Lag - or in seconds…
Screenshot of replicator streams monitoring

Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
Buffer

Confidential 32
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
io-ratio 
io-wait-ratio 
outgoing-byte-rate
Buffer

Confidential 32
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
io-ratio 
io-wait-ratio 
outgoing-byte-rate
batch-size-avg 
batch-size-max
Buffer

Confidential 32
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
io-ratio 
io-wait-ratio 
outgoing-byte-rate
batch-size-avg 
batch-size-max
record-retry-rate 
record-error-rate 
Buffer

Confidential 32
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
io-ratio 
io-wait-ratio 
outgoing-byte-rate
batch-size-avg 
batch-size-max
waiting-threads 
bufferpool-wait-time
Buffer

Confidential 32
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
io-ratio 
io-wait-ratio 
outgoing-byte-rate
batch-size-avg 
batch-size-max
waiting-threads 
io-ratio 
io-wait-ratio 
byte-consumed-rate
Buffer

Confidential 32
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
io-ratio 
io-wait-ratio 
outgoing-byte-rate
batch-size-avg 
batch-size-max
waiting-threads 
io-ratio 
io-wait-ratio 
byte-consumed-rate
Buffer
fetch-size-avg 
fetch-size-max 
fetch-rate

Confidential 32
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
io-ratio 
io-wait-ratio 
outgoing-byte-rate
batch-size-avg 
batch-size-max
waiting-threads 
io-ratio 
io-wait-ratio 
byte-consumed-rate
Buffer
fetch-size-avg 
fetch-size-max 
fetch-rate
record-max-lag

Confidential 33
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
Buffer

Confidential 33
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
network or
destination kafka
performance
Buffer

Confidential 33
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
network or
destination kafka
performance
increase
batch.size
Buffer

Confidential 33
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
Buffer

Confidential 33
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
network or origin
kafka performance
Buffer

Confidential 33
Origin
Destination
Consumer
producer
Buffer
block when  
buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
network or origin
kafka performance
Buffer
fetch.max.bytes 
fetch.min.bytes 
fetch.max.wait

34
Network Tuning
• WAN has high latency. We deal with it.
• Compute buffer size to match: https://www.switch.ch/network/tools/tcp_throughput/
• send.buffer.bytes and receive.buffer.bytes on producer, consumer, brokers
• OS tuning: https://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php  
net.core.rmem_default, net.core.rmem_max, net.core.wmem_default,
net.core.wmem_max
• Enable logging to check if this had any effect:
log4j.logger.org.apache.kafka.common.network.Selector=DEBUG
• Additional tips in our docs

35
Competent users
• Monitor consumer lag
• Add processes when things are slow
• Automate deployment

36
Kafka Pros
• Monitor time lag
• Collect client metrics
• Knows which side to blame
• Know which configs to tune
• Tunes the network over the WAN

Resources and Next Steps
https://github.com/confluentinc/cp-demo
https://www.confluent.io/download/
https://slackpass.io/confluentcommunity
https://www.confluent.io/blog

Thank you!
@gwenshap 
gwen@confluent.io
@xvrl 
xavier@confluent.io

Why is My Stream Processing Job Slow? with Xavier Leaute

More Related Content

Why is My Stream Processing Job Slow? with Xavier Leaute