The goal of most streams processing jobs is to process data and deliver insights to the business – fast. Unfortunately, sometimes our streams processing jobs fall short of this goal. Or perhaps the job used to run fine, but one day it just isn’t fast enough? In this talk, we’ll dive into the challenges of analyzing performance of real-time stream processing applications. We’ll share troubleshooting suggestions and some of our favorite tools. So next time someone asks “why is this taking so long?”, you’ll know what to do.
1 of 77
Downloaded 32 times
More Related Content
Why is My Stream Processing Job Slow? with Xavier Leaute
1. 1
Why is My Stream Processing Job Slow?
Xavier Léauté, Software Engineer
Gwen Shapira, Principal Data Architect
17. 12
Not all partitions are created equal
Important for
Keyed topics
Custom partitioned topics
Early warning signs
some partitions lagging
uneven CPU / Network usage
Typical cause
skewed key distribution in your data
bad joins (null keys)
imbalance across brokers
18. 13
Clients have metrics too!
Start with the basics GC / CPU / Network
General Slowness
Consumer or Producer Side?
Global Request Latencies
Some partitions still lagging
Per Broker metrics (bad node / network)
Per Topic metrics (data / tuning)
Buffer Size
Offset Commit
19. 14
Turn up the log level
The logs took too
much space, so we
deleted them.
20. 15
Time for some profiling
https://github.com/jvm-profiling-tools/async-profiler
https://github.com/brendangregg/FlameGraph
./profiler.sh -d 30 -f flamegraph.svg <pid>
To impress your coworkers
https://github.com/Netflix/flamescope
33. 20
Let’s commit, just to be safe, right?
Common beginner mistake
Commit only as needed
keep recovery short
maximize throughput
Metrics to validate
commit-rate
commit-latency-avg
MESSAGES
COMMIT
MESSAGES
37. 21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
batch.size + linger.ms
38. 21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
batch.size + linger.ms
don’t forget!
39. 21
Right-size your batches
Bigger Batches
increase throughput
improve compression
Small enough (<< 10MB) to keep GC low
batch.size + linger.ms
Watch
request-rate
request-latency-avg
compression-rate
don’t forget!
40. 22
My app keeps rebalancing
Symptoms
low throughput
high network chatter
consumer logs galore
no progress
hanging
41. 23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
42. 23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Hi!
43. 23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Group
Hi!
44. 23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Group
Hi!
45. 23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Group
Hi!
46. 23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Response
47. 23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Join Response
48. 23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Sync Group
49. 23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Sync Response
50. 23
Kafka Consumer Group Rebalancing 101
Consumer A
Consumer B
Consumer C
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
New Assignment
51. 24
Restoring a Happy Balance
Timing Issues
long GC pauses (tens of seconds)
infrequent calls to poll()
timeouts too short?
flaky network
1 bad machine affects the entire group
Watch
join-rate
sync-rate
52. 25
Competent Users
• Monitor Consumer Lag
• Lookout for Partition Skew
• Commit Offsets Sparingly
• Collect Logs
• Understand how to tune Batch Sizes
59. Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
Buffer
60. Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
io-ratio
io-wait-ratio
outgoing-byte-rate
Buffer
61. Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
io-ratio
io-wait-ratio
outgoing-byte-rate
batch-size-avg
batch-size-max
Buffer
62. Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
io-ratio
io-wait-ratio
outgoing-byte-rate
batch-size-avg
batch-size-max
record-retry-rate
record-error-rate
Buffer
63. Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
io-ratio
io-wait-ratio
outgoing-byte-rate
batch-size-avg
batch-size-max
record-retry-rate
record-error-rate
waiting-threads
bufferpool-wait-time
Buffer
64. Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
io-ratio
io-wait-ratio
outgoing-byte-rate
batch-size-avg
batch-size-max
record-retry-rate
record-error-rate
waiting-threads
bufferpool-wait-time
io-ratio
io-wait-ratio
byte-consumed-rate
Buffer
65. Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
io-ratio
io-wait-ratio
outgoing-byte-rate
batch-size-avg
batch-size-max
record-retry-rate
record-error-rate
waiting-threads
bufferpool-wait-time
io-ratio
io-wait-ratio
byte-consumed-rate
Buffer
fetch-size-avg
fetch-size-max
fetch-rate
66. Confidential 32
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
io-ratio
io-wait-ratio
outgoing-byte-rate
batch-size-avg
batch-size-max
record-retry-rate
record-error-rate
waiting-threads
bufferpool-wait-time
io-ratio
io-wait-ratio
byte-consumed-rate
Buffer
fetch-size-avg
fetch-size-max
fetch-rate
record-max-lag
67. Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
Buffer
68. Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
network or
destination kafka
performance
Buffer
69. Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
network or
destination kafka
performance
increase
batch.size
Buffer
70. Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
Buffer
71. Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
network or origin
kafka performance
Buffer
72. Confidential 33
Simple and elegant design
Origin
Destination
Consumer
producer
Buffer
block when
buffer is full
network or
destination kafka
performance
increase
batch.size
destination kafka
issues
network or origin
kafka performance
Buffer
fetch.max.bytes
fetch.min.bytes
fetch.max.wait
73. 34
Network Tuning
• WAN has high latency. We deal with it.
• Compute buffer size to match: https://www.switch.ch/network/tools/tcp_throughput/
• send.buffer.bytes and receive.buffer.bytes on producer, consumer, brokers
• OS tuning: https://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php
net.core.rmem_default, net.core.rmem_max, net.core.wmem_default,
net.core.wmem_max
• Enable logging to check if this had any effect:
log4j.logger.org.apache.kafka.common.network.Selector=DEBUG
• Additional tips in our docs
75. 36
Kafka Pros
• Monitor time lag
• Collect client metrics
• Knows which side to blame
• Know which configs to tune
• Tunes the network over the WAN
76. Resources and Next Steps
https://github.com/confluentinc/cp-demo
https://www.confluent.io/download/
https://slackpass.io/confluentcommunity
https://www.confluent.io/blog