Kafka Interview Problems Clean
Kafka Interview Problems Clean
A consumer group lags behind during traffic spikes. How would you identify and
resolve the bottleneck?
First, diagnose the source of lag:
Use kafka-consumer-groups.sh to inspect lag per partition.
Determine whether lag is due to message processing time, limited parallelism, or
misconfigured consumer settings.
Common bottlenecks include:
Slow downstream systems (e.g., DB writes, HTTP calls).
Too few partitions to allow parallelism.
Offsets not being committed properly (causing reprocessing).
High GC pressure or threading issues on the consumer app.
To fix the issue:
Optimize the consumer processing logic (e.g., async I/O, batching DB calls).
Increase the number of partitions to allow horizontal scaling.
Tune settings like max.poll.records, fetch.min.bytes, or increase thread pool size.
Introduce monitoring and auto-scaling mechanisms using Prometheus/Grafana.
How would you design a Kafka topic strategy for a multi-tenant platform with
millions of users and dozens of data domains?
Avoid creating a topic per user — this would overload Kafka's broker metadata.
Instead, design shared domain topics:
Embed tenant/user ID in the key (e.g., tenantId:userId) to maintain partition-level
ordering.
Example: user.activity.events topic with key-based partitioning by tenant.
Determine topic granularity based on domain context:
Use logically grouped topics like billing.events, profile.updates, etc.
Balance data volume, retention requirements, and consumer access patterns.
Implement schema management with Avro/Protobuf and Schema Registry.
Enforce access control (e.g., ACLs) if external consumers consume topics.
Ensure even partition distribution using consistent hashing or composite keys.
A Kafka topic has out-of-order messages. What could be the cause and how
would you fix it?
Kafka guarantees order only within a single partition.
Common causes of out-of-order delivery:
Improper use of keys — same logical message stream split across partitions.
Multiple producers with inconsistent keying or without keys.
Retries or replays that delay certain messages.
To resolve:
Use a consistent partition key (e.g., user ID) to ensure message locality.
Configure producers for FIFO delivery:
Enable enable.idempotence=true.
Set acks=all and max.in.flight.requests.per.connection=1.
Add sequence numbers to the message payload to allow downstream reordering if
needed.
If strict ordering across keys is required, use Kafka Streams with windowed or stateful
logic (with caution).
How would you implement a Kafka-based system that supports message replay
for debugging or reprocessing?
Start with long-retention or compacted Kafka topics.
Ensure all consumers are deterministic and idempotent.
Common replay strategies:
Run a custom consumer with auto.offset.reset=earliest.
Externally manage offsets (e.g., store checkpoints in DB).
Use a Dead Letter Topic (DLT) to isolate and replay failures.
Advanced replay architectures:
Mirror events to a dedicated 'replay' topic.
Use timestamp-based offset seeking with Kafka's API.
Tooling and support:
Persist historical data to S3/Elasticsearch using Kafka Connect.
Use Kafka Streams to rebuild derived states from event history.
Expose a UI (e.g., AKHQ, Kafka UI) to allow selective or targeted replay by operators.