Using Kafka For Real Time Data Ingestion With .NET KevinFeasel
Using Kafka For Real Time Data Ingestion With .NET KevinFeasel
Kevin Feasel
Engineering Manager, Predictive Analytics
ChannelAdvisor
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Who Am I? What Am I Doing Here?
Catallaxy Services
Curated SQL
We Speak Linux
@feaselkl
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Apache Kafka
Apache Kafka is a
message broker on
the Hadoop stack. It
receives messages
from producers and
sends messages to
consumers.
Everything in Kafka is distributed.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Why Use A Broker?
Suppose we have two applications which want to
communicate. We connect them directly.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Why Use A Broker?
We then expand out. Again.
It takes some effort here: we
need to manage connection
strings and write to the correct
DB.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Why Use A Broker?
But what happens when a
consumer (database) goes down
for too long?
• Producer drops messages
• Producer holds messages
(until it runs out of disk)
• Producer returns error
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Motivation
Today's talk will focus on using Kafka to ingest, enrich, and
consume data. We will build .NET applications in Windows to
talk to a Kafka cluster on Linux.
Our data source is flight data. I’d like to ask a few questions,
with answers split out by destination state:
1. How many flights did we have in 2008?
2. How many flights' arrivals were delayed?
3. How many minutes of arrival delay did we have?
4. Given a flight with a delay, how long can we expect it to be?
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Most message brokers act as queues.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Kafka is a log, not a queue.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Topics are categories or feeds to which messages get
published. Topics are broken up into partitions. Partitions are
ordered, immutable sequences of records.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Producers push messages to Kafka.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Consumers read messages from topics.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Consumers enlist in consumer groups. Consumer groups act
as "logical subscribers" and Kafka distributes load to
consumers in a group.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Records in partitions are immutable. You do not modify the
data, but can add new rows.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
• Consumers should know where they left off. Kafka assists
by storing consumer group-specific last-read pointer values
per topic and partition.
• Kafka retains messages for a certain (configurable) amount
of time, after which point they drop off.
• Kafka can also garbage collect messages if you reach a
certain (configurable) amount of disk space.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
The Competition
• MSMQ and Service Broker: queues in Microsoftland
• Amazon Kinesis and Azure Event Hub: Kafka as a Service
• RabbitMQ: complex routing & guaranteed reliability
• Celery: distributed queue built for Python
• ZeroMQ: socket-based distributed queueing
• Queues.io lists dozens of queues and brokers
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Building A Producer
Our first application reads data from a CSV and pushes
messages onto a topic.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Building A Producer
I chose Confluent's Kafka .NET library (nee RDKafka-dotnet) as
my library of choice.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Demo Time
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Building An Enricher
Our second application reads data from one topic and pushes
messages onto a different topic.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Building An Enricher
Enrichment opportunities:
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Demo Time
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Building A Consumer
Our third application reads data from the enriched topic,
aggregates, and periodically writes results to SQL Server.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Demo Time
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Performance
Basic tips:
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Throughput Versus Latency
Maximize throughput when you want to push as many
messages as possible. This is better for bulk loading
operations.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Throughput Versus Latency
Consumer config settings:
• fetch.wait.max.ms
• fetch.min.bytes
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
More, More, More
Kafka is a horizontally distributed system, so when in doubt,
add more:
• More brokers will help accept messages from producers
faster, especially if current brokers are experiencing high
CPU or I/O.
• More consumers in a group will process messages more
quickly.
• You must have at least as many partitions as consumers in
a group! Otherwise, consumers may sit idle.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Wrapping Up
Apache Kafka is a powerful message broker. There is a small
learning curve associated with Kafka, but this is a technology
well worth learning.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM