0% found this document useful (0 votes)
47 views

Using Kafka For Real Time Data Ingestion With .NET KevinFeasel

This document discusses using Apache Kafka for real-time data ingestion with .NET. It describes Kafka concepts like topics, partitions, producers, and consumers. It then demonstrates building a .NET producer to read CSV data, an enricher to process messages and output to JSON, and a consumer to aggregate data and write to SQL Server. It concludes with best practices for Kafka performance around throughput, latency, and horizontal scaling.

Uploaded by

Kohinata Minoru
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Using Kafka For Real Time Data Ingestion With .NET KevinFeasel

This document discusses using Apache Kafka for real-time data ingestion with .NET. It describes Kafka concepts like topics, partitions, producers, and consumers. It then demonstrates building a .NET producer to read CSV data, an enricher to process messages and output to JSON, and a consumer to aggregate data and write to SQL Server. It concludes with best practices for Kafka performance around throughput, latency, and horizontal scaling.

Uploaded by

Kohinata Minoru
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Using Kafka for Real-Time

Data Ingestion with .NET

Kevin Feasel
Engineering Manager, Predictive Analytics
ChannelAdvisor

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Who Am I? What Am I Doing Here?
Catallaxy Services

Curated SQL

We Speak Linux
@feaselkl
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Apache Kafka
Apache Kafka is a
message broker on
the Hadoop stack. It
receives messages
from producers and
sends messages to
consumers.
Everything in Kafka is distributed.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Why Use A Broker?
Suppose we have two applications which want to
communicate. We connect them directly.

Works great at low scale--it's easy to understand, easy to work


with, and has fewer working parts to break. But it hits scale
limitations. #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Why Use A Broker?
We then expand out.

It is easy to expand this way as


long as you don't overwhelm the
DB; eventually you will.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Why Use A Broker?
We then expand out. Again.
It takes some effort here: we
need to manage connection
strings and write to the correct
DB.

But it's doable and expands


indefinitely.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Why Use A Broker?
But what happens when a
consumer (database) goes down
for too long?
• Producer drops messages
• Producer holds messages
(until it runs out of disk)
• Producer returns error

There’s a better way.


#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Why Use A Broker?
Brokers take messages from
producers and feed messages to
consumers.

Brokers deal with the jumble of


connections, let us be resilient to
producer and consumer failures,
and help with scale-out.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Motivation
Today's talk will focus on using Kafka to ingest, enrich, and
consume data. We will build .NET applications in Windows to
talk to a Kafka cluster on Linux.

Our data source is flight data. I’d like to ask a few questions,
with answers split out by destination state:
1. How many flights did we have in 2008?
2. How many flights' arrivals were delayed?
3. How many minutes of arrival delay did we have?
4. Given a flight with a delay, how long can we expect it to be?
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Most message brokers act as queues.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Kafka is a log, not a queue.

Multiple consumers may


read the same message
and a consumer may re-
read messages.

Think microservices and


replaying data.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Brokers foster communication between producers and
consumers. They store the produced messages and keep track
of what consumers have read.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Topics are categories or feeds to which messages get
published. Topics are broken up into partitions. Partitions are
ordered, immutable sequences of records.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Producers push messages to Kafka.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Consumers read messages from topics.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Consumers enlist in consumer groups. Consumer groups act
as "logical subscribers" and Kafka distributes load to
consumers in a group.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
Records in partitions are immutable. You do not modify the
data, but can add new rows.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Concepts
• Consumers should know where they left off. Kafka assists
by storing consumer group-specific last-read pointer values
per topic and partition.
• Kafka retains messages for a certain (configurable) amount
of time, after which point they drop off.
• Kafka can also garbage collect messages if you reach a
certain (configurable) amount of disk space.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
The Competition
• MSMQ and Service Broker: queues in Microsoftland
• Amazon Kinesis and Azure Event Hub: Kafka as a Service
• RabbitMQ: complex routing & guaranteed reliability
• Celery: distributed queue built for Python
• ZeroMQ: socket-based distributed queueing
• Queues.io lists dozens of queues and brokers

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Building A Producer
Our first application reads data from a CSV and pushes
messages onto a topic.

This application will not try to understand the messages; it


simply takes data and pushes it to a topic.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Building A Producer
I chose Confluent's Kafka .NET library (nee RDKafka-dotnet) as
my library of choice.

There are several libraries available, each with their own


benefits and drawbacks. This library serves up messages in an
event-based model and has official support from Confluent,
so use this one.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Demo Time

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Building An Enricher
Our second application reads data from one topic and pushes
messages onto a different topic.

This application provides structure to our data and will be the


largest application.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Building An Enricher
Enrichment opportunities:

1. Convert "NA" values to appropriate values: either a


default value or None (not NULL!).
2. Perform lookups against airports given an airport code.
3. Converting the input CSV record into a structured type
(similar to a class).
4. Outputting results as JSON for later consumers.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Demo Time

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Building A Consumer
Our third application reads data from the enriched topic,
aggregates, and periodically writes results to SQL Server.

We’ve already seen consumer code, so this is easy.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Demo Time

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Kafka Performance
Basic tips:

• Maximize your network bandwidth! Your fibre channel will


push a lot more messages than my travel router.
• Compress your data. Compression works best with high-
throughput scenarios, so test first.
• Minimize message size. This reduces network cost.
• Buffer messages in your code using tools like
Collections.Concurrent.BlockingCollection
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Throughput Versus Latency
Minimize latency when you want the most responsive
consumers but don't need to maximize the number of
messages flowing.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Throughput Versus Latency
Maximize throughput when you want to push as many
messages as possible. This is better for bulk loading
operations.

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Throughput Versus Latency
Consumer config settings:
• fetch.wait.max.ms
• fetch.min.bytes

Producer config settings:


• batch.num.messages
• queue.buffering.max.ms

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
More, More, More
Kafka is a horizontally distributed system, so when in doubt,
add more:
• More brokers will help accept messages from producers
faster, especially if current brokers are experiencing high
CPU or I/O.
• More consumers in a group will process messages more
quickly.
• You must have at least as many partitions as consumers in
a group! Otherwise, consumers may sit idle.
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Wrapping Up
Apache Kafka is a powerful message broker. There is a small
learning curve associated with Kafka, but this is a technology
well worth learning.

To learn more, go here: https://CSmore.info/on/kafka

And for help, contact me:


[email protected] | @feaselkl

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

You might also like