Kafka Notes

notes on apache kafka

Uploaded by

Himani Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Kafka Notes

notes on apache kafka

Uploaded by

Himani Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Questions for our application:

- What form is the storage in currently?

- Filezilla - can download from harddrive
- How are we currently getting the data from the IoT sensors/cameras, where is it
going, etc?
- We will need to know this to figure out how to write our Producer class for Kafka
ingestion.
- Need the image files available in the Python (or whatever language)
Producer file. Once we figure this out, should be ~relatively~ simple to get
into the Kafka Producer/Consumer system
- How to send image data through Producer?
- Convert image file to binary or bytes array
- Use a ByteArraySerializer to send that array as the message value:
- In Java: props.put("value.serializer",
"org.apache.kafka.common.serialization.ByteArraySerializer");
- Turns image data into a form of “structured data” that kafka can handle.
- Kafka is not ideal for this task but many sources say it can be done and
it’s sometimes the most ideal method.
- Tiered storage can help, to avoid clogging up Kafka brokers
- Tiered storage is currently (only?) available via Confluent Cloud
- Python example (bottom): https://github.com/dpkp/kafka-python/issues/1045
- message value size will be the bytes size of the image. if same dimensions as
original image, our pca compression won’t necessarily matter because it doesn’t
affect the bytes size of the image in python
- since bytes in python are same for original and reconstructed, if the
dimensions are same.
- can’t directly send the bytes of the pca compression, for example,
because we would need to have the models for decoding available on the
other side. this isn’t possible with our pca method
- not possible because each image has trained its own pca. kind of
defeats purpose of data compression if you need to send over the
models over as data to store as well.
- potential solution: if we did go with a deep learning method like
autoencoders for producing a bytes array of a
compressed/encoded image, then we might be able to just use
those bytes because there would only be a single decoder needed
to process that data on the other side of the kafka process, for
decoding.
- but file size reduction of the image should still be helpful for speeding up
the HTTP sending part (from the IoT sensors)
- Might not be a big problem if we’re okay with not the best performance
sending these image messages over in Kafka

Kafka summary below:

3. Apache Kafka Fundamentals | Apache Kafka® Fundamentals

Kafka manages and processes events

Kafka Cluster
- Kafka cluster is what we want to use to read/write data
- The cluster could be a thing we manage or somewhere on the cloud
- Made up of brokers (machines, servers, etc)
- Each broker has it’s local storage and retention time to keep the events around
- If cluster is on the cloud we don’t think too closely about the brokers
Producer is an application we write to get things into a Kafka cluster
- Producer programs write data into the cluster
- Event source could be our IOT sensors
Consumer is an application we write to get things from a Kafka cluster
- How we read data from the cluster
- Might not think about it directly
Producers and consumers are decoupled, don’t affect each other.
Zookeeper Ensemble
- manages consensus for all brokers
- distributed consensus manager
- KIP 500 => attempting to remove Zookeeper, but not yet
- Stores access control lists and secrets
- Failure detection and recovery and management for brokers
Topics
- Streams of related messages in Kafka
- A logical representation, a log of events/messages (sorta chronological)
- Categorizes messages into groups
- Developers define Topics
- Producer <-> Topic: N to N relation
- Unlimited number of topics
Partition
- Topics are broken up into pieces or partitions
- Each partition is allocated to a separate broker
- Technically a partition is a log, strictly ordered
- A topic might not be as strictly ordered if broken up into multiple partitions
- Segments not as relevant (make up partitions)
- Partition important for how to model data
Log
- 0 (first entry) … n+1 (next entry to write)
- prior entries are immutable records
- can set retention periods in kafka to expire prior “log entries”
- Known as a stream in kafka
Consuming doesn’t destory the message! One message could be sent to multiple
Message
- key, value
- business relevant data
- headers
- optional metadata
- timestamp
- creation or ingestion time
Broker Basics
- producer sends messages to brokers
- brokers receive and store messages
- brokers manage partitons
- messages of a topic spread across partitions
- each broker handles many partitions
- each partition stores on broker’s disk
- partiton: 1…n log files
- each message in log indentified by offset
- configurable retention policy
- a kafka cluster can have many brokers
- Broker replication
- failsafe if one broker fails, keeps the partitions safe
- each broker has replicas (usually 3)
- one is the leader, rest are followers
Producer Basics
- Producers write data as messages
- can be written in any language
- java is default but c++/c, python, etc are also native
- command line producer tool
Load balancing and semantic partitioning
- producers use a partitioning strategy to assign each message to a partition
- partition strategy specified by the Producer
- no key, default is round-robin
- if key, hash(key) % number_of_partitions (ordered by key)
- messages of the same key always land in order in the same partition
- custom partitioner is possible
Consumer Basics
- consumers pull messages from 1..n topics
- new inflowing messages are automatically retrieved
- consumer offset
- keeps track of last message read, helps consumer remember where they are
- is stored in special topic
- CLI tools exist to read from cluster, distributed consumption
- consumers live in Consumer Groups - instances of a consumer to scale
Scalable Data Pipeline
Producers > Brokers > Consumers
4. How Kafka Works | Apache Kafka® Fundamentals

Basic Producer in Java

- Can be very basic, just a public static void main method
- Configuration
- Properties object, putting in property settings
- where the cluster is, key/value types
- Create Producer
- KafkaProducer class
- Shutdown Behavior
- close the producer when closed down
- Sending Data
- key and value are written
- ProducerRecord type, key/value/topic are passed into
- producer.send(record)
Basic Consumer in .NET/C#
- Configuration
- tell client where cluster is, etc.
- deserializer is not in the configuration part, will get to it later
- Message Handling
- consumer.OnMessage
- basic: just print the message to stdout
- Error Handling
- consumer.OnError
- consumer.OnConsumeError
- Subscribing to Topic
- consumer.Subscribe(“TopicName”)
- could subscribe to multiple topics
- Polling Data
- while (true), consumer.Poll()
Partition Leadership & Replication
- cluster with 4 brokers example
- no required relationship between # brokers or # partitions
- Leader partitions receive the writes and reads (producers and consumers)
- Sometimes can read from followers, not just leaders…but main is leader
- Followers write from the leader
- When a broker dies, automatically select a new leader and writing and reading can
continue with no loss of data
Data Retention Policy
- Events are immutable, they stay there
- How long do I want or can I store my data in the broker?
- By default it’s set to 1 week
- Business decision
- Could be very short or very long
- Old segments expire when newest segment is older farther retention period
Producer Design
- Topic
- [Partition]
- [Key]
- Value
- Serializer -> Partitioner -> sends record to Partition -> Broker
- Producer waits for ACK before sending (0 none, 1 leader, -1 all)
- 0 = no ack, low latency, some data loss (~at most once)
- 1 = leader ack, no missing messages and no duplicates
- -1 = all ack, no data loss, safe, more latency (~at least once)
- Exactly Once Semantics (not delivery)
- > Strong transactional guarantees for Kafka
- Prevents clients from processing duplicate messages
- Handlers failure gracefully
Consumer Groups
- One consumer group with two instances of same Consumer (just written twice)
- Automatic assignment of partitions to consumers within groups (for load balancing)
- Doesn’t migrate state between instances - KafkaStreams API does it though
Compacted Topics
- 17 values (over time) in log but only 4 keys that are unique
- keys could be unique IDs of devices in IoT
- compacted log contains just the unique keys and the latest values/current state
Troubleshooting
- Confluent control center
- Log files
- Special Config Settings
- SSL logging
- authorizer debugging
Security
- Kafka supports Encryption in Transit
- Kafka supports Authentication and Authorization
- No encryption at rest out of the box
- Clients can be mixed with & without encryption & authentication
- Or can be done at the client level (written in code in producer and consumer)

5. Integrating Kafka into Your Environment | Apache Kafka® Fundamentals

Apache Kafka Connect

- Connect API
- Data integration framework for streaming data between data and other systems
- open source, simple, scalable, reliable
- Library of connectors includes, JMS, JDBC, SQL, Hadoop, Google Cloud, etc
- No need to write the connectors ourselves, not adding value
- Distributed system with n workers, each worker has different connectors with
tasks split up (only if connectors can be partitioned and parallelized)
- Scalability and fault tolerant
Confluent REST Proxy
- talk to non-native kafka apps, and outside the firewall
- REST wrapper against producer and consumer - exposes some admin capability
Confluent Schema Registry
- version producers and consumer separately
- if you want compatibility with old data
- defines schema for messages in a topic (expected fields)
- automatically handle schema changes (ex: new fields)
- prevent backwards incompatible changes
- support multi-data center environments
- outside process. producer and consumer talk to the schema registry for definition to
check for compatibility (runtime assertions)
Schema Evolution =? adding, removing, merging fields
Confluent ksqlDB
- streaming SQL engine for apache kafka
- easy to take in SQL streams through ksqlDB without any coding of your own use any
programming language
- connect via control center CLI or UI
- Example use cases
- streaming ETL
- anomaly detection
- event monitoring
Apache Kafka Streams
- transform data with real time applications
- write standard java applications
- no separate processing cluster required
- exactly-once processing semantics
- elastic, highl scalable, fault-tolerant
- fully integrated with kafka security
- micro services, continuous queries, continuous transformations
Kafka Streams App
- application lives with application code
- applications instances contact the kafka cluster
Apache Kafka Streams
- simple = KSQL
- flexible = producer/consumer
- Kafka Streams API is in the middle
6. The Confluent Platform | Apache Kafka® Fundamentals
- event streaming and event driven applications leads to other needs arising =>
- replication between on prem and cloud or between regions in cloud
- automatic balancing of data within a cluster
- enterpise grade security
- management and monitoring
- all given by confluent platform
- helps for mature uses of kafka (production)
- confluent platform
- apache kafka as its base (connect, continuous commit log, streams) >
development & connectivity > data compatibility > enterprise operations >
management & monitoring
- complete set of development, operations, and management capabilities to run
kafka at scale
- confluent fully managed = confluent cloud
- customer self-managed = datacenter or public cloud
- confluent platform deployment models
- confluent platform
- the enterprise distribution of apache kafka
- confluent cloud
- apache kafka re-engineered for the cloud
- available on public clouds by aws, gcp, and azure
- cloud native fully managed service
- confluent control center
- management and monitoring for cluster and in kafka and all elements in platform
- web UI
- confluent CLI = helps for supplementing control center.
- RBAC needs CLI (role based access control) - new set of security features
- often better for business applications
- we could do kafka and stuff on our own without confluent but if it’s not too
expensive for a small use case, then why not?

7. Conclusion | Apache Kafka® Fundamentals

- overview: topics, partitions, brokers, producer/consumers, set up minimal kafka cluster

- went over some other courses and certifications..

Kafka Using Spring Boot
No ratings yet
Kafka Using Spring Boot
136 pages
Understanding Apache Kafka White Paper
No ratings yet
Understanding Apache Kafka White Paper
7 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Fundamentals and Architecture of Apache Kafka
No ratings yet
Fundamentals and Architecture of Apache Kafka
30 pages
AK
No ratings yet
AK
22 pages
kafka
No ratings yet
kafka
43 pages
Cours - Kafka
No ratings yet
Cours - Kafka
72 pages
Kafka
No ratings yet
Kafka
23 pages
5_kafka_2.7m
No ratings yet
5_kafka_2.7m
46 pages
Kafka
No ratings yet
Kafka
88 pages
Kafka Using Spring Boot v2
No ratings yet
Kafka Using Spring Boot v2
150 pages
Apache Kafka Key Concepts
100% (1)
Apache Kafka Key Concepts
8 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Apache Kafka
No ratings yet
Apache Kafka
43 pages
Big Data-Kafka
No ratings yet
Big Data-Kafka
14 pages
Kafka
No ratings yet
Kafka
12 pages
Kafka
No ratings yet
Kafka
5 pages
Kafka: Big Data Huawei Course
No ratings yet
Kafka: Big Data Huawei Course
14 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Kafka Clustering v1.0.0
No ratings yet
Kafka Clustering v1.0.0
20 pages
BDA Lab A7
No ratings yet
BDA Lab A7
10 pages
Using Kafka For Real Time Data Ingestion With .NET KevinFeasel
No ratings yet
Using Kafka For Real Time Data Ingestion With .NET KevinFeasel
33 pages
unit 3
No ratings yet
unit 3
26 pages
Apache Kafka(1)
No ratings yet
Apache Kafka(1)
10 pages
Documentation
No ratings yet
Documentation
105 pages
i
No ratings yet
i
26 pages
KAFKA PPT
No ratings yet
KAFKA PPT
11 pages
KAFKAExample2
No ratings yet
KAFKAExample2
12 pages
Instaclustr Understanding Apache Kafka White Paper
No ratings yet
Instaclustr Understanding Apache Kafka White Paper
8 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Kafka SlidesShare
No ratings yet
Kafka SlidesShare
100 pages
Kafka Notes2
No ratings yet
Kafka Notes2
19 pages
Kafka Streaming Data
No ratings yet
Kafka Streaming Data
154 pages
Producing Messages With Kafka Producers: Ryan Plant
No ratings yet
Producing Messages With Kafka Producers: Ryan Plant
31 pages
Apache Kafka Long Polling
No ratings yet
Apache Kafka Long Polling
20 pages
Apache Kafka Tutorial
No ratings yet
Apache Kafka Tutorial
6 pages
Kafka
No ratings yet
Kafka
19 pages
Kafka Patterns and Anti-Patterns
No ratings yet
Kafka Patterns and Anti-Patterns
7 pages
Kafka Notes1
No ratings yet
Kafka Notes1
19 pages
Pache Kafka Is An Open-Source Distr
No ratings yet
Pache Kafka Is An Open-Source Distr
1 page
Unit 5 Apache Kafka Notes
No ratings yet
Unit 5 Apache Kafka Notes
54 pages
Why Is Kafka So Fast
No ratings yet
Why Is Kafka So Fast
10 pages
Kafka My Kafka Note v67
No ratings yet
Kafka My Kafka Note v67
55 pages
08_Apache_Kafka
No ratings yet
08_Apache_Kafka
45 pages
Apache Kafka - PPT
No ratings yet
Apache Kafka - PPT
27 pages
4. Introduction to Apache Kafka and its setup (3)
No ratings yet
4. Introduction to Apache Kafka and its setup (3)
29 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Apache Kafka
No ratings yet
Apache Kafka
13 pages
1646412329504-CCDAK_study_guide
No ratings yet
1646412329504-CCDAK_study_guide
56 pages
Kafka architecture
No ratings yet
Kafka architecture
5 pages
Apache Kafka | Thi Nguyen's Blog
No ratings yet
Apache Kafka | Thi Nguyen's Blog
39 pages
kafka_arch
No ratings yet
kafka_arch
4 pages
Kafka Architectures Notes
No ratings yet
Kafka Architectures Notes
9 pages
Kafka 1
No ratings yet
Kafka 1
10 pages
Apache Kafka Tutorial
No ratings yet
Apache Kafka Tutorial
24 pages
kafka-in-depth
No ratings yet
kafka-in-depth
15 pages
Data Engineering 101 - Kafka Concept
No ratings yet
Data Engineering 101 - Kafka Concept
76 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
From Everand
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
Omar Khedher
No ratings yet
Bank Management System in VB 6
No ratings yet
Bank Management System in VB 6
30 pages
Super Market Last Update
No ratings yet
Super Market Last Update
36 pages
BD Chapter 3
No ratings yet
BD Chapter 3
13 pages
Prasanna Tanneeru: Professional Summary
No ratings yet
Prasanna Tanneeru: Professional Summary
10 pages
Chapter 2-Transaction Management
No ratings yet
Chapter 2-Transaction Management
35 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
MPSC 3
No ratings yet
MPSC 3
7 pages
Peerj Cs 661
No ratings yet
Peerj Cs 661
28 pages
BC Platforms Pioneering TREs and Federated Data Analysis
No ratings yet
BC Platforms Pioneering TREs and Federated Data Analysis
6 pages
Internship Report - Jalaluddin Hanif
No ratings yet
Internship Report - Jalaluddin Hanif
46 pages
Kishore Kumar 914266114
No ratings yet
Kishore Kumar 914266114
5 pages
Assgnmt2 (522) Wajid Sir
No ratings yet
Assgnmt2 (522) Wajid Sir
10 pages
Rman Cold - Consistent - Offline Backup
No ratings yet
Rman Cold - Consistent - Offline Backup
4 pages
Centralised Patients Information Processing and Management System For Nsambya General Clinic
No ratings yet
Centralised Patients Information Processing and Management System For Nsambya General Clinic
15 pages
odi-ee-ds
No ratings yet
odi-ee-ds
5 pages
Data Definition Language (DDL) Commands: Sql-Lab 1
No ratings yet
Data Definition Language (DDL) Commands: Sql-Lab 1
5 pages
Amazon MSK To Snowflake v1.3
No ratings yet
Amazon MSK To Snowflake v1.3
16 pages
Srirama OS
No ratings yet
Srirama OS
8 pages
Going DevOps With BMC
No ratings yet
Going DevOps With BMC
34 pages
Subquery
No ratings yet
Subquery
4 pages
Final Examination
100% (1)
Final Examination
18 pages
Exercises Chapter3
No ratings yet
Exercises Chapter3
4 pages
History of Database Management System
No ratings yet
History of Database Management System
9 pages
Data Ingestion From The RDS To HDFS Using Sqoop
No ratings yet
Data Ingestion From The RDS To HDFS Using Sqoop
5 pages
Other_Questions
No ratings yet
Other_Questions
76 pages
AWS Identity and Access Management: User Guide
No ratings yet
AWS Identity and Access Management: User Guide
736 pages
Adocode
No ratings yet
Adocode
804 pages
SCOU_220_MANUAL_T_03
No ratings yet
SCOU_220_MANUAL_T_03
18 pages
Transaction DBMS
No ratings yet
Transaction DBMS
20 pages
top-50-database-interview-questions
No ratings yet
top-50-database-interview-questions
10 pages

Kafka Notes

Uploaded by

Kafka Notes

Uploaded by

Questions for our application:

- What form is the storage in currently?

Kafka summary below:

Kafka manages and processes events

Basic Producer in Java

5. Integrating Kafka into Your Environment | Apache Kafka® Fundamentals

Apache Kafka Connect

7. Conclusion | Apache Kafka® Fundamentals

- overview: topics, partitions, brokers, producer/consumers, set up minimal kafka cluster

You might also like