We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7
Questions for our application:
- What form is the storage in currently?
- Filezilla - can download from harddrive - How are we currently getting the data from the IoT sensors/cameras, where is it going, etc? - We will need to know this to figure out how to write our Producer class for Kafka ingestion. - Need the image files available in the Python (or whatever language) Producer file. Once we figure this out, should be ~relatively~ simple to get into the Kafka Producer/Consumer system - How to send image data through Producer? - Convert image file to binary or bytes array - Use a ByteArraySerializer to send that array as the message value: - In Java: props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer"); - Turns image data into a form of “structured data” that kafka can handle. - Kafka is not ideal for this task but many sources say it can be done and it’s sometimes the most ideal method. - Tiered storage can help, to avoid clogging up Kafka brokers - Tiered storage is currently (only?) available via Confluent Cloud - Python example (bottom): https://github.com/dpkp/kafka-python/issues/1045 - message value size will be the bytes size of the image. if same dimensions as original image, our pca compression won’t necessarily matter because it doesn’t affect the bytes size of the image in python - since bytes in python are same for original and reconstructed, if the dimensions are same. - can’t directly send the bytes of the pca compression, for example, because we would need to have the models for decoding available on the other side. this isn’t possible with our pca method - not possible because each image has trained its own pca. kind of defeats purpose of data compression if you need to send over the models over as data to store as well. - potential solution: if we did go with a deep learning method like autoencoders for producing a bytes array of a compressed/encoded image, then we might be able to just use those bytes because there would only be a single decoder needed to process that data on the other side of the kafka process, for decoding. - but file size reduction of the image should still be helpful for speeding up the HTTP sending part (from the IoT sensors) - Might not be a big problem if we’re okay with not the best performance sending these image messages over in Kafka
Kafka Cluster - Kafka cluster is what we want to use to read/write data - The cluster could be a thing we manage or somewhere on the cloud - Made up of brokers (machines, servers, etc) - Each broker has it’s local storage and retention time to keep the events around - If cluster is on the cloud we don’t think too closely about the brokers Producer is an application we write to get things into a Kafka cluster - Producer programs write data into the cluster - Event source could be our IOT sensors Consumer is an application we write to get things from a Kafka cluster - How we read data from the cluster - Might not think about it directly Producers and consumers are decoupled, don’t affect each other. Zookeeper Ensemble - manages consensus for all brokers - distributed consensus manager - KIP 500 => attempting to remove Zookeeper, but not yet - Stores access control lists and secrets - Failure detection and recovery and management for brokers Topics - Streams of related messages in Kafka - A logical representation, a log of events/messages (sorta chronological) - Categorizes messages into groups - Developers define Topics - Producer <-> Topic: N to N relation - Unlimited number of topics Partition - Topics are broken up into pieces or partitions - Each partition is allocated to a separate broker - Technically a partition is a log, strictly ordered - A topic might not be as strictly ordered if broken up into multiple partitions - Segments not as relevant (make up partitions) - Partition important for how to model data Log - 0 (first entry) … n+1 (next entry to write) - prior entries are immutable records - can set retention periods in kafka to expire prior “log entries” - Known as a stream in kafka Consuming doesn’t destory the message! One message could be sent to multiple Message - key, value - business relevant data - headers - optional metadata - timestamp - creation or ingestion time Broker Basics - producer sends messages to brokers - brokers receive and store messages - brokers manage partitons - messages of a topic spread across partitions - each broker handles many partitions - each partition stores on broker’s disk - partiton: 1…n log files - each message in log indentified by offset - configurable retention policy - a kafka cluster can have many brokers - Broker replication - failsafe if one broker fails, keeps the partitions safe - each broker has replicas (usually 3) - one is the leader, rest are followers Producer Basics - Producers write data as messages - can be written in any language - java is default but c++/c, python, etc are also native - command line producer tool Load balancing and semantic partitioning - producers use a partitioning strategy to assign each message to a partition - partition strategy specified by the Producer - no key, default is round-robin - if key, hash(key) % number_of_partitions (ordered by key) - messages of the same key always land in order in the same partition - custom partitioner is possible Consumer Basics - consumers pull messages from 1..n topics - new inflowing messages are automatically retrieved - consumer offset - keeps track of last message read, helps consumer remember where they are - is stored in special topic - CLI tools exist to read from cluster, distributed consumption - consumers live in Consumer Groups - instances of a consumer to scale Scalable Data Pipeline Producers > Brokers > Consumers 4. How Kafka Works | Apache Kafka® Fundamentals
Basic Producer in Java
- Can be very basic, just a public static void main method - Configuration - Properties object, putting in property settings - where the cluster is, key/value types - Create Producer - KafkaProducer class - Shutdown Behavior - close the producer when closed down - Sending Data - key and value are written - ProducerRecord type, key/value/topic are passed into - producer.send(record) Basic Consumer in .NET/C# - Configuration - tell client where cluster is, etc. - deserializer is not in the configuration part, will get to it later - Message Handling - consumer.OnMessage - basic: just print the message to stdout - Error Handling - consumer.OnError - consumer.OnConsumeError - Subscribing to Topic - consumer.Subscribe(“TopicName”) - could subscribe to multiple topics - Polling Data - while (true), consumer.Poll() Partition Leadership & Replication - cluster with 4 brokers example - no required relationship between # brokers or # partitions - Leader partitions receive the writes and reads (producers and consumers) - Sometimes can read from followers, not just leaders…but main is leader - Followers write from the leader - When a broker dies, automatically select a new leader and writing and reading can continue with no loss of data Data Retention Policy - Events are immutable, they stay there - How long do I want or can I store my data in the broker? - By default it’s set to 1 week - Business decision - Could be very short or very long - Old segments expire when newest segment is older farther retention period Producer Design - Topic - [Partition] - [Key] - Value - Serializer -> Partitioner -> sends record to Partition -> Broker - Producer waits for ACK before sending (0 none, 1 leader, -1 all) - 0 = no ack, low latency, some data loss (~at most once) - 1 = leader ack, no missing messages and no duplicates - -1 = all ack, no data loss, safe, more latency (~at least once) - Exactly Once Semantics (not delivery) - > Strong transactional guarantees for Kafka - Prevents clients from processing duplicate messages - Handlers failure gracefully Consumer Groups - One consumer group with two instances of same Consumer (just written twice) - Automatic assignment of partitions to consumers within groups (for load balancing) - Doesn’t migrate state between instances - KafkaStreams API does it though Compacted Topics - 17 values (over time) in log but only 4 keys that are unique - keys could be unique IDs of devices in IoT - compacted log contains just the unique keys and the latest values/current state Troubleshooting - Confluent control center - Log files - Special Config Settings - SSL logging - authorizer debugging Security - Kafka supports Encryption in Transit - Kafka supports Authentication and Authorization - No encryption at rest out of the box - Clients can be mixed with & without encryption & authentication - Or can be done at the client level (written in code in producer and consumer)
5. Integrating Kafka into Your Environment | Apache Kafka® Fundamentals
Apache Kafka Connect
- Connect API - Data integration framework for streaming data between data and other systems - open source, simple, scalable, reliable - Library of connectors includes, JMS, JDBC, SQL, Hadoop, Google Cloud, etc - No need to write the connectors ourselves, not adding value - Distributed system with n workers, each worker has different connectors with tasks split up (only if connectors can be partitioned and parallelized) - Scalability and fault tolerant Confluent REST Proxy - talk to non-native kafka apps, and outside the firewall - REST wrapper against producer and consumer - exposes some admin capability Confluent Schema Registry - version producers and consumer separately - if you want compatibility with old data - defines schema for messages in a topic (expected fields) - automatically handle schema changes (ex: new fields) - prevent backwards incompatible changes - support multi-data center environments - outside process. producer and consumer talk to the schema registry for definition to check for compatibility (runtime assertions) Schema Evolution =? adding, removing, merging fields Confluent ksqlDB - streaming SQL engine for apache kafka - easy to take in SQL streams through ksqlDB without any coding of your own use any programming language - connect via control center CLI or UI - Example use cases - streaming ETL - anomaly detection - event monitoring Apache Kafka Streams - transform data with real time applications - write standard java applications - no separate processing cluster required - exactly-once processing semantics - elastic, highl scalable, fault-tolerant - fully integrated with kafka security - micro services, continuous queries, continuous transformations Kafka Streams App - application lives with application code - applications instances contact the kafka cluster Apache Kafka Streams - simple = KSQL - flexible = producer/consumer - Kafka Streams API is in the middle 6. The Confluent Platform | Apache Kafka® Fundamentals - event streaming and event driven applications leads to other needs arising => - replication between on prem and cloud or between regions in cloud - automatic balancing of data within a cluster - enterpise grade security - management and monitoring - all given by confluent platform - helps for mature uses of kafka (production) - confluent platform - apache kafka as its base (connect, continuous commit log, streams) > development & connectivity > data compatibility > enterprise operations > management & monitoring - complete set of development, operations, and management capabilities to run kafka at scale - confluent fully managed = confluent cloud - customer self-managed = datacenter or public cloud - confluent platform deployment models - confluent platform - the enterprise distribution of apache kafka - confluent cloud - apache kafka re-engineered for the cloud - available on public clouds by aws, gcp, and azure - cloud native fully managed service - confluent control center - management and monitoring for cluster and in kafka and all elements in platform - web UI - confluent CLI = helps for supplementing control center. - RBAC needs CLI (role based access control) - new set of security features - often better for business applications - we could do kafka and stuff on our own without confluent but if it’s not too expensive for a small use case, then why not?
7. Conclusion | Apache Kafka® Fundamentals
- overview: topics, partitions, brokers, producer/consumers, set up minimal kafka cluster
- went over some other courses and certifications..