Big Data-Kafka
Big Data-Kafka
+ A unified platform for handling all the real-time data feeds a large company
might have.
+ High throughput to support high volume event feeds.
+ Support real-time processing of these feeds to create new AND derived
feeds.
+ Support large data backlogs to handle periodic ingestion from offline
systems.
+ Support low-latency delivery to handle more traditional messaging use
cases. Guarantee fault-tolerance in the presence of machine failures.
What is Kafka
There are one or more servers available in Apache Kafka cluster, basically, these servers
(each) are what we call a broker. Brokers are also responsible for maintaining general state
information of the system, leader election, etc
+Producers write data to brokers.
+Consumers read data from brokers.
+All this is distributed.
+Data is stored in topics.
+Topics are split into partitions, which are replicated.
More Concepts
+ Partitions: A topic consists of partitions.Partition: ordered + immutable sequence of messages that is continually
appended to – a commit log. Partitions of a topic is configurable. Partitions determines max consumer (group)
parallelism.
+ Kafka Log: A log is nothing different but another way to view a partition. Basically, a data source writes
messages to the log. Further, one or more consumers read that data from the log at any time they want.
+ Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset
Consumers track their pointers via (offset, partition, topic) tuples
+ Replicas: “backups” of a partition They exist solely to prevent data loss. Replicas are never read from, never
written to. They do NOT help to increase producer or consumer parallelism! Kafka tolerates (numReplicas – 1)
dead brokers before losing data
+ Consumer Group: Kafka can have multiple consumer process/instance running. Basically, one consumer group
will have one unique group-id. Moreover, exactly one consumer instance reads the data from one partition in one
consumer group, at the time of reading. Since, there is more than one consumer group, in that case, one instance
from each of these groups can read from one single partition. However, there will be some inactive consumers, if
the number of consumers exceeds the number of partitions. Let’s understand it with an example if there are 8
consumers and 6 partitions in a single consumer group, that means there will be 2 inactive consumers.
Role of Zookeeper in Kafka
Apache Zookeeper serves as the coordination interface between the Kafka brokers and
consumers.
+ Also, we can say it is a distributed configuration and synchronization service.
+ Basically, ZooKeeper cluster shares the information with the Kafka servers.
+ Moreover, Kafka stores basic metadata information in ZooKeeper Kafka, such as topics,
brokers, consumer offsets (queue readers) and so on.
+ In addition, failure of Kafka Zookeeper/broker does not affect the Kafka cluster. It is
because the critical information which is stored in the ZooKeeper is replicated across its
ensembles. Then Kafka restores the state as ZooKeeper restarts, leading to zero
downtime for Kafka.
Producer and Consumer Kafka
+You use Kafka “producers” to write data to Kafka brokers.
+Available for JVM (Java, Scala), C/C++, Python, Ruby, etc.
+ In order to send messages asynchronously to a topic, KafkaProducer class provides send method. So,
the signature of send() is
+ After creating a Kafka Producer to send messages to Apache Kafka cluster. Now, we are creating a
Kafka Consumer to consume messages from the Kafka cluster.
+ Kafka Consumer subscribes to one or more topics in the Kafka cluster then further feeds on tokens or
messages from the Kafka Topics. In addition, using Heartbeat we can know the connectivity of
Consumer to Kafka Cluster. However, let’s define Heartbeat. It is set up at Consumer to let Zookeeper
or Broker Coordinator know if the Consumer is still connected to the Cluster. So, Kafka Consumer is no
longer connected to the Cluster, if the heartbeat is absent. In that case, the Broker Coordinator has to
re-balance the load. Moreover, Heartbeat is an overhead to the cluster. Also, by keeping the data
throughput and overhead in consideration, we can configure the interval at which the heartbeat is at
Consumer.
Partition to Consumer Group Mapping
Create Topic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic ineurontopic
Producer to a topic
bin/kafka-console-producer.sh --broker-list 172.18.0.2:6667 --topic ineurontopic
Documentation :https://gerardnico.com/dit/kafka/kafka-console-consumer
Class Notes
/FileStore/newdata.csv
/databricks-datasets/
%sh
mv /dbfs/FileStore/newdata.csv /dbfs/databricks-datasets/
reduce - pull the entire dataset down into a single location because it is reducing to one final value.
[1,2,3,4]
rdd.reduce(somefunction) ---> N1,N2,N3....---> Single python array result into Driver
[(1,2),(2,3),(2,4),(1,2).....] ->N1,N2,N3....-->[(1,4),(2,7)]
pairedrdd.reduceByKey(somefunction) - one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further
transformations done on its dataset.