Slide 5-6 Kafka

Distributed and Parallel Computing
Trong-Hop Do
Kafka – A distributed event streaming flatform
What is event streaming?
What can I use event streaming for?
• To process payments and financial transactions in real-time, such as in stock exchanges, banks, and insurances.
• To track and monitor cars, trucks, fleets, and shipments in real-time, such as in logistics and the automotive industry.
• To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind
parks.
• To collect and immediately react to customer interactions and orders, such as in retail, the hotel and travel industry,
and mobile applications.
• To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies.
• To connect, store, and make available data produced by different divisions of a company.
• To serve as the foundation for data platforms, event-driven architectures, and microservices.
What is a stream?
Store data in Kafka?
Tutorial 1: Kafka installation on Window
• Download Kafka from https://kafka.apache.org/
• Unzip the download file
• Rename the kafka to “kafka” and move it to C:\ drive

Kafka installation on Window
• Open C:\kafka\config\server.properties
• Change the path of log.dir

Kafka installation on Window
• Open C:\kafka\config\zookeeper.properties
• Change the path of dataDir
• By default Apache Kafka will run on port 9092 and Apache Zookeeper will run on port 2181.
Tutorial 2: Run Apache Kafka on Windows
• Start the Kafka cluster

• Run the following command to start ZooKeeper:
cd C:\kafka\
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
Run Apache Kafka on Windows
• Start the Kafka cluster

• Run the following command to start the Kafka broker:
cd C:\kafka\
bin\windows\kafka-server-start.bat .\config\server.properties
• Produce and consume some messages

• Run the kafka-topics command to create a Kafka topic named TestTopic
bin\windows\kafka-topics.bat --create --topic TestTopic --bootstrap-server localhost:9092
• Let’s create another topic named NewTopic

bin\windows\kafka-topics.bat --create --topic NewTopic --bootstrap-server localhost:9092
• Let’s show list of created topics

bin\windows\kafka-topics.bat --list --bootstrap-server localhost:9092
• Produce and consume some messages
• Run the producer and consumer on separate Command Prompt:
bin\windows\kafka-console-producer.bat --topic TestTopic --bootstrap-server localhost:9092

bin\windows\kafka-console-consumer.bat --topic TestTopic --from-beginning --bootstrap-server localhost:9092
Tutorial 3: Kafka Python client
• https://kafka-python.readthedocs.io/en/master/index.html
• Install Kafka-Python
• pip install kafka-python
• Start Zookeeper server and Kafka broker
• Zookeeper is running default on localhost:2181 and Kafka on localhost:9092

Kafka-Python
• Run consumer code
Kafka-Python
• Run producer code
Kafka-Python
• Check the result
Tutorial 4: Run Kafka on Colab
• Download Kafka and unzip
!curl -sSOL https://downloads.apache.org/kafka/3.3.1/kafka_2.13-3.3.1.tgz
!tar -xzf kafka_2.13-3.3.1.tgz
• Start zookeeper server and kafka server

!./kafka_2.13-3.3.1/bin/zookeeper-server-start.sh -daemon ./kafka_2.13-3.3.1/config/zookeeper.properties
!./kafka_2.13-3.3.1/bin/kafka-server-start.sh -daemon ./kafka_2.13-3.3.1/config/server.properties
• Create a topic
!./kafka_2.13-3.3.1/bin/kafka-topics.sh --create --bootstrap-server 127.0.0.1:9092 --replication-factor 1 --partitions 1 --topic TestTopic
Run Kafka on Colab
• Describe the created topic

!./kafka_2.13-3.3.1/bin/kafka-topics.sh --describe --bootstrap-server 127.0.0.1:9092 --topic TestTopic
• Write some event in the topic

!./kafka_2.13-3.3.1/bin/kafka-console-producer.sh --topic TestTopic --bootstrap-server 127.0.0.1:9092
• Read the event

!./kafka_2.13-3.3.1/bin/kafka-console-consumer.sh --topic TestTopic --from-beginning --bootstrap-server 127.0.0.1:9092
Run Kafka on Colab
• You can run cells sequentially and get the result (not really streaming)
Run Kafka on Colab
• Or you can run producer and consumer parallelly in different terminals
• Open terminal using Xterm and run consumer (it will be empty at first)
• Open terminal using Xterm and run producer, write some lines and they will appear on the consumer’s terminal
Run Kafka on Colab
• Use kafka-python on Colab
Tutorial 5: Test Kafka and Spark Structure Streaming on Colab
• Start kafka
• Install PySpark
#currently, 3.3.0 is the latest version. However, you still need to specify this.
!pip install pyspark==3.3.0
from pyspark.sql import SparkSession

scala_version = '2.13'
spark_version = '3.3.0‘
packages = [ f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}’ , 'org.apache.kafka:kafka-clients:3.3.1’ ]
spark = SparkSession.builder.master("local").appName("kafka-example").config("spark.jars.packages", ",".join(packages)).getOrCreate()
spark
• Install kafka-python
!pip install kafka-python
from kafka import KafkaProducer

from json import dumps
topic_name = 'Number'
kafka_server = 'localhost:9092'
producer = KafkaProducer(bootstrap_servers=kafka_server,value_serializer = lambda x:dumps(x).encode('utf-8'))
for e in range(1000):
data = {'number' : e}
producer.send(topic_name, value=data)
producer.flush()
• You can test if the topic is sent sucessfully

• Create datafram from Kafka topic
producer.flush()
kafkaDf = spark.read.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", topic_name)\
.option("startingOffsets", "earliest")\
.load()
kafkaDf.show()
• Show the dataframe in a formatted way

from pyspark.sql.functions import col, concat, lit
kafkaDf.select(
concat(col("topic"), lit(':'),
col("partition").cast("string")).alias("topic_partition"),col("offset"),col("value").cast("string")
).show()
Tutorial 6: Test Kafka and Spark Structure Streaming on Local
• Step1: Start Kafka cluster using Terminal
• Step 2: Run KafkaProducer in Jupyter Notebook
from kafka import KafkaProducer

from json import dumps
from time import sleep
topic_name = 'RandomNumber'
producer = KafkaProducer(bootstrap_servers=kafka_server,value_serializer = lambda x:dumps(x).encode('utf-8'))
for e in range(1000):
data = {'number' : e}
producer.send(topic_name, value=data)
print(str(data) + " sent")
sleep(5)
producer.flush()
• Open another Jupyter Notebook
• You will reading data from Kafka in two ways:

• Batch query
• Streaming query
• See more at https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Creating a Kafka Source for Batch Queries
• Create dataframe from Kafka data
topic_name = 'RandomNumber'
kafkaDf = spark.read.format("kafka").option("kafka.bootstrap.servers", kafka_server).option("subscribe", topic_name).option("startingOffsets",

"earliest").load()
• Show data (converting dataframe to pandas for cleaner view of data)

• Show streaming data using for loop
batchDF = kafkaDf.select(col('topic'),col('offset'),col('value').cast('string').substr(12,1).alias('rand_number'))
from time import sleep
from IPython.display import display, clear_output
for x in range(0, 2000):
try:
print("Showing live view refreshed every 5 seconds")
print(f"Seconds passed: {x*5}")
display(batchDF.toPandas())
sleep(5)
clear_output(wait=True)
except KeyboardInterrupt:
print("break")
break
print("Live view ended...")
• Perform some data aggregation and show live results
Creating a Kafka Source for Streaming Queries
• Create Streaming dataframe from Kafka
streamRawDf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_server).option("subscribe", topic_name).load()

streamDF = streamRawDf.select(col('topic'),col('offset'),col('value').cast('string').substr(12,1).alias('rand_number'))
checkEvenDF = streamDF.withColumn('Is_Even',col('rand_number').cast('int') % 2 == 0 )
• Write stream
from random import randint

randNum=str(randint(0,10000))
q1name = "queryNumber"+randNum
q2name = "queryCheckEven"+randNum
stream_writer1 = (streamDF.writeStream.queryName(q1name).trigger(processingTime="5 seconds").outputMode("append").format("memory"))

stream_writer2 = (checkEvenDF.writeStream.queryName(q2name).trigger(processingTime="5 seconds").outputMode("append").format("memory"))
query1 = stream_writer1.start()
query2 = stream_writer2.start()
• View streaming result
Tutorial 7: Kafka and MongoDB on Window
Tutorial 8
https://towardsdatascience.com/make-a-mock-real-time-stream-of-data-with-python-and-kafka-7e5e23123582
Tutorial 8: streaming from CSV
• sendStream.py
• processStream.py
Start the consumer

Start the producer
If you terminate the consumer and then restart it, the streaming will be resumed from where it stop
Tutorial 9
https://medium.com/@kevin.michael.horan/distributed-video-streaming-with-python-and-kafka-551de69fe1dd
Tutorial 9: Video streaming using Kafka
• Producer.py
• Producer.py
• Producer.py
• consumer.py
• Run consumer.py
• Stream video from webcam
• Stream a video entitled Countdow1.mp4
Tutorial 10
https://towardsdatascience.com/real-time-anomaly-detection-with-apache-kafka-and-python-3a40281c01c9
Tutorial 10: real-time anomaly detection
• Producer.py
• train.py
• detector.py
Tutorial 11: Tensorflow-IO and Kafka
https://www.tensorflow.org/io/tutorials/kafka
• Just follow https://www.tensorflow.org/io/tutorials/kafka
Tutorial 12: Spotify Recommendation System
https://www.analyticsvidhya.com/blog/2021/06/spotify-recommendation-system-using-pyspark-and-kafka-streaming/
Tutorial 13: Order book simulation
https://github.com/rongpenl/order-book-simulation
Tutorial 14: Create your own data stream
https://aiven.io/blog/create-your-own-data-stream-for-kafka-with-python-and-faker
Tutorial 15: Bigmart sale prediction
• Dataset: https://www.kaggle.com/datasets/brijbhushannanda1979/bigmart-sales-data
• Use train set to train some simple prediction model using Spark MLlib
• Stream data from test set to Kafka server (remember to set the time interval)
• Create Spark streaming dataframe from Kafka and apply the trained model to get the real-time prediction

Slide 5-6 Kafka

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Slide 5-6 Kafka

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slide 5-6 Kafka

Uploaded by

Copyright:

Available Formats

Distributed and Parallel Computing

and mobile applications.

• Download Kafka from https://kafka.apache.org/

• Unzip the download file

• Rename the kafka to “kafka” and move it to C:\ drive

• Change the path of log.dir

• Change the path of dataDir

• Start the Kafka cluster

• Start the Kafka cluster

• Produce and consume some messages

bin\windows\kafka-topics.bat --create --topic TestTopic --bootstrap-server localhost:9092

• Let’s create another topic named NewTopic

• Let’s show list of created topics

• Produce and consume some messages

• Run the producer and consumer on separate Command Prompt:

bin\windows\kafka-console-producer.bat --topic TestTopic --bootstrap-server localhost:9092

• pip install kafka-python

• Start Zookeeper server and Kafka broker

• Zookeeper is running default on localhost:2181 and Kafka on localhost:9092

• Start zookeeper server and kafka server

• Describe the created topic

• Write some event in the topic

• Read the event

from pyspark.sql import SparkSession

from kafka import KafkaProducer

• You can test if the topic is sent sucessfully

• Show the dataframe in a formatted way

• Step1: Start Kafka cluster using Terminal

• Step 2: Run KafkaProducer in Jupyter Notebook

from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers=kafka_server,value_serializer = lambda x:dumps(x).encode('utf-8'))

• You will reading data from Kafka in two ways:

kafkaDf = spark.read.format("kafka").option("kafka.bootstrap.servers", kafka_server).option("subscribe", topic_name).option("startingOffsets",

• Show data (converting dataframe to pandas for cleaner view of data)

streamRawDf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_server).option("subscribe", topic_name).load()

from random import randint

stream_writer1 = (streamDF.writeStream.queryName(q1name).trigger(processingTime="5 seconds").outputMode("append").format("memory"))

Start the consumer

You might also like