0% found this document useful (0 votes)
31 views

Code Explanation

The code streams retail data from Kafka into Spark, processes the data using UDFs to calculate order metrics like total cost and item count, and computes time-based and country-based KPIs using window functions and writes them to files stored on HDFS. It imports functions from Spark SQL, defines the data schema, creates SparkSession, reads data from Kafka, registers UDFs, selects and transforms data, and calculates various KPIs which are written to files.

Uploaded by

Shilpa Kamagari
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Code Explanation

The code streams retail data from Kafka into Spark, processes the data using UDFs to calculate order metrics like total cost and item count, and computes time-based and country-based KPIs using window functions and writes them to files stored on HDFS. It imports functions from Spark SQL, defines the data schema, creates SparkSession, reads data from Kafka, registers UDFs, selects and transforms data, and calculates various KPIs which are written to files.

Uploaded by

Shilpa Kamagari
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Code Explanation

Case Study: Retail Data Analysis


In this project, we will go through a real-world use case from the retail sector.
Data from a centralised Kafka server in real-time will be streamed and
processed to calculate various KPIs or key performance indicators.

1. Various sql functions were imported from pyspark.sql.functions module.


The functions include window, udf etc.
2. Various sql types were imported from pyspark.sql.types module. The
types include StringType, ArrayType, TimestampType, IntegerType,
DoubleType etc.
3. SparkSession was imported from pyspark.sql module
4. Initialized the spark session using
spark = SparkSession \
.builder \
.appName("KafkaRead") \
.getOrCreate()
5. Streamed the data from kafka producer using
orderRaw = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","ec2-18-211-252-152.compute-
1.amazonaws.com:9092") \
.option("subscribe","real-time-project") \
.load()
From
Bootstrap Server - 18.211.252.152
Port - 9092
Topic - real-time-project
6. Schema is defined using
jsonSchema = StructType() \
.add("invoice_no", StringType()) \
.add("country", StringType()) \
.add("timestamp", TimestampType()) \
.add("type", StringType()) \
.add("items", ArrayType(StructType([
StructField("SKU", StringType()),
StructField("title", StringType()),
StructField("unit_price", DoubleType()),
StructField("quantity", IntegerType())
])))
7. Python function is written to compute total cost of an order using
Total cost = ∑(quantity∗unitprice)
8. The above function is transformed into udf (user defined function) using
add_total_cost = udf(get_total_cost, DoubleType())
9. Python function is written to find total items in an order.
10. The above function is transformed into udf using
add_total_count = udf(get_total_item, IntegerType())
11. Python function is written to find if the order is new.
12.The above function is transformed into udf using
add_is_order_flag = udf(get_is_order, IntegerType())
13.Python function is written to find if the order is return.
14.The above function is transformed into udf using
add_is_return_flag = udf(get_is_return, IntegerType())
15. Selected data ("invoice_no", "country", "timestamp", "Total_Items",
"Total_Cost", "is_order", "is_return" ) is written to the console.
16. Time based KPI (“Window”, ”OPM”, ”Total Sales Volume”, ”Average
rate of return”, “Average Transaction Size” ) is calculated using tumbling
window function for every 1 minute
17. Time and country based KPI ((“Window”, ”Country” ,”OPM”, ”Total
Sales Volume”, ”Average rate of return”) is calculated using tumbling
window function for every 1 minute
18. The Computed KPI were written to files and stored on HDFS in json
form
19. The streaming process is manually killed after 10 mins.

You might also like