Code Explanation

The code streams retail data from Kafka into Spark, processes the data using UDFs to calculate order metrics like total cost and item count, and computes time-based and country-based KPIs using window functions and writes them to files stored on HDFS. It imports functions from Spark SQL, defines the data schema, creates SparkSession, reads data from Kafka, registers UDFs, selects and transforms data, and calculates various KPIs which are written to files.

Uploaded by

Shilpa Kamagari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views3 pages

Code Explanation

Uploaded by

Shilpa Kamagari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Code Explanation

Case Study: Retail Data Analysis

In this project, we will go through a real-world use case from the retail sector.
Data from a centralised Kafka server in real-time will be streamed and
processed to calculate various KPIs or key performance indicators.

1. Various sql functions were imported from pyspark.sql.functions module.

The functions include window, udf etc.
2. Various sql types were imported from pyspark.sql.types module. The
types include StringType, ArrayType, TimestampType, IntegerType,
DoubleType etc.
3. SparkSession was imported from pyspark.sql module
4. Initialized the spark session using
spark = SparkSession \
.builder \
.appName("KafkaRead") \
.getOrCreate()
5. Streamed the data from kafka producer using
orderRaw = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","ec2-18-211-252-152.compute-
1.amazonaws.com:9092") \
.option("subscribe","real-time-project") \
.load()
From
Bootstrap Server - 18.211.252.152
Port - 9092
Topic - real-time-project
6. Schema is defined using
jsonSchema = StructType() \
.add("invoice_no", StringType()) \
.add("country", StringType()) \
.add("timestamp", TimestampType()) \
.add("type", StringType()) \
.add("items", ArrayType(StructType([
StructField("SKU", StringType()),
StructField("title", StringType()),
StructField("unit_price", DoubleType()),
StructField("quantity", IntegerType())
])))
7. Python function is written to compute total cost of an order using
Total cost = ∑(quantity∗unitprice)
8. The above function is transformed into udf (user defined function) using
add_total_cost = udf(get_total_cost, DoubleType())
9. Python function is written to find total items in an order.
10. The above function is transformed into udf using
add_total_count = udf(get_total_item, IntegerType())
11. Python function is written to find if the order is new.
12.The above function is transformed into udf using
add_is_order_flag = udf(get_is_order, IntegerType())
13.Python function is written to find if the order is return.
14.The above function is transformed into udf using
add_is_return_flag = udf(get_is_return, IntegerType())
15. Selected data ("invoice_no", "country", "timestamp", "Total_Items",
"Total_Cost", "is_order", "is_return" ) is written to the console.
16. Time based KPI (“Window”, ”OPM”, ”Total Sales Volume”, ”Average
rate of return”, “Average Transaction Size” ) is calculated using tumbling
window function for every 1 minute
17. Time and country based KPI ((“Window”, ”Country” ,”OPM”, ”Total
Sales Volume”, ”Average rate of return”) is calculated using tumbling
window function for every 1 minute
18. The Computed KPI were written to files and stored on HDFS in json
form
19. The streaming process is manually killed after 10 mins.

My First ETL Pipeline
No ratings yet
My First ETL Pipeline
10 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Nemo Handy 4.40 User Guide
No ratings yet
Nemo Handy 4.40 User Guide
242 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Code Logic
No ratings yet
Code Logic
6 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Fabric Notes
No ratings yet
Fabric Notes
5 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Business - Requirements 2nd Project
No ratings yet
Business - Requirements 2nd Project
6 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Documentation Part by Pranay Kashyap
No ratings yet
Documentation Part by Pranay Kashyap
7 pages
ETL Question and Answers
No ratings yet
ETL Question and Answers
6 pages
BDA MakeUp Solution
No ratings yet
BDA MakeUp Solution
7 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Automl Code
No ratings yet
Automl Code
3 pages
FINTECH ICAP Sample Paper - Solution
No ratings yet
FINTECH ICAP Sample Paper - Solution
5 pages
Big Data Project - Questions
No ratings yet
Big Data Project - Questions
2 pages
First Pyspark
No ratings yet
First Pyspark
18 pages
Fabrics
No ratings yet
Fabrics
13 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Shopify's Big Data Platform
No ratings yet
Shopify's Big Data Platform
28 pages
Project Template Notebook Ipynb 1
No ratings yet
Project Template Notebook Ipynb 1
23 pages
Py Spark
No ratings yet
Py Spark
7 pages
SF Day11 Snowpark
No ratings yet
SF Day11 Snowpark
34 pages
RDD - Mini - Project - 1 - 1707570179 2024-02-10 13 - 03 - 29
No ratings yet
RDD - Mini - Project - 1 - 1707570179 2024-02-10 13 - 03 - 29
10 pages
Confluence Stuff
No ratings yet
Confluence Stuff
100 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
SparkStreaming StockData CaseStudy
No ratings yet
SparkStreaming StockData CaseStudy
3 pages
DSLab2
No ratings yet
DSLab2
6 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Fyers Kafka Csv Pipeline With Image
No ratings yet
Fyers Kafka Csv Pipeline With Image
4 pages
Bda Exp - 7
No ratings yet
Bda Exp - 7
8 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Journal
No ratings yet
Journal
47 pages
Python Pyspark Q's
No ratings yet
Python Pyspark Q's
16 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Daily Transactions Problem Statement
No ratings yet
Daily Transactions Problem Statement
27 pages
Delhivery Feature Engineering - Solution Approach
No ratings yet
Delhivery Feature Engineering - Solution Approach
7 pages
RDD
No ratings yet
RDD
4 pages
23CP309T BDA MSE Question Paper
No ratings yet
23CP309T BDA MSE Question Paper
2 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
Pyspark File Commands and Theory
No ratings yet
Pyspark File Commands and Theory
29 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PYF Project LearnerNotebook LowCode
No ratings yet
PYF Project LearnerNotebook LowCode
6 pages
?stuck in A Loop of Rejections - Let's Break The Cycle!?
No ratings yet
?stuck in A Loop of Rejections - Let's Break The Cycle!?
7 pages
Interview
No ratings yet
Interview
2 pages
Merchant Rating System Using Hadoop MapReduce
No ratings yet
Merchant Rating System Using Hadoop MapReduce
32 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
BCS613A BT Module 1 Question Bank
No ratings yet
BCS613A BT Module 1 Question Bank
2 pages
BCS613A Question Bank
No ratings yet
BCS613A Question Bank
5 pages
DBMS - Module 3 Ppts - Jan28th (Autosaved)
100% (1)
DBMS - Module 3 Ppts - Jan28th (Autosaved)
104 pages
DBMS Module 3 Ppts Jan28th
100% (1)
DBMS Module 3 Ppts Jan28th
78 pages
EY Risk and Control Considerations Within RPA Implementations
No ratings yet
EY Risk and Control Considerations Within RPA Implementations
12 pages
Server Sun Prof. Hemant Samant
No ratings yet
Server Sun Prof. Hemant Samant
34 pages
Web Design Emptech
No ratings yet
Web Design Emptech
19 pages
Wireless Advanced Troubleshooting LAB Cook Book Virtual Deployment Extended
100% (1)
Wireless Advanced Troubleshooting LAB Cook Book Virtual Deployment Extended
17 pages
Mathematica 1
100% (1)
Mathematica 1
8 pages
Evaluating Peplink SpeedFusion
No ratings yet
Evaluating Peplink SpeedFusion
9 pages
Harmony Endpoint - DLP
No ratings yet
Harmony Endpoint - DLP
2 pages
Lab Activity 1C (Tharushnan)
No ratings yet
Lab Activity 1C (Tharushnan)
4 pages
HP Pavilion 15-ab029TX Notebook - Laptop PDF
No ratings yet
HP Pavilion 15-ab029TX Notebook - Laptop PDF
1 page
Dca7103 & Advanced Software Engineering
No ratings yet
Dca7103 & Advanced Software Engineering
9 pages
School Bundle 07 SlidesMania
No ratings yet
School Bundle 07 SlidesMania
19 pages
Lexis Nexis
No ratings yet
Lexis Nexis
8 pages
ICT Lab Session 1
No ratings yet
ICT Lab Session 1
13 pages
Introduction To Python
No ratings yet
Introduction To Python
9 pages
Unix Lab Record
No ratings yet
Unix Lab Record
23 pages
Resume - Vincent Maceda
No ratings yet
Resume - Vincent Maceda
1 page
4-2-2 How To Access Service Mode: Standby Info Power On Menu Mute
No ratings yet
4-2-2 How To Access Service Mode: Standby Info Power On Menu Mute
20 pages
Implement The BADI To Download The Vendor Data in A Excel Sheet
No ratings yet
Implement The BADI To Download The Vendor Data in A Excel Sheet
8 pages
BCS602 Module 1
No ratings yet
BCS602 Module 1
35 pages
Lecture01 ADT DS Algorithm Analysis
No ratings yet
Lecture01 ADT DS Algorithm Analysis
56 pages
Mohammad Arshad CV
No ratings yet
Mohammad Arshad CV
2 pages
Parts Catalog
100% (1)
Parts Catalog
122 pages
OpenShift - Container - Platform 4.17 Web - Console en US
No ratings yet
OpenShift - Container - Platform 4.17 Web - Console en US
145 pages
Vmware Thesis
100% (3)
Vmware Thesis
6 pages
AspenEngineeringSuiteV9 Inst Short
No ratings yet
AspenEngineeringSuiteV9 Inst Short
29 pages
Computer Application UG
No ratings yet
Computer Application UG
120 pages
Test Case Document
No ratings yet
Test Case Document
18 pages
NetTime - Network Time Synchronization Tool
No ratings yet
NetTime - Network Time Synchronization Tool
3 pages

Code Explanation

Uploaded by

Code Explanation

Uploaded by

Code Explanation

Case Study: Retail Data Analysis

1. Various sql functions were imported from pyspark.sql.functions module.

You might also like