0% found this document useful (0 votes)

42 views

Apache Spark

This document demonstrates how to analyze sales order data using Apache Spark in Microsoft Azure Synapse Analytics. It shows how to load CSV data into a Spark DataFrame, explore the data by filtering, grouping, and aggregating, transform the data by adding new columns, and save the data as parquet files. It also describes how to create a SQL table from the data and run SQL queries on it to further analyze the sales data. Visualizing the results of SQL queries is also demonstrated.

Uploaded by

Sanghamitra Das

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

Apache Spark

Uploaded by

Sanghamitra Das

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

ANALYZING DATA WITH APACHE SPARK

MICROSOFT FABRIC - Bamidele Ajamu

The notebook below shows how you to use Apache Spark in Microsoft Fabric

Load csv data into a dataframe in spark

Confirming header false and header true function
Since the data has no header, you can create a new header using spark sql types
Explore the dataframe
Filter a dataframe
Group
Data Manipulation
Work with tables and SQL
Visualize the data

Sales Order data exploration

In [1]:
df = spark.read.format("csv").option("header","true").load("Files/orders/2019.csv")
# df now is a Spark DataFrame containing CSV data from "Files/orders/2019.csv".
display(df)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 3, Finished, Available)

SynapseWidget(Synapse.DataFrame, 56dcd7ce-1a7a-41df-aec4-cd33913dabe6)

In [2]:
df = spark.read.format("csv").option("header","false").load("Files/orders/2019.csv")
# df now is a Spark DataFrame containing CSV data from "Files/orders/2019.csv".
display(df)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 4, Finished, Available)

SynapseWidget(Synapse.DataFrame, 954ab7e2-496c-4ebc-9e20-98822484efcb)

In [3]:
# You can create a new header for the dataframe
from pyspark.sql.types import *

orderSchema = StructType([
StructField("SalesOrderNumber", StringType()),
StructField("SalesOrderLineNumber", IntegerType()),
StructField("OrderDate", DateType()),
StructField("CustomerName", StringType()),
StructField("Email", StringType()),
StructField("Item", StringType()),
StructField("Quantity", IntegerType()),
StructField("UnitPrice", FloatType()),
StructField("Tax", FloatType())
])

df = spark.read.format("csv").schema(orderSchema).load("Files/orders/2019.csv")
display(df)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 5, Finished, Available)

SynapseWidget(Synapse.DataFrame, 1e4719a3-cd9a-4775-86c6-0274e4bba40e)

In [ ]:

Modify the code so that the file path uses a * wildcard to read the sales order data from all of the
files in the orders folder:

In [4]:
from pyspark.sql.types import *

df = spark.read.format("csv").schema(orderSchema).load("Files/orders/*.csv")
display(df)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 6, Finished, Available)

SynapseWidget(Synapse.DataFrame, d1819bfb-cb89-4a6b-ba7c-4ef69d101683)

Explore data in a dataframe

Filter a dataframe
Group
Data Manipulation

1. Filter

In [5]:
customers = df['CustomerName', 'Email']
print(customers.count())
print(customers.distinct().count())
display(customers.distinct())

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 7, Finished, Available)

32718
12427
SynapseWidget(Synapse.DataFrame, 5982f9e3-df06-4bbd-b73a-f915962a1044)

In [6]:
# Use the where clause
customers = df.select("CustomerName", "Email").where(df['Item']=='Road-250 Red, 52')
print(customers.count())
print(customers.distinct().count())
display(customers.distinct())

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 8, Finished, Available)

133
133
SynapseWidget(Synapse.DataFrame, 59741d99-f5d2-4bf7-9d92-55fbef8313ef)
2. Aggregate and group data in a dataframe

In [7]:
productSales = df.select("Item", "Quantity").groupBy("Item").sum()
display(productSales)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 9, Finished, Available)

SynapseWidget(Synapse.DataFrame, a6b2a245-c1ea-41ab-912f-f4d7c7de0304)

In [9]:
# year aggregate of oder counts
from pyspark.sql.functions import *

yearlySales = df.select(year("OrderDate").alias("Year")).groupBy("Year").count().orderBy
display(yearlySales)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 11, Finished, Available)

SynapseWidget(Synapse.DataFrame, cc4ac63f-88fd-4066-83db-3ec15ba6bea0)

Data Transformation with Spark

In [11]:
from pyspark.sql.functions import *

## Create Year and Month columns

transformed_df = df.withColumn("Year", year(col("OrderDate"))).withColumn("Month", mont

# Create the new FirstName and LastName fields

transformed_df = transformed_df.withColumn("FirstName", split(col("CustomerName"), " ")

# Filter and reorder columns

transformed_df = transformed_df["SalesOrderNumber", "SalesOrderLineNumber", "OrderDate"

# Display the first ten orders

display(transformed_df.limit(10))

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 13, Finished, Available)

SynapseWidget(Synapse.DataFrame, f24dff6a-92cc-4479-a4b1-083346da9a8d)
Save the transformed data

In [13]:
transformed_df.write.mode("overwrite").parquet('Files/transformed_data/orders')
print ("Transformed data has now been saved! Thanks")

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 15, Finished, Available)

Transformed data has now been saved! Thanks

In [14]:
# load a new dataframe from the parquet files in the transformed_orders/orders folder
orders_df = spark.read.format("parquet").load("Files/transformed_data/orders")
display(orders_df)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 16, Finished, Available)

SynapseWidget(Synapse.DataFrame, e526884b-8e78-4a3a-b76f-f34d21085bba)
Save data in partition files

In [16]:
# To partition data by Year and Month use the below code

orders_df.write.partitionBy("Year","Month").mode("overwrite").parquet("Files/partitioned
print ("Transformed and partitioned data has been data saved!")

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 18, Finished, Available)

Transformed and partitioned data has been data saved!

In [17]:
# You can load a new dataframe from the orders.parquet file using the below code e-g 202
orders_2020_df = spark.read.format("parquet").load("Files/partitioned_data/Year=2020/Mo
display(orders_2020_df)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 19, Finished, Available)

SynapseWidget(Synapse.DataFrame, 1962b3ad-158b-420d-b1c2-30aa452c9b7c)

Work with tables and SQL

Create a table

In [2]:
# Create a new table
df.write.format("delta").saveAsTable("salesorder")

# Get the table description

spark.sql("DESCRIBE EXTENDED salesorder").show(truncate=False)

In [16]:
%%sql

SELECT * FROM sales

LIMIT 5

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 18, Finished, Available)

<Spark SQL result set with 5 rows and 9 fields>
Out[16]:

In [30]:
df = spark.sql("SELECT * FROM Sales_LakeHouse.sales LIMIT 1000")
display(df)

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 32, Finished, Available)

SynapseWidget(Synapse.DataFrame, bd69700b-df1a-42ef-92d6-d0720160b8ba)

In [25]:
%%sql
SELECT YEAR(OrderDate) AS OrderYear,
SUM((UnitPrice * Quantity) + TaxAmount) AS GrossRevenue
FROM sales
GROUP BY YEAR(OrderDate)
ORDER BY OrderYear

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 27, Finished, Available)

<Spark SQL result set with 1 rows and 2 fields>
Out[25]:

Visualize data with Spark

In [26]:
%%sql
SELECT * FROM sales
# you can use the %% to convert the code to writing sql queries

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 28, Finished, Available)

<Spark SQL result set with 1000 rows and 9 fields>
Out[26]:

In [40]:
sqlQuery = "SELECT CAST(YEAR(OrderDate) AS CHAR(4)) AS OrderYear, \
SUM((UnitPrice * Quantity) + TaxAmount) AS GrossRevenue \
FROM Sales_LakeHouse.sales \
GROUP BY CAST(YEAR(OrderDate) AS CHAR(4)) \
ORDER BY OrderYear"

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 42, Finished, Available)

In [42]:
df_spark = spark.sql(sqlQuery)
df_spark.show()

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 44, Finished, Available)

+---------+--------------------+
|OrderYear| GrossRevenue|
+---------+--------------------+
| null|2.2602264277594723E7|
+---------+--------------------+

In [6]:
# conda install pandoc

In [5]:
# conda update -n base -c defaults conda

PYSPARK Interview Questions
100% (2)
PYSPARK Interview Questions
126 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
Uploading Multiple Machine Registrations PDF
No ratings yet
Uploading Multiple Machine Registrations PDF
49 pages
DP 203t00a Enu Powerpoint 03
No ratings yet
DP 203t00a Enu Powerpoint 03
25 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
EDA Python for Data Analsis
No ratings yet
EDA Python for Data Analsis
10 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
IBM_PySpark_CheatSheet
No ratings yet
IBM_PySpark_CheatSheet
2 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
DP600CodeUsed240514
No ratings yet
DP600CodeUsed240514
27 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Online Sales Data Analysis
No ratings yet
Online Sales Data Analysis
9 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
spark_optimization_1741826797
No ratings yet
spark_optimization_1741826797
7 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
dp600day1en1731207686301
No ratings yet
dp600day1en1731207686301
41 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Pyspark_Coding_Interview_Questions
No ratings yet
Pyspark_Coding_Interview_Questions
19 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
EDA with Pandas
No ratings yet
EDA with Pandas
8 pages
1737249906013
No ratings yet
1737249906013
106 pages
Optimizing 1TB Data Handling using PySpark 3p
No ratings yet
Optimizing 1TB Data Handling using PySpark 3p
3 pages
Important PySpark Operations 1698872557
No ratings yet
Important PySpark Operations 1698872557
4 pages
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Pyspark IQ FREE Guide
No ratings yet
Pyspark IQ FREE Guide
57 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Spark Cheat Sheet 1717838924
No ratings yet
Spark Cheat Sheet 1717838924
10 pages
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Spark Lab
No ratings yet
Spark Lab
6 pages
2324 BigData Lab3
No ratings yet
2324 BigData Lab3
6 pages
Pyspark coding questions from StrataScratch platform
No ratings yet
Pyspark coding questions from StrataScratch platform
23 pages
Objectives: Classic Models
No ratings yet
Objectives: Classic Models
3 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Apache Spark with Scala - cheatsheet (1) (1)
No ratings yet
Apache Spark with Scala - cheatsheet (1) (1)
7 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Pandas Roadmap
No ratings yet
Pandas Roadmap
6 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
MySQL Crash Course: A Hands-on Introduction to Database Development
From Everand
MySQL Crash Course: A Hands-on Introduction to Database Development
Rick Silva
No ratings yet
React Portfolio App Development: Increase your online presence and create your personal brand
From Everand
React Portfolio App Development: Increase your online presence and create your personal brand
Abdelfattah Ragab
No ratings yet
Ameerpet Hostels
No ratings yet
Ameerpet Hostels
1 page
Correct: Incorrect
No ratings yet
Correct: Incorrect
93 pages
Sri Krishna Govind
No ratings yet
Sri Krishna Govind
1 page
Surya Dwadasanama Stotram Lyrics in Bengali PDF % File Name: Suryadvadashanamastotra - Itx
No ratings yet
Surya Dwadasanama Stotram Lyrics in Bengali PDF % File Name: Suryadvadashanamastotra - Itx
2 pages
Two No Der Ac Installation 1
No ratings yet
Two No Der Ac Installation 1
82 pages
Service Manual: 14V-W70M 14V-W75M
50% (4)
Service Manual: 14V-W70M 14V-W75M
45 pages
Video Editing - Notes
No ratings yet
Video Editing - Notes
15 pages
A.prasad SCCM Admin Artech Hyderabad
No ratings yet
A.prasad SCCM Admin Artech Hyderabad
4 pages
ST7MDT
No ratings yet
ST7MDT
5 pages
Riello Ups
No ratings yet
Riello Ups
4 pages
Rebirth On Windows PDF
No ratings yet
Rebirth On Windows PDF
7 pages
Ee222 LAB 4
No ratings yet
Ee222 LAB 4
9 pages
Edi 104 - Chapter 5
No ratings yet
Edi 104 - Chapter 5
43 pages
Getting Started With Aspen Batch Distillation: Image/gs - 1.gif
No ratings yet
Getting Started With Aspen Batch Distillation: Image/gs - 1.gif
26 pages
SEO Workshop 2020
No ratings yet
SEO Workshop 2020
68 pages
Machine Learning in Python
No ratings yet
Machine Learning in Python
5 pages
User Guide Color Grading Luts
No ratings yet
User Guide Color Grading Luts
6 pages
Domestic Data Entry Operator - Textbook For Class-IX-17925 Ncert Amazon - in Books
No ratings yet
Domestic Data Entry Operator - Textbook For Class-IX-17925 Ncert Amazon - in Books
1 page
Unit 2 - Information Security - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Information Security - WWW - Rgpvnotes.in
9 pages
Technology-Based Art
No ratings yet
Technology-Based Art
107 pages
Computer Science and Applications: Paper-Iii
No ratings yet
Computer Science and Applications: Paper-Iii
12 pages
B Navya Contact: 8801944544
No ratings yet
B Navya Contact: 8801944544
3 pages
Final 1
No ratings yet
Final 1
50 pages
GD5000 Catalog(en)
No ratings yet
GD5000 Catalog(en)
38 pages
Hytera tm600 Service - Manual PDF
No ratings yet
Hytera tm600 Service - Manual PDF
143 pages
Metro TMobile-2018
No ratings yet
Metro TMobile-2018
47 pages
DAT 72 Serv Guide
No ratings yet
DAT 72 Serv Guide
65 pages
HP28S Math Apps Solution Manual
100% (3)
HP28S Math Apps Solution Manual
108 pages
BE Information Technology R2019 'C' Scheme Syllabus Draft
No ratings yet
BE Information Technology R2019 'C' Scheme Syllabus Draft
143 pages
Httpsdownload.schneider-electric.comfilesp EnDocType=Instruction+Sheet&p File Name=XPSUS BCE BF VC DMB.pdf&p Doc Ref=REPLAC
No ratings yet
Httpsdownload.schneider-electric.comfilesp EnDocType=Instruction+Sheet&p File Name=XPSUS BCE BF VC DMB.pdf&p Doc Ref=REPLAC
18 pages
Request For Application and TOR For Connected Bangladesh Project
No ratings yet
Request For Application and TOR For Connected Bangladesh Project
29 pages
BMX160
No ratings yet
BMX160
2 pages
IONODES PERCEPT Body Camera Datasheet
No ratings yet
IONODES PERCEPT Body Camera Datasheet
18 pages
Keylogger & Security
No ratings yet
Keylogger & Security
12 pages

Apache Spark

Uploaded by

Apache Spark

Uploaded by

ANALYZING DATA WITH APACHE SPARK

MICROSOFT FABRIC - Bamidele Ajamu

Load csv data into a dataframe in spark

Sales Order data exploration

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 3, Finished, Available)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 4, Finished, Available)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 5, Finished, Available)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 6, Finished, Available)

Explore data in a dataframe

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 7, Finished, Available)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 8, Finished, Available)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 9, Finished, Available)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 11, Finished, Available)

Data Transformation with Spark

## Create Year and Month columns

# Create the new FirstName and LastName fields

# Filter and reorder columns

# Display the first ten orders

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 13, Finished, Available)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 15, Finished, Available)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 16, Finished, Available)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 18, Finished, Available)

StatementMeta(, d8217fed-e3ea-4bfc-af57-3dc511ee4e88, 19, Finished, Available)

Work with tables and SQL

# Get the table description

SELECT * FROM sales

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 18, Finished, Available)

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 32, Finished, Available)

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 27, Finished, Available)

Visualize data with Spark

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 28, Finished, Available)

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 42, Finished, Available)

StatementMeta(, 15388063-942e-49d6-a1d4-ee8c2fa33ac9, 44, Finished, Available)

You might also like