0% found this document useful (0 votes)

20 views15 pages

Spark Essentials

Uploaded by

Kalighat Okira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views15 pages

Spark Essentials

Uploaded by

Kalighat Okira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Spark SQL using Python

By
Prof Shibdas Dutta
Associate Professor,

DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Table of Contents
Introduction
Basic Commands
• Introduction
• Many data scientists, analysts, and general business intelligence users
rely on interactive SQL queries for exploring data. And Spark SQL is a
tool that enables you to do so.
What Is Spark SQL?

Apache Spark is an open-source data processing framework for

processing large datasets in a distributed manner (in a cluster).

Spark SQL is a Spark module for structured data processing. One use
of Spark SQL is to execute SQL queries. In this post, let’s focus on
highlighting 9 basic commands of running SQL queries.

A dataset is a distributed collection of data. A DataFrame is a

Dataset organised into named columns.
Basic commands
Getting to Know Spark SQL

1 — Creating a SparkSession

A SparkSession can be used to create DataFrames, register DataFrames

as tables, execute SQL over tables, cache tables, and read parquet files.

# Import SparkSession from pyspark.sql

from pyspark.sql import SparkSession
# Create my_spark
my_spark = SparkSession.builder.getOrCreate()
2 — Creating DataFrames

There are many ways to create DataFrames in Spark. One of the ways
is from Spark Data Sources which is the example below.

# Creating DataFrames
df =
spark.read.format('bigquery').option('project','<your_project_ID>').o
ption('table',<your_table_name>).load()
3 — Inspecting Data

After you’ve created your DataFrame, I guess the next thing you may want to do is to do some quick inspect
your data. Here’s a few commands!

#print the schema of df

df.printSchema()
#display the content of df
df.show()
#display the first 5 rows of df
df.show(5)
# Print my_spark
print(my_spark)
# Print the tables in the catalog
print(spark.catalog.listTables())
Manipulating Data
4 — Creating columns

Let’s say you want to create a new column named newdf and display everything in the new
DataFrame.

# Creating or replacing a local temporary view with this DataFrame.

df.createOrReplaceTempView("people")
# Define my query
query = "SELECT *, (order_quantity*0.3) as bonus_quantity from people"
newdf = spark.sql(query)
#display the content of new dataframe
newdf.show()
5 — Selecting

.select

You can select a column by

newdf.select(“customer_id”).show()
6 — Filtering

.filter

Filter the column order_id where order_quantity > 10

# Filtering
df.filter(df["order_quantity"]>10).show()
7 — Aggregating

.min() .max() .count()

All of the common aggregation methods, like .min(), .max(), and .count() are
GroupedData methods. These are created by calling the .groupBy()
DataFrame method. To use these functions we call that method on the
DataFrame. For example, to find the minimum value of a column, col, in a
DataFrame, df, you could do:

df.groupBy().min("order_quantity").show()
This creates a GroupedData object (so you can use the .min() method), then finds the
minimum value in col, and returns it as a DataFrame.
8 — Grouping and Aggregating

.groupBy() | .avg()

For example you want to calculate average order_quantityand group by

order_id

#calculate average order_quantity, group by order_id

df.groupby("order_id").avg("order_quantity").show()
Now you’ll see that when I pass the name of one or more columns
in my DataFrame to the .groupBy() method, the aggregation
methods behave like when you use a GROUP BY statement in a SQL
query!
SELECT order_id, avg(order_quantity)
FROM df
GROUP BY order_id
9 — Running Queries Programmatically

The sql function on a SparkSession enables applications to run SQL queries programmatically
and returns the result as a DataFrame.

You can save DataFrame as a Temporary Table and sql query on the saved table.

# Creating or replacing a local temporary view with this DataFrame.

df.createOrReplaceTempView("people")
# SQL statements can be run by using the sql method
query = "SELECT order_id, order_quantity from people where order_quantity < 10"
peopleCountDf = spark.sql(query)
# Display the content of df
peopleCountDf.show()
Output
Happy Learning

Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Spark SQL Tutorial PDF
100% (1)
Spark SQL Tutorial PDF
35 pages
Databricks Vs SQL Cheat Sheet
No ratings yet
Databricks Vs SQL Cheat Sheet
11 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
File Directory Management
100% (3)
File Directory Management
22 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Exam H13-624: IT Certification Guaranteed, The Easy Way!
No ratings yet
Exam H13-624: IT Certification Guaranteed, The Easy Way!
152 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Spark SQL Tutorial
0% (1)
Spark SQL Tutorial
7 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
1664473609-Unit 5 - Database Management - MongoDB
No ratings yet
1664473609-Unit 5 - Database Management - MongoDB
23 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
DP 203t00a Enu Powerpoint 03
No ratings yet
DP 203t00a Enu Powerpoint 03
25 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
Pyspark
No ratings yet
Pyspark
10 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
Py Spark
No ratings yet
Py Spark
7 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Spark Material
No ratings yet
Spark Material
6 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Apache Spark Lecture Notes
No ratings yet
Apache Spark Lecture Notes
4 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
HTML Code
No ratings yet
HTML Code
3 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
Google Scholar - Search-Engine
100% (1)
Google Scholar - Search-Engine
3 pages
Storing Data in Data Engineering
No ratings yet
Storing Data in Data Engineering
39 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit-4 Big Data Analytics Methods Using R
No ratings yet
Unit-4 Big Data Analytics Methods Using R
57 pages
Unit-1 Introduction To Big Data Analytics
No ratings yet
Unit-1 Introduction To Big Data Analytics
57 pages
Devops For Database
No ratings yet
Devops For Database
40 pages
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
25 pages
Introduction To Operating System (OS) : Associate Professor, DCG Data Core Systems India PVT LTD Kolkata
No ratings yet
Introduction To Operating System (OS) : Associate Professor, DCG Data Core Systems India PVT LTD Kolkata
59 pages
Azure Synapse POC
No ratings yet
Azure Synapse POC
40 pages
Chapter 4
No ratings yet
Chapter 4
23 pages
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
17 pages
Business Intelligence and Analytics: Prepared by Dr. Hima Suresh Assistant Professor Division of CS, SOE
No ratings yet
Business Intelligence and Analytics: Prepared by Dr. Hima Suresh Assistant Professor Division of CS, SOE
36 pages
Rebellabs SQL Cheat Sheeasdasdasd55555555
No ratings yet
Rebellabs SQL Cheat Sheeasdasdasd55555555
1 page
NOTEBOOK QUESTION Grade 10
No ratings yet
NOTEBOOK QUESTION Grade 10
16 pages
Lakshman - PowerBI
No ratings yet
Lakshman - PowerBI
6 pages
Image Guidelines
No ratings yet
Image Guidelines
44 pages
660itaiweisszeashanpappa 200709203756
No ratings yet
660itaiweisszeashanpappa 200709203756
37 pages
DB - Lec - 1 and 2
No ratings yet
DB - Lec - 1 and 2
39 pages
Surya Group of Institutions: School of Engineering Andtechnology
No ratings yet
Surya Group of Institutions: School of Engineering Andtechnology
56 pages
Unit-3 Hadoop Environment
No ratings yet
Unit-3 Hadoop Environment
31 pages
Aashu Sharma BI
No ratings yet
Aashu Sharma BI
3 pages
5b. Naming
No ratings yet
5b. Naming
46 pages
Hadoop Seminar Report
No ratings yet
Hadoop Seminar Report
29 pages
Bda Practical 2
No ratings yet
Bda Practical 2
3 pages
Query Manual
No ratings yet
Query Manual
37 pages
Machine Learning With Python - Machine Learning Algorithms - Linear Regression
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - Linear Regression
8 pages
Masterclass Oracle Database
No ratings yet
Masterclass Oracle Database
17 pages
Introduction To DBMS
No ratings yet
Introduction To DBMS
16 pages
Implementation of Single Moving Average Methods For Sales Forecasting of Bag in Convection Tas Loram Kulon
No ratings yet
Implementation of Single Moving Average Methods For Sales Forecasting of Bag in Convection Tas Loram Kulon
12 pages
Business Intelligence Program
No ratings yet
Business Intelligence Program
11 pages
1 Overview: Installation Guide For Contineo 3.0
No ratings yet
1 Overview: Installation Guide For Contineo 3.0
9 pages
IT Sample Paper
No ratings yet
IT Sample Paper
2 pages
Read XMLvs DOM
No ratings yet
Read XMLvs DOM
7 pages
Bangkit Logbook Week 13
No ratings yet
Bangkit Logbook Week 13
3 pages

Spark Essentials

Uploaded by

Spark Essentials

Uploaded by

Spark SQL using Python

Apache Spark is an open-source data processing framework for

A dataset is a distributed collection of data. A DataFrame is a

A SparkSession can be used to create DataFrames, register DataFrames

# Import SparkSession from pyspark.sql

#print the schema of df

# Creating or replacing a local temporary view with this DataFrame.

You can select a column by

Filter the column order_id where order_quantity > 10

.min() .max() .count()

For example you want to calculate average order_quantityand group by

#calculate average order_quantity, group by order_id

# Creating or replacing a local temporary view with this DataFrame.

You might also like