0% found this document useful (0 votes)

14 views4 pages

Spark SQLPDF 20 Jan

The document provides a tutorial on using PySpark for data manipulation, including reading a CSV file, displaying its schema, and performing SQL-like operations such as selecting, filtering, sorting, and grouping data. It demonstrates how to create temporary views for SQL queries and showcases examples of join operations between two dataframes. The tutorial covers essential functions in PySpark's DataFrame API and SQL syntax for data analysis.

Uploaded by

Anjali Sethi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views4 pages

Spark SQLPDF 20 Jan

Uploaded by

Anjali Sethi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

import pyspark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("nik").getOrCreate()

from pyspark import SparkContext

sc = SparkContext.getOrCreate()

Double-click (or enter) to edit

df = spark.read.csv("simple-zipcodes.csv", header="TRUE")

df.printSchema()

df.show()

+------------+-------+-------------------+-------+-----+
|RecordNumber|Country| City|Zipcode|State|
+------------+-------+-------------------+-------+-----+
| 1| US| PARC PARQUE| 704| PR|
| 2| US|PASEO COSTA DEL SUR| 704| PR|
| 10| US| BDA SAN LUIS| 709| PR|
| 49347| US| HOLT| 32564| FL|
| 49348| US| HOMOSASSA| 34487| FL|
| 61391| US| CINGULAR WIRELESS| 76166| TX|
| 61392| US| FORT WORTH| 76177| TX|
| 61393| US| FT WORTH| 76177| TX|
| 54356| US| SPRUCE PINE| 35585| AL|
| 76511| US| ASH HILL| 27007| NC|
| 4| US| URB EUGENE RICE| 704| PR|
| 39827| US| MESA| 85209| AZ|
| 39828| US| MESA| 85210| AZ|
| 49345| US| HILLIARD| 32046| FL|
| 49346| US| HOLDER| 34445| FL|
| 3| US| SECT LANAUSSE| 704| PR|
| 54354| US| SPRING GARDEN| 36275| AL|
| 54355| US| SPRINGVILLE| 35146| AL|
| 76512| US| ASHEBORO| 27203| NC|
| 76513| US| ASHEBORO| 27204| NC|
+------------+-------+-------------------+-------+-----+

# to work with SQL queries on spark we need to create a temporary view on the data frame
spark.read.csv("simple-zipcodes.csv", header="TRUE").createOrReplaceTempView("Zipcodes")

# Spark SQL to Select Columns

# The select() function of DataFrame API is used to select the specific columns from the DataFrame.

df.select("country","city","zipcode","state").show(5)

+-------+-------------------+-------+-----+
|country| city|zipcode|state|
+-------+-------------------+-------+-----+
| US| PARC PARQUE| 704| PR|
| US|PASEO COSTA DEL SUR| 704| PR|
| US| BDA SAN LUIS| 709| PR|
| US| HOLT| 32564| FL|
| US| HOMOSASSA| 34487| FL|
+-------+-------------------+-------+-----+
only showing top 5 rows

# In SQL, you can achieve the same using SELECT FROM clause as shown below.
# SQL Select query

spark.sql("SELECT country, city, zipcode, state FROM ZIPCODES").show(5)

# Both above examples yields the below output.

Filter Rows To filter the rows from the data, you can use where() function from the DataFrame API.

df.select("country","city","zipcode","state").where("state == 'AZ'").show(5)

+-------+----+-------+-----+
|country|city|zipcode|state|
+-------+----+-------+-----+
| US|MESA| 85209| AZ|
| US|MESA| 85210| AZ|
+-------+----+-------+-----+

Similarly, in SQL you can use WHERE clause as follows.

spark.sql(""" SELECT country, city, zipcode, state FROM ZIPCODES WHERE state = 'AZ' """).show(5)

+-------+----+-------+-----+
|country|city|zipcode|state|
+-------+----+-------+-----+
| US|MESA| 85209| AZ|
| US|MESA| 85210| AZ|
+-------+----+-------+-----+

Sorting To sort rows on a specific column use orderBy() function on DataFrame API.

df.select("country","city","zipcode","state")\
.where("state in ('PR','AZ','FL')") \
.orderBy("state") \
.show(10)

+-------+-------------------+-------+-----+
|country| city|zipcode|state|
+-------+-------------------+-------+-----+
| US| MESA| 85209| AZ|
| US| MESA| 85210| AZ|
| US| HOLT| 32564| FL|
| US| HOMOSASSA| 34487| FL|
| US| HILLIARD| 32046| FL|
| US| HOLDER| 34445| FL|
| US| PARC PARQUE| 704| PR|
| US|PASEO COSTA DEL SUR| 704| PR|
| US| BDA SAN LUIS| 709| PR|
| US| URB EUGENE RICE| 704| PR|
+-------+-------------------+-------+-----+
only showing top 10 rows

In SQL, you can achieve sorting by using ORDER BY clause.

spark.sql(""" SELECT country, city, zipcode, state FROM ZIPCODES

WHERE state in ('PR','AZ','FL') order by state """) \
.show(10)

Grouping The groupBy().count() is used to perform the group by on DataFrame.

df.groupBy("state").count() \
.show()

+-----+-----+
|state|count|
+-----+-----+
| AZ| 2|
| NC| 3|
| AL| 3|
| TX| 3|
| FL| 4|
| PR| 5|
+-----+-----+

You can achieve group by in Spark SQL by using GROUP BY clause.

spark.sql(""" SELECT state, count(*) as count FROM ZIPCODES

GROUP BY state""") \
.show()

+-----+-----+
|state|count|
+-----+-----+
| AZ| 2|
| NC| 3|
| AL| 3|
| TX| 3|
| FL| 4|
| PR| 5|
+-----+-----+

SQL Join Operations Similarly, if you have two tables, you can perform the Join operations in Spark. Here is an example

emp = ((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)

empColumns = ("emp_id","name","superior_emp_id","year_joined", "emp_dept_id","gender","salary")

from pyspark.sql import SQLContext

sqlc = SQLContext(sc)

C:\spark\python\pyspark\sql\context.py:113: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.

warnings.warn(

empDF = sqlc.createDataFrame(emp, empColumns)

empDF.show()

+------+--------+---------------+-----------+-----------+------+------+
|emp_id| name|superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
| 1| Smith| -1| 2018| 10| M| 3000|
| 2| Rose| 1| 2010| 20| M| 4000|
| 3|Williams| 1| 2010| 10| M| 1000|
| 4| Jones| 2| 2005| 10| F| 2000|
| 5| Brown| 2| 2010| 40| | -1|
| 6| Brown| 2| 2010| 50| | -1|
+------+--------+---------------+-----------+-----------+------+------+

dept = (("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
deptColumns = ("dept_name","dept_id")

deptDF = sqlc.createDataFrame(dept,deptColumns)

deptDF.show()

+---------+-------+
|dept_name|dept_id|
+---------+-------+
| Finance| 10|
|Marketing| 20|
| Sales| 30|
| IT| 40|
+---------+-------+

# to work with SQL queries on spark we need to create a temporary view on the data frame
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")\

.show()

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id| name|superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
| 1| Smith| -1| 2018| 10| M| 3000| Finance| 10|
| 3|Williams| 1| 2010| 10| M| 1000| Finance| 10|
| 4| Jones| 2| 2005| 10| F| 2000| Finance| 10|
| 2| Rose| 1| 2010| 20| M| 4000|Marketing| 20|
| 5| Brown| 2| 2010| 40| | -1| IT| 40|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

Start coding or generate with AI.

SF EC OData API REF
No ratings yet
SF EC OData API REF
824 pages
Siebel UCM
100% (2)
Siebel UCM
159 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Cs - 304 Mid Term Mcq's by Vu Topper RM
100% (1)
Cs - 304 Mid Term Mcq's by Vu Topper RM
23 pages
Process Manager User Guide en
No ratings yet
Process Manager User Guide en
566 pages
College Management System
No ratings yet
College Management System
78 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Thread MCQs
No ratings yet
Thread MCQs
7 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
No ratings yet
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
12 pages
Student Staff Feedback System - Complete Documentation
100% (1)
Student Staff Feedback System - Complete Documentation
59 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Software Engineering Question Bank
No ratings yet
Software Engineering Question Bank
11 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Assignment 1 Brief: Qualification BTEC Level 5 HND Diploma in Computing
No ratings yet
Assignment 1 Brief: Qualification BTEC Level 5 HND Diploma in Computing
16 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Pandas
No ratings yet
Pandas
20 pages
OpenBuildings Deployment Guide For ProjectWise Managed Configurations - v1.1
No ratings yet
OpenBuildings Deployment Guide For ProjectWise Managed Configurations - v1.1
59 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Week 3 - Lecture #1 Foundational Concepts of AIS
No ratings yet
Week 3 - Lecture #1 Foundational Concepts of AIS
25 pages
4 SQAandSDLC
No ratings yet
4 SQAandSDLC
17 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
ConnectKey WorkCentre Product Enhancements Read Me-R20-04
No ratings yet
ConnectKey WorkCentre Product Enhancements Read Me-R20-04
26 pages
Panda
No ratings yet
Panda
39 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
PDF&Rendition 1
No ratings yet
PDF&Rendition 1
47 pages
Journal
No ratings yet
Journal
47 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
I.P File
No ratings yet
I.P File
20 pages
Pandas Filtering
No ratings yet
Pandas Filtering
19 pages
MongoDB Sharding Guide PDF
No ratings yet
MongoDB Sharding Guide PDF
81 pages
Part - 1 - How - To - Store - Password - in - Hash - Format - in - ASP - NET - Core - Web
No ratings yet
Part - 1 - How - To - Store - Password - in - Hash - Format - in - ASP - NET - Core - Web
7 pages
Anagh-Desai BigDataAssignments NYSE Airlines Using DF
No ratings yet
Anagh-Desai BigDataAssignments NYSE Airlines Using DF
9 pages
Python Pandas-DataFrames Complete - Jupyter Notebook
No ratings yet
Python Pandas-DataFrames Complete - Jupyter Notebook
34 pages
B.Tech - Minor Programs MR22 CS
No ratings yet
B.Tech - Minor Programs MR22 CS
23 pages
Ip Practical Shubham PDF
No ratings yet
Ip Practical Shubham PDF
20 pages
Deploying Containerized JD Edwards EnterpriseOne On Oracle Cloud Infrastructure
No ratings yet
Deploying Containerized JD Edwards EnterpriseOne On Oracle Cloud Infrastructure
77 pages
Methods & Function in Databricks
No ratings yet
Methods & Function in Databricks
34 pages
SQL and PySpark
No ratings yet
SQL and PySpark
80 pages
SQL Tutorial1
No ratings yet
SQL Tutorial1
25 pages
D9 Lab+Day+9
No ratings yet
D9 Lab+Day+9
6 pages
SQL To Pandas - Group Aggregations
No ratings yet
SQL To Pandas - Group Aggregations
6 pages
Linked List in Data Structures 4
No ratings yet
Linked List in Data Structures 4
45 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
IP Imp Notes
No ratings yet
IP Imp Notes
5 pages
Practical File IP
No ratings yet
Practical File IP
27 pages
Data and AI - Spark Python
No ratings yet
Data and AI - Spark Python
11 pages
SSS Manual
No ratings yet
SSS Manual
21 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
DBMS Unit 4
No ratings yet
DBMS Unit 4
19 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Lab Record IP
No ratings yet
Lab Record IP
13 pages
Ebook Comandos JesusG 1741221641
No ratings yet
Ebook Comandos JesusG 1741221641
7 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Window Functions Spark
No ratings yet
Window Functions Spark
3 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
17 pages
Oracle Application Express: Developing Database Web Applications
No ratings yet
Oracle Application Express: Developing Database Web Applications
30 pages
2 Bda
No ratings yet
2 Bda
2 pages
Graaab User Sign Up Form - (ENG) - Puteri Residence (1) - 1
0% (1)
Graaab User Sign Up Form - (ENG) - Puteri Residence (1) - 1
1 page
Step-by-Step Build A FREE System Center 2012 Configuration Manager SP1 Lab in The Cloud - KeithMayer
No ratings yet
Step-by-Step Build A FREE System Center 2012 Configuration Manager SP1 Lab in The Cloud - KeithMayer
12 pages
Quewtion SQL - Pyspark
No ratings yet
Quewtion SQL - Pyspark
4 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
XII IP Model 1 Ans
No ratings yet
XII IP Model 1 Ans
8 pages
Window Functions in SQL and PySpark
No ratings yet
Window Functions in SQL and PySpark
5 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
Customer Data Outliers Pyspark
No ratings yet
Customer Data Outliers Pyspark
1 page
PySpark Entity Resolution
No ratings yet
PySpark Entity Resolution
5 pages
Pandas Cheatsheet DF
No ratings yet
Pandas Cheatsheet DF
1 page
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
WEBINTEL GUIDED LAB ACTIVITY Introduction To Pandas
No ratings yet
WEBINTEL GUIDED LAB ACTIVITY Introduction To Pandas
1 page
PBL Project
No ratings yet
PBL Project
21 pages
JPA2
No ratings yet
JPA2
10 pages
Data Modeler Resume
No ratings yet
Data Modeler Resume
5 pages
Pig Hive Spark Big Data Analytics
No ratings yet
Pig Hive Spark Big Data Analytics
10 pages
Whitepaper - Copado Platform Options
No ratings yet
Whitepaper - Copado Platform Options
7 pages
Student Attendance 20242 37770 53920
No ratings yet
Student Attendance 20242 37770 53920
2 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
Proc - Setinit
No ratings yet
Proc - Setinit
9 pages
Drafting Service Revenues World Summary: Market Values & Financials by Country
From Everand
Drafting Service Revenues World Summary: Market Values & Financials by Country
Editorial DataGroup
No ratings yet
Cooking Light: Low Calorie Cooking the Paleo and Grain Free Way
From Everand
Cooking Light: Low Calorie Cooking the Paleo and Grain Free Way
Rhonda Price
No ratings yet
On the Wallaby Track: Essential Australian Words and Phrases
From Everand
On the Wallaby Track: Essential Australian Words and Phrases
Laine Cunningham
No ratings yet
IBM System 360 RPG Debugging Template and Keypunch Card
From Everand
IBM System 360 RPG Debugging Template and Keypunch Card
Archive Classics
No ratings yet

Spark SQLPDF 20 Jan

Uploaded by

Spark SQLPDF 20 Jan

Uploaded by

import pyspark

from pyspark.sql import SparkSession

from pyspark import SparkContext

Double-click (or enter) to edit

# Spark SQL to Select Columns

spark.sql("SELECT country, city, zipcode, state FROM ZIPCODES").show(5)

# Both above examples yields the below output.

Similarly, in SQL you can use WHERE clause as follows.

In SQL, you can achieve sorting by using ORDER BY clause.

spark.sql(""" SELECT country, city, zipcode, state FROM ZIPCODES

Grouping The groupBy().count() is used to perform the group by on DataFrame.

You can achieve group by in Spark SQL by using GROUP BY clause.

spark.sql(""" SELECT state, count(*) as count FROM ZIPCODES

empColumns = ("emp_id","name","superior_emp_id","year_joined", "emp_dept_id","gender","salary")

from pyspark.sql import SQLContext

C:\spark\python\pyspark\sql\context.py:113: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.

empDF = sqlc.createDataFrame(emp, empColumns)

spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")\

Start coding or generate with AI.

Start coding or generate with AI.

You might also like