0% found this document useful (0 votes)

10 views4 pages

Quewtion SQL - Pyspark

The document provides a comparison of PySpark queries and their equivalent SQL queries for various data manipulation tasks. It includes operations such as selecting columns, filtering rows, counting, grouping, joining tables, and calculating aggregates. Additionally, it covers advanced techniques like window functions, pivoting tables, and using common table expressions (CTEs).

Uploaded by

Lucky Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views4 pages

Quewtion SQL - Pyspark

Uploaded by

Lucky Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Question PySpark Query SQL Query

Select all columns from a

df.select("*").show() SELECT * FROM table_name;
table.

Select specific columns

(e.g., name, age) from a df.select("name", "age").show() SELECT name, age FROM table_name;
table.

Filter rows where age is SELECT * FROM table_name WHERE age >
df.filter(df.age > 30).show()
greater than 30. 30;

Count the number of rows

df.count() SELECT COUNT(*) FROM table_name;
in a table.

Group by a column
SELECT department, COUNT(*) FROM
(e.g., department) and df.groupBy("department").count().show()
table_name GROUP BY department;
count the number of rows.

Calculate the average of a

df.select(avg("salary")).show() SELECT AVG(salary) FROM table_name;
column (e.g., salary).

Join two tables

SELECT * FROM table1 INNER JOIN table2 ON
(df1 and df2) on a common df1.join(df2, "id", "inner").show()
table1.id = table2.id;
column (e.g., id).

Perform a left join on two SELECT * FROM table1 LEFT JOIN table2 ON
df1.join(df2, "id", "left").show()
tables. table1.id = table2.id;

Find duplicate rows based SELECT email, COUNT(*) FROM table_name

df.groupBy("email").count().filter("count > 1").show()
on a column (e.g., email). GROUP BY email HAVING COUNT(*) > 1;

from pyspark.sql.window import Window

Rank rows based on a from pyspark.sql.functions import rank SELECT *, RANK() OVER (ORDER BY salary) AS
column (e.g., salary) using
rank FROM table_name;
window functions. window = Window.orderBy("salary")

df.withColumn("rank", rank().over(window)).show()

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

SELECT date, sales, SUM(sales) OVER (ORDER
Calculate cumulative sum window = BY date ROWS BETWEEN UNBOUNDED
of a column (e.g., sales). Window.orderBy("date").rowsBetween(Window.unbo PRECEDING AND CURRENT ROW) AS
undedPreceding, Window.currentRow) cumulative_sum FROM table_name;

df.withColumn("cumulative_sum",
sum("sales").over(window)).show()
Question PySpark Query SQL Query

SELECT product, SUM(CASE WHEN year =

Pivot a table to transform
df.groupBy("product").pivot("year").agg(sum("sales")) 2021 THEN sales END) AS 2021, SUM(CASE
rows into columns
.show() WHEN year = 2022 THEN sales END) AS 2022
(e.g., year as columns).
FROM table_name GROUP BY product;

WITH RankedSalaries AS (

SELECT

salary,
from pyspark.sql.window import Window
DENSE_RANK() OVER (ORDER BY salary
from pyspark.sql.functions import row_number
Find the third highest DESC) AS dense_rank
value in a column window = Window.orderBy(desc("salary"))
FROM employees
(e.g., salary).
df.withColumn("row_num",
)
row_number().over(window)).filter(col("row_num")
== 3).show() SELECT salary

FROM RankedSalaries

WHERE dense_rank = 3;

Calculate the difference

SELECT revenue - cost AS profit FROM
between two columns df.withColumn("profit", df.revenue - df.cost).show()
table_name;
(e.g., revenue - cost).

Filter rows where a column

SELECT * FROM table_name WHERE id IN (1,
value is in a list (e.g., id in df.filter(df.id.isin([1, 2, 3])).show()
2, 3);
[1, 2, 3]).

Find the top N rows based SELECT * FROM table_name ORDER BY salary
df.orderBy(desc("salary")).limit(5).show()
on a column (e.g., salary). DESC LIMIT 5;

Replace null values in a

SELECT COALESCE(column_name, 0) FROM
column with a default df.na.fill(0, subset=["column_name"]).show()
table_name;
value (e.g., 0).

from pyspark.sql.functions import concat

Concatenate two columns
SELECT CONCAT(first_name, last_name) AS
(e.g., first_name and last_n df.withColumn("full_name", concat(df.first_name, full_name FROM table_name;
ame). df.last_name)).show()

from pyspark.sql.functions import year

Extract year from a date SELECT EXTRACT(YEAR FROM date_column)
column. AS year FROM table_name;
df.withColumn("year", year("date_column")).show()

Calculate the percentage of from pyspark.sql.window import Window SELECT sales, (sales / SUM(sales) OVER ()) *
total for each row 100 AS percentage FROM table_name;
(e.g., sales). from pyspark.sql.functions import sum

window = Window.partitionBy()
Question PySpark Query SQL Query

df.withColumn("percentage", (df.sales /
sum("sales").over(window)) * 100).show()

WITH RankedSalaries AS
from pyspark.sql.window import Window
SELECT department, salary, ROW_NUMBER()
from pyspark.sql.functions import row_number
OVER (PARTITION BY department ORDER BY
window = salary DESC) AS row_num
Find the nth highest salary Window.partitionBy("department").orderBy(desc("sal
department-wise. FROM table_name
ary"))
)
df.withColumn("row_num",
row_number().over(window)).filter(col("row_num") SELECT * FROM RankedSalaries WHERE
== n).show() row_num = n;

WITH AvgSalary AS (

SELECT AVG(salary) AS avg_salary FROM

Use a CTE to find
avg_salary = df.select(avg("salary")).collect()[0][0] table_name
employees with salary
greater than the average df.filter(df.salary > avg_salary).show()
salary.
SELECT * FROM table_name WHERE salary >
(SELECT avg_salary FROM AvgSalary);

df.createOrReplaceTempView("temp_table")
SELECT * FROM table_name WHERE (name,
Find duplicates using a spark.sql("SELECT * FROM temp_table WHERE (name, age) IN (SELECT name, age FROM table_name
subquery. age) IN (SELECT name, age FROM temp_table GROUP GROUP BY name, age HAVING COUNT(*) > 1);
BY name, age HAVING COUNT(*) > 1)").show()

from pyspark.sql.window import Window

from pyspark.sql.functions import dense_rank WITH RankedSalaries AS (

SELECT department, salary, DENSE_RANK()
Find the 3rd highest salary window = OVER (PARTITION BY department ORDER BY
in each department using a Window.partitionBy("department").orderBy(desc("sal salary DESC) AS rank
window function. ary")) FROM table_name

df.withColumn("rank", SELECT * FROM RankedSalaries WHERE rank

dense_rank().over(window)).filter(col("rank") == = 3;
3).show()

with cte as (
SELECT *,
query to obtain the third row_number() over (PARTITION by user_id
transaction of every user. ORDER by transaction_date) as row_num
from transactions)
select user_id,spend, transaction_date from
cte where row_num=3

with cte as
(SELECT salary,row_number() over (order by
second highest salary
salary desc) as row_num
among all employees.
FROM employee)
select salary as second_highest_salary from
cte where row_num=2
Question PySpark Query SQL Query

SELECT user_id,tweet_date,
round(avg(tweet_count) over
Tweets' Rolling Averages (PARTITION by user_id
ORDER BY tweet_date
rows BETWEEN 2 PRECEDING and current
ROW),2)
as rolling_avg_3d
FROM tweets

Sos 28 July Step Regular Session by Saeed Mdcat Team
100% (1)
Sos 28 July Step Regular Session by Saeed Mdcat Team
5 pages
Great Quotes From Zig Ziglar PDF
100% (4)
Great Quotes From Zig Ziglar PDF
51 pages
Astm F439
No ratings yet
Astm F439
7 pages
The Problem and Its Background
100% (1)
The Problem and Its Background
12 pages
ACN Microproject
No ratings yet
ACN Microproject
16 pages
Py Spark
No ratings yet
Py Spark
10 pages
Learn Advanced SQL
No ratings yet
Learn Advanced SQL
48 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
CPTest Manual
No ratings yet
CPTest Manual
19 pages
Data Engineering 101 - Day 24 - SQL Vs PySpark
No ratings yet
Data Engineering 101 - Day 24 - SQL Vs PySpark
82 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
BeagleBone and Linux
80% (5)
BeagleBone and Linux
11 pages
How God Answers Prayer
100% (1)
How God Answers Prayer
12 pages
The Kinetics of Enzyme - Catalyzed Reactions
100% (1)
The Kinetics of Enzyme - Catalyzed Reactions
38 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Amzn1.Tortuga.3.Bc55a94a 9d6d 4883 A54c 888faa4c62c0.T23B1PWXCPIVJP
No ratings yet
Amzn1.Tortuga.3.Bc55a94a 9d6d 4883 A54c 888faa4c62c0.T23B1PWXCPIVJP
392 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
V2 SQL Final Document
No ratings yet
V2 SQL Final Document
35 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Amazon Interview Questions & Answers
No ratings yet
Amazon Interview Questions & Answers
8 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Unit 4 Spark SQL
No ratings yet
Unit 4 Spark SQL
49 pages
Dynamic Reservoir Simulation of The Alwyn Field Using Eclipse.
No ratings yet
Dynamic Reservoir Simulation of The Alwyn Field Using Eclipse.
108 pages
1,3 Butadiene
No ratings yet
1,3 Butadiene
7 pages
Ade 1737191501
No ratings yet
Ade 1737191501
29 pages
2600 v25 n3 (Autumn 2008)
No ratings yet
2600 v25 n3 (Autumn 2008)
68 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
SQL Final Document
No ratings yet
SQL Final Document
37 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Journal
No ratings yet
Journal
47 pages
Day 60
No ratings yet
Day 60
10 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Assignment 3 - Shouvik (1159)
No ratings yet
Assignment 3 - Shouvik (1159)
15 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
Info Cept
No ratings yet
Info Cept
4 pages
Pyspark SQL Final Document
No ratings yet
Pyspark SQL Final Document
31 pages
SQL Windssss
No ratings yet
SQL Windssss
17 pages
Day 77
No ratings yet
Day 77
10 pages
Spark-Scala Code
No ratings yet
Spark-Scala Code
3 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Happay
No ratings yet
Happay
21 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
SQL Query Questions Commonly Asked in Interviews 1732271818
No ratings yet
SQL Query Questions Commonly Asked in Interviews 1732271818
7 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
TCS Rejected Many Due To Weak PySpark Logic!?
No ratings yet
TCS Rejected Many Due To Weak PySpark Logic!?
7 pages
Pyspark 12 Questions
No ratings yet
Pyspark 12 Questions
8 pages
Basic Select, Where, Distinct
No ratings yet
Basic Select, Where, Distinct
11 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
Rubric For Oral Presentation
100% (1)
Rubric For Oral Presentation
1 page
Practical 2 Analytical Queries
No ratings yet
Practical 2 Analytical Queries
5 pages
Advanced SQL Analysis SSMS
No ratings yet
Advanced SQL Analysis SSMS
3 pages
Pyspark and SQL
No ratings yet
Pyspark and SQL
57 pages
SQL & Python Interview Q&A
No ratings yet
SQL & Python Interview Q&A
7 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
Ans Key Set A
No ratings yet
Ans Key Set A
6 pages
Questions For Preparation
No ratings yet
Questions For Preparation
9 pages
DBMS - Set 1
No ratings yet
DBMS - Set 1
10 pages
Window Functions in SQL and PySpark
No ratings yet
Window Functions in SQL and PySpark
5 pages
Answer Key For SET-1 TO 3
No ratings yet
Answer Key For SET-1 TO 3
7 pages
Eng201 Final Term Solved Paper Spring 2010
No ratings yet
Eng201 Final Term Solved Paper Spring 2010
17 pages
Dexos2™ Brands - GM Dexoscontact Dexos® Licensing Program
No ratings yet
Dexos2™ Brands - GM Dexoscontact Dexos® Licensing Program
9 pages
Assignment SQL
No ratings yet
Assignment SQL
3 pages
SQL - Window - Functions
No ratings yet
SQL - Window - Functions
3 pages
Window Functions Spark
No ratings yet
Window Functions Spark
3 pages
Steam Turbine
No ratings yet
Steam Turbine
24 pages
M I C R o e C o N o M I C S I I
No ratings yet
M I C R o e C o N o M I C S I I
90 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Drilling Engineering 30 Days Program
No ratings yet
Drilling Engineering 30 Days Program
2 pages
Open Hole Logging Costs ( ) : Platform Express
No ratings yet
Open Hole Logging Costs ( ) : Platform Express
8 pages
ADM Natural Vs Synthetic Vitamin E
No ratings yet
ADM Natural Vs Synthetic Vitamin E
2 pages
Line Sizing Calculation - Pump Discharge
No ratings yet
Line Sizing Calculation - Pump Discharge
2 pages
3 Windows Function 08-01-2025
No ratings yet
3 Windows Function 08-01-2025
2 pages
(PPT) Pre-Seed Pitch Deck Template
No ratings yet
(PPT) Pre-Seed Pitch Deck Template
14 pages
SmartPilot S1 Wheel & Tiller Service Manual
No ratings yet
SmartPilot S1 Wheel & Tiller Service Manual
45 pages
Frekuens
No ratings yet
Frekuens
3 pages
Doggy Styles 3 - Loving Duke
No ratings yet
Doggy Styles 3 - Loving Duke
11 pages
11 Watt Light
No ratings yet
11 Watt Light
14 pages
Our Walking Drum
No ratings yet
Our Walking Drum
3 pages
Entreprenuership CASE 1
No ratings yet
Entreprenuership CASE 1
3 pages
Adhoc Faculty Application Form
No ratings yet
Adhoc Faculty Application Form
3 pages

Quewtion SQL - Pyspark

Uploaded by

Quewtion SQL - Pyspark

Uploaded by

Question PySpark Query SQL Query

Select all columns from a

Select specific columns

Count the number of rows

Calculate the average of a

Join two tables

Find duplicate rows based SELECT email, COUNT(*) FROM table_name

from pyspark.sql.window import Window

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

SELECT product, SUM(CASE WHEN year =

Calculate the difference

Filter rows where a column

Replace null values in a

from pyspark.sql.functions import concat

from pyspark.sql.functions import year

SELECT AVG(salary) AS avg_salary FROM

from pyspark.sql.window import Window

from pyspark.sql.functions import dense_rank WITH RankedSalaries AS (

df.withColumn("rank", SELECT * FROM RankedSalaries WHERE rank

You might also like