0% found this document useful (0 votes)
6 views

SQL_ &_PYSPAK

This document provides a comprehensive guide comparing SQL and PySpark, highlighting their equivalence in data processing tasks. It includes side-by-side comparisons of SQL operations and their PySpark implementations across various categories such as data types, database operations, table alterations, and more. The guide aims to assist data professionals in transitioning between SQL and PySpark in hybrid environments.

Uploaded by

Saurabh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

SQL_ &_PYSPAK

This document provides a comprehensive guide comparing SQL and PySpark, highlighting their equivalence in data processing tasks. It includes side-by-side comparisons of SQL operations and their PySpark implementations across various categories such as data types, database operations, table alterations, and more. The guide aims to assist data professionals in transitioning between SQL and PySpark in hybrid environments.

Uploaded by

Saurabh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

SQL & PySpark Equivalence: A Comprehensive Guide

Structured Query Language (SQL) and PySpark are both powerful tools for handling large-scale
data processing. SQL is widely used for querying and manipulating structured data in relational
databases, while PySpark, built on Apache Spark, is designed for distributed computing and big
data analytics.
Understanding the equivalence between SQL and PySpark is crucial for data engineers and analysts
working in hybrid environments where both technologies are used. SQL provides a declarative way
to interact with data, whereas PySpark leverages Resilient Distributed Datasets (RDDs) and
DataFrames to perform transformations and actions efficiently across distributed systems.
This guide presents a side-by-side comparison of key SQL operations and their equivalent PySpark
implementations. It covers data selection, filtering, aggregations, joins, window functions,
performance optimizations, and more, helping professionals seamlessly transition between the
two technologies.

1.Data Types
SQL Data Type PySpark Equivalent
INT IntegerType()
BIGINT LongType()
FLOAT FloatType()
DOUBLE DoubleType()
CHAR(n) / VARCHAR(n) StringType()
DATE DateType()
TIMESTAMP TimestampType()

2.Database & Table Operations


Concept SQL Query PySpark Equivalent
Create CREATE DATABASE
spark.sql("CREATE DATABASE db_name")
Database db_name;
Use spark.catalog.setCurrentDatabase("db_n
USE db_name;
Database ame")
Drop DROP DATABASE
spark.sql("DROP DATABASE db_name")
Database db_name;
Show SHOW DATABASES; spark.sql("SHOW DATABASES").show()
Databases
CREATE TABLE
Create df.write.format("parquet").saveAsTable
table_name (col1
Table ("table_name")
INT, col2 STRING);
DROP TABLE spark.sql("DROP TABLE IF EXISTS
Drop Table table_name; table_name")
Truncate TRUNCATE TABLE
spark.sql("TRUNCATE TABLE table_name")
Table table_name;
Concept SQL Query PySpark Equivalent
Describe DESCRIBE TABLE
df.printSchema()
Table table_name;
Show SHOW TABLES; spark.sql("SHOW TABLES").show()
Tables

3. Table Alterations
Concept SQL Query PySpark Equivalent
Add ALTER TABLE table_name df.withColumn("col3",
Column ADD COLUMN col3 STRING; lit(None).cast("string"))
ALTER TABLE table_name
Rename df.withColumnRenamed("old_name"
RENAME COLUMN old_name TO
Column , "new_name")
new_name;
Drop ALTER TABLE table_name
df.drop("col3")
Column DROP COLUMN col3;

4. Partitioning & Bucketing


Concept SQL Query PySpark Equivalent
CREATE TABLE table_name
Create df.write.partitionBy("col3").fo
(col1 INT, col2 STRING)
Partitioned rmat("parquet").saveAsTable("ta
PARTITIONED BY (col3
Table ble_name")
STRING);
INSERT INTO table_name
Insert into PARTITION df.write.mode("append").partiti
Partitioned (col3='value') SELECT onBy("col3").saveAsTable("table
Table col1, col2 FROM _name")
source_table;
CREATE TABLE table_name
Create df.write.bucketBy(10,
(col1 INT, col2 STRING)
Bucketed "col1").saveAsTable("table_name
CLUSTERED BY (col1)
Table ")
INTO 10 BUCKETS;

5. Views (Temporary & Permanent)


Concept SQL Query PySpark Equivalent
CREATE VIEW view_name
df.createOrReplaceTempView("view
Create View AS SELECT * FROM _name")
table_name;
spark.sql("DROP VIEW IF EXISTS
Drop View DROP VIEW view_name;
view_name")
CREATE GLOBAL VIEW
Create df.createGlobalTempView("view_na
view_name AS SELECT *
Global View FROM table_name; me")

Show Views SHOW VIEWS; spark.sql("SHOW VIEWS").show()


6. Schema Management
Concept SQL Query PySpark Equivalent
from pyspark.sql.types import
StructType, StructField, IntegerType,
StringType, DateType
CREATE TABLE
Define schema =
table_name (col1
Schema StructType([StructField("col1",
INT, col2 STRING,
Manually IntegerType(), True),
col3 DATE);
StructField("col2", StringType(),
True), StructField("col3", DateType(),
True)])
Check DESCRIBE TABLE
df.printSchema()
Schema table_name;
ALTER TABLE
Change table_name ALTER df.withColumn("col1",
Column COLUMN col1 TYPE col("col1").cast("bigint"))
Data Type BIGINT;

7. File-Based Table Operations


Concept SQL Query PySpark Equivalent
Save as df.write.format("parquet").save("
N/A (Implicit in Hive)
Parquet path/to/parquet")
CREATE TABLE table_name
Save as df.write.format("delta").save("pa
USING DELTA LOCATION
Delta Table 'path'; th/to/delta")
df.write.format("csv").option("he
Save as CSV N/A ader", True).save("path/to/csv")
Save as df.write.format("json").save("pat
N/A
JSON h/to/json")
Save as df.write.format("orc").save("path
N/A
ORC /to/orc")

8.Basic SELECT Queries


Concept SQL Query PySpark Equivalent
Select specific SELECT column1, column2 FROM df.select("column1",
columns table; "column2")
Select all SELECT * FROM table; df.select("*")
columns
SELECT DISTINCT column FROM df.select("column").disti
Distinct values table; nct()
WHERE SELECT * FROM table WHERE df.filter(col("column")
condition column = 'value'; == 'value')
SELECT * FROM table ORDER BY
ORDER BY df.sort("column")
column;
Concept SQL Query PySpark Equivalent
LIMIT rows SELECT * FROM table LIMIT n; df.limit(n)
COUNT rows SELECT COUNT(*) FROM table; df.count()

9. Aggregate Functions
Concept SQL Query PySpark Equivalent
SUM SELECT SUM(column) FROM table; df.agg(sum("column"))
AVG SELECT AVG(column) FROM table; df.agg(avg("column"))
MAX SELECT MAX(column) FROM table; df.agg(max("column"))
MIN SELECT MIN(column) FROM table; df.agg(min("column"))

10. String Functions


Concept SQL Query PySpark Equivalent
SELECT LEN(column) FROM df.select(length(col("colu
String Length table; mn")))
Convert to SELECT UPPER(column) FROM df.select(upper(col("colum
Uppercase table; n")))
Convert to SELECT LOWER(column) FROM df.select(lower(col("colum
Lowercase table; n")))
Concatenate SELECT CONCAT(string1, df.select(concat(col("stri
Strings string2) FROM table; ng1"), col("string2")))
SELECT TRIM(column) FROM df.select(trim(col("column
Trim String table; ")))
SELECT SUBSTRING(column, df.select(substring(col("c
Substring start, length) FROM table; olumn"), start, length))

11. Date & Time Functions


Concept SQL Query PySpark Equivalent
Current Date SELECT CURDATE(); df.select(current_date())
Current df.select(current_timestam
SELECT NOW();
Timestamp p())
CAST / SELECT CAST(column AS df.select(col("column").ca
CONVERT datatype) FROM table; st("datatype"))

12. Conditional Logic


Concept SQL Query PySpark Equivalent
SELECT IF(condition,
IF (Conditional value1, value2) FROM df.select(when(condition,
Logic) value1).otherwise(value2))
table;
SELECT df.select(coalesce(col("column
COALESCE COALESCE(column1, 1"), col("column2"),
Concept SQL Query PySpark Equivalent
column2, column3) col("column3")))
FROM table;

13. Join, Grouping & Pivoting


Concept SQL Query PySpark Equivalent
SELECT * FROM table1 JOIN
JOIN table2 ON table1.column = df1.join(df2, "column")
table2.column;
SELECT column,
GROUP df.groupBy("column").agg(agg
agg_function(column) FROM
BY _function("column"))
table GROUP BY column;
PIVOT (agg_function(column) df.groupBy("pivot_column").p
PIVOT FOR pivot_column IN ivot("column").agg(agg_funct
(values)); ion)

14. Logical Operators


Concept SQL Query PySpark Equivalent
SELECT * FROM table df.filter((col("column1") ==
AND / OR WHERE column1 = value value) & (col("column2") >
AND column2 > value; value))
IS NULL / IS SELECT * FROM table df.filter(col("column").isNul
NOT NULL WHERE column IS NULL; l())
SELECT * FROM table
df.filter(col("column").like(
LIKE WHERE column LIKE
"value%"))
'value%';
SELECT * FROM table df.filter((col("column") >=
BETWEEN WHERE column BETWEEN value1) & (col("column") <=
value1 AND value2; value2))

15. Set Operations


Concept SQL Query PySpark Equivalent
SELECT column FROM table1 UNION df1.union(df2).select(
UNION SELECT column FROM table2; "column")
UNION SELECT column FROM table1 UNION ALL df1.unionAll(df2).sele
ALL SELECT column FROM table2; ct("column")

16. Window Functions


Concept SQL Query PySpark Equivalent
RANK / SELECT column, df.select("column",
DENSE_RANK / RANK() OVER (ORDER rank().over(Window.orderBy("co
ROW_NUMBER BY column) FROM lumn")).alias("rank"))
Concept SQL Query PySpark Equivalent
table;

17. Common Table Expressions (CTEs)


Concept SQL Query PySpark Equivalent
WITH cte1 AS (SELECT df.createOrReplaceTempView("cte1"
CTE (Common * FROM table1) )
Table SELECT * FROM cte1 df_cte1 = spark.sql("SELECT *
Expressions) WHERE condition; FROM cte1 WHERE condition")

18. Window Functions


Window functions allow calculations across a set of table rows related to the current row.

Concept SQL Query PySpark Equivalent


SELECT column, RANK()
df.withColumn("rank",
OVER (PARTITION BY col2
RANK() rank().over(Window.partitionBy("c
ORDER BY column) FROM
ol2").orderBy("column")))
table;
SELECT column,
DENSE_RANK() OVER df.withColumn("dense_rank",
DENSE_RA (PARTITION BY col2 dense_rank().over(Window.partitio
NK() ORDER BY column) FROM nBy("col2").orderBy("column")))
table;
SELECT column,
ROW_NUMBER() OVER df.withColumn("row_number",
ROW_NU (PARTITION BY col2 row_number().over(Window.partitio
MBER() ORDER BY column) FROM nBy("col2").orderBy("column")))
table;
SELECT column,
df.withColumn("lead_value",
LEAD(column, 1) OVER
lead("column",
LEAD() (PARTITION BY col2
1).over(Window.partitionBy("col2"
ORDER BY column) FROM
).orderBy("column")))
table;
SELECT column,
df.withColumn("lag_value",
LAG(column, 1) OVER
lag("column",
LAG() (PARTITION BY col2
1).over(Window.partitionBy("col2"
ORDER BY column) FROM
).orderBy("column")))
table;

You might also like