SQL To Pyspark
SQL To Pyspark
sql
PySpark
use case:
Extract only the necessary columns when performing analysis
or reporting
FILTERING ROWS
Application:
Extract rows that meet specific conditions, which is essential
for data cleaning or focused analysis
sql
PySpark
use case:
Narrow down the dataset to include only relevant records for
further analysis.
subqueries
Application:
Filter data based on the results of another query, enabling
more complex conditions
sql
PySpark
use case:
Use nested queries to refine data based on complex, multi-
step logic.
Data grouping
and
aggregation
group by and agg
Application:
Summarize data by grouping similar records based on a no of
column and applying Aggregate at given level.
sql
use case:
Summarize large datasets into more understandable metrics,
such as totals, averages, or counts etc based on type of
insights you want to get
Data COMBINING
and
JOINING
Joins
Application:
Combine data from two or more tables based on a common
column to enrich datasets with related information
sql
use case:
Merge datasets to create a more comprehensive dataset, often
used in data warehousing and analytics.
union/unionall
Application:
Use when you need to combine the results of two queries
sql
use case:
Combine datasets with the same structure into one
union : removes depulicate
union all: keep all records
Data MANIPULATION
and
TRANSFORMATION
ADD NEW COLUMN
Application:
Create new columns derived from existing ones, enabling the
transformation of data into new features or metrics.
sql
PySpark
use case:
Add derived columns to the dataset, useful for feature
engineering in data science projects
renaming one COLUMN
Application:
Change column names for better readability or to align with
naming conventions
sql
PySpark
use case:
Add derived columns to the dataset, useful for feature
engineering in data science projects
renaming multi COLUMN
sql
PySpark
or
droping columns
Application:
Remove unnecessary or redundant columns to simplify the
dataset
sql
PySpark
use case:
Streamline datasets by eliminating columns that are no longer
needed, which can also improve performance
changing data types - 1
Application:
Convert single column to different data type to ensure
consistency and compatibility for further analysis
sql
PySpark
use case:
Adjust data types to meet specific analysis requirements, such
as converting strings to integers for numeric operations
changing data type -2
Application:
Convert multiple columns to different data types to ensure
consistency and compatibility for further analysis
sql
PySpark
pivot
Application:
Use when you need to pivot data from rows to columns
sql
PySpark
use case:
Reshape data for better analysis or reporting.
pivot - Example
Data cleaning
filtering non-null
Application:
Filter out rows with non-null values to maintain the dataset's
integrity.
sql
PySpark
use case:
sql
PySpark
use case:
sql
PySpark
use case:
Eliminate repeated entries to maintain data uniqueness, crucial
for accurate analysis
droping columns
Application:
Remove unnecessary columns from the dataset to simplify
analysis and reduce noise
sql
PySpark
use case:
Clean up the dataset by removing irrelevant or redundant
columns, improving the focus and efficiency of your analysis
Data organization
sorting data
Application:
Organize data in ascending or descending order for easier
analysis or presentation
sql
PySpark
use case:
Highlight trends, patterns, or outliers, often used in reporting
and data visualization
counting records
Application:
Determine the number of rows in a dataset, or count non-null
values across multiple columns
sql
PySpark
use case:
Quickly assess the size of a dataset and understand the
distribution of non-null values in multiple columns, often used
in the initial stages of data exploration
like
comment
Did you
like the repost
content?
save
Follow