0% found this document useful (0 votes)
12 views

SQL To Pyspark

Uploaded by

mirzamiff
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

SQL To Pyspark

Uploaded by

mirzamiff
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

TO

SMALL ACTIONS BIG IMPACT IN COMMUNITY


data selection
and
filtering
selecting specific columns
Application:
Retrieve only specific columns from a dataset, which improves
performance by reducing the amount of data processed.

sql

PySpark

use case:
Extract only the necessary columns when performing analysis
or reporting
FILTERING ROWS
Application:
Extract rows that meet specific conditions, which is essential
for data cleaning or focused analysis

sql

PySpark

use case:
Narrow down the dataset to include only relevant records for
further analysis.
subqueries
Application:
Filter data based on the results of another query, enabling
more complex conditions

sql

PySpark

use case:
Use nested queries to refine data based on complex, multi-
step logic.
Data grouping
and
aggregation
group by and agg
Application:
Summarize data by grouping similar records based on a no of
column and applying Aggregate at given level.

sql

PySpark agg param: max, min, sum, avg, stddev

use case:
Summarize large datasets into more understandable metrics,
such as totals, averages, or counts etc based on type of
insights you want to get
Data COMBINING
and
JOINING
Joins
Application:
Combine data from two or more tables based on a common
column to enrich datasets with related information

sql

PySpark joins : inner, left, right, outer

use case:
Merge datasets to create a more comprehensive dataset, often
used in data warehousing and analytics.
union/unionall
Application:
Use when you need to combine the results of two queries

sql

types : union, unionAll


PySpark

use case:
Combine datasets with the same structure into one
union : removes depulicate
union all: keep all records
Data MANIPULATION
and
TRANSFORMATION
ADD NEW COLUMN
Application:
Create new columns derived from existing ones, enabling the
transformation of data into new features or metrics.

sql

PySpark

use case:
Add derived columns to the dataset, useful for feature
engineering in data science projects
renaming one COLUMN
Application:
Change column names for better readability or to align with
naming conventions

sql

PySpark

use case:
Add derived columns to the dataset, useful for feature
engineering in data science projects
renaming multi COLUMN
sql

PySpark

or
droping columns
Application:
Remove unnecessary or redundant columns to simplify the
dataset

sql

PySpark

use case:
Streamline datasets by eliminating columns that are no longer
needed, which can also improve performance
changing data types - 1
Application:
Convert single column to different data type to ensure
consistency and compatibility for further analysis

sql

PySpark

use case:
Adjust data types to meet specific analysis requirements, such
as converting strings to integers for numeric operations
changing data type -2
Application:
Convert multiple columns to different data types to ensure
consistency and compatibility for further analysis

sql

PySpark
pivot
Application:
Use when you need to pivot data from rows to columns

sql

PySpark

use case:
Reshape data for better analysis or reporting.
pivot - Example
Data cleaning
filtering non-null
Application:
Filter out rows with non-null values to maintain the dataset's
integrity.
sql

PySpark

use case:

Clean datasets by removing incomplete records that could


lead to inaccurate analysis or errors
filling/replace - null
Application:
Replace null values with a default value to maintain dataset
consistency.

sql

PySpark

use case:

Fill missing data with a specific value to avoid dropping rows


and losing information
remove duplicates
Application:
Ensure data integrity by removing duplicate records.

sql

PySpark

use case:
Eliminate repeated entries to maintain data uniqueness, crucial
for accurate analysis
droping columns
Application:
Remove unnecessary columns from the dataset to simplify
analysis and reduce noise

sql

PySpark

use case:
Clean up the dataset by removing irrelevant or redundant
columns, improving the focus and efficiency of your analysis
Data organization
sorting data
Application:
Organize data in ascending or descending order for easier
analysis or presentation

sql

PySpark

use case:
Highlight trends, patterns, or outliers, often used in reporting
and data visualization
counting records
Application:
Determine the number of rows in a dataset, or count non-null
values across multiple columns

sql

PySpark

use case:
Quickly assess the size of a dataset and understand the
distribution of non-null values in multiple columns, often used
in the initial stages of data exploration
like

comment
Did you
like the repost
content?
save

Follow

SMALL ACTIONS BIG IMPACT IN COMMUNITY

You might also like