0% found this document useful (0 votes)

32 views8 pages

Py Spark

This document provides an overview of common operations and functions in PySpark including loading and exploring data, filtering and sorting, joins, column operations, aggregation, and more. It also shows how to work with different data types like strings, numbers, dates, arrays, and structs.

Uploaded by

pratikpa14052000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views8 pages

Py Spark

Uploaded by

pratikpa14052000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Mastering PySpark: Essential Cheatsheet

Basics: Loading and Exploring Data

# Create a Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Read CSV data into a DataFrame

df = spark.read.csv('/path/to/your/input/ﬁle')

# Show a preview of the data

df.show()

# Display the ﬁrst and last n rows

df.head(5)
df.tail(5)

# Display as JSON (in-memory, caution with large

datasets)
df = df.limit(10)
print(json.dumps([row.asDict(recursive=True) for row in
df.collect()], indent=2))

Common Operations: Columns and Rows

# Get column names, types, and schema
df.columns
df.dtypes
df.schema
# Get row and column count
df.count()
len(df.columns)

# Write output to disk

df.write.csv('/path/to/your/output/ﬁle')

# Convert to Pandas DataFrame

df = df.toPandas()

Filtering and Sorting

df = df.ﬁlter(df.age > 25)
df = df.ﬁlter((df.age > 25) & (df.is_adult == 'Y'))

# Sort results
df = df.orderBy(df.age.asc())
df = df.orderBy(df.age.desc())

Joins: Combining Datasets

# Left join with another dataset
df = df.join(person_lookup_table, 'person_id', 'left')

# Join on di]erent columns

df = df.join(other_table, df.id == other_table.person_id,
'left')

# Join on multiple columns

df = df.join(other_table, ['ﬁrst_name', 'last_name'], 'left')
Column Operations: Transforming Data
# Add new columns
df = df.withColumn('status', F.lit('PASS'))

# Construct dynamic columns

df = df.withColumn('full_name',
F.when((df.fname.isNotNull() & df.lname.isNotNull()),
F.concat(df.fname, df.lname)).otherwise(F.lit('N/A')))

# Select, drop, and rename columns

df = df.select('name', 'age',
F.col('dob').alias('date_of_birth'))
df = df.drop('mod_dt', 'mod_username')
df = df.withColumnRenamed('dob', 'date_of_birth')

Casting, Null Handling, and Duplicates

# Cast a column to a di]erent type
df = df.withColumn('price', df.price.cast(T.DoubleType()))

# Replace nulls with speciﬁc values

df = df.ﬁllna({'ﬁrst_name': 'Tom', 'age': 0})

# Coalesce null values

df = df.withColumn('last_name', F.coalesce(df.last_name,
df.surname, F.lit('N/A')))

# Drop duplicate rows

df = df.dropDuplicates()
df = df.distinct()

# Drop duplicates considering speciﬁc columns

df = df.dropDuplicates(['name', 'height'])

String Operations: Filters and Functions

# String filters
df = df.filter(df.name.contains('o'))
df = df.filter(df.name.startswith('Al'))
df = df.filter(df.name.endswith('ice'))
df = df.filter(df.is_adult.isNull())
df = df.filter(df.first_name.isNotNull())
df = df.filter(df.name.like('Al%'))
df = df.filter(df.name.rlike('[A-Z]*ice$'))
df = df.filter(df.name.isin('Bob', 'Mike'))

# String functions
df = df.withColumn('short_id', df.id.substr(0, 10))
df = df.withColumn('name', F.trim(df.name))
df = df.withColumn('id', F.lpad('id', 4, '0'))
df = df.withColumn('full_name', F.concat('fname', F.lit(' '),
'lname'))
df = df.withColumn('full_name', F.concat_ws('-', 'fname',
'lname'))
df = df.withColumn('id', F.regexp_replace(id, '0F1(.*)', '1F1-
$1'))
df = df.withColumn('id', F.regexp_extract(id, '[0-9]*', 0))

Number Operations: Mathematical Functions

# Mathematical operations
df = df.withColumn('price', F.round('price', 0))
df = df.withColumn('price', F.ﬂoor('price'))
df = df.withColumn('price', F.ceil('price'))
df = df.withColumn('price', F.abs('price'))
df = df.withColumn('exponential_growth', F.pow('x', 'y'))
df = df.withColumn('least', F.least('subtotal', 'total'))
df = df.withColumn('greatest', F.greatest('subtotal', 'total'))

Date and Timestamp Operations

# Date and timestamp operations
df = df.withColumn('current_date', F.current_date())
df = df.withColumn('date_of_birth',
F.to_date('date_of_birth', 'yyyy-MM-dd'))
df = df.withColumn('time_of_birth',
F.to_timestamp('time_of_birth', 'yyyy-MM-dd HH:mm:ss'))
df = df.filter(F.year('date_of_birth') == F.lit('2017'))
df = df.withColumn('three_days_after',
F.date_add('date_of_birth', 3))
df = df.withColumn('three_days_before',
F.date_sub('date_of_birth', 3))
df = df.withColumn('next_month',
F.add_months('date_of_birth', 1))
df = df.withColumn('days_between', F.datedi]('start',
'end'))
df = df.withColumn('months_between',
F.months_between('start', 'end'))
df = df.filter((F.col('date_of_birth') >= F.lit('2017-05-10')) &
(F.col('date_of_birth') <= F.lit('2018-07-21')))
Array Operations: Functions and
Transformations
# Array operations
df = df.withColumn('full_name', F.array('fname', 'lname'))
df = df.withColumn('empty_array_column', F.array([]))
df = df.withColumn('first_element',
F.col("my_array").getItem(0))
df = df.withColumn('array_length', F.size('my_array'))
df = df.withColumn('flattened', F.flatten('my_array'))
df = df.withColumn('unique_elements',
F.array_distinct('my_array'))
df = df.withColumn('elem_ids',
F.transform(F.col('my_array'), lambda x: x.getField('id')))
df = df.select(F.explode('my_array'))

Struct Operations: Making and Accessing

Struct Columns
# Struct operations
df = df.withColumn('my_struct', F.struct(F.col('col_a'),
F.col('col_b')))
df = df.withColumn('col_a',
F.col('my_struct').getField('col_a'))

Aggregation Operations: Basic and Advanced

# Aggregation operations
df =
df.groupBy('gender').agg(F.max('age').alias('max_age_by_g
ender'))
df =
df.groupBy('age').agg(F.collect_set('name').alias('person_n
ames'))

# Window functions for selecting the latest row in each

group
from pyspark.sql import Window as W
window = W.partitionBy("ﬁrst_name",
"last_name").orderBy(F.desc("date"))
df = df.withColumn("row_number",
F.row_number().over(window))
df = df.ﬁlter(F.col("row_number") ==
1).drop("row_number")

Repartitioning and UDFs (User Deﬁned

Functions)
# Repartitioning
df = df.repartition(1)

# UDFs (User Deﬁned Functions)

times_two_udf = F.udf(lambda x: x * 2)
df = df.withColumn('age', times_two_udf(df.age))

random_name_udf = F.udf(lambda: random.choice(['Bob',

'Tom', 'Amy', 'Jenna']))
df = df.withColumn('name', random_name_udf())
Useful Functions and Transformations
# Flatten nested struct columns
ﬂat_df = ﬂatten(df, '_')

# Lookup and replace values from another DataFrame

df = lookup_and_replace(people, pay_codes, 'id',
'pay_code_id', 'pay_code_desc')

Data Engineering 101 SQL Basics
No ratings yet
Data Engineering 101 SQL Basics
94 pages
Alpha - 355 Operation Manual
100% (1)
Alpha - 355 Operation Manual
206 pages
Data Analyst Cheat Sheet
No ratings yet
Data Analyst Cheat Sheet
28 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Pandas Roadmap
No ratings yet
Pandas Roadmap
6 pages
v152118228 PDF
No ratings yet
v152118228 PDF
349 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Class 12 Computer Science MCQ Set
No ratings yet
Class 12 Computer Science MCQ Set
11 pages
Pandas Notes
No ratings yet
Pandas Notes
2 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Data Wrangling - Jupyter Notebook
No ratings yet
Data Wrangling - Jupyter Notebook
5 pages
Ai Unit 1
100% (1)
Ai Unit 1
101 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Journal
No ratings yet
Journal
47 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
NumPy and Pandas Step
No ratings yet
NumPy and Pandas Step
9 pages
Primavera P6 Training-Ch2
No ratings yet
Primavera P6 Training-Ch2
21 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
Aimmy Ai - Afk Arsenal Ai Bot Script
No ratings yet
Aimmy Ai - Afk Arsenal Ai Bot Script
6 pages
Dataframe in Pandas - Cheatsheet
No ratings yet
Dataframe in Pandas - Cheatsheet
8 pages
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
No ratings yet
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
10 pages
DataFrame 1
No ratings yet
DataFrame 1
3 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Pandas Syntax Revision For ML
No ratings yet
Pandas Syntax Revision For ML
10 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
2.1 Combining Data Frames
No ratings yet
2.1 Combining Data Frames
38 pages
1 CH1 IT Project Management
No ratings yet
1 CH1 IT Project Management
19 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
20 pages
Practical
No ratings yet
Practical
12 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Payment Voucher 9
No ratings yet
Payment Voucher 9
5 pages
Apache Spark Builtin Functions
No ratings yet
Apache Spark Builtin Functions
9 pages
Fifth
No ratings yet
Fifth
12 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
CYBV 388 Syllabus Fall 2023 15W
No ratings yet
CYBV 388 Syllabus Fall 2023 15W
10 pages
Dataframing in CSV
No ratings yet
Dataframing in CSV
14 pages
IG 12 Win Desktop WinLogin Admin Iss2
No ratings yet
IG 12 Win Desktop WinLogin Admin Iss2
167 pages
Data Cleaning Cheat Sheet
No ratings yet
Data Cleaning Cheat Sheet
2 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
How To Create A Live Ubuntu USB Drive With Persistent Storage
No ratings yet
How To Create A Live Ubuntu USB Drive With Persistent Storage
15 pages
Lab Session 06: Perform Following Operations Using Pandas Lab Session 06: Perform Following Operations Using Pandas
No ratings yet
Lab Session 06: Perform Following Operations Using Pandas Lab Session 06: Perform Following Operations Using Pandas
5 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
jcp11 01 Rms 20240118
No ratings yet
jcp11 01 Rms 20240118
14 pages
WSMA 2 - Display Ads
No ratings yet
WSMA 2 - Display Ads
52 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
2 pages
Unit 5 - Part B - Digital Presentations
No ratings yet
Unit 5 - Part B - Digital Presentations
7 pages
RDX Application Configuration Manual - V1
No ratings yet
RDX Application Configuration Manual - V1
5 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
GoldWave - Manual 2024 (Help)
No ratings yet
GoldWave - Manual 2024 (Help)
2 pages
Hardware Maintenance
No ratings yet
Hardware Maintenance
10 pages
PDF Metadata - Document Capture - Recherche Google
No ratings yet
PDF Metadata - Document Capture - Recherche Google
4 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
EDA - Session-1 - Basic Dataframe Opertaions-1
No ratings yet
EDA - Session-1 - Basic Dataframe Opertaions-1
7 pages
Blue Simple Professional CV Resume
No ratings yet
Blue Simple Professional CV Resume
1 page
VBOX 3 Data Sheet
No ratings yet
VBOX 3 Data Sheet
2 pages
SWOT สาหร่าย PDF
No ratings yet
SWOT สาหร่าย PDF
1 page
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Pandas Merged
No ratings yet
Pandas Merged
2 pages
Johnnie's Win32 API Tutorial
0% (1)
Johnnie's Win32 API Tutorial
7 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
17 pages
EDA With Pandas CheatSheet
No ratings yet
EDA With Pandas CheatSheet
3 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
The SQL LIKE Operator
No ratings yet
The SQL LIKE Operator
16 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
Tank 44M3
No ratings yet
Tank 44M3
15 pages
10 CE 2019 Abaqus 1 Introduccion
No ratings yet
10 CE 2019 Abaqus 1 Introduccion
22 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
T M 34129 Interactive PDF Y4 White Rose Maths Spring Block 4 Number Decimals Hundredths (1) - Ver - 1
No ratings yet
T M 34129 Interactive PDF Y4 White Rose Maths Spring Block 4 Number Decimals Hundredths (1) - Ver - 1
2 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
05HiMAP MC (1230)
No ratings yet
05HiMAP MC (1230)
20 pages
Reading An Entire File at Once: Generating Current Date
No ratings yet
Reading An Entire File at Once: Generating Current Date
2 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Nco Class-8
No ratings yet
Nco Class-8
2 pages

Py Spark

Uploaded by

Py Spark

Uploaded by

Mastering PySpark: Essential Cheatsheet

Basics: Loading and Exploring Data

# Read CSV data into a DataFrame

# Show a preview of the data

# Display the ﬁrst and last n rows

# Display as JSON (in-memory, caution with large

Common Operations: Columns and Rows

# Write output to disk

# Convert to Pandas DataFrame

Filtering and Sorting

Joins: Combining Datasets

# Join on di]erent columns

# Join on multiple columns

# Construct dynamic columns

# Select, drop, and rename columns

Casting, Null Handling, and Duplicates

# Replace nulls with speciﬁc values

# Coalesce null values

# Drop duplicate rows

# Drop duplicates considering speciﬁc columns

String Operations: Filters and Functions

Number Operations: Mathematical Functions

Date and Timestamp Operations

Struct Operations: Making and Accessing

Aggregation Operations: Basic and Advanced

# Window functions for selecting the latest row in each

Repartitioning and UDFs (User Deﬁned

# UDFs (User Deﬁned Functions)

random_name_udf = F.udf(lambda: random.choice(['Bob',

# Lookup and replace values from another DataFrame

You might also like