0% found this document useful (0 votes)
79 views21 pages

Pyspark Cheatsheet

Uploaded by

cloud.sani34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views21 pages

Pyspark Cheatsheet

Uploaded by

cloud.sani34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Python

Follow me on

Sumit Khanna for more updates

Comprehensive Guide on PySpark


Framework
PySpark is the Python API for Apache Spark, an open-source, distributed computing system that provides an
interface for programming entire clusters with implicit data parallelism and fault tolerance. PySpark allows
Python developers to harness the simplicity of Python and the power of Apache Spark to process large
datasets efficiently.

Cheat Sheet Table

Method Name Definition

Returns a DataFrameReader that provides methods for reading data from


[Link]
various data sources.

[Link] Displays the content of the DataFrame in a tabular format.

[Link] Selects a set of columns from a DataFrame.

[Link] Filters rows that meet a specified condition.

[Link] Groups the DataFrame using the specified columns.

[Link] Applies aggregate functions to the DataFrame.

[Link] Joins two DataFrames based on a common column(s).

[Link] Adds a new column or replaces an existing column in the DataFrame.

[Link] Drops specified columns from the DataFrame.

[Link] Returns a new DataFrame with distinct rows.

[Link] Returns a new DataFrame containing union of rows from two DataFrames.

Returns a new DataFrame containing intersection of rows from two


[Link]
DataFrames.
Method Name Definition

Returns a new DataFrame containing rows that are in one DataFrame but
[Link]
not the other.

[Link] Sorts the DataFrame based on specified columns.

[Link] Orders the DataFrame based on specified columns.

[Link] Caches the DataFrame in memory for quicker access.

[Link] Persists the DataFrame with specified storage level.

[Link] Repartitions the DataFrame into specified number of partitions.

[Link] Reduces the number of partitions in the DataFrame.

[Link] Drops rows containing null values.

[Link] Fills null values with specified value.

[Link] Replaces specified values in the DataFrame.

[Link] Computes basic statistics for numeric columns.

[Link] Provides a summary of the DataFrame.

[Link] Computes a cross-tabulation of two columns.

[Link] Computes correlation between two columns.

[Link] Computes covariance between two columns.

[Link] Finds frequent items for columns.

[Link] Calculates approximate quantiles.

Returns a stratified sample without replacement based on the fraction given


[Link]
on each stratum.

[Link] Renames an existing column in the DataFrame.

[Link] Drops duplicate rows based on specified columns.

[Link] Returns a sampled subset of the DataFrame.

[Link] Randomly splits the DataFrame into multiple DataFrames.

[Link] Returns the first n rows of the DataFrame.

[Link] Returns the first row or n rows of the DataFrame.

[Link] Returns the first row of the DataFrame.

[Link] Returns all rows as a list of Row objects.


Method Name Definition

[Link] Converts the DataFrame to a Pandas DataFrame.

Interface for saving the content of the DataFrame out into external storage
[Link]
systems.

[Link] Returns the schema of the DataFrame.

[Link] Returns a list of tuples containing the column name and data type.

[Link] Returns a list of column names.

[Link] Returns the number of rows in the DataFrame.

[Link] Returns whether the DataFrame is empty.

[Link] Returns a random sample of rows from the DataFrame.

[Link] Fills null values with specified value.

[Link] Replaces values with specified values.

[Link] Drops rows with null values.

Computes aggregates with specified columns and aggregators, using cube


[Link]
computation.

Computes aggregates with specified columns and aggregators, using rollup


[Link]
computation.

Explanation and Usage of Each Method

[Link]

Definition:
Returns a DataFrameReader that provides methods for reading data from various data sources.

Example:

from [Link] import SparkSession

spark = [Link]("PySpark Guide").getOrCreate()

# Reading a CSV file


df = [Link]("path/to/[Link]", header=True, inferSchema=True)
[Link]()
[Link]

Definition:
Displays the content of the DataFrame in a tabular format.

Example:

# Show the first 20 rows of the DataFrame


[Link]()

# Show the first 10 rows


[Link](10)

[Link]

Definition:
Selects a set of columns from a DataFrame.

Example:

# Select specific columns


selected_df = [Link]("column1", "column2")
selected_df.show()

[Link]

Definition:
Filters rows that meet a specified condition.

Example:

# Filter rows where the value in column1 is greater than 100


filtered_df = [Link](df.column1 > 100)
filtered_df.show()

[Link]

Definition:
Groups the DataFrame using the specified columns.
Example:

# Group by column1 and calculate the count


grouped_df = [Link]("column1").count()
grouped_df.show()

[Link]

Definition:
Applies aggregate functions to the DataFrame.

Example:

from [Link] import avg

# Calculate the average value of column1


agg_df = [Link](avg("column1"))
agg_df.show()

[Link]

Definition:
Joins two DataFrames based on a common column(s).

Example:

# Join df1 and df2 on column1


joined_df = [Link](df2, df1.column1 == df2.column1)
joined_df.show()

[Link]

Definition:
Adds a new column or replaces an existing column in the DataFrame.

Example:

from [Link] import lit

# Add a new column with a constant value


new_df = [Link]("new_column", lit("constant_value"))
new_df.show()
[Link]

Definition:
Drops specified columns from the DataFrame.

Example:

# Drop column1
dropped_df = [Link]("column1")
dropped_df.show()

[Link]

Definition:
Returns a new DataFrame with distinct rows.

Example:

# Get distinct rows


distinct_df = [Link]()
distinct_df.show()

[Link]

Definition:
Returns a new DataFrame containing union of rows from two DataFrames.

Example:

# Union of df1 and df2


union_df = [Link](df2)
union_df.show()

[Link]

Definition:
Returns a new DataFrame containing intersection of rows from two DataFrames.
Example:

# Intersection of df1 and df2


intersect_df = [Link](df2)
intersect_df.show()

[Link]

Definition:
Returns a new DataFrame containing rows that are in one DataFrame but not the other.

Example:

# Except of df1 and df2


except_df = [Link](df2)
except_df.show()

[Link]

Definition:
Sorts the DataFrame based on specified columns.

Example:

# Sort by column1
sorted_df = [Link]("column1")
sorted_df.show()

[Link]

Definition:
Orders the DataFrame based on specified columns.

Example:

# Order by column1
ordered_df = [Link]("column1")
ordered_df.show()
[Link]

Definition:
Caches the DataFrame in memory for quicker access.

Example:

# Cache the DataFrame


[Link]()
[Link]()

[Link]

Definition:
Persists the DataFrame with specified storage level.

Example:

from pyspark import StorageLevel

# Persist the DataFrame


[Link](StorageLevel.MEMORY_ONLY)
[Link]()

[Link]

Definition:
Repartitions the DataFrame into specified number of partitions.

Example:

# Repartition to 10 partitions
repartitioned_df = [Link](10)
repartitioned_df.show()

[Link]

Definition:
Reduces the number of partitions in the DataFrame.
Example:

# Coalesce to 2 partitions
coalesced_df = [Link](2)
coalesced_df.show()

[Link]

Definition:
Drops rows containing null values.

Example:

# Drop rows with any null values


df_na_dropped = [Link]()
df_na_dropped.show()

[Link]

Definition:
Fills null values with specified value.

Example:

# Fill null values with 0


df_na_filled = [Link](0)
df_na_filled.show()

[Link]

Definition:
Replaces specified values in the DataFrame.

Example:

# Replace 'old_value' with 'new_value'


df_na_replaced = [Link]('old_value', 'new_value')
df_na_replaced.show()
[Link]

Definition:
Computes basic statistics for numeric columns.

Example:

# Describe the DataFrame


[Link]().show()

[Link]

Definition:
Provides a summary of the DataFrame.

Example:

# Summary of the DataFrame


[Link]().show()

[Link]

Definition:
Computes a cross-tabulation of two columns.

Example:

# Crosstab of column1 and column2


crosstab_df = [Link]("column1", "column2")
crosstab_df.show()

[Link]

Definition:
Computes correlation between two columns.

Example:

# Correlation between column1 and column2


correlation = [Link]("column1", "column2")
print(correlation)
[Link]

Definition:
Computes covariance between two columns.

Example:

# Covariance between column1 and column2


covariance = [Link]("column1", "column2")
print(covariance)

[Link]

Definition:
Finds frequent items for columns.

Example:

# Frequent items for column1


freq_items_df = [Link](["column1"])
freq_items_df.show()

[Link]

Definition:
Calculates approximate quantiles.

Example:

# Approximate quantiles for column1


quantiles = [Link]("column1", [0.25, 0.5, 0.75], 0.01)
print(quantiles)

[Link]

Definition:
Returns a stratified sample without replacement based on the fraction given on each stratum.
Example:

# Stratified sample by column1


sampled_df = [Link]("column1", fractions={"A": 0.1, "B": 0.2}, seed=0)
sampled_df.show()

[Link]

Definition:
Renames an existing column in the DataFrame.

Example:

# Rename column1 to new_column1


renamed_df = [Link]("column1", "new_column1")
renamed_df.show()

[Link]

Definition:
Drops duplicate rows based on specified columns.

Example:

# Drop duplicates based on column1


deduped_df = [Link](["column1"])
deduped_df.show()

[Link]

Definition:
Returns a sampled subset of the DataFrame.

Example:

# Sample 10% of the DataFrame


sampled_df = [Link](fraction=0.1, seed=0)
sampled_df.show()
[Link]

Definition:
Randomly splits the DataFrame into multiple DataFrames.

Example:

# Split the DataFrame into training (70%) and testing (30%) sets
train_df, test_df = [Link]([0.7, 0.3], seed=0)
train_df.show()
test_df.show()

[Link]

Definition:
Returns the first n rows of the DataFrame.

Example:

# Take the first 5 rows


rows = [Link](5)
for row in rows:
print(row)

[Link]

Definition:
Returns the first row or n rows of the DataFrame.

Example:

# Head of the DataFrame


head_row = [Link]()
print(head_row)

[Link]

Definition:
Returns the first row of the DataFrame.
Example:

# First row of the DataFrame


first_row = [Link]()
print(first_row)

[Link]

Definition:
Returns all rows as a list of Row objects.

Example:

# Collect all rows


all_rows = [Link]()
for row in all_rows:
print(row)

[Link]

Definition:
Converts the DataFrame to a Pandas DataFrame.

Example:

# Convert to Pandas DataFrame


pandas_df = [Link]()
print(pandas_df.head())

[Link]

Definition:
Interface for saving the content of the DataFrame out into external storage systems.

Example:

# Write DataFrame to CSV


[Link]("path/to/[Link]")
[Link]

Definition:
Returns the schema of the DataFrame.

Example:

# Schema of the DataFrame


schema = [Link]
print(schema)

[Link]

Definition:
Returns a list of tuples containing the column name and data type.

Example:

# Data types of the DataFrame


dtypes = [Link]
print(dtypes)

[Link]

Definition:
Returns a list of column names.

Example:

# Columns of the DataFrame


columns = [Link]
print(columns)

[Link]

Definition:
Returns the number of rows in the DataFrame.
Example:

# Count of rows
row_count = [Link]()
print(row_count)

[Link]

Definition:
Returns whether the DataFrame is empty.

Example:

# Check if DataFrame is empty


is_empty = [Link]()
print(is_empty)

Additional Examples

Example 1: Word Count

from [Link] import explode, split

text_df = [Link]("path/to/[Link]")
word_df = text_df.select(explode(split(text_df.value, " ")).alias("word"))
word_count_df = word_df.groupBy("word").count()
word_count_df.show()

Example 2: Join DataFrames

df1 = [Link]("path/to/[Link]", header=True, inferSchema=True)


df2 = [Link]("path/to/[Link]", header=True, inferSchema=True)

joined_df = [Link](df2, [Link] == [Link], "inner")


joined_df.show()
Example 3: Aggregations

from [Link] import sum, avg

agg_df = [Link]("category").agg(sum("sales").alias("total_sales"), avg("sales").alias("average_sal


agg_df.show()

Example 4: Handling Missing Data

# Fill missing values


filled_df = [Link]({"column1": 0, "column2": "unknown"})
filled_df.show()

# Drop rows with missing values


dropped_df = [Link]()
dropped_df.show()

Example 5: Creating UDFs

from [Link] import udf


from [Link] import StringType

def my_upper(s):
return [Link]()

my_upper_udf = udf(my_upper, StringType())

df_with_udf = [Link]("upper_column", my_upper_udf(df.column1))


df_with_udf.show()

Example 6: Reading and Writing Parquet Files

# Reading Parquet file


parquet_df = [Link]("path/to/[Link]")
parquet_df.show()

# Writing Parquet file


[Link]("path/to/[Link]")

Example 7: DataFrame Transformation

transformed_df = [Link]("new_column", df.column1 * 2)


transformed_df.show()
Example 8: Filtering Data

filtered_df = [Link]((df.column1 > 100) & (df.column2 == "value"))


filtered_df.show()

Example 9: Date and Time Operations

from [Link] import current_date, current_timestamp

date_df = [Link]("current_date", current_date()).withColumn("current_timestamp", current_timest


date_df.show()

Example 10: Working with JSON Data

# Reading JSON file


json_df = [Link]("

path/to/[Link]")
json_df.show()

# Writing JSON file


[Link]("path/to/[Link]")

Example 11: SQL Queries

[Link]("table")
result_df = [Link]("SELECT * FROM table WHERE column1 > 100")
result_df.show()

Example 12: Using Window Functions

from [Link] import Window


from [Link] import rank

window_spec = [Link]("category").orderBy("sales")
ranked_df = [Link]("rank", rank().over(window_spec))
ranked_df.show()
Example 13: Pivoting Data

pivot_df = [Link]("category").pivot("year").sum("sales")
pivot_df.show()

Example 14: Unpivoting Data

unpivot_df = pivot_df.selectExpr("category", "stack(3, '2020', `2020`, '2021', `2021`, '2022', `2022`)


unpivot_df.show()

Example 15: Sampling Data

sampled_df = [Link](fraction=0.1, seed=0)


sampled_df.show()

Example 16: Creating a DataFrame from RDD

rdd = [Link]([(1, "a"), (2, "b"), (3, "c")])


rdd_df = [Link](rdd, ["id", "value"])
rdd_df.show()

Example 17: Reading Avro Files

avro_df = [Link]("avro").load("path/to/[Link]")
avro_df.show()

Example 18: Writing Avro Files

[Link]("avro").save("path/to/[Link]")

Example 19: Handling Complex Types

from [Link] import explode

complex_df = [Link]("exploded_column", explode(df.complex_column))


complex_df.show()
Example 20: Machine Learning with PySpark

from [Link] import VectorAssembler


from [Link] import LinearRegression

# Prepare data for ML


assembler = VectorAssembler(inputCols=["column1", "column2"], outputCol="features")
ml_df = [Link](df)

# Train a Linear Regression model


lr = LinearRegression(featuresCol="features", labelCol="label")
lr_model = [Link](ml_df)

# Make predictions
predictions = lr_model.transform(ml_df)
[Link]()

Follow me on

Sumit Khanna for more updates

You might also like