Python
Follow me on
Sumit Khanna for more updates
Comprehensive Guide on PySpark
Framework
PySpark is the Python API for Apache Spark, an open-source, distributed computing system that provides an
interface for programming entire clusters with implicit data parallelism and fault tolerance. PySpark allows
Python developers to harness the simplicity of Python and the power of Apache Spark to process large
datasets efficiently.
Cheat Sheet Table
Method Name Definition
Returns a DataFrameReader that provides methods for reading data from
[Link]
various data sources.
[Link] Displays the content of the DataFrame in a tabular format.
[Link] Selects a set of columns from a DataFrame.
[Link] Filters rows that meet a specified condition.
[Link] Groups the DataFrame using the specified columns.
[Link] Applies aggregate functions to the DataFrame.
[Link] Joins two DataFrames based on a common column(s).
[Link] Adds a new column or replaces an existing column in the DataFrame.
[Link] Drops specified columns from the DataFrame.
[Link] Returns a new DataFrame with distinct rows.
[Link] Returns a new DataFrame containing union of rows from two DataFrames.
Returns a new DataFrame containing intersection of rows from two
[Link]
DataFrames.
Method Name Definition
Returns a new DataFrame containing rows that are in one DataFrame but
[Link]
not the other.
[Link] Sorts the DataFrame based on specified columns.
[Link] Orders the DataFrame based on specified columns.
[Link] Caches the DataFrame in memory for quicker access.
[Link] Persists the DataFrame with specified storage level.
[Link] Repartitions the DataFrame into specified number of partitions.
[Link] Reduces the number of partitions in the DataFrame.
[Link] Drops rows containing null values.
[Link] Fills null values with specified value.
[Link] Replaces specified values in the DataFrame.
[Link] Computes basic statistics for numeric columns.
[Link] Provides a summary of the DataFrame.
[Link] Computes a cross-tabulation of two columns.
[Link] Computes correlation between two columns.
[Link] Computes covariance between two columns.
[Link] Finds frequent items for columns.
[Link] Calculates approximate quantiles.
Returns a stratified sample without replacement based on the fraction given
[Link]
on each stratum.
[Link] Renames an existing column in the DataFrame.
[Link] Drops duplicate rows based on specified columns.
[Link] Returns a sampled subset of the DataFrame.
[Link] Randomly splits the DataFrame into multiple DataFrames.
[Link] Returns the first n rows of the DataFrame.
[Link] Returns the first row or n rows of the DataFrame.
[Link] Returns the first row of the DataFrame.
[Link] Returns all rows as a list of Row objects.
Method Name Definition
[Link] Converts the DataFrame to a Pandas DataFrame.
Interface for saving the content of the DataFrame out into external storage
[Link]
systems.
[Link] Returns the schema of the DataFrame.
[Link] Returns a list of tuples containing the column name and data type.
[Link] Returns a list of column names.
[Link] Returns the number of rows in the DataFrame.
[Link] Returns whether the DataFrame is empty.
[Link] Returns a random sample of rows from the DataFrame.
[Link] Fills null values with specified value.
[Link] Replaces values with specified values.
[Link] Drops rows with null values.
Computes aggregates with specified columns and aggregators, using cube
[Link]
computation.
Computes aggregates with specified columns and aggregators, using rollup
[Link]
computation.
Explanation and Usage of Each Method
[Link]
Definition:
Returns a DataFrameReader that provides methods for reading data from various data sources.
Example:
from [Link] import SparkSession
spark = [Link]("PySpark Guide").getOrCreate()
# Reading a CSV file
df = [Link]("path/to/[Link]", header=True, inferSchema=True)
[Link]()
[Link]
Definition:
Displays the content of the DataFrame in a tabular format.
Example:
# Show the first 20 rows of the DataFrame
[Link]()
# Show the first 10 rows
[Link](10)
[Link]
Definition:
Selects a set of columns from a DataFrame.
Example:
# Select specific columns
selected_df = [Link]("column1", "column2")
selected_df.show()
[Link]
Definition:
Filters rows that meet a specified condition.
Example:
# Filter rows where the value in column1 is greater than 100
filtered_df = [Link](df.column1 > 100)
filtered_df.show()
[Link]
Definition:
Groups the DataFrame using the specified columns.
Example:
# Group by column1 and calculate the count
grouped_df = [Link]("column1").count()
grouped_df.show()
[Link]
Definition:
Applies aggregate functions to the DataFrame.
Example:
from [Link] import avg
# Calculate the average value of column1
agg_df = [Link](avg("column1"))
agg_df.show()
[Link]
Definition:
Joins two DataFrames based on a common column(s).
Example:
# Join df1 and df2 on column1
joined_df = [Link](df2, df1.column1 == df2.column1)
joined_df.show()
[Link]
Definition:
Adds a new column or replaces an existing column in the DataFrame.
Example:
from [Link] import lit
# Add a new column with a constant value
new_df = [Link]("new_column", lit("constant_value"))
new_df.show()
[Link]
Definition:
Drops specified columns from the DataFrame.
Example:
# Drop column1
dropped_df = [Link]("column1")
dropped_df.show()
[Link]
Definition:
Returns a new DataFrame with distinct rows.
Example:
# Get distinct rows
distinct_df = [Link]()
distinct_df.show()
[Link]
Definition:
Returns a new DataFrame containing union of rows from two DataFrames.
Example:
# Union of df1 and df2
union_df = [Link](df2)
union_df.show()
[Link]
Definition:
Returns a new DataFrame containing intersection of rows from two DataFrames.
Example:
# Intersection of df1 and df2
intersect_df = [Link](df2)
intersect_df.show()
[Link]
Definition:
Returns a new DataFrame containing rows that are in one DataFrame but not the other.
Example:
# Except of df1 and df2
except_df = [Link](df2)
except_df.show()
[Link]
Definition:
Sorts the DataFrame based on specified columns.
Example:
# Sort by column1
sorted_df = [Link]("column1")
sorted_df.show()
[Link]
Definition:
Orders the DataFrame based on specified columns.
Example:
# Order by column1
ordered_df = [Link]("column1")
ordered_df.show()
[Link]
Definition:
Caches the DataFrame in memory for quicker access.
Example:
# Cache the DataFrame
[Link]()
[Link]()
[Link]
Definition:
Persists the DataFrame with specified storage level.
Example:
from pyspark import StorageLevel
# Persist the DataFrame
[Link](StorageLevel.MEMORY_ONLY)
[Link]()
[Link]
Definition:
Repartitions the DataFrame into specified number of partitions.
Example:
# Repartition to 10 partitions
repartitioned_df = [Link](10)
repartitioned_df.show()
[Link]
Definition:
Reduces the number of partitions in the DataFrame.
Example:
# Coalesce to 2 partitions
coalesced_df = [Link](2)
coalesced_df.show()
[Link]
Definition:
Drops rows containing null values.
Example:
# Drop rows with any null values
df_na_dropped = [Link]()
df_na_dropped.show()
[Link]
Definition:
Fills null values with specified value.
Example:
# Fill null values with 0
df_na_filled = [Link](0)
df_na_filled.show()
[Link]
Definition:
Replaces specified values in the DataFrame.
Example:
# Replace 'old_value' with 'new_value'
df_na_replaced = [Link]('old_value', 'new_value')
df_na_replaced.show()
[Link]
Definition:
Computes basic statistics for numeric columns.
Example:
# Describe the DataFrame
[Link]().show()
[Link]
Definition:
Provides a summary of the DataFrame.
Example:
# Summary of the DataFrame
[Link]().show()
[Link]
Definition:
Computes a cross-tabulation of two columns.
Example:
# Crosstab of column1 and column2
crosstab_df = [Link]("column1", "column2")
crosstab_df.show()
[Link]
Definition:
Computes correlation between two columns.
Example:
# Correlation between column1 and column2
correlation = [Link]("column1", "column2")
print(correlation)
[Link]
Definition:
Computes covariance between two columns.
Example:
# Covariance between column1 and column2
covariance = [Link]("column1", "column2")
print(covariance)
[Link]
Definition:
Finds frequent items for columns.
Example:
# Frequent items for column1
freq_items_df = [Link](["column1"])
freq_items_df.show()
[Link]
Definition:
Calculates approximate quantiles.
Example:
# Approximate quantiles for column1
quantiles = [Link]("column1", [0.25, 0.5, 0.75], 0.01)
print(quantiles)
[Link]
Definition:
Returns a stratified sample without replacement based on the fraction given on each stratum.
Example:
# Stratified sample by column1
sampled_df = [Link]("column1", fractions={"A": 0.1, "B": 0.2}, seed=0)
sampled_df.show()
[Link]
Definition:
Renames an existing column in the DataFrame.
Example:
# Rename column1 to new_column1
renamed_df = [Link]("column1", "new_column1")
renamed_df.show()
[Link]
Definition:
Drops duplicate rows based on specified columns.
Example:
# Drop duplicates based on column1
deduped_df = [Link](["column1"])
deduped_df.show()
[Link]
Definition:
Returns a sampled subset of the DataFrame.
Example:
# Sample 10% of the DataFrame
sampled_df = [Link](fraction=0.1, seed=0)
sampled_df.show()
[Link]
Definition:
Randomly splits the DataFrame into multiple DataFrames.
Example:
# Split the DataFrame into training (70%) and testing (30%) sets
train_df, test_df = [Link]([0.7, 0.3], seed=0)
train_df.show()
test_df.show()
[Link]
Definition:
Returns the first n rows of the DataFrame.
Example:
# Take the first 5 rows
rows = [Link](5)
for row in rows:
print(row)
[Link]
Definition:
Returns the first row or n rows of the DataFrame.
Example:
# Head of the DataFrame
head_row = [Link]()
print(head_row)
[Link]
Definition:
Returns the first row of the DataFrame.
Example:
# First row of the DataFrame
first_row = [Link]()
print(first_row)
[Link]
Definition:
Returns all rows as a list of Row objects.
Example:
# Collect all rows
all_rows = [Link]()
for row in all_rows:
print(row)
[Link]
Definition:
Converts the DataFrame to a Pandas DataFrame.
Example:
# Convert to Pandas DataFrame
pandas_df = [Link]()
print(pandas_df.head())
[Link]
Definition:
Interface for saving the content of the DataFrame out into external storage systems.
Example:
# Write DataFrame to CSV
[Link]("path/to/[Link]")
[Link]
Definition:
Returns the schema of the DataFrame.
Example:
# Schema of the DataFrame
schema = [Link]
print(schema)
[Link]
Definition:
Returns a list of tuples containing the column name and data type.
Example:
# Data types of the DataFrame
dtypes = [Link]
print(dtypes)
[Link]
Definition:
Returns a list of column names.
Example:
# Columns of the DataFrame
columns = [Link]
print(columns)
[Link]
Definition:
Returns the number of rows in the DataFrame.
Example:
# Count of rows
row_count = [Link]()
print(row_count)
[Link]
Definition:
Returns whether the DataFrame is empty.
Example:
# Check if DataFrame is empty
is_empty = [Link]()
print(is_empty)
Additional Examples
Example 1: Word Count
from [Link] import explode, split
text_df = [Link]("path/to/[Link]")
word_df = text_df.select(explode(split(text_df.value, " ")).alias("word"))
word_count_df = word_df.groupBy("word").count()
word_count_df.show()
Example 2: Join DataFrames
df1 = [Link]("path/to/[Link]", header=True, inferSchema=True)
df2 = [Link]("path/to/[Link]", header=True, inferSchema=True)
joined_df = [Link](df2, [Link] == [Link], "inner")
joined_df.show()
Example 3: Aggregations
from [Link] import sum, avg
agg_df = [Link]("category").agg(sum("sales").alias("total_sales"), avg("sales").alias("average_sal
agg_df.show()
Example 4: Handling Missing Data
# Fill missing values
filled_df = [Link]({"column1": 0, "column2": "unknown"})
filled_df.show()
# Drop rows with missing values
dropped_df = [Link]()
dropped_df.show()
Example 5: Creating UDFs
from [Link] import udf
from [Link] import StringType
def my_upper(s):
return [Link]()
my_upper_udf = udf(my_upper, StringType())
df_with_udf = [Link]("upper_column", my_upper_udf(df.column1))
df_with_udf.show()
Example 6: Reading and Writing Parquet Files
# Reading Parquet file
parquet_df = [Link]("path/to/[Link]")
parquet_df.show()
# Writing Parquet file
[Link]("path/to/[Link]")
Example 7: DataFrame Transformation
transformed_df = [Link]("new_column", df.column1 * 2)
transformed_df.show()
Example 8: Filtering Data
filtered_df = [Link]((df.column1 > 100) & (df.column2 == "value"))
filtered_df.show()
Example 9: Date and Time Operations
from [Link] import current_date, current_timestamp
date_df = [Link]("current_date", current_date()).withColumn("current_timestamp", current_timest
date_df.show()
Example 10: Working with JSON Data
# Reading JSON file
json_df = [Link]("
path/to/[Link]")
json_df.show()
# Writing JSON file
[Link]("path/to/[Link]")
Example 11: SQL Queries
[Link]("table")
result_df = [Link]("SELECT * FROM table WHERE column1 > 100")
result_df.show()
Example 12: Using Window Functions
from [Link] import Window
from [Link] import rank
window_spec = [Link]("category").orderBy("sales")
ranked_df = [Link]("rank", rank().over(window_spec))
ranked_df.show()
Example 13: Pivoting Data
pivot_df = [Link]("category").pivot("year").sum("sales")
pivot_df.show()
Example 14: Unpivoting Data
unpivot_df = pivot_df.selectExpr("category", "stack(3, '2020', `2020`, '2021', `2021`, '2022', `2022`)
unpivot_df.show()
Example 15: Sampling Data
sampled_df = [Link](fraction=0.1, seed=0)
sampled_df.show()
Example 16: Creating a DataFrame from RDD
rdd = [Link]([(1, "a"), (2, "b"), (3, "c")])
rdd_df = [Link](rdd, ["id", "value"])
rdd_df.show()
Example 17: Reading Avro Files
avro_df = [Link]("avro").load("path/to/[Link]")
avro_df.show()
Example 18: Writing Avro Files
[Link]("avro").save("path/to/[Link]")
Example 19: Handling Complex Types
from [Link] import explode
complex_df = [Link]("exploded_column", explode(df.complex_column))
complex_df.show()
Example 20: Machine Learning with PySpark
from [Link] import VectorAssembler
from [Link] import LinearRegression
# Prepare data for ML
assembler = VectorAssembler(inputCols=["column1", "column2"], outputCol="features")
ml_df = [Link](df)
# Train a Linear Regression model
lr = LinearRegression(featuresCol="features", labelCol="label")
lr_model = [Link](ml_df)
# Make predictions
predictions = lr_model.transform(ml_df)
[Link]()
Follow me on
Sumit Khanna for more updates