Pyspark Cheatsheet
Pyspark Cheatsheet
Follow me on
df.union Returns a new DataFrame containing union of rows from two DataFrames.
Returns a new DataFrame containing rows that are in one DataFrame but
df.except
not the other.
Interface for saving the content of the DataFrame out into external storage
df.write
systems.
df.dtypes Returns a list of tuples containing the column name and data type.
spark.read
Definition:
Returns a DataFrameReader that provides methods for reading data from various data sources.
Example:
Definition:
Displays the content of the DataFrame in a tabular format.
Example:
df.select
Definition:
Selects a set of columns from a DataFrame.
Example:
df.filter
Definition:
Filters rows that meet a specified condition.
Example:
df.groupBy
Definition:
Groups the DataFrame using the specified columns.
Example:
df.agg
Definition:
Applies aggregate functions to the DataFrame.
Example:
df.join
Definition:
Joins two DataFrames based on a common column(s).
Example:
df.withColumn
Definition:
Adds a new column or replaces an existing column in the DataFrame.
Example:
Definition:
Drops specified columns from the DataFrame.
Example:
# Drop column1
dropped_df = df.drop("column1")
dropped_df.show()
df.distinct
Definition:
Returns a new DataFrame with distinct rows.
Example:
df.union
Definition:
Returns a new DataFrame containing union of rows from two DataFrames.
Example:
df.intersect
Definition:
Returns a new DataFrame containing intersection of rows from two DataFrames.
Example:
df.except
Definition:
Returns a new DataFrame containing rows that are in one DataFrame but not the other.
Example:
df.sort
Definition:
Sorts the DataFrame based on specified columns.
Example:
# Sort by column1
sorted_df = df.sort("column1")
sorted_df.show()
df.orderBy
Definition:
Orders the DataFrame based on specified columns.
Example:
# Order by column1
ordered_df = df.orderBy("column1")
ordered_df.show()
df.cache
Definition:
Caches the DataFrame in memory for quicker access.
Example:
df.persist
Definition:
Persists the DataFrame with specified storage level.
Example:
df.repartition
Definition:
Repartitions the DataFrame into specified number of partitions.
Example:
# Repartition to 10 partitions
repartitioned_df = df.repartition(10)
repartitioned_df.show()
df.coalesce
Definition:
Reduces the number of partitions in the DataFrame.
Example:
# Coalesce to 2 partitions
coalesced_df = df.coalesce(2)
coalesced_df.show()
df.na.drop
Definition:
Drops rows containing null values.
Example:
df.na.fill
Definition:
Fills null values with specified value.
Example:
df.na.replace
Definition:
Replaces specified values in the DataFrame.
Example:
Definition:
Computes basic statistics for numeric columns.
Example:
df.summary
Definition:
Provides a summary of the DataFrame.
Example:
df.crosstab
Definition:
Computes a cross-tabulation of two columns.
Example:
df.stat.corr
Definition:
Computes correlation between two columns.
Example:
Definition:
Computes covariance between two columns.
Example:
df.stat.freqItems
Definition:
Finds frequent items for columns.
Example:
df.stat.approxQuantile
Definition:
Calculates approximate quantiles.
Example:
df.stat.sampleBy
Definition:
Returns a stratified sample without replacement based on the fraction given on each stratum.
Example:
df.withColumnRenamed
Definition:
Renames an existing column in the DataFrame.
Example:
df.dropDuplicates
Definition:
Drops duplicate rows based on specified columns.
Example:
df.sample
Definition:
Returns a sampled subset of the DataFrame.
Example:
Definition:
Randomly splits the DataFrame into multiple DataFrames.
Example:
# Split the DataFrame into training (70%) and testing (30%) sets
train_df, test_df = df.randomSplit([0.7, 0.3], seed=0)
train_df.show()
test_df.show()
df.take
Definition:
Returns the first n rows of the DataFrame.
Example:
df.head
Definition:
Returns the first row or n rows of the DataFrame.
Example:
df.first
Definition:
Returns the first row of the DataFrame.
Example:
df.collect
Definition:
Returns all rows as a list of Row objects.
Example:
df.toPandas
Definition:
Converts the DataFrame to a Pandas DataFrame.
Example:
df.write
Definition:
Interface for saving the content of the DataFrame out into external storage systems.
Example:
Definition:
Returns the schema of the DataFrame.
Example:
df.dtypes
Definition:
Returns a list of tuples containing the column name and data type.
Example:
df.columns
Definition:
Returns a list of column names.
Example:
df.count
Definition:
Returns the number of rows in the DataFrame.
Example:
# Count of rows
row_count = df.count()
print(row_count)
df.isEmpty
Definition:
Returns whether the DataFrame is empty.
Example:
Additional Examples
text_df = spark.read.text("path/to/textfile.txt")
word_df = text_df.select(explode(split(text_df.value, " ")).alias("word"))
word_count_df = word_df.groupBy("word").count()
word_count_df.show()
def my_upper(s):
return s.upper()
path/to/jsonfile.json")
json_df.show()
df.createOrReplaceTempView("table")
result_df = spark.sql("SELECT * FROM table WHERE column1 > 100")
result_df.show()
window_spec = Window.partitionBy("category").orderBy("sales")
ranked_df = df.withColumn("rank", rank().over(window_spec))
ranked_df.show()
Example 13: Pivoting Data
pivot_df = df.groupBy("category").pivot("year").sum("sales")
pivot_df.show()
avro_df = spark.read.format("avro").load("path/to/avrofile.avro")
avro_df.show()
df.write.format("avro").save("path/to/output.avro")
# Make predictions
predictions = lr_model.transform(ml_df)
predictions.show()
Follow me on