0% found this document useful (0 votes)
22 views4 pages

Spark Transformations and Actions

Spark transformations operate on RDDs and DataFrames to create new RDDs and DataFrames. Common transformations include map, filter, groupBy, join, and distinct. Spark actions return values to the driver program like collect, count, first, take, reduce, and saveAsTextFile. Transformations are lazy while actions trigger job execution.

Uploaded by

juliatomva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views4 pages

Spark Transformations and Actions

Spark transformations operate on RDDs and DataFrames to create new RDDs and DataFrames. Common transformations include map, filter, groupBy, join, and distinct. Spark actions return values to the driver program like collect, count, first, take, reduce, and saveAsTextFile. Transformations are lazy while actions trigger job execution.

Uploaded by

juliatomva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

SPARK TRANSFORMATIONS AND ACTIONS

Transformations

Transformation Description Example

Applies a function to each element in the


RDD/DataFrame and returns a new
map RDD/DataFrame. rdd.map(lambda x: x * 2)

Filters the elements in the RDD/DataFrame


based on a condition and returns a new
filter RDD/DataFrame. rdd.filter(lambda x: x % 2 == 0)

Groups the elements in the RDD/DataFrame


based on a key and returns a grouped
groupBy RDD/DataFrame. rdd.groupBy(lambda x: x % 2)

Returns a new RDD/DataFrame containing


only the distinct elements of the original
distinct RDD/DataFrame. rdd.distinct()
Transformation Description Example

Joins two RDDs/DataFrames based on a


common key and returns a new
join RDD/DataFrame. df1.join(df2, on="common_column")

Sorts the elements in the RDD/DataFrame


based on a specified column and returns a
sort new RDD/DataFrame. df.sort("column_name")

Adds a new column or replaces an existing


column in the DataFrame and returns a new
withColumn DataFrame. df.withColumn("new_column", df["existing_column"] + 1)

Removes the specified column(s) from the


drop DataFrame and returns a new DataFrame. df.drop("column_name")

Groups the elements in the DataFrame


based on a key and computes aggregate
groupBy + agg functions on the grouped data. df.groupBy("key_column").agg({"value_column": "sum"})

Pivots a DataFrame to cross-tabulate the


pivot data and returns a new DataFrame. df.groupBy("key_column").pivot("pivot_column").sum("value_c

Actions
Action Description Example

Returns all the elements in the


RDD/DataFrame as an array (not
collect recommended for large datasets). rdd.collect()

Returns the number of elements in the


count RDD/DataFrame. rdd.count()

Returns the first element in the


first RDD/DataFrame. rdd.first()

Returns the first n elements of the


take RDD/DataFrame. rdd.take(5)

Aggregates the elements in the RDD using a


reduce specified function. rdd.reduce(lambda x, y: x + y)

Applies a function to each element in the


foreach RDD (used for side effects). rdd.foreach(lambda x: print(x))

Saves the RDD/DataFrame as text files in a


saveAsTextFile specified directory. rdd.saveAsTextFile("output_dir")
Action Description Example

Counts the occurrences of each key in an


countByKey RDD of (key, value) pairs. rdd.countByKey()

Applies a function to each partition of the rdd.foreachPartition(lambda partition:


foreachPartition RDD (used for side effects). process_partition(partition))

Converts a DataFrame to a Pandas


toPandas DataFrame (useful for small datasets). df.toPandas()

You might also like