0% found this document useful (0 votes)

22 views4 pages

Spark Transformations and Actions

Spark transformations operate on RDDs and DataFrames to create new RDDs and DataFrames. Common transformations include map, filter, groupBy, join, and distinct. Spark actions return values to the driver program like collect, count, first, take, reduce, and saveAsTextFile. Transformations are lazy while actions trigger job execution.

Uploaded by

juliatomva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views4 pages

Spark Transformations and Actions

Uploaded by

juliatomva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

SPARK TRANSFORMATIONS AND ACTIONS

Transformations

Transformation Description Example

Applies a function to each element in the

RDD/DataFrame and returns a new
map RDD/DataFrame. rdd.map(lambda x: x * 2)

Filters the elements in the RDD/DataFrame

based on a condition and returns a new
filter RDD/DataFrame. rdd.filter(lambda x: x % 2 == 0)

Groups the elements in the RDD/DataFrame

based on a key and returns a grouped
groupBy RDD/DataFrame. rdd.groupBy(lambda x: x % 2)

Returns a new RDD/DataFrame containing

only the distinct elements of the original
distinct RDD/DataFrame. rdd.distinct()
Transformation Description Example

Joins two RDDs/DataFrames based on a

common key and returns a new
join RDD/DataFrame. df1.join(df2, on="common_column")

Sorts the elements in the RDD/DataFrame

based on a specified column and returns a
sort new RDD/DataFrame. df.sort("column_name")

Adds a new column or replaces an existing

column in the DataFrame and returns a new
withColumn DataFrame. df.withColumn("new_column", df["existing_column"] + 1)

Removes the specified column(s) from the

drop DataFrame and returns a new DataFrame. df.drop("column_name")

Groups the elements in the DataFrame

based on a key and computes aggregate
groupBy + agg functions on the grouped data. df.groupBy("key_column").agg({"value_column": "sum"})

Pivots a DataFrame to cross-tabulate the

pivot data and returns a new DataFrame. df.groupBy("key_column").pivot("pivot_column").sum("value_c

Actions
Action Description Example

Returns all the elements in the

RDD/DataFrame as an array (not
collect recommended for large datasets). rdd.collect()

Returns the number of elements in the

count RDD/DataFrame. rdd.count()

Returns the first element in the

first RDD/DataFrame. rdd.first()

Returns the first n elements of the

take RDD/DataFrame. rdd.take(5)

Aggregates the elements in the RDD using a

reduce specified function. rdd.reduce(lambda x, y: x + y)

Applies a function to each element in the

foreach RDD (used for side effects). rdd.foreach(lambda x: print(x))

Saves the RDD/DataFrame as text files in a

saveAsTextFile specified directory. rdd.saveAsTextFile("output_dir")
Action Description Example

Counts the occurrences of each key in an

countByKey RDD of (key, value) pairs. rdd.countByKey()

Applies a function to each partition of the rdd.foreachPartition(lambda partition:

foreachPartition RDD (used for side effects). process_partition(partition))

Converts a DataFrame to a Pandas

toPandas DataFrame (useful for small datasets). df.toPandas()

Pyspark
No ratings yet
Pyspark
31 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Verilog MCQ
No ratings yet
Verilog MCQ
31 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
FSD Module-05
No ratings yet
FSD Module-05
38 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Javascript Bootcamp
92% (13)
Javascript Bootcamp
108 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Laboratory Report of The Basic Arithmetic Operations Using C++
0% (1)
Laboratory Report of The Basic Arithmetic Operations Using C++
8 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
SPARK
No ratings yet
SPARK
35 pages
Conditionals and Loops PDF
No ratings yet
Conditionals and Loops PDF
13 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Airlines Dynamic Pricing
No ratings yet
Airlines Dynamic Pricing
24 pages
21cs33 (Object Oriented Programming With Java)
No ratings yet
21cs33 (Object Oriented Programming With Java)
2 pages
S Functions
No ratings yet
S Functions
534 pages
Interrupts
No ratings yet
Interrupts
5 pages
FEMAP Symposium 2013 - Advanced Post With Femap Api PDF
No ratings yet
FEMAP Symposium 2013 - Advanced Post With Femap Api PDF
78 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
SMS Applications
No ratings yet
SMS Applications
224 pages
C, C++,data Structure in C & C Code
No ratings yet
C, C++,data Structure in C & C Code
151 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
Spark
No ratings yet
Spark
13 pages
MCAN201 Data Structure With Python Questions For 1st Internal
No ratings yet
MCAN201 Data Structure With Python Questions For 1st Internal
2 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
CA Chap5 Memory
No ratings yet
CA Chap5 Memory
64 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
1.1 Introduction To SQL: Data Query Language (DQL)
No ratings yet
1.1 Introduction To SQL: Data Query Language (DQL)
34 pages
Querying Microsoft SQL Server 2012: Version: Demo
No ratings yet
Querying Microsoft SQL Server 2012: Version: Demo
14 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
4 ICT S112 Introduction To C 1
No ratings yet
4 ICT S112 Introduction To C 1
18 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Xii First Term Question Computerscience 21-22
No ratings yet
Xii First Term Question Computerscience 21-22
6 pages
Spark
No ratings yet
Spark
51 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
Lecture On AI - Uninformed Search
No ratings yet
Lecture On AI - Uninformed Search
17 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
IT2213 Assessment6
No ratings yet
IT2213 Assessment6
17 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Dot Net Full Stack
No ratings yet
Dot Net Full Stack
23 pages
Big Data Analysis With Scala and Spark: Heather Miller
No ratings yet
Big Data Analysis With Scala and Spark: Heather Miller
17 pages
Matlab Lab 1
No ratings yet
Matlab Lab 1
16 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
2 - Intro To PySpark RDD
No ratings yet
2 - Intro To PySpark RDD
35 pages
VBA Tutorials
No ratings yet
VBA Tutorials
9 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
L7 Cross Compiler
No ratings yet
L7 Cross Compiler
9 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
Note
No ratings yet
Note
14 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
15 PDFsam Apache Spark Tutorial
No ratings yet
15 PDFsam Apache Spark Tutorial
7 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Apache Spark Tutorials
No ratings yet
Apache Spark Tutorials
9 pages
2335 m8 Demo1 v1 0h2 Cq188do
No ratings yet
2335 m8 Demo1 v1 0h2 Cq188do
9 pages
Aggregator
No ratings yet
Aggregator
4 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
Computer-Course Allocation 2022-2023
No ratings yet
Computer-Course Allocation 2022-2023
8 pages
Action and Transformations (Wide and Narrow)
No ratings yet
Action and Transformations (Wide and Narrow)
7 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Examination Score Report
No ratings yet
Examination Score Report
2 pages
DSA Lab Manual (Merge Sort)
No ratings yet
DSA Lab Manual (Merge Sort)
3 pages
Codd Rules
No ratings yet
Codd Rules
2 pages
Java Programs On Operators, Arithmetic Promotion, Method Calling
No ratings yet
Java Programs On Operators, Arithmetic Promotion, Method Calling
3 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Spark Cheatsheet - BEPEC
No ratings yet
Spark Cheatsheet - BEPEC
1 page

Spark Transformations and Actions

Uploaded by

Spark Transformations and Actions

Uploaded by

SPARK TRANSFORMATIONS AND ACTIONS

Transformation Description Example

Applies a function to each element in the

Filters the elements in the RDD/DataFrame

Groups the elements in the RDD/DataFrame

Returns a new RDD/DataFrame containing

Joins two RDDs/DataFrames based on a

Sorts the elements in the RDD/DataFrame

Adds a new column or replaces an existing

Removes the specified column(s) from the

Groups the elements in the DataFrame

Pivots a DataFrame to cross-tabulate the

Returns all the elements in the

Returns the number of elements in the

Returns the first element in the

Returns the first n elements of the

Aggregates the elements in the RDD using a

Applies a function to each element in the

Saves the RDD/DataFrame as text files in a

Counts the occurrences of each key in an

Applies a function to each partition of the rdd.foreachPartition(lambda partition:

Converts a DataFrame to a Pandas

You might also like