0% found this document useful (0 votes)

5 views5 pages

Pyspark RDD Operations

This document is a comprehensive cheatsheet for PySpark RDD operations, detailing methods for RDD creation, transformations, actions, persistence, partitioning, set operations, pair operations, numeric operations, data formats, compression, serialization, partitioning strategies, execution, debugging, and optimization. Each section provides code snippets and explanations for various functions and techniques used in manipulating RDDs. It serves as a quick reference guide for users working with PySpark RDDs.

Uploaded by

vamsitarak55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views5 pages

Pyspark RDD Operations

Uploaded by

vamsitarak55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

[ PySpark RDD (Resilient Distributed Datasets) Operations ] [ cheatsheet ]

1. RDD Creation

● Create RDD from a list: rdd = sc.parallelize([1, 2, 3, 4, 5])

● Create RDD from a file: rdd = sc.textFile("file.txt")
● Create RDD from a directory: rdd = sc.wholeTextFiles("directory/")
● Create empty RDD: rdd = sc.emptyRDD()

2. RDD Transformations

● Map: rdd.map(lambda x: x * 2)
● FlatMap: rdd.flatMap(lambda x: [x, x * 2, x * 3])
● Filter: rdd.filter(lambda x: x > 10)
● Distinct: rdd.distinct()
● Sample: rdd.sample(withReplacement=True, fraction=0.5)
● Union: rdd1.union(rdd2)
● Intersection: rdd1.intersection(rdd2)
● Subtract: rdd1.subtract(rdd2)
● Cartesian: rdd1.cartesian(rdd2)
● Zip: rdd1.zip(rdd2)
● ZipWithIndex: rdd.zipWithIndex()
● GroupBy: rdd.groupBy(lambda x: x % 2)
● SortBy: rdd.sortBy(lambda x: x, ascending=False)
● PartitionBy: rdd.partitionBy(3)
● MapPartitions: rdd.mapPartitions(lambda partition: [x * 2 for x in
partition])
● MapPartitionsWithIndex: rdd.mapPartitionsWithIndex(lambda index,
partition: [(index, x) for x in partition])
● FlatMapValues: rdd.flatMapValues(lambda x: [x, x * 2])
● CombineByKey: rdd.combineByKey(lambda value: (value, 1), lambda acc,
value: (acc[0] + value, acc[1] + 1), lambda acc1, acc2: (acc1[0] +
acc2[0], acc1[1] + acc2[1]))
● FoldByKey: rdd.foldByKey(0, lambda acc, value: acc + value)
● ReduceByKey: rdd.reduceByKey(lambda a, b: a + b)
● AggregateByKey: rdd.aggregateByKey((0, 0), lambda acc, value: (acc[0] +
value, acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] +
acc2[1]))
● Join: rdd1.join(rdd2)

By: Waleed Mousa

● LeftOuterJoin: rdd1.leftOuterJoin(rdd2)
● RightOuterJoin: rdd1.rightOuterJoin(rdd2)
● FullOuterJoin: rdd1.fullOuterJoin(rdd2)
● Cogroup: rdd1.cogroup(rdd2)

3. RDD Actions

● Collect: rdd.collect()
● Take: rdd.take(5)
● First: rdd.first()
● Count: rdd.count()
● CountByValue: rdd.countByValue()
● Reduce: rdd.reduce(lambda a, b: a + b)
● Aggregate: rdd.aggregate((0, 0), lambda acc, value: (acc[0] + value,
acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))
● Fold: rdd.fold(0, lambda acc, value: acc + value)
● Max: rdd.max()
● Min: rdd.min()
● Sum: rdd.sum()
● Mean: rdd.mean()
● Variance: rdd.variance()
● Stdev: rdd.stdev()
● TakeSample: rdd.takeSample(withReplacement=True, num=5)
● Foreach: rdd.foreach(lambda x: print(x))
● ForeachPartition: rdd.foreachPartition(lambda partition: [print(x) for x
in partition])
● Top: rdd.top(5)
● TakeOrdered: rdd.takeOrdered(5, lambda x: -x)
● SaveAsTextFile: rdd.saveAsTextFile("output/")
● SaveAsPickleFile: rdd.saveAsPickleFile("output/")

4. RDD Persistence and Caching

● Cache: rdd.cache()
● Persist: rdd.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK)
● Unpersist: rdd.unpersist()
● Checkpoint: rdd.checkpoint()

5. RDD Partitioning

By: Waleed Mousa

● Repartition: rdd.repartition(numPartitions=10)
● Coalesce: rdd.coalesce(numPartitions=5)
● GetNumPartitions: rdd.getNumPartitions()
● Glom: rdd.glom()
● Zip: rdd1.zip(rdd2)
● ZipPartitions: rdd1.zipPartitions(rdd2, lambda partition1, partition2: [x
+ y for x, y in zip(partition1, partition2)])

6. RDD Set Operations

● Union: rdd1.union(rdd2)
● Intersection: rdd1.intersection(rdd2)
● Subtract: rdd1.subtract(rdd2)
● Cartesian: rdd1.cartesian(rdd2)

7. RDD Pair Operations

● ReduceByKey: rdd.reduceByKey(lambda a, b: a + b)
● GroupByKey: rdd.groupByKey()
● CombineByKey: rdd.combineByKey(lambda value: (value, 1), lambda acc,
value: (acc[0] + value, acc[1] + 1), lambda acc1, acc2: (acc1[0] +
acc2[0], acc1[1] + acc2[1]))
● AggregateByKey: rdd.aggregateByKey((0, 0), lambda acc, value: (acc[0] +
value, acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] +
acc2[1]))
● FoldByKey: rdd.foldByKey(0, lambda acc, value: acc + value)
● Join: rdd1.join(rdd2)
● LeftOuterJoin: rdd1.leftOuterJoin(rdd2)
● RightOuterJoin: rdd1.rightOuterJoin(rdd2)
● FullOuterJoin: rdd1.fullOuterJoin(rdd2)
● MapValues: rdd.mapValues(lambda x: x * 2)
● FlatMapValues: rdd.flatMapValues(lambda x: [x, x * 2])
● CountByKey: rdd.countByKey()
● LookupByKey: rdd.lookup(key)
● SortByKey: rdd.sortByKey(ascending=False)
● CoGroupByKey: rdd1.cogroup(rdd2)

8. RDD Numeric Operations

● Sum: rdd.sum()
By: Waleed Mousa
● Mean: rdd.mean()
● Variance: rdd.variance()
● Stdev: rdd.stdev()
● HistogramByKey: rdd.histogram(buckets=10)

9. RDD Data Formats

● CSV: rdd = sc.textFile("file.csv").map(lambda line: line.split(","))

● JSON: rdd = sc.textFile("file.json").map(lambda line: json.loads(line))
● Parquet: rdd = sqlContext.read.parquet("file.parquet").rdd
● Avro: rdd =
sqlContext.read.format("com.databricks.spark.avro").load("file.avro").rdd
● SequenceFile: rdd = sc.sequenceFile("file.seq", keyClass, valueClass)

10. RDD Compression

● Compress: rdd.map(lambda x: (x, None)).saveAsSequenceFile("output/",

codec="org.apache.hadoop.io.compress.GzipCodec")
● Decompress: rdd = sc.sequenceFile("output/", keyClass=None,
valueClass=None, codec="org.apache.hadoop.io.compress.GzipCodec")

11. RDD Serialization

● Kryo Serializer: conf = SparkConf().set("spark.serializer",

"org.apache.spark.serializer.KryoSerializer")
● Java Serializer: conf = SparkConf().set("spark.serializer",
"org.apache.spark.serializer.JavaSerializer")

12. RDD Partitioner

● HashPartitioner: rdd.partitionBy(numPartitions=10, partitionFunc=lambda

x: hash(x))
● RangePartitioner: rdd.partitionBy(numPartitions=10, partitionFunc=lambda
x: int(x / 10))
● CustomPartitioner: rdd.partitionBy(numPartitions=10, partitionFunc=lambda
x: customPartitionFunction(x))

13. RDD Execution

● Collect: rdd.collect()

By: Waleed Mousa

● Reduce: rdd.reduce(lambda a, b: a + b)
● Aggregate: rdd.aggregate((0, 0), lambda acc, value: (acc[0] + value,
acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))
● Take: rdd.take(5)
● First: rdd.first()
● TakeSample: rdd.takeSample(withReplacement=True, num=5)
● Count: rdd.count()
● CountByValue: rdd.countByValue()
● Foreach: rdd.foreach(lambda x: print(x))
● ForeachPartition: rdd.foreachPartition(lambda partition: [print(x) for x
in partition])
● CollectAsMap: rdd.collectAsMap()

14. RDD Debugging

● Logging: rdd.map(lambda x: print("Processing:", x))

● Caching: rdd.cache()
● Checkpointing: rdd.checkpoint()
● Debugging: rdd.filter(lambda x: x == debug_value).collect()

15. RDD Optimization

● Caching: rdd.cache()
● Repartitioning: rdd.repartition(numPartitions=10)
● Coalesce: rdd.coalesce(numPartitions=5)
● Broadcast Variables: broadcast_var = sc.broadcast(large_data)
● Accumulator Variables: accumulator = sc.accumulator(0)
● Kryo Serialization: conf = SparkConf().set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
● Spark SQL: df = sqlContext.createDataFrame(rdd, schema)
● DataFrame Operations: df.filter(df.age > 18)

By: Waleed Mousa

The Charge of The Light Brigade Model Questions and Suggested Answers
No ratings yet
The Charge of The Light Brigade Model Questions and Suggested Answers
4 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Pyspark Essentials
No ratings yet
Pyspark Essentials
24 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Programming and Scientific Computing in Python For Aerospace Engineers - J Hoekstra (TU Delft)
100% (1)
Programming and Scientific Computing in Python For Aerospace Engineers - J Hoekstra (TU Delft)
139 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
SPARK
No ratings yet
SPARK
35 pages
2 - Intro To PySpark RDD
No ratings yet
2 - Intro To PySpark RDD
35 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
Programming - Programming QuickStart Box Set - HTML, Javascript & CSS (Programming, HTML, Javascript, CSS, Computer Programming) (PDFDrive)
No ratings yet
Programming - Programming QuickStart Box Set - HTML, Javascript & CSS (Programming, HTML, Javascript, CSS, Computer Programming) (PDFDrive)
223 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Pyspart Iq
No ratings yet
Pyspart Iq
27 pages
Journal
No ratings yet
Journal
47 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Speakout 2nd Edition Elementary Reading&Listening Extra
100% (2)
Speakout 2nd Edition Elementary Reading&Listening Extra
14 pages
Introduction To PySpark
100% (1)
Introduction To PySpark
21 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
Important PySpark Operations 1698872557
No ratings yet
Important PySpark Operations 1698872557
4 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
ATM Case Study, Part 1: Object-Oriented Design With The UML
No ratings yet
ATM Case Study, Part 1: Object-Oriented Design With The UML
46 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Power BI Important Shortcuts
No ratings yet
Power BI Important Shortcuts
5 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
15 PDFsam Apache Spark Tutorial
No ratings yet
15 PDFsam Apache Spark Tutorial
7 pages
Choir Theory Packet Final
No ratings yet
Choir Theory Packet Final
35 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
2335 m8 Demo1 v1 0h2 Cq188do
No ratings yet
2335 m8 Demo1 v1 0h2 Cq188do
9 pages
Spark
No ratings yet
Spark
96 pages
Rdd-Tranformations Continued
No ratings yet
Rdd-Tranformations Continued
8 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Anatomy of RDD
No ratings yet
Anatomy of RDD
31 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Spark Transformations and Actions
No ratings yet
Spark Transformations and Actions
4 pages
Oral Presentation Title Defense Rubric
No ratings yet
Oral Presentation Title Defense Rubric
1 page
External Video-En
No ratings yet
External Video-En
2 pages
Exam Ai 102 Designing and Implementing A Microsoft Azure Ai Solution Skills Measured
No ratings yet
Exam Ai 102 Designing and Implementing A Microsoft Azure Ai Solution Skills Measured
15 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Spark RDD
No ratings yet
Spark RDD
4 pages
EE102 Engg Drawing Course Outline Spring 18
No ratings yet
EE102 Engg Drawing Course Outline Spring 18
4 pages
SSS Joint Affidavit of Discrepancy DONGUINES
100% (1)
SSS Joint Affidavit of Discrepancy DONGUINES
2 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Action and Transformations (Wide and Narrow)
No ratings yet
Action and Transformations (Wide and Narrow)
7 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
RDD Actions
No ratings yet
RDD Actions
18 pages
PySpark RDD Cheat Sheet
No ratings yet
PySpark RDD Cheat Sheet
1 page
RDD
No ratings yet
RDD
4 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Comprehensive Python CheatSheet 1731972192
No ratings yet
Comprehensive Python CheatSheet 1731972192
10 pages
The Greek Philosophical Vocabulary 1nbsped 0715623354 9780715623350 - Compress
No ratings yet
The Greek Philosophical Vocabulary 1nbsped 0715623354 9780715623350 - Compress
174 pages
Resilient Distributed Datasets
No ratings yet
Resilient Distributed Datasets
40 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
GitLab CI CD Operations CheatSheet 1731972419
No ratings yet
GitLab CI CD Operations CheatSheet 1731972419
11 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Research Methods in Criminal Justice and Criminology Pearson New International Edition 9th Edition Frank Hagan Instant Download
No ratings yet
Research Methods in Criminal Justice and Criminology Pearson New International Edition 9th Edition Frank Hagan Instant Download
40 pages
Ddsu666 User Manual en
No ratings yet
Ddsu666 User Manual en
22 pages
ADE Training
No ratings yet
ADE Training
1 page
Spark Commands
No ratings yet
Spark Commands
3 pages
Translation Term Paper
100% (1)
Translation Term Paper
7 pages
Python Essential Methods in Machine Learning
No ratings yet
Python Essential Methods in Machine Learning
6 pages
Streams 2 GG
No ratings yet
Streams 2 GG
59 pages
SQL For Data Science
No ratings yet
SQL For Data Science
8 pages
Python Lists, Sets, and Tuples
No ratings yet
Python Lists, Sets, and Tuples
5 pages
Power BI Deployment Pipelines CheatSheet 1731972155
No ratings yet
Power BI Deployment Pipelines CheatSheet 1731972155
10 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Data Wrangling With Dask CheatSheet 1731972488
No ratings yet
Data Wrangling With Dask CheatSheet 1731972488
7 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
LPC2148 Ebook
No ratings yet
LPC2148 Ebook
89 pages
British Literature - Wikipedia
No ratings yet
British Literature - Wikipedia
221 pages
Spark
No ratings yet
Spark
13 pages
Lab-1 PPT Dsa-Bcsl305
No ratings yet
Lab-1 PPT Dsa-Bcsl305
13 pages
Apache Spark Tutorials
No ratings yet
Apache Spark Tutorials
9 pages
By William Somerset: Mr. Know-All
No ratings yet
By William Somerset: Mr. Know-All
20 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Unit Ii
No ratings yet
Unit Ii
16 pages
Context Driven Merchandising Module
No ratings yet
Context Driven Merchandising Module
25 pages
PySpark CheatSheet Edureka
No ratings yet
PySpark CheatSheet Edureka
1 page
Ast 111 Module Part1
No ratings yet
Ast 111 Module Part1
52 pages
Meinberg m320 Datasheet
No ratings yet
Meinberg m320 Datasheet
5 pages
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
4 DBA Resume Samples
No ratings yet
4 DBA Resume Samples
15 pages
Teacher Education: Critical Analysis and Evauation of A Teacher Education Course
No ratings yet
Teacher Education: Critical Analysis and Evauation of A Teacher Education Course
8 pages
Agreement Bib
No ratings yet
Agreement Bib
99 pages
PsychologicalEssentialism EJSP2001
No ratings yet
PsychologicalEssentialism EJSP2001
20 pages
IT3101 - Object-Oriented Systems Development: University of Colombo, Sri Lanka
No ratings yet
IT3101 - Object-Oriented Systems Development: University of Colombo, Sri Lanka
12 pages
Phrasal Verbs Worksheet:: - With Marty
No ratings yet
Phrasal Verbs Worksheet:: - With Marty
1 page
Quran Is in Arabic Only
No ratings yet
Quran Is in Arabic Only
5 pages
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet

Pyspark RDD Operations

Uploaded by

Pyspark RDD Operations

Uploaded by

[ PySpark RDD (Resilient Distributed Datasets) Operations ] [ cheatsheet ]

● Create RDD from a list: rdd = sc.parallelize([1, 2, 3, 4, 5])

By: Waleed Mousa

4. RDD Persistence and Caching

By: Waleed Mousa

6. RDD Set Operations

7. RDD Pair Operations

8. RDD Numeric Operations

9. RDD Data Formats

● CSV: rdd = sc.textFile("file.csv").map(lambda line: line.split(","))

10. RDD Compression

● Compress: rdd.map(lambda x: (x, None)).saveAsSequenceFile("output/",

11. RDD Serialization

● Kryo Serializer: conf = SparkConf().set("spark.serializer",

12. RDD Partitioner

● HashPartitioner: rdd.partitionBy(numPartitions=10, partitionFunc=lambda

13. RDD Execution

By: Waleed Mousa

14. RDD Debugging

● Logging: rdd.map(lambda x: print("Processing:", x))

15. RDD Optimization

By: Waleed Mousa

You might also like