0% found this document useful (0 votes)
5 views5 pages

Pyspark RDD Operations

This document is a comprehensive cheatsheet for PySpark RDD operations, detailing methods for RDD creation, transformations, actions, persistence, partitioning, set operations, pair operations, numeric operations, data formats, compression, serialization, partitioning strategies, execution, debugging, and optimization. Each section provides code snippets and explanations for various functions and techniques used in manipulating RDDs. It serves as a quick reference guide for users working with PySpark RDDs.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

Pyspark RDD Operations

This document is a comprehensive cheatsheet for PySpark RDD operations, detailing methods for RDD creation, transformations, actions, persistence, partitioning, set operations, pair operations, numeric operations, data formats, compression, serialization, partitioning strategies, execution, debugging, and optimization. Each section provides code snippets and explanations for various functions and techniques used in manipulating RDDs. It serves as a quick reference guide for users working with PySpark RDDs.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

[ PySpark RDD (Resilient Distributed Datasets) Operations ] [ cheatsheet ]

1. RDD Creation

● Create RDD from a list: rdd = sc.parallelize([1, 2, 3, 4, 5])


● Create RDD from a file: rdd = sc.textFile("file.txt")
● Create RDD from a directory: rdd = sc.wholeTextFiles("directory/")
● Create empty RDD: rdd = sc.emptyRDD()

2. RDD Transformations

● Map: rdd.map(lambda x: x * 2)
● FlatMap: rdd.flatMap(lambda x: [x, x * 2, x * 3])
● Filter: rdd.filter(lambda x: x > 10)
● Distinct: rdd.distinct()
● Sample: rdd.sample(withReplacement=True, fraction=0.5)
● Union: rdd1.union(rdd2)
● Intersection: rdd1.intersection(rdd2)
● Subtract: rdd1.subtract(rdd2)
● Cartesian: rdd1.cartesian(rdd2)
● Zip: rdd1.zip(rdd2)
● ZipWithIndex: rdd.zipWithIndex()
● GroupBy: rdd.groupBy(lambda x: x % 2)
● SortBy: rdd.sortBy(lambda x: x, ascending=False)
● PartitionBy: rdd.partitionBy(3)
● MapPartitions: rdd.mapPartitions(lambda partition: [x * 2 for x in
partition])
● MapPartitionsWithIndex: rdd.mapPartitionsWithIndex(lambda index,
partition: [(index, x) for x in partition])
● FlatMapValues: rdd.flatMapValues(lambda x: [x, x * 2])
● CombineByKey: rdd.combineByKey(lambda value: (value, 1), lambda acc,
value: (acc[0] + value, acc[1] + 1), lambda acc1, acc2: (acc1[0] +
acc2[0], acc1[1] + acc2[1]))
● FoldByKey: rdd.foldByKey(0, lambda acc, value: acc + value)
● ReduceByKey: rdd.reduceByKey(lambda a, b: a + b)
● AggregateByKey: rdd.aggregateByKey((0, 0), lambda acc, value: (acc[0] +
value, acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] +
acc2[1]))
● Join: rdd1.join(rdd2)

By: Waleed Mousa


● LeftOuterJoin: rdd1.leftOuterJoin(rdd2)
● RightOuterJoin: rdd1.rightOuterJoin(rdd2)
● FullOuterJoin: rdd1.fullOuterJoin(rdd2)
● Cogroup: rdd1.cogroup(rdd2)

3. RDD Actions

● Collect: rdd.collect()
● Take: rdd.take(5)
● First: rdd.first()
● Count: rdd.count()
● CountByValue: rdd.countByValue()
● Reduce: rdd.reduce(lambda a, b: a + b)
● Aggregate: rdd.aggregate((0, 0), lambda acc, value: (acc[0] + value,
acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))
● Fold: rdd.fold(0, lambda acc, value: acc + value)
● Max: rdd.max()
● Min: rdd.min()
● Sum: rdd.sum()
● Mean: rdd.mean()
● Variance: rdd.variance()
● Stdev: rdd.stdev()
● TakeSample: rdd.takeSample(withReplacement=True, num=5)
● Foreach: rdd.foreach(lambda x: print(x))
● ForeachPartition: rdd.foreachPartition(lambda partition: [print(x) for x
in partition])
● Top: rdd.top(5)
● TakeOrdered: rdd.takeOrdered(5, lambda x: -x)
● SaveAsTextFile: rdd.saveAsTextFile("output/")
● SaveAsPickleFile: rdd.saveAsPickleFile("output/")

4. RDD Persistence and Caching

● Cache: rdd.cache()
● Persist: rdd.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK)
● Unpersist: rdd.unpersist()
● Checkpoint: rdd.checkpoint()

5. RDD Partitioning

By: Waleed Mousa


● Repartition: rdd.repartition(numPartitions=10)
● Coalesce: rdd.coalesce(numPartitions=5)
● GetNumPartitions: rdd.getNumPartitions()
● Glom: rdd.glom()
● Zip: rdd1.zip(rdd2)
● ZipPartitions: rdd1.zipPartitions(rdd2, lambda partition1, partition2: [x
+ y for x, y in zip(partition1, partition2)])

6. RDD Set Operations

● Union: rdd1.union(rdd2)
● Intersection: rdd1.intersection(rdd2)
● Subtract: rdd1.subtract(rdd2)
● Cartesian: rdd1.cartesian(rdd2)

7. RDD Pair Operations

● ReduceByKey: rdd.reduceByKey(lambda a, b: a + b)
● GroupByKey: rdd.groupByKey()
● CombineByKey: rdd.combineByKey(lambda value: (value, 1), lambda acc,
value: (acc[0] + value, acc[1] + 1), lambda acc1, acc2: (acc1[0] +
acc2[0], acc1[1] + acc2[1]))
● AggregateByKey: rdd.aggregateByKey((0, 0), lambda acc, value: (acc[0] +
value, acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] +
acc2[1]))
● FoldByKey: rdd.foldByKey(0, lambda acc, value: acc + value)
● Join: rdd1.join(rdd2)
● LeftOuterJoin: rdd1.leftOuterJoin(rdd2)
● RightOuterJoin: rdd1.rightOuterJoin(rdd2)
● FullOuterJoin: rdd1.fullOuterJoin(rdd2)
● MapValues: rdd.mapValues(lambda x: x * 2)
● FlatMapValues: rdd.flatMapValues(lambda x: [x, x * 2])
● CountByKey: rdd.countByKey()
● LookupByKey: rdd.lookup(key)
● SortByKey: rdd.sortByKey(ascending=False)
● CoGroupByKey: rdd1.cogroup(rdd2)

8. RDD Numeric Operations

● Sum: rdd.sum()
By: Waleed Mousa
● Mean: rdd.mean()
● Variance: rdd.variance()
● Stdev: rdd.stdev()
● HistogramByKey: rdd.histogram(buckets=10)

9. RDD Data Formats

● CSV: rdd = sc.textFile("file.csv").map(lambda line: line.split(","))


● JSON: rdd = sc.textFile("file.json").map(lambda line: json.loads(line))
● Parquet: rdd = sqlContext.read.parquet("file.parquet").rdd
● Avro: rdd =
sqlContext.read.format("com.databricks.spark.avro").load("file.avro").rdd
● SequenceFile: rdd = sc.sequenceFile("file.seq", keyClass, valueClass)

10. RDD Compression

● Compress: rdd.map(lambda x: (x, None)).saveAsSequenceFile("output/",


codec="org.apache.hadoop.io.compress.GzipCodec")
● Decompress: rdd = sc.sequenceFile("output/", keyClass=None,
valueClass=None, codec="org.apache.hadoop.io.compress.GzipCodec")

11. RDD Serialization

● Kryo Serializer: conf = SparkConf().set("spark.serializer",


"org.apache.spark.serializer.KryoSerializer")
● Java Serializer: conf = SparkConf().set("spark.serializer",
"org.apache.spark.serializer.JavaSerializer")

12. RDD Partitioner

● HashPartitioner: rdd.partitionBy(numPartitions=10, partitionFunc=lambda


x: hash(x))
● RangePartitioner: rdd.partitionBy(numPartitions=10, partitionFunc=lambda
x: int(x / 10))
● CustomPartitioner: rdd.partitionBy(numPartitions=10, partitionFunc=lambda
x: customPartitionFunction(x))

13. RDD Execution

● Collect: rdd.collect()

By: Waleed Mousa


● Reduce: rdd.reduce(lambda a, b: a + b)
● Aggregate: rdd.aggregate((0, 0), lambda acc, value: (acc[0] + value,
acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))
● Take: rdd.take(5)
● First: rdd.first()
● TakeSample: rdd.takeSample(withReplacement=True, num=5)
● Count: rdd.count()
● CountByValue: rdd.countByValue()
● Foreach: rdd.foreach(lambda x: print(x))
● ForeachPartition: rdd.foreachPartition(lambda partition: [print(x) for x
in partition])
● CollectAsMap: rdd.collectAsMap()

14. RDD Debugging

● Logging: rdd.map(lambda x: print("Processing:", x))


● Caching: rdd.cache()
● Checkpointing: rdd.checkpoint()
● Debugging: rdd.filter(lambda x: x == debug_value).collect()

15. RDD Optimization

● Caching: rdd.cache()
● Repartitioning: rdd.repartition(numPartitions=10)
● Coalesce: rdd.coalesce(numPartitions=5)
● Broadcast Variables: broadcast_var = sc.broadcast(large_data)
● Accumulator Variables: accumulator = sc.accumulator(0)
● Kryo Serialization: conf = SparkConf().set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
● Spark SQL: df = sqlContext.createDataFrame(rdd, schema)
● DataFrame Operations: df.filter(df.age > 18)

By: Waleed Mousa

You might also like