Pyspark RDD Operations
Pyspark RDD Operations
1. RDD Creation
2. RDD Transformations
● Map: rdd.map(lambda x: x * 2)
● FlatMap: rdd.flatMap(lambda x: [x, x * 2, x * 3])
● Filter: rdd.filter(lambda x: x > 10)
● Distinct: rdd.distinct()
● Sample: rdd.sample(withReplacement=True, fraction=0.5)
● Union: rdd1.union(rdd2)
● Intersection: rdd1.intersection(rdd2)
● Subtract: rdd1.subtract(rdd2)
● Cartesian: rdd1.cartesian(rdd2)
● Zip: rdd1.zip(rdd2)
● ZipWithIndex: rdd.zipWithIndex()
● GroupBy: rdd.groupBy(lambda x: x % 2)
● SortBy: rdd.sortBy(lambda x: x, ascending=False)
● PartitionBy: rdd.partitionBy(3)
● MapPartitions: rdd.mapPartitions(lambda partition: [x * 2 for x in
partition])
● MapPartitionsWithIndex: rdd.mapPartitionsWithIndex(lambda index,
partition: [(index, x) for x in partition])
● FlatMapValues: rdd.flatMapValues(lambda x: [x, x * 2])
● CombineByKey: rdd.combineByKey(lambda value: (value, 1), lambda acc,
value: (acc[0] + value, acc[1] + 1), lambda acc1, acc2: (acc1[0] +
acc2[0], acc1[1] + acc2[1]))
● FoldByKey: rdd.foldByKey(0, lambda acc, value: acc + value)
● ReduceByKey: rdd.reduceByKey(lambda a, b: a + b)
● AggregateByKey: rdd.aggregateByKey((0, 0), lambda acc, value: (acc[0] +
value, acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] +
acc2[1]))
● Join: rdd1.join(rdd2)
3. RDD Actions
● Collect: rdd.collect()
● Take: rdd.take(5)
● First: rdd.first()
● Count: rdd.count()
● CountByValue: rdd.countByValue()
● Reduce: rdd.reduce(lambda a, b: a + b)
● Aggregate: rdd.aggregate((0, 0), lambda acc, value: (acc[0] + value,
acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))
● Fold: rdd.fold(0, lambda acc, value: acc + value)
● Max: rdd.max()
● Min: rdd.min()
● Sum: rdd.sum()
● Mean: rdd.mean()
● Variance: rdd.variance()
● Stdev: rdd.stdev()
● TakeSample: rdd.takeSample(withReplacement=True, num=5)
● Foreach: rdd.foreach(lambda x: print(x))
● ForeachPartition: rdd.foreachPartition(lambda partition: [print(x) for x
in partition])
● Top: rdd.top(5)
● TakeOrdered: rdd.takeOrdered(5, lambda x: -x)
● SaveAsTextFile: rdd.saveAsTextFile("output/")
● SaveAsPickleFile: rdd.saveAsPickleFile("output/")
● Cache: rdd.cache()
● Persist: rdd.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK)
● Unpersist: rdd.unpersist()
● Checkpoint: rdd.checkpoint()
5. RDD Partitioning
● Union: rdd1.union(rdd2)
● Intersection: rdd1.intersection(rdd2)
● Subtract: rdd1.subtract(rdd2)
● Cartesian: rdd1.cartesian(rdd2)
● ReduceByKey: rdd.reduceByKey(lambda a, b: a + b)
● GroupByKey: rdd.groupByKey()
● CombineByKey: rdd.combineByKey(lambda value: (value, 1), lambda acc,
value: (acc[0] + value, acc[1] + 1), lambda acc1, acc2: (acc1[0] +
acc2[0], acc1[1] + acc2[1]))
● AggregateByKey: rdd.aggregateByKey((0, 0), lambda acc, value: (acc[0] +
value, acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] +
acc2[1]))
● FoldByKey: rdd.foldByKey(0, lambda acc, value: acc + value)
● Join: rdd1.join(rdd2)
● LeftOuterJoin: rdd1.leftOuterJoin(rdd2)
● RightOuterJoin: rdd1.rightOuterJoin(rdd2)
● FullOuterJoin: rdd1.fullOuterJoin(rdd2)
● MapValues: rdd.mapValues(lambda x: x * 2)
● FlatMapValues: rdd.flatMapValues(lambda x: [x, x * 2])
● CountByKey: rdd.countByKey()
● LookupByKey: rdd.lookup(key)
● SortByKey: rdd.sortByKey(ascending=False)
● CoGroupByKey: rdd1.cogroup(rdd2)
● Sum: rdd.sum()
By: Waleed Mousa
● Mean: rdd.mean()
● Variance: rdd.variance()
● Stdev: rdd.stdev()
● HistogramByKey: rdd.histogram(buckets=10)
● Collect: rdd.collect()
● Caching: rdd.cache()
● Repartitioning: rdd.repartition(numPartitions=10)
● Coalesce: rdd.coalesce(numPartitions=5)
● Broadcast Variables: broadcast_var = sc.broadcast(large_data)
● Accumulator Variables: accumulator = sc.accumulator(0)
● Kryo Serialization: conf = SparkConf().set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
● Spark SQL: df = sqlContext.createDataFrame(rdd, schema)
● DataFrame Operations: df.filter(df.age > 18)