PySpark RDD Cheat Sheet
PySpark RDD Cheat Sheet
>>> rdd.reduce(lam da a, b
b
>>> rdd.reduceByKey(lam da x,y : x y).collect()
9 b
b: a
+ #Merge the rdd values for each key
('a',7,'a',2,' ',2) b
>>> rdd.countByKey() #Count RDD instances by key
Grouping by
>>> rdd3.groupBy(lam da x: x b % 2) #Return RDD of grouped values
defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
.mapValues(list)
{'a': 2,'b': 2}
.mapValues(list)
.collect()
99
>>> rdd3.aggregate((0,0),seqOp,combOp)
49.5
[('a',(9,2)), ('b',(2,1))]
>>> rdd3.stde () #Aggregate the elements of each partition, and then the results
8 866070047722118
2 . >>> rdd3.fold(0,add)
4950
([0,33,66,99],[33,33,34])
SparkC ontext >>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)
[('a',9),('b',2)]
> Sort
C onfiguration
>>> from pyspark import SparkConf, SparkContext
> Selecting Data b
>>> rdd2.sortBy(lam da x: x[1]).collect()
b
[('d',1),(' ',1),('a',2)]
.setMaster( local )
Getting b
[('a',2),(' ',1),('d',1)]
.setAppName("My app")
.set("spark.executor.memory", "1g"))
In the PySpark shell, a special interpreter-aware SparkContext is already
created in the variable called sc. >>> rdd.repartition( )
b
[(' ', 2), ('a', 7)]
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
$ ./bin/spark-shell --master local[2]
Set which master the context connects to with the --master argument, and
add Python .zip, .egg or .py files to the
4
[3, ,27,31, 40,41,42,43,60,76,79,80,86,97]
r untime path by passing a
comma-separated list to --py-files.
Filterin g > Saving
>>> rdd.filter(lam da x: b "a" in x).collect() #Filter the RDD
[('a',7),('a',2)]
v T F
>>> rdd.sa eAs ext ile( rdd.txt )
" "
> Loading Data 5
>>> rdd .distinct().collect()
['a',2,' ',7]
b
#Return distinct RDD values
v H
>>> rdd.sa eAs adoop ile( F "hdfs://namenodehost/parent/child",
’org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd.keys().collect() #Return (key,value) RDD's keys
("b",["p", "r"])])
>>> def g(x): print(x)
b
(' ', 2)
> Execution
('a', 2)
Rea d either one text file from HDFS, a local file system or or any
Hadoop-supported file system URI with textFile(),
or read in a directory
of text files with wholeTextFiles() $ ./bin/spark-submit / / / h /
examples src main pyt on pi.py