0% found this document useful (0 votes)
5 views1 page

PySpark RDD Cheat Sheet

This document is a cheat sheet for using PySpark RDD (Resilient Distributed Dataset) operations, including functions for retrieving information, reshaping data, applying functions, and performing aggregations. It provides code snippets for common tasks such as counting, grouping, filtering, and sorting RDDs, as well as initializing Spark and loading data. The document serves as a quick reference for data scientists working with PySpark.

Uploaded by

ARNAB DUTTA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views1 page

PySpark RDD Cheat Sheet

This document is a cheat sheet for using PySpark RDD (Resilient Distributed Dataset) operations, including functions for retrieving information, reshaping data, applying functions, and performing aggregations. It provides code snippets for common tasks such as counting, grouping, filtering, and sorting RDDs, as well as initializing Spark and loading data. The document serves as a quick reference for data scientists working with PySpark.

Uploaded by

ARNAB DUTTA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

> Retrieving RDD Information

> Reshaping Data


Basic Information Re ducing

Python For Data Science


>>> rdd.getNumPartitions() #List the number of partitions

[('a', ),(' ',2)]

>>> rdd.reduce(lam da a, b
b
>>> rdd.reduceByKey(lam da x,y : x y).collect()
9 b
b: a
+ #Merge the rdd values for each key

+ b) #Merge the rdd values

>>> rdd.count() #Count RDD instances 3

('a',7,'a',2,' ',2) b
>>> rdd.countByKey() #Count RDD instances by key

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

Grouping by
>>> rdd3.groupBy(lam da x: x b % 2) #Return RDD of grouped values

defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})

.mapValues(list)

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

{'a': 2,'b': 2}

>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

.mapValues(list)

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

.collect()

True [('a',[7,2]),(' ',[2])] b


Aggregating
Spark S ummary >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> rdd3.max() #Maximum value of RDD elements


#Aggregate RDD elements of each partition and then the results

99
>>> rdd3.aggregate((0,0),seqOp,combOp)

PySpark is the Spark Python API that exposes


>>> rdd3.min() #Minimum value of RDD elements
(4950,100)

the Spark programming model to Python. #Mean value of RDD elements

#Aggregate values of each RDD key

>>> rdd3.mean() >>> rdd.aggregateByKey((0,0),seqop,combop).collect()

49.5
[('a',(9,2)), ('b',(2,1))]

v #Standard deviation of RDD elements

>>> rdd3.stde () #Aggregate the elements of each partition, and then the results

8 866070047722118

2 . >>> rdd3.fold(0,add)

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

4950

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins


>>> rdd.foldByKey(0, add).collect()

([0,33,66,99],[33,33,34])

SparkC ontext >>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)
[('a',9),('b',2)]

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()


>>> from pyspark import SparkContext

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions


Inspect SparkContext > Mathematical Operations
#Apply a function to each RDD element

>>> sc.version #Retrieve SparkContext version


>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
b
>>> rdd.su tract(rdd2).collect() #Return each rdd value not contained in rdd2

>>> sc.pythonVer #Retrieve Python version


[('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')]
b
[(' ',2),('a',7)]

>>> sc.master #Master URL to connect to


#Apply a function to each RDD element and flatten the result
#Return each (key,value) pair of rdd2 with no matching key in rdd

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes


>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd2.subtractByKey(rdd).collect()

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext


>>> rdd5.collect()
[('d', 1)]

>>> sc.appName #Return application name


['a',7,7,'a','a',2,2,'a','b',2,2,'b']
>>> rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2
>>> sc.applicationId #Retrieve application ID
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys

>>> sc.defaultParallelism #Return default level of parallelism


>>> rdd4.flatMapValues(lambda x: x).collect()

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

> Sort
C onfiguration
>>> from pyspark import SparkConf, SparkContext
> Selecting Data b
>>> rdd2.sortBy(lam da x: x[1]).collect()
b
[('d',1),(' ',1),('a',2)]

#Sort RDD by given function

>>> conf = (SparkConf()

" " >>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

.setMaster( local )

Getting b
[('a',2),(' ',1),('d',1)]
.setAppName("My app")

.set("spark.executor.memory", "1g"))

>>> rdd.collect() #Return a list with all RDD elements

>>> sc = SparkContext(conf = conf)


[('a', 7), ('a', 2), (' ', 2)]
b
>>> rdd.take(2) #Take first 2 RDD elements

Using The Shell


[('a', 7), ('a', 2)]

>>> rdd.first() #Take first RDD element


> Repartitioning
('a', 7)

>>> rdd.top(2) #Take top 2 RDD elements


4 #New RDD with 4 partitions

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc. >>> rdd.repartition( )
b
[(' ', 2), ('a', 7)]
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
$ ./bin/spark-shell --master local[2]

$ ./bin/pyspark --master local[4] --py-files code.py


Samplin g
>>> rdd3.sample( alse, F 0.15, 81).collect() #Return sampled subset of rdd3

Set which master the context connects to with the --master argument, and add Python .zip, .egg or .py files to the
4
[3, ,27,31, 40,41,42,43,60,76,79,80,86,97]
r untime path by passing a comma-separated list to --py-files.
Filterin g > Saving
>>> rdd.filter(lam da x: b "a" in x).collect() #Filter the RDD

[('a',7),('a',2)]
v T F
>>> rdd.sa eAs ext ile( rdd.txt )
" "
> Loading Data 5
>>> rdd .distinct().collect()
['a',2,' ',7]
b
#Return distinct RDD values
v H
>>> rdd.sa eAs adoop ile( F "hdfs://namenodehost/parent/child",

’org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd.keys().collect() #Return (key,value) RDD's keys

Para e ll lized Collections ['a', 'a', ' '] b

>>> rdd = sc.parallelize([('a',7),('a',2),(' ',2)])


b > Stopping SparkContext
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])

>>> rdd3 = sc.parallelize(range(100))


> Iterating >>> sc.stop()
>>> rdd 4 = sc.parallelize([("a",["x","y","z"]),

("b",["p", "r"])])
>>> def g(x): print(x)

>>> rdd.foreac (g) h #Apply a function to all RDD elements

External Data ('a', 7)

b
(' ', 2)

> Execution
('a', 2)
Rea d either one text file from HDFS, a local file system or or any Hadoop-supported file system URI with textFile(),
or read in a directory of text files with wholeTextFiles() $ ./bin/spark-submit / / / h /
examples src main pyt on pi.py

>> > textFile = sc.textFile("/my/directory/*.txt")

>> > textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

You might also like