0% found this document useful (0 votes)
320 views1 page

PySpark CheatSheet Edureka

PySpark RDDs are a distributed memory abstraction that helps perform in-memory computations on large clusters in a fault-tolerant manner. RDDs can be initialized by starting PySpark and creating a SparkContext. Transformations like map, filter, and flatMap are used to manipulate RDDs, while actions like reduce, count, and take return values after computations. Common operations include sorting, grouping, joining, and set operations on RDDs.

Uploaded by

BL Pipas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
320 views1 page

PySpark CheatSheet Edureka

PySpark RDDs are a distributed memory abstraction that helps perform in-memory computations on large clusters in a fault-tolerant manner. RDDs can be initialized by starting PySpark and creating a SparkContext. Transformations like map, filter, and flatMap are used to manipulate RDDs, while actions like reduce, count, and take return values after computations. Common operations include sorting, grouping, joining, and set operations on RDDs.

Uploaded by

BL Pipas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

PYSPARK RDD CHEAT SHEET Learn PySpark at www.edureka.

co

PySpark RDD Transformations


Data Loading and Actions Sorting
Data and Set Operations
Loading
Resilient Distributed Datasets (RDDs) are a distributed Transformations Sorting and Grouping
memory abstraction that helps a programmer to perform
in-memory computations on large clusters that too in a map sortBy
fault-tolerant manner. >>> rdd = sc.parallelize(["b", "a", "c"]) >>> tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
>>> rdd.map(lambda x: (x, 1)) >>> sc.parallelize(tmp).sortBy(lambda x: x[0]).collect()
[('a', 1), ('b', 1), ('c', 1)] [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
Initialization flatMap sortByKey
Let’s see how to start Pyspark and enter the shell >>> rdd = sc.parallelize([2, 3, 4]) >>> tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
• Go to the folder where Pyspark is installed >>> rdd.flatMap(lambda x: range(1, x)) >>> sc.parallelize(tmp).sortByKey(True, 1).collect()
• Run the following command [1, 1, 1, 2, 2, 3] [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]

$ ./sbin/start-all.sh mapPartitions groupBy


$ spark-shell >>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
>>> def f(iterator): yield sum(iterator) >>> result = rdd.groupBy(lambda x: x % 2).collect()
Now that spark is up and running, we need to initialize spark context, >>> rdd.mapPartitions(f).collect() >>> sorted([(x, sorted(y)) for (x, y) in result])
which is the heart of any spark application. [3, 7] [(0, [2, 8]), (1, [1, 1, 3, 5])]

>>> from pyspark import SparkContext filter groupByKey


>>> sc = SparkContext(master = 'local[2]') >>> rdd = sc.parallelize([1, 2, 3, 4, 5]) >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> rdd.filter(lambda x: x % 2 == 0).collect() >>> map((lambda (x,y): (x, list(y))), sorted(x.groupByKey().collect()))
[2, 4] [('a', [1, 1]), ('b', [1])
Configurations distinct fold
>>> sorted(sc.parallelize([1, 1, 2, 3]).distinct().collect()) >>> from operator import add
>>> from pyspark import SparkConf, SparkContext [1, 2, 3] >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
>>> val conf = (SparkConf() 15
.setMaster("local[2]") Actions
.setAppName("Edureka CheatSheet") Set Operations
.set("spark.executor.memory", "1g")) reduce
>>> val sc = SparkContext(conf = conf) >>> from operator import add
_add_
>>> rdd = sc.parallelize([1, 1, 2, 3])
>>> sc.parallelize([1, 2, 3, 4, 5]).reduce(add)
>>> (rdd + rdd).collect()
15
Spark Context Inspection >>> sc.parallelize((2 for _ in range(10))).map(lambda x: 1)
[1, 1, 2, 3, 1, 1, 2, 3]

Now once, spark context is initialized, it’s time to check if all the .cache().reduce(add) subtract
10 >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
versions are correct or not. We need to check the default parameters >>> y = sc.parallelize([("a", 3), ("c", None)])
being used by SparkContext. count
>>> sorted(x.subtract(y).collect())
>>> sc.parallelize([2, 3, 4]).count()
[('a', 1), ('b', 4), ('b', 5)]
>>> sc.version # SparkContext Version 3
>>> sc.pythonVer # Python Version first unioin
>>> sc.parallelize([2, 3, 4]).first() >>> rdd = sc.parallelize([1, 1, 2, 3])
>>> sc.appName # Application Name
2 >>> rdd.union(rdd).collect()
>>> sc.applicationId # Application ID
take [1, 1, 2, 3, 1, 1, 2, 3]
>>> sc.master # Master URL
>>> str(sc.sparkHome) # Installed Spark Path >>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2) intersection
>>> str(sc.sparkUser()) # Retreive Current SparkContext User [2, 3] >>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
>>> sc.defaultParallelism # Get default level of Parallelism countByValue >>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
>>> sc.defaultMinPartitions # Get minimum number of Partitions >>> sorted(sc.parallelize([1, 2, 1, 2, 2], 2).countByValue().items()) >>> rdd1.intersection(rdd2).collect()
[(1, 2), (2, 3)] [1, 2, 3]
cartesian
Data Loading >>> rdd = sc.parallelize([1, 2])
>>> sorted(rdd.cartesian(rdd).collect())
Creating RDDs [(1, 1), (1, 2), (2, 1), (2, 2)]

Using Parallelized Collections


>>> rdd = sc.parallelize([('Jim',24),('Hope', 25),('Sue', 26)]) Sorting
Data and Set Operations
Loading
>>> rdd = sc.parallelize([('a',9),('b',7),('c',10)])
>>> num_rdd = sc.parallelize(range(1,5000)) saveAsTextFile
From other RDDs >>> rdd.saveAsTextFile("rdd.txt")
>>> new_rdd = rdd.groupByKey() saveAsHadoopFile
>>> new_rdd = rdd.map(lambda x: (x,1)) >>> rdd.saveAsHadoopFile
("hdfs://namenodehost/parent_folder/child_folder",
From a text File 'org.apache.hadoop.mapred.TextOutputFormat')
>>> tfile_rdd = sc.textFile("/path/of_file/*.txt") saveAsPickleFile
Reading directory of Text Files >>> tmpFile = NamedTemporaryFile(delete=True)
>>> tfile_rdd = sc.wholeTextFiles("/path/of_directory/") >>> tmpFile.close()
>>> sc.parallelize([1, 2, 'spark', 'rdd'])
.saveAsPickleFile(tmpFile.name, 3)
RDD Statistics >>> sorted(sc.pickleFile(tmpFile.name, 5).collect())
[1, 2, 'rdd', 'spark']
Maximum Value of RDD elements Standard Deviation of RDD elements PYSPARK CERTIFICATION
>>> rdd.max() >>> rdd.stdev()
Minimum Value of RDD elements Get the Summary Statistics TRAINING Stopping SparkContext and Spark Daemons
>>> rdd.min() Count, Mean, Stdev, Max & Min Stopping SparkContext
>>> rdd.stats() >>> sc.stop()
Mean value of RDD elements Number of Partitions Stopping Spark Daemons
>>> rdd.mean() >>> rdd.getNumPartitions() $ ./sbin/stop-all.sh

You might also like