0% found this document useful (0 votes)

5 views1 page

PySpark RDD Cheat Sheet

This document is a cheat sheet for using PySpark RDD (Resilient Distributed Dataset) operations, including functions for retrieving information, reshaping data, applying functions, and performing aggregations. It provides code snippets for common tasks such as counting, grouping, filtering, and sorting RDDs, as well as initializing Spark and loading data. The document serves as a quick reference for data scientists working with PySpark.

Uploaded by

ARNAB DUTTA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views1 page

PySpark RDD Cheat Sheet

Uploaded by

ARNAB DUTTA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

> Retrieving RDD Information

> Reshaping Data

Basic Information Re ducing

Python For Data Science

>>> rdd.getNumPartitions() #List the number of partitions

[('a', ),(' ',2)]

>>> rdd.reduce(lam da a, b
b
>>> rdd.reduceByKey(lam da x,y : x y).collect()
9 b
b: a
+ #Merge the rdd values for each key

+ b) #Merge the rdd values

>>> rdd.count() #Count RDD instances 3

('a',7,'a',2,' ',2) b
>>> rdd.countByKey() #Count RDD instances by key

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

Grouping by
>>> rdd3.groupBy(lam da x: x b % 2) #Return RDD of grouped values

defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})

.mapValues(list)

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

{'a': 2,'b': 2}

>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

.mapValues(list)

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

.collect()

True [('a',[7,2]),(' ',[2])] b

Aggregating
Spark S ummary >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> rdd3.max() #Maximum value of RDD elements

#Aggregate RDD elements of each partition and then the results

99
>>> rdd3.aggregate((0,0),seqOp,combOp)

PySpark is the Spark Python API that exposes

>>> rdd3.min() #Minimum value of RDD elements
(4950,100)

the Spark programming model to Python. #Mean value of RDD elements

#Aggregate values of each RDD key

>>> rdd3.mean() >>> rdd.aggregateByKey((0,0),seqop,combop).collect()

49.5
[('a',(9,2)), ('b',(2,1))]

v #Standard deviation of RDD elements

>>> rdd3.stde () #Aggregate the elements of each partition, and then the results

8 866070047722118

2 . >>> rdd3.fold(0,add)

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

4950

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins

>>> rdd.foldByKey(0, add).collect()

([0,33,66,99],[33,33,34])

SparkC ontext >>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)
[('a',9),('b',2)]

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()

>>> from pyspark import SparkContext

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions

Inspect SparkContext > Mathematical Operations
#Apply a function to each RDD element

>>> sc.version #Retrieve SparkContext version

>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
b
>>> rdd.su tract(rdd2).collect() #Return each rdd value not contained in rdd2

>>> sc.pythonVer #Retrieve Python version

[('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')]
b
[(' ',2),('a',7)]

>>> sc.master #Master URL to connect to

#Apply a function to each RDD element and flatten the result
#Return each (key,value) pair of rdd2 with no matching key in rdd

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes

>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd2.subtractByKey(rdd).collect()

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext

>>> rdd5.collect()
[('d', 1)]

>>> sc.appName #Return application name

['a',7,7,'a','a',2,2,'a','b',2,2,'b']
>>> rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2
>>> sc.applicationId #Retrieve application ID
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys

>>> sc.defaultParallelism #Return default level of parallelism

>>> rdd4.flatMapValues(lambda x: x).collect()

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

> Sort
C onfiguration
>>> from pyspark import SparkConf, SparkContext
> Selecting Data b
>>> rdd2.sortBy(lam da x: x[1]).collect()
b
[('d',1),(' ',1),('a',2)]

#Sort RDD by given function

>>> conf = (SparkConf()

" " >>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

.setMaster( local )

Getting b
[('a',2),(' ',1),('d',1)]
.setAppName("My app")

.set("spark.executor.memory", "1g"))

>>> rdd.collect() #Return a list with all RDD elements

>>> sc = SparkContext(conf = conf)

[('a', 7), ('a', 2), (' ', 2)]
b
>>> rdd.take(2) #Take first 2 RDD elements

Using The Shell

[('a', 7), ('a', 2)]

>>> rdd.first() #Take first RDD element

> Repartitioning
('a', 7)

>>> rdd.top(2) #Take top 2 RDD elements

4 #New RDD with 4 partitions

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc. >>> rdd.repartition( )
b
[(' ', 2), ('a', 7)]
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
$ ./bin/spark-shell --master local[2]

$ ./bin/pyspark --master local[4] --py-files code.py

Samplin g
>>> rdd3.sample( alse, F 0.15, 81).collect() #Return sampled subset of rdd3

Set which master the context connects to with the --master argument, and add Python .zip, .egg or .py files to the
4
[3, ,27,31, 40,41,42,43,60,76,79,80,86,97]
r untime path by passing a comma-separated list to --py-files.
Filterin g > Saving
>>> rdd.filter(lam da x: b "a" in x).collect() #Filter the RDD

[('a',7),('a',2)]
v T F
>>> rdd.sa eAs ext ile( rdd.txt )
" "
> Loading Data 5
>>> rdd .distinct().collect()
['a',2,' ',7]
b
#Return distinct RDD values
v H
>>> rdd.sa eAs adoop ile( F "hdfs://namenodehost/parent/child",

’org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd.keys().collect() #Return (key,value) RDD's keys

Para e ll lized Collections ['a', 'a', ' '] b

>>> rdd = sc.parallelize([('a',7),('a',2),(' ',2)])

b > Stopping SparkContext
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])

>>> rdd3 = sc.parallelize(range(100))

> Iterating >>> sc.stop()
>>> rdd 4 = sc.parallelize([("a",["x","y","z"]),

("b",["p", "r"])])
>>> def g(x): print(x)

>>> rdd.foreac (g) h #Apply a function to all RDD elements

External Data ('a', 7)

b
(' ', 2)

> Execution
('a', 2)
Rea d either one text file from HDFS, a local file system or or any Hadoop-supported file system URI with textFile(),
or read in a directory of text files with wholeTextFiles() $ ./bin/spark-submit / / / h /
examples src main pyt on pi.py

>> > textFile = sc.textFile("/my/directory/*.txt")

>> > textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

CepheusUniversal Whiteback17
90% (10)
CepheusUniversal Whiteback17
456 pages
14 AAU - Level 6 - Test - Challenge - Unit 4
100% (13)
14 AAU - Level 6 - Test - Challenge - Unit 4
5 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Inolab Cond 730
No ratings yet
Inolab Cond 730
80 pages
The 10 Hook Lead System
100% (1)
The 10 Hook Lead System
5 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
PySpark CheatSheet Edureka
No ratings yet
PySpark CheatSheet Edureka
1 page
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
RDD Actions
No ratings yet
RDD Actions
18 pages
PySpark Notes
No ratings yet
PySpark Notes
190 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
ADE Training
No ratings yet
ADE Training
1 page
Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV
No ratings yet
Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV
5 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Resilient Distributed Datasets
No ratings yet
Resilient Distributed Datasets
40 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Note
No ratings yet
Note
14 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Function Spark
No ratings yet
Function Spark
10 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Spark Class 2
No ratings yet
Spark Class 2
37 pages
SPARK
No ratings yet
SPARK
35 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
RDD
No ratings yet
RDD
4 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Pyspark Essentials
No ratings yet
Pyspark Essentials
24 pages
Databricks
No ratings yet
Databricks
7 pages
Spark by Sumit
No ratings yet
Spark by Sumit
33 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Introduction To PySpark
100% (1)
Introduction To PySpark
21 pages
External Video-En
No ratings yet
External Video-En
2 pages
Spark
No ratings yet
Spark
160 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
Cloudlab Exercise 11 Lesson 11
No ratings yet
Cloudlab Exercise 11 Lesson 11
2 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Spark
No ratings yet
Spark
96 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
2 - Intro To PySpark RDD
No ratings yet
2 - Intro To PySpark RDD
35 pages
Rdd-Tranformations Continued
No ratings yet
Rdd-Tranformations Continued
8 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Architecture and Sociology
No ratings yet
Architecture and Sociology
11 pages
Chapters 7
No ratings yet
Chapters 7
64 pages
Dbms Lab 1,2,3,4
No ratings yet
Dbms Lab 1,2,3,4
40 pages
Template Resource Mobilization
No ratings yet
Template Resource Mobilization
14 pages
MasterCast 222 TDS-974770
No ratings yet
MasterCast 222 TDS-974770
2 pages
Lecture Set Three-Wave Generator
No ratings yet
Lecture Set Three-Wave Generator
10 pages
Aa BPG 375001
No ratings yet
Aa BPG 375001
36 pages
Ambulong Climatological Extremes (As of 2016)
No ratings yet
Ambulong Climatological Extremes (As of 2016)
1 page
1.safety Inspection Check List
No ratings yet
1.safety Inspection Check List
2 pages
Working With Vectors: Magnitude and Direction
No ratings yet
Working With Vectors: Magnitude and Direction
16 pages
Ciao 6-1850 User Manual English
No ratings yet
Ciao 6-1850 User Manual English
8 pages
Third Party Sales Process
No ratings yet
Third Party Sales Process
4 pages
Laboratory Quality Control
50% (2)
Laboratory Quality Control
19 pages
SPM Swivels Operation Instruction and Service Manual
No ratings yet
SPM Swivels Operation Instruction and Service Manual
44 pages
T-HA Series: Panasonic Industrial Company
No ratings yet
T-HA Series: Panasonic Industrial Company
6 pages
Sas#4 - Ite 303-Sia
No ratings yet
Sas#4 - Ite 303-Sia
10 pages
Structure and Written Expression: Section Two
100% (1)
Structure and Written Expression: Section Two
26 pages
WhatsApp Chat With Nazia Lahor-1
No ratings yet
WhatsApp Chat With Nazia Lahor-1
13 pages
Essay and Elocution Competition
No ratings yet
Essay and Elocution Competition
1 page
Planning A Lesson Using PRIMM: The Five Stages of PRIMM
No ratings yet
Planning A Lesson Using PRIMM: The Five Stages of PRIMM
2 pages
Interview Vera Geier PDF
No ratings yet
Interview Vera Geier PDF
2 pages
Class IX E-Content Links (Final)
No ratings yet
Class IX E-Content Links (Final)
1 page
Internship - Ii Report Front Pages
No ratings yet
Internship - Ii Report Front Pages
9 pages
New Vendor Form
No ratings yet
New Vendor Form
1 page
Writing Ratios and Proportions
No ratings yet
Writing Ratios and Proportions
10 pages
Izar Net 2 14
No ratings yet
Izar Net 2 14
3 pages

PySpark RDD Cheat Sheet

Uploaded by

PySpark RDD Cheat Sheet

Uploaded by

> Retrieving RDD Information

> Reshaping Data

Python For Data Science

[('a', ),(' ',2)]

+ b) #Merge the rdd values

>>> rdd.count() #Count RDD instances 3

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

True [('a',[7,2]),(' ',[2])] b

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> rdd3.max() #Maximum value of RDD elements

PySpark is the Spark Python API that exposes

the Spark programming model to Python. #Mean value of RDD elements

#Aggregate values of each RDD key

>>> rdd3.mean() >>> rdd.aggregateByKey((0,0),seqop,combop).collect()

v #Standard deviation of RDD elements

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions

>>> sc.version #Retrieve SparkContext version

>>> sc.pythonVer #Retrieve Python version

>>> sc.master #Master URL to connect to

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext

>>> sc.appName #Return application name

>>> sc.defaultParallelism #Return default level of parallelism

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

#Sort RDD by given function

>>> conf = (SparkConf()

" " >>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

>>> rdd.collect() #Return a list with all RDD elements

>>> sc = SparkContext(conf = conf)

Using The Shell

>>> rdd.first() #Take first RDD element

>>> rdd.top(2) #Take top 2 RDD elements

$ ./bin/pyspark --master local[4] --py-files code.py

Para e ll lized Collections ['a', 'a', ' '] b

>>> rdd = sc.parallelize([('a',7),('a',2),(' ',2)])

>>> rdd3 = sc.parallelize(range(100))

>>> rdd.foreac (g) h #Apply a function to all RDD elements

External Data ('a', 7)

>> > textFile = sc.textFile("/my/directory/*.txt")

>> > textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

You might also like